diffopc+ : the diff tool for Open Packaging files

What : developer toolkit for Windows® ; diffing ; navigating ; validating Microsoft Office Marketplace logo
Where : Word® 2007, Excel® 2007, Powerpoint® 2007, XPS documents, OpenOffice ODF documents (Calc, Writer, Impress, Draw, Base, Math)
Who : ARsT Design, independent software vendor
When : November 2009 (latest update)

 

Testimonies
"I just saw this and it's really cool. Stephane Rodriguez has built a tool that allows you to view a diff of two Open XML files. I was actually bugging a number of folks to see if they would build something like this and it's awesome to see someone outside of Microsoft stepping up and pulling this together." Brian Jones, program manager, Microsoft Office
"If you need this, you need it bad. A program for analyzing the difference between OPC packages such as Open XML documents or XPS documents. But with diffopc+, navigating an OPC package is just like surfing the web: the relationship references are hyperlinks, and you can quickly navigate complex structures. It's as addictive as surfing the web, too -- you'll find yourself rooting around in all sorts of documents just to surf through them and see what's going on inside." Doug Mahugh, technical evangelist, Microsoft Office
"I looked around for Open XML diff tools and yours is the best one I can find. diffopc+ has paid for itself in my first day working with it. I love tools that do exactly what I need without a lot of clutter and fuss." Joe Erickson, SpreadsheetGear

 

(1.7 MB)

How much does diffopc+ cost ? $59 US. You can purchase a license at Plimus store. To activate the license, simply download and install diffopc+, click the Register button in the main dialog, and use the serial number returned by Plimus to download the license file (diffopc.license.lic).

 

How to install and run the product :

  1. Download the EXE file by clicking the button of your choice above
  2. Double-click on it and follow the steps. (a number of sample files are provided just to get started.). If you are running on Windows Vista or Windows 7, right-click on the EXE file and choose "Run as administrator".

System requirements :

  • Windows 9x/NT/2000/XP/2003/Vista/7
  • There are no run-time dependencies.

Support :

 

Supported file formats

  • Microsoft OOXML files (Office 2007 Word/Excel/Powerpoint)
  • OpenOffice ODF files (Calc, Writer, Impress, Base, Draw, Math)
  • Microsoft XPS files
  • Binary XLS files
  • regular ZIP files

 

The product in action

Experience diffopc+ in action here (OOXML file) or here (OpenOffice file).

 

The product in depth

While Microsoft® has posted specs for open packaging-based Microsoft® Office® 2007 files (Word®, Excel® and Powerpoint®), as well as XPS documents specs, there has not been much public visibility of tools for developers intending to directly access the file formats.

When you start programming with the file formats, you'll be trying to figure out the modifications in the parts before and after arbitrary changes you make. This work requires to manually crack-open zip files, possibly moving files to some working folder, opening up differing revisions of parts in a special diff tool. And that for every part. Without knowing which may have changed. Needless to say, this is time-consuming and error prone. In fact, this consumes much more time than your changes, and it only gets worse as you iterate and figure out complex changes. What's actually needed is a tool that automates this for you.

diffopc+ is just that. It presents a diff view of parts between revisions, automatically, and using an interactive html front-end.


Choosing a pair of files

It can also be used to view individual files without having to manually crack-open them by hand. Simply use the same file name in both edit boxes. To make that quick, diffopc tool lets you pick one or two files in the open file dialog, and fills the edit boxes accordingly.

The tool first shows a flat view of all parts side-by-side of two arbitrary OPC-based files, using colors to identify additions, deletions and changes (complete with the number of changes). Then it lets you click on any non-binary part to view the actual color-coded differences.

Here is an example of a diff between two revisions of an Excel® 2007 file where I have made some random changes by hand :


Automatically generating a diff between revisions of an arbitrary ZIP OPC file


Another example

The older revision of the file appears on the left side, and the newer on the right. Notice that the report tells us that 4 zip entries out of the 18 zip entries in the file have changed.

Here is how the color coding works :

 changed  added  deleted  corrupted 

If you click a part, it then opens up the diff view for this part, side-by-side, using the same color coding. But this time, this really reflects the changes in the part between revisions. Here is the diff view if you click on [Content_Types].xml :


Viewing the actual diff between two revisions of a part

Text-based diffs are broken down in rows. Changes correspond to non-matching rows.

It gets better.

You can click on any part that has been detected as non-binary. Basically this means all xml, text, vml, html, rels, fdseq, and so on. Then, whenever the part is actually an XML stream, the diff tool automatically beautifies it (makes it wrapped and indented) so that the diff is easy to read.

Whenever binary parts are pictures, you can do a visual diff.

In the samples folder, there are a number of examples of what changes may look like. Here is the list of changes I made and the corresponding files :

FilesChanges made
Word1_original.docxchanged some formatting and changed the document summary properties.
Word1_modified.docx
Word2_original.docxadded a picture and changed the document theme.
Word2_modified.docx
Word3_original.docxchanged the picture.
Word3_modified.docx
Excel1_original.xlsxadded a chart and changed some formatting.
Excel1_modified.xlsx
Excel2_original.xlsxadded a databar and added a worksheet.
Excel2_modified.xlsx
Excel3_original.xlsxmanually created fake relationship parts (corrupt document).
Excel3_modified.xlsx
Excel_2007B2TR_original.xlsxrandom changes, added a file and deleted a file.
Excel_2007B2TR_modified.xlsx
Powerpoint1_original.pptxchanged some formatting and added a table.
Powerpoint1_modified.pptx
Powerpoint2_original.pptxchanged the slide theme and added a new slide.
Powerpoint2_modified.pptx
Office2007_BULLETS_original.xpsupdated the namespace in the fdseq zip entry. changed the fdoc zip entry.
Office2007_BULLETS_modified.xps
simpleOpenOffice_original.ods(OpenOffice calc spreadsheet) changed actual cell values in the spreadsheet.
simpleOpenOffice_modified.ods
text1_original.zipadded and removed text.
text1_modified.zip

Although its primary function is to generate diffs from separate files, it works very well with a single file : you can use it to view the content of the file without having to manually unzip it or work with folders which is tedious. The obvious bonus of using the tool to introspect a single file is that XML streams are made easy to read.

Corrupted ZIP entries, from a ZIP stand point, are detected. Whenever that occurs, there is a Corr label in the main html view.

 

diffopc+ options

There are a few options :

  • hide/unhide Word 2007 review tags
    this option will appeal to those working with Word 2007 files and are overwhelmed by the many Word 2007 document review tags that obscure the XML streams. Note that it does not remove the tags in the file itself, only in the diff view. Hidden by default.
  • custom CSS stylesheet
    the ability to specify your custom CSS stylesheet is handy if you don't like the color codings, or if you are color blind.

 


diffopc+ options

 

The tool has a command-line mode to allow automated diffing : diffopcplus.exe [-d] <file1> <file2> [<outputFolder>].

The -d option filters out useless timestamp differences such as <ModifiedDate> in a document summary part of OPC packages.

The html files are stored in a workdir folder right beneath the executable by default. The workdir folder can be specified by passing it in the command-line arguments.

 

How the tree view reveals the internal structure of packages

When viewing files individually, diffopc+ shows a tree structure whose indentation reflects the parent-children relations.


Actual tree view of parts (indentation reflects parent-children relations)

What we see in the screenshot above is that Book1.xlsx has a main part which has 3 child parts : docProps/app.xml, docProps/core.xml and xl/workbook.xml. And then we see that xl/workbook.xml has 3 child parts : xl/styles.xml, xl/theme/theme1.xml and xl/worksheets/sheet1.xml. And so on.


Actual tree view of OpenOffice parts (indentation reflects parent-children relations)

Likewise, the screenshot above shows the internal structure of OpenOffice documents, where meta-inf/manifest.xml is the main part, describing all child parts, indented accordingly.

 

Navigation and validation features

There is a validation layer that changes the color of part references to dark red. While those features benefit all OPC-based files, i.e. Word® 2007, Excel® 2007, Powerpoint® 2007 and MS XPS documents, I wanted to also add features that are specific to certain types of documents.

Here is the list of features of diffopc+, on top of what general purpose diffing features :

  • [navigation] relations and relationships hyperlinks for all OPC-based files.
  • [navigation] part hyperlinks for all OPC-based relations.
  • [validation] validation layer for all individual relations/parts of OPC-based files.
  • [validation] validation cross-layer for all relations/parts of OPC-based files.
  • [navigation] links to theme colors for all Powerpoint, Word and Excel files.
  • [navigation] additional hyperlinks for XPS documents
    • [navigation+validation] navigate internal fdseq, fdoc and fpage parts, as well as associated resources.
  • [navigation] additional hyperlinks for Excel® spreadsheets
    • [navigation] navigate, introspect (tool-tip) and validate shared strings
    • [navigation] navigate shared formulas
    • [navigation] navigate styles and all their ramifications (number format, font, border, fill, theme style, indexed color)
    • [navigation] navigate conditional formatting styles
    • [navigation] navigate table named styles
    • [navigation] navigate external references (cell formulas, defined names, conditional formattings, charts)
  • [navigation] additional hyperlinks for Word® documents
    • [navigation] navigate styles
    • [navigation] navigate footnotes and endnotes
    • [navigation] navigate bookmarks
  • [navigation] navigational links for styles in OpenOffice files (Calc, Writer, Impress, Base, Draw, Math)
    • Calc (*.ODS, *.OTS, *.SXC, *.STC)
    • Word (*.ODT, *.OTT, *.SXW, *.STW)
    • Impress (*.ODP, *.OTP, *.SXI, *.STI)
    • Base (*.ODB)
    • Draw (*.ODG, *.OTG, *.SXD, *.STD)
    • Math (*.ODF, *.SXM)
  • [navigation] navigational links for internal resources in OpenOffice files through xlink:href attributes
  • [validation] validation for style inheritance across parts in OpenOffice files
    • :style-name anchors
    • style:___-name anchors
    • ___:___style-name anchors
    • parent-style-name and the likes
  • [navigation] navigational links for the main parts of OpenOffice files
  • [validation] validation layer for the main parts of OpenOffice files
  • [validation] tracking of duplicate relationship part identifiers
  • [readability] optional hiding/unhiding of Word 2007 review tags. Word 2007 documents contain a lot of review tags (w:rsidDel, w:rsidR, w:rsidRPr, w:rsidRDefault, w:rsidP, w:rsidSect, w:rsidTr) used for tracking changes made by users any time they save their documents. To hide/unhide the Word 2007 review tags, click on the Options button in diffopc+ and change the corresponding option.

To get perhaps a better sense at what this means, here are screen captures of the features :


Navigating relations is possible now. Just click!


Clicking navigates to the target relation and changes the background color to identify it.
Also, all parts are accessible by a simple click from there



When a part is missing (invalid rule), the background is turned to dark red.


Navigate internal XPS document parts.


Navigate internal Excel® shared strings (0-based index to another part).


Navigate internal Excel® shared strings (target string, background color changed to identify it).


You can click on s elements, indexing Excel® cell styles.


Target Excel® style, with all navigable ramifications (number format, font, border, fill, theme style).


You can click on Excel® conditional formatting styles.


Target Excel® conditional formatting style.


You can click on Excel® external references in formulas.


Target Excel® external reference, with clickable link to its actual definition.


You can click OpenOffice style references (tables, cells, charts, ...).


Target OpenOffice style definitions.

 

History

Oct 10, 2006 - first release
Oct 12, 2006 - fixed a bug in the UTF8 char handling. Added picture parts diffing.
Oct 16, 2006 - fixed the XML indent. Removed the modality of the dialog box. Removed the active content in the HTML source code which would bring a warning in IE XP SP2.
Oct 31, 2006 - facilitated single file introspection by auto-filling the second file path box. Added dialog shadow and minimize box.
Nov  8, 2006 - memory optimization phase I. Much faster and less memory consuming with large documents. Future phase II expected to bring near constant-memory diffing.
Nov 22, 2006 - diffopc+ released, a premium version of diffopc. A ton of interactions added. A validation layer that marks broken relationships or parts. Plus features specific to XPS and Excel® 2007 spreadsheets.
Dec 7, 2006 - UI improvement : keep dialog size. OPC management : removed a couple hardcoded part names to take full advantage of the underlying OPC library.
Dec 8, 2006 - added Excel 2007 color theme navigational links (both theme part and indexed colors)
Dec 17, 2006 - fixed a bug in the UI enabling the diff based on file paths.
Feb 24, 2007 - added row and column navigation styles for Excel 2007 packages.
Mar 17, 2007 - the main diff view is now sorted to get a clearer picture of how parts relate.
Mar 31, 2007 - navigation links for styles in OpenOffice documents (Calc, Writer, Impress).
Apr 18, 2007 - added -d option to command-line mode. This option allows to ignore timestamp-related differences usually found in document summaries. Those fake differences are the result of applications automatically updating the document summary timestamp to reflect a new save, with the side effect that it makes it impossible without the -d option to automate the diffing of files as part of a regression detection system.
Apr 18, 2007 - added <outputFolder> option to the command-line options. With this, you can set where the output files go.
June 20, 2007 - added VML as regular XML part. Was considered raw text before.
July 1, 2007 - added navigational links for Word 2007 documents (styles, footnotes, endnotes and bookmarks)
July 6, 2007 - added tree view (reveals the structure)
July 7, 2007 - added navigation links for other OpenOffice documents (Base, Draw, Math)
July 8, 2007 - added navigation links and validation from main part for all OpenOffice documents
July 9, 2007 - added tree view for OpenOffice documents
July 10, 2007 - added proper style inheritance for all OpenOffice documents : :style-name anchors, style:___-name anchors, ___:___style-name anchors as well as parent-style-name anchors.
July 11, 2007 - added sorting to OOXML relationships to make it easier to compare entries in .rels
July 18, 2007 - added comprehensive reporting support for corrupt XML streams. diffopc+ now reports corrupted streams (impaired element, attribute, ...) in Red in the diff view, marking the first node that is corrupted. Plus the Corr label in the main diff tree view.
Nov 1, 2007 - work around for an improper XML encoding in Microsoft-generated XML streams where attribute values don't encode apostrophes into XML entities.
Nov 6, 2007 - added built-in format tooltip in Excel style parts where the number format (numFmt attribute) is known to be built-in (i.e. implicit)
Dec 1, 2007 - excluded anchor relationships (of the form #Sheet2!A5) for validation checking as it requires a lot more involvment to do it properly. Also, the sorting in the main diff view is now synched with relationships sorting so that it's more intuitive to navigate.
Dec 21, 2007 - anytime a relationship part is missing, it becomes flagged as a corrupt entry and is made visible as such. There is no longer the need to click on relationship parts manually.
Dec 21, 2007 - duplicate relationship part identifiers are flagged as corrupt entries. This applies to all OPC-based files.
Dec 21, 2007 - number of corrupt entries displayed in the main diff view, to better take advantage of both diff engine and validation engine.
Dec 22, 2007 - ability to hide review tags in Word 2007 documents. Click on the Options button, and check/uncheck the corresponding option.
Dec 22, 2007 - support for Office themes (.thmx)
Dec 28, 2007 - support for navigational links in external relationship targets such as http:// and mailto:
Dec 28, 2007 - better dependency tree in the main view, which allows easier and faster understanding of how parts relate together. For instance, Powerpoint slide parts have a slide layout part as a child dependency.
Dec 28, 2007 - additional checks in relationship parts. Check of Id/Target/TargetMode attribute values.
Dec 28, 2007 - additional navigational links for theme colors in Powerpoint files (slide layout, slide master, presentation), Word files (document part) and Excel files (drawing part, chart part).
Jan 7, 2008 - better diffing algorithms. Sometimes, diffing would create large blocks of deleted versus added lines which, while technically correct and accurate, was not the most intuitive. Now the algorithm creates diff blocks that are just the height of the actual changes.
Jan 12, 2008 - relationships in OPC files are now indented more as if they were children of their associated parts, making it more intuitive to read the tree views.
Jan 12, 2008 - legend in diffopc+ report adds a grey corrupted entry to explain what grey means in reports.
Feb 5, 2008 - attributes appear in light blue to make it easier to read XML streams.
Feb 14, 2008 - ability in the diffopc+ options to specify a custom CSS stylesheet for color codings, making it possible to adjust to color blind issues. The default CSS stylesheet is as follows :
diffopc.css
.N { background-color:white; } <!-- default -->
.C { background-color:#FFD71D; } <!-- changed -->
.A { background-color:#BBFFBB; } <!-- added -->
.D { background-color:#E0686E; } <!-- deleted -->
.CORR { background-color:#DDDDDD; } <!-- corrupted -->
.MISS { background-color:red; } <!-- missing -->
.B { color:#4444AA; } <!-- attributes -->
Feb 22, 2008 - support for Chart templates (.crtx)
Feb 23, 2008 - support for r:link relationships
Mar 1, 2008 - binary parts now include a link in the main diff view to a simple page telling the size of the binary part. Such page may host an hexadecimal dump in the future. (an hexadecimal dump is the common format when we are dealing with arbitrary binary parts)
Mar 1, 2008 - empty OpenOffice zip entries removed the from main diff view
Mar 1, 2008 - navigational links for xlink:href links to pictures and other resources in OpenOffice files
Mar 8, 2008 - navigational links for binary parts in Office files and XPS files
July 3, 2008 - fix for Firefox 3 (breaking change in Firefox 3's Javascript DOM collection handling)
July 18, 2008 - validation of relationship references (r:id, r:embed, r:link) in any part
Aug 2, 2008 - filename added in all diff views, which makes it easier to understand which side belong to which file
Aug 2, 2008 - jump links added in diff views in order to scroll to the next diff in the current view. The jump link is the arrow in the middle of the screen.
Aug 12, 2008 - internal optimization of the software : memory consumption reduced by half and memory fragmentation greatly reduced (which avoids a slow down when large files are processed).
Aug 13, 2008 - progress bar added in the user interface to get an idea of how far the diffing process went so far.
Feb 6, 2009 - main diff view dynamically resizes itself to accomodate lengthy filenames.
March 5, 2009 - security update. Upgrade of ZIP library (now using version 1.2.3).
April 2, 2009 - improved handling of xlink:href links in OpenOffice files.
May 1, 2009 - support for binary XLS files, i.e. regular Excel spreadsheets.
June 28, 2009 - add theme links for spreadsheet tab colors
July 23, 2009 - add automatic sorting of [Content_types].xml entries in OPC files, with the consequence that less false differences appear between OPC files. (a false difference is a difference related to the order of how elements in the XML stream appear)
Oct 2, 2009 - more navigational links for conditional formatting dxfId(s) in OOXML files, notably auto-filters and tables.
Oct 17, 2009 - navigational links for shared formulas in spreadsheet OOXML files
Nov 17, 2009 - improved the dependency tree for cases where an entry could show up at level 0 instead of a deeper level
Nov 21, 2009 - more navigational links for named styles in Table part in spreadsheet OOXML files
Nov 25, 2009 - validation and introspection features for shared strings in Excel OOXML files.

 

 

Copyright ARsT Design 2010 - all rights reserved.