libopc: Open Packaging Convention
Note from Eric White: This is the first post in a series of guest posts by Florian Reuter. He has written a pretty cool library for working with OPC files (published at http://libopc.codeplex.com/). In upcoming posts, he is going to cover Markup Compatibility and Extensibility (MCE) and his libopc library.
The Open Packaging Convention (OPC) is part II of the OfficeOpenXML standard — the standard behind the new .docx, .xslx and .pptx Office formats.
The OPC defines a container format which can be used to store any kind of data and it is not only suited for Office format. E.g. the XMLPaperSpecification (XPS) also uses OPC as the packaging layer.
In many ways OPC can be seen as a successor of OLE containers used by the proprietary .DOC, .XSL and .PPT formats. Unlike OLE containers — which are modeled according to the FAT format —OLE containers are valid .ZIP archived plus some extra metadata.
This means that any OPC container can be opened with a ZIP program. Try it out yourself: Create a .docx/.xslx or .pptx file and rename the extension to .zip. A simple double-click will expose the container structure in Windows internal viewer:
The metadata is encoded in the additional “_rels” folders and the “[Content_Types].xml”.
In order to really understand the OPC it is important to understand the abstract OPC container structure first.
Abstract OPC container structure
First of all every OPC container specifies a set of MIME types also known as content types. Typical content types in a .DOCX document are:
Content Types |
application/vnd.openxmlformats-officedocument.customXmlProperties+xml application/vnd.openxmlformats-officedocument.extended-properties+xml application/vnd.openxmlformats-officedocument.theme+xml application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml application/vnd.openxmlformats-package.core-properties+xml application/vnd.openxmlformats-package.relationships+xml application/xml |
Additionally every OPC container has a “default”
binding between an extension and a content type. E.g.:
Extension | Type |
rels | application/vnd.openxmlformats-package.relationships+xml |
xml | application/xml |
Next every OPC container defines a set of “relation
types”. Relation types have the same form as XML namespace names. Typical relation types in a .DOCX file are:
Relation Types |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXmlProps http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties |
OPC container also keeps a list of all external relations.E.g. when a .DOCX document contains a hyperlink to “http://naverage.com”, then this external link is stored as an external relation:
External Relations |
http://naverage.com |
Data is stored inside an OPC container as parts. A part has
a hierarchical name and a type. Here are the typical parts of a .DOCX document:
Part |
Type |
customXml/item1.xml |
application/xml |
customXml/itemProps1.xml |
application/vnd.openxmlformats-officedocument.customXmlProperties+xml |
docProps/app.xml |
application/vnd.openxmlformats-officedocument.extended-properties+xml |
docProps/core.xml |
application/vnd.openxmlformats-package.core-properties+xml |
word/document.xml |
application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml |
word/fontTable.xml |
application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml |
word/settings.xml |
application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml |
word/styles.xml |
application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml |
word/theme/theme1.xml |
application/vnd.openxmlformats-officedocument.theme+xml |
word/webSettings.xml |
application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml |
Finally OPC container store relations between parts. Consider e.g. the part “word/document.xml” and the part “word/styles.xml”. There is obviously a relation between this two
parts in the way that the “word/styles.xml” part contains the styles definitions referenced in the “word/document.xml” part. Therefore in a typical .DOCX document a relation similar to the following is established:
Source |
Id |
Destination |
Type |
word/document.xml |
rId2 |
word/styles.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles |
A relation inside an OPC container has a source part, a destination part as well as a relation id and a relation type. The relation id is unique with respect to the source part, i.e. no two relations which leave a source part have the same id.
An OPC container also has a virtual root part (here denoted with “[root]” or “/”), which is used to model the root of the relation hierarchy.
Here are the typical relations found in a .DOCX file:
Source |
Id |
Destination |
Type |
[root] |
rId1 |
word/document.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument |
[root] |
rId2 |
docProps/core.xml |
http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties |
[root] |
rId3 |
docProps/app.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties |
Source |
Id |
Destination |
Type |
customXml/item1.xml | rId1 | customXml/itemProps1.xml |
http://schemas.openxmlformats.org/officeDocument/2006/ relationships/customXmlProps |
Source |
Id |
Destination |
Type |
word/document.xml |
rId1 |
customXml/item1.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml |
word/document.xml |
rId2 |
word/styles.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles |
word/document.xml |
rId3 |
word/settings.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings |
word/document.xml |
rId4 |
word/webSettings.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings |
word/document.xml |
rId5 | http://naverage.com | http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink |
word/document.xml |
rId6 |
word/fontTable.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable |
word/document.xml |
rId7 |
word/theme/theme1.xml |
http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme |
Navigating though OPC container
One of the peculiarities of the OPC is how you navigate within an OPC container. Although most API’s give you the ability to access parts directly usually the relations are used to find the right part.
Let’s suppose you want to open the document part of a DOCX document. The straightforward — but wrong — way would be to check whether an OPC container has the “word/document.xml” stream and open it if present. Even when you additionally check whether the “word.document.xml” stream has the content type “application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml” it would be the wrong way to handle a DOCX document, since the part name “word/document.xml” is not important.
The right way to access the document part of a DOCX document is to check whether the OPC container has a relation leaving [root] of type “http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument”. If so we follow the relation and next we check the content type of the relation’s
target part. If the content type of the target part is ” application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml” we have a DOCX document.
Libraries / libopc
Windows comes with two different libraries for handling OPC container. An unmanaged COM-based API and a managed .NET-based API. Documentation about the two APIs can be found here http://msdn.microsoft.com/en-us/library/windows/desktop/dd742822.aspx
and here http://msdn.microsoft.com/en-us/library/system.io.packaging.aspx.
In this series of blog posts we will use libopc (libopc.codeplex.com) a FREE and open source library for dealing with the OPC which can be used on Windows as well on Linux, iOS and Android.
Libopc comes with a command line tool “opc_dump” which can be used to dump the structure of an OPC container. This tool is very handy and it can be used like this:
> opc_dump “Hello World.docx” > dump.txt
Next
In the next post we will take a look at the layer above OPC called Markup Compatibility and Extensibility (MCE) before we will take a closer look at libopc.
Florian Reuter (CEO of Naverage UG http://naverage.com and coordinator of http://libopc.codeplex.com)