{"id":2777,"date":"2016-03-18T13:41:10","date_gmt":"2016-03-18T13:41:10","guid":{"rendered":"http:\/\/www.ericwhite.com\/home2\/bm8qcmjy\/public_html\/blog\/?page_id=2777"},"modified":"2016-03-20T11:48:14","modified_gmt":"2016-03-20T11:48:14","slug":"transforming-a-docx-with-hierarchical-numbering-into-different-xml","status":"publish","type":"page","link":"https:\/\/www.ericwhite.com\/blog\/transforming-a-docx-with-hierarchical-numbering-into-different-xml\/","title":{"rendered":"Transforming a Docx with Hierarchical Numbering into Different XML"},"content":{"rendered":"<p>The new <a href=\"https:\/\/www.ericwhite.com\/blog\/announcing-a-complete-re-write-of-listitemretriever-cs\/\">ListItemRetriever.cs module<\/a> in <a href=\"https:\/\/github.com\/OfficeDev\/Open-Xml-PowerTools\">PowerTools for Open XML<\/a> enables us to find out lots of information about the numbered and bulleted lists in our documents. &nbsp;We can write a small amount of code and transform a DOCX that contains hierarchical numbered lists into an alternative form of XML. &nbsp;For instance, we may want to transform a document that looks like this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.ericwhite.com\/blog\/wp-content\/uploads\/2016\/03\/2656.DocxWithOutline-280x300.png\" alt=\"2656.DocxWithOutline\" width=\"280\" height=\"300\" class=\"alignnone size-medium wp-image-2780\" srcset=\"https:\/\/www.ericwhite.com\/blog\/wp-content\/uploads\/2016\/03\/2656.DocxWithOutline-280x300.png 280w, https:\/\/www.ericwhite.com\/blog\/wp-content\/uploads\/2016\/03\/2656.DocxWithOutline.png 603w\" sizes=\"auto, (max-width: 280px) 100vw, 280px\" \/><\/p>\n<p>Into XML that looks like this:<\/p>\n<pre class=\"prettyprint\">&lt;?xml version=\"1.0\" encoding=\"utf-8\"?&gt;\r\n&lt;Root&gt;\r\n  &lt;Paragraph&gt;Video provides a powerful way to help you prove your point.&lt;\/Paragraph&gt;\r\n  &lt;Indent Level=\"1\"&gt;\r\n    &lt;Heading&gt;Overview&lt;\/Heading&gt;\r\n    &lt;Indent Level=\"1.1\"&gt;\r\n      &lt;Heading&gt;Dr. Johnson&rsquo;s Vitae&lt;\/Heading&gt;\r\n      &lt;Paragraph&gt;You can also type a keyword.&lt;\/Paragraph&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"1.2\"&gt;\r\n      &lt;Heading&gt;Team List&lt;\/Heading&gt;\r\n      &lt;Paragraph&gt;John&lt;\/Paragraph&gt;\r\n      &lt;Paragraph&gt;Bob&lt;\/Paragraph&gt;\r\n      &lt;Paragraph&gt;Bill&lt;\/Paragraph&gt;\r\n      &lt;Paragraph&gt;Frank&lt;\/Paragraph&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"1.3\"&gt;\r\n      &lt;Heading&gt;Motivations&lt;\/Heading&gt;\r\n      &lt;Paragraph&gt;Lorem ipsum dolor sit amet.&lt;\/Paragraph&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"1.4\"&gt;\r\n      &lt;Heading&gt;Benefits&lt;\/Heading&gt;\r\n      &lt;Paragraph&gt;Nunc viverra imperdiet enim.&lt;\/Paragraph&gt;\r\n      &lt;Indent Level=\"1.4.1\"&gt;\r\n        &lt;Heading&gt;Aaa&lt;\/Heading&gt;\r\n      &lt;\/Indent&gt;\r\n      &lt;Indent Level=\"1.4.2\"&gt;\r\n        &lt;Heading&gt;Bbb&lt;\/Heading&gt;\r\n      &lt;\/Indent&gt;\r\n      &lt;Indent Level=\"1.4.3\"&gt;\r\n        &lt;Heading&gt;CCC&lt;\/Heading&gt;\r\n      &lt;\/Indent&gt;\r\n    &lt;\/Indent&gt;\r\n  &lt;\/Indent&gt;\r\n  &lt;Indent Level=\"2\"&gt;\r\n    &lt;Heading&gt;Synopsis&lt;\/Heading&gt;\r\n    &lt;Indent Level=\"2.1\"&gt;\r\n      &lt;Heading&gt;Day 1 &ndash; Compression&lt;\/Heading&gt;\r\n      &lt;Paragraph&gt;Aenean nec lorem.&lt;\/Paragraph&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"2.2\"&gt;\r\n      &lt;Heading&gt;Day 2 &ndash; Distribution&lt;\/Heading&gt;\r\n      &lt;Paragraph&gt;In porttitor.&lt;\/Paragraph&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"2.3\"&gt;\r\n      &lt;Heading&gt;Day 3 &ndash; Time Line&lt;\/Heading&gt;\r\n      &lt;Paragraph&gt;Donec laoreet nonummy augue.&lt;\/Paragraph&gt;\r\n    &lt;\/Indent&gt;\r\n  &lt;\/Indent&gt;\r\n  &lt;Indent Level=\"3\"&gt;\r\n    &lt;Heading&gt;Technical Details&lt;\/Heading&gt;\r\n    &lt;Indent Level=\"3.1\"&gt;\r\n      &lt;Heading&gt;Engine&lt;\/Heading&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"3.2\"&gt;\r\n      &lt;Heading&gt;Passenger Compartment&lt;\/Heading&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"3.3\"&gt;\r\n      &lt;Heading&gt;Trunk&lt;\/Heading&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"3.4\"&gt;\r\n      &lt;Heading&gt;Tires&lt;\/Heading&gt;\r\n    &lt;\/Indent&gt;\r\n  &lt;\/Indent&gt;\r\n  &lt;Indent Level=\"4\"&gt;\r\n    &lt;Heading&gt;Summary&lt;\/Heading&gt;\r\n    &lt;Indent Level=\"4.1\"&gt;\r\n      &lt;Heading&gt;Contraindications&lt;\/Heading&gt;\r\n    &lt;\/Indent&gt;\r\n    &lt;Indent Level=\"4.2\"&gt;\r\n      &lt;Heading&gt;Index&lt;\/Heading&gt;\r\n    &lt;\/Indent&gt;\r\n  &lt;\/Indent&gt;\r\n&lt;\/Root&gt;<\/pre>\n<p>The ListItemRetreiver assembles all kinds of information about each item in a numbered list, and it is easy to retrieve this information &#8211; it is stored as annotations on the paragraph XML elements after calling ListItemRetriever.RetrieveListItem. &nbsp;The following code accomplishes the above transform:<\/p>\n<pre class=\"prettyprint\">\/***************************************************************************\r\n\r\nCopyright (c) Microsoft Corporation 2014.\r\n\r\nThis code is licensed using the Microsoft Public License (Ms-PL).  The text of the license\r\ncan be found here:\r\n\r\nhttp:\/\/www.microsoft.com\/resources\/sharedsource\/licensingbasics\/publiclicense.mspx\r\n\r\n***************************************************************************\/\r\n\r\nusing System;\r\nusing System.Collections.Generic;\r\nusing System.IO;\r\nusing System.Linq;\r\nusing System.Text;\r\nusing System.Threading.Tasks;\r\nusing System.Xml.Linq;\r\nusing DocumentFormat.OpenXml.Packaging;\r\nusing OpenXmlPowerTools;\r\n\r\nclass ListItemRetriever01\r\n{\r\n    private class XmlStackItem\r\n    {\r\n        public XElement Element;\r\n        public int[] LevelNumbers;\r\n    }\r\n\r\n    \/*\r\n     * This example loads each document into a byte array, then into a memory stream,\r\n     * so that the document can be opened for writing without modifying the source document.\r\n     *\/\r\n\r\n    static void Main(string[] args)\r\n    {\r\n        var thisDir = new DirectoryInfo(\".\");\r\n        foreach (var xmlFile in thisDir.GetFiles(\"*.xml\"))\r\n            xmlFile.Delete();\r\n        using (WordprocessingDocument wDoc =\r\n            WordprocessingDocument.Open(\"..\/..\/NumberedListTest.docx\", false))\r\n        {\r\n            int abstractNumId = 0;\r\n            XElement xml = ConvertDocToXml(wDoc, abstractNumId);\r\n            Console.WriteLine(xml);\r\n            xml.Save(\"Out.xml\");\r\n        }\r\n        Console.WriteLine(\"Press Enter\");\r\n        Console.ReadKey();\r\n    }\r\n\r\n    private static XElement ConvertDocToXml(WordprocessingDocument wDoc, int abstractNumId)\r\n    {\r\n        XDocument xd = wDoc.MainDocumentPart.GetXDocument();\r\n\r\n        \/\/ First, call RetrieveListItem so that all paragraphs are initialized with ListItemInfo\r\n        var firstParagraph = xd.Descendants(W.p).FirstOrDefault();\r\n        var listItem = ListItemRetriever.RetrieveListItem(wDoc, firstParagraph);\r\n\r\n        XElement xml = new XElement(\"Root\");\r\n        var current = new Stack&lt;XmlStackItem&gt;();\r\n        current.Push(\r\n            new XmlStackItem()\r\n            {\r\n                Element = xml,\r\n                LevelNumbers = new int[] { },\r\n            });\r\n        foreach (var paragraph in xd.Descendants(W.p))\r\n        {\r\n            \/\/ The following does not take into account documents that have tracked revisions.\r\n            \/\/ As necessary, call RevisionAccepter.AcceptRevisions before converting to XML.\r\n            var text = paragraph.Descendants(W.t).Select(t =&gt; (string)t).StringConcatenate();\r\n            ListItemRetriever.ListItemInfo lii = \r\n                paragraph.Annotation&lt;ListItemRetriever.ListItemInfo&gt;();\r\n            if (lii.IsListItem &amp;&amp; lii.AbstractNumId == abstractNumId)\r\n            {\r\n                ListItemRetriever.LevelNumbers levelNums = \r\n                    paragraph.Annotation&lt;ListItemRetriever.LevelNumbers&gt;();\r\n                if (levelNums.LevelNumbersArray.Length == current.Peek().LevelNumbers.Length)\r\n                {\r\n                    current.Pop();\r\n                    var levelNumsForThisIndent = levelNums.LevelNumbersArray;\r\n                    string levelText = levelNums\r\n                        .LevelNumbersArray\r\n                        .Select(l =&gt; l.ToString() + \".\")\r\n                        .StringConcatenate()\r\n                        .TrimEnd('.');\r\n                    var newCurrentElement = new XElement(\"Indent\",\r\n                        new XAttribute(\"Level\", levelText));\r\n                    current.Peek().Element.Add(newCurrentElement);\r\n                    current.Push(\r\n                        new XmlStackItem()\r\n                        {\r\n                            Element = newCurrentElement,\r\n                            LevelNumbers = levelNumsForThisIndent,\r\n                        });\r\n                    current.Peek().Element.Add(new XElement(\"Heading\", text));\r\n                }\r\n                else if (levelNums.LevelNumbersArray.Length &gt; current.Peek().LevelNumbers.Length)\r\n                {\r\n                    for (int i = current.Peek().LevelNumbers.Length; \r\n                        i &lt; levelNums.LevelNumbersArray.Length; \r\n                        i++)\r\n                    {\r\n                        var levelNumsForThisIndent = levelNums\r\n                            .LevelNumbersArray\r\n                            .Take(i + 1)\r\n                            .ToArray();\r\n                        string levelText = levelNums\r\n                            .LevelNumbersArray\r\n                            .Select(l =&gt; l.ToString() + \".\")\r\n                            .StringConcatenate()\r\n                            .TrimEnd('.');\r\n                        var newCurrentElement = new XElement(\"Indent\",\r\n                            new XAttribute(\"Level\", levelText));\r\n                        current.Peek().Element.Add(newCurrentElement);\r\n                        current.Push(\r\n                            new XmlStackItem()\r\n                            {\r\n                                Element = newCurrentElement,\r\n                                LevelNumbers = levelNumsForThisIndent,\r\n                            });\r\n                        current.Peek().Element.Add(new XElement(\"Heading\", text));\r\n                    }\r\n                }\r\n                else if (levelNums.LevelNumbersArray.Length &lt; current.Peek().LevelNumbers.Length)\r\n                {\r\n                    for (int i = current.Peek().LevelNumbers.Length;\r\n                        i &gt; levelNums.LevelNumbersArray.Length; \r\n                        i--)\r\n                        current.Pop();\r\n                    current.Pop();\r\n                    var levelNumsForThisIndent = levelNums.LevelNumbersArray;\r\n                    string levelText = levelNums\r\n                        .LevelNumbersArray\r\n                        .Select(l =&gt; l.ToString() + \".\")\r\n                        .StringConcatenate()\r\n                        .TrimEnd('.');\r\n                    var newCurrentElement = new XElement(\"Indent\",\r\n                        new XAttribute(\"Level\", levelText));\r\n                    current.Peek().Element.Add(newCurrentElement);\r\n                    current.Push(\r\n                        new XmlStackItem()\r\n                        {\r\n                            Element = newCurrentElement,\r\n                            LevelNumbers = levelNumsForThisIndent,\r\n                        });\r\n                    current.Peek().Element.Add(new XElement(\"Heading\", text));\r\n                }\r\n            }\r\n            else\r\n            {\r\n                current.Peek().Element.Add(new XElement(\"Paragraph\", text));\r\n            }\r\n        }\r\n        return xml;\r\n    }\r\n}<\/pre>\n<p>This example of the use of the ListItemRetriever module is important enough that I&#8217;ve incorporated it as one of the examples that I deliver as part of the Open XML PowerTools core examples.<\/p>\n<p>One note about the above code &#8211; I didn&#8217;t write it in the pure functional style. &nbsp;I write it as procedural code that uses a stack. &nbsp;Previously, I had posted some <a href=\"http:\/\/blogs.msdn.com\/b\/ericwhite\/archive\/2008\/07\/10\/how-to-create-hierarchy-from-flat-data-using-linq.aspx\" class=\"broken_link\">LINQ code that took a recursive, functional approach<\/a>. &nbsp;I am not sure which is more approachable for less experienced developers, but I suspect that the procedural approach that uses a stack might be easier.<\/p>\n<p>This code isn&#8217;t data driven &#8211; it produces a specific XML structure that is hard coded into the example. &nbsp;It could be possible to convert this code to another more flexible form that could produce a variety of XML shapes. &nbsp;The code is only about 150 lines long &#8211; not too complicated, so it should be easy for developers to morph the code into a form that fits their scenario.<\/p>\n<p>Cheers, Eric<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The new ListItemRetriever.cs module in PowerTools for Open XML enables us to find out lots of information about the numbered and bulleted lists in our documents. &nbsp;We can write a small amount of code and transform a DOCX that contains hierarchical numbered lists into an alternative form of XML. &nbsp;For instance, we may want to [&hellip;]<\/p>\n","protected":false},"author":10567,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_s2mail":"","footnotes":""},"class_list":["post-2777","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.ericwhite.com\/blog\/wp-json\/wp\/v2\/pages\/2777","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ericwhite.com\/blog\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.ericwhite.com\/blog\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.ericwhite.com\/blog\/wp-json\/wp\/v2\/users\/10567"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ericwhite.com\/blog\/wp-json\/wp\/v2\/comments?post=2777"}],"version-history":[{"count":5,"href":"https:\/\/www.ericwhite.com\/blog\/wp-json\/wp\/v2\/pages\/2777\/revisions"}],"predecessor-version":[{"id":3102,"href":"https:\/\/www.ericwhite.com\/blog\/wp-json\/wp\/v2\/pages\/2777\/revisions\/3102"}],"wp:attachment":[{"href":"https:\/\/www.ericwhite.com\/blog\/wp-json\/wp\/v2\/media?parent=2777"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}