Using XML DOM to Detect Tracked Revisions in an Open XML WordprocessingML Document

I’ve written a short article at OpenXMLDeveloper.org that shows how to detect tracked revisions using XmlDocument. Previously, I wrote an article on detecting tracked revisions using LINQ to XML or the strongly-typed object model of the Open XML SDK 2.0. However, some developers do not have the option of using LINQ, and instead must use one of a variety of XML DOM Document implementations. The OpenXMLDeveloper.org article presents some XmlDocument code that is pretty easy to translate to other languages and platforms.

Comments (3)

Custom Formatting of XML using LINQ to XML

On StackOverflow, there is a question (posted by Otaku, an online friend of mine for some time) about how to serialize multiple XML elements on the same line.  It is a very interesting question.  After going down a couple of dead-ends, I realized that it is pretty easy to iterate through an XML tree and do all of the writing to an XmlWriter explicitly, bypassing all of LINQ to XML’s logic for serializing through an XmlWriter.  This lets us do just about anything we want to do to the indentation of the XML, while still letting the XmlWriter class do all of the serializing of the XML itself.  Some folks at StackOverflow suggested post-processing the XML, but I know from hard experience that it is very difficult to post-process XML and really get it right, including handling CData elements, and etc.  By letting the XmlWriter class do all of the output of XML, while injecting just a bit of white space in the right places, we can be confident of the validity of the XML.

His question: he has XML that looks like this:


<Canvas>
  <Grid>
    <TextBlock>
      <Run Text="r"/>
      <Run Text="u"/>
      <Run Text="n"/>
    </TextBlock>
    <TextBlock>
      <Run Text="far a"/>
      <Run Text="way"/>
      <Run Text=" from me"/>
    </TextBlock>
  </Grid>
  <Grid>
    <TextBlock>
      <Run Text="I"/>
      <Run Text=" "/>
      <Run Text="want"/>
      <LineBreak/>
    </TextBlock>
    <TextBlock>
      <LineBreak/>
      <Run Text="...thi"/>
      <Run Text="s to"/>
      <LineBreak/>
      <Run Text=" work"/>
    </TextBlock>
  </Grid>
</Canvas>

He wants to format it so that it looks like this:


<Canvas>
  <Grid>
    <TextBlock>
      <Run Text="r"/><Run Text="u"/><Run Text="n"/>
    </TextBlock>
    <TextBlock>
      <Run Text="far a"/><Run Text="way"/><Run Text=" from me"/>
    </TextBlock>
  </Grid>
  <Grid>
    <TextBlock>
      <Run Text="I"/><Run Text=" "/><Run Text="want"/>
      <LineBreak/>
    </TextBlock>
    <TextBlock>
      <LineBreak/>
      <Run Text="...thi"/><Run Text="s to"/>
      <LineBreak/>
      <Run Text=" work"/>
    </TextBlock>
  </Grid>
</Canvas>

The reason he wants to do this is because of some fairly obscure semantics of XAML for Silverlight 3.  Read his question on StackOverflow for more detail.

I posted code on StackOverflow that shows how to do that specialized serialization using VB.NET.  Of course, actually I wrote the code first in C#, and then after getting it all working, I translated to VB.NET.  This post presents the C# code.

The key to solving this problem is to write a recursive function that iterates through the XML tree, writing the various elements and attributes to specially created XmlWriter objects.  There is an ‘outer’ XmlWriter object that writes indented XML, and an ‘inner’ XmlWriter object that writes non-indented XML.

The recursive function initially uses the ‘outer’ XmlWriter, writing indented XML, until it sees the TextBlock element (an element that triggers a desired change in the indenting behavior).  When it encounters the TextBlock element, it creates the ‘inner’ XmlWriter object, writing the child elements of the TextBlock element to it.  It also writes custom white space to the ‘inner’ XmlWriter.

When the ‘inner’ XmlWriter object is finished with writing the TextBlock element, the text that the ‘inner’ writer wrote is written to the ‘outer’ XmlWriter using the WriteRaw method.

As I mentioned, the advantages of this approach is that there is no post-processing of the XML.  It is extremely difficult to post-process XML and be certain that you have properly handled all cases, including arbitrary text in CData nodes, etc.  All of the XML is written using only the XmlWriter class, thereby ensuring that this will always write valid XML.  The only exception to this is the specially crafted white-space that is written using the WriteRaw method, which achieves the desired indenting behavior.

One key point is that the ‘inner’ XmlWriter object’s conformance level is set to ConformanceLevel.Fragment, because the ‘inner’ XmlWriter needs to write XML that does not have a root element.

To achieve the desired formatting of Run elements (i.e. Run elements that are adjacent have no insignificant white space between them), the code uses the GroupAdjacent extension method.  Some time ago, I write a blog post on the GroupAdjacent extension method.

Here is the C# code to do the specialized formatting:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;

public static class Extensions
{
    public static IEnumerable<IGrouping<TKey, TSource>> GroupAdjacent<TSource, TKey>(
        this IEnumerable<TSource> source,
        Func<TSource, TKey> keySelector)
    {
        TKey last = default(TKey);
        bool haveLast = false;
        List<TSource> list = new List<TSource>();

        foreach (TSource s in source)
        {
            TKey k = keySelector(s);
            if (haveLast)
            {
                if (!k.Equals(last))
                {
                    yield return new GroupOfAdjacent<TSource, TKey>(list, last);
                    list = new List<TSource>();
                    list.Add(s);
                    last = k;
                }
                else
                {
                    list.Add(s);
                    last = k;
                }
            }
            else
            {
                list.Add(s);
                last = k;
                haveLast = true;
            }
        }
        if (haveLast)
            yield return new GroupOfAdjacent<TSource, TKey>(list, last);
    }
}

public class GroupOfAdjacent<TSource, TKey> : IEnumerable<TSource>, IGrouping<TKey, TSource>
{
    public TKey Key { get; set; }
    private List<TSource> GroupList { get; set; }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return ((System.Collections.Generic.IEnumerable<TSource>)this).GetEnumerator();
    }

    System.Collections.Generic.IEnumerator<TSource>
        System.Collections.Generic.IEnumerable<TSource>.GetEnumerator()
    {
        foreach (var s in GroupList)
            yield return s;
    }

    public GroupOfAdjacent(List<TSource> source, TKey key)
    {
        GroupList = source;
        Key = key;
    }
}

class Program
{
    static void WriteStartElement(XmlWriter writer, XElement e)
    {
        XNamespace ns = e.Name.Namespace;
        writer.WriteStartElement(e.GetPrefixOfNamespace(ns),
            e.Name.LocalName, ns.NamespaceName);
        foreach (var a in e.Attributes())
        {
            ns = a.Name.Namespace;
            string localName = a.Name.LocalName;
            string namespaceName = ns.NamespaceName;
            writer.WriteAttributeString(
                e.GetPrefixOfNamespace(ns),
                localName,
                namespaceName.Length == 0 && localName == "xmlns" ?
                    XNamespace.Xmlns.NamespaceName :
                    namespaceName,
                a.Value);
        }
    }

    public static void WriteElement(XmlWriter writer, XElement e)
    {
        if (e.Name == "TextBlock")
        {
            WriteStartElement(writer, e);
            writer.WriteRaw(Environment.NewLine);

            // Create an XML writer that outputs no insignificant white space so that we can
            // write to it and explicitly control white space.
            XmlWriterSettings settings = new XmlWriterSettings();
            settings.Indent = false;
            settings.OmitXmlDeclaration = true;
            settings.ConformanceLevel = ConformanceLevel.Fragment;
            StringBuilder sb = new StringBuilder();
            using (XmlWriter newXmlWriter = XmlWriter.Create(sb, settings))
            {
                // Group adjacent runs so that they can be output with no whitespace between them
                var groupedRuns = e.Nodes().GroupAdjacent(n =>
                {
                    XElement element = n as XElement;
                    if (element != null && element.Name == "Run")
                        return true;
                    return false;
                });
                foreach (var g in groupedRuns)
                {
                    if (g.Key == true)
                    {
                        // Write white space so that the line of Run elements is properly indented.
                        newXmlWriter.WriteRaw("".PadRight((e.Ancestors().Count() + 1) * 2));
                        foreach (var run in g)
                            run.WriteTo(newXmlWriter);
                        newXmlWriter.WriteRaw(Environment.NewLine);
                    }
                    else
                    {
                        foreach (var g2 in g)
                        {
                            // Write some white space so that each child element is properly indented.
                            newXmlWriter.WriteRaw("".PadRight((e.Ancestors().Count() + 1) * 2));
                            g2.WriteTo(newXmlWriter);
                            newXmlWriter.WriteRaw(Environment.NewLine);
                        }
                    }
                }
            }
            writer.WriteRaw(sb.ToString());
            writer.WriteRaw("".PadRight(e.Ancestors().Count() * 2));
            writer.WriteEndElement();
        }
        else
        {
            WriteStartElement(writer, e);
            foreach (var n in e.Nodes())
            {
                XElement element = n as XElement;
                if (element != null)
                {
                    WriteElement(writer, element);
                    continue;
                }
                n.WriteTo(writer);
            }
            writer.WriteEndElement();
        }
    }

    static string ToStringWithCustomWhiteSpace(XElement element)
    {
        // Create XmlWriter that indents.
        XmlWriterSettings settings = new XmlWriterSettings();
        settings.Indent = true;
        settings.OmitXmlDeclaration = true;
        StringBuilder sb = new StringBuilder();
        using (XmlWriter xmlWriter = XmlWriter.Create(sb, settings))
            WriteElement(xmlWriter, element);
        return sb.ToString();
    }

    static void Main(string[] args)
    {
        XElement root = XElement.Parse(
@"<Canvas a='1'>
  <Grid>
    <TextBlock>
      <Run Text='r'/>
      <Run Text='u'/>
      <Run Text='n'/>
    </TextBlock>
    <TextBlock>
      <Run Text='far a'/>
      <Run Text='way'/>
      <Run Text=' from me'/>
    </TextBlock>
  </Grid>
  <Grid>
    <TextBlock>
      <Run Text='I'/>
      <Run Text=' '/>
      <Run Text='want'/>
      <LineBreak/>
    </TextBlock>
    <TextBlock>
      <LineBreak/>
      <Run Text='...thi'/>
      <Run Text='s to'/>
      <LineBreak/>
      <Run Text=' work'/>
    </TextBlock>
  </Grid>
</Canvas>");
        Console.WriteLine(ToStringWithCustomWhiteSpace(root));
    }
}

And for completeness, here is the VB code:


Imports System.Text
Imports System.Xml

Public Class GroupOfAdjacent(Of TElement, TKey)
    Implements IEnumerable(Of TElement)

    Private _key As TKey
    Private _groupList As List(Of TElement)

    Public Property GroupList() As List(Of TElement)
        Get
            Return _groupList
        End Get
        Set(ByVal value As List(Of TElement))
            _groupList = value
        End Set
    End Property

    Public ReadOnly Property Key() As TKey
        Get
            Return _key
        End Get
    End Property

    Public Function GetEnumerator() As System.Collections.Generic.IEnumerator(Of TElement) _
            Implements System.Collections.Generic.IEnumerable(Of TElement).GetEnumerator
        Return _groupList.GetEnumerator
    End Function

    Public Function GetEnumerator1() As System.Collections.IEnumerator _
            Implements System.Collections.IEnumerable.GetEnumerator
        Return _groupList.GetEnumerator
    End Function

    Public Sub New(ByVal key As TKey)
        _key = key
        _groupList = New List(Of TElement)
    End Sub
End Class

Module Module1
    <System.Runtime.CompilerServices.Extension()> _
    Public Function GroupAdjacent(Of TElement, TKey)(ByVal source As IEnumerable(Of TElement), _
                ByVal keySelector As Func(Of TElement, TKey)) As List(Of GroupOfAdjacent(Of TElement, TKey))
        Dim lastKey As TKey = Nothing
        Dim currentGroup As GroupOfAdjacent(Of TElement, TKey) = Nothing
        Dim allGroups As List(Of GroupOfAdjacent(Of TElement, TKey)) = New List(Of GroupOfAdjacent(Of TElement, TKey))()
        For Each item In source
            Dim thisKey As TKey = keySelector(item)
            If lastKey IsNot Nothing And Not thisKey.Equals(lastKey) Then
                allGroups.Add(currentGroup)
            End If
            If Not thisKey.Equals(lastKey) Then
                currentGroup = New GroupOfAdjacent(Of TElement, TKey)(keySelector(item))
            End If
            currentGroup.GroupList.Add(item)
            lastKey = thisKey
        Next
        If lastKey IsNot Nothing Then
            allGroups.Add(currentGroup)
        End If
        Return allGroups
    End Function

    Public Sub WriteStartElement(ByVal writer As XmlWriter, ByVal e As XElement)
        Dim ns As XNamespace = e.Name.Namespace
        writer.WriteStartElement(e.GetPrefixOfNamespace(ns), _
            e.Name.LocalName, ns.NamespaceName)
        For Each a In e.Attributes
            ns = a.Name.Namespace
            Dim localName As String = a.Name.LocalName
            Dim namespaceName As String = ns.NamespaceName
            writer.WriteAttributeString( _
                e.GetPrefixOfNamespace(ns), _
                localName, _
                IIf(namespaceName.Length = 0 And localName = "xmlns", _
                    XNamespace.Xmlns.NamespaceName, namespaceName),
                a.Value)
        Next
    End Sub

    Public Sub WriteElement(ByVal writer As XmlWriter, ByVal e As XElement)
        If (e.Name = "TextBlock") Then
            WriteStartElement(writer, e)
            writer.WriteRaw(Environment.NewLine)

            ' Create an XML writer that outputs no insignificant white space so that we can
            ' write to it and explicitly control white space.
            Dim settings As XmlWriterSettings = New XmlWriterSettings()
            settings.Indent = False
            settings.OmitXmlDeclaration = True
            settings.ConformanceLevel = ConformanceLevel.Fragment
            Dim sb As StringBuilder = New StringBuilder()
            Using newXmlWriter As XmlWriter = XmlWriter.Create(sb, settings)
                ' Group adjacent runs so that they can be output with no whitespace between them
                Dim groupedRuns = e.Nodes().GroupAdjacent( _
                    Function(n) As Boolean?
                        If TypeOf n Is XElement Then
                            Dim element As XElement = n
                            If element.Name = "Run" Then
                                Return True
                            End If
                            Return False
                        End If
                        Return False
                    End Function)
                For Each g In groupedRuns
                    If g.Key = True Then
                        ' Write white space so that the line of Run elements is properly indented.
                        newXmlWriter.WriteRaw("".PadRight((e.Ancestors().Count() + 1) * 2))
                        For Each run In g
                            run.WriteTo(newXmlWriter)
                        Next
                        newXmlWriter.WriteRaw(Environment.NewLine)
                    Else
                        For Each g2 In g
                            ' Write some white space so that each child element is properly indented.
                            newXmlWriter.WriteRaw("".PadRight((e.Ancestors().Count() + 1) * 2))
                            g2.WriteTo(newXmlWriter)
                            newXmlWriter.WriteRaw(Environment.NewLine)
                        Next
                    End If
                Next
            End Using
            writer.WriteRaw(sb.ToString())
            writer.WriteRaw("".PadRight(e.Ancestors().Count() * 2))
            writer.WriteEndElement()
        Else
            WriteStartElement(writer, e)
            For Each n In e.Nodes
                If TypeOf n Is XElement Then
                    Dim element = n
                    WriteElement(writer, element)
                    Continue For
                End If
                n.WriteTo(writer)
            Next
            writer.WriteEndElement()
        End If
    End Sub

    Function ToStringWithCustomWhiteSpace(ByVal element As XElement) As String
        ' Create XmlWriter that indents.
        Dim settings As XmlWriterSettings = New XmlWriterSettings()
        settings.Indent = True
        settings.OmitXmlDeclaration = True
        Dim sb As StringBuilder = New StringBuilder()
        Using xmlWriter As XmlWriter = xmlWriter.Create(sb, settings)
            WriteElement(xmlWriter, element)
        End Using
        Return sb.ToString()
    End Function

    Sub Main()
        Dim myXML As XElement = _
            <Canvas>
                <Grid>
                    <TextBlock>
                        <Run Text='r'/>
                        <Run Text='u'/>
                        <Run Text='n'/>
                    </TextBlock>
                    <TextBlock>
                        <Run Text='far a'/>
                        <Run Text='way'/>
                        <Run Text=' from me'/>
                    </TextBlock>
                </Grid>
                <Grid>
                    <TextBlock>
                        <Run Text='I'/>
                        <Run Text=' '/>
                        <Run Text='want'/>
                        <LineBreak/>
                    </TextBlock>
                    <TextBlock>
                        <LineBreak/>
                        <Run Text='...thi'/>
                        <Run Text='s to'/>
                        <LineBreak/>
                        <Run Text=' work'/>
                    </TextBlock>
                </Grid>
            </Canvas>
        Console.Write(ToStringWithCustomWhiteSpace(myXML))
        Console.ReadLine()
    End Sub

End Module

Comments

Align Attributes when Formatting XML using LINQ to XML

A few years ago, I wrote a blog post that showed how to align attributes when formatting XML using LINQ to XML. Here is an extension method that uses that technique.


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.Linq;

namespace ConsoleApplication1
{
    public static class Extensions
    {
        public static string ToStringAlignAttributes(this XContainer xContainer)
        {
            XmlWriterSettings settings = new XmlWriterSettings();
            settings.Indent = true;
            settings.OmitXmlDeclaration = true;
            settings.NewLineOnAttributes = true;
            StringBuilder sb = new StringBuilder();
            using (XmlWriter xmlWriter = XmlWriter.Create(sb, settings))
                xContainer.WriteTo(xmlWriter);
            return sb.ToString();
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            XDocument doc = new XDocument(
                new XElement("Root",
                    new XAttribute("att1", 1),
                    new XAttribute("att2", 2),
                    new XAttribute("att3", 3),
                    new XElement("Child",
                        new XAttribute("att1", 1),
                        new XAttribute("att2", 2),
                        new XAttribute("att3", 3))));
            Console.WriteLine(doc.ToStringAlignAttributes());

            XElement el = new XElement("Root",
                new XAttribute("att1", 1),
                new XAttribute("att2", 2),
                new XAttribute("att3", 3),
                new XElement("Child",
                    new XAttribute("att1", 1),
                    new XAttribute("att2", 2),
                    new XAttribute("att3", 3)));
            Console.WriteLine(el.ToStringAlignAttributes());
        }
    }
}

Update: May 5, 2011 – I initially wrote a more fancy version of this, but as it turns out, I got it wrong – it didn’t properly indent some cases of some XML documents, so am reverting the code. When I get a chance, I’ll work out the issues with the code that implements more fancy alignment.

Comments (6)

Screen-cast: Exploring Margins in Open XML WordprocessingML Documents

I’ve recorded a short screen-cast that explores some approaches to setting margins in a word-processing document.  I show setting margins for cells, paragraphs, and sections.

Uses the Open XML SDK 2.0 Productivity Tool to explore various ways to set margins in an Open XML WordprocessingML document.

Comments

Querying for Fields in an Open XML WordprocessingML Document

I’ve written a blog post at OpenXMLDeveloper.org that presents some code to query an Open XML WordprocessingML document for fields.  The code returns the field code for each field in the document.  Using this code, it becomes trivial to query a document for all hyperlinks in the document.  This will be the subject of my next post at OpenXMLDeveloper.org.

Comments

Deep Dive into Open XML WordprocessingML Fields and Hyperlinks – Part 2

I’ve posted the second video in this series on Open XML WordprocessingML fields and hyperlinks.  In this video, I show how the MarkupSimplifier application makes the markup for fields more understandable.  In addition, I explore the ways in which fields can be nested inside other fields.  This is a powerful technique, but the markup requires a bit of explanation.  Before watching this video, watch the first video of this deep-dive into Open XML WordprocessingML fields and hyperlinks.

Part 2 shows how MarkupSimplifier can be used to good effect. Also shows what the markup looks like when you nest fields within other fields.

Comments

Microsoft Interoperability: Open Specifications Developer Center

Microsoft is continually upgrading and improving their interoperability documentation.  Recently, I received the word about an update to the Open Specifications Developer Center.  There is a lot of updated material on the Learn Tab.

image

Some interesting learning modules:

  • Interoperability 101: The Basics: Learning module landing page
  • Introduction to Office Interoperability: Learning module landing page
  • Introduction to Office Open XML: Learning module landing page
  • Understanding Office Binary File Formats: Learning module landing page

Here are some new technical articles:

Comments

Deep Dive into Open XML WordprocessingML Fields and Hyperlinks

Fields are one of the most powerful components of WordprocessingML markup.  You will see field markup in hyper-links, the TOC, dates, page references, calculated values, and much more.  I’ve been asked a few questions lately about fields in WordprocessingML markup.  Fields are perhaps one of the least understood aspects of WordprocessingML markup, but they are really not very hard.  I’ve embarked on a four part series to explain field markup, show some example code that makes it easier to work with fields, and then show some code that reliably retrieves all hyper-links in an Open XML WordprocessingML markup.  The following video is the first of this four part series:

Deep dive into Open XML WordprocessingML markup

Comments (3)

Ease your WordprocessingML Research using the Open XML Markup Simplifier Application

Sometimes when researching Open XML WordprocessingML markup, extraneous markup gets in the way of your research.  The extraneous markup makes it harder to see and understand the markup issues at hand.  The MarkupSimplifier class (which is part of the PowerTools for Open XML project) can help a lot, but as downloaded from CodePlex, it is only a class.  You need to write code to use the class, and if you want to use the markup simplifier as part of your research process, it is inconvenient.  I’ve written a small WinForm application that uses the MarkupSimplifier class, and makes the use of the simplifier class much more seamless in your research.

You can download the Markup Simplifier Application at OpenXMLDeveloper.org.  The code is attached to the blog post.

I’ve recorded a six minute video that shows the Markup Simplifier application in action:

This video shows how to build and run the Markup Simplifier application.

Comments

Release of Cross-Platform C Library for Open Packaging Conventions

Doug Mahugh has announced the release of libOPC version 0.0.1, a new API for Open XML development, on Codeplex last week.  From his blog post,

This API is the first open-source cross-platform API for developers working with Open Packaging Convention (OPC) packages as used by Open XML, XPS, and other formats. Full source code is available, and it’s written in portable C99, so can be used on all popular variants of Linux/Unix, Mac OS, Windows, Android, and many other platforms. The API uses other common cross-platform open-source APIs for some of the low-level details, including ZLIB for opening ZIP-compressed packages and libXML for parsing the XML streams from the parts in the package.

This is excellent news!

Comments

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »