Advice on the Regex

Home Forums Open-Xml-PowerTools Advice on the Regex

Tagged: 

This topic contains 1 reply, has 2 voices, and was last updated by  Eric White 7 years, 8 months ago.

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • #3801

    abill
    Participant

    Hi Eric,

    First of all, congratulations, not just for a real damn useful library, your introductory videos are just neat.

    I would appreciate a quick advice on how may I use correctly some of your cool stuff, being a novice. In some docx file I need to identify through a regular expression some keywords. Sometimes the keywords are in bold, italic, different fonts, colors so I need somehow to take it into consideration.

    How would you advise me to proceed, use the regex extension that you provide on text and check the properties for formatting? is there a better approach?

    Thank you! And I hope that by using your tool I may help you out with tests and suggestions.

    #3804

    Eric White
    Keymaster

    Hi,

    I’m really happy you like the library and videos! I love working with Open XML – it is fun!

    I just want to be clear – in addition to selecting text in a document based on matching of a regex, you also want to optionally match based on styling, i.e. bold, italic?

    OpenXmlRegex was not designed to do this. However, one of the options to certain OpenXmlRegex overloads is a lambda expression that gets called for each match. One of the arguments to this lambda is the paragraph element, so it could be possible for you to examine this paragraph element and decide whether you want to do the replacement or not. If you return false from the lambda, then the replacement will not happen.

    You run into the decision of what to do if a specific paragraph contains text that will match the regular expression more than once – how do you decide whether to replace or not based on styling.

    In my opinion, a more robust system simply relies on matching the text. I have taken to the practice of delineating such text using < # and #>. These symbols make sense from a user perspective – they are easy to remember, and they make sense from a programming perspective – those characters in that order are not valid in JavaScript, C#, or XML. So I have standardized on those two characters (which are the same as used for T4 text templates). This is what I would recommend, if it fits your user scenarios.

    Let me know how your project goes. 🙂

    Best, Eric

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic.