Docugami Lights Up Dark Data Trapped in Unstructured Documents

Hi Friends,

It is great fun to join a startup, doing new and incredibly innovative things with documents, but one downside is that there have been limitations on what I can publicly disclose, of course. But now the wrappers are starting to come off. The folks at SignalFire, one of the investors in Docugami have written a super interesting blog post about what we’re doing.

My work at Docugami runs the gamut from some fairly standard DOCX transformations to building parts of a super-cool Word add-in (JavaScript / TypeScript), but the most fun aspects of my job are tailing (virtually) the scientists who have built and are enhancing world-class AI for document analysis. Frankly, I’m having more fun than I have ever had in my career.

And by the way, we still have an open software engineer position and an open ML/AI scientist position. Seize the day!

-Eric

Comments

Docugami: The Document Engineering Company

Hi Friends,

After many years being an independent contractor for a variety of companies and organizations, I have taken a job as a developer at a company named Docugami! I’ve been an employee for a couple of months now. 😊

The focus of this company is a “multi-faceted AI solution that combines innovative Deep Learning, NLP, Bayesian, Evolutionary, and other AI techniques.” It is super fascinating to see the amazing technology that this team has put together.

The company recently raised $10M in a seed round, an unusually large early stage deal. Also interesting: Grammarly is an investor.

The fine-tuning of our product is proceeding apace toward a public rollout later this year. If you want to get a glimpse of the future and see what AI can do when applied to business agreements and contracts, you can get early access from the Web site.

The absolute best thing about this job is the amazing team of people involved. The business, engineering, marketing, and public relations leaders are top-notch. The scientists who work on the AI and Machine Learning are amazing. Docugami is the complete package. I am humbled and honored to be part of this team.

You can meet some of my amazing colleagues on the Docugami LinkedIn page

There are still open developer positions, an open Software Engineering Manager position, as well as a Machine Learning / Data Scientist. This is an amazing team, and good to work with. If you are top-notch (you obviously are, since you are here 😊), go to the careers page at docugami.com.

The tagline for this company is “The Document Engineering Company.”

This is perfect for me!

Cheers, Eric

Comments (1)

Welcome, Open-Xml-PowerTools Users

Welcome, Open-Xml-PowerTools users. As you may have learned from the repo at OfficeDev/Open-Xml-PowerTools, Microsoft is going to archive that repo, and will not be maintaining that repo in the future. The good folks at
Microsoft asked if I would maintain this repo going forward, and I said “yes, absolutely”.

At this moment, the github.com/EricWhiteDev/Open-Xml-PowerTools repo is an exact clone of the github/OfficeDev/Open-Xml-PowerTools repo, but that is going to change in the near future.

I plan to:

  • Merge vNext into Master
  • Make significant enhancements to vNext

For those of you awaiting the enhancements to DocumentAssembler, I am nearly done with them, and expect to post an updated, refactored version shortly.

Unfortunately, this development came right as I am starting a 10-day vacation, so will need to balance making the necessary enhancements with my other vacation activities! But stay tuned, I expect to make progress in the next couple of days.

Warm Regards,

Eric White

Comments (9)

Building Open-Xml-PowerTools when pulling the Open-Xml-Sdk via NuGet

Recently, there was an update to the vNext branch of Open-Xml-PowerTools on GitHub (https://github.com/OfficeDev/Open-Xml-PowerTools) that enabled pulling the Open-Xml-Sdk via NuGet.

This reduces steps necessary to build the PowerTools, and make a step in the right direction, which is to use the NuGet package manager to simplify use of libraries. The following screen-cast shows the precise steps necessary in order to build the Open-Xml-PowerTools as they currently exist on GitHub.

https://www.youtube.com/watch?v=otod1wXg8wI

Comments (1)

Introducing WmlComparer, a Module in Open-Xml-PowerTools

WmlComparer compares two Open XML documents, producing a new document that contains the precise differences in revision tracking markup. You can then iterate through those differences.

This module contains my latest thinking on how to process large documents character-by-character. The module converts the DOCX into a new form – an array – where every item in the array is a single piece of content, whether it be a character in a paragraph, the paragraph mark itself, an image, and so on. This thinking will go into the next version of OpenXmlRegex, so that it will be possible to include new lines in the search string, as well as replace matched strings with content that contains new lines.

This module is extremely useful when writing XUnit tests for code that processes Open XML markup – we can test that the code produced the exact document that we intended.

Comments (5)

Open-Xml-PowerTools and JavaScript

One issue that many Open XML developers face when doing Open XML development in JavaScript is the need for many of the important modules available in Open-Xml-PowerTools.

To date, almost none of the Open-Xml-PowerTools have been converted to JavaScript (in a publicly releasable form).

However, I designed the Open-Xml-Sdk for JavaScript with the specific intention to enable manual conversion of Open-Xml-PowerTools code from C# to JavaScript. The Ltxmljs library is semantically compatible with the .NET LINQ to XML library. The Open-Xml-Sdk for JavaScript was designed with the idea that the common idioms that we use in the Open-Xml-Sdk can be converted to JavaScript with ease.

Proof-in-the-pudding is that I converted thousands of lines of code (of a preliminary version of the WmlToHtmlConverter) to JavaScript. This included the FormattingAssembler module, the ListItemRetriever module, the HtmlConverter module, and more. It worked very nicely. However, because this was based on a half-baked WmlToHtmlConverter, I decided that I wanted to first polish those modules as written in C#, and then at some future point in time convert the finished modules to JavaScript. I fully expect to convert these modules to JavaScript (but it is not going to happen in the near future, as my current projects prohibit an effort of this magnitude).

Key points:

  • It is doable.
  • It is non-trivial. You should be expert in C#, HTML, CSS, JavaScript, LINQ to XML, functional programming, and Open XML markup. I don’t want to minimize what you must know in order to tackle conversion of one of these modules to JavaScript.

But it is doable. I have a goal to do this, but it must fit in with other priorities in my life, like keeping my kid and wife with groceries. 🙂

Cheers, Eric

Comments (16)

Welcome to the new EricWhite.com

Hi friends,

With the recent decision to shutter OpenXmlDeveloper.org, I will be focusing my efforts to support Open XML developers here at EricWhite.com.

Forums: I have set up forums on this web site, so that we have a convenient place to discuss Open XML specific questions.  I have set up forum categories based on my experience with OpenXmlDeveloper.org – there are fewer categories, which should make it easier to pick the appropriate forum for your questions. Feel free to suggest new forum categories. Forums link are in the sidebar. You will need to have an account here to post in the forums. Please feel free to create one. At the top of each forum, you can subscribe, which will send email notifying you of activity.

Content: I have moved much of the content from OpenXmlDeveloper.org to this web site. Over the next month or two, I’ll be moving the rest of the interesting content here.  In the process of moving this content over, I’ll be reviewing and updating it, making sure that it recommends current best practices for Open XML development.  In addition to moving the content over, I have created ‘Developer Centers’ around specific topics such as Open-Xml-PowerTools, DocumentAssembler, DocumentBuilder, PresentationBuilder, WmlToHtmlConverter, and HtmlToWmlConverter.

Blogging: I will be regularly blogging about my Open XML activities here.

Twitter: Follow me (@EricWhiteDev) on twitter to get the latest news about Open XML development and tools.

LinkedIn: Connect, especially if you are an Open XML developer.

I will continue to enhance and support Open-Xml-PowerTools – it is a ‘live project’, with interesting enhancements planned for the near future.  I consider that Open-Xml-PowerTools is perhaps 25% complete – there is an awful lot that we can do to make server-side Open XML document generation and processing easier.  My plans are firming up for the next round of enhancements – stay tuned – I’ll keep you posted.

We have done nowhere near enough with the Open-Xml-Sdk-for-JavaScript.  There are great advantages to processing Open XML with JavaScript – cross-platform – process in browser – process with NodeJs.  And should take advantage of TypeScript, I think.

I love document formats.  I love the capacity to create and process rich content that we have gained due to the standardization of the Open XML format.  We have made great progress over the years, and I am looking forward to continuing to advance the state-of-the-art of document generation and processing.

As an aside, one thing that I am going to enjoy a lot is moving off of the old version of Telligent Community Server that we were using for OpenXmlDeveloper.org – what a pain.  EricWhite.com is a WordPress site, updated to the latest version. It is a pleasure to move into a modern, blogging / content management platform.  The forums here are implemented using bbPress, which gives us a flexible, easy to use platform.

Cheers, Eric

Comments (25)

Announcing Support for RTL in the HtmlConverter.cs Module

Hi all,

Today I am announcing support for RTL languages in the HtmlConverter module that is part of PowerTools for Open XML.

RtlConversion-b

Please give it a try and let me know how it goes…

Cheers, Eric

 

Comments (6)

PowerTools for Open XML Road Map, April 2014

I recently completed a new version of ListItemRetriever.cs – which is a super important module in PowerTools for Open XML, although it operates mostly behind the scenes.  This module is responsible for translating the various pieces of markup for numbered and bulleted lists into the actual text that HtmlConverter.cs will place in the generated HTML.  It was a test of my patience – I first patched the old version, then I re-wrote it, and then I threw it all out and re-wrote it again.  I am finally happy with it.

Now that this module is completed, it is time to jump back into some serious coding for the high-fidelity HtmlConverter.cs module.  My next goal is to complete Right-To-Left languages, and East Asian languages.

I always try to do the hard stuff first, and because of my unfamiliarity with the Open XML markup that I need to parse, this is somewhat hard.  Also, because I don’t read any RTL or East Asian languages, I have to do this by pattern matching.  Sure would be easier if I could read them…  🙂

I don’t have a good idea of how long it will take.

Some time ago I wrote a crude program that uses search engines to find Open XML documents on the web and download them.  I have a pretty large collection of them – in general, my work will be to run HtmlConverter.cs on these documents and manually compare the docs in Word with the converted HTML in a browser.  Fun.

But where I’m going – I want to have this high-fidelity conversion from DOCX to HTML in really good shape in the next 2-3 months.

Following that, I want to re-write the portions of PowerTools that we use from PowerShell.  I want to re-write all of the cmdlets using the PowerShell language, not using C#.  After re-writing the cmdlets, I believe that the process to install and use the cmdlets will be a matter of dropping some files in a specific place.  It also will make it much easier for users of PowerTools to build new cmdlets, and to modify the existing cmdlets.

This is my vision for PowerTools for Open XML 3.0.

Cheers, Eric

Comments (4)

New Focus for this Blog

This blog will have a new focus.  Up til now, the blog has been pretty much only on Open XML.  However, now I am going to go much wider afield.  Over the years I’ve gained some experience and insight into certain aspects of being a developer, and I’m going to start blogging about them – things like:

  • How I use PowerShell as a developer
  • My own approach to coverage testing, and why it is different and better
  • My approach to learning new programming languages quickly
  • Using screen-casts for intra-team communication – and how to create screen-casts super quickly
  • How I currently use the cloud as a development tool

I’m also going to discuss some of the more philosophical aspects of being a developer – I’ll even discuss what I did before I worked at Microsoft, and how and why I ended up doing what I’m doing.

And I’m certain I’ll discuss things that I haven’t contemplated yet.

This is going to be fun.

Comments (3)

« Previous entries Next Page » Next Page »