At Versal, we have created quite a few lessons for internal use, many of which were originally created as Word documents. Now that we've released the first version of our educational platform, we want to convert those lessons for online publication. We have an internal API that allows uploading of Markdown documents, which are then converted to courses with lessons and gadgets. So we only needed to convert our Word documents to Markdown. Sounds easy right?
Not so much. There are afewsolutions, but they only work for very basic text formatting. Our documents were a bit more complex, containing tables, images, and math -- which proved especially tricky! So using a number of existing tools we hacked together our own conversion script. It consists of 9 consecutive steps:
Exporting to HTML using Microsoft Word 2012. We automated this on OS X using Automator. Solutions for other platforms are welcome!
Extracting image types that we want to use. Keeps the original quality, unless that's a proprietary .emz file. In this step we also fix some math.
Converting HTML to XML usingtagsoup.
Covert OOML (proprietary Word format) into MathML equations, using Microsoft's own conversion XSLT, and a custom version ofthis XSLT. UsesSaxon 8.
Some intermediate fixes for whitespace and math.
Conversion back into HTML usingTidy. Also strips a lot of stuff.
More intermediate fixes to deal with shortcomings of Tidy and Pandoc.
Conversion into Markdown usingPandoc.
Lots of cleanup and final fixes to the Markdown.
We've released this pipeline as an open
We've released this pipeline as an open source project (MIT License), although it should be noted that you will need to purchase Microsoft Word for this to work. Hopefully this can be a starting point for a more reliable conversion of Word documents!