converting a word e-book to markdown

This is a brief record of how I converted a few Word documents to Markdown. As with everything on this site I’ve typed it up mainly so that I can refer back if needs be…but if anybody else ever finds any element of it useful, then so much the better.

Also…if anyone happens across this and knows a better way, please leave a comment. My guess is there is a more effective way out there of converting from Word to Markdown, but in the end, given I had a small number of differently formatted Word doc’s, I didn’t spend too long searching.
[toc]
I’ve kept the main text as ‘what I did’. The ‘why I did it’ is in the footnotes.

What I was doing

Powershell.org offers a number of free e-books on Powershell. It was decided to move from having these as Word documents on Microsoft Onedrive to having them as Markdown docs hosted on Penflip. There are a bunch of pro’s and con’s between the two but that’s beyond the scope of this post.

the converter

I used Ben Balter‘s excellent Word to Markdown Converter web app.

As you can see from the rest of this page and the footnotes, some bits of some of the documents I was converting needed a bit of extra work, but the web application did most of the heavy lifting.

Just for the record, the following ‘just worked’ with no intervention that I can recall:

  • the text
  • bullets
  • bold and italic (although occasionally these were transposed)
  • headings
  • tables

The bits that didn’t ‘work’ in the way I wanted them to, were the images and the quoted code. The rest of the post is mainly concerned with how I worked around these issues.

060

Saved the doc as a web-page

First, I saved each document as a web page. I learnt previously that this is a quick way of extracting all of the images – I’d found that the Web conversion tool didn’t handle images in a way that worked well for Penflip1

To save a Word doc as a Webpage, you just do ‘Save As’ in Word:
010

…and opt to ‘Save as Type’ – ‘Web Page’.

030
This creates a sub-folder under the folder you saved to that contains all the images.

020

Removed the images, and re-saved as a Word doc

Next I removed all the images and then saved it again, this time as a Word document.

To remove (or in this case to replace) all the images is easy enough to do.

You go into the normal ‘Find or Replace’ in word, but instead of typing in a bit of text to find, you click on the ‘Special’ button at the bottom of the Window.
040

This displays a big drop down, from which you select ‘Graphic’. This puts what looks like ‘^g’ in the Find field.

Because I wanted to put all the images back in, I entered ‘Image_nnn’ in the ‘Replacement field’

050

Upload the images back to the holding page

In Penflip, you can just drag and drop a .jpg or .png from Windows Explorer into the appropriate place in the text.

This is really handy, but because some of the doc’s had a lot of images, I found a way of putting the images back in bulk – by creating a holding page.

I loaded the images into Penflip by creating a holding page in the doc:

090

… and dragging and dropping all the images into that page at once.
100

Put pointers to the images back in

Then, I put the pointers to the images back in with this bit of Powershell code. The start point and increment for the count needed tweaking for different docs

$Encoding = "ascii"
$OutputFile = "c:temphtml_output.md"
$cnt = 0

write-output "" | out-file -encoding $Encoding $OutputFile

$MarkdownText = Get-Content C:temphtmk.md

Foreach ($Line in $MarkdownText)
{

    if ($Line -like "Image_nnn*")
    {
      $cnt = $cnt + 1
      $PaddedCnt = $cnt.tostring("000")
    }

    # Format is ![image004.png](images/image004.png)
    [string]$OutLine = $Line.replace("Image_nnn", "![image$Paddedcnt.png](images/image$Paddedcnt.png)")
    "$OutLine" | out-file -append -encoding $Encoding $OutputFile
}

Mark the code as code

The Word-to-Markdown converter didn’t really understand that some of the text was intended to look like code…indeed some of the authors of the documents didn’t actually change the formatting for code anyway.

So I did the markup for code manually.

Doing this fixed the comment too, which otherwise appeared as headings

Fixed the backticks

One or two of the doc’s mentioned the fact that ‘backticks’ (i.e. the _` character) has a special meaning in Powershell – it’s the line continuation character. Unfortunately the backtick also has a special meaning in Markdown – it means ‘code’.

This was relatively easy to fix manually.


  1. The Converter converts images to a base64 encoding which looks like this (I’ve ‘snipped’ the string after the first 300 or so characters – it’s actually over 30,000 characters long):

    ![](data:image/*;base64,iVBORw0KGgoAAAANSUhEUgAAAmQAAAMYCAIA
    AADq5GzlAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAA
    A2tpVFh0WE1MOmNvbS5hZG9iZS54bXAAAAAAADw/eHBhY2tldCBiZWdpbj0i
    77u/IiBpZD0iVzVNME1wQ2VoaUh6cmVTek5UY3prYzlkIj8+IDx4OnhtcG1l
    dGEgeG1sbnM6eD0iYWRvYmU6bnM6bWV0YS8iIHg6eG1wdGs9IkFkb2JlIFhN
    UCBDb3JlIDUuNS1jMDE0IDc5LjE1MTQ4MSwgMjAxMy8wMy8xMy0xMjowOTox
    NSAgICAgICAgIj4gPHJkZjpSREYgeG1sbnM6cmRm<snip>
    

    Penflip didn’t seem to like this….although it could be that it got corrupted when I cut-and-pasted. In any case I prefer to have the images handled in a more straightforward way. 

Advertisements