pandoc is a command line tool for document format and markup language conversion written in Haskell. Much like a compiler, pandoc parses documents with a recursive grammar, converts the input to an intermediate representation, stores that intermediate representation in an abstract syntax tree (AST), and then walks the AST to reproduce the document in the desired output format. However, pandoc has its own API for scripted document conversion and custom inport/export filters can be written in Lua.
pandoc supports a vast number of input and output formats but the list of supported formats is not orthogonal, some formats are export only. Additionally, pandoc has its own flavor of markdown that serves as its native document format.
USE flags for app-text/pandoc Conversion between markup formats
||Add extra documentation (API, Javadoc, etc). It is recommended to enable per package instead of globally|
||Embed data files in binary for relocatable executable.|
||Include coloured haskell sources to generated documentation (dev-haskell/hscolour)|
||Add support for software performance analysis (will likely vary from ebuild to ebuild)|
||Enable dependencies and/or preparations necessary to run tests (usually controlled by FEATURES=test but can be toggled independently)|
||Build trypandoc cgi executable.|
For the amd64 and arm64 architectures the binary package app-text/pandoc-bin is available. To install this precompiled version, replace pandoc with pandoc-bin in the following installation command:
emerge --ask app-text/pandoc
- $HOME/.local/share or as specified in $XDG_DATA_HOME - Local (per user) configuration file.
Converting between document types causes loss of some formatting information
To some extent this is expected behavior. Not all document formats are equally robust. Further, the intermediate representation used by pandoc does not preserve every possible formatting option.
Inability to convert MS Word .doc files
This is expected behavior, modern MS Office .docx files are supported but legacy .doc files are not. There are two possible workarounds:
The most basic option is to use antiword to convert the .doc to plain text.
antiword legacy_document.doc > legacy_document.txt
This is a valid for many use cases but a lot of formatting information can be lost this way. A more robust solution is to leverage a little-known feature of antiword, DocBook XML output support, in order to ensure a well formatted document conversion.
antiword -x db infile.doc | pandoc -f docbook
The last option is to import the .doc document into LibreOffice and export it as either LibreOffice's native .odt or MS Office's modern .docx format. Once you have the .docx version of the file you can leverage pandoc as normal.
This can even be accomplished from the command line as follows:
libreoffice --convert-to odt legacy_document.doc
Your mileage may vary as to whether the antiword DocBook XML or LibreOffice conversion methods results in a better document conversion, but there should be few if any differences between the two methods in the resulting output.
"'lmodern.sty' not found" when converting from markdown to pdf
emerge --ask --depclean --verbose app-text/pandoc
- antiword — a program for displaying legacy Microsoft Word .doc documents in common use from MS Word 97 – 2007 as plain text.
- unrtf — a program for displaying legacy Rich Text Format .rtf documents as HTML or plain text.