pandoc

From Gentoo Wiki
Jump to:navigation Jump to:search

This article is a stub. Please help out by expanding it - how to get started.

pandoc is a command line tool for document format and markup language conversion written in Haskell. Much like a compiler, pandoc parses documents with a recursive grammar, converts the input to an intermediate representation, stores that intermediate representation in an abstract syntax tree (AST), and then walks the AST to reproduce the document in the desired output format. However, pandoc has its own API for scripted document conversion and custom inport/export filters can be written in Lua.

pandoc supports a vast number of input and output formats but the list of supported formats is not orthogonal, some formats are export only. Additionally, pandoc has its own flavor of markdown that serves as its native document format.

Installation

USE flags

USE flags for app-text/pandoc Metapackage for pandoc version 3

doc Add extra documentation (API, Javadoc, etc). It is recommended to enable per package instead of globally
embed-data-files Embed data files in binary for relocatable executable.
hscolour Include coloured haskell sources to generated documentation (dev-haskell/hscolour)
profile Add support for software performance analysis (will likely vary from ebuild to ebuild)
test Enable dependencies and/or preparations necessary to run tests (usually controlled by FEATURES=test but can be toggled independently)
trypandoc Build trypandoc cgi executable.

Emerge

For the amd64 and arm64 architectures the binary package app-text/pandoc-bin is available. To install this precompiled version, replace pandoc with pandoc-bin in the following installation command:

root #emerge --ask app-text/pandoc

Configuration

Files

  • $HOME/.local/share or as specified in $XDG_DATA_HOME - Local (per user) configuration file.

Troubleshooting

Converting between document types causes loss of some formatting information

To some extent this is expected behavior. Not all document formats are equally robust. Further, the intermediate representation used by pandoc does not preserve every possible formatting option.

Inability to convert MS Word .doc files

This is expected behavior, modern MS Office .docx files are supported but legacy .doc files are not. There are two possible workarounds:

The most basic option is to use antiword to convert the .doc to plain text.

user $antiword legacy_document.doc > legacy_document.txt

This is a valid for many use cases but a lot of formatting information can be lost this way. A more robust solution is to leverage a little-known feature of antiword, DocBook XML output support, in order to ensure a well formatted document conversion.

user $antiword -x db infile.doc | pandoc -f docbook

The last option is to import the .doc document into LibreOffice and export it as either LibreOffice's native .odt or MS Office's modern .docx format. Once you have the .docx version of the file you can leverage pandoc as normal.

This can even be accomplished from the command line as follows:

user $libreoffice --convert-to odt legacy_document.doc

Your mileage may vary as to whether the antiword DocBook XML or LibreOffice conversion methods results in a better document conversion, but there should be few if any differences between the two methods in the resulting output.

"'lmodern.sty' not found" when converting from markdown to pdf

Emerge dev-texlive/texlive-fontsrecommended.

Removal

Unmerge

root #emerge --ask --depclean --verbose app-text/pandoc

See also

  • antiword — a program for displaying legacy Microsoft Word .doc documents in common use from MS Word 97 – 2007 as plain text.
  • unrtf — a program for displaying legacy Rich Text Format .rtf documents as HTML or plain text.

External resources