Converting PDFs to Markdown with Marker
Vik Paruchuri just released Marker
, a tool for converting PDFs to Markdown.
Here’s Vik’s Twitter/X thread and the codebase on GitHub.
On a 2023 Apple M2 MacBook Pro I was able to install the needed dependencies and run it on some PDFs from OpenStax without much difficulty.
From skimming a few pages of results, it looks pretty good; some of the markdown formatting choices are a bit unusual or don’t respect the semantic structure of the page, but as a one-size-fits-all solution that also includes OCR it seems really useful.
Edit (April 2024): A related release is PDFText
, which provides a fast PDF-extraction tool under the permissive Apache 2 license.