Vik Paruchuri just released Marker, a tool for converting PDFs to Markdown.

Here’s Vik’s Twitter/X thread and the codebase on GitHub.

On a 2023 Apple M2 MacBook Pro I was able to install the needed dependencies and run it on some PDFs from OpenStax without much difficulty.

From skimming a few pages of results, it looks pretty good; some of the markdown formatting choices are a bit unusual or don’t respect the semantic structure of the page, but as a one-size-fits-all solution that also includes OCR it seems really useful.

Edit (April 2024): A related release is PDFText, which provides a fast PDF-extraction tool under the permissive Apache 2 license.