NLDB 2025 ∙ Long Paper ∙ Oral

Accelerating End-to-End PDF to Markdown Conversion through Assisted Generation

Author(s): Changxu Duan

[GitHub] [Paper]

Overview

Scientific knowledge is increasingly locked in machine-unreadable formats like PDFs, making it difficult to index, analyze, and reuse. While existing end-to-end transformer-based models can convert screenshots of PDFs into Markdown, they are often inefficient-decoding each token from scratch even when much of the content can be directly copied. This inefficiency poses a significant bottleneck, especially for dense scientific documents where textual overlap between the PDF and the desired output is high.

We build on Prompt Lookup Decoding (mPLD) to speed up generation by reusing overlapping text between PDFs and Markdown. We also introduce Copy Lookup Decoding (CLD), which uses layout-aware signals to identify copyable content and improve decoding efficiency.

Key contributions include:

How to Cite

@inproceedings{duan2025cld,
    title = {Accelerating End-to-End PDF to Markdown Conversion Through Assisted Generation},
    ISBN = {9783031971419},
    ISSN = {1611-3349},
    url = {http://dx.doi.org/10.1007/978-3-031-97141-9_3},
    DOI = {10.1007/978-3-031-97141-9_3},
    booktitle = {Natural Language Processing and Information Systems},
    publisher = {Springer Nature Switzerland},
    author = {Duan,  Changxu},
    year = {2025},
    month = jul,
    pages = {34–48}
}