Accelerating End-to-End PDF to Markdown Conversion through Assisted Generation

NLDB 2025 ∙ Long Paper ∙ Oral

Accelerating End-to-End PDF to Markdown Conversion through Assisted Generation

Overview

Scientific knowledge is increasingly locked in machine-unreadable formats like PDFs, making it difficult to index, analyze, and reuse. While existing end-to-end transformer-based models can convert screenshots of PDFs into Markdown, they are often inefficient-decoding each token from scratch even when much of the content can be directly copied. This inefficiency poses a significant bottleneck, especially for dense scientific documents where textual overlap between the PDF and the desired output is high.

We build on Prompt Lookup Decoding (mPLD) to speed up generation by reusing overlapping text between PDFs and Markdown. We also introduce Copy Lookup Decoding (CLD), which uses layout-aware signals to identify copyable content and improve decoding efficiency.

Key contributions include:

mPLD: A plug-and-play decoding strategy requiring no model retraining, adaptable to existing Vision-Language Models.
CLD: A novel candidate generation method that integrates a layout-aware Copyable Text Identification (CTI) component. This component uses a fine-tuned ERNIE Layout model, enhanced via LoRA and extended token length (1024 via RoPE), to classify spans as copyable or not based on document layout.
Simplified token classification: Instead of complex multi-class layout tags, CLD uses a binary KEEP/DELETE scheme for efficient span-level predictions, resolved using a voting classifier.
Practical and lightweight integration: Despite introducing an additional 27 MB model component, our method remains efficient and easy to integrate. It enables up to 1.7× speedup over traditional decoding approaches, while supporting real-world applications like academic search engines and scientific content structuring.

How to Cite

@inproceedings{duan2025cld,
    title = {Accelerating End-to-End PDF to Markdown Conversion Through Assisted Generation},
    ISBN = {9783031971419},
    ISSN = {1611-3349},
    url = {http://dx.doi.org/10.1007/978-3-031-97141-9_3},
    DOI = {10.1007/978-3-031-97141-9_3},
    booktitle = {Natural Language Processing and Information Systems},
    publisher = {Springer Nature Switzerland},
    author = {Duan,  Changxu},
    year = {2025},
    month = jul,
    pages = {34–48}
}

NLDB 2025 ∙ Long Paper ∙ Oral