Changxu Duan

Computational Linguistics | Document AI | NLP

Computational linguist specializing in document understanding, information extraction, and multimodal NLP for scientific text processing. My research advances neural approaches to layout-aware document parsing, semantic segmentation of scholarly publications, and computational methods for bibliometric analysis. I develop novel architectures that bridge computer vision and natural language processing to extract structured knowledge from unstructured documents.

Education

Ph.D. Linguistics

2021 - 2026 (expected)

Technical University of Darmstadt

Dissertation: Enhancing Scholarly Document Accessibility and Analysis

Frameworks for Semantically Rich PDFs, Efficient Transformation, and Advanced Bibliometric Applications

Advisor: Prof. Dr. Andrea Rapp, Dr. Sabine Bartsch

M.Sc. Computational Linguistics

2018 - 2021

University of Stuttgart

Thesis: Semi-supervised Event-centered Emotion Analysis and Performance Prediction (completed at Robert Bosch GmbH)

B.Sc. Computer Science

2014 - 2018

Henan Normal University, China

Research Interests

Core Areas: Document Understanding, Information Extraction, Layout Analysis, Scholarly Document Processing, Citation Analysis

Methods: Multimodal Deep Learning, Sequence-to-Sequence Models, Representation Learning, Transfer Learning, Semi-supervised Learning

Applications: Scientific Knowledge Graphs, Bibliometrics, Cross-disciplinary Research Discovery, Document Accessibility

Publications

Beyond Catalogue Counts: Quantifying Visibility Bias in Low-Resource Multilingual NLP

Zhiyin Tan, Changxu Duan*, Proceedings of the Language Resources and Evaluation Conference (LREC 2026) [Conference] [GitHub]

Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown

Changxu Duan, Proceedings of the 19th International Conference on Document Analysis and Recognition (ICDAR 2025) [GitHub] [Paper] [arXiv]

Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content

Changxu Duan, Zhiyin Tan, Proceedings of the 29th International Conference on Theory and Practice of Digital Libraries (TPDL 2025) [GitHub] [Paper] [arXiv] [Slides]

Accelerating End-to-End PDF to Markdown Conversion through Assisted Generation

Changxu Duan, Proceedings of the 30th International Conference on Natural Language & Information Systems (NLDB 2025) [GitHub] [Paper] [arXiv]

Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts

Zhiyin Tan, Changxu Duan, Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2025) [Web Page] [Paper] [arXiv] [GitHub] [Slides]

Bridging scientific publication accessibility: LaTeX-markup-PDF-alignment

Changxu Duan, TUGboat (Communications of the TeX Users Group) 45:2, 2024 [Paper]

LATEX Rainbow: Universal LATEX to PDF Document Semantic & Layout Annotation Framework

Changxu Duan, Zhiyin Tan, Sabine Bartsch, Proceedings of the Workshop on Information Extraction from Scientific Publications (WIESP) at IJCNLP-AACL 2023 [Paper] [GitHub]

Presenting an Annotation Pipeline for Fine-grained Linguistic Analyses of Multimodal Corpora

Elena Volkanovska, Sherry Tan, Changxu Duan, Debajyoti Paul Chowdhury, Sabine Bartsch, Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing at KONVENS 2023 [Paper]

The InsightsNet Climate Change Corpus (ICCC)

Elena Volkanovska, Sherry Tan, Changxu Duan, Sabine Bartsch, Wolfgang Stille, Datenbank-Spektrum 23, 195–204 (2023) [Paper]

Research Experience

Research Assistant

08.2021 - 12.2024

Technical University of Darmstadt

BMBF-funded project InsightsNet — Knowledge Mining from Scientific Publications

  • Developed novel layout-aware text editing model for PDF-to-Markdown conversion achieving 1.7× inference speedup over baseline Transformer models (ICDAR 2025).
  • Introduced semantically orthogonal representation framework for citation classification that disentangles citation intent from content, improving F1-score by 8% over SOTA (TPDL 2025).
  • Constructed large-scale annotated corpus of 10,000+ scientific publications with fine-grained semantic and layout annotations, establishing annotation guidelines and inter-annotator agreement protocols.
  • Built cross-disciplinary knowledge graph from citation contexts containing 50,000+ entity relations, enabling novel bibliometric analyses and dataset discovery applications (JCDL 2025).
  • Collaborated with international research partners and presented work at conferences including ICDAR, TPDL, NLDB, JCDL, and WIESP.
Machine Learning Research Intern

04.2020 - 04.2021

Corporate Research, Robert Bosch GmbH

Master thesis research on generative models for time-series forecasting in IoT sensor data (e-bike sensors).

  • Designed and benchmarked generative models (GRU-VAE, conditional GAN) for multivariate sensor prediction, achieving ~45% MAE improvement and ~20% faster inference over baseline LSTM.
  • Conducted ablation studies and robustness analyses under label scarcity and distribution shift, deriving empirically grounded model selection guidelines.
  • Collaborated with hardware engineering team to integrate models into a prototype system and validate on real sensor streams.

Mentoring & Service

Mentoring
  • Remote Research Training Mentor, Henan Normal University (2025–present)
Peer Review
  • Reviewer: TPDL 2025, TPDL 2026, ICME 2025, ICASSP 2026, ICDAR 2026
Talks & Posters
  • Poster: "Explaining Language Model Generation through Entity Linking," ML Operations Summer School (2022)
Open Source Contributions

Developed and maintained research codebases and tools: texannotate (LaTeX annotation framework), EditTrans (PDF conversion), SPARQL-autocomplete (VS Code extension; 10,000+ installs)

Languages & Honors

  • Languages: Chinese (Native), English (C1/Fluent), German (B1/Intermediate)
  • Grants & Awards: SIGIR Student Travel Grant (JCDL 2025); First Prize in National Algorithm Competition (2017); ACM-ICPC Bronze Medal (2017); Robotics Competition Awards (2016)