Education
Ph.D. Linguistics
2021 - 2026 (expected)
Technical University of Darmstadt
Dissertation: Enhancing Scholarly Document Accessibility and Analysis
Frameworks for Semantically Rich PDFs, Efficient Transformation, and Advanced Bibliometric Applications
Advisor: Prof. Dr. Andrea Rapp, Dr. Sabine Bartsch
M.Sc. Computational Linguistics
2018 - 2021
University of Stuttgart
Thesis: Semi-supervised Event-centered Emotion Analysis and Performance Prediction (completed at Robert Bosch GmbH)
B.Sc. Computer Science
2014 - 2018
Henan Normal University, China
Research Interests
Core Areas: Document Understanding, Information Extraction, Layout Analysis, Scholarly Document Processing, Citation Analysis
Methods: Multimodal Deep Learning, Sequence-to-Sequence Models, Representation Learning, Transfer Learning, Semi-supervised Learning
Applications: Scientific Knowledge Graphs, Bibliometrics, Cross-disciplinary Research Discovery, Document Accessibility
Publications
Beyond Catalogue Counts: Quantifying Visibility Bias in Low-Resource Multilingual NLP
Zhiyin Tan, Changxu Duan*,
Proceedings of the Language Resources and Evaluation Conference (LREC 2026)
[Conference]
[GitHub]
Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown
Changxu Duan,
Proceedings of the 19th International Conference on Document Analysis and Recognition (ICDAR 2025)
[GitHub]
[Paper]
[arXiv]
Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content
Changxu Duan, Zhiyin Tan,
Proceedings of the 29th International Conference on Theory and Practice of Digital Libraries (TPDL 2025)
[GitHub]
[Paper]
[arXiv]
[Slides]
Accelerating End-to-End PDF to Markdown Conversion through Assisted Generation
Changxu Duan,
Proceedings of the 30th International Conference on Natural Language & Information Systems (NLDB 2025)
[GitHub]
[Paper]
[arXiv]
Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts
Zhiyin Tan, Changxu Duan,
Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2025)
[Web Page]
[Paper]
[arXiv]
[GitHub]
[Slides]
Bridging scientific publication accessibility: LaTeX-markup-PDF-alignment
Changxu Duan,
TUGboat (Communications of the TeX Users Group) 45:2, 2024
[Paper]
LATEX Rainbow: Universal LATEX to PDF Document Semantic & Layout Annotation Framework
Changxu Duan, Zhiyin Tan, Sabine Bartsch,
Proceedings of the Workshop on Information Extraction from Scientific Publications (WIESP) at IJCNLP-AACL 2023
[Paper]
[GitHub]
Presenting an Annotation Pipeline for Fine-grained Linguistic Analyses of Multimodal Corpora
Elena Volkanovska, Sherry Tan, Changxu Duan, Debajyoti Paul Chowdhury, Sabine Bartsch,
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing at KONVENS 2023
[Paper]
The InsightsNet Climate Change Corpus (ICCC)
Elena Volkanovska, Sherry Tan, Changxu Duan, Sabine Bartsch, Wolfgang Stille,
Datenbank-Spektrum 23, 195–204 (2023)
[Paper]
Research Experience
Research Assistant
08.2021 - 12.2024
Technical University of Darmstadt
BMBF-funded project InsightsNet — Knowledge Mining from Scientific Publications
- Developed novel layout-aware text editing model for PDF-to-Markdown conversion achieving 1.7× inference speedup over baseline Transformer models (ICDAR 2025).
- Introduced semantically orthogonal representation framework for citation classification that disentangles citation intent from content, improving F1-score by 8% over SOTA (TPDL 2025).
- Constructed large-scale annotated corpus of 10,000+ scientific publications with fine-grained semantic and layout annotations, establishing annotation guidelines and inter-annotator agreement protocols.
- Built cross-disciplinary knowledge graph from citation contexts containing 50,000+ entity relations, enabling novel bibliometric analyses and dataset discovery applications (JCDL 2025).
- Collaborated with international research partners and presented work at conferences including ICDAR, TPDL, NLDB, JCDL, and WIESP.
Machine Learning Research Intern
04.2020 - 04.2021
Corporate Research, Robert Bosch GmbH
Master thesis research on generative models for time-series forecasting in IoT sensor data (e-bike sensors).
- Designed and benchmarked generative models (GRU-VAE, conditional GAN) for multivariate sensor prediction, achieving ~45% MAE improvement and ~20% faster inference over baseline LSTM.
- Conducted ablation studies and robustness analyses under label scarcity and distribution shift, deriving empirically grounded model selection guidelines.
- Collaborated with hardware engineering team to integrate models into a prototype system and validate on real sensor streams.
Mentoring & Service
Mentoring
- Remote Research Training Mentor, Henan Normal University (2025–present)
Peer Review
- Reviewer: TPDL 2025, TPDL 2026, ICME 2025, ICASSP 2026, ICDAR 2026
Talks & Posters
- Poster: "Explaining Language Model Generation through Entity Linking," ML Operations Summer School (2022)
Open Source Contributions
Developed and maintained research codebases and tools:
texannotate (LaTeX annotation framework),
EditTrans (PDF conversion),
SPARQL-autocomplete (VS Code extension; 10,000+ installs)
Languages & Honors
- Languages: Chinese (Native), English (C1/Fluent), German (B1/Intermediate)
- Grants & Awards: SIGIR Student Travel Grant (JCDL 2025); First Prize in National Algorithm Competition (2017); ACM-ICPC Bronze Medal (2017); Robotics Competition Awards (2016)