Changxu Duan

Knowledge Graph & Generative AI Engineer | NLP & Document Intelligence

5 years in NLP and Machine Learning, with deep expertise in Knowledge Graphs and their integration with Generative AI to address LLM challenges such as hallucination and compliance. Built enterprise-scale Knowledge Graphs (50K+ entities) and end-to-end data pipelines processing 100K+ documents, from schema design to production deployment. Experienced in RDF/OWL/SPARQL stack and property graphs (Neo4j), with 8 peer-reviewed publications in NLP and document analysis. Passionate about applying Knowledge Graphs as key differentiators for Business AI solutions.

Professional Experience

Research Assistant (NLP & Knowledge Graph)

08.2021 - 12.2024

Technical University of Darmstadt

BMBF-funded project InsightsNet, acting as the team's NLP engineer to build production ML systems for knowledge extraction from scientific publications.

  • Architected, developed, and deployed an end-to-end PDF-to-structured-data pipeline that processed over 100K scientific publications, incorporating layout analysis and entity-relation extraction.
  • Developed novel layout-aware text editing model for PDF-to-Markdown conversion achieving 1.7× inference speedup over baseline models.
  • Built citation intent classification system using orthogonal representation learning, improving F1-score by 8% over SOTA.
  • Designed and implemented annotation guidelines, knowledge graph schema and quality control framework for a multi-disciplinary corpus, leading a team of 15 annotators and constructing an enterprise-ready knowledge graph with over 50K verified entity relations.
  • Collaborated closely with domain experts, data owners, and software engineers to align knowledge graph modeling decisions with downstream analytics and application requirements.
Machine Learning Research Intern

04.2020 - 04.2021

Robert Bosch GmbH

Developed deep learning models for time-series forecasting in IoT sensor data (e-Bike sensors).

  • Designed and benchmarked multiple generative models (GRU-VAE, conditional GAN) for multivariate sensor prediction, achieving ~45% MAE improvement and ~20% faster inference over baseline LSTM.
  • Collaborated with hardware engineering team to integrate models into production prototype system.
  • Delivered technical documentation and model handoff enabling seamless production deployment by engineering team.

Technical Skills

Programming & ML Stack: Python, C/C++, Java, PyTorch, HF Transformers, sklearn, NumPy, pandas, spaCy, Git, CI/CD, Docker

NLP & Document AI: NER, IE, entity/relation extraction, citation analysis, document layout analysis, PDF parsing, OCR post-processing, semantic/knowledge modeling, Generative AI (LLMs, vLLM, RAG), KG-augmented QA/search, hallucination/fact checking via entity linking

ML, Data & Knowledge Graphs: Transformer fine-tuning (BERT, Llama, Qwen), graph-based approaches & analytics, Academic Knowledge Graph architecture, RDF/OWL/SHACL/SPARQL, property graphs & graph DBs (Neo4j, ArangoDB), data/business schema modeling, annotation pipeline design, evaluation & error analysis, efficient training/inference, reproducible ML

Education

Ph.D. Computational Linguistics (dissertation submitted, awaiting defense)

2021 - 2026

Technical University of Darmstadt

Dissertation: Enhancing Scholarly Document Accessibility and Analysis

M.Sc. Computational Linguistics

2018 - 2021

University of Stuttgart

Thesis: Semi-supervised Event-centered Emotion Analysis and Performance Prediction (in collaboration with Robert Bosch GmbH)

B.Sc. Computer Science

2014 - 2018

Henan Normal University

Selected Publications

Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown

Changxu Duan, ICDAR 2025 (top-tier Document Analysis conference) [GitHub] [Paper] [arXiv]

Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content

Changxu Duan, Zhiyin Tan, TPDL 2025 (flagship Digital Libraries conference) [GitHub] [Paper]

Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts

Zhiyin Tan, Changxu Duan, JCDL 2025 (top-tier Digital Libraries conference) [Web Page] [Paper] [arXiv] [Code]

Open Source & Side Projects

KG-Augmented LLM Hallucination Reduction

Poster, DTU MLOps Summer School 2022. Built entity-linking pipeline integrating Knowledge Graphs with LLMs to verify factuality and reduce hallucinations.

SPARQL Auto-Completion VS Code Extension

Open-source developer tool with 1K+ installs providing SPARQL IntelliSense, prefix completion and query authoring for knowledge graphs. [GitHub]

Languages & Awards

Languages: Chinese, English (C1), German (B1)  ·  Awards: SIGIR JCDL 2025 Travel Grant; ACM-ICPC Bronze 2017