Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts

JCDL 2025 ∙ Long Paper ∙ Oral

Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts

Author(s): Zhiyin Tan, Changxu Duan

[GitHub] [Paper] [arXiv] [slides]

Pipeline: citation contexts → LLM extraction → entity resolution and ranking.

Overview

Finding suitable datasets for a research question remains difficult because most dataset search engines are metadata-driven (titles, keywords, repository fields). This works when metadata is complete and terminology matches, but often fails for interdisciplinary topics or datasets with sparse/inconsistent metadata.

We propose a literature-driven alternative that treats citation contexts as semantic evidence of dataset usage. Given a query, we (1) retrieve relevant papers from the Semantic Scholar Academic Graph (S2AG) and collect sentence windows around citation markers, (2) apply schema-guided LLM extraction to identify dataset mentions and their roles (e.g., Use/Modify/Evaluate Against), and (3) consolidate mentions via deterministic, provenance-preserving entity resolution to produce a ranked dataset list with evidence and links (URL/PID when available).

On 8 survey-derived computer-science queries, we achieve 47.47% average normalized recall (up to 81.82%), compared to Google Dataset Search (2.70%) and DataCite Commons (0.00%). Expert assessments across five top-level Fields of Science beyond computer science suggest a substantial portion of the additional datasets are high-utility, and some are novel for the chosen topics. Across the CS benchmarks, the system extracts 1,330 unique dataset entities, with 68.52% carrying a DOI/PID signal.

Workflow: query → citation contexts → LLM extraction → entity resolution.

Key contributions include:

Citation-context mining paradigm that bridges research questions to datasets using usage evidence from scientific papers (not just metadata).
Scalable context retrieval over S2AG with configurable citation directions (citing/cited) and optional LLM pre-filtering for relevance.
Schema-guided LLM extraction that returns structured records grounded in text (dataset name, evidence span, usage role, and confidence), enabling downstream validation and analysis.
Provenance-preserving entity resolution that consolidates name variants deterministically and enriches entities with links (URL/PID) for practical access and auditing.

How to Cite

@inproceedings{tan2025dataset,
    title = {Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts},
    url = {http://dx.doi.org/10.1109/JCDL67857.2025.00022},
    DOI = {10.1109/jcdl67857.2025.00022},
    booktitle = {2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL)},
    publisher = {IEEE},
    author = {Tan,  Zhiyin and Duan,  Changxu},
    year = {2025},
    month = Dec,
    pages = {109–118}
}