About The Metascience Observatory Explorer

SciSciNet-V2 data ingestion

The primary source of data for the Metascience Observatory Explorer is the 2025 SciSciNet-V2 dataset. If you find this explorer useful for your work, make sure to acknowledge and cite the SciSciNet paper. Read more about SciSciNet-V2 on Github. SciSciNet-V2 is is licensed under an MIT License.

How to cite: Lin, Z., Yin, Y., Liu, L. et al. SciSciNet: A large-scale open data lake for the science of science research. Sci Data 10, 315 (2023). https://doi.org/10.1038/s41597-023-02198-9

What's parts of SciSciNet-V2 were not included? The paper embeddings are not used, and the field normalized citation scores are also not presented here. The Nobel Laureate dataset was also not incorporated yet, and some of the papers imported from papers_raw are orphaned because they are not linked to known persons.

What data sources were used to construct SciSciNet-V2? SciSciNet-V2 integrates data from a range of upstream sources. The original SciSciNet (V1) was built on top of the Microsoft Academic Graph (MAG), which was sunset by Microsoft at the end of 2021. SciSciNet-V2 replaces MAG with OpenAlex as its core bibliographic graph. The major external datasets linked into SciSciNet are:

  • OpenAlex — the open bibliographic graph (successor to Microsoft Academic Graph) providing papers, authors, institutions, journals, concepts, and citation links. This is the base catalog for SciSciNet-V2.
  • PubMed — biomedical publication records with PubMed IDs, used to link SciSciNet papers to NIH projects, clinical trials, and other biomedical entities.
  • NIH RePORTER — data on NIH-funded projects with linkages to publications, patents, and clinical studies.
  • NSF Awards — data on NSF-funded projects with linkages to publications.
  • ClinicalTrials.gov — clinical studies and their linkages to referenced papers worldwide.
  • PatentsView and patent-to-science citations — USPTO and EPO patent records and citations extracted from the front pages and in-text of patents linking back to scientific papers.
  • Crossref Event Data — online attention metadata used to track mentions of papers in news feeds and on Twitter/X.
  • Nobel Laureate publication records — publication and prize-winning paper records for Nobel laureates, used to flag laureate-authored work. (Not yet surfaced in the Explorer.)

Additional data sources

Beyond SciSciNet-V2, we have incorporated data from the following sources:

  • Scopus Source Title List (November 2025) — journal metadata, ISSNs, publishers, coverage dates, open access status, and ASJC subject classifications. Used to enrich journal records and to add Scopus-indexed journals that are not present in OpenAlex.
  • Directory of Open Access Journals (DOAJ) (December 2025) — open access licensing information, Article Processing Charges (APCs), Library of Congress Classification (LCC) codes, and language coverage for open access journals.
  • Retraction Watch Database — retracted paper records matched by DOI and PMID, including retraction dates and reasons.
  • Retraction Watch Hijacked Journal Checker — a registry of hijacked (counterfeit) journals, used to flag potentially fraudulent journal entries.
  • Stop Predatory Journals (based on Beall's List) — standalone predatory journals and predatory publishers used to flag journals of questionable quality.
  • MeSH (Medical Subject Headings) — the 2026 edition of the NLM biomedical controlled vocabulary, including descriptors, scope notes, and the full tree hierarchy.
  • MEDLINE / PubMed Baseline — used to extract MeSH term annotations for biomedical papers and link them back to the knowledge graph.
  • A December 2025 snapshot of ClinicalTrials.gov is also loaded directly to provide full trial registration details (titles, conditions, interventions, sponsors, phases, enrollment, and status) on top of the paper→trial links that come from SciSciNet-V2.