Documentation: Merged Academic Corpus

Overview

The Merged Academic Corpus is not publicly available. This documentation is provided as context for the ETO products that use the dataset.

What is this dataset?

The Merged Academic Corpus (MAC) contains detailed information on over 280 million scholarly articles, combining data from public and private sources to achieve an unmatched view of the global literature. The MAC is maintained by CSET and ETO and is not publicly available in raw form due to licensing restrictions.

Which ETO products use it?

What are its sources?

The MAC currently includes data from five commercial and open-access platforms, plus additional metadata derived from those platforms using CSET algorithms. Read more >>

What are its main limitations?

The MAC doesn’t cover non-public research. It only includes research that has been publicly released and is included in one of our data sources. Read more >>
The MAC’s sources may introduce problems. Errors, gaps, and judgment calls in the MAC’s underlying datasets are likely to make it into the MAC itself. Read more >>
Recent years have incomplete data. It takes time for our data sources to incorporate the latest publications and metadata. The MAC reflects this lag. Read more >>
The MAC’s merging and enrichment processes may introduce errors. The MAC uses automated processes to combine articles from different sources and link them to useful metadata. These processes usually work well, but there may be errors in some cases. Read more >>
The MAC’s coverage of Chinese publications is incomplete. Although the MAC includes many Chinese publications, many others are only available in China-based journals that are not included in our data sources. Use particular caution when interpreting MAC-derived data related to Chinese research organizations, funders, or authors. Read more >>

What are the terms of use?

Because this dataset contains licensed data from commercial providers, it is not publicly available in raw form. However, you can interact with some of the data using ETO tools.

How do I cite it?

Because the MAC is not publicly available, you should cite the relevant ETO tool or this documentation page instead.

Structure and content

The basic unit of the MAC is the article. For our purposes, "articles" include peer-reviewed publications, working papers, and other works appearing in journals, preprint servers, or similar venues. After deduplicating the articles, we compile data about each one from the MAC’s data sources, then structure the compiled data as a series of standard metadata fields.

Title

Each article can have an English-language title, a foreign-language title, both, or neither.
- 68% of articles have an English title.
- 30% have a foreign-language title.
- 3% have no available title.
When our sources include multiple titles in the same language for the same article, we use the title from the most recently published article.

Abstract

Each article can have an English-language abstract, one or (infrequently) more foreign-language abstracts, both, or neither.
- 41% of articles have an English abstract.
- 9% have a foreign-language abstract.
- 51% have no available abstract.
When our sources include multiple abstracts in the same language for the same article, we use the abstract from the most recently published article.

Title and abstract languages

We try to automatically detect the language of each article’s title(s) and abstract(s) using a pycld2, a standard language identification algorithm.
92% of articles with non-null titles have a confident language label from pycld2, and 98% of articles with non-null abstracts have a confident language label from pycld2.
English is the most common language for titles and abstracts. Outside of English, the most common language for titles is Japanese (at 5%) and for abstracts is Chinese (at 3%).

Year

Each article has an article year.
When our sources include multiple years for the same article, we use the earliest year.

Venue

75% of MAC articles have an associated venue of article, such as a journal (e.g., Nature), conference (e.g., Interspeech) or open repository (e.g., arXiv). For the remaining 25%, there was no venue data in our sources.
When our sources list more than one venue for the same article, we currently break the tie semi-arbitrarily, using the venue with the name that is last in dictionary order.

Authors

For each article, our data sources include each author’s name and affiliated organization (e.g., "Jane Doe, Georgetown University"). In some cases, the sources also tag the authors or their organizations with a unique identifier code, such as a ROR code or ORCID code.
For each article, we compile all of the authors associated with that article in the MAC’s data sources, then identify and remove duplicate authors using their names, affiliated organizations, and identifiers (as available), resulting in a final list of authors for the article. (For deduplication purposes only, we normalize author names by removing some special characters, reversing strings separated by exactly one comma, and standardizing whitespace and formatting for author initials.)
79% of MAC articles have at least one associated author. For the remaining 21%, there was no author data in our sources.

Author organizations

MAC’s underlying data sources often contain both raw organizational affiliations and provider-cleaned affiliations, but for MAC author organizations we take the raw affiliations and run them through our own internal entity resolution pipeline to create a clean, unified output affiliation. Our internal pipeline also links the unified affiliation to a ROR identifier, in the case that such an identifier exists for that organization.
We use resulting affiliation data to identify one or more organizations for each author of a given article in the MAC. (In some cases, an author is listed on an article as affiliated with multiple organizations.)
The MAC’s author organization data is article-specific. For example, if Professor Doe moves from Georgetown to Oxford, she will be counted as affiliated with Georgetown for the articles she published while at Georgetown, and with Oxford for the articles she published after moving there.
In general, you can think of the MAC’s affiliation data as a table with three columns: one listing the specific article, one listing the name of an author of that article, and one listing the organization that author is associated with in that article.
- If no source lists an organizational affiliation for an author, we assign that author a null organization.

Author organization countries

We try to assign a country to each author organization:
- If the organization has a ROR identifier after the organization resolution process, we use the country specified by ROR.
- If the organization doesn’t have a ROR identifier, we use the country identified by our internal entity resolution model.
- We use an internal mapping table to normalize country names (for example, assigning "USA" and "U.S." to the United States).
- If none of our sources include country information we can extract, we assign a null value for this field. 28% of distinct organizations have no country affiliation.
Here (and generally in ETO resources) we use "country" informally, as a shorthand term for sovereign countries, independent states, and certain other geographic entities. Specifically, any entity that has an ISO 3166-1 code is termed a "country" in ETO resources. In other words, Entities such as territories and special administrative regions may or may not appear in MAC data separately from the sovereign countries with which they are associated (if any), but if they do, they are described as "countries."

Author organization types

Each author-affiliated organization is assigned to one of five types: Education, Company, Nonprofit, Government, or null (unknown).
We use sectoral data from the ROR database to assign types. We assign organizations without ROR identifiers a null value for this field.

Citations and citation percentile

Article citations

Most of the MAC’s data sources include extracted article citation metadata. For each article in the source dataset, these sources provide a list of citations from that article to other articles in the dataset.
We map each cited and citing article in these lists to deduplicated MAC articles using their source-specific unique identifiers.
Using this mapping, we create two consolidated lists for each deduplicated MAC article: one listing all of the other articles the original article cites (out-citations), and one listing all of the other articles that cite the original article (in-citations).
Finally, we calculate a citation percentile for each article. This value compares the article’s total number of in-citations with all other articles published in the same year. For example, a 90th percentile article has more in-citations than about 90% of all articles published in the same year.

Patent citations

One of the MAC’s sources, The Lens, tracks citations from patents in The Lens’s patent dataset to publications in the MAC (the patents themselves are not part of the MAC).
We use this metadata to count patent in-citations (i.e., citations by patents) and in-citation percentiles for each MAC article available in The Lens, using the same basic process as for citations between articles.

Research disciplines and fields

Each MAC article is assigned general research disciplines and specific research fields. For example, an article might be assigned to Medicine (discipline) or Cancer Research (field).
Each discipline and field has a relevance score indicating how relevant the article is to that discipline or field. We use a set of models to generate each article's relevance score for every discipline and field we monitor.
We designate the three top-scoring disciplines and three top-scoring fields as the article's overall disciplines and fields. These article-level identifications are used across many ETO tools - for example, to identify cluster disciplines and fields in the RC Cluster Dataset, generate cluster colors in the Map of Science, and to generate discipline- or field-wide article counts in the Research Almanac.
For fields of particular interest for emerging technology, we also determine article relevance to select research subfields and specific research topics.
The research fields that include these subfields and topics are: Artificial Intelligence, Computer Security, Semiconductors, Bioinformatics, Biotechnology, Genetics, Immunology, Neuroscience, and Virology.

Emerging subjects

In addition to research disciplines and fields, each MAC article is classified as relevant or not to certain emerging technology subjects, such as artificial intelligence and cybersecurity.
We use a set of models to apply these tags to each article. Read more >>

Unique identifiers

The MAC’s data sources sometimes include unique identifier codes for articles, authors, or organizations. These may be proprietary (i.e., codes used by vendors to identify articles within that vendor’s datasets only) or public.
We aggregate the unique identifiers associated with each article across all sources and include them in the MAC. In some cases, we also use these identifiers for deduplication.
Major identifiers in the MAC include DOI and ROR.

Fulltext

Full article text is available from two of the MAC's data sources (Semantic Scholar and arXiv). When available, this data is incorporated into the MAC.

Sources and methodology

Parts of this section are adapted from Rahkovsky et al., "AI Research Funding Portfolios and Extreme Growth" (Frontiers in Research Metrics and Analytics, 2021).

Data sources

The MAC currently includes data from:

The Lens (commercial/closed access).
The arXiv platform for open-access scientific articles and preprints (open access). arXiv is one of our two sources of fulltext.
Papers with Code, a free platform for machine learning articles and related resources (open access).
Semantic Scholar, a large-scale open-access dataset and one of our two sources of fulltext (open access).
OpenAlex, a large-scale open-access dataset viewed as the replacement for MAG.

Some of the data in the MAC is taken directly from these sources. Other data is derived from them algorithmically, as discussed below.

Consolidating raw data from different sources

We automatically incorporate raw data into the MAC from each source weekly. (The underlying data sources are updated between daily and quarterly.) We use a set of Airflow pipelines to retrieve the raw data and send it through the MAC’s merging and enrichment processes, described below.

Deduplicating articles

There is duplication between and within the sources feeding into the MAC - articles often appear in multiple sources, or multiple times within the same source. We resolve these duplicates with an automated process.

First, we normalize every article’s title and abstract.

For matching purposes, we apply the Unicode Normalization Form Compatibility Composition (NFKC) standard: unicode characters are decomposed by compatibility, then recomposed by canonical equivalence; letters are de-accented and HTML tags, copyright signs, punctuation, numbers, and non-alphanumeric characters strings are stripped; and all white space is removed from the strings.

Next, we identify and remove duplicates. For deduplication purposes, we first filter out any article with a title, abstract, or DOI occurring more than 10 times in the MAC; from our evaluation, these tend to be short or generic records that cause a significant number of false matches if included. Across the articles that remained, we presume two articles are duplicates if one of the following is true:

They match on at least two of the following metadata fields, excluding shared null and empty values:
- Title (normalized)
- Abstract (normalized)
- References within each article
- DOI
OR they match on one of the metadata fields listed above, plus one of the following metadata fields, again excluding shared null and empty values:
- Article year
- Author last names (normalized);
OR their normalized, concatenated titles and abstracts have simhash values that differ in at most two places (using a rolling window of three characters) AND they have the same article year.

We chose this method after iteratively testing different variants against a "ground truth" dataset that included true duplicates (pairs of documents confirmed by human annotators to be different instances of the same article) and false duplicates (pairs of apparently similar documents confirmed by human annotators not to be different instances of the same article). For more details on the deduplication process, visit our github repo.

Compiling article metadata

After we deduplicate articles, we link each one to the metadata associated with it or any of its duplicates in the MAC’s underlying data sources. For most metadata fields, this is a relatively straightforward process of aggregation; see above for details pertaining to each field.

However, the MAC also includes some article-level metadata fields not present in any of the underlying sources. These include research fields and subfields as well as emerging subjects such as AI.

Identifying research disciplines and fields

We score each article for relevance to research disciplines, fields, subfields, and topics using the method initially described in Toney-Wails and Dunham (2022) and expanded upon in Gelles and Dunham (2024).

In short:

We begin with a taxonomy of academic disciplines, including 19 general areas corresponding to research disciplines_ (such as Biology or Computer Science) and 281 more specific areas corresponding to research fields (such as Radiochemistry, Media Studies, or Nanotechnology). This taxonomy was originally developed by Microsoft researchers by extracting scientific concepts from Wikipedia, and then modified in certain fields to better suit CSET and ETO analytic needs.
For a select number of these research fields (Artificial Intelligence, Computer Security, Semiconductors, Bioinformatics, Biotechnology, Genetics, Immunology, Neuroscience, and Virology) we also include research subfields and research topics. In total, there are 102 research subfields (such as Network Security, Reinforcement Learning, or Immunomics) and 706 research topics (such as Machine Translation, Neuroimaging, or Quantum Lithography).
We generate text embeddings for each subject from the Wikipedia articles on each subject.
We then generate a similar embedding for each article in the MAC.
Finally, we calculate the similarity between the article embedding and each field embedding, and assign the article a corresponding relevance score for each discipline and field.

Note that although this model-driven approach is reliable in the aggregate (see Toney-Wails and Dunham (2022) for details), it occasionally produces questionable results for individual articles - that is, it may assign an article a high relevance score for a field that is not actually highly relevant to the article (as judged by a human), or a low score for a field that is relevant.

Identifying relevance to emerging technology subjects

To identify articles related to specific emerging tech subjects, we use different methods. These subjects typically cross research field boundaries. For example, AI is included as a research field within the computer science discipline, but we also include AI as an emerging technology subject using an alternative method, described below, to capture AI research across fields and disciplines. Other emerging technology subjects we identify are computer vision, natural language processing, robotics, AI safety, cybersecurity, LLMs, and chip design and fabrication.

Emerging technology subjects have fuzzy boundaries; there's no objectively correct answer to whether a particular article is "AI safety" research (for example). For each emerging subject, we try to capture articles in the MAC that subject matter experts would consider highly relevant to the subject in question. We use different methods to identify these articles depending on the subject, and we evaluate our results against "ground truth" corpora that also vary by subject. Still, it's important to note that this process inevitably involves some judgment calls. In addition, we rely on statistical models to apply the subject tags.

For both reasons, analytic results derived from the MAC's emerging technology subject tags are necessarily imprecise and should be interpreted as estimates.

AI, AI subfields, and cybersecurity

To classify articles as relevant or not to artificial intelligence, computer vision, natural language processing, robotics, and cybersecurity, we use a set of machine learning models trained on arXiv data. Articles in arXiv include subject tags that are initially provided by arXiv authors and revised by arXiv editors as appropriate. These include tags for artificial intelligence, computer vision, natural language processing, robotics, and cybersecurity. For each of these categories, we trained a separate SPECTER model on the titles and abstracts of the tagged arXiv articles. Then, we ran each model over all of the other articles in the MAC (2010 publication or later) with English titles or abstracts, assigning each one its own set of tags. (We then add the "artificial intelligence" tag to any articles tagged for NLP, computer vision, or robotics but not initially for AI.)

👀

For details about how we developed, evaluated, and deployed earlier, conceptually similar versions of these models, see Dunham et al. (2020). Details of subsequent changes (e.g., moving from SciBERT to SPECTER base models) are found in Toney-Wails et al. (2024), sections 3.1 and 5.1.

AI safety

We also tag articles as relevant or not to AI safety. We consider an article an "AI safety article" if it (a) is categorized as relevant to AI and (b) is categorized by our AI safety classifier as relevant to AI safety.

The field of AI safety research is young and quickly evolving (even compared to other emerging subjects), with no authoritative and comprehensive compilations of "AI safety articles" to refer to. Our classifier systematically identifies AI safety articles in the MAC, but we caution that the results are inherently imprecise; different methods could produce different results.

To develop the AI safety classifier, we began by creating our own definition of AI safety research encompassing safety-related concepts such as robustness, misspecification, unwanted bias, explainability, and value alignment. In parallel, we compiled a set of articles potentially related to AI safety, such as MAC articles categorized as relevant to AI in general and recent articles from safety-relevant AI conferences, workshops, and open-source repositories. ETO staff read the titles and abstracts of 2806 of these articles, then manually annotated each one as relevant or not to AI safety according to our definition. (To measure the stability of the definition in practice, 256 articles were independently double-annotated by other CSET researchers; intercoder agreement was 75%.)

We then used Snorkel Flow, a platform for developing models under programmatic weak supervision, to train the model. (For more information on this approach, see Snorkel: rapid training data creation with weak supervision and A Survey on Programmatic Weak Supervision). We ingested metadata for the 2806 manually annotated articles, plus 15,000 unlabeled articles marked relevant to AI by our AI classifier, into the Snorkel Flow platform. We split this data into 15% development, 15% validation, and 70% training sets. We then developed 83 labeling functions using the development set. 69 of these functions were keyword, regular expression, or time interval matches based on the values of the article title, abstract, publication venue, or publication years. 14 labeling functions were based on articles' membership in clusters derived from a custom support vector machine trained over word embeddings of the article titles and abstracts. These labeling functions had coverage of 99% of the development set, conflict of 10.5%, and label density of 5.329.

Through the Snorkel Flow platform, we used these labeling functions to create "weak" AI safety relevance labels for the data we ingested. We then trained an AutoML tuned model using logistic regression over the weak labels. This resulted in a model that achieved a macro-averaged F1 of 82.5% on the validation set, with precision of 73.1%, recall of 79.8%, and F1 of 76.3% over articles manually labeled relevant to AI safety.

LLMs and chips

Finally, we use a different approach to identify articles related to large language models and chip design and fabrication. For these subjects, we apply a series of prompts to a generative LLM, currently Google’s Gemini 1.5 Flash. In the first prompt, we instruct the LLM to write a one-sentence summary of the work described in a publication’s title and abstract, to include the motivation and then the problem or research task(s) addressed and the methods applied. Then, in a second prompt, we instruct the model to classify each publication, based on the summary output from the first prompt, as relevant to the development of LLMs, chip design and fabrication, or neither.

This zero-shot approach offers substantial efficiency gains. For each model, we manually labeled a small set of papers in 2024 for use in prompt development. We then drew and labeled a larger random sample for initial evaluation purposes, but overall annotated many fewer papers than would have been necessary under a supervised approach.

As an initial filtering step, we run this generative LLM method only on articles in broader domains relevant to LLMs and chips, respectively. Specifically:

We count an article as related to LLM research if it (a) is tagged as an AI article according to the method described above and (b) is flagged as relevant using the generative LLM method.
We count an article as related to chip design and fabrication research if it (a) includes chemistry, engineering, physics, or materials science as one of its three highest-scoring general subjects and (b) is flagged as relevant using our generative LLM method.

In each case, we include only articles published 2010 or later with English titles or abstracts.

👀

For initial evaluation statistics for the LLM and chip research classification methods, read CSET's technical paper: Identifying Emerging Technologies in Research. Our public repo has related code and technical details.

Known limitations

The MAC doesn’t cover non-public research. The MAC only includes research that has been publicly released and is included in one of our data sources. We believe these sources cover a significant fraction of publicly released research worldwide (and in particular, the large majority of published research in English), but a great deal of research is never made public. In particular, military research and commercial research may never be written up in open sources, or even written up at all. We have no way of knowing how much of this "hidden" research exists, or how different the MAC would look if it were somehow included.
The MAC’s sources may introduce problems. The MAC incorporates articles from many different datasets, making it a uniquely comprehensive dataset on worldwide research. No dataset is perfect, though. Errors, gaps, lags, and judgment calls in the MAC’s underlying datasets are likely to make it into the MAC itself.
Recent years have incomplete data. It takes time for our data sources to incorporate the latest publications and metadata. The MAC reflects this lag. Different sources and data types may take longer or shorter to be integrated, but as a rough, conservative rule of thumb, MAC data from the most recent two years may be materially incomplete. (Even earlier years may also be incomplete, but to a lesser degree that we believe is less likely to affect analysis.) We continuously add new data for both the current year and all prior years as our sources provide it.
The MAC’s merging and enrichment processes may introduce errors. The MAC uses automated processes to combine articles from different datasets and create metadata about them. These processes usually work well, but they do have limitations, especially when there are issues with the raw article data. These issues include:
- Organization and country data has gaps. As described above, we rely on the MAC's underlying data sources to extract raw authors and organizations for each publication, and use our own methods to canonicalize these organizations and link them (and, in turn, the publications) with countries. However, organizational metadata is sometimes missing or incompletely extracted in the sources that feed the MAC, and our methods to fill these gaps are not 100% effective.
- We have not done complete metadata translation. So far we have completely translated only the names of funding organizations. Other fields have been translated in part. However, translating all article titles and abstracts is cost-prohibitive.
- Deduplication is less effective when articles have limited metadata. Our method of deduplicating articles across corpora relies on the presence of six metadata fields. If some of these fields are absent, we are less likely to successfully merge the articles. Additionally, if an article’s title or abstract appears in multiple languages across datasets, we will have to rely on that article’s other metadata to perform a successful match.
- Named entities may not be fully resolved. Named entities, such as organizations, authors, and articles, are often given different names by the authors of publications. For example, while an author may state that they work for "Google" in one article, a different author may identify themselves as working for "Google Cloud Services" in another, "Google UK" in a third, or even "AI Team - Google Research" in a different one. For the MAC, we have a well-performing entity resolution algorithm designed to resolve these disagreements, but there may still be cases where they go uncaught. This could affect some calculations using the MAC. For example, an author’s articles could be split across multiple versions of the author’s name, making it seem like that author has written fewer articles than she really has.
- Sub-organizations are not linked to their parents. Some organizations in the MAC are parts of other organizations. For example, an article might have some authors associated with "Google Research" and others associated with "Google Cloud Services." In some contexts, users might want to group these organizations (and their authors) together under "Google." The MAC doesn’t group organizations like this.
The MAC’s coverage of Chinese publications is incomplete. Although the MAC includes many Chinese publications, many others are only available in China-based journals that are not included in our data sources. (Earlier versions of the MAC had better coverage of these sources, but unfortunately, ETO and many other organizations outside China are no longer able to access them.) Use particular caution when interpreting MAC-derived data related to Chinese research organizations, funders, or authors.

Maintenance

How are the data updated?

We update the MAC through a sequence of automated pipelines that retrieve data from our sources, merge it together, and enrich it. These pipelines normally run weekly, with occasional pauses to resolve issues due to vendor data changes or failures of automated checks in the pipelines.

The underlying data sources are updated on their own schedules, between daily and quarterly.

Credits

Virtually all CSET data team members have contributed to the MAC in some form, whether by providing feedback, reviewing code, or helping generate ideas. In alphabetical order, some particular contributions follow:

Zach Arnold: documentation, data characterization and normalization
Daniel Chou: Chinese-language data parsing and normalization
James Dunham: Article classifier development, field of study modeling, citation percentile calculation, organizational entity resolution, metadata merge
Rebecca Gelles: Organizational entity resolution, metadata merge, fields of study modeling, documentation
Jennifer Melot: Article linkage, metadata merge, data orchestration, documentation
Katherine Quinn: Article linkage, metadata merge
Ilya Rahkovsky: Article linkage, metadata merge
Christian Schoeberl: Article classifier development
Autumn Toney-Wails: Subject modeling

Student research assistants Chenxi Liu, Luwei Lei, and Jerod Sun contributed data characterization and normalization.

Emerging technology topic classifications in the MAC are based upon work supported in part by the Alfred P. Sloan Foundation under Grant No. G-2023-22358.

Major change log

8/??/25	Entity resolution model, fields of study, and underlying data sources updated
12/29/24	New emerging subjects added
11/22/23	Updates related to new underlying data sources and discipline/field classifiers
5/19/23	New emerging subjects added as part of Research Almanac launch
10/13/22	Initial release (ETO/CSET internal)