In this paper we investigate the representation and evolution of the concept of “ancestry” in PLOS Genetics from 2005 to 2025. We combine computational text-mining techniques with large language models to analyze a corpus of 15,758 sentences containing mentions of its stemmed form “ancest” across 1,176 open-access articles. A set of definitional criteria, that distinguishing genealogical ancestry from genetic ancestry (quantitative estimates or DNA-based evidence), was developed by experts and manually applied to a subset of 200 sentences. We then tested 3 LLM-based categorization approaches using the OpenAI API, evaluating consistency and alignment with our manual categorization. Results show that while simpler categorization schemes improve in-between runs consistency, overall agreement with human coding remains low, highlighting challenges in automating nuanced conceptual categorizations. Applying the most reliable approach to the full corpus reveals a temporal trend with genealogical references to ancestry decreasing over time, whereas genetic references increase, suggesting a shift toward quantitative and molecular framings in scientific literature. In this paper, we also investigate how “admixture” is defined in the scientific literature using text-mining approaches, which enable analysis of a large corpus. We compiled a dataset of 5,625 sentences from 101 open-access PLOS Genetics articles containing the prefix “admix.” From this, we selected a subset of 420 sentences for manual examination to identify definitional categories. Using the OpenAI API, we then automatically classified this subset according to the defined categories. Our results indicate that most references to “admixture” occur within the contexts of population- and individual-level effects, or methodological applications. In parallel, a co-occurrence analysis of the full corpus confirmed that these definitional categories could also be recovered automatically. These findings demonstrate the potential of integrating computational linguistics to study the semantic evolution of biological concepts across scientific literature. Work in progress uses collocation analysis and knowledge graphs to further refine the definitional categories embedded in the corpus.

PDF version