BibFusion: A Python package to integrate, deduplicate, and harmonize exported bibliographic records from Scopus and Web of Science for bibliometric analysis

Authors

DOI:

https://doi.org/10.47909/ijsmc.342

Keywords:

bibliometrics, scientometrics, cross-database integration, Scopus, Web of Science, metadata preprocessing, author disambiguation, citation networks, reproducible research

Abstract

Objective. The study presented BibFusion, a Python software package that harmonizes bibliographic exports from Scopus and Web of Science into a single, traceable, analysis-ready corpus for bibliometric and scientometric research.

Design/Methodology/Approach. BibFusion was capable of ingesting Scopus CSV and WoS TXT files, applying systematic normalization (e.g., ASCII/uppercase standardization of titles and SR keys, affiliation parsing with country extraction), and optionally enriching records via DOI‑based resolution against OpenAlex to recover persistent identifiers (e.g., OpenAlex work IDs, ORCID when available, and OpenAlex author IDs). Cross-database integration employed a DOI-first deduplication cascade with a conservative fallback (title–year–first author) in the event that a DOI is absent. The authors were disambiguated through a canonical PersonID hierarchy (ORCID → OpenAlexAuthorID → normalized name). Citation strings were cleaned and remapped to ensure the preservation of consistent citation links, and journal/Scimago information was consolidated using ISSN/EISSN rules.

Results. In a demonstration on an entrepreneurial marketing query, BibFusion consolidated 436 source records into 253 unique main works and materialized a unified corpus of 8,569 articles. The resulting dataset demonstrated high levels of identifier and geographic completeness, and it provided an analysis-ready citation layer.

Conclusions/Value. BibFusion offers a reusable, auditable integration workflow that has been demonstrated to reduce duplicate inflation and metadata fragmentation. This workflow facilitates the explicit determination of merge decisions and residual uncertainty, thereby ensuring transparency in downstream analyses.

Downloads

Download data is not yet available.

Author Biography

Sebastian Robledo, Universidad Nacional de Colombia

Sebastian is a professor and researcher at the National University of Colombia, Manizales campus, where he earned his degrees in Industrial Engineering (2005), a Master’s in Business Administration (2013), and a Ph.D. in Engineering – Industry and Organizations (2018). His research focuses on scientometrics and entrepreneurial marketing, bridging data analysis with practical applications in academic and business contexts. He is also one of the creators of Tree of Science, an innovative tool for managing and analyzing scientific information. His academic work has addressed topics such as the design of software tools for industrial engineering, the potential of passive income in multilevel marketing firms, and networking as a word-of-mouth marketing strategy in entrepreneurial contexts. Currently, he combines teaching with publishing and reviewing scientific articles, contributing to the advancement of knowledge in his areas of expertise.

References

Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. https://doi.org/10.1016/j.joi.2017.08.007 DOI: https://doi.org/10.1016/j.joi.2017.08.007

Chavarro, D., Alperin, J. P., & Willinsky, J. (2025). On the open road to universal indexing: OpenAlex and Open Journal Systems. Quantitative Science Studies, 6, 1039–1058. https://doi.org/10.1162/qss.a.17 DOI: https://doi.org/10.1162/QSS.a.17

Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science and Technology, 57(3), 359–377. https://doi.org/10.1002/asi.20317 DOI: https://doi.org/10.1002/asi.20317

Chen, X., Mao, J., & Li, G. (2024). A co-citation approach to the analysis on the interaction between scientific and technological knowledge. Journal of Informetrics, 18(3), Article 101548. https://doi.org/10.1016/j.joi.2024.101548 DOI: https://doi.org/10.1016/j.joi.2024.101548

Cioffi, A., Coppini, S., Massari, A., Moretti, A., Peroni, S., Santini, C., & Shahidzadeh Asadi, N. (2022). Identifying and correcting invalid citations due to DOI errors in Crossref data. Scientometrics, 127(6), 3593–3612. https://doi.org/10.1007/s11192-022-04367-w DOI: https://doi.org/10.1007/s11192-022-04367-w

Crystal-Ornelas, R., Varadharajan, C., O’Ryan, D., Beilsmith, K., Bond-Lamberty, B., Boye, K., Burrus, M., Cholia, S., Christianson, D. S., Crow, M., Damerow, J., Ely, K. S., Goldman, A. E., Heinz, S. L., Hendrix, V. C., Kakalia, Z., Mathes, K., O’Brien, F., Pennington, S. C., … Agarwal, D. A. (2022). Enabling FAIR data in Earth and environmental science with community-centric (meta)data reporting formats. Scientific Data, 9(1), Article 700. https://doi.org/10.1038/s41597-022-01606-w DOI: https://doi.org/10.1038/s41597-022-01606-w

Culbert, J. H., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P., & Mayr, P. (2025). Reference coverage analysis of OpenAlex compared to Web of Science and Scopus. Scientometrics, 130(4), 2475–2492. https://doi.org/10.1007/s11192-025-05293-3 DOI: https://doi.org/10.1007/s11192-025-05293-3

Delgado-Quirós, L., & Ortega, J. L. (2024). Completeness degree of publication metadata in eight free-access scholarly databases. Quantitative Science Studies, 5(1), 31–49. https://doi.org/10.1162/qss_a_00286 DOI: https://doi.org/10.1162/qss_a_00286

Delgado-Quirós, L., & Ortega, J. L. (2025). Citation counts and inclusion of references in seven free-access scholarly databases: A comparative analysis. Journal of Informetrics, 19(1), Article 101618. https://doi.org/10.1016/j.joi.2024.101618 DOI: https://doi.org/10.1016/j.joi.2024.101618

Demaine, J. (2022). Fractionalization of research impact reveals global trends in university collaboration. Scientometrics, 127(5), 2235–2247. https://doi.org/10.1007/s11192-021-04246-w DOI: https://doi.org/10.1007/s11192-021-04246-w

Elstad, M., Ahmed, S., Røislien, J., & Douiri, A. (2023). Evaluation of the reported data linkage process and associated quality issues for linked routinely collected healthcare data in multimorbidity research: A systematic methodology review. BMJ Open, 13(5), Article e069212. https://doi.org/10.1136/bmjopen-2022-069212 DOI: https://doi.org/10.1136/bmjopen-2022-069212

Hottenrott, H., Rose, M. E., & Lawson, C. (2021). The rise of multiple institutional affiliations in academia. Journal of the Association for Information Science and Technology, 72(8), 1039–1058. https://doi.org/10.1002/asi.24472 DOI: https://doi.org/10.1002/asi.24472

Kara, B. C., Şahin, A., & Dirsehan, T. (2025). BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science. SoftwareX, 30, Article 102098. https://doi.org/10.1016/j.softx.2025.102098 DOI: https://doi.org/10.1016/j.softx.2025.102098

Kim, J., & Owen-Smith, J. (2021). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics, 126(3), 2057–2083. https://doi.org/10.1007/s11192-020-03826-6 DOI: https://doi.org/10.1007/s11192-020-03826-6

Kumpulainen, M., & Seppänen, M. (2022). Combining Web of Science and Scopus datasets in citation-based literature study. Scientometrics, 127(10), 5613–5631. https://doi.org/10.1007/s11192-022-04475-7 DOI: https://doi.org/10.1007/s11192-022-04475-7

Lastilla, L., Ammirati, S., Firmani, D., Komodakis, N., Merialdo, P., & Scardapane, S. (2022). Self-supervised learning for medieval handwriting identification: A case study from the Vatican Apostolic Library. Information Processing & Management, 59(3), Article 102875. https://doi.org/10.1016/j.ipm.2022.102875 DOI: https://doi.org/10.1016/j.ipm.2022.102875

Lim, W. M., Kumar, S., & Donthu, N. (2024). How to combine and clean bibliometric data and use bibliometric tools synergistically: Guidelines using metaverse research. Journal of Business Research, 182, Article 114760. https://doi.org/10.1016/j.jbusres.2024.114760 DOI: https://doi.org/10.1016/j.jbusres.2024.114760

Maisano, D. A., Mastrogiacomo, L., Ferrara, L., & Franceschini, F. (2025). A large-scale semi-automated approach for assessing document-type classification errors in bibliometric databases. Scientometrics, 130(3), 1901–1938. https://doi.org/10.1007/s11192-025-05244-y DOI: https://doi.org/10.1007/s11192-025-05244-y

Massari, A., Mariani, F., Heibi, I., Peroni, S., & Shotton, D. (2024). OpenCitations Meta. Quantitative Science Studies, 5(1), 50–75. https://doi.org/10.1162/qss_a_00292 DOI: https://doi.org/10.1162/qss_a_00292

Matveeva, N., Sterligov, I., & Lovakov, A. (2022). International scientific collaboration of post-Soviet countries: A bibliometric analysis. Scientometrics, 127(3), 1583–1607. https://doi.org/10.1007/s11192-022-04274-0 DOI: https://doi.org/10.1007/s11192-022-04274-0

McKay, A. S. (2026). Common errors in bibliometric reviews and a novel method for correcting them. Scientometrics. https://doi.org/10.1007/s11192-026-05544-x DOI: https://doi.org/10.1007/s11192-026-05544-x

Mischo, W., Schlembach, M., & Cabada, E. (2024). Relationships between journal publication, citation, and usage metrics within a Carnegie R1 university collection: A correlation analysis. College and Research Libraries, 85(2), 234–253. https://doi.org/10.5860/crl.85.2.234 DOI: https://doi.org/10.5860/crl.85.2.234

Ng, J. Y., Liu, H., Masood, M., Syed, N., Stephen, D., Ayala, A. P., Sabé, M., Solmi, M., Waltman, L., Haustein, S., & Moher, D. (2025). Guidance for the reporting of bibliometric analyses: A scoping review. Quantitative Science Studies, 6, 988–1001. https://doi.org/10.1162/qss.a.12 DOI: https://doi.org/10.1162/QSS.a.12

Nikolić, D., Ivanović, D., & Ivanović, L. (2024). An open-source tool for merging data from multiple citation databases. Scientometrics, 129(7), 4573–4595. https://doi.org/10.1007/s11192-024-05076-2 DOI: https://doi.org/10.1007/s11192-024-05076-2

Nowakowska, M. (2025). A comprehensive approach to preprocessing data for bibliometric analysis. Scientometrics, 130(9), 5191–5225. https://doi.org/10.1007/s11192-025-05415-x DOI: https://doi.org/10.1007/s11192-025-05415-x

Ornstein, J. T. (2025). Probabilistic record linkage using pretrained text embeddings. Political Analysis, 1–12. https://doi.org/10.1017/pan.2025.10016 DOI: https://doi.org/10.1017/pan.2025.10016

Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv [cs.DL]. https://doi.org/10.48550/ARXIV.2205.01833

Purnell, P. J. (2022). The prevalence and impact of university affiliation discrepancies between four bibliographic databases—Scopus, Web of Science, Dimensions, and Microsoft Academic. Quantitative Science Studies, 3(1), 99–121. https://doi.org/10.1162/qss_a_00175 DOI: https://doi.org/10.1162/qss_a_00175

Rehs, A. (2021). A supervised machine learning approach to author disambiguation in the Web of Science. Journal of Informetrics, 15(3), Article 101166. https://doi.org/10.1016/j.joi.2021.101166 DOI: https://doi.org/10.1016/j.joi.2021.101166

Robledo, S., Valencia, L., Zuluaga, M., Echeverri, O. A., & Valencia, J. W. A. (2024). tosr: Create the tree of science from WoS and Scopus. Journal of Scientometric Research, 13(2), 459–465. https://doi.org/10.5530/jscires.13.2.36 DOI: https://doi.org/10.5530/jscires.13.2.36

Rose, M. E., & Kitchin, J. R. (2019). pybliometrics: Scriptable bibliometrics using a Python interface to Scopus. SoftwareX, 10, Article 100263. https://doi.org/10.1016/j.softx.2019.100263 DOI: https://doi.org/10.1016/j.softx.2019.100263

Ruiz-Rosero, J., Ramirez-Gonzalez, G., & Viveros-Delgado, J. (2019). Software survey: ScientoPy, a scientometric tool for topics trend analysis in scientific publications. Scientometrics, 121(2), 1165–1188. https://doi.org/10.1007/s11192-019-03213-w DOI: https://doi.org/10.1007/s11192-019-03213-w

Schmal, W. B. (2024). How transformative are transformative agreements? Evidence from Germany across disciplines. Scientometrics, 129(3), 1863–1889. https://doi.org/10.1007/s11192-024-04955-y DOI: https://doi.org/10.1007/s11192-024-04955-y

Sivertsen, G., Rousseau, R., & Zhang, L. (2025). The motivations for and effects of modified fractional counting. Journal of Informetrics, 19(3), Article 101681. https://doi.org/10.1016/j.joi.2025.101681 DOI: https://doi.org/10.1016/j.joi.2025.101681

Vaccaro, G., Sánchez-Núñez, P., & Witt-Rodríguez, P. (2022). Bibliometrics evaluation of scientific journals and country research output of dental research in Latin America using Scimago Journal and Country Rank. Publications, 10(3), 26. https://doi.org/10.3390/publications10030026 DOI: https://doi.org/10.3390/publications10030026

van Eck, N. J., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538. https://doi.org/10.1007/s11192-009-0146-3 DOI: https://doi.org/10.1007/s11192-009-0146-3

Velez-Estevez, A., Perez, I. J., García-Sánchez, P., Moral-Munoz, J. A., & Cobo, M. J. (2023). New trends in bibliometric APIs: A comparative analysis. Information Processing & Management, 60(4), Article 103385. https://doi.org/10.1016/j.ipm.2023.103385 DOI: https://doi.org/10.1016/j.ipm.2023.103385

Visser, M., van Eck, N. J., & Waltman, L. (2021). Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. Quantitative Science Studies, 2(1), 20–41. https://doi.org/10.1162/qss_a_00112 DOI: https://doi.org/10.1162/qss_a_00112

Wang, F., Dong, J., Lu, W., & Xu, S. (2023). Collaboration prediction based on multilayer all-author tripartite citation networks: A case study of gene editing. Journal of Informetrics, 17(1), Article 101374. https://doi.org/10.1016/j.joi.2022.101374 DOI: https://doi.org/10.1016/j.joi.2022.101374

Yang, J., Wu, L., & Lyu, L. (2024). Research on scientific knowledge evolution patterns based on ego-centered fine-granularity citation network. Information Processing & Management, 61(4), Article 103766. https://doi.org/10.1016/j.ipm.2024.103766 DOI: https://doi.org/10.1016/j.ipm.2024.103766

Zhang, L., Cao, Z., Shang, Y., Sivertsen, G., & Huang, Y. (2024). Missing institutions in OpenAlex: Possible reasons, implications, and solutions. Scientometrics, 129(10), 5869–5891. https://doi.org/10.1007/s11192-023-04923-y DOI: https://doi.org/10.1007/s11192-023-04923-y

Published

2026-02-14

How to Cite

Britto, A., Robledo, S., & Zuluaga, M. (2026). BibFusion: A Python package to integrate, deduplicate, and harmonize exported bibliographic records from Scopus and Web of Science for bibliometric analysis. Iberoamerican Journal of Science Measurement and Communication, 6(1), 1–21. https://doi.org/10.47909/ijsmc.342