Skip to content

BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.

License

Notifications You must be signed in to change notification settings

bcankara/BibexPy

Repository files navigation

BibexPy

Harmonizing the Bibliometric Symphony of Scopus and Web of Science

Python License Documentation GitHub Issues Downloads

DocumentationInstallationFeaturesUsageSupport

Academic Citation

We appreciate the academic community's interest in BibexPy. If you find our tool useful in your research work, we kindly request that you cite our paper:

DOI ScienceDirect

APA Citation Format

Kara, B. C., Şahin, A., & Dirsehan, T. (2025). BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science. SoftwareX, 30, 102098. https://doi.org/10.1016/j.softx.2025.102098

BibTeX Citation Format

@article{bibexpy2025,
    title     = {BibexPy: Harmonizing the bibliometric symphony of {Scopus} and {Web of Science}},
    author    = {Kara, Burak Can and {\c{S}}ahin, Alperen and Dirsehan, Ta{\c{s}}k{\i}n},
    journal   = {SoftwareX},
    volume    = {30},
    pages     = {102098},
    year      = {2025},
    issn      = {2352-7110},
    publisher = {Elsevier},
    doi       = {10.1016/j.softx.2025.102098},
    url       = {https://www.sciencedirect.com/science/article/pii/S2352711025000652},
    keywords  = {Bibliometric analysis tools, Automated data integration, Metadata enrichment software, Scikit-learn, Machine learning, API-Based metadata processing}
}

IEEE Citation Format

B. C. Kara, A. Şahin and T. Dirsehan, "BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science," SoftwareX, vol. 30, p. 102098, 2025, doi: 10.1016/j.softx.2025.102098.

Chicago Citation Format

Kara, Burak Can, Alperen Şahin, and Taşkın Dirsehan. "BibexPy: Harmonizing the Bibliometric Symphony of Scopus and Web of Science." SoftwareX 30 (2025): 102098. https://doi.org/10.1016/j.softx.2025.102098.

BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.

Tech Stack

Python Pandas NumPy scikit-learn NLTK Excel

Scopus Web of Science VOSviewer

Features

  • DOI-Based Deduplication and Merging: Identifies and removes duplicate entries while enriching metadata by merging complementary records.
  • Enhanced Metadata Enrichment:
    • API-Based Enrichment: Completes missing fields using multiple APIs with detailed field statistics and API support information.
    • Machine Learning Enrichment (Experimental):
      • Currently supports prediction for:
        • Keywords (DE field)
        • Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
        • Subject Categories (SC field)
        • Web of Science Categories (WC field)
      • Shows training data statistics for each field
      • Displays progress during model training
      • Provides enrichment results summary
      • Saves detailed statistics to Excel file
    • Combined API + ML Enrichment:
      • Sequential processing combining both methods
      • API enrichment performed first with user confirmation
      • ML enrichment applied to API-enriched data
      • Comprehensive statistics for both processes
      • User confirmation at each step
      • Automatic cleanup of temporary files
      • Detailed statistics saved to Excel files
  • Flexible Workflow: Choose between API or ML enrichment in any order, with clear progress indicators and statistics.
  • Format Conversion: Generates outputs compatible with VosViewer, Biblioshiny, and other analysis tools.
  • Command-Line Interface (CLI): Offers user-friendly interaction with minimal setup requirements.
  • Comprehensive Data Processing: Handles multiple data sources and formats efficiently.

Key Benefits

  • Time Saving: Automates manual data cleaning and enrichment tasks
  • Enhanced Data Quality: Reduces errors and inconsistencies in bibliometric data
  • Flexible Integration: Works with multiple data sources and output formats
  • Rich Metadata: Comprehensive metadata enrichment from multiple sources
  • Smart Enrichment: Choose between API-based or ML-based enrichment methods
  • Detailed Feedback: Clear statistics and progress indicators during processing
  • Easy to Use: Simple command-line interface with clear instructions

Prerequisites

Required Python Version

  • Python ≥ 3.9.0

Required Libraries

# Core Libraries - Required for Basic Functionality
pandas>=2.0.0          # Data manipulation and analysis
numpy>=1.24.0          # Required by pandas for numerical operations
openpyxl>=3.1.2        # Excel file handling

# Machine Learning - Required for ML Enrichment
scikit-learn>=1.3.0    # ML-based metadata enrichment and predictions
nltk>=3.8.1            # Text processing and feature extraction

# API and Network Libraries - Required for API Enrichment
requests>=2.31.0       # API interactions for metadata enrichment
urllib3>=2.0.0         # HTTP client for Python, used by requests
certifi>=2023.5.7      # Required for SSL certificate verification
python-dotenv>=1.0.0   # API configuration management

# Progress and User Interface
tqdm>=4.65.0          # Progress tracking for long operations
colorama>=0.4.6        # Console output formatting and colors

# Utilities
unidecode==1.3.6       # Text normalization and cleaning
typing-extensions>=4.7.0  # Type hints support

Installation

  1. Clone the Repository

    git clone https://github.com/bcankara/BibexPy.git
  2. Navigate to the Directory

    cd BibexPy
  3. Install Dependencies

    pip install -r requirements.txt
  4. (Optional) Virtual Environment Setup

    python -m venv venv
    source venv/bin/activate  # Mac/Linux
    venv\Scripts\activate     # Windows

Usage

  1. Basic Usage

    python DataProcessor.py
    • Select your project
    • Upload Scopus (.csv) and Web of Science (.txt) files
    • Choose processing options
  2. Metadata Enrichment Options

    The application offers three main methods for enriching your bibliometric data:

    A. API-Based Enrichment

    • Provides detailed statistics about empty fields
    • Shows which APIs support each field
    • Displays percentage of empty records for each field
    • Supports multiple APIs:
      • CrossRef (Free)
      • OpenAlex (Free)
      • DataCite (Free)
      • Europe PMC (Free)
      • Scopus (API key required)
      • Semantic Scholar (Optional API key)
      • Unpaywall (Email required)

    B. Machine Learning Enrichment (Experimental)

    • Currently supports prediction for:
      • Keywords (DE field)
      • Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
      • Subject Categories (SC field)
      • Web of Science Categories (WC field)
    • Shows training data statistics for each field
    • Displays progress during model training
    • Provides enrichment results summary
    • Saves detailed statistics to Excel file

    C. Combined API + ML Enrichment

    • Sequential processing combining both methods
    • API enrichment performed first with user confirmation
    • ML enrichment applied to API-enriched data
    • Comprehensive statistics for both processes
    • User confirmation at each step
    • Automatic cleanup of temporary files
    • Detailed statistics saved to Excel files
  3. API Configuration

    For API-based enrichment, configure your APIs in API_config.json:

    {
        "scopus": {
            "api_key": "YOUR-SCOPUS-API-KEY",
            "description": "Get your API key from https://dev.elsevier.com/"
        },
        "semantic_scholar": {
            "api_key": "YOUR-SEMANTIC-SCHOLAR-KEY",
            "description": "Optional. Get your key from https://www.semanticscholar.org/product/api"
        },
        "unpaywall": {
            "email": "[email protected]",
            "description": "Use your institutional email for Unpaywall access"
        },
        "crossref": {
            "email": "[email protected]",
            "description": "Recommended for better rate limits with CrossRef"
        }
    }

Output Files and Formats

BibexPy generates several output files to support different analysis needs:

1. Unified Dataset (Prefix_Bib.xlsx)

  • Format: Excel Workbook
  • Contents:
    • Merged and deduplicated records
    • Enhanced metadata from multiple sources
    • Standardized author names and affiliations
    • Complete citation information
  • Uses:
    • Primary dataset for analysis
    • Input for other bibliometric tools
    • Reference database

2. VosViewer Export (Prefix_Vos.txt)

  • Format: Tab-separated text file
  • Contents:
    • Author and co-authorship data
    • Citation networks
    • Keyword co-occurrence
    • Institution collaborations
  • Uses:
    • Direct import into VOSviewer
    • Network visualization
    • Cluster analysis

3. Quality Report (Prefix_Quality.xlsx)

  • Format: Excel Workbook
  • Contents:
    • Data completeness metrics
    • Field coverage statistics
    • Source distribution analysis
    • Duplicate detection results
    • API enrichment statistics
    • ML enrichment statistics
    • Field-wise enrichment rates
  • Uses:
    • Dataset quality assessment
    • Coverage analysis
    • Source verification
    • Enrichment performance tracking

4. Analysis Summary (Prefix_Summary.txt)

  • Format: Text file
  • Contents:
    • Processing statistics
    • API enrichment results
    • Error logs and warnings
    • Data transformation details
  • Uses:
    • Process verification
    • Quality control
    • Troubleshooting

Support and Community

License

BibexPy is licensed under the GNU General Public License (GPL). See the LICENSE file for details.


Enhance your bibliometric research with BibexPy, making data preparation efficient, reliable, and analysis-ready!

About

BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages