GitHub - bcankara/BibexPy: BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.

Harmonizing the Bibliometric Symphony of Scopus and Web of Science

Documentation • Installation • Features • Usage • Support

Academic Citation

We appreciate the academic community's interest in BibexPy. If you find our tool useful in your research work, we kindly request that you cite our paper:

APA Citation Format

Kara, B. C., Şahin, A., & Dirsehan, T. (2025). BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science. SoftwareX, 30, 102098. https://doi.org/10.1016/j.softx.2025.102098

BibTeX Citation Format

@article{bibexpy2025,
    title     = {BibexPy: Harmonizing the bibliometric symphony of {Scopus} and {Web of Science}},
    author    = {Kara, Burak Can and {\c{S}}ahin, Alperen and Dirsehan, Ta{\c{s}}k{\i}n},
    journal   = {SoftwareX},
    volume    = {30},
    pages     = {102098},
    year      = {2025},
    issn      = {2352-7110},
    publisher = {Elsevier},
    doi       = {10.1016/j.softx.2025.102098},
    url       = {https://www.sciencedirect.com/science/article/pii/S2352711025000652},
    keywords  = {Bibliometric analysis tools, Automated data integration, Metadata enrichment software, Scikit-learn, Machine learning, API-Based metadata processing}
}

IEEE Citation Format

B. C. Kara, A. Şahin and T. Dirsehan, "BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science," SoftwareX, vol. 30, p. 102098, 2025, doi: 10.1016/j.softx.2025.102098.

Chicago Citation Format

Kara, Burak Can, Alperen Şahin, and Taşkın Dirsehan. "BibexPy: Harmonizing the Bibliometric Symphony of Scopus and Web of Science." SoftwareX 30 (2025): 102098. https://doi.org/10.1016/j.softx.2025.102098.

BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.

Tech Stack

Features

DOI-Based Deduplication and Merging: Identifies and removes duplicate entries while enriching metadata by merging complementary records.
Enhanced Metadata Enrichment:
- API-Based Enrichment: Completes missing fields using multiple APIs with detailed field statistics and API support information.
- Machine Learning Enrichment (Experimental):
  - Currently supports prediction for:
    - Keywords (DE field)
    - Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
    - Subject Categories (SC field)
    - Web of Science Categories (WC field)
  - Shows training data statistics for each field
  - Displays progress during model training
  - Provides enrichment results summary
  - Saves detailed statistics to Excel file
- Combined API + ML Enrichment:
  - Sequential processing combining both methods
  - API enrichment performed first with user confirmation
  - ML enrichment applied to API-enriched data
  - Comprehensive statistics for both processes
  - User confirmation at each step
  - Automatic cleanup of temporary files
  - Detailed statistics saved to Excel files
Flexible Workflow: Choose between API or ML enrichment in any order, with clear progress indicators and statistics.
Format Conversion: Generates outputs compatible with VosViewer, Biblioshiny, and other analysis tools.
Command-Line Interface (CLI): Offers user-friendly interaction with minimal setup requirements.
Comprehensive Data Processing: Handles multiple data sources and formats efficiently.

Key Benefits

Time Saving: Automates manual data cleaning and enrichment tasks
Enhanced Data Quality: Reduces errors and inconsistencies in bibliometric data
Flexible Integration: Works with multiple data sources and output formats
Rich Metadata: Comprehensive metadata enrichment from multiple sources
Smart Enrichment: Choose between API-based or ML-based enrichment methods
Detailed Feedback: Clear statistics and progress indicators during processing
Easy to Use: Simple command-line interface with clear instructions

Prerequisites

Required Python Version

Python ≥ 3.9.0

Required Libraries

# Core Libraries - Required for Basic Functionality
pandas>=2.0.0          # Data manipulation and analysis
numpy>=1.24.0          # Required by pandas for numerical operations
openpyxl>=3.1.2        # Excel file handling

# Machine Learning - Required for ML Enrichment
scikit-learn>=1.3.0    # ML-based metadata enrichment and predictions
nltk>=3.8.1            # Text processing and feature extraction

# API and Network Libraries - Required for API Enrichment
requests>=2.31.0       # API interactions for metadata enrichment
urllib3>=2.0.0         # HTTP client for Python, used by requests
certifi>=2023.5.7      # Required for SSL certificate verification
python-dotenv>=1.0.0   # API configuration management

# Progress and User Interface
tqdm>=4.65.0          # Progress tracking for long operations
colorama>=0.4.6        # Console output formatting and colors

# Utilities
unidecode==1.3.6       # Text normalization and cleaning
typing-extensions>=4.7.0  # Type hints support

Installation

Clone the Repository

git clone https://github.com/bcankara/BibexPy.git

Navigate to the Directory
```
cd BibexPy
```
Install Dependencies
```
pip install -r requirements.txt
```

(Optional) Virtual Environment Setup

python -m venv venv
source venv/bin/activate  # Mac/Linux
venv\Scripts\activate     # Windows

Usage

Basic Usage
```
python DataProcessor.py
```
- Select your project
- Upload Scopus (.csv) and Web of Science (.txt) files
- Choose processing options
Metadata Enrichment Options

The application offers three main methods for enriching your bibliometric data:

A. API-Based Enrichment
- Provides detailed statistics about empty fields
- Shows which APIs support each field
- Displays percentage of empty records for each field
- Supports multiple APIs:
  - CrossRef (Free)
  - OpenAlex (Free)
  - DataCite (Free)
  - Europe PMC (Free)
  - Scopus (API key required)
  - Semantic Scholar (Optional API key)
  - Unpaywall (Email required)
B. Machine Learning Enrichment (Experimental)
- Currently supports prediction for:
  - Keywords (DE field)
  - Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
  - Subject Categories (SC field)
  - Web of Science Categories (WC field)
- Shows training data statistics for each field
- Displays progress during model training
- Provides enrichment results summary
- Saves detailed statistics to Excel file
C. Combined API + ML Enrichment
- Sequential processing combining both methods
- API enrichment performed first with user confirmation
- ML enrichment applied to API-enriched data
- Comprehensive statistics for both processes
- User confirmation at each step
- Automatic cleanup of temporary files
- Detailed statistics saved to Excel files

API Configuration

For API-based enrichment, configure your APIs in API_config.json:

{
    "scopus": {
        "api_key": "YOUR-SCOPUS-API-KEY",
        "description": "Get your API key from https://dev.elsevier.com/"
    },
    "semantic_scholar": {
        "api_key": "YOUR-SEMANTIC-SCHOLAR-KEY",
        "description": "Optional. Get your key from https://www.semanticscholar.org/product/api"
    },
    "unpaywall": {
        "email": "[email protected]",
        "description": "Use your institutional email for Unpaywall access"
    },
    "crossref": {
        "email": "[email protected]",
        "description": "Recommended for better rate limits with CrossRef"
    }
}

Output Files and Formats

BibexPy generates several output files to support different analysis needs:

1. Unified Dataset (`Prefix_Bib.xlsx`)

Format: Excel Workbook
Contents:
- Merged and deduplicated records
- Enhanced metadata from multiple sources
- Standardized author names and affiliations
- Complete citation information
Uses:
- Primary dataset for analysis
- Input for other bibliometric tools
- Reference database

2. VosViewer Export (`Prefix_Vos.txt`)

Format: Tab-separated text file
Contents:
- Author and co-authorship data
- Citation networks
- Keyword co-occurrence
- Institution collaborations
Uses:
- Direct import into VOSviewer
- Network visualization
- Cluster analysis

3. Quality Report (`Prefix_Quality.xlsx`)

Format: Excel Workbook
Contents:
- Data completeness metrics
- Field coverage statistics
- Source distribution analysis
- Duplicate detection results
- API enrichment statistics
- ML enrichment statistics
- Field-wise enrichment rates
Uses:
- Dataset quality assessment
- Coverage analysis
- Source verification
- Enrichment performance tracking

4. Analysis Summary (`Prefix_Summary.txt`)

Format: Text file
Contents:
- Processing statistics
- API enrichment results
- Error logs and warnings
- Data transformation details
Uses:
- Process verification
- Quality control
- Troubleshooting

Support and Community

Issues and Bugs: Submit via GitHub Issues
Feature Requests: Use GitHub Discussions
Questions: Contact us at 📧 [email protected]
Updates: Follow us on Twitter @BibexPy

License

BibexPy is licensed under the GNU General Public License (GPL). See the LICENSE file for details.

Enhance your bibliometric research with BibexPy, making data preparation efficient, reliable, and analysis-ready!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Main		Main
Workspace/Sample Project/Data		Workspace/Sample Project/Data
API_config.json		API_config.json
DataProcessor.py		DataProcessor.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harmonizing the Bibliometric Symphony of Scopus and Web of Science

Academic Citation

APA Citation Format

BibTeX Citation Format

IEEE Citation Format

Chicago Citation Format

Tech Stack

Features

Key Benefits

Prerequisites

Required Python Version

Required Libraries

Installation

Usage

Output Files and Formats

1. Unified Dataset (`Prefix_Bib.xlsx`)

2. VosViewer Export (`Prefix_Vos.txt`)

3. Quality Report (`Prefix_Quality.xlsx`)

4. Analysis Summary (`Prefix_Summary.txt`)

Support and Community

License

About

Releases 2

Packages

Languages

License

bcankara/BibexPy

Folders and files

Latest commit

History

Repository files navigation

Harmonizing the Bibliometric Symphony of Scopus and Web of Science

Academic Citation

APA Citation Format

BibTeX Citation Format

IEEE Citation Format

Chicago Citation Format

Tech Stack

Features

Key Benefits

Prerequisites

Required Python Version

Required Libraries

Installation

Usage

Output Files and Formats

1. Unified Dataset (Prefix_Bib.xlsx)

2. VosViewer Export (Prefix_Vos.txt)

3. Quality Report (Prefix_Quality.xlsx)

4. Analysis Summary (Prefix_Summary.txt)

Support and Community

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

1. Unified Dataset (`Prefix_Bib.xlsx`)

2. VosViewer Export (`Prefix_Vos.txt`)

3. Quality Report (`Prefix_Quality.xlsx`)

4. Analysis Summary (`Prefix_Summary.txt`)

Packages