Documentation • Installation • Features • Usage • Support
We appreciate the academic community's interest in BibexPy. If you find our tool useful in your research work, we kindly request that you cite our paper:
Kara, B. C., Şahin, A., & Dirsehan, T. (2025). BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science. SoftwareX, 30, 102098. https://doi.org/10.1016/j.softx.2025.102098
@article{bibexpy2025,
title = {BibexPy: Harmonizing the bibliometric symphony of {Scopus} and {Web of Science}},
author = {Kara, Burak Can and {\c{S}}ahin, Alperen and Dirsehan, Ta{\c{s}}k{\i}n},
journal = {SoftwareX},
volume = {30},
pages = {102098},
year = {2025},
issn = {2352-7110},
publisher = {Elsevier},
doi = {10.1016/j.softx.2025.102098},
url = {https://www.sciencedirect.com/science/article/pii/S2352711025000652},
keywords = {Bibliometric analysis tools, Automated data integration, Metadata enrichment software, Scikit-learn, Machine learning, API-Based metadata processing}
}
B. C. Kara, A. Şahin and T. Dirsehan, "BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science," SoftwareX, vol. 30, p. 102098, 2025, doi: 10.1016/j.softx.2025.102098.
Kara, Burak Can, Alperen Şahin, and Taşkın Dirsehan. "BibexPy: Harmonizing the Bibliometric Symphony of Scopus and Web of Science." SoftwareX 30 (2025): 102098. https://doi.org/10.1016/j.softx.2025.102098.
BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.
- DOI-Based Deduplication and Merging: Identifies and removes duplicate entries while enriching metadata by merging complementary records.
- Enhanced Metadata Enrichment:
- API-Based Enrichment: Completes missing fields using multiple APIs with detailed field statistics and API support information.
- Machine Learning Enrichment (Experimental):
- Currently supports prediction for:
- Keywords (DE field)
- Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
- Subject Categories (SC field)
- Web of Science Categories (WC field)
- Shows training data statistics for each field
- Displays progress during model training
- Provides enrichment results summary
- Saves detailed statistics to Excel file
- Currently supports prediction for:
- Combined API + ML Enrichment:
- Sequential processing combining both methods
- API enrichment performed first with user confirmation
- ML enrichment applied to API-enriched data
- Comprehensive statistics for both processes
- User confirmation at each step
- Automatic cleanup of temporary files
- Detailed statistics saved to Excel files
- Flexible Workflow: Choose between API or ML enrichment in any order, with clear progress indicators and statistics.
- Format Conversion: Generates outputs compatible with VosViewer, Biblioshiny, and other analysis tools.
- Command-Line Interface (CLI): Offers user-friendly interaction with minimal setup requirements.
- Comprehensive Data Processing: Handles multiple data sources and formats efficiently.
- Time Saving: Automates manual data cleaning and enrichment tasks
- Enhanced Data Quality: Reduces errors and inconsistencies in bibliometric data
- Flexible Integration: Works with multiple data sources and output formats
- Rich Metadata: Comprehensive metadata enrichment from multiple sources
- Smart Enrichment: Choose between API-based or ML-based enrichment methods
- Detailed Feedback: Clear statistics and progress indicators during processing
- Easy to Use: Simple command-line interface with clear instructions
- Python ≥ 3.9.0
# Core Libraries - Required for Basic Functionality
pandas>=2.0.0 # Data manipulation and analysis
numpy>=1.24.0 # Required by pandas for numerical operations
openpyxl>=3.1.2 # Excel file handling
# Machine Learning - Required for ML Enrichment
scikit-learn>=1.3.0 # ML-based metadata enrichment and predictions
nltk>=3.8.1 # Text processing and feature extraction
# API and Network Libraries - Required for API Enrichment
requests>=2.31.0 # API interactions for metadata enrichment
urllib3>=2.0.0 # HTTP client for Python, used by requests
certifi>=2023.5.7 # Required for SSL certificate verification
python-dotenv>=1.0.0 # API configuration management
# Progress and User Interface
tqdm>=4.65.0 # Progress tracking for long operations
colorama>=0.4.6 # Console output formatting and colors
# Utilities
unidecode==1.3.6 # Text normalization and cleaning
typing-extensions>=4.7.0 # Type hints support
-
Clone the Repository
git clone https://github.com/bcankara/BibexPy.git
-
Navigate to the Directory
cd BibexPy
-
Install Dependencies
pip install -r requirements.txt
-
(Optional) Virtual Environment Setup
python -m venv venv source venv/bin/activate # Mac/Linux venv\Scripts\activate # Windows
-
Basic Usage
python DataProcessor.py
- Select your project
- Upload Scopus (
.csv
) and Web of Science (.txt
) files - Choose processing options
-
Metadata Enrichment Options
The application offers three main methods for enriching your bibliometric data:
A. API-Based Enrichment
- Provides detailed statistics about empty fields
- Shows which APIs support each field
- Displays percentage of empty records for each field
- Supports multiple APIs:
- CrossRef (Free)
- OpenAlex (Free)
- DataCite (Free)
- Europe PMC (Free)
- Scopus (API key required)
- Semantic Scholar (Optional API key)
- Unpaywall (Email required)
B. Machine Learning Enrichment (Experimental)
- Currently supports prediction for:
- Keywords (DE field)
- Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
- Subject Categories (SC field)
- Web of Science Categories (WC field)
- Shows training data statistics for each field
- Displays progress during model training
- Provides enrichment results summary
- Saves detailed statistics to Excel file
C. Combined API + ML Enrichment
- Sequential processing combining both methods
- API enrichment performed first with user confirmation
- ML enrichment applied to API-enriched data
- Comprehensive statistics for both processes
- User confirmation at each step
- Automatic cleanup of temporary files
- Detailed statistics saved to Excel files
-
API Configuration
For API-based enrichment, configure your APIs in
API_config.json
:{ "scopus": { "api_key": "YOUR-SCOPUS-API-KEY", "description": "Get your API key from https://dev.elsevier.com/" }, "semantic_scholar": { "api_key": "YOUR-SEMANTIC-SCHOLAR-KEY", "description": "Optional. Get your key from https://www.semanticscholar.org/product/api" }, "unpaywall": { "email": "[email protected]", "description": "Use your institutional email for Unpaywall access" }, "crossref": { "email": "[email protected]", "description": "Recommended for better rate limits with CrossRef" } }
BibexPy generates several output files to support different analysis needs:
- Format: Excel Workbook
- Contents:
- Merged and deduplicated records
- Enhanced metadata from multiple sources
- Standardized author names and affiliations
- Complete citation information
- Uses:
- Primary dataset for analysis
- Input for other bibliometric tools
- Reference database
- Format: Tab-separated text file
- Contents:
- Author and co-authorship data
- Citation networks
- Keyword co-occurrence
- Institution collaborations
- Uses:
- Direct import into VOSviewer
- Network visualization
- Cluster analysis
- Format: Excel Workbook
- Contents:
- Data completeness metrics
- Field coverage statistics
- Source distribution analysis
- Duplicate detection results
- API enrichment statistics
- ML enrichment statistics
- Field-wise enrichment rates
- Uses:
- Dataset quality assessment
- Coverage analysis
- Source verification
- Enrichment performance tracking
- Format: Text file
- Contents:
- Processing statistics
- API enrichment results
- Error logs and warnings
- Data transformation details
- Uses:
- Process verification
- Quality control
- Troubleshooting
- Issues and Bugs: Submit via GitHub Issues
- Feature Requests: Use GitHub Discussions
- Questions: Contact us at 📧 [email protected]
- Updates: Follow us on Twitter @BibexPy
BibexPy is licensed under the GNU General Public License (GPL). See the LICENSE file for details.
Enhance your bibliometric research with BibexPy, making data preparation efficient, reliable, and analysis-ready!