GitHub

Acknowledgment: This project utilizes the dataset and fine-tuned models developed by Dr. Uri Kartoun (https://urikartoun.com/).

Goal: This code fine-tunes a pre-trained language model to classify expressions from clinical narrative notes regarding alcohol use. The goal is to identify if the expression indicates alcohol use or if it depicts appropriate use or non-use.

Model Overview: The model is built on pre-trained models (e.g., "emilyalsentzer/Bio_ClinicalBERT," "UFNLP/gatortron-base") from Hugging Face's Transformers library, adapted to recognize specific patterns in clinical narratives that relate to alcohol consumption.

Access: https://huggingface.co/kartoun/Bio_ClinicalBERT_for_Alcohol_Use_Classification (108M parameters), https://huggingface.co/kartoun/gatortron-base_for_Alcohol_Use_Classification (355M parameters).

Data for Fine-tuning https://huggingface.co/datasets/kartoun/Alcohol_Use_Clinical_Notes_GPT4 This dataset contains 1,500 samples of expressions indicating alcohol use or its negation, generated from clinical narrative notes using OpenAI's ChatGPT 4 model. It's designed to support NLP applications that require the identification of alcohol use references in healthcare records.

Usage: Run the "Fine-tuning" script to handle data loading, model initialization, training, and saving the model outputs. Afterwards, run the "Performance assessment" script to evaluate the model's performance metrics. You can query the model with a single text blob and get a classification value (0 or 1).

Output: The fine-tuned model and tokenizer are saved in a designated directory, along with performance metrics for review.

Classification performance using a held-out set:

Generalizability: The source code can be adapted to fine-tune pre-trained clinical models for a wide range of classification tasks beyond just alcohol use classification. This includes tasks like identifying Social Determinants of Health, as well as multi-label classifications such as smoking status (current, past, never) or more complex scenarios like classifying one of the seven classes of pancreatic cancer, all with minimal adjustments to the code and training set.

Contributing: Feel free to contribute to this project by submitting pull requests or opening issues for any bugs or enhancements you identify.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
images		images
Fine-tuning Alc Feb 7 2025.py		Fine-tuning Alc Feb 7 2025.py
Performance assessment Alc Feb 8 2025.py		Performance assessment Alc Feb 8 2025.py
README.md		README.md
Training metrics Bio_ClinicalBERT_for_Alcohol_Use_Classification_2025_02_06_01_32_56.csv		Training metrics Bio_ClinicalBERT_for_Alcohol_Use_Classification_2025_02_06_01_32_56.csv
Training metrics gatortron-base_for_Alcohol_Use_Classification_2025_02_07_03_24_09.csv		Training metrics gatortron-base_for_Alcohol_Use_Classification_2025_02_07_03_24_09.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

kartoun/alcohol_use_classification_llms

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages