💻 Machine Learning Models for Complaint Classification

This repository focuses on applying machine learning models to classify complaint types using natural language processing (NLP) techniques. We aim to predict whether a complaint is related to computer issues or non-computer issues using various ML models.

📂 What Kind of Data Do We Have?

We are working with two important .csv files that contain:

Computer complaints
Non-computer complaints

All complaints are in Spanish, which adds an interesting dimension to the text processing and classification.

As shown above, we are dealing with a moderately sized dataset.

🎯 Goal of the Project

Our goal is to build a machine learning model that can accurately classify complaints into two categories:

Computer-related complaints
Non-computer-related complaints

🛠️ How Will We Use the Data?

Since machine learning models only work with numerical data, we need to transform our text data (complaints in Spanish) into a numerical format. To achieve this, we will use the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm.

🔎 What's the TF-IDF Algorithm?

TF-IDF is a popular algorithm in NLP used to convert text into a format that can be processed by machine learning models. It assigns weights to words based on their importance in a document and across a collection of documents.

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Measures how important a term is across the entire collection.

In Scikit-learn, this transformation is done using TfidfVectorizer.

from sklearn.feature_extraction.text import TfidfVectorizer

legal_tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', ngram_range=(1, 2), stop_words=list(STOP_WORDS))

legal_tfidf.fit(df[df['tipo'] == 'denuncia-Legal']['Denuncias'])

legal_vocab = legal_tfidf.vocabulary_

This snippet transforms the complaint texts into numerical vectors and builds a vocabulary from the legal complaint data.

📑 Defining the Labels and Features

After applying TF-IDF, we define the features and labels:

legal_features = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', ngram_range=(1, 2), stop_words=list(STOP_WORDS), vocabulary=legal_vocab).fit_transform(df[df['tipo'] == 'denuncia-Legal']['Denuncias'])

legal_labels = labels[df['tipo'] == 'denuncia-Legal']

The legal_features are extracted using the defined vocabulary, and legal_labels contain the corresponding labels for the complaints.

🤖 Choosing the Model

We experiment with several machine learning models to find the best one for our classification task:

RandomForestClassifier

The RandomForestClassifier is an ensemble learning method based on decision trees. It creates multiple decision trees and combines their predictions to make the final classification.

n_estimators: Number of decision trees.
max_depth: Limits the depth of each tree.
random_state: Seed for reproducibility.

LinearSVC

LinearSVC (Linear Support Vector Classifier) is a linear model for classification:

Uses a linear kernel by default.
Finds the best hyperplane that separates different classes.
Supports regularization to prevent overfitting.

MultinomialNB

MultinomialNB (Multinomial Naive Bayes) is based on Bayes' theorem:

Suitable for text classification tasks.
Models the likelihood of each feature's occurrence given the class.

LogisticRegression

LogisticRegression is a linear model for binary and multi-class classification:

Estimates the probability of belonging to a certain class using a logistic function.
Supports regularization to control model complexity.

Applying our data to these models, we find that the LinearSVC performs the best.

So, the LinearSVC model achieves 100% effectiveness, confirming its superior performance for our dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
denuncias (3) (2) (1).xlsx		denuncias (3) (2) (1).xlsx
marijoa.ipynb		marijoa.ipynb
note.txt		note.txt
text_classifyer.ipynb		text_classifyer.ipynb
whygithubprojects.txt		whygithubprojects.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💻 Machine Learning Models for Complaint Classification

📂 What Kind of Data Do We Have?

🎯 Goal of the Project

🛠️ How Will We Use the Data?

🔎 What's the TF-IDF Algorithm?

📑 Defining the Labels and Features

🤖 Choosing the Model

RandomForestClassifier

LinearSVC

MultinomialNB

LogisticRegression

About

Releases

Packages

Languages

nicolasvargaszz/ML-models

Folders and files

Latest commit

History

Repository files navigation

💻 Machine Learning Models for Complaint Classification

📂 What Kind of Data Do We Have?

🎯 Goal of the Project

🛠️ How Will We Use the Data?

🔎 What's the TF-IDF Algorithm?

📑 Defining the Labels and Features

🤖 Choosing the Model

RandomForestClassifier

LinearSVC

MultinomialNB

LogisticRegression

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages