Skip to content

A Generative AI-powered tool that automates data cleaning using Google’s Gemini API. Users describe tasks in natural language, and the tool generates and executes Python code to process CSV files. Built with Streamlit for a user-friendly interface, it simplifies data cleaning, reduces manual effort, and improves efficiency

Notifications You must be signed in to change notification settings

SriramV1212/Data-Cleaning-and-Automation-Tool-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Data Cleaning Automation Tool with Generative AI

License: MIT Python 3.7+ Streamlit Gemini API Pandas

A generative AI-powered tool that automates data cleaning tasks by dynamically generating and executing Python code based on natural language input.

📊 Project Overview

This project leverages Google's Gemini API to create a data cleaning automation tool that simplifies the process of preparing and cleaning datasets. Users can describe their data cleaning requirements in natural language, and the tool automatically generates and executes the corresponding Python code. The tool is designed to streamline data processing tasks, reduce manual effort, and improve the efficiency and accuracy of data pipelines.

🎯 Key Features

  • Natural Language Processing: Users can describe data cleaning tasks in plain English.
  • Automated Code Generation: The tool dynamically generates Python code based on user input.
  • Streamlit Web Interface: A user-friendly interface for uploading CSV files, describing tasks, and viewing results.
  • Integration with Gemini API: Utilizes Google's LLM to interpret user input and generate accurate code.
  • Data Cleaning Automation: Reduces manual effort and improves efficiency in data pipelines.

🧠 Technical Implementation

Files

  • sentient.py - Main script for the data cleaning automation tool.
  • requirements.txt - List of dependencies for the project.
  • example_data.csv - Sample dataset for testing the tool.

Data Flow

  1. User Input: The user uploads a CSV file and describes the data cleaning task in natural language.
  2. Code Generation: The Gemini API interprets the user's input and generates the corresponding Python code.
  3. Code Execution: The generated code is executed on the uploaded dataset.
  4. Result Display: The cleaned dataset is displayed to the user via the Streamlit interface.

🚀 Getting Started

Setup Instructions

Prerequisites

  • Python 3.7+
  • Pandas
  • Streamlit
  • Large Language Models
  • Required Python packages (install via pip):

Installation

  1. Clone this repository
git clone https://github.com/yourusername/data-cleaning-automation-tool.git
  1. Install required packages
pip install -r requirements.txt
  1. Run the Streamlit App
streamlit run sentient.py

🔮 Future Work

#Advanced Features

  • Real-time Streaming Architecture:

    • Implement Apache Kafka for high-throughput, fault-tolerant data streaming
    • Build real-time analytics with Kafka Streams or Apache Flink
  • Workflow Orchestration:

    • Migrate to Apache Airflow for robust pipeline scheduling and monitoring
    • Implement DAGs for complex maintenance prediction workflows
  • Big Data Processing:

    • Scale to distributed computing with Apache Spark for handling fleet-wide generator data
    • Implement batch processing with Hadoop ecosystem for historical analysis
  • Data Warehousing & Storage:

    • Implement Snowflake data warehouse for flexible scaling and analytics
    • Utilize AWS S3 for cost-effective long-term storage of sensor data
  • Cloud Infrastructure:

    • Migrate to AWS cloud infrastructure (EC2, Lambda, SageMaker)
    • Implement containerization with Docker and Kubernetes for deployments
  • Advanced Analytics:

    • Develop a data lake architecture for combining structured and unstructured maintenance data
    • Implement dbt (data build tool) for analytics engineering and transformation

Future Work

#Advanced Features

Customizable Data Cleaning Pipelines: Allow users to save and reuse data cleaning workflows.

Integration with Cloud Storage: Enable users to directly import datasets from cloud storage services like Google Drive or AWS S3.

Error Handling and Suggestions: Improve the tool's ability to handle ambiguous or incomplete user input by providing suggestions and error messages.

Support for Multiple Data Formats: Extend the tool to support other data formats such as Excel, JSON, and SQL databases.

#calability and Performance

Distributed Processing: Implement distributed data processing using Apache Spark for handling large datasets.

Real-time Collaboration: Allow multiple users to collaborate on data cleaning tasks in real-time.

Performance Optimization: Optimize the tool for faster code generation and execution, especially for large datasets.

User Experience Enhancements Interactive Data Visualization: Integrate interactive data visualization tools like Plotly or Altair for better data exploration.

User Feedback Loop: Implement a feedback mechanism to improve the tool's accuracy and usability based on user input.

Tutorials and Documentation: Provide comprehensive tutorials and documentation to help users get started with the tool.

License

MIT License

Contact

Feel free to reach out if you have any questions or would like to collaborate!

Email: [email protected] LinkedIn: https://www.linkedin.com/in/sriram-vivek-58a673269/

About

A Generative AI-powered tool that automates data cleaning using Google’s Gemini API. Users describe tasks in natural language, and the tool generates and executes Python code to process CSV files. Built with Streamlit for a user-friendly interface, it simplifies data cleaning, reduces manual effort, and improves efficiency

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages