Skip to content

BUFFALO – Boundless User-Focused Framework for Advanced LLM Optimzation

License

Notifications You must be signed in to change notification settings

cisco-open/buffalo

BUFFALO - An Enterprise LLM Gateway 🐃

Boundless User-Focused Framework for Advanced LLM Optimzation

   

tl;dr - Cisco Research proudly presents BUFFALO, a flexible, open-source enterprise gateway for LLMs.

Check out our TechBlog, Slide Deck, Demo Input, and Demo Output for more information and exciting applications!

Make sure to use the Table of Contents (the three horizontal lines next to README.md) for easier viewing!

   

Why BUFFALO?

The guiding question in our creation of BUFFALO was simple:

"Why can't an enterprise deploy GPT-4, LLAMA-2, or any other LLM for both internal and external usage?"

Our team identified several limitations of the LLM landscape which prevent Enterprises from readily deploying LLM-based solutions.

Data Privacy Expenses Data Security Hallucinations Explainability Competence
Sharing PII, divulging customer/enterprise data Cost/power, usage limitation, redundant queries Leaking confidential data, unauthorized access Misleading information, improper tool usage Robustness, removal of bias, truthfulness Lack of reasoning, attribution, false confidence

   

Installing Python Packages

To start, we can clone this repository using:

git clone https://github.com/cisco-open/BUFFALO.git

Next, we must install the necessary packages for BUFFALO. For this, we have two options:

Will have a conda env/requirements.txt coming soon!

Need to pip install the following:

  • streamlit
  • annotated_text
  • faiss
  • langchain
  • rannet
  • nltk
  • openie
  • elasticsearch
  • spacy

   

Installing Elasticsearch

Elasticsearch needs to be running in the background for the Output Verification component. Please download Elasticsearch 8.9.0 at the following link Elasticsearch Download

Next, ensure that security settings are disabled by going to Elasticsearch.yml in the Elasticsearch installation folder and setting the following:

xpack.security.enabled: false
xpack.security.enrollment.enabled: false

More Info - Disabling Elasticsearch Security

   

Installing Stanford CoreNLP

JAVA needs to be installed to run OpenIE!

Stanford CoreNLP also needs to be installed in order to run the Ouput Verification component. The zip folder can be downloaded from Stanford OpenIE Repo

   

Installing RANNET Model

Both RANNET base english model and model-store need to be downloaded. These can be found on the RANNET GitHub page.

   

Usage (Demos)

Having completed installation, launching BUFFALO is as easy as 1, 2, 3.

Step 1) Launching Elasticsearch

You can launch Elasticsearch by navigating to the downloaded, unzipped directory, and using the corresponding command.

More Info - Launching ElasticSearch

.\bin\elasticsearch.bat

   

Step 2) Starting up the REST API Server

Next, navigate to the gateway folder and start the backend

cd gateway
python app.py 

   

Step 3) Starting the Streamlit Frontend

Wait for a few seconds to ensure the server is up and running.

Next, launch streamlit by going back to the top-level directory:

cd ..
streamlit run main.py

And voila! A link should pop up taking us to our dashboard gateway.

   

Architecture

Let's take a quick look at how our BUFFALO gateway works under the hood.

The backend (gateway) is a REST API server which we started by running python app.py

This server contains ALL the actual gateway components/computations, allowing the streamlit frontend to be as light as possible.

All streamlit does is call the REST API server w/ corresponding GET/POST request, and displays the output.

All REST API endpoints can be seen in app.py, and they are summarized here:

General
  • (GET) / = default, ensures server is up
  • (GET) /reset = resets server state
  • (GET) /docs = gets all documents and file sizes in Enterprise Knowledge Base (set to the .\gateway\docs folder)
  • (GET) /model_list = gets list of all models as specified in admin.yml
  • (POST) /model = receives query, returns answer given specific LLM
Input Layer
  • (POST) /prompt/post = recieves input prompt and selected models, returns cost options
  • (POST) /cache/search/post = receives prompt, returns cache results
  • (POST) /cache/model/post = receives query, returns document-specific LLM response (if applicable)
  • (POST) /cache/add/post = receives prompt/query, adds to cache
Output Layer
  • (POST) /exfiltrator/post = receives output of llm, returns leaks, if any
  • (POST) /exfiltrator/topic/post = receives output of llm, returns list of relevant topics/facts
  • (POST) /verify/facts/post = receives LLM output, returns list of facts
  • (POST) /verify/ie/post = receives list of facts, retruns relation triplets for each, if any
  • (POST) /verify/triplet/post = receives list of triplets, returns matches/info if found

New endpoints can be added in a similar format to app.py

These endpoints simply call the corresponding functions from each of the classes. The four files house all of the classes/methods for each of the four components (prompt.py : QueryProcessor, cache.py : QACache, leaks.py : ExfiltrationModel, verify.py : VerificationModel).

And in Streamlit, our frontend is split into three parts: Col1, Col2A, Col2B

  • Col1 is on the left hand side, where the user enters prompt and can see what documents are in the Knowledge base
  • Col2A is on the right hand side, and displayed first which shows the Prompt and Cache
  • Col2B is shown after "Submit to LLM" is selected in Col2A, and it shows the Exfiltration and Verification components

   

Contributing & Future Directions

We have several TODO's spread out throughout our files! We will continue working in these directions to improve BUFFALO, adjusting as needed based on our user's feedback

Currently, we are working on implementing the following features, split by component-of-focus:

Overall
  • Some way to store session state so can have chat history
  • Json dictionary from end-to-end (need to discuss some new ideas)
  • Model Comparison (benchmark of prompts to multiple models, give truthfulness score)
  • Can sample a few datapoints to run check against (show comparison on how they work)
  • Running multiple models (can ask Vamsi)
  • Using GPT4 to tag (connecting prompt and verification)
Model-Specific
  • Change model so we're loading from huggingface + have a dedicated class
Prompt Layer
  • Say cost options can be configured by admin (verbosity), add to admin.yml
LLM Cache
  • Someway to give cache more than one document
  • If cache partial hit, need option to ONLY query LLM w/ non-cache hit (some way to specify for each)
  • If cache hit, but still LLM, and LLM score better, replace cache entry w/ better result or append
Exfiltration
  • Remove squad documents (or be able to toggle off)
  • Need to not only add file names, but also big ideas (ex. "Financial Information About Cisco")
  • Solidify Idea Embeddings research ^^
Verification
  • Need to separate between retrieval augmented correction/generation
  • Will NOT send docs to LLM until we explicitly ask it to
Miscellaneous
  • Create + Test Conda Env
  • Merge Elasticsearch launch into backend start

If you wish to contribute or suggest any additional funtionalities, please check out Contributing Guidelines

Acknowledgements

BUFFALO would not have been possible without the contributions of the following individuals, to whom we express our gratitude:

  • Advit Deepak, Will Healy, Raunak Sinha, Tarun Raheja, Jayanth Srinivasa, Charles Fleming, Goli Vamsi Krishna Mohan, Ramana Kompella, Vijoy Pandey
  • The entirety of the Cisco Research team for their feedback, ideas, and support 🥳

Thank you so much for checking us out! 🐃

   

About

BUFFALO – Boundless User-Focused Framework for Advanced LLM Optimzation

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages