Search Engine

Overview
Getting Started
Project Structure
- C#
  - Builder
  - Server
- Python
Running
- Developing
- Serving
arXiv Data-Format
Key Terms and Abbreviations
Stems

Overview

Written in C# .NET with vector embeddings generated using gensim in Python.
Download entries from arXiv to be indexed by Qdrant.
The indexing and searching has lemmatization of key terms, stemming using both a custom and Porter stemmer, and stop words removal.
- Custom key terms and stems can be added via text files.
Search by either query or find similar papers to a specific one.
Infinite scroll to load more results.
Uses Ollama to produce concise summaries with a local large language model.
Uses Hunspell for spell correction.

Getting Started

This will require installing Python.
It is recommended you create a virtual environment for Python.
Install the requirements.txt file for the Python packages.
Install Docker.
Install Ollama, and ensure it is running.
You need to have the required .NET tools installed.
1. The easiest way to do this (even if you don't end up using it) is to install Visual Studio. When doing the installation of Visual Studio, simply check the "ASP.NET and web development" option, and it will install everything you need.
2. You could alternatively manually download .NET and configure it.
In an IDE of your choice (Rider, Visual Studio, Visual Studio Code), open the solution file "SearchEngine.sln".
- Rider, and all other JetBrains products, are free for students! I have worked with C# and .NET for years professionally, and I highly recommend using Rider if you have the option.
- You can get it for free here.

Project Structure

All subprojects and scripts which can be directly run are listed here. If something is not listed here, it is either a helper subproject or file which is not directly run itself.

C#

Builder

This will handle one-time building and indexing of the dataset from arXiv, summarizing it with Ollama, and indexing it into Qdrant.
Arguments are positional and do not require a prefix like the Python scripts in this repository do.
Default values will apply if you do not pass anything.

Total Results - Integer - The amount of arXiv data to download. Defaults to 100000. Note you will likely not be able to download more than 1,500,000 in total due to arXiv API limitations.
Lower Clustering Bound - Integer - The lower number of clusters to fit to. Defaults to 5.
Upper Clustering Bound - Integer - The upper number of clusters to fit to. Defaults to 5.
Run Mitigation - Boolean - If mitigation should be run. Defaults to true.
Run Clustering - Boolean - If clustering should be run. Defaults to true.
Run PageRank - Boolean - If PageRank should be run. Defaults to true.
Run Summarzing - Boolean - If summarizing with Ollama should be run. Defaults to true.
Run Indexing - Boolean - If results should be indexed into the Qdrant database. Defaults to true.
Reset Indexing - Boolean - If the Qdrant database should be reset before indexing. Defaults to true.
Starting Category - String or Null - What arXiv category to start searching from. Defaults to null to run from the beginning.
Starting Order - String or Null - What arXiv order to start searching from. Defaults to null to run from the beginning.
Starting Sort By Mode - String or Null - What arXiv sort by mode to start searching from. Defaults to null to run from the beginning.

Server

Hosts the backend server so we can run searches from our client.
This also contains all Qdrant methods for indexing and searching.
Will automatically scrape and index more arXiv data.
This takes a single boolean argument to run the scarping service automatically in the background on repeat. Defaults to false.

Python

model_fitting

Generates word2vec embeddings.
Checks classification on word2vec models and two language models.
-d --directory - The folder to load data from. Defaults to arXiv_processed_mitigated.
-s --seed - The seed for random state. Defaults to 42.
-al --alpha_low - The lower bound for word2vec alpha values. Defaults to 0.01.
-au --alpha_upper - The upper bound for word2vec alpha values. Defaults to 0.05.
-as --alpha_step - The step for word2vec alpha values. Defaults to 0.01.
-wl --window_low - The lower bound for word2vec window values. Defaults to 5.
-wu --window_upper - The upper bound for word2vec window values. Defaults to 10.
-ws --window_step - The step for word2vec window values. Defaults to 5.
-nl --negative_low - The lower bound for word2vec negative values. Defaults to 5.
-nu --negative_upper - The upper bound for word2vec negative values. Defaults to 10.
-ns --negative_step - The step for word2vec negative values. Defaults to 5.

plot_embeddings

Plot out PCA and t-SNE for an embeddings file.
-e --embeddings - The embeddings file to use. Defaults to embeddings.txt.
-o --output - The folder to output plots to. Defaults to Plots.
-p --perplexity - Perplexity for t-SNE. Defaults to 20.
-s --size - The size of the plots. Defaults to 100.
-n --no-labels - Disable labels.

text_statistics

Calculate Heaps' Law, Zipf's Law, and get the most frequent terms in the corpus.
-d --directory - The root folder to run these statistics on. Defaults to arXiv_processed.
-o --output - The output directory to save to. Defaults to Text Statistics.
-x --width - The output width. Defaults to 10.
-y --height - The output height. Defaults to 5.

training_set

Allows you to make a smaller training set of data if your corpus is very large.
-d --directory - The root folder to build a training set for. Defaults to arXiv.
-s --size - The amount to use for a training set, either as a percentage as a float in the range (0, 1] or the number of files to copy as an integer. Defaults to 0.1.
-r --seed - The seed for randomly choosing files. Defaults to 42.

Running

Developing

Launch Ollama is running.
Run Builder with NUMBER MIN MAX true false false true false. See above for what these commands are.
Run model_fitting.
From the Embeddings folder under arXiv_processed_mitigated, copy over an embeddings file to the root of your application and name it ``embeddings.txt`.
Launch Qdrant with Docker with docker run -p 6334:6334 qdrant/qdrant.
Run Builder again with NUMBER MIN MAX false true true false true.

Serving

Launch Qdrant with Docker with docker run -p 6334:6334 qdrant/qdrant.
Run Server.

arXiv Data Format

The scrapped arXiv data is organized into subfolders based on their primary category.
The name of the file is the ID on arXiv as a text (.txt) file.
- This can be used to get the main (abstract) page of the document, the PDF, or for newer papers, their experimental HTML pages of the papers as seen below where you replace "ID" with the file name (less the ".txt").
- Main/abstract page - https://arxiv.org/abs/ID
- PDF - https://arxiv.org/pdf/ID
- HTML (Note that not all documents may have this) - https://arxiv.org/html/ID
The first line of a text file contains the title of the document.
The second line contains the abstract.
The third line has the date and time in the format of "YYYY-DD-MM hh:mm:ss".
The fourth line has all authors separated by a "|".
The fifth line has all categories separated by a "|", with the primary category being the first one.
All text has been preprocessed, ensuring all whitespace has been replaced by single spaces. Additionally, LaTeX/Markdown has been converted over to plain text.

Key Terms and Abbreviations

Our pipeline automatically replaces abbreviations with their term. For instance, "LLM" automatically becomes "Large Language Model". This ensures the indexing process treats a term and their abbreviation equally.
Key terms can be found in terms.txt. These have the following format: term|abbreviation1|abbreviation2|...|abbreviationN. You can have as many abbreviations as you want for a term. Everything is normalized to lowercase in our pipeline.
Our key terms builder automatically handles plurals. For instance, you only need to have large language model|llm and not large language model|llms or large language models|llms.
The key terms also recognize and remove the instances where abbreviations are introduced. For instance, in a paper it is common to for instance write "Large Language Models (LLMs)" the first time large language models are introduced. Our pipeline will reduce all such instances of this to just the term. This is done so the indexing process only recognize this as the term being written once, and not twice as the second was just introducing the abbreviation to the reader. We make sure to capture all possible plural combinations. Examples of this are below:
- "Large Language Model (LLM)" becomes "Large Language Model".
- "Large Language Model (LLMs)" becomes "Large Language Model".
- "Large Language Models (LLM)" becomes "Large Language Models".
- "Large Language Models (LLMs)" becomes "Large Language Models".

Stems

Custom stems can be found in stems.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
Builder		Builder
Client		Client
Embeddings/arXiv_processed_mitigated		Embeddings/arXiv_processed_mitigated
Plots		Plots
Server		Server
Shared		Shared
Text Statistics/arXiv_processed		Text Statistics/arXiv_processed
arXiv_clustering		arXiv_clustering
arXiv_mitigated		arXiv_mitigated
.gitattributes		.gitattributes
.gitignore		.gitignore
SearchEngine.sln		SearchEngine.sln
auto_transformer_model.py		auto_transformer_model.py
clustering.csv		clustering.csv
data_loader.py		data_loader.py
e5_model.py		e5_model.py
embeddings.txt		embeddings.txt
embeddings_model.py		embeddings_model.py
en_US.aff		en_US.aff
en_US.dic		en_US.dic
mini_model.py		mini_model.py
mitigated.txt		mitigated.txt
model_fitting.py		model_fitting.py
naive_bayes_model.py		naive_bayes_model.py
page_rank.csv		page_rank.csv
plot_embedding.py		plot_embedding.py
readme.md		readme.md
requirements.txt		requirements.txt
stems.txt		stems.txt
summary_fix.py		summary_fix.py
terms.txt		terms.txt
text_statistics.py		text_statistics.py
training_set.py		training_set.py
transformer_model.py		transformer_model.py
word2vec_model.py		word2vec_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine

Overview

Getting Started

Project Structure

C#

Builder

Server

Python

model_fitting

plot_embeddings

text_statistics

training_set

Running

Developing

Serving

arXiv Data Format

Key Terms and Abbreviations

Stems

About

Contributors 2

Languages

StevenRice99/Vector-Search

Folders and files

Latest commit

History

Repository files navigation

Search Engine

Overview

Getting Started

Project Structure

C#

Builder

Server

Python

model_fitting

plot_embeddings

text_statistics

training_set

Running

Developing

Serving

arXiv Data Format

Key Terms and Abbreviations

Stems

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages