- Overview
- Getting Started
- Project Structure
- Running
- arXiv Data-Format
- Key Terms and Abbreviations
- Stems
- Written in C# .NET with vector embeddings generated using gensim in Python.
- Download entries from arXiv to be indexed by Qdrant.
- The indexing and searching has lemmatization of key terms, stemming using both a custom and Porter stemmer, and stop words removal.
- Custom key terms and stems can be added via text files.
- Search by either query or find similar papers to a specific one.
- Infinite scroll to load more results.
- Uses Ollama to produce concise summaries with a local large language model.
- Uses Hunspell for spell correction.
- This will require installing Python.
- It is recommended you create a virtual environment for Python.
- Install the requirements.txt file for the Python packages.
- Install Docker.
- Install Ollama, and ensure it is running.
- You need to have the required .NET tools installed.
- The easiest way to do this (even if you don't end up using it) is to install Visual Studio. When doing the installation of Visual Studio, simply check the "ASP.NET and web development" option, and it will install everything you need.
- You could alternatively manually download .NET and configure it.
- In an IDE of your choice (Rider, Visual Studio, Visual Studio Code), open the solution file "SearchEngine.sln".
- Rider, and all other JetBrains products, are free for students! I have worked with C# and .NET for years professionally, and I highly recommend using Rider if you have the option.
- You can get it for free here.
All subprojects and scripts which can be directly run are listed here. If something is not listed here, it is either a helper subproject or file which is not directly run itself.
- This will handle one-time building and indexing of the dataset from arXiv, summarizing it with Ollama, and indexing it into Qdrant.
- Arguments are positional and do not require a prefix like the Python scripts in this repository do.
- Default values will apply if you do not pass anything.
Total Results
- Integer - The amount of arXiv data to download. Defaults to100000
. Note you will likely not be able to download more than 1,500,000 in total due to arXiv API limitations.Lower Clustering Bound
- Integer - The lower number of clusters to fit to. Defaults to5
.Upper Clustering Bound
- Integer - The upper number of clusters to fit to. Defaults to5
.Run Mitigation
- Boolean - If mitigation should be run. Defaults totrue
.Run Clustering
- Boolean - If clustering should be run. Defaults totrue
.Run PageRank
- Boolean - If PageRank should be run. Defaults totrue
.Run Summarzing
- Boolean - If summarizing with Ollama should be run. Defaults totrue
.Run Indexing
- Boolean - If results should be indexed into the Qdrant database. Defaults totrue
.Reset Indexing
- Boolean - If the Qdrant database should be reset before indexing. Defaults totrue
.Starting Category
- String or Null - What arXiv category to start searching from. Defaults tonull
to run from the beginning.Starting Order
- String or Null - What arXiv order to start searching from. Defaults tonull
to run from the beginning.Starting Sort By Mode
- String or Null - What arXiv sort by mode to start searching from. Defaults tonull
to run from the beginning.
- Hosts the backend server so we can run searches from our client.
- This also contains all Qdrant methods for indexing and searching.
- Will automatically scrape and index more arXiv data.
- This takes a single boolean argument to run the scarping service automatically in the background on repeat. Defaults to
false
.
- Generates word2vec embeddings.
- Checks classification on word2vec models and two language models.
-d
--directory
- The folder to load data from. Defaults toarXiv_processed_mitigated
.-s
--seed
- The seed for random state. Defaults to42
.-al
--alpha_low
- The lower bound for word2vec alpha values. Defaults to0.01
.-au
--alpha_upper
- The upper bound for word2vec alpha values. Defaults to0.05
.-as
--alpha_step
- The step for word2vec alpha values. Defaults to0.01
.-wl
--window_low
- The lower bound for word2vec window values. Defaults to5
.-wu
--window_upper
- The upper bound for word2vec window values. Defaults to10
.-ws
--window_step
- The step for word2vec window values. Defaults to5
.-nl
--negative_low
- The lower bound for word2vec negative values. Defaults to5
.-nu
--negative_upper
- The upper bound for word2vec negative values. Defaults to10
.-ns
--negative_step
- The step for word2vec negative values. Defaults to5
.
- Plot out PCA and t-SNE for an embeddings file.
-e
--embeddings
- The embeddings file to use. Defaults toembeddings.txt
.-o
--output
- The folder to output plots to. Defaults toPlots
.-p
--perplexity
- Perplexity for t-SNE. Defaults to20
.-s
--size
- The size of the plots. Defaults to100
.-n
--no-labels
- Disable labels.
- Calculate Heaps' Law, Zipf's Law, and get the most frequent terms in the corpus.
-d
--directory
- The root folder to run these statistics on. Defaults toarXiv_processed
.-o
--output
- The output directory to save to. Defaults toText Statistics
.-x
--width
- The output width. Defaults to10
.-y
--height
- The output height. Defaults to5
.
- Allows you to make a smaller training set of data if your corpus is very large.
-d
--directory
- The root folder to build a training set for. Defaults toarXiv
.-s
--size
- The amount to use for a training set, either as a percentage as a float in the range (0, 1] or the number of files to copy as an integer. Defaults to0.1
.-r
--seed
- The seed for randomly choosing files. Defaults to42
.
- Launch Ollama is running.
- Run
Builder
withNUMBER MIN MAX true false false true false
. See above for what these commands are. - Run
model_fitting
. - From the
Embeddings
folder underarXiv_processed_mitigated
, copy over an embeddings file to the root of your application and name it ``embeddings.txt`. - Launch Qdrant with Docker with
docker run -p 6334:6334 qdrant/qdrant
. - Run
Builder
again withNUMBER MIN MAX false true true false true
.
arXiv Data Format
- The scrapped arXiv data is organized into subfolders based on their primary category.
- The name of the file is the ID on arXiv as a text (.txt) file.
- This can be used to get the main (abstract) page of the document, the PDF, or for newer papers, their experimental HTML pages of the papers as seen below where you replace "ID" with the file name (less the ".txt").
- Main/abstract page - https://arxiv.org/abs/ID
- PDF - https://arxiv.org/pdf/ID
- HTML (Note that not all documents may have this) - https://arxiv.org/html/ID
- The first line of a text file contains the title of the document.
- The second line contains the abstract.
- The third line has the date and time in the format of "YYYY-DD-MM hh:mm:ss".
- The fourth line has all authors separated by a "|".
- The fifth line has all categories separated by a "|", with the primary category being the first one.
- All text has been preprocessed, ensuring all whitespace has been replaced by single spaces. Additionally, LaTeX/Markdown has been converted over to plain text.
- Our pipeline automatically replaces abbreviations with their term. For instance, "LLM" automatically becomes "Large Language Model". This ensures the indexing process treats a term and their abbreviation equally.
- Key terms can be found in
terms.txt
. These have the following format:term|abbreviation1|abbreviation2|...|abbreviationN
. You can have as many abbreviations as you want for a term. Everything is normalized to lowercase in our pipeline. - Our key terms builder automatically handles plurals. For instance, you only need to have
large language model|llm
and notlarge language model|llms
orlarge language models|llms
. - The key terms also recognize and remove the instances where abbreviations are introduced. For instance, in a paper it is common to for instance write "Large Language Models (LLMs)" the first time large language models are introduced. Our pipeline will reduce all such instances of this to just the term. This is done so the indexing process only recognize this as the term being written once, and not twice as the second was just introducing the abbreviation to the reader. We make sure to capture all possible plural combinations. Examples of this are below:
- "Large Language Model (LLM)" becomes "Large Language Model".
- "Large Language Model (LLMs)" becomes "Large Language Model".
- "Large Language Models (LLM)" becomes "Large Language Models".
- "Large Language Models (LLMs)" becomes "Large Language Models".
- Custom stems can be found in
stems.txt
.