sueszli
diff --git a/‎air - advanced information retrieval 188.980/assets/Pasted image 20240531191258.png
116 KB b/‎air - advanced information retrieval 188.980/assets/Pasted image 20240531191258.png
116 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Pasted image 20240531191324.png
116 KB b/‎air - advanced information retrieval 188.980/assets/Pasted image 20240531191324.png
116 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Pasted image 20240531191345.png
117 KB b/‎air - advanced information retrieval 188.980/assets/Pasted image 20240531191345.png
117 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Pasted image 20240531191352.png
117 KB b/‎air - advanced information retrieval 188.980/assets/Pasted image 20240531191352.png
117 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-30 at 21.16.54.png
235 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-30 at 21.16.54.png
235 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-30 at 22.08.11.png
168 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-30 at 22.08.11.png
168 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-30 at 23.03.52.png
92.6 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-30 at 23.03.52.png
92.6 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.13.58.png
113 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.13.58.png
113 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.16.04.png
82.8 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.16.04.png
82.8 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.17.01 1.png
91.1 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.17.01 1.png
91.1 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.17.01.png
91.1 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.17.01.png
91.1 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.17.56.png
43.1 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.17.56.png
43.1 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.18.51.png
61.4 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.18.51.png
61.4 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.19.06.png
169 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.19.06.png
169 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.19.32.png
125 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 19.19.32.png
125 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 20.07.06.png
92.9 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-05-31 at 20.07.06.png
92.9 KB
diff --git a/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-06-11 at 12.56.23.png
115 KB b/‎air - advanced information retrieval 188.980/assets/Screenshot 2024-06-11 at 12.56.23.png
115 KB
diff --git a/‎air - advanced information retrieval 188.980/exams.md
+222 b/‎air - advanced information retrieval 188.980/exams.md
+222
diff --git a/‎air - advanced information retrieval 188.980/papers/anserini.md
+12 b/‎air - advanced information retrieval 188.980/papers/anserini.md
+12
diff --git a/‎air - advanced information retrieval 188.980/papers/jimmy lin intro to ir.md
+65 b/‎air - advanced information retrieval 188.980/papers/jimmy lin intro to ir.md
+65
diff --git a/‎air - advanced information retrieval 188.980/papers/lucene is all you need.md
+30 b/‎air - advanced information retrieval 188.980/papers/lucene is all you need.md
+30
@@ -0,0 +1,222 @@
+<!-- most of these are from mattermost -->
+
+# 2024 (first exam)
+
+**question**: The M in MAP stands for the mean over all Queries
+
+answer (boolean): False
+
+- in the context of Mean Average Precision (MAP), the "M" stands for the "Mean", which refers to the mean of the Average Precision (AP) values calculated over all queries
+- it calculates the average precision across all queries, not the "mean over all queries"
+
+---
+
+**question**: MRR only considers the first relevant document
+
+answer (boolean): True
+
+- it takes the inverse of the position of the first relevant doc
+
+---
+
+**question**: BERT does not transform Tokens
+
+answer (boolean): False
+
+- bert = bidirectional encoder representations from transformers
+- it generates contextual representations from tokens
+
+---
+
+**question**: CNNs can form nGrams trough the sliding window
+
+answer (boolean): True
+
+- 1D-CNNs can only be used in NLP with a sliding window
+- use cases:
+	- n-gram representation learning = generating word embeddings as char-n-grams
+	- dimensionality reduction = capturing an embedding-n-gram
+
+---
+
+**question**: IR Models made \[…] totally obsolete
+
+answer (fill in): 
+
+- manual searching
+
+---
+
+*paper specific questions*
+
+based on: https://arxiv.org/pdf/2405.07767
+
+- the authors disclosed the exact promts they used
+- The Paper follows the Cranfield paradigm
+- the Authors used a specific TREC Setting to evaluate
+- \[insert random collection from paper] is a Test collection
+- This paper is the first to use LLMs in Test Collection creation
+- how were the different methods correlated? what did the $\tau$ values mean? what metrics were used?
+- ChatGPT’s Queries were on average significantly longer than those of T5
+- ChatGPT’s relevant judgements address the information need with way less documents
+- Further experiments showed BIAS in using LLMs for the generation of Test collections
+- Synthetic Testcollections are used for documents that are never used by humans
+- Synthetic Queries resulted on average on more relevant documents per Query
+
+# 2023
+
+**question**: Recall and nDCG are typically measured at a lower cutoff than MAP and MRR (you don't have to know the exact formula)
+
+answer (boolean): False
+
+- the MAP is a more general metric and captures the area under the precision-recall-curve → ~@100-1000
+- but the MRR, DCG, nDCG measure how far-up in the search results the rel-docs are positioned → ~@5-20
+- "… typically, we measure all those metrics at a certain cutoff at "k" of the top retrieved documents. And for MAP and recall this is typically done at 100 and at 1000, whereas for position MRR and nDCG we have a much lower cutoff, so at 5, at 10 or at 20 to kind of get the same experience as users would do." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%202%20-%20Closed%20Captions.md#14-ranking-list-evaluation-metrics)
+
+---
+
+**question**: Judgement pairs should use pooling of many diverse system results
+
+answer (boolean): True
+
+- this question is referring to the pooling-process in labeling ie. with mechanical turk where you're creating a cutoff-set to reduce the labor required to label your data.
+- "… if we use a diverse pool of different systems, we can then even reuse those pool candidates and this gives us confidence that we have at least some of the relevant results in those pooling results. It allows us to drastically reduce the annotation time compared to conducting millions of annotations by hand." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%203%20-%20Closed%20Captions.md#26-pooling-in-information-retrieval)
+
+# 2022
+
+**question**: Test collections should be statistically significant
+
+answer (boolean): False
+
+- the systems/models we build with the test-collections should be when we compare them, but not the test-collections themselves.
+- statistical significance tests are used to verify that the observed differences between systems/models are not due to chance.
+- "… we test whether two systems produce different rankings that are not different just by chance \[…]. Our hypothesis is that those systems are the same and now we test via a statistical significance test on a per-query basis" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%202%20-%20Closed%20Captions.md#29-statistical-significance-i)
+
+---
+
+**question**: The quality of a test collection is measured with the inter-annotator agreement
+
+answer (boolean): False
+
+- the degree of agreement among raters ≠ test-collection quality
+- "We can measure the label quality of annotators based on their inter-annotation agreement" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%203%20-%20Closed%20Captions.md#25-evaluate-annotation-quality)
+- see: https://en.wikipedia.org/wiki/Inter-rater_reliability
+
+---
+
+**question**: A word-1-gram that we use when training Word2Vec is also considered as a word-n-gram
+
+answer (boolean): False
+
+- 1-gram ≠ n-gram
+- word2vec generates a single embedding for each word by learning to either guess the word from its surroundings or the other way around.
+- you do have to train it with more than a single word and pass in a window size, but the word it's being trained to reconstruct is always a 1-gram / unigram.
+- don't confuse this with CNNs that generate n-gram representations (a single embedding for n words)
+- "1-Word-1-Vector type of class which includes Word2Vec" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%204%20-%20Closed%20Captions.md#10-word-embeddings)
+- see: https://radimrehurek.com/gensim/models/word2vec.html#usage-examples
+
+---
+
+**question**: ColBERTer achieves state-of-the-art performance on the MS Marco dev set
+
+answer (boolean): True
+
+- "When trained on MS MARCO Passage Ranking, ColBERTv2 achieves the highest MRR@10 of any standalone retriever." [(source: paper)](https://arxiv.org/abs/2112.01488)
+
+---
+
+**question**: "Assign each retrieval model its advantage of desirable properties for a retrieval model compared to the other models."
+
+models:
+
+- bert-cat
+- tk
+- colbert
+- bert-dot
+
+attributes:
+
+- effectivity
+- memory hog
+- effort moved to indexing
+- transformers combined with kernel-pooling
+
+answer (assign attributes to sentences):
+
+| model name | effectivity | latency | memory footprint | note                                                                         |
+| ---------- | ----------- | ------- | ---------------- | ---------------------------------------------------------------------------- |
+| bert-cat   | 1           | 950ms   | 10.4GB           | vanilla-bert and t5 are the slowest and most accurate                        |
+| preTTR     | 0.97        | 445ms   | 10.9GB           | precomputes $n$ layers of bert for each doc                                  |
+| colBERT    | 0.97        | 28ms    | 3.4GB            | precomputes representations for each doc                                     |
+| bert-dot   | 0.87        | 23ms    | 3.6GB            | uses cosine similarity instead of linear layer                               |
+| tk         | 0.89        | 14ms    | 1.8GB            | limits number of transformer layers and context, then applies kernel pooling |
+
+above is a table from the lecture slides with some notes, that the following answers are based on:
+
+- bert-cat ← effectivity
+	- vanilla-bert and T5 are the most effective models against which all others are benchmarked
+- tk ← transformers combined with kernel-pooling
+	- can have a bunch of other optimizations as tkl or tk-sparse
+- colbert ← memory hog
+	- uses most memory in practice (i don't know why this doesn't align with the table above), because we're storing entire contextualized representations for each document
+- bert-dot ← effort moved to indexing
+	- used in combination with a nearest-neighbor-index so we can just use the cosine similarity instead of a linear layer or max-pooling
+
+---
+
+*course papers from that year:*
+
+- colberter was particularly popular: https://arxiv.org/pdf/2203.13088
+- https://dl.acm.org/doi/pdf/10.1145/3269206.3271719
+- https://discovery.ucl.ac.uk/id/eprint/10119400/1/Mitigating_the_Position_Bias_of_Transformer_Based_Contextualization_for_Passage_Re_Ranking.pdf
+- https://ecai2020.eu/papers/82_paper.pdf
+
+# 2019
+
+**question**: What are the differences between Matchpyramid and KNRM?
+
+answer (open question):
+
+- both are neural re-ranking models
+- interpretability:
+	- kernel-based models can be less interpretable, as the learned ranking function operates in a high-dimensional feature space
+	- hierarchical convolutional layers provide some interpretability, as the model learns to capture matching patterns at different levels of hierarchy
+- efficiency:
+	- both scale linearly to the document length but knrm is faster because it's simpler [(source: paper)](https://www.ijcai.org/proceedings/2019/0758.pdf)
+	- "the KNRM model is very fast, so it's definitely by far the fastest model we're talking about today, and in this course as a whole. And on its own, it has roughly the same effectiveness as MatchPyramid." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%207%20-%20Closed%20Captions.md)
+- robustness against small vocabularies:
+	- "MatchPyramid and KNRM suffer if they have small vocabularies \[… but.] if you then use FastText, you get better results overall, for all models except for KNRM were the results are quite on par." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%207%20-%20Closed%20Captions.md#44-effect-of-the-fixed-vocabulary-size)
+- architectures:
+	- matchPyramid – hierarchical pattern extraction
+		- i. compute the 2D match-matrix of all query-doc cosine-similarities
+		- ii. apply a series of convolutional kernels and pooling layers on match-matrix
+		- iii. compute final score with a neural net
+	- knrm – kernel-based approach to count the amount of different similarities between query and doc
+		- i. (if it's a convolutional-knrm) use CNNs to generate word-n-gram embeddings
+		- ii. compute the 2D match-matrix of all query-doc cosine-similarities
+		- iii. apply the radial-basis-function kernel, summed along document dimension
+		- iv. compute final score with a neural net
+
+---
+
+**question**: What would the precision-recall-curve of an ideal re-ranker look like?
+
+answer (open question):
+
+- "often, there is an inverse relationship between precision and recall" [(source: wikipedia)](https://en.wikipedia.org/wiki/Precision_and_recall#:~:text=Often%2C%20there%20is%20an%20inverse,illustrative%20example%20of%20the%20tradeoff)
+- improving recall (complreteness) typically comes at the cost of reduced precision (correctness), because you're likelier to make more mistakes as you retrieve more data.
+- usually we see high precision at low recalls, gradually decreasing as recall increases. and after all relevant documents have been retrieved we have diminishing returns and a sharp drop in precision.
+- so ideally we'd like to have perfect precision until all relevant documents have been retrieved and vertically drop to 0 precision.
+
+---
+
+**question**: Why are low-frequency words an issue for information retrieval but not so much for other tasks like information categorization?
+
+answer (open question):
+
+- information-retrieval = deciding relevance of docs to a query
+	- in neural IR: "(1) The model performs poorly on previously unseen terms appearing at retrieval time (OOV terms). (2) Due to the lack of training data for low-frequency terms, the learned vectors may not be semantically robust." [(source: hofstätter's paper)](https://arxiv.org/pdf/1904.12683)
+	- but traditional IR models use TF-IDF
+- classification = deciding whether a word belongs to a set of predefined categories
+	- we're capturing the general theme or topic of a document, so presence / absence of rare words may have less influence on the overall decision.
+
@@ -0,0 +1,12 @@
+> Ma, Xueguang, Tommaso Teofili, and Jimmy Lin. "Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes." Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2023.
+
+anserini wraps lucene.
+
+pyserini wraps lucene and faiss, so it provides both vector search and inverted indexes.
+
+but it turns out that lucene (since a recent release) can (additionally to inverted indexes) also support hnsw indexes for dense vector search. this while being reasonably effective. but there’s a catch: due to weird design choices, to do hnsw indexes requires training models based on cosine similarity in lucene.
+
+this paper:
+
+1. demonstrates that lucene can be used for both dense and sparse vectors
+2. compares lucene and faiss for dense vectors
@@ -0,0 +1,65 @@
+## introduction
+
+_the retrieval problem:_
+
+- = given an information need expressed as a query $q$, the text retrieval task is to return a ranked list of $k$ texts $d_1, d_2, \dots, d_k$ from an arbitrarily large but finite collection of texts $C = d_i$ that maximizes a metric of interest, for example, nDCG, AP, etc.
+- information retrieval is about searching information.
+- in most contexts, "ranking" and "retrieval" mean the same thing.
+- alternative names: the search problem, the information retrieval problem, the text ranking problem, the top-k document retrieval problem.
+
+_definitions:_
+
+- search query: representation of an information need.
+- document collection / corpus: collection of documents with unique ids.
+- ranking: ranked list of document ids, based on “relevance”, which can be binary or graded on a scale.
+
+_the evaluation problem:_
+
+- “relevance judgments” / “qrels” are optimal query results to benchmark retrieval systems.
+- the “metric” is the quality of a single query result.
+
+_ir stages:_
+
+1. training phase: system gets trained with dataset.
+2. indexing phase: indexer takes the document collection to build an “index”, which is a data structure that supports fast reads.
+3. retrieval / search phase: system returns a ranked list based on query.
+
+## bi-encoders
+
+_data representation:_
+
+- sparse representation vectors:
+     - have mostly zero values (that we leave out) with only a few non-zero values.
+     - unable to capture semantic relationships.
+     - we use inverted indexes.
+- dense representation vectors:
+     - mostly contain non-zero values.
+     - commonly embedding vectors (dense vector represenations generated by learning encoders called “transformers”).
+     - dimensions also capture some "latent semantic space".
+     - we use hierarchical navigable small-world networks (hnsw) ie. from the faiss library.
+- embedding vectors: output of learning encoders called “transformers”.
+- representations can be unsupervised/heuristic or supervised/learned.
+
+_components:_
+
+- document encoder: takes a document, returns document representation.
+- query encoder: takes a query, returns query representation.
+- comparison function: takes 2 representations, returns relevance score / query-document score.
+     - ie: n-dimensional float vector as the representation, dot product as the score.
+     - ie: k-nearest neighbor search in vector space.
+
+_example: BM25_
+
+- representation: sparse vectors, no semantics, based on heuristic
+- document encoder: bm25 encoding is a “bag-of-words” (sparse lexical vectors)
+- query encoder: hot-vector.
+- comparison function: top-k retrieval from a probabilistic model of how frequent words occur.
+
+## source
+
+- https://github.com/castorini/anserini/blob/master/docs/start-here.md
+- https://github.com/castorini/anserini/blob/master/docs/reproducibility.md
+- https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md
+- https://github.com/castorini/pyserini/blob/master/docs/conceptual-framework.md
+- https://github.com/castorini/pyserini/blob/master/docs/experiments-nfcorpus.md
+- https://github.com/castorini/pyserini/blob/master/docs/conceptual-framework2.md
@@ -0,0 +1,30 @@
+> Xian, Jasper, et al. "Vector search with OpenAI embeddings: Lucene is all you need." Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 2024.
+> 
+> critique: https://news.ycombinator.com/item?id=37373635
+
+**search problem in vector space:** given the query embedding, the system’s task is to rapidly retrieve the top-k passage embeddings with the largest dot products.
+
+**assumption:** top-k retrieval on sparse vectors and dense vectors / embeddings require seperate and dedicated vector stores for operations around HNSW (for k-nearest neighbor search in vector space) for generative AI applications.
+
+- there have been many recent vector stores (pinecone, weaviate, chroma, milvus, qdrant,…).
+- these systems are very convenient because additionally to crud operations, they also handle nearest neighbor search.
+
+**observation:** this is wrong. state of the art vector search using generative AI does not require any AI-specific implementations. providing operations around HNSW indexes does not require a separate vector store.
+
+- we should build upon existing infrastructure. companies have already invested a lot of money into the lucene ecosystem (elasticsearch, opensearch, and solr) for sparse retrieval models.
+- lucene already has built in HNSW. it has the same feature set, but it’s just less performant and less convenient.
+- embeddings can be computed with simple API calls (encoding as a service).
+- indexing and searching dense vectors is conceptually identical to indexing and searching text with bag-of-words models that have been available for decades.
+
+**but there’s a catch:** the more mature software hasn’t quite caught up yet.
+
+- it’s janky: lucene doesn’t officially support it yet. it was a hack.
+- it’s slow: lucene achieves only around half the query throughput of faiss under comparable settings.
+
+---
+
+also interesting: postgres support
+
+- https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector
+- https://neon.tech/blog/pg-embedding-extension-for-vector-search
+- https://github.com/neondatabase/pg_embedding