Skip to content

Commit de612f6

Browse files
authored
Add files via upload
1 parent 364f411 commit de612f6

25 files changed

+1164
-0
lines changed
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
<!-- most of these are from mattermost -->
2+
3+
# 2024 (first exam)
4+
5+
**question**: The M in MAP stands for the mean over all Queries
6+
7+
answer (boolean): False
8+
9+
- in the context of Mean Average Precision (MAP), the "M" stands for the "Mean", which refers to the mean of the Average Precision (AP) values calculated over all queries
10+
- it calculates the average precision across all queries, not the "mean over all queries"
11+
12+
---
13+
14+
**question**: MRR only considers the first relevant document
15+
16+
answer (boolean): True
17+
18+
- it takes the inverse of the position of the first relevant doc
19+
20+
---
21+
22+
**question**: BERT does not transform Tokens
23+
24+
answer (boolean): False
25+
26+
- bert = bidirectional encoder representations from transformers
27+
- it generates contextual representations from tokens
28+
29+
---
30+
31+
**question**: CNNs can form nGrams trough the sliding window
32+
33+
answer (boolean): True
34+
35+
- 1D-CNNs can only be used in NLP with a sliding window
36+
- use cases:
37+
- n-gram representation learning = generating word embeddings as char-n-grams
38+
- dimensionality reduction = capturing an embedding-n-gram
39+
40+
---
41+
42+
**question**: IR Models made \[] totally obsolete
43+
44+
answer (fill in):
45+
46+
- manual searching
47+
48+
---
49+
50+
*paper specific questions*
51+
52+
based on: https://arxiv.org/pdf/2405.07767
53+
54+
- the authors disclosed the exact promts they used
55+
- The Paper follows the Cranfield paradigm
56+
- the Authors used a specific TREC Setting to evaluate
57+
- \[insert random collection from paper] is a Test collection
58+
- This paper is the first to use LLMs in Test Collection creation
59+
- how were the different methods correlated? what did the $\tau$ values mean? what metrics were used?
60+
- ChatGPT’s Queries were on average significantly longer than those of T5
61+
- ChatGPT’s relevant judgements address the information need with way less documents
62+
- Further experiments showed BIAS in using LLMs for the generation of Test collections
63+
- Synthetic Testcollections are used for documents that are never used by humans
64+
- Synthetic Queries resulted on average on more relevant documents per Query
65+
66+
# 2023
67+
68+
**question**: Recall and nDCG are typically measured at a lower cutoff than MAP and MRR (you don't have to know the exact formula)
69+
70+
answer (boolean): False
71+
72+
- the MAP is a more general metric and captures the area under the precision-recall-curve → ~@100-1000
73+
- but the MRR, DCG, nDCG measure how far-up in the search results the rel-docs are positioned → ~@5-20
74+
- "… typically, we measure all those metrics at a certain cutoff at "k" of the top retrieved documents. And for MAP and recall this is typically done at 100 and at 1000, whereas for position MRR and nDCG we have a much lower cutoff, so at 5, at 10 or at 20 to kind of get the same experience as users would do." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%202%20-%20Closed%20Captions.md#14-ranking-list-evaluation-metrics)
75+
76+
---
77+
78+
**question**: Judgement pairs should use pooling of many diverse system results
79+
80+
answer (boolean): True
81+
82+
- this question is referring to the pooling-process in labeling ie. with mechanical turk where you're creating a cutoff-set to reduce the labor required to label your data.
83+
- "… if we use a diverse pool of different systems, we can then even reuse those pool candidates and this gives us confidence that we have at least some of the relevant results in those pooling results. It allows us to drastically reduce the annotation time compared to conducting millions of annotations by hand." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%203%20-%20Closed%20Captions.md#26-pooling-in-information-retrieval)
84+
85+
# 2022
86+
87+
**question**: Test collections should be statistically significant
88+
89+
answer (boolean): False
90+
91+
- the systems/models we build with the test-collections should be when we compare them, but not the test-collections themselves.
92+
- statistical significance tests are used to verify that the observed differences between systems/models are not due to chance.
93+
- "… we test whether two systems produce different rankings that are not different just by chance \[]. Our hypothesis is that those systems are the same and now we test via a statistical significance test on a per-query basis" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%202%20-%20Closed%20Captions.md#29-statistical-significance-i)
94+
95+
---
96+
97+
**question**: The quality of a test collection is measured with the inter-annotator agreement
98+
99+
answer (boolean): False
100+
101+
- the degree of agreement among raters ≠ test-collection quality
102+
- "We can measure the label quality of annotators based on their inter-annotation agreement" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%203%20-%20Closed%20Captions.md#25-evaluate-annotation-quality)
103+
- see: https://en.wikipedia.org/wiki/Inter-rater_reliability
104+
105+
---
106+
107+
**question**: A word-1-gram that we use when training Word2Vec is also considered as a word-n-gram
108+
109+
answer (boolean): False
110+
111+
- 1-gram ≠ n-gram
112+
- word2vec generates a single embedding for each word by learning to either guess the word from its surroundings or the other way around.
113+
- you do have to train it with more than a single word and pass in a window size, but the word it's being trained to reconstruct is always a 1-gram / unigram.
114+
- don't confuse this with CNNs that generate n-gram representations (a single embedding for n words)
115+
- "1-Word-1-Vector type of class which includes Word2Vec" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%204%20-%20Closed%20Captions.md#10-word-embeddings)
116+
- see: https://radimrehurek.com/gensim/models/word2vec.html#usage-examples
117+
118+
---
119+
120+
**question**: ColBERTer achieves state-of-the-art performance on the MS Marco dev set
121+
122+
answer (boolean): True
123+
124+
- "When trained on MS MARCO Passage Ranking, ColBERTv2 achieves the highest MRR@10 of any standalone retriever." [(source: paper)](https://arxiv.org/abs/2112.01488)
125+
126+
---
127+
128+
**question**: "Assign each retrieval model its advantage of desirable properties for a retrieval model compared to the other models."
129+
130+
models:
131+
132+
- bert-cat
133+
- tk
134+
- colbert
135+
- bert-dot
136+
137+
attributes:
138+
139+
- effectivity
140+
- memory hog
141+
- effort moved to indexing
142+
- transformers combined with kernel-pooling
143+
144+
answer (assign attributes to sentences):
145+
146+
| model name | effectivity | latency | memory footprint | note |
147+
| ---------- | ----------- | ------- | ---------------- | ---------------------------------------------------------------------------- |
148+
| bert-cat | 1 | 950ms | 10.4GB | vanilla-bert and t5 are the slowest and most accurate |
149+
| preTTR | 0.97 | 445ms | 10.9GB | precomputes $n$ layers of bert for each doc |
150+
| colBERT | 0.97 | 28ms | 3.4GB | precomputes representations for each doc |
151+
| bert-dot | 0.87 | 23ms | 3.6GB | uses cosine similarity instead of linear layer |
152+
| tk | 0.89 | 14ms | 1.8GB | limits number of transformer layers and context, then applies kernel pooling |
153+
154+
above is a table from the lecture slides with some notes, that the following answers are based on:
155+
156+
- bert-cat ← effectivity
157+
- vanilla-bert and T5 are the most effective models against which all others are benchmarked
158+
- tk ← transformers combined with kernel-pooling
159+
- can have a bunch of other optimizations as tkl or tk-sparse
160+
- colbert ← memory hog
161+
- uses most memory in practice (i don't know why this doesn't align with the table above), because we're storing entire contextualized representations for each document
162+
- bert-dot ← effort moved to indexing
163+
- used in combination with a nearest-neighbor-index so we can just use the cosine similarity instead of a linear layer or max-pooling
164+
165+
---
166+
167+
*course papers from that year:*
168+
169+
- colberter was particularly popular: https://arxiv.org/pdf/2203.13088
170+
- https://dl.acm.org/doi/pdf/10.1145/3269206.3271719
171+
- https://discovery.ucl.ac.uk/id/eprint/10119400/1/Mitigating_the_Position_Bias_of_Transformer_Based_Contextualization_for_Passage_Re_Ranking.pdf
172+
- https://ecai2020.eu/papers/82_paper.pdf
173+
174+
# 2019
175+
176+
**question**: What are the differences between Matchpyramid and KNRM?
177+
178+
answer (open question):
179+
180+
- both are neural re-ranking models
181+
- interpretability:
182+
- kernel-based models can be less interpretable, as the learned ranking function operates in a high-dimensional feature space
183+
- hierarchical convolutional layers provide some interpretability, as the model learns to capture matching patterns at different levels of hierarchy
184+
- efficiency:
185+
- both scale linearly to the document length but knrm is faster because it's simpler [(source: paper)](https://www.ijcai.org/proceedings/2019/0758.pdf)
186+
- "the KNRM model is very fast, so it's definitely by far the fastest model we're talking about today, and in this course as a whole. And on its own, it has roughly the same effectiveness as MatchPyramid." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%207%20-%20Closed%20Captions.md)
187+
- robustness against small vocabularies:
188+
- "MatchPyramid and KNRM suffer if they have small vocabularies \[… but.] if you then use FastText, you get better results overall, for all models except for KNRM were the results are quite on par." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%207%20-%20Closed%20Captions.md#44-effect-of-the-fixed-vocabulary-size)
189+
- architectures:
190+
- matchPyramid – hierarchical pattern extraction
191+
- i. compute the 2D match-matrix of all query-doc cosine-similarities
192+
- ii. apply a series of convolutional kernels and pooling layers on match-matrix
193+
- iii. compute final score with a neural net
194+
- knrm – kernel-based approach to count the amount of different similarities between query and doc
195+
- i. (if it's a convolutional-knrm) use CNNs to generate word-n-gram embeddings
196+
- ii. compute the 2D match-matrix of all query-doc cosine-similarities
197+
- iii. apply the radial-basis-function kernel, summed along document dimension
198+
- iv. compute final score with a neural net
199+
200+
---
201+
202+
**question**: What would the precision-recall-curve of an ideal re-ranker look like?
203+
204+
answer (open question):
205+
206+
- "often, there is an inverse relationship between precision and recall" [(source: wikipedia)](https://en.wikipedia.org/wiki/Precision_and_recall#:~:text=Often%2C%20there%20is%20an%20inverse,illustrative%20example%20of%20the%20tradeoff)
207+
- improving recall (complreteness) typically comes at the cost of reduced precision (correctness), because you're likelier to make more mistakes as you retrieve more data.
208+
- usually we see high precision at low recalls, gradually decreasing as recall increases. and after all relevant documents have been retrieved we have diminishing returns and a sharp drop in precision.
209+
- so ideally we'd like to have perfect precision until all relevant documents have been retrieved and vertically drop to 0 precision.
210+
211+
---
212+
213+
**question**: Why are low-frequency words an issue for information retrieval but not so much for other tasks like information categorization?
214+
215+
answer (open question):
216+
217+
- information-retrieval = deciding relevance of docs to a query
218+
- in neural IR: "(1) The model performs poorly on previously unseen terms appearing at retrieval time (OOV terms). (2) Due to the lack of training data for low-frequency terms, the learned vectors may not be semantically robust." [(source: hofstätter's paper)](https://arxiv.org/pdf/1904.12683)
219+
- but traditional IR models use TF-IDF
220+
- classification = deciding whether a word belongs to a set of predefined categories
221+
- we're capturing the general theme or topic of a document, so presence / absence of rare words may have less influence on the overall decision.
222+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
> Ma, Xueguang, Tommaso Teofili, and Jimmy Lin. "Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes." Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2023.
2+
3+
anserini wraps lucene.
4+
5+
pyserini wraps lucene and faiss, so it provides both vector search and inverted indexes.
6+
7+
but it turns out that lucene (since a recent release) can (additionally to inverted indexes) also support hnsw indexes for dense vector search. this while being reasonably effective. but there’s a catch: due to weird design choices, to do hnsw indexes requires training models based on cosine similarity in lucene.
8+
9+
this paper:
10+
11+
1. demonstrates that lucene can be used for both dense and sparse vectors
12+
2. compares lucene and faiss for dense vectors
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
## introduction
2+
3+
_the retrieval problem:_
4+
5+
- = given an information need expressed as a query $q$, the text retrieval task is to return a ranked list of $k$ texts $d_1, d_2, \dots, d_k$ from an arbitrarily large but finite collection of texts $C = d_i$ that maximizes a metric of interest, for example, nDCG, AP, etc.
6+
- information retrieval is about searching information.
7+
- in most contexts, "ranking" and "retrieval" mean the same thing.
8+
- alternative names: the search problem, the information retrieval problem, the text ranking problem, the top-k document retrieval problem.
9+
10+
_definitions:_
11+
12+
- search query: representation of an information need.
13+
- document collection / corpus: collection of documents with unique ids.
14+
- ranking: ranked list of document ids, based on “relevance”, which can be binary or graded on a scale.
15+
16+
_the evaluation problem:_
17+
18+
- “relevance judgments” / “qrels” are optimal query results to benchmark retrieval systems.
19+
- the “metric” is the quality of a single query result.
20+
21+
_ir stages:_
22+
23+
1. training phase: system gets trained with dataset.
24+
2. indexing phase: indexer takes the document collection to build an “index”, which is a data structure that supports fast reads.
25+
3. retrieval / search phase: system returns a ranked list based on query.
26+
27+
## bi-encoders
28+
29+
_data representation:_
30+
31+
- sparse representation vectors:
32+
- have mostly zero values (that we leave out) with only a few non-zero values.
33+
- unable to capture semantic relationships.
34+
- we use inverted indexes.
35+
- dense representation vectors:
36+
- mostly contain non-zero values.
37+
- commonly embedding vectors (dense vector represenations generated by learning encoders called “transformers”).
38+
- dimensions also capture some "latent semantic space".
39+
- we use hierarchical navigable small-world networks (hnsw) ie. from the faiss library.
40+
- embedding vectors: output of learning encoders called “transformers”.
41+
- representations can be unsupervised/heuristic or supervised/learned.
42+
43+
_components:_
44+
45+
- document encoder: takes a document, returns document representation.
46+
- query encoder: takes a query, returns query representation.
47+
- comparison function: takes 2 representations, returns relevance score / query-document score.
48+
- ie: n-dimensional float vector as the representation, dot product as the score.
49+
- ie: k-nearest neighbor search in vector space.
50+
51+
_example: BM25_
52+
53+
- representation: sparse vectors, no semantics, based on heuristic
54+
- document encoder: bm25 encoding is a “bag-of-words” (sparse lexical vectors)
55+
- query encoder: hot-vector.
56+
- comparison function: top-k retrieval from a probabilistic model of how frequent words occur.
57+
58+
## source
59+
60+
- https://github.com/castorini/anserini/blob/master/docs/start-here.md
61+
- https://github.com/castorini/anserini/blob/master/docs/reproducibility.md
62+
- https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md
63+
- https://github.com/castorini/pyserini/blob/master/docs/conceptual-framework.md
64+
- https://github.com/castorini/pyserini/blob/master/docs/experiments-nfcorpus.md
65+
- https://github.com/castorini/pyserini/blob/master/docs/conceptual-framework2.md
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
> Xian, Jasper, et al. "Vector search with OpenAI embeddings: Lucene is all you need." Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 2024.
2+
>
3+
> critique: https://news.ycombinator.com/item?id=37373635
4+
5+
**search problem in vector space:** given the query embedding, the system’s task is to rapidly retrieve the top-k passage embeddings with the largest dot products.
6+
7+
**assumption:** top-k retrieval on sparse vectors and dense vectors / embeddings require seperate and dedicated vector stores for operations around HNSW (for k-nearest neighbor search in vector space) for generative AI applications.
8+
9+
- there have been many recent vector stores (pinecone, weaviate, chroma, milvus, qdrant,…).
10+
- these systems are very convenient because additionally to crud operations, they also handle nearest neighbor search.
11+
12+
**observation:** this is wrong. state of the art vector search using generative AI does not require any AI-specific implementations. providing operations around HNSW indexes does not require a separate vector store.
13+
14+
- we should build upon existing infrastructure. companies have already invested a lot of money into the lucene ecosystem (elasticsearch, opensearch, and solr) for sparse retrieval models.
15+
- lucene already has built in HNSW. it has the same feature set, but it’s just less performant and less convenient.
16+
- embeddings can be computed with simple API calls (encoding as a service).
17+
- indexing and searching dense vectors is conceptually identical to indexing and searching text with bag-of-words models that have been available for decades.
18+
19+
**but there’s a catch:** the more mature software hasn’t quite caught up yet.
20+
21+
- it’s janky: lucene doesn’t officially support it yet. it was a hack.
22+
- it’s slow: lucene achieves only around half the query throughput of faiss under comparable settings.
23+
24+
---
25+
26+
also interesting: postgres support
27+
28+
- https://www.crunchydata.com/blog/hnsw-indexes-with-postgres-and-pgvector
29+
- https://neon.tech/blog/pg-embedding-extension-for-vector-search
30+
- https://github.com/neondatabase/pg_embedding

0 commit comments

Comments
 (0)