|
| 1 | +<!-- most of these are from mattermost --> |
| 2 | + |
| 3 | +# 2024 (first exam) |
| 4 | + |
| 5 | +**question**: The M in MAP stands for the mean over all Queries |
| 6 | + |
| 7 | +answer (boolean): False |
| 8 | + |
| 9 | +- in the context of Mean Average Precision (MAP), the "M" stands for the "Mean", which refers to the mean of the Average Precision (AP) values calculated over all queries |
| 10 | +- it calculates the average precision across all queries, not the "mean over all queries" |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +**question**: MRR only considers the first relevant document |
| 15 | + |
| 16 | +answer (boolean): True |
| 17 | + |
| 18 | +- it takes the inverse of the position of the first relevant doc |
| 19 | + |
| 20 | +--- |
| 21 | + |
| 22 | +**question**: BERT does not transform Tokens |
| 23 | + |
| 24 | +answer (boolean): False |
| 25 | + |
| 26 | +- bert = bidirectional encoder representations from transformers |
| 27 | +- it generates contextual representations from tokens |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +**question**: CNNs can form nGrams trough the sliding window |
| 32 | + |
| 33 | +answer (boolean): True |
| 34 | + |
| 35 | +- 1D-CNNs can only be used in NLP with a sliding window |
| 36 | +- use cases: |
| 37 | + - n-gram representation learning = generating word embeddings as char-n-grams |
| 38 | + - dimensionality reduction = capturing an embedding-n-gram |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +**question**: IR Models made \[…] totally obsolete |
| 43 | + |
| 44 | +answer (fill in): |
| 45 | + |
| 46 | +- manual searching |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +*paper specific questions* |
| 51 | + |
| 52 | +based on: https://arxiv.org/pdf/2405.07767 |
| 53 | + |
| 54 | +- the authors disclosed the exact promts they used |
| 55 | +- The Paper follows the Cranfield paradigm |
| 56 | +- the Authors used a specific TREC Setting to evaluate |
| 57 | +- \[insert random collection from paper] is a Test collection |
| 58 | +- This paper is the first to use LLMs in Test Collection creation |
| 59 | +- how were the different methods correlated? what did the $\tau$ values mean? what metrics were used? |
| 60 | +- ChatGPT’s Queries were on average significantly longer than those of T5 |
| 61 | +- ChatGPT’s relevant judgements address the information need with way less documents |
| 62 | +- Further experiments showed BIAS in using LLMs for the generation of Test collections |
| 63 | +- Synthetic Testcollections are used for documents that are never used by humans |
| 64 | +- Synthetic Queries resulted on average on more relevant documents per Query |
| 65 | + |
| 66 | +# 2023 |
| 67 | + |
| 68 | +**question**: Recall and nDCG are typically measured at a lower cutoff than MAP and MRR (you don't have to know the exact formula) |
| 69 | + |
| 70 | +answer (boolean): False |
| 71 | + |
| 72 | +- the MAP is a more general metric and captures the area under the precision-recall-curve → ~@100-1000 |
| 73 | +- but the MRR, DCG, nDCG measure how far-up in the search results the rel-docs are positioned → ~@5-20 |
| 74 | +- "… typically, we measure all those metrics at a certain cutoff at "k" of the top retrieved documents. And for MAP and recall this is typically done at 100 and at 1000, whereas for position MRR and nDCG we have a much lower cutoff, so at 5, at 10 or at 20 to kind of get the same experience as users would do." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%202%20-%20Closed%20Captions.md#14-ranking-list-evaluation-metrics) |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +**question**: Judgement pairs should use pooling of many diverse system results |
| 79 | + |
| 80 | +answer (boolean): True |
| 81 | + |
| 82 | +- this question is referring to the pooling-process in labeling ie. with mechanical turk where you're creating a cutoff-set to reduce the labor required to label your data. |
| 83 | +- "… if we use a diverse pool of different systems, we can then even reuse those pool candidates and this gives us confidence that we have at least some of the relevant results in those pooling results. It allows us to drastically reduce the annotation time compared to conducting millions of annotations by hand." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%203%20-%20Closed%20Captions.md#26-pooling-in-information-retrieval) |
| 84 | + |
| 85 | +# 2022 |
| 86 | + |
| 87 | +**question**: Test collections should be statistically significant |
| 88 | + |
| 89 | +answer (boolean): False |
| 90 | + |
| 91 | +- the systems/models we build with the test-collections should be when we compare them, but not the test-collections themselves. |
| 92 | +- statistical significance tests are used to verify that the observed differences between systems/models are not due to chance. |
| 93 | +- "… we test whether two systems produce different rankings that are not different just by chance \[…]. Our hypothesis is that those systems are the same and now we test via a statistical significance test on a per-query basis" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%202%20-%20Closed%20Captions.md#29-statistical-significance-i) |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +**question**: The quality of a test collection is measured with the inter-annotator agreement |
| 98 | + |
| 99 | +answer (boolean): False |
| 100 | + |
| 101 | +- the degree of agreement among raters ≠ test-collection quality |
| 102 | +- "We can measure the label quality of annotators based on their inter-annotation agreement" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%203%20-%20Closed%20Captions.md#25-evaluate-annotation-quality) |
| 103 | +- see: https://en.wikipedia.org/wiki/Inter-rater_reliability |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +**question**: A word-1-gram that we use when training Word2Vec is also considered as a word-n-gram |
| 108 | + |
| 109 | +answer (boolean): False |
| 110 | + |
| 111 | +- 1-gram ≠ n-gram |
| 112 | +- word2vec generates a single embedding for each word by learning to either guess the word from its surroundings or the other way around. |
| 113 | +- you do have to train it with more than a single word and pass in a window size, but the word it's being trained to reconstruct is always a 1-gram / unigram. |
| 114 | +- don't confuse this with CNNs that generate n-gram representations (a single embedding for n words) |
| 115 | +- "1-Word-1-Vector type of class which includes Word2Vec" [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%204%20-%20Closed%20Captions.md#10-word-embeddings) |
| 116 | +- see: https://radimrehurek.com/gensim/models/word2vec.html#usage-examples |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +**question**: ColBERTer achieves state-of-the-art performance on the MS Marco dev set |
| 121 | + |
| 122 | +answer (boolean): True |
| 123 | + |
| 124 | +- "When trained on MS MARCO Passage Ranking, ColBERTv2 achieves the highest MRR@10 of any standalone retriever." [(source: paper)](https://arxiv.org/abs/2112.01488) |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +**question**: "Assign each retrieval model its advantage of desirable properties for a retrieval model compared to the other models." |
| 129 | + |
| 130 | +models: |
| 131 | + |
| 132 | +- bert-cat |
| 133 | +- tk |
| 134 | +- colbert |
| 135 | +- bert-dot |
| 136 | + |
| 137 | +attributes: |
| 138 | + |
| 139 | +- effectivity |
| 140 | +- memory hog |
| 141 | +- effort moved to indexing |
| 142 | +- transformers combined with kernel-pooling |
| 143 | + |
| 144 | +answer (assign attributes to sentences): |
| 145 | + |
| 146 | +| model name | effectivity | latency | memory footprint | note | |
| 147 | +| ---------- | ----------- | ------- | ---------------- | ---------------------------------------------------------------------------- | |
| 148 | +| bert-cat | 1 | 950ms | 10.4GB | vanilla-bert and t5 are the slowest and most accurate | |
| 149 | +| preTTR | 0.97 | 445ms | 10.9GB | precomputes $n$ layers of bert for each doc | |
| 150 | +| colBERT | 0.97 | 28ms | 3.4GB | precomputes representations for each doc | |
| 151 | +| bert-dot | 0.87 | 23ms | 3.6GB | uses cosine similarity instead of linear layer | |
| 152 | +| tk | 0.89 | 14ms | 1.8GB | limits number of transformer layers and context, then applies kernel pooling | |
| 153 | + |
| 154 | +above is a table from the lecture slides with some notes, that the following answers are based on: |
| 155 | + |
| 156 | +- bert-cat ← effectivity |
| 157 | + - vanilla-bert and T5 are the most effective models against which all others are benchmarked |
| 158 | +- tk ← transformers combined with kernel-pooling |
| 159 | + - can have a bunch of other optimizations as tkl or tk-sparse |
| 160 | +- colbert ← memory hog |
| 161 | + - uses most memory in practice (i don't know why this doesn't align with the table above), because we're storing entire contextualized representations for each document |
| 162 | +- bert-dot ← effort moved to indexing |
| 163 | + - used in combination with a nearest-neighbor-index so we can just use the cosine similarity instead of a linear layer or max-pooling |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +*course papers from that year:* |
| 168 | + |
| 169 | +- colberter was particularly popular: https://arxiv.org/pdf/2203.13088 |
| 170 | +- https://dl.acm.org/doi/pdf/10.1145/3269206.3271719 |
| 171 | +- https://discovery.ucl.ac.uk/id/eprint/10119400/1/Mitigating_the_Position_Bias_of_Transformer_Based_Contextualization_for_Passage_Re_Ranking.pdf |
| 172 | +- https://ecai2020.eu/papers/82_paper.pdf |
| 173 | + |
| 174 | +# 2019 |
| 175 | + |
| 176 | +**question**: What are the differences between Matchpyramid and KNRM? |
| 177 | + |
| 178 | +answer (open question): |
| 179 | + |
| 180 | +- both are neural re-ranking models |
| 181 | +- interpretability: |
| 182 | + - kernel-based models can be less interpretable, as the learned ranking function operates in a high-dimensional feature space |
| 183 | + - hierarchical convolutional layers provide some interpretability, as the model learns to capture matching patterns at different levels of hierarchy |
| 184 | +- efficiency: |
| 185 | + - both scale linearly to the document length but knrm is faster because it's simpler [(source: paper)](https://www.ijcai.org/proceedings/2019/0758.pdf) |
| 186 | + - "the KNRM model is very fast, so it's definitely by far the fastest model we're talking about today, and in this course as a whole. And on its own, it has roughly the same effectiveness as MatchPyramid." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%207%20-%20Closed%20Captions.md) |
| 187 | +- robustness against small vocabularies: |
| 188 | + - "MatchPyramid and KNRM suffer if they have small vocabularies \[… but.] if you then use FastText, you get better results overall, for all models except for KNRM were the results are quite on par." [(source: lectures)](https://github.com/sebastian-hofstaetter/teaching/blob/master/advanced-information-retrieval/Lecture%207%20-%20Closed%20Captions.md#44-effect-of-the-fixed-vocabulary-size) |
| 189 | +- architectures: |
| 190 | + - matchPyramid – hierarchical pattern extraction |
| 191 | + - i. compute the 2D match-matrix of all query-doc cosine-similarities |
| 192 | + - ii. apply a series of convolutional kernels and pooling layers on match-matrix |
| 193 | + - iii. compute final score with a neural net |
| 194 | + - knrm – kernel-based approach to count the amount of different similarities between query and doc |
| 195 | + - i. (if it's a convolutional-knrm) use CNNs to generate word-n-gram embeddings |
| 196 | + - ii. compute the 2D match-matrix of all query-doc cosine-similarities |
| 197 | + - iii. apply the radial-basis-function kernel, summed along document dimension |
| 198 | + - iv. compute final score with a neural net |
| 199 | + |
| 200 | +--- |
| 201 | + |
| 202 | +**question**: What would the precision-recall-curve of an ideal re-ranker look like? |
| 203 | + |
| 204 | +answer (open question): |
| 205 | + |
| 206 | +- "often, there is an inverse relationship between precision and recall" [(source: wikipedia)](https://en.wikipedia.org/wiki/Precision_and_recall#:~:text=Often%2C%20there%20is%20an%20inverse,illustrative%20example%20of%20the%20tradeoff) |
| 207 | +- improving recall (complreteness) typically comes at the cost of reduced precision (correctness), because you're likelier to make more mistakes as you retrieve more data. |
| 208 | +- usually we see high precision at low recalls, gradually decreasing as recall increases. and after all relevant documents have been retrieved we have diminishing returns and a sharp drop in precision. |
| 209 | +- so ideally we'd like to have perfect precision until all relevant documents have been retrieved and vertically drop to 0 precision. |
| 210 | + |
| 211 | +--- |
| 212 | + |
| 213 | +**question**: Why are low-frequency words an issue for information retrieval but not so much for other tasks like information categorization? |
| 214 | + |
| 215 | +answer (open question): |
| 216 | + |
| 217 | +- information-retrieval = deciding relevance of docs to a query |
| 218 | + - in neural IR: "(1) The model performs poorly on previously unseen terms appearing at retrieval time (OOV terms). (2) Due to the lack of training data for low-frequency terms, the learned vectors may not be semantically robust." [(source: hofstätter's paper)](https://arxiv.org/pdf/1904.12683) |
| 219 | + - but traditional IR models use TF-IDF |
| 220 | +- classification = deciding whether a word belongs to a set of predefined categories |
| 221 | + - we're capturing the general theme or topic of a document, so presence / absence of rare words may have less influence on the overall decision. |
| 222 | + |
0 commit comments