You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we look into adding automatic topic classification/topic modeling to pol.is conversations, we have mostly focused on single-topic models. See for example #1866 and references therein. For example BERTopic library of (Grootendorst 2022) uses embeddings then clusters them with a range of clustering algorithms from the simplest K-means to HDBSCAN (Campello, Moulavi, and Sander 2013). Even evaluations of Topic models such as the V-Metric of (Rosenberg and Hirschberg 2007), or the various metrics in the OCTIS library (Terragni et al. 2021) and Gensim (Řehůřek and Sojka 2010), seem -- unless I am mistaken -- to only focus on non-overlapping clusters, i.e. with a single cluster per data-point.
This makes sense, as it is simpler to get started with. And, naively, when first approaching pol.is, I imagine that due to their brevity, it would be unlikely for single comments to surface multiple topics. This was a deeply flawed assumption.
Analyzing data from past conversations for which we have human-provided topic classification, in particular Bowling Green 2018 and West Begroot 2022, we see that single-topic comments are the exception rather than the norm. Most comments are labelled with between 2 and 4 topics. On paper, Bowling Green 2018 only presents multi-topic at the subtopic level: each comment was assigned to a single top-level topic. However, those top-level topics are very imbalanced, with “Quality of Life” containing almost two-thirds of the comments, and covering very diverse sub-topics -- see figure below. I am still working on computing deeper summary statistics, in particular double-checking that, as I suspect, those subtopics do not follow a hierarchical structure, but are indeed a set of independent topics. West Begroot 2022 has no subtopics, just pure multi-topics.
I need to do a proper literature search, but from the top of my head, diverse directions we could look into:
There *are* algorithms for multiple topic modelling, especially for large documents. It remains to be seen how well those apply for short comments, which is a different ballgame, as highlighted by Colin, and by Richard in [Topic Models] Evaluating topic models #1866.
If we want to work with embeddings (which are convenient for short text), I suspect there might be standard algorithms allowing for multilabel clustering (the same way there are algorithms for multilabel classification) but I would need to do a deeper dive.
I imagine Latent variable algorithms, such as Latent Dirichlet Allocation (D. Blei, Ng, and Jordan 2001; D. M. Blei, Ng, and Jordan 2003) can probably be extended to multilabel latents (or a set of latents), thanks to the flexibility of hierarchical models. I imagine that this probably has been done already.
And of course, for algorithms that do a first “Topic Discovery” step over the whole set of comments (e.g. LLM-based), then a comment-per-comment “Topic Assignment” assigning each comment to one of the discovered steps we can likely transform the last step into a multilabel classification for which there are multiple known algorithms. Nicely, LLM-based Jigsaw SenseMaker (‘Jigsaw-Code/Sensemaking-Tools’ [2024] 2025) precisely takes that later approach to support multiple topics per comment, as well as multiple sub-topics if needed: https://github.com/Jigsaw-Code/sensemaking-tools/blob/b792fac30d83ea484469502d8d398bc007b690f1/src/tasks/categorization.ts#L103-L127
TLDR: we will need to support multiple topics per comments, and a proper literature review is needed :-) The nice part is that most of the “simple” evaluations / sanity checks (coverage, precision, recall) we come up with for single-topic do apply to multi-topics. Of course, more elaborate metrics (such as V-measure) will need more thinking.
Note on the labelled data: these manually-labelled topics were provided to me by @DZNarayanan . I do not believe they are yet in our open data repository. I do have a converter to the JSON format Colin and Tim are using as input in their experiments with topic reporting. However, we should probably hold on before publishing that data, as we do not have much of it and we might want to keep it outside of LLM training sets if we want to eval LLMs.
References:
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. ‘Latent Dirichlet Allocation’. Journal of Machine Learning Research 3 (Jan): 993–1022.
Campello, Ricardo J. G. B., Davoud Moulavi, and Joerg Sander. 2013. ‘Density-Based Clustering Based on Hierarchical Density Estimates’. In Advances in Knowledge Discovery and Data Mining, edited by Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guandong Xu, 160–72. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-37456-2\_14.
Řehůřek, Radim, and Petr Sojka. 2010. ‘Software Framework for Topic Modelling with Large Corpora’. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA.
Rosenberg, Andrew, and Julia Hirschberg. 2007. ‘V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure’. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), edited by Jason Eisner, 410–20. Prague, Czech Republic: Association for Computational Linguistics. https://aclanthology.org/D07-1043.
Terragni, Silvia, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. 2021. ‘OCTIS: Comparing and Optimizing Topic Models Is Simple!’ In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, edited by Dimitra Gkatzia and Djamé Seddah, 263–70. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-demos.31.
The text was updated successfully, but these errors were encountered:
As we look into adding automatic topic classification/topic modeling to pol.is conversations, we have mostly focused on single-topic models. See for example #1866 and references therein. For example BERTopic library of (Grootendorst 2022) uses embeddings then clusters them with a range of clustering algorithms from the simplest K-means to HDBSCAN (Campello, Moulavi, and Sander 2013). Even evaluations of Topic models such as the V-Metric of (Rosenberg and Hirschberg 2007), or the various metrics in the OCTIS library (Terragni et al. 2021) and Gensim (Řehůřek and Sojka 2010), seem -- unless I am mistaken -- to only focus on non-overlapping clusters, i.e. with a single cluster per data-point.
This makes sense, as it is simpler to get started with. And, naively, when first approaching pol.is, I imagine that due to their brevity, it would be unlikely for single comments to surface multiple topics. This was a deeply flawed assumption.
Analyzing data from past conversations for which we have human-provided topic classification, in particular Bowling Green 2018 and West Begroot 2022, we see that single-topic comments are the exception rather than the norm. Most comments are labelled with between 2 and 4 topics. On paper, Bowling Green 2018 only presents multi-topic at the subtopic level: each comment was assigned to a single top-level topic. However, those top-level topics are very imbalanced, with “Quality of Life” containing almost two-thirds of the comments, and covering very diverse sub-topics -- see figure below. I am still working on computing deeper summary statistics, in particular double-checking that, as I suspect, those subtopics do not follow a hierarchical structure, but are indeed a set of independent topics. West Begroot 2022 has no subtopics, just pure multi-topics.
I need to do a proper literature search, but from the top of my head, diverse directions we could look into:
TLDR: we will need to support multiple topics per comments, and a proper literature review is needed :-) The nice part is that most of the “simple” evaluations / sanity checks (coverage, precision, recall) we come up with for single-topic do apply to multi-topics. Of course, more elaborate metrics (such as V-measure) will need more thinking.
Note on the labelled data: these manually-labelled topics were provided to me by @DZNarayanan . I do not believe they are yet in our open data repository. I do have a converter to the JSON format Colin and Tim are using as input in their experiments with topic reporting. However, we should probably hold on before publishing that data, as we do not have much of it and we might want to keep it outside of LLM training sets if we want to eval LLMs.
References:
The text was updated successfully, but these errors were encountered: