Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Topic Models] Support multiple topics per comment #1877

Open
jucor opened this issue Jan 21, 2025 · 0 comments
Open

[Topic Models] Support multiple topics per comment #1877

jucor opened this issue Jan 21, 2025 · 0 comments
Labels
feature-request For new feature suggestions

Comments

@jucor
Copy link
Contributor

jucor commented Jan 21, 2025

As we look into adding automatic topic classification/topic modeling to pol.is conversations, we have mostly focused on single-topic models. See for example #1866 and references therein. For example BERTopic library of (Grootendorst 2022) uses embeddings then clusters them with a range of clustering algorithms from the simplest K-means to HDBSCAN (Campello, Moulavi, and Sander 2013). Even evaluations of Topic models such as the V-Metric of (Rosenberg and Hirschberg 2007), or the various metrics in the OCTIS library (Terragni et al. 2021) and Gensim (Řehůřek and Sojka 2010), seem -- unless I am mistaken -- to only focus on non-overlapping clusters, i.e. with a single cluster per data-point.

This makes sense, as it is simpler to get started with. And, naively, when first approaching pol.is, I imagine that due to their brevity, it would be unlikely for single comments to surface multiple topics. This was a deeply flawed assumption.

Analyzing data from past conversations for which we have human-provided topic classification, in particular Bowling Green 2018 and West Begroot 2022, we see that single-topic comments are the exception rather than the norm. Most comments are labelled with between 2 and 4 topics. On paper, Bowling Green 2018 only presents multi-topic at the subtopic level: each comment was assigned to a single top-level topic. However, those top-level topics are very imbalanced, with “Quality of Life” containing almost two-thirds of the comments, and covering very diverse sub-topics -- see figure below. I am still working on computing deeper summary statistics, in particular double-checking that, as I suspect, those subtopics do not follow a hierarchical structure, but are indeed a set of independent topics. West Begroot 2022 has no subtopics, just pure multi-topics.

Image

I need to do a proper literature search, but from the top of my head, diverse directions we could look into:

  • There *are* algorithms for multiple topic modelling, especially for large documents. It remains to be seen how well those apply for short comments, which is a different ballgame, as highlighted by Colin, and by Richard in [Topic Models] Evaluating topic models #1866.
  • If we want to work with embeddings (which are convenient for short text), I suspect there might be standard algorithms allowing for multilabel clustering (the same way there are algorithms for multilabel classification) but I would need to do a deeper dive.
  • I imagine Latent variable algorithms, such as Latent Dirichlet Allocation (D. Blei, Ng, and Jordan 2001; D. M. Blei, Ng, and Jordan 2003) can probably be extended to multilabel latents (or a set of latents), thanks to the flexibility of hierarchical models. I imagine that this probably has been done already.
  • And of course, for algorithms that do a first “Topic Discovery” step over the whole set of comments (e.g. LLM-based), then a comment-per-comment “Topic Assignment” assigning each comment to one of the discovered steps we can likely transform the last step into a multilabel classification for which there are multiple known algorithms. Nicely, LLM-based Jigsaw SenseMaker (‘Jigsaw-Code/Sensemaking-Tools’ [2024] 2025) precisely takes that later approach to support multiple topics per comment, as well as multiple sub-topics if needed: https://github.com/Jigsaw-Code/sensemaking-tools/blob/b792fac30d83ea484469502d8d398bc007b690f1/src/tasks/categorization.ts#L103-L127

TLDR: we will need to support multiple topics per comments, and a proper literature review is needed :-) The nice part is that most of the “simple” evaluations / sanity checks (coverage, precision, recall) we come up with for single-topic do apply to multi-topics. Of course, more elaborate metrics (such as V-measure) will need more thinking.

Note on the labelled data: these manually-labelled topics were provided to me by @DZNarayanan . I do not believe they are yet in our open data repository. I do have a converter to the JSON format Colin and Tim are using as input in their experiments with topic reporting. However, we should probably hold on before publishing that data, as we do not have much of it and we might want to keep it outside of LLM training sets if we want to eval LLMs.

References:

  • Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. ‘Latent Dirichlet Allocation’. Journal of Machine Learning Research 3 (Jan): 993–1022.
  • Blei, David, Andrew Ng, and Michael Jordan. 2001. ‘Latent Dirichlet Allocation’. Advances in Neural Information Processing Systems 14. https://proceedings.neurips.cc/paper/2001/hash/296472c9542ad4d4788d543508116cbc-Abstract.html.
  • Campello, Ricardo J. G. B., Davoud Moulavi, and Joerg Sander. 2013. ‘Density-Based Clustering Based on Hierarchical Density Estimates’. In Advances in Knowledge Discovery and Data Mining, edited by Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guandong Xu, 160–72. Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-37456-2\_14.
  • Grootendorst, Maarten. 2022. ‘BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure’. arXiv. https://doi.org/10.48550/arXiv.2203.05794.
  • ‘Jigsaw-Code/Sensemaking-Tools’. (2024) 2025. TypeScript. Jigsaw. https://github.com/Jigsaw-Code/sensemaking-tools.
  • Řehůřek, Radim, and Petr Sojka. 2010. ‘Software Framework for Topic Modelling with Large Corpora’. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA.
  • Rosenberg, Andrew, and Julia Hirschberg. 2007. ‘V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure’. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), edited by Jason Eisner, 410–20. Prague, Czech Republic: Association for Computational Linguistics. https://aclanthology.org/D07-1043.
  • Terragni, Silvia, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. 2021. ‘OCTIS: Comparing and Optimizing Topic Models Is Simple!’ In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, edited by Dimitra Gkatzia and Djamé Seddah, 263–70. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-demos.31.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request For new feature suggestions
Projects
None yet
Development

No branches or pull requests

1 participant