The goal of this repository is to create embeddings for YouTube channels. These embeddings can be used as-is for content similarity, and can also be used to extract social dimensions.
We propose three types of embeddings.
- Social Sharing / Reddit embedding: made using shares of YouTube videos on Reddit (using Pushshift data)
- Content embedding: made from video titles and descriptions, fed through a Sentence Transformer.
- Recommendation embedding: made from recording recommendations YouTube provides to a history-less user, and computing a node embedding.
Those embeddings for our filtered 40K channels are featured in the embeds/ folder.
Similarly, social dimensions are featured in the dims/ folder.
Create a conda (/mamba) environment using conda env create -f environment.yml
.
This creates a conda environment named ytb
with all libraries necessary for running the code. It might be necessary to upgrade your conda version beforehand (conda upgrade conda
) if you get any error.
The repository uses jupytext for notebooks version control, so notebooks are saved in Markdown format, which still makes them readable from github, and removes the output.
All of the notebooks for recreating the embeddings are in the generate_embeddings/ folder. Please note that it will require some work to get everything working. Notably, it assumes you have already extracted all links to youtube in reddit comments and submissions (the pyspark code for extracting them is not (not yet?) public).
Unfortunately, it looks like the pushshift dumps are currently not accessible over on https://files.pushshift.io/ (although there seems to be a torrent remaining), and according to this post, Reddit revoked pushshift's access, so more recent posts will not be able to be included in datasets.