Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embed course metadata as contentfile #2050

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

shanbady
Copy link
Contributor

@shanbady shanbady commented Feb 14, 2025

What are the relevant tickets?

Fixes (part 1 of) https://github.com/mitodl/hq/issues/6725

Description (What does it do?)

This PR adds embeddings for general course info (what shows up in the resource panel) in the contentfiles collection so that the chat agent can get that info from the contentfile vector endpoint directly.

How it works
Anytime we embed a new resource, we also generate an "about this course" document with all the course info and put that in the contentfiles collection. We can follow this same pattern for whatever else we might need to enrich the chat agent's response

How can this be tested?

  1. checkout this branch
  2. make sure you have some contentfiles locally
  3. find a learning resource id for a resource that has contentfiles
  4. generate embeddings for that resource via python manage.py generate_embeddings --resource-ids <id>
  5. hit the contentfile vector search endpoint - set the resource_readable_id parameter to the readable_id (make the readable_id is url encoded - this has get me a few times).
    http://open.odl.local:8063/api/v0/vector_content_files_search/?limit=10&resource_readable_id=course-v1%3AMITxT%2B14.73x&q=who%20offers%20this%20course?
  6. try seting the "q=" parameter to questions about the course (info that shows up in the resource panel) for example "?q=who is teaching this course?" or "q=how much does it cost? will I earn a certificate?". The results should surface a chunk that starts with "Information about this course" and has a "file_extension" of ".md"

Additional Context

  • The "course info" document won't always be surfaced as the top result - its heavily dependent on other contentfiles for the resource - in lots of cases the same info is found in other chunks. This does however fill the gap where that info is absent in contentfiles.
  • There will be a follow on ticket about pulling in full marketing site content into the chunk collection - related https://github.com/mitodl/hq/issues/6699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant