Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many tokens #4

Open
drob-xx opened this issue Nov 21, 2022 · 3 comments
Open

Too many tokens #4

drob-xx opened this issue Nov 21, 2022 · 3 comments

Comments

@drob-xx
Copy link

drob-xx commented Nov 21, 2022

Hi @jalammar,

With my dataset I'm getting

CohereError: too many tokens: total number of tokens (prompt and prediction) cannot exceed 2048 - received 6354. Try using a shorter prompt or a smaller max_tokens value.

Is this b/c Cohere can't handle more than 2048 or is it a limitation on the freebie key I'm using? My toy data is 6354 records.

Thanks!

@jalammar
Copy link
Collaborator

It's a limitation on the model (to 2048 tokens), and not the trial key. Length is indeed a limitation of this approach.

Note that the 6354 number in the message does not refer to your records. It refers to the number of tokens in the naming prompt.

Here's an example of how that looks. Say we're naming these two clusters:

image

Topically starts by naming cluster 0. To do that:
1- it takes a few example texts from the cluster
2- Adds these examples to the cluster naming prompt
3- Sends that to the generative model

image

Then does the same with the rest of the clsuters.

Two ways to get around the length limitation:

  • instead of passing the full texts, do you have shorter versions of the text instead? For example, if you are topic modeling articles, can you do the naming on the titles instead of the articles themselves? Or if you are doing academic papers, can you do the abstracts, or titles instead?
  • If not, then consider using an excerpt of the text for the naming. Create for example a new column in the dataframe containing an excerpt of the long document (maybe the first sentence or paragraph, if that would tend to capture the gist of the document).
  • A more advanced approach that would likely lead to better results is to pass a summary of the document.

@drob-xx
Copy link
Author

drob-xx commented Nov 21, 2022

Thanks for the detailed explanation. I understand the 2048 token limitation, but for some reason I am still getting the error when I throttle the sample length down. Let me see if I'm understanding the flow correctly for example:

app.name_topics((first200Words, IssueTopicsBT.topics_))

So where IssueTopicsBT.topics_ is the cluster assignment for each of the text strings in first200Words the app will:

  • cycle through each topic cluster, select (how many??) doc string samples,
  • concatenate them
  • feed them to cohere?

In other words will I hit the 2048 limit when the length the sampled records x number of samples > 2048? If so then how many samples are concatenated? Hope this is clear.

@jalammar
Copy link
Collaborator

jalammar commented Dec 1, 2022

Clear! The samples are concatenated to the first (generic) prompt in this file: https://github.com/cohere-ai/sandbox-topically/blob/main/topically/prompts/prompts.py

It shows a few examples for the types of names of clusters.

I believe with the new Command model we should be able to use shorter prompts and fit in more examples.

cycle through each topic cluster, select (how many??) doc string samples,

10 Is the default value. You can change that by passing num_sample_texts=5 to name_topics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants