Too many tokens #4

drob-xx · 2022-11-21T02:33:49Z

With my dataset I'm getting

CohereError: too many tokens: total number of tokens (prompt and prediction) cannot exceed 2048 - received 6354. Try using a shorter prompt or a smaller max_tokens value.

Is this b/c Cohere can't handle more than 2048 or is it a limitation on the freebie key I'm using? My toy data is 6354 records.

Thanks!

jalammar · 2022-11-21T11:51:32Z

It's a limitation on the model (to 2048 tokens), and not the trial key. Length is indeed a limitation of this approach.

Note that the 6354 number in the message does not refer to your records. It refers to the number of tokens in the naming prompt.

Here's an example of how that looks. Say we're naming these two clusters:

Topically starts by naming cluster 0. To do that:
1- it takes a few example texts from the cluster
2- Adds these examples to the cluster naming prompt
3- Sends that to the generative model

Then does the same with the rest of the clsuters.

Two ways to get around the length limitation:

instead of passing the full texts, do you have shorter versions of the text instead? For example, if you are topic modeling articles, can you do the naming on the titles instead of the articles themselves? Or if you are doing academic papers, can you do the abstracts, or titles instead?
If not, then consider using an excerpt of the text for the naming. Create for example a new column in the dataframe containing an excerpt of the long document (maybe the first sentence or paragraph, if that would tend to capture the gist of the document).
A more advanced approach that would likely lead to better results is to pass a summary of the document.

drob-xx · 2022-11-21T17:38:52Z

Thanks for the detailed explanation. I understand the 2048 token limitation, but for some reason I am still getting the error when I throttle the sample length down. Let me see if I'm understanding the flow correctly for example:

app.name_topics((first200Words, IssueTopicsBT.topics_))

So where IssueTopicsBT.topics_ is the cluster assignment for each of the text strings in first200Words the app will:

cycle through each topic cluster, select (how many??) doc string samples,
concatenate them
feed them to cohere?

In other words will I hit the 2048 limit when the length the sampled records x number of samples > 2048? If so then how many samples are concatenated? Hope this is clear.

jalammar · 2022-12-01T07:39:08Z

Clear! The samples are concatenated to the first (generic) prompt in this file: https://github.com/cohere-ai/sandbox-topically/blob/main/topically/prompts/prompts.py

It shows a few examples for the types of names of clusters.

I believe with the new Command model we should be able to use shorter prompts and fit in more examples.

cycle through each topic cluster, select (how many??) doc string samples,

10 Is the default value. You can change that by passing num_sample_texts=5 to name_topics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many tokens #4

Too many tokens #4

drob-xx commented Nov 21, 2022

jalammar commented Nov 21, 2022

drob-xx commented Nov 21, 2022

jalammar commented Dec 1, 2022 •

edited

Loading

Too many tokens #4

Too many tokens #4

Comments

drob-xx commented Nov 21, 2022

jalammar commented Nov 21, 2022

drob-xx commented Nov 21, 2022

jalammar commented Dec 1, 2022 • edited Loading

jalammar commented Dec 1, 2022 •

edited

Loading