Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info table for code LMs #28

Open
niansong1996 opened this issue Apr 5, 2023 · 4 comments
Open

Info table for code LMs #28

niansong1996 opened this issue Apr 5, 2023 · 4 comments
Assignees
Labels
Milestone

Comments

@niansong1996
Copy link
Contributor

We need a table to summarize all the code LMs that we test in this project.

An example of this (from this paper):
image

Some more code-related features we would like to summarize about those models (feel free to add more):

  • programming languages that this model is trained on
  • is the model initialized from natural language
  • ...
@niansong1996
Copy link
Contributor Author

@niansong1996 Also think about what columns are important for code models.

@yilunzhao's version is here in this link

@niansong1996
Copy link
Contributor Author

@yilunzhao Let's have the discussion here.

This is an initial draft of the table columns, and I have added my comment in (italics)

  • Model
  • Release Time
  • Open-source
  • Basic Information
    • Size
    • Base Model (Do you mean model architecture, or what is the model is initialized from)
    • In-Context Learning (Do you mean if they are instruction-tuned?)
    • Code LM-specific *(is this used to distinguish general LM and code LM?)
  • Pretraining Details
    • Pretrain data scale
      • (We need to use a unified metric, right now some are in # instances and some in GB/TB)
      • *(Also, we need to be more specific about this, like "PILE", "BigQuery", a short sentence description is also okay)
    • Pretraining Programming Languages (This is good, but maybe we can rank the top-5 PL instead?)
    • Pretraining Hardware (I think this is optional, but let's have it for now)
    • Pretraining Time (same as above)

A couple of things I think we can add:

  • Access (released model weights? access through API?), can be merged with Open-Source in some way

Also some comments:

  • I think for the same model series, if they only differ in size, we can have them in the same row and have sizes listed like "small(60M), base(220M), ...", but if they also differ in pretraining data (e.g., for codegen-nl/multi/mono), we should list them in different rows

We can also iterate on this, so feel free to start writing the rows as well, these two surveys can be good sources to see if we missed any important ones (and of course, some of them are not relevant):

@niansong1996
Copy link
Contributor Author

One other important piece of information I just thought of is the maximum length of the LM.

@niansong1996
Copy link
Contributor Author

Maybe also add "percentage of code in the training data"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants