Info table for code LMs #28

niansong1996 · 2023-04-05T22:29:39Z

We need a table to summarize all the code LMs that we test in this project.

An example of this (from this paper):

Some more code-related features we would like to summarize about those models (feel free to add more):

programming languages that this model is trained on
is the model initialized from natural language
...

niansong1996 · 2023-04-21T19:27:03Z

@niansong1996 Also think about what columns are important for code models.

@yilunzhao's version is here in this link

niansong1996 · 2023-04-25T18:07:43Z

@yilunzhao Let's have the discussion here.

This is an initial draft of the table columns, and I have added my comment in (italics)

Model
Release Time
Open-source
Basic Information
- Size
- Base Model (Do you mean model architecture, or what is the model is initialized from)
- In-Context Learning (Do you mean if they are instruction-tuned?)
- Code LM-specific *(is this used to distinguish general LM and code LM?)
Pretraining Details
- Pretrain data scale
  - (We need to use a unified metric, right now some are in # instances and some in GB/TB)
  - *(Also, we need to be more specific about this, like "PILE", "BigQuery", a short sentence description is also okay)
- Pretraining Programming Languages (This is good, but maybe we can rank the top-5 PL instead?)
- Pretraining Hardware (I think this is optional, but let's have it for now)
- Pretraining Time (same as above)

A couple of things I think we can add:

Access (released model weights? access through API?), can be merged with Open-Source in some way

Also some comments:

I think for the same model series, if they only differ in size, we can have them in the same row and have sizes listed like "small(60M), base(220M), ...", but if they also differ in pretraining data (e.g., for codegen-nl/multi/mono), we should list them in different rows

We can also iterate on this, so feel free to start writing the rows as well, these two surveys can be good sources to see if we missed any important ones (and of course, some of them are not relevant):

niansong1996 · 2023-04-27T05:58:10Z

One other important piece of information I just thought of is the maximum length of the LM.

niansong1996 · 2023-05-11T00:20:55Z

Maybe also add "percentage of code in the training data"

niansong1996 added the models label Apr 5, 2023

niansong1996 assigned yilunzhao Apr 5, 2023

niansong1996 added this to the neurips milestone Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Info table for code LMs #28

Info table for code LMs #28

niansong1996 commented Apr 5, 2023

niansong1996 commented Apr 21, 2023

niansong1996 commented Apr 25, 2023

niansong1996 commented Apr 27, 2023

niansong1996 commented May 11, 2023

Info table for code LMs #28

Info table for code LMs #28

Comments

niansong1996 commented Apr 5, 2023

niansong1996 commented Apr 21, 2023

niansong1996 commented Apr 25, 2023

niansong1996 commented Apr 27, 2023

niansong1996 commented May 11, 2023