We are glad you are contributing to NeMo Curator! Before you make a PR, be sure to read over this guide in detail. This checklist ensures that NeMo Curator stays easy-to-use by both users and developers. Not all steps are necessary for some contributions, so read the linked sections for more information about each item.
- Follow the general principles in your design
- Write your code in the proper place
- Write examples and documentation for using your code
- Format using the style guide
- Write unit tests
- Make a pull request
- User-oriented: make it easy for end users, even at the cost of writing more code in the background
- Robust: make it hard for users to make mistakes.
- Reusable: for every piece of code, think about how it can be reused in the future and make it easy to be reused.
- Readable: code should be easier to read.
- Legal: if you copy even one line of code from the Internet, make sure that the code allows the license that NeMo Curator supports. Give credit and link back to the code.
- Sensible: code should make sense. If you think a piece of code might be confusing, write comments.
The repository is home to flexible Python modules, sample scripts, tests, and more. Here is a brief overview of where everything lives:
- config - A collection of example configuration files for many of the curator's modules.
- docs - Walkthroughs and motivations for each of the modules.
- examples - Example scripts for how users may want to compose the curator.
- nemo_curator - The main home for all the NeMo Curator's Python APIs.
- tests - Unit tests for each module.
Examples provide an easy way for users to see how the curator works in action.
There should be at least one example per module in the curator.
They should be incredibly lightweight and rely on the core nemo_curator
modules for their functionality.
Most should be designed for a user to get up and running on their local machines, but distributed examples are welcomed if it makes sense.
Python scripts should be the primary way to showcase your module.
Though, SLURM scripts or other cluster scripts should be included if there are special steps needed to run the module.
The documentation should complement each example by going through the motivation behind why a user would use each module. It should include both an explanation of the module, and how it's used in its corresponding example. The documentation should also cover potential pitfalls and performance considerations when running the module at scale. This existing examples and documentation should serve as a good reference to what is expected.
We use black
as our style guide. To fix your format run pip install pre-commit && pre-commit install && pre-commit run --all
.
- Include docstrings for every class and method exposed to the user.
- Avoid wild import:
from X import *
unless inX.py
,__all__
is defined. - Minimize the use of
**kwargs
. RaiseError
is preferred toassert
. Write:if X: raise Error
instead ofassert X
.- Classes are preferred to standalone methods.
- Methods should be atomic. A method shouldn't be longer than 88 lines, e.g. can be fit into the computer screen without scrolling.
- If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
- Add
__init__.py
for every folder. - F-strings are prefered to formatted strings.
- Loggers are preferred to print.
- Private functions (functions start with
_
) shouldn't be called outside its host file. - If a comment lasts multiple lines, use
'''
instead of#
.
Unit tests should be simple and fast. Developers should be able to run them frequently while developing without any slowdown.
pytest
# If you don't have NVIDIA GPU do:
# pytest --cpu
Send your PRs to the main
or dev
branch
- Make sure your PR does one thing. Have a clear answer to "What does this PR do?".
- Read General Principles and style guide below
- Make sure you sign your commits. E.g. use
git commit -sS
when committing. - Make sure all unittests finish successfully before sending PR
pytest
or (if your dev box does not have GPU)pytest --cpu
from the root folder - Send your PR and request a review
The dev
branch is for active development and may be unstable. Unit tests are expected to pass before merging into dev
or main
.
Every release dev
and main
will sync to be the same.
Full text of the DCO:
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
Joseph Jennings (@jjennings) or Ryan Wolf (@rywolf)
They may ask for other reviewers depending on the scope of the change. Your pull requests must pass all checks and peer-review before they can be merged.
Thank you for contributing to NeMo Curator!