Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook which integrates metag, metap, metab data #124

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

bmeluch
Copy link
Contributor

@bmeluch bmeluch commented Feb 5, 2025

This PR will add a new notebook to the repo that connects metag, metap, and metab data from the same samples together. It relies heavily on the KEGG orthology annotations provided by the NMDC workflows for each of these data types.

Links

nbviewer https://nbviewer.org/github/microbiomedata/nmdc_notebooks/blob/91-create-notebook-integrating-metag-metap-metab-data-r/omics_types_integration/R/integration_notebook.ipynb

colab https://colab.research.google.com/github/microbiomedata/nmdc_notebooks/blob/91-create-notebook-integrating-metag-metap-metab-data-r/omics_types_integration/R/integration_notebook.ipynb


All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • Does your PR link to an issue?
  • Have you described the changes this PR will make?

New Notebook Submissions:

  • Have you included a summary of the notebook in the README.md included updated links to the notebook?
  • Does your PR include links to the new notebook (in the branch) for review using nbviewer, Colab, and reviewnb? These three are the preferred ways to review changes and additions to notebooks during review.
  • Does your PR include a test in a github workflow that tests the render-ability of your notebook?

@bmeluch bmeluch linked an issue Feb 5, 2025 that may be closed by this pull request
7 tasks
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@bmeluch
Copy link
Contributor Author

bmeluch commented Feb 5, 2025

@samobermiller I got the first part of the notebook rendered - does it look ok in reviewNB? and does it make sense so far? thank you :)

@@ -0,0 +1,730 @@
{
Copy link
Collaborator

@samobermiller samobermiller Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest rewriting the title to something more action based. Maybe 'How can we relate different types of omics data in the NMDC database?'

see my later note about looking at crossover in kegg ids before using the kegg api, if you decide to follow that idea maybe add to the beginning of your note 'NOTE: After finding overlap in KO identifications across omics types, this notebook uses the KEGGREST R package to interface with the KEGG API and determine the biological relevance of these identifications. Use of............' My thinking is this will underline that there is a use for your notebook even if you dont use the licensed packages


Reply via ReviewNB

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Sam about the title. And I wouldn't use a question. Instead something like:

"Identifying relationships between annotated omics data types in NMDC"

@@ -0,0 +1,730 @@
{
Copy link
Collaborator

@samobermiller samobermiller Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the KEGG api and package are restricted, would it make sense to add something beforehand looking at the crossover in kegg ids identified between the metagenomic and metaproteomic data? i know it won't show biological relevance until you use the licensed api to pull info, but it could show the use of your notebook even without the licensed api? 'here is what you can do to see overlap in multi-omic identifications and if you'd like more information on their biological relevance, you can follow the below tutorial using the kegg api but it will require a licence blah blah blah'


Reply via ReviewNB

@@ -0,0 +1,1100 @@
{
Copy link
Collaborator

@samobermiller samobermiller Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by 'we can look up the corresponding annotations in other KEGG databases'...what are 'other KEGG databases'?

Are you saying there is more than place to search for corresponding annotations using the KEGG ids found in this data? Based on the rest of the notebook I think what you're saying is that you're gonna look up annotation information for the KEGG ids in a sample's metabolomic, proteomic and genomic data, then compare overlap of the annotation information between the three omics types?

this confusion is probably due to my own lack of background knowledge, but i think it could use some clarification


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KEGG calls each of their sets of identifiers "databases" so the KEGG orthology IDs are a different "database" from the EC numbers, which are in a different "database" from the compound IDs. So the code below is using an endpoint that looks up what Y is linked to in X database. It's confusing. I tried to add some more explanation let me know if it makes sense!

@@ -0,0 +1,1100 @@
{
Copy link
Collaborator

@samobermiller samobermiller Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a bit more context on how to interpret the chord diagram and maybe an example observation. also could you clarify what you mean by 'connections'? is it the overlap in annotations? to me it looks like the figure is showing nearly all protein annotations and about half the metabolomic annotations are also seen in the gene annotations. does overlap of all three arcs mean anything like it does in a venn diagram?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird I can't see this comment in reviewNB anymore but I added an actual explanation, lmk if it makes more sense!

@bmeluch bmeluch changed the title DRAFT: Add notebook which integrates metag, metap, metab data Add notebook which integrates metag, metap, metab data Feb 22, 2025
@bmeluch bmeluch marked this pull request as ready for review February 22, 2025 08:35
@@ -0,0 +1,1238 @@
{
Copy link

@lamccue lamccue Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very clear. I may not understand what is happening. There are two things, right? I suggest clearly separating them. Could make a separate small section.

first section:

"Gather protein information"

"Next we do the same thing for proteins - starting with the KO and EC IDs from the NMDC protein reports for each sample, identify the modules and pathways each protein is associated with"

second section:

"Combine protein and metabolite information:

"We can now use the Enzyme Commission and KEGG annotations to connect the metabolites and proteins identified in each sample."


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reorganized section 2 so that all of the "is x found in y" comparisons are happening together at the end, let me know if it makes more sense

@@ -0,0 +1,1238 @@
{
Copy link

@lamccue lamccue Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be more interesting if we compared / contrasted heatmaps showing the coverage of each pathway when all data types are included (the existing one) and also one showing the coverage of each pathway when only metabolites and proteins are included (no metagenome).

This could demonstrate to someone how to pull different parts of the data to explore it from different angles.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pathways don't distinguish between genes and proteins but I split out the heatmaps by gene/protein KO vs. compound ID. I'm afraid it still might not be interesting though. I could take the heatmaps out if you think there's enough other stuff in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create notebook integrating metag, metap, metab data (R)
3 participants