Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.0.10 #75

Merged
merged 56 commits into from
Jan 18, 2024
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
205d7cc
init joint solve
cameronmartino Apr 6, 2022
d4c9b99
init joint functions
cameronmartino Jun 20, 2022
1a56d87
update cov calculations
cameronmartino Aug 24, 2022
706f371
flake8 optspace
cameronmartino Oct 27, 2022
f975568
add transformation functions
cameronmartino Oct 27, 2022
6c6f222
joint-opt
cameronmartino Oct 27, 2022
d1b9f8f
add standalone script
cameronmartino Oct 27, 2022
42968bd
add scripts standalone and q2
cameronmartino Nov 7, 2022
32cc024
add scripts testing
cameronmartino Nov 7, 2022
385e745
add testing data scripts
cameronmartino Nov 7, 2022
4885ff2
update flake8 in optspcae
cameronmartino Nov 7, 2022
84216d8
update transform docs & param name
cameronmartino Nov 29, 2022
c426be2
add new data projection scripts
cameronmartino Nov 29, 2022
b18105a
add tests for transformer
cameronmartino Nov 29, 2022
28ec87e
fix index name, breaks conversion in q2
cameronmartino Nov 29, 2022
be58dc5
fix cv output to work with qiime2
cameronmartino Dec 1, 2022
8d89232
add utility functions to help with donwstream analysis
cameronmartino Dec 1, 2022
a6d50e5
add feature table corr type and function
cameronmartino Dec 1, 2022
ca59e93
fix type issues
cameronmartino Dec 1, 2022
32562e9
add index name to corr for transformer
cameronmartino Dec 1, 2022
4a4a541
first tutorial
cameronmartino Dec 1, 2022
740d91b
add tutorial data/images
cameronmartino Dec 1, 2022
04e52d5
update tutorial one
cameronmartino Dec 1, 2022
ed85e0d
fix new scripts
cameronmartino Dec 2, 2022
afa97e3
add tutorials for joint rpca
cameronmartino Dec 2, 2022
12ec3e3
catch up to master
cameronmartino May 25, 2023
7dfbff7
adding tempted & new sparse tensor class functionality
cameronmartino May 26, 2023
713d415
update tempted wrappers for commands
cameronmartino May 28, 2023
bf813f8
add command and q2
cameronmartino May 29, 2023
4d0c827
add transformations for tempted
cameronmartino May 30, 2023
906bea8
fix input types to composition
cameronmartino Jun 1, 2023
2db0fd0
Merge pull request #73 from biocore/tempted-dev
cameronmartino Nov 25, 2023
c05b8c8
add ability to input pre-transformed tables into joint-rpca
cameronmartino Nov 27, 2023
bde6656
depreciate auto rpca and rank estimation functionality see issue #70
cameronmartino Nov 27, 2023
8a49969
bug fix for issue #71
cameronmartino Nov 27, 2023
a50053b
add qc on distances see issue #70
cameronmartino Nov 28, 2023
a950c73
add updated functions for testing and visualizing issue #70
cameronmartino Nov 30, 2023
78d5665
update docs for issue #70
cameronmartino Nov 30, 2023
7104e1e
add commands for rpca with cv, see issue #70
cameronmartino Nov 30, 2023
7e4516e
update tutorials and add tempted tutorial
cameronmartino Nov 30, 2023
e976e47
update tutorials and add tempted tutorial
cameronmartino Nov 30, 2023
5b3dc81
changes logged and version up
cameronmartino Nov 30, 2023
d22a004
update readme
cameronmartino Dec 1, 2023
9247a54
update qc and cv intro and code
cameronmartino Dec 9, 2023
4bde03c
update README
cameronmartino Dec 9, 2023
1871ecd
update readme
cameronmartino Dec 11, 2023
c78d069
add updates tests examples of QC
cameronmartino Dec 12, 2023
fe282a2
update confusing flag in standalone command
cameronmartino Dec 13, 2023
00e24e7
make read me header easier to see
cameronmartino Dec 13, 2023
ba404a2
Merge branch 'master' into jointrpca
cameronmartino Dec 13, 2023
176973a
clean up text
cameronmartino Dec 18, 2023
1494b6d
Merge branch 'jointrpca' of https://github.com/biocore/gemelli into j…
cameronmartino Dec 18, 2023
7ad0ee3
fix permanova in tutorial
cameronmartino Jan 8, 2024
397f257
add breaking change
cameronmartino Jan 8, 2024
b42777e
remove unnecessary np.random seed setting
cameronmartino Jan 15, 2024
1fde9cf
add docstring
cameronmartino Jan 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,32 @@

v0.0.10 (2023-12-01)

### Bug fixes

* Removed np.int types
* fixes bug that prevents use in newer versions of numpy / QIIME2
* see issue #71

### Features [experimental]

* rpca projection of new data and with cross validation
* A new function for RPCA where new data can be projected into an existing ordination.
* Also allows for internal CV, where the hold out data is projected into the training ordination, which is then used to re-build the test data, and the error between the projection and the real test data is calculated.
* see issue #70
* tempted.py and assocated tests/commands/tutorials
* Added in a python implementation TEMPTED, more details on the methods inm this paper [here](https://www.biorxiv.org/content/10.1101/2023.07.26.550749v1).
* joint-rpca and associated tests/commands/tutorials
* Joint-RPCA is an extention of RPCA for multiple omics types.
* qc_rarefaction
* This function compares the mantel correlation of distances to the abs. differences in sample sum between rarefied and unrarefied input data. This is an easy check to ensure the results are not bieng significantly altered by non-rarefaction is cases of large differences (e.g., low-biomass) or where the sample sum differences can not be seperated from the phenotypes/low-rank structure of the data (i.e., deep sequencing of controls and shallow of sick).
* see issue #70

### Deprecated functionality

* auto_rpca across the package and rank_estimate in optspace.py
* was not performing correctly and no reasonable quick fix is available.
* see issue #70

v0.0.9 (2023-05-24)

### Bug fixes
Expand Down
71 changes: 65 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@

# Gemelli

Gemelli is a tool box for running both Robust Aitchison PCA (RPCA) and Compositional Tensor Factorization (CTF) on _sparse_ compositional omics datasets.
Gemelli is a tool box for running Robust Aitchison PCA (RPCA), Joint Robust Aitchison PCA (Joint-RPCA), TEMPoral TEnsor Decomposition (TEMPTED), and Compositional Tensor Factorization (CTF) on _sparse_ compositional omics datasets.

RPCA can be used on cross-sectional datasets where each subject is sampled only once. CTF can be used on repeated-measure data where each subject is sampled multiple times (e.g. longitudinal sampling). Both methods are [_unsupervised_](https://en.wikipedia.org/wiki/Unsupervised_learning) and aim to describe sample/subject variation and the biological features that separate them.
RPCA can be used on cross-sectional datasets where each subject is sampled only once. CTF can be used on repeated-measure data where each subject is sampled multiple times (e.g. longitudinal sampling). TEMPTED is specifically designed for longitundal (time series) repeated measure studies, especially when samples are irregularly sampled across subjects. Joint-RPCA allows for the exploration of multiple omics datasets with shared samples at once. All these methods are [_unsupervised_](https://en.wikipedia.org/wiki/Unsupervised_learning) and aim to describe sample/subject variation and the biological features that separate them.

The preprocessing transform for both RPCA and CTF is the robust centered log-ratio transform (rlcr) which accounts for sparse data (i.e. many missing/zero values). Details on the rclr can be found [here](https://msystems.asm.org/content/4/1/e00016-19) and a interactive introduction into the transformation can be found [here](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/introduction.ipynb). In short, the rclr log transforms the observed (nonzero) values before centering. RPCA and CTF then perform a matrix or tensor factorization on only the observed values after rclr transformation, similar to [Aitchison PCA](https://academic.oup.com/biomet/article-abstract/70/1/57/240898?redirectedFrom=fulltext) performed on dense data. If the data also has an associated phylogeny it can be incorporated through the phylogenetic rclr, details can be found [here](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Phylogenetic-RPCA-moving-pictures.ipynb).

Expand All @@ -20,37 +20,50 @@ To install the most up to date version of gemelli, run the following command

Gemelli can be run standalone or through [QIIME2](https://qiime2.org/) and as a python API or CLI.

## Cross-sectional study (i.e. one sample per subject) with RPCA
## Cross-sectional / multi-omics study (i.e. one sample per subject) with RPCA

If you have a [cross-sectional study design](https://en.wikipedia.org/wiki/Cross-sectional_study) with only one sample per subject then RPCA is the appropriate method to use in gemelli. There are two commands within RPCA. The first is `rpca` and the second is `auto-rpca`. The only difference is that `auto-rpca` automatically estimates the underlying-rank of the matrix and requires no input for the `n_components` parameter. In the `rpca` command the `n_components` must be set explicitly. For examples of using RPCA we provide tutorials below exploring the microbiome between body sites.
If you have a [cross-sectional study design](https://en.wikipedia.org/wiki/Cross-sectional_study) with only one sample per subject then RPCA is the appropriate method to use in gemelli. For examples of using RPCA we provide tutorials below exploring the microbiome between body sites.

Joint-RPCA allows for the exploration of those feature that seperate jointly across sample groupings and the potential interactions of those features.

### Tutorials

#### Tutorials with QIIME2

* [RPCA QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-moving-pictures.ipynb)
* [Phylogenetic RPCA QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Phylogenetic-RPCA-moving-pictures.ipynb)
* [Joint-RPCA QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Joint-RPCA-QIIME2-CLI.ipynb)
* [Joint-RPCA QIIME2 API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Joint-RPCA-QIIME2-API.ipynb)


#### Standalone tutorial outside of QIIME2

* [RPCA Python API & CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-moving-pictures-standalone-cli-and-api.ipynb)
* [Joint-RPCA API & CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Joint-RPCA-CLI-API.ipynb)

## Repeated measures study (i.e. multiple sample per subject) with CTF
## Repeated measures study (i.e. multiple sample per subject) with CTF & TEMPTED

### Tutorials

If you have a [repeated measures study design](https://en.wikipedia.org/wiki/Repeated_measures_design) with multiple samples per subject over time or space then CTF is the appropriate method to use in gemelli. For optimal results CTF requires samples for each subject in each time or space measurement. In some cases, this can require binning time my larger windows (e.g. instead of days use months). For examples of using CTF we provide a microbiome time series IBD study in the tutorials below.
If you have a [repeated measures study design](https://en.wikipedia.org/wiki/Repeated_measures_design) with multiple samples per subject over time or space then CTF is the appropriate method to use in gemelli. For optimal results CTF requires samples for each subject in each time or space measurement. If that is not the case and your study has irregular time sampling, then TEMPTED should be used. TEMPTED also allows for the projection of new data into an existing factorization which is necessary for machine learning. For examples, explore the tutorials below.

#### Tutorials with QIIME2

* [CTF QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/IBD-Tutorial-QIIME2-CLI.md)
* [CTF QIIME2 API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/IBD-Tutorial-QIIME2-API.ipynb)
* [Phylogenetic CTF QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Phylogenetic-IBD-Tutorial-QIIME2-CLI.ipynb)
* [Phylogenetic CTF QIIME2 API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/Phylogenetic-IBD-Tutorial-QIIME2-API.ipynb)
* [TEMPTED QIIME2 CLI](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/TEMPTED-QIIME2-CLI.ipynb)
* [TEMPTED QIIME2 API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/TEMPTED-QIIME2-API.ipynb)

#### Standalone tutorial outside of QIIME2

* [CTF Standalone Python API](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/IBD-Tutorial-standalone-API.ipynb)
* [TEMPTED R implementation - Intallation and tutorials](https://github.com/pixushi/tempted)

## Performing parameter optimization and QC on results

For an introduction to these QC methods see the tutorial [here](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-CV-QC-introduction.ipynb). Examples are also provided in the RPCA tutorials [here (RPCA QIIME2 CLI)](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-moving-pictures.ipynb) & [here (RPCA Python API & CLI)](https://github.com/biocore/gemelli/blob/master/ipynb/tutorials/RPCA-moving-pictures-standalone-cli-and-api.ipynb). Users are encrouaged to report the QC/CV results for thier data.

# Citations

Expand Down Expand Up @@ -95,6 +108,52 @@ Martino, C. et al. A Novel Sparse Compositional Technique Reveals Microbial Pert
}
```


## Citation for Phylogenetic RPCA

```
Martino, C. et al. A Novel Sparse Compositional Technique Reveals Microbial Perturbations. mSystems 4, (2019)
```

```
@ARTICLE{Martino2022,
author = {Martino, Cameron and McDonald, Daniel and Cantrell, Kalen and
Dilmore, Amanda Hazel and Vázquez-Baeza, Yoshiki and Shenhav,
Liat and Shaffer, Justin P and Rahman, Gibraan and Armstrong,
George and Allaband, Celeste and Song, Se Jin and Knight, Rob},
title = {Compositionally Aware Phylogenetic {Beta-Diversity} Measures
Better Resolve Microbiomes Associated with Phenotype},
volume = {7},
number = {3},
elocation-id = {e0005022},
year = {2022},
doi = {10.1128/msystems.00050-22},
publisher = {American Society for Microbiology Journals},
URL = {http://dx.doi.org/10.1128/msystems.00050-22},
journal = {mSystems},
}
```

## Citation for TEMPTED

```
Shi, p. et al. Time-Informed Dimensionality Reduction for Longitudinal Microbiome Studies. bioRxiv, (2023)
```

```
@ARTICLE{Shi2023,
author = {Shi, Pixu and Martino, Cameron and Han, Rungang and Janssen,
Stefan and Buck, Gregory and Serrano, Myrna and Owzar, Kouros and
Knight, Rob and Shenhav, Liat and Zhang, Anru R},
title = {{Time-Informed} Dimensionality Reduction for Longitudinal
Microbiome Studies},
year = {2023},
doi = {10.1101/2023.07.26.550749},
URL = {https://www.biorxiv.org/content/10.1101/2023.07.26.550749v1},
journal = {bioRxiv},
}
```

## Other Resources

- The compositional data [wiki](https://en.wikipedia.org/wiki/Compositional_data)
Expand Down
2 changes: 1 addition & 1 deletion gemelli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------

__version__ = "0.0.9"
__version__ = "0.0.10"
88 changes: 87 additions & 1 deletion gemelli/_defaults.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
# descriptions. This is used by both the standalone RPCA and QIIME 2 RPCA sides
# of gemelli.


DEFAULT_MTD = 0
DEFAULT_BL = None
DEFAULT_COMP = 3
Expand All @@ -21,6 +20,20 @@
DEFAULT_TENSALS_MAXITER = 25
DEFAULT_FMETA = None
DEFAULT_COND = None
DEFAULT_METACV = None
DEFAULT_COLCV = None
DEFAULT_TESTS = 10
DEFAULT_MATCH = True
DEFAULT_TRNSFRM = True
DEFAULT_TEMPTED_PC = 1
DEFAULT_TEMPTED_EP = 1e-4
DEFAULT_TEMPTED_SMTH = 1e-6
DEFAULT_TEMPTED_RES = 101
DEFAULT_TEMPTED_MAXITER = 20
DEFAULT_TEMPTED_RH = 'random'
DEFAULT_TEMPTED_RHC = 'random'
DEFAULT_TEMPTED_SVDC = True
DEFAULT_TEMPTED_SVDCN = 1
DESC_BIN = ("The feature table containing the "
"samples over which metric should be computed.")
DESC_COUNTS = ("The feature table in biom format containing the "
Expand Down Expand Up @@ -103,4 +116,77 @@
"sum of all children of that node (i.e. Fast UniFrac).")
QBIPLOT = "A biplot of the (Robust Aitchison) RPCA feature loadings"
QADIST = "The Aitchison distance of the sample loadings from RPCA."
QACV = "The cross-validation reconstruction error."
QRCLR = "A rclr transformed table. Note: zero/missing values have NaNs"
DESC_METACV = ("Sample metadata file in QIIME2 formatting. "
"Containing the columns with training and test labels.")
DESC_COLCV = ("Sample metadata column containing `train` and `test`"
" labels to use for the cross-validation evaluation.")
DESC_TESTS = ("Number of random samples to choose for test split samples "
"if sample metadata and a train-test column are not provided.")
DESC_TABLES = ("The collection of feature tables containing shared "
"samples over which metric should be computed.")
DESC_TRAINTABLES = ("The tables to be projected on the first"
" principal components previously extracted"
" from a training set.")
DESC_TRAINTABLE = ("The table to be projected on the first"
" principal components previously extracted"
" from a training set.")
DESC_TRAINORDS = ("A joint-biplot of the (Robust Aitchison) RPCA"
" feature loadings produced from the training data.")
DESC_TRAINORD = ("A biplot of the (Robust Aitchison) RPCA"
" feature loadings produced from the training data.")
DESC_MATCH = ("Subsets the input tables to contain only features used in the"
" training data. If set to False and the tables are not"
" perfectly. Matched a ValueError will be produced.")
DESC_TRNSFRM = ("If set to false the function will expect `tables`"
"to be dataframes already rclr transformed."
" This is used for internal functionality in the "
"joint-rpca function and is set to be only False.")
DESC_MTABLE = ("The biplot ordination in"
" skbio.OrdinationResults format to match.")
DESC_MORD = ("The feature table in biom format containing the"
"samples and features to match the ordination with.")
DESC_FM = "If set to True the features in the ordination will be matched."
DESC_SM = "If set to True the samples in the ordination will be matched."
DESC_MORDOUT = "A subset biplot with the input table."
DESC_CORRTBLORD = ("A joint-biplot or subset joint-biplot"
" of the (Robust Aitchison) RPCA feature loadings.")
DESC_CORRTBL = "A feature by feature correlation table."
DESC_TCOND = ("Metadata column containing time points"
" across which samples are paired.")
DESC_REP = ('Choose how replicate samples are handled. If replicates are'
'detected, "error" causes method to fail; "drop" will discard'
' all replicated samples; "random" chooses one representative at'
' random from among replicates.')
DESC_SVD = "Removes the mean structure of the temporal tensor."
DESC_SVDC = "Rank of approximation for average matrix in svd-centralize."
DESC_RES = ("Number of time points to evaluate the value"
" of the temporal loading function.")
DESC_SMTH = ("Smoothing parameter for RKHS norm. Larger means "
"smoother temporal loading functions.")
DESC_MXTR = "Maximum number of iteration in for rank-1 calculation."
DESC_EPS = ("Convergence criteria for difference between iterations "
"for each rank-1 calculation.")
DESC_IO = ("Compositional biplot of subjects as points and"
" features as arrows. Where the variation between"
" subject groupings is explained by the log-ratio"
" between opposing arrows.")
DESC_PIO = ("Compositional biplot of subjects as points from"
" new data projected into a pre-generated space.")
DESC_SLO = ("Each components temporal loadings across the"
"input resolution included as a column called"
"'time_interval'.")
DESC_SVDO = ("The loadings from the SVD centralize"
" function, used for projecting new data.")
DESC_TDIST = "Subject by subject distance matrix (not samples)."
DESC_PC = ("The pseudocount to add to the table before applying the "
"transformation. Default is zero which will add the"
" minimum non-zero value to all the values.")
DESC_TJNT = ("If flagged Joint-RPCA will not use the RCLR "
"transformation and will instead assume that "
"the data has already been transformed or "
"normalized. Disabling the RCLR step will also "
"disable any filtering steps. It will be expected "
"that all filtering was done previously."
"Default is to use RCLR.")
15 changes: 0 additions & 15 deletions gemelli/factorization.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,12 @@
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------

import warnings
import numpy as np
import pandas as pd
from scipy.linalg import svd
from numpy.linalg import norm
from .base import _BaseImpute
from scipy.spatial import distance
from gemelli.optspace import rank_estimate


class TensorFactorization(_BaseImpute):
Expand Down Expand Up @@ -524,19 +522,6 @@ def tenals(tensor,
tensor_dimensions = tensor.shape
# Frobenius norm initial for ALS minimization.
initial_tensor_frobenius_norm = norm(tensor)**2
# rank est. (for each slice)
if len(tensor_dimensions) == 3: # only for 3D
for i in range(tensor_dimensions[0]):
obs_tmp = tensor[i, :, :].copy()
total_nonzeros = np.count_nonzero(mask[i, :, :].copy())
n_, m_ = obs_tmp.shape
eps_tmp = total_nonzeros / np.sqrt(n_ * m_)
if min(obs_tmp.shape) <= 2:
# two-subjects/time is already low-rank
continue
if rank_estimate(obs_tmp, eps_tmp) >= (min(obs_tmp.shape) - 1):
warnings.warn('A component of your data may be high-rank.',
RuntimeWarning)
# initial by Robust Tensor Power Method (modified for non-symmetric
# tensors).
initial_eigvals, initial_loadings = robust_tensor_power_method(
Expand Down
10 changes: 3 additions & 7 deletions gemelli/matrix_completion.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,12 +117,8 @@ def _fit(self):
raise ValueError("max_iterations must be at least 1")

# check the settings for n_components
if isinstance(self.n_components, str) and \
self.n_components.lower() == 'auto':
# estimate the rank of the matrix
self.n_components = 'auto'
# check hardset values
elif isinstance(self.n_components, int):
if isinstance(self.n_components, int):
if self.n_components > (min(n, m) - 1):
raise ValueError("n-components must be at most"
" 1 minus the min. shape of the"
Expand All @@ -133,7 +129,7 @@ def _fit(self):
# otherwise rase an error.
else:
raise ValueError("n-components must be "
"an interger or 'auto'")
"an interger.")

# return solved matrix
self.U, self.s, self.V = OptSpace(n_components=self.n_components,
Expand Down Expand Up @@ -163,7 +159,7 @@ def fit_transform(self, X, y=None):
having right singular vectors as rows. Of shape (N,n_components)

"""
X_sparse = X.copy().astype(np.float64)
X_sparse = X.copy().astype(float)
self.X_sparse = X_sparse
self._fit()
return self.sample_weights
Loading
Loading