Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anndata Object with counts? #23

Closed
Demond-dev opened this issue Jan 12, 2025 · 13 comments
Closed

Anndata Object with counts? #23

Demond-dev opened this issue Jan 12, 2025 · 13 comments
Labels
help wanted Extra attention is needed

Comments

@Demond-dev
Copy link

Demond-dev commented Jan 12, 2025

Hi,

I was wondering if there was anyway to incorporate the output from the ENACT pipeline into an anndata object with the binned counts and the whole slide image for use in spatial plots with other packages? I've tried running the pipeline on data our lab has generated and it only resulted in an anndata object with 1000 variables and no image to use for spatial plotting.

Thanks!

@LucFrancisENX
Copy link

LucFrancisENX commented Jan 28, 2025

Hi @Demond-dev , you can try this function I adapted from the bin2cell read_visium function. It tracks back the changes made to the pixel coordinates within enact so that the spots fit to the high resolution image

All you need is the object_path (Path to the enact adata.h5 output)
The spaceranger_path (Path to spaceranger 002_binned output directory for visium datafiles)
The source_image_path (Path to your btf file)

adata = read_visium_enact(enact_object_path=object_path,
                          spaceranger_path=spaceranger_path,
                          source_image_path=source_image_path)
def read_visium_enact(
    enact_object_path: Path | str,
    spaceranger_path: Path | str,
    genome: str | None = None,
    *,
    count_file: str = "filtered_feature_bc_matrix.h5",
    library_id: str | None = None,
    load_images: bool | None = True,
    source_image_path: Path | str | None = None,
    spaceranger_image_path: Path | str | None = None,
) -> AnnData:
    """\
    Read 10x-Genomics-formatted visum dataset with ENACT pipeline output.

    In addition to reading regular 10x output,
    this looks for the `spatial` folder and loads images,
    coordinates and scale factors.
    Based on the `Space Ranger output docs`_.

    See :func:`~scanpy.pl.spatial` for a compatible plotting function.

    .. _Space Ranger output docs: https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/overview
    
    The function b2c.read_visium function has been adapted with the following changes

    1. Read in enact output h5 file
    2. When reading in spaceranger output, find the pixel boundaries of the cropped image
    3. The pixel boundary minimum values are then added to each cell coordinate within the enact output

    #### This should result in enact output object complete with associated image data ready for squidpy downstream analysis

    Parameters
    ----------
    enact_object_path
        Path to enact output h5 output file. 
        Should be in a similar format to: ".../output_files/{study}/tmap/{segmentation-method}|cell_typist_cells_adata.h5" (if celltypist step was performed)
    spaceranger_path
        Path to spaceranger 002_binned output directory for visium datafiles.
    genome
        Filter expression to genes within this genome.
    count_file
        Which file in the passed directory to use as the count file. Typically would be one of:
        'filtered_feature_bc_matrix.h5' or 'raw_feature_bc_matrix.h5'.
    library_id
        Identifier for the visium library. Can be modified when concatenating multiple adata objects.
    source_image_path
        Path to the high-resolution tissue image. This would be your btf file
    spaceranger_image_path
        Path to the folder containing the spaceranger output hires/lowres tissue images. If `None`, 
        will go with the `spatial` folder of the provided `spaceranger_path`.

    Returns
    -------
    Annotated data matrix, where observations/cells are named by their
    barcode and variables/genes by gene name. Stores the following information:

    :attr:`~anndata.AnnData.X`
        The data matrix is stored
    :attr:`~anndata.AnnData.obs_names`
        Cell names
    :attr:`~anndata.AnnData.var_names`
        Gene names for a feature barcode matrix, probe names for a probe bc matrix
    # To be added:
    :attr:`~anndata.AnnData.var`\\ `['gene_ids']`
        Gene IDs
    :attr:`~anndata.AnnData.var`\\ `['feature_types']`
        Feature types
    #
    :attr:`~anndata.AnnData.var`
        Any additional metadata present in /matrix/features is read in.
    :attr:`~anndata.AnnData.uns`\\ `['spatial']`
        Dict of spaceranger output files with 'library_id' as key
    :attr:`~anndata.AnnData.uns`\\ `['spatial'][library_id]['images']`
        Dict of images (`'hires'` and `'lowres'`)
    :attr:`~anndata.AnnData.uns`\\ `['spatial'][library_id]['scalefactors']`
        Scale factors for the spots
    :attr:`~anndata.AnnData.uns`\\ `['spatial'][library_id]['metadata']`
        Files metadata: 'chemistry_description', 'software_version', 'source_image_path'
    :attr:`~anndata.AnnData.obsm`\\ `['spatial']`
        Spatial spot coordinates, usable as `basis` by :func:`~scanpy.pl.embedding`.
    """
 

    path = Path(spaceranger_path)
    #if not provided, assume the hires/lowres images are in the same folder as everything
    #except in the spatial subdirectory
    if spaceranger_image_path is None:
        spaceranger_image_path = path / "spatial"
    else:
        spaceranger_image_path = Path(spaceranger_image_path) 

    adata = read_h5ad(enact_object_path)   

    adata.uns["spatial"] = dict()

    from h5py import File

    with File(path / count_file, mode="r") as f:
        attrs = dict(f.attrs)
    if library_id is None:
        library_id = str(attrs.pop("library_ids")[0], "utf-8")

    adata.uns["spatial"][library_id] = dict()

    with File(path / count_file, mode="r") as f:
        # Extract data from the HDF5 file
        feature_names = f['matrix/features/name'][:]  # Load the full dataset
        gene_ids = f['matrix/features/id'][:]
        feature_types = f['matrix/features/feature_type'][:]
        genome = f['matrix/features/genome'][:]

    # Decode byte strings (if necessary) and create a pandas DataFrame
    adata_var = pd.DataFrame({
        "gene_ids": [x.decode('utf-8') for x in gene_ids],
        "feature_types": [x.decode('utf-8') for x in feature_types],
        "genome": [x.decode('utf-8') for x in genome]
    })

    # Set feature names as the index (assuming they are unique)
    adata_var.index = [x.decode('utf-8') for x in feature_names]

    ## Some genes are duplicated in adata_var. During troubleshooting found these TBCE HSPA14 TMSB15B
    ## Since not linked to any gene expression data - can drop
    # # Drop the duplicated rows based on the index (keeping only the first occurrence of each)
    adata_var = adata_var[~adata_var.index.duplicated(keep='first')]

    adata.var = adata_var

    if load_images:
        tissue_positions_file = (
            path / "spatial/tissue_positions.csv"
            if (path / "spatial/tissue_positions.csv").exists()
            else path / "spatial/tissue_positions.parquet" if (path / "spatial/tissue_positions.parquet").exists()
            else path / "spatial/tissue_positions_list.csv"
        )
        files = dict(
            tissue_positions_file=tissue_positions_file,
            scalefactors_json_file=path / "spatial/scalefactors_json.json",
            hires_image=spaceranger_image_path / "tissue_hires_image.png",
            lowres_image=spaceranger_image_path / "tissue_lowres_image.png",
        )
        # check if files exists, continue if images are missing
        for f in files.values():
            if not f.exists():
                if any(x in str(f) for x in ["hires_image", "lowres_image"]):
                    logg.warning(
                        f"You seem to be missing an image file.\n"
                        f"Could not find '{f}'."
                    )
                else:
                    raise OSError(f"Could not find '{f}'")

        adata.uns["spatial"][library_id]["images"] = dict()
        for res in ["hires", "lowres"]:
            try:
                adata.uns["spatial"][library_id]["images"][res] = imread(
                    str(files[f"{res}_image"])
                )
            except Exception:
                raise OSError(f"Could not find '{res}_image'")

       # read json scalefactors
        adata.uns["spatial"][library_id]["scalefactors"] = json.loads(
            files["scalefactors_json_file"].read_bytes()
        )

        adata.uns["spatial"][library_id]["metadata"] = {
            k: (str(attrs[k], "utf-8") if isinstance(attrs[k], bytes) else attrs[k])
            for k in ("chemistry_description", "software_version")
            if k in attrs
        }  

        # read coordinates
        if files["tissue_positions_file"].name.endswith(".csv"):
            positions = pd.read_csv(
                files["tissue_positions_file"],
                header=0 if tissue_positions_file.name == "tissue_positions.csv" else None,
                index_col=0,
            )
        elif files["tissue_positions_file"].name.endswith(".parquet"):
            positions = pd.read_parquet(files["tissue_positions_file"])
            #need to set the barcode to be the index
            positions.set_index("barcode", inplace=True)
        positions.columns = [
            "in_tissue",
            "array_row",
            "array_col",
            "pxl_row_in_fullres", # the original column in bin2cell.read_visium, was set to pxl_col_in_fullres 
            "pxl_col_in_fullres", # the orginal column in bin2cell_read_visium was set to pcl_row_in_fullres
        ]
        # see link below for structure of tissue_positions_file.parquet file
        # https://www.10xgenomics.com/support/software/space-ranger/latest/analysis/outputs/spatial-outputs

        # cleaning up, removing negative coords,removing out of tissue bins
        tissue_pos_list_filt = positions[positions.in_tissue == 1]
        tissue_pos_list_filt["pxl_row_in_fullres"] = tissue_pos_list_filt[
            "pxl_row_in_fullres"
        ].astype(int)
        tissue_pos_list_filt["pxl_col_in_fullres"] = tissue_pos_list_filt[
            "pxl_col_in_fullres"
        ].astype(int)
        tissue_pos_list_filt = tissue_pos_list_filt.loc[
            (tissue_pos_list_filt.pxl_row_in_fullres >= 0)
            & (tissue_pos_list_filt.pxl_col_in_fullres >= 0)
        ]
        x_min = tissue_pos_list_filt["pxl_col_in_fullres"].min()
        y_min = tissue_pos_list_filt["pxl_row_in_fullres"].min()
        x_max = tissue_pos_list_filt["pxl_col_in_fullres"].max()
        y_max = tissue_pos_list_filt["pxl_row_in_fullres"].max()

        # adding crop boundary minimums to the enact output cell coordinates
        adata.obsm['spatial']['cell_x'] = (adata.obsm['spatial']['cell_x'] + x_min) 
        adata.obsm['spatial']['cell_y'] = (adata.obsm['spatial']['cell_y'] + y_min) 

        # convert to an array for compatibility with squidpy
        adata.obsm['spatial'] = adata.obsm['spatial'].to_numpy()

        # put image path in uns
        if source_image_path is not None:
            # get an absolute path
            source_image_path = str(Path(source_image_path).resolve())
            adata.uns["spatial"][library_id]["metadata"]["source_image_path"] = str(
                source_image_path
            )
        return adata

@LucFrancisENX
Copy link

This was used on the output of the human colon cancer visium hd dataset, and I was able to perform standard downstream processing using scanpy/squidpy.

One issue I am having is that if I try to save to raw beforehand (as below), my kernel crashes when running any downstream analysis (e.g. filter_cells).

adata.raw = adata.copy()

Would anyone have a solution for this please? I don't encounter any such issues with the bin2cell output

@AlbertPlaPlanas
Copy link
Contributor

@Demond-dev in our latest release we have added a notebook example that may help you visualize the outputs in an image: https://github.com/Sanofi-Public/enact-pipeline/blob/main/ENACT_outputs_demo.ipynb (note that in git it does not render the image as it is an html interactive plot, but once you load it in jupyter or vscode it should be visible).

Also, the latest release incorporates compatibility with TIssUUmaps for output visualization. The execution generates a .tmap file that allows you to load your outcomes in TissUUmaps: https://github.com/Sanofi-Public/enact-pipeline/blob/main/ENACT_outputs_demo.ipynb

@AlbertPlaPlanas
Copy link
Contributor

was able to perform standard downstream processing using scanpy/squidpy.

One issue I am having is that if I try to

Have you checked your memory usage? it seems you may be running out of memory.

@AlbertPlaPlanas AlbertPlaPlanas added the help wanted Extra attention is needed label Jan 29, 2025
@LucFrancisENX
Copy link

Have you checked your memory usage? it seems you may be running out of memory.

@AlbertPlaPlanas It does seem to be linked to memory usage, but I am not sure why because the standard 8um binned object or a bin2cell object both run without any memory issues. Is there something different to the structure of the enact output object which would make it require more memory?

Thanks for your help

@AlbertPlaPlanas
Copy link
Contributor

AlbertPlaPlanas commented Jan 29, 2025

My first thought is that if you keep your ENACT object in memory after analysis, it may be eating a lot of your memory as it will have all the raw counts, the bin-cell assignment, the whole slide image, and a few other heavy elements stored in memory.

So, assuming you are doing something like this

from enact.pipeline import ENACT

so_hd = ENACT(
  ...
)
so_hd.run_enact()
...

# Loading Anndata object
adata = sc.read_h5ad(cells_adata_path)

could you try doing something like this instead?

from enact.pipeline import ENACT

so_hd = ENACT(
  ...
)
so_hd.run_enact()
...
del so_hd
import gc
gc.collect()

# Loading Anndata object
adata = sc.read_h5ad(cells_adata_path)

It may not be the most elegant solution, but let us know if it works so we can better manage this in future releases.

@LucFrancisENX
Copy link

LucFrancisENX commented Jan 29, 2025

so_hd = ENACT(
...
)
so_hd.run_enact()
...
del so_hd
import gc
gc.collect()

Loading Anndata object

adata = sc.read_h5ad(cells_adata_path)

I have been running the downstream analysis after finishing the run_enact script, closing this and loading a fresh environment with free memory, so I don't think this is the problem. I can have a try at trimming final anndata object from this "read_visium_enact" function I posted above, to see if that reduces the size.

I also noticed that the structure of the adata.X from the enact output was different compared to a 8um bin or bin2cell object, which could be affecting the processing time (from the 10x Visium Human Colon Cancer dataset).

adata_8um.X

<Compressed Sparse Row sparse matrix of dtype 'float32'
with 209957484 stored elements and shape (418325, 18085)>

adata_b2c.X

<Compressed Sparse Row sparse matrix of dtype 'float64'
with 180551398 stored elements and shape (235546, 18059)>

adata_enact.X

array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 1, 0, 1],
[0, 0, 0, ..., 3, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 2],
[0, 0, 0, ..., 1, 0, 7],
[0, 0, 0, ..., 1, 2, 3]])

@AlbertPlaPlanas
Copy link
Contributor

Thanks, this helps. Just one question before we revise this, what is your current setup?

@LucFrancisENX
Copy link

Thanks, this helps. Just one question before we revise this, what is your current setup?

I have 36GB RAM on my local computer, with access to higher remote computing resources if needed.

Thanks again

@Mena-SA-Kamel
Copy link
Contributor

so_hd = ENACT(
...
)
so_hd.run_enact()
...
del so_hd
import gc
gc.collect()

Loading Anndata object

adata = sc.read_h5ad(cells_adata_path)

I have been running the downstream analysis after finishing the run_enact script, closing this and loading a fresh environment with free memory, so I don't think this is the problem. I can have a try at trimming final anndata object from this "read_visium_enact" function I posted above, to see if that reduces the size.

I also noticed that the structure of the adata.X from the enact output was different compared to a 8um bin or bin2cell object, which could be affecting the processing time (from the 10x Visium Human Colon Cancer dataset).

adata_8um.X

<Compressed Sparse Row sparse matrix of dtype 'float32' with 209957484 stored elements and shape (418325, 18085)>

adata_b2c.X

<Compressed Sparse Row sparse matrix of dtype 'float64' with 180551398 stored elements and shape (235546, 18059)>

adata_enact.X

array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 1, 0, 1], [0, 0, 0, ..., 3, 0, 0], ..., [0, 0, 0, ..., 0, 0, 2], [0, 0, 0, ..., 1, 0, 7], [0, 0, 0, ..., 1, 2, 3]])

Thank you for pointing this issue out. We will updated ENACT's 'df_to_adata()' function to save the transcript counts as a sparse matrix. In the mean time, update df_to_adata() in 'enact-pipeline/src/package_results.py/' to the following:

def df_to_adata(self, results_df, cell_by_gene_df):
        """Converts pd.DataFrame object with pipeline results to AnnData

        Args:
            results_df (_type_): _description_

        Returns:
            anndata.AnnData: Anndata with pipeline outputs
        """
        file_columns = results_df.columns
        spatial_cols = ["cell_x", "cell_y"]
        stat_columns = ["num_shared_bins", "num_unique_bins", "num_transcripts"]
        results_df.loc[:, "id"] = results_df["id"].astype(str)
        results_df = results_df.set_index("id")
        results_df["num_transcripts"] = results_df["num_transcripts"].fillna(0)
        results_df["cell_type"] = results_df["cell_type"].str.lower()
        adata = anndata.AnnData(cell_by_gene_df.set_index("id"))
        adata.obs = adata.obs.merge(results_df, on="id").drop_duplicates(keep='first')

        adata.obsm["spatial"] = adata.obs[spatial_cols].astype(int)
        adata.obsm["stats"] = adata.obs[stat_columns].astype(int)
        
        # This column is the output of cell type inference pipeline
        adata.obs["cell_type"] = adata.obs[["cell_type"]].astype("category")
        adata.obs["patch_id"] = adata.obs[["chunk_name"]]
        adata.obs = adata.obs[["cell_type", "patch_id"]]

        # Converting the Anndata cell transcript counts to sparse format for more efficient storage
        adata.X = csr_matrix(adata.X).astype(np.float32) # *FIX FOR MEMORY ISSUES*
        return adata

ENACT's next release will include this fix.

@Mena-SA-Kamel
Copy link
Contributor

Hi,

I was wondering if there was anyway to incorporate the output from the ENACT pipeline into an anndata object with the binned counts and the whole slide image for use in spatial plots with other packages? I've tried running the pipeline on data our lab has generated and it only resulted in an anndata object with 1000 variables and no image to use for spatial plotting.

Thanks!

The reason why you are only seeing 1000 variables because you likely have the 'use_hvg' flag set to true which only considers the top-1000 highly variable genes for bin-to-cell assignment. If you want the entire genes to be present in the final AnnData object, please set this flag to False. However, note that setting 'use_hvg' to True will cause bin-to-cell-assignment methods such as 'weighted_by_cluster' and 'weighted_by_gene' to take significantly longer time and require more memory and storage on your machine.

@Mena-SA-Kamel
Copy link
Contributor

so_hd = ENACT(
...
)
so_hd.run_enact()
...
del so_hd
import gc
gc.collect()

Loading Anndata object

adata = sc.read_h5ad(cells_adata_path)

I have been running the downstream analysis after finishing the run_enact script, closing this and loading a fresh environment with free memory, so I don't think this is the problem. I can have a try at trimming final anndata object from this "read_visium_enact" function I posted above, to see if that reduces the size.
I also noticed that the structure of the adata.X from the enact output was different compared to a 8um bin or bin2cell object, which could be affecting the processing time (from the 10x Visium Human Colon Cancer dataset).

adata_8um.X

<Compressed Sparse Row sparse matrix of dtype 'float32' with 209957484 stored elements and shape (418325, 18085)>

adata_b2c.X

<Compressed Sparse Row sparse matrix of dtype 'float64' with 180551398 stored elements and shape (235546, 18059)>

adata_enact.X

array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 1, 0, 1], [0, 0, 0, ..., 3, 0, 0], ..., [0, 0, 0, ..., 0, 0, 2], [0, 0, 0, ..., 1, 0, 7], [0, 0, 0, ..., 1, 2, 3]])

Thank you for pointing this issue out. We will updated ENACT's 'df_to_adata()' function to save the transcript counts as a sparse matrix. In the mean time, update df_to_adata() in 'enact-pipeline/src/package_results.py/' to the following:

def df_to_adata(self, results_df, cell_by_gene_df):
        """Converts pd.DataFrame object with pipeline results to AnnData

        Args:
            results_df (_type_): _description_

        Returns:
            anndata.AnnData: Anndata with pipeline outputs
        """
        file_columns = results_df.columns
        spatial_cols = ["cell_x", "cell_y"]
        stat_columns = ["num_shared_bins", "num_unique_bins", "num_transcripts"]
        results_df.loc[:, "id"] = results_df["id"].astype(str)
        results_df = results_df.set_index("id")
        results_df["num_transcripts"] = results_df["num_transcripts"].fillna(0)
        results_df["cell_type"] = results_df["cell_type"].str.lower()
        adata = anndata.AnnData(cell_by_gene_df.set_index("id"))
        adata.obs = adata.obs.merge(results_df, on="id").drop_duplicates(keep='first')

        adata.obsm["spatial"] = adata.obs[spatial_cols].astype(int)
        adata.obsm["stats"] = adata.obs[stat_columns].astype(int)
        
        # This column is the output of cell type inference pipeline
        adata.obs["cell_type"] = adata.obs[["cell_type"]].astype("category")
        adata.obs["patch_id"] = adata.obs[["chunk_name"]]
        adata.obs = adata.obs[["cell_type", "patch_id"]]

        # Converting the Anndata cell transcript counts to sparse format for more efficient storage
        adata.X = csr_matrix(adata.X).astype(np.float32) # *FIX FOR MEMORY ISSUES*
        return adata

ENACT's next release will include this fix.

This is now fixed in ENACT version 0.2.2.

@MiZ26
Copy link

MiZ26 commented Feb 12, 2025

Hello,
thanks for the recent bug fixes!
However we encountered an OOM (out of memory ) error when running the pipeline using all 18085 genes.
This was due to the large dataframes created when loading the counts from the csv files the functions: merge_[cellassign/sargent]_output_files (package_results) and merge_files (pipeline.py). A dataframe with 300000 cells x 18000 genes and float64 encoding would have a size of ~43.2GB even if the combined disk usage of the sparse csv files is smaller .
Would it be possible (in a future version) to load the counts from the csv files directly into a sparse matrix format and maybe avoid the intermediate step of first creating a df which is then used to construct the adata object ? somehow like:

for patch, df in merged_df.groupby(['chunk_name'], sort= False) : 
    transcripts = sc.sparse.csr_matrix(np.loadtxt(os.path.join(bin_folder, patch[0]), delimiter=',', skiprows=1)[:,1:])
    if adata: 
        adata = ad.concat([adata,ad.AnnData(transcripts)])
    else: adata = ad.AnnData(transcripts)

where merged_df would be a df of the merged files from the idx_lookup folder.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants