Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem uploading samples #701

Open
Zymergen-SBRUBAKER opened this issue Oct 31, 2019 · 36 comments
Open

Problem uploading samples #701

Zymergen-SBRUBAKER opened this issue Oct 31, 2019 · 36 comments

Comments

@Zymergen-SBRUBAKER
Copy link

Hi I have an error when uploading samples through the browser. There does not appear to be any documentation of how to do a manual upload using scp.

@glebkuznetsov
Copy link
Member

Hi,

Can you describe the error? A common issue is not providing the correct full path to the files in the upload template. The full path to the location that each fastq has been scp'ed needs to be provided. Can you provide the upload template sheet you are using showing paths? (Feel free to provide a partially anonymized screenshot if you prefer).

Thanks,
Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 4, 2019 via email

@glebkuznetsov
Copy link
Member

Hello,

Looks like the attachment didn't make it through. Might have to use the Github Issues interface directly for it to go through (rather than email?) If you have any additional details about the failure, please let me know (e.g. is there a delay before the error?)

Sorry we never put up documentation for scp upload. In fact it is the method most users here in the Church Lab use today. Briefly, the steps are something like:

  1. scp files to a location of your choice on the machine running Millstone. E.g., create a data directory /home/ubuntu/raw-data and scp there.

  2. Through the browser Samples page, use New -> From Server Location...

image

  1. Fill in the template linked in that form with each row representing a sample and the full path to the location.

Unfortunately, we didn't get around to adding scp instruction to the official docs here:
https://millstone.readthedocs.io/en/latest/user_guide/projects_alignments.html

In case it's helpful, here's a draft of more complete documentation we started writing but never got around to posting. Feel free to glance in case there is something helpful there.
https://docs.google.com/document/d/1tbPiVaaVqECliw5Eu8xBJ8OxpVHynWpo1_kFEFkoJmU/edit?usp=sharing

Thanks!
Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 5, 2019 via email

@glebkuznetsov
Copy link
Member

Hi Shane,

The "queued to copy" state is actually temporary and should resolve with some time. Though usually it's pretty quick.

A couple other issues to check:

  • what is the ec2 instnce you are using?
  • how much EBS space is there?

Thanks,
Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 6, 2019 via email

@glebkuznetsov
Copy link
Member

Hey Shane,

Good to hear it's working. Sorry the UI affordances are not perfect. We're expert users here so we hack features as needed :)

As far as instance type, I'd recommend running at least m5.xlarge to make sure you have enough memory for the alignment and subsequent analysis. I'm also concerned you may run out of disk space and would recommend allocating at least 3-5x as much disk as the size of all your FASTQs.

There's a few other tricks we do here to speed things up and make efficient use of AWS. But some of them require manually modifying some of the code and config files and we haven't really documented this anywhere. For example, when aligning > 50 genomes, we'll normally change the instance type to one with many cores (e.g. c5.9xlarge) and tweak the respective config that says to use all the cores. This actually turns out to be cheaper than running a smalerl instance in a more serial model. We'll then change back to a smaller instance type to do analysis / export data.

Anyway, happy to discuss/advise further if you're interested in ramping up your microbial genome alignment/analysis pipelines.

Cheers,
Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 6, 2019 via email

@glebkuznetsov
Copy link
Member

Hey Shane,

No variants sounds surprising. And I agree 2 min sounds fast for an alignment and variant calling. It's hard to tell from the logs whether anything specifically went wrong. I can't remember whether "truncated input" is bad. It is possible the small machine ran out of memory at some point and a process failed in a way that didn't disrupt the rest of the pipeline.

You can look at our unit tests for some example data. For example if you trace through this test https://github.com/churchlab/millstone/blob/master/genome_designer/pipeline/tests/variant_calling/test_variant_calling.py#L138, you'll see variables pointing to an example genome and fastqs:

        self.KNOWN_SUBSTITUTIONS_ROOT = os.path.join(settings.PWD, 'test_data',
                'test_genome_known_substitutions')

        self.TEST_GENOME_FASTA = os.path.join(self.KNOWN_SUBSTITUTIONS_ROOT,
                'test_genome_known_substitutions.fa')

        self.FAKE_READS_FASTQ1 = os.path.join(self.KNOWN_SUBSTITUTIONS_ROOT,
                'test_genome_known_substitutions_0.snps.simLibrary.1.fq')

        self.FAKE_READS_FASTQ2 = os.path.join(self.KNOWN_SUBSTITUTIONS_ROOT,
                'test_genome_known_substitutions_0.snps.simLibrary.2.fq')

And test_data is located here: https://github.com/churchlab/millstone/tree/master/genome_designer/test_data

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 6, 2019 via email

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 19, 2019 via email

@glebkuznetsov
Copy link
Member

Hey Shane,

Looks like the attachment didn't make it through to Github. Maybe it's not going through by email? Can you try uploading directly at the issue URL: #701?

Thanks,
Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Here is the screenshot!
Screen Shot 2019-11-19 at 3 25 39 PM

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 22, 2019 via email

@glebkuznetsov
Copy link
Member

Ah, interesting. It looks like it might be version issue either with Django or Postgres. Though might be the data.

Are you running Millstone on AWS using our pre-built AMI?

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 22, 2019 via email

@glebkuznetsov
Copy link
Member

Hmmm... I tried running the test data myself on a fresh image using our AMI and it seemed to work:

image

One thing is to confirm the postgres on your Millstone instance is the supported version (9.3):

ubuntu@ip-172-30-0-134:~$ psql --version
psql (PostgreSQL) 9.3.15

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 25, 2019 via email

@glebkuznetsov
Copy link
Member

Hi Shane,

Hmm.. I was more concerned the Postgres version changed (e.g. due to system update), but that appears not be the case.

I suspect what might have happened on your end is that variant calling failed, and our UI is not set up very well to reflect this. Reviewing this thread, I was reminded that you are using "the smallest ec2 instance right now. It does say there is about 1.7GB still free in the upper right corner." So I think what might be happening is alignment is failing due running short on either memory or storage.

We typically use at least an m4.2xlarge (32 GB RAM) and at least 3x as much EBS storage as the FASTQ size (or at least 100 GB). That's what I used for my test yesterday. I recall users trying to use a smaller instance having similar issues.

I think a good bet is to retry on a bigger machine (at least m4.2xlarge) with sufficient storage.

-Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Nov 25, 2019 via email

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Jan 13, 2020 via email

@glebkuznetsov
Copy link
Member

glebkuznetsov commented Jan 14, 2020

Hi Shane,

This isn't well-documented anywhere and the process is a little messy. A better short-term solution might be to spin up a Millstone instace with a bigger disk.

However, if you'd like to try extending to the bigger EBS, I believe the rough steps are:

  • Mount the EBS (e.g. to /millstone_data) and make sure you can write to it, e.g. following directions here, and probably have to update write permissions i.e. sudo chown -R ubuntu:ubuntu /millstone_data/
  • Move the Millstone files from their previous location to the EBS location:
    mv /home/ubuntu/millstone/genome_designer/temp_data /millstone_data
  • In ~/millstone/genome_designer/conf/local_settings.py, add/update the param MEDIA_ROOT = '/millstone_data/temp_data'
  • Fix symlink required by Jbrowse:
    rm ~/millstone/genome_designer/jbrowse/gd_data
    ln -s /millstone_data/temp_data ~/millstone/genome_designer/jbrowse/gd_data
  • Restart Millstone server and related:
    supervisorctl restart all

I might have messed up a step or two above so give that a try if spinning up a new Millstone instance isn't feasiable.

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Jan 15, 2020 via email

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Jan 15, 2020 via email

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Jan 15, 2020 via email

@glebkuznetsov
Copy link
Member

glebkuznetsov commented Jan 15, 2020 via email

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Jan 17, 2020 via email

@glebkuznetsov
Copy link
Member

Hi Shane,

Indeed gd_data needs to be a symlink to the location where Millstone actually stores files, millstone/genome_designer/temp_data. JBrowse just displays actual bam files.

Should be able to fix this by removing the gd_data folder and setting the symlink:

ln -s /home/ubuntu/millstone/genome_designer/temp_data /home/ubuntu/millstone/genome_designer/jbrowse/gd_data

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Jan 21, 2020 via email

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Jan 27, 2020 via email

@glebkuznetsov
Copy link
Member

Hi Shane,

That's surprising with so few samples, but hard for me to debug. Which strategy for export are you using?

-Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Jan 28, 2020 via email

@glebkuznetsov
Copy link
Member

Hi Shane,

Got it. Indeed there might be some issue with exporting the entire project due to the instance size. It's a not a feature we optimized.

However, what most users actually want to export is a .csv of all the called variants and metadata. That should work for your project, and just have to do it from the Analyze view. For example, in the public demo we host, you'd do it from this page:
http://ec2-52-4-236-89.compute-1.amazonaws.com/projects/b4cbc454/analyze/e0f7b0c1/variants?filter=&melt=0#

  1. Click the top checkbox that selects all.
  2. A blue notification appears informing you that only the 100 are selected, but you probably want to press 'Select all results that match this filter.'
  3. In the dropdown, select 'Export as csv', as shown in the screenshot below:

image

The rest of the data (.fastq files, generated .bam files, etc.) are located in the millstone filesystem (temp_data folder as discussed above) so you can browse / scp what you need there, though folder names are by software-generated uids, so might need extra detective work, or using the django shell to query the database for that.

Let me know if that's what you were looking for.

Thanks!
Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Feb 7, 2020 via email

@glebkuznetsov
Copy link
Member

Hi Shane,

Great to hear.

All of the Millstone related logs get written to one of the files in /var/log/supervisor. Specifically either in millstone-stdout.log or celery-stdout.log.

-Gleb

@Zymergen-SBRUBAKER
Copy link
Author

Zymergen-SBRUBAKER commented Feb 10, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants