Metaproteomic Workflow (v1.0.0)¶
Summary¶
The metaproteomics workflow/pipeline is an end-to-end data processing workflow for protein identification and characterization using MS/MS data. Briefly, mass spectrometry instrument generated data files(.RAW) are converted to mzML, an open data format, using MSConvert. Peptide identification is achieved using MSGF+ and the associated metagenomic information in the FASTA (protein sequences) file format. Intensity information for identified species is extracted using MASIC and combined with protein information.
Workflow Diagram¶
Workflow Dependencies¶
Third party software¶
|----------------------------|------------------------------------------|
| MSGFPlus | v20190628 |
| Mzid-To-Tsv-Converter | v1.3.3 |
| PeptideHitResultsProcessor | v1.5.7130 |
| pwiz-bin-windows | x86_64-vc141-release-3_0_20149_b73158966 |
| MASIC | v3.0.7235 |
| sqlite-netFx-full-source | 1.0.111.0 |
| Conda | (3-clause BSD) |
| | |
Workflow Availability¶
The workflow is available in GitHub: https://github.com/microbiomedata/metaPro
The container is available at Docker Hub (microbiomedata/mepro): https://hub.docker.com/r/microbiomedata/mepro
Inputs¶
- .raw, metagenome, parameter files : MSGFplus & MASIC, contaminant_file
Outputs¶
- Processing multiple datasets.
.
├── Data/
├── FDR_table.csv
├── Plots/
├── dataset_job_map.csv
├── peak_area_crosstab_by_dataset_id.csv
├── protein_peptide_map.csv
├── specID_table.csv
└── spectra_count_crosstab_by_dataset_id.csv
- Processing single FICUS dataset.
- metadatafile, [Example](https://jsonblob.com/400362ef-c70c-11ea-bf3d-05dfba40675b)
| Keys | Values |
|--------------------|--------------------------------------------------------------------------|
| id | str: "md5 hash of $github_url+$started_at_time+$ended_at_time" |
| name | str: "Metagenome:$proposal_extid_$sample_extid:$sequencing_project_extid |
| was_informed_by | str: "GOLD_Project_ID" |
| started_at_time | str: "metaPro start-time" |
| ended_at_time | str: "metaPro end-time" |
| type | str: tag: "nmdc:metaPro" |
| execution_resource | str: infrastructure name to run metaPro |
| git_url | str: "url to a release" |
| dataset_id | str: "dataset's unique-id at EMSL" |
| dataset_name | str: "dataset's name at EMSL" |
| has_inputs | json_obj |
| has_outputs | json_obj |
| stats | json_obj |
has_inputs :
| MSMS_out | str: file_name \|file_size \|checksum |
| metagenome_file | str: file_name \|file_size \|checksum \|
int: entry_count(#of gene sequences) \|
int: duplicate_count(#of duplicate gene sequences) |
| parameter_files | str: for_masic/for_msgfplus : file_name \|file_size \|checksum
parameter file used for peptide identification search
| Contaminant_file | str: file_name \|file_size \|checksum
(FASTA containing common contaminants in proteomics)
has_outputs:
| collapsed_fasta_file | str: file_name \|file_size \|checksum |
| resultant_file | str: file_name \|file_size \|checksum |
| data_out_table | str: file_name \|file_size \|checksum |
stats:
| from_collapsed_fasta | int: entry_count(#of unique gene sequences) |
| from_resultant_file | int: total_protein_count |
| from_data_out_table | int: PSM(# of MS/MS spectra matched to a peptide sequence at 5% false discovery rate (FDR)
float: PSM_identification_rate(# of peptide matching MS/MS spectra divided by total spectra searched (5% FDR)
int: unique_peptide_seq_count(# of unique peptide sequences observed in pipeline analysis 5% FDR)
int: first_hit_protein_count(# of proteins observed assuming single peptide-to-protein relationships)
int: mean_peptide_count(Unique peptide sequences matching to each identified protein.)
- data_out_table
| DatasetName | PeptideSequence | FirstHitProtein | SpectralCount | sum(MasicAbundance) | GeneCount | FullGeneList | FirstHitDescription | DescriptionList | min(Qvalue) |
- collapsed_fasta_file
- resultant_file
Requirements for Execution¶
- Docker or other Container Runtime
Version History¶
- 1.0.0
Point of contact¶
Package maintainer: Anubhav <anubhav@pnnl.gov>