资源介绍
Evolution of the SARS-CoV-2 proteome in three dimensions (3D) during the first six months of the COVID-19 pandemic
https://covid-19_proteome_evolution_paper.iqb.rutgers.edu
Legends for Supplementary Figures for 29 SARS-CoV-2 Study Proteins
Separate analysis of protein changes was performed for each study protein and complex. Description below applies to all figures.
A: Grey scale representation of observed frequencies for all USV substitutions of Native Residue (i.e., amino acid type in the reference protein sequence) changing to Substituted Residue for a given protein/complex. Red boxes enclose conservative substitutions for hydrophobic, uncharged polar, positively charged, and negatively charged amino acids, respectively in order from upper left to lower right. Cysteine, Glycine and Proline are excluded from these groupings.
B-D: Normalized Frequency histograms for ΔΔGApp calculated for all USVs for a given protein/complex. These were calculated using three methods, which we refer to as hard-hard (B), soft-hard (C), and soft-soft (D), based on the scoring functions used for sidechain rotamer optimization and gradient-based energy minimization respectively (see methods). All energy values described in the text were obtained using the soft-hard method. Overlay of energy histogram with fitted bi-Gaussian curve (solid red line) and fitted single Gaussian curves for subsets of USVs with surface (green), boundary layer (yellow), or core (blue) substitutions. USVs with multiple substitutions were included in single Gaussian fitting when all substitutions mapped to the same region of the study protein. The data used for fitting includes the energies of all unique protein models produced by a given method, excluding extreme outliers with energy values greater than 3 standard deviations away from the central mean.
E-G: USV Count histograms indicate the number of USVs among the full set for a given protein in which each site included a substitution. Sites are separated by burial layer. Substitutions at sites that are absent from the available crystal structures are excluded from the histograms. In most cases, only a single protein is analyzed, and only panel E is included. In the case of complexes, a separate histogram is provided for each protein in the complex: for methyltransferase nsp10-nsp16, E is nsp10 and F is nsp16; for RDRP nsp12-nsp7-nsp8, E is nsp7, F is nsp8, and G is nsp12.
Legends for Supplementary Tables for 29 SARS-CoV-2 Study Proteins
Table: USVs: All identified USVs for a protein/complex. Columns are:
- date: Date of first collection of a strain with the USV reported to GISAID
- gisaid_count: The number of sequences in the GISAID database that include the USV
- id: The GISAID strain identification for the first collected instance of the USV
- location: The country in which the first strain including the USV was collected
- substitutions: All substitutions in the USV, in the form [chain]_[sequence][site][substitution], with multiple substitutions separated by semicolons
- is_in_PDB: whether a substitution is present in the PDB model used to generate the USV structure, with multiple substitutions separated by semicolons
- multiple: whether more than one amino acid substitution is present in the USV
- conservative: whether a substitution is conservative, with multiple substitutions separated by semicolons
- layer: Identification of the burial layer (surface, boundary, or core) of a substitution in the reference structure, with multiple substitutions separated by semicolons and substitutions absent from the PDB excluded
- sh_rmsd: The RMSD of the USV to the reference structure when modeled using the soft-hard method
- sh_ddG: The ΔΔGApp of the USV when modeled using the soft-hard method
- hh_rmsd: The RMSD of the USV to the reference structure when modeled using the hard-hard method
- hh_ddG: The ΔΔGApp of the USV when modeled using the hard-hard method
- ss_rmsd: The RMSD of the USV to the reference structure when modeled using the soft-soft method
- ss_ddG: The ΔΔGApp of the USV when modeled using the soft-soft method
Table: Substitutions: All substitutions identified for a protein/complex
- chain: The chain identifier of the protein in the PDB file in which the substitution is present
- site: The residue number at which the substitution is present
- reference: The one-letter amino acid name of the residue in the reference sequence
- mutant: The one-letter amino acid name of the residue in a USV
- conservative: Indication of whether a substitution is conservative
- in_pdb: whether the substitution site is present in the PDB model used to generate the USV structure
- layer: Identification of the burial layer (surface, boundary, or core) of a substitution in the reference structure
- date: date: Date of first collection of a strain with the substitution reported to GISAID
- location: The country in which the first strain including the substitution was collected
- gisaid_count: The number of sequences in the GISAID database including the substitution
- usv_count: The number of identified USVs including the substitution
- ddG: The soft-hard ΔΔGApp of the USV that includes only the substitution, left empty if no single-substitution USV was identified with the substitution
- single: Indication of whether the substitution was present in a single-substitution USV
- multiple: Indication of whether the substitution was present in a USV with multiple substitutions
- associates: List of all other substitutions that were identified in a USV that included the substitution
- strains: List of all USV-representative GISAID strains that included the substitution, with the single-substitution USV strain listed first if one was available
Table: Gaussian Fit Statistics: Fitted models for the energies of all USVs either together (ALL) or by study protein.
- fit: The number of Gaussian curves in the fitted energy model
- protein: The protein/complex name
- method: The modeling method used to calculate energy values
- layer: The subset burial layer (surface, boundary, or core) of USVs for which the energy model was fitted, excluding all USVs with substitutions not in that layer
- μ1: Mean of the first Gaussian in the fitted model
- σ1: Variance of the first Gaussian in the fitted model
- wt1: Weight of the first Gaussian in the fitted model
- μ2: Mean of the second Gaussian in the fitted model
- σ2: Variance of the second Gaussian in the fitted model
- wt2: Weight of the second Gaussian in the fitted model
- R2: R-squared value indicating the goodness of fit
Description of Computed Structural Models for Unique Sequence Variants for 29 SARS-CoV-2 Study Proteins.
USV Computed Structural Models. Computed structural models for all amino acid substituted USVs. We are providing the structural models of all study proteins modeled using the soft-hard modeling method (see Methods). Structural models are named according to the GISAID strain identification of the first strain in which the USV was identified, followed by an underscore-separated list of substitutions in the f
END