This class gives basic tools to capture knowledge on microbial protein families and reviews intergrative datamining methods that allow to predict the function of proteins. The focus is mainly on bacterial proterins and metabolism. But some eukaryotic resources are also given as some protein families are widely conserved. All tools used in this class are web-based. The mumber of web-based biological database is very large >6000 and growing. The Data Commons database is a great resource to search for ones that fit your needs, knowing that many databases are short lived. In this course, are list the ones we use the most in our research and teaching.
With the genomic revolution started in 1995, microbiology has become a data science as all aspects of the field are driven by knowledge derived from sequences ( see extra reading here). With >400,000 genomes sequenced ( see current number on the GOLD database), gathering informations on genomes and proteins is not easy.
Module Objectives
Many integrated databases are available to identify or gather genomic bacterial or archaeal sequences. We use BV-BRC routinely but IMG/JGI, Microscope and NCBI are also great resources.
Some genomes might be not correctly classified. Focusing on reference and/or complete genomes is always better when you are gathering genomes.
For comparative purposes knowing useful fungal or plant databases is sometimes very useful .
Using BlastP @NCBI is not the most efficient way to gather diverse set of sequences because you will have to many very similar sequences unless you limit greatly your genomes or your database. HHmer can help do sensitive searches more compactly. Our favorite methods to gather sequences from a protein family quickly are orthology databases or Protein Family databases. One must still be conscious of paralogs and fusions. Uniprot is a great starting point as it federates many other resources. The ones we use frequently are listed below.
Orthology databases
Protein family databases
Finding the papers published on different members of the same protein family can be quite difficult as explained in detail in this manuscript Reed et al. The main message is to always start with Paper Blast with a Family input (HMM) and also use different phylogenetically diverse sequences as input.
Mapping Ids from one database format to the other is a constant issue when navigating database. Mappers are available but their use is can be tricky and fustrating.
To do multiple alignments and phylogenetic trees, the headers in Fasta sequences will often need to be reformated. A user with little programming experience can use regular expression manipulations (RegEx) to do that pretty easily.
In the genomic era, protein function is predicted using automated methods during the genome annoation process ( see review by Stein that is still very current). Many different types of errors can occur . See List here. One of the ways to check for errors is to put the annotation in its biological context by performing metabolic reconstructions. It is also a path to identify missing genes or pathway holes. Comparative genomic methods such as physical clustering or gene neighborhood analyses methods can then identify candidates for these missing genes ( see Review here) and - Additional Reading here. Automation of functional anntotation requires controled vocabularies that describes functions such as Gene Ontologies (GO) and/or Enzyme Commission (EC) numbers as described here.
Module objectives
Databases that compile patwhay information and place predicted proteins in specific pathways starting with genomic information can be used to perform metabolic reconstruction, check if a pathway is complete and compare pathways between organisms. KEGG has both Pathways and Modules even if the latter have been also integrated in the former. Metacyc/BioCyc use now a subscription model.
General genome integration databases also have tools to perform metabolic reconstructions and they have nice features to compare pathways between genomes.
Once you have identified a globally or missing gene case by doing metabolic reconstruction, or through literature searches or looking at the Unknown Gene/Enzyme Databases, you can use different types of genomic or post-genomic derived associations to find gene candidates. In this module we focus on protein fusions, physical clustering and phylogenetic profiles, in the next modules we will look at co-regulation, protein interactions and omics derived associations. All metabolic pathways have pathway holes and non-orthologous displacments that fill these holes. See this analysis of the Folate pathway.
Integrative database. The most user friendly database to integrate differnt types of association evidence is STRING. It is a good strating point but as everything is pre-computed. PubSEED has a great physical clustering tool but the genomes have not been update in many years. This tools is now integrated in BV-BRC.
Additional gene clustering platform. There are many tools to explore physical clusteringa associations. The ones we use the most in addition to intergrative databases are listed below.
Fusion databases. It is difficult to differentiate between true fusion (or Rosetta stone proteins) from multifomain proteins. The former have to be found as seperate domains in some organisms. Model SEED is a specific fusion database but otherwise general domain database have indirectly fusion information.
Phylogenetic Distribution. The tools to extract the protein family are present and absent in specific organisms are integrated in the general integrative genome databases. They vary by the number and taxonomy of genomes. The difficulty lies in choosing enough genomes to have a reasonable of candidates.
There is currently an explosion on the use of AI in biology applications. The field of structural biology has been transformed by alpha-fold. In terms of predicting function, AI tools are helping to capture and propapagate functional terms but can still make overpredictions and are not designed to discover novel function yet as discussed HERE.
In the last 20 years different types of "omics" data (transcriptomics, proteomics, metabolomics to name the most common) can be generated and the amount is increasing at an exponential rate as the costs go down and the technology improves. In parallel, predictions of regulatory network based on motif analyses has improved and when combined with transcriptomic data can help identify genes that are co-regulated. One problem with all these data sets is that they are noisy and require replicates ( from 3 to 6 depending on the techniques). Also, Id mapping issues make the use of web-based analysis tools sometimes fustrating. Many of the best tools to analyze multi-omics data are R packages but there are few web-based alternatives to both analyze your own and published data. In front of this data deluge, a useful to keep track is OmicsDI. See addional reading Here.
Module Objectives
Transcriptomics data was initially captured using Microarrays and is now mainly generated using RNAseq. The tools to analyze the RNASeq data generated by experimentalists have matured and are available in pipelines that are quite easy to use. Understanding the steps and statistics in these analysis flow is still very important both to design and interpret experiments as discussed in the papers HERE (to add). Analysing the published RNASeq data is still a challenge even if it is all publicly available in GEO. Specialized databases have emerged to help navigate the data deluge, some organisms specific, some for group of organisms, but the bulk of the resources are for eukaryotes (mainly human). For Bacteria webtools to identify co-expressed genes or tools to map transcriptomic data to genome browsers are available for specific genomes or group of genomes but these are very community specific. A subset of examples that we use routinely are given below.
Platforms to analyze your microbial RNASeq data ( Free and web-based)
Multi Species Depositories and/or analyses
Organism Specific databases
HTS data mapped to browser
Co-regulation can be predicted by the identification of the operator sites for specific regulators. These can be predicted using in silico methods or by combining with expression data.
Ordered mutant libaries have been constructed for a dozen model organisms. This allowed to identify essential genes and screen these mutants in a wide variety of growth conditions including the presence of many chemicals. More recently TnSeq TnSeq-derived fitness data or CRISPRi knockout data can be generated more quickly. Accessing the phenome data can be sometimes difficult (for example the data from the E.coli Carol Gross data set is no longer available.
Protein/protein interaction data was one the first type of HTS data to be collected using different type of approaches as discussed HERE. It is very noisy but can be useful when combined with other types of association evidence. This data has been linked in every Uniprot entry and integrated in multiomics databases such as STRING see below, but more specific databases are discussed HERE and a few listed below.
Not many webtools allow to map omics data to pathways specifically for Bacteria (see Review). A few databases generate very user friendly association netwoks that compile all types of data includeing Omics data.
Advanced data queries
** Precomputed Networks that integrate different types of data**
Mapping different types of data to pathway maps
.
Module Objectives
Many databases have integrated structure viewers such as PDB or iCn3D. These are sufficient to gather initial information about ligand bindong site or fold membership or structural similarity but to do more advanced structural analysis two platforms are available Pymol or ChimeraX. Pymol requires a subscription even if educational access is available but ChimeraX is free and quite easy to learn.
Protein prediction tools had progressed greatly in the last 10 years with the CASP competition pushing the modelers to build to tools like Phyre. But the release in 2020 of AlphaFold has been a game changer for the field. Alphafold predictions have been now been integrated for every Uniprot entry and new development are beeing published constantly as discussed HERE
The nature and position of the ligand(s) is critical to understand protein function.
Alpha-fold is not yet optimal to model interactions between macromolecules even if it is a very active field of study as discussed HERE. The Kurgan lab has a suite of interesting tools in the BioMine servers including many nucleic acid binding tools. Unfortunately these tools are not very reliable yet and the servers that provide them come and go. This paper reviews Protein/RNA binding prediction tools.
Protein/Protein Interaction Predictions
Protein/RNA or Protein/DNA Interaction Predictions
Tools to visualize gene and protein data are critical to both exploring and publications. A seminal review on datavisualization can be found HERE. More generally the VizBI community is very active in developping tools. For the purpose of this class we will just focus on a few types of tools.
Module Objectives
Depending on the explicit goals, explore neigbhorhhods to find associations or create figures for a manuscript, the choice of the tool might be different. Many integrated database such as String, IMG-JGI or BV-BRC have GN tools. They can be useful for exploration but are not adequate for figures. The most extensive for exploration is the EFI/GNT.In the last years a flurry of user frienfly tools have been published and should be explored. The ones we use often are listed below.
Classical logo creating tools such as WebLogo3 identify the residues that are conserved in a protein alignment. Sometimes it is useful to identify the residues that are different as can do the TwoSampleLogo(TSL) tool below.
Mapping data to tree is very useful to represent all types of information, including gene presence/absence, sizes or genes or families. Popular R packages exist, see ggTree, but it terms of web-based tools iTOL is by far the easiest to use even if a few other platforms have been developped over the years.