This course is an introduction to the basic bioinformatics tools used in computational biology for life science research. The course uses web-based resources that analyze gene and protein sequences as pertinent data examples. No programming skills are needed.
**Student Learning Outcomes – After successful completion of this course, students will be able to:
Module Objectives
Bioinfornatics definition: Wikipedia gives a good summary in one sentence "Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex". Modern biology has become a "big data" science and computational tools are needed to analyze and make sense of all the different data types that are being generated.
Numerous online resources are available for learning bioinformatics. A few free resources are listed below.
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They can be classified by many different criteria such as the types of data they cover, if they are primary data or curated, if they are general or specialized. The quality of the database depends on how often they are updated, the curation effort that goes in to them or how well they respond to user feedback. The perinity of databases is an issue as this requires long-term funding that is very difficult to secure. Finding the best database to perform a task or access information is not easy a few staring points are listed below but generally, biologists have a set of databases they use on a regular basis for the majority of their tasks.
Pubmed is one of the databases that almost all biologists use to search and access biomedical literature. It is also a good starting point for beginners to learn the basics of database searches such as the use of boolean operators, issues with names and spelling, or the use of advanced search tools.
For users who want to explore other literature resources than Pubmed a few are listed in this Wiki.
Module Objectives
Extracting protein/gene sequences from databases is not trivial. First, names are not unique identifiers and mapping between different databases and/or objects is not well done or intuitive. Ontology resources are trying to homogenize names or give numbering systems such as the EC numbers for enzymes or the TC numbers for transporters. This module focuses on extracting protein and gene sequences from NCBI and UniProt resources.
The format of your input file is critical as most bioinformatic tools cannot deal with format errors. Make sure to understand the popular FASTA and GenBank formats. Tools to clean sequences up and convert one format to the other are also very useful.
With the ease of sequencing, some genomes get sequenced many times generating identical gene and protein objects. Databases have dealt with this in different ways. UniProt has eliminated redundancy by combining identical protein sequences in one unique entry. NCBI has dealt by using WP records, then Identical Protein Reports and finally Identical Protein Groups (IPG).
Advanced booleen searches are very powerful to find sets of genes/proteins using a wide variety of filters (taxonomy, size, annotation) both in Uniprot and NCBI. Every different database at NCBI has slightly different fields tgat can be searched see the one for proteins or genes. All entries Uniprot and NCBI have multiple links to the same entry in other databases that can provide similar or additional information such as PDB or BRENDA or KEGG.
Alignments are a powerful way to compare related DNA or protein sequences. They can be used to capture various facts about the sequences aligned, such as common evolutionary descent or common structural function. See good overall summary HERE. We will focus first on pairwise sequence alignmemnts that are used to compare two sequences and search databases.
Module Objectives
Similarity matrices are the simplest way to align two sequences. These are called DotPlots.
Many sequence pairwise alignment programs have been developed. They can be global or local,use different matrices for computing similarity scores between amino acids), be exact or heuristic.
Any sequence can be aligned with any other sequence if enough gaps are allowed. An alignment score is a measure of the similarity of the two sequences using a specific algorithm and parameters. It is NOT a measure of homology even if it can be used as one of many criteria to infer homology, orthology. or paralogy
Blast is the most popular program for searching databases. Input can be DNA or protein sequences. The output gives the top-scoring pairwise alignment between the input sequence and individual database sequences. Different databases can be searched and filters can be added to focus on specific taxonomic groups. This BLastFactSheet is a good starting point.
Two scores are given for every output sequence; the Bit score is a similarity score. The e-value is the probability of getting this score by chance and it will vary with the size of the queried database. More details in these videos: BLAST Results: Expect Values, Part 1 & Part 2.
WARNING: With the number of sequences deposited in Genbank, using Blast with default parameters is nearly useless. Using smaller databases, excluding specific taxa or focusing on reference or model oragnisms are different ways to get around this issue.
Alternatives to blast to search database exist, The best known is FASTA that actually preceeded Blast .
Cool tool to find the litterature on potential homologs of a given imput sequence by using PaperBlast
Module Objectives
To generate a useful alignment a set of phylogenetically diverse homologous sequences have to be gathered in Fasta format and used as input in the alignment programs. (Note: headers that are too long can get cut and if the beginning are identical this might lead to errors. Hidden characters can also cause errors. It is always useful to use text editors such as TextEdit or Notepad++).
Different tools can be used to edit alignments. JalView is a popular option.
To visualize alignments this tool for EBI http://www.ebi.ac.uk/Tools/msa/mview/ .
Multiple alignments are the foundational tools to classify proteins A good primer on protein classification @EBI
Tools to visualize conserved motifs in proteins and DNA
The Profiles, Motifs, or HMMs allow to find remote homologs in database searches.
Module Objectives
Genome browsers show different types of information mapped to a given genome. They can be very simple with hust the organization of the genes like on the genomic contect section of a gene page @NCBI or more complex with many tracks like the UCSC or Ensembl Browsers.
Most integrative database such as JGI-IMG or BV-BRC have simple genome browsers as well as organisms specific databases such as SGD genome browser.
Predicting ncRNAs requires specific tools such as tRNAScan for tRNAs. For rRNAs, the widely used RNAmmer is no longer available and is being replaced by Rfam in many annotation pipelines as discussed HERE.
Predicting coding sequences (CDS) from open reading frames (ORFs) is not trivial and most algorithm struggle with start sites prediction as discussed HERE.
Module Objectives
Identification of CDS in plants as additional challenges (introns, contamination by bacterial DNA) discussed HERE but many resources for plant genome resources area available.
As one of the most utilized bacterial model ( see nice paper HERE), many resources are availabel compiling experimental evidence on regulation in E. coli
Bioinformatic tools for promoter identification were not very efficient butare improving with the arrival of AI as discussed HERE
In Bacteria
In Plants
Tools for regulatory site identification in Bacteria. This review can provide good background.
Module Objectives
The analysis of enzymes restriction profiles in DNA for traditional cloning and the design of primers for PCR ammplification were the bioinformatic tools to be developped for molecular biologists.
General biophysical properties can be derived from protein sequences. Many can be found @EXPASY or at @DTU.
Many Protein Localization prediction tools are available and this can be quite overwhelming. Below are the ones heavily used for bacterial proteins
Module Objectives
Phylogenetic trees are used to visualize the evolutionary relationships between protein or DNA sequences that derive from a common ancestor. Building and interpreting trees is not easy. It requires that the set of sequences and the algorithm used to construct the trees are chosen appropriately to answer the specific question that is asked.
Module Objectives
The coordinates of all experimentatly validated structures are deposited in central depository Protein Data Bank that can ve accessed from different Knoweledge base platforms that help analyze and mine this structural data.
Superfamily databases that combine structure and evolutionnary relatioships are available
Simple Tools to visualize and compare 3D structures are available through web interfaces even if the more elaborate analysis requires programs that are downloaded on the user's desktop such as Pymol or ChimeraX that we do no cover in this class.
Predicting 2D structures was one of the first applications of bioinformatics followed by worldwide competitions starting in 1994 to predict 3D structures that led to the AI-based Alpha-fold2 platform that revolutionized the field.