Your are here: Home - Bioinformatics Glossary

 

bioinformatics: Store, manage, retrieve, analyze and integrate vast amounts of genomic data being produced globally. Today embraces protein structure analysis, gene and protein functional information, data from patients, pre- clinical and clinical trials and metabolic pathways of numerous species. [CHI Bioinformatics report ] 

Computational methods are leading the "New Biology", the emerging discipline in which all biological parameters are interconnected. In this new discipline, biological pathways can be constructed through the use of computational algorithms to generate meaningful data from gene expression microarrays, mass spec, 2-D gel, protein-protein interaction studies, and other experiments. This will enable us, in addition to understanding how parameters are interconnected, to identify therapeutically relevant targets, and define and diagnose disease on a molecular basis. The analysis of tissue-specific assays to test compounds may allow a more accurate prediction of drug response than animal studies that are often poor and inconsistent predictors. Architectures of data underlie the accurate diagnosis and early intervention of disease based on genotype, gene expression signatures and protein transcription. This knowledge will help deliver personalized medicine and lower the cost of drug development.  Bioinformatics & Genome Research: The future of genome research  Beyond genome June 21- 24, 2004, San Francisco, CA  Bioinformatics glossary

bottom-up: The classical reductionist approach to biology which aims to examine the smallest units to gain insight into the larger ones. Mendelian genetics, which looks at single genes, is a bottom- up approach. Compare top- down.

clinical informatics: The application of informatics approaches to the clinical- evaluation phase of drug development. These approaches can include clinical- trial simulations to improve trial design and patient selection, as well as electronic capturing and storing of clinical data and protocols. The goal is to reduce expenses and time to market. [CHI Bioinformatics report] 

computational biophysics:  Activities of the Theoretical and Computational Biophysics Group center on the structure and function of supramolecular systems in the living cell, and on the development of new algorithms and efficient computing tools for structural biology.  The Resource brings the most advanced molecular modeling, bioinformatics, and computational technologies to bear on questions of biomedical relevance.

data mining: Exploration and analysis, by automatic or semi- automatic means, of large quantities of data in order to discover meaningful patterns or rules.

annotated databases:  Databases may contain a combination of amino acid sequences, comments, literature references and notes on known post- translational modifications to the sequence. A database that contains all of these elements is referred to as "annotated". Other databases only contain the sequence, an accession number and a descriptive title. Annotation of each entry is obviously very time- consuming and difficult to maintain without errors. Therefore annotated databases usually have many fewer sequence entries than non- annotated ones. Annotation also implies that some functional or structural information is known about the mature protein, as opposed to a sequence that is known only from the translation of a stretch of nucleotide sequence. Even the best annotated databases now include large numbers of entries that have very little real information about the mature protein other than some reference to who sequenced and translated the nucleotide sequence. Annotated databases are technically superior for many purposes, because they contain information about the true form of the mature protein. [Biopolymer Markup Language — BIOML Working Draft Proposal, 1999]

annotation: The annotation process identifies sequence features on the contigs - such as variation, sequence tagged sites, FISH mapped clone regions, known and predicted genes, and gene models. This stage provides contig, mRNA, and protein records with added feature annotation. [NCBI Contig Assembly and Annotation Process, 2001] 

Each fragment of DNA contains unique features. A DNA fragment may encode a portion of a gene or a gene control sequence, or the fragment may be a portion of a genome that has no apparent function. Bioinformaticists perform detailed analysis of DNA fragments, comparing new DNA sequence, previously annotated DNA sequences and identifying common characteristics, and assigning known or putative potential functions to the DNA sequence. Cross species DNA sequence comparison is quite common and can reveal common genes shared between organisms. A bioinformatic study may also require peptide to peptide comparisons allowing common structural features of proteins to define the function a DNA fragment encoding a specific protein or enzyme.   [CHI High Throughput Genomics] report,  2001.

The elucidation and description of biologically relevant features in the sequence is essential in order for genome data to be useful. The quality with which annotation is done will have direct impact on the value of the sequence. At a minimum, the data must be annotated to indicate the existence of gene coding regions and control regions. Further annotation activities that add value to a genome include finding simple and complex repeats, characterizing the organization of promoters and gene families, the distribution of G + C content, and tying together evidence for functional motifs and homologs. [Lawrence Berkeley Lab, US "Advanced Computational Structural Genomics"] 

Explanatory notes, comments, analysis and commentaries added to a database. May refer to sequence data or protein structures and includes predictions, characterizations, summaries, and other detailed information, including gene function. Annotation can be manual (as in SWISS- PROT) or automated (as in TrEMBL).  Since annotation is highly skilled and labor intensive, efforts are being made to automate the process, at least for preliminary data. 

bioinformatics: Roughly, bioinformatics describes any use of computers to handle biological information. In practice the definition used by most people is narrower; bioinformatics to them is a synonym for "computational molecular biology" - the use of computers to characterise the molecular components of living things. [Damian Counsell, bioinformatics.org FAQ]

The discipline of storing, retrieving, analyzing, and integrating biological data. The field currently encompasses protein structure analysis, gene and protein functional information, data from patients, pre- clinical and clinical trial information, and studies of metabolic pathways in numerous species.  Bioinformatics will be one of the keys to success for companies applying genomic tools to drug discovery and development.  Demand for greater flexibility, better integration, and higher- value analytical tools is increasing. As a result, a growing number of companies are competing in this field, with a wider range of offerings and business models. During this, the "functional" and "high- throughput" phase of genomics, having top- level software products is simply not enough. The most promising contenders offer not just excellent applications but also access to databases and/ or consulting services.  [CHI Bioinformatics report] 

biological databases: Biological databases have inherent complications stemming from the nature of the information they contain and the dependence of computational methods on these data. Most biological data are not digital, making machine- readability of the data (for automated data- mining) impossible. In addition, the lack of standardized nomenclature and ontology, the use of protein aliases (leading to ambiguity), the lack of interoperability across databases, and the presence of errors in database annotations have hindered and complicated the use of computational methods. Defining the Mandate of Proteomics in the Post- Genomics Era, Board on International Scientific Organizations, National Academy of Sciences, 2002 

comparative bioinformatics: The genome sequences from several chordates are being completed; the bioinformatics largely exists in the research community to discover the protein-coding potential of those genomes. However, the bioinformatics to elucidate gene regulation encoded in genomes and gene regulatory networks is not so developed. New bioinformatics, new model organism resources, new experimental approaches, and new collaborations are needed if the community is to understand the gene networks that help create phenotypes of interest. A research team at ORNL and the University of Tennessee are developing some needed bioinformatics. The overall projects include 1) supplying several web services and collaborative bioinformatics that supports large consortia of experimental researchers and 2) developing comparative bioinformatics and new data mining environments that can ultimately help understand the nature and evolution of gene regulatory networks. J. Snoddy et. al. Univ of Tennessee, ORNL, International Mammalian Genome 17 Nov. 2002 

computational biology: The development and application of data - analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioral, and social systems. Biomedical Information Science and Technology Initiative BISTI Bioinformatics at the NIH, 2000 

curated databases: Often less complete than primary databases, but they have less redundancy and the added value of scientific annotation; therefore, a biologically significant sequence should be easier to find in such a database and of greater value. Naturally, the degree of redundancy and annotation in such a database depends on the experience, skills, aims, and devotion of its curators.  ...  The only proper way to curate databases is the way groups like those that developed OMIM [Online Mendelian Inheritance in Man], SWISS- PROT and most commercial databases have done it — that is, through making scientific judgments as data are cleaned up and merged. [CHI Bioinformatics report]

Under the supervision of a curator. Other curated databases include LocusLink, RefSeq, & SGD (Saccharomyces cerevisae Genome Database)

Ensembl: A joint project between EMBL- EBI and the Sanger Centre (UK) to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Human data are available now; they hope to add mouse data soon.  http://www.ensembl.org/index.html

flat files: Pure text documents that are totally unstructured. This type of file generally does not provide very specific search answers, but it is the most popular type of file on the Web and is now a bit easier to search, thanks to the use of hyperlinks. [CHI Bioinformatics report] 

functional bioinformatics:  The emerging field of functional bioinformatics focuses on the development of ontologies or concept classifications fed into algorithms used to perform computations of the functions of  biomolecules .["About bioinformatics" George Washington Univ. Medical Center, 2002] http://www.gwumc.edu/bioinformatics/about/bioinfo.htm

An emerging subfield of bioinformatics that is concerned with ontologies and algorithms for computing with biological function. Functional bioinformatics is the computational counterpart of functional genomics ...  is concerned with managing and analyzing functional genomics data, such as gene expression experiments and large- scale knock- out experiments. .. emphasizes large- scale computational problems, such as problems involving complete metabolic networks and genetic networks.  [Peter D. Karp "An ontology for biological function based on molecular interactions" Bioinformatics Ontology 16 (3): 269- 285, 2000] 

NCBI  National Center for Biotechnology Information: Established in 1988 as a national  resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease. Part of  NIH. http://www.ncbi.nlm.nih.gov

non-redundant databases: Researchers at the National Center for Biotechnology Information (NCBI) coined the term "nr" database (nonredundant database) to refer to a database in which the obviously redundant entries have been merged. These entries are typically those that are 100%, character- by- character identical, and algorithms exist that can remove such redundancy. Although such a database has less redundancy than a primary database, a substantial amount of redundancy remains, and it can be removed only by a curator using scientific judgment. [CHI Bioinformatics report]

proprietary databases:  Fee- based, copyrighted databases (in contrast to public databases such as those at DDBJ/ EMBL/ GenBank).  Examples include Incyte's LifeSeq and Gene Logic's GeneExpress databases.  Some databases charge subscription fees to commercial organizations, with other arrangements available to non- profits.. Also referred to as private databases.

redundant databases: When sequence databanks were first created, primary [redundant] databases had the advantage of being more comprehensive than curated databases and more likely to contain recently discovered sequences. However, redundancy is no longer much of an advantage. In a highly redundant database, biologically significant results are more likely to be hidden among large numbers of irrelevant reported matches. [CHI Bioinformatics report] 

relational databases: Most or all of the data are structured. These files are the hardest to set up and maintain, and require specific knowledge by a searcher, but they are the easiest to use when doing analysis or integration. Data is categorized by specific fields, and so, by knowing the fields one should be able to capture all the relevant data, quite easily. The searchability of a relational database is totally dependent on how well the database has been structured. [CHI Bioinformatics report]

algorithm:  A procedure consisting of a sequence of algebraic formulas and/or logical steps to calculate or determine a given task. [MeSH, 1987]

Rules or a process, particularly in computer science. In medicine a step by step process for reaching a diagnosis or ruling out specific diseases.  May be expressed as a flow chart in either sense. Greater efficiencies in algorithms, as well as improvements in computer hardware have led to advances in computational biology. A computable set of steps to achieve a desired result.

artificial intelligence (AI): A wide- ranging term encompassing computer applications that have the ability to make decisions; the ability to explain reasoning is evidence of intelligence.  Also covers methods that have the ability to learn. [J Glassey et al. “Issues in the development of an industrial bioprocess advisory system” Trends in Biotechnology 18 (4):136-41 April 2000] 

artificial neural nets: Algorithms simulating the functioning of human neurons and may be used for pattern recognition problems, e.g., to establish quantitative structure- activity relationships. [IUPAC Computational]

artificial neural networks ANN: Also referred to as connectionist architectures, parallel distributed processing, and neuromorphic systems, an artificial neural network (ANN) is an information- processing paradigm inspired by the way the densely interconnected, parallel structure of the mammalian brain processes information. Artificial neural networks are collections of mathematical models that emulate some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning. The key element of the ANN paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements that are analogous to neurons and are tied together with weighted connections that are analogous to synapses.

biometrics: The information age is quickly revolutionizing the way transactions are completed. Everyday actions are increasingly being handled electronically, instead of with pencil and paper or face to face. This growth in electronic transactions has resulted in a greater demand for fast and accurate user identification and authentication. Biometric technology is a way to achieve fast, user- friendly authentication with a high level of accuracy. [Biometrics Consortium] 

bootstrapping: Kerr and Churchill use a bootstrapping procedure to calculate confidence intervals for the fitted values. Any bootstrapping procedure works by perturbing the original dataset and re-solving the model many times, often thousands of times. [Similar methods are sometimes called resampling or jackknife methods.] This generates a large number of values for each variable (one for each perturbed dataset), and one then estimates the true values of the variable, confidence intervals, and so on, from these values. The tricky part of the procedure is deciding how to perturb the dataset.

cluster analysis: The clustering, or grouping, of  large data sets (e.g., chemical and/ or pharmacological data sets) on the basis of similarity criteria for appropriately scaled  variables that represent the data of interest. Similarity criteria (distance based, associative, correlative, probabilistic) among the several clusters facilitate the recognition of patterns and reveal otherwise hidden structures (Rouvray, 1990; Willett, 1987, 1991). [IUPAC Computational]

A set of statistical methods used to group variables or observations into strongly inter- related subgroups. In epidemiology, it may be used to analyze a closely grouped series of events or cases of disease or other health- related phenomenon with well- defined distribution patterns in relation to time or place or both. [MeSH, 1990]

dendogram: A tree diagram that depicts the results of hierarchical clustering. Often the branches of the tree are drawn with lengths that are proportional to the distance between the profiles or clusters. Dendograms are often combined with heat maps, which can give a clear visual representation of how well the clustering has worked.

docking algorithms:  The key to success for computational tools used in structure- based drug design is the ability to accurately place or 'dock' a ligand in the binding pocket of the protein target of interest.  In this presentation, the effect of several factors on molecular docking accuracy, including force field parameters, ligand and protein flexibility will be discussed.  In order to examine the potential effect in an unbiased fashion, several test sets made up of ligand- protein co-complex x-ray structures were assembled that represent a diversity of size, flexibility and polarity with respect to the ligands and proteins.

expert systems:  A computer-based program that encodes rules obtained from process experts usually in the form of  “if - then” statements.

fuzzy logic: A superset of conventional (Boolean) logic that has been extended to handle the concept of  partial truth- truth values between “completely true” and ‘completely false”.  Introduced by Dr. Lotfi  Zadeh (Univ. of California - Berkeley) in the 1960’s as a means to model the uncertainty of natural language.

genetic algorithm GA:  Method for library design by evaluating the fit of a parent library to some desired property (e.g. the level of activity in a biological assay, or the computationally determined diversity of the compound set) as measured by a fitness function. The design of more optimal daughter libraries is then carried out by a heuristic process with similarities to genetic selection in that it employs replication, mutation, deletions etc. over a number of generations. [IUPAC Combinatorial Chemistry]

An optimization algorithm based on the mechanisms of Darwinian evolution which uses random mutation, crossover and selection procedures to breed better models or solutions from an originally random starting population or sample. (Rogers and Hopfinger, 1994).

heuristic: Tools such as genetic algorithms or neural networks employ heuristic methods to derive solutions which may be based on purely empirical information and which have no explicit rationalization. [IUPAC Combinatorial Chemistry] 

Trial and error methods.  Narrower term: heuristic algorithm

heuristic algorithm:  A programming strategy for solving computationally resistant problems that utilizes self- educating techniques (i.e., feedback evaluation) to improve performance (e.g., FASTA). Problem solving by such experimental,  trial- and- error methods does not guarantee the optimal solution. [labvelocity.com]

knowledge based systems: An extension of the expert system concept wherein additional forms of knowledge, such as mathematical models, are incorporated with the expert rules. [J Glassey et al. “Issues in the development of an industrial bioprocess advisory system” Trends in Biotechnology 18 (4):136- 141 April 2000] 

machine learning: In Knowledge Discovery, machine learning is most commonly used to mean the application of induction algorithms, which is one step in the knowledge discovery process. This is similar to the definition of empirical learning or inductive learning in Readings in Machine Learning by Shavlik and Dietterich. Note that in their definition, training examples are ``externally supplied,'' whereas here they are assumed to be supplied by a previous stage of the knowledge discovery process. Machine Learning is the field of scientific study that concentrates on induction algorithms and on other algorithms that can be said to ``learn.'' [Glossary of terms, Ron Kohavi, Machine Learning, 30, 271- 274, 1998]

neural networks: Technique for optimizing a desired property given a set of items which have been previously characterized with respect to that property (the 'training set'). Features of members of the training set which correlate with the desired property are 'remembered and used to generate a model for selecting new items with the desired property or to predict the fit of an unknown member. [IUPAC Combinatorial Chemistry] 

Principal Components Analysis PCA: Computational approach to reducing the complexity of, for example, a set of descriptors, by identifying those features which provide the major contributions to observed properties, and thus reducing the dimensionality of the relevant property space. [IUPAC Combinatorial Chemistry]

A data reduction method using mathematical techniques to identify patterns in a data matrix. The main element of this approach consists of the construction of a small set of new orthogonal, i.e., non- correlated, variables derived from a linear combination of the original variables. [IUPAC Computational]

Often confused with common factor analysis.

consensus sequence (consensus): Poor Terminology! The simplest form of a consensus sequence is created by picking the most frequent base at some position in a set of aligned DNA, RNA or protein sequences.

schema (plural schemata): A description of the data represented within a database. The format of the description varies but includes a table layout for a relational database or an entity- relationship diagram.

supervised training:  Supervised training involves a mechanism of providing the network with the desired output either by manually "grading" the network's performance or by providing the desired outputs with the inputs. ... The vast bulk of networks utilize supervised training. [Artificial Neural Networks Technology, Data and Analysis Software, Dept. of Defense, 2000]

unsupervised training sets: Unsupervised training is where the network has to make sense of the inputs without outside help. ... Unsupervised training is used to perform some initial characterization on inputs. However, in the full blown sense of being truly self learning, it is still just a shining promise that is not fully understood, does not completely work, and thus is relegated to the lab. [Artificial Neural Networks Technology, Data and Analysis Software, Dept. of Defense, 2000]

AceDB  http://www.acedb.org/  A genome database system developed primarily by Jean Thierry- Mieg (CNRS, Montpellier, France) and Richard Durbin (Sanger Centre. UK). It provides a custom database kernel, with a non- standard data model designed specifically for handling scientific data flexibly, and a graphical user interface with many specific displays and tools for genomic data. AceDB is used both for managing data within genome projects, and for making genomic data available to the wider scientific community. AceDB was originally developed for the C. elegans genome project, from which its name was derived (A C. elegans DataBase). However, the tools in it have been generalized to be much more flexible and the same software is now used for many different genomic databases from bacteria to fungi to plants to man. It is also increasingly used for databases with non- biological content.

ArrayExpress, EBI, UK http://www.ebi.ac.uk/arrayexpress/ A public repository for microarray based gene expression data. Currently the EBI is establishing a pilot database containing microarray gene expression data that are available publicly.

DBGET/LinkDB, GenomeNet, Institute for Chemical Research, Kyoto University, Japan  http://www.genome.ad.jp/dbget/  Integrated database retrieval system, currently supports the following databases and gene catalogs: nucleic acid sequences: GenBank, EMBL protein sequences: SWISS- PROT, PIR, PRF, PDB, STR, 3D structures: PDB, sequence motifs: PROSITE, EPD, TRANSFAC, enzyme reactions: LIGAND, metabolic pathways: PATHWAY, amino acid mutations: PMD, amino acid indices: AAindex, genetic diseases: OMIM, literature: LITDB, Medline, gene catalogs: E. coli, H. influenzae, M. genitalium, M. pneumoniae, M. jannaschii, Synechocystis, S. cerevisiae, cross reference EMBL and GenBank

DDBJ  DNA DataBank of Japan Shares information daily with EMBL and GenBank. http://www.ddbj.nig.ac.jp/  

Dali, EBI European Bioinformatics Institute  http://www.embl-ebi.ac.uk/dali/  The Dali server is a network service for comparing protein structures in 3D. You submit the coordinates of a query protein structure and Dali compares them against those in the Protein Data Bank. A multiple alignment of structural neighbours is mailed back to you. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. If you want to know the structural neighbours of a protein already in the Protein Data Bank, you can find them in the FSSP database. Dali and HSSP are derived databases organizing protein space in the structurally known regions. The structure classification by Dali and the sequence families in HSSP can be browsed jointly from a web interface providing a rich network of links between domains and proteins and between structures and sequences. This results in a database of explicit multiple alignments of protein families in the twilight zone of sequence similarity.

dbEST, NCBI  http://www.ncbi.nlm.nih.gov/dbEST/index.html Sequence data and other information on "single- pass" cDNA sequences or ESTs, from a number of organisms, part of GenBank.

dbSNP, NCBI  http://www.ncbi.nlm.nih.gov/SNP/  Uses "looser variation" definition for SNPs (no requirement or assumption about minimum allele frequencies or the polymorphisms…Short deletion and insertion polymorphisms, and microsatellite repeats, as well as SNPs are included. Disease causing clinical mutations, as well as neutral polymorphisms, are also in scope.

dbSTS, NCBI  http://www.ncbi.nlm.nih.gov/dbSTS/  A subset of GenBank, with sequence and mapping data on short genomic landmark sequences (STSs). More comprehensive annotation than in GenBank and regularly updated with BLAST.

EMBL (European Molecular Biology Laboratory: Main laboratory is in Heidelberg, Germany, with outstations in Hamburg, Grenoble, France (access to high powered instruments for structure studies) and Hinxton, UK (bioinformatics). Supported by 14 European countries and Israel, shares data daily with DDBJ and GenBank.  http://www.embl-heidelberg.de/

EPD Eukaryotic Promoter Database, Bioinformatics Group, ISREC Swiss Institute for Experimental Cancer Research  http://www.epd.isb-sib.ch/  an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis.

Entrez, NCBI  http://www.ncbi.nlm.nih.gov/Entrez/ A retrieval system for searching several linked databases. It provides access to PubMed (Medline), Nucleotide sequence database (GenBank) Protein sequence database, Structure: three- dimensional macromolecular structures, Genome: complete genome assemblies PopSet: Population study data sets, Taxonomy: organisms in GenBank, OMIM: Online Mendelian Inheritance in Man

Entrez Genomes, NCBI, US  http://www.ncbi.nlm.nih.gov/Entrez/Genome/org.html  The whole genomes of over 600 organisms can be found. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses and mitochondria.

Entrez Nucleotides, NCBI, US http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide  A collection of sequences from several sources, including GenBank, RefSeq, and PDB. The number of bases grows at an exponential rate.

Entrez Proteins, NCBI, US http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein  The protein entries in the Entrez search and retrieval system have been compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq

ENZYME, ExPASy, Switzerland http://www.expasy.ch/enzyme/  Enzyme nomenclature database

ExPASy (Expert Protein Analysis System), Swiss Institute of Bioinformatics, Switzerland  http://www.expasy.ch/  Proteomics server

GOLD Genomes Online, Integrated Genomics, Inc UIUC/Argonne  http://igweb.integratedgenomics.com/GOLD/ Complete and ongoing genome projects information.

GenBank:, NCBI, US http://www.ncbi.nlm.nih.gov/Genbank/  NIH genetic sequence database, annotated collection of all publicly available DNA sequence Mirrored at EMBL and DDBJ. Currently estimated (early 2000) that over 2 million bases are deposited here each day. This growth will only accelerate in the future. Began in the 1980’s by DOE.

HOMOLOGENE, NCBI, US  http://www.ncbi.nlm.nih.gov/HomoloGene/   A homology resource which includes both curated and calculated orthologs and homologs for genes represented in UniGene and LocusLink for human, mouse, rat, and zebrafish. The curated orthologs include ortholog gene pairs reported in the Mouse Genome Database (MGD) at the Jackson Laboratory, the Zebrafish Information (ZFIN) database at the University of Oregon, and in published reports. The calculated orthologs and homologs are the result of nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. These orthologs and homologs are considered putative since they are based only on sequence comparisons.

HUGO Mutation Database Initiative, Human Genome Organisation, Univ. of Melbourne, Australia,  http://ariel.ucs.unimelb.edu.au/~cotton/dblist.htm  :Links to Locus specific mutation databases, Central and general mutation databases, national and ethnic mutation databases, complex disease databases, clinical and patient aspects, non human mutations, artificial mutations and other related databases. 

Human SNP Database, Whitehead Institute, US  http://www-genome.wi.mit.edu/SNP/human/index.html

International Nucleotide Database: Composed of  DDBJ, EMBL and GenBank. Often - but inaccurately - referred to as GenBank.

KEGG Pathway Database, http://www.genome.ad.jp/kegg/ Links to pathway and other databases (metabolic and regulatory)  http://kegg.genome.ad.jp/kegg/kegg4.html

LocusLink, NCBI, US http://www.ncbi.nlm.nih.gov/LocusLink/  A single query interface to curated sequence and descriptive information about genetic loci. It presents information on official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, and related web sites.

MAGPIE Multipurpose Automated Genome Project Investigation Environment Genome Sequencing Projects (completed and in progress) http://genomes.rockefeller.edu/research.shtml   

MIPS Munich Information Center for Protein Sequences, Germany  http://www.mips.biochem.mpg.de/  We are a bioinformatics group of the GSF (National Research Center for Environment and Health) at the  Max- Planck- Institut f. Biochemie. MIPS is a member of PIR- International (Protein Identification Resource) and of EMBNET (European Molecular Biological Network)

MMDB Molecular Modeling DataBase, NCBI, US  http://www.ncbi.nlm.nih.gov/Entrez/structure.html  A database of macromolecular 3D structures (as well as tools for their visualization and comparative analysis). Contains experimentally determined biopolymer structures obtained from the Protein Data Bank (PDB). Structures can be anything from short oligonucleotides or peptides to very large macromolecular complexes containing dozens of individual molecules.

Nucleic Acids Database NDB, Rutgers Univ., US http://ndbserver.rutgers.edu/  Assembles and distributes structural information about nucleic acids.

OMIM, Online Mendelian Inheritance in Man, NCBI, US http://www.ncbi.nlm.nih.gov/Omim/searchomim.html Gene maps (cytogenetic locations of genes described in OMIM) and morbid maps (alphabetical list of diseases described in OMIM and their corresponding cytogenetic locations). [from the OMIM FAQ]

PDB Protein Data Bank, Research Collaboratory for Structural Bioinformatics http://www.rcsb.org/  3D macromolecular structural data. Incorporates NDB Nucleic Acid Database Project, Rutgers.

PIR Protein Information Resource, NBRF, Georgetown Univ. Medical Center, US http://www-nbrf.georgetown.edu/pirwww/pirhome.shtml  The Protein Information Resource (PIR), in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japanese International Protein Sequence Database (JIPID) maintains the PIR- International Protein Sequence Database --- a comprehensive, annotated, and non- redundant protein sequence database in which entries are classified into family groups and alignments of each group are available.

PROSITE, Swiss Institute of Bioinformatics http://www.expasy.ch/prosite/  A database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs

PathDB, National Center for Genome Research, US  http://www.ncgr.org/pathdb/  A functional prototype research tool for biochemistry and functional genomics. One of the key underlying philosophies of our project is to capture discrete metabolic steps. This allows us to build tools to construct metabolic networks de novo from a set of defined steps.  PathDB is not simply a data repository but a system around which tools can be created for building, visualizing, and comparing metabolic networks.

Pfam (from SWISS-PROT and TrEMBL)  http://pfam.wustl.edu/ and various European mirror sites including EBI, UK  http://www.sanger.ac.uk/Software/Pfam/ and Sweden http://www.cgr.ki.se/Pfam/  A database of multiple alignments of protein domains or conserved protein regions. Hopefully they represent some evolutionary conserved structure, which has implications for the protein's function. Pfam is actually formed in two separate ways. Pfam-A are accurate human crafted multiple alignments whereas Pfam-B is an automatic clustering of the rest of SWISS- PROT and TrEMBL using the program Domainer

Prints, University College London, UK  http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/PRINTS.html  Compendium of protein fingerprints.

ProtFam, MIPS, Germany  http://www.mips.biochem.mpg.de/proj/protfam/  A curated protein classification database. In a joint effort, MIPS and PIR- NBRF classify sequences into superfamilies and families and annotate homology domains

REBASE, Restriction Enzyme DataBase, New England Biolabs http://rebase.neb.com/rebase/rebcit.html  A collection of information about restriction enzymes and related proteins. It contains published and unpublished references, recognition and cleavage sites, isoschizomers, commercial availability, methylation sensitivity, crystal and sequence data. DNA methyltransferases, homing endonucleases, nicking enzymes, specificity subunits and control proteins are also included. Putative DNA methyltransferases and restriction enzymes, as predicted from analysis of genomic sequences, are also listed. REBASE is updated daily and is constantly expanding.

SAGEmap, NCBI, US http://www.ncbi.nlm.nih.gov/SAGE/ Serial Analysis of Gene Expression, or SAGE, is an experimental technique designed to gain a quantitative measure of gene expression. The SAGE technique itself includes several steps utilizing molecular biological, DNA sequencing and bioinformatics techniques. These steps have been used to produce 9 or 10 base "tags", which are then, in some manner, assigned gene descriptions

SCOP: Structural Classification of Proteins, University of Cambridge UK http://scop.mrc-lmb.cam.ac.uk/scop/   SCOP mirrors http://scop.mrc-lmb.cam.ac.uk/scop/mirrors.html   Reference: Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540.

SRS Sequence Retrieval System  http://www.lionbio.co.uk/publicsrs.html  URL has a list of public SRS servers, including EBI, DDBJ, INFOBIOGEN, EMBL SRS, developed initially as an academic system, probably the best biological database browsing tool available. SRS allows you to browse the contents of databases through a web interface, exploring links to other databases and launching other programs on the retrieved database records.

SWISS-PROT, ExPASy (Expert Protein Analysis System) Swiss Institute of Bioinformatics http://www.expasy.ch/sprot/sprot-top.html  An annotated protein sequence database maintained by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library.

TrEMBL, Swiss Institute of Bioinformatics, European Bioinformatics Institute UK  http://www.expasy.ch/sprot/  A computer- annotated supplement of SWISS- PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS- PROT.

UniGene, NCBI, US http://www.ncbi.nlm.nih.gov/UniGene/index.html  An experimental system for automatically partitioning GenBank sequences into a non- redundant set of gene- oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. Well- characterized genes and ESTs.

WIT2 What Is There, Argonne National Lab, US  http://wit.mcs.anl.gov/WIT2/  Attempts to produce metabolic reconstructions (models of the metabolism of the organism derived from sequence, biochemical, and phenotypic data) for sequenced (or partially sequenced) genomes. For each organism, table connecting genes (ORFs) to hypothesized functional roles is included. "being transferred to new server" july 2004

ArrayDB, NHGRI, US http://genome.nhgri.nih.gov/arraydb/  LIMS (Laboratory Information Management System) software for managing and analyzing large- scale expression database. Information stored in ArrayDB is used to provide integrated gene expression reports by linking array target sequences with NCBI’s Entrez retrieval system, UniGene and KEGG pathway views. Designed to store information on hybridization targets (cDNA clones).

BLAST (Basic Local Alignment Search Tool): Software program from NCBI for searching public databases for homologous sequences or proteins. Designed to explore all available sequence databases regardless of whether query is protein or DNA. http://www.ncbi.nlm.nih.gov/BLAST/  

CLUSTALW at EBI, UK http://www2.ebi.ac.uk/clustalw/  

Cn3Dhttp://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml A helper application for your web browser that allows you to view 3-dimensional structures from NCBI's Entrez retrieval service.

FASTA: Software program, from the University of Virginia, used to scan a protein or DNA sequence library for similar sequences. http://fasta.bioch.virginia.edu/  

GRAIL: Genome Recognition and Assembly Internet Link  http://compbio.ornl.gov/Grail-1.3/ 

ORF Finder, NCBI, US http://www.ncbi.nlm.nih.gov/gorf/gorf.html  Gene prediction.

PredictProtein Server  http://www.embl-heidelberg.de/predictprotein/predictprotein.html  Service for sequence analysis and protein structure prediction.  A Neural Network based prediction server, which automatically builds a multiple sequence alignment from the most recent version of SwissProt. Ab initio secondary structure prediction.

PSA Protein Structure Predicter Server, BMERC, Boston Univ. US  http://bmerc-www.bu.edu/psa/  Predicts probable secondary structures and folding classes for a given amino acid sequence.

Protein Explorer  http://www.umass.edu/microbio/chime/explorer/  Supersedes RasMol.

Protein Structure 2/3D Structure Prediction & Databases, CMS Molecular Biology Resource, San Diego Supercomputer Center, US http://restools.sdsc.edu/biotools/biotools9.html

RasMol homepage [Macromolecular structure viewer] See Protein Explorer  which is now recommended as easier to use and more powerful than RasMol.

SWISS-MODEL, Swiss Institute of Bioinformatics  http://www.expasy.ch/swissmod/SWISS-MODEL.html  

 

 

 

 

 

Menu

Home
Guide
Animations
Name Index
Reference Books
Downloads
Glossary
Lecture Notes