BIOLOGICAL DATABASES BIOINFORMATICS DATABASES are electronic reservoir of information. Databases, created during 60s, were employed to solve the problems with file oriented systems in that they were compact, fast, easy to use, current, and accurate, allowed the easy sharing of data between multiple uses, and were secure.

One of the first biological sequence databases was the “Atlas of PROTEIN SEQUENCES AND STRUCTURES” by MARGARET DAYHOFF and colleagues, published in 1960s. Initially databases were maintained in books. Later on when computer became easily accessible to scientists, computer became storehouse. Databases were distributed on tape and later on various kinds of disks. Today databases are accessible for public through web server using the internet. And now biology has turned into data rich science, the need for such databases have tremendously increased.

Thus a biological database is a large, organized body of persistent data usually associated with computerized software designed to update, query and retrieve components of the data stored within the system.


1. NUCLEOTIDE SEQUENCE DATABASES Database which contains information related to sequences DNA fragments.

GenBank ( National Center for Biotechnology Information, NCBI)

EMBL (European Molecular Biology Laboratory)

DDBJ (DNA Data Bank of Japan)

GenBank, which is built by National Center for Biotechnology Information, is a part of International Nucleotide Sequence Database Collaboration along with its two partners, DNA Data Bank of Japan (DDBJ), European Molecular Biology Laboratory (EMBL) Nucleotide Database from European Bioinformatics Institute. GenBank incorporates sequences from publicity available sources, primarily from direct author submissions and large sequencing projects. These databases are maintained on daily basis. All EMBL and DDBJ entries are contained in GenBank.

The increasing size of the database, along with the diversity of data sources available, have made it convenient to split GenBank into smaller, discrete divisions; which are as follows: PRI- Primate; ROD- Rodent; MAM- Other mammalian; VRT- other vertebrates; INV-invertebrate; PLN- Plant, fungal, algal; BCT- Bacterial; RNA-structural RNA; VRL-Viral; PHG- Bacteriophage; SYN- synthetic; UNA- Unannotated; EST- EST(Expressed sequence tags); PAT- Patent; STS- STS(Sequence tagged sites) GSS- GSS(Genome survey sequences; HTG- HTG (High throughput genomic sequences).

Information can be retrieved from GenBank using the ENTREZ integrated retrieval system; this combines data from the principal DNA and protein sequence databases with information from genome maps and protein structures. Additional information on the sequences can be accessed via the MEDLINE facility, which provides abstracts from the original published articles.

EMBL: EMBL the nucleotide sequence database from the European Bioinformatics Institute (EBI) includes the sequences both from direct author submissions and genome sequencing groups, and from the scientific literature and patent applications.

Information can be gathered from EMBL using the SRS (Sequence Retrieval System) this links the principal DNA and protein sequence databases with motif, structure, mapping and other specialist databases, and includes the links to the MEDLINE facility.

DDBJ: The database is produced, maintained and distributed at the national institute of genetics and began as collaboration with EMBL & GENBANK sequences may be submitted to it from all corners of the world through Web Based data submission tool.

GENOME BIOLOGY: Genome biology site at NCBI contain information about available complete genomes. (

PROTEIN SEQUENCE DATABASES Databases which contain various protein sequences.

SWISS PROT PIR (Protein Information Resources) http.//

PIR: It is a division of National Biomedical Research Foundation (NBRF) in US. Updated and maintained on daily basis. Actually second publication of Atlas of protein sequence and structure became foundation of PIR.

SWISS-PROT: An annotated protein sequence database established in 1986. SWISS PROT groups are maintained by Swiss Institute Of Bioinformatics (SIB) & EBI. The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. The Databases can be accessed and searched through SRS system at ExPASY (Expert Protein Analysis System), AN EXPERT MOLECULAR BIOLOGY SERVER.

TrEMBL: The supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT.

Index of Resources Eukaryotes - Animals Eukaryotes - Plant Eukaryotes - Plant Organelles Eukaryotes - Insects Eukaryotes - Worms Eukaryotes - Fungi Eukaryotes - Protozoa Prokaryotes - Bacteria Viruses Gene Map-Linkage Comparative Analyses Tools General Genome Database Resources

Bos [Bovine/Cow/Cattle] BovGBASE - AGIS-USDA (US) BOVMAP Database - INRA (France) Cattle Genome Information - Cattle Genome Mapping Project (US) Cattle Genomic Database Search - Japan Animal Genome Server (Japan) Canis [Canine/Dog] Dog Genome Project - Dog Genome Center (Japan) Equus [Equine/Horse] Horse Genetics - UCD-School Vet Med (U.S) Fugu [Pufferfish] HGMP-RC FUGU Project - MRC-HGMP (UK) Gallus [Chicken] ChickMap - Roslin Institute (UK) ChickGBASE - AGIS-USDA (US) Chicken Gene Mapping Project - MSU (US) Homo sapien [Human] CHLC Human Maps & Markers (US) GDB - Human Genome Database - JHU School of Medicine (US) OnLine Mendelian Inheritance in Man - NCBI (US) Human Genome Sequence Data - BLHGC (US) HuGeMap Database - Physical & Genetic Maps - Infobiogen (France) STS-Based Map of the Human Genome - Whitehead Institute/MIT-CGR (US) GenLink - Human Genetic Resources - WUSTL (US) Human Chromosome 22 Contig Map and Sequences - U Oklahoma (US) Human Genomic Database Search - Japan Animal Genome Server (Japan) IXDB - X-chromosome Integrated Database - MPIMG (Germany) Mus [Mouse] MGD - Japan Animal Geneome Database Server (Japan) Mouse Genome Informatics Resource - The Jackson Laboratory (US) Mouse Genomic Database Search - Japan Animal Genome Server (Japan) Mouse Genetic Map Information - Whitehead Institute/MIT-CGR (US) Ovis [Sheep] SheepBASE - AGIS-USDA (US) SheepBase/SheepMap - Roslin Institute (UK) SheepBase/SheepMap - AgResearch (New Zealand) Sheep Genome Database - New Zealand Sheep Genome Programme (New Zealand) Sheep Genomic Database Search - Japan Animal Genome Server (Japan) Rattus [Rat] RATMAP - The Rat Genome Database - Goteborgs U (Sweden) Rat Genomic Database Search - Japan Animal Genome Server (Japan) Sus [Porcine/Swine/Pigs/Hogs] Swine Genome Map - Swine Genome Mapping Project - US Meat Animal Research Center (US) NAGRP Pig Map - Iowa State U (US) PiGBASE - Roslin Institute (UK) Pig Genome - Japan Animal Genome Server (Japan)

Arabidopsis Arabidopsis - (AAtDB) A. thaliana Database - Stanford University (US) Arabidopsis thaliana Chromosome 4 - Genetic/Physical Maps - Nottingham Arabidopsis Stock Center (UK) Arabidopsis Information Management System (AIMS) - Michigan State U (US) Arabidopsis cDNA Sequence Analysis Project - MSU (US) The Arabidopsis Genome Center (ATGC) - U Pennsylvania (US) "Everything Arabidopsis!" Resource - Lehle Seeds (US) The TIGR Arabidopsis Database Project - TIGR (US) Chlamydomonas ChlamyDB - Chlamydomonas reinhardtii Genome Database - Chlamydomonas Genetics Center/Duke U (US) Forest Trees Forest Trees Genome Database - The Dendrome Project (US) Plant Genome Databases, General AGsDB - A Genus (multi-)species DataBase - NAL (US) Mnemonic/Numeric-Mendel Database - Rutgers U (US) Plant Genome Databases - USDA/AGIS (US) Plant Genome Data and Information Center - NAL (US) Glycine max [Soybean] SoyBase - NAL (US) SoyBase - Iowa State U (US) Gossypium hirsutum [Cotton] CottonDB - USDA/ARS - Southern Crops Research Laboratory (US) CottonDB - NAL (US) Grain Genome Databases, General GrainGenes - Wheat, Oat and Sugercane Genome DBs - USDA/AGIS (US) GrainGenes - Wheat, Oat and Sugarcane Genome DBs - USDA/NAL Plant Genome Research Program (US) Medicago sativa [Alfalfa] AlfalfaGenes - A Medicago sativa Genome Database - Kansas State U (US) Oryza sativa [Rice] RiceGenes - Rice Genome Database - Cornell U (US) RGP - Rice Genome Research Program - DISC (Japan) The Korea Rice Genome Database - NIAST (Korea) Phaseolus sp. [Beans] BeanGenes - Genome Database of Phaseolus and Vigna species - North Dakota State U (US) Solanaceae [Potato] SolGenes - Solanaceae Genome Database - Cornell U (US) Sorghum bicolor [Sorghum] Sorghum - Texas A&M U (US) Zea mays [Corn] MaizeDB - Maize Database at the - U Missouri (US) Maize - USDA-ARS - U Missouri (US)

Chloroplast (US) Complete Organelle Genomic Sequences (US) Mitochondria (US) Mitochondrial Plasmids & Other Nuclear Elements (US)

Aedes sp. [Mosquito] Aedes aegypti - Mosquito Genomic Resources - MsqDB (US) Aedes albopictus - Mosquito Genomic Resources - MsqDB (US) Aedes triseriatus - Mosquito Genomic Resources - MsqDB (US) Anopholes sp. [Mosquito] Anopholes gambiae - Mosquito Genomic Resources - MsqDB (US) AnoDB - Anopheles Database - IMBB (Greece) Ceratitus capitata [Medfly] Ceratitus capitata - Medfly Resource - IMBB (Greece) Drosophila sp. [Drosophila] GIFTS - Gene Interactions in the Fly TransWorld Server - CNRS (France) FlyBase - Drosophila (FlyBase) Resource - IUBio (US) FlyBase - Drosophila (FlyBase) Resource - ANGIS (AU) FlyBase - Drosophila (FlyBase) Resource - Harvard U (AU) FlyBase - Drosophila (FlyBase) Resource - EBI (UK) FlyBase - Drosophila (FlyBase) Resource - NIG (Japan) FlyBase - Drosophila (FlyBase) Resource - IBMC (France) Drosophilia Sequence Data - Berkeley Drosophila Sequencing Project (US) The Interactive Fly - Guide to Drosophila Genes - Purdue U (US) FlyBrain - OnLine Atlas & Database of Drosophila Nervous System - Univ Freiburg (Germany)

Caenorhabditis elegans ACeDB - Caenorhabditis elegans Database - USDA/AGIS (US) Caenorhabditis Genetics Center - U Minnesota (US)

Actinobacillus sp. Actinobacillus actinomycetemcomitans Strain HK1651 - Genome Sequencing Project - ACGT (US) Bacillus sp. NRSub - Bacillus subtillus Non-Redundant DB - DDBJS (Japan) NRSub - Bacillus subtillus Non-Redundant DB - ACNUC (France) Bacillus subtillus Genome Sequencing Project - NIST (Japan) Micado Bacillus subtillus Genomic Resource - INRA (France) SwissProt B. subtillus Sequence Subset Collection - ExPasy (Switzerland) SubtiList - B. subtillus Protein & DNA Dequence Database - Institute Pasteur (France) Escherichia sp. ECCE - E. coli Cell Envelope Protein Data Collection - Cardiff U (UK) Escherichia coli WWW Home Page - NIG (Japan) EcoCyc - Encyclopedia of E. coli Genes & Metabolism - Pangea Systems, Inc. (US) GenProtEC - E. coli Genome & Proteome Database - MBL (US) Esherichia coli Genetic Map - INRA (France) Blastn Search of a E. coli Sequence-Specific Database - VGC (US) Escherichia coli DataBank - NIG (Japan) Colibri - E. Coli Protein & DNA Dequence Database - Institute Pasteur (France) HIDC - Haemophilus influenzae Database Collection - U Giessen (Germany) HIDB - Haemophilus influenzae Rd Genome Database - TIGR (US) HinCyc - Encyclopedia of H. influenzae Genes & Metabolism - SRI (US) Methanococcus sp. [Archaeon] MJGD - Methanococcus jannaschii Genome Database - TIGR (US) Mycobacterium sp. MycDB - Mycobacterium Database - RIT (Sweden) MycDB - Mycobacterium Database - USDA/AGIS (US) TubercuList - M. tuberculosis Protein & DNA Dequence Database - Institute Pasteur (France) Mycoplasma sp. Mycoplasma genetalium Genome Database - TIGR (US) Neisseria sp. Neisseria gonorrhoeae - Genomic Sequencing Project - ACGT (US) Staphylococcus sp. Staphylococcus aureus NCTC 8325 - Genome Sequencing Project - ACGT (US) Streptococcus sp. Streptococcus pyogenes - Genome Sequencing - U Oklahoma (US) Streptococcus mutans strain UAB159 - Genome Sequencing Project - ACGT (US) Synchocystis sp. CyanoBase - Genome Database for Cyanobacterium Synechocystis sp. strain PCC6803 - KDRI (Japan)

Asperigillus sp. Asperigillus nidulans - Cosmid & cDNA Sequencing Project - U Oklahoma (US) Candida sp. Candida albicans Genetic Information - VGC (US) Fungi Genome Databases, General PathoGenes - Fungal Pathogens of Small-Grain Cereals - AGIS-NAL (US) Neurospora sp. Neurospora crassa - Gene loci, Linkage maps, etc. - Fungal Gentics Stock Center (US) Saccharomyces sp. YPD - Yeast Protein (Genes) Database - Proteome Inc. (US) YPD - Yeast Protein (Genes) Database - CSHL (US) SGD - Saccharomyces cerevisiae Genome Database - Stanford U (US) Yeast Chromosome Sequences Database - VGC (US) S. cerevisiae - Sequencing Projects - Sanger Center (UK) Schizosaccharomyces sp. Schizosaccharomyces pombe - Sequencing Project - Sanger Centre (UK)

All the Virology Servers in the World - Tulane U (US) Astrovirus Sequence Database - IAH (UK) Calicivirus Sequence Database - IAH (UK) HIV - HIV Sequence Database - Los Alamos Nat Labs (US) Human Retrovirus and Aids Database - LANL (US) ICTVdB - Universal Virus Database - ANU (Australia) Picornavirus Sequence Database - IAH (UK) Sequivirus Sequence Database - IAH (UK) VIDE Database - Plant Viruses OnLine - U Idaho (US) Virus Sequences, Alignments & Phylogenetic Trees - U Wisconsin (US) Virus Information Resource - ANU (Australia)

Vertebrate Comparative Database - UK HGMP Resource Center (UK)

BioTech Organisms & Strains Resource - BioTech (US) Genome Databases Listing - EERIE-Nimes (France) Genome Research Centers - A Listing of WWW sites - EERIE-Nimes (France)

Phylogenetic tree construction:

1. Nucleotide BLAST i) To submit your query to NCBI-BLAST, open window

ii) Click on BLAST (BLASTn)’ and a web page would appear in the same window.

iii) Click on nucleotide blast. iv) A new page will open , paste your sequence or accession number of any sequence.

v) A new window would open titled “Results of BLAST”. vi) The alignment of sequence would be displayed at the top of the monitor followed by a list displaying the accession number. In the list, organisms are listed in a sequential order of their homology (from maximum to minimum).

vii) We need to use the accession number. Note down the accession number.

2. Downloading DNA sequences i) Now we need to compare our sequence with that of the sequence of closely related species. For this purpose download the sequence from the following site ii) Click on accession number in the window showing results of BLAST . iii) While downloading, select ‘FASTA’ against ‘default’ and click ‘Display’. iv) Copy all the required sequences one by one into notepad.

3. Input files preparation for CLUSTAL i) Paste all the sequences in word/text format in courier new fonts (10) with a “>” symbol followed by the accession number or strain name with DNA sequence in next line. ii) Append the other entire sequences one by one as above without any line spacing. iii) Save the above wordpad file (e.g. Sequence.txt)

4. Follow the following steps for sequence alignment with CLUSTAL

Clustal X

Click on file Load sequences

Click on alignment

Select Output format options Select Clustal format and Phylip format

Click on Alignment

Do complete alignment


Three files aln, dnd, and phylip will form

Save all the files

Open the ALN file

Trim files

Delete the sequences from both ends

Save this file

Load Clustal

Click on Alignment

Select output format options Select Clustal format and Phylip format

Do complete alignment


Three files aln, dnd, and phylip will form

Phylip file will be used for the construction of phylogenetic tree with the help of software treecon

5. Drawing tree with the help of TREECON

Open Treecon

I. DISTANCE ESTIMATION a. Click “Start distance estimation” b. Open test sequence c. Click “Nucleic acid sequences” d. Phylip Interleaved program e. OK

II. INFER TREE TOPOLOGY a. Click “Neighbour joining” b. Bootstrap analysis c. Yes d. Take value thousand

III. ROOT UNROOTED TREES a. Click “Rooted trees”

IV. DRAW PHYLOGENETIC TREE a. Click b. “Tree drawing program” will open c. Modify the trees by using the icons

