Projects

Home / Projects

All projects of Data Center


We aim to integrate of big biological data, to construct different topics databases including tumor diseases, population polymorphism, biodiversity, microbiological and others, to provide data sharing systems and communities to meet the needs of researchers in different areas, to enhance the data value and promote data development application.

MilletDB: Millet DataBase

The millet database is based on the data of the millet genome project researched by BGI and Zhangjiakou Academy of Agricultural sciences. The database records the genotype-phenotype information of millet. Users can query and retrieve the genotype of millet through the phenotype, and the corresponding phenotype can be retrieved by genotype. Besides, the database also applies the big data technology and machine learning method to construct the genotype-phenotype model to promote the intelligent molecular breeding.

Database statictics:
Samples: 2540

OneKP: 1000 Plants

The 1,000 Plants project (OneKP or 1KP) is an international multi-disciplinary consortium that has generated large-scale genome sequencing data for over 1,000 species of plants. We constructed online BLAST platform based on the OneKP datasets in the database. By June 21, 2017, BLAST for OneKP has 947 users and has completed more than 74,726 jobs.

Database statictics:
Species: 1199
Samples: 1504
Scaffold data: 81G

B10K: Bird 10K Genomes

The Bird 10,000 Genomes (B10K) Project plans to generate representative draft genome sequences from all extant bird species within the next five years (2015-2020). The B10K project will complete a genomic level tree of the entire bird species, decode the relationship between genetic variation and phenotypic variation, uncover the correlation of genetic evolutionary and biogeographical and biodiversity patterns, evaluate the impact of various ecological factors and human influence on species evolution, and unveil the demographic history. As of March 14, 2017, B10K has processed 2,500 samples, representing 2,400 species from 1,370 genera, 300 families, and 36 orders.

Database statictics:
Species: 933
Samples: 940
Scaffold data: 86G

MLGD: Marine Life Genome Database

Marine Life Genome Database (MLGD) is an on-line database aiming to provide a comprehensive knowledge and analysis for the genome of marine lives. We collected the genome, transcriptome and proteome data and information of the marine species that has been sequenced and published so far. These information are organized based on the taxonomy of marine lives and each specie can be searched and viewed in the taxon tree. The information of each specie contains: description, reference, image, genetic information and data. Data sets can be downloaded directly or redirected to the NCBI Genome database, and we will add some on-line analysis tools in the future editions. We highly welcome cooperations to sequence and analyze new marine species for a better understanding of the genomes of marine lives. At present, 472547 species information, 7538 genomic data, and 25514 image information have been collected in MLGD.

Database statictics:
Species: 472547
Genomes: 7538
Images: 25514

FishT1K: Transcriptomes of 1,000 Fishes

FishT1K (Transcriptomes of 1,000 fishes) project was officially launched by BGI in November 2013, with the aim of generating genome-wide transcriptome sequences for 1,000 diverse species of fishes using RNA-seq. The FishT1K database will establish the first data storage, application, sharing platform for fish group research, greatly advancing the study of fish biology, eventually contributing towards global fish biodiversity conservation efforts and sustainable utilization of natural resources. In addition, the database will promote development of new technologies and softwares for transcriptome sequencing, data analysis, annotation, and storage.

Database statictics:
Species: 129
Samples: 158
Scaffold data: 21G

1KITE: 1K Insect Transcriptome EvolutionBeta

Insects are one of the most species-rich groups of metazoan organisms. They play a pivotal role in most non-marine ecosystems and many insect species are of enormous economical and medical importance. Unraveling the evolution of insects is essential for understanding how life in terrestrial and polar environments evolved. The 1KITE (1K Insect Transcriptome Evolution) project aims to study the transcriptomes (that is the entirety of expressed genes) of more than 1,000 insect species encompassing all recognized insect orders.

Database statictics:
Subprojects: 12
Species: 105
Samples: 105
Config sequence data: 4.3G

MT10K: 10K Mitochondrion Genome

The 10K Mitochondrial Genome Project (MT10K) is a global research project initiated by the Chinese National GeneBank. The project is planned to construct a mitochondrial database covering all groups of animals with more than 10,000 mitochondrial genomes. It aims to cover all the families of animals, and creates a truly comprehensive mitochondrial database. At present, the database consists of 5,157 species and about 35Gb of mitochondrial genomic data. The database also incorporates Blast and PhyML tools to provide data and analytical support for researchers in the field.

Database statictics:
Data sources: 3
Species: 5157
Specimens: 5206
Images: 61

10KP: 10KPBeta

The 10,000 plants (tenKP or 10KP) aims to sequence over 10,000 genomes representing every major clade of plants and eukaryotic microbes. This project would generate large-scale plant genome data within the next five years (2017-2022), addressing fundamental questions about plant evolution. Major supporters include Beijing Genomics Institute in Shenzhen (BGI-Shenzhen) and China National Gene Bank (CNGB). BGI corporate will support this project by developing new tools for de novo genome sequencing and assembly on BGISEQ platforms.

Database statictics:
Species: 381425
Samples: 2

BCGD: Biodiversity Comparative Genomics DatabaseBeta

The Biodiversity Comparative Genomics Database aims to bring together knowledge and omics data sets of different species on Earth, including some excellent data sets of major international projects such as B10K, 1KITE, etc. to understand the species diversity and construct phylogenetic trees to reveal the evolutionary relationships between species through cross-species comparative analysis. Based on species knowledge, biological data, barcode, pictures and other data to build a species identification system for species identification and information query.

Database statictics:
Species: 165

ADD: Agriculture BioDiversity DatabaseBeta

Agriculture BioDiversity Database (ADD) is a database integrated with data from agricultural genome projects and knowledges of different species, which provides a platform with a user-friendly interface for the agricultural researchers. ADD provides a visualized taxonomy structure to organize the integrated information and also a searching service. At present, the database archives description, image, reference, genomic data and available gene structure and function annotation of each agriculture-related species, especially plants. In the future, on-line analysis tools will be added, and the content of the database will be expanded to transcriptome, SNP, and related microorganism. As its name, ADD, we all hope it can be more abundant in information and more helpful to agriculture researchers. So that we highly welcome cooperations to reveal the stories of new species and dig into the sequenced species, for a better understanding of the agriculture.

Database statictics:
Species: 353
Data: 2.774T

PVD: Pathogen Variation Database

Pathogen Variation Database (PVD) focuses on the identification and detection of unknown pathogens in human samples containing various pathogenic genomic data and related annotation information. The PVD demonstrates the results clearly and easily by data analysis and visualization, and will provide the toxicity identification and drug resistance of some pathogenic bacteria (HBV/HIV/HCV/HP). Thus, we offer the fast and comprehensive detection services for clinicians, patients and researchers.

Database statictics:
No available statictics.

GDRD: Genetic Disease and Rare Disease databaseBeta

GDRD is an integrated platform for genetic disease and rare disease research and application which focuses on collection, storage, analysis and mining of human genetic data, and phenotype data. Now in GDRD phase I,around 7,000 papers, 10,000 causative variants and 300 families with rare disorders from BGI, clinVar and OMIM database have been organized and presented on the website. Subsequently, GDRD phase II will aggregate sequencing data and phenotypic characters from a variety of genetic disease and rare disease research projects in CNGB and collaborators. By the end of 2017, it will add WGS data of 6,000 samples, and the amount of generated data will reach 20~30TB. In addition, it which provides automated explanation process will help explanation the genetic test results by clinicians, improve the efficiency of explanation greatly and avoid human error.

Database statictics:
Papers: 25
Families: 286
Variations: 11104

PIRD: Pan Immune Repertoire Database

Pan Immune Repertoire Database (PIRD) mainly focuses on immune data related to human body. It collects BCR and TCR sequencing data of various disease and experimental information and phenotype information of the corresponding individual. The PIRD V1 stores the 1,923 samples data and 554,696,060 sequence data. The PIRD V2 will integrate more samples and data. By the end of 2017, it will add 5,000 samples information and the total data will reach 10TB. The PIRD which provides data comparison and visual analysis services for disease health researchers and clinicians.

Database statictics:
Projects: 2
Samples: 1923
Individuals: 1860
Raw data: 4.7T
Analysis data(compressed): 37G

HMD: Human Microbiome Database

The Human Microbiome Database (HMD) provides relevant sample and microbial data. The Human Microbial Database currently covers sequencing data volume of 83G and phenotypic information from 1,443 cases of stool samples from 8 human intestinal microbiological research projects. It also contains the most complete human intestinal microbial gene set in the world. By the end of 2017, HMD phase II will collect 3,000 samples (blood, saliva, plaque, skin, genital tract flora and so on), and the amount of data will increase 250GB. The corresponding gene sets will be available too.

Database statictics:
Samples: 1443
Genes: 9879896

GeMap: Human omics-scale annotation system

GeMap is a comprehensive database which integrates genome data of 27 different races that come from 18 countries, and collects data from six authoritative databases, including 38,659 genes and millions of mutation information. GeMap can provide data retrieval service. Searching in GeMap can use rs id, gene name, disease name and chromosome position, etc. In the future, GeMap will integrates disease data, so that users can retrieve phenotype information with disease name, or retrieve related disease with gene name, etc. In addition, GeMap will build personal data analysis workstation, and then users can upload personal sequencing data, and get personal genome data analysis results. GeMap not only provides massive data support for the scientific researchers and medical practitioners, but also will provide the public with easy-to-use personal genome analysis tools and platforms, fully meet the needs of the masses.

Database statictics:
Data sources: 6
Populations: 27
Countries: 18
Genes: 38659
Variations: 151474076

DISSECT: Data Integration Solution for Systematic Exploration of Cancer Traits

DISSECT (Data Integration Solution for Systematic Exploration of Cancer Traits) is a comprehensive data integration platform for cancer research, including the first mirror site of ICGC Data Portal in China, which provide important resources for domestic researchers. Based on the big data research, we attempt to establish the most comprehensive cancer big data integration system through large-scale, standardized data platform construction. DISSECT has already stored genomic and clinical data from around 20,000 cancer cases, and will continuously release updated data. The second version of DISSECT will integrate 10,000 Chinese people cancer WGS data from BGI cancer research institute. The most valuable of the system is providing omics data integrating and the depth excavation analysis of the large samples data with single cancer or multiple cancers, to support the development of Chinese precision cancer medicine.

Database statictics:
Projects: 37
Samples: 50181
Genes: 59615
Tools: 8
All data: 18G

DHGV: Database of Human Genetic VariationsBeta

The Database of Human Genetic Variations (DHGV) advances scientific researches and human health by providing the variations of different human populations from the world. So far, the database contains more than 10,145 human samples around the world, and more than 170 million mutations. The database will continue to collect data on the genetic variation from Chinese. In the first release of DHGV, You can search any variants you expect to obtain the variation information, including alleles, distribution, frequency and annotations in genome. The second version is intended to add information on the relationship between mutations and disease. Not only that, DHGV's free data service will also greatly drive research of the world's population, especially the Chinese population, which contains the evolution of origin, genetic diseases and precision medical and other aspects of research and application.

Database statictics:
Samples: 4563
Variations: 1741376415
Countries: 33

SCDB: Single Cell DataBaseBeta

The single cell database will integrate the atlas of human cells, catalog all kinds of the body cells including subtypes, build a complete list of human cells, define human cells and construct the cell framework. The first version of single cell database demonstrates four projects including 46 samples, 30,854 cells, and 470GB data.

Database statictics:
Projects: 4
Samples: 46
Cells: 30854

BDDB: Birth Defects DatabaseBeta

Birth Defects Database focuses on birth defect with genetic or partially genetic origin. Collected the genetic data and phenotype data of birth defects related samples. To accelerate data sharing, exchange, and cooperation in this field. Researchers can get general information of particular disease and basic clinical information of cases by searching.

Database statictics:
Disease: 190
Sample: 52

ICGC: ICGC Data Portal China Mirror

ICGC Data Portal China Mirror provides visualization, query and download services of tumor data, covering 70 tumor research projects, 19,290 samples, 46,429,997 individual mutations and 57,658 mutant genes. The mirror site is regularly synchronized with the ICGC master station, providing more rapid service to Chinese researchers.

Database statictics:
Projects: 70
Cancer primary sites: 21
Donors: 19305
Simple somatic mutations: 46693172
Mutated genes: 57658

HIFTD: Human infertility and fetal tissue databaseBeta

Human infertility and fetal tissue database mainly contains the genomic sequencing information of several diseases which relate to human reproductive function, including infertility, recurrent miscarriage, azoospermia, polycystic ovary syndrome, uterine fibroid, endometriosis, adenomyosis etc. In addition, the database also includes sequencing data of fetal tissue and normal individuals. Human infertility and fetal tissue database provide SNPs (single nucleotide polymorphisms), InDels (small Insertions and Deletions, shorter than 50bp), CNVs (copy number variants: including deletions and insertions) and SVs (structural variants: including translocations and inversions) for clinicians and researchers.

Database statictics:
Sample: 214
Variant: 2284

CCDB: Children's Cancer DataBaseBeta

Children 's Cancer Database was established to research and improve treatments for children with cancer. Cancer is the second leading cause of death (following accidents) in children aged 0 to 14 years. The most common types of cancer diagnosed in children ages 0 to 14 years are leukemias and pediatric brain tumors. In the last 40 years, the overall survival rate for children’s cancer has increased from 10% to nearly 90%. DNA and other tissue samples from childhood cancer patients which collected from Beijing Tiantan Hospital and then sequenced in BGI. This database is a collection of genotype and phenotype information data to research the childhood cancer. The application of next-generation sequencing to analysis of childhood cancer can provide an unprecedented understanding of their molecular biology and clinically relevant alterations.

Database statictics:
Disease: 7
Donor: 254
Sample: 441

cfDNA: cfDNA DatabaseBeta

Cell-free DNA(cfDNA) is a kind of small amount DNA that is out of cell, it was found and reported on 1940s by scientists. cfDNA is also called circulating DNA when it exists in plasma or serum. We also can find that there is a small group fetus cfDNA from maternal blood during pregnancy, this kind of fetus cfDNA can be called cell-free fetal DNA (cff-DNA). The research of circulating DNA in human blood has become one of the hot topic on bio-medical and clinical diagnosis regions. With the development of Next-Generation Sequencing(NGS), we can carry out cfDNA quantifiable research from genomics level. There are several aspects of cfDNA in scientific research and medical application: Noninvasive prenatal testing(NIPT), cancer detection and monitoring and organ transplantation, etc.

Database statictics:
Sample(BGI-SEQ500): 10000

CNSA: CNGB Nucleotide Sequence Archive

CNGB Nucleotide Sequence Archive (CNSA) is a service that provides online data submitting of data and metadata of projects, samples, experiments and sequencing. It meet the needs of sharing a small batch of sequencing data. CNSA is committed to the storage and sharing for data of any type and phase of projects, samples, experiments. It archives raw data, intermediate and final data files produced from a wide variety of sequencing platforms. As a data storage and sharing platform, CNSA which adopts in the International Nucleotide Sequence Database Collaboration (INSDC) standard accepts the data submissions from all over the world, and shares to INSDC. Besides, each complete dataset (including phenotype, omics, experimental analysis methods and other data) in CNSA will be assigned a DOI that can be used as a standard citation for future use of these data. CNSA makes biological sequence data available to the researchers to enhance reproducibility, improve utilization and allow for new discoveries by comparing data sets.

Database statictics:
No available statictics.

BLAST: The public beta version of high-performance sequence alignment service

BLAST stands for Basic Local Alignment Search Tool. The BLAST service of CNGB is developed with NCBI BLAST+ 2.6.0 standalone version, downloaded from NCBI FTP server, providing sequences searching on public data of CNGB applications, BGI projects and external data sources. The word, BLAST, in the name "the BLAST service of CNGB", is standing for kinds of sequence searching method. More types of sequence searching algorithms will be integrated in the future.

Database statictics:
CNGB Projects: 4

GigaDB: GigaDB

GigaDB primarily serves as a repository to host data and tools associated with articles in GigaScience. However, it also includes a subset of datasets that are not associated with GigaScience articles. GigaDB defines a dataset as a group of files (e.g., sequencing data, analyses, imaging files, software programs) that are related to and support an article or study. Through the association with DataCite, each dataset in GigaDB will be assigned a DOI that can be used as a standard citation for future use of these data in other articles by the authors and other researchers. Datasets in GigaDB all require a title that is specific to the dataset, an author list, and an abstract that provides information specific to the data included within the set. We encourage detailed information about the data we host to be submitted by their creators in ISA-Tab, a format used by the BioSharing and ISA Commons communities that we work with to maintain the highest data and metadata standards in our journal. To maximize its utility to the research community, all datasets in GigaDB are placed under a CC0 waiver (for more information on the issues surrounding CC0 and data see Hrynaszkiewicz and Cockerill, 2012). Datasets that are not affiliated with a GigaScience article are approved for inclusion by the Editors of GigaScience. The majority of such datasets are from internal projects at the BGI, given their sponsorship of GigaDB. Many of these datasets may not have another discipline-specific repository suitably able to host them or have been rapidly released prior to any publications for use by the research community, whilst enabling their producers to obtain credit through data citation. The GigaScience Editors may also consider the inclusion of particularly interesting, previously unpublished datasets in GigaDB, especially if they meet our criteria and inclusion as Data Note articles in the journal.

Database statictics:
No available statictics.

Biomigo: Biomigo

Biomigo is a basic database of biological data based on CGA (CNGB Global Archive), providing full-data retrieval service and integrating a large number of biological data resources, and covering some data such as gene, mutation, expression, protein and phenotype. The current version of Biomigo (Version Beta) integrates some research data in various fields such as 1,000 Plants (OneKP), 10K Mitochondrion Genome (MT10K), 1K Insect Transcriptome Evolution (1KITE), Transcriptomes of 1,000 Fishes (FishT1K), Bird 10K Genomes (B10K), ICGC Data Portal China Mirror (ICGC). In the furture, more basic databases of biological data will be added to improve the architecture of data warehouse and the retrieval performance of search engine, and better support services of CGA.

Database statictics:
CNGB Projects: 23

Chicken Variation Database

On March 1, 2004 , the National Human Genome Research Institute (NHGRI) announced the accomplishment of the first draft of the chicken genome sequence of Red Junglefowl (RJF), which is believed to be the wild ancestor of domestic chickens. Towards this end, BGI led an international team of scientists from China, USA, UK, Sweden, Netherlands, Germany, having created a sequence variation map. To facilitate the application of our data to avian genetics and to provide a foundation for functional and evolutionary studies, we implemented the Chicken Variation Database (ChickVD) timely.

Database statictics:
No available statictics.

Pig Genome Database

Pig Genome Database(Data Access Limited)

Database statictics:
No available statictics.

Rice Information System

BGI, one of the major genome sequencing centers in China, has been carrying out the Superhybrid Rice G enome Project (SRGP) with full efforts to understand genome biology of the rice. In Rice Information System, we report the latest progress in the assembly and annotation of the rice genome of 93-11, a cultivar of Oryza sativa ssp. indica and the major food crop in China, and pres ent the sequenced genomes and related information in systematic and graphical ways, which further lay the foundation for the in-depth compa rative studies between rice subspecies.

Database statictics:
No available statictics.

YanHuang Database

On October 11th, 2007, Beijing Genomics Institute at Shenzhen (BGI-Shenzhen) announced the completion of first diploid genome sequence of a Han Chinese, a representative of Asian population. The genome, named as YH, is a very start of YanHuang Project, which aims to sequence 100 Chinese individuals in 3 years. We set up this ‘YH database’ to present the entire DNA sequence assembled based on 3.3 billion reads (117.7Gbp raw data) generated by Illumina Genome Analyzer. In total of 102.9Gbp nucleotides were mapped onto the NCBI human reference genome (Build 36) by self-developed software SOAP (Short Oligonucleotide Alignment Program), and 3.07 million SNPs were identified.

Database statictics:
No available statictics.

Silkworm Genome Database

SilkDB(Silkworm Genome Database) is a database of the integrated genome resource for the silkworm, Bombyx mori. This database provides access to not only genomic data including functional annotation of genes, gene products and chromosomal mapping, but also extensive biological information such as microarray expression data, ESTs and corresponding references. SilkDB will be useful for the silkworm research community as well as comparative genomics.

Database statictics:
No available statictics.

Giant Panda Database

On October 11th, 2008, Beijing Genomics Institute at Shenzhen (BGI-Shenzhen) announced the completion of first draft genome sequence of a female giant panda named Jingjing, who is 3 years old and chosen from the Chengdu and Wolong breeding centers, using the next-generateion sequencing technology (Illumina GA) and the self-developed short reads assembly method. We set up this database to present the entire panda genome sequence, as well as the annotation information such as gene structure and functions, non-coding RNAs and repeat elements. The polymorphism information detected in the diploid genome, such as SNPs, Indels, and Structural variations (SV) were also presented.

Database statictics:
No available statictics.

Monkey Database

The most commonly used non-human primates in medical research are of the genus Macaca, making it important to gain a better understanding of their genetic differences. The Macaca genus of Old World monkeys is closely related to humans, sharing a last common ancestor ~25 million years ago (Mya) . The close relationship between humans and macaques has made several species attractive as animal models for a variety of different biomedical analyses, including investigations of cancer, neurological disease, HIV infection, Parkinson’s disease, malaria, drug abuse, as well as in toxicology and vaccine and drug testing. Although the Indian subspecies of the rhesus macaque (Macaca mulatta mulatta) was originally the research model of choice, a ban on the export of this rhesus macaque has greatly reduced the availability of these animals, leading to increased use of other macaque species/subspecies, in particular the Chinese rhesus macaque (Macaca mulatta lasiota) and the cynomolgus/crab-eating macaque (Macaca fascicularis). Here we present genome information of two newly sequenced macaca: the Chinese rhesus macaque and the cynomolgus/crab-eating macaque and a previously sequenced the Indian rhesus macaque. Together with the expression information of Indian rhesus macaque and cynomolgus/crab-eating macaque, we want to distinguish the difference of rhesus and cynomolgus as model animal.

Database statictics:
No available statictics.

Ant Database

The long-term goal of the ant genome project is to establish ants as model organisms to gain insights into the epigenetic mechanisms that underlie social behavior and longevity. Ants present unique opportunities to address these questions at a molecular level, because genetically identical embryos can follow developmental trajectories giving rise either to reproductive queens or non-reproductive workers. These two types (morphs) of adults display striking differences in physiology, life span and behavior, which must be determined via epigenetic mechanisms. The ant genome database (Antbase) currently contains genomic data for our sequenced two ants, and will include more ant species with genomes available in near future. It already contains multiple bioinformatic tools such as blast search, genome browser, as well as detail information for each gene. Its functions will be greatly improved in next few monthes.(Data Access Limited)

Database statictics:
No available statictics.

Tobacco Database

Tobacco (Nicotiana tabacum L.) has been a model plant, because it is a convenient plant system for research. Tobacco is a member of the Solanaceae, a plant family that includes several other economically important species, such as tomato, eggplant, petunia, potato and pepper. A high quality, well-annotated genome sequence of N. tomentosiformis, combined with high throughput analyses of the transcriptome promises to radically enhance our ability to identify genetic factors involved in the formation of undesirable compounds in cigarette smoke and important agronomic traits in tobacco. The tobacco database seeks to provide such a resource to the tobacco research and breeding community in the near future.(Data Access Limited)

Database statictics:
No available statictics.

Camel Genome Database

A camel is an even-toed ungulate within the genus Camelus, bearing distinctive fatty deposits known as humps on its back. There are two species of camels: the dromedary or Arabian camel has a single hump, and the Bactrian camel has two humps. They are native to the dry desert areas of West Asia, and Central and East Asia, respectively. Both species are domesticated to provide milk and meat, and as beasts of burden.(Data Access Limited)

Database statictics:
Data: 16G

Sheep Database

The first version of draft assembled sheep genome was generated from liver DNA of a single Texel ewe, currently consists of scaffolds covering 2,710 Mb. The genome sequencing consisted of approximately 75 fold whole genome shotgun sequencing reads obtained from Illumina technology, and combined with 360K BAC-end sequences deposited in NCBI were assembled using SOAPdenovo by BGI.

Database statictics:
Data: 2.1G

Oyster Database

The long-term goal of the oyster genome project is not only to establish oyster as model organisms for mollusc research, but also to improve the mollusc culture and to get insight into the interaction between marine environment and human health. Therefore, we sequenced the the genome of pacific oyster, Crassostrea gigas, which usually was thought as with high polymorphism and repeat content. This genome database currently contains genomic data for our sequenced C. gigas, and will include more omic data in near future. It already contains multiple bioinformatic tools such as blast search, genome browser, as well as detailed information for each gene.(Data Access Limited)

Database statictics:
No available statictics.

Birch genome databse

On October, 2013, we complete the first draft genome sequence of a tree of B. platyphylla, which grows in Harbin (China) located at 45°44′N and 126°36′E. The genome of B. platyphylla is sequenced using the next-generation sequencing technology (Illumina GA) and the self-developed short reads assembly method. The genome of B. platyphylla is estimated to be approximately 440 million base pairs contained in 28 chromosomes. We created this database to present the entire B. platyphylla genome sequence, as well as the annotation information such as gene structure and functions, non-coding RNAs and repeat elements. (Data Access Limited)

Database statictics:
Data: 519M

Pepper Genome Database

Capsicum, commonly referred to as pepper, is an economically important genus of the Solanaceae family which includes tomato and potato. As one of the most important vegetable crops, pepper genome will provide an invaluable new resource for biological research and breeding of Capsicum. To better manage the pepper genome data and facilitate public academic users to access the genome data and related information, we developed the Pepper Genome Database.

Database statictics:
Data: 3.2G

Litchi Genome Database

Whole genome shotgun (WGS) strategy will be adopted in Litchi chinensis de novo sequencing with Illumina solexa sequencing technology. The gradient insert libraries of 200 bp, 500 bp, 800 bp to 2 Kb, 5 Kb, 10 Kb, 20 Kb are constructed based on the features of repeat sequences in the species and are sequenced by paired-end in order to cross many different repeats in assembly. The sequencing depth reaches at least 60X genome coverage to ensure the precision of each single base and genome integrality. The whole genome map of Litchi chinensis will be generated with BGI’s own assembly software SOAPdenovo and bioinformatics analysis will be carried out to further decode the Litchi chinensis genome. (Data Access Limited)

Database statictics:
Data: 5.6G

Cotton Genome Project

CGP (Cotton Genome Project), were initiated and performed by Institute of Cotton Research of CAAS. Accompanied by BGI. CGP are mainly focused at cotton sequencing and functional analysis.(Limited Cooperation Project)

Database statictics:
Data: 20.9G

Catfish Genome Database

Silurus Genome Project were initiated and performed by Key Laboratory of Freshwater Fish Reproduction and Development (Ministry of Education), Key Laboratory of Aquatic Science of Chongqing, School of Life Science, Southwest University, Accompanied by BGI. This project is mainly focused on catfish genome sequencing, functional analysis and the comparative genomic analyses between S. meridionalis and S. asotus.(Limited Cooperation project)

Database statictics:
Data: 554M

Jujube Genome Database

The jujube (Ziziphus jujuba Mill.) is the most economically important member of the Rhamnaceae, a large cosmopolitan family. It is one of the oldest cultivated fruit trees in the world, with evidence of domestication dating back to 7,000 years ago. It is native to China and is now a major dry fruit crop with a cultivation area of 2 million ha, as well as a traditional herbal medicine in Asia. It has been introduced into more than 40 countries from temperate to tropical zones throughout the five continents and is becoming increasingly popular worldwide. We sequenced the complete genomes of one of the oldest and most widely cultivated jujube cultivars, ‘Dongzao’ using an integrated strategy. To further manage the jujube genome data and facilitate more academic researchers to access the genome data and related information, we developed the Jujube Genome Database.(Limited Cooperation project)

Database statictics:
Data: 159M

Whole-genome Resequencing Project of Chinese Indigenous Pig Breeds

Whole-genome Resequencing Project of Chinese Indigenous Pig Breeds(Limited Cooperation project)

Database statictics:
Data: 4.9G

Snapdragon Genome Database

As a popular floriculture plant and an ideal model of floral development and evo-devo studies, the snapdragon genome will provide an invaluable new resource for plant development, adaption and evolution studies. To facilitate the access of public academic users the genome data and related information, the Snapdragon Genome Database has been developed and will be regularly updated.(Data Access Limited)

Database statictics:
No available statictics.

Cucumber Genome Database

Cucumber has seven pairs of chromosomes and a haploid genome of 367 Mb, which is smaller than other species in Cucurbitaceae family. Here, we have sequenced and assembled the genome of the domestic cucumber, C. sativus var. sativus L. The assembled N50 contig and scaffold sizes were 19.8 Kb and 1.14 Mb, respectively. Using the genetic map , we anchored 72.8% of the assembled sequences onto the 7 chromosomes. A total of 26,682 genes were predicted in the current cucumber genome. As the first sequenced vegetable crop, cucumber genome will provide an invaluable new resource for biological research and breeding of cucurbits. To better manage the cucumber genome data and facilitate public academic users to access the genome data and related information, we developed the Cucumber Genome Database.

Database statictics:
Data: 388M

Foxtail millet Database

Foxtail millet (2n=18), is an annual grass grown both as cereal crop (grain production) and as forage, mainly grown in temperate, subtropical and tropical areas. Taken as a healthy food, it can supply a nutritious dietary source ranging from starch, protein to various kinds of vitamins and minerals, such as calcium, iron, and sodium. It feeds nearly one-third of the world population with main daily-calories intake, especially prevalence in dry climates or soil-poor regions that are not suited for the cultivation of many other crops. It is a crop with self-pollinating, short lifecycle, small stature and small genome size, all of these favorable attributes makes it more attractive to be an invaluable functional genomics system model, and as a reference genome to aid the sequencing of other larger grasses genomes.

Database statictics:
Data: 3.46G

Pestalotiopsis Microspora Database

Pestalotiopsis microspora is a species of endophytic fungus capable of breaking down and digesting polyurethane. (Data Access Limited)

Database statictics:
Data: 445.2G

Chaetomium Globosum Database

Chaetomium globosum is a dematiaceous species of fungus, in the Chaetomiaceae family. (Data Access Limited)

Database statictics:
No available statictics.

Phylogenomics Analysis of Birds

To do phylogenomics analysis of birds, the genomes presented here were sequenced, de novo assembled and annotated for functional elements. These genomes will be helpful to construct the evolutionary history of birds, which has the potential to provide answers to several outstanding fundamental evolutionary questions.

Database statictics:
Data: 105G

Naked Mole Rat Database

The naked mole-rat (Heterocephalus glaber), also known as sand puppy or desert mole rat, is a burrowing rodent native to parts of East Africa. This unusual mammal has plenty of fantastic physiology features that make it a unique animal model for researchers in a variety of fields. We present the entire naked mole-rat genome sequence, as well as gene structure and functional annotation. Information is also provided on gene expression levels in three organs. We hope these data will contribute to a better understanding of genetic and biological basis for naked mole-rat's extraordinary features.

Database statictics:
Data: 445G

Gut Meta FTP

Details to follow.

Database statictics:
Data: 1.1T

Paulownia Genome Database

Whole genome shotgun (WGS) strategy was adopted in Paulownia fortunei de novo sequencing with illumina solexa sequencing technology. The gradient insert libraries of 200bp, 500bp, 800bp, to 2Kb, 5Kb, 10Kb, 20Kb were constructed based on the features of repeat sequences in the species and were sequenced by paired-end in order to cross many different repeats in assembly. The sequencing depth reached at least 60X genome coverage to ensure the precision of each single base and genome integrality. The whole genome map of Paulownia fortunei was generated.(Data Access Limited)

Database statictics:
Data: 755M