Gene Expression Omnibus（GEO）It is a public repository for archiving and freely distributing complete sets of microarrays, new generation sequencing and other forms of high-throughput functional genomic data submitted by the scientific community.In addition to data storage, a series of Web-based interfaces and applications are provided to help users query and download research and gene expression patterns stored in GEO.
GEOData storage mode
GEOThe database stores four types of data: GSE, GDS, GSM, and GPL.
A GSE number (GSExxx) corresponds to a series of data for the entire research project, possibly involving different platforms.
A dataset of the same platform corresponding to GDS (GDSxxx). These include data generated from microarrays and high-throughput sequence techniques, such as:
- Gene expression profiling by microarray or next-generation sequencing (see example) Non-coding RNA analysis by microarray or next-generation sequencing (see example)
- Chromatin immunoprecipitation (ChIP) analysis was performed by microarray or next generation sequencing (see sample).
- Genome methylation analysis by microarray or next-generation sequencing (see example)
- High throughput RT-PCR (see example)
- Genome variation analysis (arrayCGH) by array (see example)
- SNPArray (see example) (see human themes FAQ)
- Serial analysis of gene expression (SAGE) (see example)
- Protein arrays (see example)
A GSM number (GSMxxx) corresponds to a single sample of data information, it can only be a single platform data, often, GSE and GDS will contain multiple GSM data;
A GPL (GPLxxx) corresponds to the information of a platform, which is usually not contacted.
In addition, the GEO Profiles database is a form in which GEO staff tend to have a single gene in different datasets based on the data submitted by users.
GEOData retrieval and download
GEOThe database supports keyword retrieval and Boolean logic, much like pubmed, which is typically retrieved in the GEO DataSets database. For example, search for breast cancer, as follows:
1. Search for breast cancer “BreastCancer”, you can get all the chip data of breast cancer.
2. Select the chips that need to be researched and click in.
3.Click on the sample number to jump to the download page and download data such as SOFT, MINiML and RAW.
4. It can also be analyzed in this sample, for example, analyzing the BRCA1 gene expression profile.Profile neighbors links, or genes with similar expression profiles, can be found, which is what we need to look for for possible co-expression genes associated with BRCA1.
5. After analyzing the expression profiles of all the genes, a possible signal pathway can also be obtained.
RPackage installation and download data
> # try http:// if https:// URLs are not supported > source("https://bioconductor.org/biocLite.R") > biocLite("GEOquery")
Download with GSE
By reading the literature to find the GSE number of interest, download the corresponding expression data and platform information, you can use the getGEO () function in GEOquery to download series_matrix. txt. For example, GSE57820:
> library(GEOquery) > # destdirSet up the current directory, getGPL and AnnotGPL are set TRUE, you can download and get the platform's annotation file.> GSE57820 < - getGEO ("GSE57820", GSEMatrix =TRU)E, destdir = "," getGPL = T, AnnotGPL = T)
Download with GDS
> GDS6100 <- getGEO("GDS6100", GSEMatrix =TRUE, destdir = ".", getGPL = T, AnnotGPL = T)
Download with GSM
GSM is used to download single sample expression data, such as GSM1394594:
> GSM1394594 <- getGEO("GSM1394594", GSEMatrix =TRUE, destdir = ".", getGPL = T, AnnotGPL = T)
Download with GPL
For the chip platform, the data downloaded by GPL number is the chip design and annotation information, and the corresponding relationship between probe group and gene can be obtained, such as GPL10558:
> GPL10558 <- getGEO("GPL10558", GSEMatrix =TRUE, destdir = ".", getGPL = T, AnnotGPL = T)