Notes
Slide Show
Outline
1
 
2
 
3
 
4
Microarray Technology Introduction

  • Nucleic acid hybridization/antibody based methodology for expression or mapping studies


  • high throughput screening method


  • Utilizes very large collections of substrate immobilized DNA or Proteins


  • Large scale and concurrent survey of


        • Gene/Protein expression changes


          • Gene expression/protein profiling

        • Genomic Variation


          • Single nucleotide polymorpism detection/ Haplotype analysis (HAPMap)


    • requires bioinformatic approaches for data management
    •     and analysis

5
Expression Profiling
  • Molecular description of a cell (system) at ultra high resolution


    • Transcriptome Mapping
        • Molecular Finger Printing

    • Genetic Networks
        • Transcriptional dependencies

    • Inferential Pathway Analysis
        • Signal transduction targets

  • Systems biology tool
6
Some Terminology
  • Probe = spotted DNA
  • Target = labeled sample
  • Feature = Probe
  • Pitch = center to center spacing of probes
  • Feature density = total number of probes


        • Probes are not the same as genes

  • Unigene ID = EST to EST association
  • Entrez ID = sequence associated with genomic locus
  • GO ID = functionally annotated genes
7
 
8
 
9
What do we need?
  • Arrayer
  • Substrates
    • Membranes (33P)
    • Glass (Fluorescence)
    • Beads (Fluorescence)
  • Probe sets
    • Oligonucleotides
    • EST/cDNA
  • Scanner
  • RNA
10
Substrates and Arrays – what are the Options?
  • Substrate linked Synthesis (Glass, beads)
    • Affymetrix, Nimblegen (fluorescence)
      • Light directed synthesis
      • oligonucleotides


    •  Agilent (fluorescence)
      • Spatially directed fluidics based synthesis
      • oligonucleotides


    • Illumina, Luminex (fluorescence)
      • Bead linked synthesis
      • oligonucleotides

  • Contact Printing (glass, membranes)
    • cDNA (fluorescence, 33P)
    • Oligonucleotides (fluorescence, 33P)
    • Proteins (antibodies, fluorescence)
11
Detection of Hybridization Events
12
Where do the we get the DNA to put on arrays?
  • Oligos - designed based on sequence data
  • ORFs - PCR primers designed based on sequence data,  often use tailed primers
  • cDNA - clones from various clone collections, PCR amplified and purified


13
Affymetrix (Nimblegen) GeneChip technology
  • Short Oligonucleotides
    • 25 mer (Affymetrix
    • 50 mer (Nimblegen)
  •  40-1300k DNA oligos on ~ 2.5 cm2 glass surface.


  • Expression arrays
    • Human, Mouse, Rat, Yeast, E. coli, Drosophila, C. elegans, Dog, Soybean, Plasmodium/Anopheles, Pseudomonas, Arabidopsis, Zebrafish, Xenopus, etc.
  • DNA analysis arrays
    • Resequencing, SNP analysis, LOH
  • Custom arrays
14
Genechip Synthesis
  • The manufacturing of GeneChip® probe arrays is a combination of photolithography and combinatorial chemistry. See: http://www.Affymetrix.com/Technology.html
15
Genechip Design
16
Agilent
  • inkjet technology long (60mer) oligonucleotide arrays
  • Arrays include
    • Expression arrays (Human, Mouse, Rat, Arabidopsis, rice, Magnaporthe and yeast expression arrays)
    • Promoter arrays (human and mouse)
    • Custom arrays (Rapid turnaround,8.4K or 22K feature sizes, up to 8 fields per slide)
  • Custom arryasAgilent also creates custom arrays
17
Chip Manufacturing Process
18
Multiplexed DNA Micro Array Analysis
19
Quantum Dot Properties
20
Decoding of Dye based Beads
21
 
22
Direct Hybridization RNA Profiling Bead
  • 50 base gene-specific Probe linked to 23 base Address
  • Hybridized to labeled nucleic acid made from total RNA
  • Each bead coated with same probe oligo (100,000’s copies)
  • ~30 copies of each bead type per array
23
Decoding of  SEBs - Fiber Optics & Microwells
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
Sources of Variance
  • Arrays
    • Feature Consistency
      • Spotting Volume
      • Pin characteristics
    • Substrate
      • Homogeneity
      • DNA binding capacity
  • Environment
    • Dust
    • Humidity
    • Aerosols
    • Light
    • Temperature
    • Oxidants
  • RNA
    • Purity à Efficiency of cDNA synthesis
    • Integrity à Length of FSR transcript
  • Labeling
32
 
33
False-positive error and array experimental costs
34
Planning A Good Array Experiment
  • Experimental Design
    • What is the biological question, i.e. what comparisons should be made?
    • What is the type of biological comparisons?
    • How is sample complexity controlled?
    • How many biological replicates are required?
  • Data Analysis
    • How will differentially expressed genes be identified?
    • How will errors be estimated?
35
Available Solutions
  • Most integrated and optimized:
  • Commercial Software
    • SAS, SPSS, S-Plus (general)
    • Spotfire, GeneSight, GeneSpring (specific)
  • Custom
    • TM4, BAMarray, Powerarray, ClusFavor etc. (specific)
  • Main Issues
    • Proprietary
    • Hidden data handling

  • Most versatile, most recent and transparent data handling:
  • Open Source
    • BioConductor/R
      • SAM
      • SPH/EB-arrays
      • LIMMA & Tcl GUI
      • R/MANOVA & JAVA GUI
      • D-Chip & C++ GUI
    • Main Issues
      •  Data import
      •  File format
      •  Requires programmer’s support


36
 
37
 
38
 
39
 
40
Data Preprocessing
  • How to remove systematical biases!
41
 
42
Background Correction
  • None
    • DNA vs Substrate
    • No Imputation/Offset
  • Local
    • Negative Signal Intensities likely
    • Imputation/Offset required
  • Global
    • Negative Signal Intensities likely
    • Imputation/Offset required
  • Moving Minimum
    • 3x3 spot average background
    • Negative Signal Intensities likely
    • Imputation/Offset required
43
 
44
 
45
Two Color Arrays: Loess Normalization
    • Intra Arrray
    • Intensity dependent
    • Model based Stat Tests (MAANOVA)



46
Two Color Arrays: Quantile Normalization
    • Intra/Inter Array
    • Factor based
    • Removal of High Intensity Outliers
    • Standard Stat Tests (SAM, SPH, LIMMA)


47
Beware
  • Any data adjustment, be it performed as sophisticated or industrious as possible, cannot convert low quality data into high quality data


  • Data adjustment always removes a part of the biology


  • !!Use it as sparingly as possible!!
48
Statistical Analysis
  • How to select differentially expressed Genes!
49
Identification of Differentially Expressed Genes
  • Degree of Regulation
    • Small changes
      • Less reproducible
      • Few genes @ Standard Significance threshold
      •  More replicates required
    • Large Changes
      • More reproducible
      • Many genes @ Standard Significance threshold
      • Fewer replicates required

  • But:
    • Effect size is gene dependent (Transcriptome Survey)


    • Data preparation and Statistical Analysis
50
 
51
Concordance Analysis for Replication
  • 11: ratios from A vs B comparison; replicate 1
  •    12: ratios from A vs B comparison; replicate 2
  •    13: ratios from B vs A comparison; replicate 3
  •    15: ratios from B vs a comparison; replicate 4








  •    Concordance coeff.: 0.947 – 0.961
52
 
53
 
54
 
55
 
56
 
57
Heuristic selection of genes I
by median & standard deviation (CV)
58
 
59
 
60
Data Example
61
 
62
 
63
Why Doesn’t the t-statistics Work?
  • When diffg and       are small, then      is big (then p becomes very small). So gene g is more likely to be declared as a DE gene.
  • This method is biased in favor of selecting genes with small diffg and        .
64
 
65
SAM Plot
66
Multi-Level Modeling:


67
Multi-Level Modeling
  • T-tests (SAM)
    • High accuracy data
    • Minimal preprocessing artifacts


  • Level1 Model (LIMMA, MAANOVA)
    • Gene specific
    • Yobs =f(Ytrue)
    • Yobs =g(Yarray) + g(Ydye) + g(Ybatch) +………g(Ytreatment)


  • Level2 Model (SPH, EBayes)
    • Population specific
    • Yobs =f(Yup) + f(Ydown) + f(Ynot)
    • Borrow and share information appropriately for better estimates
68
Multi-Level Modeling: LIMMA, MAANOVA, SPH
  • Ability to model various sources of variability:
  • Level 1 model: Gene Specific Model
  • Model the observed log intensities as a function of the unknown true log intensity.
  • detailed modeling of experimental variability: within array, between array, estimation of gene specific variability …


  • Level 2 model: Population Average Model
  • All unknown quantities are given prior distributions
  • Building of all these features into a common model
  • Ability to borrow and share information in appropriate ways to get better estimates
69
 
70
"By computing the Analysis of..."
  • By computing the Analysis of Variance (ANOVA), we can mathematically estimate the different sources of variation and systematically detect treatment effects in the data.


  • The real question is: which genes are differentially expressed between the samples? In our framework, we ask which variety-by-gene (VG) effects are statistically significant.


  • Thanks to the use of Model, you can omit certain preprocessing steps.
71
 
72
MAANOVA produces Four F-like Statistics
  • F1g  measures the gene specific treatment effect
  • F3g  measures the gth gene treatment effect using the pooled variance estimator
  • F2g  measures the gth gene treatment effect using both the gene specific variance estimator and the pooled variance estimator with equal weight.
  • Fsg  measures the gth gene treatment effect using a shrinkage variance estimator.
  • Fs is the most robust and usually most powerful (Cui, Churchill et all)


73
 
74
Volcano Plot
  • A ‘volcano’ plot provides a graphical summary of the simultaneous results from all four F-tests.
  • On the plot, the y-axis value is -log10(P-value) for the F1 test. The x-axis value is proportional to the fold changes.
  • A horizontal line represents the significance threshold of the F1 test.
  • Blue dots: EE genes



  • Green dots: F3
  • Orange dots: Fs
  • Red dots: F2
    • (In example graph, F2 tests
    • were not run.)
75
Power of MAAONVA
  • Limited use of Preprocessing techniques
  • Per gene estimation of factors contributing to variance
  • permits random effects
  • Systematic detection of treatment effects


  • But: parametric model




76
LIMMA: Linear Models For Microarray Data
  • Similar to MAANOVA
    • Linear model of factors (treatment, dye, etc.)
    • Factors are independent (no interaction)


  • Requires a Contrast Matrix


  • Uses either:
    •  log-odds (B) – Statistics (large B à DE gene)


      • Bg=log (odds ratio) =l og (odds of gene g to be DE versus EE)

    •  moderated t-statistics (large T à DE gene)

77
How to define the Rejection Region: the multiple Hypothesis Problem
78
Test Of A Single Hypothesis
  • gene g is equivalently expressed (EE)
  • gene g is differentially expressed (DE)
79
 
80
 
81
 
82
 
83
 
84
 
85
 
86
Comparison of The Four Methods
87
 
88