|
1
|
- Jin Jun
- 12/09/2004
- Dept. of CS&E
|
|
2
|
- Biological background
- Previous works and Idea
- Problem definition
- Length constrained algorithms
- GC rich regions
- CpG islands
- Conclusions
|
|
3
|
- Regions of a DNA sequence that are rich in nucleotides C and G
- significant in gene recognition and comparative genomics
|
|
4
|
- Isochore model
- The genome of warm-blooded vertebrates is composed of isochores
- long (>300-kb) DNA regions homogeneous in nucleotide composition.
- Five distinct isochore classes were described for the human genome
- GC-poor classes L1 and L2 (~63% of the genome) and increasingly
GC-rich classes H1, H2, and H3 (~24%, 7.5%, and 4.7%, respectively).
- These five classes form the so called typical “genome phenotype”.
|
|
5
|
- Birds (chicken) and rodents (mouse and rats) have deviant genome
phenotypes
- The chicken possesses an additional extremely GC-rich isochore class
H4, whereas mouse and rat lack class H3. è Comparative genomics
- Genes are found predominantly in the GC-richest isochore classes H2 and
H3, which comprise only 12% of the human genome.
è Gene recognition
|
|
6
|
- Regions of DNA of at least 200bp in length
- with G+C content above 50%, and
- a ratio of observed vs. expected CpGs at least 0.6
- Most of the CpG islands are between 200 and 1400bp with a majority of
them being 200–400bp.
- play an important role in the prediction of promoter.
|
|
7
|
|
|
8
|
- A fixed window is moved down the sequence/alignment and calculating statistics
of GC richness or compositional heterogeneity at each position that the
window is moved to.
- Since an optimal region could span several windows, the window-based
approach might fail in finding the exact locations of some interesting
regions.
|
|
9
|
- x-pl : where x is the C+G count of the region, p is a positive constant
ratio, and l is the length of the region.
- In other words, each of nucleotides C and G is given a reward of 1-p,
and each of nucleotides A and T is penalized by p
|
|
10
|
- (i) given a sequence of real numbers of length n and an upper bound U,
find a consecutive subsequence of length at most U with the maximum sum
in O(n)-time.
- (ii) given a sequence of real numbers of length n and a lower bound L,
find a consecutive subsequence of length at least L with the maximum
average in O(n log L)-time.
|
|
11
|
|
|
12
|
- Given A = <a1, a2, …, an> and U ≤ n
- Find a consecutive subsequence of A of length at most U such that the
sum of the numbers in the subsequence is maximized.
|
|
13
|
- Let A1, A2, …, Ak be disjoint
(consecutive) subsequences of A forming a partition of A, i.e. A = A1A2
… Ak. Ai is called the ith segment of the partition.
- Denote w(A) as the sum of the sequence
|
|
14
|
- A real sequence A = <a1, a2, …, an> is left-negative iff the sum
of each proper prefix <a1, a2, …, an> (w(<a1, a2, …, an>))
is negative or zero for all 1 ≤ i ≤ n-1.
- <-4, 1, -2, 3> is left-negative, while <5, -3, 4, -1, 2,
-6> is not.
- A partition of the sequence A = A1A2 … Ak is minimal left-negative if
each Ai, 1 ≤ i ≤ k, is left-negative, and, for each 1 ≤
i ≤ k-1, the sum of Ai is positive, i.e. w(Ai) > 0.
- the partition <5> <-3, 4> <-1, 2> <-6> of
<5, -3, 4, -1, 2, -6> is minimal left-negative.
|
|
15
|
- Set up the left-negative pointers.
|
|
16
|
- The left-negative pointers of a 15-element sequence.
- 1 2
3 4 5
6 7 8
9 10 11 12 13 14 15
- ai 9 -3 1
7 -15 2 3 -4
2 -7 6 -2 8
4 -9
- p[i] 1 4 3
4 15 6
7 13 9 13 11 13 13 14 15
- w[i] 9 5 1
7 -12 2 3
3 2 5
6 6 8
4 -9
9 -3 1
7 -15 2 3 -4
2 -7 6 -2 8
4 -9
|
|
17
|
- Lemma 2 The algorithm MLN-Point given in Algorithm 1 finds all
left-negative pointers for a length n sequence in O(n) time
|
|
18
|
- Finding the maximum sum consecutive subsequence with length constraint.
|
|
19
|
- The left-negative pointers of a 15-element sequence with U=5.
- 1 2
3 4 5
6 7 8
9 10 11 12 13 14 15
- ai 9 -3 1
7 -15 2 3 -4
2 -7 6 -2 8
4 -9
- p[i] 1
4 3 4
15 6 7 13
9 13 11 13 13 14 15
- w[i] 9
5 1 7 -12
2 3 3
2 5 6
6 8 4 -9
- i 1
2 3 4
5 6 7
8 9 10 11 12 13 14 15
- j 4
4 4 4
7 7 7
9 13 14 14 14 14 14 15
- S(i,j) 14 5
8 4 -10 5
3 -2 7 9 16 10 12 4 -9
- mi 1 11
- mj 4 14
- ms 14 16
- result:<6, -2, 8, 4> and its sum = 16.
|
|
20
|
- Theorem 1 Given a length n real sequence, finding the consecutive
subsequence of length at most U with the maximum sum can be done in O(n)
time.
|
|
21
|
- Corollary 2 Given a length n real sequence and positive integers L ≤
U, finding the consecutive subsequence of length between L and U with
the maximum sum can be done in O(n) time
|
|
22
|
|
|
23
|
- u(A) = w(A)/|A| : the average of A
- A sequence A = <a1, a2, …, an> is right-skew iff the average of
any prefix <a1, a2, …, ai> is always less than or equal to the
average of the remaining suffix subsequence <ai+1, ai+2, …, an>.
- <-4, 1, -2, 3> is right-skew, while <5, -3, 4, -1, 2, -6>
is not.
- A partition A = A1A2 … Ak is decreasingly right-skew if each segment Ai of
the partition is right-skew and u(Ai) > u(Aj) for any i < j .
- the partition <5> <-3, 4> <-1, 2, -6> of <5, -3,
4, -1, 2, -6> is decreasingly right-skew.
|
|
24
|
- The left-negative pointers of a 15-element sequence.
- 1 2
3 4 5
6 7 8
9 10 11 12 13 14 15
- ai 9 -3 1
7 -15 2 3 -4
2 -7 6 -2 8
4 -9
- p[i] 1 4 4
4 14 7
7 14 9 14 11 14 13 14 15
- w[i] 9 5 8
7 -3 5
3 7 2
9 6 10 8
4 -9
- d[i] 1 3 2
1 10 2
1 7 1
5 1 3
1 1 1
9 -3 1
7 -15 2 3 -4
2 -7 6 -2 8
4 -9
|
|
25
|
- Finding the maximum average consecutive subsequence with length
constraint
|
|
26
|
- Lemma 6 The algorithm can computes all right-skew pointers for a length
n sequence in O(n) time.
- Theorem 3 Given a length n real sequence, finding the consecutive
subsequence of length at least L with the maximum average can be done in
O(n log L) time.
|
|
27
|
- Experiments
- mslc: A proposed algorithm 2 to locates the maximum-sum subsequence of
length at most U in O(n)-time.
- mslc slow: A brute-force O(nU)-time version.
- mavs: A proposed algorithm 3 to locates the maximum-average subsequence
of length at least L in O(n log L).
- mavs slow: A brute-force O(nL)-time version.
- mavs linear: this program linearly scan segments for the good
partnership. In the worst case, the time complexity is O(nL).
|
|
28
|
|
|
29
|
|
|
30
|
- Algorithm for finding a consecutive subsequence of length at most U with
the maximum sum in O(n)-time.
è GC rich regions
- Algorithm for finding a consecutive subsequence of length at least L
with the maximum average in O(n log L)-time.
è CpG islands
|
|
31
|
- Anton Nekrutenko and Wen-Hsiung Li, “Assessment of compositional
heterogeneity within and between eukaryotic genomes”, Genome Research,
10:1986–1995, 2000.
- Yaw-Ling Lin, Tao Jiang, and Kun-Mao Chao, “Efficient algorithms for
locating the length-constrained heaviest segments, with applications to
biomolecular sequence analysis”, Journal of Computer and System Science,
65:570-586, 2002.
- X. Huang, “An algorithm for identifying regions of a DNA sequence that
satisfy a content requirement”, CABIOS, 10:219–225, 1994.
|