Notes
Slide Show
Outline
1
Patterns and Clusters
2
Concept
  • Pattern
    • Dataset that can be concisely encoded
  • Cluster
    • Arrangement of data with respect to common features
3
Applications of Pattern Matching and Discovery
  • Matching
    • Similarity searches in databases
  • Discovery
    • Learning sequence motifs
4
Application of Cluster Analysis and Discovery
  • Analysis
    • Predicting whether a patient has a disease based on measuring certain disease-related variables
  • Discovery
    • Finding gene expression associated with disease in large data arrays
5
 
6
PATTERNS
7
String Pattern Matching
Regular Expressions
  • A regular expression (RE) describes a language exclusively by single symbols, by the empty symbol [1] l, by È, and by the Kleene Star operator[2] *.
  • Specifically, regular expressions over an alphabet S* are all strings over the alphabet
  •  SÈ{ (,),l,È,*}
  • [1] l is used to mean Æ
  • [2] The Kleene star * . For a, a* is l+a +a2 + a3 + a4 +......
8
String Pattern Matching
Regular Expressions
  • The REs are obtained recursively as follows:
  • Each member of S ,and l, is a regular expression
  • If a and b are REs, then so is the concatenation (ab)
  • If a and b are REs then so is the union (aÈb)
  • If a  is an RE then so is a*
  • Nothing is an RE unless it follows from the above.


9
Regular Expressions
  • A particularly clear example of a regular expression is from Gusfield :
  • Let S be the alphabet of lower case English characters.
  • RE= (a+c+t)ykk(p+q)*vdt(l+z+l)(pq)
  • A string specified by RE could be  aykkpqppvdtpq
  • Here is how that happens:


10
Implementations
  • GREP (General Regular Expression Processor) – a UNIX Utility
  • Aho and Corasick Algorithm for text searching


11
Image Pattern Recognition
Fourier and related Transforms
12
Image Pattern Recognition Wavelet Transform
13
Image Pattern Recognition
Wavelet Transform
14
Suffix Tree Pattern Discovery
from Pattern Discovery in Biomolecualr Data, Wang et al 1999
  • Distance between two sequences
    • A*B*CD and AB*G*D
    • is the EDIT Distance, where  the wildcard match is scored 0
  • The patterns are active if their respective lengths match the specified length and they are within a specified distance of one-another
15
Suffix Tree Pattern Discovery


  • Build a  Generalized Suffix Tree
  • Like a suffix tree only encodes groups of strings
16
Suffix Tree Discovery Algorithm
  • Look at the pattern structure for wildcards
  • If there is a sequence set off by a wildcard (*X*) , travel the tree to find 2 leaves whose lengths sum to the pattern length
  • If there are 2 sequences set apart by wildcards (*X*Y*), then find 3 leaves whose lengths sum to the pattern length
  • etc.
17
Suffix Tree Discovery Algorithm
  • Rank the candidate pattern activity with respect to distance, considering  all possible combinations
  • Compare most likely candidates with the entire set
18
CLUSTERS
19
Discriminant Analysis
20
Discriminant Analysis
  • Really is a classification scheme
  • We are given some two or more criterion (outcome) variables
  • We are given a bunch of predictor variables (features)
  • We figure out: given a set of specific observations of the features, what outcome does that set predict?
21
Discriminant Analysis
  • RESTATED:
  • Can we identify the source of a particular set of observations  (observation vector)?
22
Discriminant Analysis
  • Really a form of multiple regression
23
Discriminant Analysis
2 Groups
24
Discriminant Analysis
2 Groups Using 3 Dimensions
25
The Final Product
  • The analysis finds a linear combination of weights a,b,c,…..


  • such that
  • f=ax1+bx2+cx3+……


  • f’s value above or below some threshold value will identify the “source” cluster
26
An Example
27
An Example
  • Use 5 features, including the power under the  a-peak and under its 1st harmonic, combined with the 3 driving- power features depicted above
  • Apply to a training set of classical migraneurs and normals
  • Learn the weights of a 5 element discriminant function
  • Apply the function to new cases
28
Data Issues
  • Types of Data
  • Metrics
    • Similarities
    • Dissimilarities
  • Scaling
  • Handling Outliers
  • Correlations
29
Types of Data
  • Interval Scaled
    • Distance between two points in one interval same as in another
    • Energy to heat 1 gm water from -40° to -38° same as heating 1 gram water from 100° to 102°
30
Types of Data
  • Binary
    • Symmetric: approximately equal choices-male/female, married/single
    • Asymetric: has 2 heads, has one head--info is only in the rarity (cf information theory) -default in the problem setup should be the  common occurrence

31
Types of Data
  • Ratios
    • Can use interval scaling
    • Can take log
    • Can do ordinal scaling (rank)
32
Metrics
  • Euclidian or ‘usual’ : Pythagorean theorem
  • Manhattan or taxicab: sum of sides
  • Pipewrench : maximum
33
Similarities and Dissimilarities
  • Distances are natural dissimilarities
  • Similarity would be “how much do they look alike?” - high score means very similar
    • Always nonnegative number
    • A correlation coefficient is a natural example (except for the negative part)
34
Different Dimensions
  • Units change Þ patterns change
  •   eg changing cm to yards changes the scale
  • Alternative: standardize the data- make it dimensionless
    • Also has a downside
35
Effect of Scale Units in Clustering
36
Outliers
  • Define z-score




  • Define Mean Absolute Deviation
    • -less sensitive to outliers

37
Var vs MAD
38
Correlated Data
  • When one data point ‘follows’ another under an action (is correlated), there is redundancy in the data. The second data point offers no new information.


  • For ideal clustering, we must consider data reduction when there is correlation of the variables
39
Correlations Among Cluster Elements
40
Why Ask?
  • Can identify the underlying factor
  • Can screen variables (just pick one representing a factor)
  • Can summarize data: 11 variables®3 factors
  • Can sample uncorrelated ones to solve a problem
  • Can cluster objects  (inverse problem)
41
What we are looking for