|
1
|
|
|
2
|
- Pattern
- Dataset that can be concisely encoded
- Cluster
- Arrangement of data with respect to common features
|
|
3
|
- Matching
- Similarity searches in databases
- Discovery
|
|
4
|
- Analysis
- Predicting whether a patient has a disease based on measuring certain
disease-related variables
- Discovery
- Finding gene expression associated with disease in large data arrays
|
|
5
|
|
|
6
|
|
|
7
|
- A regular expression (RE) describes a language exclusively by single
symbols, by the empty symbol [1] l, by È, and by the Kleene Star operator[2] *.
- Specifically, regular expressions over an alphabet S* are all strings
over the alphabet
- SÈ{ (,),l,È,*}
- [1] l is used to mean Æ
- [2] The Kleene star * . For a, a* is l+a +a2
+ a3 + a4 +......
|
|
8
|
- The REs are obtained recursively as follows:
- Each member of S ,and l, is a regular expression
- If a and b are REs, then so is the concatenation (ab)
- If a and b are REs then so is the union (aÈb)
- If a is an RE then so is a*
- Nothing is an RE unless it follows from the above.
|
|
9
|
- A particularly clear example of a regular expression is from Gusfield :
- Let S be the alphabet of lower case English characters.
- RE= (a+c+t)ykk(p+q)*vdt(l+z+l)(pq)
- A string specified by RE could be
aykkpqppvdtpq
- Here is how that happens:
|
|
10
|
- GREP (General Regular Expression Processor) – a UNIX Utility
- Aho and Corasick Algorithm for text searching
|
|
11
|
|
|
12
|
|
|
13
|
|
|
14
|
- Distance between two sequences
- A*B*CD and AB*G*D
- is the EDIT Distance, where the
wildcard match is scored 0
- The patterns are active if their respective lengths match the specified
length and they are within a specified distance of one-another
|
|
15
|
- Build a Generalized Suffix Tree
- Like a suffix tree only encodes groups of strings
|
|
16
|
- Look at the pattern structure for wildcards
- If there is a sequence set off by a wildcard (*X*) , travel the tree to
find 2 leaves whose lengths sum to the pattern length
- If there are 2 sequences set apart by wildcards (*X*Y*), then find 3
leaves whose lengths sum to the pattern length
- etc.
|
|
17
|
- Rank the candidate pattern activity with respect to distance,
considering all possible
combinations
- Compare most likely candidates with the entire set
|
|
18
|
|
|
19
|
|
|
20
|
- Really is a classification scheme
- We are given some two or more criterion (outcome) variables
- We are given a bunch of predictor variables (features)
- We figure out: given a set of specific observations of the features,
what outcome does that set predict?
|
|
21
|
- RESTATED:
- Can we identify the source of a particular set of observations (observation vector)?
|
|
22
|
- Really a form of multiple regression
|
|
23
|
|
|
24
|
|
|
25
|
- The analysis finds a linear combination of weights a,b,c,…..
- such that
- f=ax1+bx2+cx3+……
- f’s value above or below some threshold value will identify the “source”
cluster
|
|
26
|
|
|
27
|
- Use 5 features, including the power under the a-peak and under its 1st harmonic, combined with
the 3 driving- power features depicted above
- Apply to a training set of classical migraneurs and normals
- Learn the weights of a 5 element discriminant function
- Apply the function to new cases
|
|
28
|
- Types of Data
- Metrics
- Similarities
- Dissimilarities
- Scaling
- Handling Outliers
- Correlations
|
|
29
|
- Interval Scaled
- Distance between two points in one interval same as in another
- Energy to heat 1 gm water from -40° to -38° same as heating 1 gram water from 100° to 102°
|
|
30
|
- Binary
- Symmetric: approximately equal choices-male/female, married/single
- Asymetric: has 2 heads, has one head--info is only in the rarity (cf
information theory) -default in the problem setup should be the common occurrence
|
|
31
|
- Ratios
- Can use interval scaling
- Can take log
- Can do ordinal scaling (rank)
|
|
32
|
- Euclidian or ‘usual’ : Pythagorean theorem
- Manhattan or taxicab: sum of sides
- Pipewrench : maximum
|
|
33
|
- Distances are natural dissimilarities
- Similarity would be “how much do they look alike?” - high score means
very similar
- Always nonnegative number
- A correlation coefficient is a natural example (except for the negative
part)
|
|
34
|
- Units change Þ
patterns change
- eg changing cm to yards changes
the scale
- Alternative: standardize the data- make it dimensionless
|
|
35
|
|
|
36
|
- Define z-score
- Define Mean Absolute Deviation
- -less sensitive to outliers
|
|
37
|
|
|
38
|
- When one data point ‘follows’ another under an action (is correlated),
there is redundancy in the data. The second data point offers no new
information.
- For ideal clustering, we must consider data reduction when there is
correlation of the variables
|
|
39
|
|
|
40
|
- Can identify the underlying factor
- Can screen variables (just pick one representing a factor)
- Can summarize data: 11 variables®3 factors
- Can sample uncorrelated ones to solve a problem
- Can cluster objects (inverse
problem)
|
|
41
|
|