The random projection algorithm is a randomized algorithm for motif finding. The input is a set of strings and the goal is to find substrings which appears in disproportional fraction of the strings. A challenge problem in this domain is defined by finding planted (l,d) motifs in n random strings of length m where an (l,d) motif is a string of length l with at most d point mutations. The random projection algorithm is very simple and elegant, using randomly selected subsets of the substring and hashing over all the strings. Genscan is a popular gene finding program by Chirs Burge. The goal here is to implement a gene finding algorithm and test it on the human genome. 1.Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94. 2.GenScan site BLAST is a fast local alignment algorithm, which enjoys very wide popularity in the biological community and is used by thousands of researchers on a daily basis. The goal is to search for significant local alignments ofa target sequence in a large sequences database. In this project you shallprogram BLAST and test its performence on a sample sequence database. 1.Blast at NCBI 2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) ‘Basic local alignment search tool.’ J. Mol. Biol. 215:403-410. ClustalW is a progressive multiple alignment algorithm which use several heuristic to improve the performence of HMM based alignment tool. The idea is to incrementaly refine a Profile HMM that will anchor the target sequences. 1. ClustalW at EBI 2. Higgins D., Thompson J., Gibson T.Thompson J.D., Higgins D.G., Gibson T.J.(1994). CLUSTAL W: improving the sensitivity of progressivemultiple sequence alignment through sequence weighting,position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680 SVMs (Support Vector machines) are powerful classifiers which are in use in several machine learning applications. We are given a set of examples, each charachterized by an n dimensional real valued vector and a binary number (yes or no). The SVM predict the binary number (the class) based on the vector by test the position of the vector relative to some hyperplane (defined by its normal vector). SMO is a fast SVM training algorithm which is very easy to implement. In this project you shall implement SMO and use it to classify cancer patients. The data you will use is derived from DNA chips that can measure the activity of thousands of genes in a given tissue. Your SVM will try to distinguish between patients with different types of tumor based on the gene activity (or expression) profile. Suffix trees were (or will be)Ý introduced in class. The goal here is to implement and experiment with them paying attention to performence and implementation efficiency. To test your algorithm performence you shall use gnu gprof.gprof can be used with gcc compiled programs and provide detailed report on the time spent at each of the routines used by a given program. By analyzing gprofreoprt you can verify your implementation does not spend time in the ‘wrong’ places, identify bottle necks and try to remove them. A frequent example in that respect is to optimize the use of pointer dereferecing, to use iterators instead of direct access to vector objects and to cache computation whenever possible. The goal here is to train a profile HMM that can identify protein families. The HMM topology is a combination of two standard profile HMM with common start and end states. During the training process, we hope that one profile HMM will absorb one type of sequences and the other will take care of the rest. Source.