The resources provided here are of interest for all students who attend my lectures in the bioinformatics domain. Reading or printing the pdf documents requires adobe reader version 9 or later. Bioinformatics is an interdisciplinary subject applying computer science, mathematics and statistics to support knowledge discovery in the life sciences. Acquiring bioinformatics skills requires thus studying biology, computer science and mathematics/statistics. Although we will revise important mathematical background in those courses which rely on such topics, prospective students who are interested in attending my lectures should contemplate brushing up these skills. A level comparable to the LMBT mathematics lecture (a compulsory food- and biotechnology bachelor degree course) is highly desirable. Additionally, knowledge about computer programming and using operating systems shells is advantageous.
Bioinformatics is a discipline which requires actively engaging with the taught material. All lectures described here have thus a strong practical component and the students assessment depends strongly on the practical. I want to assure you that I am happy to assist during practicals in case of problems and I also encourage interaction, if you want to discuss a specific problem with a colleague. As a word of warning, I want however also point out that there is zero tolerance, if I discover people handing in plagiarism.
To reflect the difficulty of having to acquire skills from different disciplines, all elective lectures are awarded 1.5 ETCS points per weekly teaching hour. To ease understanding of the topics discussed in my lectures, I recommend attending the lectures in a particular order, whereupon 793.403 will be blocked just before the practical part of 793.307 and both lectures can be taken in the same term.
Here you find all material required for a successful completion of my FSPMA lecture. Lecture notes are available in pdf or gzipped postscript format. For the practical you need to download the latest source distribution of FSPMA, fspma_1.1.tar.gz. Installation instructions are provided here. Instructions and data for the hands on experiments can be obtained as tared gzip archive. A login will be provided during the lecture. Copy the file into an empty directory, untar it (tar -xzf fspmaexercise.tar.gz) and follow the instructions in exercise.txt. The practical part to this lecture will determine the lecture result. It is thus essential that you work out the problems in exercise.txt and pack your completed version of the file exercise.txt together with all def files into a zip archive and send it as attachment to my BOKU email address (peter.sykacek@boku.ac.at) using the email subject fspma results, #your matriculation number#. Assuming that you have completed both the Affymetrix and the CDNA analysis and that all files are in the same directory, the command for packing the result files is tar -czf lecture-result.tar.gz exercise.txt affy.def cdna.def. The results archive to be attached to the email is the file lecture-result.tar.gz. The deadline for sending these results is announced during the lecture. Make sure to ask, if you are unclear about it. After inspecting your solutions, I will send out emails with proposed marks. If you feel that you deserve a better assessment, we can arrange for a verbal examination. Such exams are based on a discussion, where you can comment on certain aspects of the problems you solved during the practical.
The aim of the Machine Learning and Pattern Recognition for Bioinformatics is introducing data analysis concepts for bioinformatics applications. Using a machine learning focus on data analysis has the advantage of a less formal introduction to the topic as is typically found in statistics. Despite that the focus on bioinformatics applications of machine learning, the concepts taught in the lecture are universally applicable.
The theoretical part of the lecture consists of fourteen lecturing units and about three full days in the computer lab where we apply machine learning to practical problems. The theory part introduces generic concepts to be taken in consideration when applying machine learning methods. We move then on to a classification of different data analysis paradigms covering supervised and unsupervised methods. This is followed by a revision of important prerequisites from mathematics and probability theory. Throughout the lecture, the importance of theoretical concepts is always illustrated by referring to certain algorithms from machine learning or pattern recognition which require application of the respective mathematical or statistical techniques. To prepare the use of MatLab in the practical part, such mathematical expressions are also shown in MatLab syntax.
After having established the basis for empirical knowledge discovery, the lecture continues with introducing richer model classes like neural networks (RBFs and MLPs) and machine learning methods for model fitting and diagnosis. Most noteworthy are approaches for controlling the complexity of models and model selection. This is followed by an in depth investigation of classification which covers models, error functions and classifier assessment. The theoretical part of the lecture is ended with an overview of popular machine learning methods like ICA and mixture density models.
The lecture notes for "Machine Learning and Pattern Recognition for Bioinformatics (793.307)" are available in pdf format.
The material for the practical part is based on MatLab and the NETLAB toolbox provided by Prof. Ian Nabney. In addition you will need to download some additional material for the course consisting of public data sets and instructions for analyses to be done in the practical. The material is available as tared gzip archive. A login which is required for download will be provided in the first practical session. Copy the file into an empty directory, untar it (tar -xzf mlprn_labresources.tgz) and follow the instructions in mlprn_practical.txt, which requires modifying this file by adding code fragments and answering questions. The practical part of this lecture will determine the lecture result. It is thus essential that you work out the problems and send the completed file as attachment to my BOKU email address peter.sykacek@boku.ac.at, using the email subject mlprn results, #your matriculation number#. The deadline for sending these results is announced during the lecture. Make sure to ask, if you are unclear about it. Proposed marks are sent to course participants by email. People who feel that they deserve a better assessment can arrange for a verbal examination. Such exams are based on a discussion which require commenting on certain aspects of the problems you solved during the practical.
Bayesian Data Analysis in the Life Sciences provides an introduction to applying Bayesian concepts to data analysis in medical informatics and bioinformatics. Unlike my other courses, which are more about applying data analysis methods, this lecture will in addition deal with implementing methods for data analysis. In the course of this lecture, students will thus derive and implement a simple variant of Bayesian regression. In addition we will implement a few auxiliary algorithms which are relevant for certain aspects of data analysis. Both the code implemented by students and some additional ready made libraries will be used in the practical for analysing real world life science data. The shift from applying to deriving and implementing data analysis methods requires in this lecture a slightly higher level in mathematics and computer science when compared to the other courses I offer at BOKU.
The practical applies Bayesian methods to data analysis applications in the life sciences. We will in particular use the ability of properly implemented Bayesian methods to compare different model classes quantitatively. This is useful for finding genes with expression patterns which relate to producing certain metabolites and in a second application to finding genes with expression patterns which allow distinguishing different cancer types. The practical also discusses important calibration and diagnosis steps which allow assuring that inference results are data driven and modelling choices had little influence.
The lecture notes for "Bayesian Data Analysis in the Life Sciences (793.402)" are available in pdf format. This course requires in addition doing the assignment provided for download here as self study exercise.
The practical for the Bayesian data analysis in the life science is based on a set of MatLab libraries partly provided for download and partly required to develop by the participants. Implementations should be prepared in a slot of a few weeks intentionally left free between the last theory block and the first practical session. The prepared code can be completed during the practical. The course material for the practical containing libraries, data sets and instructions can be downloaded here as ziped archive. A login which is required for download will be provided in the first practical session. Copy the file into an empty directory, unzip it (unzip biobayes_lab_4student.zip) and follow the instructions in biobayes_labwork.txt. Completion of the practical requires modifying this file by adding code fragments and answering questions. In addition, there are several MatLab source files to be completed in subdirectories of ./mlbsrc. The practical part of this lecture will determine the lecture result. It is thus essential that you work out the problems and send the completed file together with all code files as attachment to my BOKU email address peter.sykacek@boku.ac.at, using the email subject biobayes results, #your matriculation number#. You are advised sending a zip archive containing all files an subdirectories in your working directory. Under Linux this is achieved by issuing the command: zip -r biobayesres.zip biobayes_labwork.txt data mlbsrc in the respective directory. The deadline for sending these results is announced during the lecture. Make sure to ask, if you are unclear about it. Proposed marks are sent to course participants by email. People who feel that they deserve a better assessment can arrange for a verbal examination. Such exams are based on a discussion which require commenting on certain aspects of the problems you solved during the practical.
This material is my contribution to the new "Introduction to Bioinformatics" where I provide a two hours introduction to machine learning. The lecture provides an overview of important skills like writing a computer program or probabilities. It moves then to structuring machine learning approaches, giving a few examples and highlighting the problems one has to consider when applying machine learning methods. If you are interested you can get my handouts in pdf format. Those who need to take the exam can get prototypical questions and answers here.
The lecturing material provided here is valid for SS 2011 and upfollowing terms. The material is now presented in the least mathematical style possible. In addition I provide seven exam question including sketches for prototypical answers and references to slide numbers. My current lecture notes are provided here in pdf format. Exam questions and answers are also available in pdf format Note that both documents load with the latest adobe reader. Old versions might not work.
My contribution to the lecture Computational Mathematics and Bioinformatics Lecture (894.305) has a rather volatile history with annual modifications which were introduced to adjust the level of the lecture such that it is comprehensible for an average Biotechnology master student and remains comprehensive enough to be relevant for the practical problems encountered in research and development in biotechnology.
In response to recommendations by the Biotechnology study commission to reduce the mathematical level of the lecture 894.305, the lecture was completely redesigned for SS 09/10. Due to this restructuring, the time available for teaching data analysis topics has been reduced to one third of the originally designated time. As a result, the contents of my theory part is now shrunk to a teaser which can give students a flavour of machine learning in bioinformatics but certainly not the expertise they will need to solve practically relevant problems. I do thus recommend that students who enjoy the data analysis (machine learning & statistics) contribution to 894.305 sign up for my specialist lectures.
In preparation of the machine learning overview which is presented as block five (L5) of the newly structured "Computational Mathematics and Bioinformatics" lecture, you should download and print the handouts provided here as pdf file. The skills presented in L5 are relevant for passing the corresponding practical part (P3). This implies that students must have these topics well prepared, when attending P3. Knowledge and more importantly an understanding of L5 is thus implicitly tested in P3. Since I believe that examining the same material twice has no advantages, there will consequently be no explicit questions about L5 in the written exam of the lecture. The assessment will however enter into the final mark via the assessment of P3. Please feel free to contact me (email to peter.sykacek@boku.ac.at) if you have topical questions concerning the material presented here. Questions concerning other aspects of the lecture should be addressed to Prof. Oostenbrink. Please consult BOKU Universities staff pages for his email address.
People interested in starting a PhD in data analysis in bioinformatics will need very good skills in computer science, mathematics and statistis. Prospective PhD students must be able demonstrating good working knowledge in R or MatLab, as for example obtained in a masters project. Prospective PhD students who bring their own funding are always welcome, if their ideas fit into the broad research focus of probabilistic methods in bioinformatics. Students who are interested in a funded position should apply to respective job adverts which are announced in various places including international job boards. Since I get many job applications sent by people without even barely meeting the criteria stated explicitly in the job adverts, I want to point out here that sending such applications is a waste of time and energy. Please apply only if you really fulfill the requirements in the job advert.
In applied sciences analytical results must not overly depend on possibly unreliable data. To account for known and unknown sources of problems while acquiring biological data, we need new statistical methods. Such methods can be based on uncommonly used distributions or new mathematical model structures. Further it will be necessary to thoroughly test and calibrate algorithms based on such models to validate that they work as expected on generic data sets of known structure. To ensure efficiency computationally intensive algorithms will be implemented in a way to make use of parallel processing, if possible. An important part will be to apply such methods to available data sets and compare results to existing ones.
I always welcome enthusiastic students who are interesting doing a data analysis related masters thesis under my guidance. If you contemplate a thesis in data analysis in bioinformatics the following guideline is for you. As you can see from the example abstracts from my previous masters students below, working on topics which emerge in my field require excellent skills in computer science and a good knowledge of mathematics/statistics. It is thus recommended that you attend at least one or two of my elective courses first. Thesis topics are specified such that it is possible to complete work within six months, once you have the relevant skill level (e.g. reasonable knowledge in R or MatLab). The advantage for students working with me is my scientific interest in the work they will do. I will provide a reasonable level of feedback without suppressing creativity. My thesis projects are purely scientific. This implies that you can not expect payments but instead get a good starting point if you want to move into science or research and development.
Life-science research incorporating microarray expression profiling is often limited by the difficulty of making state-of-the-art data analysis tools accessible to laboratory scientists. Previously, attempts for bridging this gap have been made with the implementation of the friendly statistics package for microarray analysis FSPMA, which is provided as a publicly available R-toolbox.
The objective of this thesis is to enhance functionality of the FSPMA package and to develop a Graphical User Interface (GUI). FSPMA allows efficient exploration of microarray data, providing powerful methods for data loading, cleaning, normalisation, and selection of differentially expressed genes as well as sophisticated operations like k nearest neighbour imputing and spike-based normalisation. FSPMA processing is controlled by a definition file that specifies all the steps necessary to derive analysis results from quantified microarray data. The definition file additionally provides complete documentation of the dataset explored and the analysis performed.
For this thesis, the existing FSPMA library was extended by two supplementary packages. One package, fspma.extension allows for unbalanced experiments to be analysed and solves the problem of conceptual limitation in the FSPMA package, which allows analysing balanced designs only. The second, fspmaGUI supplies a graphical user interface based on R-Tcl/Tk for the FSPMA library and the functions provided by the fspma.extension package. It provides an user-friendly interface to the statistical methods of the FSPMA and the fspma.extension package for R, and is itself implemented as an R package. For users uncomfortable with the command line computing environment of the R language, fspmaGUI considerably facilitates analysing microarray experiments. The software provides point and click implementation to methods for experiment definition, normalisation, and analysis of microarray data.
The resulting new FSPMA version is a locally installed stand- alone application and covers quick first time analysis for novice biologists and self-contained analyses for expert users. It provides a flexible and easy-to-use analysis tool for a broad application range.
In the past years many statistical methods and tools have been developed for the analysis of microarrays. Although it is a well-known problem that microarrays often produce widely dispersed data, little considerations about the robustness of the current methodology have been made.
This work tests a possible approach for increasing robustness of a hierarchical Bayesian ANOVA model, which is specifically designed for the analysis of microarrays, with respect to its underlying error model. Additionally, it means to provide an understanding of the differences of results compared to the standard model and their differing biological implications. The core of the method is the model selection of a fitting likelihood function from a set of non central student's t distributions of different degrees of freedom and normal distributions. A hybrid MCMC sampler has been designed and implemented in MatLab in order to perform the model inference.
The proposed approach has been tested with several artificial and biological data sets. Applying the method to different biological settings, has provided a clear answer to the question: is student's t distribution a more reasonable model distribution for such data sets? Student's t distributions with low degrees of freedom are generally preferred as error model. More importantly the results showed that differences between the robust (student's t) and the standard (Gaussian) model not only occurred in the statistical inference, but also led to different biological conclusions which were drawn based on Gene Ontology analysis. Thus this work shows the importance of handling the choice of model likelihood with great care in the field of microarray analysis.