Unlike other methods probabilistic machine learning is based on one consistent principle which is used throughout the entire inference procedure. Probabilistic methods approach inference of latent variables, model coefficients, nuisance parameters and model order essentially by applying Bayesian Theory. Hence we may treat all unknown variables identically which is mathematically nice. For computational reasons a fully probabilistic model might not be feasible. In such situations we have to use approximations. Obviously for a methodology that has to stand the test in an empirical discipline, a mathematical consistency argument is not too convincing. So why should one use probabilistic methods?
The basic idea of Bayesian sensor fusion is to take uncertainty of information into account. In machine learning the seminal papers were those by (MacKay 1992) who discussed the effects of model uncertainty. In (Wright 1999) these ideas were later extended to input uncertainty. Related ideas have been used by (Dellaportas & Stephens 1995), who discuss models for errors in observables. I got interested in these issues in the context of hierarchical models where model parameters of a feature extraction stage are used for diagnosis or segmentation purposes. Such models are e.g. used for sleep analysis or also in the context of BCI. In a Bayesian sense these features are latent variables and should be treated as such. Again this is a consistency argument which has to be examined for its practical relevance.
In order to obtain a hierarchical model that does sensor fusion we simply regard the feature extraction stage as latent space and integrate (marginalize) over all uncertainties. The left figure compares a sensor fusing DAG with current practice in many applications of probabilistic modeling that regard extracted features as observations. I reported on a first attempt to approach this problem in (Sykacek 1999) which is in more detail described in section 4 in my Ph.D. thesis.
In order to see that a latent feature space has practical advantages, we consider a very simple case where two sensors are fused in a naive Bayes manner to predict the posterior probability in a two class problem. The model is similar to the one in the graph used above, however with two latent feature stages that are, conditional on the state of interest t, assumed to be independent. The plot on the right illustrates the effect of knowing one of the latent features with varying precision. Conditioning on a best estimate results obviously in probabilities that are independent of the precision. We hence obtain a flat line with a probability of class "2" of about 0.27. Marginalization changes this probability. Depending on how much the uncertainties differ, we can, as is illustrated in this figure, also obtain different predicted states. We may thus expect to improve in such cases where the precision of the distributions in the latent feature space varies. We have successfully applied a HMM based latent feature space model to classification of segments of multi sensor time series. Such problems arise in clinical diagnosis (sleep classification) and in the context of brain computer interfaces (BCI). A MCMC implementation and evaluation on synthetic and BCI data has been published in (Sykacek & Roberts 2002 a). This work was also the topic of a talk I gave at the NCRG in Aston in July 2002. A pdf version of the slides being available here. Recently (Beal et al. 2002) have applied similar ideas to sensor fusion of audio and video data.
This work was done by the authors, while contributing to the BBSRC funded "Shared Genetic Pathways in Cell Number Control" research program, which was awarded to the Department of Pathology, University of Cambridge, UK. As the project title suggests, this project investigates molecular biological processes that control development cycles in different biological systems. The search for the underlying genetic markers requires a principled approach that can infer which genes are of shared importance in several microarray experiments. We propose for that purpose a fully Bayesian model for an analysis of shared gene function. The approach assumes that several microarray experiments with known cross annotations between transcripts (genes) should be analyzed for common genetic markers. The implementation described in this work has in particular the advantage to combine data sets before applying thresholds and thus the advantage that the result is independent of that choice. For more information on the method, we refer to the original paper (Sykacek et al 2007 a) and the pdf supplement (Sykacek et al 2007 b).
The analysis of development processes in many tissues is faced with several interacting biological processes and a mix of various cell types. As an example we investigate in this work the shared biological activity at gene level in a mouse mammary gland development cycle (Clarkson et al 2004) and a human endothelial cell culture with apoptosis induced by serum withdrawal (Johnson et al 2004). The biological complexity of the experiments is best visualized, if we mark different development stages for active biological processes at a macro level. For the mouse mammary time course, we get the following table of active processes. During lactation, time is in days and during involution we use hours.
|Type I Apoptosis||-||-||-||+||+||?||-||-|
|Type II Apoptosis||-||-||-||-||-||?||+||+|
We use "+" to indicate that a process is active and "-" to indicate it's inactivity. A "?" indicates epochs where we are uncertain about the process activity. A similar though simpler classification can be obtained for the second experiment which studies human endothelial cells under serum deprivation. Duration during serum deprivation is in hours.
|Biological Process||control (t0)||t28|
|Type II Apoptosis||-||+|
In addition to the results we present in the original paper and in the pdf supplement, we provide here the top 20 genes we find important to contribute to both data sets.
The full gene list in comma separated format is available as zipped archive. To check, which biological processes we find attributed to this list, we follow the suggestion in (Lewin et al. 2006) and use Fishers exact test to infer significance levels of active gene ontology (GO) categories from the probabilistic rank list (see also (Al-Shahrour et al.)). The resulting GO categories for the gene list of this shared analysis can be obtained as comb_apo_all.xml in xml format. This file is compatible with Treemap - (C) University of Maryland and preserves the parents - child relationships from the directed acyclic GO graph. Note that the treeml.dtd file is part of the Treemap package and not available here. Treemap is under a non commercial license. If it is unavailable despite that, the xml ﬁle can be inspected with any reasonable web browser.
The software to calculate indicator probabilities that capture shared gene function comes as collection of MatLab libraries. The package consists of the main code, which uses the variational Bayesian approach described in (Sykacek et al 2007 a) and additional functions for data handling, output generation and an EM implementation for regularized probit link regression used during initialization of all Q-distributions. To make the software distribution flexible, all MatLab functions are collected in archives each containing functions of a particular type. The software is available under GPL 2 license and comes without any warranty.
To install the package, one has to download all required archives, provided as *.zip files or tared gzip archives (*.tar.gz), unpack the archives and set appropriate MatLab paths. Scripts using a hypothetical experiment derived from a mouse testis time course kindly provided by R. Furlong, demonstrate how to use the library.
|helpers.tar.gz||generic helper functions|
|statsgen.tar.gz||generic statistics functions|
|mca_base.tar.gz||basic microarray file handling (loading various microarray data formats)|
|mca_fuse.tar.gz||generic handling in connection with shared analysis (cross annotation and output generation)|
|probitem.tar.gz||Penalized maximum likelihood (MAP) for probit link regression via an EM algorithm.|
|combanalysis.glb.hphp.tar.gz||Variational Bayes for shared analysis of subset probabilities in probit regression.|
All components required to successfully run the experiment, will be installed automatically, if one creates a new directory and then downloads and runs the setup script in that directory. Linux (Unix) users should use combsyssetup.sh. After download, you might have to set executable permission by invoking the command chmod +x combsyssetup.sh. Windows users should either do the same after installing a cygwin environment or install the Wget and Unzip packages from GnuWin32 and then download and run combsyssetup.bat Note that this will install all required packages and, if run at later times, install updates.
After having run the script, the installation directory contains MatLab scripts and data which illustrate the approach discussed in (Sykacek et al 2007 a). The data are extracted from a subset of a mouse testis development time course, kindly provided by R. Furlong. The data consists of 7 time points: adult day 1 day 5 day 10 day 15 day 23 and day 35, with differential expression measured against the adult generation.
To illustrate all steps from cross annotation to generation of gene lists, we divided this data artificially into two "experiments". One experiment contains the samples of the adult generation and days 1 and 15. Here we use the original gene ids. We assume that the biological state change is between adult and the other two development periods. The corresponding data file is called "exp1mca.tsv" and is formatted like FSPMA normalized raw output: Gene ids are used as column headers and all samples as rows below. We also have a corresponding effects description as "exp1eff.tsv", which is used to generate the labels. The second experiment contains days 35, 23, 5 and 10 and artificially modified gene ids, to mimic a situation that requires between species annotation. Here we assume that the biological states correspond to days 35 and 23 versus days 5 and 10. The files are "exp2mca.tsv" for the microarray data and "exp2eff.tsv" for the labels. Note that the assumption is that each experiment provides information about differences in late and early stages of testis development. The analysis goal is thus similar to a problem, where we attempt to combine two experiments obtained from different platforms or species. This requires "cross annotation", which is here done according to the tab delimited file "crossann.tsv". In general, each row in this file contains a tuple that provides a unique mapping between all different unique gene ids one finds in a shared analysis. To complete the list of files, we provide in addition the tab delimited file "genespec.csv", which provides for the unique gene ids in the target genome, a mapping to gene symbols and descriptions.
|exp1mca.tsv, exp2mca.tsv||Normalized log ratios (location and scale adjustment)|
|exp1eff.tsv, exp2eff.tsv||(default) Labels|
|crossann.tsv||cross annotations between the different gene ids found in the gene lists|
|genespec.csv||mapping from unique gene ids (for the cross annotation target) to standardized symbols and gene descriptions|
|shareanalysis.def||specification of the cross annotated experiments that enter shared analysis|
|crossann.m||cross annotation of microarray experiments and preparation of shared analysis|
|runsim.m, calccoreg.m||calculation of gene indicator probabilities of shared gene function by variational Bayesian inference|
|combres2csv.m||extraction of gene ranking w.r.t. shared gene function as a tab delimited file|
Both artificial experiments have to be cross annotated. This step will align the gene ids in different experiments and provide two raw data files and a gene id to symbols and description annotation in MatLab 6 format. Cross annotation is done by the MatLab script crossann.m found in the installation folder.
After cross annotation, we have to prepare for the shared analysis. This requires to specify a tab delimited text file ("shareanalysis.def") which controls this process. The minimal requirement is to specify in this control file which (previously cross annotated) data files should be analyzed for shared gene function. In addition one can specify a different set of labels. This is useful to analyze the same data for different biological classifications. We may also provide independent test data, which will be used to obtain generalization errors. Analysis is stared with the script combanalysis.m in the installation folder. The simulation will, depending on the size of the problem and the mode of analysis, take up to several hours (this example is though done in less than one minute or in a few minutes time, if we want fold results). As a result we get all simulation output in MatLab format. Details of the calculated results require to look into the code and to analyze the variables stored in the MatLab output file.
The last step in an analysis of shared function is to generate a rank table of shared gene function. This is done with the script crossann2csv.m found in this folder as well. The result is a rank list of similar structure as the one provided in the supplement of the original paper.
All intermediate results generated during a shared analysis is stored in MatLab 6 format (for Octave compatibility). The final rank table is a comma separated file.
|exp1.mat, exp2.mat||cross annotated and normalized raw data|
|crossanngenespec.mat||reordered gene specifications (id, symbol and description)|
|state.mat, crosslog.mat||internal log files (see code)|
|sharetestres.mat||inference result about shared gene function. This file contains all results including probabilities, predictions and all Q-distributions found from variational Bayesian inference.|
|share_test_rank.csv||rank table as comma separated file.|
To run such an analysis on a different experiment, one must provide data files structured like those listed in the data files table. The structure of the microarray data and the default labels is identical to the output generated by FSPMA, which can thus be used as preprocessing tool. In addition, one has to generate a file which allows cross annotation between all gene sets that appear in any one data set. If there is only one set of gene ids, cross annotation should be done anyway, to obtain the data in the format as expected by runsim.m. In this case all parts in runsim.m that specify the shared analysis will refer to the same gene id column. Inference of shared gene function requires in addition a control file similar to shareanalysis.def. Finally one has to adjust all script files in the script files summary to meet the different requirements.
This page describes joint work with R. Clarkson, C. Print, R. Furlong and G. Micklem. We also thank David MacKay for advise. The project was moslty done at the Departments of Pathology and Genetics, University of Cambridge as part of the project "Shared Genetic Pathways in Cell Number Control", ref. 8/EGH16106 funded by the BBSRC within their Exploiting Genomics initiative. During completion of this work, Peter Sykacek moved to the Bioinformatics group at BOKU University, Vienna, which is funded by the WWTF, ACBT, Baxter AG, and ARC Seibersdorf.