Unlike other methods probabilistic machine learning is based on one consistent principle which is used throughout the entire inference procedure. Probabilistic methods approach inference of latent variables, model coefficients, nuisance parameters and model order essentially by applying Bayesian Theory. Hence we may treat all unknown variables identically which is mathematically nice. For computational reasons a fully probabilistic model might not be feasible. In such situations we have to use approximations. Obviously for a methodology that has to stand the test in an empirical discipline, a mathematical consistency argument is not too convincing. So why should one use probabilistic methods?
The basic idea of Bayesian sensor fusion is to take uncertainty of information into account. In machine learning the seminal papers were those by (MacKay 1992) who discussed the effects of model uncertainty. In (Wright 1999) these ideas were later extended to input uncertainty. Related ideas have been used by (Dellaportas & Stephens 1995), who discuss models for errors in observables. I got interested in these issues in the context of hierarchical models where model parameters of a feature extraction stage are used for diagnosis or segmentation purposes. Such models are e.g. used for sleep analysis or also in the context of BCI. In a Bayesian sense these features are latent variables and should be treated as such. Again this is a consistency argument which has to be examined for its practical relevance.
In order to obtain a hierarchical model that does sensor fusion we simply regard the feature extraction stage as latent space and integrate (marginalize) over all uncertainties. The left figure compares a sensor fusing DAG with current practice in many applications of probabilistic modeling that regard extracted features as observations. I reported on a first attempt to approach this problem in (Sykacek 1999) which is in more detail described in section 4 in my Ph.D. thesis.
In order to see that a latent feature space has practical advantages, we consider a very simple case where two sensors are fused in a naive Bayes manner to predict the posterior probability in a two class problem. The model is similar to the one in the graph used above, however with two latent feature stages that are, conditional on the state of interest t, assumed to be independent. The plot on the right illustrates the effect of knowing one of the latent features with varying precision. Conditioning on a best estimate results obviously in probabilities that are independent of the precision. We hence obtain a flat line with a probability of class "2" of about 0.27. Marginalization changes this probability. Depending on how much the uncertainties differ, we can, as is illustrated in this figure, also obtain different predicted states. We may thus expect to improve in such cases where the precision of the distributions in the latent feature space varies. We have successfully applied a HMM based latent feature space model to classification of segments of multi sensor time series. Such problems arise in clinical diagnosis (sleep classification) and in the context of brain computer interfaces (BCI). A MCMC implementation and evaluation on synthetic and BCI data has been published in (Sykacek & Roberts 2002 a). This work was also the topic of a talk I gave at the NCRG in Aston in July 2002. A pdf version of the slides being available here. Recently (Beal et al. 2002) have applied similar ideas to sensor fusion of audio and video data.
In order to assess the effects on the bit rates of BCI's, we compare in (Curan et al. 2003) the communication bandwidth we may achieve with different cognitive tasks (Curran & Stokes 2003). We base comparisons on generalization accuracies obtained for independent test data. Differences are assessed for statistical significance using McNemar's test, a test for analyzing paired results that can be found in (Ripley 1996). In order to allow comparisons with other BCI systems, we also report bit rates as is suggested in (Wolpaw et al. 2000). The BCI experiments in this study were done by 10 young, healthy and untrained subjects. They are based on 3 cognitive tasks: an auditory imagination, an imagined spatial navigation task and an imagined right motor task. Each experiment consists of 10 repetitions of alternating pairs of these tasks each of which have been done for seven seconds. EEG recordings are obtained from two electrode sites: T4, P4 (right tempero-parietal for spatial and auditory tasks), C3' , C3" (left motor area for right motor imagination). The ground electrode is placed just lateral to the left mastoid process.
comparison | accuracy (a) | bit/s (a) | accuracy (b) | bit/s (b) | P_{null} |
(a) vs. (b) | 74 % | 0.173 | 69 % | 0.107 | <<0.01 |
(a) vs. (c) | 74 % | 0.173 | 71 % | 0.131 | <0.01 |
(a) vs. (d) | 74 % | 0.173 | 71 % | 0.131 | 0.01 |
(b) vs. (c) | 69 % | 0.107 | 71 % | 0.131 | 0.02 |
(b) vs. (d) | 69 % | 0.107 | 71 % | 0.131 | 0.03 |
(c) vs. (d) | 71 % | 0.131 | 71 % | 0.131 | 0.40 |
An investigation of different classification paradigms reveals that on this data the BCI classifiers perform significantly better, when allowing for a nonlinear decision boundary. The method applied in this comparison uses autoregressive features (AR) extracted from successive segments of EEG. We use a generative classifier that predicts probabilities of cognitive states. Table Comparison of different tasks summarizes the results of this comparison. Task pairing (a) refers to the combination navigation - auditory, task pairing (b) refers to the combination navigation - right motor, task pairing (c) refers to the combination auditory - right motor and task pairing (d) refers to the combination left motor - right motor, which we include in order to allow for a comparison with these classical tasks. Our results allow to conclude that (a) vs. (b) result in slightly better correct classification rates as the classical imagined motor task. However, since we can extract information about the cognitive state in all cases, the main conclusion is that we might significantly increase the bit rate of BCI systems by using more than two cognitive tasks. For more details on this study we refer to (Curan et al. 2003).
Probabilistic models can be used to describe many architectures that have been applied to static and adaptive BCI systems. Examples are Hidden Markov models, that have been successfully applied to BCI in (Obermeier et al. 2001). Probabilistic models have also been quite popular tools in the machine learning and statistics community. Recently these communities have investigated efficient algorithms that allow inference of very complex models. These findings are of interest for the BCI community since they allow us to go beyond classical time series models and by that improve different aspects of BCI systems. We have recently evaluated two such generalizations in the context of BCI systems. Coupled HMM's are generalizations of ordinary HMM's, where two hidden state sequences are probabilistically coupled using arbitrary lags. In (Rezek et al. 2002) these models have been applied to movement planning and shown to outperform classical HMM's.
Another modification of HMM's was proposed in (Sykacek & Roberts 2002), where we follow probabilistic principles and suggest that classifications based on feature extraction (like the use of spectral representations or AR models as used in our BCI) have to regard the features as latent variables. Hence inference and predictions need to marginalize over this latent space. The practical advantage of the proposed architecture is that both the feature and the model uncertainty (the latter means the uncertainty about model order) are automatically taken into consideration. This effects model estimation and prediction and results in automatic artefact moderation and thus in intelligent sensor fusion. The idea exploits a property found by marginalization (i.e. integration over the distribution) over (at least two) uncertain latent variables estimated from two sensor signals (e.g. EEG recorded at different electrode sites). Depending on the variance of the distribution, we will obtain different posterior probabilities. Section Bayesian sensor fusion illustrates the effect using two sensor signals X_{a} and X_{b} and the state of interest (e.g. the cognitive state we want to predict) y. We see how the posterior probability depends on the variance. The effect may even result in a different assessment w.r.t. the predicted state.
The application of such a marginalization idea to BCI is illustrated in table BCI with fully Bayesian method. We illustrate results obtained with this paradigm using two task pairings of the study reported in our neuro-cognitive study. We compare the generalization accuracies of the fully probabilistic model (full Bayes) with those of a classical approach that does feature extraction separately. Table BCI with fully Bayesian method shows also the probabilities of the null hypothesis P_{null} that the results are equal (McNemar's test). We may thus conclude that a fully Bayesian approach significantly outperforms classifications obtained when conditioning on feature estimates. Despite having found that a fully Bayesian approach improves BCI performance, the proposed method has the disadvantage of not being directly applicable to online BCI. The computational complexity simply does not allow that. Hence we investigated an approximation which can be used in real time and nevertheless achieve the desired effects (Sykacek et al. 2003 b) .
classical model | full Bayes | ||||
task pair | Accuracy | bit rate b/s | Accuracy | bit rate b/s | P_{null} |
(d) | 75.9% | 0.20 | 81.4% | 0.31 | 0.04 |
(a) | 76.2% | 0.21 | 84.5% | 0.38 | <<0.01 |
Current BCI architectures developed by other research groups rely on assuming that the EEG generated during cognitive tasks shows stationary behavior. This assumption must be wrong for several reasons:
We thus suggest a fully adaptive approach for the translation algorithm that, even in short time use of a BCI, resulted in higher communication bandwidth than conventional static BCI's. Probabilistic models can be of advantage in describing algorithms for adaptive BCI systems. A graph structure that illustrates such an approach is shown in my section on adaptive classification. We assume a model that predicts the probabilities of cognitive states and regard the parameters of the classifier are regarded as latent variables in a first order Markov process. The solution in a linear Gaussian case are the well known Kalman filter equations. In our case the non-linearity introduced by predicting probabilities requires us to use an approximation. We suggest for that purpose a variational technique and thus obtain variational Kalman filtering as an inference method (Sykacek & Roberts 2003) .
Variational methods (Jordan et al. 1999, Attias 1999) are attractive for BCI systems because compared with Laplace approximations (as e.g. used for classification problems in (Penny & Roberts 1999), they allow for more flexibility and contrary to particle filters they still provide a parametric form of the posterior. Having a parametric posterior is important since it allows efficient real time implementations. We apply this algorithm to features extracted from EEG. My favorite approach is to use a lattice filter representation of auto regressive parameters since they have in (Sykacek et al. 1999 c) been found superior to other feature extraction techniques including conventional AR parameters. Results applying the variational Kalman filter classifier to the BCI data described in our neuro-cognitive study, are summarized in table Adaptive BCI. The last column are the probabilities of the null hypothesis P_{null} that the results are equal (McNemar's test).The results suggest that a truly adaptive BCI (column vkf) significantly outperforms the equivalent static method (column vsi). A detailed description of our adaptive translation algorithm and a thorough evaluation can be found in (Sykacek et al. 2003 c).
Generalization results | |||||
vkf | vsi | ||||
Cognitive task | Accuracy | bit/s | Accuracy | bit/s | P_{null} |
navigation/auditory | 86% | 0.42 | 83% | 0.34 | 0.02 |
navigation/movement | 80% | 0.28 | 80% | 0.28 | 0.31 |
auditory/movement | 78% | 0.24 | 76% | 0.21 | <<0.01 |
This page is a summary of my SIESTA activities. I have looked at various aspects of the EEG model that should lead to the "core" sleep analyzer. These activities include deriving a Bayesian method for preprocessing, an investigation of resampling issues, Feature subset selection and an EEG model based on variational Bayesian techniques that forms the core of the SIESTA sleep analyzer. We decided for the Bayesian paradigm, since we found it extremely useful for automatic sleep classification according to Rechtschaffen & Kales rules (Sykacek et al. 1998 and Sykacek et al. 2002 a). I also came up with ideas how to combine models that were built separately for different biosignals. Below, we will use some acronyms: EEG - electroencephalogram (A signal obtained by recording from different positions on the scalp. It represents local brain activity.) EOG - electrooculogram (A signal recorded from the forehead that shows eye movements.) EMG - electromyogram (A signal recorded from various positions on the human body which represents the local muscle activity).
The SIESTA analyzer processes EEG with a Bayesian implementation of an AR lattice filter structure. (Bayesian reflection coefficients). The difficulty with this representation is the calculation of the marginal likelihood of the model (i.e. integration of the distribution w.r.t. model parameters). Preprocessing results in a-posterior distributions of coefficients and posterior probabilities for models. Details have first been published in one of our EMBEC abstracts: (Sykacek et al. 1999 b) Based on this lattice filter model, I have also tried to obtain a REM/non REM feature from EMG. However this attempt failed - probably due to the same reasons why an amplitude based feature could not be derived.
From a technical point of view there are several possibilities for measuring the quality of feature subsets. We decided to look at the likelihood function of various classifiers and a nonparametric estimate of an impurity measure (We measure the Gini index by a k nearest neighbors approach).
Search for the "best" subset was based on suboptimal algorithms (forward selection and sequential elimination). This was necessary since the number of features (more than 200) rendered exhaustive search impossible.
The optimal subset size was determined with a statistical significance test. We used McNemar's test of comparing two paired classifiers and an appropriate p-value. In the forward selection scheme we add the most promising feature if it increases the classification accuracy with statistical significance. In the backward elimination strategy, we remove the least important feature if it does not result in a statistically significant difference.
The results in the following tables are from (Sykacek et al. 1999 c) .
Subset 1: Gini index and sequential forward selection
stochastic complexity at C3 |
Hjorth coefficient at Fp2: cmpl. /(act. * mob) |
Subset 2: Likelihood of logistic regression
ref. coefficient at C4: 1st. coefficient |
power spectral density at Fp1: Beta (12.5 Hz - 30.0 Hz) |
Kalman AR coefficient at C3: 2nd. coefficient |
Conventional FSS has a major problem: If two or more (similar sized) subsets explain the problem equally well, using just one of them is in a Bayesian sense not consistent with the information provided. The correct approach would be to integrate out feature subset uncertainty. Consequently we also applied such a Bayesian technique to the problem of determining relevant feature subsets. The result of such FSS is a posterior probability over feature subsets. Prediction would then consider all subsets according to their posterior probability, which for the SIESTA data is shown in the image to the left. For further details of this method, I want to refer to my NIPS 99 preprint (Sykacek 2000 b).
Subset 3: Bayesian wrapper for C3 only. The posterior probability
of this subset is 0.69
ref. coefficient at C3: 1st. coefficient |
ref. coefficient at C3: 3rd. coefficient |
Hjorth coefficient at C3: cmpl. /(act. * mob) |
The following considerations lead to the chosen architecture for the Siesta analyzer.
We decided for a model that allows class conditional densities to be mixture of Gaussians. Together with prior probabilities for class, we thus have a generative model for EEG features that - via Bayes theorem - allows predicting probabilities for wake, REM and deep sleep. The corresponding graphical model is illustrated to the left. The basic architecture was previously used in a maximum likelihood setting e.g. (Trĺvén 1991). The architecture is a latent variable model consisting of the "features" x, the class label t and the kernel indicator d. In addition the Bayesian approach requires us to include all model parameters and hyper parameters as well. These are the prior probabilities of each class P and the parameters of the mixture model. For the latter we have W as (class conditional) kernel allocation probabilities, μ as kernel mean and λ as kernel precision (inverse kernel variance). The remaining variables specify a, partially hierarchical, prior which is largely influenced by (Richardson & Green 1997) . We have δ_{P} and δ_{W} as prior counts (we use 1) in the Dirichlet distributions over the corresponding variables P and W. Variables κ and ξ specify a Gaussian prior over the mean, μ. Variables α and β specify a Gamma prior over the kernel precisions, λ, where β is itself given a Gamma prior. This hierarchical setting is suggested by (Richardson & Green 1997) , to make inference less sensitive to the hyper parameter settings.
For reasons of computational efficiency we decided for a variational implementation for the generative classifier. The idea follows from implementations by (Attias 1999) who derives a variational solution for a Gaussian mixture model or (Gharamani & Beal 2000) who derives a variational implementation for a mixture of factor analyzers. Variational methods and the EM algorithm (Dempster et al. 1977) share the same ideas. The EM algorithm is a special case of a variational lower bound that becomes exact in the maximum. Unlike the EM, general variational algorithms remain a lower bound.
To obtain a lower bound, we apply Jensen's inequality, to the log marginal likelihood. In the simplest case we use a mean field assumption of a factorizing posterior distribution. Each circular variable in the DAG gets it's own approximate distribution. Details of how to derive the algorithm can be found in my PhD thesis (Sykacek 2000 a). As a result of iterating the variational updates to convergence we obtain a negative free energy for the model which can be used to guide model selection (Attias 1999). The plot to the right shows the negative free energies of a generative model for one of the electrodes (C3) for various numbers of Gaussian kernels. It suggests with very large probability a model with 15 kernels. More details on the prototype of the SIESTA sleep analyzer can be found in (Sykacek et al. 2001).
If applied to new data, the SIESTA analyzer calculates probability traces that characterize the all-night sleep profile. An example plot is illustrated below.
This work was done by the authors, while contributing to the BBSRC funded "Shared Genetic Pathways in Cell Number Control" research program, which was awarded to the Department of Pathology, University of Cambridge, UK. As the project title suggests, this project investigates molecular biological processes that control development cycles in different biological systems. The search for the underlying genetic markers requires a principled approach that can infer which genes are of shared importance in several microarray experiments. We propose for that purpose a fully Bayesian model for an analysis of shared gene function. The approach assumes that several microarray experiments with known cross annotations between transcripts (genes) should be analyzed for common genetic markers. The implementation described in this work has in particular the advantage to combine data sets before applying thresholds and thus the advantage that the result is independent of that choice. For more information on the method, we refer to the original paper (Sykacek et al 2007 a) and the pdf supplement (Sykacek et al 2007 b).
The analysis of development processes in many tissues is faced with several interacting biological processes and a mix of various cell types. As an example we investigate in this work the shared biological activity at gene level in a mouse mammary gland development cycle (Clarkson et al 2004) and a human endothelial cell culture with apoptosis induced by serum withdrawal (Johnson et al 2004). The biological complexity of the experiments is best visualized, if we mark different development stages for active biological processes at a macro level. For the mouse mammary time course, we get the following table of active processes. During lactation, time is in days and during involution we use hours.
Biological Process | L_{0} | L_{5} | L_{10} | I_{12} | I_{24} | I_{48} | I_{72} | I_{96} |
Type I Apoptosis | - | - | - | + | + | ? | - | - |
Type II Apoptosis | - | - | - | - | - | ? | + | + |
Apoptosis | - | - | - | + | + | + | + | + |
Differentiation | + | + | + | ? | - | - | - | - |
Inflammation | ? | - | - | + | + | ? | - | - |
Remodeling | -/(?) | - | - | - | - | ? | + | + |
Acute Phase | + | - | - | - | + | + | + | + |
We use "+" to indicate that a process is active and "-" to indicate it's inactivity. A "?" indicates epochs where we are uncertain about the process activity. A similar though simpler classification can be obtained for the second experiment which studies human endothelial cells under serum deprivation. Duration during serum deprivation is in hours.
Biological Process | control (t_{0}) | t_{28} |
Type II Apoptosis | - | + |
Apoptosis | - | + |
Differentiation | + | - |
In addition to the results we present in the original paper and in the pdf supplement, we provide here the top 20 genes we find important to contribute to both data sets.
Gene Symbol | P(I_{t}|D) | P(G=t|D) | Co-Regulation |
SAT | 0.99951 | 0.047597 | anti |
ODC1 | 0.99921 | 0.029237 | co |
GRN | 0.99921 | 0.029125 | co |
BSCL2 | 0.99919 | 0.028601 | anti |
MLF2 | 0.99884 | 0.019988 | anti |
IFRD2 | 0.99867 | 0.017425 | co |
BTG2 | 0.99843 | 0.014688 | co |
CCNG2 | 0.99826 | 0.013274 | co |
TNK2 | 0.99789 | 0.010943 | anti |
C9orf10 | 0.99783 | 0.010614 | co |
HAGH | 0.99764 | 0.0097747 | co |
PPP2CB | 0.99759 | 0.0095567 | anti |
SSR1 | 0.99748 | 0.0091528 | co |
MUT | 0.99747 | 0.0091039 | co |
DHRS3 | 0.99746 | 0.0090926 | co |
PSMA1 | 0.99741 | 0.0089018 | anti |
HBLD2 | 0.99732 | 0.0086073 | co |
SYPL1 | 0.99724 | 0.0083639 | co |
C2F | 0.99723 | 0.0083374 | co |
ATP6V1B2 | 0.99706 | 0.0078419 | anti |
The full gene list in comma separated format is available as zipped archive. To check, which biological processes we find attributed to this list, we follow the suggestion in (Lewin et al. 2006) and use Fishers exact test to infer significance levels of active gene ontology (GO) categories from the probabilistic rank list (see also (Al-Shahrour et al.)). The resulting GO categories for the gene list of this shared analysis can be obtained as comb_apo_all.xml in xml format. This file is compatible with Treemap - (C) University of Maryland and preserves the parents - child relationships from the directed acyclic GO graph. Note that the treeml.dtd file is part of the Treemap package and not available here. Treemap is under a non commercial license. If it is unavailable despite that, the xml ﬁle can be inspected with any reasonable web browser.
The software to calculate indicator probabilities that capture shared gene function comes as collection of MatLab libraries. The package consists of the main code, which uses the variational Bayesian approach described in (Sykacek et al 2007 a) and additional functions for data handling, output generation and an EM implementation for regularized probit link regression used during initialization of all Q-distributions. To make the software distribution flexible, all MatLab functions are collected in archives each containing functions of a particular type. The software is available under GPL 2 license and comes without any warranty.
To install the package, one has to download all required archives, provided as *.zip files or tared gzip archives (*.tar.gz), unpack the archives and set appropriate MatLab paths. Scripts using a hypothetical experiment derived from a mouse testis time course kindly provided by R. Furlong, demonstrate how to use the library.
Library File | Description |
helpers.tar.gz | generic helper functions |
statsgen.tar.gz | generic statistics functions |
mca_base.tar.gz | basic microarray file handling (loading various microarray data formats) |
mca_fuse.tar.gz | generic handling in connection with shared analysis (cross annotation and output generation) |
probitem.tar.gz | Penalized maximum likelihood (MAP) for probit link regression via an EM algorithm. |
combanalysis.glb.hphp.tar.gz | Variational Bayes for shared analysis of subset probabilities in probit regression. |
All components required to successfully run the experiment, will be installed automatically, if one creates a new directory and then downloads and runs the setup script in that directory. Linux (Unix) users should use combsyssetup.sh. After download, you might have to set executable permission by invoking the command chmod +x combsyssetup.sh. Windows users should either do the same after installing a cygwin environment or install the Wget and Unzip packages from GnuWin32 and then download and run combsyssetup.bat Note that this will install all required packages and, if run at later times, install updates.
After having run the script, the installation directory contains MatLab scripts and data which illustrate the approach discussed in (Sykacek et al 2007 a). The data are extracted from a subset of a mouse testis development time course, kindly provided by R. Furlong. The data consists of 7 time points: adult day 1 day 5 day 10 day 15 day 23 and day 35, with differential expression measured against the adult generation.
To illustrate all steps from cross annotation to generation of gene lists, we divided this data artificially into two "experiments". One experiment contains the samples of the adult generation and days 1 and 15. Here we use the original gene ids. We assume that the biological state change is between adult and the other two development periods. The corresponding data file is called "exp1mca.tsv" and is formatted like FSPMA normalized raw output: Gene ids are used as column headers and all samples as rows below. We also have a corresponding effects description as "exp1eff.tsv", which is used to generate the labels. The second experiment contains days 35, 23, 5 and 10 and artificially modified gene ids, to mimic a situation that requires between species annotation. Here we assume that the biological states correspond to days 35 and 23 versus days 5 and 10. The files are "exp2mca.tsv" for the microarray data and "exp2eff.tsv" for the labels. Note that the assumption is that each experiment provides information about differences in late and early stages of testis development. The analysis goal is thus similar to a problem, where we attempt to combine two experiments obtained from different platforms or species. This requires "cross annotation", which is here done according to the tab delimited file "crossann.tsv". In general, each row in this file contains a tuple that provides a unique mapping between all different unique gene ids one finds in a shared analysis. To complete the list of files, we provide in addition the tab delimited file "genespec.csv", which provides for the unique gene ids in the target genome, a mapping to gene symbols and descriptions.
File Name | Description |
exp1mca.tsv, exp2mca.tsv | Normalized log ratios (location and scale adjustment) |
exp1eff.tsv, exp2eff.tsv | (default) Labels |
crossann.tsv | cross annotations between the different gene ids found in the gene lists |
genespec.csv | mapping from unique gene ids (for the cross annotation target) to standardized symbols and gene descriptions |
shareanalysis.def | specification of the cross annotated experiments that enter shared analysis |
File Name | Description |
crossann.m | cross annotation of microarray experiments and preparation of shared analysis |
runsim.m, calccoreg.m | calculation of gene indicator probabilities of shared gene function by variational Bayesian inference |
combres2csv.m | extraction of gene ranking w.r.t. shared gene function as a tab delimited file |
Both artificial experiments have to be cross annotated. This step will align the gene ids in different experiments and provide two raw data files and a gene id to symbols and description annotation in MatLab 6 format. Cross annotation is done by the MatLab script crossann.m found in the installation folder.
After cross annotation, we have to prepare for the shared analysis. This requires to specify a tab delimited text file ("shareanalysis.def") which controls this process. The minimal requirement is to specify in this control file which (previously cross annotated) data files should be analyzed for shared gene function. In addition one can specify a different set of labels. This is useful to analyze the same data for different biological classifications. We may also provide independent test data, which will be used to obtain generalization errors. Analysis is stared with the script combanalysis.m in the installation folder. The simulation will, depending on the size of the problem and the mode of analysis, take up to several hours (this example is though done in less than one minute or in a few minutes time, if we want fold results). As a result we get all simulation output in MatLab format. Details of the calculated results require to look into the code and to analyze the variables stored in the MatLab output file.
The last step in an analysis of shared function is to generate a rank table of shared gene function. This is done with the script crossann2csv.m found in this folder as well. The result is a rank list of similar structure as the one provided in the supplement of the original paper.
All intermediate results generated during a shared analysis is stored in MatLab 6 format (for Octave compatibility). The final rank table is a comma separated file.
File Name | Description |
exp1.mat, exp2.mat | cross annotated and normalized raw data |
crossanngenespec.mat | reordered gene specifications (id, symbol and description) |
state.mat, crosslog.mat | internal log files (see code) |
sharetestres.mat | inference result about shared gene function. This file contains all results including probabilities, predictions and all Q-distributions found from variational Bayesian inference. |
share_test_rank.csv | rank table as comma separated file. |
To run such an analysis on a different experiment, one must provide data files structured like those listed in the data files table. The structure of the microarray data and the default labels is identical to the output generated by FSPMA, which can thus be used as preprocessing tool. In addition, one has to generate a file which allows cross annotation between all gene sets that appear in any one data set. If there is only one set of gene ids, cross annotation should be done anyway, to obtain the data in the format as expected by runsim.m. In this case all parts in runsim.m that specify the shared analysis will refer to the same gene id column. Inference of shared gene function requires in addition a control file similar to shareanalysis.def. Finally one has to adjust all script files in the script files summary to meet the different requirements.
This page describes joint work with R. Clarkson, C. Print, R. Furlong and G. Micklem. We also thank David MacKay for advise. The project was moslty done at the Departments of Pathology and Genetics, University of Cambridge as part of the project "Shared Genetic Pathways in Cell Number Control", ref. 8/EGH16106 funded by the BBSRC within their Exploiting Genomics initiative. During completion of this work, Peter Sykacek moved to the Bioinformatics group at BOKU University, Vienna, which is funded by the WWTF, ACBT, Baxter AG, and ARC Seibersdorf.