Neural Computing Research Group

Interdisciplinary Research in DNA Microarray Data Analysis & Modelling

-

Introduction

Interdisciplinary Research in DNA Microarray Data Analysis & Modelling is a project funded by the BBSRC as part of the 'Exploiting Genomics' Initiative. Microarrays are a revolutionary technology allowing biologists to study the activity of thousands of genes simultaneously and over time. This interdisciplinary research programme will develop techniques and tools for the mathematical modelling and statistical analysis of DNA microarray data. Microarray studies can be divided into three aspects: Data Generation (experimentation, microarray production); Data Management (storage and access of data and information); Data Analysis & Modelling (gene classification and network modelling). The proposed project is pay particular attention to the first and third aspect of DNA microarray studies. In a unique partnership between biologists, control engineers, computer scientists, mathematicians and statisticians, this project is to take an integrative approach to study both eukarytoic and bacterial systems. The project is led at Aston by Professor David Lowe and Dr. Ian Nabney. The focus of our research is on pattern analysis, particularly advanced methods of visualisation. The project is organised as a large consortium with the following research groups:

Research Funding

BBSRC:     136,636 GBP

Duration

The project will started in February, 2003, and will last for 3 years.

Project Background

The principal aim for data analysis and modelling is to develop methodologies and tools that are generic and robust, that is, the conclusions drawn from studying these two particular systems should be applicable to a wide range of microarray studies. Following data capture, image analysis, normalisation, data correction and pre-processing, the starting point for data analysis and modelling is a gene expression matrix (GEM) corresponding to n genes in its rows and m samples (conditions or time points). The purpose of microarray data analysis is two-fold: to investigate the organisation (interrelationships) of genes and/or the dynamic behaviour (interactions) of genes. While the former employs pattern recognition techniques for gene classification, the latter requires parametric modelling to represent gene networks.

Cluster analysis is often used to try to explain the results of microarray experiments. The trouble with this approach is that there is little guidance on how to choose the algorithm or to decide how many clusters are present.

Visualisation plays a key role in developing good models for data, especially when the quantity of data is large. It is an important aid in feature selection, gives information about local deviations in performance and provides a useful `sanity check' for objective quantitative measures (such as generalisation performance). The quantity and complexity of the data means that simple visualisation methods, such as principal component analysis (PCA), do not work. Instead, we propose to develop further certain advanced non-linear visualisation techniques arising from earlier work at Aston:

The Generative Topographic Map (GTM) is a latent variable model with a non-linear function mapping a (usually two dimensional) latent space to the data space. The data is visualised by computing a summary statistic of the posterior distribution (usually the mean or mode) for each data point. This model assumes that the data lies close to a two-dimensional manifold; however, this is too simple a model for microarray data. Because GTM is a generative latent variable model, it is straightforward to train mixtures of GTMs using a suitable extension of the EM algorithm. We have developed a hierarchical mixture of GTMs and used it with success on High Throughput Screening Data in some preliminary experiments with microarray data. This scheme models the whole data set with a GTM at the top level, which is broken down into clusters at deeper levels of the hierarchy. Because the data can be visualised at each level of the hierarchy, the selection of clusters, which are used to train GTMs at the next level down, can be carried out interactively by the user.

Streptomyces visualisation

Visualiation of Streptomyces data using GTM.

Neuroscale is an extension of the classical distance preserving visualisation methods of Sammon mappings and multi-dimensional scaling. It uses radial basis function networks with a particularly efficient training algorithm known as Shadow Targets. There are two issues to address in the application of this technique. Firstly, an appropriate distance measure must be defined. Secondly, the training times and data storage requirements for the model scale as the square of the number of data points. For very large datasets, this incurs a prohibitive computational cost.

Many interesting microarray experiments involve monitoring a biological system at different times (time course experiments). We shall also aim to extend our models to incorporate this temporal element, through the use of latent variable models (such as GTM through time, which combines Hidden Markov Models with GTM) and techniques such as complexity pursuit, which combine autoregression and ICA.


This page is maintained by Ian Nabney (i.t.nabney@aston.ac.uk)