Projects, past and present

Data Science :

Bayesian networks of data from a cohort study:
One of my MPhil thesis chapters inferred dynamic Bayesian networks concerning the relationship between early respiratory infections and childhood asthma pathogenesis from longitudinal cohort data.

A question of interest was whether respiratory infections led to the wheeziness of childhood asthma, given that they are correlated and a viable mechanism has been postulated. My networks indicate that it is probably the other way around, with wheezing in response to airborne allergens allowing infectious agents to penetrate deeper into the lungs. (The atopy label indicates the class of antibody involved in the allergic response.)

This network shows how the variables from each year act upon those of the following year for the first five years of a child's life. The shade of the inferred edges indicates the posterior probability of the edge correctly fitting the data, with the dashed line indicating a negative effect. 

We tried a number of variable combinations, favouring those for which edges were either very dark or very faint, and avoiding intermediate shades as far as possible. We then got a stronger signal by considering not upper vs lower respiratory infections, but upper plus/minus lower respiratory infections, indicating that the total number of infections didn't change, but that more infections penetrated the lower respiratory tract instead of being contained to the upper respiratory tract.

The signal was further improved by considering wheezy lower infections instead of lower tract infections in general.

Biological subtypes of asthma from exclusive predictors:
A variable predictive of a only a given subtype will be weakly predictive of the more general case. The corresponding Areas Under the Curve are related in a way determined by the fraction of cases belonging to the subtype, allowing the definition of an index which indicates whether the predictive power of the variable is exclusive to the subtype in question. 

If we consider a sample set whose cases may be neatly divided into two or more sub-cases a, b, ... . For concreteness, the general case could be fifth-year asthma divided into asthma which is triggered (subcase a), or not triggered (subcase b), by allergens. Suppose a given classifier, P say, scores an AUC of auca predicting sub-case a but could not distinguish other general cases from controls. In our example a classifier using antibody concentrations in the blood might be partially predictive of later allergic asthma but not of asthma without allergy. The case-control pairs can then be divided into those whose case is a member of a, and those whose is not, i.e. those for which the asthma is allergy-driven and those for which it is not. The former is predicted with an AUC of $auc_a$, the latter with an AUC of 0.5 which is equivalent to random guessing. It follows that when P is used to predict the general case, then it would score an AUC of

AUCP = qa auca + (1-qa) × 0.5,

where qa is the proportion of cases belonging to sub-case a. Rearranging gives the exclusivity index (EI)

EI = (AUCP - qa auca)/(1-qa) ,

which equals half when the predictor is truly exclusive to subgroup a.

When I applied this to the data I was studying, some antibody concentrations in the blood, such as house-dust-mite, were indeed predictive of allergy-driven asthma (actually wheeze, which we used as a proxy), Others, such as antibodies for cat, peanut, couch grass and rye allergies were not. As one might expect, house-dust-mite antibodies were exclusively predictive of asthma with airborne allergies. However it was exclusive to cases of multiple airborne allergies and was not predictive of allergic asthma driven by house-dust-mite alone! Sadly, I did not get to drill down on this any further. 

Biochemical Property Prediction:
BioPPsy is a package to predict clinically relevant properties of small molecules from those molecules from which such properties are already known. It was developed by Brian Smith's group at La Trobe university for predicting pharmacologically relevant properties such as solubility and permeability to either skin or the blood-brain barrier. When I took over this project it only had linear models. I have since added Partial Linear Squares, Neural Networks and Support Vector Regression (these last two by incorporating the weka package). Just as importantly, this has been done in such a way that anyone can code their own algorithm and simply add it to the project. BioPPsy may be downloaded from my sourceforge repository.

Modelling :

Epidemiology of sexually transmitted diseases among gay and bisexual men: Pre-Exposure Prophylaxis (PrEP) is a simple drug regime which completely protects someone from contracting HIV. It has recently been added to the pharmaceutical benefits scheme, raising concerns about increased STI transmission due to reduced condom use. I am currently developing PrEPSTI, an agent-based model to simulate STI transmission in this population.

Studying and modelling the dynamics of the plant hormone auxin and its molecular transporter PIN:
The distribution of auxin is at the heart of plant morphogenesis, but how do plant cells "decide" to locate the proteins responsible for pumping auxin between cells. Our approach was based on the so-called "flux-based" model, in which the expression of PIN in a cell wall is proportional to the flow of auxin passing through it. We were able to show that this single model produces both diffused and canalised distributions as they are distributed in the floral meristem with implementing a change in model or model parameters.