Common human disease research has evolved into a largely complex and interdisciplinary pursuit. Modern epidemiological challenges such as the characterization of complex systems, the management of ‘big data’, or the integration of data for systems biology epitomize this trend. The early stages of biomedical research typically focus on connecting predictive factors, whether they be genetic, epigenetic or environmental, to increased or decreased common disease susceptibility. This attempt to detect patterns of association is likely complicated by non-linear phenomena such as complex gene-gene interactions, gene-environment interactions, genetic heterogeneity, and phenocopy. My primary research interests focus on the development, evaluation, and application of novel computational, statistical, and visualization methods to facilitate classification and data mining in the complex, noisy domain of biomedical research.
My thesis research focused on the adaptation of a learning classifier system (LCS) algorithm to the task of detecting, modeling, and characterizing epistatic and heterogeneous associations within single nucleotide polymorphism (SNP) association studies. The development and application of LCS algorithms has since become a particular area of specialization. My postdoctoral work expanded upon this successful LCS groundwork leading to the development of ExSTraCS, an Extended Supervised Tracking and Classifying System. This work epitomizes my interest in (1) developing strategies which limit the number of assumptions made about the data, and instead allows the data to speak for itself for detecting complex or heterogeneous patterns, (2) allowing for the integration of data types by offering an algorithmic framework which functions for all combinations of discrete/continuous, attributes/endpoints, and (3) promoting a user friendly, interpretable environment for knowledge discovery. My work with LCS algorithms has also led me to pursue visual and statistical strategies with which to guide and facilitate knowledge discovery. My interests have also branched off into the theory and practice of complex disease model and data simulation, which led to the development of the open source GAMETES software package. Also, my interest in tackling issues related to ‘big data’ have motivated me to explore and expand attribute filtering approaches (ReliefF, SURF, SURFStar, MultiSURF), for computational and algorithmic flexibility and efficiency. These algorithms offer critical preprocessing steps for feature selection and the generation and application of statistical, objective, and unbiased expert knowledge to more efficiently guide stochastic algorithm learning.
In summary, my research interests lie at the intersection of genetics, genomics, biostatistics, epidemiology, machine learning, and computer science. I have adopted a quantitative biomedical research strategy that embraces, rather than ignores, the complexity of the relationship between predictive factors and disease endpoints.
Recent Highlights and Research in Progress:
- Right now i’m exploring the key problem of how to best determine rule-fitness in a given problem domain, without prior knowledge of problem noisiness or complexity (i.e. number of features involved and the nature of their relationship with endpoint). Below are some figures related to our most recent publication on Pareto-inspired multi-objective rule fitness.
- To help the algorithm to adapt to problems of classification or regression, we are currently exploring the integration of Genetic Programming (GP) trees and Rule-Based Machine Learning. The premise of this work is that an evolutionary system can adapt to the problem space such that when the underlying problem is can be described by a single function/model, as may be the case with simpler regression problems, GP will dominate the fitness space, while in the context of complex, heavily niched, or heterogeneous domains, rules will dominate, capturing the underlying pattern in a piece-wise manner, characteristic of rule-based machine learning.
- We are also developing a new software package called REBATE that seeks to bring together and computationally optimize accessible code for a family of feature selection/feature weighting algorithms referred to as Releif-Based algorithms including the relatively well known ReliefF algorithm, and enhanced versions developed in the Moore lab called SURF, SURF*, and MultiSURF. These algorithms are important because they have been shown to be sensitive to both epistatic interactions and heterogeneity effects, allowing the feature space to be reduced for machine learning, based on more than simple linear main effects.
- Also we are working to make an extended version of our GAMETES complex data simulation software available with a number of new features, including the ability to generate heterogeneous datasets, datasets with hierarchical additive effects, build custom 2, and 3 locus models, as well as simulate data with quantitative traits rather than just with discrete binary class endpoints.
- A project that we have been piecing together for a number of years is to develop a user friendly and intuitive GUI to pair with the ExSTraCS algorithm, that will take advantage of the new UPenn visualization lab, named the “Idea Factory” spearheaded by Dr. Jason H. Moore.