Peer Reviewed Abstracts/Posters

Interpretable modeling of epistatic and heterogeneous associations in epidemiology with rule-based machine learning

  • Ryan J. Urbanowicz, Ben Yang, and Jason H. Moore (August 11-12, 2016) New York City, New York
  • Memorial Sloan Kettering Cancer Center Symposium on Statistical and Computational Methods for Pharmacogenetic Epidemiology of Cancer

Abstract: One important aspect of epidemiological research is the advancement of machine learning strategies that can detect complex patterns of association between genetic or environmental variants and common disease risk. In particular, methods that can detect, model, and characterize epistatic interactions and heterogeneous patterns of associations can offer new insights in contrast with traditional methods that rely on simplifying assumptions.  Data mining methods that can handle heterogeneity are almost non-existent, however rule-based machine learning algorithms have been demonstrated to possess this ability.  Over the last few years, we have been developing a rule-based machine learning algorithm called ExSTraCS, or an Extended Supervised Tracking and Classifying System.  This adaptive algorithm evolves a population of human interpretable ‘IF:THEN’ rules that can break complex and heterogeneous problems into accessible pieces.  In previous publications we have adapted rule-based machine learning to the needs (i.e. emphasis on both model prediction accuracy and interpretability) and characteristics of bioinformatic and epidemiological problems (i.e. noisy, often high dimensional problems that may combine data types, include missing values, and have balanced or imbalanced, binary, multi-class or quantitative traits as phenotypes).  Notably we introduced (1) statistical and visualization guided approaches to facilitate knowledge discovery, (2) a strategy to guide the rule-based evolutionary search with computationally derived expert knowledge, (3) an attribute tracking heuristic akin to a long term memory and ant colony optimization which can be used to directly characterize heterogeneous patterns and candidate patient subgroups, (4) strategies to handle missing or imbalanced data as well as multi-class phenotypes (5) strategies to improve scalability for handling a greater number of potentially predictive attributes, (6) an expansion to learn on data with quantitative traits as phenotypes, (7) an underlying multi-objective rule-fitness to improve performance and interpretability in noisy problem domains, and most recently (8) a Pareto-front inspired multi-objective rule-fitness to further the goal of assumption free, data driven machine learning.   Over the course of developing ExSTraCS we demonstrated its efficacy and interpretability over an expansive simulation study as well as an epidemiological investigation of bladder cancer revealing evidence of heterogeneous phenocopy and epistasis concurrently, along with candidate patient subgroups through the application of attribute tracking.  Additionally ExSTraCS was the first and currently only algorithm ever to report solving the highly epistatic and heterogeneous 135-bit multiplexer data mining benchmark problem directly.  Solving this benchmark requires the algorithm to find 128 heterogeneous 8-way epistatic interactions involving 135 predictive features, with no single locus effects.  Our most recent work seeks to extend the Pareto-inspired multi-objective fitness strategy with an intelligent rule compaction approach such that the output rule-set model may be optimized to the level of noise in the dataset without prior knowledge.  Furthermore, we are integrating the evolution of genetic programming trees in parallel with rules of the ExSTraCS algorithm in order to improve performance and interpretability in analyses involving quantitative traits.  The collective success combined with the preliminary results of these new developments demonstrate ExSTraCS to be a powerful open source tool for exploring complexity, heterogeneity, and patient subgroups in cancer epidemiology.

GAMETES 2.0: expanding the complex model and data simulation software to generate heterogeneous datasets, custom models, and quantitive traits

  • Ryan J. Urbanowicz, Peter Andrews, and Jason H. Moore (October 4-6, 2015) Baltimore, Maryland
  • American Society of Human Genetics Conference

Abstract: Increasing acknowledgement of the complexity of common diseases, particularly with regards to complex multivariate patterns of association, has led to an increased interest in the development of new analytical methodologies, algorithms, and software able to detect, model, and characterize such patterns.  In order to properly develop, test, and evaluate such methodologies, a variety of representative simulated datasets for simulation studies are required.  Previously we developed the GAMETES software for the rapid, deterministic generation of strict, purely epistatic single nucleotide polymorphism (SNP) models based on user defined parameters such as heritability, minor allele frequencies, prevalence, and the order of interaction (e.g. 2-way, or 3-way).  GAMETES 2.0 expands the capabilities of this software, allowing users to (1) combine multiple genetic models for the simulation of datasets with patterns of genetic heterogeneity, (2) generate custom 2 or 3-way SNP models with a report of the model’s characteristics and predicted relative detection difficulty, (3) generate datasets with a quantitative trait/endpoint as opposed to a binary discrete class endpoint.   Quantitative endpoints are generated from GAMETES genetic models by using the penetrance values of specific genotype combinations as a centroid for selecting a continuous trait value for each subject in the dataset.  We test this new simulation software by generating simulation studies with heterogeneity and quantitative traits respectively and demonstrate that we can identify these simulated patterns using advanced machine learning approaches (i.e. QMDR and EXSTRACS), and feature selection approaches (ReliefF, SURF, SURF*, and MultiSURF).

A new ‘front’ in rule-based data mining for complex, heterogeneous and noisy association analyses

  • Ryan J. Urbanowicz and Jason H. Moore (October 6-10, 2015) Baltimore, Maryland
  • American Society of Human Genetics Conference

Abstract: Biological and statistical phenomena such as epistasis, genetic heterogeneity, and phenocopy can mask the relationship between genotypic, epigenetic, and environmental risk factors/markers and phenotypes of interest.  It has often been suggested that such phenomena may account for a substantial portion of so called ‘missing heritability’ across genetic association studies of common human heritable diseases.  The modern data mining toolkit for association analyses mostly includes approaches that labor under restrictive assumptions such as the number of predictive variables, the application of a specific genetic model, linearity, or homogeneity in order to function quickly, effectively, and reliably.  Previously, we developed a rule-based machine learning algorithm called ExSTraCS, an Extended Supervised Tracking and Classifying System, for assumption-free classification, prediction, and knowledge discovery designed to be particularly advantageous in detecting, modeling and characterizing complex, noisy, multivariate, epistatic, and heterogeneous patterns of association.  The key to this flexibility is that ExSTraCS, like other Learning Classifier System algorithms learns piece-wise effective generalizations of the problem space.  In other words, human interpretable rules are evolved to individually capture subspaces of the overall pattern and collectively applied to form the predictive ‘model’.  One major challenge is to be able to compare and rank the ‘value’ of these evolved rules in a way that emphasizes both the accuracy and the correct coverage of the dataset in order to reduce overfitting and promote solution interpretability in noisy or heterogeneous problems.  In the present study we introduce a Pareto-front-inspired methodology for the calculation of rule-fitness within ExSTraCS that provides a reliable, multi-objective global value metric.  Specifically, as rules emerge and are evaluated, a non-dominated front of points defined by the accuracy and correct coverage of a rule is updated and applied to calculate the fitness of every existing rule as a function of the distance from the non-dominated front, and the relative area under the front.  We find that this methodology significantly improves performance, interpretability, and allows for dramatic and simple rule compaction, across a spectrum of complex noisy simulation studies concurrently modeling epistatic and heterogeneous patterns with assorted heritabilities and sample sizes.

ExSTraCS 2.0: a scalable Michigan-style classifier system for detecting, modeling, and characterizing heterogeneity and epistasis

  • Ryan J. Urbanowicz and Jason H. Moore (January 3-7, 2010) Big Island, Hawaii
  • Pacific Symposium on Biocomputing

A flexible learning classifier system for classification and data mining in genetic epidemiology

  • Ryan J. Urbanowicz and Jason H. Moore (October 18-20, 2012) Stevenson, Washington
  • International Genetic Epidemiology Society Meeting

Genetic heterogeneity detection using a learning classifier system

  • Ryan J. Urbanowicz and Jason H. Moore (October 11-15, 2011) Montreal Canada
  • American Society of Human Genetics Conference

A fast, direct algorithm for generating pure, strict, epistatic models with random architectures

  • Ryan J. Urbanowicz, Jeff Kiralis, Jonathan Fisher, Nicholas Sinnott-Armstrong, Tamra Heberling, and Jason H. Moore (October 11-15, 2011) Montreal Canada
  • American Society of Human Genetics Conference

Modeling disease in the presence of genetic heterogeneity and epistasis: a learning classifier system approach

  • Ryan J. Urbanowicz and Jason H. Moore (January 4-8, 2010) Big Island, Hawaii
  • Pacific Symposium on Biocomputing

A learning classifier system approach to detecting and modeling genetic heterogeneity in the presence of epistasis

  • Ryan J. Urbanowicz and Jason H. Moore (July 8-12, 2009) Montreal, Canada
  • Genetic and Evolutionary Computing Conference (GECCO’09)

Genetics of cognitive decline post cancer chemotherapy: DNA repair genes

  • Tim Ahles, Andrew Saykin, Brenna McDonald, Harker Rhodes, Jason Moore, Ryan J. Urbanowicz, Gregory Tsongalis, and Tor Tosteson (November 7-9, 2008) Washington, D.C.
  • National Cancer Institute Translational Science Meeting

Data-driven constructive induction MDR for genetic heterogeneity

  • Ryan J. Urbanowicz, Delany Granizo-MacKenzie, and Jason H. Moore (November 11-15, 2008) Philadelphia, PA
  • American Society of Human Genetics Annual Meeting

Mask Functions for the Symbolic Modeling of Episasis

  • Ryan J. Urbanowicz, Bill White, Nate Barney, and Jason H. Moore (January 3-7, 2007) Maui, Hawaii
  • Pacific Symposium on Biocomputing