Journal Publications


ExSTraCS 2.0: addressing scalability with a rule specificity limit in a Michigan-style supervised learning classifier system for classification, prediction, and knowledge discovery

  • Ryan J. Urbanowicz and Jason H. Moore (2015)
  • Journal of Evolutionary Intelligence
    BibTeX
    @article{urbanowicz2015exstracs,
    title={ExSTraCS 2.0: description and evaluation of a scalable learning classifier system},
    author={Urbanowicz, Ryan J and Moore, Jason H},
    journal={Evolutionary intelligence},
    volume={8},
    number={2-3},
    pages={89–116},
    year={2015},
    publisher={Springer}
    }

Abstract: Algorithmic scalability is a major concern for any machine learning strategy in this age of ‘big data’. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExSTraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, noisy, and heterogeneous problem domains. While Michigan-style learning classifier systems are powerful and flexible learners, they are not considered to be particularly scalable. For the first time, this paper presents a complete description of the ExSTraCS algorithm and introduces an effective strategy to dramatically improve learning classifier system scalability. ExSTraCS 2.0 addresses scalability with (1) a rule specificity limit, (2) new approaches to expert knowledge guided covering and mutation mechanisms, and (3) the implementation and utilization of the TuRF algorithm for improving the quality of expert knowledge discovery in larger datasets. Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets. ExSTraCS 2.0 was also able to reliably solve the 6, 11, 20, 37, 70, and 135 multiplexer problems, and did so in similar or fewer learning iterations than previously reported, with smaller finite training sets, and without using building blocks discovered from simpler multiplexer problems. Furthermore, ExSTraCS usability was made simpler through the elimination of previously critical run parameters.


A classification and characterization of two-locus pure, strict, epistatic, models for simulation and detection

  • Ryan J. Urbanowicz, Ambrose Granizo-Mackenzie, Jeff Kiralis, and Jason H. Moore (2014)
  • BioData Mining. BioMed Central Ltd.
    BibTeX
    @article{urbanowicz2014classification,
    title={A classification and characterization of two-locus, pure, strict, epistatic models for simulation and detection},
    author={Urbanowicz, Ryan J and Granizo-Mackenzie, Ambrose LS and Kiralis, Jeff and Moore, Jason H},
    journal={BioData mining},
    volume={7},
    number={1},
    pages={1},
    year={2014},
    publisher={BioMed Central}
    }

Background: The statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model ‘architecture’ on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models.

Results: In this study we utilized a geometric approach to classify pure, strict, two-locus epistatic models by “shape”. In total, 33 unique shape symmetry classes were identified. Using a detection difficulty metric, we found that model shape was consistently a significant predictor of model detection difficulty. Additionally, after categorizing shape classes by the number of edges in their shape projections, we found that this edge number was also significantly predictive of detection difficulty. Analysis of constraints within GAMETES indicated that increasing model population size can expand model class coverage but does little to change the range of observed difficulty metric scores. A variable population prevalence significantly increased the range of observed difficulty metric scores and, for certain constraints, also improved model class coverage.

Conclusions: These analyses further our theoretical understanding of epistatic relationships and uncover guidelines for the effective generation of complex models using GAMETES. Specifically, (1) we have characterized 33 shape classes by edge number, detection difficulty, and observed frequency (2) our results support the claim that model architecture directly influences detection difficulty, and (3) we found that GAMETES will generate a maximally diverse set of models with a variable population prevalence and a larger model population size. However, a model population size as small as 1,000 is likely to be sufficient.


A multi-core parallelization strategy for statistical significance testing in learning classifier systems

  • James Rudd, Jason H. Moore, and Ryan J. Urbanowicz (2013)
  • Journal of Evolutionary Intelligence
    BibTeX
    @article{rudd2013multi,
    title={A multi-core parallelization strategy for statistical significance testing in learning classifier systems},
    author={Rudd, James and Moore, Jason H and Urbanowicz, Ryan J},
    journal={Evolutionary intelligence},
    volume={6},
    number={2},
    pages={127–134},
    year={2013},
    publisher={Springer}
    }

Abstract: Permutation-based statistics for evaluating the significance of class prediction, predictive attributes, and patterns of association have only appeared within the learning classifier system (LCS) literature since 2012. While still not widely utilized by the LCS research community, formal evaluations of test statistic confidence are imperative to large and complex real world applications such as genetic epidemiology where it is standard practice to quantify the likelihood that a seemingly meaningful statistic could have been obtained purely by chance. LCS algorithms are relatively computationally expensive on their own. The compounding requirements for generating permutation-based statistics may be a limiting factor for some researchers interested in applying LCS algorithms to real world problems. Technology has made LCS parallelization strategies more accessible and thus more popular in recent years. In the present study we examine the benefits of externally parallelizing a series of independent LCS runs such that permutation testing with cross validation becomes more feasible to complete on a single multi-core workstation. We test our python implementation of this strategy in the context of a simulated complex genetic epidemiological data mining problem. Our evaluations indicate that as long as the number of concurrent processes does not exceed the number of CPU cores, the speedup achieved is approximately linear.


The role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach

  • Ryan J. Urbanowicz, Angeline Andrew, Margaret Karagas, and Jason H. Moore (2013)
  • Journal of the American Medical Informatics Association (JAMIA)
    BibTeX
    @article{urbanowicz2013role,
    title={Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach},
    author={Urbanowicz, Ryan John and Andrew, Angeline S and Karagas, Margaret Rita and Moore, Jason H},
    journal={Journal of the American Medical Informatics Association},
    volume={20},
    number={4},
    pages={603–612},
    year={2013},
    publisher={The Oxford University Press}
    }

Background and Objective:  Detecting complex patterns of association between genetic or environmental risk factors and disease risk has become an important target for epidemiological research. In particular, strategies that provide multifactor interactions or heterogeneous patterns of association can offer new insights into association studies for which traditional analytic tools have had limited success.

Materials and Methods: To concurrently examine these phenomena, previous work has successfully considered the application of learning classifier systems (LCSs), a flexible class of evolutionary algorithms that distributes learned associations over a population of rules. Subsequent work dealt with the inherent problems of knowledge discovery and interpretation within these algorithms, allowing for the characterization of heterogeneous patterns of association. Whereas these previous advancements were evaluated using complex simulation studies, this study applied these collective works to a ‘real-world’ genetic epidemiology study of bladder cancer susceptibility.

Results and Discussion:  We replicated the identification of previously characterized factors that modify bladder cancer risk—namely, single nucleotide polymorphisms from a DNA repair gene, and smoking. Furthermore, we identified potentially heterogeneous groups of subjects characterized by distinct patterns of association. Cox proportional hazard models comparing clinical outcome variables between the cases of the two largest groups yielded a significant, meaningful difference in survival time in years (survivorship). A marginally significant difference in recurrence time was also noted. These results support the hypothesis that an LCS approach can offer greater insight into complex patterns of association.

Conclusions: This methodology appears to be well suited to the dissection of disease heterogeneity, a key component in the advancement of personalized medicine.


Predicting Difficulty in Simulated Genetic Models: Metrics for Model Architecture Selection

  • Ryan J. Urbanowicz, Jeff Kiralis, Jonathan Fisher, and Jason H. Moore (2012)
  • BioData Mining. BioMed Central Ltd.
    BibTeX
    @article{urbanowicz2012predicting,
    title={Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection},
    author={Urbanowicz, Ryan J and Kiralis, Jeff and Fisher, Jonathan M and Moore, Jason H},
    journal={BioData mining},
    volume={5},
    number={1},
    pages={1},
    year={2012},
    publisher={BioMed Central}
    }

Background: Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In order to design a simulation study which efficiently takes architecture into account,
a reliable metric is needed for model selection.

Results: We evaluate three metrics as predictors of relative model detection difficulty derived from previous works: (1) Penetrance table variance (PTV), (2) customized odds ratio (COR), and (3) our own Ease of Detection Measure (EDM), calculated from the penetrance values and respective genotype frequencies of each simulated genetic model. We evaluate the reliability of these metrics across three very different data search algorithms, each with
the capacity to detect epistatic interactions. We find that a model’s EDM and COR are each stronger predictors of model detection success than heritability.

Conclusions: This study formally identifies and evaluates metrics which quantify model detection difficulty. We utilize these metrics to intelligently select models from a population of potential architectures. This allows for an improved simulation study design which accounts for differences in detection difficulty attributed to model architecture. We implement the calculation and utilization of EDM and COR into GAMETES, an algorithm which rapidly and
precisely generates pure, strict, n-locus epistatic models.


GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures

  • Ryan J. Urbanowicz, Jeff Kiralis, Jonathan Fisher, Nicholas Sinnott-Armstrong, and Jason H. Moore (2012)
  • BioData Mining. BioMed Central Ltd.
    BibTeX
    @article{urbanowicz2012gametes,
    title={GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures},
    author={Urbanowicz, Ryan J and Kiralis, Jeff and Sinnott-Armstrong, Nicholas A and Heberling, Tamra and Fisher, Jonathan M and Moore, Jason H},
    journal={BioData mining},
    volume={5},
    number={1},
    pages={1},
    year={2012},
    publisher={BioMed Central}
    }

Background: Geneticists who look beyond single locus disease associations require additional strategies for the detection of complex multi-locus effects. Epistasis, a multi-locus masking effect, presents a particular challenge, and has been the target of bioinformatic development. Thorough evaluation of new algorithms calls for simulation studies in which known disease models are sought. To date, the best methods for generating simulated multi-locus epistatic models rely on genetic algorithms. However, such methods are computationally expensive, difficult to adapt to multiple objectives, and unlikely to yield models with a precise form of epistasis which we refer to as pure and strict. Purely and strictly epistatic models constitute the worst-case in terms of detecting disease associations, since such associations may only be observed if all n-loci are included in the disease model. This makes them an attractive gold standard for simulation studies considering complex multi-locus effects.

Results: We introduce GAMETES, a user-friendly software package and algorithm which generates complex biallelic single nucleotide polymorphism (SNP) disease models for simulation studies. GAMETES rapidly and precisely generates random, pure, strict n-locus models with specified genetic constraints. These constraints include heritability, minor allele frequencies of the SNPs, and population prevalence. GAMETES also includes a simple dataset simulation strategy which may be utilized to rapidly generate an archive of simulated datasets for given genetic models. We highlight the utility and limitations of GAMETES with an example simulation study using MDR, an algorithm designed to detect epistasis.

Conclusions: GAMETES is a fast, flexible, and precise tool for generating complex n-locus models with random architectures. While GAMETES has a limited ability to generate models with higher heritabilities, it is proficient at generating the lower heritability models typically used in simulation studies evaluating new algorithms. In addition, the GAMETES modeling strategy may be flexibly combined with any dataset simulation strategy. Beyond dataset simulation, GAMETES could be employed to pursue theoretical characterization of genetic models and epistasis.


An analysis pipeline with statisticsal and visualization-guided knowledge discovery for Michigan-style learning classifier systems

  • Ryan J. Urbanowicz, Ambrose Granizo-Mackenzie, and Jason H. Moore (2012)
  • IEEE Computational Intelligence Magazine
    BibTeX
    @article{urbanowicz2012analysis,
    title={An analysis pipeline with statistical and visualization-guided knowledge discovery for michigan-style learning classifier systems},
    author={Urbanowicz, Ryan J and Granizo-Mackenzie, Ambrose and Moore, Jason H},
    journal={IEEE computational intelligence magazine},
    volume={7},
    number={4},
    pages={35–45},
    year={2012},
    publisher={IEEE}
    }

Abstract: Michigan-style learning classifier systems (M-LCSs) represent an adaptive and powerful class of evolutionary algorithms which distribute the learned solution over a sizable population of rules. However their application to complex real world data mining problems, such as genetic association studies, has been limited. Traditional knowledge discovery strategies for M-LCS rule populations involve sorting and manual rule inspection. While this approach may be sufficient for simpler problems, the confounding influence of noise and the need to discriminate between predictive and non-predictive attributes calls for additional strategies. Additionally, tests of significance must be adapted to M-LCS analyses in order to make them a viable option within fields that require such analyses to assess confidence. In this work we introduce an M-LCS analysis pipeline that combines uniquely applied visualizations with objective statistical evaluation for the identification of predictive attributes, and reliable rule generalizations in noisy single-step data mining problems. This work considers an alternative paradigm for knowledge discovery in M-LCSs, shifting the focus from individual rules to a global, population-wide perspective. We demonstrate the efficacy of this pipeline applied to the identification of epistasis (i.e., attribute interaction) and heterogeneity in noisy simulated genetic association data.


Learning classifier systems: a complete introduction, review, and roadmap

  • Ryan J. Urbanowicz and Jason H. Moore (2009)
  • Journal of Artificial Evolution and Applications. Hindawi Publishing Corporation
    BibTeX
    @article{urbanowicz2009learning,
    title={Learning classifier systems: a complete introduction, review, and roadmap},
    author={Urbanowicz, Ryan J and Moore, Jason H},
    journal={Journal of Artificial Evolution and Applications},
    volume={2009},
    pages={1},
    year={2009},
    publisher={Hindawi Publishing Corp.}
    }

Abstract: If complexity is your problem, learning classifier systems (LCSs) may offer a solution. These rule-based, multifaceted, machine learning algorithms originated and have evolved in the cradle of evolutionary biology and artificial intelligence. The LCS concept has inspired a multitude of implementations adapted to manage the different problem domains to which it has been applied (e.g., autonomous robotics, classification, knowledge discovery, and modeling). One field that is taking increasing notice of LCS is epidemiology, where there is a growing demand for powerful tools to facilitate etiological discovery. Unfortunately, implementation optimization is nontrivial, and a cohesive encapsulation of implementation alternatives seems to be lacking. This paper aims to provide an accessible foundation for researchers of different backgrounds interested in selecting or developing their own LCS. Included is a simple yet thorough introduction, a historical review, and a roadmap of algorithmic components, emphasizing differences in alternative LCS implementations.