Pareto inspired multi-objective rule fitness for noise-adaptive rule-based machine learning
- Ryan J. Urbanowicz, Randal S. Olson, Jason H. Moore (2016)
- Proceedings of the Parallel Problem Solving from Nature Conference (PPSN XIV)
BibTeX@inproceedings{urbanowicz2016pareto,
title={Pareto Inspired Multi-objective Rule Fitness for Noise-Adaptive Rule-Based Machine Learning},
author={Urbanowicz, Ryan J and Olson, Randal S and Moore, Jason H},
booktitle={International Conference on Parallel Problem Solving from Nature},
pages={514–524},
year={2016},
organization={Springer}
}
Abstract: Learning classifier systems (LCSs) are rule-based evolutionary algorithms uniquely suited to classification and data mining in complex, multi-factorial, and heterogeneous problems. The fitness of individual LCS rules is commonly based on accuracy, but this metric alone is not ideal for assessing global rule ‘value’ in noisy problem domains and thus impedes effective knowledge extraction. Multi-objective fitness functions are promising but rely on prior knowledge of how to weigh objective importance (typically unavailable in real world problems). The Pareto-front concept offers a multi-objective strategy that is agnostic to objective importance. We propose a Pareto-inspired multi-objective rule fitness (PIMORF) for LCS, and combine it with a complimentary rule-compaction approach (SRC). We implemented these strategies in ExSTraCS, a successful supervised LCS and evaluated performance over an array of complex simulated noisy and clean problems (i.e. genetic and multiplexer) that each concurrently model pure interaction effects and heterogeneity. While evaluation over multiple performance metrics yielded mixed results, this work represents an important first step towards efficiently learning complex problem spaces without the advantage of prior problem knowledge. Overall the results suggest that PIMORF paired with SRC improved rule set interpretability, particularly with regard to heterogeneous patterns.
Evaluation of a tree-based pipeline optimization tool for automating data science
- Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016)
- Proceedings of the 2016 Annual Conference on Genetic and Evolutionary Computation
BibTeX@article{olson2016evaluation,
title={Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
author={Olson, Randal S and Bartley, Nathan and Urbanowicz, Ryan J and Moore, Jason H},
journal={arXiv preprint arXiv:1603.06212},
year={2016}
}
Abstract: As the field of data science continues to grow, there will be an ever-increasing demand for tools that make machine learning accessible to non-experts. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning—pipeline design. We implement an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and demonstrate its effectiveness on a series of simulated and real-world benchmark data sets. In particular, we show that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user. We also address the tendency for TPOT to design overly complex pipelines by integrating Pareto optimization, which produces compact pipelines without sacrificing classification accuracy. As such, this work represents an important step toward fully automating machine learning pipeline design.
Automating biomedical data science through tree-based pipeline optimization
- Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016)
- Applications of Evolutionary Computation (EvoBIO)
BibTeX@inproceedings{olson2016automating,
title={Automating biomedical data science through tree-based pipeline optimization},
author={Olson, Randal S and Urbanowicz, Ryan J and Andrews, Peter C and Lavender, Nicole A and Moore, Jason H and others},
booktitle={European Conference on the Applications of Evolutionary Computation},
pages={123–137},
year={2016},
organization={Springer}
}
Abstract: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning — pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators — such as synthetic feature constructors — that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.
Continuous endpoint data mining with ExSTraCS: a supervised learning classifier system
- Ryan J. Urbanowicz, Niranjan Ramanand, Jason H. Moore (2015)
- Workshop Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO’15)
BibTeX@inproceedings{urbanowicz2015continuous,
title={Continuous Endpoint Data Mining with ExSTraCS: A Supervised Learning Classifier System},
author={Urbanowicz, Ryan and Ramanand, Niranjan and Moore, Jason},
booktitle={Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation},
pages={1029–1036},
year={2015},
organization={ACM}
}
Abstract: ExSTraCS is a powerful Michigan-style learning classifier system (LCS) that was developed for classification, prediction, modeling, and knowledge discovery in complex and/or heterogeneous supervised learning problems with clean or noisy signals. To date, ExSTraCS has been limited to problems with discrete endpoints (i.e. classes). Many real world problems, however, involve endpoints with continuous values (e.g. function approximation, or quantitative trait analyses). In some problems the goal is to predict a specific continuous value with low error based on input values. In other problems it may be more informative to predict continuous intervals (i.e. predict that an endpoint falls within some range to define meaningful thresholds within the endpoint continuum). Thus far, there has not been a supervised learning LCS designed to handle continuous endpoints, nor one that incorporates interval predictions within rules. In this paper, we propose and evaluate (1) a supervised learning approach for solving continuous endpoint problems that connects input states to endpoint intervals within rules, (2) a novel prediction scheme that converts interval predictions into a specific continuous value prediction, and (3) an alternate approach to rule subsumption. Following simulation study analyses, we discuss the benefits and drawbacks of these implementations within ExSTraCS.
Retooling fitness for noisy problems in a supervised Michigan-style learning classifier system
- Ryan J. Urbanowicz and Jason H. Moore (2015)
- Proceedings of the 17th Annual Conference on Genetic and Evolutionary Computation (GECCO’15)
BibTeX@inproceedings{urbanowicz2015retooling,
title={Retooling Fitness for Noisy Problems in a Supervised Michigan-style Learning Classifier System},
author={Urbanowicz, Ryan and Moore, Jason},
booktitle={Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation},
pages={591–598},
year={2015},
organization={ACM}
}
Abstract: An accuracy-based rule fitness is a hallmark of most modern Michigan-style learning classifier systems (LCS), a powerful, flexible, and largely interpretable class of machine learners. However, rule-fitness based solely on accuracy is not ideal for identifying ‘optimal’ rules in supervised learning. This is particularly true for noisy problem domains where perfect rule accuracy essentially guarantees over-fitting. Rule fitness based on accuracy alone is unreliable for reflecting the global ‘value’ of a given rule since rule accuracy is based on a subset of the training instances. While moderate over-fitting may not dramatically hinder LCS classification or prediction performance, the interpretability of the solution is likely to suffer. Additionally, over-fitting can impede algorithm learning efficiency and leads to a larger number of rules being required to capture relationships. The present study seeks to develop an intuitive multi-objective fitness function that will encourage the discovery, preservation, and identification of ‘optimal’ rules through accuracy, correct coverage of training data, and the prior probability of the specified attribute states and class expressed by a given rule. We demonstrate the advantages of our proposed fitness by implementing it into the ExSTraCS algorithm and performing evaluations over a large spectrum of complex, noisy, simulated datasets.
An extended Michigan-style learning classifier system for flexible supervised learning, classification, and data mining
- Ryan J. Urbanowicz, Gediminas Bertasius, and Jason H. Moore (2014)
- Proceedings of the Parallel Problem Solving from Nature Conference (PPSN XIII)
BibTeX@inproceedings{urbanowicz2014extended,
title={An extended michigan-style learning classifier system for flexible supervised learning, classification, and data mining},
author={Urbanowicz, Ryan J and Bertasius, Gediminas and Moore, Jason H},
booktitle={International Conference on Parallel Problem Solving from Nature},
pages={211–221},
year={2014},
organization={Springer}
}
Abstract: Advancements in learning classifier system (LCS) algorithms have highlighted their unique potential for tackling complex, noisy problems, as found in bioinformatics. Ongoing research in this domain must address the challenges of modeling complex patterns of association, systems biology (i.e. the integration of different data types to achieve a more holistic perspective), and ‘big data’ (i.e. scalability in large-scale analysis). With this in mind, we introduce ExSTraCS (Extended Supervised Tracking and Classifying System), as a promising platform to address these challenges using supervised learning and a Michigan-Style LCS architecture. ExSTraCS integrates several successful LCS advancements including attribute tracking/feedback, expert knowledge covering (with four built-in attribute weighting algorithms), a flexible and efficient rule representation (handling datasets with both discrete and continuous attributes), and rapid non-destructive rule compaction. A few novel mechanisms, such as adaptive data management, have been included to enhance ease of use, flexibility, performance, and provide groundwork for ongoing development.
Rapid rule compaction strategies for global knowledge discovery in a supervised learning classifier system
- Jie Tan, Jason H. Moore, and Ryan J. Urbanowicz (2013)
- Advances in Artifical Life (ECAL)
BibTeX@article{tan2013rapid,
title={Rapid rule compaction strategies for global knowledge discovery in a supervised learning classifier system},
author={Tan, Jie and Moore, Jason and Urbanowicz, Ryan},
journal={Advances in Artificial Life, ECAL},
volume={12},
pages={110–117},
year={2013}
}
Abstract: Michigan-style learning classifier systems have availed themselves as a promising modeling and data mining strategy for bioinformaticists seeking to connect predictive variables with disease phenotypes. The resulting ‘model’ learned by these algorithms is comprised of an entire population of rules,some of which will inevitably be redundant or poor predictors. Rule compaction is a post-processing strategy for consolidating this rule population with the goal of improving interpretation and knowledge discovery. However,existing rule compaction strategies tend to reduce overall rule population performance along with population size, especially in the context of noisy problem domains such as bioinformatics. In the present study we introduce and evaluate two new rule compaction strategies (QRC, PDRC) and a simple rule filtering method (QRF), and compare them to three existing methodologies. These new strategies are tuned to fit with a global approach to knowledge discovery in which less emphasis is placed on minimizing rule population size (to facilitate manual rule inspection) and more is placed on preserving performance. This work identified the strengths and weaknesses of each approach, suggesting PDRC to be the most balanced approach trading a minimal loss in testing accuracy for significant gains or consistency in all other performance statistics.
A simple multi-core parallelization strategy for learning classifier system evaluations
- James Rudd, Jason H. Moore, and Ryan J. Urbanowicz (2013)
- Workshop Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO’13)
BibTeX@inproceedings{rudd2013simple,
title={A simple multi-core parallelization strategy for learning classifier system evaluation},
author={Rudd, James and Moore, Jason and Urbanowicz, Ryan},
booktitle={Proceedings of the 15th annual conference companion on Genetic and evolutionary computation},
pages={1259–1266},
year={2013},
organization={ACM}
}
Abstract: Permutation strategies for statistically evaluating the significance of predictions and patterns identified within learning classifier systems (LCSs) have only appeared since 2012. While already considered to be computationally expensive algorithms, a permutation testing based approach to determining statistical significance has the potential to be many times more demanding. One area of LCS research which has become both feasible and popularized in recent years is the adoption of parallelization strategies. In the present study we explore the simple benefits of parallelizing a set of LCS analyses in an attempt to make the completion of a permutation test with cross validation more feasible on a single multi-core workstation. We test our python implementation of this strategy in the context of a simulated complex genetic epidemiological data mining problem. Our evaluations indicate that on Windows 7 computers, as long as the number of concurrent processes does not exceed the number of CPU cores, the speedup achieved is approximately linear.
An expert knowledge guided Michigan-style learning classifier system for the detection and modeling of episasis and genetic heterogeneity
- Ryan J. Urbanowicz, Delany Granizo-Mackenzie, and Jason H. Moore (2012)
- Proceedings of the Parallel Problem Solving from Nature Conference (PPSN XII)
BibTeX@inproceedings{urbanowicz2012using,
title={Using expert knowledge to guide covering and mutation in a michigan style learning classifier system to detect epistasis and heterogeneity},
author={Urbanowicz, Ryan J and Granizo-Mackenzie, Delaney and Moore, Jason H},
booktitle={International Conference on Parallel Problem Solving from Nature},
pages={266–275},
year={2012},
organization={Springer}
}
Abstract: Learning Classifier Systems (LCSs) are a unique brand of multifaceted evolutionary algorithms well suited to complex or heterogeneous problem domains. One such domain involves data mining within genetic association studies which investigate human disease. Previously we have demonstrated the ability of Michigan-style LCSs to detect genetic associations in the presence of two complicating phenomena: epistasis and genetic heterogeneity. However, LCSs are computationally demanding and problem scaling is a common concern. The goal of this paper was to apply and evaluate expert knowledge-guided covering and mutation operators within an LCS algorithm. Expert knowledge, in the form of Spatially Uniform ReliefF (SURF) scores, was incorporated to guide learning towards regions of the problem domain most likely to be of interest. This study demonstrates that expert knowledge can improve learning efficiency in the context of a Michigan-style LCS.
Instance-linked attribute tracking and feedback for Michigan-style supervised learning classifier systems
- Ryan J. Urbanowicz, Ambrose Granizo-Mackenzie, and Jason H. Moore (2012)
- Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO’12)
BibTeX@inproceedings{urbanowicz2012instance,
title={Instance-linked attribute tracking and feedback for michigan-style supervised learning classifier systems},
author={Urbanowicz, Ryan and Granizo-Mackenzie, Ambrose and Moore, Jason},
booktitle={Proceedings of the 14th annual conference on Genetic and evolutionary computation},
pages={927–934},
year={2012},
organization={ACM}
}
Abstract: The application of learning classifier systems (LCSs) to classification and data mining in genetic association studies has been the target of previous work. Recent efforts have focused on: (1) correctly discriminating between predictive and nonpredictive attributes, and (2) detecting and characterizing epistasis (attribute interaction) and heterogeneity. While the solutions evolved by Michigan-style LCSs (M-LCSs) are conceptually well suited to address these phenomena, the explicit characterization of heterogeneity remains a particular challenge. In this study we introduce attribute tracking, a mechanism akin to memory, for supervised learning in MLCSs. Given a finite training set, a vector of accuracy scores is maintained for each instance in the data. Post-training, we apply these scores to characterize patterns of association in the dataset. Additionally we introduce attribute feedback to the mutation and crossover mechanisms, probabilistically directing rule generalization based on an instance’s tracking scores. We find that attribute tracking combined with clustering and visualization facilitates the characterization of epistasis and heterogeneity while uniquely linking individual instances in the dataset to etiologically heterogeneous subgroups. Moreover, these analyses demonstrate that attribute feedback significantly improves test accuracy, efficient generalization, run time, and the power to discriminate between predictive and non-predictive attributes in the presence of heterogeneity.
Random artificial incorporation of noise in a learning classifier system environment
- Ryan J. Urbanowicz, Nicholas Sinnott-Armstrong, and Jason H. Moore (2011)
- Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (GECCO’11)
BibTeX@inproceedings{urbanowicz2011random,
title={Random artificial incorporation of noise in a learning classifier system environment},
author={Urbanowicz, Ryan and Sinnott-Armstrong, Nicholas and Moore, Jason},
booktitle={Proceedings of the 13th annual conference companion on Genetic and evolutionary computation},
pages={369–374},
year={2011},
organization={ACM}
}
Abstract: Effective rule generalization in learning classifier systems (LCSs) has long since been an important consideration. In noisy problem domains, where attributes do not precisely determine class, overemphasis on accuracy without sufficient generalization leads to over-fitting of the training data, and a large discrepancy between training and testing accuracies. This issue is of particular concern within noisy bioinformatic problems such as complex disease, gene association studies. In an effort to promote effective generalization we introduce and explore a simple strategy which seeks to discourage overfitting via the probabilistic incorporation of random noise within training instances. We evaluate a variety of noise models and magnitudes which either specify an equal probability of noise per attribute, or target higher noise probability to the attributes which tend to be more frequently generalized. Our results suggest that targeted noise incorporation can reduce training accuracy without eroding testing accuracy. In addition, we observe a slight improvement in our power estimates (i.e. ability to detect the true underlying model(s)).
The application of Pittsburgh-style learning classifier systems to address genetic heterogeneity and epistasis in association studies
- Ryan J. Urbanowicz, and Jason H. Moore (2011)
- Proceedings of the Parallel Problem Solving from Nature Conference (PPSN XI)
BibTeX@inproceedings{urbanowicz2010application,
title={The application of pittsburgh-style learning classifier systems to address genetic heterogeneity and epistasis in association studies},
author={Urbanowicz, Ryan J and Moore, Jason H},
booktitle={International Conference on Parallel Problem Solving from Nature},
pages={404–413},
year={2010},
organization={Springer}
}
Abstract: Despite the growing abundance and quality of genetic data, genetic epidemiologists continue to struggle with connecting the phenotype of common complex disease to underlying genetic markers and etiologies. In the context of gene association studies, this process is greatly complicated by phenomena such as genetic heterogeneity (GH) and epistasis (gene-gene interactions), which constitute difficult, but accessible challenges for bioinformatisists. While previous work has demonstrated the potential of using Michigan-style Learning Classifier Systems (LCSs) as a direct approach to this problem, the present study examines Pittsburgh-style LCSs, an architecturally and functionally distinct class of algorithm, linked by the common goal of evolving a solution comprised of multiple rules as opposed to a single “best” rule. This study highlights the strengths and weaknesses of the Pittsburgh-style LCS architectures (GALE and GAssist) as they are applied to the GH/epistasis problem.
The application of Michigan-style learning classifier systems to address genetic heterogeneity and epistasis in association studies
- Ryan J. Urbanowicz, and Jason H. Moore (2010)
- Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO’10)
BibTeX@inproceedings{urbanowicz2010application,
title={The application of michigan-style learning classifiersystems to address genetic heterogeneity and epistasisin association studies},
author={Urbanowicz, Ryan J and Moore, Jason H},
booktitle={Proceedings of the 12th annual conference on Genetic and evolutionary computation},
pages={195–202},
year={2010},
organization={ACM}
}
Abstract: Genetic epidemiologists, tasked with the disentanglement of genotype-to-phenotype mappings, continue to struggle with a variety of phenomena which obscure the underlying etiologies of common complex diseases. For genetic association studies, genetic heterogeneity (GH) and epistasis (gene-gene interactions) epitomize well recognized phenomenon which represent a difficult, but accessible challenge for computational biologists. While progress has been made addressing epistasis, methods for dealing with GH tend to “side-step” the problem, limited by a dependence on potentially arbitrary cutoffs/covariates, and a loss in power synonymous with data stratification. In the present study, we explore an alternative strategy (Learning Classifier Systems (LCSs)) as a direct approach for the characterization, and modeling of disease in the presence of both GH and epistasis. This evaluation involves (1) implementing standardized versions of existing Michigan-Style LCSs (XCS, MCS, and UCS), (2) examining major run parameters, and (3) performing quantitative and qualitative evaluations across a spectrum of simulated datasets. The results of this study highlight the strengths and weaknesses of the Michigan LCS architectures examined, providing proof of principle for the application of LCSs to the GH/epistasis problem, and laying the foundation for the development of an LCS algorithm specifically designed to address GH.
Mask functions for the symbolic modeling of epistasis using genetic programming
- Ryan J. Urbanowicz, Bill White, Nate Barney, and Jason H. Moore (2008)
- Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation (GECCO’08)
BibTeX@inproceedings{urbanowicz2008mask,
title={Mask functions for the symbolic modeling of epistasis using genetic programming},
author={Urbanowicz, Ryan J and Barney, Nate and White, Bill C and Moore, Jason H},
booktitle={Proceedings of the 10th annual conference on Genetic and evolutionary computation},
pages={339–346},
year={2008},
organization={ACM}
}
Abstract: The study of common, complex multifactorial diseases in genetic epidemiology is complicated by nonlinearity in the genotype-to-phenotype mapping relationship that is due, in part, to epistasis or gene-gene interactions. Symobolic discriminant analysis (SDA) is a flexible modeling approach which uses genetic programming (GP) to evolve an optimal predictive model using a predefined collection of mathematical functions, constants, and attributes. This has been shown to be an effective strategy for modeling epistasis. In the present study, we introduce the genetic “mask” as a novel building block which exploits expert knowledge in the form of a pre-constructed relationship between two attributes. The goal of this study was to determine whether the availability of“mask”building blocks improves SDA performance. The results of this study support the idea that pre-processing data improves GP performance.