Recommending data preprocessing pipelines for machine learning applications in production

  • Empfehlung von Pipelines der Datenvorverarbeitung für Anwendungen des maschinellen Lernens in der Produktion

Frye, Maik; Schmitt, Robert H. (Thesis advisor); Behr, Marek (Thesis advisor)

[korrigierte] 1. Auflage. - Aachen : Apprimus Verlag (2023)
Book, Dissertation / PhD Thesis

In: Ergebnisse aus der Produktionstechnik 8/2023
Page(s)/Article-Nr.: 1 Online-Ressource : Illustrationen, Diagramme

Dissertation, RWTH Aachen University, 2022


The era of Industry 4.0 opens up the possibility of optimizing production systems in a data-driven way. To turn data into value, machine learning (ML) models are trained on production data aiming at identifying patterns to optimize processes. A crucial prereq-uisite for achieving performant ML models is the availability of high quality data. Since raw data generated in production exhibits multiple quality issues, data preprocessing (DPP) is required to increase the quality of the data. One of the key design decisions in any ML project is the choice of suitable DPP methods. The search space further increases when DPP methods are configured into DPP pipelines. Due to the high num-ber of possible DPP pipelines, data scientists commonly select suitable pipelines man-ually and via trial and error. For these reasons, DPP nowadays accounts for approximately 80 % of the time in ML projects.To guide data scientists, decision support systems (DSS) have been developed that assist in the selection of suitable DPP pipelines but do not cover productionspecific requirements. Therefore, the main research question was: Can a DSS be developed that supports in recommending DPP pipelines for ML applications in production? To be able to answer the main research question, a meta learning-based decision sup-port system, called Meta-DPP, was developed. Meta-DPP relies on three core compo-nents: the meta target selector, meta features database, and meta model. The meta target selector chooses between two preselected sets of overall well performing pipe-lines, called pipeline pools, for both classification and regression tasks. Further, the meta features database stores learning taskspecific information about the data set, e. g., the number of instances, as well as past ML algorithm and DPP pipeline performances. The meta model then recommends a pipeline from the pipeline pool based on the meta features from the database. When applying Meta-DPP, a user interface enables the data scientist, production expert, or IT expert to input their data set, learning task, ML algorithm and information about explainability. Given these four inputs, Meta-DPP provides a ranked recommendation of the DPP pipelines from the pool. Probabilities provided by the meta model further indicate how certain Meta-DPP is about the recommendation. Verifying and validating revealed the correct development and implementation of Meta-DPP. The validation on 324 production use cases further prove that Meta-DPP outperform essential pipelines on average, whereby essential pipelines ensure the function-ing of ML algorithms by performing minimum DPP. As a conclusion, the main research question was positively answered.


  • Laboratory for Machine Tools and Production Engineering (WZL) of RWTH Aachen University [417200]
  • Chair of Production Metrology and Quality Management [417510]