Association Rule Mining for High Dimensional Master Data

There are various approaches to association rule mining in large data sets. Typical applications are in market basket analyses, medical diagnosis, biomedical literature, protein sequences, census data, logistic regression and fraud detection. There are not many known approaches to association rule mining for Master Data. However, Master Data are a key asset for enterprises today. It is the single source of business objects across the enterprise. The quality of Master Data is of critical importance for organizations since business decisions depend on it. Therefore, much effort goes into ensuring high-quality Master Data. Traditionally, organizations are using rule-based approaches to discover defects in Master Data. The definition of these rules is expensive for organizations and constrained by the availability of resources with the right domain expertise. The aim of this project is to evaluate with recourse to suggested validation rules the applicability of approaches like association rule mining as a way to support Master Data domain experts.

Challenges

The vision of this project is to introduce the applicability of association rule mining techniques to identify validation rules from high dimensional data in general and then adjusting the approach to Master Data in particular. For the domain of Master Data, we expect from this project an optimized association rule mining algorithm regarding complexity and diversity. This algorithm can then be generalized as a Master Data analysis approach for different industry applications.

Vision

In general, it is required to evaluate previous work dealing with association rule mining and specify the particular challenges when working with Master Data. The research goal of this project can be subdivided into three questions:

  1. How to efficiently discover frequent sets and generate association rules from the frequent itemsets?
  2. How to validate and evaluate the resulting association rules with appropriately defined metrics?
  3. How to make the resulting association rules as an efficient filtering for a decision support by the domain expert? In the case of possibly generating thousands of rules, the analyses of the results might be hard to perform by one domain expert.

Results

The recently completed SDIL project leveraged rule-based approaches combined with supervised machine learning to discover interesting patterns in a unique industrial data set provided by SAP within the SDIL. Read the paper about the project results at: http://www.sdil.de/downloads/sdic-2016-konferenzband.pdf#page=48

Project Partners

KIT, SAP

Contact Person

Dr. Peter Neumayer, peter.neumayer@sap.com

Project Duration

Jan – Aug 2016