Data integration using machine learning: Automation of data mapping using machine learning techniques
Examensarbete för masterexamen
Complex adaptive systems (MPCAS), MSc
Data integration involves the process of mapping the flow of data between systems. This is a task usually performed manually and much time can be saved if some parts of this can be automated. In this report three models based on statistics from earlier mapped systems is presented. The purpose of these models is to aid an expert in the mapping process by supplying a first guess on how to map two systems. The models are limited to mappings between two XML-formats, where the path to a node carrying data usually is descriptive of its data content. The developed models are the following: 1. A shortest distance model based on the concept that two nodes that have been associated with a third node but not each other most likely have something to do with each other. 2. A network flow model, which connects words with similar semantic meaning to be able to associate the words in two connected XML paths with each other. 3. A data value model which connects data values to nodes based on similarities between the value and earlier seen data. The results of the models agrees with expectations. The shortest distance model can only make suggestions based on XML-structures that are present in the training set supplied for the project. The network flow model has the advantage that it only needs to recognize parts of a path to map two nodes to each other, and even completely unfamiliar systems can be mapped if there are similarities between the two systems. Overall, the data value model performs the worst, but can make correct mappings in some cases when neither of the others can.
Data- och informationsvetenskap , Informations- och kommunikationsteknik , Computer and Information Science , Information & Communication Technology