Data integration using machine learning: Automation of data mapping using machine learning techniques

Typ
Examensarbete för masterexamen
Master Thesis
Program
Complex adaptive systems (MPCAS), MSc
Publicerad
2016
Författare
Birgersson, Marcus
Hansson, Gustav
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Data integration involves the process of mapping the flow of data between systems. This is a task usually performed manually and much time can be saved if some parts of this can be automated. In this report three models based on statistics from earlier mapped systems is presented. The purpose of these models is to aid an expert in the mapping process by supplying a first guess on how to map two systems. The models are limited to mappings between two XML-formats, where the path to a node carrying data usually is descriptive of its data content. The developed models are the following: 1. A shortest distance model based on the concept that two nodes that have been associated with a third node but not each other most likely have something to do with each other. 2. A network flow model, which connects words with similar semantic meaning to be able to associate the words in two connected XML paths with each other. 3. A data value model which connects data values to nodes based on similarities between the value and earlier seen data. The results of the models agrees with expectations. The shortest distance model can only make suggestions based on XML-structures that are present in the training set supplied for the project. The network flow model has the advantage that it only needs to recognize parts of a path to map two nodes to each other, and even completely unfamiliar systems can be mapped if there are similarities between the two systems. Overall, the data value model performs the worst, but can make correct mappings in some cases when neither of the others can.
Beskrivning
Ämne/nyckelord
Data- och informationsvetenskap , Informations- och kommunikationsteknik , Computer and Information Science , Information & Communication Technology
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index