Data integration using machine learning: Automation of data mapping using machine learning techniques

Examensarbete för masterexamen

Please use this identifier to cite or link to this item:
Download file(s):
File Description SizeFormat 
232167.pdfFulltext1.49 MBAdobe PDFView/Open
Type: Examensarbete för masterexamen
Master Thesis
Title: Data integration using machine learning: Automation of data mapping using machine learning techniques
Authors: Birgersson, Marcus
Hansson, Gustav
Abstract: Data integration involves the process of mapping the flow of data between systems. This is a task usually performed manually and much time can be saved if some parts of this can be automated. In this report three models based on statistics from earlier mapped systems is presented. The purpose of these models is to aid an expert in the mapping process by supplying a first guess on how to map two systems. The models are limited to mappings between two XML-formats, where the path to a node carrying data usually is descriptive of its data content. The developed models are the following: 1. A shortest distance model based on the concept that two nodes that have been associated with a third node but not each other most likely have something to do with each other. 2. A network flow model, which connects words with similar semantic meaning to be able to associate the words in two connected XML paths with each other. 3. A data value model which connects data values to nodes based on similarities between the value and earlier seen data. The results of the models agrees with expectations. The shortest distance model can only make suggestions based on XML-structures that are present in the training set supplied for the project. The network flow model has the advantage that it only needs to recognize parts of a path to map two nodes to each other, and even completely unfamiliar systems can be mapped if there are similarities between the two systems. Overall, the data value model performs the worst, but can make correct mappings in some cases when neither of the others can.
Keywords: Data- och informationsvetenskap;Informations- och kommunikationsteknik;Computer and Information Science;Information & Communication Technology
Issue Date: 2016
Publisher: Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)
Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)
Collection:Examensarbeten för masterexamen // Master Theses

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.