Using blood metabolomics to identify dietary protein intake with Machine Learning methods Master’s thesis in Computer science and engineering KLEIO GKOUTZOMITROU Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Using blood metabolomics to identify dietary protein intake with Machine Learning methods KLEIO GKOUTZOMITROU Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 A Chalmers University of Technology Master’s thesis template for LATEX KLEIO GKOUTZOMITROU © KLEIO GKOUTZOMITROU, 2023. Supervisor: Annikka Polster, Department of Biology and Biological Engineering Advisor: Helen Lindqvist, Biochemistry and Food Science (University of Gothen- burg) Examiner: Jean-Philippe Bernardy, Department of Computer Science and Engineer- ing Master’s Thesis 2023 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Description of the picture on the cover page (if applicable) Typeset in LATEX Gothenburg, Sweden 2023 iv A Chalmers University of Technology Master’s thesis template for LATEX KLEIO GKOUTZOMITROU Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract This thesis examines how metabolomics data may be used to classify individuals based on the sources of protein in their diets. Developing accurate classification models that can distinguish between omnivores, vegans, vegetarians, and pescetari- ans is the aim of the study. Principal Component Analysis (PCA), Random Forest (RF), Support Vector Machines (SVM), and neural networks are used in this process as data analysis tools. The dataset, which was given by the Gothenburg University Department of Internal Medicine and Clinical Nutrition, included 120 healthy participants who followed various eating patterns. The subjects were chosen based on certain criteria, and blood samples and body composition were taken and examined. The dataset has been scaled and contains unidentified metabolites. The metabolic profile of the sample was shown using principal component analysis (PCA). The overall PCA analysis revealed that there was substantial individual vari- ation in the metabolomic profiles and that the food groups could not be effectively differentiated. The metabolic profiles of meat eaters and non-meat eaters might be used to distinguish them. Random Forest, SVM, and neural networks were the three machine learning tech- niques that were utilized for categorization. Neural Networks performed worse than Random Forest and SVM models in classifying each dietaryăgroup separately. Ran- dom Forest classified omnivores and non-omnivores with a high degree of accuracy. To measure the consumption of dairy, eggs, and meat, several scoring techniques were applied. The second method, which increased meat intake ratings by a factor of 1.5, produced the results with the highest degree of accuracy. This study sheds light on the metabolic effects of omnivorous diets and improves our understanding of the complex relationship between nutrition, metabolism, and health outcomes. It also highlights the potential of metabolomics and machine learning in predicting dietary patterns and categorizing people into different dietary categories. Keywords: metabolomics, machine learning, Principal Component Analysis, Ran- dom Forest, Support Vector Machines, Neural Networks. v Acknowledgements I would like to express my heartfelt gratitude to Annikka Polster, my supervisor, for her unwavering support, guidance, and valuable input throughout this project. Her expertise and mentorship have been instrumental in shaping the direction and execution of this thesis. I am also deeply grateful to Helen Lindqvist for generously providing me with the training data that served as a crucial foundation for my analysis. Her contribution has been pivotal in the success of this study, and I am sincerely appreciative of her assistance. Lastly, I would like to express my gratitude to my family, friends, and colleagues for their continuous support, encouragement, and understanding throughout this endeavor. Kleio Gkoutzomitrou, Gothenburg, 2023-06-16 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Diet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Metabolomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Nuclear Magnetic Resonance (NMR) . . . . . . . . . . . . . . . . . . 5 2.4 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . 8 2.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 9 2.5.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 9 2.5.3 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . 9 2.5.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 9 2.5.5 Applications of Machine Learning . . . . . . . . . . . . . . . . 9 2.5.6 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . 10 2.5.6.1 Random Forest . . . . . . . . . . . . . . . . . . . . . 10 2.5.6.2 Support Vector Machine . . . . . . . . . . . . . . . . 11 2.5.6.3 Neural Networks . . . . . . . . . . . . . . . . . . . . 11 2.5.7 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Methods 13 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Unidentified Metabolites . . . . . . . . . . . . . . . . . . . . . 14 3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 15 ix Contents 3.4.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 18 3.4.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . 19 3.6 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Results 23 4.0.1 PCA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.0.1.1 Overall PCA analysis results . . . . . . . . . . . . . 23 4.0.1.2 Gender-specific PCA analysis results . . . . . . . . . 23 4.0.1.3 Meat Consumption-specific PCA analysis . . . . . . 23 4.0.1.4 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.0.1.5 Most Important Features . . . . . . . . . . . . . . . 26 4.0.2 Classification Models Results . . . . . . . . . . . . . . . . . . 26 4.0.2.1 First Task: classifying each group separately . . . . . 26 4.0.2.2 Second Task: classifying omnivores and a combined group of vegetarians, vegans, and pescetarians . . . . 29 4.0.2.3 Third Task: classifying vegans and a combined group of vegetarians, omnivores, and pescetarians . . . . . 34 4.0.2.4 Fourth Task: classifying vegans and vegetarians to- gether and omnivores and pescetarians together . . . 35 5 Discussion and Conclusion 39 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Bibliography 45 A Appendix 1 I x List of Figures 2.1 NMR spectroscopy: a toolset in metabolism studies. Pictorial rep- resentation of the various ways NMR spectroscopy can be used in metabolic studies, such as (A) structure elucidation, (B) quantita- tive NMR (qNMR), (C) metabolomics, (D) metabolite-protein inter- actions, and (E) isotope-tracing metabolomics or stable isotope re- solved metabolomics (SIRM) [19]. . . . . . . . . . . . . . . . . . . . . 6 2.2 NMR spectrum with identified metabolites. A visual representation of an NMR spectrum showing the spectral peaks corresponding to different identified metabolites. This illustration helps to enhance understanding and interpretation of NMR spectroscopy [21]. . . . . . 7 2.3 Neural networks, which are set up in layers and comprise a collection of linked nodes. Tens or even hundreds of hidden layers are common in networks. [34]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 3D plot of the PCA method. . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 3D plot of the PCA method only for the men samples. . . . . . . . . 24 4.3 3D plot of the PCA method only for the women samples. . . . . . . . 24 4.4 3D plot of the PCA method for meat eaters and non-meat eaters. . . 25 4.5 20 most important features from RF model (Task 1). The seventh most important feature is a combination of phosphocholine, acetyl- choline, phosphoethanolamine and lipids/ffa, but it is not visible in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.6 Confusion matrix for RF in the task 1. . . . . . . . . . . . . . . . . . 28 4.7 Confusion matrix for SVM in the task 1. . . . . . . . . . . . . . . . . 29 4.8 Confusion matrix for the RF in the task 2. . . . . . . . . . . . . . . . 30 4.9 The 50 most important features from RF model (Task 2). . . . . . . . 31 4.10 Confusion matrix for Random Forest in the task 3. . . . . . . . . . . 35 4.11 The 50 most important features from RF model (Task 3). The first most important feature is a combination of phosphocholine, acetyl- choline, phosphoethanolamine, and lipids/ffa, but it is not visible in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.12 The 50 most important features from the RF model (Task 3). The 12th most important feature is a combination of phosphocholine, acetyl- choline, phosphoethanolamine, and lipids/ffa, but it is not visible in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 xi List of Figures xii List of Tables 3.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Performance metrics for random forest model (Task 1) . . . . . . . . 26 4.2 Performance metrics for SVM model (Task 1) . . . . . . . . . . . . . 28 4.3 Performance metrics for neural network model (Task 1) . . . . . . . . 29 4.4 Performance metrics for random forest model (Task 2) . . . . . . . . 30 4.5 Performance metrics for RF, meat consumption score. . . . . . . . . . 32 4.6 Performance metrics for RF, meat/dairy/eggs consumption score with- out vegans. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.7 Performance metrics for RF, meat/dairy/eggs consumption score 1.5 x for omnivores, without vegans. . . . . . . . . . . . . . . . . . . . . . 33 4.8 Performance metrics for RF, meat/dairy/eggs consumption score from previous research, without vegans. . . . . . . . . . . . . . . . . . . . . 34 4.9 Performance metrics for the RF model (Task 3) . . . . . . . . . . . . 34 4.10 Performance metrics for SVM model (Task 3) . . . . . . . . . . . . . 35 4.11 Performance metrics for random forest model (Task 4) . . . . . . . . 37 4.12 Performance metrics for SVM model (Task 4) . . . . . . . . . . . . . 37 xiii List of Tables xiv 1 Introduction In this chapter, we will provide a brief background on the use of blood metabolomics, the investigation of blood’s small molecules known as metabolites, to identify dietary protein intake with machine learning methods. We will discuss why accurately assessing dietary protein intake is an important research problem and how traditional methods have limitations in addressing this problem. We will then introduce the potential of blood metabolomics and machine learning to provide a more accurate and personalized approach. Specifically, we will focus on the problem of identifying dietary protein intake and the approach we have implemented. We will define the scope of our work, including any assumptions and limitations. Next, we will present the contributions of this thesis and outline the chapters that follow. 1.1 Background A healthy diet is essential for good health and nutrition. In recent years, there has been growing concern about the negative health effects of meat consumption, particularly red meat [1]. This concern has prompted interest in vegetarianism or "flexitarianism," which involves eating less meat or being vegetarian but consuming fish. As a result, the "omnivore dietary category" now includes those who consume a variety of meat and fish, as well as those who consume less meat or no meat at all. Some vegetarians, known as lacto-ovo vegetarians, replace meat with full-fat dairy foods like cheese and eggs. Others follow an essentially vegan diet, replacing dairy foods with novel alternatives based on soy, rice, or oats. A vegan diet forgoes all foods of animal origin. These contemporary changes in food consumption may ac- count for the conflicting findings on health effects in research contrasting vegetarian and omnivorous diets [2]. We research nutrition because dietary intake affects our health in significant ways [3]. However, measuring diet with objectivity is challenging. Metabolomics could be useful for this task. Metabolomic analyses offer an option to study a comprehensive set of small molecules present in biofluids, cells or tissue, and allows for the profiling of thousands of molecules [4]. Blood metabolomics, the comprehensive analysis of small molecules in the blood, is a promising tool for assessing dietary protein intake and other aspects of nutrition. Machine learning (ML) is a powerful tool for analyzing large and complex data sets, including blood metabolomics data. The 1 1. Introduction combination of blood metabolomics and ML methods has the potential to provide new insights into the complex relationship between dietary protein intake and health outcomes [5]. 1.2 Methods The research methodology involved collecting blood samples from participants with varying dietary protein sources, including meat-eaters, fish-eaters, and vegetarians. The blood samples will be analyzed using Nuclear magnetic resonance (NMR), a powerful analytical technique for identifying and quantifying small molecules in com- plex biological samples. The resulting data will be preprocessed and analyzed using statistical methods and ML methods, including feature selection, dimensionality reduction, and classification algorithms. The significance of this research lies in its potential to advance our understanding of the relationship between dietary protein intake and health outcomes. Accurately assessing dietary protein intake using blood metabolomics and ML methods could improve our ability to develop personalized nutrition recommendations that are tailored to an individual’s specific needs and goals. 1.3 Structure The structure of this Master’s thesis is as follows: • Chapter 1 provides an overview of the research background, objectives, and significance. • Chapter 2 reviews the relevant literature on blood metabolomics, dietary protein intake, and ML methods. • Chapter 3 describes the research methodology in detail, including study de- sign, data description, and analysis methods. • Chapter 4 presents the study’s results, including biomarker identification, model development, and model evaluation. • Chapter 5 summarizes the main findings and conclusions, as well as recom- mendations for future research. 2 2 Theory The theory chapter of this thesis covers the fundamentals of diet and how it affects the human body, including the different types of macronutrients and their role in the body. Furthermore, the chapter explains the concept of metabolomics and its importance in the study of nutrition, including the techniques used in metabolomics analysis such as nuclear magnetic resonance (NMR) spectroscopy. The chapter then delves into the statistical method of principal component analysis (PCA) and its applications in metabolomics research. The section on machine learning provides an overview of the different types of machine learning algorithms used in metabolomics, including supervised and unsupervised learning. The focus will be on the use of machine learning techniques, such as support vector machines (SVMs), random forest, and deep learning, to analyze metabolomics data for identifying biomarkers of dietary protein intake. 2.1 Diet Diet is defined as the sum of foods consumed by an individual or population and plays a crucial role in maintaining good health and preventing diseases [6]. The human diet consists of macronutrients and micronutrients. Macronutrients are nu- trients that are required in large quantities by the body and include proteins, car- bohydrates, and fats. Micronutrients are nutrients required in smaller quantities by the body. These include vitamins and minerals, which are essential for various physiological processes in the body. The macronutrients known as proteins are made up of amino acids, which are the body’s building blocks. Proteins are necessary for the production of enzymes, hor- mones, and other compounds as well as for the development and upkeep of human tissues. There are 20 different types of amino acids, and the body can synthesize some of them, whereas others are essential and must be obtained from the diet. Sources of dietary protein include animal products such as meat, fish, eggs, and dairy, as well as plant-based sources such as legumes, nuts, and seeds [7]. Pro- teins from animal sources contain all essential amino acids, which are the building blocks of proteins that the body cannot produce on its own. In contrast, vegetarian sources of protein may lack one or more essential amino acids. Therefore, different plant-based protein sources must be combined to ensure an adequate intake of all the essential amino acids. This is known as protein complementation and is often 3 2. Theory necessary for vegetarians and vegans to meet their daily protein requirements [8]. Additionally, vegetarian protein sources often contain dietary fiber, which can have a positive impact on digestion and overall health. Protein content and quality can vary widely among different foods [9]. While most foods contain some amount of protein and amino acids, high protein sources are typically found in animal muscle products such as meat and fish. However, protein sources also differ in other metabolites such as carbohydrate content, fatty acids, and other micronutrients. For example, while meat and fish are low in carbohy- drates, they may contain varying amounts of saturated and unsaturated fatty acids, depending on the type of animal and its diet. In contrast, vegetarian protein sources such as legumes, nuts, and seeds, may contain higher levels of carbohydrates, fiber, and other micronutrients such as vitamins and minerals [9]. Carbohydrates are a subset of macronutrients that are the main sources of energy in the body [10]. They contain both complex carbohydrates, such as starch and fiber, and simple sugars, such as glucose, fructose, and galactose. Honey, fruits, and vegetables all have simple carbohydrates, but grains and legumes all include complex carbohydrates. Most carbohydrates provide energy by being broken down into glucose, which is used by the body as fuel. However, there are some exceptions to this. For example, some carbohydrates, such as dietary fiber, are not fully broken down by the human body and therefore do not provide energy in the same way as other carbohydrates. However, dietary fiber plays an essential role in maintaining good health by promoting the growth of beneficial bacteria in the colon, which can improve digestion and reduce the risk of certain diseases. Additionally, some carbohydrates, such as sugar alcohols, are only partially absorbed and utilized by the body for energy [11]. Fats are macronutrients that are important for the absorption of fat-soluble vitamins and play a role in hormone production [12]. While dietary fats contribute to the body’s energy supply, it is important to note that adipose tissue, which stores fat, is not directly equivalent to dietary fat. Adipose tissue serves as a storage site for excess energy in the form of triglycerides, which can be derived from dietary fats as well as other sources. Dietary fats consist of fatty acids, which can be categorized as saturated or unsaturated. Saturated fats are commonly found in animal products such as meat and dairy, while unsaturated fats are predominantly present in plant-based sources like nuts, seeds, and vegetable oils. Research suggests that excessive consumption of saturated fats has been associated with an increased risk of heart disease [13], whereas unsaturated fats are generally considered to be healthier options. 2.2 Metabolomics An increasing number of research studies has demonstrated that metabolomics ap- pear to be a possible objective tool to identify habitual intake of meat and other animal products in healthy subjects adhering to a vegan, vegetarian, or omnivore diet [14], [15], [16]. By studying metabolomics, we can identify the source and the 4 2. Theory amount of protein intake. Metabolomics is the study of small molecules or metabolites that are present in a biological sample [17]. The study of metabolomics is important in understanding the complex interactions between diet and health. A metabolite profile in the blood can reflect the metabolic state of an individual and provide insight into how different dietary patterns impact their health. Proteins are broken down into amino acids during digestion. The metabolites pro- duced during protein metabolism can be measured in blood, providing a snapshot of an individual’s dietary protein intake. Carbohydrates and fats also have specific metabolic pathways and produce metabolites that can be measured, providing a broader picture of an individual’s overall dietary intake. The ability to measure and analyze metabolites in blood will lead to advancements in the field of personalized nutrition. By identifying specific metabolites associ- ated with certain dietary patterns, individuals can be provided with personalized nutrition recommendations tailored to their unique metabolic profile. This has the potential to improve overall health outcomes and prevent chronic diseases associated with poor dietary habits. 2.3 Nuclear Magnetic Resonance (NMR) NMR spectroscopy, a powerful analytical technique, is employed to investigate the chemical and physical properties of molecules [18]. It is based on how specific atomic nuclei, such as those of hydrogen, carbon, and nitrogen, interact with a magnetic field. These nuclei have the ability to both absorb and release radiofrequency radia- tion when put in a high magnetic field. The intensity of the magnetic field and the atoms surroundings affect the frequency of radiation that is absorbed. In NMR spectroscopy, a sample is placed in a strong magnetic field, typically gener- ated by a superconducting magnet. After that, radiofrequency radiation is applied to the sample, usually in the form of a pulse. The samples nuclei take in the radi- ation, and as a result of this absorption, they align with the magnetic field. The nuclei relax back to their initial condition once the pulse is switched off and begin to release radiofrequency radiation. A sensitive antenna picks up this radiation, and the signal is processed to create an NMR spectrum [19]. The chemical and physical characteristics of the molecules in the sample are re- vealed by the NMR spectrum. The various types of nuclei in the sample and their surroundings are represented by the peaks in the spectrum. The frequency of radia- tion absorbed, which is correlated with the magnetic field and atomic environment, determines the position of the peak. The amount of nuclei in that environment has an impact on the peaks strength. We can ascertain the kind, number, and structure of the molecules in the sample by examining the NMR spectra [18]. The study of metabolism holds an interest in a number of NMR techniques we can see in Figure 2.1, including metabolomics analysis, metabolite identification 5 2. Theory Figure 2.1: NMR spectroscopy: a toolset in metabolism studies. Pictorial represen- tation of the various ways NMR spectroscopy can be used in metabolic studies, such as (A) structure elucidation, (B) quantitative NMR (qNMR), (C) metabolomics, (D) metabolite-protein interactions, and (E) isotope-tracing metabolomics or stable isotope resolved metabolomics (SIRM) [19]. and structure elucidation, metabolite quantification (qNMR), the use of stable iso- topes in metabolism investigations, and metabolite-protein interactions, among oth- ers. NMR is a versatile spectroscopy that can be used to answer questions about metabolism in a variety of biological systems. This spectroscopy can also help to clarify fundamental biochemical concepts such as metabolite identification, quan- tification, and turnover, metabolic activities, organelle compartmentalization, and metabolite interactions with macromolecules for enzymology or regulatory events [19]. Although there are different methods for metabolomics analysis, such as LC- MS, NMR spectroscopy offers its own unique advantages [20]. The method used in this study is 1H-NMR. To provide a visual representation of NMR spectroscopy, Figure 2.2 displays a spec- trum along with identified metabolites. This figure illustrates the spectral peaks corresponding to different metabolites, allowing a clearer understanding of the anal- ysis for readers who may be less familiar with spectroscopy. 6 2. Theory Figure 2.2: NMR spectrum with identified metabolites. A visual representation of an NMR spectrum showing the spectral peaks corresponding to different identified metabolites. This illustration helps to enhance understanding and interpretation of NMR spectroscopy [21]. 7 2. Theory 2.4 Statistical Methods 2.4.1 Principal Component Analysis (PCA) Principal Component Analysis (PCA) is a widely used statistical technique for re- ducing the dimensionality of high-dimensional data by identifying and extracting the most important features that capture the majority of the variance in the data [22]. PCA is a linear transformation method that transforms the original data into a new set of variables, called principal components, which are linear combinations of the original variables. Each successive principal component effectively captures the largest possible part of the remaining variance, since they are orthogonal and uncorrelated [23]. The coefficients of this linear combination are called "loadings", and they represent the contribution of each variable to that particular principal component. PCA decreases the dimensionality of the data by projecting it onto a lower-dimensional space and identifying the directions in the data that hold the greatest information. PCA (Principal Component Analysis) is a valuable tool in the analysis of metabolomics data, which typically involves numerous correlated variables (metabolites) [19]. It makes it possible to spot data patterns and trends that may not have been imme- diately obvious in the original dataset. Additionally, PCA makes feature selection easier by highlighting the metabolites that have the greatest impact on the data’s overall variability. This is particularly helpful in metabolomics studies if there are more metabolites than samples. It is feasible to identify the main metabolites in charge of the variations across samples by using PCA to reduce the dimensional- ity of the data. This method can be useful in locating possible biomarkers linked to particular medical disorders or dietary changes. However, it is crucial to make sure that the findings of PCA are carefully interpreted, taking into consideration the constraints and probable causes of data variance. To use PCA’s capabilities in metabolomics analysis successfully, other adjustments and considerations could be required. 2.5 Machine Learning Machine learning is a field of computer science and artificial intelligence that focuses on the development of algorithms and statistical models that enable computers to automatically learn from and make predictions or decisions based on data. In essence, machine learning is the process of training computer programs to recognize patterns and make decisions based on input data. Supervised learning and unsupervised learning are the two primary categories of machine learning. Other categories of machine learning are semi-supervised learning and reinforcement learning. 8 2. Theory 2.5.1 Supervised Learning In supervised learning, a model is trained using data that has already been classified or labeled with the desired outcome. The model then applies these labeled data to fresh, unobserved data to generate predictions or categorize them [24]. 2.5.2 Unsupervised Learning Unsupervised learning, in contrast, entails developing a model using a dataset that has not been categorized or labeled. In order to uncover significant insights or groups, the model then independently searches the data for patterns or structure without using any pre-existing labels [25]. 2.5.3 Semi-supervised Learning Between supervised and unsupervised learning is semi- supervised learning. During the training phase, it blends a small quantity of labeled data with a big amount of unlabeled data and utilizes context to spot data trends [26]. This technique, for instance, can be applied to classification situations when the solution calls for a supervised learning algorithm but little labeling. Due to the fact that it uses a combination of labeled and unlabeled data, it is quicker than supervised learning. Generative models, low-density separation, Laplacian regularization, and heuristic methods are a few examples. There are not many reported applications for this method in the field of metabolomics. 2.5.4 Reinforcement Learning Unsupervised ML was guided using the reinforcement learning approach, which rewards good behavior and penalizes undesirable behavior. The model’s capacity to link desired inputs and outputs is strengthened by positive feedback [26]. In a number of fields, including game theory, operations research, and swarm intelligence, reinforcement learning has drawn a lot of interest. It provides a potential foundation for training models to make the best choices possible, depending on feedback from the outside world. We can speed up learning, improve the efficiency of our models in identifying complex patterns, and optimize results by using reinforcement learning. 2.5.5 Applications of Machine Learning Machine learning is becoming increasingly important in many fields, including medicine, finance, and marketing, where large amounts of data are collected and analyzed to extract useful information and make predictions or decisions. Due to the enormous amount of data produced by high-throughput analytical methods, machine learning has also recently grown to be a crucial tool in metabolomics research [27]. This data may be analyzed and interpreted using machine learning algorithms, and predictions based on the found patterns can then be made. One of machine learning’s main ben- efits is its capacity for handling massive, complicated data sets and learning from them, enabling more precise predictions and insights. In this thesis, we have used 9 2. Theory machine learning techniques to predict the dietary source of protein intake from blood metabolomics data. 2.5.6 Machine Learning Algorithms 2.5.6.1 Random Forest Random forest is a popular machine learning algorithm used for classification, regres- sion, and other tasks. It leverages an ensemble approach, where multiple decision trees are created, and their predictions are combined to provide more accurate out- comes [28]. In this context, a decision tree is a tree-like model that recursively partitions the data based on selected features, aiming to achieve purity or a specific level of impurity. A random selection of characteristics and data samples from the original dataset is used to build each decision tree. The process of merging the predictions from each of these separate trees to get the final prediction is known as the ensemble approach. The random forest technique can capture a larger range of patterns and enhance overall performance by combining predictions from many trees. Measures like Gini impurity or information gain are frequently used to evaluate the quality of the splits inside the decision trees. A decision tree node’s impurity or homogeneity may be measured using the Gini impurity metric. It calculates the likelihood that a randomly selected element in the node will be classified incorrectly [29]. A purer node where the majority of the components fall within the same class or category is one with a lower Gini impurity score. Similarly, node impurity refers to the impurity or heterogeneity of a specific node in a decision tree. In a random forest, a voting mechanism is used to integrate the forecasts of many decision trees. Each tree makes a forecast, and the prediction with the highest per- centage is chosen as the outcome [28]. This method enhances the model’s accuracy while reducing overfitting. With high-dimensional datasets, random forest can still maintain accuracy while handling missing data. Random forest include knowledge regarding feature significance as one of its primary benefits. This is useful for locating a dataset’s most crucial characteristics and learning more about the underlying data. Additionally, random forest is scalable, has high computational efficiency, and can handle both categorical and numerical data. However, it might not work well when there are correlated features and can be sensitive to noisy data. Correlated features can cause problems including duplicated data, a decline in decision tree variety, and a higher chance of overfitting [30]. When characteristics are closely connected, the model could give them similar weights, which could distort judgment and prevent precise predictions. Additionally, linked features might restrict the variety of decision trees, making it more difficult for the model to fully capture the range of patterns in the data. On the other hand, noisy data might result in overfitting, cause inconsistencies, and affect evaluations of feature significance. Noisy data might result in inaccurate fluctuations that affect decision rules and reduce the model’s overall reliability and accuracy. Correlated 10 2. Theory characteristics and noisy data must be addressed and taken into consideration in the preprocessing step in order to guarantee the optimal performance of a random forest model. In general, random forest is an effective machine learning technique that may be used to solve many different types of issues. It has been used in many different industries, including biology, banking, and image recognition. 2.5.6.2 Support Vector Machine Support Vector Machines (SVMs) are a type of supervised learning algorithm that can be used for both classification and regression tasks [31]. SVMs are based on the concept of finding the best hyperplane that separates different classes of data. The ideal boundary between classes is chosen to be the hyperplane with the largest distance from the closest data points. In other words, SVMs look for the decision boundary that maximizes the margin or distance between the data points and the decision border between classes of data [32]. SVMs work by converting the input data into a high-dimensional feature space, which makes it easier to distinguish between the data points. A kernel function is used to execute this transformation, mapping the input data from the original space to the higher-dimensional space [32]. The SVM algorithm then looks for the hyperplane that best separates the data points once the data has been processed. In comparison to other classification algorithms, SVMs have a number of advantages, including the ability to handle high-dimensional data and the ability to handle non- linearly separable data using various kernel functions [33]. SVMs have also been demonstrated to work well in a wide range of applications and have a solid theoretical underpinning. In the context of metabolomics, SVMs have been used to classify samples based on their metabolic profiles, such as distinguishing between different disease states or identifying different types of food intake [5]. SVMs have also been applied to fea- ture selection to determine the most crucial metabolites for differentiating between various sample groups. 2.5.6.3 Neural Networks Neural networks are a class of machine learning algorithms that are modeled after the structure and function of the human brain. A neural network is made up of a number of linked nodes, or neurons, that process and send information. A neural network’s architecture generally consists of an input layer, one or more hidden layers, and an output layer, as shown in Figure 2.3. The connections between the neurons are weighted according to the strength of the relationships between them and connect neurons in each layer to neurons in the layer above them. Neural networks are commonly used in supervised learning tasks, where the model is trained on labeled data to make predictions or classifications on new, unseen data. To reduce the difference between the projected output and the actual output, the 11 2. Theory Figure 2.3: Neural networks, which are set up in layers and comprise a collection of linked nodes. Tens or even hundreds of hidden layers are common in networks. [34]. weights of the connections between neurons are altered during learning. The error between the projected output and the actual output is calculated in this procedure, which is also known as backpropagation, and it is then transmitted back through the network to change the weights [35]. Neural networks may be used for unsupervised learning tasks in addition to super- vised learning, where the model is trained on unlabeled data to find patterns and correlations in the data. Unsupervised learning is frequently used for tasks like dimensionality reduction and clustering. In a variety of applications, including computer vision, natural language processing, and speech recognition, neural networks have shown to be quite successful [36]. They may, however, be computationally expensive and need a lot of data to train well. 2.5.7 Cross Validation In machine learning, the cross-validation approach is used to evaluate a model’s per- formance. Cross-validation’s fundamental idea is that the model should be trained and tested using several subsets of the data. A model is trained and tested k times using a distinct subset as the test set and the remaining subsets as the training set in k-fold cross-validation. The data is separated into k subsets [37]. The problem of overfitting, which can happen when a model is too complicated and catches noise in the training data, is helped by the use of cross-validation. We may obtain a more accurate approximation of the model’s generalization performance by assessing the model’s performance on various subsets of the data. 12 3 Methods In this section, a description of the project’s process is presented. The procedures em- ployed are explained, along with an elucidation of the reasoning behind the decision- making process. 3.1 Data Collection The dataset used in this project was provided by the Department of Internal Medicine and Clinical Nutrition at Gothenburg University. The data were collected as part of a study aimed at identifying techniques for measuring habitual dietary exposure [14]. Prior to this project, the dataset underwent extensive preprocessing steps to ensure its suitability for analysis. These steps are described in the following section. The study enrolled 120 healthy volunteers, including 45 men and 75 women, who complied with habitual vegan, (lacto- ovo-)vegetarian, or omnivore diets [14]. Addi- tionally, a fourth group of pescetarians was included. Volunteers were recruited through advertisements and were considered suitable if they were between 18 and 65 years old, healthy, and had a BMI between 18 and 30. The screening process included clinical markers (to exclude participants with diseases), a short lifestyle questionnaire, a 4-day weighed food diary, and a food- frequency questionnaire (FFQ) developed specifically for the study. The FFQ in- cluded questions about food intake related to soy or soy products, legumes, vegeta- bles, fruit and berries, milk products, eggs and egg-based foods, fish and shellfish, poultry, red meat, and cookies and confectionery [38]. Body composition was measured with bioimpedance analysis (ImpediMed Bioimp version 5.3.1.1), and volunteers who were pregnant, lactating, or who used nicotine products regularly were excluded. Participants provided written informed consent before entering the study. Serum samples were collected and analyzed for metabo- lites using 1H-NMR spectroscopy. The study was conducted in two periods, from April to May 2013 and from August to December 2015. Volunteers were not allowed to drink alcohol the night before sampling or consume food supplements 1 week before sampling [38], [14]. 13 3. Methods 3.1.1 Unidentified Metabolites It is significant to note that certain metabolites were either left undetermined or were only tentatively assigned during the examination of the blood samples using 1H-NMR spectroscopy. The presence of unidentified metabolites can stem from several factors inherent to metabolomics studies. The complexity of the metabolomic profile and the limits of the available reference databases are two aspects that contribute to unidentified metabolites. There are still certain metabolites that have not been properly defined or included in the existing databases despite significant attempts to create comprehensive databases. As a result, it might not be possible to link particular NMR peak spectrum patterns to known metabolites. Another factor is the lack of accurate mass spectral libraries for comprehensive iden- tification. Mass spectral libraries are essential for metabolite identification because they make it possible to compare experimental and reference spectra. However, reliable identification may be limited by the lack of reference spectra for specific metabolites or by spectrum variability brought on by circumstances unique to a given sample. 3.2 Pre-processing In the pre-processing section, the steps taken to prepare the dataset for analysis are described. 3.2.1 Missing Values The first step in the data preprocessing was to check for missing values. As the dataset had been previously used for research purposes, we checked whether missing values had been addressed by the previous researchers [14], [38]. No missing values were found in the dataset we received for this study. 3.2.2 Scaling The next step in the data preprocessing was to perform Unit Variable (UV) scaling, which is a widely used scaling method in spectroscopy, including NMR spectroscopy. It can be difficult to compare spectra between samples because to variances in signal intensities caused by variations in sample concentrations, which are addressed via UV scaling. Each data point is divided by the square root of the corresponding feature’s standard deviation in order to address this problem. We adjust the data for variations in sample concentration using UV scaling, making it possible to compare and analyze NMR data in a more insightful manner. One reason why UV scaling is a popular choice for NMR data is that it does not affect the shape of the spectra or the relative intensities of the signals [39]. This means that the scaling does not distort the data in any way, which can be important for subsequent analyses, such as feature selection and modeling. 14 3. Methods 3.3 Statistical Methods We chose to use Principal Component Analysis (PCA) in this thesis project. PCA is a widely used method for reducing the dimensionality of data while retaining the most important information. It has been successfully applied in many different fields, including metabolomics [40]. In this study, we used PCA to select the most relevant features from the metabolomics dataset, which helped to reduce the dimensionality of the data and identify the most important variables. PCA is a powerful tool that can help to identify trends and patterns in large datasets. In order to perform the PCA, the Python library scikitlearn was used, and the function PCA() was called with n_components = 4, which specifies that we want to keep the top 4 principal components [41]. The data were then fitted using the PCA function, which created a new dataset made up of the chosen principle components and was put in a new dataframe dubbed pca_metabolomics. To visualize the results of the PCA, a 3D scatter plot was created using the mat- plotlib library [42], with each point in the plot representing a sample in the dataset. The x, y, and z coordinates of each point corresponded to the values of the first three principal components, respectively. Different colors and markers were used to distinguish between samples belonging to different classes. After performing the PCA, the most important features in the dataset were identified by examining the absolute values of the components using the print statement at the end of the code. This step helped to identify which variables were contributing the most to the separation between the different sample classes in the PCA plot. The purpose of this step was to select and retain only the most influential features, thereby reducing the dimensionality of the dataset. This reduction can increase computational effectiveness and assist in focusing the emphasis on the significant variables that contribute to the observed patterns and changes in the data. 3.4 Machine Learning 3.4.1 Random Forest Random forest is a popular machine learning algorithm used for classification tasks. It is an ensemble method that constructs multiple decision trees and combines their results to improve accuracy and prevent overfitting [28]. In this study, we utilized the scikitlearn library in Python to implement a random forest model. The first step in building a random forest model is to split the dataset into training and testing sets. We used the train_test_split function from scikitlearn to randomly split the dataset into 80% training data and 20% testing data [43]. We set the random state to 42 to ensure reproducibility. Setting a specific random state value ensures that the same random sequence is generated each time the algorithm is run, making the results consistent and reproducible. The choice of the number 42 is arbitrary, and any other integer value could have been used instead. 15 3. Methods Next, we defined the random forest classifier and created a parameter grid using the param_grid dictionary. The n_estimators parameter specifies the number of decision trees to be used in the random forest. The max_depth parameter con- trols the maximum depth of the decision trees. The min_samples_split parameter specifies the minimum number of samples required to split an internal node. The min_samples_leaf parameter specifies the minimum number of samples required to be at a leaf node. Finally, the max_features parameter specifies the maximum number of features to be considered when splitting a node. We used the GridSearchCV function to perform a grid search over the parameter grid to find the best combination of hyperparameters [44]. The CV parameter specifies the number of cross-validation folds to be used. We used 5-fold cross-validation to obtain reliable estimates of the model’s performance. In the specific case of using 5-fold cross-validation with the Random Forest model, it means that we divided the data into 5 subsets, and trained the model on 4 subsets while using the remaining subset as the test set. We repeated this process 5 times, each time using a different subset as the test set. This allowed us to obtain a more reliable estimate of the model’s performance by averaging the performance metrics over the 5 test sets. After finding the best hyperparameters, we created a new random forest classifier with the optimized hyperparameters and fit it to the training data using the fit function. We then used the fitted model to predict the classes of the testing data using the predict function. To evaluate the performance of the random forest model, we calculated the confusion matrix and the classification report. The confusion matrix shows the number of true positive, true negative, false positive, and false negative predictions as shown in Table 3.1. The classification report provides metrics such as precision, recall, support and F1-score for each class. Accuracy, precision, recall, support and F1 score are commonly used metrics to evaluate the performance of a classification model. These metrics are calculated using the confusion matrix. Table 3.1: Confusion Matrix Actual Class Positive Negative Pr ed ic te d C la ss Positive True Posi- tive (TP) False Posi- tive (FP) Negative False Nega- tive (FN) True Nega- tive (TN) Accuracy is the most commonly used metric, and it represents the proportion of correct predictions among all predictions. However, it can be misleading in cases where the classes are imbalanced, meaning that one class has many more observa- tions than the other. For example, if a model is predicting whether a patient has a 16 3. Methods rare disease or not, and 99% of the patients do not have the disease, a model that always predicts "no disease" will have an accuracy of 99%, but it is not useful in practice [45]. Accuracy = TP + TN TP + FP + FN + TN (3.1) Precision represents the proportion of true positives among all predicted positives. It is a useful metric when the cost of false positives is high [45]. For example, consider a predictive model designed to identify the presence of a particular medical condition in a patient, based on a set of observable symptoms. In this case, a false positive (a patient who is predicted to have the condition but actually does not) could lead to unnecessary medical procedures, such as surgery or medication, that can have negative side effects on the patient’s health. On the other side, a false negative (a patient who is expected not to have the illness but really has) might cause a delay in treatment, which may result in the disease progressing and potentially even death. In order to guarantee the accuracy of the model’s predictions, precision is a critical parameter in medical diagnosis. Precision = True Positive True Positive + False Positive (3.2) Recall represents the proportion of true positives among all actual positives. It is a useful metric when the cost of false negatives is high. For example, in the case of a model that predicts whether a patient has a disease or not, a false negative (a patient that has the disease but is classified as healthy) can be life-threatening [45]. Recall = TP TP + FN (3.3) The support metric, commonly referred to as the sample size or the number of instances, offers important insights into how classes are distributed across datasets. It displays the proportion of observations in the dataset that correspond to each type. The support indicator is crucial for assessing classification performance since it determines how reliable the outcomes are. A larger number of cases for a given class are indicated by a higher support value, indicating a more accurate and trustworthy evaluation. A lower support value, on the other hand, denotes a fewer number of instances, which might result in less accurate estimations of performance indicators. The samples included in the test set are represented by the support numbers in the evaluation metrics, not the dataset’s overall sample size. This distinction results from the division of the data into training and testing subsets for the purpose of evaluating the effectiveness of the classification model. As a result, the evaluation metrics’ support numbers may look lower than the dataset’s overall sample sizes. F1 score is the harmonic mean of precision and recall, and it provides a balance between the two metrics. It has a scale from 0 to 1, with 1 denoting optimal performance, which minimizes both false positives and false negatives. When both 17 3. Methods types of errors are significant and must be taken into account when assessing the performance of the model, the F1 score can be particularly helpful [45]. F1 Score = 2 · Precision · Recall Precision + Recall (3.4) Precision, recall, and F1-score may still be calculated and used to assess the model’s performance while making predictions for many classes. However, as compared to binary categorization settings, their interpretation becomes more complex. When using multiple classes, accuracy, which measures the percentage of true posi- tives among all occurrences predicted to fit in a given class, is determined separately for each class. Recall, often referred to as sensitivity, is determined for each class and denotes the percentage of cases that really fit in a given category out of all occurrences. The F1-score can also be calculated for each class using the harmonic mean of precision and recall. Overall, we chose to use random forest because it is a powerful and flexible algo- rithm that can handle high- dimensional data and nonlinear relationships between features and target variables. The use of grid search and cross- validation allowed us to find the optimal hyperparameters and obtain reliable estimates of the model’s performance. 3.4.2 Support Vector Machine To implement SVM, we first split the data into training and testing sets using an 80-20 ratio [43]. SVM is a binary classification algorithm by default, meaning it separates data into two classes. However, it can be extended to handle multiclass classification tasks through techniques like one-vs-rest or one-vs-one, where multiple binary classifiers are trained to distinguish each class from the rest [46]. We then normalized the data using the StandardScaler function, which standardizes the data by removing the mean and scaling to unit variance. Next, we defined the SVM model using the Support Vector Classification (SVC) function from the scikitlearn library [41]. To find the best hyperparameters for the SVM model, we used a grid search with cross-validation. We defined a hyperparameter grid with different values for the regularization parameter C, the kernel function, and the gamma parameter. The regularization parameter (C) determines the trade-off between achieving a low train- ing error and a low complexity model. To investigate various levels of regularization, we took into account a range of C values, namely [0.001, 0.01, 0.1, 1, 10]. In addi- tion, we investigated the linear, RBF, and polynomial kernel functions. The gamma parameter regulates how much each training sample has an impact on the decision boundary. For gamma, we took into account the values [0.001, 0.01, 0.1, 1, 10] to evaluate the effects of various degrees of influence. It is important to note that the selection of hyperparameters is influenced by the particular dataset and issue at hand. The grid search performed a five-fold cross-validation to evaluate the model 18 3. Methods with each combination of hyperparameters and returned the set of hyperparameters that yielded the best performance on the training data. We then trained an SVM model with the best hyperparameters on the training data and evaluated its performance on the testing data using the classification_report function from scikit learn. The classification report provides a summary of the precision, recall, and F1 score for each class, as well as the the overall accuracy of the model. 3.4.3 Neural Network Neural networks are a type of machine learning model that can learn to recognize patterns in data. Specifically, we used a feedforward neural network with three layers: an input layer with 237 nodes, a hidden layer with 64 nodes, and another hidden layer with 32 nodes. The output layer had four nodes, each corresponding to one of the four diet types in the dataset. The neural network is built using the Keras API with a sequential model structure. The first layer has 64 neurons and uses the rectified linear unit (ReLU) activation function, which is commonly used in neural networks for its ability to handle non- linearity. The input dimension of this layer is set to 237, which is the number of features in the input data. The second layer has 32 neurons and also uses the ReLU activation function. This layer is followed by a dropout layer, which helps prevent overfitting by randomly dropping out some of the neurons during training [47]. The final layer is a dense layer with 4 neurons, which corresponds to the number of classes in the output data. This layer uses the softmax activation function, which is commonly used in multi-class classification problems. The model is compiled using the categorical cross-entropy loss function, which is commonly used for multi-class classification problems. The optimizer used is Adam, which is an adaptive learning rate optimization algorithm that is commonly used in deep learning [48]. During training, the model is fed with mini-batches of 32 data points and trained over 100 epochs. The performance of the model is evaluated using the accuracy metric, which measures the proportion of correctly classified samples. Overall, the neural network architecture we used is a relatively simple feedforward network with three dense layers. The ReLU activation function is used in the hidden layers to introduce non-linearity, and the softmax activation function is used in the output layer to predict the probabilities of the different classes [49]. 3.5 Challenges and Limitations The present study acknowledges certain limitations inherent in the methodologies employed for the execution of this thesis project. 19 3. Methods One limitation of PCA is that it assumes a linear relationship between the vari- ables, which may not be true for metabolomics data. Complex interactions between metabolites can provide non- linear connections between variables. When using PCA, this may lead to the loss of crucial data. In addition, PCA may be sensitive to outliers in the data, which could affect the findings and provide false conclusions. In metabolomics, PCA has been used to identify biomarkers for various diseases and conditions, including cancer, diabetes, and obesity. However, the use of PCA in metabolomics has been criticized for its limitations, and alternative methods such as Partial Least Squares (PLS) have been proposed as more suitable alternatives for certain types of data [50]. Random Forest is a popular machine-learning algorithm that is used in metabolomics to predict the metabolite concentrations of unknown samples. However, one of the main limitations of random forest is that it is prone to overfitting, especially when the number of features is large [51]. This can be a problem in metabolomics, where there are often many thousands of metabolites that can be measured. Additionally, random forest can be computationally expensive, especially when there are many trees in the forest. Support Vector Machines (SVMs) are another popular machine learning algorithm that can be used for metabolomics data analysis. One of the limitations of SVMs is that they can be sensitive to the choice of kernel function. Additionally, SVMs can be sensitive to outliers, which can be a problem in metabolomics where there may be systematic errors in the measurement of metabolite concentrations. Finally, SVMs can be computationally expensive, especially when there are many samples and/or features [51]. Neural Networks is a powerful machine learning algorithm that can be used for var- ious applications, including metabolomics. However, one of the main limitations of neural networks is that they are prone to overfitting, especially when there are many parameters to be learned. In metabolomics, neural networks can also be lim- ited by the availability of large datasets, as they require large amounts of data to be trained effectively. Additionally, neural networks can be computationally expensive, especially when there are many layers and/or neurons in the network. In conclusion, the use of algorithms such as PCA, random forest, SVMs and neu- ral networks in metabolomics has both advantages and limitations. While these algorithms can provide valuable insights into complex biological systems, they can also be limited by their assumptions, computational requirements and sensitivity to different types of data. As such, it is important for researchers to carefully consider the strengths and limitations of these algorithms when analyzing metabolomics data, and using a range of approaches in order to obtain the most comprehensive insights possible. 3.6 Objective The primary objective of this Master’s thesis is to develop and evaluate a machine learning model that can accurately predict the type of dietary protein consumed 20 3. Methods by an individual based on their blood metabolomics data. Specifically, the study aims to investigate whether the consumption of meat, fish, or vegetarian sources of protein can be determined through blood metabolomics data analysis. The ability to accurately assess the type of dietary protein intake is essential for research in various fields, including nutrition, metabolism, and public health. However, traditional methods for assessing dietary protein intake have limitations, such as reliance on self-reported dietary intake or incomplete nutrient databases. Blood metabolomics, on the other hand, offers a promising approach for assessing dietary protein intake, as it allows for the identification of specific metabolites that can serve as biomarkers of protein consumption. Our goal is to identify these biomarkers and develop a machine learning model that can use them to predict an individual’s type of dietary protein intake with a high degree of precision. Overall, this thesis aims to advance the field of dietary assessment by providing a novel approach for accurately predicting an individual’s dietary protein intake using blood metabolomics and machine learning. 21 3. Methods 22 4 Results In this chapter, we present the results of our research, which aimed to investigate the metabolic differences between different dietary groups. 4.0.1 PCA Results PCA was employed to visualize the metabolic profile of the dataset. 4.0.1.1 Overall PCA analysis results The 3D plot (Figure 4.1) does not show a clear separation between the four di- etary groups, indicating that the metabolomic profiles of the groups are not distinct enough to be separated by the chosen number of principal components. 4.0.1.2 Gender-specific PCA analysis results Furthermore, two separate 3D PCA plots were created for men and women, respec- tively. The plots (Figure 4.2 and 4.3) showed that there is not a clear separation between the two groups in both cases, indicating that there are no clear sex-specific metabolic differences. 4.0.1.3 Meat Consumption-specific PCA analysis In addition, a 3D PCA plot was generated to compare the metabolic profiles of meat eaters and non-meat eaters. Figure: 4.4 showed a clear separation between the two groups, indicating that meat consumption has an impact on the metabolic profile. 4.0.1.4 Outliers From the 3D plots of the PCA analysis (Figures 4.1, 4.2, 4.3), it was clear that there were some outliers present in the dataset. To identify and remove these outliers, we used the Z-score method which computes the deviation of each data point from the mean in terms of standard deviation. We then set a threshold of 2.45 standard deviations, which is a commonly used threshold to identify outliers. In total, we identified and removed 10 outliers, 5 men and 5 women. Notably, none of the outliers belonged to the pescetarian group. The removal of outliers improved the quality and accuracy of our analysis, allowing us to draw more reliable conclusions from the data. 23 4. Results Figure 4.1: 3D plot of the PCA method. Figure 4.2: 3D plot of the PCA method only for the men samples. Figure 4.3: 3D plot of the PCA method only for the women samples. 24 4. Results Figure 4.4: 3D plot of the PCA method for meat eaters and non-meat eaters. 25 4. Results 4.0.1.5 Most Important Features To identify the most important features in our dataset, we used principal com- ponent analysis (PCA) and extracted the first principal component (PC1). The absolute values of the loadings for each feature on PC1 were then calculated using the ’pca.components_’ attribute. We selected the top 100 features with the highest loadings on PC1 as the most important features for our analysis. This allowed us to focus on the most informative variables in our dataset and improve the accuracy and efficiency of our analysis. In addition to the full dataset, we also performed the same procedure for the datasets containing only men and only women. However, after comparing the results, we decided to keep the most important features found from the full dataset for further analysis. We believe that this approach will provide a more generic and accurate representation of the data, rather than focusing solely on gender-specific patterns. 4.0.2 Classification Models Results Based on the research question of classifying individuals into four groups (vegan, veg- etarian, pescetarian, and omnivore) based on their dietary habits, we applied three popular machine learning algorithms: Random Forest, Support Vector Machines (SVM), and Neural Networks. We attempted multiple tasks with each algorithm. The first task involved classifying each group separately. The second task separated the samples into omnivores and all the other groups together. The third task sepa- rated the samples into vegans and all the other groups together. Finally, the fourth task separated the samples into omnivores and pescetarians together and vegans and vegetarians together. 4.0.2.1 First Task: classifying each group separately After running the Random Forest algorithm to classify each group separately, the results showed an overall accuracy of 57%. Table 4.1 shows the precision, recall, f1- score, and support for each of the four groups in the first task of using random forest for classification. The numbers for the support metric refers to the test samples. Due to that we observe smaller values of support compared with the total samples of the dataset. Table 4.1: Performance metrics for random forest model (Task 1) Group Precision Recall F1-score Support Vegans 0.58 0.64 0.61 11 Vegetarians 0.00 0.00 0.00 3 Omnivores 0.86 0.75 0.80 8 Pescetarians 0.00 0.00 0.00 1 After analyzing the results, we proceeded to extract the top 20 most important features for this model. Figure 4.5 displays these features in descending order of importance. 26 4. Results Figure 4.5: 20 most important features from RF model (Task 1). The seventh most important feature is a combination of phosphocholine, acetylcholine, phospho- ethanolamine and lipids/ffa, but it is not visible in the figure. 27 4. Results Figure 4.6: Confusion matrix for RF in the task 1. In addition to the classification results and the feature importance analysis, we also calculated and visualized the confusion matrix (Figure 4.6) for the best-performing random forest classifier in task 1. Following the random forest classifier, we employed a support vector machine (SVM) model for the same dataset. The obtained accuracy of 74% was recorded, and the performance metrics, namely precision, recall, F1-score, and support, are presented in Table 4.2. Table 4.2: Performance metrics for SVM model (Task 1) Group Precision Recall F1-score Support Vegans 0.69 1.00 0.81 11 Vegetarians 0.00 0.00 0.00 3 Omnivores 0.86 0.75 0.80 8 Pescetarians 0.00 0.00 0.00 1 In addition to the classification results and the feature importance analysis, we also calculated and visualized the confusion matrix (Figure 4.7) for the best-performing SVM classifier in task 1. The third classifier that was used for Task 1 is neural networks. The model was trained with 100 epochs and a batch size of 32. After training, we evaluated the model using the test set and obtained an accuracy of 30%. We also calculated 28 4. Results Figure 4.7: Confusion matrix for SVM in the task 1. precision, recall, F1-score and support for each class. These metrics are shown in Table 4.3 Table 4.3: Performance metrics for neural network model (Task 1) Group Precision Recall F1-score Support Vegans 0.00 0.00 0.00 11 Vegetarians 0.00 0.00 0.00 3 Omnivores 0.32 0.88 0.47 8 Pescetarians 0.00 0.00 0.00 1 4.0.2.2 Second Task: classifying omnivores and a combined group of vegetarians, vegans, and pescetarians The Second Task involved grouping all non-omnivorous samples together and treat- ing them as a single class while considering omnivorous samples as the other class. This task allowed us to investigate whether the metabolomics data can distinguish between omnivorous and non-omnivorous diets, which could be useful for develop- ing biomarkers of dietary intake and assessing the health effects of different dietary patterns. For this task, we utilized the Random Forest and SVM models to classify the samples. The accuracy of the best random forest classifier for the second task was 87%. The 29 4. Results Figure 4.8: Confusion matrix for the RF in the task 2. other metrics can be seen in Table 4.4 and the confusion matrix is shown in Figure 4.8. Table 4.4: Performance metrics for random forest model (Task 2) Group Precision Recall F1-score Support Non-omnivores 0.83 1.00 0.91 15 Omnivores 1.00 0.62 0.77 8 In addition to classifying omnivores and non-omnivores, we also wanted to under- stand which features are the most important for this classification. To achieve this, we used the Random Forest classifier and obtained the feature importance. We then kept the top 50 features with the highest importance scores and used them as input for the SVM model. This allowed us to not only achieve a high accuracy of 87% but also to identify the most relevant features that contribute to the classification. These features and their importance scores can be seen in the Figure 4.9. Along with the prior classification tasks, we further explored the relationship be- tween dietary patterns and metabolomics data by attempting to predict the amount of meat consumed by individuals. Leveraging the information available in the dataset, which indicated the frequency of meat consumption in the four days pre- ceding the sample collection, we categorized the individuals into three groups: 0 (no meat consumption), 1 (meat consumed 1-4 times), and 2 (meat consumed 5-6 times). 30 4. Results Figure 4.9: The 50 most important features from RF model (Task 2). 31 4. Results While our predictive models were unable to accurately estimate the exact amount of meat consumed, they demonstrated moderate success in distinguishing individuals who abstained from meat consumption (group 0) from those who consumed meat to varying degrees. For this specific task of predicting the amount of meat consumption, we employed the Random Forest algorithm. The accuracy score achieved by the model was 72%, indicating moderate success in the prediction task. However, it is important to note that the model struggled to accurately predict the amount of meat consumed, particularly for the categories with lower sample sizes. The precision, recall, and F1-score varied across the three groups and can be seen in Table 4.5. Table 4.5: Performance metrics for RF, meat consumption score. Group Precision Recall F1-score Support Group 0 0.68 1.00 0.81 15 Group 1 0.00 0.00 0.00 3 Group 2 1.00 0.20 0.33 5 Group 0 (no meat consumption) achieved the highest precision and recall, while Group 1 (meat consumed 1-4 times) and Group 2 (meat consumed 5-6 times) exhib- ited lower scores. This indicates that the model was more successful in identifying individuals who abstained from meat consumption, compared to differentiating be- tween different levels of meat consumption. Overall, the results highlight the chal- lenges in precisely estimating the amount of meat consumed based on metabolomics data and emphasize the need for further research. Continuing our exploration of dietary patterns and their metabolic associations, we proceeded to investigate the consumption of meat, dairy, and eggs among individuals in the dataset. For this particular task, we opted to exclude individuals following a vegan diet from our analysis. Since vegans abstain from consuming meat, dairy, and eggs altogether, their exclusion allowed us to focus specifically on individuals with varying levels of meat, dairy, and egg consumption. By removing vegans from the dataset, we aimed to develop a predictive model that could estimate the amount of consumption among non-vegan individuals. In our analysis of meat, dairy, and egg consumption, we employed three different scoring methods to quantify the dietary patterns of individuals. The first scoring task involved calculating the cumulative score based on the total number of times an individual reported consuming meat, dairy, and eggs. In the second method, we introduced a scoring adjustment by multiplying the number of times individuals consumed meat by a factor of 1.5. This modification aims to assign a higher score to individuals who consume protein from animal sources, reflecting the potentially greater impact of animal-based protein consumption on metabolic profiles. Lastly, the third task involved utilizing a scoring system derived from previous research. After calculating the cumulative consumption of these products, we classified indi- viduals into three groups based on their consumption frequency: group 1 represented 32 4. Results individuals who consumed meat, dairy, and eggs 0-4 times, group 2 included indi- viduals who consumed them 5-9 times, and group 3 consisted of individuals who consumed them 10-14 times within a specified period. Our goal was to develop pre- dictive models capable of estimating an individual’s group based on their metabolic profile. Using the Random Forest classifier with the best hyperparameters found during the model optimization process, we obtained an accuracy score of 50.36%. The confusion matrix revealed that the model struggled to accurately classify the individuals into the respective consumption groups. There were misclassifications across all three groups, resulting in low precision, recall, and f1-scores for each group (Table 4.6). Table 4.6: Performance metrics for RF, meat/dairy/eggs consumption score without vegans. Group Precision Recall F1-score Support Group 1 0.00 0.00 0.00 4 Group 2 0.33 0.75 0.46 4 Group 3 0.40 0.33 0.36 6 In the method where we multiplied the score of meat consumption by 1.5 and sep- arated individuals into two groups, we aimed to capture the differential impact of animal-based protein intake on metabolic profiles. The two groups were defined as follows: Group 1 included individuals with a score ranging from 0 to 8, and Group 2 consisted of individuals with a score ranging from 9 to 16.5, with an average score of 8.25. Using the Random Forest classifier with the best hyperparameters found during the model optimization process, we obtained an accuracy score of 75.76%. The precision, recall, and F1-score for each group varied as we can see in Table 4.7 Table 4.7: Performance metrics for RF, meat/dairy/eggs consumption score 1.5 x for omnivores, without vegans. Group Precision Recall F1-score Support Group 1 0.36 0.67 0.47 6 Group 2 0.50 0.22 0.31 9 Group 1 exhibited a precision of 36%, recall of 67%, and F1-score of 47%. Group 2 showed a precision of 50%, recall of 22%, and F1-score of 31%. The overall performance indicates that the model had limited success in accurately predicting the groupings based on the multiplied meat consumption score. In the third method, we employed the Omnivore Index, which was previously de- scribed in the paper titled "Identification of Single and Combined Serum Metabolites Associated with Food Intake." [38] This index serves as a metric for assessing an indi- vidual’s dietary pattern and quantifying their level of omnivorous consumption. The scores ranged from 2 to 14, and our objective was to classify individuals into either a high or low group. The two groups were defined as follows: Group 1 consisted of individuals with scores ranging from 2 to 9, while Group 2 included individuals with scores ranging from 10 to 14. 33 4. Results Using the Random Forest classifier with the best hyperparameters obtained during the model optimization process, we achieved an accuracy score of 73% correctly classifying 9 out of 15 individuals. The precision, recall, and F1-score for each group indicate the performance of the classifier in distinguishing between high and low omnivore index groups can be seen in Table 4.8 Table 4.8: Performance metrics for RF, meat/dairy/eggs consumption score from previous research, without vegans. Group Precision Recall F1-score Support Group 1 0.50 1.00 0.67 6 Group 2 1.00 0.33 0.50 9 4.0.2.3 Third Task: classifying vegans and a combined group of vegetar- ians, omnivores, and pescetarians In this task, our objective was to classify vegans separately from all the other groups. We utilized the Random Forest classifier to train and evaluate the model’s perfor- mance. By using the GridSearchCV function with cross-validation, we performed hyperparameter tuning to find the best combination of hyperparameters that max- imizes the model’s performance. The corresponding best accuracy score achieved was 73.07%. Both classes’ precision, recall, and F1-score can be observed in Table 4.9. For class 1 (vegan), the precision was 0.60, recall was 0.27, and F1-score was 0.37. For class 3 (non-vegan), the precision was 0.56, recall was 0.83, and F1-score was 0.67. Table 4.9: Performance metrics for the RF model (Task 3) Group Precision Recall F1-score Support Vegans 0.60 0.27 0.37 11 Other groups 0.56 0.83 0.67 12 When evaluating the model’s performance on the test set, the confusion matrix, Figure 4.10, showed that out of the 23 samples, 3 were correctly classified as class 1 (vegan), while 10 were correctly classified as class 3 (non-vegan). However, there were 8 misclassifications for class 1 and 2 misclassifications for class 3. In order to gain a deeper understanding of the factors influencing the classification of vegans and non-vegans, we performed an analysis to determine the most important features for this task. From this analysis, we identified the top 50 most important features that can be seen in Figure 4.11 Similarly, we applied the SVM (Support Vector Machine) algorithm to classify ve- gans and non-vegans. Utilizing the best hyperparameters obtained from the grid search, we trained the SVM model and examined its performance. The SVM model trained with these hyperparameters achieved an accuracy of 78% in predicting the dietary groups. Table 4.10 demonstrates the metrics precision, recall, F1-score and 34 4. Results Figure 4.10: Confusion matrix for Random Forest in the task 3. support. The precision, recall, and F1-score were also calculated for each class, demonstrating satisfactory performance with precision values of 0.80 for class 1 (ve- gans) and 0.77 for class 3 (non-vegans). The recall values were 0.73 for class 1 and 0.83 for class 3, indicating good performance in correctly identifying instances from each class. The F1-score, which considers both precision and recall, was 0.76 for class 1 and 0.80 for class 3. Table 4.10: Performance metrics for SVM model (Task 3) Group Precision Recall F1-score Support Vegans 0.80 0.73 0.76 11 Other groups 0.77 0.83 0.80 12 For the SVM model, since it used a non-linear kernel, feature importance analysis using coefficients was not applicable. Therefore, we focused solely on the Random Forest model to identify the most influential features. 4.0.2.4 Fourth Task: classifying vegans and vegetarians together and omnivores and pescetarians together In the last task, we focused on classifying vegans and vegetarians together against omnivores and pescetarians. We employed both Random Forest and Support Vector Machine (SVM) models to perform the classification task. 35 4. Results Figure 4.11: The 50 most important features from RF model (Task 3). The first most important feature is a combination of phosphocholine, acetylcholine, phospho- ethanolamine, and lipids/ffa, but it is not visible in the figure. 36 4. Results Upon applying the Random Forest (RF) model to classify the combined group of vegans and vegetarians and the combined group of omnivores and pescetarians, we obtained the following outcomes. The best hyperparameters for the RF model were determined as follows: max_depth = 10, max_features = ’sqrt’, min_samples_leaf = 4, min_samples_split = 2, and n_estimators = 500. With these parameters, the RF model achieved an accuracy score of 82.03%. When evaluating the performance of the Random Forest model, we obtained the classification metrics on the test set, shown in Table 4.11. Table 4.11: Performance metrics for random forest model (Task 4) Group Precision Recall F1-score Support Vegans 0.81 0.93 0.87 14 Other groups 0.86 0.67 0.75 9 The confusion matrix revealed that out of the 23 instances, 13 were correctly clas- sified as belonging to the combined group of vegans and vegetarians, while 6 were accurately identified as belonging to the combined group of omnivores and pescetar- ians. However, there was one misclassification in classifying the combined group of vegans and vegetarians and three misclassifications in classifying the combined group of omnivores and pescetarians. The 50 most important features are shown in Figure 4.12. Moving on to the SVM model, we also performed a grid search with cross-validation to determine the best hyperparameters. The best hyperparameters for the SVM model were determined as C = 0.01, gamma = 0.001, and kernel = ’linear’. The model achieved an accuracy of 74%, indicating a moderately accurate classification. The classification report for the SVM model on the test set showed the following results (Table 4.12): Table 4.12: Performance metrics for SVM model (Task 4) Group Precision Recall F1-score Support Vegans 0.75 0.86 0.80 11 Other groups 0.71 0.56 0.63 12 37 4. Results Figure 4.12: The 50 most important features from the RF model (Task 3). The 12th most important feature is a combination of phosphocholine, acetylcholine, phospho- ethanolamine, and lipids/ffa, but it is not visible in the figure. 38 5 Discussion and Conclusion The Results chapter presented the findings of our research, which aimed to investi- gate the metabolic differences between different dietary groups. In this discussion section, we will interpret and analyze the results, provide insights into the impli- cations of the findings, discuss the limitations of the study, and suggest potential directions for future research. 5.1 Discussion The principal component analysis (PCA) was employed to visualize the metabolic profile of the dataset. The four dietary groups were not clearly distinguished by the overall PCA analysis, indicating that there may be significant individual variation in the metabolomic profiles of each group. This variation may result from causes other than diet, suggesting that individual variations or other factors may have an impact on metabolic profiles. Another explanation is that although sharing the same dietary group, the diets within a category may really demonstrate significant variances. The principal component analysis (PCA) was employed to visualize the metabolic profile of the dataset. The overall PCA analysis revealed that the four dietary groups exhibited some degree of overlap in their metabolic profiles. However, it is worth noting that omnivores were more clearly distinguished from the other three groups in Figure 4.4. This indicates that there are discernible metabolic differences between omnivores and the remaining dietary groups. Nonetheless, there is still notable individual variation within each group, suggesting that factors other than diet, such as genetic variations or lifestyle factors, may contribute to the observed metabolic variations. Another explanation is that although sharing the same dietary group, the diets within a category may really demonstrate significant variances. Additionally, gender-specific PCA analyses were carried out separately for men and women. But in all instances, the 3D PCA plots failed to clearly separate the two groups. This suggests that within the population under study, there are no obvious sex-specific metabolic differences. Interestingly, when comparing the metabolic profiles of meat eaters and non-meat eaters, a clear separation was observed between the two groups. This suggests that meat consumption has an impact on the metabolic profile and can be a distinguishing 39 5. Discussion and Conclusion factor in the analysis. However, it is important to note that this separation does not necessarily imply causality. The Z-score approach allowed outliers to be located and eliminated from the dataset, which enhanced the analysis’s accuracy and quality. The removal of outliers allowed for more reliable conclusions to be drawn from the data. The identification of the most important features in the dataset was performed us- ing PCA, focusing on the first principal component (PC1). The top 100 features on PC1 with the highest loadings were chosen as the most crucial features. This method enabled a more concentrated study on the dataset’s most important vari- ables, potentially increasing the analysis’s accuracy and effectiveness. Both men and women were analyzed separately to investigate potential gender-specific differences in metabolism and dietary patterns. The results from the entire dataset, however, were assessed to be more representative and were chosen for further analysis in light of the goal of collecting a thorough and accurate picture of the data. The differ- ences in metabolization and diet between men and women, as well as the impact of disparities in muscle mass on the patterns found, were also of interest to investigate. Moving on to the classification models, three popular machine learning algorithms, namely Random Forest, Support Vector Machines (SVM), and Neural Networks were applied to classify individuals into four dietary groups based on their habits. In the first approach of classifying each group separately, the Random Forest al- gorithm achieved an overall accuracy of 57%. The precision, recall, and F1-score for each group varied, indicating different levels of classification performance. The SVM model performed better, achieving an accuracy of 74%. However, the Neu- ral Network model yielded lower accuracy of 30%. These results suggest that the classification of individuals into specific dietary groups based on metabolomic data is challenging and may require more sophisticated approaches or additional data sources to achieve higher accuracy. In the second approach, where omnivores and non-omnivores were classified together, the Random Forest model achieved an accuracy of 87%. The results indicate that the metabolomics data can distinguish between omnivorous and non-omnivorous diets, highlighting the potential of metabolomics in developing biomarkers for dietary intake assessment and exploring the health effects of different dietary patterns. Dietary studies can benefit from the ability to predict the absence of meat intake, since it provides information on the metabolic signatures connected to non-meat- eating habits. Our models demonstrated the ability of metabolomics data to capture variations in dietary patterns and give information on the metabolic impacts of meat consumption, despite the difficulty of correctly measuring meat consumption. Certain metabolites found in Approach 2 (Table 4.9), including glycine, creatine, trimethylamine, glutamine, and valine, indicate a likely link between protein intake and meat eating [14]. Glycine, a crucial amino acid essential in protein synthesis, is frequently included in meals high in protein, such as meat [52]. Creatine, which is mostly present in meat and other animal products, plays a role in energy metabolism [53]. Trimethylamine (TMA), which is produced by gut bacteria when they break 40 5. Discussion and Conclusion down nutrients like the choline in meat, may signify increasing consumption of nutri- ents obtained from animals [54]. A vital amino acid called glutamine is a component of proteins, suggesting that eating meat may result in a larger protein intake. An- imal protein sources are a good supply of valine, another important amino acid [55]. These metabolites likely reflect the metabolic processes related to digestion, metabolism, and utilization of proteins obtained from meat. In our analysis of meat, dairy, and egg consumption, we utilized three different scor- ing methods to quantify individuals’ dietary patterns and explore their relationship with metabolic profiles. With an accuracy score of just 50.36%, the first scoring method, which was based on cumulative consumption, had poor classification per- formance. Low accuracy, recall, and F1-scores for each group show that the model had difficulty correctly classifying individuals into the corresponding consumption categories. This suggests that simply counting the number of times individuals re- ported consuming these products the last 4 days before the sample taking, may not be sufficient to capture the nuances of their dietary patterns. The model’s accuracy was greater while using the second score (75.76%), which multiplied the meat consumption score by a factor of 1.5. However, there were differences in the precision, recall, and F1-scores between the two groups, demon- strating that the adjusted score couldn’t properly predict consumption habits. The performance metrics for Group 1 were superior to those for Group 2, indicating that those with lower meat intake levels were more accurately classified. The Omnivore Index, drawn from earlier research, was used as the third score. The model performed quite well in identifying groups with high and low omnivore indexes, with an accuracy score of 73%. However, there were some misclassifications between the two groups based on the precision, recall, and F1-score differences. Comparing the three scoring methods, the second approach, involving the multiplica- tion of meat consumption scores by 1.5, yielded the highest accuracy. This indicates that better classification performance may be achieved by taking into consideration the various effects that consuming animal-based protein may have on metabolic pro- files. The addition of a weighting factor accounts for the increased protein content and possible metabolic effects of sources of protein produced from animals. The first scoring approach, based on cumulative consumption, may have limited accuracy due to its simplistic nature. It does not take into account potential changes in dietary patterns, such as frequency or various types of meat, dairy, and egg products. The third approach, using the Omnivore Index showed differences in performance indicators for the two groups while reaching a reasonably high accuracy score. The complex nature of people’s eating habits and the inherent difficulties in representing their variation within a single scoring system may be to blame for this. Overall, the second strategy performed the best, but it is crucial to keep in mind that dietary habits are complex and affected by a variety of factors in addition to consuming meat, dairy, and eggs. The accuracy and predictive value of models designed to evaluate dietary patterns based on metabolic profiles may be further 41 5. Discussion and Conclusion improved by the inclusion of additional dietary and lifestyle factors in subsequent research. In the third approach, our objective was to classify vegans separately from all other dietary groups. To complete this challenge, we used the Support Vector Machine (SVM) method and the Random Forest (RF) classifier. The accuracy score for the RF model was 73.07%, while the accuracy score for the SVM model was 78%. Numerous metabolites that were relevant in the categorization of vegans and non- vegans were discovered through the examination of critical features. According to the RF model, phosphocholine, acetylcholine, phosphoethanolamine, and lipids/ffa were the most significant features, followed by creatine and lysine, glutamine and an unidentified metabolite, and separately, creatine and valine. These significant features, as depicted in Figure 4.11, provide information on how the metabolic profiles of vegans and non-vegans differ, based on an analysis of serum samples from individuals. The interaction of lipids/ffa, phosphoethanolamine, acetylcholine, and phosphocholine appears to be a particularly important element in distinguishing between the two groups. The discriminating ability of the model is also enhanced by creatine, lysine, glutamine, and the unidentified metabolite. The classification performance of both models suggests that distinguishing between vegans and non-vegans based on metabolic profiles is challenging. The low recall for class 1 (vegans) in both models suggests that it may be challenging to accurately identify every vegan, maybe as a result of the diversity of the vegan population. Vegans may adhere to various dietary subtypes, such as processed vegan diets or whole-food plant-based diets, which might alter their metabolic profiles. Further- more, the dataset’s small sample size for the vegan group may have hindered the models’ ability to correctly categorize this particular dietary category. In summary, whereas the RF and SVM models classified vegans and non-vegans with respectable accuracy, the findings highlight the difficulty of identifying these dietary categories simply based on metabolic profiles. The low recall of vegans points to significant diversity among the vegan population, which may be impacted by various dietary subtypes and personal characteristics. Furthermore, the existence of metabolic patterns that overlap between vegans and non-vegans emphasizes the importance of variables other than food alone, such as heredity, lifestyle, and general eating habits. In the fourth approach, we aimed to classify vegans and vegetarians together against omnivores and pescetarians. Both the Random Forest (RF) (82% accuracy) and Support Vector Machine (SVM) (74%) models were utilized for this classification task. In the analysis of the most influential features for classifying the combined group of vegans and vegetarians against the combined group of omnivores and pescetarians, several metabolites emerged as crucial contributors: creatine + lysine, valine, crea- tine, glycine, phosphocholine + acetylcholine + phosphoethanolamine + lipids/ffa, trimethylamine, glutamine + unknown, and isoleucine. 42 5. Discussion and Conclusion Creatine and lysine are closely related and are involved in energy metabolism and protein synthesis [52]. The essential amino acids valine, glycine, and isoleucine are crucial for the synthesis of proteins and the creation of energy. The classification’s dependence on these metabolites shows that the two groups’ approaches to the metabolism of proteins and energy differ from one another. The observed disparities may be a result of variations in the sources of dietary protein and their associated amino acid compositions. The combination of phosphocholine + acetylcholine + phosphoethanolamine + lipids/ffa reflects various lipid-related compounds. Fatty acids and lipids are important ele- ments of cell membranes and are crucial for metabolic pathways and energy storage [56]. The presence of these metabolites highlights any differences between omni- vores and pescetarians and omnivores and vegetarians in terms of lipid metabolism, especially phospholipid metabolism. 5.2 Limitations and Future Work The limitations observed in the accuracy of predicting meat consumption can be attributed to several factors. Firstly, the dataset’s features may not have fully cap- tured the nuances and variations in meat consumption levels, including the limited number of metabolites obtained from NMR metabolomics. This highlights the need for more comprehensive features that provide a deeper understanding of dietary habits. Secondly, the accuracy of the model may have been influenced by the distri- bution of the data and the specific features used for prediction. A more diverse and representative dataset could potentially improve the model’s performance. The lower accuracy for categories with smaller sample sizes indicates that the model had difficulty learning the patterns of those categories due to their limited represen- tation in the dataset. To address this limitation, efforts should be made to gather a more diverse dataset that includes a wider range of meat consumption levels across different demographic groups. Future research may examine bigger and more varied datasets to confirm and expand on our findings. Additionally, incorporating data from additional -omics subjects, such as genomics or transcriptomics, may help us get a deeper understanding of the biological processes that underlie the observed metabolic variations between dietary groups. We can improve individualized nutrition advice, comprehend the effect of dietary decisions on health outcomes, and perhaps design focused therapies for enhancing people’s well-being by unraveling the metabolic fingerprints linked to various eating patterns. 5.3 Conclusion In this work, we looked at how metabolomics data may be used to categorize people into various dietary groups. We examined four distinct methods to do this assign- 43 5. Discussion and Conclusion ment using several machine learning techniques, including Random Forest, Support Vector Machine and Neural Networks models. Our results demonstrate the potential of metabolomics to distinguish between various food categories and predict eating habits and the kind of protein consumed. Through the examination of our results, we identified different levels of categoriza- tion accuracy and performance across the various methods used. Notably, we clas- sified vegetarians and vegans with acceptable accuracy rates, indicating potential similarities in their eating habits. However, it was difficult for our models to differ- entiate between omnivores and pescetarians, showing that the consumpti