MASTER’S THESIS Department of Mathematical Sciences Division of Mathematical Statistics CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2014 Analysis of gene expression patterns in wild eelpout from the Baltic Sea Identification of expression patterns for differentially expressed genes connected to pollution EMMA WIJKMARK Thesis for the Degree of Master of Science Department of Mathematical Sciences Division of Mathematical Statistics Chalmers University of Technology and University of Gothenburg SE – 412 96 Gothenburg, Sweden Gothenburg, September 2014 Analysis of gene expression patterns in wild eelpout from the Baltic Sea Identification of expression patterns for differentially expressed genes connected to pollution Emma Wijkmark Abstract In a world where the human makes large toxic footprints the anthropogenic pollution of the aquatic environment is evident. It is therefore of great importance to understand what and how pollution adversely affect organisms in the environmental ecosystems. The aim of this study is to understand microarray gene expression patterns in the fish eelpout and how they change in connection to pollution. The analysed dataset is unique due to its extensive set-up with 158 wild fish from 16 sites located in four large geographical regions in the Baltic Sea. The expression profiles were found to be highly varying. Clustering analysis showed that fish from the same region and site had a high tendency to group together. Compar- ison between fish from reference and polluted sites showed significant differences in gene expression but the effects were in general small. This is likely an indication that there are few shared effects between the polluted sites. Similarly, there were small effects on the gene expression between fish with a high and low reproduction success. Large and significant effects were however seen when comparing fish with low and high values of the known biomarker EROD. We also observed strong correlations between measured gene expression for the CYP1A gene and EROD. The strength of the correlation varied between regions and the highest correlation was found at the Swedish west coast. Fi- nally, we assessed the independence between sampled fish and found that several genes had a high within-site correlation (median correlation over all genes were 0.18). Our results suggest that there are small effects on gene expression connected to pollution and reproduction success. We could, however, identify large effects connected to region and site which may indicate that ecological factors and population parameters had a substantial impact on the observed gene expression profiles. Acknowledgements First of all I want to express my gratitude to my supervisor Erik Kristiansson for his excellent guidance into the world of applied statistics and for his never-ending time for questions. Then I would like to thank my collaborators at the University of Gothenburg who not only made this thesis possible but also much more interesting by giving a bio- logical view of the statistical results. It wouldn’t have been as fun without you. Lastly, I want to turn my gratitude to everyone who not contributed directly to the thesis but indirectly by creating a warm atmosphere in which to spend the breaks. Emma Wijkmark, Gothenburg ,July 2014 Contents 1 Introduction 1 1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 4 2.1 Linear models with empirical Bayes . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Formulation of linear model using design and contrast matrix . . . 4 2.1.2 Empirical Bayes and modified t-test . . . . . . . . . . . . . . . . . 5 2.1.3 Multiple testing correction . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Materials / Method 11 3.1 Sampling set-up and data . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 Data included in the study . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2 Principal component analysis . . . . . . . . . . . . . . . . . . . . . 13 3.3 Gene rank analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 The linear model and empirical Bayes . . . . . . . . . . . . . . . . 14 3.3.2 Differences in gene expression based on pollution classification . . 14 3.3.3 Gene expression in relation to the biomarker EROD . . . . . . . . 15 3.3.4 Difference in gene expression based on reproduction success . . . . 16 3.4 Within-site correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Results 18 4.1 Exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1.2 Principal component analysis . . . . . . . . . . . . . . . . . . . . . 19 4.2 Gene rank analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 i CONTENTS 4.2.1 Differences in gene expression based on pollution classification . . 22 4.2.2 Gene expression in relation to the biomarker EROD . . . . . . . . 23 4.2.3 Difference in gene expression based on reproduction success . . . . 27 4.3 Within-site correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Discussion 33 Bibliography 38 A Appendix 39 B Appendix 41 C Appendix 42 D Appendix 44 ii 1 Introduction In a world where the human makes large toxic footprints it is of great importance to understand the impact this has on the environment. The pollution is today widespread and many of the emissions are accumulated in the coastal environment. The anthro- pogenic pollution of the aquatic environment is evident and it is therefore a main issue to investigate what and how the pollution adversely affects organisms in the aquatic ecosystems. To do this, fishmonitoring is being developed as a tool for investigation of adverse effects caused by pollution and the traditional strategies in biomonitoring by using histopathology can today be complemented by microarrays. This makes it possible to investigate the relation between pollution and the aquatic organisms’ gene expressions which as a next step might be used as early warning signs. Within the project called Balcofish gene expression profiles of the fish eelpout (Zoarces viviparus) have been analysed using microarrays. The species eelpout has shown to over- all be a valuable bio-indicator. One major advantage compared to other fish species is that it gives birth to live young, or rather the eggs develops into fry within the female. Thus it is possible to directly examine the reproduction for each female (Hedman et al., 2011). The Balcofish project is a large EU-BONUS financed project with the aim to investigate chemical pollution by developing fish monitoring and by providing solid in- formation to the management of the Baltic Sea. To achieve this, fish have been sampled from 16 sites located in four large geographical regions in the Baltic Sea. The Balcofish project is a joint research project involving scientists from Denmark, Germany and Swe- den. This thesis is performed in collaboration with the scientists at the University of Gothenburg, Department of Biological & Environmental Sciences and Department of Infectious Diseses, and is part of the Balcofish project. The DNA microarrays, developed in the 1990s, revolutionised the field of gene ex- pression analysis since it gave the possibility to investigate a very large number of genes simultaneously. The gene expression microarrays that have been used to retrieve the data for this study consist of a solid surface with microscopic DNA spots, each of them 1 CHAPTER 1. INTRODUCTION containing many copies of a unique DNA sequence called probe. The probe can be a part of a known gene or an unidentified sequence. mRNA from each individual fish is isolated and reverse transcribed into cDNA (a more stable molecule) and fluorescently labeled. The cDNA from a sample is then hybridized to the microarray. cDNA that have not attached to any probe is washed away. The amount of cDNA sequences that attaches to each probe tells how much of that specific sequence that was found in the sample and by that how much this gene was expressed. This amount of attached sequences is usually examined by detection of fluorescent using a laser scanner (Alberts, 2002). Since microarrays is used to investigate a large number of genes simultaneously the retrieved data will be high dimensional. The raw intensities are often noisy and this together with the high dimensionality urges the need for statistical tools. Common pre- processing of the intensities to aid the succeeding analysis is quantile normalization and log transformation. The log transformed gene expression data is often close to normal and therefore linear models can be used. The linear model gives a flexible frame work and there is a variety of statistical tests that can be used for inference. Among these we find simpler methods such as the classical t-test but alternatively more robust estimators can be applied for example the estimators found in the limma-package for the software R (Smyth, 2004). Many of previous studies of gene expression differences in fish have investigated the short term impacts in controlled studies . The step from this into field implementation is not done without complications. Sellin Jeffries et al. (2012) looked at taking microarrays to the field by deploying fatheaded minnows (Pimephales promelas) to be caged for 7 days at four sites with different anthropogenic impact. It was shown that microarrays can be utilized to discriminate between sites with different contamination loads. The gene expression profiles were found to be site specific and that fish from low- and high- impact sites aggregated into distinct groups. In Williams et al. (2014) long term impact of contamination exposure was investigated when European flounder (Platichthys flesus), all sampled from the same site, were kept for 7 months in mesocosms with sediment from sites with different contamination loads. Small but statistically significant alterations in transcriptomic and metabolomic responses in liver tissue were detected between fish exposed to different sediments. Neither of these studies concerned fish from different populations. To be able to study gene expression for chronically exposed wild fish the fish need to be sampled at site and for large-scale studies the population effect cannot be avoided. For the final step in taking microarrays into field implementation the long term impact of exposure, ecological parameters and population effects will all be needed to be taken into account. In Falciani et al. (2008) European flounder sampled from six widely spread sites were compared and a wide range of genes seemed to be differently expressed between the sites. It was demonstrated that gene expression signatures in livers can predict the geographical sampling site but that the accuracy is limited to specific sites. The question is if discrimination between field sites with different levels of contaminations still is apparent when the study concerns wild sampled fish from different populations. And furthermore, if gene expression signatures can predict geographical sampling site and origin even when the number of sites is increased. 2 1.1. AIM CHAPTER 1. INTRODUCTION 1.1 Aim In this master’s thesis project gene expression patterns in wild sampled fish will be in- vestigated and questions concerning how to handle data from large-scale sampling will be adressed such as impact of site and region. The aim of this study is to analyse gene expression patterns in eelpout liver and find possible relation between gene expres- sion differences and exposure of pollution as well as relation between gene expression differences and biomarkers. The specific objectives are: 1. Identify tendencies of groupings of the fish. Investigate the general appearance of the gene expression profiles and discrimination between samples based on sampling site, region and exposure of pollution. 2. Find differentially expressed genes connected to pollution. Identify consistent dif- ferences between the gene expression profiles of the fish in relation to the contam- ination load at the sites. 3. Find differentially expressed genes connected to biomarkers such as reproduction success. This by defining a measure of reproduction success and then identify consistent differences in gene expression profiles between groupings of the fish based on reproduction success. 3 2 Theory The analysis of the gene expression data will rely on statistical methods and the following chapter will introduce the methods used for the data analysis. The structure of the linear model used for inference and the basics for handling the arising problem of high dimensionality will be explained. Moreover, an introduction to the unsupervised analysis methods hierarchical clustering and principal component analysis is given. 2.1 Linear models with empirical Bayes 2.1.1 Formulation of linear model using design and contrast matrix Let yg denote the log transformed gene expression for probe g and denote the coefficient vector as αg holding the coefficients representing the exploratory variables. These two vectors will be yg =  yg,1 yg,2 : yg,n , αg =  αg,1 αg,2 : αg,p  where n is the number of samples, for this study n = 158, and p is the number of covari- ates included in the model. The so-called design matrix, denoted as X, is constructed as follows and has dimension n× p, X =  x1,1 x1,2 . x1,p x2,1 x2,2 . x2,p : : . : xn,1 xn,2 . xn,p  . 4 2.1. LINEAR MODELS WITH EMPIRICAL BAYES CHAPTER 2. THEORY The design matrix is such that each column holds data for one of the included covariates. Each row represents a sample since the i:th row is the covariate profile for the i:th sample and will not change with the probes. This profile holds information for this specific sample about the properties included in the model which can be both individual properties and properties of its environment. The linear model can now be formulated Yg ∼ Nn(Xαg,σ 2 gIn) where all samples are assumed to be independent. For each covariate that is included in the design matrix we get an estimation of its coefficient and by that the possibility to remove its impact in the model or to examine its influence. This choice of which coefficients to examine can be made using a contrast matrix C. It holds information about which coefficients that are to be compared and how these should be weighted. For the following setting each column should sum to zero. The number of rows in the contrast matrix is the same as the length of the coefficient vector αg. In the same way as for the design matrix, the contrast matrix will be the same for all probes. The contrast of interest is given by βg = CTαg where for this study the contrast matrix is always one dimensional C =  c1 c2 : cp  . Let the null hypothesis H0 be no difference in gene expression and let the alternative hypothesis HA be that there is difference in gene expression according to H0 :βg = 0 HA :βg 6= 0. The covariates that are included in the design matrix but have entries zero in the contrast matrix are covariates which effects the model tries to compensate for. Covariates that have non-zero values in the contrast matrix are the ones which effects are being examined. The magnitude and sign of the elements in the contrast matrix tell how the features are weighted together. For example, if site effects are being examined the weights can be by the number of samples in the sites (Smyth, 2004). 2.1.2 Empirical Bayes and modified t-test The variances for each probe are not assumed to be equal. But if estimating each probe’s variance using a frequentist approach we will run into problems due to the high dimensionality. When having a very large number of tests, such as the microarrays will give, the risk is almost certain that some of the estimated variances will be too small due 5 2.1. LINEAR MODELS WITH EMPIRICAL BAYES CHAPTER 2. THEORY to randomness. The result of this might be that even if the absolute difference in gene expression is too small to be of biological interest the result will still seem significant due to the by randomness grossly underestimated variance. Due to this the ordinary t-test might give a bad rating of the biological effects. To avoid this problem an empirical Bayes method can be used which borrows information across the probes to fit hyperparameters ξ and η. Consider the following for testing the contrast βg. The variance σ2 is a random variable thought to be independent between the probes. Yg is the log transformed gene expression for probe g. Yg|σ2g ∼ Nn(Xαg,σ 2 gIn) σ2g ∼ Γ−1(ξ,η) The inverse gamma distribution has not only been empirically shown to fit microarray data well but has good statistical properties since it is the conjugate prior to the normal distribution. The hyperparameters ξ and η are point estimated using all the given data across the probes. Since the number of probes is large the parameters are well estimated and considered as known. They decide how the variance behaves by setting the shape and rate of the inverse gamma distribution. Let us consider the null hypothesis H0 : βg = 0 and the alternative hypothesis HA : βg 6= 0. Without further explanation and only giving an intuitive feeling for the result of empirical Bayes we can say that we have now moved from the ordinary t-test to the moderated t-test Tg = √ n− p+ 2ξ CT (XTX)−C Yg√ S2 g + 2η . where Yg is the projection of Yg on the subspace spanned by H0, Yg = C(XTX)−Yg, and S2 g is the variance according to S2 g = YT g (I − X(XTX)−1X)Yg. The distribution of Tg follows a t-distribution with n − p + 2ξ degrees of freedom (Smyth et al., 2003; Kristiansson et al., 2005). One way to interpret this is that a large variance will lead to a large η and the variance S2 g is less trusted in the empirical Bayes setting. This approach makes the model more robust against underestimated variances. 2.1.3 Multiple testing correction The above described model and computation of test-statistic will be done for each probe separately. The large number of probes tested in the microarray analysis gives a multiple testing situation where each comparison gives roughly 135 000 test-statistics. Therefore we expect that many test-statistics will have large magnitudes and thus small p-values by chance. This problem is referred to as the multiple testing problem. To correct for this the p-values can be adjusted by making them stricter, in other words less significant. There are different ways of adjusting the p-values which differ in what type of error that is being controlled and how strictly. 6 2.2. HIERARCHICAL CLUSTERING CHAPTER 2. THEORY One simple and strict correction is the Bonferroni correction which controls the family-wise error rate (FWER). The FWER is the probability of at least one false rejec- tion of the null hypothesis in favour of the alternative hypothesis for a chosen threshold α according to FWER = P (V ≥ 1) ≤ α where V is the number of falsely rejected null hypotheses. The Bonferroni corrected p-value is the non-adjusted p-value multiplied by the number of tests and this should be smaller than α for the change to be called significant. If the number of tests is large, as for microarray experiments, the Bonferroni correction along with other methods for FWER correction are too conservative which will result in too few genes called significantly changed. Another method is Benjamini-Hochberg which instead controls the false discovery rate (FDR). The FDR is the number of tests incorrectly called significant according to FDR = E[V/R] ≤ α where V again is the number of falsely rejected null hypotheses in favour of the alter- native hypotheses and R is the total number of rejected null hypotheses (Hastie, 2009). This means that the Benjamini-Hochberg adjusted p-values are such that for a chosen threshold, α, all probes with a FDR adjusted p-value lower than α are called differentially expressed. Among these the expected proportion of false discoveries, i.e. probes that are falsely called differentially expressed, are less than the threshold value α (Smyth, 2005). 2.2 Hierarchical clustering Hierarchical clustering is used to visualise groupings in the data using a binary tree struc- ture. Each level of the tree describes a hierarchy and represents a particular grouping of the observations into disjoint clusters. At the lowest level each cluster contains a single observation and at the highest level there is one cluster which contains all of the observa- tions. The clustering is often visualised in a binary tree called dendrogram, see figure 2.1 for an example. A dendrogram is a tree where the nodes at the first level is at height zero and the consecutive levels of the dendrogram have branch lengths proportional to the dissimilarity between the joined clusters. Long branch lengths between clusters indicate dissimilar clusters while short branch lengths imply more similar clusters. The mono- tonicity property possessed by dendrograms means that dissimilarity between joined clusters is monotone increasing with the height at which the clusters are joined. By cutting the tree at different levels we get different groupings of the observations and if long branches are cut it is more likely that these groupings are representations of natural clusters. A distance metric is defined for the pairwise distances between the observations which for example can be Euclidean distance or correlation distance. Left to determine is the meaning of two clusters being dissimilar and the answer to this question will change de- pending on the analysis. Three measures often used are single linkage, complete linkage 7 2.2. HIERARCHICAL CLUSTERING CHAPTER 2. THEORY 0. 01 5 0. 02 5 0. 03 5 Example of hierarchical clustering hclust (*, "complete") corrDistMat H ei gh t S am pl e 1 S am pl e 2 S am pl e 3 S am pl e 4 S am pl e 5 Figure 2.1: Example of a dendrogram from hierarchical clustering where 5 samples are being clustered based on correlation distance metric and using complete linkage. The clusters are recursively joined and the height at which clusters are joined describes the proportional dissimilarities between them. and average linkage which all compare the pairwise dissimilarities between clusters and joins the two clusters with smallest between cluster dissimilarity. Single linkage defines the between cluster dissimilarity to be the smallest distance between two observation belonging to different clusters. In contrast, complete linkage defines the between cluster dissimilarity to be the distance of the most dissimilar pair of observations belonging to different clusters. Complete linkage will result in compact clusters since the algorithm favours clusters where all observations are as similar as possible. The trade-off for this is that some observations might be assigned to a cluster while being closer to some obser- vations from another cluster. The average linkage can be seen as a compromise between the above described linkage choices since it defines the between cluster dissimilarity to be the average over all pairwise distances between the observations of different clusters. It creates clusters that are relatively compact where all observations are relatively close. The drawback is that the result is dependent on the numerical scale of the dissimilarity measure of the observations. Applying a monotone strictly increasing transformation to the dissimilarities can change the result while single and complete linkage only depend on the ordering of the dissimilarities. The average linkage has however statistical con- sistency property in contrast to single and complete linkage. For single and complete linkage it is unclear what aspects of the population distribution that are being estimated by the clustering algorithm. There are two different clustering strategies. One of them is the agglomerative, a 8 2.3. PRINCIPAL COMPONENT ANALYSIS CHAPTER 2. THEORY bottom-up method which starts with each observation as a single cluster. Then the two least dissimilar clusters are recursively merged and the method ends with one cluster containing all observations. The other strategy, namely divisive method, is a top-down method which starts with all observations in one cluster and then recursively split the clusters until each observation is a cluster. Using divisive methods it is not guaranteed to get the monotonicity property required for the result to be shown as a dendrogram (Hastie, 2009). 2.3 Principal component analysis Principal component analysis is a projection of the data onto a subspace spanned by un- correlated variables. It is used for shrinking dimensions by an orthogonal transformation to components which sequentially captures the most variance. The principal components are in short the linear combinations of the original variables that give uncorrelated com- ponents with maximum variance among all sequential linear combinations of the original components. Let X be an p-dimensional random vector holding the observed data having a positive semidefinite covariance matrix Σ. Let further, Γ be an orthogonal matrix such that Σ = Γ∆ΓT where ∆ =  δ1 0 . . . 0 δp  and δi are the ordered roots of |Σ − δI| = 0 i.e. δ1 ≥ δ2 ≥ ..δp ≥ 0. Define U to be [U1,...,Up] T and U = ΓTX. Then U is called a vector of principal components of X and Ui is called the ith principal component of X. The covariance matrix of U is ∆ = ΓTΣΓ and therefore the components Ui are uncorrelated with variance δi i.e. the variance of the ith component is equal to the ith eigenvalue of Σ. Since Γ is orthogonal, observing U is equivalent to observing X. The covariance matrix Σ is rarely known and the principal components needs to be found by estimations (Arnold, 1981). Keeping the above in mind and omitting the details when having unknown Σ, let us consider x to be the vector of observed components. Denote the rotation matrix by R which by having the principal components as columns and each variable as a row works as a matrix of basis vectors. By Z = xR we get Z the matrix holding new coordinates for each sample in the subspace spanned by the principal components. PCA is often used for variable reduction. By only using the first k components of Z for further analysis of the data the number of variables are reduced to the k variables that captures as much of the variance as possible. The choice of how many variables to use for analysis can be based on the amount of explained variance or by exterior restriction of the maximum number of dimensions. Interpretation of the principal components is not always obvious since they are linear combinations of the observed variables. It might be the case that the feature of interest is not captured by the first principal components 9 2.3. PRINCIPAL COMPONENT ANALYSIS CHAPTER 2. THEORY but by the ith and jth components instead. Therefore, principal components are often analysed by plotting the data using pairwise combinations of the principal components and the components giving rise to groupings in the data will be further investigated. The components of x are often standardized unless the numerical scale is on its own thought to be valuable information. The reason is that the same data but using different units will give completely different principal components (Johnson and Wichern, 1998). 10 3 Materials / Method 3.1 Sampling set-up and data 3.1.1 Experimental design The fish, later called samples, are of the species European eelpout (Zoarces viviparus) which has shown to overall be a valuable bio-indicator. Its stationary behaviour makes it possible to investigate long term ecological and anthropogenic impact in contrast to other fish species. One major advantage compared to other fish species is that the reproduction for each female can be directly examined since it gives birth to live young, or rather the eggs develops into fry within the female (Hedman et al., 2011). The fish were sampled at 16 coastal sites in four regions. The sites are located such that four sites are on the Swedish west coast (Fjällbacka, Stenungsund, Göteborg, Vendelsö), four sites on Swedish east coast (Kvädöfjärden, Marsö, G̊asö, Slakmöre), five sites in Denmark (Agersø, Karrebæk Fjord, Isefjord, Frederiksværk, Roskilde Fjord) and three sites in Germany (Wismar, Eggers Wiek, Slazhaff), see table 3.1. The sites have been classified based on pollution level, some as polluted, some as medium polluted and some as reference sites (Albertsson et al., 2011). From each site, 8 to 11 female samples have been hybridized to microarrays. It is thought to be different populations of eelpout in the different regions but also that Denmark might be divided into two different populations. The samples taken from the same site are assumed to be from the same population even if there is a possibility that occasional exchanges of individuals between populations have occurred (Hedman et al., 2011). Workshops have been held to get the sampling procedure as undiversified as possi- ble but there are some aspects of the sampling procedure with large differences. One example is that the samples from German sites were moved by car before testing was performed. Considerable effort has been put into keeping the number of affecting fac- tors at a minimum and as an example all microarray and enzyme analyses have been performed by the same laboratory. 11 3.2. EXPLORATORY ANALYSIS CHAPTER 3. MATERIALS / METHOD Table 3.1: Sampling sites included in the study divided into four geographical regions and classified by the prior belief of their pollution levels. Region Site Classification Sweden - West coast Fjällbacka Reference Stenungsund Polluted Göteborg Polluted Vendelsö Reference Sweden - East coast Kvädöfjärden Reference Marsö Reference G̊asö Polluted Slakmöre Medium polluted Denmark Agersø Reference Karrebæk Fjord Polluted Isefjord Medium Polluted Frederiksværk Polluted Roskilde Fjord Polluted Germany Wismar Polluted Eggers Wiek Medium polluted Slazhaff Medium polluted 3.1.2 Data included in the study The gene expression data is one-channel microarray intensities and has been preprocessed by log transforming and normalizing each sample. Roughly 135 000 data points for each sample were analysed from the DNA microarrays as 2-4 probes had been designed from approximately 50 000 eelpout liver sequences (contigs). Data for ethoxyresorufin-o-deethylase (EROD) activity and reproduction success for each fish is also included in the study. The reproduction data is in the form of number of dead and abnormal fry together with the total brood size. 3.2 Exploratory analysis First unsupervised analysis was performed on all gene expression profiles using hierarchi- cal clustering and principal component analysis. This was done to identify and visualise patterns in the general appearance. 12 3.2. EXPLORATORY ANALYSIS CHAPTER 3. MATERIALS / METHOD 3.2.1 Hierarchical clustering Hierarchical clustering is an easy way to visualise groupings in data. An agglomerative (bottom-up) clustering procedure was used and implemented using hclust in R. The samples were clustered and each sample’s gene expression profile was compared with the other samples’ profiles. Since the absolute levels were not of interest but the patterns of the gene expressions, correlation distance was chosen instead of Euclidean distance. The distance metric used was (1 − P)/2 where P denotes the correlation matrix with Pij defined as the Pearson correlation between the ith and jth samples’ gene expression profiles. When using average linkage a monotone strictly increasing transformation applied to the dissimilarities can change the result while single and complete linkage only depend on the ordering of the dissimilarities. The chaining effect that single linkage might have is not a desirable effect in this case. Instead it is more appealing to have compact clusters for the trade-off of risking that a few observations might be assigned to the wrong cluster. Hence complete linkage was chosen to overall have compactness such that all observations in a cluster were alike and still not having to depend on any transformation. The samples were clustered using different gene expression profiles. These gene ex- pression profiles were created by filtering based on how much each probe varied. For example, a high variance profile was created using the top 10% most varying probes and a low variance profile based on the 10% least varying probes. The clustering was performed using no information about site belonging or pollution level. This information was however used when visualising the results in dendrograms. 3.2.2 Principal component analysis The scales of the variables in PCA have a huge impact on the result. If there is no information in scale or the variables have been measured using different unit systems the general recommendation is to scale the variables. In this case however, all the variables included were of the same type and by scaling we would risk getting an increase of noise by giving low variance probes to much impact. Therefore the variables were not scaled or centred since we rather risk losing some informative low expressed probes in favour for not bringing up to much noise. First principal component analysis was performed by prcomp in R using all samples. Principal components were pairwise plotted against each other and the samples marked by region or pollution. This was then repeated for each region such that now only samples belonging to the same region were used in each of the regional principal component analyses. The samples were marked either by pollution level or site belonging. 13 3.3. GENE RANK ANALYSIS CHAPTER 3. MATERIALS / METHOD 3.3 Gene rank analysis 3.3.1 The linear model and empirical Bayes The main outline for the following analysis was to fit a linear model to the log transformed expression data for each probe and then use empirical Bayes method to fit hyperparam- eters. Inference was done using moderated t-test and p-values were adjusted by FDR (Benjamini Hochberg) for cut-off 0.01. The statistical model used in microarray analysis needs to be able to handle the hierarchical structure of the data and the inference should take the high dimensionality and the multiple testing situation into account. The linear model has shown to work well for microarray data and Sjögren et al. (2007) showed that it has a higher power to detect differentially expressed genes compared to standard methods. Log transformed gene expression data is often, as well as in this case, close to normal which makes the linear model favourable and we have robust estimators in the limma-package (Smyth, 2004). Based on which comparison that was made different aspects were taken into account and these features included in the design matrix. The samples were divided into different groups and compared using the contrast matrix. This allowance for general experiments to be analysed using the same framework by only changing the design and contrast matrix gives a very flexible and appealing model (Smyth, 2005). The linear model was fitted using least squares approach and then empirical Bayes was used. Empirical Bayes can be seen as borrowing information across all probes and through this we did not have to assume equal variance across the probes. Instead the variances were thought to be from the same distribution with the fitted hyperparameters. To correct for the multiple testing problem Benjamini Hochberg correction was used to adjust the p-values which controls the false discovery rate. The FDR adjusted p- values are interpreted such that for the chosen threshold 0.01, all probes with adjusted p-values lower than 0.01 are called significantly differentially expressed. Among these, the expected proportion of probes falsely called differentially expressed are less than the threshold value i.e. 1% (Smyth, 2005). The above described steps were done in the software R using the limma package in Bioconductor which calculates the log fold change and adjusted p-value. 3.3.2 Differences in gene expression based on pollution classification For comparing sites with different pollution classification the design matrix consisted of covariates indicating site, meaning that an overall site effect was estimated for each site and this was done for each probe separately. The interpretation of a site coefficient with large magnitude is that all together something in the environment or population of the fish from this site had a large impact on the expression of this particular gene in comparison to fish from other sites. Using the contrast matrix two groups were then formed. One consisted of all samples from reference sites and the other of samples from polluted sites. By comparing groups chosen to be reference and polluted we compared, for each probe, if there was any difference in effects between these groups. In other words, 14 3.3. GENE RANK ANALYSIS CHAPTER 3. MATERIALS / METHOD we test how likely it is that the regulation of this gene was caused by the fish being in reference or polluted environment. This was done by having positive weights in the contrast for reference sites and negative weights for polluted sites. For this comparison the site coefficients were weighted by the number of samples from each site. The medium polluted sites had elements zero in the contrast matrix. The group sizes for this contrast were 51 samples in the reference group and 68 samples in the polluted group. When looking at the regional comparison the sites could not be divided as easily into one reference group and one polluted group. The reason for this was the number of sites and the unbalanced partition of sites with different pollution classifications, see table 3.1. To overall get as good groups as possible and to stay consistent the group division for the regional comparisons were made such that medium polluted sites were combined with reference sites and compared against polluted sites. Notice that this means that for the Swedish west coast reference sites were compared against polluted sites while for Germany medium polluted sites were compared against the polluted site. The group sizes using above contrast were for: Swedish west coast 21-21, Swedish east coast 30-8, Denmark 20-29 and for Germany 19-10. Also for the regional comparisons all site coefficients were estimated but now only site effects for sites belonging to the same region were compared. This by setting all but one regions coefficients in the contrast matrix to zero and reweight. 3.3.3 Gene expression in relation to the biomarker EROD In an aquatic system the type or mixture of contamination may be hard to know and a biomarker, indicator of some biological state or condition, can be of good use. Hepatic EROD induction in fish has shown to be useful in such situations and has successfully been used as a biomarker of general contamination (Whyte et al., 2000). EROD mea- sures the enzyme activity of the CYP1A gene and its level of gene expression has been measured using four probes. Correlation between EROD activity and expression of CYP1A The within-site correlation between the log transformed EROD values and the log trans- formed measured expression of CYP1A was computed as Pearson correlation coefficient. This was done for each site separately and the regional estimations of the correlation were retrieved by averaging the within-site correlation for each region. The reason to first compute the within-site correlations is to get the correlation at individual level. Otherwise it might be the case that EROD level and expression of CYP1A might cor- relate at site level but that this correlation is not representative for the correlation at individual level. Differences in gene expression for samples with high and low EROD activity To compare the difference in gene expression in relation to levels of EROD two groups were formed. The high EROD group consisted of the 5% of the samples with highest 15 3.4. WITHIN-SITE CORRELATION CHAPTER 3. MATERIALS / METHOD EROD values, values above 0.400, and the low EROD group were the 5% of the samples with lowest EROD values, values 0.037 and lower. This gave a group size of 8 in the high level group and 7 in the low level group. No other factors were taken into account for this comparison and only samples within these groups were used to fit the linear model. This was repeated with high and low EROD individuals from the Swedish west coast. This time the limit for high EROD values were 0.35 and low EROD values were considered to be values 0.06 and under, giving group sizes of 3 and 4. 3.3.4 Difference in gene expression based on reproduction success Reproduction success was measured using rate of total abnormal fry which was defined to be the number of dead and malformed fry divided by the total brood size. A low rate of abnormal fry means good reproduction success. In the same way as for the high and low EROD comparison the only features included in the comparison were indication of belonging to the high or low rate of abnormal fry group and only samples within these groups were included in the model. High rate of abnormal fry was chosen to be rates 0.05 or more while low rates were values 0.03 or lower. This to create two large groups of size 60 for the high rate group and 77 for the low rate group. 3.4 Within-site correlation The common assumption of independence between samples is most likely not true since there are common factors affecting all fish from the same site such as ecological parame- ters and pollution. The assumption was tested by estimating ρ describing the correlation between samples belonging to the same site and σ2 describing the variance. This was done for each probe separately and the subscript indicating probe has therefore been omitted. A simplifying assumption made was that these parameters were the same for all sites and that there were no correlations between samples belonging to different sites. The covariance matrix was as follows Σ =  Σ1 0 · · · 0 0 Σ2 · · · 0 ... ... . . . ... 0 0 · · · ΣM  where each Σi, for sites 1,...,M , is the submatrix describing the covariance for site i and has dimension ni × ni where ni is the number of samples at site i. The submatrix for each site of the covariance matrix Σ was Σi = σ2  1 ρ · · · ρ ρ 1 · · · ρ ... ... . . . ... ρ ρ · · · 1  . 16 3.4. WITHIN-SITE CORRELATION CHAPTER 3. MATERIALS / METHOD We hence assumed that there existed within-site correlation for the samples but no between-site correlation. The estimation was done by maximizing the multivariate normal log likelihood l(σ2,ρ) = −n 2 ln(2π)− 1 2 ln(det(Σ))− 1 2 (y − µ)TΣ−1(y − µ). The average of the log transformed gene expressions was subtracted for each sample by replacing µ with ȳ where ȳ = ∑N i=1 yi and N is the total number of samples. To ease the optimization this was rewritten by taking advantage of the structure of the covariance matrix Σ, using the matrix determinant lemma, Sherman-Morrison formula and the multinomial theorem (Bernstein (2009) Fact 2.16.1 concerning determinant and invers and Fact 1.17.1), yielding l(σ2,ρ) =− 1 2 ( Nlog(2πσ2) + (N −M)log(1− ρ) + M∑ i=1 log(1− ρ+ ρ · ni) ) − 1 2 ( 1 σ2 M∑ i=1 (aiAi + biBi) ) where ai = −(ni − 2)ρ− 1 ((ni − 1)ρ+ 1)(ρ− 1) , bi = ρ ((ni − 1)ρ+ 1)(ρ− 1) and Ai = ni∑ k=1 (yik − µi)2, Bi = ( ni∑ k=1 yik − µi )2 −Ai where yik is the kth sample in the ith site. For maximization of the likelihood numerical optimization in R using optim was performed. 17 4 Results The result section is divided into three parts. First is the exploratory analysis which investigates the general appearance and similarities between samples. In the next part changes in the gene expression patterns are analysed. The final part investigates inde- pendence between the samples. 4.1 Exploratory analysis To get an understanding of the data the unsupervised analysis methods hierarchical clus- tering and principal component analysis were performed on the samples’ gene expression profiles. 4.1.1 Hierarchical clustering The dendrogram in figure 4.1 shows a distinct clustering by region. A German cluster containing two thirds of the German samples and only one non-German sample was found. The Swedish east coast samples were also well clustered except the samples from the site Marsö. These were clustered into a separate cluster and then joined with clusters for other regions while the large Swedish east coast cluster was late joined with other clusters. Two clear Swedish west coast clusters were formed. However, the cluster consisting of samples from Fjällbacka and Göteborg in the lower part of figure 4.1 is likely due to technical variation since all these were measured on the same physical chip. The Danish samples were more scattered than the samples from the other regions and no complete Danish clusters are seen even though some Danish samples were clustered together. A clear grouping was seen for the clustering based on high variance probes, figures 4.1 and 4.2, in comparison to clustering using low variance probes, figures A.1 and A.2 in appendix. 18 4.1. EXPLORATORY ANALYSIS CHAPTER 4. RESULTS SW Site 15 SW Site 11 SW Site 6 Ger Site 16 S W S ite 1 1 SE Site 10 Dk Site 1 Dk Site 1 G er Site 16 G er S ite 2 Dk Site 12 G er S ite 13 G er S ite 13 D k S ite 7 Dk Site 8 SE Site 10 SE Site 9 SE S ite 9 SE Site 9 SE Site 5 SE Site 5 SE Site 14 SE S ite 14 SW S ite 6 S W S ite 6 SE Site 9 SE Site 9 SE Site 10 SE Site 10 SE S ite 9 SE S ite 9 SW S ite 15 SE Site 5 S W S ite 1 1 S W S ite 1 1 Ger Site 13 G er S ite 13 G er S ite 2 G er S ite 2 SW Site 3 S W S ite 3 S W S ite 1 5 SW S ite 15 SE Site 5 S W S ite 1 1 SE S ite 14 SE Site 14 G er S ite 2 G er S ite 2 SE Site 5 S W S ite 3 S E S ite 9 Dk Site 8 Dk Site 1 Dk Site 1 Dk Site 4 Dk Site 4 Dk Site 8 Dk Site 8 SW Site 3 S W S ite 6 Dk Site 12 Dk Site 12 S E S ite 10 S E S ite 10 D k S ite 4 D k S ite 4 D k Site 7 D k S ite 7 G er S ite 13 G er S ite 13 Ger Site 16 Ger Site 16 Dk Site 1 S E S ite 5 Dk Site 1 Dk Site 1 D k S ite 7 Dk Site 7 SE Site 14 SE Site 14 Dk Site 1 Dk Site 8 Ger Site 2 SE Site 10 SE Site 10 SW Site 3 SE S ite 9 D k S ite 7 D k S ite 7 G er S ite 13 G er S ite 13 Ger Site 16Ger Site 16 Ger Site 2 G er S ite 2 Dk Site 1 Dk Site 1Dk Site 4 Dk Site 4 Dk S ite 8 Dk S ite 8 G er S ite 13 G er S ite 13 Dk Site 12 D k Site 12 S W S ite 1 5 SE S ite 14 SE Site 14 S W S ite 1 1 Dk Site 4 Dk Site 4 SW Site 15 S W S ite 1 5 SW Site 15 S W S ite 6 SW Site 3 SE Site 5 SE Site 5 SW S ite 6 SE Site 10 S W S ite 1 1 S W S ite 1 5 Dk S ite 7 Dk Site 8 Dk Site 12 Dk Site 12 G er Site 2 G er Site 16 SE Site 14 SE S ite 9 S W S ite 11 SW Site 11 D k S ite 4 D k S ite 7 Dk S ite 8 Dk Site 12 G er Site 16 Ger Site 16 SE Site 14 SE Site 10 S W S ite 1 1 S W S ite 1 5 D k S ite 4 Dk Site 7 Dk Site 8 Dk Site 12 Ger Site 16 SW S ite 6 S W S ite 3 SW S ite 6 S W S ite 3 SW S ite 6 S W S ite 3 SW S ite 6 S W S ite 3 S W S ite 3 SW S ite 6 SW SE Dk Ger Figure 4.1: Hierarchical clustering using correlation distance metric and complete linkage based on the 10% most varying probes. Dendrogram coloured by region and sites numbered in alphabetic order which shows a clear regional clustering of the samples. A large part of the regional clustering came from clustering by site which was found at a lower hierarchical level than the regional clustering and is seen in figure 4.2 where the dendrogram is coloured by site. Even if not all samples from the same site were clustered together small clusters tended to be formed by samples from the same site. We did not identify any clustering based on levels of pollution, figure 4.3. The small tendencies seen seem more to be site or region effects that make some pattern appear. 4.1.2 Principal component analysis The principal component analysis gave no distinct division of the samples into regions or pollution classification when looking at two dimensional principal component plots. The first 10 principal components captured 57 % of the variance and how the variance captured by the principal components declines is shown in figure B.1 in appendix. As in line with the hierarchical clustering there seemed to be more grouping by region than 19 4.1. EXPLORATORY ANALYSIS CHAPTER 4. RESULTS Site 15 Site 11 Site 6 Site 16 S ite 1 1 Site 10 Site 1 Site 1 Site 16 S ite 2 Site 12 S ite 13S ite 13 S ite 7 Site 8 Site 10 Site 9 Site 9 Site 9 Site 5 Site 5 Site 14Site 14 Si te 6S ite 6 Site 9 Site 9 Site 10 Site 10 Si te 9 Si te 9 Site 15 Site 5 S ite 1 1 S ite 1 1 Site 13 S ite 13S ite 2 S ite 2 Site 3 S ite 3 S ite 1 5 Site 15 Site 5 S ite 1 1 Site 14 Site 14 S ite 2 S ite 2 Site 5 S ite 3 S ite 9 Site 8 Site 1 Site 1 Site 4 Site 4 Site 8 Site 8 Site 3 S ite 6 Site 12 Site 12 S ite 10 S ite 10 S ite 4 S ite 4 Site 7 S ite 7 S ite 13 S ite 13 Site 16 Site 16 Site 1 S ite 5 Site 1 Site 1 S ite 7 Site 7 Site 14 Site 14 Site 1 Site 8 Site 2 Site 10 Site 10 Site 3 Site 9 S ite 7 S ite 7 S ite 13 S ite 13 Site 16 Site 16 Site 2 S ite 2 Site 1 Site 1Site 4 Site 4 Site 8 Site 8 S ite 13 S ite 13 Site 12 Site 12 S ite 1 5 Site 14 Site 14 S ite 1 1 Site 4 Site 4 Site 15 S ite 1 5 Site 15 S ite 6 Site 3 Site 5Site 5 Si te 6 Site 10 S ite 1 1 S ite 1 5 Site 7 Site 8 Site 12 Site 12 Site 2 Site 16 Site 14 Site 9 S ite 11 Site 11 S ite 4 S ite 7 Site 8 Site 12 Site 16Site 16 Site 14 Site 10 S ite 1 1 S ite 1 5 S ite 4 Site 7 Site 8 Site 12 Site 16 Si te 6 S ite 3Si te 6 S ite 3Si te 6 S ite 3 Site 6 S ite 3 S ite 3 Si te 6 Figure 4.2: Hierarchical clustering using correlation distance metric and complete linkage based on the 10% most varying probes. Dendrogram coloured by site and numbered in alphabetic order visualising that the samples were clustered by site. pollution. Figure 4.4 shows an example of the principal component analysis. In the left plot it is seen that some discrimination between regions can be found but the overlap was quite large for the confidence ellipses at confidence level 75%. The right plot in the same figure shows the same principal components but this time the samples were marked by pollution level for which the confidence ellipses coincide. Comparison by the left and right subplots tells us that these two principal components divided better between regions than pollution. Even when performing PCA within each region the overlap for samples from different sites were considerable. It was however possible to find principal components that to some extent divided the samples by site, for an example see figure 4.5. As for the PCA using all samples there were no distinct segregation based upon pollution classification but tendencies could be found. For example figure 4.6 shows Danish samples marked either as reference or medium polluted and polluted where a division between polluted 20 4.2. GENE RANK ANALYSIS CHAPTER 4. RESULTS P Site 15 R Site 11 P Site 6 P Site 16 R S ite 1 1 R Site 10 R Site 1 R Site 1 P Site 16 M S ite 2 P Site 12 M S ite 13 M S ite 13 M S ite 7 P Site 8 R Site 10 R Site 9 R S ite 9 R Site 9 P Site 5 P Site 5 M Site 14 M S ite 14 P Si te 6 P S ite 6 R Site 9 R Site 9 R Site 10 R Site 10 R S ite 9 R S ite 9 P S ite 15 P Site 5 R S ite 1 1 R S ite 1 1 M Site 13 M S ite 13 M S ite 2 M S ite 2 R Site 3 R S ite 3 P S ite 1 5 P S ite 15 P Site 5 R S ite 1 1 M S ite 14 M Site 14 M S ite 2 M S ite 2 P Site 5 R S ite 3 R S ite 9 P Site 8 R Site 1 R Site 1 P Site 4 P Site 4 P Site 8 P Site 8 R Site 3 P S ite 6 P Site 12 P Site 12 R S ite 10 R S ite 10 P S ite 4 P S ite 4 M Site 7 M S ite 7 M S ite 13 M S ite 13 P Site 16 P Site 16 R Site 1 P S ite 5 R Site 1 R Site 1 M S ite 7 M Site 7 M Site 14 M Site 14 R Site 1 P Site 8 M Site 2 R Site 10 R Site 10 R Site 3 R S ite 9 M S ite 7 M S ite 7 M S ite 13 M S ite 13 P Site 16 P Site 16 M Site 2 M S ite 2 R Site 1 R Site 1P Site 4 P Site 4 P Site 8 P S ite 8 M S ite 13 M S ite 13 P Site 12 P Site 12 P S ite 1 5 M S ite 14 M Site 14 R S ite 1 1 P Site 4 P Site 4 P Site 15 P S ite 1 5 P Site 15 P S ite 6 R Site 3 P Site 5P Site 5 P Si te 6 R Site 10 R S ite 1 1 P S ite 1 5 M S ite 7 P Site 8 P Site 12 P Site 12 M Site 2 P Site 16 M Site 14 R S ite 9 R S ite 11 R Site 11 P S ite 4 M S ite 7 P S ite 8 P Site 12 P Site 16 P Site 16 M Site 14 R Site 10 R S ite 1 1 P S ite 1 5 P S ite 4 M Site 7 P Site 8 P Site 12 P Site 16 P Si te 6 R S ite 3 P Si te 6 R S ite 3 P Si te 6 R S ite 3 P S ite 6 R S ite 3 R S ite 3P Si te 6 R Reference M Medium P Polluted Figure 4.3: Hierarchical clustering using correlation distance metric and complete linkage based on the 10% most varying probes. Dendrogram coloured by pollution classification and sites numbered in alphabetic order showing that no clustering based on pollution was found. and non-polluted samples are seen. This could however be a result of aggregation by site that occasionally looks like grouping by pollution. 4.2 Gene rank analysis For the following analysis the linear statistical model presented in theory has been used for each probe assuming independent normally distributed log transformed gene expres- sions. The empirical Bayes method in the limma-package was then used to estimate the variance (see theory chapter for details). The results were visualised in volcano plots where each probe is a point represented by its log fold change on the horizontal axis and its false discovery rate adjusted p-value on the vertical axis. These are interpreted such that large magnitudes in horizontal direction are probes which differ a lot in expression between the groups while points close to zero in horizontal direction do not alter in expression between the groups. The 21 4.2. GENE RANK ANALYSIS CHAPTER 4. RESULTS −150 −100 −50 0 50 100 − 10 0 − 50 0 50 10 0 PC 2 Var: 0.0904 P C 3 V ar : 0 .0 71 ● SW SE Dk Ger ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −150 −100 −50 0 50 100 − 10 0 − 50 0 50 10 0 PC 2 Var: 0.0904 P C 3 V ar : 0 .0 71 ● Reference Medium Polluted ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Unscaled PCA Figure 4.4: Unscaled principal component analysis performed on all samples with confi- dence ellipses at confidence level 75%. Same principal components are visualised in both subplots and samples coloured by region to the left and coloured by pollution to the right. This shows that the samples were to some extent divided into region but not into pollution classification by these components. vertical axis describes the significance in the changes represented as minus the base ten logarithm of the FDR adjusted p-values. This means that points far up on the vertical axis have very small p-values and high significance. Probes in the top left and right corners are the probes that altered the most and had changes with high significance. 4.2.1 Differences in gene expression based on pollution classification Since reference sites were compared against polluted sites a negative log fold change means that the probe was upregulated in the polluted sites compared to in the reference sites. The effects seen were not very strong and a FDR cut-off at 0.01 gave 1553 signif- icantly differentially expressed probes. Figure 4.7 shows that the comparison between samples from reference sites against samples from polluted sites had more downregulated than upregulated probes in the polluted sites. There was however a larger number of high significance changes for the in polluted sites upregulated probes than the number of high significance changes for the downregulated probes When moving to regional comparisons the log fold changes and significance were overall larger (figures 4.8-4.11). The number of significantly differentially expressed probes for FDR cut-off 0.01 were 5276 for Swedish west coast, 7085 for Swedish east coast, 3112 for Denmark and 5172 for Germany. The volcano plots were quite symmetrical for all but Swedish east coast (figure 4.9) which had much more high significance probes with positive log fold changes than negative log fold changes. This means that Swedish 22 4.2. GENE RANK ANALYSIS CHAPTER 4. RESULTS ● ● ● ●● ● ● ● ● −200 −150 −100 −50 0 50 100 − 15 0 − 10 0 − 50 0 50 10 0 PC 1 Var: 0.2066 P C 2 V ar : 0 .1 53 ● Eggers Wiek Salzhaff/Kroy Wesmar/Wendorf Unscaled PCA for Germany Figure 4.5: Unscaled principal component analysis performed on German samples with confidence ellipses at confidence level 75%. Samples are marked by site which shows that these principal components gave some division of the German samples into site. −100 −50 0 50 100 − 50 0 50 PC 3 Var: 0.071 P C 6 V ar : 0 .0 38 8 ● Reference Med Poll/Poll ● ● ● ● ● ● ● ● ● ● Unscaled PCA for Denmark Figure 4.6: Unscaled principal component analysis performed on Danish samples with confidence ellipses at confidence level 75%. The samples are coloured by pollution clas- sification where medium polluted sites have been grouped with polluted sites which gave a division of the samples. east coast seems to have had more significantly downregulated genes in the polluted sites compared to the other regions but the cause for this is not known. 4.2.2 Gene expression in relation to the biomarker EROD Correlation between EROD activity and expression of CYP1A The regional correlations were formed by taking average of the site correlations for each of the four probes and are summarised in table 4.1 while all site correlations can be seen in table C.1 in appendix. There was a clear positive correlation between measured expression of CYP1A and levels of EROD, measure of the enzyme activity of CYP1A. Two of the sites had however negative correlations. In figure 4.12 log transformed EROD values are shown at the horizontal axis and log transformed expression for CYP1A for one of the probes on the vertical axis. Linear regression was performed for each site to get a visualisation of the relation at individual level. The mean of the regional correlations were 0.45 for Swedish west coast, 0.35 for Swedish east coast, 0.25 for Denmark and 0.26 for Germany. The correlation was highest for the Swedish west coast. If not for the site G̊asö the correlation for the Swedish east coast would have been high as well. It can be noticed that the German samples had lower EROD levels compared to the other regions but the same levels of CYP1A expression. 23 4.2. GENE RANK ANALYSIS CHAPTER 4. RESULTS Figure 4.7: Volcano plot for comparison of hepatic gene expression for samples from refer- ence sites against polluted sites. A negative log fold change tells that an upregulation was seen in the fish from polluted sites. The line represents a FDR cut-off at 0.01 and shows that small but significant effects were seen. Differences in gene expression for samples with high and low EROD activity As expected all CYP1A probes were upregulated in the high EROD group and we saw log fold changes between 0.56 to 1.39 with adjusted p-values between 0.08 and 0.195 (table 4.2). It was however not the probes for gene expression of CYP1A that were the probes with most extreme log fold changes even if the groups were chosen based on EROD. Instead other probes had log fold changes with very large magnitudes. When using a FDR cut-off at 0.01 there were 1178 significantly changed probes, the probes for CYP1A not being any of them. There were less significantly differentially expressed probes in the comparison of high and low EROD individuals from the Swedish west coast than in the all regions comparison of high and low EROD individuals. The number of significantly differentially expressed probes was 51 for the within Swedish east coast comparison. The log fold changes in the comparison between high and low EROD samples from the Swedish west coast had 24 4.2. GENE RANK ANALYSIS CHAPTER 4. RESULTS Figure 4.8: Volcano plot for comparison of hepatic gene expression for samples from Swedish west coast where samples from ref- erence sites were compared against sam- ples from polluted sites. Negative log fold changes indicate upregulations in fish from the polluted sites and the line represents a FDR cut-off at 0.01. Larger effects were seen here compared to the all regions com- parison of reference against polluted. Figure 4.9: Volcano plot for comparison of hepatic gene expression for samples from Swedish east coast where samples from ref- erence and medium polluted sites were com- pared against samples from polluted sites. Negative log fold changes indicate upregu- lations in fish from the polluted sites and the line represents a FDR cut-off at 0.01. A large proportion of the probes were down- regulated in the polluted sites. Table 4.1: Table showing the regional correlations of log transformed EROD levels and log transformed gene expression of CYP1A measured using four different probes. All regions had clear positive correlations for all probes. The strongest correlation was found at the Swedish west coast. Probe 13544 Probe 13545 Probe 13546 Probe 13547 Mean SW 0.386 0.473 0.375 0.580 0.453 SE 0.292 0.347 0.401 0.368 0.352 Dk 0.220 0.210 0.332 0.250 0.253 Ger 0.270 0.267 0.171 0.322 0.257 larger magnitudes than in the all regions comparison and also the log fold changes for the CYP1A probes were larger. The p-values were smaller but still not significant (table 4.3). Notice the difference in rank of the CYP1A probes between the all sample comparison and the within Swedish west coast comparison (table 4.2 and table 4.3). 25 4.2. GENE RANK ANALYSIS CHAPTER 4. RESULTS Figure 4.10: Volcano plot for compari- son of hepatic gene expression for samples from Denmark where samples from refer- ence and medium polluted sites were com- pared against samples from polluted sites. Negative log fold changes indicate upregu- lations in fish from the polluted sites and the line represents a FDR cut-off at 0.01 show- ing large and significant changes. Figure 4.11: Volcano plot for compari- son of hepatic gene expression for samples from Germany where samples from medium polluted sites were compared against sam- ples from polluted sites. Negative log fold changes indicate upregulations in fish from the polluted sites and the line represents a FDR cut-off at 0.01. Large and significant changes were seen despite the absence of ref- erence sites. Table 4.2: Table of changes for the probes measuring expression of CYP1A in the comparison between the 5% of the individuals with highest levels of EROD against the 5 % with lowest EROD lev- els. There was large upregulation of the CYP1A gene in the high EROD group but using a FDR cut-off at 0.01 none of these changes were considered as signifi- cant. Probe Rank log FC FDR adj. p-val 13547 12372 1.39 0.08 13545 26170 0.56 0.18 13544 27235 0.41 0.19 13546 27931 1.01 0.20 Table 4.3: Table of changes for the probes measuring expression of CYP1A in the comparison between high and low EROD individuals at the Swedish west coast. A larger upregulation was seen in this comparison than in the all regions comparison based on EROD but still, us- ing a FDR cut-off at 0.01, none of the changes were considered to be significant. Probe Rank log FC FDR adj. p-val 13546 3206 2.18 0.08 13544 4213 0.95 0.09 13547 4472 2.45 0.10 13545 9442 1.15 0.15 26 4.2. GENE RANK ANALYSIS CHAPTER 4. RESULTS ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● −4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 10 11 12 13 14 15 Region correlation SW: 0.58 Logged EROD Lo gg ed g en e ex pr es si on Stenungsund Ringhals Göteborg harbour Fjällbacka ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● −4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 10 11 12 13 14 15 Region correlation SE: 0.37 Logged EROD Lo gg ed g en e ex pr es si on Marsö Kvädöfjärden Gåsö Slakmöre ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 10 11 12 13 14 15 Region correlation Dk: 0.25 Logged EROD Lo gg ed g en e ex pr es si on Agersö Roskilde Fjord Isefjord Karrebaek Fjord Fredriksvaerk ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 10 11 12 13 14 15 Region correlation Ger: 0.32 Logged EROD Lo gg ed g en e ex pr es si on Wesmar/Wendorf Eggers Wiek Salzhaff/Kroy Figure 4.12: Scatterplots of log transformed EROD levels against log transformed gene expression for one of the probes measuring expression of CYP1A. Linear regression has been plotted for each site. The region correlation is the average of this probe’s within-site correlations for each region. All sites but Frederiksværk and G̊asö had positive correlations and most correlations were strong. Notice the low levels of EROD in the German samples compared to the other regions. 4.2.3 Difference in gene expression based on reproduction success A change in gene expression as a result of external impact is on its own not a bad sign for the fish. Instead it could be a confirmation that the species has the necessary tools to cope with a changing environment. What it all comes down to in the end is the possibility to successfully reproduce. By looking at differences in gene expression in relation to rate of abnormal fry we got a linking between the processes at mRNA level in the fish and impact on reproduction ability. Figure 4.14 shows the volcano plot for the comparison based on rate of abnormal fry. There was a high density of probes with low significance and close to zero log fold change. The number of significantly changed probes using a FDR cut-off at 0.01 was 100. Very few probes had a log fold change with larger magnitude than one which makes this comparison the one with the smallest differences in gene expression. For the differences that were found there was approximately the same amount of upregulated as downregulated genes. 27 4.3. WITHIN-SITE CORRELATION CHAPTER 4. RESULTS Figure 4.13: Volcano plot for comparison of hepatic gene expression for the 5% of the individuals with highest levels of EROD against the 5 % with lowest EROD levles. A positive log fold change tells that an upregulation was seen in fish in the high EROD group. Large effects were found and the line representing a FDR cut-off at 0.01 shows that many of these were significant but the four probes measuring expression of CYP1A, listed in table 4.2 and here marked with triangles, were not. 4.3 Within-site correlation To investigate the within-site relationship between the samples a within-site correlation parameter ρ was estimated together with an estimation of the variance σ2 for each probe. The estimated variances ranged between 0.006 and 10.3 with median 0.183. In em- pirical Bayes for microarray data the variances are assumed to be from an inverse gamma distribution which the histogram of σ̂2 (figure 4.15) follows. This is consistent with re- sults of Smyth et al. (2003) and Kristiansson et al. (2005) which concluded that inverse gamma is a good distribution assumption in microarray analysis. In the histogram the 412 probes with σ̂2 larger than 3 have been left out. Often the within-site correlation is assumed to be zero but the clustering gave indices 28 4.3. WITHIN-SITE CORRELATION CHAPTER 4. RESULTS Figure 4.14: Volcano plot for comparison of hepatic gene expression based on rate of abnormal fry which was defined to be the number of dead and abnormal fry didvided by the total brood size. Two large groups contaning almost all samples were formed and the high rate of abnormal fry group was compared aganist the low rate group. Small effects were seen and the line representing a FDR cut-off at 0.01 shows that few probes were significantly changed. that there are apparent similarities between samples belonging to the same site and here we saw that independence between samples from the same site was most of the times not the case. A majority of genes seem to have been affected by within-site correlation. The within-site correlation ranged between -0.09 and 0.840 with median 0.175 (figure 4.16). Out of the 135 091 probes the number of negative ρ̂ was 4980. The relation between the variance and the within-site correlation for each probe is seen in figure 4.17. The highest density of probes was found for σ̂2 close to 0.11 and ρ̂ close to 0.16. Noticeable is that for both high and low variance probes large values of ρ̂ were found and the within-site correlations were spread over the same range. There were numerous probes with large within-site dependence. The probes that represent known genes and with largest ρ̂ are listed in table 4.4, for a more extensive 29 4.3. WITHIN-SITE CORRELATION CHAPTER 4. RESULTS Histogram of σ2 σ2 D en si ty 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 1 2 3 4 Figure 4.15: Truncated histogram of esti- mated variance, σ̂2, of each probe. The his- togram seems to follow an inverse gamma distribution which is consistent with the as- sumption made in empirical Bayes. Histogram of ρ ρ D en si ty 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 Figure 4.16: Histogram of estimated within-site correlation, ρ̂, for each probe. This shows that many genes seem to have been affected by within-site correlation and several strong within-site correlations were seen. listing see table D.1 in appendix. At the top we find genes associated with tryptase-2 precursor, ATP binding cassette, ice structuring protein and antifreeze proteins. Genes with smallest ρ̂ were connected to ribosomal functions. 30 4.3. WITHIN-SITE CORRELATION CHAPTER 4. RESULTS Figure 4.17: Scatterplot showing the relation between estimated variance, σ̂2, and esti- mated within-site correlation, ρ̂, for the probes. The dispersion of the points, each point representing a probe, shows that the range of estimated within-site correlations was the same for all variances. 31 4.3. WITHIN-SITE CORRELATION CHAPTER 4. RESULTS Table 4.4: List of the probes representing known genes and which had largest estimated within-site correlations. Genes appearing multiple times are represented as the probe with largest estimated within-site correlation. Rank ρ̂ Annotation 1 0.84 tryptase-2 precursor 15 0.73 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 22 0.72 ANP2 ANALU Type-3 ice-structuring protein 31 0.69 type III antifreeze protein 45 0.66 cytochrome P450, family 2, subfamily J, polypeptide 2 51 0.65 type III antifreeze protein 58 0.64 protein FAM110B-like 71 0.61 PAX3 and PAX7 binding protein 78 0.60 RNA-binding protein 6-like 81 0.59 PAX3- and PAX7-binding protein 82 0.59 hypothetical protein 87 0.59 splicing regulatory glutamine/lysine-rich protein 99 0.58 ANP3 ZOAAM Ice-structuring protein 32 5 Discussion In this study hepatic gene expression microarray data from wild fish of the species eelpout was analysed. The fish, 158 in total, were sampled at 16 sites located in four large geographical regions in the Baltic Sea. The aim was to understand microarray gene expression patterns in the fish eelpout and how they change in connection to pollution. Discrimination between samples based on field site, population and exposure of pollu- tion was investigated. Differences in gene expression of the fish in relation to exposure of contamination and the biomarkers EROD and reproduction success were identified using linear model and empirical Bayes. Using numerical optimization under the as- sumption of normal distributed gene expressions the within-site correlation between the gene expressions was estimated for each probe separately. Clustering analysis showed that fish from the same region and site had a high tendency to group together. The com- parison between fish from reference and polluted sites had significant differences in gene expression but the effects were in general small. This is likely an indication that there were few shared differentially expressed genes between the polluted sites. The effects where however larger and more significant for the regional comparisons based pollution classification which could either be due to differences in population that was blurring out the signal or that the pollution in each region was more homogeneous. The large effects identified in connection to region and site may indicate that ecological factors and population parameters had a substantial impact on the observed gene expression profiles. Large and significant effects were also seen when comparing fish with low and high values of the known biomarker EROD. We observed strong correlations between measured gene expression for CYP1A gene and levels of EROD. The strength of the correlation varied between regions and the highest correlation was found at the Swedish west coast. There were, however, small effects on the gene expression between fish with high and low rate of abnormal fry and a more thorough investigation of the reproduc- tion success and its connection to changes in gene expression should be performed. We assessed the independence between sampled fish and found that several genes had a high 33 CHAPTER 5. DISCUSSION within-site correlation. This underlines the importance to use a model that models site specific effects and/or correlations. Large-scale field studies will always have factors that affect the sampling procedure and retrieving of data. Such factors are sampling bias and systematic effects caused by different sampling circumstances for the regions. An example from this study is that testing was done at site for some regions and for others the samples needed to be moved before testing was performed. It is often hard to remove the impact of these factors and larger and more complex studies will typically result in more varying conditions. The extensive set-up of this study is unique and much care has been put into making the sampling and testing procedure for all sites as similar as possible such as performing all microarray and enzyme analyses by the same laboratory. The general appearance was analysed by hierarchical clustering, using correlation distance metric and complete linkage, as well as by unscaled PCA. This showed that site and region had a high tendency to group together. According to the hierarchical clustering of the samples, region and site seem to be the factors that had largest influence on the general appearance of the gene expression profiles of the fish. For discrimination into site and region the high variance probes contained more information than the low variance probes. Three of the regions, namely Swedish west coast, Swedish east coast and Germany, had clear regional grouping while the Danish fish did not seem to stand out compared to fish from the other regions. The variable reduction into two variables using principal component analysis was not able to capture the division into region or site in a credible way. Summing up the explanatory analysis there was apparent division into region and site but no perfect segregation was seen. It could also be concluded that neither the clustering nor the PCA showed clear grouping based on pollution which in- dicate that there were likely not a large number of genes collectively induced by different types of contamination. Often large-scale gene expression studies address short term hypotheses for controlled experiments in laboratory. In contrast, this study considered wild sampled fish that, due to its stationary, had lived at the same site for a long time. Therefore, the fish was assumed to have been chronically exposed to the ecological and anthropogenic conditions at the site and many of the short term stress responses could be assumed to have settled down. Using a linear model, gene expression profiles between samples from reference and polluted sites were compared. There were significant differences in gene expression but the effects were in general small. The effects were however larger and more significant when the pollution classification based comparisons were performed within each region. The in general small effects in the comparison between fish from reference and polluted sites could be due to various reasons. One of them is the difficulty of prior pollution classification. It is for example possible that some reference sites are not completely clean. It is also possible that the level of exposure differed between the polluted sites, which may induce variation in the gene response. The regional comparisons showed larger effects despite that reference and medium polluted sites now were combined into one group and compared against polluted sites. This indicated that there was another cause for the small effects in the all regions comparison than solely the pollution classification. 34 CHAPTER 5. DISCUSSION It should also be noted that the polluted sites are likely to have been exposed in different ways which led to differences in gene expression profiles within the group of polluted. This implied that there were significant effects in the data but either the differences in population was blurring out the signal or there was more similar types of pollution within the same region. The regional averages of the within-site Pearson correlation coefficients between mea- sured expression of CYP1A and levels of EROD, measure of the enzyme activity of CYP1A, showed strong positive correlations. The clear positive correlations we saw for all regions, average between 0.26 and 0.45, lies in line with the correlations seen in Guo et al. (2008) which were between 0.21 and 0.64. In Guo et al. (2008) the relation between mRNA levels and protein expression in human circulating monocytes was investigated and even though this study investigated mRNA levels in the liver of eelpouts compared to activity of protein this gave an indication of what to be expected. Considering the difference in measuring protein expression and activity of protein it seems as the correla- tions seen here were almost a bit stronger than expected especially for the Swedish west coast. It can be noticed that the German samples had lower levels of EROD compared to the other regions but the same measured expression of CYP1A. This could have been a cause of inhibition or population effects. The gene expression profiles of the 5% of the individuals with highest levels of EROD were compared against the 5% of the individuals with lowest levels of EROD. It was found to be large and significant differences in gene expression between the groups. EROD is used as a general biomarker for pollution but its biological function is complex and involves detoxification of foreign chemical substances which includes both ecological impact and anthropogenic pollution. The more specialized biological function of EROD than as protection against general pollution could be the reason for the larger effects in the comparison based on EROD than the effects seen in the comparisons based on pollution classification. We saw upregulation for many genes in the high EROD group and it is likely that other genes than CYP1A also were upregulated due to the same external influence. As an indicator of reproduction success the rate of abnormal fry was chosen and defined to be the number of dead and abnormal fry divided by the total brood size. Two large groups were formed containing almost all samples and fish with high rates of abnormal fry were compared against samples with low rates. Small differences in the gene expression profiles were seen for these groups. The reproduction process involves many biological systems and therefore it might be too complex to be captured by a comparison of this type. This would be the case if having many different reasons for the high abnormal rate which means that different genes are changing for the samples within the same group. It is likely that the number of fry and the rate of abnormal fry were affected by physiological and ecological parameters. The number of fry was for example strongly correlated with the length of the fish which was biased by region. Furthermore, small brood sizes tended to have a lower rate of abnormal fry. There was therefore no clear way how to choose the groups for the reproduction comparison. Based on the comparison performed it cannot be concluded that the data do not contain information 35 CHAPTER 5. DISCUSSION about the relationship between gene differences and reproduction success but rather that the comparison needs to be done more carefully. By maximizing the normal log likelihood for each probe with respect to variance and within-site correlation the dependencies between the samples were investigated. This was performed under the assumptions that the parameters were the same for all sites and that there were no correlation between samples from different sites. It is a common assumption in microarray analysis using linear models that the samples are independent but it was here shown that many genes were affected by within-site correlation and that it was strong for several genes. Due to this dependence the p-values calculated under the assumption of independence cannot be assured to be accurate. For many biological question the most interesting is however the ordering of the genes and even if the magnitude of the p-values was not reliable due to correlation the ranking of the probes may be more robust.Two approaches that could be used to take care of the within-site correlation are either by modifying the covariance matrix in the linear model or to use mixed models. Out of the roughly 135 000 probes 4% had negative within- site correlation, all with small magnitudes. Among the genes with negative within-site correlation there was an overrepresentation of genes associated with basal functions such as ribosomal processes. These processes are thought not to be easily affected by environmental impact and therefore not subject of within-site correlation. Considering the large number of estimations performed, one for each probe, the negative within-site correlations are likely to be due to variations around zero seen when having a large number of independent samples. To be able to make inference on individual level the dependence should be taken into account. One way to interpret the impact of within-site correlation is in the context of sampling design. If a specific gene is to be investigated and its within-site correlation is low fish from the same site can be considered to be independent replicates. To conclude, this work shows that large-scale gene expression is a viable tool for assessing differentially expressed genes in wild fish. Our results suggest there are small effects on gene expression connected to pollution and rate of abnormal fry in the case of large-scale sampling and that factors such as population might interfere. We could, however, identify large effects connected to region and site which may indicate that eco- logical factors and population parameters have a substantial impact on gene expression profiles. Finally, we demonstrated that genes tend to have similar expression levels in fish from the same site which underlines the need of taking dependences into account if inference on individual level is to be made. 36 Bibliography Bruce Alberts. Molecular biology of the cell. Garland Science, New York, 2002. ISBN 0815340729. E Albertsson, J Gercken, J Strand, N Asker, S Bergek, I Holmqvist, U Kammann, J. Parkkonen, and L Förlin. Biomarker responses in eelpout from different coastal sites in Sweden, Denmark and Germany. Manuscript, 2011. Steven Arnold. The theory of linear models and multivariate analysis. Wiley, New York, 1981. ISBN 0471050652. Dennis Bernstein. Matrix mathematics theory, facts, and formulas. Princeton University Press, Princeton, N.J, 2009. ISBN 9780691132877. F. Falciani, A.M. Diab, V. Sabine, T.D. Williams, F. Ortega, S.G. George, and J.K. Chipman. Hepatic transcriptomic profiles of European flounder (Platichthys flesus) from field sites and computational approaches to predict site from stress gene responses following exposure to model toxicants. Aquatic Toxicology, 90(2):92–101, 2008. Yanfang Guo, Peng Xiao, Shufeng Lei, Feiyan Deng, Gary Guishan Xiao, Yaozhong Liu, Xiangding Chen, Liming Li, Shan Wu, Yuan Chen, Hui Jiang, Lijun Tan, Jingyun Xie, Xuezhen Zhu, Songping Liang, and Hongwen Deng. How is mRNA expression predic- tive for protein expression? A correlation study on human circulating monocytes. 40 (5):426–436, 2008. doi: 10.1111/j.1745-7270.2008.00418.x. Trevor Hastie. The elements of statistical learning data mining, inference, and prediction. Springer, New York, 2009. ISBN 0387848576. J.E. Hedman, H. Rüdel, J. Gercken, S. Bergek, J. Strand, M. Quack, M. Appelberg, L. Förlin, A. Tuvikene, and A. Bignert. Eelpout (Zoarces viviparus) in marine envi- ronmental monitoring. Marine Pollution Bulletin, 62(10):2015–2029, 2011. R.A. Johnson and D.W. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall series in statistics. 1998. ISBN 9780138341947. 37 BIBLIOGRAPHY BIBLIOGRAPHY Erik Kristiansson, Anders Sjögren, Mats Rudemo, and Olle Nerman. Weighted Analysis of Paired Microarray Experiments. Statistical Applications in Genetics and Molecular Biology, 4(1):1–47, 2005. M.K. Sellin Jeffries, A.C. Mehinto, B.J. Carter, N.D. Denslow, and A.S. Kolok. Taking microarrays to the field: Differential hepatic gene expression of caged fathead minnows from Nebraska watersheds. Environmental Science and Technology, 46(3):1877–1885, 2012. A. Sjögren, E. Kristiansson, M. Rudemo, and O. Nerman. Weighted analysis of general microarray experiments. BMC Bioinformatics, 8, 2007. G. K. Smyth. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor, chapter 23. New York, 2005. G.K. Smyth. Linear models and empirical bayes methods for assessing differential ex- pression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1), 2004. GK Smyth, NP Thorne, and J Wettenhall. LIMMA: Linear Models for Microarray Data User’s Guide, 2003. URL http://www. bioconductor. org, 2003. J.J. Whyte, R.E. Jung, C.J. Schmitt, and D.E. Tillitt. Ethoxyresorufin-O-deethylase (EROD) activity in fish as a biomarker of chemical exposure. Critical Reviews in Toxicology, 30(4):347–570, 2000. T.D. Williams, I.M. Davies, H. Wu, A.M. Diab, L. Webster, M.R. Viant, J.K. Chipman, M.J. Leaver, S.G. George, C.F. Moffat, and C.D. Robinson. Molecular responses of European flounder (Platichthys flesus) chronically exposed to contaminated estuarine sediments. Chemosphere, 108:152–158, 2014. 38 A Appendix SW Site 15 SW Site 11 SW Site 6Ger Site 16 SW Site 11 SE Site 10 Dk Site 1 D k Si te 1 Ger Site 16 Ger Site 2 Dk Site 12 Ger Site 13 Ger Site 13 Dk Site 7 Dk Site 8 SE Site 10 SE Site 9 SE Site 9 SE Site 9 SE S ite 5 SE Site 5 SE Site 14 SE Site 14 SW Site 6 SW S ite 6 SE S ite 9 SE S ite 9 SE Site 10 SE Site 10 SE Site 9 SE Site 9 SW Site 15 SE Site 5 SW Site 11 SW S ite 11 Ger Site 13 Ger Site 13 Ger Site 2 Ger Site 2 SW Site 3 SW Site 3 SW Site 15 SW Site 15 SE Site 5 SW S ite 11 SE Site 14 SE Site 14 Ger Site 2 Ger Site 2 SE Site 5 SW Site 3 SE Site 9 Dk Site 8 Dk S ite 1 D k Si te 1 Dk Site 4 D k Si te 4 Dk Site 8 Dk Site 8 SW Site 3 SW Site 6 Dk Site 12 Dk Site 12 SE S ite 1 0 SE S ite 1 0 Dk S ite 4 Dk S ite 4 Dk Site 7 Dk Site 7 Ger Site 13 G er Site 13 G er Site 16 Ger Site 16D k S ite 1 S E S ite 5 Dk S ite 1 D k S ite 1 Dk S ite 7 D k S ite 7 SE Site 14 SE Site 14 Dk Site 1 Dk S ite 8 G er S ite 2SE Site 10 SE Site 10 SW S ite 3 SE Site 9 D k S ite 7 D k S ite 7 Ger Site 13 Ger S ite 13 G er Site 16 Ger Site 16 G er Site 2 G er Site 2 D k S ite 1 D k S ite 1 D k S ite 4 D k S ite 4 D k S ite 8 D k S ite 8 Ger Site 13 Ger Site 13 D k S ite 1 2 Dk Site 12 SW Site 15 SE S ite 1 4 SE Site 14 SW Site 11 D k S ite 4 Dk S ite 4 S W S ite 1 5 S W S ite 15 SW Site 15 SW Site 6 SW S ite 3 SE Site 5 SE Site 5 S W S ite 6 S E S ite 10 SW Site 11 S W S ite 15 D k S ite 7 D k S ite 8 D k S ite 1 2 D k S ite 1 2 G er S ite 2 Ger Site 16 S E S ite 1 4 S E S ite 9 S W S ite 1 1 S W S ite 1 1 D k S ite 4 D k S ite 7 D k S ite 8 D k S ite 1 2 G er S ite 16 G er S ite 16 SE Site 14 S E S ite 10 S W S ite 11 S W S ite 15 D k S ite 4 D k S ite 7 D k S ite 8 D k S ite 1 2 G er S ite 16 S W S ite 6 S W S ite 3 S W S ite 6 S W S ite 3 S W S ite 6 S W S ite 3 S W S ite 6 S W S ite 3 S W S ite 3 S W S ite 6 SW SE Dk Ger Figure A.1: Hierarchical clustering of the samples based on correlation distance metric and complete linkage using the 10% least varying probes. Dendrogram coloured by region and sites numbered in alphabetic order showing less grouping than was seen for clustering using the 10% most varying probes. 39 APPENDIX A. APPENDIX Site 15 Site 11 Site 6Site 16 Site 11 Site 10 Site 1 Si te 1 Site 16 Site 2 Site 12 Site 13 Site 13 Site 7 Site 8 Site 10 Site 9 Site 9 Site 9 Site 5 Site 5 Site 14 Site 14 Site 6 Site 6 Site 9 Site 9 Site 10 Site 10 Site 9 Site 9 Site 15 Site 5 Site 11 Site 11 Site 13 Site 13 Site 2 Site 2 Site 3 Site 3 Site 15 Site 15 Site 5 Site 11 Site 14 Site 14 Site 2 Site 2 Site 5 Site 3 Site 9 Site 8 Site 1 Si te 1 Site 4 Si te 4 Site 8 Site 8 Site 3 Site 6 Site 12 Site 12 Site 1 0 Si te 1 0 Site 4Site 4 Site 7 Site 7 Site 13 Site 13 Site 16 Site 16 S ite 1 S ite 5 Site 1 S ite 1 Site 7 S ite 7 Site 14 Site 14 Site 1Site 8 Si te 2 Site 10 Site 10 Si te 3 Site 9 S ite 7 S ite 7 Site 13 Site 13 Site 16 Site 16 Site 2 Site 2 S ite 1S ite 1 S ite 4 S ite 4 S ite 8 S ite 8 Site 13 Site 13 S ite 1 2 Site 12 Site 15 Si te 1 4 Site 14 Site 11 S ite 4 Site 4 S ite 1 5 S ite 15 Site 15 Site 6 Si te 3 Site 5 Site 5 S ite 6 S ite 10 Site 11 S ite 15 S ite 7 S ite 8 S ite 1 2 S ite 1 2 S ite 2 Site 16 S ite 1 4 S ite 9 S ite 1 1 S ite 1 1 S ite 4 S ite 7 S ite 8 S ite 1 2 S ite 16 S ite 16 Site 14 S ite 10 S ite 11 S ite 15 S ite 4 S ite 7 S ite 8 S ite 1 2 S ite 16 S ite 6 S ite 3 S ite 6 S ite 3 S ite 6 S ite 3S ite 6 S ite 3 S ite 3 S ite 6 Figure A.2: Hierarchical clustering of the samples based on correlation distance metric and complete linkage using the 10% least varying probes. Dendrogram coloured by site and numbered in alphabetic order. Less grouping was seen here than for the clustering using the 10% most varying probes. P Site 15 R Site 11 P Site 6P Site 16 R Site 11 R Site 10 R Site 1 R S ite 1 P Site 16 M Site 2 P Site 12 M Site 13 M Site 13 M Site 7 P Site 8 R Site 10 R Site 9 R Site 9 R Site 9 P S ite 5 P Site 5 M Site 14 M Site 14 P Site 6 P S ite 6 R S ite 9 R S ite 9 R Site 10 R Site 10 R Site 9 R Site 9 P Site 15 P Site 5 R Site 11 R S ite 11 M Site 13 M Site 13 M Site 2 M Site 2 R Site 3 R Site 3 P Site 15 P Site 15 P Site 5 R S ite 11 M Site 14 M Site 14 M Site 2 M Site 2 P Site 5 R Site 3 R Site 9 P Site 8 R Site 1 R S ite 1 P Site 4 P Si te 4 P Site 8 P Site 8 R Site 3 P Site 6 P Site 12 P Site 12 R S ite 1 0 R S ite 1 0 P S ite 4 P S ite 4 M Site 7 M Site 7 M Site 13 M Site 13 P Site 16 P Site 16R S ite 1 P S ite 5 R S ite 1 R S ite 1 M S ite 7 M S ite 7 M Site 14 M Site 14 R Site 1 P Site 8 M S ite 2R Site 10 R Site 10 R S ite 3 R Site 9 M S ite 7 M S ite 7 M Site 13 M Site 13 P Site 16 P Site 16 M Site 2 M Site 2 R S ite 1 R S ite 1 P S ite 4 P S ite 4 P S ite 8 P S ite 8 M Site 13 M Site 13 P S ite 1 2 P Site 12 P Site 15 M S ite 1 4 M Site 14 R Site 11 P S ite 4 P S ite 4 P S ite 1 5 P S ite 15 P Site 15 P Site 6 R S ite 3 P Site 5 P Site 5 P S ite 6 R S ite 10 R Site 11 P S ite 15 M S ite 7 P S ite 8 P S ite 1 2 P S ite 1 2 M S ite 2 P Site 16 M S ite 1 4 R S ite 9 R S ite 1 1 R S ite 1 1 P S ite 4 M S ite 7 P S ite 8 P S ite 1 2 P S ite 16 P S ite 16 M Site 14 R S ite 10 R S ite 11 P S ite 15 P S ite 4 M S ite 7 P S ite 8 P S ite 1 2 P S ite 16 P S ite 6 R S ite 3 P S ite 6 R S ite 3 P S ite 6 R S ite 3 P S ite 6 R S ite 3 R S ite 3 P S ite 6 R Reference M Medium P Polluted Figure A.3: Hierarchical clustering of the samples based on correlation distance metric and complete linkage using the 10% least varying probes. Dendrogram coloured by pollution classification and sites numbered in alphabetic order. Neither this nor the clustering using the 10% most varying probes gave any division based on pollution classification. 40 B Appendix Figure B.1: The percentage of variance that was explained by the principal components in the unscaled principal component analysis performed on samples from all regions. 41 C Appendix 42 APPENDIX C. APPENDIX Table C.1: Table showing the within-site Pearson correlation coefficients for the correlation between log transformed EROD levels and log transformed gene expression of CYP1A. There were clear positive correlations for almost all sites and strongest correlations were seen for Swedish sites. Probe Probe Probe Probe Mean Sd 13544 13545 13546 13547 Agersö 0.291 0.235 0.438 0.244 0.302 0.094 Eggers Wiek -0.258 -0.003 -0.195 0.085 -0.093 0.161 Fjällbacka 0.776 0.808 0.825 0.783 0.798 0.022 Fredriksvaerk -0.031 -0.009 0.129 -0.002 0.022 0.073 G̊asö -0.100 -0.219 -0.018 -0.312 -0.162 0.130 Göteborg harbour 0.579 0.565 0.534 0.794 0.618 0.119 Isefjord 0.471 0.305 0.450 0.573 0.450 0.111 Karrebaek Fjord -0.167 0.204 -0.017 0.050 0.017 0.154 Kvädöfjärden 0.525 0.717 0.643 0.736 0.655 0.096 Marsö 0.640 0.656 0.663 0.583 0.636 0.036 Ringhals -0.073 0.244 0.042 0.354 0.142 0.193 Roskilde Fjord 0.538 0.318 0.661 0.384 0.475 0.155 Salzhaff/Kroy 0.674 0.367 0.492 0.460 0.498 0.129 Slakmöre 0.105 0.236 0.315 0.466 0.280 0.151 Stenungsund 0.262 0.277 0.097 0.388 0.256 0.120 Wesmar/Wendorf 0.393 0.436 0.217 0.421 0.367 0.101 Mean 0.289 0.321 0.330 0.375 Sd 0.335 0.274 0.302 0.305 43 D Appendix 44 APPENDIX D. APPENDIX Table D.1: The within-site correlation was estimated for each probe under the assump- tion of no between-site correlation. The probes representing known genes and had largest estimated within-site correlations, ρ̂, are listed below. Rank ρ̂ Annotation 1 0.84 tryptase-2 precursor 5 0.83 tryptase-2 precursor 8 0.81 tryptase-2 precursor 9 0.81 tryptase-2 precursor 15 0.73 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 22 0.72 ANP2 ANALU Type-3 ice-structuring protein 23 0.71 ANP2 ANALU Type-3 ice-structuring protein 26 0.70 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 28 0.70 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 31 0.69 type III antifreeze protein 35 0.68 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 36 0.68 type III antifreeze protein 39 0.678 ANP2 ANALU Type-3 ice-structuring protein 42 0.67 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 45 0.66 cytochrome P450, family 2, subfamily J, polypeptide 2 46 0.66 ANP2 ANALU Type-3 ice-structuring protein 48 0.65 cytochrome P450, family 2, subfamily J, polypeptide 2 49 0.65 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 50 0.65 cytochrome P450, family 2, subfamily J, polypeptide 2 51 0.65 type III antifreeze protein 54 0.65 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 56 0.64 type III antifreeze protein 58 0.64 protein FAM110B-like 59 0.64 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 63 0.63 ATP-binding cassette, sub-family B (MDR/TAP), member 6a 65 0.62 cytochrome P450, family 2, subfamily J, polypeptide 2 70 0.61 type III antifreeze protein 71 0.61 PAX3 and PAX7 binding protein 72 0.61 type III antifreeze protein 73 0.61 type III antifreeze protein 75 0.60 PAX3 and PAX7 binding protein 45 Introduction Aim Theory Linear models with empirical Bayes Formulation of linear model using design and contrast matrix Empirical Bayes and modified t-test Multiple testing correction Hierarchical clustering Principal component analysis Materials / Method Sampling set-up and data Experimental design Data included in the study Exploratory analysis Hierarchical clustering Principal component analysis Gene rank analysis The linear model and empirical Bayes Differences in gene expression based on pollution classification Gene expression in relation to the biomarker EROD Difference in gene expression based on reproduction success Within-site correlation Results Exploratory analysis Hierarchical clustering Principal component analysis Gene rank analysis Differences in gene expression based on pollution classification Gene expression in relation to the biomarker EROD Difference in gene expression based on reproduction success Within-site correlation Discuss