Estimating Travel Demand from Twitter using an Individual Mobility Model In Sweden, The Netherlands and São Paulo Master’s thesis in Computer science and engineering KRISTOFFER EK ERIC WENNERBERG Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2020 Master’s thesis 2020 Estimating Travel Demand from Twitter using an Individual Mobility Model In Sweden, The Netherlands and São Paulo KRISTOFFER EK ERIC WENNERBERG Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2020 Estimating Travel Demand from Twitter using an Individual Mobility Model In Sweden, The Netherlands and São Paulo KRISTOFFER EK ERIC WENNERBERG © KRISTOFFER EK, ERIC WENNERBERG, 2020. Supervisor: Sonia Yeh, Department of Space, Earth and Environment Advisor: Yuan Liao, Department of Space, Earth and Environment Examiner: Carl Seger, Department of Computer Science and Engineering Master’s Thesis 2020 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2020 iv Estimating Travel Demand from Twitter using an Individual Mobility Model KRISTOFFER EK ERIC WENNERBERG Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract The cost of conducting household travel surveys is increasing, while the response rate is decreasing, pushing researchers to explore new sources of data that can be used to estimate travel demand. Among these new data sources is geotagged tweets from Twitter due to its large quantity of available data and low cost of access. At the same time, using Twitter for travel demand estimation has garnered criticism regarding the biases inherent in Twitter data. This thesis uses geotagged tweets from three regions: Sweden, the Netherlands and São Paulo, to quantify the bias in Twitter data and develop a novel model that estimates travel demand by de-biasing the raw Twitter data. The model integrates two natural dimensions of individ- ual mobility: regularly returning to habitual locations and occasionally exploring new locations. The proposed model addresses the under-representation of habitual places such as home and workplace and corrects the geotagging behavioural bias of overly representing long-distance travel. The model is validated against external data sources in each of the three regions and it is found to result in significant im- provements over contemporary methods for using Twitter data for travel demand estimation. The model’s parameters are robust across regions studied, and by using the parameters found in this thesis one can expect the same improvements compared to contemporary approaches when applied to other regions. Keywords: human mobility, travel demand estimation, Twitter, individual mobility model. v Acknowledgements We would like to thank our supervisors, Sonia Yeh and Yuan Liao, for their tireless support and guidance. Kristoffer Ek & Eric Wennerberg, Gothenburg, June 2020 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Measuring human mobility with geotagged tweets . . . . . . . 3 1.1.3 Modelling travel demand: from individuals to population . . . 4 1.2 Thesis objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Disposition of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Ethical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Methods 7 2.1 Data collection and preprocessing . . . . . . . . . . . . . . . . . . . . 7 2.2 Feature construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Individual mobility model . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Returning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Validation 17 3.1 External data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 EU-wide population grid . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Sweden . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.3 The Netherlands . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.4 São Paulo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Population representation of top geotag Twitter users . . . . . . . . . 20 3.3 Mobility representation of the proposed model . . . . . . . . . . . . . 20 4 Results 23 4.1 Population representation . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Individual mobility model: parameters and validation . . . . . . . . . 27 5 Discussion 35 5.1 Top geotag Twitter users vs general population . . . . . . . . . . . . 35 ix Contents 5.2 Mobility measured by geotagged tweets . . . . . . . . . . . . . . . . . 35 5.3 Individual mobility model . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4 Model sensitivity to different parameters . . . . . . . . . . . . . . . . 37 5.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 Conclusion 39 Bibliography 41 A Notations I B Parameter tuning III x List of Figures 2.1 Hierarchy of timeline for an individual. . . . . . . . . . . . . . . . . . 11 2.2 Influence of parameters ρ and γ on the exploration probability. ni is the number of distinct places visited by an individual. . . . . . . . . . 12 2.3 A Bearing distribution for one individual i. B Jump size distribution for one individual i. C Visual explanation of the shift function, where θ is drawn from the jump size distribution and α is drawn from the bearing distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Comparison of cumulative distributions of observed visitation fre- quency, fj, and re-scaled visitation frequency, P (s), for one individ- ual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Preference for short-distance travel for varying values of β. . . . . . . 14 2.6 Example of the model choices when simulating three visits of a daily trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 (A) A geographical overview of Sweden’s national and regional bound- aries. (B) Snapshot of zones in West area zoomed in on Gothenburg. (C) Snapshot of zones in East area zoomed in on Stockholm. . . . . . 18 3.2 Geographical overview of OViN zones in the Netherlands . . . . . . . 19 3.3 A geographical overview of the São Paulo Metropolitan region in Brazil (left) and its distribution of research zones (right). . . . . . . . 20 4.1 Spatial distribution of estimated home locations of Twitter users com- pared to census data in Sweden. The numbers on the colour bar represent the Twitter-derived population percentage divided by the percentage derived from GEOSTAT. 1 represents an equal ratio of residents between the Twitter users and census data, in the specific zone. A: Comparison at the county level. B: Comparison at the municipality level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Comparison of estimated home locations of Twitter users with census data in Sweden. The diagonal line represents a perfect correlation. Each data point represents the share of population in a zone calcu- lated from census (x axis) and top geotag Twitter users (y axis). A: County level. B: Municipality level. . . . . . . . . . . . . . . . . . . 24 xi List of Figures 4.3 Spatial distribution of estimated home locations of Twitter users com- pared to census data in the Netherlands. The numbers on the colour bar represent the Twitter-derived population percentage divided by the percentage derived from GEOSTAT. 1 represents an equal ratio of residents between the Twitter users and census data, in the spe- cific zone. A: Comparison at the county level. B: Comparison at the municipality level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.4 Comparison of estimated home locations of Twitter users with cen- sus data in the Netherlands. The diagonal line represents a perfect correlation. Each data point represents the share of population in a zone calculated from census (x axis) and top geotag Twitter users (y axis). A: County level. B: Municipality level. . . . . . . . . . . . . . 26 4.5 Spatial distribution of estimated home locations of Twitter users com- pared to census data in São Paulo. The numbers on the colour bar represent the Twitter-derived population percentage divided by the percentage derived from GEOSTAT. 1 represents an equal ratio of residents between the Twitter users and census data, in the specific study zone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.6 Comparison of estimated home locations of Twitter users with census data in São Paulo. The diagonal line represents a perfect correla- tion. Each data point represents the share of population in a zone calculated from census (x axis) and top geotag Twitter users (y axis). 27 4.7 Parameter topology in Sweden, the Netherlands and São Paulo. A Influence of exploration parameters γ and ρ on MSE - β is fixed at 0.04. B Influence of β on MSE - γ and ρ is fixed at different values. One pair of ρ and γ was included in both the first and the second phase of grid search, thus, having results for additional β values. . . . 28 4.8 Trip distance distributions for the National area, Sweden (Source=Sampers- National). Cumulative percentage of trips in each distance quantile. The black vertical lines indicate the upper and lower boundaries for the distance quantiles. The same below for all figures on distance distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.9 Trip distance distributions for the East area (Source=Sampers-East). 30 4.10 Trip distance distributions for the West area (Source=Sampers-West). 30 4.11 Trip distance distributions for the Netherlands (Source=OViN). . . . 31 4.12 Trip distance distributions for São Paulo (Source=OD Survey 2017 in São Paulo). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 xii List of Tables 2.1 Summary of data for each region, before and after processing. . . . . 7 2.2 Summary of mobility features constructed for individual i. . . . . . . 10 4.1 Optimal set of parameters for the model in each region. . . . . . . . . 29 4.2 Summary of MSE in Sweden, comparing baselines to the best model configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Summary of MSE in the Netherlands, comparing baseline to the best model configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Summary of MSE in São Paulo, comparing baseline to the best model configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1 MSE and parameters of the best performing model in each region compared to the baselines. . . . . . . . . . . . . . . . . . . . . . . . . 37 A.1 Lookup table with the main symbols and relevant notations used in this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II B.1 Performance, in terms of MSE*, for the different model configurations in Sweden, sorted by MSE*. MSE* is the sum of the three MSE values received for the areas “National”, “East” and “West”. . . . . . . . . IV B.2 Performance, in terms of MSE, for the different model configurations in the Netherlands, sorted by MSE. . . . . . . . . . . . . . . . . . . . V B.3 Performance, in terms of MSE, for the different model configurations in São Paulo, sorted by MSE. . . . . . . . . . . . . . . . . . . . . . . VI xiii List of Tables xiv 1 Introduction People visit different places to participate in a variety of activities every day, and the series of these places form a trajectory. Aggregating the whole population’ tra- jectories reveals the flows of the population between places. Travel demand, as quantified by such flows, is vital for making informed policies in transport and other areas such as urban planning, public health, and greenhouse gas (GHG) mitigation. In understanding how people move, empirical data with good quality plays an im- portant role. To date, daily and short-distance travel have been extensively studied by transportation and geographic researchers using traditional household travel sur- veys. However, the costs of conducting these surveys are increasing, response rate decreasing to an alarmingly low rate, and are typically conducted every 5-10 years, if at all, meaning that it is hard to keep up-to-date [1]. The increased availability of humans’ spatiotemporal records via various social media platforms have provided researchers with new data sources for estimating travel demand. Among these sources, Twitter is especially appealing due to its low cost and easy access to a significant volume of mobility traces. Previous studies have used data from Twitter to answer important mobility questions, for example how geographical features and cultural norms affect long-distance travel, how people move within cities, and how people can be clustered based on their mobility patterns. The main criticism, however, of using Twitter to measure the mobility pattern of the general population pertains to three aspects: (1) Twitter users do not necessarily represent the whole population, (2) The users’ incentives for using Twitter, such as showing off being at unusual places, might skew the observed mobility, (3) Twitter data lack of regular, and often sparse, sampling. So far, there is not a good way to address these criticisms other than to acknowledge that the issues exist, but being able to address them is essential to bringing Twitter into play for travel demand estimation in a more rigorous manner. This thesis examines Criticism (1) and creates individual mobility model to address Criticism (2)-(3), and by doing so, advance the case of using Twitter for mobility studies. The rest of this chapter reviews the related work on common data sources and models to represent mobility and how the above criticisms are considered in the modelling. Finally, the thesis objectives and ethical considerations conclude the chapter. 1 1. Introduction 1.1 Related work There are a few data sources that have been used for travel demand estimation, such as household travel surveys, GPS log, Call Detailed Record (CDR), and Twitter as an example of social media. This section reviews the strengths and limitations of Twitter data as compared with the other sources (Section 1.1.1). In Section 1.1.2, we show how mobility is measured by geotagged tweets in the literature and the corresponding problems. Section 1.1.2 reviews a range of mobility models for characterising human mobility at both individual and population level, where we justify the selection of an individual mobility model to build upon in this thesis to model mobility trajectories for travel demand estimation. 1.1.1 Data sources In the last decade, new technologies and services have offered an alternative source of human mobility data in addition to the traditional household travel survey. Three categories of unconventional data sources have been widely used: GPS-enabled tracking devices (GPS logs), Call Detail Records (CDR), and geotagged social me- dia. These new data sources have different characteristics that impact the type of research questions suitable to explore. GPS log data contain time-series data of GPS coordinates indicating individuals’ whereabouts. They are commonly collected from a limited number of subjects who willingly carry a GPS tracker. Most studies applying GPS log data collect data from a small group of individuals, typically in a range of 20-500 [2]. GPS devices produce positions that have an accuracy of 10 meters [3]. The major advantage of GPS logs is their high temporal resolution (e.g., every 10 seconds [4]). Therefore, they provide a relatively complete and accurate picture of an individual’s movements during the observation period, which usually lasts several days to a few months. However, in comparison to other sources, GPS log data is used infrequently by the broader research community due to small sample size, high cost, and privacy concerns. Mobile phone CDR is collected by cell towers in a specific area and contain informa- tion, timestamps and position, about calls and text messages sent in the vicinity of each tower. While more data is collected by cell-providers, researchers typically only have access to records on calls and texts. CDR is collected long-term with a large number of individuals due to the high penetration rate of cell phone users. CDR is the most frequently used data source today to estimate the mobility patterns of the general population. For example, one study used a one-year-long CDR data set with 3 million individuals tracked to model the fundamental patterns of individual mobility, e.g., the long-tailed distribution of trip distance [5]. Because the position attributed to each record is the position of the closest cell tower, the spatial reso- lution is directly correlated to the density of the cell tower network, where towers are typically spaced 200-300 meters apart in urban areas and up to 30 km in rural areas. The spatial sparsity in rural areas, and the limitation to a single cell phone provider, restrict the usage of CDR to urban cities, and consequently, short distance travel. Furthermore, locations are only recorded when an individual makes a phone 2 1. Introduction call or sends a text message, leading to sparse and irregular samples. Lastly, while CDR is often used, it is difficult to access, and the data are anonymised for pri- vacy consideration, resulting in various drawbacks depending on the anonymisation techniques. Despite some drawbacks, in contrast to both CDR and GPS logs, data from Twitter is easy, cheap to access, and scale-free, e.g., adaptive to both regional and global scales [2]. A tweet in which the user has selected to attach its position is called a geotagged tweet. Geotagged tweets contain information useful for transportation research: timestamp, position, and text. In general, only a small number of tweets have position attached, varying between different regions of the world, typically 1-3% [6]. The relatively low number of geotagged tweets could be improved by inferring position from the text content, but the accuracy is low [7]. Because the position of geotagged tweets is based on the GPS-enabled device the user sent the tweet from, it shares the same spatial resolution of GPS-log data, typically 10m. Geotagged tweets are collected when an individual sends a tweet, which leads to the same issues, such as sparsity and irregularity, as CDR. Geotagged tweets enabled observation of a large number of individuals’ movement over long periods, while not restricted by geographical and administrative boundaries such as cities and countries. Previous studies have used geotagged tweets to derive summarised mobility pat- terns around the world. Hawelka et al. (2014) [8] analysed global mobility patterns observed from one billion tweets and found that geographical features and cultural norms influenced the mobility patterns. For example, individuals in isolated coun- tries, such as Australia and New Zealand, exhibited a relatively larger radius of gy- ration and individuals from Arabic countries travelled significantly less during the period of Ramadan. Lenormand et al. (2014) [9] compared the commuting-mobility, travel between home and work, during weekdays derived from Twitter data and CDR in Barcelona and Madrid to travel surveys and found a high correlation between the three data sources. To summarise, though geotagged social media as an emerging data source has been used widely to quantify mobility in the last ten years, careful investigations and validation are still needed to further extend its application in estimating travel demand more robustly for the entire population and across different regions. This thesis aims to address these challenges as we describe in the following sections. 1.1.2 Measuring human mobility with geotagged tweets Challenges with using Twitter data for travel demand estimation pertains to two aspects: user behaviour and temporal sparsity. User behaviour manifests itself in how Twitter users interact with the platform. Tasse et al. (2017) [10] found that users geotag their tweets in order to show off being at cool places and to keep their friends and family updated. This consequently leads to users geotagging tweets at places they rarely visit, and even being reluctant to geotag at routine places. Further, they find that users generally geotag their tweets at places far from home, with only 46.7% of geotagged tweets originating from the user’s home city. Due to 3 1. Introduction geotagged tweets being the result of a conscious decision [10], they are naturally sparse and irregular in time. The issue with temporal sparsity lies in the observed movement from Twitter data. Travel demand is an aggregation of individuals’ trips, the connection between two consecutive stays. Due to the temporal sparsity of geotagged tweets, trips are not directly observed in Twitter data [11], but rather what is observed is more precisely named ’displacements’, the connection between two, possibly but not necessarily, consecutive stays. Most previous studies use geotagged tweets without considering, or over-simplifying these challenges. One approach used in literature to translate displacements to trips is to apply a time threshold between 4 and 24 hours [12, 13]. Displacements with a time interval shorter than the threshold are considered trips, and all other dis- placements are discarded. Regardless of the exact time threshold used, the amount of available data is massively reduced. Lenormand et al. (2014) [9] bypassed the issue of temporal sparsity by estimating the home and workplace of individuals, and derived commuting-mobility, travel between home and work. While this in some sense solves the temporal sparsity, it does so by falsely relying on the second prob- lem that Twitter users are reluctant to post on habitual places. Therefore, careful modelling of the individual mobility trajectories using Twitter data is essential to use geotagged tweets for travel demand estimation. 1.1.3 Modelling travel demand: from individuals to popu- lation Travel demand estimation manifests at the population level, reflecting the flows between regions. There are two types of models, population-based and individual- based. Population-based models, as the name implies, operate on the entire popu- lation at once. In contrast, the individual-based models operate on individuals, and its output can be aggregated to the population level. One of the most applied population-level models in travel demand estimation is the gravity model [14], which states that the trip number between two places can be determined by the production (e.g., population) and attraction (e.g., workplaces) of both places, and their distance from one another. Despite the widespread use of the gravity model, it has notable limitations such as over-simplification and being data- demanding. Another model that recently has gained attention is called Radiation model, which improves the traditional gravity model [15]. The estimated travel de- mand has fixed zoning once the model is constructed. Its further application is more constrained than the combination of individual-based modelling and aggregation. Population-based models often require good-quality data at the individual level. Liao et al. (2020) [11] explore the feasibility of using geotagged displacements from Twitter with the gravity model to estimate travel demand where, however, the individual trajectories are not modelled sufficiently to consider behavioural biases. When using Twitter data for travel demand estimation, the individual trajectories require reasonable de-biasing to be better used to estimate the population flows between regions. 4 1. Introduction Of the individual-based models, Markovian models are among the typical ones that have been used widely. Cárcamo et al. (2017) [16] used CDR and constructed a Markov model capturing the transitions between antennas based on the entire pop- ulation. Gambs et al. (2012) [17] described an algorithm for next place prediction based on a Markov mobility model of an individual, called n-MMC, which achieved up to 95% accuracy. Approaches similar to Markovian models require observed transitions between locations which, however, are not observable in Twitter data due to the sparsity issue. Therefore, these types of models are not suitable for travel demand estimation based on Twitter data. Lévy-flight and random walk models have been successfully used to model the mobil- ity of animals, and have also been used to model the individual mobility of humans [18]. Although these models exhibit a striking statistical resemblance to human mobility[19], they are based on the assumption that human movement is random. Song et al. (2010), proposes that human mobility is barely random, but follows re- producible scaling laws [5]. In their Individual Mobility Model, they focus primarily on two generic mechanisms, exploration and preferential return, both unique to hu- man mobility. According to the model, an individual’s next displacement can either be exploring a new location or returning to a previously visited location. The Individual Mobility Model captures the asymptotic/general mobility of individu- als, meaning it is a good fit for estimating travel demand after a proper aggregation. Furthermore, the model does not rely on observed trips between places but instead uses the visitation frequency of places to determine the next displacement. The model reflects the input data, and without modification, it would output the be- haviour bias observed in Twitter data. However, given the generality of the model, it holds the potential to be adapted to Twitter data, and allow travel demand esti- mation to be recovered. 1.2 Thesis objectives To address the limitations of traditional travel surveys, Twitter data has gained increased interests among emerging data sources in estimating mobility. However, some of the well-known criticisms have continued to be ignored in practice. This thesis examines the biases of geotagged tweets and proposes an individual mobility model for travel demand estimation with the attempt to address some of the biases. The model is applied to several regions globally and validated against other "official" data sources in the field of transport. Specifically, this thesis has four objectives: • Examine the representativeness of geotagged Twitter users and potential bi- ases. • Propose an individual mobility model to address the sparsity issue and be- havioural biases of using geotagged tweets. • Calibrate and validate the proposed model against travel surveys and the traffic model output to obtain optimal model parameters. Investigate the 5 1. Introduction performance of the proposed model compared with the common practice of utilising geotagged tweets. • Compare the sensitivity of model parameters and discuss the ability to gener- alising it to the other global regions. 1.3 Disposition of this thesis The remainder of this thesis consists of five chapters; Methods, Validation, Re- sults, Discussion, and Conclusion. Methods describe the data and how they are processed and the proposed model. Validation describes the data sets that are val- idated against, and also how the validation is carried out. Results show the effects of applying the validation to the model. Discussion deliberates the main findings, limitations, and future work. At last, Conclusion summarises this thesis. 1.4 Ethical considerations The available data will contain users’ Twitter handles, a unique identifier for the user, meaning that the derived travel behaviour can be tied to a specific user. To combat this, the data will be anonymised, i.e. the Twitter handles will be replaced with pseudonymised strings, to remove the direct connection between the data and the underlying Twitter user. To prevent "reverse engineering", we do not publish individual trajectories and their locations. Furthermore, users have opted in to share their location, meaning that some form of consent is already given. The users have however not consented for their data to be part of this study, but this is, in part, mitigated by the fact that the data is publicly available on their profiles, meaning anyone could potentially retrieve it. 6 2 Methods This chapter describes the methods applied in this thesis. Section 2.1 describes the dataset in terms of collection and preprocessing. Section 2.2 describes the features constructed from the dataset. Section 2.3 introduces the model of individual mobility and how to deal with the biases found in Twitter data. 2.1 Data collection and preprocessing The Twitter data used in this thesis have been collected and extracted by a previous study [20]. The data set consists of 23 regions, both countries and cities, and was collected in two stages. In the first stage, tweets during six months (20 December 2015 - 20 June 2016) were collected using a geographical bounding box containing the region. From these tweets, the users who geotag their tweets most frequently was identified. In the second stage, the 3200 most recent tweets of the identified users were retrieved. Not all users have a total of 3200 tweets, and some tweet more frequently than others, such that the total number of tweets and timespan varies from user to user. In total, there are between 30 and 65 million tweets for each region. To provide more detailed validation, this thesis focuses on the following regions: Sweden, São Paulo and the Netherlands. Furthermore, only geotagged tweets are of interest, and all regular tweets are removed from the data set. Table 2.1 shows the number of tweets and geotagged tweets in each region. The geotagged tweets that user i have sent are: (X, Y, t)i,p, p = 1, 2, ..., Ni, where X Before processing After processing Region Tweets Individuals Geotweets Individuals Geotweets Sweden 31 591 697 7 773 2 943 731 3 961 1 248 158 São Paulo 65 089 103 22 853 8 059 448 10 943 3 513 796 The Netherlands 31 997 687 12 638 4 418 891 5 375 1 479 674 Table 2.1: Summary of data for each region, before and after processing. 7 2. Methods is the decimal degree of latitude, Y is the decimal degree of longitude, and t is the local time converted from the original UTC timestamp of the pth geotagged tweet using the location (X, Y ). Ni is the total number of geotagged tweets sent by this user. For each geotagged tweet, we also calculate the day of the week and hour of the day based on t. The sequence of the user’s geotagged tweets is, therefore: Gi = {X, Y, t, w, h}i,p, p = 1, 2, ..., Ni (2.1) Cross-platform posting Twitter supports cross-platform posts, meaning that a user can link their account to other platforms and posts shared on one platform will be shared on the other. Cross-posting is problematic for geotagged tweets, as platforms have different spatial accuracy in their reported locations. For example, when a user shares an image on Instagram and tags it with a place, the reported location will be in the centre of that place. So if an individual sets their location on Instagram as Sao Paulo, the geotagged tweet will be reported in the middle of Sao Paulo, as opposed to an exact location. To deal with this artefact, all geotagged tweets are grouped based on their exact latitudinal and longitudinal coordinates, and groups with more than 0.1% of the tweets in the region are removed. This could potentially remove some regular geotagged tweets (not cross-posted), but the likelihood of several geotagged tweets having exact coordinates is minimal. Clustering locations into places The spatial resolution of GPS devices is typically within 10 meters [3], meaning that geotagged tweets from the same place can have slightly different coordinates, X and Y . To deal with this problem, each individual’s observed locations are grouped to places using DBSCAN [21], a clustering algorithm that groups points tightly packed based on density. The algorithm is parameterised with distance threshold, ε, and the minimum number of points per cluster, nmin. Parameter ε controls the geographic size of the clusters but has been found to not be very sensitive for travel demand estimation as the overall patterns emerge despite different values [3]. We use ε = 100m as it prevents a large number of small places being identified, while still separating different places. We use nmin = 1 in order to not ignore places that have only been visited once. Let ni be the number of distinct places for user i, Xj and Yj be the spatial centroid of the of place j, obtained from DBSCAN. Furthermore, let K denote the number of geotagged tweets at this place. The set of the user’s distinct places is, therefore: Si = {X, Y,K}i,j, j = 1, 2, ..., ni (2.2) Estimating home place It is possible that during the first stage of data collection, some top geotag users were only visiting the region, but live somewhere else. As we focus on the residents in 8 2. Methods each region, this artefact needs to be removed. Therefore, each user’s home location, sh ∈ Si, is detected using the assumption that they live at the most visited location during weekends and 7 pm-8 am on weekdays [22]. Due to Twitter users’ reluctance to geotag their tweets at habitual places, including their home, combined with the fact that not everyone has the same circadian rhythm (some working night shifts for example) this method is not perfect. Although the method is not perfect, it is the most common approach for identifying home locations for this type of data. Finally, individuals with an estimated home located outside of the region are removed from the data set. Considering the long study period, up to 9 years for some users, it is probable that some have moved during the study period, resulting in multiple home locations detected from their timelines. The home location is an essential part of a person’s mobility pattern; thus, moving will most likely results in a significant change in their mobility pattern. Hence, as a measure to reduce the complexity of the analysis, we only consider the period where the user lives at his/her latest home location. Bot accounts and insufficient geotagged tweets Another artefact of the data set is bot accounts, which, for example, only tweet about job postings and weather updates. Some of these accounts also geotag their tweets, often at the same place. These are identified by selecting users with only one distinct place and subsequently removed. After the processing to deal with the artefacts in the original data set, we further remove users based on the amount of available data. This is done to ensure mobility patterns can be identified. Users with less than 20 tweets are removed. Table 2.1 presents the data sets, before and after processing. 2.2 Feature construction For the trajectory of each Twitter user, not all distinct places are visited with the same frequency; some are visited more frequently than others. Thus, a place j has a visitation frequency of being visited, fi,j, that is calculated based on the number of geotagged tweets individual i has sent at the place j. fi,j = Ki,j∑ni j=1 Ki,j (2.3) Jump size, θp,p+1, refers to the distance between two consecutive geotagged tweets p and p+ 1, and is defined as θp,p+1 = haversine(Xp, Yp, Xp+1, Yp+1) (2.4) where haversine(Xp, Yp, Xp+1, Yp+1) is the Haversine distance (distance along the curved surface of the earth) between two coordinates. 9 2. Methods Bearing, αp,p+1, refers to the direction of the straight line one travels between two consecutive geotagged tweets p and p+ 1 and it is defined as ∆Yp,p+1 =Yp+1 − Yp (2.5) yp,p+1 = sin(∆Yp,p+1) ∗ cos(Xp+1) (2.6) xp,p+1 = cos(Xp) ∗ sin(Xp+1)− sin(Xp) ∗ cos(Xp+1) ∗ cos(∆Yp,p+1) (2.7) αp,p+1 = arctan2(yp,p+1, xp,p+1) (2.8) Feature Description Si Set of distinct places visited by the individual sh ∈ Si Estimated home place of the individual fi,j Visitation frequency of place j, j = 1, 2, ..., ni Prob(θ)i Jump size distribution Prob(α)i Bearing distribution Table 2.2: Summary of mobility features constructed for individual i. 2.3 Individual mobility model As mentioned earlier in Section 1.2, the goal of this thesis is to demonstrate a novel method to estimate travel demand using Twitter data while addressing biases such as under-representation of short-distance trips. In this section, we describe the Individual Mobility Model, first proposed by Song et al. (2010) [5], without considering any potential bias in the data source, and our adaptions. We start this section with an overview of the framework in which the model operates, followed by a detailed description of how the steps in the model are carried out. 2.3.1 Framework The raw trajectory of geotagged tweets constitutes a biased observation of the actual mobility trajectory for a Twitter user. The model developed aims to construct the mobility pattern of an average week for individuals. To achieve this, we create a timeline, Li, for each individual i. The timeline consists of trajectories of multiple days, Li = (Ti,d), d = 1, 2, ..., D. Each daily trajectory, Ti,d, consists of multiple visits Ti,d = (vi,d,m), m = 1, 2, ...,Md. This is depicted in Figure 2.1. The number of daily trajectories, D, is set at twenty weeks (D = 140). It is selected to be large enough such that we can find a regular pattern for an average week. We found that increasing D, i.e. increasing the number of modelling days, does not change our results. The number of visits generated per day, Md, is drawn from a normal distribution estimated from travel survey from Sweden[23], N(3.14, 1.8), and 10 2. Methods Figure 2.1: Hierarchy of timeline for an individual. the same distribution is used for all regions. Due to drawingMd from a distribution, each day has a different number of visits per day. Each visit vi,d,m consists of latitude and longitude, X and Y , expressed in decimal degrees. vi,d,m = (X, Y ) (2.9) For the first visit of each daily trajectory, vi,d,1, it is assumed that the individual is located at their estimated home, sh. This assumption reflects humans’ tendency to return home at the end of every day. To generate the remaining visits for the daily trajectory, it is assumed that the individual can perform one of two choices: exploration or preferential return. When exploring, the model generates a visit to a place j not observed in the individual’s distinct places, j /∈ Si. On the other hand, when returning, the model generates a visit to place j observed in the individual’s places, j ∈ Si. The probability of each choice is dependant on the number of distinct places, ni, de- rived from the individual’s geotagged tweets. The more distinct places, the smaller the probability of exploration is. How much the number of distinct locations influ- ences the probabilities is controlled via two parameters: 0 < ρ ≤ 1 and 0 ≤ γ. Prob(explore)i = ρn−γi (2.10) Prob(return)i = 1− Prob(explore)i (2.11) The parameters ρ and γ are not specific for an individual but is shared across the population. Figure 2.2 shows their influence on the exploration probability. From an individuals timeline, a set of trips is constructed by considering each pair of consecutive visits to be the origin and destination of a trip. For the remainder of this chapter, we describe the process of modelling individual mobility in detail with two possible options: exploration and preferential return. The chapter is concluded with an example showing how the model works. 11 2. Methods Figure 2.2: Influence of parameters ρ and γ on the exploration probability. ni is the number of distinct places visited by an individual. Figure 2.3: A Bearing distribution for one individual i. B Jump size distribution for one individual i. C Visual explanation of the shift function, where θ is drawn from the jump size distribution and α is drawn from the bearing distribution. 2.3.2 Exploration For day d, let m denote the current place. When exploring, the individual i makes a visit to an unobserved location, m+ 1 /∈ Si. The new location’s coordinates, Xm+1 and Ym+1, is generated based on the individual’s jump size distribution (Prob(θ)i), bearing distribution (Prob(α)i), and current location, (Xm, Ym), as depicted in Fig- ure 2.3. θ ← Prob(θ)i (2.12) α← Prob(α)i (2.13) Xm+1, Ym+1 = shift(Xm, Ym, θ, α) (2.14) vi,d,m+1 = (Xm+1, Ym+1) (2.15) 2.3.3 Returning For day d, let m denote the current place. When returning, the individual moves to one of their previously visited places m+1 ∈ Si. The selection of place m+1 among 12 2. Methods all places in Si depends on two factors: the visitation frequency of the candidate place m + 1, and the travel distance from the current place, m, to the candidate place. Visitation frequency Previous research on human mobility suggests that the visitation frequency of hu- mans is uneven, such that the frequency z of the kth most visited location follows Zipf’s law[5] with parameter ζ ≈ 1.2± 0.1. zk ∼ k−ζ (2.16) Because Twitter users are reluctant to geotag tweets at habitual places, we assume that the observed visitation frequency fi,j is skewed. Habitual places, such as home (sh) and work, have an observed visitation frequency that is lower than the actual visitation frequency of the place. On the other hand, infrequent place, such as one- time visits to bars, have an observed visitation frequency that is higher than the actual visitation frequency. We assume that, while the observed visitation frequency is skewed, the order of places based on visitation frequency is correct. Therefore we use the below equation to re-scale the visitation frequency of geotagged places where s ∈ Si and rank(s) denotes the relative order of places by visitation frequency. P (s) = zrank(s)∑ s′∈Si zrank(s′) (2.17) Figure 2.4 shows the effect of re-scaling the visitation frequency as described in Equation 2.17 for one individual. The re-scaling results in habitual places being visited more frequently than they have been observed in the geotagged tweets. Figure 2.4: Comparison of cumulative distributions of observed visitation fre- quency, fj, and re-scaled visitation frequency, P (s), for one individual. 13 2. Methods Impedance to the candidate places The other factor determining the selection of next place, m+ 1, is its distance from the last visit’s location, Xm and Ym. Including distance in the selection helps de-bias the trajectories generated, accounting for the fact that Twitter users are more likely to geotag tweets far from home. The intuition is that we want to slightly increase the probability of visiting places closer to where the individual is currently located. To achieve this, we use an approach similar to that of the Gravity model, by modelling the probability of travelling to a place to be inversely proportional to the distance to it. In other words, the longer the distance one needs to travel from a place to another, the more unlikely the visit happens. Therefore, the impedance between a candidate place s and previous place m is expressed as exp(−β ∗ haversine(Xm, Ym, Xs, Ys)) and normalised with the below equation. I(s) = exp(−β ∗ haversine(Xm, Ym, Xs, Ys))∑nj j=1 exp(−β ∗ haversine(Xm, Ym, Xj, Yj)) (2.18) The strength of preference for short-distance travel, I(s), is controlled by parameter β. Figure 2.5 shows how different values for β yield a stronger och weaker preference. Figure 2.5: Preference for short-distance travel for varying values of β. Combining visitation frequency and impedance Both factors, P (s) and I(s), are in the range [0, 1]. And they are combined with multiplication and re-normalised to sum to 1. The place the individual moves to for the next visit, vi,d,m+1 = (Xm+1, Ym+1), is drawn from the resulting distribution of the candidate places’ probability. Prob(s) = P (s) · I(s)∑ s′∈Si P (s′) · I(s′) (2.19) vi,d,m+1 ← Prob(s) (2.20) 14 2. Methods 2.3.4 Example This section show an example of how the model works, by simulating the choices made during one daily trajectory, Ti,d. Three different visits are simulated and illustrated in Figure 2.6. The individual in the example have visited three distinct places, and their observed visitation frequency is depicted in the left-most part of the figure, indicated by their size. Figure 2.6: Example of the model choices when simulating three visits of a daily trajectory. The first visit of the daily trajectory is to the individual’s home location, sh. In the example, the second visit is assumed to be preferential return. The figure shows the combination of visitation frequency and impedance, P (s) ∗ I(s), indicated by the size of the circles. Note how the place in the bottom right has lower probability to returned to, due to the distance to the current location. Instead, the top-left place is returned to, marked with 2 in the figure, because of it’s proximity to the current location. For the final visit in the example, it is assumed to be exploration, and an unobserved place will be visited based on the bearing and jump size distributions, Prob(θ) and Prob(α). The sampled values from these two distributions are depicted in the figure. The individual will move to where they intersect, marked with 3 in the figure. This process is repeated until the daily trajectory is completed, and then repeated for the remaining daily trajectories of the individual’s timeline. 15 2. Methods 16 3 Validation To validate the proposed model, we compare the model outputs with established data sources including travel survey and traffic model output. When validating in- dividual mobility patterns, it is uncommon to have two data sources which include the same group of individuals. Hence, validation is conducted at the population level. Aggregating individual trajectories of mobility yields a picture of how the population flows between regions. One of the most common ways to validate travel demand is to construct and compare Origin-Destination (OD) matrices from differ- ent data sources. An OD matrix represents the volume of trips between any two zones in a study area. Using OD matrices, it is possible to compare and analyse two independent travel demand estimations of the same study area. In this chapter, Section 3.1 presents the external data sources used for validation in each region: Sweden, the Netherlands, and São Paulo. Section 3.2 and Section 3.3 describe the methods used in population representation and trip distance validation, respectively. 3.1 External data sources This section details four different data sources; three of them are survey-based travel demand estimations, and one is a population distribution estimation. 3.1.1 EU-wide population grid The GEOSTAT initiative was taken jointly by Eurostat and the National Statistical Institutes to establish a data and production infrastructure for geospatial statistics. The GEOSTAT 2011 dataset [24] represent the main characteristics of the 2011 population and housing census in a 1 km2 grid system for the entire European Union. This information allows us to identify potential population biases in terms of the spatial distribution of Twitter users’ detected home location as compared with the general population in Sweden and the Netherlands. 3.1.2 Sweden Sampers [25] is a tool owned and managed by the Swedish Transport Administra- tion (Trafikverket), that estimates the historical and future traffic volumes based on 17 3. Validation Figure 3.1: (A) A geographical overview of Sweden’s national and regional bound- aries. (B) Snapshot of zones in West area zoomed in on Gothenburg. (C) Snapshot of zones in East area zoomed in on Stockholm. studies of travel demand. From the Sampers model, three OD matrices have been retrieved for Sweden from 2014; “National”, “East”, and “West” (see Figure 3.1). The cell value of the OD matrices represents the estimated number of trips between the origin and destination zone. Although the data is, at time of this thesis, six years old, it is the most representative ground truth for Sweden. The OD matri- ces represent the domestic travel demand during an average weekday. Each area is segmented into zones, and the segmentation depends on the area. The national model considers trips longer than 100 km done by residents in all of Sweden and is segmented into 682 zones. The models in East and West consider all trips within Sweden, done by residents in the respective study area and is segmented such that spatial resolution decreases further away from the largest city in the area, Stock- holm and Gothenburg respectively. The East and West study area each contains approximately 3000 zones. 3.1.3 The Netherlands OViN (Onderzoek Verplaatsingen in Nederland) is a recent dataset on daily mobility of the Dutch population. The dataset consists of a basic survey at a national level and possible follow-up surveys. The research is a continuous daily study of the travel behaviour of Dutch people. Respondents are asked to keep track of where 18 3. Validation Figure 3.2: Geographical overview of OViN zones in the Netherlands they go for that particular day of the year, for what purpose, with what means of transport and how long it takes to get there. Based on this research, information can be obtained about all daily trips by Dutch people on Dutch territory. All trips in the OViN data set originates and ends in postal code areas, grouped by their first four digits. In other words, the geographical partitioning into zones is defined by the location’s first four postal code digits. In the Netherlands, this results in 4066 zones that trips can occur in between. A geographical overview of these zones can be seen in Figure 3.2. 3.1.4 São Paulo The city of São Paulo has collected information on trips from citizens in São Paulo. The study, carried out in 2017, interviewed 32,000 households distributed in 517 research zones. In total, approximately 100,000 people were interviewed. Trips were collected from each respondent over 24 hours, and only weekdays were considered. Of the 517 research zones, 342 represents the municipality of São Paulo, and 175 represents the neighbouring municipalities. Their geographical distribution is shown in Figure 3.3. 19 3. Validation Figure 3.3: A geographical overview of the São Paulo Metropolitan region in Brazil (left) and its distribution of research zones (right). 3.2 Population representation of top geotag Twit- ter users The representativeness of top geotag Twitter users for the whole population is crucial for our study and the validity using Twitter data for mobility estimation. The density distribution at the zone-level shows the discrepancy between Twitter users’ derived home locations and the census number of residents in the corresponding zones. A similar process has been done at the county level [20], compared to which this project will move one step forward to quantify the population bias at a much finer geographical resolution. For Sweden and the Netherlands, the GEOSTAT 2011 data source is used as the ground truth. The spatial resolution of GEOSTAT, 1 km2 grids, is too detailed for our analysis. Hence, the grids are grouped into counties and municipalities before comparison. The derived distribution from Twitter will then be compared to the GEOSTAT distribution for each of the two resolutions (counties and municipalities). For São Paulo, the population distribution is included in the data source, i.e. there is an estimated number of residents in each of the 517 research zones included in the study. Thus, the comparison of population distribution will be conducted at the zone level for the São Paulo region. 3.3 Mobility representation of the proposed model While trip distance (d, km) does not encompass the direction of flow between places, it is an essential metric of mobility whose distribution reveals the validity of using Twitter to estimate travel demand [11]. Therefore, it is selected to test the proposed model as calibrated and validated against the external data sources. 20 3. Validation Two baselines In order to quantify the improvement of the model, we compare our model to two other models (hereafter called baselines) from the literature[20, 13, 12, 26]. The two baselines differ in how they construct trips from displacements observed in the raw geotagged tweets. The first baseline, henceforth called just baseline, considers every displacement to be a trip, i.e. every pair of consecutive geotagged tweets form the origin and destination of a trip. The second baseline, henceforth called baseline-24, considers displacements with a duration shorter than 24 hours to be trips, i.e. every pair of consecutive geotagged tweets posted within 24 hours of each other becomes an origin and a destination of a trip. Comparison procedure For each set of trips (baseline, baseline-24, model), an OD matrix is created. First, the coordinates of origin and destination for each trip is geographically joined to the zones of the external data source. Secondly, the number of trips between each origin-destination pair is calculated and normalised so that the sum of all cells in the OD matrix add up to 1. Based on the distance between zones, 100 distance quantiles are calculated (Q = 100), such that each quantile contains the same number of origin-destination pairs. For each quantile, 0 < q ≤ Q, the share of trips from the external data source, tq, and the model, t′q is calculated. The similarity of the model compared to the external data source is quantified by Mean Squared Error, MSE (see eqs. (3.1) and (3.2)). SEq = (tq − t′q)2 (3.1) MSE = ∑Q q=1 SEq Q (3.2) Model calibration Because the model is parameterized (ρ, γ, and β), the question of which parameters are optimal, or close to optimal, should be addressed. Furthermore, it is possible that the optimal parameters in one region is not optimal for another region, for example due to geographical differences. Hence, the difference in optimal parameters for different regions should also be explored. To find the optimal model parameters within a region a two-phased grid search is conducted. In the first phase a wide range of parameters are tested, as seen Eq. 3.3. After evaluating all parameters, the ones that achieves the lowest MSE of trip distance distribution, in comparison to external data source, are selected. 21 3. Validation ρ ∈ [0.3, 0.6, 0.9] γ ∈ [0.2, 0.5, 0.8] β ∈ [0.01, 0.04, 0.07] (3.3) In the second phase of the grid search, a narrow set of parameters is selected for evaluation centred around the best parameters found in the first phase. For example, if ρ = 0.6, γ = 0.5, β = 0.04 was selected in the first phase, the parameters to be evaluated in second phase are depicted in Eq. 3.4. After evaluating all parameters in the second phase, the ones that achieve the lowest MSE are considered the optimal parameters for that region. ρ ∈ [0.5, 0.6, 0.7] γ ∈ [0.45, 0.5, 0.55] β ∈ [0.03, 0.04, 0.05] (3.4) When the two phased grid search is completed for the three regions, the optimal parameters of the regions, and their respective MSE, are compared. In the best case, the optimal parameters in each region would be the same, and thus indicate that the model is robust across these regions. 22 4 Results In this chapter, the results are presented; Section 4.1 shows the representativeness of the estimated home locations of observed Twitter geotag users compared with the external data sources described in the previous section. Section 4.2 presents the parameters of the model and the process used for setting them, and Section 4.2 shows the performance of the model in terms of the trip distance distribution. 4.1 Population representation This section presents the spatial distribution of the estimated home locations of geotag Twitter users and compare their spatial distributions with census/survey data sources in Sweden, the Netherlands, and São Paulo. Sweden Figure 4.1 shows the representativeness of the estimated home locations of geotag Twitter users compared with the census population in Sweden. The top geotag Twitter users in Sweden are overly representing the residents in Stockholm county, where the capital of Sweden is located, but under-representing the residents in Väs- tra Götaland county, where the second-largest city Gothenburg is located. At the municipality level, Figure 4.1.B shows that Twitter users are overly representing the residents who live in urban centres as well as a few more rural municipalities, such as Åre and Rättvik.The municipality of Sweden’s capital, Stockholm, is over- represented by a factor of 2.2. Comparing Figure 4.1.A and Figure 4.1.B, shows that even if the population level is close to census at the county level, individual municipalities within the county can still be over- or under-represented. This is due to the smaller zones at municipality level, which sheds light on how the population is distributed in each county. Figure 4.2 compares the estimated home locations of Twitter users and census pop- ulation. For both county and municipality level, less populated areas are more likely to be under-represented by top Twitter users. 23 4. Results Figure 4.1: Spatial distribution of estimated home locations of Twitter users com- pared to census data in Sweden. The numbers on the colour bar represent the Twitter-derived population percentage divided by the percentage derived from GEO- STAT. 1 represents an equal ratio of residents between the Twitter users and census data, in the specific zone. A: Comparison at the county level. B: Comparison at the municipality level. Figure 4.2: Comparison of estimated home locations of Twitter users with census data in Sweden. The diagonal line represents a perfect correlation. Each data point represents the share of population in a zone calculated from census (x axis) and top geotag Twitter users (y axis). A: County level. B: Municipality level. 24 4. Results The Netherlands Figure 4.3 shows the population representation of estimated home locations of geo- tag Twitter users relative to the derived GEOSTAT distribution. The population distribution at the county level is similar to that of GEOSTAT. The county of North Holland is slightly over-represented, by a factor of 1.7. Figure 4.3.B shows the dis- tribution at the municipality level. At this resolution, it is evident that the two sparsely populated islands Vlieland and Terschelling are over-represented by the Twitter users. More urban municipalities, such as Amsterdam and Utrecht are also over-represented; Amsterdam by a factor of 3.35 and Utrecht by a factor of 1.69. Figure 4.3: Spatial distribution of estimated home locations of Twitter users com- pared to census data in the Netherlands. The numbers on the colour bar represent the Twitter-derived population percentage divided by the percentage derived from GEOSTAT. 1 represents an equal ratio of residents between the Twitter users and census data, in the specific zone. A: Comparison at the county level. B: Comparison at the municipality level. Figure 4.4 shows the relationship between the two sources at both municipality and county level. Similar to the findings in Sweden (Figure 4.2), geotagged tweets in the Netherlands tend to be more prevalent among the residents in urban and populated areas. São Paulo Figure 4.5 illustrates the results of population representation for the 517 research zones. It indicates that rural zones, located far from the inner city of São Paulo, are under-represented. Moreover, the Twitter distribution of inner São Paulo resembles that of the travel survey, with one exception. That is the research zone of Barra Funda, which has a travel survey estimate of 324 out of the 20 821 671 residents in São Paulo. For twitter, 22 out of 10 686 users are estimated to have a home location in Barra Funda. This results in an over-representation by a factor of 132. 25 4. Results Figure 4.4: Comparison of estimated home locations of Twitter users with census data in the Netherlands. The diagonal line represents a perfect correlation. Each data point represents the share of population in a zone calculated from census (x axis) and top geotag Twitter users (y axis). A: County level. B: Municipality level. Figure 4.5: Spatial distribution of estimated home locations of Twitter users com- pared to census data in São Paulo. The numbers on the colour bar represent the Twitter-derived population percentage divided by the percentage derived from GEO- STAT. 1 represents an equal ratio of residents between the Twitter users and census data, in the specific study zone. Resembling what is observed in Sweden and the Netherlands, top geotag Twitter users in São Paulo tend to overly represent the residents in densely populated areas. However, as shown Figure 4.6 the discrepancy between Twitter users and census population is more salient than Sweden and the Netherlands, i.e., the top geotag Twitter users in São Paulo display a lower population representation. 26 4. Results Figure 4.6: Comparison of estimated home locations of Twitter users with census data in São Paulo. The diagonal line represents a perfect correlation. Each data point represents the share of population in a zone calculated from census (x axis) and top geotag Twitter users (y axis). 4.2 Individual mobility model: parameters and validation Model parameters This section presents the results obtained in each region with different model con- figurations. Figure 4.7 shows the influence of different model configurations for Sweden, the Netherlands and São Paulo. For Sweden, which consists of three areas, the optimi- sation is based on the sum of the three MSE values. The optimal values for explo- ration rate parameters γ and ρ are different for Sweden compared to the other two regions. In Sweden, the exploration rate parameters are optimal when γ ∈ [0.75, 0.8] and ρ ∈ [0.3, 0.4], while in São Paulo and the Netherlands they are optimal when γ ∈ [0.45, 0.5] and ρ ∈ [0.6, 0.7]. The results indicate that a lower probability for exploration is preferential in Sweden, while a slightly higher exploration rate in the Netherlands and São Paulo. Parameter β, controlling preference for short-distance travel, is subsequently anal- ysed when exploration parameters is fixed in their respective range, depicted in the right column of figure 4.7. For all regions, there is one value of β that achieves a better score, regardless of the values of exploration rate parameters. In Sweden, that value is β = 0.03, in the Netherlands β = 0.04, and São Paulo β = 0.05. This indicates that the smaller the region in the study, the larger the optimal value of β is. Furthermore, the smaller the region under study, the less influence the exact value of β has. 27 4. Results Figure 4.7: Parameter topology in Sweden, the Netherlands and São Paulo. A Influence of exploration parameters γ and ρ on MSE - β is fixed at 0.04. B Influence of β on MSE - γ and ρ is fixed at different values. One pair of ρ and γ was included in both the first and the second phase of grid search, thus, having results for additional β values. 28 4. Results Table 4.1 shows the best model configuration for each region. The complete table of results for different model configurations across the three regions can be found in Appendix B. Region γ ρ β Sweden 0.75 0.4 0.03 The Netherlands 0.45 0.6 0.04 São Paulo 0.45 0.6 0.05 Table 4.1: Optimal set of parameters for the model in each region. Validation This section presents the results of validation for each region using the method described in Section 3.3. In this section, the model’s results are presented using the best set of parameters found in the previous section. Sweden The trip distance distribution for the region of Sweden has three areas: “National”, “East” and “West”. Note that the maximum straight-line distance for Sweden is around 1500 km. National Sampers-National model considers trips with a minimum distance of 100 km by residents from all over Sweden. Figure 4.8 shows the trip distance distributions for the National region from the external source (Sampers), the two baselines, and the model. Both the baseline and the baseline-24 under-represents medium distance trips, 100 km to 250 km, and over-represents long-distance trips up to about 600 km. Although baseline-24 yields an improvement of the baseline distribution, it still deviates from Sampers. Our model closely follows the cumulative distribution of Sampers up until 200 km. For the distance above 200 km, it slightly deviates from Sampers for the distance of 200 - 500 km. East For the East area, the Sampers output considers trips in all of Sweden, made by the residents who live in the East area. The trip distance distributions for the area are illustrated in Figure 4.9. The first distance quantile, which contains trips up to 3.2 km, is under-represented by both the baseline and the model. The baseline continues to under-represent trips up to 100 km and then heavily over-represents long-distance trips. baseline-24 resembles the baseline except for having a lot more short-distance trips, with 76% of the trips in the first distance quantile. The model generates slightly more trips in the range of 3.2 - 7 km than Sampers. For distance above 7 km, the model approximates the cumulative distribution derived from Sampers. 29 4. Results Figure 4.8: Trip distance distributions for the National area, Sweden (Source=Sampers-National). Cumulative percentage of trips in each distance quan- tile. The black vertical lines indicate the upper and lower boundaries for the distance quantiles. The same below for all figures on distance distributions. Figure 4.9: Trip distance distributions for the East area (Source=Sampers-East). West Figure 4.10: Trip distance distributions for the West area (Source=Sampers-West). For the West area, the Sampers output considers trips in all of Sweden, made by the 30 4. Results residents who live in the West area, and the trip distance distributions are illustrated in Figure 4.10. Compared to Sampers, the two baselines overly represent the first quantile, 0 - 2.1 km, but then under-represents trips with a distance less than 100 km. The time threshold in baseline-24 generates more trips in the first quantile than the unbounded baseline. Moreover, the baseline also over-represents long-distance trips. In contrast, the model slightly over-represents trips up to 5 km and then follows the same cumulative distribution as Sampers. In summary, the model improves over the two baselines across all three areas, espe- cially on the national level. On the national level baseline-24 is slightly better than baseline while being worse on both East and West. According to figures 4.10 and 4.9, this is due to the much larger share of short-distance trips (<10 km). The MSE between the Sampers’ output and the baselines’ as well as the model are summarised in Table 4.2. MSE (10−5) between Source (Sampers) Region Baseline Baseline-24 Model National 14.9 9.49 0.79 East 2.06 7.13 0.61 West 2.51 23.2 1.17 Table 4.2: Summary of MSE in Sweden, comparing baselines to the best model configuration. The Netherlands The longest distance one can travel within the Netherlands is around 300 km, shorter than Sweden due to its different geometry of the territory. Figure 4.11 shows the cu- mulative trip distance distribution, comparing the source (OViN), baseline, baseline- 24 and model. Figure 4.11: Trip distance distributions for the Netherlands (Source=OViN). The baseline significantly under-represents travel shorter than 10km, being off by 31 4. Results 10%. The cumulative trips distance distribution for the baseline continues to be lower than expected until around 50km, and trips longer than 50km are slightly over-represented. The baseline-24, however, has a significantly better share of short distance trips but still follow the under and over-representation pattern as the base- line. In comparison, the model corrects the short distance under-representation of Twitter data, and achieves a good fit with the source data for all the distance ranges. The MSE between the trips in the external source (OViN) and the baselines’ as well as the model are summarised in Table 4.3. The baseline-24 outperforms the baseline, due to it’s higher share of short-distance trips (<10 km). The model further improves on both of the baselines, because of a better share of medium-distance trips, which are very similar to OViN. MSE (10−5) between Source (OViN) Region Baseline Baseline-24 Model The Netherlands 6.72 0.63 0.10 Table 4.3: Summary of MSE in the Netherlands, comparing baseline to the best model configuration. São Paulo São Paulo is different from Sweden and the Netherlands because it is a city where the longest distance one can travel is significantly shorter, slightly more than 100km. Figure 4.12 shows the cumulative trip distance distribution, comparing the source (travel survey), baseline, baseline-24 and the model. The baseline overly represents travel in the shortest distance interval, 0 - 2.15 km, while slightly under represents travel on distances up to 20 km. Here, the baseline-24 actually performs worse than the baseline, by having a 20 percentage point increase in the shortest distance interval as compared with the source data. The model, however, corrects the short distance over-representation, and all together achieves a good fit with the source data. The MSE between the trips in the travel survey and the baselines’ as well as the model are summarised in Table 4.4. The baseline model over-represents the share of short-distance trips (<3 km), and the baseline-24 further increase the share of short-distance trips leading to a worse MSE. The model, however, achieves a good fit for the share of short-distance trips, which leads to the best MSE of the three models. 32 4. Results Figure 4.12: Trip distance distributions for São Paulo (Source=OD Survey 2017 in São Paulo). MSE (10−5) between Source (travel survey) Region Baseline Baseline-24 Model São Paulo 12.6 47.7 0.33 Table 4.4: Summary of MSE in São Paulo, comparing baseline to the best model configuration. 33 4. Results 34 5 Discussion This thesis estimates travel demand using geotagged tweets. We develop an individ- ual mobility model that accounts for and corrects the observed mobility behaviour biases found in geotagged tweets so that geotagged tweets can improve the estimates of travel demand. The model is calibrated and validated in two countries, Sweden and the Netherlands, and one city, São Paulo. 5.1 Top geotag Twitter users vs general popula- tion In agreement with previous studies, we find that the top geotag users on Twitter overly represent residents in urban areas (see figs. 4.1, 4.3 and 4.5). Stockholm county in Sweden and North Holland in the Netherlands both indicate an over- representation by a factor of, approximately, 1.7. In addition to previous work, we also find that the urban area over-representation can be attributed to the most central areas. The centre of Stockholm city in Sweden is overly represented by a factor of 2.3, and this number is 3.4 for Amsterdam in the Netherlands, and 130 for Barra Funda in São Paulo. Quantifying the population biases sheds light on the need for further de-biasing. For instance, the discrepancy between the spatial distribution of top geotag Twitter users and the general population could be used to attach a weight to each top geotag Twitter user when modelling the population flows between places. An issue with the high-resolution comparison of population density arises due to the small sample size of Twitter users in each region. This leads to the summary statis- tics in small zones being very sensitive to scale, and them being disproportionately over- or under-represented in comparison to the true population. This indicates the importance of properly choosing study zones when using geotagged tweets. 5.2 Mobility measured by geotagged tweets The baseline trips are created by connecting every two consecutive geotagged tweets by the same user without any time threshold, similar to other methods in the liter- ature [20, 26]. We found that the trips tend to overly represent long-distance travel 35 5. Discussion (see figs. 4.8 to 4.11). For Sweden and the Netherlands, the distance at which the baseline starts to overly represent is approximately 50 km. The baseline’s tendency to over-represent long distance trips further confirms the behaviour bias described by Tasse et al (2017) [10]. However, it should also be noted that the external data sources used in validation might not perfectly represent the true mobility for the region. Traditional household travel surveys have, for example, been shown to under-report long-distance trips [27, 28]. We also show that, if the considered region is geographically small, a short distance over-representation, less than 5 km, emerges in the trip distance distribution. For example, in São Paulo, where the maximum distance is 100 km, trips less below 3 km are over-represented by 10% (see Figure 4.12). Hence, regardless of the study area’s scale, trips derived directly from geotagged tweets will not yield a representative trip distance distribution. Furthermore, the share of short distance trips is affected by the spatial aggregation used, and the spatial resolution of the source. Geotagged tweets have a higher spatial resolution, leading to many short-distance trips which is more realistic than travel surveys because surveys sometimes only have tips with minimum distance of 1 km. The baseline-24 uses a time threshold of 24 hours, as commonly found in litera- ture [12, 13]. The threshold appears to primarily remove long-distance trips, which in turn yields a more representative share of long-distance trips compared to the baseline. We also find that the threshold produces a considerably greater share of short-distance trips, i.e. 0 - 10 km, than the baseline. In all regions except for the Netherlands, this impacts the MSE negatively, as the ground truth has a lower share of short-distance trips. Overall, we find that the baseline-24 produce trip distance distributions worse than the baseline. Furthermore, the common practice [12, 13] of adding a time threshold to convert displacements into trips reduces the amount of available data, and as the accuracy of the estimations is actually decreased, there are not many benefits of this practice. 5.3 Individual mobility model Based on the mobility biases found in Twitter data, we propose an Individual Mo- bility Model that accounts for the observed behavioural biases of geotagged tweets. The proposed model integrates two natural dimensions of individual mobility: reg- ularly returning to habitual locations and occasionally exploring new locations. To address the caveat of under-representation of habitual places such as home and work- place, the model re-scales the observed visitation frequency of various locations by combining their order and Zipf’s law [5]. In addition, the model combines visitation frequency and distance when selecting a location to return to. By doing so, the tendency for long-distance travel observed in Twitter data, is corrected. As a result, the modelled mobility prefers making visits to places closer to where the individual is currently located more than the baselines. The model consistently improves upon the baseline models across all regions (see Table 5.1), especially on short-distance trips (< 50 km). Despite the good results, 36 5. Discussion the model have some limitations that should be explored further; the dependence on estimated home location, and assumed independence of jump size and bearing distributions. First, the model is dependant on the estimated home location from the tweets, as the first visit every day is assumed to be at home. Consequently, inherent in the the model is the uncertainties of the estimated home locations. Theoretically, the model only requires the first visit of the user’s timeline to be at a known preferred location, and the remaining visits could be generated without starting at the home location every day. This would drastically decrease the dependence on estimated home locations, but other effects are still unknown. Secondly, the jump size and bearing distributions are independent in the proposed model which is a simplification of the reality. For example, it is not likely that a person living in Sweden makes a 6000 km north (crossing the north pole), while taking a flight to New York of the same distance is much more likely. This could be addressed by sampling the jump size conditionally based on the bearing in future work. The consequences of this assumption do not emerge in the results, however, because the trip distance distribution is one aspect of mobility, and does not take bearing into account. MSE (10−5) Parameters Region Baseline Baseline-24 Model γ ρ β Sweden: National 14.9 9.49 0.79 0.75 0.4 0.03 Sweden: East 2.06 7.13 0.61 0.75 0.4 0.03 Sweden: West 2.51 23.2 1.17 0.75 0.4 0.03 The Netherlands 6.72 0.63 0.10 0.45 0.6 0.04 São Paulo 12.6 47.7 0.33 0.45 0.6 0.05 Table 5.1: MSE and parameters of the best performing model in each region compared to the baselines. 5.4 Model sensitivity to different parameters There are three parameters in the proposed model: ρ, γ and β We calibrate the model parameters for three of the regions in this thesis: Sweden, the Netherlands and São Paulo. The β parameter, controlling preference for short-distance travel, is the parameter that influences the MSE the most. Furthermore it appears to be correlated with the maximum length travel within the region. That is, the longer the maximum travel, the lower β is found. Despite this, it is found that the shorter the maximum travel in the region, the less influence β has on the model. This is because the lower probability of long-distance trips, controlled by β, becomes irrelevant when only considering short-distance trips. The optimal value of β in each region is summarised in Table 5.1. Parameters controlling exploration rate, ρ and γ, have 37 5. Discussion less influence on the MSE in the studied regions than β. However, across all regions the optimal parameters suggest that small amounts of exploration is beneficial for the MSE presumably due to biases towards unusual locations in geotagged tweets as compared with the actual mobility. Although the optimal model parameters found for each region is slightly different, our grid search results (see Figure 4.7) suggests that the optimal values are in the same range for all regions: β around 0.04, ρ around 0.5 and γ between 0.45 and 0.75. We expect that applying the model with these parameters to a new region would yield better results than using consecutive tweets directly, with or without a time threshold. 5.5 Future work This thesis carefully examines the biases of geotagged tweets in two aspects, popu- lation representation and mobility representation as measured by geotagged tweets. Despite addressing the behavioural biases by proposing the individual mobility model, the population bias of the top geotag Twitter users are not integrated into the modelling. In order to de-bias on both aspects for travel demand estimation, one future direction can be adding varying weights to those top geotag Twitter users as compared with the general population when aggregating their trajectories to create an origin-destination matrix. The proposed individual mobility model performs well as compared with the other established data sources in the three regions that we studied. However, the evalua- tion method in this thesis is limited to the trip distance distribution which consti- tutes one part of the travel demand. The next step is to further validate the model on where visits are generated to take spatial orientation into consideration. Another future work is to generalise the proposed model into global regions to further test its feasibility and conduct cross-regional analysis on the individual mobility to make full use of social media data as an emerging data source in mobility study. 38 6 Conclusion Traditional household travel surveys is a commonly used method of estimating travel demand. However, the cost of conducting these travel surveys is increasing, while the response rate is decreasing. This has led researchers to explore new sources of data that can be used to estimate travel demand. Among these new data sources is social media data such as geotagged tweets from Twitter, which is promising due to it’s large quantity of available data and low cost of access. At the same time, using Twitter for travel demand estimation has garnered criticism regarding the biases inherent in Twitter data. We quantify and confirm the results of previous research regarding the biases of Twitter users. (1) The users of Twitter are overly represented in urban areas, and even more so in the absolute centre of these areas. (2) Twitter users geotag their tweets during trips far from home leading to an over-representation of long distance trips if used directly as a proxy of human mobility. Our main contribution to the field, corresponding to the revealed biases, is to de- velop a novel model to generate individual mobility trajectories for travel demand estimation using geotagged tweets. It takes behavioural biases into consideration. The proposed model produces individual based series of visits based on the observed geotagged activities. The proposed model integrates two natural dimensions of indi- vidual mobility: regularly returning to habitual locations and occasionally exploring new locations that have not been visited before. The proposed model addresses the under-representation of habitual places such as home and workplace and corrects the geotagging behavioural bias of being less constrained by distance. Validation on three different regions suggests that the model is able to capture the essential travel demand in multiple regions of distinct geographical properties as compared with the other established data sources. Finally, the results suggest that the model’s param- eters are robust across regions studied, and by using the parameters found in this thesis one can expect similar improvements compared to contemporary approaches across other regions. Given that geotagged tweets as an emerging data source have been used widely to characterise travel demand, it is imperative to address the known biases, as we have demonstrated here. Future work includes examining the performance of the model by using more validation metrics than trip distance distribution, integrating the population de-biasing in the model and test the model in other regions. 39 6. Conclusion 40 Bibliography [1] Yang Yue, Tian Lan, Anthony Yeh, and Qing-Quan Li. Zooming into individu- als to understand the collective: A review of trajectory-based travel behaviour studies. Travel Behaviour and Society, 1:69–78, 05 2014. [2] Yuan Liao. Understanding human mobility with emerging data sources: Vali- dation, spatiotemporal patterns, and transport modal disparity. 2020. [3] Raja Jurdak, Kun Zhao, Jiajun Liu, Maurice AbouJaoude, Mark Cameron, and David Newth. Understanding human mobility from twitter. PLOS ONE, 10(7):1–16, 07 2015. [4] Juha K Laurila, Daniel Gatica-Perez, Imad Aad, Olivier Bornet, Trinh-Minh- Tri Do, Olivier Dousse, Julien Eberle, Markus Miettinen, et al. The mobile data challenge: Big data for mobile computing research. In Pervasive Computing, number EPFL-CONF-192489, 2012. [5] Chaoming Song, Tal Koren, Pu Wang, and Albert-Laszlo Barabasi. Modelling the scaling properties of human mobility. Nature Physics, 6, 10 2010. [6] Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. arXiv:1306.5204 [physics], June 2013. arXiv: 1306.5204. [7] Quan Yuan, Gao Cong, Zongyang Ma, Aixin Sun, and Nadia Magnenat Thal- mann. Who, where, when and what: discover spatio-temporal topics for twitter users. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’13, pages 605–613, Chicago, Illi- nois, USA, August 2013. Association for Computing Machinery. [8] Bartosz Hawelka, Izabela Sitko, Euro Beinat, Stanislav Sobolevsky, Pavlos Kazakopoulos, and Carlo Ratti. Geo-located Twitter as proxy for global mobil- ity patterns. Cartography and Geographic Information Science, 41(3):260–271, May 2014. [9] Maxime Lenormand, Miguel Picornell, Oliva Garcia Cantu Ros, Antonia Tu- gores, Thomas Louail, Ricardo Herranz, Marc Barthelemy, Enrique Frias- Martinez, and Jose Javier Ramasco. Cross-checking different sources of mobility information. PLoS ONE, 9, 04 2014. 41 Bibliography [10] Dan Tasse, Zichen Liu, Alex Sciuto, and Jason I Hong. State of the geotags: Motivations and recent changes. In ICWSM, pages 250–259, 2017. [11] Yuan Liao, Sonia Yeh, and Jorge Gil. Feasibility of estimating travel demand using social media data. Transportation, 2020. [12] Jae Hyun Lee, Adam Davis, Elizabeth McBride, and Konstadinos G Goulias. Statewide comparison of origin-destination matrices between california travel model and twitter. In Mobility Patterns, Big Data and Transport Analytics, pages 201–228. Elsevier, 2019. [13] A. Kheiri, F. Karimipour, and M. Forghani. Intra-Urban Movement Flow Esti- mation Using Location Based Social Networking Data. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sci- ences, 15:781–785, December 2015. [14] Fan Yang, Peter J. Jin, Yang Cheng, Jian Zhang, and Bin Ran. Origin- Destination Estimation for Non-Commuting Trips Using Location-Based So- cial Networking Data. International Journal of Sustainable Transportation, 9(8):551–564, November 2015. [15] Filippo Simini, Marta C González, Amos Maritan, and Albert-László Barabási. A universal model for mobility and migration patterns. Nature, 484(7392):96– 100, 2012. [16] Juan Gonzalo Cárcamo, Roderick Grahm Vogel, Adam M. Terwilliger, Jonathan P. Leidig, and Greg Wolffe. Generative models for synthetic popula- tions. In Proceedings of the Summer Simulation Multi-Conference, SummerSim ’17, San Diego, CA, USA, 2017. Society for Computer Simulation International. [17] Sébastien Gambs, Marc-Olivier Killijian, and Miguel Nunez del Prado Cortez. Next place prediction using mobility markov chains. 04 2012. [18] Yu Liu, Chaogui Kang, Song Gao, Yu Xiao, and Yuan Tian. Understanding intra-urban trip patterns from taxi trajectory data. Journal of Geographical Systems, 14(4):463–483, October 2012. [19] I. Rhee, M. Shin, S. Hong, K. Lee, S. J. Kim, and S. Chong. On the levy-walk nature of human mobility. IEEE/ACM Transactions on Networking, 19(3):630– 643, 2011. [20] Yuan Liao, Sonia Yeh, and Gustavo S Jeuken. From individual to collective behaviours: exploring population heterogeneity of human mobility based on social media data. EPJ Data Science, 8(1):34, 2019. [21] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density- based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996. 42 Bibliography [22] Christian M. Schneider, Vitaly Belik, Thomas Couronné, Zbigniew Smoreda, and Marta C. González. Unravelling daily human mobility motifs. Journal of The Royal Society Interface, 10(84):20130246, 2013. [23] Official Statistics of Sweden. Swedish National Travel survey (RVU Sweden) 2011—2016, 2016. [24] Eurostat. Population grids, 2018. Data retrieved from Statistics Explained. [25] Trafikverket. Sampers, Nov 2019. [26] Song Gao, Jiue-An Yang, Bo Yan, Yingjie Hu, Krzysztof Janowicz, and Grant McKenzie. Detecting origin-destination mobility flows from geotagged tweets in greater los angeles area. 09 2014. [27] Maxim Janzen. Population synthesis for long-distance travel demand simu- lations. In 6th symposium of the European association for research in trans- portation (hEART 2017). ETH Zurich, Institute for Transport Planning and Systems, 2017. [28] Zhenzhen Wang, Sylvia He, and Yee Leung. Applying mobile phone data to travel behaviour research: A literature review. Travel Behaviour and Society, 03 2017. 43 Bibliography 44 A Notations I A. Notations Notation Definition i Individual index (X,Y ) Decimal coordinates of a geotagged tweet t Local time of a geotagged tweet converted Gi Sequence of a user’s geotagged tweets p Index of geotagged tweet in Gi Ni Total number of geotagged tweets in Gi w Day of week h Hour of day Si Set of distinct places visited by individual i j Index of distinct place in Si ni Total number of distinct places in Si Kj Frequency of visiting a place j for individual i fj Frequency rate of place j among total visited places in Si θp,p+1 Distance between two consecutive geotagged tweets αp,p+1 Bearing between two consecutive geotagged tweets s A place in Si sh Identified home place of the individual Li Model output of individual mobility trajectory for individual i D Total number of days for Li Ti,d A series of visits for individual i at day d Md Total number of visits for Ti,d vi,d,m The mth visit at dth day for individual i γ, ρ Two parameters that control exploration in the proposed model β The parameter that controls returning in the proposed model ζ The parameter of Zipf’s Law k Rank of visited places by their visiting frequency Table A.1: Lookup table with the main symbols and relevant notations used in this thesis. II B Parameter tuning III B. Parameter tuning ρ γ β MSE* (10−5) 0.4 0.75 0.03 2.57 0.4 0.8 0.03 2.71 0.3 0.8 0.03 2.92 0.3 0.75 0.03 3.08 0.4 0.85 0.03 3.22 0.4 0.75 0.04 3.39 0.3 0.8 0.04 3.46 0.4 0.85 0.04 3.51 0.3 0.75 0.04 3.59 0.4 0.8 0.04 3.62 0.3 0.85 0.03 3.74 0.6 0.8 0.04 3.75 0.3 0.85 0.04 4.04 0.2 0.75 0.04 4.13 0.3 0.5 0.04 4.23 0.2 0.8 0.04 4.61 0.2 0.75 0.03 4.74 0.9 0.8 0.04 4.82 0.2 0.85 0.04 5.05 0.2 0.8 0.03 5.83 0.6 0.5 0.04 6.23 0.4 0.8 0.05 6.84 0.3 0.75 0.05 6.92 0.9 0.5 0.07 6.92 0.4 0.75 0.05 6.98 0.2 0.85 0.03 7.0 0.3 0.2 0.07 7.09 0.4 0.85 0.05 7.25 0.2 0.75 0.05 7.27 0.2 0.85 0.05 7.45 0.2 0.8 0.05 7.67 0.3 0.8 0.05 7.73 0.3 0.85 0.05 7.83 0.6 0.5 0.07 8.05 0.3 0.2 0.04 8.77 0.9 0.5 0.04 9.66 0.9 0.8 0.07 10.75 0.3 0.5 0.07 12.54 0.6 0.8 0.07 13.03 0.3 0.8 0.01 13.72 0.6 0.8 0.01 15.69 0.3 0.8 0.07 15.73 0.3 0.5 0.01 16.55 0.6 0.2 0.07 16.91 0.9 0.8 0.01 18.57 0.6 0.5 0.01 23.72 0.6 0.2 0.04 25.87 0.3 0.2 0.01 28.8 0.9 0.5 0.01 31.22 0.9 0.2 0.07 42.41 0.6 0.2 0.01 52.99 0.9 0.2 0.04 53.84 0.9 0.2 0.01 81.65 Table B.1: Performance, in terms of MSE*, for the different model configurations in Sweden, sorted by MSE*. MSE* is the sum of the three MSE values received for the areas “National”, “East” and “West”. IV B. Parameter tuning ρ γ β MSE (10−5) 0.6 0.45 0.04 0.1 0.7 0.5 0.04 0.12 0.7 0.45 0.04 0.14 0.6 0.5 0.04 0.16 0.7 0.55 0.04 0.16 0.6 0.5 0.04 0.17 0.5 0.45 0.04 0.18 0.3 0.2 0.04 0.19 0.9 0.5 0.04 0.21 0.5 0.55 0.03 0.26 0.5 0.5 0.04 0.29 0.6 0.55 0.04 0.3 0.7 0.45 0.05 0.3 0.5 0.5 0.03 0.4 0.6 0.55 0.03 0.4 0.5 0.55 0.04 0.44 0.6 0.2 0.07 0.49 0.5 0.45 0.03 0.57 0.7 0.55 0.03 0.58 0.6 0.45 0.05 0.59 0.6 0.5 0.03 0.59 0.7 0.5 0.05 0.65 0.9 0.8 0.04 0.66 0.3 0.5 0.04 0.82 0.7 0.5 0.03 0.83 0.6 0.45 0.03 0.92 0.6 0.8 0.04 1.0 0.6 0.5 0.05 1.02 0.7 0.55 0.05 1.03 0.5 0.45 0.05 1.04 0.7 0.45 0.03 1.25 0.6 0.55 0.05 1.41 0.5 0.5 0.05 1.42 0.3 0.8 0.04 1.52 0.9 0.5 0.07 1.73 0.5 0.55 0.05 1.88 0.3 0.2 0.07 1.91 0.6 0.2 0.04 3.49 0.6 0.5 0.07 4.06 0.9 0.2 0.07 5.77 0.9 0.8 0.07 6.82 0.3 0.5 0.07 7.37 0.6 0.8 0.07 8.43 0.3 0.8 0.01 9.42 0.3 0.8 0.07 10.09 0.6 0.8 0.01 10.39 0.3 0.5 0.01 10.64 0.9 0.8 0.01 11.0 0.9 0.2 0.04 11.08 0.6 0.5 0.01 12.85 0.3 0.2 0.01 14.76 0.9 0.5 0.01 15.21 0.6 0.2 0.01 21.77 0.9 0.2 0.01 29.77 Table B.2: Performance, in terms of MSE, for the different model configurations in the Netherlands, sorted by MSE. V B. Parameter tuning ρ γ β MSE (10−5) 0.7 0.5 0.05 0.33 0.6 0.45 0.05 0.34 0.6 0.55 0.04 0.37 0.7 0.45 0.05 0.38 0.5 0.5 0.04 0.38 0.6 0.5 0.05 0.39 0.7 0.55 0.05 0.39 0.5 0.45 0.04 0.4 0.5 0.55 0.04 0.4 0.9 0.5 0.07 0.41 0.9 0.8 0.04 0.42 0.7 0.55 0.04 0.42 0.6 0.5 0.04 0.43 0.5 0.45 0.05 0.43 0.6 0.55 0.05 0.51 0.6 0.45 0.04 0.53 0.5 0.5 0.05 0.54 0.7 0.5 0.04 0.55 0.5 0.55 0.03 0.55 0.3 0.5 0.04 0.58 0.6 0.8 0.04 0.64 0.3 0.2 0.04 0.66 0.5 0.5 0.03 0.67 0.6 0.55 0.03 0.69 0.5 0.55 0.05 0.71 0.7 0.45 0.04 0.76 0.3 0.2 0.07 0.79 0.5 0.45 0.03 0.82 0.6 0.5 0.03 0.89 0.7 0.55 0.03 0.92 0.9 0.5 0.04 1.04 0.3 0.8 0.04 1.06 0.6 0.45 0.03 1.15 0.7 0.5 0.03 1.18 0.3 0.8 0.01 1.29 0.6 0.2 0.07 1.39 0.6 0.5 0.07 1.48 0.7 0.45 0.03 1.58 0.6 0.8 0.01 1.76 0.3 0.5 0.01 1.86 0.9 0.8 0.01 2.33 0.9 0.8 0.07 2.95 0.6 0.5 0.01 3.26 0.3 0.5 0.07 3.75 0.3 0.2 0.01 4.02 0.6 0.8 0.07 4.09 0.6 0.2 0.04 4.39 0.9 0.5 0.01 5.12 0.3 0.8 0.07 5.67 0.9 0.2 0.07 7.15 0.6 0.2 0.01 9.75 0.9 0.2 0.04 11.32 0.9 0.2 0.01 16.83 Table B.3: Performance, in terms of MSE, for the different model configurations in São Paulo, sorted by MSE. VI List of Figures List of Tables Introduction Related work Data sources Measuring human mobility with geotagged tweets Modelling travel demand: from individuals to population Thesis objectives Disposition of this thesis Ethical considerations Methods Data collection and preprocessing Feature construction Individual mobility model Framework Exploration Returning Example Validation External data sources EU-wide population grid Sweden The Netherlands São Paulo Population representation of top geotag Twitter users Mobility representation of the proposed model Results Population representation Individual mobility model: parameters and validation Discussion Top geotag Twitter users vs general population Mobility measured by geotagged tweets Individual mobility model Model sensitivity to different parameters Future work Conclusion Bibliography Notations Parameter tuning