Maritime Shipping Network Graph - a Model Derived from Vessel AIS Data Creating and evaluating a graph representation of maritime vessel traffic using AIS data Master’s thesis in Computer science and engineering HAMPUS LARSSON, WENDY PAU Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Maritime Shipping Network Graph - a Model Derived from Vessel AIS Data Creating and evaluating a graph representation of maritime vessel traffic using AIS data HAMPUS LARSSON, WENDY PAU Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Maritime Shipping Network Graph Representation Creating and evaluating a graph representation of maritime vessel traffic using AIS data HAMPUS LARSSON, WENDY PAU © HAMPUS LARSSON, WENDY PAU, 2024. Supervisor: Peter Damaschke, Data Science and AI. Advisor: Kayed Mahra & Jesper Olsson Examiner: Magnus Myreen, Computer Science and Engineering. Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Constructed graph representation of the maritime shipping network. Typeset in LATEX Gothenburg, Sweden 2024 iv Maritime Shipping Network Graph Representation Creating and evaluating a graph representation of maritime vessel traffic using AIS data HAMPUS LARSSON, WENDY PAU Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Maritime shipping shoulders more than 90% of global trade. Data science and ML, another immensely profitable industry, currently experiences an unprecedented evolution of techniques. This project proposes novel methods that apply data science techniques to the domain of maritime shipping. The main objective of this project is to construct a graph closely modelling the global maritime shipping structure, from which analytics can be derived. The node set is constructed using a pipeline of Change Point Detection to identify preliminary waypoints and reduce data quantity, KDE is utilised for geographical density estimation and partitioning the AIS data into different density areas, and lastly, the geospatial indexing framework S2 Geometry is used for final waypoint extraction to a node set. The edge set is constructed using a transition matrix that is used together with the final node set to construct the graph representation. Simulation results on the graph representation reveal the ability to construct routes with high resemblance to real-world routes. Further testing revealed high likeness between the most influential nodes in the graph representation and influential points-of-interest in the maritime shipping structure. In turn, the maritime shipping network graph representation is a tool for analysing the maritime shipping structure. Keywords: AIS data, maritime, graph, network, kernel density estimation, S2 Google, traffic route, change point detection, path-finding, A*. v Acknowledgements We would like to express our sincere gratitude to all those who have been a part of this project and contributed to its results. First and foremost, we sincerely thank our advisors, Jesper Olsson and Kayed Mahra for supervising this project. Their supportive natures, combined with heaps of domain knowledge, have been vital in shaping the project into what it is today. We are also grateful for the support of professor Peter Damaschke, our supervisor at Chalmers University of Technology. It has been invaluable to have the guidance of someone as well-versed in the norms and customs of the scientific community as him. Special thanks go out to professor Magnus Myreen at Chalmers University of Tech- nology. His insightful feedback and correspondence has truly elevated this report. Hampus Larsson & Wendy Pau, Gothenburg, 2024-06-25 vii Contents List of Figures xi List of Tables xv List of Algorithms xvii 1 Introduction 1 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Ethical Considerations and Risks . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Automatic Identification System . . . . . . . . . . . . . . . . . . . . . 5 2.2 Google S2 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Change Point Detection . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 PELT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Related Work 11 4 Data & Analysis 15 4.1 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.1 AIS data sources . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1.2 Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.3 Speed and course over ground . . . . . . . . . . . . . . . . . . 20 4.1.4 Position time frequency . . . . . . . . . . . . . . . . . . . . . 21 5 Methods 23 5.1 Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.2 Creation of Node Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.2.1 Change Point Detection . . . . . . . . . . . . . . . . . . . . . 24 5.2.2 Data Splitting Using Kernel Density Estimation . . . . . . . . 25 5.2.3 Waypoint Extraction Using S2 Geometry . . . . . . . . . . . . 26 5.3 Creation of Edge Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3.1 Transition Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 28 ix Contents 5.4 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.5 Graph Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.5.1 Graph Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.5.2 A* Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.5.3 A* - Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6 Results 39 6.1 Graph Evaluation Using A* . . . . . . . . . . . . . . . . . . . . . . . 39 6.2 Graph Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.3 Change Point Detection . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.4 Data Splitting of Change Points . . . . . . . . . . . . . . . . . . . . . 46 6.5 Extraction of Waypoints . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.6 Transition Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7 Discussion 53 7.1 Graph Evaluation Using A* . . . . . . . . . . . . . . . . . . . . . . . 53 7.2 Network Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.3 Extraction of Waypoints . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.4 Transition Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 8 Conclusion 57 Bibliography 59 A Appendix 1 I A.1 A* paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I x List of Figures 2.1 A world map showing two of the six face cells, where the red face cell has been subdivided into four smaller cells several times. The world map is plotted using Cartopy [13]. CAVEAT: This figure is a conceptual visualisation of cell divisions. The cell sizes are incorrect and do not represent the true sizes. . . . . . . . . . . . . . . . . . . . 6 2.2 Plots of the function f(x) = |x| in both a continuous manner and a discrete manner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 A conceptual visualisation of an 1D case of how KDE is constructed. The black lines indicate the locations of the five observations, the red bumps are the smoothed observations, and lastly, the blue curve is the resulting density estimate. . . . . . . . . . . . . . . . . . . . . . . 8 2.4 KDE with different smoothing bandwidths. The plots are created using seaborn.kdeplot() [17], with samples randomly drawn from a normal distribution using the Python package NumPy [18]. . . . . . . 9 4.1 AIS data aggregated with H3 geographical index and coloured by the mode navigational status of samples in each hexagon. . . . . . . . . . 15 4.2 Global ports (red) from the port list excluding ports of type NaN. . . 17 4.3 Distribution plot of each dataset from the three AIS data sources. . . 18 4.4 Categorical scatter plot of ports showing different port types. The red ports represent those with no port type (NaN). . . . . . . . . . . . . 19 4.5 Ports with missing port type. . . . . . . . . . . . . . . . . . . . . . . 19 4.6 Ports excluding missing port types. . . . . . . . . . . . . . . . . . . . 19 4.7 Distribution plots showing 4 different sog intervals. . . . . . . . . . . 20 4.8 Distribution plots showing 4 different cog intervals. . . . . . . . . . . 21 4.9 Count plot of the mean update frequencies for the journeys. . . . . . 21 5.1 An example showing a newly extracted waypoint being added to the waypoint tree depending on 4 different cases. . . . . . . . . . . . . . . 27 5.2 Main maritime routes divided up into core and secondary routes [40]. Red circles denote primary choke points, whereas yellow circles denote secondary choke points. . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.1 A* paths: Port of Singapore to Port of Rotterdam, Port of Tianjin to Port of Nynäshamn. Ports visited on the path are marked by a polygon. Non-port nodes visited are marked by a circle. . . . . . . . . 39 xi List of Figures 6.2 A* paths: Port of Hong Kong to Port of Mahshahr, Port of Novorossiysk to Port of Lianyungang. Ports visited on the path are marked by a polygon. Non-port nodes visited are marked by a circle. . . . . . . . . 40 6.3 A* paths: Port of Busan to Port of New York, Cape Town to Port of Durban. Ports visited on the path are marked by a polygon. Non-port nodes visited are marked by a circle. . . . . . . . . . . . . . . . . . . 40 6.4 A plot of all nodes in the graph representation of the maritime struc- ture. Edges are omitted for ease of viewing. . . . . . . . . . . . . . . 41 6.5 A plot of all nodes and edges in the graph representation of the maritime structure. Edges are directed in the actual graph; here, direction is omitted. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.6 Scatter plots of the 50 most important nodes with respect to degree, betweenness, and closeness centrality. . . . . . . . . . . . . . . . . . . 43 6.7 Visuals of the community detection on the graph representation. . . . 43 6.8 Results of change point detection on AIS dataset. The red dots represent identified change points, and the black markings portray all maritime traffic extracted from the AIS data. . . . . . . . . . . . . . 44 6.9 Elbow method for choosing the penalty parameter for change point detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.10 Change point detection on sog (blue lines). The detected change points are the start of each segmentation block. . . . . . . . . . . . . 45 6.11 Change point detection on cog (blue lines). The detected change points are the start of each segmentation block. . . . . . . . . . . . . 46 6.12 Heat map of the result from KDE of change points. The change points are also plotted as scatter points in the figure. . . . . . . . . . . . . . 46 6.13 Combined map plot of change points filtered by KDE. Sparse (green), mid (purple), and dense (blue). . . . . . . . . . . . . . . . . . . . . . 47 6.14 Map plot showing waypoints as polygons of different levels ranging from level 4-8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.15 Map plot showing two of the vital waterway regions with the extracted waypoint in Figure 6.14 with scatter points of the AIS data. . . . . . 48 6.16 Map plot of the waypoints in the East Asia region and around the Suez Canal and Hormuz Strait. . . . . . . . . . . . . . . . . . . . . . 49 6.17 Transition matrix plots detailing the sparseness of it. . . . . . . . . . 50 6.18 Density plot of the transition matrix values. . . . . . . . . . . . . . . 50 6.19 Port of Hong Kong (HKHKG) to Port of Mahshahr (IRMRX). . . . . 51 6.20 Port of Busan (KRPUS) to Port Of Tianjin (CNTXG). . . . . . . . . 51 A.1 A* paths for test case VNSGN ↔ NLRTM . . . . . . . . . . . . . . . I A.2 A* paths for test case USLGB ↔ CNTXG. . . . . . . . . . . . . . . I A.3 A* paths for test case CNTXG ↔ DEHAM. . . . . . . . . . . . . . . II A.4 A* paths for test case HKHKG ↔ IRMRX. . . . . . . . . . . . . . . II A.5 A* paths for test case USNYC ↔ KRPUS. . . . . . . . . . . . . . . II A.6 A* paths for test case SGSIN ↔ NLRTM. . . . . . . . . . . . . . . . II A.7 A* paths for test case BEANR ↔ SAJED. . . . . . . . . . . . . . . . III A.8 A* paths for test case RUPKC ↔ NLRTM. . . . . . . . . . . . . . . III xii List of Figures A.9 A* paths for test case CNLYG ↔ RUNVS. . . . . . . . . . . . . . . . III A.10 A* paths for test case CNTXG ↔ SENYN. . . . . . . . . . . . . . . IV A.11 A* paths for test case SENYN ↔ USHNL. . . . . . . . . . . . . . . . IV A.12 A* paths for test case IRMRX ↔ RUNVS. . . . . . . . . . . . . . . . IV A.13 A* paths for test case SAJED ↔ CLVAP. . . . . . . . . . . . . . . . V A.14 A* paths for test case ZADUR ↔ ZACPT. . . . . . . . . . . . . . . . V xiii List of Figures xiv List of Tables 4.1 Available information about the vessel information. Some fields can be unavailable and is thus set as null. The descriptions were taken from [27], [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Example of AIS data sample structure, showing data fields 1-6. . . . . 16 4.3 Example of AIS data sample structure, showing data fields 7-11. . . 16 4.4 Example of AIS data sample structure, showing data fields 12-16. . . 16 4.5 Example of AIS data sample structure, showing data fields 17-21. . . 16 4.6 Available information about the port data. Some fields can be unavail- able and is thus set as null. . . . . . . . . . . . . . . . . . . . . . . . . 17 4.7 Information about the data from the three AIS data sources. . . . . . 18 5.1 Set-up of change point detection. . . . . . . . . . . . . . . . . . . . . 25 5.2 For an estimated pdf value of a data point est_pdf, the data point is categorized as one of the three areas: sparse, mid, and dense, according to the respective thresholding in the table. The variables max_low and min_high are manually chosen and represents the maximum pdf value for the sparse area and the minimum pdf for the dense area respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.3 Arguments for Algorithm 1. . . . . . . . . . . . . . . . . . . . . . . . 28 5.4 Primary choke points and the naval route they occupy. . . . . . . . . 37 5.5 Test cases used for evaluating the graph representation using the A* algorithm. Routes are tested both ways. Choke points are represented by their row number in Table 5.4. . . . . . . . . . . . . . . . . . . . 37 6.1 Estimated travel times (in hours) for the test cases. . . . . . . . . . . 41 6.2 Metrics concerning the graph representation. . . . . . . . . . . . . . . 42 6.3 Number of change points compared to total number of AIS data points. Row 3-4 is the number of points, where a point is a change point based on either the course or speed value, while the last row is based on both values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.4 The min and max threshold pdf values for a point to be categorised as one of the following regions: sparse, mid, and dense. . . . . . . . . 47 6.5 Final chosen values for the variables in Table 5.3 used for Algorithm 1. 48 xv List of Tables xvi List of Algorithms 1 Create Waypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2 Populate Transition Matrix . . . . . . . . . . . . . . . . . . . . . . . 29 3 Graph Construction Algorithm . . . . . . . . . . . . . . . . . . . . . 31 4 A* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Travel Time Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 xvii List of Algorithms xviii 1 Introduction The maritime transportation sector stands as a cornerstone of our current economy. The demand for moving goods continues to surge, propelled by customer expectations for cost-effective, top-tier services and the expenses tied to providing these transport services. Maritime shipping dominates alternative modes of transport such as trucks, aeroplanes, or trains, due to its cost efficiency, rapid speeds, and dependable schedules [1]. Dating back through history, maritime transportation remains one of humanity’s earliest means of connecting, playing an integral role in trade and cultural exchange. Presently, maritime transport shoulders more than 90% of global trade, particularly within global scale logistical networks [2]. In the last 30 years, containerised cargo has grown by more than 8% annually, with more than 5150 container ships globally operational in 2017 [3]. In recent years, data science and AI have exploded. The data science market is projected to increase from $96.3 billion (2022) to $378.7 billion (2030) [4], a potential largely based in the acquisition of data science and AI into new disciplines. As of 2023, 18.2% of firms in America report the usage of AI — a greater than three-fold increase since 2017 [5]. In the maritime shipping sector, data science is no foreign concept. As it stands, the query ’maritime shipping AI’ yields approximately 80 900 results on Google Scholar [6]. However, due to the rapid development of data science and AI, many concepts are left unexplored in the context of maritime shipping. The complex structure of global maritime shipping share a multitude of properties with that of a mathematical graph. Ports exhibit node-like properties, being crucial points of connectivity in the maritime shipping structure. Likewise, the movements of vessels between ports model the function of edges in a graph [7]. By leveraging emerging data science and AI/ML techniques, global maritime shipping and its network structure can be modelled as a graph. A graph representation of the maritime shipping structure could provide multifaceted benefits, primarily stemming from inherent properties of the graph structure. Firstly, a graph representation enables the visualisation of complex connections within the maritime structure, thus allowing stakeholders comprehensive insights into the flow of cargo, route optimisation, and bottleneck analysis. Secondly, by leveraging graph algorithms the analysis of key nodes, vital shipping lanes, and strategic ports is facilitated; decision- making processes are aided. Moreover, a graph representation would provide means for analysing dynamic changes in the shipping structure. In simulation and testing, the understanding of the global maritime shipping network could be furthered. 1 1. Introduction 1.1 Problem Description This thesis contributes to the maritime shipping sector by utilising data science techniques to construct a graph representation modelling the maritime shipping network, with naval points of interest as nodes and vessel routes as edges. A graph representation of the maritime shipping network is constructed using Automatic Identification System (AIS) data (Section 2.1), offering valuable insights into the operation and structure of maritime shipping. Through its construction via historical AIS data, the graph representation provides a unique platform for analysis where vital hubs and points of interest (POI) in the maritime shipping structure are deduced. Furthermore, shipping journeys can be planned using algorithmic approaches. As such, the construction of a “true-to-life” graph representation paves the way for savings analysis; both of economic and environmental nature. One of the main challenges lies in producing a graph representation that is both meaningful and accurately reflects the intricacies of the actual maritime network. Determining what the nodes and edges shall represent poses a significant challenge. While the existing research commonly adopts an approach that treats ports as nodes and the connections between them as edges, this approach might overlook valuable insights [7]. One have to consider that there are additional elements that should be included within the node set, such as anchorages, sea lanes, and other geographical POI’s. Furthermore, the geographical density of data plays a crucial role in defining nodes. Certain locations experience heavy maritime traffic in contrast to others; the data density is uneven. Therefore, many clustering algorithms, such as the original DBSCAN [8], are not suited for the task of aggregating AIS data into nodes. Another challenge, presented by the issue of uneven data coverage, is dealing with missing data. Since vessels often turn off their AIS, the AIS data typically have gaps and uneven geographical distribution. This missing data further complicates the problem since important milestones risk being omitted from the AIS data. Without comprehensive data coverage, there are difficulties in accurately discerning the actual routes undertaken by vessels, which in turn risk causing a disconnect between reality and the model graph representation. 1.2 Purpose As alluded to, the primary aim of this project is to construct a close-to-reality graph representation of the maritime shipping network structure. With the purpose of allowing transfer of insights from the graph representation to real life scenarios, closeness to reality is a main concern. To achieve a close to reality graph representation, focus lay on constructing both a comprehensive node set and a meaningful edge set. Thus, with the purpose of identifying nodes suitable for inclusion in the node set, maritime points of interest (waypoints) are extracted from AIS data. By applying ML techniques, the density 2 1. Introduction disparities and missing data mentioned in Section 1.1 can be combated. As the aim is to accurately represent maritime shipping in the form of a graph, edges are also a focal point of research. By observing historical vessel journeys, an edge set representative of actual ship journeys is constructed. Finally, this project aims to demonstrate the congruence of the graph representation with real-world maritime traffic. With this objective in mind, path finding algorithms are explored to compare the navigation between nodes to that of the real life navigation of vessels between ports. 1.3 Limitations One limitation in constructing the graph representation pertained to the time frame of AIS data integration. The graph is based on historical AIS data spanning from 2018 to the present day. Consequently, shipping patterns predating 2018 are not taken into account. Another limitation imposed on the project concerns the “observation of actualised success”. In Section 1.1 a key motivator of the project is explained to be the potential for environmental and economic savings. It is not within the scope of this project to see insights gathered from the graph representation actualised in the real world maritime sector. Instead, the scope is limited to simulation on historical data, as opposed to the identification and actualisation of future savings. 1.4 Ethical Considerations and Risks The process of constructing a comprehensive maritime network graph representa- tion featured a multitude of inherent risks and ethical considerations, that each required meticulous attention throughout the duration of the project work. Ethical considerations included, but are not limited to: • Unbiased analytics: In the case of future real-world application of the graph representation for analytical purposes, fair and unbiased analytics will be of vital importance. Therefore, project work has to actively reject discriminatory practices, and consider the potential impact on diverse stakeholders. • Environmental obligations: One could argue it is an ethical obligation to advocate for a sustainable tomorrow. Thus, the project always aims to highlight the environmental implications of optimised maritime traffic; the research has to align with the broader goals of society. Apart from the necessity of the project aligning with the ethical considerations above, there are also risks in need of management to ensure successful completion of the study: • Geographical coverage and data sparsity: The AIS data is uneven as seen to geographical coverage. There is a significant difference between AIS coverage in high density regions, as compared to low density regions. Therefore, the 3 1. Introduction generalisability of the maritime network graph would have been at risk if the unevenness could not be managed. One could even imagine some important trade route to be missing from the graph representation. However, areas with a high degree of AIS coverage also naturally happens to be areas of high importance for maritime shipping. Furthermore, given that AIS data extends back to 2018, it would be debatable to regard a maritime trade route as significant if it has been inactive ever since then. Even though the project did not fail due to this risk, in order to cover all bases of potential failure, the following mitigation is proposed at the start of work: In the case of insufficient global data coverage the project work will pivot to focus on an area more local in scope, but vast in its AIS coverage (e.g. the English Channel or the Mediterranean Sea). • Model alignment with reality: In Section 5.5 the evaluation process is described. The model’s usefulness can be validated through path-finding algorithms, and by comparing proposed routes with the historically travelled ones. Alas, this proved only partial usefulness; for the graph representation to truly consolidate its usefulness, one has to further make sure of its alignment with real maritime shipping structure via alternate avenues. Thus, if extra proof of alignment is required, further action can be taken. To diminish the severity of this ’risk’ the following future mitigation is proposed: by engaging in dialogue with domain experts at our disposal, both in modelling and in maritime shipping, an extra level of certainty can be ensured. 4 2 Theory In this chapter, theoretical concepts of importance to the purpose of this project are presented. Firstly, the concept of the Automatic Identification System is described, coupled with examples of its immense usability. Secondly, Googles S2 Geometry library and its usability for this project is explained. Subsequently, concepts pertain- ing to node set extraction from AIS data follow: Change Point Detection and Kernel Density Estimation. 2.1 Automatic Identification System The Automatic Identification System (AIS) is a vital tool for monitoring vessels at sea and can be used for multiple purposes, as described in Section 3. It utilises VHF radio to broadcast information about the ship, such as its identification, location, speed, and direction. This real-time data is used to track vessels, avoid collisions, control traffic, and enforce maritime regulations. AIS data is also essential for studying traffic patterns, optimising routes, and aiding search and rescue operations [9]. 2.2 Google S2 Geometry In traditional cartography, so-called ’map projections’ are employed and involve mapping functions from the set of points on the planet’s surface to a set of points on a planar map. This is the cause of the infamous distortions in the Mercator projection. This distortion problem can be largely circumvented by using spherical geometry and spherical projections. While the earth is not perfectly spherical, it is more so than flat. A spherical projection, as the name implies, includes mapping points from the surface of the Earth to points on a mathematical sphere. This procedure achieves a distortion of only 0.56% [10]. To approach this problem, Google created the S2 library for spherical geometry. The main idea of the framework is that the unit sphere can be decomposed into a hierarchy of cells, where the Earth is projected onto six top-level ’face cells’ [11], [12]. Each top-level face cell can successively be subdivided into four smaller children cells, where the level of a cell represents the number of subdivisions starting with the face cell (see Figure 2.1). The levels range from 0 to 30, where a leaf cell is the smallest 5 2. Theory cell at level 30, while the largest cell is a face cell at level 0. A cell is called ’S2Cell’ in the library, and each S2Cell is uniquely identified by a 64-bit ’S2CellId’. Figure 2.1: A world map showing two of the six face cells, where the red face cell has been subdivided into four smaller cells several times. The world map is plotted using Cartopy [13]. CAVEAT: This figure is a conceptual visualisation of cell divisions. The cell sizes are incorrect and do not represent the true sizes. 2.3 Change Point Detection Change point detection is the task of finding changes in the underlying model of a signal. In Figure 2.2a the function f(x) = |x| is displayed. This would be the underlying model of the signal. Say we observe this function and sample at a predetermined interval (Figure 2.2b). The task of change point detection is then to observe these discrete samples and determine points where the function f(x) changes behaviour; such points are referred to as change points. If one were looking for change points in f(x), then x = 0 would be a suitable candidate. For it is at this point where the derivative of the function abruptly changes from f ′(x) = −1 to f ′(x) = 1. Change point detection can be performed both offline (as is the case in this report) and online (real-time). The offline case assumes a signal y = {y1, ..., yT} that takes values in Rd space. Furthermore, y is assumed to be piece-wise stationary. This means that some characteristics of the signal change abruptly at some unknown times t1 < t2 < tK , where K can be unknown. The task of change point detection lies in identifying these time instances t1, ..., tK . This proves useful in detecting maritime waypoints. By observing changes in the ’speed’ and ’course’ signals of a vessel, change points can be found. 6 2. Theory (a) Plot of the function f(x) = |x|. (b) Discrete sampling of the function f(x) = |x|. Figure 2.2: Plots of the function f(x) = |x| in both a continuous manner and a discrete manner. 2.3.1 PELT PELT — Pruned Exact Linear Time is an algorithm for detecting multiple change points in a sequence of data. It was proposed as a computationally efficient alternative to the Binary Segmentation and Segment Neighbourhood search methods [14]. The algorithm works as follows: 1. Initialisation: The algorithm starts with no change points and calculates the cost of this model. In this case, cost is defined as the cumulative cost of a segment of data points x0..xi..xn ∈ X, where cost is the negative log likelihood of the data under a Gaussian distribution: cost = − n∑ i=1 log ( 1√ 2πσ exp ( −(xi − µ)2 2σ2 )) This cost is stored in a list of minimum costs, and the change point which led to this cost is stored in a separate list (initially empty). 2. Iteration over data: For each data point from 2 to n (where n is the total number of data points), the algorithm calculates the cost of models with each possible last change point and finds the one with the minimum cost. This cost is then added to the list of minimum costs, and the corresponding change point is added to the list of change points. 3. Pruning: To improve efficiency, the algorithm discards change points that could not possibly minimise the cost in future iterations. This is done by considering the cost and the rate of change of cost for each change point. 4. Termination: The algorithm stops when it has iterated over all data points. The detected change points are those stored in the list of change points. PELT requires a penalty. This is most commonly chosen using a derivation of the Bayesian information criterion (BIC) as seen in Equation 2.1 [15]. 7 2. Theory β = p ∗ log n (2.1) Here p is a calibration parameter and n is the sample size. 2.4 Kernel Density Estimation Kernel Density Estimation (KDE), also called the kernel estimator, is a non- parametric density estimator used for estimating the underlying probability density function (pdf) of a random variable [16]. Due to the non-parametric nature, requiring no assumption on the underlying parametric form, KDE allows more flexibility by automatically learning the density shape from the data. This is one of the things that makes KDE a suitable approach for data drawn from an unknown or complicated distribution. Let (X1, X2, ..., Xn) be independent random samples drawn from an unknown dis- tribution with density function f . The kernel estimator can formally be expressed as f̂(x) = 1 nh n∑ i=1 K ( x−Xi h ) . (2.2) where h > 0 is the bandwidth (also called window width or smoothing parameter), and K is a kernel function that satisfies the condition ∫ ∞ −∞ K(x)dx = 1. (2.3) The basic idea of KDE involves applying the kernel function K on each sample, which returns individual smooth bumps that are then summed together to obtain a density estimator. Figure 2.3 shows an 1D example of KDE construction where the red bumps are the small bumps after applying the kernel function on each sample, and the blue line is the summed, resulting density estimator. Figure 2.3: A conceptual visualisation of an 1D case of how KDE is constructed. The black lines indicate the locations of the five observations, the red bumps are the smoothed observations, and lastly, the blue curve is the resulting density estimate. 8 2. Theory Bandwidth selection is the problem of selecting the smoothing bandwidth h and is a classical topic researched in non-parametric statistics. Figure 2.4 presents another KDE example that uses different values for h on a dataset randomly sampled from a normal distribution. Figure 2.4: KDE with different smoothing bandwidths. The plots are created using seaborn.kdeplot() [17], with samples randomly drawn from a normal distribution using the Python package NumPy [18]. The left subfigure shows undersmoothing of the density estimate, where a lot of (wanted and unwanted) features are captured. Whereas, the right subfigure shows the effect of oversmoothing where all features are lost or obscured. There exists common approaches to bandwidth selection such as Silverman’s Rule and Scott’s Rule to name a few [19], [20]. For more details and discussions about KDE and bandwidth selection, see [16]. 9 2. Theory 10 3 Related Work As a consequence of the high monetary volume present in the maritime shipping sector, previous work has been carried out with the aim of optimising the industry. What is presented below is only a subset of the studies; albeit, a relevant subset. Because of the grand nature of the maritime sector, research has branched in a multitude of directions such as: route planning, shipping network creation, and vessel optimisation. In Yang et al. [21], the authors successfully reconstructed the shipping service network between Europe and Asia while considering the One Belt One Road initiative. The proposal, as laid out by China, involved the improvement of the Budapest-Piraeus railway and the New Eurasian Land Bridge rail services. Considering the initiative, the paper explored the coping strategy for the Chinese liner shipping company. Their findings showed increased profit by 5− 6% using their shipping network, meeting more origin-destination demands, finding ’better’ hub ports and lastly, discovering the changes and effects the initiative would come to bring. Here, they were able to see significant benefit of increased port calls to Piraeus. An immediate increase of cargo volume could be attributed to the increased capacity, and reduced freight rate of the East Europe rail. The increased capacity usage rate in the new shipping network alleviated pressure from the rest of the system. From these findings, the authors were able to gain managerial insights, showcasing the usefulness of shipping network design for controlling transportation costs and improving shipping efficiency. The findings were also useful for gaining understanding of the current shipping network and the effects of potential future changes. A study similar to this project was conducted by Liu et al. [22], in which the authors proposed a framework of global maritime shipping network extraction using machine learning methods, including waypoint and berthing area identification, trajectory segmentation and separation, edge generation, and the construction of directed graphs modelling the global maritime shipping network using AIS data of bulk carriers in 2018. To prove and explore the application of the shipping network, a probability-based route planning method using the A* algorithm was proposed. The estimated shipping networks showed effectiveness for ship routing where generation of navigable routes given any departure and destination was one of the important applications, especially when there is no historical data on recorded paths between them. The paper highlighted some challenges such as the difficulty of recognising unique thresholds for changes in the navigation status for different ships in different 11 3. Related Work scenarios, and ineffective differentiation of waypoint areas among highly dense areas in large waters despite divided global regions and separately determined clustering parameters for each region. Moreover, the probability-based route planning method showed some unreasonable routes with high probabilities in large waters, where the authors suspect it to be due to the simplicity of the way of calculating the probability of node connections. For future work, the authors suggest either focusing on solving the mentioned problems, improving the methodology of the shipping network extraction, or conducting a deeper analysis of the estimated networks by following current research. Filipiak et al. [23] proposed a novel approach for extracting maritime traffic networks from AIS data using the cumulative sum (CUSUM) method for change detection and a genetic algorithm (GA) for waypoint discovery. For edge discovery, the authors annotated the nearest waypoint for each AIS data-point by looking at every trajectory of the vessels passing an area of interest and tracking the visited waypoints based on historical AIS data. Then, by grouping the AIS data by from_waypoint and to_waypoint, a dataset containing edges is constructed. CUSUM was used for finding preliminary waypoints by detecting AIS data-points with a significant change in course or speed. The preliminary waypoints were partitioned using the spatial partitioning algorithm k-d B-trees, where each partition was treated separately by the GA to detect the final waypoints in a distributed manner. The spatial partitioning was a crucial step for using the GA, as the dense areas represented the fittest genes while sparse areas were ignored. The final network was evaluated by comparing four generated routes with the respective routes from a contemporary software within the Baltic Sea area with AIS data spanning over 8 weeks. The AIS data was further filtered, separating the data by passengers, tankers, and cargo ships. The resulting graphs showed a closer resemblance for the tanker and cargo vessels than the passenger vessels where the authors attribute it to data imbalance in the AIS messages caused by some passenger routes operating on shorter and repetitive patterns leading to numerous messages in small regions. Moreover, some irregularities in the results are attributed to the stochastic nature of the GA. Lastly, the authors highlighted that the proposed method could be improved due to its susceptibility to missing or incorrect AIS messages and bad performance in sparse areas. A suggestion was to use AIS trajectory reconstruction methods to alleviate the issue. In a study conducted by Le Tixerant et al. [24], the authors investigated the uses of Automatic Identification System (AIS) data for maritime spatial planning (MSP). The results showed that AIS has many benefits in analysis and further studies where it can be used to describe sea-fishing activities, construct trajectory and position density maps, highlight hierarchical networks of maritime routes, asses risks of use conflicts in marine space by contributing to spatio-temporal interactions between activities for different time spans, and produce more qualitative spatial indicators by crossing AIS with other data. Another benefit is potentially having access to data on a worldwide scale, enabling analysis both at global and local levels. However, due to the nature of AIS data and being easily accessible while having an unclear legal framework, the authors discuss how AIS can infringe on public privacy where it is possible to extract leisure and fishing navigation activities and link it to a person and 12 3. Related Work how private organisations exploit, distribute and market it online raising concerns from shipping companies. 13 3. Related Work 14 4 Data & Analysis For this project, the dataset was provided by the company. The dataset contained AIS data from three different sources, together with a list of ports (see visualisation example of traffic data in Figure 4.1). The three sources were: Gatehouse, AISHub and VesselFinder, where the data ranged from year 2018 to the present day and consisted of a mix of terrestrial and satellite data. The vessel information that is common to all sources can be seen in Table 4.1 where it contains information about a vessel and its status at some time point. Figure 4.1: AIS data aggregated with H3 geographical index and coloured by the mode navigational status of samples in each hexagon. 15 4. Data & Analysis Vessel Information Field name Description TIMESTAMP Date and Time (in UTC) when position was recorded by AIS IMO Unique seven-digit vessel number MMSI Maritime mobile service identity (AIS identifier) NAME Name of the vessel CALLSIGN Callsign of the vessel LATITUDE Geographical latitude (WGS84) LONGITUDE Geographical longitude (WGS84) COURSE Course over ground (degrees) SPEED Speed over ground (knots) HEADING Heading of the vessel’s hull (degrees) NAVSTAT Navigation status according to [25] AISTYPE Type of the vessel according to [26] A Dimension (meters) from AIS GPS antenna to the Bow of the vessel B Dimension (meters) from AIS GPS antenna to the Stern of the vessel (vessel length = A+B) C Dimension (meters) from AIS GPS antenna to the Port of the vessel D Dimension (meters) from AIS GPS antenna to the Starboard of the vessel (vessel width = C+D) DRAUGHT Current draught of the vessel (meters) DESTINATION Port of destination ETA Estimated time of arrival Table 4.1: Available information about the vessel information. Some fields can be unavailable and is thus set as null. The descriptions were taken from [27], [28]. Tables 4.2, 4.3, 4.4, 4.5 show a single manufactured AIS data sample from the Gatehouse data, consisting of 21 additional data fields. The data sample was manufactured in the sense that it does not reflect any actual data transmitted by a vessel, but the general picture painted by the actual data content is structurally equivalent. This data consists more of port-to-port information such as port of loading (pol), port of destination (pod), and estimated time of arrival (ETA). added_datetime position_datetime id start_datetime end_datetime vessel_imo_number 2022-12-05 20:11:09.721494 UTC 2022-12-05 19:58:05.000000 UTC 999 2022-10-19 15:15:10.887625 UTC null 9684689 Table 4.2: Example of AIS data sample structure, showing data fields 1-6. input_pol_un_location_code pol_un_location_code pol_name pol_country pol_arrival_datetime CNYSN CNYSN Yangshan China 2022-07-20 00:00:00.000000 UTC Table 4.3: Example of AIS data sample structure, showing data fields 7-11. input_pod_un_location_code pod_un_location_code pod_name pod_country pod_arrival_datetime LKCMB LKCMB Port of Colombo Sri Lanka 2022-09-11 23:59:59.999999 UTC Table 4.4: Example of AIS data sample structure, showing data fields 12-16. sta source_eta eta add_intermediate matched_schedule 2022-09-11T18:00:00 2022-12-26 18:00:00.000000 UTC 2022-09-11T18:00:00 TRUE null Table 4.5: Example of AIS data sample structure, showing data fields 17-21. 16 4. Data & Analysis Figure 4.2: Global ports (red) from the port list excluding ports of type NaN. The port data contains information about global ports such as location code, location name, latitude, and longitude (Figure 4.2, Table 4.6). The combination of the AIS data and the list of ports formed the foundation for constructing the graph representation. Port Information Field name Description un_location_code UN/LOCODE, the United Nations Code for Trade and Transport Locations location_name Name of port location location_country Country of port location longitude Geographical longitude latitude Geographical latitude location_timezone Timezone of port location location_subdivision Name of subdivision location of the port location_port_type Type of port e.g. harbour, seaport, river port location_port_size Size of port (VS, S, M, L, VL) Table 4.6: Available information about the port data. Some fields can be unavailable and is thus set as null. 4.1 Data analysis Data exploration and visualization were performed in order to gain understanding and insight of the data. This aided in discovering variable correlation or relationship that would be useful for the following steps in creation of the node and edge sets. Data visualization helps identify data imbalances, insufficient or missing data among the variables that could pose a future problem or to also help interpret the results. The analysis and visualization tasks consisted of the following: 17 4. Data & Analysis 1. Data parsing, where relevant data such as vessel IMO, timestamp, coordinates, and port visits were extracted from CSV files. 2. Data exploration of each dataset from the providers and the port list. 3. Distribution map of ship points in different speed and course ranges to identify the quantity of stationary and moving states. 4. A histogram of the position time frequency for a vessel and journey. 4.1.1 AIS data sources The AIS data from the three data sources consists of data from different time intervals and contains different amount of points, as can be seen in Table 4.7. Both the Gatehouse and AISHub data contains global data (Figure 4.3c, 4.3a), but the AISHub data is limited to land based stations. Moreover, the VesselFinder data covers only the Asia-Europe regions and contains both terrestrial and satellite data (Figure 4.3b). Name Start End #rows Resolution Gatehouse 2022-09-21 2022-12-31 1 973 916 Hourly VesselFinder 2017-06-01 2018-06-30 1 002 733 Hourly AISHub 2023-02-22 Ongoing Many Hourly Table 4.7: Information about the data from the three AIS data sources. (a) AISHub (b) VesselFinder (c) Gatehouse Figure 4.3: Distribution plot of each dataset from the three AIS data sources. Foremost, because of the computation time and data imbalance, it was decided to only use the AIS data from Gatehouse as it contains global data compared to the 18 4. Data & Analysis other two sources that only covers specific areas. Moreover, as the Gatehouse data also contains additional vessel information involving port-to-port information, it helped in the extraction of the edge set. By knowing in which port a vessel starts and ends up in, it makes it possible to extract the journey, which was later used in the calculation of the transition matrix. More details in Section 5.3. 4.1.2 Ports The list of ports contains ports of port types: NaN (missing port type), canal, ferry, harbour, marina, seaport, river port, port terminal, waterway region, deep-water seaport, offshore terminal, pier, and jetty or wharf. The ports of different types are shown in Figure 4.4 where the majority of the visible ports have no port type (red points). Figure 4.4: Categorical scatter plot of ports showing different port types. The red ports represent those with no port type (NaN). Figure 4.5: Ports with missing port type. Figure 4.6: Ports excluding missing port types. Further observations on the ports of missing port type shows that they are mostly located inland or at minuscule islands. An overview of the ports containing and 19 4. Data & Analysis excluding ports with no port type can be seen in Figure 4.5, 4.6. Since the project focuses on the maritime transport, the inland ports are not of much interest. Moreover, since the port list excluding NaN port types contains most of the busiest container ports and covers the world, it was decided to exclude the NaN ports in the next steps as the final port list was deemed adequate for the project. 4.1.3 Speed and course over ground The speed of a vessel relative to the Earth’s surface is called Speed over ground (sog) and is usually measured by GPS. Four distribution plots of different sog intervals can be seen in Figure 4.7. A speed of 0 knots would imply a vessel not moving, which usually are when they are in a port or anchored. The most common speed globally seemed to be around 15–20 knots, while some vessels travel at up to 25 knots. A lower speed of around 5–10 knots are most common in dense areas such as Europe, East Asia and Central America, while a speed over 25 knots occurred in some few individual points in the world. (a) sog = 0 (b) sog = [5, 10] (c) sog = [15, 20] (d) sog = [20, 25] Figure 4.7: Distribution plots showing 4 different sog intervals. The actual course of where a vessel is moving towards relative to the Earth’s surface is called Course over ground (cog) and is also usually measured using GPS. Four distribution plots showing different cog intervals ranging from 0 to 360 degrees can be seen in Figure 4.8. Various complete routes between different regions can be identified in the plots, such as vessels travelling between America and Europe or Asia. Moreover, the course indicates which regions the vessel is travelling from and to, e.g., for vessels with cog = [225, 270] it is more probable that the vessels are travelling from America to China than the other way. 20 4. Data & Analysis (a) cog = [0, 0] (b) cog = [0, 45] (c) cog = [225, 270] (d) cog = [315, 360] Figure 4.8: Distribution plots showing 4 different cog intervals. 4.1.4 Position time frequency Even though the AIS data resolution is hourly, it does not necessarily imply that the number of data for each vessel or journey is equal. The consequences of unequal amount of data points per journey leads to some vessel journeys being over- and underrepresented, which can later affect the construction of the edge set. Figure 4.9: Count plot of the mean update frequencies for the journeys. A journey was represented by grouping the dataset by a vessel’s IMO, pol, pol and pod arrival time. This ensures that each group contains a journey for some vessel 21 4. Data & Analysis starting at some port. By also grouping by the pol and pod arrival time, it ensures that the data points in the group are from the same journey as their start and end time are matching. Afterwards, the mean update frequency for a journey/group can be calculated, where the total occurrences are shown in Figure 4.9. The figure indicates data imbalance for some journeys as a steep hill can be seen between mean frequency of 1–5 hours. In the way the edge set is constructed (Subsection 5.3) using a transition matrix, the journeys that have longer update time will have longer edges between the waypoint nodes, while the journeys with shorter update time will have many short edges between many nodes. This can pose a problem for the final graph construction, as longer edges would imply the possibility of travelling longer and illogical distances, such as a vessel starting in the United Kingdom and suddenly arrived at Yemen without passing any waypoints through the Suez Canal. 22 5 Methods This chapter presents the methods used during this project to construct and evaluate the graph representation. The process employed consisted of three phases: Con- struction of the node set from historical AIS data (Section 5.2), construction of the edge set from historical AIS data (Section 5.3), and lastly, evaluation of the graph representation (Section 5.5). 5.1 Graph Representation The purpose of this project was explained as ’the creation and evaluation of a comprehensive & global maritime shipping network graph representation’. For the sake of clarity in the methodologies below, a maritime shipping network graph representation is defined and explained as follows: • A graph representation in the context of this project refers to an instance of the graph data structure. More precisely, the base of a graph representation is a graph G = (V, E), where G is the graph consisting of nodes (vertices) V and edges E. This graph G is formed in accordance with some modelling rules in order to model some object O. To complete the graph representation of O, additional information I is embedded into the graph. Thus, the graph is extended to G = (V, E, I), where I is of type dictionary and maps object-specific properties of O for each node. • The maritime shipping network, in the context of this project, refers to the shipping by sea undertaken by cargo ships and other maritime shipping vessels. These vessels transport cargo between the world’s seaports; often passing by intermediary goals such as canals, anchorages, or geographical points of interests. • A maritime shipping network graph representation is a graph representation of the maritime shipping network. A model of the maritime shipping network is created by translating it into the structure of a graph G. Nodes and edges are created in a manner suitable to the maritime shipping structure. • In the case of this project, the graph representation will model maritime shipping at a global scale. Nodes and edges are formed to preserve the historical accuracy of the AIS data, upon with the graph is built. Object specific information is embedded to convey position, and character of each node. Every edge is 23 5. Methods weighted. The procedure and exact details of graph creation is explained in Sections 5.2–5.4. 5.2 Creation of Node Set The data analysis seen in Chapter 4 served as a precursor to the node set creation. All methodologies seen in this section were derived from insights gathered during analysis of the AIS data. As such, by applying the methodologies described below, a node set representing the actual maritime shipping structure and its complexities could be formed. Naval ports exhibit many node-like qualities. Thus, port are strong candidates for inclusion in the node set of the graph representation; the most obvious node-like behaviour being their role as origin and destination in the transport of cargo. Every cargo ship is routed between pairs of ports, and thus they are the focal points of maritime traffic. In addition, ports are naturally connected to each other via the cargo vessels. Connectivity between nodes is a crucial component in most graphs. On a high abstraction level, creation of port nodes include: 1. Obtaining a port list coupled with the respective UN location code (UN/LOCODE) for each port. 2. Gathering location coordinates for each port. 3. Embedding the information into nodes. However, ports do not reveal the entire picture of maritime shipping. Two routes, both with origin O and destination D, can vary immensely when compared to each other. When comparing maritime journeys there are two other factors to consider apart from origin and destination: the geographical route undertaken and the time required to complete the journey. In the pursuit of capturing more journey context, the node set was extended past solely ports. Ship berthing areas will typically be located in the near vicinity of ports, although, one might also include other points of interests in the node set. To identify waypoints, common anchorages and other points of interest, the AIS data was utilised. 5.2.1 Change Point Detection As previously mentioned, the complete dataset is far too vast to consider all data points when creating a node set. In addition, many data points do not contain valuable information for the cause of node set creation. The node set should contain both port nodes and waypoint nodes. To deduce meaningful waypoints, change point detection was applied. Change point detection can be applied to all dimensions of the AIS data, however, interest lied mainly in the sog and cog data. A change in speed might indicate entrance into an area of interest, for example a canal, anchorage, or similar. Similarly, a change in course might indicate another area of interest; a vessel that drastically redirects its course is more in line with the expected behaviour at a node than at 24 5. Methods an edge. Therefore, the change point set would come to include both speed change points and course change points. To detect change points, the python package ruptures was used [29]. In Table 5.1 one can regard the configuration employed. KernelCPD is the C implementation of both the dynamic programming approach and the PELT approach for change point detection, where it was shown to be much faster in both cases [30]. At first BIC was used for calculating the penalty but showed unsatisfying results. Hence, it was opted to use the elbow method for choosing the penalty value instead. Parameter Configuration Cost function rbf Search method KernelCPD Penalty Elbow method min_size 1 Table 5.1: Set-up of change point detection. The change point detection was performed twice on the AIS dataset, first considering the speed of vessels as the signal, and second considering the course of vessels as the signal. Before applying the change point detection on the variables, the data was first sorted by their time of position and missing sog and cog values were removed. The algorithm was then applied on individual vessels grouped by their IMO to detect changes on a journey. The resulting change points were then compiled into the change point set. 5.2.2 Data Splitting Using Kernel Density Estimation Having identified the change point set, further work needs to be done where the geographical density of the points needs to be considered. For example, consider sparse areas where the observations are scattered across large water and dense areas where the observations are densely packed. For sparse areas, a cluster containing more than 3 points could be rare, while clusters containing more than 10 points could be normal in dense areas. Sparse −1 < est_pdf ≤ max_low Mid max_low < est_pdf ≤ min_high Dense min_high < est_pdf ≤ max{est_pdf} Table 5.2: For an estimated pdf value of a data point est_pdf, the data point is categorized as one of the three areas: sparse, mid, and dense, according to the respective thresholding in the table. The variables max_low and min_high are manually chosen and represents the maximum pdf value for the sparse area and the minimum pdf for the dense area respectively. Different constraints would thus be needed for respective areas because of their dissimilarities in density. By identifying different density areas and splitting the data 25 5. Methods by density, the data becomes more homogeneous and thus, facilitates in waypoint extraction for the different areas. For the kernel density estimation of the change points, scipy.stats.gaussian_kde from the SciPy package was used [31]. After obtaining the plot of the density estimation and density levels, the dataset was then split into three different areas: dense, mid, and sparse, by thresholding using the manually chosen thresholds: max_low, the maximum pdf value for sparse area and min_high, the minimum pdf value for dense area as seen in Table 5.2. The estimated pdf value, est_pdf, was evaluated for each data point and were then appended to one of the three areas according to the aforementioned thresholding. 5.2.3 Waypoint Extraction Using S2 Geometry The Python package s2sphere was used for identifying the waypoints [32]. The s2sphere package is an implementation of a part of the C++ S2 Geometry Library mentioned in Section 2.2. In this section, the geographical coverage of the change point set needed to be considered. Here, there could be single points spread across wide waters i.e. sparse areas, and multiple packed points by a port i.e. dense areas. Algorithm 1 Create Waypoints df := AIS data level := minimum cell size min_level := maximum cell size threshold := total number of points in a cell at level tree← root node df_temp← copy of df for current_level = level...min_level,−1 do cell_counts← group and count the rows by ’S2CellId’ waypoints← cell_counts ≥ threshold if waypoints is not empty then df_waypoints← get rows from df_temp that has S2CellId in waypoints add waypoints to tree remove waypoints from df_temp end if cell_counts← get parent cell ids for all cell ids in cell_counts end for waypoints← collect all leaves in tree return list of waypoints To identify the waypoints, a waypoint was defined to be a cell with a threshold representing the maximum point count in the cell. The main idea was to start from the smallest cell level, and check if some cells contains n amount of points. Add the cells fulfilling the constraint in the waypoint set and repeat the process for the rest of 26 5. Methods the set at level − 1 until no more cells are left, or it has reached min_level. At the end of the process, it resulted in a list of S2CellIds that represents the waypoints. The first step in this process was to construct a list containing S2CellId of each point in the change point set at the smallest cell size, level. This was done by taking a point’s longitude and latitude coordinates and converting it into a S2CellId using the constructor. In this step, the list could contain duplicates as the points could exist in the same cells at the same level. After obtaining the S2CellIds representing the change points, the waypoints were identified as described in Algorithm 1. The waypoints are stored using a tree data structure to avoid duplicated waypoints. However, since ports are always a waypoint, an exception was needed to handle multiple ports in the same cell, i.e. they have the same S2CellId. To solve this issue, it was decided that a cell can contain multiple ports as the route up until the port of destination (pod) will be the same and that the difference between going to pod A versus pod B is negligible. Moreover, to enable evaluation of the network from port-to-port in the later stage, the port data was embedded in the respective waypoints. The main idea for the tree structure was that for each non-existing waypoint in the tree: (1) a new branch is added to the root node, where the waypoint is the leaf node and all preceding nodes are the successive parent cells. Therefore, whenever a waypoint is extracted, it only requires checking if the parent cell/node already exists at the level subsequent to the root node. (2) If the parent cell already exists, then the algorithm ignores the new waypoint. (3) If the new waypoint is a port, then a new neighbouring node leaf is added by the current waypoint. (4) However, if the port already exists in the tree, then the new port data is embedded in the already existing tree leaf. Figure 5.1: An example showing a newly extracted waypoint being added to the waypoint tree depending on 4 different cases. A visual example of the tree construction is shown in Figure 5.1; an example showing a newly extracted waypoint being added to the waypoint tree depending on 4 different 27 5. Methods cases. Each case is depicted with a S2Cell split into 2 levels together with its respective waypoint tree. Let the new waypoint cell be B, the cell marked in blue. Furthermore, let A, the pink cell, be an already existing waypoint cell in the tree. The 4 cases are then: (1) When there exists no parent cell subsequent to the root node that contains B. (2) There exists a parent cell at level 1 that contains B. (3) B is a port while A is not. (4) Both A and B are ports. This subsection is an iterative process consisting of analysing the extracted waypoints and changing the argument values in Table 5.3. As mentioned in Subsection 5.2.2 the different density areas requires different constraints. The current method for deciding the values for the constraints was done by analysing the waypoints in different areas and adjust accordingly until satisfying waypoints were obtained. Variable Description level minimum cell size (higher level) min_level maximum cell size (smaller level) threshold total number of points in a cell at level Table 5.3: Arguments for Algorithm 1. 5.3 Creation of Edge Set This section outlines the process of creating an edge set suitable for representing connections between nodes in a maritime setting. The precursor to this section is the successful creation of a node-set according to Section 5.2. By considering the intended use-case of the graph representation (as per Section 5.4) the optimal structure of the graph can be inferred; the edges have to be directed in order to preserve information and be able to accurately model vessel journeys between nodes. Furthermore, in the name of performance and since the evaluation method involves route planning, it would be nonsensical to consider a complete graph. Thus, edges were created in mirror of actual historical ship journeys — an edge models an actual plausible route undertaken by a ship between two nodes. 5.3.1 Transition Matrix In graph theory, a transition matrix is typically a 2D n× n matrix, where n is the number of nodes in the graph. Each row in the matrix contains fractions representing the probability of transition from a node to every other node. As a result, each row has a sum of 1. In the maritime case, considering the AIS data of vessel patterns (see Chapter 4), it is clear some origin-destination pairs are more likely than others. For example, if a vessel enters the Suez Canal from the Mediterranean Sea it is infinitely more likely that the vessel will exit the Suez Canal at Port Suez than that the vessel’s next stop will be the port Shenzhen, China. While this might seem trivial, it was of essence that the transition matrix captured such real world relations. To ensure 28 5. Methods that the transition matrix accurately models real world transitions between points of interest (nodes) all AIS data from Gatehouse was considered in the following transition matrix population algorithm (Algorithm 2): Algorithm 2 Populate Transition Matrix 1: AIS := AIS data 2: N := Number of nodes in node set 3: Initialise matrix TM[N][N] 4: Initialise matrix Times[N][N] 5: 6: Remove data rows that are not in any waypoint cells 7: sorted_ais ← sort by imo and datetime 8: 9: freq ← datetime frequency to group by 10: grouped_ais ← sorted_ais grouped by imo, pol_un_location_code, datetime, freq 11: for group in grouped_ais do 12: current_cell, prev_cell ← None 13: current_time, prev_time ← None 14: for row in group do 15: if row[cell] == current then 16: continue 17: end if 18: prev_cell ← current_cell 19: current_cell ← row[cell] 20: prev_time ← current_time 21: current_time ← row[datetime] 22: if prev_cell is not None then 23: diff ← (current_time - prev_time).total_seconds() 24: Times[prev_cell][current_cell] += diff 25: TM[prev_cell][current_cell] += 1 26: end if 27: end for 28: end for 29: 30: Normalise TM & Times by dividing each row element by TM’s corresponding total row sum The general operation of Algorithm 2 can be summarised as follows: the outermost loop iterates over groups of sorted AIS data. The data was sorted by time and IMO, port of loading and destination, and time period; group Gi = gi0, gi1, ..., gin then models a journey of the vessel i in chronological order. The logic of the algorithm was then to keep track of the previous cell visited and noticing when the vessel enters a new cell. Every time such a connection was discovered, it was added to the transition matrix TM . Then, every entry was normalised to ensure all entries in the 29 5. Methods transition matrix are probabilities; using division by the total accesses to each row, normalisation was achieved. TM then served as the basis upon which the graph was constructed. As described in Subsection 4.1.4, the update frequency for a journey had to be considered and was done by manually choosing a time period/frequency using the pandas groupby instruction, pd.Grouper(position_datetime, freq) in the groupby [33], [34]. This splits the journeys by a time interval, i.e. the chosen frequency. Let some journey take one month and the chosen frequency is one day. Then the journey will be split into groups with a days’ interval. This ensured that a vessel can not travel from the start point to the end point in one step. Also, one can note that a Times matrix was updated every iteration alongside the transition matrix TM . Times was also normalised in the same manner as TM . This mean estimation of travel time served as a precursor to the custom heuristic function later employed in A* evaluation (Section 5.5.2). 5.4 Graph Construction This step entails the combination of the node set and the edge set to obtain a shipping network graph representation. Firstly, nodes were added to an empty graph data structure G. Nodes were in turn embedded with location data, signalling the position of each node. In addition, each node contains a port property, which is a boolean property indicating whether the node is a port. Then, in order to suitably connect nodes to one another, the transition matrix (TM) was used; for every non-zero element in TM a directed edge is formed from the node corresponding to the row index to the node corresponding to the column index. Each edge was supplied with two properties: weight and probability. The probability from TM directly translates to the probability property. The weight property, on the other hand, required an additional step. To ensure correct function, the A* algorithm requires that a weight is associated with each edge [35]. Furthermore, a lower weight should imply a lesser inclination to journey across the particular edge. Therefore, each edge was assigned a weight equal to the negative logarithm of the probability (Equation 5.1). weight = − log(probability) (5.1) Using more concrete notation, the weight assignment can be formulated as seen below (Equation 5.2): weights[i][j] = − log(TM [i][j]) (5.2) The above steps can be comprised and summarised in Algorithm 3, where addNode(node) is a function of the graph data structure that allows for the addition of nodes into the node set. Likewise, addEdge(start, end, probability, weight) is another utility 30 5. Methods function of the graph. The function addEdge(start, end, probability, weight) permits the addition of edges into the edge set. Algorithm 3 Graph Construction Algorithm G := Empty Graph nodeDict := Dictionary with nodes as keys and coordinates as values portDict := Dictionary with nodes as keys and ports as values TM := Transition Matrix for node in nodeDicts do G.addNode(node) G[node].coords = nodeDict[node] if node in portDict.keys() then G[node].port = portDict[node] end if end for for i in TM.rows do for j in TM.cols do if TM[i][j] > 0 then G.addEdge(i,j, probability=TM[i][j], weight=(-log(TM[i][j]))) end if end for end for 5.5 Graph Evaluation The second phase of the project, following the construction of a graph representation, concerned evaluation. As per Section 1.2 the purpose of the project lie in creating a close-to-reality graph representation of the maritime shipping structure. For this project, two methods of evaluating closeness to real life were chosen. Firstly, by studying the properties of the graph representation it was possible to deduce characteristics that in turn could be utilised for comparison; The graph representation could be compared to its real life counterpart through the use of graph metrics. Secondly, to ensure operational closeness to reality on top of the structural closeness, the graph representation was subjected to a set of test cases. The test cases were designed to test how well the graph representation lends itself to route planning. Real life major origin-destination pairs were tested on the graph representation and evaluated for closeness to their real life counterparts. 5.5.1 Graph Metrics Three closely correlated metrics, often utilised in the pursuit of explaining and understanding graphs, are Degree Centrality, Betweenness Centrality, and Closeness Centrality [36]. All three measurements try to discern the most important nodes 31 5. Methods in a graph: Degree Centrality measure the number of edges incident to each node, normalised by the maximum possible degree in the graph. It is often the case that nodes with a high degree centrality represent vital hubs in the graph structure. In Equation 5.3 the formula is presented as it was implemented to measure degree centrality in the graph representation. CD(j) denotes the degree centrality of node j, and TM [i][j] denotes the transition matrix entry corresponding to the connection between nodes i and j. sgn is a sign function returning 1 if x > 0 and 0 otherwise. Finally, the summation was normalised with respect to the maximum possible degree in the graph. CD(j) = ∑n j=i sgn(TM [i][j]) n− 1 (5.3) Betweenness Centrality, on the other hand, will quantify the extent to which nodes lie on the shortest path between other pairs of nodes. As such, the metric is useful for detecting bottlenecks in the graph structure. The implemented version of betweenness centrality is found below in Equation 5.4, where CB(v) denotes betweenness centrality. In addition, σij(v) denotes the number of shortest paths from node i to j passing through node v, and σij(v) denotes the total number of shortest paths from node i to j, with which the measure is normalised. CB(v) = ∑ i ̸=j ̸=v σij(v) σij (5.4) Finally, Closeness Centrality measure the closeness to other nodes for nodes in the graph. As a result, one can detect nodes with high potential for quick interaction with other nodes. The implementation of closeness centrality was precisely derived from Equation 5.5. In Equation 5.5, d(u, v) denotes the distance between nodes u and v, and n − 1 denotes the total number of reachable nodes from node v. CC(v) = n− 1∑n−1 u=1 d(u, v) (5.5) These above metrics were examined in the network graph representation formed from the procedures in Section 5.4. Apart from the three metrics above, density, average clustering coefficient, assortativity, and reciprocity were measured. Subsequently, these metrics were plotted together with the centrality measures in order to visualise the closeness to reality of the network graph representation. 5.5.2 A* Algorithm A* (’A-star’) is a path-finding algorithm in the family of weighted graph-traversal algorithms. The algorithm was employed to find the shortest path between two nodes in a graph, considering both the distance travelled and the estimated distance to 32 5. Methods the goal node. Like Dijkstra’s algorithm, A* operates by maintaining a lowest-cost path tree from the starting node to the goal node [35]. However, unlike Dijkstra’s algorithm, A* utilises a heuristic function; A* expands potential paths using the following formula (equation 5.6): f(n) = g(n) + h(n) (5.6) in which f(n) denotes the total estimated cost of reaching the goal through node n, g(n) denotes the cost of reaching node n from the starting node, and h(n) is the heuristic function that estimates the cost of reaching the goal node from node n. One should note that in the case of h(n) = 0, A* is computationally equivalent to Dijkstra’s algorithm. Algorithm 4 A* function astar(start, goal) openSet← {start} ▷ Set of nodes to be evaluated cameFrom← empty map ▷ Map to store the previous node in optimal path gScore[start]← 0 ▷ Cost from start along best known path fScore[start]← heuristic(start, goal) ▷ Estimated total cost from start to goal while openSet is not empty do current← node in openSet with the lowest fScore ▷ Get the node with lowest fScore if current = goal then return reconstructPath(cameFrom, current) end if Remove current from openSet for all neighbor of current do tentativeGScore← gScore[current] + distance(current, neighbor) if tentativeGScore < gScore[neighbor] then cameFrom[neighbor]← current gScore[neighbor]← tentativeGScore fScore[neighbor]← gScore[neighbor] + heuristic(neighbor, goal) if neighbor not in openSet then Add neighbor to openSet end if end if end for end while return failure ▷ No path exists end function In Algorithm 4 the implementation of the A* algorithm is presented where it makes a couple of simplifications and assumptions. Firstly, heuristic(node, goal) is assumed to be an appropriate heuristic function. As stated, this could be the 33 5. Methods Euclidean distance, or even the Manhattan distance if the application domain calls for it. Moreover, working implementations of distance(node1, node2) and reconstructPath(cameFrom, current) are assumed. For the sake of robustness, it was necessary to be able to verify expected behaviour of the graph representation. A*, as presented in Algorithm 4, was one such tool for evaluation. For implementation the Python package NetworkX and its method astar_path(G, source, target, heuristic=None, weight=’weight’) was employed [37]. Furthermore, to be able to effectively path find in the maritime setting, a custom heuristic function had to be supplied. Euclidean distance is a common heuristic function when handling distances on a 2-dimensional plane (Equation 5.7). However, in real life, shipping vessels traverse the sea on a globe. Additionally, the positions of nodes in the graph are supplied in coordinate format (see Section 5.4). Haversine distance exists to fill this gap and provide the tools for distance calculation on a globe. More specifically, the Haversine formula is used to find the distance d in-between two points lying on the surface of a sphere (globe) utilising their latitudes (ϕ1, ϕ2) and longitudes (λ1, λ2) (Equation 5.8). dist(x1, y1, y2, y2) = √ (x2 − x1)2 + (y2 − y1)2 (5.7) d = 2r · arcsin  √√√√sin2 ( ϕ2 − ϕ1 2 ) + cos(ϕ1) · cos(ϕ2) · sin2 ( λ2 − λ1 2 ) (5.8) Equation 5.8 consists of the following elements: • d: the distance between the two points. • r: the radius of the sphere (the earth has a radius of approximately 6, 371 km). • ϕ1 and ϕ2: the latitudes of the two points in radians. • λ1 and λ2: the longitudes of the two points in radians. Even though Haversine distance more closely models the distance between two nodes in the graph, context is left out. Firstly, the Earth is not a perfect sphere. The fact is that the Earth’s shape more closely resembles that of an oblate spheroid. Vincenty’s formulae [38], remedies this fallacy of the Haversine formula. Still, innate properties of distance based heuristic functions render them complicated for global maritime applications. When traversing the sea, the shortest distance route might not always be possible. These impossibilities are often found where waters are separated by a landmass. In such a case, the shortest distance would be to traverse the landmass in a straight line between the waters. However, in reality, the vessel has to avoid the landmass and navigate around it. 34 5. Methods Travel Time Heuristic To combat the above problems, a time based heuristic function was used. By considering the actual time required to travel between geographical POI’s as the heuristic cost, one can circumvent the issues related to a distance based heuristic function. The key component in constructing such a heuristic function lies in study of the historical travel patterns of vessels. The first component of the time based heuristic function is seen in Algorithm 2. In tandem with calculating the transition matrix entries, the mean travel time between nodes is calculated and stored in the matrix Times. The matrix is of size nxn, where n is the number of nodes in the node-set. Each element in the Times matrix is calculated as per Equation 5.9. current_time and prev_time are defined as in Algorithm 2, i and j refers to node i and node j in the node-set; i, j ∈ nodeset and i ̸= j . ∆time = current_time− prev_time T imes[i][j]← Times[i][j] + ∆time (5.9) Following the calculation of each matrix element in Times, all elements were nor- malised by dividing by the total number of accesses to the matrix (full procedure seen in Algorithm 2). The resulting Times matrix was then stored in memory, upon which the heuristic function could be constructed (Algorithm 5). Algorithm 5 and its operation can be summarised as follows: first the expected travel time between the two input nodes is read from Times, a penalty value is then returned if the time read equals 0, otherwise the time read is returned. Algorithm 5 Travel Time Heuristic function Time_Heuristic(node1, node2) Times := Matrix of size #nodes×#nodes containing estimated travel times time = Times[node1][node2] if time == 0 then time = 1e9 end if return time end function 5.5.3 A* - Test Cases To verify the expected behaviour using A* one need to intelligently design suitable test cases. Thus, for a test case to be suitable, it has to test conformity of the graph representation to reality. That is, for any pair of ports, the A* algorithm should be able to suggest a feasible route. Ideally, every origin-destination port pair in existence would be scrutinised. Although, such extensive testing is not feasible. Instead, intelligent design has to be employed in order to form a suitable subset of origin-destination pairs; a subset of pairs covering all areas of interest during testing. 35 5. Methods Research has uncovered an imbalance in the importance within the set of all maritime shipping routes. While there is an infinite number of possible routes, as alluded to above, only a subset are truly viable routes for global scale maritime shipping [39]. The central axis of the maritime structure connects North America, Europe, and Pacific Asia through the Suez Canal, the Strait of Malacca, and the Panama Canal. Transatlantic and transpacific shipping involves a large spread of ports. As a consequence, routes are not as set in stone as in the central axis structure. Trans-Indian ocean traffic serves predominantly as intermediary traffic from Pacific Asia to Europe, which implies a series of clearly defined routes between the Strait of Malacca and Bab el-Mandeb. Deviations from these standard routes are only viable in the case of coastal and local shipping. Figure 5.2 displays theses standard routes. A core route is defined as a route catering towards the major commercial shipping demands, and a secondary route is defined as a route connecting minor markets. Figure 5.2: Main maritime routes divided up into core and secondary routes [40]. Red circles denote primary choke points, whereas yellow circles denote secondary choke points. Furthermore, Figure 5.2 also visualises so-called choke points. Choke points are geographical locations that serve as motivators for the standard shipping routes; they are strategic locations through which vessels are routed to optimise the shipping structure. Thus, main routes are formed by routing via choke points. For a choke point to be classified as primary, it has to meet a certain criterion: There should be a limited set of cost-effective alternative routes, not routed via the choke point. The importance of well functioning major choke points is clearly illustrated by the Suez Canal blockage in 2021. Estimations reveal that by freezing the Suez Canal for six days, a loss of $10 billion each day is recorded [41]. Table 5.4 lists all primary choke points. 36 5. Methods Choke point Route 1: Bab el-Mandeb Red Sea 2: Bosphorus Black Sea 3: Cape of Good Hope Atlantic — Indian Ocean 4: Gibraltar Atlantic — Mediterranean Sea 5: Suez Canal Mediterranean Sea — Red Sea 6: Strait of Hormuz Persian Gulf 7: Strait of Malacca Indian Ocean — South China Sea 8: Panama Canal Atlantic-Pacific 9: Oresund Baltic Sea — Nordic Sea Table 5.4: Primary choke points and the naval route they occupy. In order to subject the graph representation to meaningful test-cases, this knowledge of the maritime shipping structure was leveraged. The test set includes origin- destination pairs where the historic journey would have followed a standard naval route. Furthermore, origin-destination pairs were selected to test the willingness of routing via choke points. A well engineered graph representation will naturally lend itself to routing via choke points, as is the case in the real world scenario. In Table 5.5 the chosen subset of test-cases can be regarded. The first two columns display port codes; codes used to identify naval ports. Thus, each row is a test case, where the choke points tested are displayed in the third column. Port 1 Port 2 Test VNSGN (Ho Chi Minh City) NLRTM (Port of Rotterdam) 1,4,5,7 USLGB (Long Beach) CNTXG (Port Of Tianjin) - CNTXG (Port Of Tianjin) DEHAM (Port of Hamburg) 1,4,5,7 HKHKG (Port of Hong Kong) IRMRX (Port of Mahshahr) 6,7 USNYC (Port of New York) KRPUS (Port of Busan) 8 SGSIN (Port of Singapore) NLRTM (Port of Rotterdam) 1,4,5,7 BEANR (Port of Antwerp) SAJED (Port of Jeddah) 4,5 RUPKC (Port of Petropavlovsk) NLRTM (Port of Rotterdam) 1,4,5,7,8 CNLYG (Port of Lianyungang) RUNVS (Port of Novorossiysk) 1,2,5,7 CNTXG (Port of Tianjin) SENYN (Port of Nynäshamn) 1,4,5,7,9 SENYN (Port of Nynäshamn) USHNL (Port of Honolulu) 8,9 IRMRX (Port of Mahshahr) RUNVS (Port of Novorossiysk) 2,6 SAJED (Port of Jeddah) CLVAP (Port of Valparaiso) 4,5,8 ZADUR (Port Of Durban) ZACPT (Cape Town) 3 Table 5.5: Test cases used for evaluating the graph representation using the A* algorithm. Routes are tested both ways. Choke points are represented by their row number in Table 5.4. Tests were conducted using the configuration of A* described above in Section 5.5. The resulting paths were saved and visualized. 37 5. Methods 38 6 Results This chapter presents the results obtained following the methodologies presented in the previous chapter (Chapter 5). Firstly, the results of evaluating the graph representation via A* are relayed. Thereafter, graph metrics pertaining to structural properties are presented. Results covering the inner workings of the graph represen- tation follow last; detected change points, change point splitting results, waypoints extracted, and the transition matrix. 6.1 Graph Evaluation Using A* The graph representation was subjected to evaluation through A* on the test cases stated in Section 5.5.3. The graph representation proved successful path-finding on all the test cases, without failing a single test case. In Figure 6.1a the A* path from the port of Singapore (SGSIN) to the port of Rotterdam (NLRTM) is visualised. The proposed path crosses multiple primary choke points: the Strait of Malacca, Bab el-Mandeb, the Suez Canal, and Gibraltar. A secondary choke point is also crossed in the form of the English canal. No expected choke points were omitted in the path. (a) SGSIN → NLRTM. (b) CNTXG → SENYN. Figure 6.1: A* paths: Port of Singapore to Port of Rotterdam, Port of Tianjin to Port of Nynäshamn. Ports visited on the path are marked by a polygon. Non-port nodes visited are marked by a circle. In Figure 6.1b the A* path between the port of Tianjin (CNTXG) and the port of Nynäshamn (SENYN) is visualised. The port of Tianjin is the largest seaport in southern China, with a large volume of vessels journeying from here on export endeavours. The port of Nynäshamn is a smaller seaport, located in Sweden. The discovered A* path connects the two, omitting no expected choke point; The path is 39 6. Results routed via the secondary choke point in the China sea, through the Strait of Malacca, Bab el-Mandeb, the Suez Canal, Gibraltar, the secondary choke point the English canal, and finally Oresund. Note that the passage through the secondary choke point in the China sea is only implied, as the route visits a coastal port in China. (a) HKHKG → IRMRX. (b) RUNVS → CNLYG. Figure 6.2: A* paths: Port of Hong Kong to Port of Mahshahr, Port of Novorossiysk to Port of Lianyungang. Ports visited on the path are marked by a polygon. Non-port nodes visited are marked by a circle. In Figure 6.2a the A* path from the port of Hong Kong (HKHKG) to the Iranian port of Mahshahr (IRMRX) is visualised. The discovered best path by A* resemble the core route in Figure 5.2. Also, the expected choke points of Strait of Hormuz and Strait of Malacca are visited. In Figure 6.2b the suggested A* path from the Russian port of Novorossiysk to the Chinese port of Lianyungang is visualised. In this case, all expected choke points are traversed: Bosphorus, the Suez Canal, Bab el-Mandeb, the strait of Malacca. Furthermore, two secondary choke points are found on the suggested path. (a) KRPUS → USNYC. (b) ZACPT → ZADUR. Figure 6.3: A* paths: Port of Busan to Port of New York, Cape Town to Port of Durban. Ports visited on the path are marked by a polygon. Non-port nodes visited are marked by a circle. In Figure 6.3a the A* path from the port of Busan (KRPUS) to the port of New York (USNYC) is visualised. The A* path closely mimics the core route in Figure 5.2. Furthermore, three secondary choke points are passed through together with traversal through the primary choke point, the Panama Canal. Lastly, in Figure 6.3b the A* path between two South African ports is visualised: Cape Town to Durban. The choke point Cape of Good Hope is included in the path. 40 6. Results Table 6.1 shows the estimated travel time for each test case; based on mean historic travel time as per Section 5.5.2. Port 1 Port 2 Time 1 → 2 (hours) Time 2 → 1 (hours) VNSGN NLRTM 103.17 74.55 USLGB CNTXG 117.68 98.75 CNTXG DEHAM 114.46 145.46 HKHKG IRMRX 65.23 49.95 USNYC KRPUS 203.44 174.48 SGSIN NLRTM 91.97 92.52 BEANR SAJED 64.82 63.97 RUPKC NLRTM 142.73 131.45 CNLYG RUNVS 76.16 155.40 CNTXG SENYN 137.94 134.32 SENYN USHNL 132.66 204.84 IRMRX RUNVS 99.56 96.75 SAJED CLVAP 96.42 128.83 ZADUR ZACPT 10.77 27.40 Table 6.1: Estimated travel times (in hours) for the test cases. 6.2 Graph Metrics The graph was constructed in accordance with Subsection 5.4. The resulting graph contained a total of 1748 nodes. Graph edges totalled in at 33 462 after construction via the transition matrix. The resulting graph is visualised in Figure 6.4. A red dot indicates the existence of a node at the specified location. Figure 6.4: A plot of all nodes in the graph representation of the maritime structure. Edges are omitted for ease of viewing. Figure 6.5: A plot of all nodes and edges in the graph representation of the mar- itime structure. Edges are directed in the actual graph; here, direction is omitted. As stated above, there were a total of 7288 edges. Almost all of these are visible in Figure 6.5. Edges where nodes have a difference in longitude greater than 350 are omitted in the plot. These are edges that cross the International Date Line (180th meridian), and would have spanned the entire plot. In Section 6.6 it is mentioned 41 6. Results that the transition matrix has a sparsity ratio of 0.00101. In graph theory, density is a measurement of how dense connections are within a graph. To calculate density, the below equation is utilised (Equation 6.1): Density = #Edges #PossibleEdges (6.1) Calculating the density of the graph representation yields 0.00239, which is an increase when compared to the sparsity ratio of the transition matrix; a natural consequence of multiple waypoints having the same coordinates. Another metric is Average Clustering Coefficient (ACC ). Ranging from 0 to 1 the ACC tells the average tendency of nodes to cluster. In the case of the graph representation, an ACC of 0.333 could be recorded. This ACC indicates that the nodes tend to form local clusters to a degree. In other words, nodes can form clusters, but connections outside immediate neighbours also occur. Metric Value Density 0.00239 ACC 0.333 Assortativity 0.480 Reciprocity 0.718 Degree Centrality (Mean) 0.005 Betweenness Centrality (Mean) 0.004 Closeness Centrality (Mean) 0.032 Table 6.2: Metrics concerning the graph representation. In Table 6.2 the average degree-, betweenness-, and closeness centrality as per Section 5.5.1, are displayed together with the ACC, density, assortativity, and reciprocity. As can be noted, the assortativity coefficient of the graph representation was calculated as 0.480 and the reciprocity coefficient was calculated as 0.718. In Figure 6.6, the top 50 nodes with respect to degree centrality, betweenness centrality, and closeness centrality, are visualized. 42 6. Results (a) Degree Centrality. (b) Betweenness Centrality. (c) Closeness Centrality. Figure 6.6: Scatter plots of the 50 most important nodes with respect to degree, betweenness, and closeness centrality. Since the ACC alluded to the tendency for some clustering, community detection was performed in the pursuit of uncovering any communities/modules in the graph. The uncovered communities can be regarded in Figure 6.7a, coupled with a bar plot over class frequency seen in Figure 6.7b. (a) The results of community detection in the graph representation. Each colour represents belonging to a different class. (b) The frequency of which community classes occur. A few classes dominate. Figure 6.7: Visuals of the community detection on the graph representation. 43 6. Results 6.3 Change Point Detection The first step of construction of the node set for the graph representation is to apply change point detection as presented in Subsection 5.2.1. The task of change point detection aimed to single out a subset of the total dataset; only a select few data points display significant changes in speed or course behaviour. Figure 6.8 shows the final data points identified as change points. Figure 6.8: Results of change point detection on AIS dataset. The red dots represent identified change points, and the black markings portray all maritime traffic extracted from the AIS data. Compared to the original AIS dataset, which consisted of 1 973 916 data points, the change point set was reduced to a total of 130 470 data points. Thus, dimensionality could be reduced whilst keeping all important data necessary for waypoint identifica- tion. Of the total change points detected, 61 715 were solely course change points, 49 679 were speed change points, and 19 076 were both course and speed change points (Table 6.3). Data #points AIS data 1 973 916 Change Points 130 470 Course Change Points 61 715 Speed Change Points 49 679 Course & Speed Change Points 19 076 Table 6.3: Number of change points compared to total number of AIS data points. Row 3-4 is the number of points, where a point is a change point based on either the course or speed value, while the last row is based on both values. Figure 6.9 shows the resulting plots of the elbow method that was used to decide 44 6. Results the penalty value for cog and sog. The penalties range from 0 to 10 with a step of 0.5. From the figures, it seems like the elbow of the curve is at value 2.5 for sog and 3 for cog. But by testing other values around those, a penalty value of 2.3 for sog and 2 for cog showed the best results. (a) Chosen penalty value for sog: 2.3 (b) Chosen penalty value for cog: 2 Figure 6.9: Elbow method for choosing the penalty parameter for change point detection. The Figure 6.10, 6.11 displays the signals (sog, cog) from four different vessels. The start of each segment of alternating colour represents at which points the algorithm detects the change points. In the top figure in of the sog example, the algorithm performs well where the ’blocks of signals’ are properly segmented. However, in the bottom figure with less occurring ’blocks’ the algorithm performs inadequately where it misses to segment the two rightmost blocks and instead segments a seemingly flat part. Figure 6.10: Change point detection on sog (blue lines). The detected change points are the start of each segmentation block. In the change point detection on cog, the top figure shows again good performance of the algorithm where it was able to segment tightly and noisy signals. However, the 45 6. Results bottom figure shows an example of drastically varying signals where the algorithm performed worse. Figure 6.11: Change point detection on cog (blue lines). The detected change points are the start of each segmentation block. 6.4 Data Splitting of Change Points After running the KDE on the change point set, the results are shown in Figure 6.12. The yellow to green spots shows where the density of points is the highest and can, thus, be classified as dense areas. Around the dense areas, some faint light blue spots can be seen where it becomes fainter across large waters. These areas can further be split into mid dense and sparse areas. Figure 6.12: Heat map of the result from KDE of change points. The change points are also plotted as scatter points in the figure. From observing the density plot above, the change points were split as described in Subsection 5.2.2, into three datasets by the following areas with the respective 46 6. Results thresholds seen in Table 6.4. The resulting split of change points can be seen in the Figure 6.13 where the points belonging to the sparse areas were coloured green, the mid-points were coloured pink, and the dense points were coloured blue. Observing further, it can be seen that the dense areas covers the majority of the world’s vital waterways for global trade: the English Channel, Malacca strait, Hormuz strait, Suez Canal, and the Panama Canal. Min Max Sparse 0 0.09 Mid 0.09 0.28 Dense 0.28 ∼ 1.62 Table 6.4: The min and max threshold pdf values for a point to be categorised as one of the following regions: sparse, mid, and dense. Figure 6.13: Combined map plot of change points filtered by KDE. Sparse (green), mid (purple), and dense (blue). 6.5 Extraction of Waypoints The waypoints were extracted using the methods described in Subsection 5.2.3. For each data set, the maximum and minimum cell levels and threshold for point count were set as in Table 6.5. The resulting waypoints after combining the waypoints from each split of change point set can be seen in Figure 6.14. A total of 2686 waypoints, which is approximately, ∼ 2% were extracted from the change point set where the cell levels ranged from 4 to 8. The larger cells/polygons (blue) are positioned in sparse and mid dense areas, while the smallest cells (green) are positioned in the denser areas. 47 6. Results level min_level threshold Sparse 7 4 8 Mid 8 5 6 Dense 8 6 30 Table 6.5: Final chosen values for the variables in Table 5.3 used for Algorithm 1. Figure 6.14: Map plot showing waypoints as polygons of different levels ranging from level 4-8. (a) The Malacca Strait. (b) The Panama Ca