Studying Imperfect Communication In Dis- tributed Optimization Algorithm Master’s thesis in Computer science and engineering Swati Math, Madhumitha Venkatesan Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Studying Imperfect Communication In Distributed Optimization Algorithm Swati Math and Madhumitha Venkatesan Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 A Chalmers University of Technology © Swati Math, 2024. © Madhumitha venkatesan, 2024. Supervisor: Ashkan Panahi, DS&AI Examiner: Ashkan Panahi, DS&AI Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Gothenburg, Sweden 2024 iv A Chalmers University of Technology Swati Math Madhumita Venkatesan Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Distributed optimization methods are essential in machine learning, especially when data is distributed across multiple nodes or devices. These algorithms enable effec- tive model training without data consolidation, improving privacy and reducing communication costs. However, their performance is greatly influenced by the qual- ity of communication, which may degrade due to factors such as quantization and erasure. Quantization, which involves estimating values during transmission, can re- sult in loss of information and requires strategic optimization to manage distortion and communication expenses. Similarly, erasure causes loss of transmitted informa- tion, leading to delays in convergence ,increased energy usage. This study explores how communication imperfections affect the performance of distributed optimization algorithms, emphasizing convergence rates, scalability, and overall efficiency. The research examines how quantization and erasure impact different distributed archi- tectures like Federated Learning and push-pull gradient methods under different network topologies and suggests ways to reduce their effects. Keywords: Distributed optimization, machine learning, quantization, erasure, con- vergence, communication overhead, scalability, Federated Learning, push-pull gradi- ent methods, distributed systems. v Acknowledgements We wish to express our gratitude to Prof. Ashkan Panahi, who supervises us at the Department of Computer Science and Engineering at Chalmers University. His steadfast support, wise advice, and priceless expertise have been crucial to the ac- complishment of our project. We really appreciate his support and his kind sharing of his extensive knowledge on distributed optimization techniques, which have im- proved our comprehension and enabled us to successfully navigate the complexities of this difficult subject. We could not have completed this assignment without his commitment and guidance. Swati Math, Gothenburg, Madhumitha venkatesan, Gothenburg, 2024-11-08 vi Contents List of Figures ix 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Previous Work on FEDAVG and PUSHPULL algorithms . . . . . . . 5 1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Theory 8 2.1 Distributed Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Overview of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 FEDAVG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1.1 Algorithm: FedAvg . . . . . . . . . . . . . . . . . . . 11 2.2.2 PUSH-PULL . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2.1 Algorithm: Push-Pull . . . . . . . . . . . . . . . . . 15 2.2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Trade-offs in Quantization . . . . . . . . . . . . . . . . . . . . 20 2.3.1.1 Reduced Model Size and Communication time . . . . 21 2.3.1.2 Communication Efficiency and Computational Over- head . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1.3 Optimal balance in Quantization . . . . . . . . . . . 21 2.3.2 Quantization in FedAvg and push-pull . . . . . . . . . . . . . 22 2.4 Erasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Effects of Erasure . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.1.1 Impact Convergence time . . . . . . . . . . . . . . . 24 2.4.1.2 Increased Communication Overhead . . . . . . . . . 24 2.4.2 Strategies for Mitigation . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Erasure in Fedavg and Push-Pull . . . . . . . . . . . . . . . . 25 3 Methods and Results 28 3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1.1 Architecture of High Performance Computing Cluster: 29 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Analysis of Quantization Impact on FedAvg . . . . . . . . . . 32 vii Contents 3.2.1.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1.2 Average communication Time . . . . . . . . . . . . . 32 3.2.1.3 Execution Time . . . . . . . . . . . . . . . . . . . . . 33 3.2.1.4 Average Convergence Time and Convergence Loop . 33 3.2.2 Analysis of Quantization Impact on Pushpull . . . . . . . . . . 34 3.2.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.2.2 Average communication Time . . . . . . . . . . . . . 35 3.2.2.3 Average convergence time and convergence loop . . . 36 3.2.2.4 Execution Time . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Analysis of Erasure Impact on FedAvg . . . . . . . . . . . . . 38 3.2.4 Analysis of Erasure Impact on Pushpull . . . . . . . . . . . . 38 3.2.5 Communication Topologies . . . . . . . . . . . . . . . . . . . . 39 3.2.5.1 Topology in FedAVg . . . . . . . . . . . . . . . . . . 40 3.2.5.2 Topology in Push-Pull . . . . . . . . . . . . . . . . . 41 3.2.6 Analysis of Scalability in FEDAVG . . . . . . . . . . . . . . . 45 3.2.7 Analysis of Scalability on Push-pull . . . . . . . . . . . . . . . 46 4 Conclusion and Discussion 48 4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Bibliography 50 A Appendix 1 I viii List of Figures 1.1 Communication grapgh for a distributed optimatation . . . . . . . . 2 2.1 Centralized Federated Averaging . . . . . . . . . . . . . . . . . . . . 10 2.2 Decentralized Federated Averaging [31] . . . . . . . . . . . . . . . . . 11 2.3 Algorithm:FedAvg [33] . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 illustrates the Push-Pull Algorithm working mechanism. The left side of the figure shows node 1 actively pulling information from nodes 2, 3, and 4, while the right side shows node 1 pushing information to nodes 2, 3, and 4. This visualizes how node 1 acts as a central hub, pulling information in one phase and pushing information in the other phase, consistent with the push-pull dynamics[34]. . . . . . . . . . . . 13 2.5 Push-pull block diagram . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Algorithm:Push-pull[34] . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Impact of Quantization Levels on Accuracy, Communication Time, Convergence Time, and Convergence Rate for MNIST and CIFAR-10 Datasets. The leftmost subplot compares the accuracy percentages of the datasets across different quantization levels. The second subplot shows the communication time in seconds, highlighting the efficiency of data transfer. The third subplot presents the convergence time in seconds, reflecting the overall time required for the models to converge during training. The rightmost subplot shows the convergence rate, measured in FedAvg loops, indicating the number of communication rounds required for the model to converge to a stable state. . . . . . . 35 3.2 Impact of Quantization Levels on Accuracy, Communication Time, and Convergence Time for MNIST and CIFAR-10 Datasets. The left subplot compares the accuracy percentages of the datasets across dif- ferent quantization levels, indicating the effect on model performance. The middle subplot shows the communication time in seconds, high- lighting the efficiency of data transfer. The right subplot presents the convergence time in seconds, reflecting the overall time required for the models to converge during training. . . . . . . . . . . . . . . . . . 37 3.3 Impact of Erasure on Accuracy, Communication Time, Convergence Time and loop for MNIST and CIFAR10. . . . . . . . . . . . . . . . . 38 ix List of Figures 3.4 The figure illustrates the performance differences between the baseline and erasure methods in a peer-to-peer quantization strategy for both MNIST and CIFAR-10 datasets, particularly in managing missing packets or data loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Figure presents a comparative analysis of three different algorithm topologies (Star, Fully Connected (FC), and Circular) across two datasets, CIFAR-10 and MNIST . . . . . . . . . . . . . . . . . . . . . 40 3.6 Comparison of Different Graph Topologies on Accuracy, Commu- nication Time, and Convergence Time for MNIST and CIFAR-10 Datasets. The left subplot compares the accuracy percentages of the datasets across different graph topologies (star, fully connected, cir- cular, and random), indicating the effect on model performance. The middle subplot shows the communication time in seconds, highlight- ing the efficiency of data transfer. The right subplot presents the convergence time in seconds, reflecting the overall time required for the models to converge during training . . . . . . . . . . . . . . . . . 44 3.7 Scalability analysis of MNIST and CIFAR-10 datasets showing the impact of increasing clients on accuracy, convergence time, communi- cation time, and convergence loops. . . . . . . . . . . . . . . . . . . . 46 3.8 Scalability analysis of MNIST and CIFAR-10 datasets showing the impact of increasing tasks on accuracy, convergence time, communica- tion time, and convergence loops. The plots compare the performance metrics across 8, 16, and 24 tasks for both datasets, highlighting the trade-offs between accuracy and computational resources . . . . . . . 47 x 1 Introduction Distributed optimization algorithms are designed to solve the optimization prob- lems where the computation is spread across multiple nodes or clients or devices. These algorithms are widely employed in large-scale systems such as machine learn- ing, where decentralized processing is critical for handling huge amounts of data and model parameters. By distributing the workload across several agents, such algorithms ensure scalable and efficient solutions without the need for a centralized systems. The workload is distributed across the nodes or clients, allowing for parallel processing, which increases computational efficiency and saves training time. This parallel technique enables simultaneous computations on various sections of the data, resulting in faster convergence and shorter overall processing time. Beyond machine learning, distributed optimization also plays a crucial role in other large-scale sys- tems, such as networked systems where it optimizes resource allocation, routing, and load balancing, assuring efficient network performance. Distributed optimization is particularly important in sensor networks for tasks such as data gathering, target tracking, and environmental monitoring. It allows sensors to collectively process data and make judgments [1]. In traditional centralized systems, a single central server has access to the entire dataset and is responsible for performing all computations. While this approach works well for smaller datasets, it quickly becomes inefficient for large-scale systems due to scalability issues and communication overhead. Centralized methods also create a single point of failure, meaning that if the central server crashes or becomes overloaded, the entire system can be compromised. These limitations highlight the importance of decentralized approaches in large-scale systems. In decentralized system no single node has access to the entire data set; in- stead, each node operates on a local data subset. To ensure a global solution, nodes must communicate with one another; However, developing effective communication protocols is critical, as excessive communication might cause bottlenecks, reducing convergence. To minimize the communication overhead quantization is used, which is the process of reducing the precision of the data being communicated between nodes or clients. However, while this technique can optimize bandwidth (reducing communication overhead) and speed up training, it may introduce small errors called quantization error which affects the model performance. In addition to quantization errors, era- sure or message loss may occur. This means that information such as model updates 1 or gradient information are lost during transmission between nodes. This data loss can occur due to various reasons, such as network issues, corrupted data packets, or dropped connections. This can slow down the convergence process or lead to inaccurate updates in the model, as missing messages prevent nodes from fully syn- chronizing their local states. The convergence rate in distributed optimization is not only influenced by the num- ber of nodes or clients involved and the desired level of accuracy but also on the structure and nature of the network over which nodes communicate. This includes factors like whether links are directed or undirected, static or time-varying. The State-of-the-art algorithms and their analyses are tailored to these different scenarios, highlighting the crucial role of network topology [2]. It has also been highlighted that denser networks tend to accelerate convergence due to faster information propaga- tion but at the cost of increased communication overhead, whereas sparser networks reduce communication costs but may require more iterations to reach consensus. In our research, our primary focus is in the effects of quantization, erasure and network topology on communication efficiency and convergence in distributed opti- mization. By understanding how these factors interact, we aim to provide insights into how quantization errors and message loss affect the performance of distributed optimization algorithms. Figure 1.1: Communication grapgh for a distributed optimatation 1.1 Background Over the past few years, distributed optimization and distributed machine learn- ing (DML) have become increasingly important, particularly due to the rise of large-scale data processing and the development of highly distributed computing infrastructures. Distributed approaches to optimization aim to breaks the complex problems into smaller, more manageable subproblems that can be solved collabo- ratively by multiple agents or nodes in a network. These methods are crucial for 2 applications where data is inherently distributed (e.g., edge computing, federated learning, and cloud-based systems) or when data cannot be centralized due to pri- vacy or resource constraints. This section provides an overview of centralized and decentralized methods in distributed optimization, highlighting key techniques like gossip-based and dual-based methods. Centralized optimization methods rely on a central controller that collects infor- mation from all nodes, solves a global optimization problem, and disseminates the solution back to the nodes. Methods like Stochastic Gradient Descent (SGD) and Gradient Boosting Machines exemplify this centralized approach, requiring a central server to aggregate gradients or updates after each iteration. While these methods can achieve high accuracy and convergence, they suffer from several drawbacks in distributed settings. Specifically, centralized methods are sensitive to single points of failure, are limited by the capacity of the central server, and can be inefficient due to the need for massive data transfer. To address these limitations, recent ad- vancements have focused on distributed optimization techniques. Downpour SGD is introduced as a variant of asynchronous stochastic gradient descent designed to handle large data sets by using multiple replicas of a single DistBelief model[3]. This method, along with Sandblaster L-BFGS, leverages distributed systems to mitigate the drawbacks of centralized methods. By distributing the computational load across multiple nodes and reducing dependence on a single central server, these distributed approaches significantly improve efficiency and scalability. Federated optimization refers to the optimization problem implicit in federated learn- ing, drawing a connection (and contrast) to distributed optimization. Federated optimization has several significant characteristics that distinguish it from a conven- tional distributed optimization problem: (1)Non-IID: The training data on a given client is typically based on the usage of the mobile device by a particular user, and hence any particular users local dataset will not be representative of the population distribution. (2)Unbalanced: Similarly, some users will make much heavier use of the service or app than others, leading to varying amounts of local training data. (3)Massively distributed: The number of clients participating in an optimization to be much larger than the average number of examples per client. (4)Limited commu- nication: Mobile devices are frequently offline or on slow or expensive connections[4]. Decentralized optimization techniques distribute the computation workload across nodes, each solving a local problem and sharing information with its immediate neighbors. This approach removes the dependency on a central node, enhancing robustness and scalability. However, this methods faces challenges such as slower convergence due to limited information sharing and the need for consensus across nodes. Decentralized methods are often seen in environments such as wireless sen- sor networks, edge computing. where data privacy and communication overhead are critical concerns. In recent years, Federated Learning (FL) has emerged as a groundbreaking approach in distributed machine learning, particularly suited to the challenges posed by sensi- 3 tive and voluminous data generated by mobile and edge devices[5], [6]. FL enables model training directly on devices, preserving data privacy and reducing the need for central data aggregation[7]. A pivotal algorithm in Federated Learning is Federated Averaging (FedAvg), which combines local stochastic gradient descent with server- side model averaging. FedAvg is notably effective in handling non-IID (Independent and Identically Distributed) data and reducing communication overhead compared to traditional methods[7], [8]. Consensus methods are another class of decentralized approaches where nodes itera- tively communicate with their neighbors to reach agreement on the global solution. Methods such as Distributed Consensus-ADMM and Gradient Descent (DGD) be- long to this category, ensuring that all nodes converge to a common solution over time. Nedic et al.[9] considered several algorithms that use different types of con- sensus models, namely weighted-averaging and push-sum models[10]. Alternating direction method of multipliers (ADMM)[11] is a simple but powerful algorithm that is well suited to distributed convex optimization, and in particular to problems arising in applied statistics and machine learning. It takes the form of a decomposition-coordination procedure, in which the solutions to small local sub- problems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for constrained optimization, two earlier approaches. It turns out to be equivalent or closely related to many other algorithms as well, such as Douglas-Rachford splitting from numerical analysis, Spingarns method of partial inverses, Dykstras alternating projections method, Bregman iterative algorithms for problems in signal processing, proximal methods, and many others. The fact that it has been re-invented in different fields over the decades underscores the intuitive appeal of the approach. Gossip-based methods also known as gossip algorithm are motivated by applications to sensor, peer-to-peer, and ad hoc networks for exchanging information and com- puting in an arbitrarily connected network of nodes. These constraints motivate the design of simple decentralized algorithms for computation where each node ex- changes information with only a few of its immediate neighbors in a time instance (or, a round). The goal in this setting is to design algorithms so that the desired computation and communication are done as quickly and efficiently as possible. The averaging time of a gossip algorithm depends on the second largest eigenvalue of a doubly stochastic matrix characterizing the algorithm. Designing the fastest gossip algorithm corresponds to minimizing this eigenvalue, which is a semidefinite pro- gram (SDP). There two well-known cases for which the performance and scaling of gossip matrices are studied: Wireless Sensor Networks, which are modeled as Ge- ometric Random Graphs, and the Internet graph under the so-called Preferential Connectivity (PC) model[12]. A key focus of recent research is improving distributed optimization methods. The paper titled "Push-Pull Gradient Methods for Distributed Optimization in Networks" 4 by Shi Pu, Wei Shi, Jinming Xu, and Angelia Nedi introduces a novel gradient- based algorithm for distributed (consensus-based) optimization in directed graphs. This new approach differs from previous push-sum protocols[13], [14] by using a row stochastic matrix to mix decision variables and a column stochastic matrix to track average gradients. The paper[15] also explores a random-gossip variant of the push-pull method, known as the G-Push-Pull algorithm. In this variant, agents com- municate randomly with one or two of their neighbors during each iteration. The authors show that both the Push-Pull and G-Push-Pull methods converge linearly to the optimal solution for strongly convex and smooth objective functions. On- going research aims to refine these methods and enhance communication-efficient distributed learning techniques. 1.2 Previous Work on FEDAVG and PUSHPULL algorithms The field of decentralized and distributed optimization has seen substantial progress, particularly in consensus-based methods over both undirected and directed graphs. Early research efforts focused on static undirected networks and leveraged algorithms like the Alternating Direction Method of Multipliers (ADMM), achieving linear con- vergence through the use of doubly stochastic matrices for consensus [16][17]. How- ever, the construction of doubly stochastic matrices can be impractical for directed graphs, leading to the development of push-sum algorithms that facilitate consensus in directed networks[18], [19]. These methods were later extended to accommodate time-varying graphs and smooth, strongly convex functions [15], [20], [21], although they often required careful step size selection, which posed stability challenges that have been addressed in more recent work[13]. The Federated Averaging (FedAvg) approach, developed by [4], [7],combines local stochastic gradient descent with model averaging at the server to handle federated learning difficulties such as non-IID data and communication restrictions. This im- proves on past distributed optimisation efforts, such as those by [22] and [23], which focused on cluster-based settings but did not address the particular difficulties of federated settings. FedAvg is contrasted to classic approaches such as synchronous SGD[8] and asynchronous SGD[3], demonstrating its effectiveness in lowering com- munication overhead. It also contains privacy considerations, relying on secure mul- tiparty computing and differential privacy[24] to alleviate vulnerabilities associated with on-device data. In particular, the push-pull gradient methods introduced in these works provide a significant foundation for decentralized optimization, where row-stochastic matrices are used for decision variable updates and column-stochastic matrices for gradient tracking. These methods unify different network architectures, including peer-to- peer, master-slave, and leader-follower systems, allowing for flexibility in distributed optimization environments. Notably, these algorithms have also been extended to handle asynchronous communication through random-gossip variants, making them 5 robust to network imperfections such as delays and stochastic updates[25][26]. 1.3 Goals This study evaluates of Federated Averaging (FEDAVG) and push-pull optimiza- tion algorithms within the context of distributed machine learning, particularly in high-performance computing (HPC) environments using MPI (Message Passing In- terface) communication protocols. The study will examine how these algorithms perform under various communication constraints and predefined network topolo- gies, such as fully connected, circular, star-shaped, and random networks, involving interconnected nodes within an HPC cluster. The primary objective is to explore the impact of communication imperfections on important algorithm metrics, specifically the Quantization, used to reduce commu- nication overhead by approximating continuous values with fewer bits, can intro- duce errors (quantization errors)and erasure flaws that affect the precision of model updates, potentially slowing convergence and reducing accuracy. The study will specifically focus on FEDAVG and push-pull optimization algorithms, intentionally excluding other decentralized machine learning techniques from the scope of analysis. Experiments are conducted within an HPC cluster environment, utilizing MPI for communication between nodes. This setup allows for the study of algorithm perfor- mance under realistic conditions, reflecting the challenges encountered in practical distributed machine learning scenarios. The investigation will be limited to predeter- mined network topologies and specific communication techniques, with the purpose of understanding and documenting the effects of communication defects on algo- rithm performance under different scenarios. This delimitation ensures that the research remains focused on its core objectives, providing detailed insights into the specific challenges and performance characteris- tics of FEDAVG and push-pull algorithms in the context of quantization, erasure effects, and network topology variations within an HPC framework. This dissertation methodically investigates federated learning algorithms, starting with implementing two core techniques: using FedAvg in a star topology and a push-pull mechanism in a peer-to-peer network. These applications act as standard models, setting a starting point for additional examination. The study initially presents quantization methods to FedAvg and push-pull algorithms, carefully ana- lyzing their impact on communication costs, model performance, and convergence. Quantization aims to improve communication efficiency by decreasing data preci- sion, and this research assesses the compromises associated with this strategy. After this, the thesis explores how resilient these algorithms are when messages is deleted. Understanding the impact of message loss on federated learning’s robustness and reliability is essential for gaining insights into the fault tolerance of the algorithms. The study then moves on to examine how various communication topologies impact 6 the effectiveness of federated learning by investigating the role of network topol- ogy. By testing different topological designs, the research identifies configurations that optimize both training speed and accuracy, revealing insights into how network topology influences the results of federated learning. In the last stage, the thesis evaluates how well the baseline algorithms FedAvg and push-pullcan scale with varying network sizes, using the MNIST and CIFAR-10 datasets. This analysis of scalability plays a crucial role in assessing the algorithms’ suitability for widespread implementation in practical situations. In the thesis, there is a constant focus on the relationship between communication efficiency, accuracy, and convergence, offering a thorough insight into the performance variations in fed- erated learning under various circumstances. 7 2 Theory 2.1 Distributed Optimization Distributed optimization is an important technique used to solve large scale machine learning problems where data is distributed across multiple nodes/clients that com- municate over network. Each agent (i) in a network aims to minimize a its local objective function (fi(x)), while sharing information with each other to ensure con- vergence to the optimal solution, which is usually the sum of local objectives from all node. Communication between nodes is sometimes limited by bandwidth or latency constraints, necessitating the optimization of both computation and communication. Each node i has its own local data and aims to minimize a local objective function fi()x. The main goal of distributed systems is to find the value of x that minimizes the sum of all the local objective functions from every node, represented as: min x N∑ i=1 fi(x) Where: • N is the number of nodes (or clients). • fi(x) is the local objective function at node i, which depends on local data. • Each node/clients computes its gradient ∇fi(x) locally, and the clinets commu- nicate with each other or a central server to update x in a coordinated manner. The nodes/clients communicate over a communication graph. The communication graph (G) represent how nodes communicate each other and structure of the commu- nication graph has considerable impacts the performance of the distributed optimiza- tion process. A well-connected graph can helps speed up information transmission and convergence. Conversely, a poorly connected graph can slow down the process and lead to poor performance [27]. The communication graph G = (V, E) plays an important role in determining how fast and efficiently consensus can be achieved, where V represents the nodes and E the communication links between nodes. Each node in the graph represents an agent, and each edge represents a communication link between two agents. The graph can be directed or undirected In symmetric communication[28], data flow between nodes is bidirectional, which 8 means that node A can transmit information to node B and node B can send infor- mation back to node A. This sort of communication is widespread in balanced and evenly connected networks, resulting in easier analysis and faster consensus. Asymmetric communication [28] occurs when information transmission is one-way, i.e., node A can send data to node B but node B cannot send data back. Asymmet- ric communication can occur due to limitations in network resources or hierarchical systems. While it complicates analysis and may hinder consensus, it can simu- late actual circumstances like master-slave or client-server relationships. Symmetric communication simplifies convergence analysis and frequently leads to speedier con- sensus across nodes, although asymmetric communication, while more complex, can represent practical systems with limited data flow. The communication graph is central to the distributed consensus algorithm, the con- sensus algorithm plays an important role to distributed and multi-agent systems. Its goal is to ensure that all nodes (or agents) in a system agree on the same data value, even if some nodes fail or deliver inaccurate data [29]. A consensus algorithm ensures that each node i updates its local decision variable xi by incorporating information from its neighboring nodes j. Nodes in a network uses communication graphs to repeatedly update their local decision variables by incorporating information from neighboring nodes, thus achieving both local convergence and global consensus [29]. The consensus update rule can be represented as: xt+1 i = xt i − η ∑ j∈Ni Wij(xt i − xt j) where: • η is the learning rate, • Wij represents the communication weight between node i and node j, • Ni is the set of neighbors of node i. 2.2 Overview of Algorithms 2.2.1 FEDAVG Federated Averaging (FEDAVG) is a widely used algorithm in Federated Learning, where multiple devices (clients) collaborate to train a single, powerful model with- out sharing their private data. Each client trains a copy of the model using its local data and after completion of the local training, the client does not sends this data to server instead, it sends only the updates model parameters (gradients or weights) to a central server. The server averages these parameters from all devices, this averag- ing step is core of FedAvg algorithm and broadcasts the averaged parameters back to the clients. Each devices receive the averaged parameter and update its local model copy accordingly. This process of local training,sending updates, averaging them and updating the global model is repeated over several rounds, gradually improving the global model[30][4]. Traditionally FedAvg uses central server to averaging the 9 updates, but it can be adapted to decentralized topologies ( FC or circular). Federated Learning is a machine learning approach for training models on clients without sharing original data, for Next-Word Prediction on Mobile Phones. The diagram 2.6 illustrates a distributed version of federated learning, where remote de- vices/clients communicate directly with each other without need for central server. Each client performs local training on its own data and then exchanges model up- dates with other devices in a peer-to-peer fashion, shown by the dashed commu- nication lines between smartphones [31]. The privacy of text data is maintained by training a prediction model locally on non-identically dispersed user data and delivering updates to the server. The server aggregates these updates to form a new global model, iteratively improving predictions until convergence or a stopping criterion is met.This decentralized/distributed approach reduces reliance on a cen- tral server and allows devices to collaborate more flexibly, although it introduces additional complexity in communication and coordination. Both 2.1 and 2.2 illustrate two important aspects of Federated Learning (FL), show- casing both the traditional centralized and decentralized approaches respectively. Figure 2.1: Centralized Federated Averaging 10 Figure 2.2: Decentralized Federated Averaging [31] Constantly updating the model parameters on a regular basis, which for deep neural network can involve large number of parameters, requires frequent transmission of substantial amounts of data between cilents and server. This creates a communi- cation overhead between central server and clients/devices, especially in large-scale- networks with many participants, which can significantly slow down the model’s convergence. Given that communication channels between devices/clients and the server usually have limited bandwidth and power, this frequent data transfer be- comes a bottleneck.To address this, techniques such as quantization, compression, and adaptive communication strategies are often employed to reduce the communi- cation load while maintaining model performance.[32][30]. However, in our study we are concentrating on the quantization techniques, how its useful in saving com- munication time and also its effect on convergence of the algorithm. 2.2.1.1 Algorithm: FedAvg In the FedAvg algorithm, D is entire dataset used for training and is distributed among clients, w(0) is inital global model,which is same for all clients at begining of training process. Learning rate η and w(T ) is updated global model after specified number of training rounds.FedAvg algorithm involves two steps one local updates and global updates . In local update each client calculates average gradient (∇Li) over its own dataset Di and client updates its local model w(t)i . w(t)i = w(t − 1)i − η · ∇Li Where, w(t − 1)i is clients local model from previous round (t-1), η is learning rate, ∇Li is average gradient calculated over client’s local data. 11 Figure 2.3: Algorithm:FedAvg [33] In global update, server aggregates the local model updates from all clients. w(t) = ∑N i=1(pi · w(t)i)∑N i=1 pi Where, w(t) is updated model, pi is weight of client i, w(t)i is local update from client i. FedAvg algorithm iterates through these two steps for specified number of rounds (T),gradually improving the local and global models based on collabrative learninig from all clients dataset [33]. 2.2.2 PUSH-PULL Push-Pull Optimization: Push-Pull algorithms are essential tools for distributed optimization, enabling collaborative optimization among interconnected agents or nodes in various network topologies. These algorithms use local computations to reach consensus on a global objective function, facilitating effective information ex- change and synchronization among agents. Now, we explain the Push-Pull Gradient Method, introduced by Author(s) Shi Pu, Wei Shi, Jinming Xu, and Angelia Nedic. Push-Pull Gradient Methods for Dis- tributed Optimization in Networks [34], unifies different types of distributed architec- tures, including decentralized, centralized, and semi-centralized architectures. This method provides a cohesive framework that bridges various distributed paradigms, offering a unified approach to optimizing computational processes across diverse system configurations.If the graph is directed and strongly connected, For the fully 12 decentralized case, suppose we have a graph G that is undirected and connected. Here, G represents the network topology. We can set GR = GC = G, where GR and GC are the adjacency matrices for the receiver and controller nodes, respec- tively. In this context, R and C are symmetric matrices representing specific weight assignments within the network, and the proposed algorithm reduces to the one considered in [13], [35]. If the graph is directed and strongly connected, we can still let GR = GC = G and design the weights for R and C accordingly. The matrix R defines the weight distribution for the network when information is pulled by the neighbors (receiver nodes). The first row has a weight of 1 for the first node and 0 for others, indicating that node 1’s information is solely determined by itself. The other rows distribute the information equally between two nodes, reflect- ing the connections in a decentralized or partially centralized system where nodes other than node 1 have some degree of interaction with multiple neighbors. The matrix C defines the weight distribution when information is pushed to the neighbors (controller nodes). The first row reflects that node 1 pushes its infor- mation equally to all other nodes, consistent with its role as a central node in a semi-centralized architecture. The other rows have lower weights, reflecting a less active role in the push process for nodes 2, 3, and 4, which mainly respond to node 1. Figure 2.4: illustrates the Push-Pull Algorithm working mechanism. The left side of the figure shows node 1 actively pulling information from nodes 2, 3, and 4, while the right side shows node 1 pushing information to nodes 2, 3, and 4. This visualizes how node 1 acts as a central hub, pulling information in one phase and pushing information in the other phase, consistent with the push-pull dynamics[34]. To illustrate the less straightforward situation of (semi-)centralized networks, let us provide a simple example. Consider a four-node star network composed of nodes 1, 13 2, 3, and 4, where node 1 is situated at the center and nodes 2, 3, and 4 are bidirec- tionally connected with node 1 but not connected to each other. In this case, the matrix R in our algorithm can be chosen as 1s information regarding x1k is pulled by the neighbors (the entire network in this case) through GR; the others only passively infuse the information from node 1. At the same time, node 1 has been pushed infor- mation regarding yik (for i = 2, 3, 4) from the neighbors through GC ; the other nodes only actively comply with the request from node 1. This motivates the algorithms name, push-pull gradient method. Although nodes 2, 3, and 4 are updating their yi values accordingly, these quantities do not have to contribute to the optimization procedure and will die out geometrically fast due to the weights in the last three rows of C. Consequently, in this special case, the local step size for agents 2, 3, and 4 can be set to 0. Without loss of generality, suppose f1(x) = 0. Then the algorithm becomes a typical centralized algorithm for minimizing ∑4 i=2 fi(x) where the master node 1 utilizes the slave nodes 2, 3, and 4 to compute the gradient information in a distributed way. Taking the above as an example for explaining the semi-centralized case, it is worth noting that node 1 can be replaced by a strongly connected subnet in GR and GC , respectively. Correspondingly, nodes 2, 3, and 4 can all be replaced by subnets as long as the information from the master layer in these subnets can be diffused to all the slave layer agents in GR, while the information from all the slave layer agents can be diffused to the master layer in GC . Specific requirements on the connectivities of slave subnets can be understood by using the concept of rooted trees. We refer to the nodes as leaders if their roles in the network are similar to the role of node 1; and the other nodes are termed as followers. Note that after the replacement of the individual nodes by subnets, the network structure in all subnets is decentralized, while the relationship between the leader subnet and follower sub- nets is master-slave. This is why we refer to such an architecture as semi-centralized. This example illustrates a semi-centralized case where node 1 can be replaced by a strongly connected subnet in GR and GC respectively. Similarly, nodes 2, 3, and 4 can be replaced by subnets, ensuring that information can be diffused appropriately between master and slave layers in GR and GC . The connectivity requirements of these slave subnets can be understood using rooted trees. Nodes fulfilling roles sim- ilar to node 1 are termed leaders, while other nodes are termed followers based on their network roles [34]. Figure 2.5: Push-pull block diagram Sender Node A -> Receiver and Requester Node B: The arrow from Node A to Node B is labelled "Push Request" suggesting that Node A is passing data to Node B without receiving a request. 14 Receiver Node B -> Responder Node C: The arrow from Node B to Node C is labelled "Pull Request" indicating that Node B is seeking data from Node C. Responder Node C -> Receiver, Requester Node B: The arrow from Node C to Node B is labelled "Response" indicating that Node C is returning the requested data to Node B. Where nodes both push data to specific recipients and respond to pull requests from other nodes.From the viewpoint of an agent, the information about the gradients is pushed to the neighbors, while the information about the decision variable is pulled from the neighbors hence giving the name push-pull gradient methods[34]. The 2.5 diagram provide a visual representation of how push and pull algorithms work in distributed systems, where nodes exchange data or messages based on dif- ferent communication patterns. In a push phase of distributed optimization, each node computes its local updates based on its own data and current model parameters. These local updates are then transmitted or "pushed" to neighboring nodes in the network. This dissemination of information enables nodes to share their insights and collaborate effectively towards a common objective. Conversely, in the pull phase, nodes receive updates or information from their neigh- bors. This process involves aggregating the received updates, typically through techniques like averaging or consensus, to update their own local model parameters. The pull phase ensures that nodes benefit from the collective intelligence of neigh- boring nodes to enhance their own understanding and refine their model parameters. However, in practical scenarios with imperfect communication channels, such as quantization or erasure channels, the efficacy of push and pull algorithms can be challenged. Erasure channels may drop transmitted data packets. These imper- fections can lead to loss of information or introduce errors in the communication process. 2.2.2.1 Algorithm: Push-Pull Algorithm 1 (pushpull) can be rewritten in the following aggregated form. A matrix is nonnegative if all its elements are nonnegative. In the context of the algorithm, let α = diag{α1, α2, . . . , αn} be a nonnegative diag- onal matrix, and R = [Rij], C = [Cij] ∈ Rn×n. We make the following assumption regarding the matrices R and C: **Assumption 2:** The matrix R ∈ Rn×n is a nonnegative row-stochastic matrix, and C ∈ Rn×n is a nonnegative column-stochastic matrix. This means that R 15 Figure 2.6: Algorithm:Push-pull[34] satisfies R1 = 1, and C satisfies 1T C = 1T . Additionally, the diagonal elements of both R and C are positive, i.e., Rii > 0 and Cii > 0 for all i ∈ N . As a consequence of C being column-stochastic, it can be shown by induction that: 1 n 1T yk = 1 n 1T ∇F (xk), ∀k. This relation is crucial as it allows (a subset of) the agents to track the average gradient 1 n 1T ∇F (xk) through the y-update. In each iteration, every agent will share gradient information with its outgoing neigh- bors and receive decision variables from its incoming neighbors. Each component of yk tracks average gradients using the column-stochastic matrix C, while each compo- nent of xk drives optimization through average consensus using the row-stochastic matrix R. Algorithm 1 resembles gradient tracking methods seen in prior work, where the doubly stochastic matrix is split into row-stochastic and column-stochastic components. This asymmetric R−C structure, previously used for achieving average consensus, differs from linear systems due to its introduction of nonlinear dynamics from the gradient terms. Now, we establish the graph structures GR and GC induced by matrices R and C respectively. Notably, GC mirrors GC with all its edges reversed[34]. 2.2.3 Evaluation Metrics In the field of federated learning and distributed machine learning, the performance and efficiency of algorithms are critically influenced by several key factors. Under- standing these factors is essential for optimizing algorithm performance and ensuring robust, scalable machine learning models. 16 In our project, we implemented and analyzed key metrics such as communication time, average loss, convergence threshold, convergence time, convergence loop, and accuracy. Average Communication Time: Communication time measure shows how long it takes a client to exchange gradients with all other clients that are participating in the training. It includes both the time to receive gradients synchronously from all clients plus the time to send its gradients synchronously to all other clients. Accurate measurement of communication time is essential for understanding the efficiency of distributed training processes.The Average Communication Time represents the av- erage time spent by all clients during each training iteration to exchange gradients in a distributed setup. The accumulated total communication time over multiple iterations is given by: Total Comm Time = N∑ i=1 Tcomm,i Where: • N = Total number of iterations. • Tcomm,i = Communication time during the i-th iteration, which includes both the reduction and broadcast times. The Average Communication Time is calculated as: Average Comm Time = Total Comm Time Number of Clients Where: • Total Comm Time is the sum of communication times from all clients. • Number of Clients is the total number of clients participating in the training process. This formula effectively represents the overall communication overhead experienced during the training process. Convergence Time: Convergence time consumed by a client from the start of training until it reaches a point where the training loss falls below a predefined convergence threshold ( 0.01 in our experiment). This metric, typically measured in seconds, re- flect the speed at which the model stabilizes and begins producing reliable outputs. Factors such as hardware specifications, dataset size, and model complexity can in- fluence convergence time. The Average Convergence Time is the average time taken by all clients to reach convergence, where the training loss falls below a predefined threshold. The convergence threshold ϵ is a predefined value that determines when the training is considered to have converged. In our experiment, this threshold is set to: ϵ = 0.01 17 The training is considered to have converged when the average loss falls below this threshold. The individual client convergence time Tconvergence = Tk − Tstart where Lavg,k < ϵ Step-by-step Representation: 1. Start timing at Tstart. 2. For each epoch k, compute the average loss Lavg,k. 3. Identify the first epoch k where (average loss) Lavg,k < ϵ (threshold) 4. Calculate the convergence time as the difference between the end time of this epoch and the start time. The average loss Lavg,k for the k-th epoch is calculated as the mean of the loss values over all batches within that epoch. It measures how well the model is learning during the training process. Lavg,k = 1 n n∑ i=1 Li,k Where: • n = Total number of batches in the k-th epoch (len(train_loader)). • Li,k = Loss value of the i-th batch in the k-th epoch. Total Convergence time across all clients that have reached convergence is given by. Tconvergence = M∑ j=1 Tconvergence,j Where: • M = Total number of clients that have reached convergence. • Tconvergence,j = Convergence time for the j-th client. The Average Convergence Time is calculated as: Average Convergence Time = Tconvergence Number of Clients Where: • Tconvergence is the sum of convergence times from all clients who have reached convergence. • Number of Clients is the total number of clients participating in the training process. Convergence Loops: This metric refers to the specific training loop during which the training loss drops below the predefined convergence threshold (in this case con- vergence threshold of 0.01). The number of loops required provides insights into the stability and efficiency of the training process, indicating how quickly the model is able to meet the convergence criteria. Accuracy: In this context, accuracy refers to the test accuracy, which measures how well the trained model performs on a separate test dataset that was not used during 18 training. It is a key indicator of the model’s ability to generalize to new, unseen data reflecting the effectiveness of the model in practical applications. These metrics communication time, convergence time, convergence loops, and accu- racy were specifically measured during the implementation of distributed optimiza- tion algorithms, including the FedAvg and Push-Pull. By defining, implementing, and analyzing these measures within our code, we gained valuable insights into the real-world difficulties and trade-offs associated with distributed learning settings. In this study, we focused on implementation of federated averaging algorithm and push-pull algorithm with specific attention to the data partitioning strategies, quan- tization of gradients and erasure of gradients. We also emphasized to analyze the impact of different network topologies with these metrics. 2.3 Quantization Quantization is a process of mapping continuous values to discrete levels, commonly used in machine learning to reduce the size of models and the volume of data trans- mitted during training in distributed systems. We can apply quantization in two ways Post-training quantization and quantization aware training. In post-training quantization, quantizes models or gradients after training has been completed. In the case of gradient quantization, this would mean applying quantization techniques to the computed gradients after the full-precision training has been completed. This is useful in scenario where the goal is to minimize the communication overhead. Whereas in quantization aware training, quantization is integrated during training phase, leading to better-optimized models with minimal accuracy degradation, but it requires more computational resources and time due to the need for retraining [36]. Gradient quantization addresses the problem of bandwidth and communication cost, which are significant bottlenecks in decentralized and distributed learning systems. In gradient quantization [37], gradients are scaled to a range determined by the number of bits and then rounded to the nearest quantization level. This quantization step maps each gradient component to a discrete level, reducing its precision but significantly compressing the amount of data transmitted. Let: • g ∈ Rd be the original gradient vector (tensor). • max_val = max(|g|) be the maximum absolute value of the gradient. • b be the number of bits used for quantization. The quantization process can be broken down into normalization, scaling, rounding, and clamping. The gradient is normalized using the maximum absolute value: ĝi = gi max_val where ĝi is the normalized version of the gradient. The number of discrete levels available for quantization is determined by b bits. The 19 range of quantization levels is: scale = 2b−1 − 1 The normalized gradient is scaled to this range: g̃i = ĝi × scale where g̃i is the scaled gradient ready for quantization. The scaled gradient is rounded to the nearest integer to map it to a discrete level: q(gi) = round(g̃i) The rounded value is clamped to ensure it lies within the range [−scale, scale]: q(gi) = min(max(q(gi), −scale), scale) where q(gi) is the quantized value of gi. For Dequantization of multi-bit case , ĝi = q(gi) 2b−1 − 1 × max_val For the special case where b = 1, the gradient [38] is simply quantized to its sign : q(gi) = sign(gi) For Dequantization b = 1, ĝi = q(gi) × max_val where: sign(gi) =  +1, if gi > 0 −1, if gi < 0 0, if gi = 0 Quantization also introduces a loss of precision by representing weights or gradients with fewer bits. while this reduces communication overhead however this reduces precision may introduce small errors called quantization error which affects accu- racy of the model, convergence, and number of iteration needed for convergence compared to a non-quantized model. 2.3.1 Trade-offs in Quantization Optimizing quantization requires balancing between communication efficiency, com- putational complexity, and model accuracy. The selection of an optimal number of quantization levels/bits is critical in navigating this trade-off. Reducing quan- tization levels/bits minimizes communication overhead but it introduces the risk 20 of quantization error, leading to a loss of precision and slower convergence. Con- versely, increasing more quantization levels/bits enhances data precision but also raises computational cost and communication time. The challenge lies in determin- ing the optimal quantization level based on specific distributed learning application requirements. 2.3.1.1 Reduced Model Size and Communication time Using fewer bits of quantization levels reduces the size of the model updates, de- creasing the amount of data transmitted between clients and server.This also reduce communication overhead as less data needs to be communicated between clients and server. since smaller data packets can be transmitted more quickly and also model size requires less memory on device . However, this reduction can lead to more distortion and potentially lower model accuracy due to information loss.[39][40] 2.3.1.2 Communication Efficiency and Computational Overhead Quantization may offer communication efficiency but in turn it increases compu- tation overhead, as lowering the model parameter and gradients from the original format to lower bit forms and back again de-quantization may add extra computa- tions [41]. Quantization introduces information loss. This quantized values may not be as ex- act as the original numbers, depending on bit width that was used.This may result in more calculations to make up for decreased precision. And slow convergence due to quantization error. Finding the right balance between these factors is crucial for optimizing the performance of distributed systems, especially in scenarios involving resource-constrained devices where efficient computation is vital for successful model training and convergence. 2.3.1.3 Optimal balance in Quantization Optimizing quantization requires a careful approach to balance between factors like, communication efficiency, computational complexity, and data precision in our project’s distributed learning application. Employing more quantization levels improves data precision but comes at the cost of increased computational complexity and communication overhead as discussed in above section. The challenge lies in determining the optimal quantization level tailored to our specific distributed learning application requirements. This involves considering the trade-offs between communication efficiency, data precision, and computational resources. By navigating these trade-offs effectively , we can optimize the performance and efficiency of our distributed learning application while maintaining communication reliability and computation efficiency and data precision. 21 2.3.2 Quantization in FedAvg and push-pull In a federated learning, the gradients computed locally on each device are sent back to a central server for averaging. However, transmitting high-precision gradi- ents from millions of devices/clients results in communication overhead. The use of gradient quantization reduces this overhead by compressing the gradients, thereby allowing the learning process to be more communication-efficient[4]. In order to alleviate the limitation of bandwidth in distributed networks, quan- tized messages should be transmitted between the nodes. To solve this distributed optimization problem, a gradient descent method is combined with a distributed quantized consensus algorithm that requires nodes to exchange quantized messages and converges in a finite number of steps [40]. This integration helps in significantly reducing the communication load while maintaining the algorithm’s overall perfor- mance. The application of quantization in FedAvg is to increase efficiency has been the sub- ject of several studies. For instance, a recent study by [42] shows that FL models that are significantly more robust to different bit-widths during on-device inference are produced by incorporating quantization robustness into the training procedure. Quantization is a strategy that leads to smaller model sizes and lower computational demands by reducing the number of bits needed to describe model parameters. This is especially useful for resource-constrained client devices, like those utilized in Fe- dAvg setups. In a similar vein, the study by [43] explores various quantization techniques to re- duce communication overhead in federated learning, particularly with the FedAvg algorithm. The study highlights the motivation for quantization, its impact on convergence rates, and robustness against quantization noise. Experimental results demonstrate that quantized FedAvg can achieve similar accuracy to unquantized versions with significantly lower communication costs. Additionally, the paper [43] addresses the application of these techniques to heterogeneous devices, enhancing efficiency across diverse participants. In Federated averaging, gradient quantization takes palce after local training of each clients before gradient are sent back to the server for the aggregation. Similarly quantization emerges as a promising technique within push-pull algorithms. Several studies have explored the application of quantization within push-pull frameworks to optimize performance and reduce computational demands. At each optimization step, each node executes (i)gradient descent step, subtracting the scaled gradient from its current estimate, and (ii) a finite-time calculation of the quantization average of all nodes’ estimates in the network. As a result, this tech- nique closely resembles the centralized gradient descent algorithm. The approach is demonstrated to asymptotically converge to a neighborhood of the best solution at a linear convergence rate [40]. This ensures that the push-pull method remains effective even with the reduced precision due to quantization. Both FedAvg and push-pull algorithms benefit from quantization, though their im- 22 plementations and benefits differ. Quantization in FedAvg lowers communication cost during local model aggregation from distributed devices, allowing for effective model updates throughout a mobile device federated network. Because of these de- vices’ limited bandwidth and computational capacity, quantization is essential for maintaining efficiency. In contrast, the push-pull algorithm employs quantization to manage communica- tion constraints within distributed optimization tasks. Here, quantization focuses on reducing the data exchange between nodes to address the bandwidth limitations inherent in large-scale distributed systems. To evaluate the effectiveness of quantization in FedAvg and Push-Pull, we conducted experiments using the MNIST and CIFAR-10 datasets. These experiments were designed to measure the impact of various quantization levels(1,2,4,8,16) on the performance and efficiency of the FedAvg and Psh-Pull algorithm. The high-level findings indicate that quantized FedAvg can achieve nearly the same accuracy as the non-quantized version while significantly reducing communication time, though convergence time may increase with higher levels of quantization However, accuracy for Push Pull is decreased and also communication times for lower bits of quantiza- tion for both datasets, deatil about the effects of quantization on both algorithms is in the section 3.2 2.4 Erasure In networking or data transmission, a "missing packet/Erasure" occurs when a data packet, a small unit of data sent through a network, does not reach its intended destination. This might happen due to various factors such as network congestion, transmission issues, and signal deterioration. The absence of packets may lead to communication disruptions, decreased efficiency, or unfinished data transmission. Commonly used methods like packet re-transmission or error correction are em- ployed to reduce the effects of missing packets in protocols that depend on accurate and full packet delivery. Communication between nodes (devices) and the central node (CN) in distributed systems, particularly in Federated Learning [44], frequently takes place over unstable channels that might cause data loss, known as erasure channels. In an erasure channel, updates or gradient vectors transmitted from a node to the CN are lost or "erased" at a specific probability perase. When this happens, the CN discards the missing updates and only includes the received, non-erased updates. This method reduces the need for comprehensive information and allows the optimization process to proceed despite communication faults, but with some influence on convergence and speed. 23 2.4.1 Effects of Erasure The concept of "erasure" in communication systems involves data loss during trans- mission, with the receiver being informed of the specific lost data. Unlike errors in bits that lead to receiving inaccurate data, erasures signify the absence of data altogether, impacting communication time, convergence time and accuracy signif- icantly. This part discusses the influence of erasure on system performance and various methods to minimize its effects. Erasure affects the convergence and efficiency of communication systems in several ways: 2.4.1.1 Impact Convergence time Missing data may lead to multiple retransmissions, which can result in a slowdown of the communication process as a whole. This issue becomes particularly problem- atic with protocols that depend on acknowledgements, as the sender must wait for confirmation of successful receipt before transmitting the next packet. Increased erasure rates can lead to significant delays in the acknowledgment process, resulting in delayed convergence of the algorithm which in-turn needs more loops for conver- gence. Erasure channels, where updates are lost during transmission at a specific probability perase,directly impact distributed learning algorithms such as Federated Learning (FL). This channel result in incomplete gradient aggregation at the central node, as updates that are lost are not considered in the model updates. consequently, the model’s updates are based on partial information, delaying convergence and requiring more iterations to achieve optimal performance. [44] therefore increase in convergence time. 2.4.1.2 Increased Communication Overhead While erasure typically increases communication overhead due to re-transmissions or redundancy, in some systems, it might paradoxically reduce communication time.This can happen if the system chooses to continue with partial data to maintain speed, or if it sends less data overall as a result of dropped packets. However, a shorter com- munication time does not always translate into a more efficient system there could be other costs involved, like a slower rate of convergence (requiring more loops for optimal solution) or less precision [44]. 2.4.2 Strategies for Mitigation There are multiple strategies that can be utilized to lessen the impact of erasure in communication systems. Data packets are lost during transmission due to network congestion, device failure, or interference, resulting in missing packets. TCP and other re-transmission protocols are used to mitigate the effects of missing packets by resending all of them. An error-checking code is sent along with the data to correct any problems that may arise during transmission. Duplicate data is sent so that 24 the recipient can reconstruct the original data if any packets are missing. Quality of Service (QoS) helps by assigning higher priority to certain types of traffic, reduc- ing the likelihood of essential data packet loss. Network redundancy, such as using multiple paths for data delivery, reduces the likelihood of packet loss. Furthermore, the use of buffering and latency variation can help to smooth out discrepancies in packet arrival timings, lowering the impact of packet loss in live applications. Finally, error detection and recovery methods such as check-sums and CRCs ensure the identification and re-transmission of any damaged or lost packets. Although the major focus is on broader data protection and redundancy solutions, these principles are also useful in managing and eliminating missing packets in net- work connections. While the strategies mentioned abovesuch as re-transmission protocols, error correction codes, QoS, and network redundancyare vital for mini- mizing the impact of erasure. our study specifically focuses on the effects of erasure on communication time, con- vergence time, accuracy, and convergence loops. Understanding these effects is cru- cial for developing more efficient algorithms and systems that can better tolerate erasures without relying solely on mitigation techniques. 2.4.3 Erasure in Fedavg and Push-Pull Missing data can greatly impact distributed machine learning methods like Feder- ated Averaging (FedAvg) and Push-Pull. In federated learning (FL), particularly in wireless networks, the presence of faulty communication channels adds heterogene- ity to the system. These communication issues, like as packet loss or transmission faults, can severely disrupt the learning process by sending incomplete or inaccurate model updates to the central server.This disruption is especially significant in algo- rithms such as FedAvg, which implies that all client updates contribute evenly and consistently to the overall model. However, in environments involving unreliable communication, FedAvg may struggle to maintain stable convergence because miss- ing or corrupted updates can cause considerable variations in the model’s training trajectory [45]. The paper [44] explores the challenges and solutions for federated learning (FL) in environments where communication links between clients and the server are unre- liable and prone to packet erasure. In such settings, the communication links are modeled as packet erasure channels, where local updates from clients can be lost with a certain probability. This does affect the learning process because some update are lost during communication. This may potentially slow down the convergence of the algorithm and also reduce accuracy . However, it also shows that the impact on convergence can be significantly reduced by using the strategies they propose, such as reusing the last received update when a new update is lost. These strategies help to maintain a convergence rate that is close to what would be achieved in a sce- nario with perfect communication, thereby demonstrating that federated learning can be robust even in the presence of communication errors. Similarly, Erasures in the decentralized Push-Pull protocol can lead to false updates spreading through 25 the network, changing the local models and slowing down convergence at each node. Nodes missing crucial updates from neighbors due to lost packets can lead to in- consistencies and potentially hinder the overall optimization process.These effects highlight the critical role that reliable communication plays in maintaining model accuracy and ensuring efficient convergence in federated learning systems. We consider a scenario where the sender communicates with the receiver over an erasure channel, with each action index sent by the learner potentially being erased with probability ϵ. The receiver is aware of these erasures, but the sender is not, which can lead to suboptimal learning if the receiver does not receive the intended action. This issue mirrors challenges in decentralized systems, such as the Push-Pull protocol, where erasure channels can cause lost messages between nodes. These lost messages can lead to inconsistencies, delayed convergence, and potentially incorrect outcomes in distributed optimization processes. Reliable communication is thus cru- cial for both effective learning and maintaining accuracy in federated learning[46]. In our work, we simulated the effects of erasure channels on both the FedAvg and Push-Pull algorithms to study their impact on convergence loop and time, accuracy and communication time . By introducing a controlled probability of erasure ϵ in the communication between clients (or nodes) and the central server for fedavg and between clients for Push-Pull algorithms, code implements a stochastic gradient era- sure mechanism where each node (except the central node) can have its gradient erased (set to zero) with a certain probability, denoted as perase. For FeadAvg, where the central node (CN) aggregates only the available (non-erased) updates from the devices without waiting for full gradients or reusing past updates [44]. A similar approach applies to the Push-Pull algorithm, where nodes aggregate only the received updates without waiting for missing data, maintaining progress despite communication interruptions. We were able to observe how these algorithms perform under real-world conditions where communication is unreliable. The results, which are detailed in the section 3.2 , demonstrate the varying degrees of robustness in these algorithms. For instance, while FedAvg showed sensitivity to missing up- dates, resulting in delayed convergence time and loop and lower accuracy, Push-Pull exhibited challenges in maintaining consistency across nodes. The process of gradient aggregation with erasure in a distributed learning setup can be represented as follows: Gradient Vector and Erasure Probability: Gradient Vector: Let gi be the gradient vector of the i-th node. Erasure Probability: Let perase be the probability of erasure. Erasure Process: For each node i (except the central node), the gradient vector is set to zero with probability perase: g′ i = gi with probability 1 − perase 0 with probability perase 26 Central Node Aggregation: The central node aggregates the gradients from all nodes. If G is the matrix of gradients from all nodes, the aggregated gradient Gagg is: Gagg = N∑ i=1 g′ i where N is the total number of nodes. our code uses probability-driven approach to simulate communication failures by erasing gradients, star topology for FEDAVG and peer-to-peer for push-pull algo- rithm by setting the gradient of non-central nodes to zero with a certain probability for FEDAVG In contrast, for the Push-Pull algorithm, the erasure applies uniformly, reflecting equal communication unreliability across all nodes. This approach allows testing the robustness of distributed optimization algorithms against communication failures providing insights into their performance under realistic, unreliable commu- nication conditions. 27 3 Methods and Results 3.1 Methods This section investigates the framework that underlies the metrics and important factors to consider our implementation which provides foundation for the subsequent analysis of our experimental results. In this study, we focused on implementation of federated averaging algorithm and push-pull algorithm with specific attention to the data partitioning strategies, quan- tization of gradients and erasure of gradients. We also emphasized to analyze the impact of different network topologies on these metrics. Tools and Libraries: The implementation leverages PyTorch, a popular deep learning library for dataset pre-processing, model definition, training, gradients synchroniza- tion and evaluation. Communication between clients during distributed training is managed using MPI (Message Passing Interface) through PyTorchs torch.distributed package. This setup allows for efficient communication and synchronization across multiple clients participating in the federated learning process. Quantization Type: The implemented code utilizes uniform quantization of gra- dient values during the synchronization phase. This process involves scaling the gradients using a fixed scaling factor based on the number of quantization bits spec- ified (e.g.,1,2 4, 8, 16 bits). The scaled gradients are then rounded to the nearest integer and clamped within a specified range, ensuring that the quantization process is both efficient and effective in reducing communication overhead. The quantization applied in our framework is specifically during the gradient synchronization phase after the training process. This means that while the model weights are updated in floating-point precision, the gradients communicated between clients are quantized. This form of gradient quantization helps in reducing the bandwidth required for gradient exchange without significantly impacting the model’s performance. Erasure: We have simulated the occurrence of data loss during communication by inducing erasure based on a probability factor. This probability represents the likeli- hood that the gradients (model updates) from client nodes will be lost or corrupted before they reach the server. During each communication round, a random number is generated and compared against the probability factor. If the random number falls within the probability range, the gradient is erased (not sent), simulating a failed transmission. This approach is done to simulate a real time scenario where 28 where communication channels are unreliable, such as in networks experiencing high traffic congestion, poor signal strength, or intermittent connectivity. Network Topology: Different topologies were considered for our implemetation of FedAVg and push-pull (star,Fully-connected,peer to peer, circular) and their impacts on the performance metrics of interest, such as convergence time, communication time, and accuracy. 3.1.1 Implementation 3.1.1.1 Architecture of High Performance Computing Cluster: HPC system consists of Intel Xeon Gold 6130 and Intel Xeon Gold 6338 cores with few NVIDIA GPU cards. It also has a central storage system, a slurm scheduler and a login node. All these components are inter-connected through Infiniband and ethernet. The cluster is accessed through login node through which we can transfer input files to cluster, prepare batch scripts and send slurm script to scheduler. The scheduler then allocates the resources and starts the script on compute nodes. Steps to schedule model training in HPC system: First we login to cluster through the login node and prepare a slurm batch script and a python script for training the model. The slurm script specifies the resource needed, including the number of cores, CPU, GPU, wall time and memory . Once submitted, the job is placed in a queue, ordered by priority. Job starts when requested resources are available and then the automatic environment variables inform MPI to run the job. There is also a temporary directory available within each node which can be used during file I/O operations, however this directory is cleaned immediately when the job ends or fails or crashes. For this thesis we have considered nodes Node 1,4 and Tasks 8,16,24. Nodes in an HPC cluster are individual compute units with their own CPUs, memory, and sometimes GPUs. In this study, we used Node 1 and Node 4 to explore the scalability of our model training. By using different nodes, we tested how the distributed training algorithms (FedAvg and push-pull) scale when moving from a single node (Node 1) to multiple nodes (Node 4). This helps in understanding the performance impacts of scaling across nodes and how well the communication and computation distribute when using more hardware resources. In this context, tasks refer to the number of clients or processes involved in dis- tributed training protocols like FedAvg and push-pull. Each task can be viewed as a different client in a federated learning setup or a separate process in a parallel computing environment. Each task in FedAvg represents a client that trains separately on local data be- fore communicating updates to a central server for aggregate. The number of tasks (8, 16, 24) represents the number of clients involved in the federated learning system. Push-Pull uses tasks to represent clients or processes that connect directly with one 29 another to exchange gradients or model updates, rather than a centralized server. Increasing the number of tasks examines the communication protocol’s ability to manage larger, more complicated interactions. Data pre-processing: In this thesis we have used MNIST and CIFAR-10 dataset. MNIST: A dataset of grayscale images containing 60,000 training samples and 10,000 test samples. Each image represents a handwritten digit (0-9). CIFAR-10: A dataset of RGB images containing 60,000 samples split into 50,000 training samples and 10,000 test samples. It consists of 10 different classes such as airplanes, cars, birds, etc. Once the scheduler allocates the requested resources (nodes and tasks as specified in the SLURM batch script), the job is sent to the compute nodes. The job script is replicated on other nodes as per the SLURM configuration, allowing independent ex- ecution on each computing node. Each Python script is executed independently on its computing node while the communication between nodes is established through MPI (Message Passing Interface). MPI ensures seamless data exchange, synchro- nization, and coordination between the nodes during training. The Python script initially downloads the dataset (MNIST or CIFAR-10) to a central storage location accessible by all nodes. The dataset is divided according to predefined classes, trans- forming each data sample into tensor format suitable for neural network training. From the 60,000 samples/images of data, 10,000 images are reserved for testing. These test samples are not exposed to the model during training, ensuring unbiased evaluation. MNIST datsets, images are grayscale, whereas CIFAR-10 datastes im- ages are RGB, requiring different preprocessing steps for image normalization and data augmentation depending on the dataset type. Data Distribution Across Nodes: Each computing node receives a unique subset of the dataset using the Distributed- Sampler, which ensures that the data is randomly and identically distributed across all nodes (IID). This means that each node gets a balanced portion of the data that reflects the overall dataset distribution. In our implementation, the data across nodes is IID because each node’s data subset is randomly sampled and contains a similar distribution of classes and features as the entire dataset. This arrangement differs from real-world federated learning scenarios, in which data is frequently non-IID due to client-specific features. To achieve this IID distribution, we employed a data partitioning strategy that involves shuffling the dataset before dividing it among clients. The use of the PyTorch DistributedSam- pler with shuffle=True ensures that each client receives a randomly selected subset of the data, maintaining the IID property. The IID sampling approach employed in this implementation ensures that each node processes statistically similar data, resulting in a controlled environment for evaluating distributed training without the additional complication of non-IID data, serving as a baseline for assessing the per- formance of the federated learning model under ideal data distribution conditions. CNN Model : The model considered for training is a 2 convolution layered CNN model with num- 30 ber of classes as per the dataset used. The dataset and the model is loaded into the computing node then the image samples are passed through each layers of the model which generates the loss. This step is iterated through all the image samples available to each node. An Epoch is defined as an iteration where the model has trained through complete dataset once without repeating. The model is then trained some few epochs, after few epochs we have the upadated model parameters at each computing node. At this point the nodes need to exchange the model parameters to update the model and in order to achieve this MPI calls are used to send the gradients from all the nodes to a single node to avergae the gradients and update the model. Since we have used a star topology for node placements, the central node requests the gradient from all the participating nodes, averages the gradients and updates the model by sharing the new gradients to all nodes including itself. This procedure of training the model for few epochs, gradient averaging, model update and then training again keeps reducing the loss indicating the model is learning effectively and improving its ability to recognise the images. After a certain point the loss does not reduce significantly indicating the model is converged and no addi- tional improvement can be achieved so we stop the training and log the evaluation metrics for further evaluation. Testing the model: Once the model is trained, it is tested for its performance using the image samples which were not used during the training step. By doing this we evaluate the perfor- mance of model with a separate set of dataset. The purpose here is to assess how the model works with unseen images providing a correlation to its real world per- formance. If the model performance good on training images but does not perform well on testing images it means model is Overfitting where model has learned the images but not the underlying generalised information. Similarly if the model per- forms poorly on both training and test images , it indicates underfitting signifying model is simple for dataset used. 3.2 Results This section provides a detailed analysis of the experimental results obtained from applying different quantization levels, erasure and network topologies to the FedAvg and Push-Pull algorithm on the MNIST and CIFAR-10 datasets. The FedAvg algorithm was implemented using a star topology, where a central server aggregates the model updates from all clients, making it the default approach in cen- tralized federated learning setups. On the other hand, the Push-Pull algorithm, typically used in distributed optimiza- tion, was applied in a peer to peer connected topology, where each client directly communicates with others in a peer-to-peer network. In this setup, the Push-Pull algorithm facilitates a decentralized learning process, allowing clients to collabora- tively train a model without a central server. Both algorithms were analyzed under various quantization levels to assess their performance on the MNIST and CIFAR-10 datasets. The quantization method used here is uniform quantization, which scales the gradi- 31 ent values to a fixed range based on the number of bits and then rounds them to the nearest integer. The quantization for 1-bit gradients is a special case, using the sign function. The metrics considered for evaluation are defined in section 3.1 . 3.2.1 Analysis of Quantization Impact on FedAvg 3.2.1.1 Accuracy For the MNIST dataset, the accuracy remains high across all quantization levels, with a slight decrease observed at 1-bit quantization. Specifically, the accuracy decreases from 98.42% with no quantization to 97.25% with 1-bit quantization due to reduction in precision which introduces noise. The accuracy stabilizes around 98% for higher bit quantization levels (2-bit, 4-bit, 8-bit, and 16-bit), indicating the robustness of the model to quantization. For the CIFAR-10 dataset, The accuracy of the model slightly decreases as the quantization level decreases. The accuracy drops from 80% with no quantization to 77.56% with 1-bit quantization, indicating a minor loss of precision. However, the accuracy stabilizes around 79-80% for higher bit quantization levels. This behavior is expected as lower bit quantization introduces more approximation errors, but higher bit levels maintain sufficient precision to preserve model performance. While both datasets exhibit a drop in accuracy at the lowest bit level, the difference in the behavior of accuracy drop is due to the complexity of the datasets, the nature of the images, and the model’s sensitivity to quantization noise. However, the accuracy stabilizes at higher bit levels, demonstrating that quantization can be applied effectively with minimal loss of performance, especially at bit levels that strike a balance between precision and efficiency. 3.2.1.2 Average communication Time The average communication time for MNIST datasets decreases significantly with lower bit quantization levels. The communication time reduces from 943.60 seconds with no quantization to 553.38 seconds with 1-bit quantization, representing a re- duction of approximately 41.36%. As we increase the bit width, the communication time gradually increases, achieving 714.19 seconds with 2-bit quantization (24.33% ), 718.66 seconds with 4-bit quantization (23.82% reduction), 816.75 seconds with 8-bit quantization (13.44% reduction), and 806.25 seconds with 16-bit quantization (14.56% reduction). The average communication time for CIFAR-10 dataset generally decreases with lower bit quantization levels. The communication time reduces from 8591.18 sec- onds with no quantization to 7541.56 seconds with 1-bit quantization, representing a reduction of approximately 12.22%. However, it’s noteworthy that the 1-bit quan- tization does not achieve the lowest communication time among all levels. As we increase the bit width, the communication time further reduces, achieving 7236.27 seconds with 2-bit quantization (15.79% reduction), 7513.10 seconds with 4-bit quan- tization (12.55% reduction), 7518.80 seconds with 8-bit quantization (12.50% reduc- tion), and 7737.53 seconds with 16-bit quantization (9.92% reduction). 32 The slightly higher communication time observed with 1-bit quantization CIFAR- 10, compared to 2-bit, may be attributed to the computational overhead associated with quantizing and dequantizing the model parameters at such a low bit-width. The process of handling these very small data packets might introduce inefficiencies that offset some of the gains from reduced data volume. Additionally, network and implementation specifics, such as error handling and retransmissions due to quanti- zation noise, could also play a role.These insights are supported by findings in related works such as [47] who highlighted the trade-offs involved in gradient compression methods like signSGD, where reducing data volume can sometimes introduce inef- ficiencies that counterbalance the gains and [48] which discuss the challenges and overheads associated with low-bit quantization.Communication time reduction was more significant for MNIST due to its simpler nature, while CIFAR-10’s complexity introduced overhead at 1-bit quantization. In large-scale federated learning deployments, even small percentage gains can lead to significant absolute time savings. The results demonstrate that using lower bit quantization levels can substantially improve communication efficiency without severely compromising the models performance. These findings underscore the prac- tical benefits of quantization in enhancing communication efficiency in federated learning, validating our approach and encouraging further research into optimal quantization levels for various distributed learning scenarios. 3.2.1.3 Execution Time Execution time is indeed a crucial metric for assessing the efficiency of machine learning models.The data clearly shows that execution times are typically reduced by lower quantization levels, most likely as a result of the decreased computational needs of handling smaller numerical representations. The execution time shows a decreasing trend with lower bit quantization for both the MNIST and CIFAR-10 datasets. For MNIST, it reduces from 103.67 minutes with- out quantization to 98.57 minutes with 1-bit quantization. For CIFAR-10, the time drops from 585.22 minutes without quantization to 518.04 minutes with 2-bit quan- tization. However, the 1-bit quantization for CIFAR-10 results in a relatively higher execution time due to the need for more communication rounds to compensate for reduced gradient precision in this complex dataset. Despite some variation at higher bit levels, the overall reduction in execution time demonstrates the computational efficiency gained through quantization. 3.2.1.4 Average Convergence Time and Convergence Loop Convergence time tends to decrease slightly with lower quantization levels, but the differences are less pronounced compared to communication time. Both datasets show similar convergence time patterns, with MNIST converging faster due to its simplicity, while CIFAR-10 requires more time due to its complexity. The conver- gence loop, represented by the number of FedAvg loops, indicates the number of communication rounds required for the model to converge to a stable state. Quantization, reduces the precision of the model updates sent from clients to the server, can introduce noise and affect the convergence rate. Lower bit quantization 33 (e.g.1-bit) can significantly slow down convergence due to the high noise in gradient updates. However, studies [49] have shown that with careful design, such as using higher bit quantization or adjusting the learning rate, the convergence rate can be improved and made comparable to full precision . The difference in convergence behavior between datasets such as CIFAR-10 and MNIST is primarily due to the complexity of the datasets. CIFAR-10, with its col- ored images and intricate features, tends to have a slower convergence rate compared to the simpler grayscale digit images in MNIST. The sensitivity to quantization noise is higher in CIFAR-10 due to its complex feature space, leading to more pronounced impacts on convergence. The convergence loop for CIFAR-10 dataset remains relatively stable across different quantization levels, with minor variations. The convergence rate is 20 FedAvg loops for no quantization and increases slightly to 21 loops for 1-bit and 2-bit quantiza- tion, then stabilizes at 19 loops for higher bit levels. This stability indicates that the FedAvg algorithm can handle quantization noise without significantly affecting the convergence loop. In contrast, for MNIST dataset, being less complex, shows greater resilience to quantization noise. The convergence rate for MNIST remains stable, requiring 9 FedAvg loops without quantization, slightly increasing to 11 loops with 1-bit quantization, and stabilizing at 10 loops for higher bit levels. The convergence loop analysis highlights the trade-off between quantization level and algorithm performance. While lower-bit quantization can lead to minor increases in convergence loops, the benefits in communication efficiency and execution time often outweigh these costs. These findings emphasize the importance of selecting appropriate quantization levels to balance communication overhead with conver- gence performance, particularly in federated learning scenarios where communica- tion costs are a significant consideration. Future research should focus on adaptive quantization strategies that dynamically adjust bit-widths to optimize convergence rates based on dataset complexity and network conditions. The quantization bits and their effects on metrics shown in figure 3.1. Quantization effectively reduced communication time and execution time without severely impacting accuracy or con- vergence. While lower bit quantization generally improved efficiency, the optimal bit level might vary based on specific dataset and model characteristics. 3.2.2 Analysis of Quantization Impact on Pushpull 3.2.2.1 Accuracy The MNIST and CIFAR-10 datasets were examined through a push-pull method (peer-to-peer) with varying levels of quantization and configurations, including 8 clients and no quantization. Accuracy for the MNIST dataset rose from 76.35% using 1-bit quantization to 97.62% with 8-bit quantization, then decreased to 97.14% with 16-bit quantization. The configuration of 8 clients and absence of quantization resulted in the top accuracy of 98%. Accuracy for the CIFAR-10 dataset increased from 79% using 1-bit to 82% with 4-bit quantization and stayed consistent up to 16-bit quantization, with no quantization achieving above 82%. 34 Figure 3.1: Impact of Quantization Levels on Accuracy, Communication Time, Con- vergence Time, and Convergence Rate for MNIST and CIFAR-10 Datasets. The leftmost subplot compares the accuracy percentages of the datasets across different quantization levels. The second subplot shows the communication time in seconds, highlighting the efficiency of data transfer. The third subplot presents the conver- gence time in seconds, reflecting the overall time required for the models to converge during training. The rightmost subplot shows the convergence rate, measured in Fe- dAvg loops, indicating the number of communication rounds required for the model to converge to a stable state. 3.2.2.2 Average communication Time In this paper [40],consider the unconstrained distributed optimization problem, in which the exchange of information in the network is captured by a directed graph topology, thus, nodes can only communicate with their neighbors. Additionally, in our problem, the communication channels among the nodes have limited bandwidth. In order to alleviate this limitation, quantized messages should be exchanged among the nodes. For solving this distributed optimization problem, we combine a gradient descent method with a distributed quantized consensus algorithm (which requires the nodes to exchange quantized messages and converges in a finite number of steps). Specifically, at every optimization step, each node (i) performs a gradient descent step (i.e., subtracts the scaled gradient from its current estimate), and (ii) performs a finite-time calculation of the quantized average of every nodes estimate in the network. In Koloskova et al. (2019), the authors present a gossip-based stochastic gradient descent algorithm, which utilizes arbitrary compressed messages and exhibits lin- ear convergence.The main idea of quantization in this paper is that nodes transmit a compressed value (i.e., quantized) of their stored information as they require a few bits for representation compared to the non-compressed ones (i.e., real values) which in theory require an infinite number of bits. For this reason, communication- efficient distributed optimization has received significant attention recently in the control and machine learning communities. As a result, the impact of quantization on push pull algorithm for MNIST dataset is that the communication times for no quantization is 917.97 seconds. Neverthe- less, the transmission latency fluctuates significantly depending on the number of quantization bits employed. By using 1-bit quantization, the communication time is reduced significantly to 141.67 seconds. The notable decrease can be attributed to 35 the acceleration of communication time with 1-bit quantization, which reduces the amount of data exchanged among peers. Nonetheless, the time for communication sharply increases to 1118.22 seconds when the quantization bit count is raised to 2-bit. The sudden increase suggests that the benefits of smaller data sets are less important than the challenge of managing slightly more accurate quantization. The durations for communication when quantizing with 4 bits, 8 bits, and 16 bits are as follows: 584.68 seconds, 384.22 seconds, and 688.08 seconds, respectively. The varying levels of quantization present a different perspective on the communi- cation times for the CIFAR-10 dataset. The duration of communication is 5380.20 seconds, much longer when not quantized. While not as dramatic, quantization boosts communication times in the MNIST dataset, as seen. The communication time decreases to 5132.59 seconds when using 1-bit quantization and for varying quantization bit: 2, 4, 8, and 16. The communication times vary: 5310.19, 5332.63, 5432.17 and 5459.84 seconds. As the number of quantization bits increases (2, 4, 8, 16 bits), communication time gradually increases from 5310.19 seconds to 5459.84 seconds for CIFAR10 and MNIST from 141.67 to 688.08 seconds . Although higher-bit quantization provides more accurate data precision, it increases the data size and introduces additional computational overhead, leading to longer communication times. 3.2.2.3 Average convergence time and convergence loo