Grasp Synthesis Methods on Known Objects for Bin-Picking A comparison on deep-learning based and analytical-model based 6 DoF grasp pose synthesisers Master’s thesis in Systems, Control, and Mechatronics SEVAG TAFNAKAJI SIMON WIDERBERG DEPARTMENT OF ELECTRICAL ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2025 www.chalmers.se www.chalmers.se Master’s thesis 2025 Grasp Synthesis Methods on Known Objects for Bin-Picking A comparison on deep-learning based and analytical-model based 6 DoF grasp pose synthesisers SEVAG TAFNAKAJI SIMON WIDERBERG Department of Electrical Engineering Division of Systems and Control Chalmers University of Technology Gothenburg, Sweden 2025 Grasp Synthesis Methods on Known Objects for Bin-Picking A comparison on deep-learning based and analytical-model based 6 DoF grasp pose synthesisers Sevag Tafnakaji Simon Widerberg © Sevag Tafnakaji, Simon Widerberg, 2025. Supervisor: Atieh Hanna, Volvo GTO Examiner: Karinne Ramirez-Amaro, Electrical Engineering Department Master’s Thesis 2025 Department of Electrical Engineering Division of Systems and Control Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Visualization in Rviz of multiple grasps generated using Contact-GraspNet in relation to a captured depth image of the scene. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2025 iv Grasp Synthesis Methods on Known Objects for Bin-Picking A comparison on deep-learning based and analytical-model based 6 DoF grasp pose synthesisers Sevag Tafnakaji Simon Widerberg Department of Electrical Engineering Division of Systems and Control Chalmers University of Technology Abstract Automated pick-and-place operations are foundational tasks in robotics, with wide- ranging applications in industrial automation, logistics, and service robotics. Central to these operations is the ability of a robotic system to reliably plan and execute grasps on a diverse set of objects. Grasp synthesis, which is the process of determin- ing suitable contact points and hand configurations for successful object manipula- tion, remains a challenging problem due to the inherent uncertainties in perception, object variability, and physical interactions. To address the challenges of grasp synthesis, this thesis explores and evaluates two distinct approaches to grasp pose generation. The first approach leverages a database-driven method, storing precomputed grasp poses for known, proprietary objects. The second employs an end-to-end deep learning model capable of gener- alizing grasp predictions across a wide variety of novel objects. A complete robotic pipeline was developed to integrate these grasp synthesis methods into practical pick-and-place and bin-picking tasks. Using this pipeline, we conducted experimen- tal evaluations on a physical robotic platform to compare the grasp success rates of both approaches in real-world scenarios. We conclude that the deep learning method, using Contact-GraspNet for generating grasps, appears more fitting for the applications Volvo desire in their production environment due to its flexible and scalable nature as well as achieving an overall 50% success rate compared to 39% for the database-driven method for single object pick-and-place. While the database-driven method could still work, it is not as scalable and is reliant on an object pose estimation system. Keywords: Robotics, Automation, Grasp Synthesis, Bin-Picking, Pick-And-Place, Deep Learning, ROS v Acknowledgements Firstly, I would like to thank my examiners and supervisors. Your direction and assistance on this project saved us much time and effort that would have otherwise been wasted. I would like to also thank my family, whose eternal patience and support helped me see this project through, despite the setbacks and delays. To my friends, Hannah, Karthik, Monika, and Abhishek, I can only say thank you for the amount of times you have raised my spirits enough to finish the project. Now, I will have to find new things to complain about. Sevag Tafnakaji, Gothenburg, June 2025 I also, would like to thank my examiner Karinne and my supervisors Atieh and Maximillian. Your expertise and knowledge was instrumental to the project and your detailed feedback helped us stay on track. Thank you! Finally, I would like to thank my friends and family - thank you for being so sup- portive of every decision I have made during my time at Chalmers. It’s been a long journey, but you’ve helped me realise how much I truly love the time spent as a student. Simon Widerberg, Gothenburg, June 2025 This work was supported by the Vinnova project AIHURO (Intelligent human-robot collaboration). vii Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Assumptions and Limitations . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 End-to-End Deep Learning Models . . . . . . . . . . . . . . . 4 1.3.2 Datasets for Deep Learning Models . . . . . . . . . . . . . . . 5 1.3.3 Analytical Grasp Pose Synthesisers . . . . . . . . . . . . . . . 6 2 Theory 7 2.1 Robot Operating System . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Topics and Communication . . . . . . . . . . . . . . . . . . . 7 2.1.3 Coordinate Frames . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.4 Controllers and MoveIt! . . . . . . . . . . . . . . . . . . . . . 10 2.2 GraspIt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Contact-GraspNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Methodology & Design 15 3.1 Using a Deep Learning Model to Generate Grasps . . . . . . . . . . . 15 3.1.1 Retraining the Model . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 Segmentation Map . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.3 Inference for Generating Grasp Poses . . . . . . . . . . . . . . 18 3.2 Using an Analytical Model to Generate Grasps . . . . . . . . . . . . . 19 3.2.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 Using the Grasps in ROS2 . . . . . . . . . . . . . . . . . . . . 20 3.2.3 Additional Contact Points . . . . . . . . . . . . . . . . . . . . 22 3.3 Implementation of the Pick & Place Pipeline . . . . . . . . . . . . . . 23 3.4 Controllers and Motion Planner . . . . . . . . . . . . . . . . . . . . . 23 3.5 Gripper Control Node . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Grasp Synthesis Node . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.7 Pick And Place Node . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.8 Visualisation Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.9 Camera Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 ix Contents 3.10 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Results 33 4.1 Results from Pick and Place Operations . . . . . . . . . . . . . . . . 33 4.1.1 Graspit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.2 Contact-GraspNet . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.3 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Discussion & Conclusion 37 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.1 Contact-GraspNet: Baseline vs Retrained . . . . . . . . . . . . 37 5.1.2 Effect of Number of Contact Points for GraspIt . . . . . . . . 39 5.1.3 Comparison between Models . . . . . . . . . . . . . . . . . . . 39 5.1.4 Effect of available setup . . . . . . . . . . . . . . . . . . . . . 40 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Improvements to the Grasp Synthesis Model . . . . . . . . . . 41 5.2.2 Extensions and Improvements to the Pick and Place Pipeline . 41 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Bibliography 43 A Appendix 1 I A.1 Custom Service Files . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.1.1 GenerateGrasp.srv . . . . . . . . . . . . . . . . . . . . . . . . I A.1.2 PickAndPlace.srv . . . . . . . . . . . . . . . . . . . . . . . . . I A.1.3 SetDesiredPoseFrame.srv . . . . . . . . . . . . . . . . . . . . . II A.1.4 SetGripperState.srv . . . . . . . . . . . . . . . . . . . . . . . . II x List of Figures 1.1 Current vs Volvo Group vision of the kitting process. rather than finding the correct object in one of the bins, it would be delivered to the worker through automated guided vehicles. . . . . . . . . . . . . . 2 2.1 Image taken from the official ROS2 tutorials showing an example node setup with common ways of inter-node communication (described in section 2.1.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Example of Coordinate frames in TF2. The coordinate frames and its transformations on the robot were defined through the robot URDF file, whereas the coordinate frames of the objects were manually defined. 9 2.3 Cones used to if forces applied at contact is enough to counteract the friction. Taken from [18]. . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Robotiq Adaptive 2F-140 gripper grasping a toy airplane. Result gathered after 100,000 Iterations using the energy function guided potential quality energy. . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Figure showing the architecture of Contact-GraspNet (from left to right). The input would be the RGB-D image and segmentation map from the camera, and the output would be the synthesised grasp poses with estimated probabilities of success . . . . . . . . . . . . . . . . . 13 2.6 Image taken from the original Contact-GraspNet paper [8]. Figure showing the all estimated variable from each head in relation to each other. a⃗ and b⃗ are the approach and grasp direction vector. w is the grasp width and c is the contact point. . . . . . . . . . . . . . . . . . 14 3.1 Plots showing the loss and validation loss for both the Baseline mod- els training vs the retrained model. "dir loss" represent the grasp direction estimation loss, "ce loss" short for cross entropy loss rep- resenting the contact point classification loss, "off loss" representing grasp width estimation loss, and "app loss" representing the approach vector estimation loss. Baseline (blue) include 14 000 iterations over 16 epochs while retrained (orange) include only 1300 iterations over 16 epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 ROI selection applied to image (1) and resulting SAM segmentation map overlayed on top of original image (2). . . . . . . . . . . . . . . . 17 xi https://docs.ros.org/en/humble/Tutorials/Beginner-CLI-Tools/Understanding-ROS2-Nodes/Understanding-ROS2-Nodes.html List of Figures 3.3 The inference node receives camera data from dedicated topics. The depth and color image as well as the camera intrinsics are passed into the inference script along with a segmentation map generated from the color image and a manually selectd ROI. The inference script outputs up to 200 grasps, many of which can be duplicates, which are filtered out. Finally the poses are converted to ROS interpretable TF2 frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 An example of the worlds used to generate grasps for one of the tested objects. Note the red lines on the gripper fingers, which are the contact points and its normals. see 2.2 for definitions of contact points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Figure describing a single pass of grasp synthesis using the grasp database generated by GraspIt. Filtering is dependent on specific the object that is going to be grasped, and how many attempts were made so far to generate a grasp . . . . . . . . . . . . . . . . . . . . . 20 3.6 Filter angle representation. Figure shows an example object coordi- nate frame with an example grasp pose coordinate frame above it. Θ shows the angle between the position vector of the grasp pose and its projection to the horizontal plane, Ψ shows the yaw euler angle of the grasp pose (rotation around the z-axis), and Φ is the angle between the projected position vector of the grasp pose and the x-axis. . . . . 21 3.7 Two different contact points defined, which lead to different grasp poses stored in the database . . . . . . . . . . . . . . . . . . . . . . . 22 3.8 System architecture shown through the ROS2 nodes. The red ar- rows are connections through topics (i.e. publishers and subscribers), whereas the purple arrows are connections through services. Here the Pick & Place node contains the client of each service called. . . . . . 23 3.9 The service that uses the interface to the MoveIt interface. Shows the expected inputs and outputs of the service. . . . . . . . . . . . . . 24 3.10 The service that controllers if the gripper is open or closed. . . . . . . 25 3.11 The service that generates grasps when called. . . . . . . . . . . . . . 25 3.12 The service that performs the pick and place task by iterating through sub-tasks using the other services. . . . . . . . . . . . . . . . . . . . . 26 3.13 10 Generated grasps on an object displayed in Rviz as markerArrays along with their index. . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.14 Figure showing the robot used to test the implemented pick and place pipeline and grasp pose synthesis models. . . . . . . . . . . . . . . . . 28 3.15 The four different objects that grasps can be generated for using both of the methods. While Contact-GraspNet can work on many differ- ent objects effectively, GraspIt required more information about the objects provided through CAD files. . . . . . . . . . . . . . . . . . . . 29 3.16 Objects all placed in their fixed poses for testing, in relation to the robot and the blue bucket. . . . . . . . . . . . . . . . . . . . . . . . . 30 xii List of Figures 3.17 Pictures showing the difference between configurations as seen by MoveIt’s planning scene. Right shows the grouped object setup, where the green boxes are modelled obstacles, whereas left shows tests using single objects (note the missing green boxes around the objects). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Example grasp of object 2 using GraspIt. Left shows point at which gripper was closed, right shows object at starting pose before being dropped off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Example grasp of object 1 using GraspIt. Left shows point at which gripper was closed, right shows object at starting pose before being dropped off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1 Figure showing the difference in clustering for generated grasps be- tween the baseline model (left) and the retrained model (right) on the same object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Baseline model generating grasps on object 3 with object 4 present in the scene (left) and without object 4 (right). . . . . . . . . . . . . 38 5.3 Retrained model generating grasps whose pose are located within the object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 xiii List of Figures xiv List of Tables 3.1 Filter angles (θ1, θ2, ψmax, ϕmax) (in degrees) based on object and attempt number. These angles are part of what determine which synthesised grasp pose is the "best" and most likely to succeed. . . . . 21 4.1 GSR (%) / PNPSR (%) results for the GraspIt based grasp-pose synthesis method. Contact points shown in figure. 3.7 . . . . . . . . . 33 4.2 GSR (%) / PNPSR (%) results for the Contact-GraspNet based grasp-pose synthesis method, where "Baseline" is the model using the weights from the public Contact-GraspNet repository, and "Re- trained" is the model using weights trained for the robotiq gripper . . 34 4.3 Table showing the fraction of the successes that were due to the object slipping into a more stable configuration such that it would not fall, and the fraction of the cases where failures were due to no valid grasps being generated at all. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 xv List of Tables xvi 1 Introduction Many companies introduce automation onto their production lines to safely increase productivity and output. Automation comes in many forms such as using conveyor belts to transport products, using camera systems for quality assurance or using robotic arms to lift heavy objects. The use of robotic arms have become synony- mous with automation thanks to their ability to repetitively complete precise, heavy and hazardous tasks such as welding, painting, or material handling. With improved and cheapened technologies, the latest research of the past couple of decades have been focused on making these automation solutions more generalizable and intelligent [1]. For example, in robotic manipulation, the utilization of machine learning methods for object recognition and grasp planning can eliminate the need for statically defining the grasp location and trajectory. These advancements ben- efit companies by streamlining production lines, reducing human intervention, and increasing system adaptability to minor variations. One application of these advancements is bin picking, a popular research topic in industrial automation [2]. Bin picking refers to the process of using a robotic arm equipped with a vision system to identify, pick up, and place objects from a bin or container. This is a crucial application in industrial automation, particularly in manufacturing, logistics, and warehousing. However, bin picking presents several challenges [3], such as accurately distinguishing individual objects in a cluttered environment and ensuring the system operates quickly and reliably in a production setting. Volvo Group wishes to implement such a system in an attempt to solve an ineffi- ciency in supplying their kitting stations with the necessary kit components. Volvo envisions a solution where a robotic manipulator is attached to the component rack, which then receives an order of components that is then picked and placed aside for transport to the desired kitting station. For example, consider a break caliper. The Caliper is composed of multiple components such as, break pads, bushings, and seals that need to be assembled together. A worker currently needs to find these components from the correct rack and bucket and bring them back to their kitting station. Instead, Volvo wishes to have a robot system attached to the racks which gathers the desired components from the rack onto a automatically guided vehicle, which then transports the parts to the stations. An example of how kitting stations are used today compared to their vision can be seen in figure 1.1. 1 1. Introduction Figure 1.1: Current vs Volvo Group vision of the kitting process. rather than finding the correct object in one of the bins, it would be delivered to the worker through automated guided vehicles. Volvo is currently researching one possible way to achieve their vision of having a robotic system detect and grasp objects for further transporting. Volvo identi- fied that this can be achieved through grasp pose synthesis, which is the process of generating valid grasp poses to attempt to pick up an object [4] [5]. But their current implementation takes a long time to set up since the grasp pose needs to be manually configured for each object. This implementation also uses expensive proprietary hardware and software which would not be feasible to use on a large scale. These challenges are addressed in this thesis by relying on a cheaper camera and by developing a compatible method for synthesising these grasp points using only the visual information given from the camera and existing CAD (Computer Aided Design) files of the objects. 1.1 Objective In order to determine a reliable method of synthesising grasps, two methods are implemented, tested and compared: One method where grasp poses are synthesised offline using analytical models and then queried during runtime, and another method utilizing a machine learning model to generate grasps from vision data. The goal of the thesis is to answer the following research question: Is a deep-learning model more viable than an analytical model for the purpose of grasp pose synthesis, given that the objects being picked are already known? In order to answer this research question, we developed a pipeline which allows for easy and repeatable testing of grasp pose synthesis models, where we test their effectiveness in generating grasp poses that are good enough for a pick-and-place task. We measured this effectiveness through the use of two metrics: Grasp Success Rate (GSR), and Pick-and-Place Success Rate (PNPSR). GSR measures how often the model generates a grasp that is able to simply pick up the object. We measure 2 1. Introduction it as a success if the robot is able to move the object from its current pose to an intermediary pose. PNPSR is similar to GSR, except the main difference is that it measures how often the robot is able to complete the full pick-and-place task using the grasp pose synthesis model. Both are measured by counting how many success- ful attempts were made out of a total number of attempts. What differentiates our approach from the current research is that many of the cutting edge models assume no previous knowledge of the object when generating grasp poses. This means that the models do not require the full 3D CAD file of the object it attempts to grasp, and usually a depth camera and colour image of the scene is sufficient for it. We are more restrictive in our approach for two reasons: • In manufacturing, very rarely is there a case that there are no CAD files for any of the objects to be picked from a bin. • Access to the 3D CAD files allows the use of analytical models which allows for a stored database of possible grasp poses around an object. 1.2 Assumptions and Limitations One of the main assumptions is that there will never be any "unknown" object, which are objects that 3D CAD files are not available for. This is what makes the comparison of methods feasible, as cases in which previously unknown objects can be encountered would make the analytical models useless, as they require the 3D CAD files to generate grasps. Since our research question focuses on grasp pose synthesisers and will compare two different models, we must ensure that the data collected from our tests are not af- fected by other factors. Estimating the pose of the object is out of scope for this project. Therefore, we utilized known fixed object poses for testing. The reason for this choice is that we do not wish any vision based pose estimators to affect the performance of the grasp pose synthesising methods. From the provided files and objects that Volvo can share, only 4 were available in both 3D CAD and physical form. Therefore, testing will only be performed on these 4 objects. For training, and inferring grasp poses using the deep-learning model, we are limited to the hardware that is available. For both training and inference, a laptop with 16 GB RAM is connected to an external GPU, a Nvidia GTX 3090 with 24 GB VRAM. This limitation in RAM proved an issue for training and will be discussed later. The thesis is limited to only testing cases where objects are positioned in a struc- tured way on a table in front of the robotic arm as opposed the final configuration where objects will be placed in buckets on racks. The reason for this is that it would be very difficult, if not impossible to manually provide accurate poses for objects in 3 1. Introduction a pile, so instead a structured setup was chosen to make repeatable tests possible. Lastly, the thesis will only make use of a single gripper, which is a parallel fingered gripper from Robotiq [6]. This is a common enough type of gripper that is used in the industry, but there are other types, such as grippers with three or more fingers, human-like grippers, and vacuum/suction grippers. These other grippers will have grasp poses that are different than those of the Robotiq gripper, and therefore are not considered in this thesis. 1.3 Literature Review 1.3.1 End-to-End Deep Learning Models Deep learning is a promising approach for grasp synthesis, driven by the rise of data-based solutions and advances in machine learning. [4] [5]. One simple ver- sion [7] would be to generate grasps from a top-down perspective which assumes the gripper will always be positioned vertically, such that only 2D position and the yaw orientation needs to be predicted. This provides compact output that is easier for models to infer, but lacks in generalized performance since the best point to grasp an object in different orientations is not always from the top. Therefore, in general cases, the full 6 degrees of freedom (DoF) is required to reliably find optimal grasps, but this comes at the cost of making the model and computations more complex. Grasp synthesis in 6 DoF is a difficult problem to solve due to the amount of di- mensions that need to be "searched" in order to find a viable solution. One possible solution to this is to add constraints on some dimensions. Sundermeyer et al. [8] uses what they call contact points, which are surface points from the pointcloud representing the object as a point where the parallel gripper must be in contact. This reduces the dimensionality of the problem by redefining the problem as a 3 DoF estimation around the estimated contact point. They showed improved results over the baseline model GraspNet [9] and showed that even unknown objects can still be grasped in a satisfactory manner. They also show promising convergence rates for training and integration of segmentation maps for filtering the input depth image. It is for these reason we decided to utilize Contact-GraspNet [8] for our implementation. Other deep learning models are also developed other than the aforementioned end- to-end models. A combination of model-based and reinforcement learning (RL) solution was developed by Borja-Diaz et al. [10]. Rather than generating a grasp pose and then using a planner to move the robot to the grasp pose, they seek to remove that step and use their solution for both the grasp pose estimation, and navi- gation towards the goal pose. Regular deep learning models learn affordance regions (areas where the robot can interact with an object for a specific affordance/task) and estimate the objects’ centre. The robot then uses an RL model to navigate the robot towards the goal pose. The use of RL allows for better navigation as it can 4 1. Introduction more easily deal with errors in the prediction due to the stochastic nature of esti- mation based on visual information. This solution is able to deal with never before seen objects, but also tasks/affordances that it was previously unfamiliar, and able to guess where it would need to grasp in order to accomplish the task and was able to do it more effectively than the baseline it compared to. The reason that this is useful is that a good place to grasp an object is highly dependent on what it will be used for. For example, you can grasp scissors from the blades (as long as it is closed) in order to simply transport it, but you cannot do that if you plan to actually use the scissors. However, due to the nature of the implementation, there can only be one affordance region per object, and additional constraints on the region or grasp pose cannot be added without changing the critic in the reinforcement learning model or retraining the affordance region estimator model. Other solutions were also developed, such as using transformers and/or attention modules to allow for more human-like understanding of knowing where to grasp. For example, a transformer-based model might infer that grasping one end of a long, thin object (like a pen) might be unstable and instead suggest a grip closer to the center [11] [12]. Additionally, some approaches leverage graph representations of the input point cloud, enabling the use of graph neural networks and attention mech- anisms to capture spatial relationships more effectively [13] [14]. These networks, however, were shown to not work in real-time grasp synthesis applications due to the large size of the models, and therefore the high computational cost resulting in slow inference. 1.3.2 Datasets for Deep Learning Models Developing deep learning models requires a vast amount of data for training and validation. However, collecting this data is a time-consuming process, as it often involves multiple millions of samples. Collecting this amount of data and labelling it is infeasible for this thesis, therefore existing datasets such as Graspnet-1Billion [9] and ACRONYM [15] can be used to train initial models. Both datasets are the ba- sis for well performing models [9] [8] but they differ in their implementation where Graspnet-1Billion is based on real RGB-D images of 88 objects and over 1 billion annotated grasps while ACRONYM is a synthetic dataset comprised of 8872 ob- jects and 17.7 million annotated grasps. Giving Graspnet-1Billion a much denser annotated dataset while ACRONYM boasts a much wider selection of objects. The creators of graspnet-1billion also provide a baseline model to compare the re- sults with, using common metrics in grasp synthesis research such as grasp success rate, that were shown to be popular in recent literature surveys [4] [5]. 5 1. Introduction 1.3.3 Analytical Grasp Pose Synthesisers The previously mentioned approaches mostly focus on generating grasp positions online, meaning in real-time while the robot is actively operating. The alternative is offline methods that aim to create a large set of possible grasp positions that an online evaluation function can choose from. Such as Kleeberger et al. [16] which uses clustering algorithms to to pre-compute a diverse set of possible grasp positions of known objects. The grasp positions can also be generated using more intuitive heuristics such as difference in euclidean distance from the reference frame or based on constraints such as collision or infeasible joint angles [17]. Other alternatives include attempting to find grasps that will ensure the object would not slip from between the grippers fingers when moving. This is done through simulating the friction between contact points, and the forces and torques around these points. One of the most popular simulators for this purpose is GraspIt [18], where the friction, as well as wrenches (forces and torques) were modeled and tools are provided in the simulator to find suitable grasps for a multitude of different grippers. The problem of finding grasps through suitable forces, however, is that the dimen- sions of the search are too high. This can lead to very long search times, and finding optimal solutions would be difficult. This search problem can be instead turned into an optimisation problem as shown in [19], where using neuroscience research, the degrees of freedom of modelled human grasps were reduced from about 20 down to 2 by using so called eigengrasps. Then by using Cost functions, these eigengrasps along with the grasp poses can be evaluated if they are good grasps or not, and solvers, like simulated annealing [20] can be used to find the local minima of these cost functions, and hence, the best grasps. The issue with these simulators is that they require knowledge about the object’s pose in order to then make use of these generated grasps, and since pose estimation is a non-trivial problem, how the user would solve that is a design choice that must be made. This is made trivial thanks to the deep learning models, as they do not require object pose information, but they are instead fixed to one perspective usually. 6 2 Theory Robotics is inherently a multi-disciplinary field, combining mechanical, electrical, and software engineering. Therefore, some relevant information regarding vital parts of the thesis is provided in this chapter. 2.1 Robot Operating System Establishing communication between different components of a robotic system is essential for ensuring coordinated and efficient operation. This is achieved using the Robot Operating System (ROS), specifically, ROS2 Humble [21]. ROS is a framework that allows easier and quicker implementations for robotics thanks to the open-source foundation of the framework. Reported by the developers of ROS, it is a popular tool in research and academics, thanks to its ease of use and quick implementations that allow the user to focus on the application rather than setting it up. This is due to the infrastructure that ROS provides, which is the communication through so called topics, services, and nodes (among many other features). 2.1.1 Nodes A node in ROS can be considered as a block that is responsible for some behaviour. For example, a node can represent a sensor taking measurements of the environment, and doing any pre-/post-processing that is necessary, and then providing the result to other nodes through so called topics. These nodes can also set specific parameters that could be useful to know across different nodes as ROS parameters. Using a depth camera as an example, a node that represents it could set a maximum depth/distance value as a ROS parameter so that other nodes like the control system could take that information into account. 2.1.2 Topics and Communication Communication between the nodes can be done in many ways, depending on the exact behaviour of communication desired. One of the most common ways is mak- ing use of topics. Topics are useful in continous stream type of communication, which are common for sensor readings such as sensors to represent the environment (cameras, LiDARs, etc) and sensors that represent robot states (encoders, IMUs, etc). Typically, topics have one node publish information to it through publishers, and one or more nodes listen to the published information through subscribers. If 7 2. Theory desired, multiple nodes can publish to the same topic, but if done simultaneously, it can lead to unexpected behaviour if not careful. The information that is published to the topics must be standardised through message types defined in message files. Within these publishers and subscribers, the user can define some script to process the message however they wish in order to make use of the information. The stan- dard example to explain publishers and subscribers is by using a topic that takes string messages, with a publisher that sends "Hello" and an incrementing value, and a subscriber that reports what it received from the topic. If the user instead wishes to trigger a process and receive a response, then one way to accomplish that is through services. Services are defined through servers and clients, where the server defines the process that the service is expected to do and generate the response, and the client is where the request is generated. Similar to topics, the services are defined through standardised service types. The services used in the thesis are defined in the appendix. The common example to explain services is a simple sum service, where the client sends a request with two values to be added, and the server returns the sum. Figure 2.1 shows how communication through topics and services occur. The pub- lisher will send its messages continously through the topic, and the topic will send duplicates of this message to any subscribers attached to it. Services are more one- to-one, where the client will get a response that is dependent on the request sent to the server. Figure 2.1: Image taken from the official ROS2 tutorials showing an example node setup with common ways of inter-node communication (described in section 2.1.2). 8 https://docs.ros.org/en/humble/Tutorials/Beginner-CLI-Tools/Understanding-ROS2-Nodes/Understanding-ROS2-Nodes.html 2. Theory 2.1.3 Coordinate Frames In robotics, keeping track of where things are is essential for accurate planning and control. To accomplish this, coordinate frames and transformations are used. Po- sitions must be defined to some relative fixed point, and in robotics, there are at least one fixed point that everything else will be relative to, commonly referred to as "world". However, it is sometimes simpler to change this frame of reference to another one that is more suitable, for example the lens of a camera or the base of a robot arm. Therefore, there are many frames of references called coordinate frames, and the change to another coordinate frame is called a transformation. To aid with transformations, the ROS2 package TF2 [22] is used. The user will only need to define the coordinate frames relative to a parent, and the package will figure out the necessary transformations. The links and joints of robot can be abstracted as transformations, and using them, the end-effector of the robot can be tracked with any valid joint values. The user can define the robot through a xacro or URDF file, and ROS will interpret this file as a set of transformations. Once initialised, all of the latest transformations between the coordinate frames are stored in a buffer, and all interfacing with TF2 requires that the buffer is kept up to date. Any new transformations are added to the buffer, and therefore the current pose of any link can be extracted through the buffer itself, defined in terms of any of the coordinate frames, so long as there is a transformation between the two. Examples of some coordinate frames can be seen in figure 2.2. Figure 2.2: Example of Coordinate frames in TF2. The coordinate frames and its transformations on the robot were defined through the robot URDF file, whereas the coordinate frames of the objects were manually defined. 9 2. Theory 2.1.4 Controllers and MoveIt! Moving a robot is commonly done through controllers, and the responsibility of the controllers is to take in the current joint values/angles, and the desired joint values, and decide what inputs to apply to each actuator to reach these desired joint val- ues. Depending on the controller, the user can decide to control different aspects. Commonly, the user can control the joint position (joint values), but the user can also choose to control the forces and torques applied on each joint. The choice of controller, and its tuning is application dependent. ROS2 already provides many common control packages, and robot manufacturers also commonly provide pack- ages to control their products/robots through ROS. However, these controllers are basic and are not sufficient for robotics applications. In the case of robotic manipulation, the controllers packages do not consider colli- sions with the environment when reaching the desired joint values. This is left to a process that is called motion planning. There are many different algorithms for motion planning, and many collections of packages that already implement them. One of the most common packages for motion planning and its applications is called MoveIt! [23] as it combines implementations of popular stochastic motion planning algorithms. MoveIt allows for easy connection between controllers and motion plan- ning, and allowing the robot to consider the environment when planning its path. Additionally, it also allows for planning a path in which the end-effector travels in as straight of a path as possible, by controlling the end-effectors path in cartesian space. This is beneficial as planning in joint space may require more movement than necessary to bring the end-effector from its current pose to its goal pose, as planning in joint space considers the joint values at the goal pose instead of the goal pose itself. 2.2 GraspIt One of the grasp synthesis methods used for this thesis is a simulation environment called GraspIt! [18]. This simulator models friction between two objects using the Coulomb model. Simply put, the forces that can be applied at the contact point between two objects/materials is determined by the coefficient of friction between these two materials. The coefficient defines an angle of a cone (which is called the friction cone), and as long as the force is within this cone, then it is true that the magnitude of this force can be withstood and resisted by friction. These friction cones are highly vital, as GraspIt attempt to find grasps that will have a property called force-closure, where the grasp will be able to resist any dis- turbance wrench (combination of force and torque), with the assumption that the contact forces are strong enough. This is done by determining if the wrench space origin is contained within the generated Grasp Wrench Space (GWS), which is "the space of wrenches that can be applied to an object by a grasp given limits on the contact normal forces" [18]. Colloquially, force-closure simply is a check of a "strong grip", and that the object will not slip from the grasp of the gripper. 10 2. Theory Figure 2.3: Cones used to if forces applied at contact is enough to counteract the friction. Taken from [18]. Force closure is not enough to determine if the grasp is good, however. To aid in this search, there are two steps taken to improve grasp quality and speed at which the good grasps are found: dimensionality reduction, and conversion from a search prob- lem to an optimisation problem. Both of these steps are described in detail in [19]. The dimensionality of grasps is reduced from the number of degrees of freedom that a gripper may have into a set of so called "eigengrasps", which are common postures taken by the gripper. For example, with a human hand, one eigengrasp is described as "thumb rotation, thumb flexion, MCP flexion, and Index abduction" [19]. This can effectively be converted to a single dimension with a minimum and maximum "joint" value. Then one or two of these eigengrasps are enough to significantly reduce the dimension of the search. By using so-called "Energy Functions", that take in the eigengrasp amplitudes (what values between the min and max it is for each eigengrasp), and the pose of the grasp and provide a single value to represent how good it is, the problem of searching for a good grasp pose gets converted into an optimisation problem. The specific energy function that is recommended in GraspIt is called Guided Potential Quality Energy, where it first checks if the grasp has force-closure, and if it does, evaluates how close the grasp contact points between gripper and object is to the user-defined desired contact points. The planner/simulator uses simulated annealing for the optimisation, and since that method minimises the search function, the energy value is made negative. This means that any non-zero grasp has force closure, and the lower the value is, the closer each contact points gets to the user defined points. With these user defined points, the user can determine if only the tip of the grippers should be used, or if the object should be covered by as much of the gripper as possible. 11 2. Theory (a) Grasp with highest energy score. Thanks to force-closure criteria, it is highly likely that the toy airplane will not slip from its grasp (b) Grasp with the second highest en- ergy score. Thanks to force-closure cri- teria, it is highly likely that the toy air- plane will not slip from its grasp Figure 2.4: Robotiq Adaptive 2F-140 gripper grasping a toy airplane. Result gathered after 100,000 Iterations using the energy function guided potential quality energy. 2.3 Contact-GraspNet Contact-GraspNet [8] is a grasp synthesis model published in ICRA in 2021 by Nvidia and TUM researchers Sundermeyer et.al. They proposed an end-to-end net- work that efficiently generates a distribution of 6-DoF parallel-jaw grasps directly from a depth recording of a scene. This depth recording can be a RGB-D image coupled with the camera intrinsics and optionally a segmentation map of the objects in the scene, or, it can be a point cloud composed of 3 dimensional points. Including a segmentation map will allow the model to filter out grasps that were generated outside of the intended object. The segmentation map also allows the use of local regions of interest for preprocessing the depth scene by cropping the scene to only focus on a the segmented area. This maximizes the amount of contact points the model can find as well as minimizing the inference time. Figure 2.5 shows the architecture as the paper defines it. First, the dense point cloud is converted to a feature list, which the model can learn from. To do this, a highly popular architecture called PointNet++ is used [24], Pointnet++ can be in- terpreted as a more effective alternative to volumetric CNNs (Convolutional Neural Networks) that similarly captures features in 3D space but without rigidly scanning the space with a set stride. These features are then fed to 4 different heads are used for the 4 different estimation tasks. 12 2. Theory Figure 2.5: Figure showing the architecture of Contact-GraspNet (from left to right). The input would be the RGB-D image and segmentation map from the camera, and the output would be the synthesised grasp poses with estimated prob- abilities of success The heads for the approach and baseline (which we will refer to as grasp direc- tion) vectors have very similar structures, as they are both estimating vectors. The outputs of these heads are vital in determining the orientation of the grasp pose. The grasp width head is a simpler variant, but estimated through equidistant bins ranging from zero to the maximal width of the gripper. Each bin contains a range of values, and the final grasp width would be the centre value of the grasp bin with the highest confidence. The final head (Prediction Scores Head) is the basis of the grasp pose representa- tion, as this head is what is used to predict the contact point, c, which is the point from the point cloud that the gripper will be in contact with and therefore apply its approach and grasp direction vector in relation to. This head converts the input point cloud into likelihoods of grasp success, such that the point with the highest likelihood of success is subsequently chosen as c. The combination of all four heads can then represent a parallel grasp, as seen in figure 2.6. 13 2. Theory Figure 2.6: Image taken from the original Contact-GraspNet paper [8]. Figure showing the all estimated variable from each head in relation to each other. a⃗ and b⃗ are the approach and grasp direction vector. w is the grasp width and c is the contact point. The network is trained on the ACRONYM dataset [15] which consists of 8872 meshes from the Shapenet dataset [25] and 17.7 million simulated grasps under varying fric- tion. These meshes are then placed in random stable poses in scenes. Point clouds of the scenes are then rendered and used as inputs that can be compared to the ACRONYM ground truths. 14 3 Methodology & Design In this chapter, we will present how and why we implemented the two different methods for synthesising grasps. We also go over the implementation for using the robotic manipulator with ROS2 and how we designed the experiments to compare the two synthesis methods. 3.1 Using a Deep Learning Model to Generate Grasps We decided to use Contact-GraspNet for our deep learning based solution because it stands out among grasp synthesis models for several practical and technical reasons, particularity because they claim stellar performance and sufficiently fast runtime for complex scenes. The model also accepts segmentation maps as an input, allowing the model to operate in cluttered environments. The Anaconda environment provided on the official GitHub 1 is not compatible with 3000-series Nvidia cards therefore different versions of libraries needed to be used. The model also utilize Tensorflow 2.5 which has limited compatibility with newer libraries. The environment used in our implementation is based on a forked reposi- tory 2 that also implements Contact-GraspNet for Docker and a 3000-series Nvidia card. The original configuration for training the model is built around the Franka Emika Panda gripper which has a maximum grasp width of 80 mm. For our implementation the model needs to facilitate 140 mm grasp width. The model makers recommends to either retrain the model with the necessary configuration or to scale the point cloud dimensions and the resulting grasp position by the fraction Wd Wp where Wp is the Panda gripper width and Wd is the desired gripper width. Scaling the input and output works very well for many objects and has the benefit of using their model weights that are advertised to have great performance. When using the model with RGBD images captured using the realsense camera, there were issues generating grasps for objects with heights lower than approximately 5 cm. Scaling down the point cloud also has the negative effect of making the scene tighter and harder to produce collision free grasps. Therefore, we also attempted to retrain the model to 1https://github.com/NVlabs/contact_graspnet 2https://github.com/tlpss/contact_graspnet 15 3. Methodology & Design validate if a reconfigured model could remedy these problems. 3.1.1 Retraining the Model Contact-GraspNet is trained on the ACRONYM dataset which is built using mesh data from ShapeNetSem [26], a smaller and more densely annotated subset of the large 3D CAD dataset ShapeNet. Compared to ShapeNet, ShapeNetSem enriches the dataset with useful semantics such as weights, material composition and physical sizes of common household items. These meshes then need to be processed further to make them watertight, meaning they do not have holes, gaps or overlaps in its surface. And finally the meshes complexity needs to be simplified in order to load as many objects as possible into memory at once. These meshes are then combined with the ACRONYM grasp annotations and Contact-GraspNet’s provided scene configurations and contact point annotations. Figure 3.1: Plots showing the loss and validation loss for both the Baseline models training vs the retrained model. "dir loss" represent the grasp direction estimation loss, "ce loss" short for cross entropy loss representing the contact point classification loss, "off loss" representing grasp width estimation loss, and "app loss" representing the approach vector estimation loss. Baseline (blue) include 14 000 iterations over 16 epochs while retrained (orange) include only 1300 iterations over 16 epochs. 16 3. Methodology & Design Contact-GraspNet recommends to use a system with a Cuda capable GPU with more than 24 GB VRAM and 64 GB of normal RAM to be able to load the entire dataset into memory at once. The host computer used for this tasks was limited to 16 GB of RAM and can therefore only load a small subset of the dataset. Less training data leads to decreased model performance, especially in terms of generalisation as overfitting to certain objects becomes more likely. In our case, the model did not run enough iterations to start improve the grasp direction estimation, as can be seen in figure 3.1 where "dir loss" only improved slightly over all 16 epochs compared to the training logs from the baseline model. The grasp direction vector is the vector from the finger of the gripper to the contact point, which is orthonormal to the approach vector. 3.1.2 Segmentation Map One of the inputs for the grasp synthesis model is a segmentation map of the ob- ject(s) in the scene that grasps should be generated for. Meta’s Segment Anything Model (SAM) [27] combined with manual region-of-interest (ROI) selection was used to generate a segmentation map of the object of interest. The ROI selection is made with an interactive Opencv [28] window. Figure 3.2: ROI selection applied to image (1) and resulting SAM segmentation map overlayed on top of original image (2). 17 3. Methodology & Design 3.1.3 Inference for Generating Grasp Poses During runtime, the inference is performed in a ROS2 node, figure 3.3 shows the pipeline of the node. The inference script allows for several runtime arguments to increase the probability of finding a grasp as well as to filter what grasps are re- turned. These are the runtime arguments used: • forward_passes: Determines how many batched forward passses the model will perform, increases potential grasps and increases inference time. • z_range: Limit the search for grasp to within the thresholds [zmin, zmax]. • local_regions: A boolean flag whether to reduce the search area to around the provided segmentation map. • filter_grasps: Only return the grasps in the scene that contain a contact point in the segmentation map. • skip_border_objects: Will ignore grasps generated on the boundaries of the final depth image. Figure 3.3: The inference node receives camera data from dedicated topics. The depth and color image as well as the camera intrinsics are passed into the inference script along with a segmentation map generated from the color image and a man- ually selectd ROI. The inference script outputs up to 200 grasps, many of which can be duplicates, which are filtered out. Finally the poses are converted to ROS interpretable TF2 frames. Resulting grasp poses will be a homogeneous matrix describing the rotation and translation of the grasp. This grasp need the then be converted into quaternions and finally placed in a transformation frame with the camera color lens as the ref- erence frame. 18 3. Methodology & Design 3.2 Using an Analytical Model to Generate Grasps For the analytical model, we chose to use GraspIt. The reason for this choice is due to the fact that the simulator in GraspIt models the forces applied at contact and ensures that it would not slip from its grasps, and would allow for custom evaluation functions if need be, so it is highly adaptable to Volvo’s use case. 3.2.1 Environment Setup In order to use the GraspIt in ROS2, the user must first generate a database of possible grasps. This is done through packages provided by this repository [29]. These packages allow for direct interfacing with GraspIt, without having to use the graphical window. It requires that a setup is defined in so called "worlds", and an example setup is shown in 3.4. The table in this environment is loaded in as an obstacle, so that GraspIt knows to not generate grasps that would lead to collisions with the table. Figure 3.4: An example of the worlds used to generate grasps for one of the tested objects. Note the red lines on the gripper fingers, which are the contact points and its normals. see 2.2 for definitions of contact points. As defined in section 2.2, the user must add specific contact points along the gripper. Then, the ROS1 interface to the GraspIt planner generates a list of the best grasp poses it can find, and stores them all in separate files, defined in a similar fashion to the input world file. A separate script, outside of ROS1, then combines all of them into a single database that can be read when needed. This database can then be used in ROS2 inside of the grasp synthesis node. 19 3. Methodology & Design 3.2.2 Using the Grasps in ROS2 Inside of the grasp synthesis node, when a request for grasp poses is received, the name of the object is expected. The reason for this is to extract the coordinate frame from the TF2 buffer. It is assumed that the name of the coordinate frame defined in the buffer is the same as the object name itself. This means that how the pose of the object is given/estimated is not relevant for this node. Figure 3.5: Figure describing a single pass of grasp synthesis using the grasp database generated by GraspIt. Filtering is dependent on specific the object that is going to be grasped, and how many attempts were made so far to generate a grasp For this thesis, the pose of each graspable object was manually fed into the buffer through static transformations. After the pose is loaded from the buffer, the grasp pose database is loaded, and then the grasp poses are filtered out. The filtering step is the most vital, as it would re-order the grasp database to fit the current envi- ronment best. Through iteration, the following filtering conditions were determined: 1. If the angle between the grasp position vector and the horizontal plane, repre- sented by Θ in figure 3.6, is less than a minimum given angle, θ1, remove the grasp pose from the database. 2. if Θ is between θ1 and a second provided angle, θ2, remove the grasp pose from the database if it has grasp rotation value, represented by Ψ in figure 3.6, less than a given value, ψmax. 3. If the angle between the position vector projected to the horizontal xy-plane and the x-axis, represented by the angle Φ in figure 3.6, is less than a given ϕmax, remove the grasp pose from the database. 4. If the distance from the robot end effector to the object pose is smaller than the distance from the robot end effector to the grasp pose, remove that grasp pose from the database 5. In case of Object 2, if we reached attempt number 3, add a slight offset of 5cm to the grasp position in the z-axis. This is done to offset the top part of the object, which is not present in the CAD file. Since it is not present in the CAD file, it cannot be taken into consideration by GraspIt. If none of the conditions above are met for a grasp pose, then it means that the specific grasp pose stays in the database. 20 3. Methodology & Design Figure 3.6: Filter angle representation. Figure shows an example object coordinate frame with an example grasp pose coordinate frame above it. Θ shows the angle between the position vector of the grasp pose and its projection to the horizontal plane, Ψ shows the yaw euler angle of the grasp pose (rotation around the z-axis), and Φ is the angle between the projected position vector of the grasp pose and the x-axis. The exact values for θ1, θ2, ψmax and ϕmax is dependent on both the object itself, and the attempt number and is shown in table 3.1. Attempt Nr. Object 1 Object 2 Object 3 Object 4 1 (0, 30, 60, 90) (0, 30, 80, 90) (0, 30, 60, 10) (0, 30, 60, 10) 2 (20, 60, 75, 90) (20, 60, 85, 90) (0, 60, 75, 20) (0, 60, 75, 20) 3 (40, 90, 90, 90) (40, 90, 90, 90) (0, 90, 90, 25) (0, 90, 90, 25) Table 3.1: Filter angles (θ1, θ2, ψmax, ϕmax) (in degrees) based on object and attempt number. These angles are part of what determine which synthesised grasp pose is the "best" and most likely to succeed. The values of these angles were determined through trial and error, and could still be further improved through more testing or better filtering methods. After the filtering step is done, the top five (or however many were left after filter- ing) grasps were extracted, then these grasps are visualised in Rviz and their poses is added to the TF2 buffer of transformations. Once the TF2 buffer is updated, a response sent back to the client containing information about the grasps generated (pose, index from database, etc.). 21 3. Methodology & Design 3.2.3 Additional Contact Points The definition of contact points on the gripper is very vital, as it will affect the output of the energy function used during GraspIt’s grasp synthesis. It is suggested that the contact points should be along the fingers of the gripper, but to encourage grasps with the gripper closer to the object, contact points can be added along other parts of the gripper. To test if grasps with more contact points were better, we created two different sets of contact points, one with the standard suggestion of contact points only along the finger (figure 3.7a), and another with more contact points (figure 3.7b). The idea is that heavier objects, such as objects 3 and 4 (figures 3.15c and 3.15d), would perform better when the grasps are encouraged to be closer to the actual object itself, since the forces applied would be closer to the center of gravity. (a) Contact points just on the finger pads, encouraging that the point of con- tact between the object and gripper oc- cur mostly on the finger pads. This con- figuration will be known as "G, few CP" (GraspIt, few Contact Points). (b) Additional contact points added to force the gripper to be closer to the ob- ject. This configuration will be known as "G, more CP" (GraspIt, more Con- tact Points). Figure 3.7: Two different contact points defined, which lead to different grasp poses stored in the database For an explanation as to how the defined contact points can affect the grasp pose synthesis in GraspIt, see section 2.2. 22 3. Methodology & Design 3.3 Implementation of the Pick & Place Pipeline In this section, we will cover the main nodes that facilitate the pick and place task and allow the user to keep track of the performance of the grasp synthesis models as well as the overall system performance. Figure 3.8 shows how these nodes com- municate between each other, where services are marked in purple, and topics are marked in red. Figure 3.8: System architecture shown through the ROS2 nodes. The red arrows are connections through topics (i.e. publishers and subscribers), whereas the purple arrows are connections through services. Here the Pick & Place node contains the client of each service called. Since most communication between nodes uses topics and services, the internal processing doesn’t matter as long as they send and receive data in the same format. This makes the implementation highly modular, and allows the user to change any of the nodes shown in the figure and use any other implementation. A new user can choose to control the robot with a different controller, have a different planner instead of MoveIt, or even implement new grasp pose synthesising methods. This is one of the main benefits of using ROS2, and it is what allows for the easy comparison of the two different grasp pose synthesis methods. 3.4 Controllers and Motion Planner The controllers chosen for this thesis is a scaled joint trajectory controller, which is an extension of the popular joint trajectory controller. The joint trajectory con- troller finds a viable trajectory that the robot can take such that, at the end, the robot will end up at the given joint values, and will attempt to fulfil that trajectory within a given execution time. The extension is based on the fact that the UR arms 23 3. Methodology & Design can have set velocity scales such that the maximum velocity can be some reduced value. If the scale is set to some value that is less than one, then the controller is not aware that it cannot use the full potential of the robot, and will hence fall behind in the trajectory. The scaled joint trajectory controller however mitigates this problem and will not lead to any of these issues. This controller is provided by Universal Robots [30]. These controllers also work in tandem with the real robot, so any control outputs are sent to the real robot for execution. The MoveIt node starts all the necessary configurations for the MoveIt package, which is necessary for our custom interface to MoveIt planners and collision checks to work. Figure 3.9: The service that uses the interface to the MoveIt interface. Shows the expected inputs and outputs of the service. The MoveIt interface node contains the custom services defined for this thesis, which allows the user to request that the gripper base link coordinate frame would lie on top of another coordinate frame. The coordinate frame passed into the service server through the request should be already defined in the TF2 transformation buffer, such that it can be used to navigate there. The idea is that the grasp poses generated by the grasp synthesis nodes will be added to the TF2 transformation buffer, and the grasp task node will use this interface to check if it is feasible to navigate there, and to plan a path there if it is. Figure 3.9 shows the expected inputs and outputs of this service. In order to let the planning algorithm find a direct, straight line between the end-effectors current pose, and the desired goal pose, we added the flag for cartesian path planning, which uses the MoveIt interface to find a viable plan in cartesian space rather than the joint space of the robot. The last two inputs deal with the planning scene, which is the representation of the environment (scene) that MoveIt uses for planning. It always includes the robot it- self, but can also include other objects as obstacles, in our case, we add a rectangular box to represent a table. 24 3. Methodology & Design 3.5 Gripper Control Node The gripper control node is an additional custom service that allows the user to easily set the gripper to open and close. Thanks to the aforementioned force sensors on the gripper, the user does not need to consider the grasp width that is estimated through the grasp pose synthesis models, and instead simply command the gripper to close as much as it can. The firmware on the gripper stops the gripper before it damages the object. Figure 3.10: The service that controllers if the gripper is open or closed. 3.6 Grasp Synthesis Node The grasp synthesis node is the core of the thesis, where the two different methods of grasp pose generation are interchangeable here. In its current form, the two differ- ent methods of grasp synthesis use the same service name, and are hence never run simultaneously, but it can very easily be adapted so that each solution is a different service to call. Figure 3.11 shows the expected inputs and outputs of these services. Figure 3.11: The service that generates grasps when called. For our implementation, only the boolean value which shows if grasps were synthe- sised is used for logic, and the rest were useful for debugging. The user can directly make use of the provided grasp poses by looping through the response, or they can also use the TF2 buffer as the grasp poses are also added there. 25 3. Methodology & Design 3.7 Pick And Place Node To test the grasp synthesis model, a simple pick-and-place task will be used, in order to determine if the grasp pose generated is robust enough to be able to transport an object. The grasp task node is what coordinates this. This is the central node that is expected to be the main interaction point between the system and the user. The pick and place task consists of the following steps: 1. Plan and navigate to the start pose. 2. Synthesise grasp poses 3. Use the MoveIt interface to determine if any valid grasp pose was given (i.e. a plan for navigation was found and there are no collisions). If a valid plan is found, execute it and wait until the robot is at the goal pose. 4. If none of the provided grasp poses were valid, repeat steps 2 & 3 for a maxi- mum of three attempts. 5. If no valid grasp pose was synthesised after a maximum of three attempts, the pick and place task has failed. Otherwise, continue. 6. At this stage, the robot should be at the goal pose. Close the gripper. 7. Move to the start pose 8. Move to the drop off pose (passing through an intermediary pose that is some vertical offset away from the drop-off pose). 9. Open the gripper, dropping off the object in the bin. 10. move to the start position If more than one string is provided in the list of string input field shown in figure 3.12, repeat steps 2-10 for each object name/string. The planning and navigation, grasp pose synthesis, and gripper control is done through the aforementioned ser- vices. Figure 3.12: The service that performs the pick and place task by iterating through sub-tasks using the other services. 26 3. Methodology & Design 3.8 Visualisation Node To visualise how well each system is working and the planned path for movements, Rviz will be used. Almost every node will make data available for visualisation through topics, so that the user can more easily track how each sub-system is be- having. The motion planner will visualise the final pose it will plan to, as well as the planned trajectory it expects to execute (if it is valid. In case it is not, it simply visualises the trajectory). The grasp synthesis node will also visualise the grasps through placing a simplified model of a parallel fingered gripper at the grasp poses, along with its index, see figure 3.13. Figure 3.13: 10 Generated grasps on an object displayed in Rviz as markerArrays along with their index. 3.9 Camera Node To capture the environment the realsense ROS2 wrapper is used to spawn a camera node. This node Publishes the color image, depth image, an aligned depth-to-color image and the camera intrinsics to its own topics. The aligned depth-to-color image contains a transformed depth image that has the same pixel mapping as the color image. This allows any classification or segmentation that is computed on the color image to be mapped directly onto the depth image. This alignment is also necessary for the grasp synthesis model. The ROS wrapper provides Several post-processing filters for the depth imaging that are used to reduce noise as well as decrease pro- cessing time. The filters applied are: 27 3. Methodology & Design • Disparity filter: Performs transformation between depth and disparity do- mains which enhances following filters. • spatial filter: Applies edge-preserving smoothing of depth data. • temporal filter: Filters depth data by looking into previous frames There are more filters available such as decimation filter and hole-filling filter these were not used. This is because decimation filters downsamples the depth image which loses information and because the processing time is fast enough without it. While the hole-filling filter created a more dense depth image it also created un- wanted artifacts around the edges of objects that were not spatially distinct enough from its background. 3.10 Experiment Setup The system consists of a UR10e robot [31] with a Robotiq Adaptive 2F-140 grip- per [32] attached to it. Also attached to the robot is a Realsense D435i depth camera. In order to have a full view of the workspace, we chose to attach the Realsense cam- era using a custom designed holder. This holder fits on the end effector of the UR10e, and adds a vertical offset such that the camera is not occluded by the gripper in front of it. The part was designed to be just long enough so that the finger of the gripper is out of view. The camera can then be screwed on the holder as shown in figure 3.14 Figure 3.14: Figure showing the robot used to test the implemented pick and place pipeline and grasp pose synthesis models. 28 3. Methodology & Design A laptop hosting a docker container of an Ubuntu 22.04 environment with ROS2 Humble is connected to the robot with an Ethernet cable. Communication with the gripper is done through the same Ethernet cable. To answer the main research question of the thesis, and to facilitate a fair compar- ison of the grasp synthesis models, the aim was to define a highly repeatable test setup. Available to test with were 4 objects, shown in figure 3.15. These are objects that require "kitting" on the Volvo Group production line, and hence primary candidates for automated retrieval. (a) Object 1 (b) Object 2 (c) Object 3 (d) Object 4 Figure 3.15: The four different objects that grasps can be generated for using both of the methods. While Contact-GraspNet can work on many different objects effectively, GraspIt required more information about the objects provided through CAD files. 29 3. Methodology & Design Due to the geometry and weight of objects 3 and 4, it was expected that both meth- ods will have the most difficulty with these objects, as slight deviations from an optimal grasp pose would result in a failed grasp or the object slipping from the grasp during movement. It is worth noting that there is some discrepency between the CAD file of object 2 and the object that we received to test with. Inside of the CAD file, there is no top black section, and since we did not have access to more objects, we had no choice than to continue with the discrepancy. This would only affect the results of the GraspIt solution of object 2. In order to test the efficacy of the methods, two categories of tests will be setup: single objects, and grouped objects. Each object has a fixed defined pose in the environment, as can be seen in figure 3.16, where during the first testing phase, only one object is present in its pose, whereas in the second phase, all of them are present at the same time. Figure 3.16: Objects all placed in their fixed poses for testing, in relation to the robot and the blue bucket. The reason for these two different configurations is to see if the grasp synthesis models are able to generate grasps that would not lead the robot to collide with the other obstacles. This would be a common issue in bin picking if not careful. To mitigate this, we modelled the table, as well as all of the objects to be considered as obstacles when testing with grouped objects, whereas only the table would be considered when testing with single objects. This difference is shown in figure 3.17. For each test setup, the robot will attempt to perform a simple pick and place task, where the objects in the scene will be transported to the nearby blue bucket. The robot will begin from 5 different configurations, such that different perspectives and views of the scene can be achieved. For each configuration, the test was repeated 10 times, and the amount of successful grasps, as well as successful pick-and-place operations that were successful. For a robust grasp synthesis model, it was expected 30 3. Methodology & Design Figure 3.17: Pictures showing the difference between configurations as seen by MoveIt’s planning scene. Right shows the grouped object setup, where the green boxes are modelled obstacles, whereas left shows tests using single objects (note the missing green boxes around the objects). that these two values should be as close as possible, since a robust grasp entails that disturbances such as moving the arm during grasps should not let the object slip from the grip. The metric most commonly used to measure the efficacy of grasp synthesis models is called Grasp Success Rate (GSR). In the research, Grasp Success Rate does not have a standard definition, and the authors choose what condition would count as a success. Since bin picking is the main application of these methods, it was deemed necessary to have the additional metric of determining the success of the pick-and- place task as well, even though it has more points-of-failure due to motion planners and implementation. However, the risk of additional failures due to implementation were deemed low enough to test and measure the success rate of the whole pick and place task. To avoid confusion, the pick and place success rate was labelled as Pick And Place Success Rate (PNPSR). 31 3. Methodology & Design 32 4 Results 4.1 Results from Pick and Place Operations The following tables show the Grasp Success Rate (GSR) and Pick-And-Place Suc- cess Rate (PNPSR) for each category for both the offline synthesised method using GraspIt and the end-top-end deep learning method using Contact-GraspNet. Each cell represents the success rate after 50 attempts, 10 tries from each of the 5 config- uration totalling to 1600 separately tested grasps, See section 3.16. 4.1.1 Graspit As can be seen in table 4.1, grasp poses generated through GraspIt were not able to grasp or transport object 3 entirely, and had many difficulties with object 4. Single Object Grouped Objects G, few CP G, More CP G, few CP G, More CP Object 1 88 / 88 100 / 100 90 / 90 76 / 76 Object 2 60 / 60 100 / 100 82 / 82 78 / 78 Object 3 0 / 0 0 / 0 0 / 0 0 / 0 Object 4 8 / 8 0 / 0 16 / 6 0 / 0 Table 4.1: GSR (%) / PNPSR (%) results for the GraspIt based grasp-pose syn- thesis method. Contact points shown in figure. 3.7 For objects 1 and 2, it can be seen that GraspIt with less contact points on the gripper performs better than the Graspit with more contact points on the gripper only when there are multiple objects in the scene. The behaviour is reversed if only just one object is being grasped. 33 4. Results Figure 4.1: Example grasp of object 2 using GraspIt. Left shows point at which gripper was closed, right shows object at starting pose before being dropped off. Figure 4.2: Example grasp of object 1 using GraspIt. Left shows point at which gripper was closed, right shows object at starting pose before being dropped off. 4.1.2 Contact-GraspNet Table 4.2 shows that the Baseline model of Contact-GraspNet performs very well on both object 1 and 2. It also manages to pick object 3 and 4, despite the objects being very heavy and unwieldy. The retrained model has very poor performance, but the fact that it manages to generate suitable grasps for object 2 is still a good indication that retraining the model to accommodate a larger gripper width has potential. For grouped objects, both models are uniformly worse for all the objects. For our experiments, synthesising grasps took around 6-8 seconds on average. Single Object Grouped Objects CGN, Baseline CGN, Retrained CGN, Baseline CGN, Retrained Object 1 60 / 60 0 / 0 50 / 50 6 / 6 Object 2 92 / 90 46 / 42 74 / 72 18 / 18 Object 3 32 / 26 0 / 0 12 / 6 0 / 0 Object 4 18 / 8 0 / 0 0 / 0 0 / 0 Table 4.2: GSR (%) / PNPSR (%) results for the Contact-GraspNet based grasp- pose synthesis method, where "Baseline" is the model using the weights from the public Contact-GraspNet repository, and "Retrained" is the model using weights trained for the robotiq gripper 34 4. Results 4.1.3 Special Cases CGN, Baseline CGN, Retrained G, Less-CP G, More-CP Successes Due to Object Slipping Into Stable Pose 6.41% 24.24% 10.78% 0% Failures Due To No Valid Grasp Poses Were Synthesised 18.44% 26.70% 60.09% 94.17% Table 4.3: Table showing the fraction of the successes that were due to the object slipping into a more stable configuration such that it would not fall, and the fraction of the cases where failures were due to no valid grasps being generated at all. Given that tables 4.1 and table 4.2 are averages over all different iterations of all different starting poses, we amassed in table 4.3 some interesting cases that occurred during the testing. With some of the successes, where both the grasp and pick and place task was successful, we found that the grasp was not fully closed and the ob- ject was then able to slip into a more stable position such that it was transported to the drop off point. While we still marked that as a success, we kept track of these grasps. We can see how the retrained Contact-GraspNet model has the highest ratio of this scenario occuring, and how GraspIt with more contact points does not have a single case of this occurring. Similarly, in the cases of failures, there were some cases where there were simply no valid grasp poses that were synthesised. These could show limitations of the synthesis model itself, and could possibly be remedied by better tuning of the model or some additional recovery strategies to perform in these scenarios. It is interesting to note here that GraspIt has the highest rates of this occurring. 35 4. Results 36 5 Discussion & Conclusion In this chapter, a thorough analysis of the results gathered during testing will be discussed. Eventual improvements to the system and future research topics will also be mentioned as well as a final conclusion and answer to the research question based on the analysis provided. 5.1 Discussion 5.1.1 Contact-GraspNet: Baseline vs Retrained The baseline model for Contact-GraspNet performs surprisingly well considering the input scene is scaled down significantly. There were some clear issues with gen- erating grasps for object 1, resulting in severely lower scored grasps, requiring us to lower the score threshold from the recommended 0.23 down to 0.13. Generally grasps that are generated are very consistent between tries. A high-scoring grasp from one perspective is usually considered a good grasp from a different perspective given that the contact point is visible from both perspectives. The retrained model performs far worse, which is to be expected given it is only trained on about 1/10 the data as the Baseline model. But given the small amount of training data, we are still impressed by the performance. The grasps generated are more sparsely spread around the object, while the baseline usually outputs more clustered grasps, the difference is illustrated in figure 5.1. Both models share the problem of having very high computational cost and slow inference time which can be partially reduced by a sparser depth image. Figure 5.1: Figure showing the difference in clustering for generated grasps between the baseline model (left) and the retrained model (right) on the same object. 37 5. Discussion & Conclusion Baseline Contact-GraspNet actively generated grasps that avoid collision with other objects in the scene. This could be observed by the grasps generated on object 3 when object 4 and taken out of the scene, producing distinctly different grasps in both scenarios, this can be seen in figure 5.2. Figure 5.2: Baseline model generating grasps on object 3 with object 4 present in the scene (left) and without object 4 (right). The retrained version of Contact-GraspNet proposed a lot of colliding grasps, es- pecially on object 3 and 4 where grasps were generated from inside the object, suggesting that extending the training to more data significantly increases collision avoidance and the models intuition of object shape. Figure 5.3: Retrained model generating grasps whose pose are located within the object. 38 5. Discussion & Conclusion 5.1.2 Effect of Number of Contact Points for GraspIt When more contact points are added onto the gripper in GraspIt, the performance on objects 1 and 2 shows clear improvement if there are no obstacles nearby. This is reasonable as it is able to get close to the object more often, whereas in the case of grouped objects, many of the previously valid grasps were discounted due to col- lisions with other objects. This highlights the fact of how important the filtering of grasps is, and how it needs to take into consideration both the grasps stored in the database, as well as the environment which the object is in. Another point of interest is comparing how performance improved when using less contact points when there are obstacles. Having the grasp poses use only the fin- ger pads as possible contact points lead to the generated grasp poses being further away, and hence more likely to generate some grasp pose that is both stable, and not hindered by any of the obstacles that we had placed. Given that there will be many obstacles in bin picking tasks (other objects in the bin, the bin itself, etc), this suggests that having contact points only on the finger pads would lead to better performance for bin picking and other pick and place tasks where objects are close to each other. For both objects 3 and 4, its clear that neither configurations of contact points lead to stable performance in generating valid grasp poses. This can be attributed to the geometry and weight distribution of the object, making it highly sensitive to any disturbances in grasp poses, but also in the fact that GraspIt did not take into consideration any information about the actual weight of the object, and only the friction at the surface. This leads to grasps that might have worked for lighter ob- jects, but is highly unstable for the ones that were available to test with. During testing, some of the test configurations (start position, object being grasped, etc) showed high sensitivity to the object pose. Due to the fact that the objects were placed back into its original position manually, it may not have aligned perfectly with the estimated object pose. This could be translated as possible noise from a pose estimator model. What this showed was that some configurations and objects are more sensitive to this noise than others, and is highly dependent on how "stable" the final choice of grasp was. This goes to show how the filtering of grasps is important, such that only stable grasps should be chosen and not just any valid grasp. 5.1.3 Comparison between Models We can see that both synthesis models had varying performance, and some are more fit than others in certain situations. While both models can be further optimised and tuned such that their performances are improved, it is clear that they already have their benefits and downsides. Using GraspIt as the model for grasp synthesis means that the automation that Volvo wishes to introduce would be cheaper, given that it can run on weaker hard- ware at the kitting station, on the condition that more effort is placed on the filtering 39 5. Discussion & Conclusion of grasps, and that engineer-hours are spent importing the necessary CAD files cor- rectly into the GraspIt environment. Any new part that is added to the kitting station would necessitate an update to the grasp database and ensuring that the filtering of grasps is working and viable for that new object, which is a lot of time investment for something that happens constantly in manufacturing (which is the addition of new parts). This model would also require a precise 6 DoF object pose estimator, which is not trivial and would require more research into determining the best approach that would work for Volvo Group. On the other hand, using Contact-GraspNet is more flexible and adaptive, given that it can estimate grasp poses for some never before seen objects. Given that the baseline model was never trained or tested on the objects we used in the tests, it had very good performance, and we hoped that retraining the model with the right gripper configurations would improve it, but due to the limitation of the hardware were not able to confirm. However, retraining with the full dataset is vital to achieve good performance on proprietary parts with complex shapes, since scaling down the point cloud with the baseline model will worsen the performance of the model with some small objects, and is only a temporary solution to show if the network archi- tecture is a functional one. Given the performance of the baseline model, simply training it with the exact same dataset but with the correct gripper configuration is sure to improve the model performance, and therefore it is logical to conclude that adding the proprietary objects into the dataset would also improve the performance even more than that. This solution is also dependent on powerful hardware which would require powerful computers at the racks or the infrastructure for decentralised computing to be in place, which are both expensive options. 5.1.4 Effect of available setup While we tried to implement the pipeline in such a fashion that it would allow us to measure the effectiveness of the grasp pose synthesis models, we were also limited with some of our hardware and software. The addition of more objects to GraspIt is time-consuming, and difficult to get correct, and since we had access to so few CAD files and even fewer actual objects, this lead to very few objects that we could test our grasp synthesis models with. The limited variety in the data we collected also means that while we can show if there is potential in either solution for use in automating the kitting process for Volvo Group, we believe that our conclusions should be validated through more tests with more types of objects commonly found in the kitting process. Due to a limited amount of time, the amount of parameters we could tune, test, and vary were limited. While we attempted to test the models using some varied configurations, other factors could affect their performance. Since we determine that a generated grasp is valid only if a plan could be found without colliding with obsta- cles, investigating different parameters and configurations in the navigation planners 40 5. Discussion & Conclusion of MoveIt could improve performance. Lastly, since the deep learning model is highly dependent on the quality of its input, investigating the filters applied on the depth camera would also be beneficial. 5.2 Future Work 5.2.1 Improvements to the Grasp Synthesis Model We have pinpointed some possible starting points for any future work that aims to improve the grasp synthesising models, and their use to automate the kitting process at Volvo Group. Retraining Contact-GraspNet with the full ACRONYM dataset should be the first step such that the right gripper configuration can be used when making inferences, after this point, transfer learning can be investigated such that a smaller dataset containing the proprietary objects are used to extend the newly trained Contact-GraspNet model, with possibly better performance as well. One can also revisit newer grasp synthesis models that utilize deep learning in case models that are faster, easier to train, or are generally better than Contact-GraspNet can be found. To improve GraspIt, an investigation on how to filter the grasps that are gener- ated is worth the time. Exactly how that is done, either through Bayesian learning models, deep learning models, or some simple conditional statements, is left to the future investigator, and Volvo Group to determine if that is viable for their purposes. Lastly, in order to fully be able to use these synthesis models to automate, a full object pose estimation and segmentation model must be implemented. The expecta- tion is that these systems should be fully automated, and should be able to provide the grasp synthesis models with both object pose estimates and segmentation maps when needed, rather than being manually defined or manually select a region of interest. 5.2.2 Extensions and Improvements to the Pick and Place Pipeline In order to ensure that the bin-picking tasks are performed well, we must ensure that the robot is able to find good plans to navigate towards the proposed grasp poses. One way to improve this is to extend the MoveIt implementation, such that it continuously can update its planning scene, and account for obstacles in a better fashion using the same depth camera as is mounted on the robot arm. Another improvement could be on the planning algorithms themselves. A thorough investigation on a good planning algorithm for cartesian path planning of the end ef- fector, as well as good detection of a valid path through the MoveIt interface would 41 5. Discussion & Conclusion allow for safer automation, less time spent recovering from failed paths, and the ability to have people work closer to the robot. Another possible improvement is to introduce collaborative behaviour to the robot. While the goal is that the robot would take the objects from bins and drop them off at a table or automated guided vehicles, there would still be humans nearby. Investigating how collaborative behaviour between the human and robot could best be implemented would allow the human to also take any required object from a nearby bin without risking any harm to the human. 5.3 Conclusion Below is the research question we wished to answer through this thesis: • Is a deep-learning model more viable than an analytical model for the purpose of grasp pose synthesis, given that the objects being picked are already known? We find that while both solutions show high potential in succeeding, and that each have their benefits and downsides, the deep learning model would be more adap- tive and future proof, given its quick and easy expandability and that it showed the highest success rate in our trials. However, given the already available infrastructure and previous attempts at automation tested at Volvo Group R&D, the analytical model would be the easiest and quickest to get implemented at the cost of more time spent each time a new object is to be added. At the same time, we emphasize that before any development is made, more investi- gation needs to be done on improving both systems to ensure that these conclusions can be generalised to the entire Volvo Group kitting automation. 42 Bibliography [1] R. X. Gao, J. Krüger, M. Merklein, H.-C. Möhring, and J. Váncza, “Artifi- cial intelligence in manufacturing: State of the art, perspectives, and future directions,” CIRP Annals, vol. 73, no. 2, pp. 723–749, Jan. 1, 2024, issn: 0007-8506. doi: 10.1016/j.cirp.2024.04.101. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S000785062400115X (visited on 02/21/2025). [2] A. Cordeiro, L. F. Rocha, C. Costa, P. Costa, and M. F. Silva, “Bin picking ap- proaches based on deep learning techniques: A state-of-the-art survey,” in 2022 IEEE International Conference on Autonomous Robot Systems and Competi- tions (ICARSC), Apr. 2022, pp. 110–117. doi: 10.1109/ICARSC55462.2022. 9784795. [Online]. Available: https://ieeexplore.ieee.org/abstract/ document/9784795 (visited on 02/25/2025). [3] M. Alonso, A. Izaguirre, and M. Graña, “Current research trends in robot grasping and bin picking,” in International Joint Conference SOCO’18-CISIS’18- ICEUTE’18, M. Graña, J. M. López-Guede, O. Etxaniz, et al., Eds., Cham: Springer International Publishing, 2019, pp. 367–376, isbn: 978-3-319-94120-2. doi: 10.1007/978-3-319-94120-2_35. [4] R. Newbury, M. Gu, L. Chumbley, et al., “Deep learning approaches to grasp synthesis: A review,” IEEE Transactions on Robotics, vol. 39, no. 5, pp. 3994– 4015, Oct. 2023, Conference Name: IEEE Transactions on Robotics, issn: 1941-0468. doi: 10.1109/TRO.2023.3280597. [Online]. Available: https: //ieeexplore.ieee.org/document/10149823 (visited on 01/21/2025). [5] R. Platt, “Grasp learning: Models, methods, and performance,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 6, pp. 363–389, Volume 6, 2023 May 3, 2023, Publisher: Annual Reviews, issn: 2573-5144. doi: 10. 1146 / annurev - control - 062122 - 025215. [Online]. Available: https : / / www.annualreviews.org/content/journals/10.1146/annurev-control- 062122-025215 (visited on 01/17/2025). [6] “Robotiq gripper product sheet.” (), [Online]. Available: https : / / blog . robotiq.com/hubfs/Product- sheets/Adaptive%20Grippers/Product- sheet-Adaptive-Grippers-EN.pdf (visited on 02/26/2025). [7] D. Morrison, P. Corke, and J. Leitner, Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach, May 15, 2018. doi: 10.48550/ arXiv.1804.05172. arXiv: 1804.05172[cs]. [Online]. Available: http:// arxiv.org/abs/1804.05172 (visited on 05/13/2025). [8] M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox, “Contact-GraspNet: Efficient 6-DoF grasp generation in cluttered scenes,” in 2021 IEEE Inter- 43 https://doi.org/10.1016/j.cirp.2024.04.101 https://www.sciencedirect.com/science/article/pii/S000785062400115X https://www.sciencedirect.com/science/article/pii/S000785062400115X https://doi.org/10.1109/ICARSC55462.2022.9784795 https://doi.org/10.1109/ICARSC55462.2022.9784795 https://ieeexplore.ieee.org/abstract/document/9784795 https://ieeexplore.ieee.org/abstract/document/9784795 https://doi.org/10.1007/978-3-319-94120-2_35 https://doi.org/10.1109/TRO.2023.3280597 https://ieeexplore.ieee.org/document/10149823 https://ieeexplore.ieee.org/document/10149823 https://doi.org/10.1146/annurev-control-062122-025215 https://doi.org/10.1146/annurev-control-062122-025215 https://www.annualreviews.org/content/journals/10.1146/annurev-control-062122-025215 https://www.annualreviews.org/content/journals/10.1146/annurev-control-062122-025215 https://www.annualreviews.org/content/journals/10.1146/annurev-control-062122-025215 https://blog.robotiq.com/hubfs/Product-sheets/Adaptive%20Grippers/Product-sheet-Adaptive-Grippers-EN.pdf https://blog.robotiq.com/hubfs/Product-sheets/Adaptive%20Grippers/Product-sheet-Adaptive-Grippers-EN.pdf https://blog.robotiq.com/hubfs/Product-sheets/Adaptive%20Grippers/Product-sheet-Adaptive-Grippers-EN.pdf https://doi.org/10.48550/arXiv.1804.05172 https://doi.org/10.48550/arXiv.1804.05172 https://arxiv.org/abs/1804.05172 [cs] http://arxiv.org/abs/1804.05172 http://arxiv.org/abs/1804.05172 Bibliography national Conference on Robotics and Automation (ICRA), ISSN: 2577-087X, May 2021, pp. 13 438–13 444. doi: 10.1109/ICRA48506.2021.9561877. [On- line]. Available: https : / / ieeexplore . ieee . org / abstract / document / 9561877 (visited on 01/22/2025). [9] H.-S. Fang, C. Wang, M. Gou, and C. Lu, “GraspNet-1billion: A large-scale benchmark for general object grasping,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ISSN: 2575-7075, Jun. 2020, pp. 11 441–11 450. doi: 10.1109/CVPR42600.2020.01146. [Online]. Available: https://ieeexplore.ieee.org/document/9156992 (visited on 01/20/2025). [10] J. Borja-Diaz, O. Mees, G. Kalweit, L. Hermann, J. Boedecker, and W. Bur- gard, Affordance learning from play for sample-efficient policy learning, Mar. 1, 2022. doi: 10.48550/arXiv.2203.00352. arXiv: 2203.00352[cs]. [Online]. Available: http://arxiv.org/abs/2203.00352 (visited on 01/17/2025). [11] Z. Zhao, H. Yu, H. Wu, and X. Zhang, “Bio-inspired affordance learning for 6- DoF robotic grasping: A transformer-based global feature encoding approach,” Neural Networks, vol. 171, pp. 332–342, 2024. doi: 10.1016/j.neunet.2023. 12.005. [12] Z. Chen, Z. Liu, S. Xie, and W.-S. Zheng, “Grasp region exploration for 7- DoF robotic grasping in cluttered scenes,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), ISSN: 2153-0866, Oct. 2023, pp. 3169–3175. doi: 10.1109/IROS55552.2023.10341757. [Online]. Available: https://ieeexplore.ieee.org/document/10341757 (visited on 01/17/2025). [13] C. Zhuang, H. Wang, W. Niu, and H. Ding, “A parallel graph network for generating 7-DoF model-free grasps in unstructured scenes using point cloud,” Robotics and Computer-Integrated Manufacturing, vol. 92, p. 102 879, Apr. 1, 2025, issn: 0736-5845. doi: 10.1016/j.rcim.2024.102879. [Online]. Avail- able: https://www.sciencedirect.com/science/article/pii/S0736584524001662 (visited on 01/17/2025). [14] H. Wang, W. Niu, and C. Zhuang, “GraNet: A multi-level graph network for 6-DoF grasp pose generation in cluttered scenes,” in 2023 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), ISSN: 2153-0866, Oct. 2023, pp. 937–943. doi: 10.1109/IROS55552.2023.10341549. [Online]. Available: https://ieeexplore.ieee.org/document/10341549 (visited on 01/17/2025). [15] C. Eppner, A. Mousavian, and D. Fox, ACRONYM: A large-scale grasp dataset based on simulation, Nov. 18, 2020. doi: 10.48550/arXiv.2011.09584. arXiv: 2011.09584[cs]. [Online]. Available: http://arxiv.org/abs/2011.09584 (visited on 02/04/2025). [16] K. Kleeberger, F. Roth, R. Bormann, and M. F. Huber, “Automatic grasp pose generation for parallel jaw grippers,” in Intelligent Autonomous Systems 16, M. H. Ang Jr, H. Asama, W. Lin, and S. Foong, Eds., Cham: Springer International Publishing, 2022, pp. 594–607, isbn: 978-3-030-95892-3. doi: 10.1007/978-3-030-95892-3_45. 44 https://doi.org/10.1109/ICRA48506.2021.9561877 https://ieeexplore.ieee.org/abstract/document/9561877 https://ieeexplore.ieee.org/abs