已閱讀5頁,還剩3頁未讀, 繼續(xù)免費閱讀
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
A Multi-task Convolutional Neural Network for Autonomous Robotic Grasping in Object Stacking Scenes Hanbo Zhang, Xuguang Lan, Site Bai, Lipeng Wan, Chenjie Yang, and Nanning Zheng AbstractAutonomous robotic grasping plays an important role in intelligent robotics. However, how to help the robot grasp specifi c objects in object stacking scenes is still an open problem, because there are two main challenges for autonomous robots: (1)it is a comprehensive task to know what and how to grasp; (2)it is hard to deal with the situations in which the target is hidden or covered by other objects. In this paper, we propose a multi-task convolutional neural network for autonomous robotic grasping, which can help the robot fi nd the target, make the plan for grasping and fi nally grasp the target step by step in object stacking scenes. We integrate vision-based robotic grasping detection and visual manipulation relationship reasoning in one single deep network and build the autonomous robotic grasping system. Experimental results demonstrate that with our model, Baxter robot can autonomously grasp the target with a success rate of 90.6%, 71.9% and 59.4% in object cluttered scenes, familiar stacking scenes and complex stacking scenes respectively. I. INTRODUCTION In the research of intelligent robotics 1, autonomous robotic grasping is a very challenging task 2. For use in daily life scenes, autonomous robotic grasping should satisfy the following conditions: Grasping should be robust and effi cient. The desired object can be grasped in a multi-object scene without potential damages to other objects. The correct decision can be made when the target is not visible or covered by other things. For human beings, grasping can be done naturally with high effi ciency even if the target is unseen or grotesque. However, robotic grasping involves many diffi cult steps including per- ception, planning and control. Moreover, for complex scenes (e.g. the target is occluded or covered by other objects), robots also need certain reasoning ability to grasp the target orderly. For example, as shown in Fig. 1, in order to prevent potential damages to other objects, the robot have to plan the grasping order through reasoning, perform multiple grasps in sequence to complete the task and fi nally get the target. These diffi culties make autonomous robotic grasping more challenging in complex scenes. Therefore, in this paper, we propose a new vision-based multi-task convolutional neural network (CNN) to solve the mentioned problems for autonomous robotic grasping, which can help the robot complete grasping task in complex scenes Hanbo Zhang and Xuguang Lan are with the Institute of Artifi cial Intelligence and Robotics, the National Engineering Laboratory for Vi- sual Information Processing and Applications, School of Electronic and Information Engineering, Xian Jiaotong University, No.28 Xianning Road, Xian, Shaanxi, China.zhanghanbo163, xglan Fig. 1.Grasping task in a complex scene. The target is the tape, which is placed under several things and nearly invisible. Complete manipulation relationship tree indicates the correct grasping order, and grasping in this order will avoid damages to other objects. The correct decision made by the human being should be: if the target is visible, the correct grasping plan should be made and a sequence of grasps should be executed to get the target while if the target is invisible, the visible things should be moved away in a correct order to fi nd the target. (e.g. grasp the occluded or covered target). To achieve this, three functions should be implemented including grasping the desired object in multi-object scenes, reasoning the correct order for grasping and executing grasp sequence to get the target. To help the robot grasp the desired object in multi-object scenes, we design the Perception part of our network, which can simultaneously detect objects and their grasp candidates. The grasps are detected in the area of each object instead of the whole scene. In order to deal with situations in which the target is hidden or covered by other objects, we design the Reasoning part to get visual manipulation relationships between objects and enable the robot to reason the correct order for grasping, preventing potential damages to other objects. For transferring network outputs to confi gurations of grasping execution, we design Grasping part of our network. For the perception and rea- soning process, RGB images are taken as input of the neural network, while for execution of grasping, depth information is needed for approaching vector computation and coordinate transformation. Though there are some previous works that try to complete grasping in dense clutter 36, as we know, our proposed algorithm is the fi rst to combine perception, reasoning, and grasp planning simultaneously by using one neural network, and attempts to realize autonomous robotic grasp in complex scenarios. To evaluate our proposed algorithm, we validate the performance of our model in VMRD dataset 7. For robotic experiments, Baxter robot is used as the executor to complete grasping tasks, in which the robot is required to 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE6435 fi nd the target, make the plan for grasping and grasp the target step by step. The rest of this paper is organized as following: Related work is reviewed in Section II; Our proposed algorithm is detailed in Section III; Experimental results including vali- dation on VMRD dataset and robotic experiments are shown in Section IV, and fi nally the conclusion and discussion are described in Section V. II. RELATEDWORK A. Robotic Grasp Detection As the development of deep learning, robotic grasp detec- tion based on convolutional neural network (CNN) achieves state-of-the-art performance on several datasets such as Cor- nell dataset 813 and CMU grasp dataset 14. They are suitable for grasp detection in single-object scenes. There are some works proposed for grasping in dense clutter 3 6. A deep network is used to simultaneously detect the most exposed object and its best grasp by Guo et al. 3, which is trained on a fruit dataset including 352 RGB images. However, their model can only output the grasp affi liated to the most exposed object without perception and understanding of the overall environment and reasoning of the relationship between objects, which limits the use of the algorithm. Algorithms proposed in 4, 5 and 6 only focus on the detection of grasps in scenes where objects are densely cluttered, rather than what the grasped objects are. Therefore, the existing algorithms detect grasps on features of the whole image, and can only be used to grasp an unspecifi ed object instead of a pointed one in stacking scenes. B. Visual Manipulation Relationship Reasoning RecentworksprovethatCNNsachieveadvanced performance on visual relationship reasoning 1517. Different from visual relationship, visual manipulation relationship 7 is proposed to solve the problem of grasping order in object stacking scenes with consideration of the safety and stability of objects. However, when this algorithm is directly combined with the grasp detection network to solve grasping problem in object stacking scenes, there are two main diffi culties: 1) it is diffi cult to correctly match the detected grasps and the detected objects in object stacking scenes; 2) the cascade structure causes a lot of redundant calculations (e.g. the extraction of scene features), which makes the speed slow. Therefore, in this paper, we propose a new CNN architec- ture to combine object detection, grasp detection and visual manipulation relationship reasoning and build a robotic au- tonomous grasping system. Different from previous works, the grasps are detected on the object features instead of the whole scene. Visual manipulation relationships are applied to decide which object should be grasped fi rst. The proposed network can help our robot grasp the target following the correct grasping order in complex scenes. Target: tape Fig. 2.Our task is to grasp the target in complex scenes. The target is covered by several other objects and almost invisible. The robot need to fi nd the target, plan the grasping order and execute the grasp sequence to get the target. III. TASKDEFINITION In this paper, we focus on grasping task in scenes where the target and several other objects are cluttered or piled up, which means there can be occlusions and overlaps between objects, or the target is hidden under other objects and cannot be observed by the robot. Therefore, we set up an environment with several different objects each time. In each experiment, we test whether the robot can fi nd out and grasp the specifi c target. The target of the task is input manually. The desired robot behavior is that the fi nal target can be grasped step by step following the correct manipulation relationship predicted by the proposed neural network. In detail, we focus on grasping task in realistic and challenging scenes as following: each experimental scene includes 6-9 objects, where objects are piled up and there are severe occlusions and overlaps. In the beginning of each experiment, the target is diffi cult to detect in most cases. Following this setting, it can test whether the robot can make correct decisions to fi nd the target and successfully grasp it. IV. PROPOSEDAPPROACH A. Architecture The proposed architecture of our approach is shown in Fig. 3. The input of our network is RGB images of working scenes. First, we use CNN (e.g. ResNet-101 18) to extract image features. As shown in Fig. 3, convolutional features are shared among Region proposal network (RPN, 19), object detection, grasp detection and visual manipulation relationship reasoning. The shared feature extractor used in our work is ResNet-101 layers before conv4 including 30 ResBlocks. Therefore, the stride of shared features is 16. RPN follows the feature extractor to output regions of interest (ROI). RPN includes three 33 convolutional layers: an intermediate convolutional layer, a ROI regressor and a ROI classifi er. The ROI regressor and classifi er are both cascaded after the intermediate convolutional layer to output locations of ROIs and the probability of each ROI being a candidate of object bounding boxes. 6436 Fig. 3.Architecture of our proposed approach. The input is RGB images of working scenes. The solid arrows indicate forward-propagation while the dotted arrows indicate backward-propagation. In each iteration, the neural network produces one robotic grasp confi guration and the robot moves one object. The iteration will not be terminated until the desired target is grasped. (a): Network architecture; (b): Perception part with object detector and grasp detector; (c): Reasoning part with visual manipulation relationship predictor; (d): Expected results. The mainbody of our approach includes 3 parts: Per- ception, Reasoning and Grasping. “Perception” is used to obtain the detection results of object and grasp with the affi liation between them. “Reasoning” takes object bounding boxes output by “Perception” and image features as input to predict the manipulation relationship between each pair of objects. “Grasping” uses perception results to transform grasp rectangles into robotic grasp confi gurations to be executed by the robot. Each detection produces one robotic grasp confi guration, and the iteration is terminated when the desired target is grasped. B. Perception In “Perception” part, the network simultaneously detects objects and their grasps. The convolutional features and ROIs output by RPN are fi rst fed into ROI pooling layer, where the features are cropped by ROIs and pooled using adaptive pooling into the same size W H (in our work, the size is 7 7). The purpose of ROI pooling is to enable the corresponding features of all ROIs to form a batch for network training. 1) Object Detector: Object detector takes a mini-batch of ROI pooled features as input. As in 18, a ResNet conv5 layer including 9 convolutional layers is adopted as the header for fi nal regression and classifi cation taking ROI pooled features as input. The headers output is then averaged on each feature map. The regressor and classifi er are both fully connected layers with 2048-d input and no hidden layer, outputting locations of refi ned object bounding boxes and classifi cation results respectively. 2) Grasp Detector: Grasp detector also takes ROI pooled features as input to detect grasps on each ROI. Each ROI is fi rstly divided into W H grid cells. Each grid cell corre- sponds to one pixel on ROI pooled feature maps. Inspired by our previous work 13, the grasp detector outputs k (in this paper, k = 4) grasp candidates on each grid cell with oriented anchor boxes as priors. Different oriented anchor size is explored during experiments including 1212 and 2424 pixels. A header including 3 ResBlocks cascades after ROI pooling in order to enlarge receptive fi eld of features used for grasp detection. The reason is that a large receptive fi eld can prevent grasp detector from being confused by grasps that belongs to different ROIs. Then similar to 13, the grasp regressor and grasp classifi er follow the grasp header and output 5k and 2k values for grasp rectangles and graspable confi dence scores respectively. Therefore, the output for each grasp candidate is a 7-dimension vector, 5 for the location of the grasp (xg,yg,wg,hg,g) and 2 for graspable and ungraspable confi dence scores (cg,cug). Therefore, the output of “Perception” part for each objectcontainstwoparts:objectdetectionresultO and grasp detection result G. O is a 5-dimension vec- tor (xmin,ymin,xmax,ymax,cls) representing the loca- tion and category of an object and G is a 5-dimension vector(xg,yg,wg,hg,g)representingthebestgrasp. 6437 ToothpasteBox Tape Wrist DeveloperToothpaste ToothpasteNo Relationship TapePlier Box Toothpaste Plier Wrist Developer Tape Input Image Manipulation Relationships Manipulation Relationship Tree Fig. 4.When all manipulation relationships are obtained, the manipulation relationship tree can be built combining all manipulation relationships. The leaf nodes represent the objects that should be grasped fi rst. (xg,yg,wg,hg,g) is computed by Eq. (1): xg= xg wa+ xa yg= yg ha+ ya wg= exp(wg) wa hg= exp(hg) ha g= g (90/k) + a (1) where (xa,ya,wa,ha,a) is the corresponding oriented an- chor. C. Reasoning Inspired by our previous work 7, we combine visual manipulation relationship reasoning in our network to help robot reason the grasping order without potential damages to other objects. Manipulation Relationship Predictor: To predict manipu- lation relationships of object pairs, we adopt Object Pairing Pooling Layer (OP2L) to obtain features of object pairs. As shown in Fig. 3.C, the input of manipulation relationship predictor is features of object pairs. The features of each object pair (O1,O2) include the features of O1, O2and the union bounding box. Similar to object detector and grasp detector, the features of each object are also adaptively pooled into the same size of W H. The difference is that the convolutional features are cropped by object bounding boxes instead of ROIs. Note that (O1,O2) is different from (O2,O1) because manipulation relationship does not conform to the exchange law. If there are n detected objects, the number of object pairs will be n (n 1), and there will be n (n 1) manipulation relationships predicted. In manipulation relationship predictor, the features of the two objects and the union bounding box are fi rst passed through several convolutional layers respectively (in this work, ResNet Conv5 layers are applied), and fi nally manipulation relationships are classifi ed by a fully connected network containing two 2048-d hidden layers. After getting all manipulation relationships, we can build a manipulation relationship tree to describe the correct grasping order in the whole scene as shown in Fig. 4. Leaf nodes of the manipulation relationship tree should be grasped before the other nodes. Therefore, it is worth noting that the most important part of the manipulation relationship tree is the leaf nodes. In other words, if we can make sure that the leaf nodes are detected correctly in each step, the grasping order will be correct regardless of the other nodes. D. Grasping “Grasping” part is used to complete inference on outputs of the network. In other words, the input of this part is object and grasp detection results, and the output is the corresponding robotic confi guration to grasp each object. Note that there is no trainable weights in “Grasping” part. 1) Grasp Selection: As described above, the grasp de- tector outputs a large set of grasp candidates for each ROI. Therefore, the best grasp candidate should be selected fi rst for each object. According to 12, there are two methods to fi nd the best grasp: (1) choose the grasp with highest graspable score; (2) choose the one closest to the object center in Top-N candidates. The second one is proved to be a better way in 12, which is used to get the grasp of each object in our paper. In our experiments, N is set to 3. 2) Coordinate Transformation: The purpose of the coor- dinate transformation is to map the detected grasps in the image to the approaching vector and grasp point in the robot coordinate system. In this paper, an affi ne transformation is used approximately for this mapping. The affi ne transfor- mation is obtained through four reference points with their coordinates in the image and robot coordinate system. The grasp point is defi ned as the point in grasp rectangle with minimum depth while the approaching vecto
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年哈爾濱南崗區(qū)哈西社區(qū)衛(wèi)生服務(wù)中心招聘3人筆試考試備考題庫及答案解析
- 深度解析(2026)《GBT 26070-2010化合物半導(dǎo)體拋光晶片亞表面損傷的反射差分譜測試方法》
- 2025江蘇泰州市高港區(qū)胡莊鎮(zhèn)公益性崗位招聘2人模擬筆試試題及答案解析
- 2025年山東師范大學(xué)公開招聘人員(7名)備考筆試題庫及答案解析
- 2025嘉興海寧市交通投資控股集團有限公司下屬公司12月招聘參考筆試題庫附答案解析
- 古希臘“閑暇”(Schole)概念的教育意涵-基于亞里士多德《政治學(xué)》第八卷
- 2025下半年武警江西總隊醫(yī)院社會招聘5人備考筆試試題及答案解析
- 2025年12月華僑大學(xué)化工學(xué)院藍志元教授團隊招聘科研助理4人(福建)備考考試題庫及答案解析
- 2025云南昆明市官渡區(qū)北京八十學(xué)校招聘5人備考筆試試題及答案解析
- 2026湖南省氣象部門事業(yè)單位招聘應(yīng)屆畢業(yè)生13人(第二輪)(第2604號)參考考試題庫及答案解析
- 徽派民宿設(shè)計案例分析
- 機器人噴涂技術(shù)培訓(xùn)考核試卷
- (2025)輔警招聘考試題題庫及答案
- DB34T 4718-2024農(nóng)村房地一體不動產(chǎn)確權(quán)登記成果質(zhì)量檢查驗收規(guī)范
- 運用PDCA循環(huán)降低初次剖宮產(chǎn)率
- 門店銷售任務(wù)合同范例
- 合法斷絕母子關(guān)系協(xié)議書范文
- 地質(zhì)災(zāi)害危險性評估服務(wù)方案
- 【MOOC】英文技術(shù)寫作-東南大學(xué) 中國大學(xué)慕課MOOC答案
- 電氣工程及其自動化職業(yè)規(guī)劃課件
- 2023年新高考(新課標)全國2卷數(shù)學(xué)試題真題(含答案解析)
評論
0/150
提交評論