2024寒武紀(jì)TensorFlow用戶手冊_第1頁
2024寒武紀(jì)TensorFlow用戶手冊_第2頁
2024寒武紀(jì)TensorFlow用戶手冊_第3頁
2024寒武紀(jì)TensorFlow用戶手冊_第4頁
2024寒武紀(jì)TensorFlow用戶手冊_第5頁
已閱讀5頁,還剩46頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

版本2021年05月06 插圖目 表格目 2.1版本記 CambriconTensorFlow架構(gòu)簡 原生模式 融合模式 暫不支持的TensorFlow模 TensorFlow 使用 使用 安裝 安裝寒武紀(jì)Neuware依賴 獲取CambriconTensorFlow源代 使用 獲取MLU設(shè)備個(gè) 獲取詳細(xì)的MLU相關(guān)信 MLU內(nèi)存使用策 在多MLU系統(tǒng)中使用單 使用多MLU設(shè) MLU設(shè)備可見 使用 基于LowAPI的CambriconTensorFlow教 基于Keras的CambriconTensorFlow教 Horovod分布 增加自定義MLU算 MLU內(nèi)存越界檢 tensordump功能以及誤差計(jì)算工 TensorDump功

插圖目錄插圖目錄

CambriconTensorFlow在寒武紀(jì)軟件棧中的位

CambriconTensorFlow架 表格目錄表格目錄

11((或保證本文件所且寒武紀(jì)不承擔(dān)因應(yīng)用或使用任何產(chǎn)品或服務(wù)而產(chǎn)生的任何責(zé)任。寒武紀(jì)不應(yīng)對因下列原因產(chǎn)生的任何12客戶產(chǎn)品設(shè)計(jì)。本指南列出的性能測試和等級(jí)要使用特定芯片或計(jì)算機(jī)系統(tǒng)或組件來測量。經(jīng)該等測試,本指南所示結(jié)果反映了寒武紀(jì)產(chǎn)品的大概性能。系統(tǒng)硬件或軟件設(shè)計(jì)或配置的任何不同會(huì)影響實(shí)際性能。如上所述,寒武紀(jì)不代表、擔(dān)?;虮WC本指南所述產(chǎn)品將適用于任何特定用途。寒武紀(jì)不代表或擔(dān)保測試每種產(chǎn)品以期避免應(yīng)用程序或產(chǎn)品的默認(rèn)情況??蛻舢a(chǎn)品設(shè)計(jì)的脆弱性會(huì)影響寒武紀(jì)產(chǎn)品的質(zhì)量和可靠性并導(dǎo)致超出本指南范圍的額外或不同的情況和/或要求。 ?2021222.1:寒武紀(jì)TensorFlowV20210506更新時(shí)間:20210201移除融合模式相關(guān)內(nèi)容。融合模式的使用說明,參見《寒武紀(jì)TensorFlow增加快速入門33TensorFlow是一款基于數(shù)據(jù)流的編程框架,廣泛應(yīng)用于各類機(jī)器學(xué)習(xí)、深度神經(jīng)網(wǎng)絡(luò),具有很高的可為支持寒武紀(jì)MLU(MachineLearningUnit,機(jī)器學(xué)習(xí)處理器),寒武紀(jì)擴(kuò)展了開源深度學(xué)習(xí)編程框架TensorFlow(以下簡稱CambriconTensorFlow),在原生TensorFlow中增加對MLU的支持,實(shí)現(xiàn)已有TensorFlow模型在MLU關(guān)于原生TensorFlow的更多內(nèi)容,參見TensorFlow官網(wǎng)相關(guān)文檔資料1。本手冊主要介紹基于寒武紀(jì)MLU的CambriconTensorFlow目前CambriconTensorFlowTensorFlowv1.15.4寒武紀(jì)軟件棧具有端云一體的特點(diǎn),既適用于終端智能芯片,也適用于云端智能板卡。CambriconTensorFlow是支持寒武紀(jì)MLU習(xí)編程框架,在寒武紀(jì)軟件棧所處位置如下圖所示。1\h.CAMBRICONTENSORFLOW3.1:CambriconTensorFlow基于原生TensorFlow,CambriconTensorFlow集成了寒武紀(jì)軟件棧、擴(kuò)展了TensorFlow設(shè)備層,并增加了對寒武紀(jì)MLU的支持。CambriconTensorFlow屏蔽硬件的相關(guān)屬性,允許用戶使用原生TensorFlowAPI進(jìn)行開發(fā),并且適配了部分KerasAPI。CambriconTensorFlow3.2:CambriconTensorFlow以下對CambriconTensorFlowTensorFlow的graphexecutorTensorFlowCPU、GPU該模式下,計(jì)算圖經(jīng)過圖優(yōu)化后,graphexecutorTensorFlow軟件棧層面,用戶可以通過CNNL(CambriconNeuwareNeuralNetworkLibrary,寒武紀(jì)神經(jīng)網(wǎng)絡(luò)計(jì)算庫、CNPlugin(CambriconNeuwarePlugin,寒武紀(jì)自定義算子庫)以及CNCC(CambriconNeuwareCompilerCollection,寒武紀(jì)BANGC語言編譯器)接入寒武紀(jì)深度優(yōu)化的通用算子或者靈關(guān)于TensorFlow算子的定義,參見Createanop2用戶也可以通過寒武紀(jì)BANG語言靈活添加自定義算子。更多內(nèi)容,參見添加自定義算子2\h.暫不支持的TENSORFLOWCambriconTensorFlow內(nèi)建了基于CNML(CambriconNeuwareMachineLearningLibrary,寒武紀(jì)機(jī)器學(xué)習(xí)庫)融合模式。融合模式下,CambriconTensorFlow會(huì)對用戶的計(jì)算圖進(jìn)行全圖優(yōu)化,將多個(gè)小算子融合成一個(gè)大算子,消除不必要的中間結(jié)果,生成更高效率、更具硬件親和性的MLU模式的更多內(nèi)容,參見《寒武紀(jì)TensorFlowexport目前Cambriconexport關(guān)于運(yùn)行融合模式的更多內(nèi)容,參見《寒武紀(jì)TensorFlow暫不支持的TensorFlow(just?in?time)編譯機(jī)制、AOT(ahead?of?time)有關(guān)TensorFlowLite的信息,參見TensorFlow33\h44CambriconTensorflow運(yùn)行時(shí)依賴CNToolkit以及計(jì)算加速庫CNML、CNNL、CNCLCNPlugin本章節(jié)主要介紹如何安裝CambriconTensorFlow。有關(guān)如何安裝CambriconNeuwareSDK您也可以使用CambrionensorFlw發(fā)布的Docer容器,其中已經(jīng)安裝了CambrionensorFlw所依賴的組件。使用CambrionensorFlw發(fā)布包中包含使用manylinux構(gòu)建的wheel包。用戶可以選擇對應(yīng)ython本的wheel包進(jìn)行安裝。下面以Python2virtualenvvirtualenv--system-site-packages-ppython2.7sourcesourcepythonpython-c"importtensorflowastf;.使用使用在使用已安裝CambriconTensorFlow運(yùn)行環(huán)境的Docker在主機(jī)上安裝MLU驅(qū)動(dòng)。關(guān)于MLU安裝Docker。關(guān)于DockerDocker4導(dǎo)入TensorFlowDocker運(yùn)行以下命令將TensorFlowdockerimage倉庫。根據(jù)實(shí)際需求載入develimagereleaseimage。本節(jié)以發(fā)布的Ubuntu16.04##導(dǎo)入devel#導(dǎo)入release導(dǎo)入完成后,可執(zhí)行dockerimagels啟用CambriconTensorFlowCambriconTensorFlow鏡像已預(yù)裝CNToolkit、Python2.7或Python3.7通過以下命令啟動(dòng)TensorFlowdockerdockerrun[-it][--rm]--privileged[-phostPort:containerPort][-v?→TENSORFLOW_IMAGE_NAME:TAGdockerdockerrun-it--privileged-v/path/to/your/models:/path/to/your/models-v關(guān)于DockerDocker5使用releaseimage或develimage使用releaseimage檢測TensorFlowpythonpython-c"importtensorflowastf; 2)也可參見快速入門章節(jié),在Docker內(nèi)運(yùn)行demo,驗(yàn)證TensorFlow使用develimage編譯安裝TensorFlowcdcd##編譯成功后自動(dòng)在當(dāng)前目錄創(chuàng)建虛擬環(huán)境virtualenv_mlu并安裝sourcepython-c"importtensorflowastf;從源碼編譯TensorFlow,首先要滿足TensorFlow社區(qū)對開發(fā)環(huán)境的要求,具體參見從源代碼構(gòu)建6您可以使用CambriconTensorFlow發(fā)布的帶有devel標(biāo)識(shí)的開發(fā)版dockerdocker器中,編譯所依賴的構(gòu)建工具、工具鏈以及依賴庫已經(jīng)安裝就緒??梢灾苯舆M(jìn)入TensorFlow源碼目錄編譯TensorFlow源碼要求GCC7.3。推薦使用Ubuntu18(安裝TensorFlow使用Bazel##下載0.24.1版本Bazelwget/bazelbuild/bazel/releases/download/0.24.1/bazel-0.24.1-installer-bashbazel-0.24.1-installer-linux-x86_64.sh--安裝完成后,可執(zhí)行以下命令驗(yàn)證Bazel6\hbazelbazelBuildBuildlabel:Buildtarget:bazel-out/k8-Buildtime:TueApr216:29:262019Buildtimestamp:Buildtimestampasint:安裝以下寒武紀(jì)NeuwareCNToolkit(需要先安裝CNMLCNNLCNCLCNPlugin。具體安裝步驟,參見《寒武紀(jì)CNToolkit包版本要求,參見源碼包的cambricon_tensorflow/tensorflow/perty文件。獲取發(fā)布的CambriconTensorFlow源碼包并解壓。tartar-zxvf編譯過程中編譯器會(huì)到系統(tǒng)默認(rèn)路徑搜索寒武紀(jì)Neuware軟件包的頭文件和動(dòng)態(tài)鏈接庫。如果您的寒武紀(jì)Neuware軟件包的安裝路徑不在系統(tǒng)默認(rèn)路徑,可以通過設(shè)置環(huán)境變量exportNEUWARE_HOME=/path_to_neuware_home來指定搜索Neuware軟件包的路徑。 可以使用源碼包中的build_tensorflow_mlu.sh腳本來編譯CambriconTensorFlowcdcd#Neuware#export${TENSORFLOW_HOME}/virtualenv_mlu目錄下會(huì)生成TensorFlowwheel動(dòng)在virtualenv_mluwheel原生TensorFlow社區(qū)版本分發(fā)的TensorFlowpip、dockerCPUTensorFlowIItensorflow/core/platform/cpu_feature_guard.cc:142]YourCPUsupportsinstructionsthatCambrionensorFlw發(fā)布的pip、docer為了保證開箱即用的性能,選擇了相對比較激進(jìn)的策略,默認(rèn)打開了一些在x64平臺(tái)比較通用的指令集優(yōu)化。對應(yīng)的GCC編譯選項(xiàng):----copt=-mfma--copt=-mavx2--copt=-mavx--copt=-msse4.1copt=-aws:AmazonWebaws:AmazonWebServicegcp:GoogleCloudPlatform對應(yīng)的bazel--config--config=noaws--CambrionensorFlw默認(rèn)沒有使能VX512F的支持。如果您需要開啟編譯器的VX512F指令優(yōu)化請修改編譯腳本,加入--copt=-mavx512f編譯選項(xiàng)以使能該功能。 如果通過pi、docer分發(fā)的CambrionensorFlw因?yàn)镮SA兼容性無法在您的CPU上運(yùn)行,或您需要AS、GCP相關(guān)的服務(wù)特性,參見從源碼編譯自行修改編譯配置,重新編譯ensorFl。55使用本節(jié)介紹在CambrionTensorFlow中使用MLU進(jìn)行計(jì)算加速的基本用法。通常情況下,使用MLU的方式跟使用GPU關(guān)于使用GPU的更多信息,參見UsingGPUs7對于TensorFlowtf.kerasMLU獲取MLUimportimporttensorflowasprint("NumMLUsAvailable:",4卡MLU獲取詳細(xì)的MLUimportimporttensorflowas4MLU機(jī)器上將打印如下日志和輸出(此處省略時(shí)間戳、文件名、行號(hào)等信息:##TensorFlowgitversion(from#ICurrentMLU_Visible_DevicesCount:.使用#CambriconTensorFlow的版本號(hào)、編譯時(shí)間以及編譯時(shí)使用的NeuwarelibrariesICambriconTensorFlowversion:1.1.0wascompiledat2020-12-1617:04:50withToolkit:?→CNNL:1.1.1,CNML:7.9.2,CNPLUGIN:1.11.1,CNCL:Itfgitversion:v0.9.2-21252-#ICurrentlibraryversions:DRIVER:407,CNRT:40803,CNNL:1102,CNML:#ICreatedTensorFlowdevice(/job:localhost/replica:0/task:0/device:MLU:0with16384?→memory)->physicalMLU(device:0,name:ICreatedTensorFlowdevice(/job:localhost/replica:0/task:0/device:MLU:1with16384?→memory)->physicalMLU(device:1,name:ICreatedTensorFlowdevice(/job:localhost/replica:0/task:0/device:MLU:2with16384?→memory)->physicalMLU(device:2,name:ICreatedTensorFlowdevice(/job:localhost/replica:0/task:0/device:MLU:3with16384?→memory)->physicalMLU(device:3,name:MLU270)['/job:localhost/replica:0/task:0/device:CPU:0','/job:localhost/replica:0/task:0/device:MLU:0?→','/job:localhost/replica:0/task:0/device:MLU:1',將sessionoptions中l(wèi)og_device_placement設(shè)置為Trueimportimporttensorflowas#Createsaa=tf.constant([1.0,2.0,3.0,4.0,5.0,6.0],shape=[2,3],b=tf.constant([1.0,2.0,3.0,4.0,5.0,6.0],shape=[3,2],c=tf.matmul(a,#Createsasessionwithlog_device_placementsettosess=#RunstheIIDevice/job:localhost/replica:0/task:0/device:MLU:0->device:0,name:/job:localhost/replica:0/task:0/device:MLU:1->device:1,name:/job:localhost/replica:0/task:0/device:MLU:2->device:2,name:(下頁繼續(xù)/job:localhost/replica:0/task:0/device:MLU:3->device:3,name:>>>#Runsthe...MatMul:(MatMul):/job:localhost/replica:0/task:0/device:MLU:0a:(Const):/job:localhost/replica:0/task:0/device:MLU:0b:(Const):[[22.[49.CambriconTensorFlow的算子支持CPU設(shè)備和MLU"/cpu:0":"/cpu:0":機(jī)器的CPU"/device:MLU:0":MLU"/device:MLU:1":MLUMU設(shè)備比CPUMUCambrion會(huì)將該算子優(yōu)先分配到MU設(shè)備上,實(shí)現(xiàn)計(jì)算加速。如果希望某個(gè)算子在特定設(shè)備上運(yùn)行,而不是在CambriconTensorFlowwithtf.deviceimportimporttensorflowas#Createsawitha=tf.constant([1.0,2.0,3.0,4.0,5.0,6.0],shape=[2,3],b=tf.constant([1.0,2.0,3.0,4.0,5.0,6.0],shape=[3,2],c=tf.matmul(a,#Createsasessionwithlog_device_placementsettosess=#RunstheIIDevice/job:localhost/replica:0/task:0/device:MLU:0->device:0,name:/job:localhost/replica:0/task:0/device:MLU:1->device:1,name:/job:localhost/replica:0/task:0/device:MLU:2->device:2,name:/job:localhost/replica:0/task:0/device:MLU:3->device:3,name:(下頁繼續(xù)>>>#Runsthe...MatMul:(MatMul):/job:localhost/replica:0/task:0/device:MLU:0a:(Const):/job:localhost/replica:0/task:0/device:cpu:0b:(Const):[[22.[49.MLUTensorFlow對GPU內(nèi)存的使用策略是:默認(rèn)情況下,在啟動(dòng)階段申請全部的GPUmemory。用戶可以申請內(nèi)存。無論使用哪種策略,設(shè)備側(cè)內(nèi)存被TensorFlow詳細(xì)說明,參見AllowingGPUmemorygrowth8CambriconTensorFlow對MLU內(nèi)存的使用策略與GPU類似。兩者主要區(qū)別在于TensorFlowMLU啟動(dòng)階段不會(huì)預(yù)分配全部MLU內(nèi)存,而是在運(yùn)行階段按需分配。與GPU相同,TensorFlow申請走的MLUexportexport關(guān)閉MLUTensorFlow初始化階段,所有MLU在多MLU系統(tǒng)中使用單使用方式與GPU相同。更多信息,參見UsingasingleGPUonamulti?GPUsystem989MLU使用方式與GPU相同。更多信息,參見UsingmultipleGPUs10MLU默認(rèn)情況下,當(dāng)前系統(tǒng)的所有MLU設(shè)備對TensorFlow均可見。如果要控制TensorFlow進(jìn)程使用某一個(gè)或者某幾個(gè)MLU設(shè)備,可以借助寒武紀(jì)驅(qū)動(dòng)提供的MLU_VISIBLE_DEVICES環(huán)境變量來控制MLU設(shè)備需要注意的是,TensorFlow0Itensorflow/core/common_runtime/direct_session.cc:362]DeviceItensorflow/core/common_runtime/direct_session.cc:362]Device/job:localhost/replica:0/task:0/device:MLU:0->device:0,name:/job:localhost/replica:0/task:0/device:MLU:1->device:1,name:/job:localhost/replica:0/task:0/device:MLU:2->device:2,name:/job:localhost/replica:0/task:0/device:MLU:3->device:3,name:/job:localhost/replica:0/task:0/device:MLU:4->device:4,name:/job:localhost/replica:0/task:0/device:MLU:5->device:5,name:/job:localhost/replica:0/task:0/device:MLU:6->device:6,name:/job:localhost/replica:0/task:0/device:MLU:7->device:7,name:exportexport/job:localhost/replica:0/task:0/device:MLU:0->device:/job:localhost/replica:0/task:0/device:MLU:0->device:0,name:exportexport可以在啟動(dòng)日志中看到,對應(yīng)設(shè)備在TensorFlow0、1、2、3/job:localhost/replica:0/task:0/device:MLU:0->device:0,name:/job:localhost/replica:0/task:0/device:MLU:1->device:1,name:10

.基于LOWAPICAMBRICONTENSORFLOW/job:localhost/replica:0/task:0/device:MLU:2->device:2,name:/job:localhost/replica:0/task:0/device:MLU:3->device:3,name:CNMIXCambrionensorFlw和CambrionPoc在MLU200CambriconTensorFlowCNMIX基于LowAPI的CambriconTensorFlow確認(rèn)CambriconTensorFlow更多安裝信息,參見安裝章節(jié)。導(dǎo)入TensorFlowimportimportcnmiximporttensorflowasimporttensorflow.keras.datasets.mnistas(x_train,(x_train,y_train),(x_test,y_test)=mnist.load_data()batch_size,data_num=50,len(x_train)defdefinitial=tf.truncated_normal(shape,return(下頁繼續(xù)definitial=tf.constant(0.1,returndefconv2d(x,returntf.nn.conv2d(x,W,strides=[1,1,1,1],defreturntf.nn.max_pool(x,ksize=[1,2,2,strides=[1,2,2,1],xx=tf.placeholder(tf.float32,[None,28,y_=tf.placeholder("float",[None,W_conv1=weight_variable([5,5,1,b_conv1=x_image=tf.reshape(x,[-h_conv1=tf.nn.relu(conv2d(x_image,W_conv1)+h_pool1=W_conv2=weight_variable([5,5,32,b_conv2=h_conv2=tf.nn.relu(conv2d(h_pool1,W_conv2)+h_pool2=W_fc1=weight_variable([7*7*64,b_fc1=h_pool2_flat=tf.reshape(h_pool2,[-1,h_fc1=tf.nn.relu(tf.matmul(h_pool2_flat,W_fc1)+keep_prob=h_fc1_drop=tf.nn.dropout(h_fc1,W_fc2=weight_variable([1024,b_fc2=(下頁繼續(xù)y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop,W_fc2)+cross_entropycross_entropy=-train_step=tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)correct_prediction=tf.equal(tf.argmax(y_conv,1),tf.argmax(y_,1))accuracy=tf.reduce_mean(tf.cast(correct_prediction,"float"))基于CNMIX構(gòu)建sessionsession_configsession_config=cnmix.mixed_precision_graph_rewrite_config(batch_size,data_num)sess=tf.Session(config=session_config)EPOCH,EPOCH,iters=10,int(data_num/forepochinforiinstart,end=batch_size*i,batch_size*train_x,train_y=x_train[start:end],np.eye(10)[y_train[start:end]]sess.run(train_step,feed_dict={x:train_x,y_:train_y,keep_prob:0.5})test_loss,test_accuracy=sess.run([cross_entropy,accuracy],feed_dict={x:x_test,y_:np.eye(10)[y_test],keep_prob:1.0})print("epoch%d,TestLoss:%.10f,TestAccuracy:%.5f"%(epoch+1,?→test),WARNING:tensorflow:FromWARNING:tensorflow:Froma.py:12:Thenametf.enable_eager_executionisdeprecated.Instructionsforupdating:Usetf.wherein2.0,whichhasthesamebroadcastruleasnp.whereepoch1,TestLoss:0.7170808105,TestAccuracy:0.95410epoch2,TestLoss:0.5037763672,TestAccuracy:epoch3,TestLoss:0.4459250977,TestAccuracy:epoch4,TestLoss:0.3748795898,TestAccuracy:epoch5,TestLoss:0.3611684570,TestAccuracy:epoch6,TestLoss:0.3054287842,TestAccuracy:epoch7,TestLoss:0.2697987305,TestAccuracy:.基于KERAS的CAMBRICONTENSORFLOWepoch8,TestLoss:0.2209449219,TestAccuracy:epoch9,TestLoss:0.2141630127,TestAccuracy:epoch10,TestLoss:0.2400469238,TestAccuracy:98%基于的確認(rèn)CambriconTensorFlow更多安裝信息,參見安裝章節(jié)。導(dǎo)入TensorFlow、CNMIX##importosimporttensorflowas載入并準(zhǔn)備好MNISTmnistmnist=tf.keras.datasets.mnist(x_train,y_train),(x_test,y_test)=mnist.load_data()x_train,x_test=x_train/255.0,x_test/255.0將模型各層堆疊以搭建tf.keras.Sequentialmodelmodel=model.add(tf.keras.layers.Flatten(input_shape=(28,28)))model.add(tf.keras.layers.Dense(128,activation='relu'))model.add(tf.keras.layers.Dense(10,activation='softmax'))基于CNMIXsessionconfigsession_configsession_config=cnmix.mixed_precision_graph_rewrite_config(batch_size,data_num,model.model.fit(x_train,y_train,epochs=5,model.evaluate(x_test,y_test,60000/60000[==============================]-7s118us/sample-loss:0.3359-acc:60000/60000[==============================]-7s117us/sample-loss:0.1624-acc:60000/60000[==============================]-6s93us/sample-loss:0.1224-acc:60000/60000[==============================]-6s98us/sample-loss:0.1002-acc:60000/60000[==============================]-6s96us/sample-loss:0.0858-acc:10000/10000-1s-loss:0.0768-acc:在網(wǎng)絡(luò)訓(xùn)練中,網(wǎng)絡(luò)代碼模型常會(huì)基于CPU、GPU等設(shè)備進(jìn)行構(gòu)建。若要基于MLU設(shè)備訓(xùn)練網(wǎng)絡(luò),則在GPUwithwithtf.device("/GPU:0"):res=tf.add(1.,2.)在MLUwithwithtf.device("/MLU:0"):res=tf.add(1.,2.)配置CNMIX和session導(dǎo)入CNMIX模塊,通過CNMIX設(shè)置ensorFlw執(zhí)行GPU相關(guān)sessionconfigconfigconfig=執(zhí)行MLU相關(guān)sessionconfigimportimport#CNMIXconfigdata_numbatch_sizeconfig=global_batch_size=global_batch_size,data_num=data_num,opt_level=devices=['/device:MLU:'devices=['/device:MLU:'+str(mlu_id)formlu_idinrange(num_mlus)]distribution=tf.distribute.MirroredStrategy(devices=devices,CambriconTensorFlow只支持KerasAPI的MirroredStrategyHorovod#GPU#GPUconfig.gpu_options.visible_device_list=#config.mlu_options.visible_device_list=1166增加自定義MLU本章節(jié)以添加均方誤差(Mean?SquareError,MSE)算子為例,說明如何在框架內(nèi)完整添加一個(gè)自定義MLU注冊MSE算子對應(yīng)的OpDef在tensorflow/core/user_ops/mlu_special_ops.ccREGITER_OP的宏注冊Mse子。.Input("a:.Input("b:.Output("output:.Attr("T:{half,在REGISTER_OP了MseShapeFnStatusStatusMseShapeFn(shape_inference::InferenceContext*c)ShapeHandleShapeHandleinput_b;ShapeHandleinput;std::vector<DimensionHandle>dims;for(inti=0;i<c->Rank(input)-1;i++){c->set_output(0,c-return(下頁繼續(xù).增加自定義MLU添加完上面兩個(gè)函數(shù)之后,REGISTER_OP宏就完成了MSE算子OpDef的定義。除此之外,該宏還自動(dòng)在PythonAPI中添加了MSE注冊MSE算子的OpKerneltensorflow/core/kernels/mlu_kernels文件夾下,創(chuàng)建新文件mse_op.cc。在該文件中,利用REGISTER_KERNEL_BUILDER定義了MSE算子的兩個(gè)后端:DEVICE_CPU和DEVICE_MLU。使用該宏時(shí),要保證Name中的字符串和第1節(jié)中REGISTER_OP中的算子名稱一致。通過調(diào)用TF_CALL_MLU_FLOAT_TYPES,向REGISTER_MLU分別傳入half和floatmse_op.cc文件中的代碼如下(完整代碼:#include#includeclassMseOp:publicOpKernel{explicitMseOp(OpKernelConstruction*context):OpKernel(context)voidCompute(OpKernelContext*ctx)overrideclassMLUMseOp:publicOpKernelexplicitMLUMseOp(OpKernelConstruction*ctx):OpKernel(ctx)voidCompute(OpKernelContext*ctx)overrideif(!ctx->ValidateInputsAreSameShape(this))return;autoinput=ctx->input(0);Tensor*output=nullptr;TensorShapefor(inti=0;i<input.dims()-1;i++){ctx->allocate_output(0,output_shape,&output));OP_REQUIRES_OK(ctx,ComputeInternal(ctx));StatusComputeInternal(OpKernelContext*ctx)autoinput_a=ctx->input(0);autoinput_b=ctx->input(1);intline_width=input_a.dim_size(input_a.dims()-1);cnrtQueue_tqueueCHECK_NOTNULL(reinterpret_cast<cnrtQueue*>(ctx----returnMseOp<CPUDevice,float>);MseOp<CPUDevice,Eigen::half>);#define #undef}//namespace在MseOp的實(shí)現(xiàn)中,實(shí)現(xiàn)了一個(gè)簡易的CPU版MSE算子。在MLUMseOp的ComputeInternal函數(shù)中,準(zhǔn)備了MseKernel函數(shù)的入?yún)?,后調(diào)用CustomedOps中定義的MseKernel函數(shù),從而調(diào)用BANGC版的MSE在CustomedOps中實(shí)現(xiàn)MseKernel所有自定義BANGC實(shí)現(xiàn)的kernel文件全部位于tensorflow/core/kernels/mlu_kernels/customed_ops目錄。為了添加MseKernel函數(shù),需要添加新文件mse_kernel.mlu,并修改兩個(gè)文件bangc_kernels.h和BUILD。在mse_kernel.mlu文件中,利用BANGC編寫MSE算子相關(guān)的計(jì)算邏輯,并封裝一個(gè)MseKernelbangc_kernels.h文件聲明了所有的BANGC:#define#include<stdint.h>#include"cnrt.h"namespacemlustaticvoidMishKernel(cnrtQueue_tvoid*input,void*output,intinput_size,intdata_type,intstaticvoidMseKernel(cnrtQueue_tvoid*input_a,void*input_b,void*output,intdata_nums,intline_width,int}//namespace#endif//mse_kernel.mlu文件如下(完整代碼):#include"bangc_kernels.h"#defineBLOCK_SIZE12800#defineALIGN_UP(a,b)(((a)+(b)-1)/(b)*mlu_funcvoidmseKernelInter(TT*input_b,T*output,intintline_width)mlu_entryvoidvoid*input_a,void*input_b,void*output,intdata_nums,intline_width,intdata_type)if(data_type==6)mseKernelInter((float*)input_a,(float*)input_b,(float*)output,data_nums,}elseif(data_type==3)mseKernelInter((half*)input_a,(half*)input_b,(half*)output,data_nums,void*input_a,void*input_b,void*output,intdata_nums,intline_width,intdata_type)cnrtDim3_tdim={4,1,

cnrtFunctionType_tfunc_type=mseKernel<<<dim,func_type,關(guān)于BUILD文件的修改,參見修改BUILD文件為了測試新增算子的功能和精度,需要同時(shí)添加算子的單測文件。參考本章節(jié)添加算子后,在PythonAPIC++API中都能直接調(diào)用該算子。因此,需添加兩個(gè)算子單測文件:tensorflow/core/kernels/mlu_kernels/mse_op_mlu_test.cc和tensorflow/python/kernel_tests/mlu_kernel_tests/mse_op_test.py。在C++API和PythonAPI中,可以分別通過調(diào)用Mse和tensorflow.mse接口來調(diào)用該算子。修改BUILD在添加新算子過程中,只需要修改tensorflow/core/kernels/mlu_kernels/BUILD該文件主要完成以下內(nèi)容:tf_kernel_librarymse_op.cc、mse_op_mlu.hmse_kernel編譯成新的tar?get:mse_op。該target同樣可以被其他target利用tf_cc_test_mlu函數(shù),將mse_op_mlu_test.cc單測文件編譯成一個(gè)binarymlu_core_kernelsmse_opTensorFlow在生成mlu_core_kernels目標(biāo)函數(shù)的deps域中,添加一行":mse_op"以下為該文件的代碼示例:name=srcs=["mse_op.cc"],hdrs=["mse_op_mlu.h"],deps=["@local_config_mlu//:mlu",size="medium",(下頁繼續(xù)srcs=linkstatic=1,deps=[#Publicsupportname="mlu_core_kernels",deps=[...//77本章主要講述CambriconTensorFlowMLU中。內(nèi)存越界的bug(出現(xiàn)的時(shí)機(jī)和出錯(cuò)的現(xiàn)象都是隨機(jī)的,很難定位。別插入header和footer,在deallocate時(shí)拷回到host端檢查之前插入的header和master遇到日志中有[cnrtError]、mluunfinished這些關(guān)鍵字時(shí),優(yōu)先打開內(nèi)存越界檢查:exportTF_MLU_DEBUG_MEMORY_OVERSTEP=truecoredump中有Footermaskhasbeenoverwritten或者Headermaskhasbeenoverwritten,說明存在內(nèi)存debugTen?sorFlow(./build_tensorflow_mlu.sh?g),后使能內(nèi)存申請調(diào)用棧記錄.MLU以自定義的Overstep算子為例。在OverstepCompute函數(shù)中模擬了內(nèi)存越界的行為(直接往output拷貝一組大于output.NumElements()。假設(shè)網(wǎng)絡(luò)中調(diào)用了Oversteplog20212021-01-1810:50:48.701641:[cnrtError][186985][Card:0]Erroroccurredin2021-01-1810:50:48.701678:[cnrtError][186985][Card:0]Returnvalueis230,?→CHECKfailed.Function:MLUCnrtMemcpyAsync@line:221returnedcode632046:Unknownerror.?→errordetected,forceterminating...Aborted(coredumped)log[cnrtError]exportTF_MLU_DEBUG_MEMORY_OVERSTEP=true打開內(nèi)存泄漏檢20212021-01-1811:15:05.364044:E?→cc:141][230845]i=0mask=0xcdcdcdcdcdcdcdcd2021-01-1811:15:05.364053:E?→cc:141][230845]i=1mask=0xcdcdcdcdcdcdcdcd2021-01-1811:15:05.364059:E?→cc:229][230845]Footermaskhasbeenoverwritten,ptr=2021-01-1811:15:05.364067:E?→cc:235][230845]wasallocated2021-01-1811:15:05.364238:I?→cc:101][230845]call可以看出,新生成的logFootermaskhasbeenoverwritten?→python2.7/site-:tensorflow::MLUDebugAllocator::AllocateRaw(unsignedlong,unsigned:tensorflow::Allocator::AllocateRaw(unsignedlong,unsigned?→tensorflow::AllocationAttributes(下頁繼續(xù).MLU:float*tensorflow::TypedAllocator::Allocate<float>(tensorflow::Allocator*,unsigned?→tensorflow::AllocationAttributes?→python2.7/site-:tensorflow::Tensor::Tensor(tensorflow::Allocator*,?→tensorflow::TensorShapeconst&,tensorflow::AllocationAttributes:?→tensorflow::AllocationAttributes::tensorflow::OpKernelContext::allocate_output(int,tensorflow::TensorShape?→tensorflow::Tensor**,:tensorflow::OpKernelContext::allocate_output(int,tensorflow::TensorShape::tensorflow::BaseMLUDevice::Compute(tensorflow::OpKernel*,?→python2.7/site-?→python2.7/site-?→python2.7/site-:std::function<void()>::operator()()::Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::ThreadPoolTempl(int,?→tensorflow::thread::EigenEnvironment)::{lambda()#1}::operator()():std::_Function_handler<void(),?→<tensorflow::thread::EigenEnvironment>::ThreadPoolTempl(int,?→tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data:std::function<void()>::operator()():tensorflow::thread::EigenEnvironment::CreateThread(std::function<void?→::operator()():std::_Function_handler<void?→tensorflow::thread::EigenEnvironment::CreateThread(std::function<void?→M_invoke(std::_Any_data .TENSORDUMP:std::function<void()>::operator()():voidstd::_Bind_simple<std::function<void()>:std::_Bind_simple<std::function<void()>:std::thread::_Impl<std::_Bind_simple<std::function<void()>()>:/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba):/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)可以看到,[10]中出現(xiàn)OverstepOp的Compute,說明OverstepOp的Compute中存在內(nèi)存越界。tensordump通過設(shè)置環(huán)境變量將網(wǎng)絡(luò)中各節(jié)點(diǎn)輸出tensordumpexportexportexport //運(yùn)行結(jié)束后網(wǎng)絡(luò)中間節(jié)點(diǎn)的輸出tensor會(huì)被dump到dump_dir文件夾。以上方法可將device為CPU和MLU的節(jié)點(diǎn)輸出全部dump到指定文件夾。dump出來的文件命名規(guī)則為${node_name}_output_${output_port}_${execution_order}。例如,node_name:節(jié)點(diǎn)名稱。此例中為resnet50.conv_1.conv2d。原本名稱應(yīng)為“/output_port:此文件是這個(gè)node0execution_order:此tensor被dump出來的順序,該順序與node10最好只進(jìn)行一次推理,比如只運(yùn)行一個(gè)batch使用tensordump在進(jìn)行dump之前,需要清空dump_dircdcd--truth_value_dir=/path/to/truth_value_dump_dir--test_value_dir=/path/to/test_value_dump_dir5%兩個(gè)輸入文件夾中的數(shù)據(jù)是使用tensordump功能dump出來的。誤差計(jì)算工具會(huì)根據(jù)命名規(guī)則進(jìn)兩個(gè)輸入文件夾中的數(shù)據(jù)不是使用ensordump功能dump出來的,但對應(yīng)文件的名稱完全一致。誤差計(jì)算工具會(huì)根據(jù)文件的讀取順序計(jì)算誤差輸出報(bào)告。文件內(nèi)容的格式必須是每行只含有一個(gè)數(shù)字。output_right.txt:只包含誤差小于等于??thresholdoutput_abnormal:包含由于某些異常導(dǎo)致無法計(jì)算誤差的文件的報(bào)告。異常包括含有nan值、含有inf值、兩個(gè)文件中數(shù)據(jù)量不同、無法解析文件中的數(shù)據(jù)(比如亂碼)等。truth_value_file:import.Preprocessor.mul_output_0_1503test_value_file:import.Preprocessor.mul_output_0_1507tensor_name:import/Preprocessor/mul:0iteration:error_rate:mse:is_same_size:truth_value_file:命令中??truth_value_dirtest_value_file:命令中??test_value_dirtensor_name:網(wǎng)絡(luò)中的tensor名稱。iteration:該tensordump出來。在一次推理中如果一個(gè)tensordump出來,則說error_rate:相對誤差,命令中??thresholdmse:truth_value_file:import.Preprocessor.mul_output_0_1503tensor_name:import/Preprocessor/mul:0iteration:1error_rate:Nonemse:Noneis_same_size:False當(dāng)兩個(gè)文件中的數(shù)據(jù)量不同時(shí),則不進(jìn)行誤差計(jì)算,并把報(bào)告寫入到output_abnormal.txt原因?yàn)閕s_same_size為False。文件中包含nan或者truth_value_file:(下頁繼續(xù)test_value_file:iteration:1error_rate:Nonemse:Noneis_same_size:Truehasnan/-nan:truth_value_file:test_value_file:hasinf/-inf:test_value_file:Truen

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論