版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、Exact inference in structured prediction Kevin Bello Department of Computer Science Purdue Univeristy West Lafayette, IN 47906, USA Jean Honorio Department of Computer Science Purdue Univeristy West Lafayette, IN 47906, USA Abstract Structured prediction can be
2、thought of as a simultaneous prediction of multiple labels. This is often done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise and unary potentials. The above is naturally modeled with a graph, where edges and vertices are related to pairwise and unary po
3、tentials, respectively. We consider the generative process proposed by Globerson et al. (2015) and apply it to general connected graphs. We analyze the structural conditions of the graph that allow for the exact recovery of the labels. Our results show that exact recovery is possible and achievable
4、in polynomial time for a large class of graphs. In particular, we show that graphs that are bad expanders can be exactly recovered by adding small edge perturbations coming from the Erd os- Rnyi model. Finally, as a byproduct of our analysis, we provide an extension of Cheegers inequality. 1Introduc
5、tion Throughout the years, structured prediction has been continuously used in multiple domains such as computer vision, natural language processing, and computational biology. Examples of structured prediction problems include dependency parsing, image segmentation, part-of-speech tagging, named en
6、tity recognition, and protein folding. In this setting, the inputXis some observation, e.g., social network, an image, a sentence. The output is a labelingy, e.g., an assignment of each individual of a social network to a cluster, or an assignment of each pixel in the image to foreground or backgrou
7、nd, or the parse tree for the sentence. A common approach to structured prediction is to exploit local features to infer the global structure. For instance, one could include a feature that encourages two individuals of a social network to be assigned to different clusters whenever there is a strong
8、 disagreement in opinions about a particular subject. Then, one can defi ne a posterior distribution over the set of possible labelings conditioned on the input. Some classical methods for learning the parameters of the model are conditional random fi elds (Lafferty et al. 2001) and structured suppo
9、rt vector machines (Taskar et al. 2003, Tsochantaridis et al. 2005, Altun & Hofmann 2003). In this work we will focus in the inference problem and assume that the model parameters have been already learned. In the context of Markov random fi elds (MRFs), for an undirected graphG = (V,E), one is inte
10、rested in fi nding a solution to the following inference problem: max yM|V| X vV,mM cv(m)1yv= m + X (u,v)E,m1,m2M cu,v(m,n)1yu= m,yv= n,(1) whereMis the set of possible labels,cu(m)is the cost of assigning labelmto nodev, andcu,v(m,n) is the cost of assigningmandnto the neighborsu,vrespectively.1Sim
11、ilar inference problems arise 1In the literature, the cost functions cvandcu,vare also known as unary and pairwise potentials respectively. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. in the context of statistical physics, sociology, community detectio
12、n, average case analysis, and graph partitioning. Very few cases of the general MRF inference problem are known to be exactly solvable in polynomial time. For example, Chandrasekaran et al. (2008) showed that(1)can be solved exactly in polynomial time for a graphGwith low treewidth via the junction
13、tree algorithm. While in the case of Ising models, Schraudolph & Kamenetsky (2009) showed that the inference problem can also be solved exactly in polynomial time for planar graphs via perfect matchings. Finally, polynomial-time solvability can also stem from properties of the pairwise potential, un
14、der this view, the inference problem can be solved exactly in polynomial time via graph cuts for binary labels and sub-modular pairwise potentials (Boykov & Veksler 2006). Despite the intractability of maximum likelihood estimation, maximum a-posteriori estimation, and marginal inference for most mo
15、dels in the worst case, the inference task seems to be easier in practice than the theoretical worst case. Approximate inference algorithms can be extremely effective, often obtaining state-of-the-art results for these structured prediction tasks. Some important theoretical and empirical work on app
16、roximate inference include (Foster et al. 2018, Globerson et al. 2015, Kulesza & Pereira 2007, Sontag et al. 2012, Koo et al. 2010, Daum et al. 2009). In particular, Globerson et al. (2015) analyzes the hardness of approximate inference in the case where performance is measured through the Hamming e
17、rror, and provide conditions for the minimum- achievable Hamming error by studying a generative model. Similar to the objective(1), the authors in (Globerson et al. 2015) consider unary and pairwise noisy observations. As a concrete example (Foster et al. 2018), consider the problem of trying to rec
18、over opinions of individuals in social networks. Suppose that every individual in a social network can hold one of two opinions labeled by1or+1. One observes a measurement of whether neighbors in the network have an agreement in opinion, but the value of each measurement is fl ipped with probability
19、p(pairwise observations). Additionally, one receives estimates of the opinion of each individual, perhaps using a classifi cation model on their profi le, but these estimates are corrupted with probabilityq(unary observations). Foster et al. (2018) generalizes the work of Globerson et al. (2015), wh
20、o provides results for grid lattices, by providing results for trees and general graphs that allow tree decompositions (e.g., hypergrids and ring lattices). Note that the above problem is challenging since there is a statistical and computational trade-off, as in several machine learning problems. T
21、he statistical part focuses on giving highly accurate labels while ignoring computational constraints. In practice this is unrealistic, one cannot afford to wait long times for each prediction, which motivated several studies on this trade-off (e.g., Chandrasekaran & Jordan (2013), Bello & Honorio (
22、2018). However, while the statistical and computational trade-off appears in general, an interesting question is whether there are conditions for when recovery of the true labels is achievable in polynomial time. That is, conditions for when the Hamming error of the prediction is zero and can be obt
23、ained effi ciently. The present work addresses this question. In contrast to (Globerson et al. 2015, Foster et al. 2018), we study the suffi cient conditions for exact recovery in polynomial time, and provide high probability results for general families of undirected connected graphs, which we cons
24、ider to be a novel result to the best of our knowledge. In particular, we show that weak-expander graphs (e.g., grids) can be exactly recovered by adding small perturbations (edges coming from the Erd os-Rnyi model with small probability). Also, as a byproduct of our analysis, we provide an extensio
25、n of Cheegers inequality (Cheeger 1969). Finally, another work in this line was done by Chen et al. (2016), where the authors consider exact recovery for edges on sparse graphs such as grids and rings. However, (Chen et al. 2016) consider the case where one has multiple i.i.d. observations of edge l
26、abels. In contrast, we focus on the case where there is a single (noisy) observation of each edge and node in the graph. 2Notation and Problem Formulation This section introduces the notation used throughout the paper and formally defi nes the problem under analysis. Vectors and matrices are denoted
27、 by lowercase and uppercase bold faced letters respectively (e.g., a,A), while scalars are in normal font weight (e.g.,a). Moreover, random variables are written in upright shape (e.g.,a,A). For a random vectora, and a random matrixA, their entries are denoted byaiandAi,jrespectively. Indexing start
28、s at1, withA i,: andA :,i indicating thei-th row andi-th 2 column ofArespectively. Finally, sets and tuples are both expressed in uppercase calligraphic fonts and shall be distinguished by the context. For example, R will denote the set of real numbers. We now present the inference task. We consider
29、 a similar problem setting to the one in (Globerson et al. 2015), with the only difference that we consider general undirected graphs. That is, the goal is to predict a vector ofnnode labelsy = (y1,.,yn), whereyi +1,1, from a set of observationsXandc, whereXandccorrespond to corrupted measurements o
30、f edges and nodes respectively. These observations are assumed to be generated from a ground truth labelingyby a generative process defi ned via an undirected connected graphG = (V,E), an edge noisep (0,0.5), and a node noiseq (0,0.5). For each edge(u,v) E, the edge observationXu,vis independently s
31、ampled to bey uyv(good edge) with probability1 p, andyuyv(bad edge) with probabilityp. While for each edge(u,v) / E, the observationXu,vis always0. Similarly, for each nodeu V, the node observationcuis independently sampled to bey u (good node) with probability1 q, andy u(bad node) with probabilityq
32、. Thus, we have a known undirected connected graphG, an unknown ground truth label vectory +1,1n, and noisy observationsX 1,0,+1nn and c 1,+1n, and our goal is to predict a vector label y 1,+1n. Defi nition 1(Biased Rademacher variable).Letzp +1,1such thatP(zp= +1) = 1p, and P(zp= 1) = p. We callzpa
33、 biased Rademacher random variable with parameterpand expected value 1 2p. From the defi nition above, we can write the edge observations asXu,v= y uyvz (u,v) p 1?(u,v) E?, wherez(u,v) p is a biased Rademacher with parameterp. While the node observation iscu= y uz (u) q , where z(u) q is a biased Ra
34、demacher with parameter q. Given the generative process, we aim to solve the following optimization problem, which is based on the maximum likelihood estimator that returns the labelargmaxyP(X,y)(see Globerson et al. (2015): max y 1 2y Xy + cy subject toyi= 1,(2) where =log 1q q/log 1p p . In genera
35、l, the above combinatorial problem is NP-hard to compute (e.g., see for results on grids (Barahona 1982). Our goal is to fi nd what structural properties of the graph G suffi ce to achieve, with high probability, exact recovery in polynomial time. 3On Exact Recovery of Labels Our approach consists o
36、f two stages, similar in spirit to (Globerson et al. 2015). We fi rst use only the quadratic term from(2), which will give us two possible solutions, and then as a second stage, the linear term is used to decide the best between these two solutions. 3.1First Stage We analyze a semidefi nite program
37、(SDP) relaxation to the following combinatorial problem(3), motivated by the techniques in (Abbe et al. 2016). max y 1 2y Xy subject toyi= 1,(3) We denote the degree of nodeiasi, and the maximum node degree asmax= maxiVi. For any subsetS V, we denote its complement bySCsuch thatS SC= VandS SC= . Fur
38、thermore, letE(S,SC) = (i,j) E | i S,j SCor j S,i SC, i.e.,|E(S,SC)| denotes the number of edges between S and SC. Defi nition 2(Edge Expansion).For a setS Vwith|S| n/2, its edge expansion,S, is defi ned as:S= |E(S,SC)|/|S|.Then, the edge expansion of a graphG = (V,E) is defi ned as: G= minSV,|S|n/2
39、S. In the literature,G is also known as the Cheeger constant, due to the geometric analogue defi ned by Cheeger in (Cheeger 1969). Next, we defi ne the Laplacian matrix of a graph and the Rayleigh quotient which are also used throughout this section. 3 Defi nition 3(Laplacian matrix).For a graphG =
40、(V,E)ofnnodes. The Laplacian matrixLis defi ned as L = D A, where D is the degree matrix and A is the adjacency matrix. Defi nition 4(Rayleigh quotient).For a given symmetric matrixM Rnnand non-zero vector a Rn, the Rayleigh quotient RM (a), is defi ned as: RM(a) = aMa aa . We now defi ne a signed L
41、aplacian matrix. Defi nition 5(Signed Laplacian matrix).For a graphG = (V,E)ofnnodes. A signed Laplacian matrix,M , is a symmetric matrix that satisfi esxMx = P (i,j)E(yixi yjxj) 2, whereyis an eigenvector of M with eigenvalue 0, and yi +1,1. Note that the typical Laplacian matrix, as in Defi nition
42、 3, fulfi lls the conditions of Defi nition 5 with yi= +1 for all i. Next, we present an intermediate result for later use. Lemma 1.LetG = (V,E)be an undirected graph ofnnodes with LaplacianL. LetM Rnn be a signed Laplacian with eigenvectory as in Defi nition 5, and leta Rnbe a vector such thathy,ai
43、 = 0. Finally, let1 Rnbe a vector of ones. Then we have that, for a given R, RL(a y + 1) RM(a), where the operator denotes the Hadamard product. Proof.First, note thatLhas a0eigenvalue with corresponding eigenvector1. Also, we have thatxLx = P (i,j)E(xi xj)2, for any vectorx. Then,(a y + 1)L(a y + 1
44、) = P (i,j)E(yiai + ) (yjaj+ )2= (yiai yjaj)2= aMa.Therefore, we have that the numerators ofRL(a y + 1)andRM(a)are equal. For the denominators, one can observe that: (ay+1)(ay+1) = (ay)(ay)+2h1,ayi+211 = P iaiyiaiyi+2ha,yi+ 2n = aa + 2n aa, which implies that RL(a y + 1) RM(a). In what follows, we p
45、resent our fi rst result, which has a connection to Cheegers inequality (Cheeger 1969). Theorem 1.LetG,M,L,y be defi ned as in Lemma 1, and let1 2 nbe the eigenvalues of M. Then, we have that 2 G 4max 2. Proof.Sinceyis an eigenvector ofMwith eigenvalue0, andMis a symmetric matrix, we can express 2us
46、ing the variational characterization of eigenvalues as follows: 2=min aRn, ay=0 RM(a),(4) where we used the fact that y is orthogonal to all the other eigenvectors, by the Spectral Theorem. Assume thatais the eigenvector associated with2, i.e., we have thatMa = 2aanday = 0. Then, by Lemma 1, we have
47、 that: RL(a y + 1) RM(a) = 2.(5) Next, we choose Rsuch thata1y1+ ,a2y2+ ,.,anyn+ has median0. The reason for the zero median is to later ensure that the subset of verticesShas less thann/2vertices. Let w = a y + 1. From equation (5), we have that RL(w) 2. Letw+= (w+ i )such thatw+ i = wiifwi 0andw+
48、i = 0otherwise. Letw= (w i )such thatw i = wiifwi 0andw i = 0otherwise. Then, we have that eitherRL(w+) 2RL(w) orRL(w) 2RL(w). Now suppose that w.l.o.g.RL(w+) 2RL(w), then, it follows that RL(w+) 22. Let us scalew+by some constant Rso that:w1,w2,.,wm 0,1.It is clear that RL(w+) = RL(w+), therefore,
49、we will still usew+to denote the rescaled vector. That is, now the entries of vector w+are in between 0 and 1. Next, we will show that there exists a setS Vwith|S| n/2such that: E|E(S,SC)| E|S| p2R L(w+)max.We construct the setSas follows. We chooset 0,1uniformly at random and letS = i | (w+ i )2 t.
50、 LetBi,j= 1ifi Sandj SCor ifj Sandi SC, andBi,j= 0 4 otherwise. Then,E|E(S,SC)| = EP(i,j)EBi,j = P (i,j)EEBi,j = P (i,j)EP(w + j )2 t (w+ i )2). Recall that (w+ i )2 0,1, therefore, the probability above is |(w+ i )2 (w+ j )2|. Thus, E|E(S,SC)| = X (i,j)E |w+ i w+ j | |w+ i + w+ j | s X (i,j)E (w+ i
51、 w+ j )2 s X (i,j)E (w+ i + w+ j )2 (6) s X (i,j)E (w+ i w+ j )2 s 2 X (i,j)E (w+ i )2+ (w+ j )2 s X (i,j)E (w+ i w+ j )2 s 2max X i (w+ i )2, (7) where eq.(6)is due to Cauchy-Schwarz inequality and eq.(7)uses the maximum-degree of a node for an upper bound. Now consider another random variablebisuc
52、h thatbi= 1ifi S, andbi= 0otherwise. Therefore, we have thatE|S| = EPibi = P iEbi = P iP(t (w + i )2) = P i(w + i )2. Thus, E|E(S,SC)| E|S| qP (i,j)E(w + i w+ j )22max P i(w + i )2 P i(w + i )2 = qP (i,j)E(w + i w+ j )22max P i(w + i )2 = p2R L(w+)max 22max.The above implies that there exists someSs
53、uch that |E(S,SC)| |S| 22max. Therefore, G 22maxor equivalently 2 G 4max 2. Remark 1.For a given undirected graphG, its Laplacian matrixL fulfi lls the conditions of Lemma 1 and Theorem 1. That is, ifM = Lin Theorem 1 then it becomes the known Cheegers inequality. Therefore, our result in Theorem 1
54、apply for more general matrices and is of use for our next result. We now provide the SDP relaxation of problem(3). LetY = yy, we have thatyXy = Tr(XY ) = hX,Y i. Since our prediction is a column vectory, we have thatyyis rank-1 and symmetric, which implies thatY is a positive semidefi nite matrix.
55、Therefore, our relaxation to the combinatorial problem (3) results in the following primal formulation2: max Y hX,Y isubject toYii= 1, Y ? 0.(8) We will make use of the following matrix concentration inequality for our main proof. Lemma 2(Matrix Bernstein inequality, Theorem 1.4 in (Tropp 2012). Con
56、sider a fi nite sequence Nkof independent, random, self-adjoint matrices with dimensionn. Assume that each ran- dom matrix satisfi esENk = 0andmax(Nk) Ralmost surely.Then, for allt 0, P ? max ?P kNk ? t ? n exp ? t2/2 2+Rt/3 ? , where 2= kPkEN2 kk. The next theorem includes our main result and provi
57、des the conditions for exact recovery of labels with high probability. Theorem 2.LetG = (V,E)be an undirected connected graph withnnodes, Cheeger constantG, and maximum node degreemax. Then, for the combinatorial problem(3), a solutiony y,y is achievable in polynomial time by solving the SDP based r
58、elaxation(8), with probability at least 1 ?1(G,max,p), where p is the edge noise from our model, and ?1(G,max,p) = 2n e 3(12p)24 G 15363 maxp(1p)+32(12p)(1p)2Gmax. Proof.Without loss of generality assume thaty = y . The fi rst step of our proof corresponds to fi nding suffi cient conditions for whenY = yyis the unique optimal solution to SDP(8), for which we make use of the Karush-Kuhn-Tucker (K
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- CCAA - 2023年01月環(huán)境管理體系基礎(chǔ)答案及解析 - 詳解版(65題)
- 養(yǎng)老院老人臨終關(guān)懷服務(wù)制度
- 企業(yè)員工培訓(xùn)與素質(zhì)拓展制度
- 老年終末期患者跌倒預(yù)防環(huán)境改造的循證實踐培訓(xùn)方案
- 保障智能助手用戶數(shù)據(jù)的安全政策
- 2025年內(nèi)蒙古通遼經(jīng)濟技術(shù)開發(fā)區(qū)社區(qū)工作者招聘筆試真題
- 2025年山西省煙草專賣局(公司)真題
- 2025年龍巖市中醫(yī)院招聘專業(yè)技術(shù)考試真題
- 2025年福建省能源石化集團有限責(zé)任公司招聘考試真題
- 線性代數(shù)02198自考真題模擬試題及答案
- 大體積混凝土施工裂縫防治技術(shù)研究
- 電力行業(yè)物資管理部崗位職責(zé)
- 感染性心內(nèi)膜炎護理查房
- 導(dǎo)管相關(guān)皮膚損傷患者的護理 2
- 審計數(shù)據(jù)管理辦法
- 建筑設(shè)計防火規(guī)范-實施指南
- 口腔修復(fù)臨床病例
- 乙狀結(jié)腸冗長護理查房
- 2025年廣西中考英語試卷真題(含答案解析)+聽力音頻
- 短文魯迅閱讀題目及答案
- DB34T 5137-2025電化學(xué)儲能液冷系統(tǒng)設(shè)計技術(shù)要求
評論
0/150
提交評論