iccv2019論文全集9019-sparse-logistic-regression-learns-all-discrete-pairwise-graphical-models

上傳人：我*** IP屬地：北京上傳時(shí)間：2020-06-17 格式：PDF 頁(yè)數(shù)：11 大?。?76.69KB 積分：12 舉報(bào) 版權(quán)申訴

iccv2019論文全集9019-sparse-logistic-regression-learns-all-discrete-pairwise-graphical-models_第2頁(yè)

iccv2019論文全集9019-sparse-logistic-regression-learns-all-discrete-pairwise-graphical-models_第3頁(yè)

iccv2019論文全集9019-sparse-logistic-regression-learns-all-discrete-pairwise-graphical-models_第4頁(yè)

iccv2019論文全集9019-sparse-logistic-regression-learns-all-discrete-pairwise-graphical-models_第5頁(yè)

已閱讀5頁(yè)，還剩6頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Sparse Logistic Regression Learns All Discrete Pairwise Graphical Models Shanshan Wu, Sujay Sanghavi, Alexandros G. Dimakis Department of Electrical and Computer Engineering University of Texas at Austin , , Abstract We characterize t

2、he effectiveness of a classical algorithm for recovering the Markov graph of a general discrete pairwise graphical model from i.i.d. samples. The algo- rithm is (appropriately regularized) maximum conditional log-likelihood, which involves solving a convex program for each node; for Ising models thi

3、s is1- constrained logistic regression, while for more general alphabets an2,1group- norm constraint needs to be used. We show that this algorithm can recover any arbitrary discrete pairwise graphical model, and also characterize its sample com- plexity as a function of model width, alphabet size, e

4、dge parameter accuracy, and the number of variables. We show that along every one of these axes, it matches or improves on all existing results and algorithms for this problem. Our analysis applies a sharp generalization error bound for logistic regression when the weight vector has an1(or2,1) const

5、raint and the sample vector has an(or2,) constraint. We also show that the proposed convex programs can be effi ciently solved in O(n2)running time (wherenis the number of variables) under the same statistical guarantees. We provide experimental results to support our analysis. 1Introduction Undirec

6、ted graphical models provide a framework for modeling high dimensional distributions with dependent variables and have many applications including in computer vision (Choi et al., 2010), bio- informatics (Marbach et al., 2012), and sociology (Eagle et al., 2009). In this paper we characterize the ef

7、fectiveness of a natural, and already popular, algorithm for the structure learning problem. Structure learning is the task of fi nding the dependency graph of a Markov random fi eld (MRF) given i.i.d. samples; typically one is also interested in fi nding estimates for the edge weights as well. We c

8、onsider the structure learning problem in general (non-binary) discrete pairwise graphical models. These are MRFs where the variables take values in a discrete alphabet, but all interactions are pairwise. This includes the Ising model as a special case (which corresponds to a binary alphabet). The n

9、atural and popular algorithm we consider is (appropriately regularized) maximum conditional log-likelihood for fi nding the neighborhood set of any given node. For the Ising model, this becomes 1-constrained logistic regression; more generally for non-binary graphical models the regularizer becomes

10、an2,1norm. We show that this algorithm can recover all discrete pairwise graphical models, and characterize its sample complexity as a function of the parameters of interest: model width, alphabet size, edge parameter accuracy, and the number of variables. We match or improve dependence on each of t

11、hese parameters, over all existing results for the general alphabet case when no additional assumptions are made on the model (see Table 1). For the specifi c case of Ising models, some recent work has better dependence on some parameters (see Table 2 in Appendix A). We now describe the related work

12、, and then outline our contributions. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. PaperAssumptionsSample complexity (N) Greedy algorithm (Hamilton et al., 2017) 1. Alphabet size k 2 O(exp(k O(d)exp(O(d2) O(1) )ln(nk ) 2. Model width 3. Degree d 4. Mini

13、mum edge weight 0 5. Probability of success 1 Sparsitron (Klivans and Meka, 2017) 1. Alphabet size k 2 O( 2k5exp(14) 4 ln(nk ) 2. Model width 2,1-constrained logistic regression (this paper) 3. Minimum edge weight 0 O( 2k4exp(14) 4 ln(nk )4. Probability of success 1 Table 1:Sample complexity compari

14、son for different graph recovery algorithms. The pairwise graphical model has alphabet sizek. Fork = 2(i.e., Ising models), our algorithm reduces to the 1-constrained logistic regression (see Table 2 in Appendix A for related work on learning Ising models). Our sample complexity has a better depende

15、ncy on the alphabet size (O(k4)versus O(k5) than that in (Klivans and Meka, 2017)2. Related Work In a classic paper, Ravikumar et al. (2010) considered the structure learning problem for Ising models. They showed that1-regularized logistic regression provably recovers the correct dependency graph wi

16、th a very small number of samples by solving a convex program for each variable. This algorithm was later generalized to multi-class logistic regression with2,1group sparse regularization, for learning MRFs with higher-order interactions and non-binary variables (Jalali et al., 2011). A well- known

17、limitation of (Ravikumar et al., 2010; Jalali et al., 2011) is that their theoretical guarantees only work for a restricted class of models. Specifi cally, they require that the underlying learned model satisfi es technical incoherence assumptions, that are diffi cult to validate or check. A large a

18、mount of recent work has since proposed various algorithms to obtain provable learning results for general graphical models without requiring the incoherence assumptions. We now describe the (most related part of the extensive) related work, followed by our results and comparisons (see Table 1). For

19、 a discrete pairwise graphical model, letnbe the number of variables andkbe the alphabet size; defi ne the model width as the maximum neighborhood weight (see Defi nition 1 and 2 for the precise defi nition). For structure learning algorithms, a popular approach is to focus on the sub-problem of fi

20、nding the neighborhood of a single node. Once this is correctly learned, the overall graph structure is a simple union bound. Indeed all the papers we now discuss are of this type. As shown in Table 1, Hamilton et al. (2017) proposed a greedy algorithm to learn pairwise (and higher-order) MRFs with

21、general alphabet. Their algorithm generalizes the approach of Bresler (2015) to learning Ising models. The sample complexity in (Hamilton et al., 2017) grows logarithmically inn, but doubly exponentially in the width(only single exponential is necessary (Santhanam and Wainwright, 2012). Klivans and

22、Meka (2017) provided a different algorithmic and theoretical approach by setting this up as an online learning problem and leveraging results from the Hedge algorithm therein. Their algorithm Sparsitron achieves single-exponential dependence on . Our Contributions Our main result: We show that a cla

23、ssical algorithm2,1-constrained3logistic regression can recover the edge weights of a discrete pairwise graphical model from i.i.d. samples (see Theorem 2). For the special case of Ising models (see Theorem 1), this reduces to an1-constrained logistic regression. For the general setting with non-bin

24、ary alphabet, since each edge has a group of parameters, it is natural to use an2,1group sparse constraint to enforce sparsity at the level of 2Theorem 8.4 in (Klivans and Meka, 2017) has a typo. The correct dependence should be k5instead ofk3. In Section 8 of (Klivans and Meka, 2017), after re-writ

25、ing the conditional distribution as a sigmoid function, the weight vectorwis a vector of length(n 1)k + 1. Their derivation uses an incorrect boundkwk1 2, while it should be kwk1 2k. This gives rise to an additional k2 factor on the fi nal sample complexity. 3It may be possible to prove a similar re

26、sult for the regularized version of the optimization problem using techniques from (Negahban et al., 2012). One needs to prove that the objective function satisfi es restricted strong convexity (RSC) when the samples are from a graphical model distribution (Vuffray et al., 2016; Lokhov et al., 2018)

27、. It would be interesting to see if the proof presented in this paper is related to the RSC condition. 2 groups. We make no incoherence assumption on the graphical models. As shown in Table 1, our sample complexity scales as O(k4), which improves4the previous best result with O(k5) dependency5. Our

28、analysis applies a sharp generalization error bound for logistic regression when the weight vector has an2,1(or1) constraint and the sample vector has an2,(or) constraint (see Lemma 8 and 11 in Appendix E). Our key insight is that a generalization bound can be used to control the squared distance be

29、tween the predicted and true logistic functions (see Lemma 1 and 2 in Appendix B), which then implies annorm bound between the weight vectors (see Lemma 5 and 6 in Appendix B). We show that the proposed algorithms can run in O(n2)time without affecting the statistical guarantees (see Section 2.3). N

30、ote that O(n2) is an effi cient runtime for graph recovery overn nodes. Previous algorithms in (Hamilton et al., 2017; Klivans and Meka, 2017) also require O(n2) runtime for structure learning of pairwise graphical models. We construct examples that violate the incoherence condition proposed in (Rav

31、ikumar et al., 2010) (see Figure 1). We then run1-constrained logistic regression and show that it can recover the graph structure as long as given enough samples. This verifi es our analysis and shows that our conditions for graph recovery are weaker than those in (Ravikumar et al., 2010). We empir

32、ically compare the proposed algorithm with the Sparsitron algorithm in (Klivans and Meka, 2017) over different alphabet sizes, and show that our algorithm needs fewer samples for graph recovery (see Figure 2). Notation.We usento denote the set1,2, ,n. For a vectorx Rn, we usexiorx(i)to denote itsi-t

33、h coordinate. Thep norm of a vector is defi ned askxkp= (Pi|xi|p)1/p. We use xi Rn1to denote the vector after deleting thei-th coordinate. For a matrixA Rnk, we use AijorA(i,j)to denote its(i,j)-th entry. We useA(i,:) RkandA(:,j) Rnto the denote thei-th row vector and thej-th column vector. Thep,qno

34、rm of a matrixA Rnk is defi ned askAkp,q= kkA(1,:)kp,.,kA(n,:)kpkq . We defi nekAk= maxij|A(i,j)|. We useh,ito represent the dot product between two vectors hx,yi = P ixiyior two matrices hA,Bi = P ijA(i,j)B(i,j). 2Main results We start with the special case of binary variables (i.e., Ising models),

35、 and then move to the general case with non-binary variables. 2.1Learning Ising models We fi rst give a defi nition of an Ising model distribution. Defi nition 1.LetA Rnnbe a symmetric weight matrix withAii= 0fori n. Let Rnbe a mean-fi eld vector. Then-variable Ising model is a distributionD(A,)on1,

36、1n that satisfi es P ZD(A,)Z = z exp X 1i 0. If we set? 0. If we set? 0, if the number of mirror descent iterations satisfi esT = O(2k3exp(O()ln(n)/?4), then (6) and (15) still hold with probability at least1 . The time and space complexity of Algorithm 1 isO(TNn2) andO(TN+n2). The time and space co

37、mplexity of Algorithm 2 isO(TNn2k2)andO(TN+n2k2). Proof of Theorem 3 requires boundingk w wk(where wis the value afterTmirror descent iterations). This is non-trivial because we are in the high-dimensional regime as the number of samplesN = O(ln(n), the empirical loss functions in (11) and (4) are n

38、ot strongly convex. Due to the space limit, more details of this section can be found in Appendix I, J and K. Note that O(n2) is an effi cient time complexity for graph recovery overnnodes. Previous structural learning algorithms of Ising models require either O(n2)complexity (Bresler, 2015; Klivans

39、 and Meka, 2017) or a worse time complexity (Ravikumar et al., 2010; Vuffray et al., 2016). It is possible to improve the time complexity given in Theorem 3 (especially the dependence on ?and), by using stochastic or accelerated versions of mirror descent algorithms (instead of the batch version giv

40、en in Appendix J). In fact, the Sparsitron algorithm proposed by Klivans and Meka (2017) can be seen as an online mirror descent algorithm for optimizing the1-constrained logistic regression (see Algorithm 3 in Appendix J). Furthermore, Algorithm 1 and 2 can be parallelized as every node has an inde

41、pendent regression problem. We would like to remark that our goal here is not to give the fastest fi rst-order optimization algorithm. Instead, our goal is to provably show that it is possible to run Algorithm 1 and Algorithm 2 in O(n2) time without affecting the original statistical guarantees. 3Pr

42、oof outline We give a proof outline for Theorem 1. The proof of Theorem 2 follows a similar outline. Let D be a distribution over1,1n 1,1, where(x,y) D satisfi esPy = 1|x = (hw,xi). Let L(w) = E(x,y)Dln(1 + eyhw,xi)and L(w) = PN i=1ln(1 + e yihw,xii)/N be the expected and empirical logistic loss. Su

43、ppose kwk1 2. Let w argminw L(w) s.t. kwk1 2. Our goal is to prove thatk w wkis small when the samples are constructed from an Ising model distribution. Our proof can be summarized in three steps: 7 1. If the number of samples satisfi esN = O(2ln(n/)/2), thenL( w) L(w) O(). This is obtained using a

44、sharp generalization bound whenkwk1 2andkxk 1(see Lemma 8 in Appendix E). 2.For anyw, we show thatL(w)L(w) Ex(hw,xi)(hw,xi)2(see Lemma 10 and Lemma 9 in Appendix E). Hence, Step 1 implies thatEx(h w,xi) (hw,xi)2 O() (see Lemma 1 in Appendix B). 3.We now use a result from (Klivans and Meka, 2017) (see Lemma 5 in

人人文庫(kù)> 全部分類> 應(yīng)用文書

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

iccv2019論文全集9019-sparse-logistic-regression-learns-all-discrete-pairwise-graphical-models

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

iccv2019論文全集9019-sparse-logistic-regression-learns-all-discrete-pairwise-graphical-models

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔