Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6516
Ravi Kumar D. Sivakumar (Eds.)
Algorithms and Models for the Web Graph 7th International Workshop, WAW 2010 Stanford, CA, USA, December 13-14, 2010 Proceedings
13
Volume Editors Ravi Kumar Yahoo! Research 701 First Ave Sunnyvale, CA 94089, USA E-mail:
[email protected] D. Sivakumar Google Research 1600 Amphitheater Parkway Mountain View, CA 94043, USA E-mail:
[email protected]
Library of Congress Control Number: 2010941363 CR Subject Classification (1998): I.2, H.3, H.4, H.2.8, I.4, H.5 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-18008-6 Springer Berlin Heidelberg New York 978-3-642-18008-8 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
This volume contains the papers presented at WAW 2010: 7th Workshop on Algorithms and Models for the Web Graph held during December 16, 2010, at Stanford University, USA. There were 19 submissions. Each submission was reviewed by at least three Programme Committee members. Papers were submitted and reviewed using the EasyChair online system. The committee decided to accept 13 papers. The programme also included an invited talk by Andrei Z. Broder of Yahoo! Research. November 2010
Ravi Kumar D. Sivakumar
Organization
Conference Chairs Andrei Z. Broder Fan Chung Graham
Yahoo! Research University of California, San Diego
Programme Chairs Ravi Kumar D. Sivakumar
Yahoo! Research Google, Inc.
Programme Committee Lars Backstrom Tanya Berger-Wolf Anthony Bonato Anirban Dasgupta Luca de Alfaro Abraham Flaxman Ashish Goel David Kempe Jure Leskovec Lincoln Lu Frank McSherry Jennifer Neville Alessandro Panconesi
Facebook University of Illinois, Chicago Ryerson University Yahoo! Labs Google and University of California, Santa Cruz University of Washington Stanford University and Twitter University of Southern California Stanford University University of South Carolina Microsoft Research Purdue University Sapienza University, Rome
WINE 2010 Liaison and Local Organization Amin Saberi Yinyu Ye
Stanford University Stanford University
Table of Contents
The Anatomy of the Long Tail of Consumer Demand (Invited Talk) . . . . Andrei Broder A Sharp PageRank Algorithm with Applications to Edge Ranking and Graph Sparsification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan Chung and Wenbo Zhao Efficient Triangle Counting in Large Graphs via Degree-Based Vertex Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mihail N. Kolountzakis, Gary L. Miller, Richard Peng, and Charalampos E. Tsourakakis Computing an Aggregate Edge-Weight Function for Clustering Graphs with Multiple Edge Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthew Rocklin and Ali Pinar
1
2
15
25
Component Evolution in General Random Intersection Graphs . . . . . . . . Milan Bradonji´c, Aric Hagberg, Nicolas W. Hengartner, and Allon G. Percus
36
Modeling Traffic on the Web Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark R. Meiss, Bruno Gon¸calves, Jos´e J. Ramasco, Alessandro Flammini, and Filippo Menczer
50
Multiplicative Attribute Graph Model of Real-World Networks . . . . . . . . Myunghwan Kim and Jure Leskovec
62
Random Walks on Digraphs, the Generalized Digraph Laplacian and the Degree of Asymmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanhua Li and Zhi-Li Zhang
74
Finding and Visualizing Graph Clusters Using PageRank Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan Chung Graham and Alexander Tsiatas
86
Improving Random Walk Estimation Accuracy with Uniform Restarts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konstantin Avrachenkov, Bruno Ribeiro, and Don Towsley
98
The Geometric Protean Model for On-Line Social Networks . . . . . . . . . . . Anthony Bonato, Jeannette Janssen, and Pawel Pralat Constant Price of Anarchy in Network Creation Games via Public Service Advertising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erik D. Demaine and Morteza Zadimoghaddam
110
122
VIII
Table of Contents
Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pooya Esfandiar, Francesco Bonchi, David F. Gleich, Chen Greif, Laks V.S. Lakshmanan, and Byung-Won On
132
Game-Theoretic Models of Information Overload in Social Networks . . . . Christian Borgs, Jennifer Chayes, Brian Karrer, Brendan Meeder, R. Ravi, Ray Reagans, and Amin Sayedi
146
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
The Anatomy of the Long Tail of Consumer Demand Andrei Broder Yahoo! Research, 701 First Ave., Sunnyvale, CA 94089, USA
[email protected]
Abstract. The long tail of consumer demand is consistent with two fundamentally different theories. The first, and more popular hypothesis, is that a majority of consumers have similar tastes and only few have any interest in niche content; the second, is that everyone is a bit eccentric, consuming both popular and niche products. By examining extensive data on user preferences for movies, music, web search, and web browsing, we found overwhelming support for the latter theory. Our investigation suggests an additional factor in the success of “infiniteinventory” retailers such as Netflix and Amazon: besides the significant revenue obtained from tail sales, tail availability may boost head sales by offering consumers the convenience of “one-stop shopping” for both their mainstream and niche interests. However, the observed taste eccentricity is much less than what is predicted by a fully random model whereby every consumer makes his product choices independent of each other and proportional to the product popularity. Hence, it appears consumers have a small a-priori propensity towards the popular or the exotic, but constructing a good model that agrees with the observed data as well as characterizing “eccentricity” are still open questions that we will discuss in some detail. This talk is largely based on joint work with Sharad Goel, Evgeniy Gabrilovich, and Bo Pang: “Anatomy of the long tail: Ordinary people with extraordinary tastes”, WSDM 2010.
R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, p. 1, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Sharp PageRank Algorithm with Applications to Edge Ranking and Graph Sparsification Fan Chung and Wenbo Zhao University of California, San Diego La Jolla, CA 92093 {fan,w3zhao}@ucsd.edu
Abstract. We give an improved algorithm for computing personalized PageRank vectors with tight error bounds which can be as small as Ω(n−p ) for any fixed positive integer p. The improved PageRank algorithm is crucial for computing a quantitative ranking of edges in a given graph. We will use the edge ranking to examine two interrelated problems – graph sparsification and graph partitioning. We can combine the graph sparsification and the partitioning algorithms using PageRank vectors to derive an improved partitioning algorithm.
1
Introduction
PageRank, which was first introduced by Brin and Page [9], is at the heart of Google’s web searching algorithms. Originally, PageRank was defined for the Web graph (which has all webpages as vertices and hyperlinks as edges). For any given graph, PageRank is well-defined and can be used for capturing quantitative correlations between pairs of vertices as well as pairs of subsets of vertices. In addition, PageRank vectors can be efficiently computed and approximated (see [3,4,8,13,14]). The running time of the approximation algorithm in [3] for computing a PageRank vector within an error bound of is basically O(1/). For the problems that we will examine in this paper, it is quite crucial to have a sharper error bound. In Section 2, we will give an improved approximation algorithm with running time O(m log(1/)) to compute PageRank vectors within an error bound of . The PageRank is originally meant for determining the “importance” of vertices in the Web graph. It is also essential to identify the “importance” of edges in dealing with various problems. We will use PageRank vectors to define a qualitative ranking of edges that we call Green values for edges because of its connection with discrete Green functions. The Green values for edges can also be viewed as a generalized version of effective resistances in electric network theory. The detailed definition for Green values of edges will be given in Section 3. We then use the sharp approximate PageRank algorithm to compute Green values within sharp error bounds. To illustrate the usage of Green values, we examine a basic problem on sparsifying graphs. Graph sparsification was first introduced by Bencz´ ur and Karger R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 2–14, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Sharp PageRank Algorithm
3
[7,15,16,17] for approximately solving various network design problems. The heart of the graph sparsification algorithms are the sampling techniques for randomly selecting edges. The goal is to approximate a given graph G with m edges ˜ called sparsifier, with O(n log n) edges (or on n vertices by a sparse graph G, ˜ is fewer) on the same set of vertices in such a way that every cut in sparsifier G within a factor of (1 ± ) of the corresponding cut in G for some constant > 0. ˜ ≤ xT Lx, where L and L ˜ It was shown that, for all x ∈ {0, 1}n, |xT Lx − xT Lx| ˜ are the Laplacians of the graph G and its sparsifier G, respectively. Spielman and Teng [24] devised a sampling scheme to construct a spectral sparsifier with O(n logc n) edges for some (large) constant c in O(m polylog(n)) time. ˜ ≤ ˜ for graph G is a sparsifier satisfying |xT Lx − xT Lx| A spectral sparsifier G T n x Lx for all x ∈ R . In [25] Spielman and Srivastava gave a different sampling scheme using the effective resistances of electrical networks to construct an improved spectral sparsifier with only O(n log n) edges. In the process for constructing this spectral sparsifier, they need to use the Spielman-Teng solver [24] as subroutines for solving O(log n) linear systems. The running time of their sparsification algorithm is mainly dominated by the running time of Spielman-Teng solver which is O(m logc n) for a very large constant c [24]. Recently, Batson, Spielman and Srivastava [6] gave an elegant construction for a spectral sparsifier with a linear number of edges although the running time is O(n3 m). Here, we ˜ with will use Green values to sample edges of G in order to form the sparsifier G O(n log n) edges. There are two advantages of sampling using PageRank and Green values. The running time of our sparsification algorithm is significantly faster and simpler than those in [6,24] since we avoid using Spielman-Teng solver for solving linear system. In addition, the graph sparsification problem is closely related to graph partitioning algorithms. For graph partitioning algorithms, previously widely used approach is the recursive spectral method which finds a balanced cut in a graph on n vertices with running time O((n2 /λ)polylog(n)) (see [23]), together with an approximation guarantee within a quadratic root of the optimal conductance (where λ denotes the spectral gap of the normalized Laplacian). The running time can be further improved to O(n2 polylog(n)) by using Spielman-Teng solver for linear systems [24]. Another approach for the balanced cut problem is by using commodity flows [2,20]. In [2] the approximation guarantee √ is within a factor of log n of the optimal, which was further reduced to O( log n) in [5] but the running time is still O(n2 polylog(n)). In another direction, Spielman and Teng [24,25] introduced local graph partitioning which yields a cut near the specified seeds with running time only depending on the volume of the output. Their local partitioning algorithm has an approximation guarantee similar to the spectral method by using a mixing result on random walks [19]. Andersen, Chung and Lang [3] used PageRank vectors to give a local partitioning algorithm with improved approximation guarantee and running time. Recently, Andersen and Peres use involving sets instead of PageRank to further improved the running time [4]. Our balanced-cut algorithm consisting of two parts. First we use PageRank vectors to sparsify the graph. Then we use the known PageRank partitioning
4
F. Chung and W. Zhao
algorithm to find a balanced cut. Both parts have the same complexity as computing the PageRank vectors. Consequently, the complexity for our PageRank balanced-cut algorithm is O(m log2 n/φ + n polylog(n)) for any input graph on n vertices and m edges. The balanced-cut algorithm here can be viewed as an application of graph sparsification. In this paper, we omit the proofs for most of lemmas and theorems (except Theorems 4, 5, and 7) which will be included in the full paper version.
2
A Sharp PageRank Approximation Algorithm
We consider an undirected, weighted graph G = (V, E, w) with n vertices and m edges where the edge weights w(u, v) = w(v, u) ≥ 0 and the edge set E consists of all pairs (u, v) with w(u, v) > 0. The weighted degree d(u) of vertex u is the sum of w(u, v) over all v, i.e., d(u) = v w(u, v). A typical random walk is defined by its transition probability matrix P satisfying P (u, v) = w(u, v)/d(u). We may write P = D−1 A, where A is the weighted adjacency matrix satisfying A(u, v) = w(u, v) for all pairs of (u, v) ∈ E and D is the diagonal matrix of weighted degree. We here consider the lazy walk Z on G, defined by Z = (I + P )/2. In [3], PageRank vector pr is defined by a recurrence relation involving a seed vector s (as a probability distribution) and a positive jumping constant α < 1 (or transportation constant), i.e. pr = αs + (1 − α)prZ where pr and s are taken to be row vectors. In this paper, we consider the PageRank pr as a discrete Green’s function αs(I − (1 − α)Z)−1 = βs(βI + L)−1 where β = 2α/(1 − α) and L = I − P . Note that the usual Green’s function is associated with the pseudo inverse of L. Another (equivalent) way to express the recurrence of PageRank in terms of β and s is that: for a positive value β > 0, the (personalized) PageRank vector prβ,s with a seed vector s is the unique soluβ 2 s + 2+β prβ,s Z. If a seed vector is the characteristic tion of equation prβ,s = 2+β function χu of a single vertex u, we may write prβ,χu = prβ,u if there is no confusion. It is easy to check that v∈V prβ,s (v) = 1 since v∈V s(v) = 1. The PageRank approximation algorithm in [3] contains the following routines, called Push and ApproximatePR, which serve as subroutines later in the sharp approximate PageRank algorithm. Given a vertex u, an approximate PageRank vector p and a residual vector r, the Push operation is as follows: Push(u): Let p = p and r = r, except for these changes: β r(u) and r (u) = r(u) 1. let p (u) = p(u) + 2+β 2+β ; 2. for each vertex v such that (u, v) ∈ E: r (v) = r(v) +
r(u) (2+β)d(u) .
Lemma 1 ([3]). Let p and r denote the resulting vectors after performing operation Push(u) with vectors p and r. Then p = prβ,s−r implies p = prβ,s−r . Theorem 1 ([3]). For any vector s with s1 ≤ 1, and any constant ∈ (0, 1], the algorithm ApproximatePR(s, β, ) computes an approximate PageRank
A Sharp PageRank Algorithm
5
vector p = prβ,s−r such that the residual vector r satisfies |r(v)/d(v)| ≤ for all v ∈ V . The running time of algorithm is O( 2+β β ). ApproximatePR(s, β, ): 1. Let p = 0 and r = s. 2. While r(u) ≥ d(u) for some vertex u: pick any vertex u where r(u) ≥ d(u) and apply operation Push(u). 3. Return p and r. We will improve the estimate error bound for the above algorithm by the following iterative process. SharpApproximatePR(s, β, ): 1. Let = 1, r = s and p = 0. 2. While > : (a) set = /2; (b) let p and r be the output of ApproximatePR(r, β, ); (c) let p = p + p and r = r . 3. Return p and r. Theorem 2. Given constants ∈ (0, 1], β > 0 and seed vector s, to approximate PageRank vector prβ,s , the algorithm SharpApproximatePR(s, β, ) computes approximate PageRank vector p = prβ,s−r such that the residual vector r satisfies |r(v)/d(v)| ≤ for all v ∈ V and the running time is O( (2+β)mβlog(1/) ). In particular, if is an inverse of a polynomial on n with degree p, i.e. Ω(n−p ), the n running time can be bounded by O( m log ). β
3
The Green Values for Edges in a Graph
Recall that the combinatorial Laplacian of G is defined by L = D − A. If we orient the edges of G in an arbitrary but fixed way, we can write its Laplacian as L = B T W B, where Wm×m is a diagonal matrix with W (e, e) = w(e) and Bm×n is the signed edge-vertex incidence matrix such that B(e, v) = 1 for v is e’s head; B(e, v) = −1 for v is e’s tail; and B(e, v) = 0 otherwise. The normalized Laplacian of G is defined to be L = D−1/2 LD−1/2 and we can write L = S T W S where Sm×n = BD−1/2 . Since L is symmetric and we have n−1 L = i=0 λi φTi φi , where λ0 = 0 < λ1 ≤ λ2 ≤ . . . ≤ λn−1 ≤ 2 are eigenvalues of L and φ0 , . . . , φn−1 are a corresponding orthonormal basis of eigenvectors. Various properties concerning eigenvalues of the normalized Laplacian can be found in [10]. Denote the β-normalized Laplacian Lβ by βI + L, and we may write Lβ = S T Wβ S where we define S and Wβ as follows: S =
I βI 0 and Wβ = . S (n+m)×n 0 W (n+m)×(n+m)
6
F. Chung and W. Zhao
Here the index set for the columns of S and columns (rows) of Wβ is V ∪ E where the first n columns (rows) are indexed by V and the last m columns (rows) are indexed by E. Green’s functions were firstly introduced in a celebrated essay by George Green [12] in 1828. The discrete analog of Green’s functions which are associated with the normalized Laplacian of graphs were considered in [11] in connection with the study of Dirichlet eigenvalues with boundary conditions. In particular, the following modified Green’s function Gβ was used in [11]. For β ∈ R+ , let Green’s function Gβ denote the symmetric matrix satisfying Lβ Gβ = I. Clearly, n−1 we have Gβ = i=0 λi1+β φTi φi . We remark that the discrete Green’s function is basically a symmetric form of the PageRank. Namely, it is straightforward to check that prβ,s = βsD−1/2 Gβ D1/2 . For each edge e = {u, v} ∈ E, we define the Green value gβ (u, v) of e to be a combination of four terms in PageRank vectors gβ (u, v) =
prβ,u (u) prβ,u (v) prβ,v (v) prβ,v (u) − + − . d(u) d(v) d(v) d(u)
(1)
Actually, one can also verify that gβ (u, v) = β(χu −χv )D−1/2 Gβ D−1/2 (χu −χv )T . By using the above properties of Green’s function, we can prove the following facts which will be useful later. n−1 φi (u) − Lemma 2. The Green value gβ (u, v) can be expressed as i=0 λiβ+β ( √ d(u) φi (v) 2 β 1 1 √ ≤ ) . Particularly, for two distinct vertices u, v ∈ V , 2+β d(u) + d(v) d(v)
gβ (u, v) ≤
1 d(u)
+
1 d(v) .
Since the Green values are relatively small (e.g., of order Ω(1/np ) as Lemma 2 for positive integer p), we need very sharply approximate PageRank to be within a factor of 1 + O(1/np ) of the exact values in the analysis of the performance bound for the graph sparsification algorithms that we will examine in Section 4. For all the pairs (u, v), we define the approximate Green value g˜β (u, v) by g˜β (u, v) =
prβ,χu −rχu (u) d(u)
−
prβ,χu −rχu (v) d(v)
+
prβ,χv −rχv (v) d(v)
−
prβ,χv −rχv (u) d(u)
.
Here, prβ,χu −rχu and prβ,χu −rχu are the approximate PageRank vectors as outputs of ApproximatePR for prβ,u and prβ,v respectively; rχu and rχv are the corresponding residual vectors satisfying rχu D−1 1 ≤ /4 and rχv D−1 1 ≤ /4. With this definition, we can prove that Lemma 3. For two distinct vertices u, v ∈ V , we have |gβ (u, v) − g˜β (u, v)| ≤ . Here we will give a rough estimate for computing Green’s values by directly using Lemma 3 and Theorem 1. Note that by invoking Theorem 1 without using further techniques the running time does not seem to be improved. Theorem 3. Given any constant > 0 and any pair (u, v) ∈ V × V , the approximate Green value g˜β (u, v) can be computed in O( 2+β β ) time such that
A Sharp PageRank Algorithm
7
|gβ (u, v) − g˜β (u, v)| ≤ . In particular, after O( (2+β)n β ) preprocessing time, for each (u, v) ∈ V × V , we can compute such g˜β (u, v) by using a constant number of queries. Recall that in Lemma 2, we established a lower bound for gβ (u, v). We denote as Δ the maximum degree of graph G. Then, a direct consequence of the above theorem is the following. Corollary 1. Given any constant > 0 and any pair (u, v) ∈ V × V , we can compute quantity g˜β (u, v) in O( βΔ2 ) time such that |gβ (u, v) − g˜β (u, v)| ≤ gβ (u, v). In particular, after O( βΔn 2 ) preprocessing time, for each (u, v) ∈ V × V , we can compute such g˜β (u, v) by using a constant number of queries. We will improve both Theorem 3 and Corollary 1 in the Section 5 by using sharp approximate PageRank algorithms and dimension reduction techniques.
4
Graph Sparsification Using Green Values
To construct our sparsifier, we use a method quite similar to the scheme used by Spielman and Srivastava except that PageRank is used here instead of effective resistance. We will give several graph sparsification algorithms, some of which involve approximate Green values (which will be examined in section 5). In this section, we use exact Green values for edges. Recall that the graph G = (V, E, w) we consider here is an undirected weighted graph. For a subset S of vertices in G, the edge boundary ∂(S) of S consists of all ¯ is edges with exactly one endpoint in S. The weight of ∂(S), denoted by w(S, S), the sum of all edge weights of edges in ∂(S). The volume of S, denoted by vol(S), is defined to be the sum of degrees d(v) over all v in S. When S = V , we write vol(S) = vol(G). The Cheeger ratio (or conductance) hG (S) of S in G is defined ¯ ¯ by hG (S) = w(S, S)/min{vol(S), vol(S)}. The conductance hG of G is defined to be the minimum Cheeger ratio among all subsets S with vol(S) ≤ vol(G)/2. The goal of sparsification is to approximate a given graph G by a sparse ˜ on the same set of vertices while the sparse graph G ˜ preserves the graph G Cheeger ratios of every subset of vertices to within a factor of 1 ± . The main step in any sparsification algorithm [7,15,16,17,24,25] is to choose an appropriate probability distribution for random sampling the edges in a way that Cheeger ratios of subsets change little. Our sparsification algorithm is a sampling process using probabilities proportional to the Green values gβ ’s as follows: ˜ = SparsifyExactGreen(G, q, β): G For each e = (u, v) ∈ E, set probability pe ∝ w(e)gβ (e) and repeat the following steps for q times: 1. Choose an edge e ∈ G randomly with probability pe ˜ with weight w(e)/qpe . 2. Add e to G 3. Sum the weights if an edge is chosen more than once.
8
F. Chung and W. Zhao
The analysis of the above algorithm will be examined in the following subsections. Our main theorem is the following Theorem 4. Given an unweighed graph G on n vertices with m edges, for ˜ denote the output of the algorithm SparsifyExactGreen any ∈ (0, 1), let G (G, q, β), where q = 256C 2 n log n/2 , β = 1/2, C ≥ 1 is a absolute constant. Then with probability at least 1/2, we have |hG˜ (S) − hG (S)| ≤ for all S ⊂ V . 4.1
Analyzing the Sparsifier
Our analysis follows the general scheme as that of [25]. In our analysis of spar1/2 1/2 sifier, we consider the matrix Λβ = Wβ S Gβ S T Wβ . Note that Λβ is a (n + m) × (n + m) matrix and we index its the first n columns (rows) by V and its last m columns (rows) by E. From the definition and properties of Green values in Section 3, one can verify that Λβ (e, e) = β1 Wβ (e, e)gβ (e) Wβ (e, e) = 1 β w(e)gβ (e). Here are several useful properties for Λβ . Lemma 4. (i) Λ2β = Λβ . (ii) The dimension (or rank) of Λβ , denoted by dim(Λβ ) is n. (iii) The eigenvalues of Λβ are 1 with multiplicity n and 0 with multiplicity m. (iv) Λβ (e, e) = Λβ (·, e)22 . Next, we introduce some notations and several lemmas that the theorems in ˜ Recall that later sections rely on. Let w(e) ˜ be the edge weight of edge e in G. q is a number and pe is the sampling probability for e ∈ E. Denote Iβ as a nonnegative diagonal matrix # of times e is sampled w(e) ˜ In×n 0 = Iβ = where R(e, e) = . 0 R (n+m)×(n+m) w(e) qpe Lemma 5. Suppose Iβ is a nonnegative diagonal matrix such that Λβ Iβ Λβ − Λβ Λβ 2 ≤ . Then ∀x ∈ Rn , |xL˜β xT − xLβ xT | ≤ xLβ xT where Lβ = 1/2 1/2 S T Wβ S and L˜β = S T Wβ Iβ Wβ S . Lemma 6 ([22]). Let p be a probability distribution over Ω ⊆ Rd such that supy∈Ω y2 ≤ M and Ep y T y2 ≤ 1. Let y1 . . . yq be independently samples
1 q drawn from p. Then for every 1 > > 0, P q i=1 yiT yi − Ey T y > ≤ 2 2 exp(−2 /a2 ), where a = min CM logq q , 1 and C is an absolute constant. 4.2
Proof of Theorem 4
We first prove the following theorem which leads to the proof of Theorem 4. ˜ be the output of the Theorem 5. Let L be the normalized Laplacian of G and G algorithm SparsifyExactGreen(G, q, β), where q = 4C 2 n log n/2 , ∈ (0, 1] and C is a absolute constant. Then with probability at least 1/2, we have ∀x ∈ Rn , |xL˜β xT − xLβ xT | ≤ xLβ xT , where Lβ = βI + L = S T Wβ S and L˜β = 1/2 1/2 S T Wβ Iβ Wβ S .
A Sharp PageRank Algorithm
9
Brief proof of Theorem 5: Before applying Lemmas 5 and Lemma 6 we obT T serve that Λβ I ·) Λ (e, ·) + β Λβ = β e∈E R(e, e)Λβ (e, v∈V Λβ (v, ·) Λβ (v, ·) T T and Λβ Λβ = e∈E Λ β (e, ·) Λβ (e, ·) + v∈V ΛTβ (v, ·) Λβ (v, ·). Thus we have I Λ − Λ Λ = Λ β β β β β e∈E (R(e, e) − 1)Λβ (e, ·) Λβ (e, ·). Now, let us consider T e∈E R(e, e)Λβ (e, ·) Λβ (e, ·) which can be expressed as q # of times e is sampled 1 T Λβ (e, ·)T Λβ (e, ·) = y yi qpe q i=1 i
e∈E
where y1 , . . . , yq are random vectors drawn independently with replacement from the distribution p defined by setting y = √1pe Λβ (e, ·) with probability pe . We also need to bound the norm of the expectation of y T y and the norm T y. By using the properties √ of Λβ in Lemma 4, we can show that Ep y y2 1 1 and √pe Λβ (e, ·)2 ≤ n (details are omitted). Notice that if we let q 2 n log n/2 ) 4C 2 n log n/2 then we have min CM logq q , 1 ≤ C 2 n log(4C 4C 2 n log n
of ≤ = ≤
/2. By applying the Rudelson and Vershynin’s lemma in [22] (Lemma 6), we completes the proof of the theorem.
Before applying Theorem 5 to prove Theorem 4, we still need the following two lemmas. We here consider G as an unweighted graph first, i.e w(e) = 1 for all edges, although this can be easily extended to the general weighted graphs. ˜ be the output of algorithm SparLemma 7. For any constant ∈ (0, 1], let G sifyExactGreen (G, q, β), where q = 4C 2 n(β + 2) log n/2 . Then, with probability 1 − 1/n, for all subsets S ⊂ V , we have |volG˜ (S) − volG (S)| ≤ volG (S). ˜ corresponding to graph G satisfies two conditions: Lemma 8. If sparse graph G (a) for all x ∈ Rn , |xL˜β x − xLβ xT | ≤ xLβ xT ; (b) for all subsets S ⊂ V , |volG˜ (S) − volG (S)| ≤ volG (S). Then |hG˜ (S) − hG (S)| ≤ 2hG (S) + β. Proof of Theorem 4: To prove Theorem 4, we need to combine Lemma 7, Lemma ˜ be the output of the algorithm 8 and Theorem 5. For any 1 > > 0, let G SparsifyExactGreen(G, q, β), where q = 256C 2 n log n/2 , β = 1/2, and C ≥ 1 is a constant. By Theorem 5 and Lemma 7, the conditions of Lemma 8 are satisfied with probability at least 1/2. Note that we have chosen β to be 1/2 and hG (S) ≤ 1, thus algorithm SparsifyExactGreen can be applied by using n O( n log ˜ (S) − hG (S)| ≤ . 2 ) sampling. Furthermore, for all S ∈ V , we have |hG
By choosing a different β, namely, β = φ/2, we have the following: ˜ be the output of the algorithm Theorem 6. Given constants , φ ∈ (0, 1], let G SparsifyExactGreen(G, q, β), where q = 256C 2 n log n/2 , β = φ/2, and C ≥ 1. Then with probability at least 1/2, we have |hG˜ (S) − hG (S)| ≤ hG (S) for all S ∈ V with hG (S) ≥ φ.
10
F. Chung and W. Zhao
5
Sparsification Using Approximate PageRank Vectors
In Corollary 1, we can compute approximate Green values g˜β (u, v) satisfying (1−κ)gβ (u, v) ≤ g˜β (u, v) ≤ (1+κ)gβ (u, v) for all edges (u, v) ∈ E in O( Δn β 2 ) time, where κ is any absolute constant such that κ ∈ (0, 1] (e.g., κ = 0.01). Instead of using exact Green values, we can use approximate Green values in running the algorithm SparsifyExactGreen. The approximate Green values g˜β ’s are combinations of approximate PageRank vectors prβ,v ’s. Here we choose the parameters β for algorithm ApproximatePR as prβ,v = ApproximatePR(χv , β, (2+β)Δ κ). It is not difficult to verify that all results in Section 4 will change at most by a constant factor if we run the algorithm SparsifyExactGreen by using approximate Green values. The performance guarantee and the number of sampled edges in the Theorems 4 differ by at most a constant factor, although the computational complexity will increase to O( Δn β 2 ). In order to further improve the running time, we use several methods in the following subsections. 5.1
Graph Sparsification by Using Sharply Approximate Green Values
In order to have better error estimate of approximate Green values, we need to improve the error estimate for approximate PageRank vectors in Theorem 1. We will use the strengthened approximate PageRank algorithm SharpApproximatePR and the dimension reduction technique in [25] to approximate the Green values by using these sharply approximate PageRank vectors produced by SharpApproximatePR. First, we recall that gβ (u, v) = β(χu − χv )D−1/2 Gβ D−1/2 (χu − χv )T and thus gβ (u, v) = β(χu − χv )D−1/2 Gβ Lβ Gβ D−1/2 (χu − χv )T 1 = βWβ S Gβ D−1/2 (χu − χv )T 22 = Wβ S D1/2 [βD−1/2 Gβ D−1/2 ](χu − χv )T 22 . β
Therefore, gβ (u, v)’s are just pairwise distances between vectors {ZχTv }v∈V where Z = Wβ S D1/2 [βD−1/2 Gβ D−1/2 ]. However, the dimension of the vectors in {ZχTv }v∈V is m + n. In order to reduce the computational complexity for computing these vectors, we project these vectors into a lower dimensional space while preserving their pairwise distances by the following lemma. Lemma 9 ([1]). Given vectors x1 , . . . , xn ∈ Rd and constants , γ > 0, let k0 = cγ log n/2 where cγ is a constant depending on γ. For integer k ≥ k0 , let Rk×d be a √ random matrix where {Rij } are independent random variables with values ±1/ k. Then with probability 1 − n1γ , we have (1 − )||xi − xj ||22 ≤ ||Rxi − Rxj ||22 ≤ (1 + )||xi − xj ||22 .
A Sharp PageRank Algorithm
11
Now, we are ready to state our algorithm to approximate the Green values. Later, in order to analysis our algorithm ApproxiamteGreen, we will give a bound for yi ’s by Lemma 10. ApproxiamteGreen(β, , k): 1. Let Rk×(n+m) = [R1 , R2 ] be a random matrix whose entries are √ independent random variables with values ±1/ k, where R1 is a k × n matrix and R2 is a k × m matrix. 1/2 2. Let Y = RWβ S D1/2 and Z˜ = RZ. 3. For i = 1, . . . , k, do the following ˜ (a) Let yi be the ith row of Y and z˜i be the ith row of Z. (b) Approximate z˜i by z˜i = SharpApproximatePR(yi , β, /nr ). 4. Let Z˜ be the approximated matrix for Z˜ whose rows are z˜1 , . . . , z˜k . For all (u, v) ∈ E, return g˜β (u, v) = Z˜ (χu − χv )T 22 . Lemma 10. Given an integer k and a random matrix whose entries are inde√ pendent random variables with values ±1/ k, with probability 1 − 1/n2, we have ||yi ||1 ≤ c v∈V d(v) log n for 1 ≤ i ≤ k, where c is an absolute constant. By combining the above lemmas and Theorem 2, we have the follow theorem. Theorem 7. Given any constant > 0, let k = c log n/2 . If β = Ω(poly(1/n)), algorithm ApproxiamteGreen(β, , k) will output g˜β (u, v) satisfying |gβ (u, v)− 2 n g˜β (u, v)| ≤ gβ(u,v) in O( m log β2 ) time. n Proof. To bound the running time, note that step 1 can be completed in O( m log ) 1/2 1/2 time since R is a k ×(n+m) random matrix. In step 2, we set Y = RWβ S D n and it only takes O( m log ) time since S has O(m) entries and Wβ is a diag2 onal matrix. In step 3, Let yi be the ith row of Y and z˜i be the ith row of Z˜ which is the matrix (Y [βD−1/2 Gβ D−1/2 ])k×n . Therefore, we have z˜i = yi [βD−1/2 Gβ D−1/2 ] and we can view z˜i as a scaled PageRank vector with seed vector yi . In Lemma 10, we proved that with probability at least 1 − 1/n , yi 1 ≤ c v∈V d(v) log n for 1 ≤ i ≤ k. Without loss of generality, we may assume c v∈V d(v) log n = O(m) otherwise the graph is sufficient sparse. Thus, z˜i can be approximated by using algorithm SharpApproximatePR with arbitrary small absolute error, say, . Thus, each call of SharpApproximatePR ) ) time. By Lemma 2, gβ (u, v) = Ω(β/n) which implies just takes O( m log(1/ β that we only need to set = /nr for some large enough but fixed constant r where each call of SharpApproximatePR will actually take O(m log n/β) time. Since there are k = O(log n/2 ) calls, the total running time of step 3 is O(m log2 n/β2 ). In step 4, since each column of Z˜ has k = O(log n/2 ) entries and there are m edges in the graph, the running time of step 4 is O(m log n/2 ). The lemma then follows.
1/2
12
F. Chung and W. Zhao
˜ = SparsifyApprGreen(G, q, β, ): G 1. For all (u, v) ∈ E, compute approximate Green values g˜β (e) by calling ApproxiamteGreen (β, κ, k) where κ = 1/2 and k = c log n/2 . 2. Apply SparsifyExactGreen(G, q, β) with approximate Green values. Theorem 8. Given constants > 0, φ > 0 and a graph G on n vertices with m ˜ be the output of the algorithm edges, set q = 256C 2 n log n/2 , β = φ/2, and let G SparsifyApprGreen(G, q, β, ). Then with probability at least 1/2, we have (i) |hG˜ (S) − hG (S)| ≤ hG (S) for all S ∈ V satisfying hG (S) ≥ φ; (ii) Algorithm 2 n SparsifyApprGreen can be performed by using O( m log ) preprocessing time φ n and O( n log 2 ) sampling.
The proof is quite similar to the analysis in Section 4 and will be omitted.
6
Partitioning Using Approximate PageRank Vectors
In this section, we combine the graph sparsification and the partitioning algorithms using PageRank vectors to derive an improved partitioning algorithm. An application of our sparsification algorithm by PageRank is the balanced cut problem. For a given graph G, we first use our sparsification algorithm to preprocess the graph. Then we apply the local partitioning algorithm using PageRank vectors [3,4] on the sparsifier. Since the local partitioning algorithm is a subroutine for the balance cut problem, we obtain a balanced cut algorithm with an improved running time. Spielman and Teng [24] gave a local partitioning algorithm which, for a fixed value of φ, gives a cut with approximation ratio O(φ1/2 log3/2 n) and of volume vφ in O(m logc (n)/φ−5/3 ) time where vφ is the largest volume of the set with cheeger ratio φ. Note that he constant c above is quite large [24]. In [3], PageRank vectors were used to derive a local partitioning algorithm with an improved running time (m + nφ−1 )O(polylog(n)). In [4], the running time was further reduced to O(m + nφ−1/2 )O(polylog(n)) by preprocessing using sparsification algorithms in [7]. Given an undirected, weighted graph G = (V, E, w) with n vertices and m edges, we can apply algorithm SparsifyApprGreen as a preprocess proce˜ with only O(n log n/2 ) edges in time dure on graph G to get a sparsifier G 2 O(m log n/φ) such that |hG˜ (S) − hG (S)| ≤ hG (S) for all S with hG (S) ≥ φ. ˜ instead of Then, we use the algorithm PageRank-Partition [3] on graph G G for balanced cut problem. The algorithm PageRank-Partition has two inputs including a parameter φ and a graph with m edges. As stated in [3], the PageRank-Partition algorithm has expected running time O(m log4 m/φ2 ). Furthermore, with high probability the PageRank-Partition algorithms was shown to be able to find a set S, if exists, such that vol(S) ≥ vol(C)/2 and hG (S) ≤ φ. This can be summarized as follows: Theorem 9. Given an undirected, weighted graph G = (V, E, w) with n vertices and m edges, constant φ > 0, and > 0. With probability 1/2, we can preprocess
A Sharp PageRank Algorithm
13
2 n n ˜ with O( n log graph G in O( m log ) time to obtain a sparse graph G φ 2 ) edges such that for all S ∈ V satisfying hG (S) ≥ φ, |hG˜ (S) − hG (S)| ≤ hG (S). ˜ Algorithm PageRank-Partition takes as inputs a parameter φ and a graph G and has expected running time O(n log6 m/(φ2 2 )). If there exists a set C with hG (C) = O(φ2 /log 2 n), then with high probability the PageRank-Partition algorithm finds a set S such that vol(S) ≥ vol(C)/2 and hG (S) ≤ φ.
References 1. Achlioptas, D.: Database-friendly random projections. In: PODS 2001, pp. 274–281 (2001) 2. Arora, S., Kale, S.: A combinatorial, primal-dual approach to semidefinite programs. In: STOC 2007, pp. 227–236 (2007) 3. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: FOCS 2006, pp. 475–486 (2006) 4. Andersen, R., Peres, Y.: Finding sparse cuts locally using evolving sets. In: STOC 2009, pp. 235–244 (2009) √ ˜ 2) 5. Arora, S., Hazan, E., Kale, S.: Θ( log n) approximation to sparsest cut in O(n time. In: FOCS 2004, pp. 238–247 (2004) 6. Batson, J., Spielman, D.A., Srivastava, N.: Twice-Ramanujan sparsifiers. In: STOC 2009, pp. 255–262 (2009) ˜ 2 ) time. In: 7. Bencz´ ur, A.A., Karger, D.R.: Approximating s-t minimum cuts in O(n STOC 1996, pp. 47–55 (1996) 8. Berkhin, P.: Bookmark-coloring approach to personalized pagerank computing. Internet Mathematics 3(1), 41–62 (2007) 9. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998) 10. Chung, F.: Spectal Graph Theory. AMS Publication, Providence (1997) 11. Chung, F., Yau, S.-T.: Discrete Green’s Functions. Journal of Combinatorial Theory, Series A 91(1-2), 191–214 (2000) 12. Green, G.: An Essay on the Application of Mathematical Analysis to the Theories of Electricity and Magnetism, Nottingham (1828) 13. Haveliwala, H.: Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Eng. 15(4), 784–796 (2003) 14. Jeh, G., Widom, J.: Scaling personalized web search. In: WWW 2003, pp. 271–279 (2003) 15. Karger, D.R.: Random sampling in cut, flow, and network design problems. In: STOC 1994, pp. 648–657 (1994) 16. Karger, D.R.: Using randomized sparsification to approximate minimum cuts. In: SODA 1994, pp. 424–432 (1994) 17. Karger, D.R.: Minimum cuts in near-linear time. JACM 47(1), 46–76 (2000) 18. Lov¨ asz, L.: Random walks on graphs: A survey. Combinatorics, Paul Erd¨ os is Eighty 2, 1–46 (1993) 19. Lov´ asz, L., Simonovits, M.: The mixing rate of Markov chains, an isoperimetric inequality, and computing the volume. In: FOCS 1990, pp. 346–354 (1990) 20. Orecchia, L., Schulman, L.J., Vazirani, U.V., Vishnoi, N.K.: On partitioning graphs via single commodity flows. In: STOC 2008, pp. 461–470 (2008)
14
F. Chung and W. Zhao
21. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web, Technical report, Stanford Digital Library Technologies Project (1998) 22. Rudelson, M., Vershynin, R.: Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM 54(4) (2007) 23. Spielman, D.A., Teng, S.-H.: Spectral partitioning works: Planar graphs and finite element meshes. In: FOCS 1996, pp. 96–105 (1996) 24. Spielman, D.A., Teng, S.-H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: STOC 2004, pp. 81–90 (2004) 25. Spielman, D.A., Srivastava, N.: Graph sparsification by effective resistances. In: STOC 2008, pp. 563–568 (2008)
Efficient Triangle Counting in Large Graphs via Degree-Based Vertex Partitioning Mihail N. Kolountzakis1, Gary L. Miller2 , Richard Peng2 , and Charalampos E. Tsourakakis3 1
Department of Mathematics, University of Crete, Greece School of Computer Science, Carnegie Mellon University, USA
[email protected],
[email protected],
[email protected],
[email protected] 3 Deparment of Mathematical Sciences, Carnegie Mellon University, USA 2
Abstract. In this paper we present an efficient triangle counting algorithm which can be adapted to the semistreaming model [12]. The key idea of our algorithm is to combine the sampling algorithm of [31,32] and the partitioning of the set of vertices into a high degree and a low degree subset respectively as in [1],treating 3/2 each set appropriately. We obtain a running time O m + m tΔ2log n and an approximation (multiplicative error), where n is the number of vertices, m the number of edges and Δ the maximum number of triangles an edge is contained. Furthermore, we show howthis algorithm can be adapted to the semistreaming 3/2 model with space usage O m1/2 log n + m tΔ2 log n and a constant number of passes (three) over the graph stream. We apply our methods in various networks with several millions of edges and we obtain excellent results. Finally, we propose a random projection based method for triangle counting and provide a sufficient condition to obtain an estimate with low variance. Note: Due to space contraints several details are omitted here. The full version of this work can be found in Arxiv http://arxiv.org/abs/1011.0468 [21].
1 Introduction Graphs are ubiquitous: the Internet, the World Wide Web (WWW), social networks, protein interaction networks and many other complicated structures are modeled as graphs [7]. The problem of counting subgraphs is one of the typical graph mining tasks that has attracted a lot of attention. The most basic, non-trivial subgraph, is the triangle. Given a simple, undirected graph G(V, E), a triangle is a three node fully connected subgraph. Many social networks are abundant in triangles, since typically friends of friends tend to become friends themselves [35]. This phenomenon is observed in other types of networks as well (biological, online networks etc.) and is one of the main reasons which gave rise to the definitions of the transitivity ratio and the clustering coefficients of a graph in complex network analysis [25]. Triangles are used in several applications such as uncovering the hidden thematic structure of the web [10], as a feature to assist the classification of web activity [4] and for link recommendation in R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 15–24, 2010. c Springer-Verlag Berlin Heidelberg 2010
16
M.N. Kolountzakis et al.
online social networks [33]. Furthermore, triangles are used as a network statistic in the exponential random graph model [11]. In this paper, we propose a new triangle counting method which an provides 3/2 approximation to the number of triangles in the graph and runs in O m + m tΔ2log n time, where n is the number of vertices, m the number of edges and Δ the maximum number of triangles an edge is contained. The key idea of the method is to combine the sampling scheme introduced by Tsourakakis et al. in [31,32] with the partitioning idea of Alon, Yuster and Zwick [1] in order to obtain a more efficient sampling scheme. Furthermore, we show that this method can be adapted to the semistreaming model with m3/2 Δ log n 1/2 space. We apply our a constant number of passes and O m log n + t2 methods in various networks with several millions of edges and we obtain excellent results both with respect to the accuracy and the running time. Furthermore, we optimize the cache properties of the code in order to obtain a significant additional speedup. Finally, we propose a random projection based method for triangle counting and provide a sufficient condition to obtain an estimate with low variance. The paper is organized as follows: Section 2 presents briefly the existing work and the theoretical background, Section 3 presents our proposed method and Section 4 presents the experimental results on several large graphs. In Section 5 we provide a sufficient condition for obtaining a concentrated estimate of the number of triangles using random projections and in Section 6 we conclude and provide new research directions.
2 Preliminaries In this section, we briefly present the existing work on the triangle counting problem and the necessary theoretical background for our analysis, namely a version of the Chernoff bounded and the Johnson-Lindenstrauss lemma. An detailed description of the existing work on the problem can be found in the full version of this work [21]. We use the following notation in the rest of the paper: G([n], E) stands for an undirected simple graph on n labeled vertices and edge set E, m = |E|, t = #triangles, Δ(u, v) = #triangles containing vertices u and v, Δ = maxe∈E(G) Δ(e) and p denotes the sparsification parameter. 2.1 Existing Work There exist two categories of triangle counting algorithms, the exact and the approximate. It is worth noting that for the applications described in Section 1 the exact number of triangles in not crucial. Thus, approximate counting algorithms which are faster and output a high quality estimate are desirable for the practical applications in which we are interested in this work. The state of the art algorithm is due to Alon, Yuster and Zwick [1] and runs in 2ω O(m ω+1 ), where currently the fast matrix multiplication exponent ω is 2.371 [8]. Thus, the Alon et al. algorithm currently runs in O(m1.41 ) time. In planar graphs, triangles can be found in O(n) time [14,26]. In practical exact counting, simple listing algorithms such as the node- and edge-iterator are used. Practical improvements over this family
Efficient Triangle Counting in Large Graphs
17
of algorithms have been achieved using various techniques, such as hashing and sorting by the degree [22,27]. On the approximate counting side, most of the triangle counting algorithms have been developed in the streaming and semistreaming setting [3,4,16,5,28]. Other approximate algorithms include spectral algorithms [29,30,2] Of immediate interest to this work is the DOULION algorithm in [31] which tosses a coin independently for each edge with probability p to keep the edge and probability q = 1 − p to throw it away is proposed. It was shown later by Tsourakakis, Kolountzakis and Miller [32] using a powerful theorem due to Kim and Vu [19] that under mild conditions on the triangle density the method results in a strongly concentrated estimate on the number of triangles. 2.2 Concentration of Measure In Section 3 we make extensive use of the following version of the Chernoff bound [6]. with Theorem 1. Let X1 , X2 , . . . , Xk be independently distributed {0, 1} variables k 1 −2 pk/2 E[Xi ] = p. Then for any > 0, we have P r | k i=1 Xi − p| > p ≤ 2e . 2.3
Random Projections
A random projecton x → Rx from Rd → Rk approximately preserves all Euclidean distances. One version of the Johnson-Lindenstrauss lemma [15] is the following: Lemma 1 (Johnson Lindenstrauss). Suppose x1 , . . . , xn ∈ Rd and > 0 and take k = C−2 log n. Define the random matrix R ∈ Rk×n by taking all Ri,j ∼ N (0, 1) (standard gaussian) and independent. Then, with probability bounded below by a constant the points yj = Rxj ∈ Rk satisfy (1 − )|xi − xj | ≤ |yi − yj | ≤ (1 + )|xi − xj | for i, j = 1, 2, . . . , n.
3 Proposed Method Our algorithm combines two approaches that have been taken on triangle counting: sparsify the graph by keeping a random subset of the edges [31,32] followed by a triple sampling using the idea of vertex partitioning due to Alon, Yuster and Zwick [1]. 3.1 Edge Sparsification The following method was introduced in [31] and was shown to perform very well in practice: keep each edge with probability p independently. Then for each triangle, the probability of it being kept is p3 . So the expected number of triangles left is p3 t. This is an inexpensive way to reduce the size of the graph as it can be done in one pass over the edge list using O(mp) random variables (more details can be found in section 4.2 in [21] and [20]).
18
M.N. Kolountzakis et al.
In a later analysis [32], it was shown that the number of triangles in the sampled ˜ Δ ). Here graph is concentrated around the actual triangle count as long as p3 ≥ Ω( t we show a similar bound using more elementary techniques. Suppose we have a set of k triangles such that no two share an edge, for each such triangle we define a random variable Xi which is 1 if the triangle is kept by the sampling and 0 otherwise. Then as the triangles do not have any edges in common, the Xi s are independent and take value 0 with probability 1 − p3 and 1 with probability p3 . So by Chernoff bound, the concentration is bounded by: k 2 3 1 3 3 Pr | Xi − p | > p ≤ 2e− p k/2 k i=1 So when p3 k2 ≥ 4d log n, the probability of sparsification returning an -approximation is at least 1 − n−d. This is equivalent to p3 k ≥ (4d log n)/(2 ), so to sample with small p and throw out many edges, we would like k to be large. Our main theorem is the following: n ), then with probability 1 − nd−3 , the sampled graph Theorem 2. If p3 ∈ Ω( dΔlog 2t has a triangle count that -approximates t.
3.2 Triple Sampling Since each triangle corresponds to a triple of vertices, we can construct a set of triples that include all triangles, U . From this list, we can then sample some triples uniformly, let these samples be numbered from 1 to s. Also, for the ith triple sampled, let Xi be 1 it is a triangle and 0 otherwise. Since we pick triples randomly from U and t of them t and Xi s are independent. So by Chernoff bound are triangles, we have E(Xi ) = |U| we obtain: s 2 t t 1 |> ≤ 2e− ts/(2|U|) Xi − Pr | s i=1 |U | |U | s So when s = Ω(|U |/t log n/2 ), we have ( 1s i=1 Xi /s)|U | approximates t within a factor of with probability at least 1 − n−d for any d of our choice. As |U | ≤ n3 , this immediately gives an algorithm with runtime O(n3 log n/(t2 )) that approximates t within a factor of . Slightly more careful bookkeeping can also give tighter bounds on |U | in sparse graphs. Consider a triple containing vertex u, (u, v, w). Since uv, uw ∈ E, we have the number of such triples involving u is at most deg(u)2 . Also, as vw ∈ E, another bound on the number of such triples is m. When deg(u)2 > m, or deg(u) > m1/2 , the second bound is tighter, and the first is in the other case. These two cases naturally suggest that low degree vertices with degree at most m1/2 be treated separately from high degree vertices with degree greater than m1/2 . For the number of triangles around low degree vertices, since x2 is concave, the value of 2 u deg(u) is maximized when all edges are concentrated in as few vertices as possible. Since the maximum degree of such a vertex is m1/2 , the number of such triangles
Efficient Triangle Counting in Large Graphs
19
is upper bounded by m1/2 · (m1/2 )2 = m3/2 . Also, as the sum of all degrees is 2m, there can be at most 2m1/2 high degree vertices, which means the total number of triangles incident to these high degree vertices is at most 2m1/2 · m = 2m3/2 . Combing these bounds give that |U | can be upper bounded by 3m3/2 . Note that this bound is asymptotically tight when G is a complete graph (n = m1/2 ). However, in practice the second bound can be further reduced by summing over the degree of all v adjacent to u, becoming uv∈E deg(v). As a result, an algorithm that implicitly constructs U by picking the better one among these two cases by examining the degrees of all neighbors will achieve |U | ≤ O(m3/2 ) This better bound on U gives an algorithm that approximates the number of triangles in time:
m3/2 log n O m+ t2 As our experimental data in section 4.1. indicate, the value of t is usually Ω(m) in practice. In such cases, the second term in the above calculation becomes negligible compared to the first one. In fact, in most of our data, just sampling the first type of triples (a.k.a., pretending all vertices are of low degree) brings the second term below the first. 3.3 Hybrid Algorithm Edge sparsification with a probability of p allows us to only work on O(mp) edges, therefore the total runtime sampling algorithm after sparsification with of the triple 3/2 . As stated above, since the first term in most probability p becomes O mp + (mp) 2 tp3 practical cases are much larger, we can set the value of p to balance these two terms out 2/5 1/2 resulting in p = m t2log n . The actual value of p picked would also depend heavily on constants in front of both terms, as sampling is likely much less expensive due to factors such as cache effect and memory efficiency. Nevertheless, our experimental results in section 4 does seem to indicate that this type of hybrid algorithms can perform better in certain situations.
4 Experiments 4.1 Experimental Setup and Implementation Details The graphs used in our experiments are shown in Table 1. Multiple edges and self loops were removed (if any). The experiments were performed on a single machine, with Intel Xeon CPU at 2.83 GHz, 6144KB cache size and and 50GB of main memory. The graphs are from real world web-graphs, some details regarding them are in the chart below. The algorithm as implemented in C++. For further implementation details, see [21].
20
M.N. Kolountzakis et al. Table 1. Datasets used in our experiments Name AS-Skitter Flickr Livejournal-links Orkut-links Soc-LiveJournal Web-EDU Web-Google Wikipedia 2005/11 Wikipedia 2006/9 Wikipedia 2006/11 Wikipedia 2007/2 Youtube[24]
Nodes 1,696,415 1,861,232 5,284,457 3,072,626 4,847,571 9,845,725 875,713 1,634,989 2,983,494 3,148,440 3,566,907 1,157,822
Edges Triangle Count 11,095,298 28,769,868 15,555,040 548,658,705 48,709,772 310,876,909 116,586,585 285,730,264 42,851,237 285,730,264 46,236,104 254,718,147 3,852,985 11,385,529 18,540,589 44,667,095 35,048,115 84,018,183 37,043,456 88,823,817 42,375,911 102,434,918 2,990,442 4,945,382
Description Autonomous Systems Person to Person Person to Person Person to Person Person to Person Web Graph (page to page) Web Graph Web Graph (page to page) Web Graph (page to page) Web Graph (page to page) Web Graph (page to page) Person to Person
A major optimization that we used was to sort the edges in the graph and store the input file in the format as a sequence of neighbor lists per vertex. Each neighbor list begins with the size of the list, followed by the neighbors. This is similar to how softwares such as Matlab store sparse matrices, and the preprocessing time to change the data into this format is not counted. It can significantly improve the cache property of the graph stored, and therefore improving the performance. Some implementation details can be based on this graph storage format. Since each triple that we check already have 2 edges already in the graph, it suffices to check whether the 3rd edge in the graph. This can be done offline by comparing a smaller list of edges against the initial edge list of the graph and count the number of entries that they have in common. Once we sort the query list, the entire process can be done offline in one pass through the graph. This also means that instead of picking a pre-determined sample rate for the triples, we can vary the sample rate for them so the number of queries is about the same as the size of the graph. Finally, the details behind efficient binomial sampling are discussed in [21], Section 4.2. Specifically picking a random subset of expected size p|S| from a set S can be done in expected sublinear time [20]. 4.2 Results The six variants of the code involved in the experiment are first separated by whether the graph was first sparsified by keeping each edge with probability p = 0.1. In either case, an exact algorithm based on hybrid sampling with performance bounded by O(m3/2 ) is ran. Then two triple based sampling algorithms are also considered. They differ in whether an attempt to distinguish between low and high degree vertices, so the simple version is essentially sampling all ’V’ shaped triples off each vertex. Note that no sparsification and exact also generates the exactly number of triangles. Errors are measured by the absolute value of the difference between the value produced and the exact number of triangles divided by the exact number. The results on error and running time are averages over five runs. Results on these graphs described above are, the methods listed in the columns listed in Table 2.
Efficient Triangle Counting in Large Graphs
21
Table 2. Results of Experiments Averaged Over 5 Trials
Graph AS-Skitter Flickr Livejournal-links Orkut-links Soc-LiveJournal Web-EDU Web-Google Wiki-2005 Wiki-2006/9 Wiki-2006/11 Wiki-2007 Youtube
No Sparsification Exact Simple Hybrid err(%) time err(%) time err(%) time 0.000 4.452 1.308 0.746 0.128 1.204 0.000 41.981 0.166 1.049 0.128 2.016 0.000 50.828 0.309 2.998 0.116 9.375 0.000 202.012 0.564 6.208 0.286 21.328 0.000 38.271 0.285 2.619 0.108 7.451 0.000 8.502 0.157 2.631 0.047 3.300 0.000 1.599 0.286 0.379 0.045 0.740 0.000 32.472 0.976 1.197 0.318 3.613 0.000 86.623 0.886 2.250 0.361 7.483 0.000 96.114 1.915 2.362 0.530 7.972 0.000 122.395 0.943 2.728 0.178 9.268 0.000 1.347 1.114 0.333 0.127 0.500
Sparsified (p = .1) Exact Simple Hybrid err(%) time err(%) time err(%) time 2.188 0.641 3.208 0.651 1.388 0.877 0.530 1.389 0.746 0.860 0.818 1.033 0.242 3.900 0.628 2.518 1.011 3.475 0.172 9.881 1.980 5.322 0.761 7.227 0.681 3.493 0.830 2.222 0.462 2.962 0.571 2.864 0.771 2.354 0.383 2.732 1.112 0.251 1.262 0.371 0.264 0.265 1.249 1.529 7.498 1.025 0.695 1.313 0.402 3.431 6.209 1.843 2.091 2.598 0.634 3.578 4.050 1.947 0.950 2.778 0.819 4.407 3.099 2.224 1.448 3.196 1.358 0.210 5.511 0.302 1.836 0.268
4.3 Remarks From Table 2 it is clear that none of the variants clearly outperforms the others on all the data. The gain/loss from sparsification are likely due to the fixed sampling rate, so varying it as in earlier works [31] are likely to mitigate this discrepancy. The difference between simple and hybrid sampling are due to the fact that handling the second case of triples has a much worse cache access pattern as it examines vertices that are two hops away. There are alternative implementations of how to handle this situation, which would be interesting for future implementations. A fixed sparsification rate of p = 10% was used mostly to simplify the setups of the experiments. In practice varying p to look for a rate where the result stabalizes is the preferred option [32]. When compared with previous results on this problem, the error rates and running times of our results are all significantly lower. In fact, on the wiki graphs our exact counting algorithms have about the same order of speed with other appoximate triangle counting implementations.
5 Theoretical Ramifications 5.1 Random Projections and Triangles Consider any two vertices i, j ∈ V which are connected, i.e., (i, j) ∈ E. Observe that the inner product of the i-th and j-th column of the adjacency matrix A of graph G gives the number of triangles that edge (i, j) participates in. Viewing the adjacency matrix as a collection of n points in Rn , a natural question to ask is whether we can use results from the theory of random projections [15] to reduce the dimensionality of the points while preserving the inner products which contribute to the count of triangles. Magen and Zouzias [23] have considered a similar problem, namely random projections which preserve approximately the volume for all subsets of at most k points. Our main results are the following:
22
M.N. Kolountzakis et al.
Lemma 2. Let t = u∼v A u Av , where u ∼ v means Auv = 1 and let Y = u∼v (RAu ) (RAv ) (R defined as in Lemma 1). Then, the expectation of Y is given by the following equation: E [Y ] =
n k
#{i − ∗ − ∗ − i} = k · t
(1)
l=1 i=1
The variance of Y is given by the following equation: Var [Y ] = C · k · (number of circuits of length 6 in G)
(2)
Applying Chebyshev’s inequality gives us the following simple corollary: Corollary 1. Let c > 0 be a constant. A sufficient condition for the following concentration inequality 3 Pr [|Y − E [Y ]| > E [Y ]] < 1 − c
(3)
is the following condition 4: Var [Y ] = o(k · (E [Y ])2 )
(4)
5.2 Sampling in the Semi-streaming Model The previous analysis of triangle counting by Alon, Yuster and Zwick was done in the streaming model [1], where the assumption was constant space overhead.We show that our sampling algorithm can be done in a slightly weaker model with space usage m3/2 log n 1/2 equaling O m log n + . We assume the edges adjacent to each vertex are t2 given in order [12]. We first need to identify high degree vertices, specifically the ones with degree higher than m1/2 . This can be done by sampling O(m1/2 log n) edges and recording the vertices that are endpoints of one of those edges. For the proof and the implementation details, see [21].
6 Conclusions and Future Work In this work, we extended previous work [31,32] by introducing the powerful idea of Alon, Yuster and Zwick [1]. Specifically, we propose a Monte Carlo which algorithm 3/2 nΔ approximates the true number of triangles within and runs in O m + m tlog 2 time. Our method can be extended to the semi-streaming model using three passes and 3/2 nΔ . a memory overhead of O m1/2 log n + m tlog 2 In practice our methods obtain excellent running times, typically few seconds for graphs with several millions of edges. The accuracy is also satisfactory, especially for the type of applications we are concerned with. Finally, we propose a random projection based method for triangle counting and provide a sufficient condition to obtain an
Efficient Triangle Counting in Large Graphs
23
estimate with low variance. A natural question is the following: can we provide some reasonable condition on G that would guarantee the sufficient condition 4 of Corollary 1? Finally, since our proposed methods are easily parallelizable, developing such an implementation in the M AP R EDUCE framework, see [9] and [18,17], is an natural practical direction.
References 1. Alon, N., Yuster, R., Zwick, U.: Finding and Counting Given Length Cycles. Algorithmica 17(3), 209–223 (1997) 2. Avron, H.: Counting triangles in large graphs using randomized matrix trace estimation. In: Proceedings of KDD-LDMTA 2010 (2010) 3. Bar-Yosseff, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA (2002) 4. Becchetti, L., Boldi, P., Castillo, C., Gionis, A.: Efficient Semi-Streaming Algorithms for Local Triangle Counting in Massive Graphs. In: KDD (2008) 5. Buriol, L., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting Triangles in Data Streams. In: PODS (2006) 6. Chernoff, H.: A Note on an Inequality Involving the Normal Distribution. Annals of Probability 9(3), 533–535 (1981) 7. Chung, F., Lu, L.: Complex Graphs and Networks, vol. (107). American Mathematical Society, Providence (2006) 8. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. In: STOC (1987) 9. Jeffrey, D., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI (2004) 10. Eckmann, J.-P., Moses, E.: Curvature of co-links uncovers hidden thematic layers in the World Wide Web. In: PNAS (2002) 11. Frank, O., Strauss, D.: Markov Graphs. Journal of the American Statistical Association 81(395), 832–842 (1986) 12. Feigenbaum, J., Kannan, S., McGregor, A., Suri, S., Zhang, J.: On graph problems in a semistreaming model. Journal of Theoretical Computer Science 348(2), 207–216 (2005) 13. Hajnal, A., Szemer´edi, E.: Proof of a Conjecture of Erds. In: Combinatorial Theory and Its Applications, vol. 2, pp. 601–623. North-Holland, Amsterdam (1970) 14. Itai, A., Rodeh, M.: Finding a minimum circuit in a graph. In: STOC (1977) 15. Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics 26, 189–206 (1984) 16. Jowhari, H., Ghodsi, M.: New Streaming Algorithms for Counting Triangles in Graphs. In: Wang, L. (ed.) COCOON 2005. LNCS, vol. 3595, pp. 710–716. Springer, Heidelberg (2005) 17. Kang, U., Tsourakakis, C., Appel, A.P., Faloutsos, C., Leskovec, J.: Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations. In: SIAM Data Mining, SDM 2010 (2010) 18. Kang, U., Tsourakakis, C., Faloutsos, C.: PEGASUS: A Peta-Scale Graph Mining System. In: IEEE Data Mining, ICDM 2009 (2009) 19. Kim, J.H., Vu, V.H.: Concentration of multivariate polynomials and its applications. Combinatorica 20(3), 417–434 (2000) 20. Knuth, D.: Seminumerical Algorithms, 3rd edn. Addison-Wesley Professional, Reading (1997)
24
M.N. Kolountzakis et al.
21. Kolountzakis, M., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient Triangle Counting in Large Graphs via Degree-based Vertex Partitioning, http://arxiv.org/abs/1011.0468 22. Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407, 458–473 (2008) 23. Magen, A., Zouzias, A.: Near Optimal Dimensionality Reductions That Preserve Volumes. In: Goel, A., Jansen, K., Rolim, J.D.P., Rubinfeld, R. (eds.) APPROX and RANDOM 2008. LNCS, vol. 5171, pp. 523–534. Springer, Heidelberg (2008) 24. Mislove, A., Massimiliano, M., Gummadi, K., Druschel, P., Bhattacharjee, B.: Measurement and Analysis of Online Social Networks. In: IMC (2007) 25. Newman, M.: The structure and function of complex networks (2003) 26. Papadimitriou, C., Yannakakis, M.: The clique problem for planar graphs. Information Processing Letters 13, 131–133 (1981) 27. Schank, T., Wagner, D.: Finding, Counting and Listing all Triangles in Large Graphs, An Experimental Study. In: Nikoletseas, S.E. (ed.) WEA 2005. LNCS, vol. 3503, pp. 606–609. Springer, Heidelberg (2005) 28. Schank, T., Wagner, D.: Approximating Clustering Coefficient and Transitivity. Journal of Graph Algorithms and Applications 9, 265–275 (2005) 29. Tsourakakis, C.E.: Fast Counting of Triangles in Large Real Networks, without counting: Algorithms and Laws. In: ICDM (2008) 30. Tsourakakis, C.E.: Counting Triangles Using Projections. KAIS Journal (2010) 31. Tsourakakis, C.E., Kang, U., Miller, G.L., Faloutsos, C.: Doulion: Counting Triangles in Massive Graphs with a Coin. In: KDD (2009) 32. Tsourakakis, C.E., Kolountzakis, M., Miller, G.L.: Approximate Triangle Counting (Preprint), http://arxiv.org/abs/0904.3761 33. Tsourakakis, C.E., Drineas, P., Michelakis, E., Koutis, I., Faloutsos, C.: Spectral Counting of Triangles via Element-Wise Sparsification and Triangle-Based Link Recommendation. In: ASONAM (2010) 34. Vu, V.H.: On the concentration of multivariate polynomials with small expectation. Random Structures and Algorithms 16(4), 344–363 (2000) 35. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences). Cambridge University Press, Cambridge (1994)
Computing an Aggregate Edge-Weight Function for Clustering Graphs with Multiple Edge Types Matthew Rocklin1 and Ali Pinar2, 1
Department of Computer Science, University of Chicago
[email protected] 2 Sandia National Laboratories
[email protected]
Abstract. We investigate the community detection problem on graphs in the existence of multiple edge types. Our main motivation is that similarity between objects can be defined by many different metrics and aggregation of these metrics into a single one poses several important challenges, such as recovering this aggregation function from ground-truth, investigating the space of different clusterings, etc. In this paper, we address how to find an aggregation function to generate a composite metric that best resonates with the ground-truth. We describe two approaches: solving an inverse problem where we try to find parameters that generate a graph whose clustering gives the ground-truth clustering, and choosing parameters to maximize the quality of the ground-truth clustering. We present experimental results on real and synthetic benchmarks.
1 Introduction A community or a cluster in a network is assumed to be a subset of vertices that are tightly coupled among themselves and loosely coupled with the rest of the network. Finding these communities is one of the fundamental problems of networks analysis and has been the subject of numerous research efforts. Most of these efforts begin with the premise that a simple graph is already constructed. That is the relation between two objects (hence existence of a single edge between two nodes) is already quantified with a binary variable or a single number that represents the strength of the connection. This paper studies the community detection problem on networks with multiple edges types or multiple similarity metrics, as opposed to traditional networks with a single edge type. In many real-world problems, similarities between objects can be defined by many different relationships. For instance, similarity between two scientific articles can be defined based on authors, citations to, citations from, keywords, titles, where they are published, text similarity and many more. Relationships between people can be based on the nature of communication (e.g., business, family, friendships) or the means of
This work is supported by the Laboratory Directed Research and Development program of Sandia National Laboratories. This author is also supported by DOE Applied Mathematics Research Program. Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energys National Nuclear Security Administration under contract DE-AC04-94AL85000.
R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 25–35, 2010. c Springer-Verlag Berlin Heidelberg 2010
26
M. Rocklin and A. Pinar
communication (e.g., emails, phone, in person). Electronic files can be grouped by their type (Latex, C, html), names, the time they are created, or the pattern they are accessed. In these examples, there are actually multiple graphs that define relationships between the subjects. Reducing all this information by constructing a single composite graph is convenient as it enables application of many strong results from the literature. However, the information being lost during this aggregation may be crucial. The community detection problem on networks with multiple edge types bears many interesting problems. If the ground-truth clustering is known, can we recover an aggregation scheme that best resonates with the ground-truth data? Is there a meta-clustering structure, (i.e., are the clusterings clustered) and how do we find it? How do we find significantly different clusterings for the same data? These problems add another level of complexity to the already difficult problem of community detection in networks. As in the single edge type case, the challenges lie not only in algorithms, but also in formulations of these problems. Our ongoing work addresses all these problems. In this paper however, we will focus on recovering an aggregation scheme for ground-truth clustering. Our techniques rely on using nonlinear optimization and methods for classical community detection (i.e., community detection with single edge types). We present results with real data sets as well as synthetic data sets.
2 Background Traditionally, a graph G is defined as a tuple (V, E), where V is a set of vertices and E is a set of edges. A weight wi ∈ R may be associated with the edges that corresponds to the strength of the connection between the two end vertices. In this work, we work with multiple edge types that correspond to different measures of similarity. Subsequently, we replace the weight of an edge wi ∈ R with a weight vector wi1 , wi2 , . . . , wiK ∈ RK , where Kis the number of different edge types. A composite similarity can be defined by a function RK → R to reduce the weight vector to a single number. In this paper, we will restrict ourselves to linear functions such that the composite edge weight wi (α) K is defined as j=1 αj wij . 2.1 Clustering in Graphs Intuitively, the goal of clustering is to break down the graph into smaller groups such that vertices in each group are tightly coupled among themselves, and loosely coupled with the remainder of the network. Both the translation of this intuition into a welldefined mathematical formula and design of associated algorithms pose big challenges. Despite the high quality and the high volume of the literature, the area continues to draw a lot of interest both due to the growing importance of the problem and the challenges posed by the sizes of the subject graphs and the mathematical variety as we get into the details of these problems. Our goal here is to extend the concept of clustering to graphs with multiple edge types without getting into the details of clustering algorithms and formulations, since such a detailed study will be well beyond the scope of this paper. In this paper, we used Graclus, developed by Dhillon et al[1], which uses the top-down approach that recursively splits the graph into smaller pieces.
Computing an Aggregate Edge-Weight Function
27
2.2 Comparing Two Clusterings At the core of most of our discussions will be similarity between two clusterings. Several metrics and methods have been proposed for comparing clusterings, such as variation of information [6], scaled coverage measure [11], classification error [3,5,6], and Mirkin’s metric [7]. Out of these, we have used the variation of information metric in our experiments. Let C0 = C01 , C02 , . . . , C0K and C1 = C11 , C12 , . . . , C1K be two clusterings of the k same node set. Let n be the total number of nodes, and P (C, k) = |Cn | be the probability that a node is in cluster C k in a clustering C. We also define P (Ci , Cj , k, l) = |Cik ∩Cjl | . n
Then the entropy of information in Ci will be H(Ci ) = −
K
P (Ci , k) log P (Ci , k)
k=1
the mutual information shared by Ci and Cj will be I(Ci , Cj ) =
K K
P (Ci , Cj , k, l) log P (Ci , Cj , k, l),
k=1 l=1
and the variation of information is given by dV I (Ci , Cj ) = H(Ci ) + H(Cj ) − 2I(Ci , Cj ).
(1)
Meila [6] explains the intuition behind this metric a follows. H(Ci ) denotes the average uncertainty of the position of a node in clustering Ci . If, however, we are given Cj , I(Ci , Cj ) denotes average reduction in uncertainty of where a node is located in Ci . If we rewrite Equation (1) as dV I (Ci , Cj ) = (H(Ci ) − I(Ci , Cj )) + (H(Cj ) − I(Ci , Cj )) , the first term will be measurement of information lost if Cj is the true clustering and we obtain Ci , and the second term will be vice versa. The variation of information metric can be computed in O(n) time.
3 Recovering a Graph Given a Ground Truth Clustering Suppose we have the ground-truth clustering information about a graph with multiple similarity metrics. Can we recover an aggregation scheme that best resonates with the ground-truth data? This aggregation scheme that reduces multiple similarity measurements into a single similarity measurement can be a crucial enabler that reduces the problem of finding communities with multiple similarity metrics, to a well-known, fundamental problem in data analysis. Additionally if we can obtain this aggregation scheme from data sets for which the ground-truth is available, we may then apply the same aggregation to other data instances in the same domain.
28
M. Rocklin and A. Pinar
Formally, we work on the following problem. Given a graph G = (V, E) with multiple similarity measurements for each edge wi1 , wi2 , . . . , wiK ∈ RK , and a groundtruth clustering for this graph C ∗ . Our goal is to find a weighting vector α ∈ RK , such that the C ∗ is an optimal clustering for the graph G, whose edges are weighted K as wi = j=1 αj wij . Note that this is only a semi-formal definition, as we have not formally defined what we mean by an optimal clustering. In addition to the well-known difficulty of defining what a good clustering means, matching to the ground-truth data has specific challenges, which we discuss in the subsequent section. Below, we describe two approaches. The first approach is based on inverse problems, and we try to find weighting parameters for which the clustering on the graph yields the ground-truth clustering. The second approach computes weighting parameters that maximizes the quality of the ground-truth clustering. 3.1
Solving an Inverse Problem
Inverse problems arise in many scientific computing applications where the goal is to infer unobservable parameters from finite observations. Solutions typically involve iterations of taking guesses and then solving the forward problems to compute the quality of the guess. Our problem can be considered as an inverse problem, since we are trying to compute an aggregation function, from a given clustering. The forward problem in this case will be the clustering operation. We can start with a random guess for the edge weights, cluster the new graph, and use the distance between two clusterings as a measure for the quality of the guess. We can further put this process within an optimization loop to find the parameters that yield the closest clustering to the ground-truth. The disadvantage of this method is that it relies on the accuracy of the forward solution, i.e., the clustering algorithm. If we are given the true solution to the problem, can we construct the same clustering? This will not be easy for two reasons. First, there is no perfect clustering algorithm, and secondly, even if we were able to solve the clustering problem optimally, we would not have the exact objective function for clustering. Also, the need to solve many clustering problems will be time-consuming especially for large graphs. 3.2 Maximizing the Quality of Ground-Truth Clustering An alternative approach is to find an aggregation function that maximizes the quality of the ground-truth clustering. For this purpose, we have to take into account not only the overall quality of the clustering, but also the placement of individual vertices, as the individual vertices represent local optimality. For instance, if the quality of the clustering will improve by moving a vertex to another cluster than its ground-truth, then the current solution cannot be ideal. While it is fair to assume some vertices might have been misclassified in the ground-truth data, there should be a penalty for such vertices. Thus we have two objectives while computing α: (i) justifying the location of each vertex (ii) maximizing the overall quality of the clustering. Justifying Locations of Individual Vertices. For each vertex v ∈ V we define the pull to each cluster C k in C = C 1 , C 2 , . . . C K to be the cumulative weights of edges between v and its neighbors in C k ,
Computing an Aggregate Edge-Weight Function
Pα (v, Ck ) =
wi (α)
29
(2)
wi =(u,v)∈E;u∈C k
We further define the holding power, Hα (v) for each vertex, to be the pull of the cluster to which the vertex belongs in C ∗ minus the next largest pull among the remaining clusters. If this number is positive then v is held more strongly to the proper cluster than to any other. We can then maximize the number of vertices with positive holding power by maximizing |{v : Hα (v) > 0}|. What is important for us here is the concept of pull and hold, as the specific definitions may be changed without altering the core idea. While this method is local and easy to compute, its discrete nature limits the tools that can be used to solve the associated optimization problem. Because gradient information is not available it hinders our ability to navigate in the search space. In our experiments, we smoothed the step-like nature of the function H(v) > 0 by replacing it with arctan(βHα (v)). This functional form still encodes that we want holding power to be positive for each node but it allows the optimization routine to benefit from small improvements. It emphasises nodes which are close to the H(v) = 0 crossing point (large gradients) over nodes which are well entrenched (low gradients near extremes). This objective function sacrifices holding scores for nodes which are safely entrenched in their cluster (high holding power) or are lost causes (very low holding power) for those which are near the cross-over point. The extent to which it does this can be tuned by β, the steepness parameter of the arctangent. For very steep parameters this function resembles classification (step function) while for very shallow parameters it resembles a simple linear sum as seen in Fig. 1. We can solve the following optimization problem to maximize the number of vertices, whose positions in the ground-truth clustering are justified by the weighting vector α. Fig. 1. Arctangent provides a smooth arctan(βHα (v)) (3) arg max blend between step and linear funcα∈RK
tions
v∈V
Overall Clustering Quality. In addition to individual vertices being justified, overall quality of the clustering should be maximized. Any quality metric can potentially be used for this purpose however we find that some strictly linear functions have a trivial solution. Consider an objective function that measures the quality of a clustering as the sum of the inter-cluster edges. To minimize the cumulative weights of cut edges, or equivalently to maximize the cumulative weights of internal edges we solve K min ej ∈Cut i=1 αi wji |α| = 1 α
k where Cut denotes the set of edges whose end points are in different clusters. Let S dek note the sum of the cut edges with respect to the k-th metric. That is S = ej ∈Cut wjk .
30
M. Rocklin and A. Pinar
k Then the objective function can be rewritten as minα K 1 αk S . Because this is linear it has a trivial solution that assigns 1 to the weight of the maximum S k , which means only one similarity metric is taken into account. While additional constraints may exclude this specific solution, a linear formulation of the quality will always yield only a trivial solution within the new feasible region. In our experiments we used the modularity metric [9]. The modularity metric uses a random graph generated with respect to the degree distribution as the null hypothesis, setting the modularity score of a random clustering to 0. Formally, the modularity score for an unweighted graph is 1 di dj (4) eij − δij , 2m ij 4m where eij is a binary variable that is 1, if and only if vertices vi and vj are connected; di denotes the degree of vertex i; m is the number of edges; and δi,j is a binary variable di dj that is 1, if and only if vertices vi and vj are on the same cluster. In this formulation, 4m corresponds to the number of edges between vertices vi and vj in a random graph with the given degree distribution, and its subtraction corresponds to the the null hypothesis. This formulation can be generalized for weighted graphs by redefining eij as the weight of this edge (0 if no such edge exists), di as the cumulative weight of edges incident to vi ; and m as the cumulative weight of all edges in the graph [8]. 3.3 Solving the Optimization Problems We have presented several nonlinear optimization problems for which the derivative information is not available. To solve these problems we used HOPSPACK (Hybrid Optimization Parallel Search PACKage) [10], which is developed at Sandia National Laboratories to solve linear and nonlinear optimization problems when the derivatives are not available.
4 Experimental Results 4.1 Recovering Edge Weights The goal of this set of experiments is to see whether we can find aggregation functions that justify a given clustering. We have performed our experiments on 3 data sets. Synthetic data: Our generation method is based on Lancichinetti et al.’s work [2] that proposes a method to generate graphs as benchmarks for clustering algorithms. We generated networks of sizes 500, 100, 2000, and 4000 nodes, 30 edges per node on average, mixing parameters μt = .7, μw = .75, and known communities. We then perturbed edge weights, wi , with additive and multiplicative noise so that wi ← ν(wi + σ) : σ ∈ (−2wa , 2wa ), ν ∈ (0, 1) uniformly, independently, and identically distributed, where wa is the average edge weight. After the noise, none of the original metrics preserved the original clustering structure. We display this in Fig. 2, which presents histograms for the holding power for vertices. The green bars correspond to vertices of the original graph, they all have positive holding power. The blue bars correspond to holding powers after noise is added.
Computing an Aggregate Edge-Weight Function
31
Fig. 2. Three histograms of holding powers for Blue: an example perturbed (poor) edge type, Green: the original data (very good), Red: the optimal blend of ten of the perturbed edge types
We only present one edge type for clarity of presentation. As can be seen a significant portion (30%) of the vertices have negative holding power, which means they would rather be on another cluster. The red bars show the holding powers after we compute an optimal linear aggregation. As seen in the figure, almost all vertices move to the positive side, justifying the ground-truth clustering. A few vertices with negative holding power are expected, even after an optimal solution due to the noise. These results show that a composite similarity that resonates with a given clustering can be computed out of many metrics, none of which give a good solution by itself. In Table 1, we present these results on graphs with different number of vertices. While the percentages change for different number of vertices, our main conclusion that a good clustering can be achieved via a better aggregation function remains valid. Table 1. Fraction of nodes with positive holding power for ground-truth, perturbed, and optimized networks Number of nodes Number of clusters Ground-truth Optimized Perturbed (average) 500 14 .965 .922 .703 1000 27 .999 .977 .799 2000 58 .999 .996 .846 4000 118 1.00 .997 .860
File system data: An owner of a workstation classified 300 of his files as belonging to one of his three ongoing projects, which we took as the ground-truth clustering. We used filename similarity, time-of-modification/time-of-creation similarity, ancestry (distance in the directory-tree), and parenthood (edges between a directory node with file nodes in this directory) as the similarity metrics among these files. Our results showed that only three metrics (time-of-modification, ancestry, and parenthood) affected the clustering. However, the solutions were sensitive to the choice of the arctangent parameter. In Fig. 3, each column corresponds to an optimal solution for the corresponding arctangent parameter. Recall that higher values of the arctangent
32
M. Rocklin and A. Pinar
Fig. 3. Optimal solutions for the file system data for different arctangent parameters
parameter corresponds to sharper step functions. Hence, the right of the figure corresponds to maximizing the total number of vertices with positive holding power, while the left side corresponds to maximizing the sum of holding powers. The difference between the two is that the solutions on the right side may have a lot of nodes with barely positive values, while those on the left may have nodes further away from zero at the cost of more nodes with negative holding power. This is expected in general, but drastic change in the optimal solutions as we go from one extreme to another was surprising to us, and should be taken into account in further studies. Arxiv data: We took 30,000 high-energy physics articles published on arXiv.org and considered abstract text similarity, title similarity, citation links, and shared authors as edge types for these articles. We used the top-level domain of the submitter’s e-mail (.edu, .uk, .jp, etc...) as a proxy for the region where the work was done. We used these regions as the ground-truth clustering. The best parameters that explained the ground-truth clustering were 0.0 for abstracts, 1.0 for authors, 0.059 for citations, and 0.0016 for titles. This means the shared authors edge type is almost entirely favored, with cross-citations coming a distant second. This is intuitive because a network of articles linked by common authors will be linked both by topic (we work with people in our field) but also by geography (we often work with people in nearby institutions) whereas edge types like abstract text similarity tend to encode only the topic of a paper, which is less geographically correlated. Different groups can work on the same topic, and it was good to see that citations factored in, and such a clear dominance of the authors information was noteworthy. As a future work, we plan to investigate nonlinear aggregation functions on this graph. 4.2 Clustering Quality vs. Holding Vertices We have stated two goals while computing an aggregation function: justifying the position of each vertex and the overall quality of clustering. In Fig. 4, we present the Pareto frontier for the two objectives. The vertical axis represent the quality of the clustering
Computing an Aggregate Edge-Weight Function
33
Fig. 4. Pareto frontier for two objectives: normalized modularity and percentage of nodes with positive holding power
with respect to the modularity metric [9], while the horizontal axis represents the percentage of nodes with positive holding power. The modularity numbers are normalized with respect to the modularity of the ground-truth clustering, and normalized numbers can be above 1, since the ground-truth clustering does not specifically aim at maximizing modularity. As expected, Fig. 4 shows a trade-off between two objectives. However, the scale difference between the two axis should be noted. The full range in modularity change is limited to only 3% for modularity, while the range is more than 20% for fraction of vertices with positive holding power. More importantly, by only looking at the holding powers we can preserve the original modularity quality. The reason for this is that we have relatively small clusters, and almost all vertices have a connection with a cluster besides their own. If we had clusters where many vertices had all their connection within their clusters (e.g., much larger clusters), then this would not have been the case, and having a separate quality of clustering metric would have made sense. However, we know that most complex networks have small communities no matter how big the graphs are [4]. Therefore, we expect that looking only at the holding powers of vertices will be sufficient to recover aggregation functions. 4.3 Inverse Problems vs. Maximizing Clustering Quality We used the file system data set to investigate the relationship between the two proposed approaches, and present results in Fig. 5. For this figure we compute the objective function for the ground-truth clustering for various aggregation weights and use the same weights to compute clusterings with Graclus. From these clusterings we compute the variation of information (VI) distance to the ground-truth. Fig. 5 presents the correlation between the measures: VI distance for Graclus clusterings for the first approach, and the objective function values for the second approach. This tries to answer whether solutions with higher objective function values yield clusterings closer to the ground-truth using Graclus. In this figure, a horizontal line fixed at 1 would have shown complete agreement. Our results show a strong correlation when moderate values for β are taken (arctan function is neither too step-like nor too linear). These results are not sufficient
34
M. Rocklin and A. Pinar
Fig. 5. The correlation of the arctan-smoothed objective function with variation of information distance using clusterings generated by Graclus as we vary the steepness parameter
to be conclusive as we need more experiments and other clustering tools. However, this experiment produced promising results and shows how such a study may be performed. 4.4
Runtime Scalability
In our final set of experiments we show the scalability of the proposed method. First, we want to note that the number of unknowns for the optimization problem is only a function of the aggregation function and is independent of the graph size. The required number of operations for one function evaluation on the other hand depends linearly on the size of the graph, as illustrated in Figure 6. In this experiment, we used synthetic graphs with 30 as the average degree, and the presented numbers correspond to averages on 10 different runs. As expected, the runtimes scale linearly with the number of edges. The runtime of the optimization algorithm depends on the number of function evaluations. Since the algorithm we used is nondeterministic, the number of function evaluations, hence runtimes vary even for different runs on the same problem, and thus are less informative. We are not presenting these results in detail due to space constraints. However, we want to reemphasize that the size of the optimization problem does not grow with the graph size, and we don’t expect the number of functions evaluations to cause any scalaFig. 6. Scalability of the proposed method bility problems. We also observed that the number of function evaluations increase linearly with the number of similarity metrics. These results are also omitted due to space constraints.
Computing an Aggregate Edge-Weight Function
35
5 Conclusion and Future Work We have discussed the problem of graph clustering with multiple edge types, and studied computing an aggregation function to compute composite edge weights that best resonate with given ground-truth clustering. We have applied real and synthetic data sets and presented experimental results that show that our methods are scalable and can recover aggregation functions that yield high-quality clusterings. This paper only scratches the surface of the clustering problem with multiple edge types. There are many interesting problems to be investigated such as meta-clustering, (i.e., clustering the clusterings) and finding significantly different clusterings for the same data, which are part of our ongoing work. We are also planning to extend our experimental work on the current problem of computing aggregation functions from ground-truth data.
References 1. Dhillon, I., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE T. Pattern Analysis and Machine Intelligence 29(11), 1944–1957 (2007) 2. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Physical Review E 78(4), 1–5 (2008) 3. Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions, neural computation. Neural Computation 16, 1299–1323 (2004) 4. Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M.: Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6, 29–123 (2009) 5. Luo, X.: On coreference resolution performance metrics. In: Proc. Human Language Technology Conf. and Conf. Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, pp. 25–32. Association for Computational Linguistics (2005) 6. Meila, M.: Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 577–584 (2005) 7. Mirkin, B.: Mathematical Classification and Clustering. Kluwer Academic Press, Dordrecht (1996) 8. Newman, M.: Analysis of weighted networks. Phys. Rev. E 70(5), 056131 (2004) 9. Newman, M.: Modularity and community structure in networks. PNAS 103, 8577–8582 (2006) 10. Plantenga, T.: Hopspack 2.0 user manual. Technical Report SAND2009-6265, Sandia National Laboratories (2009) 11. Stichting, C., Centrum, M., Dongen, S.V.: Performance criteria for graph clustering and markov cluster experiments. Technical Report INS-R0012, Centre for Mathematics and Computer Science (2000)
Component Evolution in General Random Intersection Graphs Milan Bradonji´c1 , Aric Hagberg2, Nicolas W. Hengartner3, and Allon G. Percus4 1
Theoretical Division and Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
[email protected] 2 Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
[email protected] 3 Information Sciences Group, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
[email protected] 4 School of Mathematical Sciences, Claremont Graduate University, Claremont, CA 91711, USA
[email protected]
Abstract. Random intersection graphs (RIGs) are an important random structure with algorithmic applications in social networks, epidemic networks, blog readership, and wireless sensor networks. RIGs can be interpreted as a model for large randomly formed non-metric data sets. We analyze the component evolution in general RIGs, giving conditions on the existence and uniqueness of the giant component. Our techniques generalize existing methods for analysis of component evolution: we analyze survival and extinction properties of a dependent, inhomogeneous Galton-Watson branching process on general RIGs. Our analysis relies on bounding the branching processes and inherits the fundamental concepts of the study of component evolution in Erd˝os-R´enyi graphs. The major challenge comes from the underlying structure of RIGs, which involves both a set of nodes and a set of attributes, with different probabilities associated with each attribute. Keywords: General random intersection graphs, random graphs, branching processes, giant component, stochastic processes in relation with random discrete structures.
1 Introduction Bipartite graphs, consisting of two sets of nodes with edges only connecting nodes in opposite sets, are a natural representation for many algorithmic problems on networks. Social networks can often be cast as bipartite graphs built from sets of individuals connected to sets of attributes, such as membership of a club or organization, work colleagues, or fans of the same sports team. A well-known example is a collaboration graph, where the two sets might be scientists and research papers, or actors and movies [27,18]. Simulations of epidemic spread in human populations are often performed on networks constructed from bipartite graphs of people and the locations they R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 36–49, 2010. c Springer-Verlag Berlin Heidelberg 2010
Component Evolution in General Random Intersection Graphs
37
visit during a typical day [11]. Bipartite structure is hardly limited to social networks. The relation between nodes and keys in secure wireless communication, for examples, forms a bipartite network [6]. Factor graphs have become a standard representation for constraint satisfaction problems such as k-SAT and graph coloring. In general, bipartite graphs are well suited to problems of classifying objects, where each object has a set of properties [10]. However, modeling such networks remains a challenge. The wellstudied Erd˝os-R´enyi model, Gn,p , successfully used for average-case analysis of algorithm performance, does not satisfactorily represent many randomly formed social or collaboration networks. Gn,p does not capture the typical scale-free degree distribution of many real-world networks [3]. More realistic degree distributions can be achieved by the configuration model [20] or expected degree model [7], but even those fail to capture common properties of social networks such as the high number of triangles (or cliques) and strong degree-degree correlation [19,1]. A straightforward way of remedying these problems is to characterize each of the bipartite sets separately. One step in this direction is an extension of the configuration model that specifies degrees in both sets [14]. We study the related approach of random intersection graphs (RIG), first introduced in [26,16]. Any undirected graph can be represented as an intersection graph [9]. The simplest version is the “uniform” RIG, G(n, m, p), containing a set of n nodes and a set of m attributes, where any given nodeattribute pair contains an edge with a fixed probability p, independently of other pairs. Two nodes in the graph are taken to be connected if and only if they are both connected to at least one common element in the attribute set. In our work, we study the more general RIG, G(n, m, p) [22,21], where the node-attribute edge probabilities are not given by a uniform value p but rather by a set p = {pw | w ∈ W }. A node is attached to the attribute w, with probability pw .1 This general model has only recently been developed and only a few results have been obtained, such as expander properties, cover time, and the existence and efficient construction of large independent sets [22,21,23]. In this paper, we generalize results that have previously been obtained for the uniform RIG [4,6], analyzing the evolution of components in general RIGs and obtaining conditions for the existence and uniqueness of the giant component. Our main contribution is a generalization of the branching process used for analyzing Gn,p [2]. By considering an auxiliary process that is stochastically equivalent, we bound the stopping time for the branching process on general RIGs, yielding bounds on the sizes of graph components. The major challenge comes from the underlying structure of RIGs, which involves both the set of nodes and the set of attributes, as well as the set of different probabilities p = {pw | w ∈ W }. Our approach requires us to keep track of the history of the branching process, which is directly dictated by this structure.
2 Model and Previous Work In this paper, we will consider the general intersection graph G(n, m, p), introduced in [22,21], with a set of probabilities p = {pw | w ∈ W }, where pw ∈ (0, 1). We now formally define the model. 1
Note that the pw do not generally sum up to 1. Furthermore, we can eliminate the trivial cases of pw = 0 and pw = 1, corresponding to the absence of attribute w and to a complete graph.
38
M. Bradonji´c et al.
Model. Given a set of nodes V = {1, 2, . . . , n}, attributes W = {1, 2, . . . , m}, and probabilities p = {pw | w ∈ W }, for all (v, w) ∈ V × W , define the i.i.d. indicator random variables (1) Iv,w ∼ Bernoulli(pw ). Every node v ∈ V is assigned a random set of attributes W (v) ⊆ W W (v) = {w ∈ W | Iv,w = 1}. This is illustrated schematically in Fig. 1. A set of edges E ∈ V × V is then defined, such that for two different nodes vi , vj ∈ V , {vi , vj } ∈ E iff |W (vi ) ∩ W (vj )| ≥ s
(3)
V
v1 v2 v3
(2) W
p1 p2 pm
w1 w2 w3
for a given integer s ≥ 1. Thus, two nodes are adjacent if and only if they p1 have at least s attributes in common. p2 vn One limitation of our analysis is that pm for simplicity, we fix s = 1. wm Our model generalizes the uniform model G(n, m, p), studied in [4,6], where all pw take on the same value p. Different generalizations and special Fig. 1. Random intersection graph. V is set of nodes and W is set of attributes. A particular atcases have been studied in [13,15,17,8]. tribute wi is associated with every node indepenTo complete the picture of previous dently at random with probability p . i work, in [8] it was shown that when n = m, a set of probabilities p = {pw | w ∈ W } can be chosen to tune the degree and clustering coefficient of the graph.
3 Mathematical Preliminaries In this paper, we analyze the component evolution of the general RIG structure. As we have already mentioned, the major challenge comes from the underlying structure of RIGs, which involves both a set of nodes and a set of attributes, as well as a set of different probabilities p = {pw | w ∈ W }. Moreover, the edges in RIG are not independent. Hence, a RIG cannot be treated as an Erd˝os-R´enyi random graph Gn,pˆ, with the edge probability pˆ = 1 − w∈W (1 − p2w ). However, in [12], the authors provide the comparison among Gn,pˆ and G(n, m, p), showing that for m = nα and α > 6, these two classes of graphs have asymptotically the same properties. In [25], Rybarczyk has recently shown the equivalence of sharp threshold functions among Gn,pˆ and G(n, m, p), when m ≥ n3 . In this work, we do not impose any constraints among n and m, and we develop methods for the analysis of branching processes on RIGs, since the existing methods for the analysis of branching processes on Gn,p do not apply.
Component Evolution in General Random Intersection Graphs
39
We now briefly state the edge dependence. Consider three distinct nodes vi , vj , vk from V , and let “↔” denote adjacency, so that vi ↔ vj iff |W (vi ) ∩ W (vj )| ≥ 1. Conditional on the set W (vk ), by the definition (2), the sets W (vi ) ∩ W (vk ) and W (vj ) ∩ W (vk ) are mutually independent, which implies conditional independence of the events {vi ↔ vk | W (vk )}, {vj ↔ vk | W (vk )}, that is, P[vi ↔ vk , vj ↔ vk | W (vk )] = P[vi ↔ vk | W (vk )] P[vj ↔ vk | W (vk )].
(4)
However, the latter does not imply independence of the events {vi ↔ vk } and {vj ↔ vk } since in general P[vi ↔ vk , vj ↔ vk ] = E P[vi ↔ vk , vj ↔ vk | W (vk )] = E P[vi ↔ vk | W (vk )] P[vj ↔ vk | W (vk )] = P[vi ↔ vk ] P[vj ↔ vk ]. (5) Furthermore, the conditional pairwise independence (4) does not extend to three or more nodes. Indeed, conditionally on the set W (vk ), the sets W (vi ) ∩ W (vj ), W (vi ) ∩ W (vk ), and W (vj ) ∩ W (vk ) are not mutually independent, and hence neither are the events {vi ↔ vj }, {vi ↔ vk }, and {vj ↔ vk }, that is, P[vi ↔ vj , vi ↔ vk , vj ↔ vk | W (vk )] = P[vi ↔ vj | W (vk )] ×P[vi ↔ vk | W (vk )] P[vj ↔ vk | W (vk )].
4 Auxiliary Process on General Random Intersection Graphs Our analysis for the emergence of a giant component is inspired by the process described in [2], which measures the size of a component by counting the number of steps until a breadth-first search terminates. The difficulty in using this approach to analyze the evolution of the stochastic process defined by equations (1), (2), and (3) resides in the fact that we need, at least in principle, to keep track of the temporal evolution of the sets of nodes and attributes being explored. This results in a process that is not Markovian. Therefore, we instead construct an auxiliary process that is simpler to analyze but whose stopping time is, in distribution, identical to that of the breadth-first search. The process is algorithmically defined as follows. Auxiliary Process. Start from an arbitrary node v0 ∈ V . Denote by Vt the cumulative set of nodes visited by time t, which we initialize to V0 = {v0 }. Denote the cumulative set of all attributes [6] associated with the set Vt by Wt =
t
W (vτ ).
(6)
τ =0
Now consider the set of nodes adjacent to Vt but not yet visited: v ∈ V \ Vt : W (v) ∩ Wt = ∅ .
(7)
40
M. Bradonji´c et al.
Following [2], we call this the set of alive nodes at time t. Unlike in [2], however, we do not keep track of the actual list of alive nodes, but only the size of the set, which we denote by the random variable Yt = v ∈ V \ Vt : W (v) ∩ Wt = ∅ . The process evolves as follows: for t ≥ 1, pick a node vt uniformly at random from the set V \ Vt−1 and update the set of visited nodes Vt = Vt−1 ∪ {vt } = {v0 , . . . , vt }. Then update the set of alive nodes and Yt . The process terminates when Yt reaches 0. To understand why this auxiliary process is useful, notice that due to the independence of the random variables Iv,w , at step t every node in V \ Vt is equally likely to belong to the set (7) of alive nodes. Consequently, picking the next node vt+1 uniformly from V \ Vt is the same random process as picking vt+1 uniformly from the set of alive nodes (as in [2]), conditional on the history of the attribute sets uncovered up through time t: (8) Ht = {W (v0 ), W (v1 ), . . . , W (vt )}. In the latter process, the stopping time T (v0 ) = inf{t ≥ 0 : Yt = 0}
(9)
would simply be equal to |C(v0 )| − 1, where |C(v0 )| is the size of the component containing v0 [2].Thus in the auxiliary process, since it is stochastically equivalent, d
T (v0 ) = |C(v0 )| − 1.
(10)
4.1 Process Description in Terms of Random Variable Yt Let us characterize the process {Yt }t≥1 in terms of the number Zt of newly discovered neighbors of Vt : Zt = Yt − Yt−1 + 1, (11) where the term +1 reflects the fact that vt is discovered at time step t, but it is not counted in Yt because it has been visited. For nodes that are neither visited nor alive, the events of their becoming alive at time t are conditionally independent given the history Ht , since each event involves a different subsets of the indicator random variables Iv,w . W (vt ) and Wt−1 are mutually independent, hence the conditional probability that such a node u becomes alive at time t is rt = P[u ↔ vt , u ↔ vt−1 , u ↔ vt−2 , . . . , u ↔ v0 |Ht ] = P[W (u) ∩ W (vt ) = ∅, W (u) ∩ Wt−1 = ∅|Ht ] = P[W (u) ∩ W (vt ) = ∅, W (u) ∩ Wt−1 = ∅|W (vt ), Wt−1 ] = P[W (u) ∩ W (vt ) = ∅|W (vt )] P[W (u) ∩ Wt−1 = ∅|Wt−1 ]
= 1− (1 − pα ) (1 − pβ ). α∈W (vt )
β∈Wt−1
Component Evolution in General Random Intersection Graphs
The last expression can be rewritten as
rt =
qβ −
β∈Wt−1
41
qα
α∈Wt
= φt−1 − φt ,
(12)
where we set qw = 1 − pw for all w ∈ W and φt = α∈Wt qα , and use the convention W−1 = ∅ and φ−1 = 1. Observe that the probability (12) does not depend on u. Hence the number of new alive nodes at time t is, conditional on the history Ht , a Binomial distributed random variable with parameters rt and Nt = n − t − Yt : Zt+1 |Ht ∼ Bin(Nt , rt ).
(13)
Now, by mathematical induction in t, it easily follows that for times t ≥ 1 the number of alive nodes Yt satisfies: t−1
(1 − rτ ) − t + 1. Yt |Ht−1 ∼ Bin n − 1, 1 −
(14)
τ =0
4.2 Expectation and Variance of φt The history Ht embodies the evolution of how the attributes are discovered over time. It is insightful to recast that history in terms of the discovery times Γw of each attribute in W . Given any sequence of nodes v0 , v1 , v2 , . . ., the probability that a given attribute w is first discovered at time t < n is P[Γw = t] = P[Ivt ,w = 1, Ivt−1 ,w = 0, . . . , Iv0 ,w = 0] = pw (1 − pw )t . If an attribute w is not discovered by time n − 1, we set Γw = ∞ and note that P[Γw = ∞] = (1 − pw )n . From the independence of the random variables Iv,w , it follows that the discovery times {Γw : w ∈ W } are mutually independent. We now focus on describing the distribution of φt = α∈Wt qα . For t ≥ 0, we have φt =
α∈Wt
qα =
t
j=0 α∈Wj \Wj−1
E[φt ] =
d
qα =
t
I(Γw =j) qw =
j=0 w∈W
I(Γw ≤t) qw .
(15)
w∈W
t+1 1 − (1 − qw )(1 − qw ) .
(16)
w∈W
The concentration of φ0 will be crucial for the analysis of the supercritical regime. Hence, we provide E[φ0 ] and E[φ20 ] here. In Subsection 5.2, we will assume that pw = p(log n/n). Under this condition, it follows from (16) that
(1 − p2w ) = 1 − p2w + o( p2w ). (17) E[φ0 ] = w∈W
w∈W
w∈W
42
M. Bradonji´c et al.
Moreover, under the same condition, it follows from (15) that E[φ20 ] = E[
2I(Γw ≤0) qw ]=
w∈W
=
2 2 1 − (1 − qw 1 − (1 − qw )P[Γw = 0] = )pw
w∈W
1−
2p2w
+
p3w
=1−2
w∈W
p2w
w∈W
+ o(
w∈W
p2w ).
(18)
w∈W
5 Giant Component With the process {Yt }t≥0 defined in the previous section, we analyze both the subcritical and supercritical regime of a general random intersection graph by adapting the percolation-based techniques used to analyze Erd˝os-R´enyi random graphs [2]. The technical difficulty in analyzing that stopping time rests in the fact that the distribution of Yt depends on the history of the process, dictated by the structure of the general RIG. In the next two subsections, we will give conditions for the absence as well as for the existence and uniqueness of the giant component, in general RIGs. 5.1 Subcritical Regime Theorem 1. Suppose that pw = O(1/n) for all w
and
p3w = O(1/n2 ).
w∈W
For any positive constant c < 1, if w∈W p2w = c/n, then whp2 all components in a general random intersection graph G(n, m, p) are of order O(log n). Proof. We generalize the techniques used in the proof for the sub-critical case in Gn,p presented in [2]. Let T (v0 ) be the stopping time defined in (9), for the process starting d
at node v0 and recall that T (v0 ) = |C(v0 )|. We will bound the size of the largest component, and prove that under the conditions of the theorem, all components are of order O(log n), whp. For all t ≥ 0, P[T (v0 ) > t] = E P[T (v0 ) > t | Ht ] ≤ E P[Yt > 0 | Ht ] t−1
= E P[Bin(n − 1, 1 − (1 − rτ )) ≥ t | Ht ] . (19) τ =0
Bounding from above, we have 1−
t−1
(1 − rτ ) ≤
τ =0 2
t−1 τ =0
rτ =
t−1
(φτ −1 − φτ ) = 1 − φt−1 ,
(20)
τ =0
“With high probability,” meaning with probability 1 − o(1), as the number of nodes n → ∞.
Component Evolution in General Random Intersection Graphs
43
which can readily be shown by induction in t for rτ ∈ [0, 1]. By using stochastic or t−1 dering of the Binomial distribution, both in n and in τ =0 rτ , and for any positive constant ν < 1, which is to be specified later, it follows that P[T (v0 ) > t | Ht ] ≤ P[Bin(n,
t−1
rτ ) ≥ t | Ht ] ≤ P[Bin(n, 1 − φt−1 ) ≥ (1 − ν)t | Ht ]
τ =0
= P[Bin(n, 1 − φt−1 ) ≥ t | 1 − φt−1 < (1 − ν)t/n ∩ Ht ]P[1 − φt−1 < (1 − ν)t/n | Ht ] + P[Bin(n, 1 − φt−1 ) ≥ t | 1 − φt−1 ≥ (1 − ν)t/n ∩ Ht ]P[1 − φt−1 ≥ (1 − ν)t/n | Ht] ≤ P[Bin(n, 1 − φt−1 ) ≥ t | 1 − φt−1 < (1 − ν)t/n ∩ Ht ] +P[1 − φt−1 ≥ (1 − ν)t/n | Ht ].
(21)
Furthermore, using the fact that the event {1 − φt−1 < (1 − ν)t/n} is Ht -measurable, together with the stochastic ordering of the binomial distribution, we obtain P[Bin(n, 1−φt−1 ) ≥ t | 1−φt−1 < (1−ν)t/n∩Ht ] ≤ P[Bin(n, (1−ν)t/n) ≥ t | Ht ]. Taking the expectation with respect to the history Ht in (21) yields P[T (v0 ) > t] ≤ P[Bin(n, (1 − ν)t/n) ≥ t] + P[1 − φt−1 ≥ (1 − ν)t/n]. For t = K0 log n, where K0 is a constant large enough and independent on the initial node v0 , the Chernoff bound ensures that P[Bin(n, (1 − ν)t/n) ≥ t] = o(1/n), and
(1 − ν)t I(Γw ≤t) qw ≤ 1− P{1 − φt−1 ≥ (1 − ν)t/n} = P n w∈W 1 (1 − ν)t =P log I(Γw ≤ t) ≥ − log 1 − 1 − pw n w∈W 1 (1 − ν)t . (22) log ≤P I(Γw ≤ t) ≥ 1 − pw n w∈W
Define the auxiliary random variables Xt,w = n log(1/(1 − pw ))I(Γw ≤ t), so that 1 t E[Xt,w ] = n log (1 − qw ) = n pw + o(pw ) 1 − (1 − pw )t 1 − pw = n pw + o(pw ) tpw + o(tpw ) = ntp2w + o ntp2w , (23) which implies
E[Xt,w ] = nt
w∈W
p2w 1 + o(1) .
w∈W
Thus under the stated condition that n
w∈W
p2w = c < 1,
(24)
44
M. Bradonji´c et al.
it constant c , where c < c < 1, and for sufficiently large n,
follows that for some w∈W E[Xt,w ] ≤ c t. In light of (22) and Bernstein’s inequality [5], (1 − ν)t P 1 − φt−1 ≥ ≤P Xt,w ≥ (1 − ν)t n w∈W Xt,w − E[Xt,w ] ≥ (1 − ν − c )t ≤P w∈W
≤ exp
3
− 32 ((1 − ν − c )t)2 (.25) w Var[Xt,w ] + nt maxw pw (1 + o(1))
Since 1 2 2 2 t E[Xt,w 1 − (1 − pw )t ] = n log (1 − qw ) = n2 pw + o(pw ) 1 − pw 2 2 = n pw + o(p2w ) tpw + o(tpw )) = n2 tp3w + o n2 tp3w , (26) it follows that for some large constant K1 > 0 w∈W
Var[Xt,w ] ≤
2 E[Xt,w ] = n2 t
w∈W
p3w + o n2 t p3w ≤ K1 t.
w∈W
w∈W
Finally, the assumption of the theorem implies that there exists a constant K2 > 0 such that n max pw ≤ K2 . w∈W
Substituting these bounds into (25) yields P[1 − φt−1
3(1 − ν − c )2 t . ≥ (1 − ν)t/n] ≤ exp − 2(3K1 + K2 )
Taking ν ∈ (0, 1 − c ) and t = K3 log n for some constant K3 large enough and not depending on the initial node v0 , we conclude that P[1−φt−1 ≥ (1−ν)t/n] = o(n−1 ), which in turn implies that taking constant K4 = max{K0 , K3 }, ensures that P[T (v0 ) > K4 log n] = o(1/n) for any initial node v0 . Finally, the union bound over the n possible starting values v0 gives P[max T (v0 ) > K4 log n] ≤ no(n−1 ) = o(1), v0 ∈V
which implies that all connected components are of size O(log n), whp. 5.2 Supercritical Regime We now turn to the study of the supercritical regime in which limn→∞ n c > 1.
w∈W
p2w =
Component Evolution in General Random Intersection Graphs
Theorem 2. Suppose that log n pw = o for all w n
and
w∈W
45
log n . p3w = o n2
For any constant c > 1, if w∈W p2w = c/n, then whp there exists a unique largest component in G(n, m, p), of order Θ(n). Moreover, the size of the giant component is given by nζc (1 + o(1)), where ζc is the solution in (0, 1) of the equation 1 − e−cζ = ζ, while all other components are of size O(log n).
Remark. The conditions on pw and w p3w are weaker than ones in the case of the subcritical regime.
t−1 Proof. We start by bounding 1 − t−1 τ =0 (1 − rτ ). The upper bound τ =0 rτ has already been established in (20). For the lower bound, we apply Jensen’s inequality to the function log(1 − x) to get log
t−1
(1 − rτ ) =
τ =0
t−1 τ =0
log(1 − rτ ) =
t−1
log 1 − (φτ −1 − φτ )
τ =0
t−1 1 − φt−1 1 . (27) (φτ −1 − φτ ) = t log 1 − ≤ t log 1 − t τ =0 t
In light of (15), φt is decreasing in t, and hence 1−
1 − φt−1 t 1 − φ0 t (1 − rτ ) ≥ 1 − 1 − ≥1− 1− . t t τ =0 t−1
To bound 1 − 1 −
1−φ0 t
t
(28)
further, consider the function ft (x) = 1 − (1 − x/t)t for x
in a neighborhood of the origin, with t ≥ 1. For any fixed x, ft (x) decreases to 1 − e−x as t tends to infinity. The latter function is concave, and hence for all x ≤ ε, ft (x) ≥ 1 − e−x ≥
1 − e−ε x. ε
Focusing on 1 − φ0 , from (18) and (17), by using Chebyshev’s inequality with
2 p = c/n, it follows that φ0 is concentrated around its mean E[φ0 ] = 1 − c/n. w w∈W Therefore, with probability 1 − o(1/n), 1 − φ0 = o(1). But (1 − e−ε )/ε can be made art−1 bitrary close to 1 by taking ε small enough, so it follows that 1 − τ =0 (1 − rτ ) > c /n for some constant c ∈ (1, c) arbitrarily close to c. Hence, the branching process on RIG is stochastically lower bounded by Bin(n− 1, c /n). But this bound itself stochastically dominates a branching process on Gn,c /n . Because c > 1, there exists whp a giant component of size Θ(n) in Gn,c /n . This implies that the stopping time of the branching process associated to Gn,c /n is Θ(n) with high probability, as is therefore the stopping time Tv for some v ∈ V . Thus, whp there is a giant component in a general RIG. We now show that this giant component is unique and that all other components have size O(log n). Consider the size of the giant component. From the representation (15)
46
M. Bradonji´c et al.
for φt−1 , consider the previously introduced random variables Xt,w = n log(1/(1 − pw ))I(Γw ≤ t). Similarly to the proof of Theorem 1, it follows that under the conditions of the theorem there is a positive constant δ > 0 such that w Xt,w is concentrated within (1 ± δ) w E[Xt,w ] = (1 ± δ)c/n, with probability 1 − o(1). Hence, there exists p+ = c+ /n, for some constant c+ > c > 1, such that 1 − φt−1 ≤ 1 − (1 − p+ )t , which is equivalent to − log φt−1 ≤ t log(1 − p+ ) = tp+ + o(tp+ ) = tc+ /n + o(t/n). Similarly, the concentration of φt−1 implies that there exists p− = c− /n, with c > c− > 1, such that 1 − (1 − p− )t ≤ 1 − (1 − (1 − φt−1 )/t)t , which implies that − log φt−1 ≥ t log(1 − p− ) = tp− + o(tp− ) = tc− /n + o(t/n). Combining the upper and lower bound, we conclude that, with probability 1 − o(1), the rate of the branching process on RIG is bracketed by − t
1 − (1 − p ) ≤ 1 −
t−1
(1 − rτ ) ≤ 1 − (1 − p+ )t .
(29)
τ =0
The stochastic dominance of the binomial distribution, together with (29), implies t−1
P Bin n − 1, 1 − (1 − p− )t ≥ t ≤ P Bin n − 1, 1 − (1 − rτ ) ≥ t τ =0
≤ P Bin n − 1, 1 − (1 − p+ )t ≥ t . (30) In light of (29), the branching process {Yt }t≥0 associated to a RIG is stochastically bounded from below and above by the branching processes associated with Gn,p− and Gn,p+ , respectively [2]. Since both c− , c+ > 1, there exist giant components in both Gn,p− and Gn,p+ , whp. In [24], it has been shown that the giant components in Gn,λ/n , for λ > 1, is unique and of size ≈ nζλ , where ζλ is the unique solution in (0, 1) of the equation 1 − e−λζ = ζ.
(31)
Moreover, the size of the giant component in Gn,λ/n satisfies the central limit theorem maxv {|C(v)}| − ζλ n ζλ (1 − ζλ ) √ . ∼ N 0, n (1 − λ + λζλ )2
(32)
From the definition of the stopping time (see (19)) and given (30) and (32), there is a giant component in a RIG of size at least nζλ (1−o(1)), whp. Furthermore, the stopping times of the branching processes associated to Gn,p− and Gn,p+ are approximately ζn, where ζ satisfies (31), with λ− = np− and λ+ = np+ , respectively. These two stopping times are close to one another, which follows from analyzing the function F (ζ, c) = 1 − ζ − e−cζ , where (ζ, c) is the solution of F (ζ, c) = 0, for given c. Since all partial derivatives of F (ζ, c) are continuous and bounded, the stopping times of the branching processes defined from Gn,p− , Gn,p+ are “close” to the solution of (31), for λ = c. From (30), the stopping time of a RIG is bounded by the stopping times on Gn,p− , Gn,p+ .
Component Evolution in General Random Intersection Graphs
47
For the last part of the proof of uniqueness of the giant component, we adapt the arguments in [2] to our setting. Let us assume that there are at least two giant components in a RIG, with the sets of nodes V1 , V2 ⊂ V . Let us create a new, independent “sprinkling” on the top of our RIG, with the same sets of nodes and attributes, while pˆw = pγ , RIG w Let for γ > 1 to be defined later. Now, our object of interest is RIGnew = RIG ∪ RIG. us consider all Θ(n2 ) pairs {v1 , v2 }, where v1 ∈ V1 , v2 ∈ V2 , which are independent in (but not in RIG), hence the probability that two nodes v1 , v2 ∈ V are connected RIG, is given by in RIG
1 − (1 − pˆ2w ) = 1 − (1 − p2γ p2γ p2γ (33) w )= w + o( w ), w
w
w
w
since γ > 1 and pw = O(1/n) for any w. Given that w p2w = c/n, we choose 2 γ > 1 so that w p2γ w = ω(1/n ). Now, by the Markov inequality, whp there is a pair implying that V1 , V2 are connected, {v1 , v2 } such that v1 is connected to v2 in RIG, whp, forming one connected component within RIGnew . From the previous analysis, it follows that this component is of size at least 2nζλ (1 − δ) for any small constant δ > 0. On the other hand, the probabilities pnew in RIGnew satisfy w = 1 − (1 − pw)(1 − pˆw ) = pw + pˆw (1 − pw ) = pw + pγw (1 − pw ) = pw (1 + o(1)), pnew w again since γ > 1 and pw = O(1/n) for any w. Thus, 2 (pnew p2w +Θ( p1+γ p2w (1+o(1)) = c/n+o(1/n). w ) = w (1−pw )) = w∈W
w∈W
w∈W
w∈W
(34) Given that the stopping time on RIG is bounded by the stopping times on Gn,p− , Gn,p+ , and from its continuity, it follows that the giant component in RIGnew cannot be of size 2nζλ (1 − δ), which is a contradiction. Thus, there is only one giant component in RIG, of size given by nζc (1+o(1)), where ζc satisfies (31), for λ = c. Moreover, knowing the behavior of Gn,p , from (30), it follows that all other components are of size O(log n).
6 Conclusion The analysis of random models for bipartite graphs is important for the study of algorithms on networks formed by associating nodes with shared attributes. In the random intersection graph (RIG) model, nodes have certain attributes with fixed probabilities. In this paper, we have considered the general RIG model, where these probabilities are represented by a set of probabilities p = {pw : w ∈ W }, where pw denotes the probability that a node is attached to the attribute w. We have analyzed the evolution of components in general RIGs, giving conditions for existence and uniqueness of the giant component. We have done so by generalizing the branching process argument used to study the birth of the giant component in Erd˝os-R´enyi graphs. We have considered a dependent, inhomogeneous Galton-Watson process, where the number of offspring follows a binomial distribution with a different number of nodes and different rate at each step during the evolution. The analysis of
48
M. Bradonji´c et al.
such a process is complicated by the dependence on its history, dictated by the structure of general RIGs. We have shown that in spite of this difficulty, it is possible to give stochastic bounds on the branching process, and that under certain conditions the giant
component appears at the threshold n w∈W p2w = 1, with probability tending to one, as the number of nodes tends to infinity.
Acknowledgments Part of this work was funded by the Department of Energy ASCR program, by the Air Force Office of Scientific Research MURI grant FA9550-10-1-0569, and by the Office of Naval Research grant N00014-10-1-0641. Nicolas W. Hengartner was supported by DOE-LDRD 20080391ER.
References 1. Albert, R., Barab´asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47–97 (2002) 2. Alon, N., Spencer, J.H.: The probabilistic method, 2nd edn. John Wiley & Sons, Inc., New York (2000) 3. Barab´asi, A.L., Albert, R.: Emergence of Scaling in Random Networks. Science 286(5439), 509–512 (1999) 4. Behrisch, M.: Component evolution in random intersection graphs. Electr. J. Comb. 14 (2007) 5. Bernstein, S.N.: On a modification of chebyshevs inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math. 4(25) (1924) 6. Bloznelis, M., Jaworski, J., Rybarczyk, K.: Component evolution in a secure wireless sensor network. Netw. 53(1), 19–26 (2009) 7. Chung, F., Lu, L.: The average distances in random graphs with given expected degrees. Proceedings of the National Academy of Sciences of the United States of America 99(25), 15879–15882 (2002) 8. Deijfen, M., Kets, W.: Random intersection graphs with tunable degree distribution and clustering. Probab. Eng. Inf. Sci. 23(4), 661–674 (2009) 9. Erd˝os, P., Goodman, A.W., P´osa, L.: The representation of a graph by set intersections. Canad. J. Math. 18, 106–112 (1966) 10. Godehardt, E., Jerzy Jaworski, K.R.: Random intersection graphs and classification. In: Advances in Data Analysis, vol. 45, pp. 67–74 (2007) 11. Eubank, S., Guclu, H., Anil Kumar, V.S., Marathe, M.V., Srinivasan, A., Toroczkai, Z., Wang, N.: Modelling disease outbreaks in realistic urban social networks. Nature 429(6988), 180– 184 (2004) 12. Fill, J.A., Scheinerman, E.R., Singer-Cohen, K.B.: Random intersection graphs when m = ω(n): An equivalence theorem relating the evolution of the g(n, m, p) and g(n, p) models. Random Struct. Algorithms 16(2), 156–176 (2000) 13. Godehardt, E., Jaworski, J.: Two models of random intersection graphs and their applications. Electronic Notes in Discrete Mathematics 10, 129–132 (2001) 14. Guillaume, J.-L., Latapy, M.: Bipartite graphs as models of complex networks. Physica A: Statistical and Theoretical Physics 371(2), 795–813 (2006) 15. Jaworski, J., Stark, D.: The vertex degree distribution of passive random intersection graph models. Comb. Probab. Comput. 17(4), 549–558 (2008)
Component Evolution in General Random Intersection Graphs
49
16. Karo´nski, M., Scheinerman, E., Singer-Cohen, K.: On random intersection graphs:the subgraph problem. Combinatorics, Probability and Computing 8 (1999) 17. Lager˚as, A.N., Lindholm, M.: A note on the component structure in random intersection graphs. Electronic Journal of Combinatorics 15(1) (2008) 18. Newman, M.E.J.: Scientific collaboration networks. I. Network construction and fundamental results. Phys. Rev. E 64(1), 016131 (2001) 19. Newman, M.E.J., Park, J.: Why social networks are different from other types of networks. Phys. Rev. E 68(3), 036122 (2003) 20. Newman, M.E.J., Strogatz, S.H., Watts, D.J.: Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E 64(2), 026118 (2001) 21. Nikoletseas, S., Raptopoulos, C., Spirakis, P.: Large independent sets in general random intersection graphs. Theor. Comput. Sci. 406, 215–224 (2008) 22. Nikoletseas, S.E., Raptopoulos, C., Spirakis, P.G.: The existence and efficient construction of large independent sets in general random intersection graphs. In: D´ıaz, J., Karhum¨aki, J., Lepist¨o, A., Sannella, D. (eds.) ICALP 2004. LNCS, vol. 3142, pp. 1029–1040. Springer, Heidelberg (2004) 23. Nikoletseas, S.E., Raptopoulos, C., Spirakis, P.G.: Expander properties and the cover time of random intersection graphs. Theor. Comput. Sci. 410(50), 5261–5272 (2009) 24. van der Hofstad, R.: Random graphs and complex networks. Lecture notes in preparation, http://www.win.tue.nl/˜rhofstad/NotesRGCN.html 25. Rybarczyk, K.: Equivalence of the random intersection graph and G(n, p) (2009) (submitted), http://arxiv.org/abs/0910.5311 26. Singer-Cohen, K.: Random intersection graphs. PhD thesis, Johns Hopkins University (1995) 27. Watts, D.J., Strogatz, S.H.: Collective dynamics of Small-World networks. Nature 393(6684), 440–442 (1998)
Modeling Traffic on the Web Graph Mark R. Meiss1,3 , Bruno Gonc¸alves1,2,3 , Jos´e J. Ramasco4 , Alessandro Flammini1,2 , and Filippo Menczer1,2,3,4 2
1 School of Informatics and Computing, Indiana University, Bloomington, USA Center for Complex Networks and Systems Research, Indiana University, Bloomington, USA 3 Pervasive Technology Institute, Indiana University, Bloomington, USA 4 Complex Networks and Systems Lagrange Laboratory, ISI Foundation, Turin, Italy
Abstract. Analysis of aggregate and individual Web requests shows that PageRank is a poor predictor of traffic. We use empirical data to characterize properties of Web traffic not reproduced by Markovian models, including both aggregate statistics such as page and link traffic, and individual statistics such as entropy and session size. As no current model reconciles all of these observations, we present an agent-based model that explains them through realistic browsing behaviors: (1) revisiting bookmarked pages; (2) backtracking; and (3) seeking out novel pages of topical interest. The resulting model can reproduce the behaviors we observe in empirical data, especially heterogeneous session lengths, reconciling the narrowly focused browsing patterns of individual users with the extreme variance in aggregate traffic measurements. We can thereby identify a few salient features that are necessary and sufficient to interpret Web traffic data. Beyond the descriptive and explanatory power of our model, these results may lead to improvements in Web applications such as search and crawling.
1 Introduction PageRank [6] has been a remarkably influential model of Web browsing, framing it as random surfing activity. The measurement of large volumes of Web traffic enables systematic testing of PageRank’s underlying assumptions [22]. Traffic patterns aggregated across users reveal that some of its key assumptions—uniform random distributions for walk and teleportation—are widely violated, making PageRank a poor predictor of traffic, despite its standard interpretation as a stationary visit frequency. This raises the question of how to design a more accurate navigation model. We expand on our previous empirical analysis [22,20] by considering also individual traffic patterns [15]. Our results provide further evidence for the limits of Markovian traffic models such as PageRank and suggest the need for an agent-based model with features such as memory and topicality that can account for both individual and aggregate traffic patterns. Models of user browsing have important practical applications. Traffic clearly has a direct impact on the financial success of companies and institutions. Indirectly, understanding traffic patterns aids in advertising, both to predict revenues and establish rates [12]. Second, realistic models of Web navigation can guide the behavior of crawling algorithms, improving search engines’ coverage of important sites [9,24]. Finally, improved traffic models may lead to enhanced ranking algorithms [6,28,18]. R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 50–61, 2010. c Springer-Verlag Berlin Heidelberg 2010
Modeling Traffic on the Web Graph
51
After background material, we describe in § 3 a data set collected through a field study of over 1,000 users at Indiana University. In § 4 we introduce an agent-based navigation model, ABC, with three key, realistic ingredients: (1) bookmarks used as teleportation targets, defining boundaries between sessions and capturing the diversity of starting pages; (2) a back button is used to model branching observed in empirical traffic; and (3) topical interests drive an agent’s decision to continue browsing, leading to diverse session sizes. The model also considers the topical locality of the Web, so that interesting pages tend to link to other such pages. In § 5 we compare the traffic generated by our model with measurements of both aggregate and individual Web traffic data. These results allow us to identify features that are necessary and sufficient to interpret Web traffic data.
2 Background Empirical studies of Web traffic have most often involved the analysis of server logs, with scales ranging from small samples of users from a few selected servers [17] to large groups from the logs of large organizations [15]. This approach has the advantage of distinguishing users by IP address (even if they may be anonymized), thus capturing individual traffic patterns [15]. However, the choice of target server will bias both the user sample and the part of the Web graph observed. Browser toolbars offer another source of traffic data; these gather information based on the activity of many users. While toolbars involve a larger population, their data are still biased toward users who opt to install such software. Moreover, their data are not generally available to researchers. Adar et al. [1] used this approach to study patterns of page revisitation. A related approach is to have a select panel of users install tracking software, which can eliminate many biases but incur experimental costs. Such an approach has been used to describe exploratory browsing [4]. These studies did not propose models to explain observed traffic patterns. Our own study measures traffic directly through packet capture, an approach adopted by Qiu et al. [27], who used captured HTTP packet traces from the UCLA CS department to study the influence of search engines on browsing. We use a larger sample of residential users, reducing the biases attendant to a workplace study of technical users. We focus strongly on the analysis of browsing sessions. A common assumption is that long pauses correspond to breaks between sessions, leading many to rely on timeouts as a way of defining sessions. Flaws in this technique motivated our definition of time-independent logical sessions, based on the reconstruction of session trees rooted at pages requested without a referrer [20]. One goal of our model is to explain the broad distributions of size and depth for these logical sessions. The role of page content in driving users’ browsing patterns has received relatively little attention, with the notable exception of a study of the correlation between changes in page content and revisit patterns [2]. Under a basic model of Web navigation, users perform a random walk through pages in the Web graph. PageRank [6] is a random walk modified by teleportation, which uses uniformly random starting points to model how users start new sessions. This Markovian process has no memory or backtracking and no notion of user interests or page content. The stationary distribution of visitation frequency generated by PageRank constitutes a prediction of actual click rates, which can then be compared with
52
M.R. Meiss et al.
empirical traffic data. We have shown that the assumptions underlying PageRank— uniform link selection, uniform teleportation sources and targets—are violated by actual user behavior, making it a poor predictor of actual traffic [22]. Our goal here is to present a more predictive model, using PageRank as a null model for evaluation. Other researchers have also introduced more realistic models to capture features of real browsing behavior, such as the back button and tabbed browsing [19,5,8]. There have also been attempts to model the interplay between user interests and page content; Huberman et al. proposed a model in which visited pages have interest values described by a random walk that continues as long as the current page has a value above a threshold [16]. Such an approach relates to algorithms for improved topical crawlers [24]. We previously proposed a model in which users maintain a list of bookmarks from which new sessions begin, providing memory of past navigation [3]. While it is able to reproduce empirical page and link traffic distributions, it fails to account for patterns exhibited by individual users, such as entropy and session characteristics. The ABC model builds upon this previous model; some initial results were reported in [21]. Here we extend this effort to encompass both individual and aggregate measures of Web traffic, offering a comprehensive comparison among ABC, empirical measurements and a baseline model. We also discuss the key role of the topology of the Web graph.
3 Empirical Traffic Data We gathered our HTTP request data from an undergraduate dormitory at Indiana University under methodology described in detail in our previous work [20]. The requests are gathered directly from a mirror of the building’s network connection and reflect only unencrypted traffic. We use some basic heuristics to filter the data to include only requests made from browsers for actual page fetches, retaining a series of (user, referrer URL, target URL) triples. We also strip query parameters from the URLs, which affects roughly one-third of the requests. While this helps in the common case that parameters affect content within a static framework, it is less accurate when embedded CGI parameters select a page. Our analysis indicates that this effect is greatly mitigated by search-engine friendly design principles. The resulting data set contains 29.5 million page requests that come from 967 distinct users. They visited 2.5 million unique URLs, of which 2.1 million appeared as targets and 0.86 million appeared as referrers. We organize each user’s clicks into tree-based logical sessions using an algorithm described in our previous work [20]. The basic notions are that new sessions start with empty-referrer requests; that each request represents a directed edge from a referring URL to a target URL; and that requests belong to the session in which their referring URL was most recently requested. These session trees mimic users’ multitasking behavior of by permitting several active sessions at once. The properties of these session trees, such as size and depth, are relatively insensitive to an additional timeout constraint [20]. We impose such a timeout as we form the sessions: a click cannot be associated with a session tree that has been dormant for thirty minutes. This process yields 11.1 million logical sessions in all, with a mean of over 11 thousand per user. The structure of these trees allows us to infer how users backtrack as they browse. Modern caching mechanisms mean that a browser generally does not issue a request for
Modeling Traffic on the Web Graph
53
a recently accessed page, preventing direct observation of multiple links pointing to the same page, within a single session. While we have no direct way of detecting when the user presses the back button, session trees allow us to infer “backwards” traffic: if the next request in a tree comes from a URL other than the most recently visited, the user must have navigated to that page or opened it in a separate tab. Any statistical description involves a compromise between summarizing the data and describing it accurately. For many human activities, including Web traffic, the data do are not normally distributed, but rather fit into heavy-tailed distributions best approximated by power laws [7,22]. The mean and median often describe the data poorly, as shown by a large and diverging variance and strong skew. The next best description is a histogram; we thus present these distributions in terms of their probability density functions rather than measures of central tendency. To characterize properties of traffic data and evaluate models of navigation, we focus on six quantities, several of which are discussed in preliminary work [20,21]: Page traffic. The total number of visits to each page. Because of caching mechanisms, the majority of revisits to a page by a single user beyond the first visit within each session will not be represented in the data. Link traffic. The number of times each hyperlink has been traversed by a user, as identified by the referrer and destination URLs in each request. We typically observe only the first visit to a destination page within a session. Empty referrer traffic. The number of times each page initiates a session. We assume that a request without a referrer corresponds to using a bookmark, opening a link from another application, or entering an address manually. Entropy. Shannon information entropy. For an individual user j, the entropy is defined as Sj = − i ρij log2 ρij where ρij is the fraction of visits of user j to site i aggregated across sessions. Session size. The number of unique pages visited in a logical session tree. Session depth. The maximum tree distance between the starting page of a session and any page within that session. (Recall that sessions have a tree-like structure because backtracking requests are usually served from the browser cache.)
4 ABC Model We now introduce the models for interpreting the empirical data. As a baseline, we consider a PageRank-like reference model with teleportation probability pt = 0.15. This value is standard in the literature and best approximates the empirical data. We simulate a population of random walkers equal in number to our users. Each walker browses for as many sessions as corresponding real-world user. These sessions are terminated by the jumps, so the total number of pages visited by a walker differs from the corresponding user. Teleportation jumps lead to session-starting pages selected uniformly at random. We call our own model ABC for its main ingredients: agents, bookmarks and clicks, as illustrated in Fig. 1. Agents possess some amount of energy, which represents their attention; it is spent by navigating and acquired by visiting interesting pages. Agents also have a back button and a bookmark list that is updated during navigation. Each agent starts at a random page with initial energy E0 . Then, for each time step:
54
M.R. Meiss et al.
R
– )~R
P(
Ranked Bookmark List
Pick Start Bookmark E = E0
E<0 New Session?
1–pb E>0
U pd at e Forward E = E – cf +
Back Button?
pb
Back E = E – cb
Fig. 1. Schematic illustration of the ABC model
1. If E ≤ 0, the agent starts a new session. 2. Otherwise, if E > 0, the user continues the current session, following a link from the present node. There are two alternatives: (a) With probability pb , the back button is used, leading back to the previous page. The agent’s energy is decreased by a fixed cost cb . (b) Otherwise, with probability 1−pb , a forward link is clicked. The agent’s energy is updated to E − cf + Δ where cf is a fixed cost and Δ is a stochastic value representing the new page’s relevance to the user. The visitation is recorded in the bookmark list, which is kept ranked from most to least visited. To initiate a new session, the bookmark with rank R is chosen with probability P (R) ∝ R−β . This selection mechanism mimics the use of frequency ranking in various features of modern browsers, such as URL completion. The functional form is motivated by data on selection among a ranked list of search results [14]. The back button is our basic mechanism for producing branching behavior. The data indicate that the incoming and outgoing traffic of a site are seldom equal, with a ratio distributed over many orders of magnitude [22]. This violation of flow conservation cannot be explained by teleportation alone; the sessions of real users have many branches. Our prior results show an average node-to-depth ratio among session trees of almost two. These observations are consistent with the use of tabs and the back button, behavior confirmed by other studies [10,30]. The role of energy is critical. The duration of a real-world session depends on a user’s individual goals and interests: visiting relevant pages leads to more clicks and longer sessions. ABC therefore incorporates agents with distinct interests and page topicality, relying on the intuition that an agent spends energy when navigating and gains it by discovering pages that match its interests. Moving forward costs more than using the back button. Known pages yield no energy, while novel pages increase energy by a random amount representing their relevance. Agents browse until they run out of energy, then start another session.
Modeling Traffic on the Web Graph
55
Fig. 2. Representation of a few typical and representative session trees from the empirical data (top) and from the ABC model (bottom). Animations are available at cnets.indiana.edu/groups/nan/webtraffic.
The dynamic variable Δ reflects a page’s relevance to an agent. If Δ values are independent, identically distributed random variables, the amount of stored energy will behave as a random walk. The session duration (number of clicks until E = 0) will 3 have a power-law tail P () ∼ − 2 [16]. However, empirical results suggest a larger exponent [20]. Moreover, studies show that content similarity between pages is correlated with their link distance, as is a page’s relevance to a given topic [11,23]. Neighboring pages are topically similar, and the relevance of page t to a user is close to that of page r linking to t. To capture such topical locality, we correlate the Δ values of adjacent pages. We initially use Δ0 = 1; then, when a page t is first visited in a given session, Δt is given by Δt = Δr (1 + ), where r is the referrer page, is uniformly randomly distributed in [−η, η], and η controls the degree of topical locality. A visited page can again become interesting in a later session and provide the agent with energy. However, it will yield different energy in different sessions, modeling drift in user interests.
5 Model Evaluation Our simulations take place on a scale-free network with N nodes and degree distribution P (k) ∼ k −γ , generated according to the Molloy-Reed algorithm [25], which we call G1. This graph has N = 107 nodes, more than observed in the data, to ensure adequate room for navigation. We also set γ = 2.1 to match our data set. To prevent dangling links, we construct G1 with symmetric edges. We also ran simulations of ABC on a second graph (G2) derived from an independent, empirical data set obtained by extracting the largest strongly connected component from the Web traffic of the entire university population (about 100,000 people) [22]. G2 is thus an actual subset of the Web graph with no dangling links. Based on three weeks of traffic as measured in November 2009, G2 has N = 8.14 × 106 nodes and the same degree distribution, with γ ≈ 2.1. Within each session we simulate caching by recording traffic only when the target page is novel to the session. This lets us count the unique pages visited, which mirrors the empirical session size. These cached pages are reset between sessions. We must now set the parameters of ABC. The distribution of empty-referrer traffic will depend on the parameter β and is well-approximated by P (T0 ) ∼ T0−α , where α = 1 + 1/β [29]. Empirically, this exponent is α ≈ 1.75 [22]; to match it, we set the parameter β = 1/(α − 1) = 1.33. We can also fit the back button probability pb = 0.5
56
M.R. Meiss et al. a 10
10
Empirical PageRank ABC (G1) ABC (G2)
-3
Empirical PageRank ABC (G1) ABC (G2)
-3
-6
10
-9
10
-6
10
10
-12
10
0
10
P(ω)
P(T)
10
b
0
0
10
10
1
2
10
3
10
T
4
10
5
10
6
10
10
-9
-12 0
10
10
1
2
10
3
10
ω
10
4
5
10
6
10
Fig. 3. Distribution of (a) page traffic and (b) link traffic generated by ABC model versus data and baseline
from the data. The initial energy E0 , the forward and backward costs cf and cb , and the topical locality parameter η control session duration. We thus set E0 = 0.5 arbitrarily and estimate the costs as follows. Empirically, the average session size is roughly two pages. The net energy loss per click is −δE = pb cb + (1 − pb )(cf − Δ), where Δ = 1 is the expected energy value of a new page. By setting cf = 1 and cb = 0.5, we obtain an expected session size 1 − (1 − pb )E0 /δE = 2 (counting the initial page). In general, higher costs lead to shorter sessions and lower entropy. We explored the effects of η by simulation, settling on η = 0.15. Small values mean that all pages have similar relevance, making the session distributions too narrow. Large values erase topical locality, making the distributions too broad. Our results refer to this combination of parameters, with the numbers of users and sessions per user being drawn from the empirical data. We use the same parameters for both G1 and G2, without any further tuning to mach the properties of these networks. The ABC agents generate session trees similar to those in the empirical data, as shown in Fig. 2. For a quantitative evaluation, we compare ABC with the empirical distributions described in § 3 and the reference PageRank model as simulated on the artificial G1 network. We first consider the aggregate distributions, starting with traffic received by individual pages, as shown in Fig. 3(a). The empirical data show a broad power-law distribution for page traffic, P (T ) ∼ T −α , with exponent α ≈ 1.75, consistent with prior results for host-level traffic [22,20]. Theoretical arguments [26] suggest that PageRank should behave similarly. In general, a node will be visited if a neighbor has just been visited, making its traffic proportional to its degree in the absence of assortativity. This idea and prior results [22] lead us to expect PageRank’s distribution of page traffic to fit a power law P (T ) ∼ T −α where α ≈ 2.1 matches the exponent of the in-degree [7,13], as shown in Fig. 3(a). In contrast, traffic from ABC is biased toward previously visited pages (bookmarks), yielding a broader distribution and matching empirical measurements. In Fig. 3(a), we compare the distributions of traffic per link ω from the models with the empirical data, revealing a power law for P (ω) with degree 1.9, agreeing with prior measurements of host-level traffic [22]. The comparison with PageRank illustrates the diversity of links with respect to their probability of being clicked. A rough argument
Modeling Traffic on the Web Graph a
57
b
0
10
Empirical PageRank ABC (G1) ABC (G2)
0.6
Empirical PageRank ABC (G1) ABC (G2)
-3
0.4
P(S)
P(T0)
10
-6
10
0.2
-9
10
10
-12 0
10
1
10
2
10
3
10
T0
4
10
5
10
6
10
0 0
5
10
S
15
20
Fig. 4. Distribution of (a) traffic originating from jumps (page requests with empty referrer) and (b) user entropy generated by ABC model versus data and baseline
can explain the reference model’s poor performance at reproducing the data. Recall that, disregarding teleportation, page traffic is roughly proportional to in-degree. The traffic expected on a link would thus be proportional to the traffic of the source page and inversely proportional to its out-degree, assuming that links are chosen uniformly at random. In-degree and out-degree are equal in our simulated graphs, leading to link traffic that is independent of degree and nearly constant for all links, as shown by the decaying distribution for PageRank. For ABC, the stronger heterogeneity in the probability of visiting pages is reflected in a heterogeneous choice of links, resulting in a broad distribution better fitting the empirical data, as shown in Fig. 3(b). Our empirical data in Fig. 4(a) show that pages are not equally likely to start a browsing session. Their popularity as starting points is roughly distributed as a power law with exponent of about -1.8 (consistent with results for host-level traffic [22]), implying diverging variance and mean as the number of sessions increases. While not unexpected, this demonstrates a serious flaw in the hypothesis of uniform teleportation. Because PageRank assumes uniform probability among starting pages, its failure to reproduce the empirical data is evident in Fig. 4(a). In contrast, ABC’s bookmarking mechanism captures the non-uniform probability of starting pages, yielding a distribution similar to the empirical data, as shown in Fig. 4(a), supporting the idea that rank-based bookmark selection is a sound cognitive mechanism for initiating sessions. When it comes to individual users, the simplest hypothesis is that the broad distributions for aggregate behavior reflect extreme variability within the traffic of each user, suggesting that there is no “typical” user as described by their overall traffic. To examine users’ diversity of behavior, we adopt Shannon’s information entropy as defined in § 3. Entropy measures the focus of a user’s interests, offering a better description of a single user than, e.g., the number of distinct pages visited; two users who have visited the same number of pages can have very different measures of entropy. Given a number of visits Nv , the entropy is maximum (S = Nv log(Nv )) when Nv pages are visited once, and minimum (S = 0) when all visits are to a single page. The distribution of entropy across users is shown in Fig. 4(b). We observe that the PageRank model produces higher entropy than observed in the data: a PageRank walker picks starting pages with uniform probability, while a real user most often starts from
58
M.R. Meiss et al. a 10
P(Ns)
10 10
-2
10
Empirical PageRank ABC (G1) ABC (G2)
-2
10
-4
-4
P(Ds)
10
b 0
Empirical PageRank ABC (G1) ABC (G2)
0
10
-6
-6
10 -8
10
10
-8
-10
10
-10
0
10
1
10
2
10
Ns
10
3
4
10
10
0
10
10
1
2
Ds
10
3
10
Fig. 5. Distribution of (a) session size (unique pages per session) and (b) session depth generated by ABC model versus data and baseline
a previously visited page, leading them to revisit neighboring pages. The ABC model yields entropy distributions that are influenced by the underlying network but fit empirical entropy data better than PageRank, suggesting that bookmarks, the back button, and topicality help to explain the focused habits of real users. Finally, we consider two distributions that describe logical sessions: size (number of unique pages) and depth (distance from the starting page), both of which affect entropy. Figs. 5(a) and (b) show that the empirical distributions are broad, spanning three orders of magnitude, with a large proportion of long sessions. The brief sessions seen for the PageRank model originate from its teleportation mechanism, which cannot capture broadly distributed session sizes. The jump probability pt bounds the length (number of clicks) of a session, with a narrow, exponential distribution P () ∼ (1 − pt ) . These exponentially short sessions do not conflict with the high entropy of PageRank walkers (Fig. 4(b)), which arises from jumps to random targets rather than browsing itself. In contrast, user interest and topical locality in ABC yield broad distributions of both session size and depth, as seen in Fig. 5(a) and (b). Agents visiting relevant pages tend to keep browsing, and relevant pages lead to more relevant pages, creating longer and deeper sessions. We believe the diversity shown in aggregate measures of traffic is a consequence of this diversity of interests rather than the behavior of extremely eclectic users—as shown by the narrow distribution of entropy. To study the dependence of ABC on network topology, we ran the model on additional artificial networks. We eliminated any limitation of network size by simulating an infinite graph that generates new nodes as agents navigate. In this limit case, an agent’s energy level is a random walk, with session size obeying a power law with exponent −3/2 [16]. However, the constant availability of new content leads to too many large sessions, as shown in Fig. 6. We then considered a Barabasi-Albert (BA) tree with the same node count as G1. The large number of leaf nodes affects the distribution of session size dramatically. Agents that begin a session in a leaf seldom to backtrack sufficiently high up the tree to discover new nodes; they quickly run out of energy, yielding a narrow distribution (Fig. 6). If we lift the constraint that the network contain no cycles, agents can escape these cul-de-sacs. Using an Erd¨os-Renyi (ER) network broadens the
Modeling Traffic on the Web Graph 0
Empirical Infinite network BA tree ER network MR network (G1)
P(Ns)
10 -2 10 -4 10 -6 10 -8 10 -10 10 0 10
59
1
10
10
2
Ns
3
10
4
10
Fig. 6. Effect of network topology on session size. The curves correspond to simulations of the ABC model on different artificial networks. The resulting session size distributions are compared −3/2 with the empirical one. The dashed line is a guide to the eye for P (NS ) ∼ NS .
distribution of session size (Fig. 6), bringing it closer to the empirical data while still underestimating the number of large sessions due to the lack of hubs. For comparison, Fig. 6 also shows the distribution obtained with G1, a network with cycles and broadly distributed degree. As already seen (Fig. 5(a)), this network gives excellent results, showing that both hubs and cycles are needed for the exploration of distant regions of the network. If either element is missing, agents can reach only limited content, leading to shortened sessions.
6 Conclusions Previous studies have shown that Markovian processes such as PageRank cannot explain many patterns observed in real measurements of Web activity, especially the diversity of starting points, the global diversity of link traffic, and the heterogeneity of session sizes. Furthermore, individual behaviors are quite focused in spite of such diverse aggregate measurements. These observations call for a stateful, agent-based model that can help explain the empirical data through more realistic browsing behavior. We have proposed three key ingredients for such a model. First, agents maintain individual lists of bookmarks (a memory mechanism) for use as teleportation targets. Second, agents have a back button (a branching mechanism) that can also simulate tabbed browsing. Finally, agents have topical interests that matched by page content, modulating the probability of an agent starting a new session and leading to heterogeneous session sizes. We have shown that the resulting ABC model is capable of reproducing with remarkable accuracy the aggregate traffic patterns we observe in our empirical measurements. More importantly, our model offers the first account of a mechanism that can generate key properties of logical sessions. This allows us to argue that the diversity apparent in page, link, and bookmark traffic is a consequence of the diversity of individual interests rather than the behavior of very eclectic users. Our model is able to capture, for the first time, the extreme heterogeneity of aggregate traffic measurements while explaining the narrowly focused browsing patterns of individual users. While ABC is more complex than prior models, its greater predictive power suggests that bookmarks, tabbed browsing, and topicality are salient features of how we browse the Web. We believe that ABC may lead the way to more sophisticated, realistic, and hence more effective ranking and crawling algorithms.
60
M.R. Meiss et al.
The model does rely on several key parameters. While we have attempted to make reasonable and realistic choices for most of these parameters and explored the sensitivity of our model with respect to the rest, further work is needed to understand the combined effect of these parameters in a principled way. For example, we already know that parameters such as network size, costs, and topical locality play a key role in modulating the balance between individual diversity (entropy) and session size. In the future, we hope to analyze the model from a more theoretical perspective. Finally, while the ABC model is a clear step in the right direction, it shares some limitations of existing efforts, most notably the uniform choice among outgoing links from a page, which may cause the imperfect match between the individual entropy values of our agents and those of actual users. Acknowledgements. We thank CNetS and PTI at Indiana University and L. J. Camp of the IU School of Informatics and Computing, for support and infrastructure. We also thank IU’s network engineers for support in data collection. This work was supported in part by the I3P research program, managed by Dartmouth College and supported under Award 2003-TK-TX-0003 from the Science and Technology Directorate of the U.S. DHS. BG was supported in part by grant NIH-1R21DA024259 from the NIH. JJR is funded by the project 233847-Dynanets of the EUC. This material is based upon work supported by NSF award 0705676. This work was supported in part by a gift from Google. Opinions, findings, conclusions, recommendations or points of view in this document are those of the authors and do not necessarily represent the official position of the U.S. DHS, Science and Technology Directorate, I3P, NSF, IU, Google, or Dartmouth College.
References 1. Adar, E., Teevan, J., Dumais, S.: Large scale analysis of web revisitation patterns. In: Proc. CHI (2008) 2. Adar, E., Teevan, J., Dumais, S.: Resonance on the web: Web dynamics and revisitation patterns. In: Proc. CHI (2009) 3. Gonc¸alves, B., Meiss, M.R., Ramasco, J.J., Flammini, A., Menczer, F.: Remembering what we like: Toward an agent-based model of Web traffic. Late Breaking Results WSDM (2009) 4. Beauvisage, T.: The dynamics of personal territories on the web. In: Proc. HT (2009) 5. Bouklit, M., Mathieu, F.: BackRank: an alternative for PageRank? In: Proc. WWW Special Interest Tracks and Posters, pp. 1122–1123 (2005) 6. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks 30(1-7), 107–117 (1998) 7. Broder, A., Kumar, S., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. Computer Networks 33(1-6), 309–320 (2000) 8. Chierichetti, F., Kumar, R., Tomkins, A.: Stochastic models for tabbed browsing. In: Proc. WWW, pp. 241–250 (2010) 9. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Computer Networks 30(1-7), 161–172 (1998) 10. Cockburn, A., McKenzie, B.: What do web users do? an empirical analysis of web use. Int. J. of Human-Computer Studies 54(6), 903–922 (2001)
Modeling Traffic on the Web Graph
61
11. Davison, B.: Topical locality in the Web. In: Proc. 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 272–279 (2000) 12. Douglis, F.: What’s your PageRank? IEEE Internet Computing 11(4), 3–4 (2007) 13. Fortunato, S., Boguna, M., Flammini, A., Menczer, F.: Approximating PageRank from indegree. In: Aiello, W., Broder, A., Janssen, J., Milios, E.E. (eds.) WAW 2006. LNCS, vol. 4936, pp. 59–71. Springer, Heidelberg (2008) 14. Fortunato, S., Flammini, A., Menczer, F., Vespignani, A.: Topical interests and the mitigation of search engine bias. Proc. Natl. Acad. Sci. USA 103(34), 12684–12689 (2006) 15. Gonc¸alves, B., Ramasco, J.J.: Human dynamics revealed through web analytics. Phys. Rev. E 78, 026123 (2008) 16. Huberman, B., Pirolli, P., Pitkow, J., Lukose, R.: Strong regularities in World Wide Web surfing. Science 280(5360), 95–97 (1998) 17. Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the World-Wide-Web. Computer Networks and ISDN Systems 27, 1065–1073 (1995) 18. Liu, Y., Gao, B., Liu, T.Y., Zhang, Y., Ma, Z., He, S., Li, H.: BrowseRank: letting Web users vote for page importance. In: Proc. SIGIR, pp. 451–458 (2008) 19. Mathieu, F., Bouklit, M.: The effect of the back button in a random walk: application for PageRank. In: Proc. WWW Alternate Track Papers & Posters, pp. 370–371 (2004) 20. Meiss, M., Duncan, J., Gonc¸alves, B., Ramasco, J.J., Menczer, F.: What’s in a session: tracking individual behavior on the Web. In: Proc. HT (2009) 21. Meiss, M., Gonc¸alves, B., Ramasco, J.J., Flammini, A., Menczer, F.: Agents, bookmarks and clicks: A topical model of Web navigation. In: Proc. HT (2010) 22. Meiss, M., Menczer, F., Fortunato, S., Flammini, A., Vespignani, A.: Ranking web sites with real user traffic. In: Proc. WSDM, pp. 65–75 (2008) 23. Menczer, F.: Mapping the semantics of web text and links. IEEE Internet Computing 9(3), 27–36 (2005) 24. Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004) 25. Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence. Random Structures and Algorithms 6(2-3), 161–180 (1995) 26. Noh, J.D., Rieger, H.: Random walks on complex networks. Phys. Rev. Lett. 92, 118701 (2004) 27. Qiu, F., Liu, Z., Cho, J.: Analysis of user web traffic with a focus on search activities. In: Proc. 8th International Workshop on the Web and Databases (WebDB), pp. 103–108 (2005) 28. Radlinski, F., Joachims, T.: Active exploration for learning rankings from clickthrough data. In: Proc. KDD (2007) 29. Fortunato, S., Flammini, A., Menczer, F.: Scale-free network growth by ranking. Phys. Rev. Lett. 96, 218701 (2006) 30. Tauscher, L., Greenberg, S.: How people revisit web pages: Empirical findings and implications for the design of history systems. Int. J. of Human-Computer Studies 47(1), 97–137 (1997)
Multiplicative Attribute Graph Model of Real-World Networks Myunghwan Kim and Jure Leskovec Stanford University, Stanford, CA 94305, USA
[email protected],
[email protected]
Abstract. Large scale real-world network data such as social and information networks are ubiquitous. The study of such networks seeks to find patterns and explain their emergence through tractable models. In most networks, and especially in social networks, nodes have a rich set of attributes associated with them. We present the Multiplicative Attribute Graphs (MAG) model, which naturally captures the interactions between the network structure and the node attributes. We consider a model where each node has a vector of categorical latent attributes associated with it. The probability of an edge between a pair of nodes depends on the product of individual attribute-attribute similarities. The model yields itself to mathematical analysis. We derive thresholds for the connectivity and the emergence of the giant connected component, and show that the model gives rise to networks with a constant diameter. We also show that MAG model can produce networks with either log-normal or power-law degree distributions. Keywords: social networks, network model, latent attribute node model.
1 Introduction With the emergence of the Web, large online social computing applications have become ubiquitous, which in turn gave rise to a wide range of massive real-world social and information network data such as social networks, Web graphs, and so on. The unifying theme of studying real-world networks is to find patterns of connectivity and explain them through models. The main objective is to answer questions such as “What do real graphs look like?”, “How can we find models that explain the observed patterns?”, and “What are algorithmic consequences of the models?”. Research on empirical observations about the structure of networks and the models giving rise to such structures go hand in hand. The empirical analysis of large realworld networks aims to discover common structural properties, such as heavy-tailed degree distributions [5], local clustering of edges [19], small diameter [11], and so on. In parallel, there have been efforts to develop the network formation mechanisms that naturally generate networks with the observed structural features. In these network formation mechanisms, there have been two dichotomous modeling approaches. Broadly speaking, the theoretical computer science and physics community have mainly focused on relatively simple “mechanistic” but analytically tractable network models where
The full version of this paper appears in http://arxiv.org/abs/1009.3499.
R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 62–73, 2010. c Springer-Verlag Berlin Heidelberg 2010
Multiplicative Attribute Graph Model of Real-World Networks
63
connectivity patterns observed in the real-world naturally emerge from the model. The prime example in this line of work is the Preferential Attachment model with its many variants [3,1,4], which specifies a simple but natural edge creation mechanism that in the limit leads to networks with power-law degree distributions. Other models of similar flavor include the Copying Model [8], the Small-world model [19], the Forest Fire model [11], and models of bipartite affiliation networks [9]. On the other hand, in statistics, machine learning and social network analysis, a different approach to modeling network data has emerged. There the effort is in the development of statistically sound models that consider the structure of the network as well as the features of nodes and edges in the network. Examples of such models include the Exponential Random Graphs [18], the Stochastic Block Model [2], and the Latent Space Model [6]. “Mechanistic” and “Statistical” models. Generally, there has been some gap between the above two lines of research. The “mechanistic” models are analytically tractable in a sense that one can mathematically analyze properties of the networks that arise from the models. These models emphasize the natural emergence of networks that have structural properties which are found in real-world networks. However, such models are usually not statistically interesting in a sense that they do not nicely lend themselves to model parameter estimation and are generally too simplistic to model heterogeneities between individual nodes. On the contrary, “statistical” models are generally analytically intractable and the network properties do not naturally emerge from the model in general. However, these models are usually accompanied by statistical procedures for model parameter estimation and very useful for testing various hypotheses about the interaction of connectivity patterns and the properties of nodes and edges. Although models of network structure and formation are seldom both analytically tractable and statistically interesting, an example of a model satisfying both features is the Kronecker graphs model [10], which is based on the recursive tensor product of small graph adjacency matrices. Kronecker graphs are analytically tractable in a sense that one can analyze global structural properties of networks that emerge from the model [14,10]. In addition, this model is statistically meaningful because there exists an efficient parameter estimation technique based on maximum likelihood [10]. It has been empirically shown that with only four parameters Kronecker graphs accurately model the global structural properties of real-world networks such as degree distributions, edge clustering, diameter and spectral properties of the graph adjacency matrices. Modeling networks with rich node attribute information. Network models investigate edge creation mechanisms, but generally a rich set of attributes is associated with each node. This is especially true in social networks, where not only people’s connections but also their profiles like age and gender have been collected. In this sense, both node characteristics and the network structure need to be considered simultaneously. The attempt to model the interaction between the network structure and node attributes raises a wide range of questions. For instance, how do we account for the heterogeneity in the population of the nodes or how do we combine node features in an interesting way to obtain probabilities of individual links? While the earlier work on a general class of latent space models formulated such questions, most resulting
64
M. Kim and J. Leskovec
Fig. 1. Schematic representation of the Multiplicative Attribute Graphs (MAG) model. Given a pair of nodes u and v with the corresponding binary attribute vectors a(u) and a(v), the probability of edge P [u, v] is the product over the entries of attribute-attribute similarity matrices Θi where values of ai (u) and ai (v) “select” the appropriate entries (row/column) of Θi .
models were either analytically tractable but statistically uninteresting or statistically very powerful but do not lend themselves to mathematical analysis. To bridge this gap, we propose a class of stochastic network models that we refer to as Multiplicative Attribute Graphs (MAG). The model naturally captures the interactions between the network structure and the node attributes in a clean and tractable manner. We consider a model where each node has a vector of categorical attributes associated with it. Individual attributes of nodes are then combined to model the emergence of links. The model allows for rich interaction between node features in a sense that one can simultaneously model features that reflect homophily (i.e., love of the same) as well as those reflecting heterophily (i.e., love of the different). For example, if people share certain features like hobby, they are more likely to be friends. However, for some other features like gender, people may tend to form a relationship with someone with the opposite characteristic. The proposed MAG model is designed to capture the homophily as well as the heterophily that both naturally occur in social networks. We proceed by formulating the model and show that it is both statistically interesting and analytically tractable. In the following sections, we present our mathematical results. First, we examine the number of edges and show that our model naturally obeys the Densification Power Law [11]. Second, we examine the connectivity of MAG model, which includes the conditions not only when the network contains a giant connected component but also when it becomes connected. Third, we show that the diameter of the MAG model remains small even though the number of nodes is large. Fourth, we show that networks emerging from the MAG model have log-normal degree distributions. Furthermore, we describe a more general version of the model that can also capture the power-law degree distributions. We view this as particularly interesting in the light of a long-standing debate about how to distinguish power-law distributions from log-normal in empirical data [16] and what implications this would make for realworld networks. Also, our results imply that the MAG model is flexible in a sense that networks with different properties emerge depending on the parameter configuration.
2 Formulation of the Multiplicative Attribute Graph Model General considerations. We consider a setting where each node u has a vector a(u) of l categorical attributes associated with it. For example, one can regard the attribute vectors as a sequence of answers to l yes/no questions such as “Are you female?”.
Multiplicative Attribute Graph Model of Real-World Networks
65
The other essential ingredient of our model is to specify a mechanism that generates the probability of an edge between two nodes based on their attribute vectors. As mentioned before, we want our model to account for the homophily of certain features as well as the heterophily of the others. More precisely, we associate each attribute i (i.e., ith question) with an attribute-attribute similarity matrix Θi . For the binary example in Fig. 1, each Θi is a 2 × 2 matrix. The entries of matrix Θi represent the probability of an edge given the values of the i-th attribute of both nodes, i.e., attribute values act as “selectors” of an appropriate cell of Θi . Thus, if the attribute reflects homophily, the corresponding matrix Θi would have large values on the diagonal (i.e., the edge probability is high when the nodes’ answers match), whereas if the attribute represents heterophily the off-diagonal values of Θi would be high (i.e., the edge probability is high when nodes’ answers do not match). The top of Fig. 1 illustrates the concept of node attributes acting as selectors of entries of matrices Θi . The Multiplicative Attributes Graph (MAG) model. Now we formulate a general version of the MAG model. To start with, let each node u have a vector of l categorical attributes and let each attribute have cardinality di for i = 1, 2, · · · l. We also have l matrices, Θi ∈ di × di for i = 1, 2, · · · l. Each entry of Θi is a probability, i.e., a real value between 0 and 11 . Then, the probability of an edge (u, v), P [u, v], is defined as the multiplication of probabilities corresponding to individual attributes, i.e., P [u, v] =
l
Θi [ai (u), ai (v)]
(1)
i=1
where ai (u) denotes the value of i-th attribute of node u. Note that edges appear independently with probability determined by node attributes and matrices Θi . Figure 1 illustrates the model. One can interpret the MAG model in the following sense. In order to construct a social network, we ask each node u a series of multiple-choice questions and the attribute vector a(u) stores the answers to these questions. Θi then reflects the marginal edge probability over the i-th answers of for a pair of nodes. That is, the answers of nodes u and v on a question i select an entry of matrix Θi , i.e., u’s answer selects a row and v’s answer selects a column. One can thus think of matrices Θi ’s as the attribute-attribute similarity matrices. Assuming that the questions are appropriately chosen so that answers are independent of each other, the product over the entries of matrices Θi results in the probability of the edge between u and v. The choice of multiplicatively combining entries of Θi is very natural. In particular, the social network literature defines a concept of Blau-spaces [15] where sociodemographic attributes act as dimensions. Organizing force in Blau space is homophily as it has been argued that the flow of information between a pair of nodes decreases with the “distance” in the corresponding Blau space. In this way, small pockets of nodes appear and lead to the development of social niches for human activity and social organization. In this respect, multiplication is a natural way to combine node attribute data 1
Note that there is no condition for Θi to be stochastic, we only require each entry of Θi to be on interval (0, 1).
66
M. Kim and J. Leskovec
(i.e., the dimensions of the Blau space) so that even a single attribute can have profound impact on the linking structure (i.e., it creates a narrow social niche communitiy). The proposed MAG model is also analytically tractable in a sense that we can formally analyze the properties of the model. Moreover, the MAG model is also statistically interesting as it can account for the heterogeneities in the node population and can be used to study the interaction between properties of nodes and their linking behavior. Moreover, one can pose many interesting statistical inference questions: Given attribute vectors of all nodes and the network structure, how can we estimate the values of matrices Θi ? Or, given a network, how can we estimate both the node attributes and the matrices Θi ? The focus of this paper is in mathematical analysis and we leave the questions of parameter estimation for the future work. Simplified version of the model. Next we delineate a simplified version of the model that we will mathematically analyze in the further sections of the paper. First, while the general MAG model applies to directed networks, we consider undirected version of the model by requiring each Θi to be symmetric. Second, we assume binary node attributes and thus matrices Θi have 2 rows and 2 columns. Third, to further reduce the number of parameters, we also assume that the similarity matrices for all attributes are the same, i.e., Θi = Θ for all i. These three conditions imply that Θ[1, 1] = α, Θ[1, 0] = Θ[0, 1] = β, and Θ[0, 0] = γ for 0 ≤ α, β, γ ≤ 1. Furthermore, all our results will hold for α > β > γ. As we show later, the assumption α > β > γ is natural since most large real-world networks have a common structure [10]. Last, we also assume a simple generative model of node attributes. We use i.i.d. Bernoulli distribution parameterized by μ to model the attribute vector of each node u, i.e., P (ai (u) = 1) = μ for i = 1, 2, · · · , l and 0 < μ < 1. Putting it all together, the MAG model M (n, l, μ, Θ) is fully specified by six parameters: n is the number of nodes, l is the number of attributes of each node, μ is the probability that an attribute takes a value of 1, and Θ = [α β; β γ] specifies the attribute-attribute similarity matrix. We now study the properties of the random graphs that result from the M (n, l, μ, Θ) where every unordered pair of nodes (u, v) is independently connected with probability P [u, v] defined in (1). Since the probability of an edge exponentially decreases in l, the most interesting case occurs when l = ρ log n for some constant ρ.2 Connections to other models. We note that our model belongs to a general class of latent space network models, where nodes have some discrete or continuous valued attributes and the probability of linking depends on the values of attribute of the two nodes. For example, the Latent Space Model [6] assumes that nodes reside in d-dimensional Euclidean space and the probability of an edge depends on the Euclidean distance between the locations of the nodes. Similarly, in Random Dot Product Graphs [20], the linking probability depends on the inner product between the vectors associated with node positions. Furthermore, recently introduced Multifractal Network Generator [17] can also be viewed as a special case of MAG model where the node attribute distribution and the similarity matrix are equal for every attribute. 2
Throughout the paper, log(·) indicates log2 (·) unless explicitly specified as ln(·).
Multiplicative Attribute Graph Model of Real-World Networks
67
The Kronecker graphs model [10] takes a small (usually 2 × 2) initiator matrix K and tensor-powers it l times to obtain a large graph adjaceny matrix G. The MAG model generalizes this Kronecker graphs model in the following sense. Proposition 1. A Kronecker graph G on 2l nodes with a 2 × 2 initiator matrix K is equivalent to the following MAG graph M : Let us number the nodes of M as 0, · · · , 2l − 1. Let the binary attribute vector of a node u of M be a binary representation of its node id, and let Θi = K. Then individual edge probabilities (u, v) of nodes in G match those in M , i.e., PG [u, v] = PM [u, v]. The above observation is interesting for several reasons. First, all results obtained for Kronecker graphs naturally apply to a subclass of MAG graphs where the node’s attribute values are the binary representation of its id. This means that in a Kronecker graph version of the MAG model each node has a unique combination of attribute values (i.e., each node has different node id) and all attribute value combinations are occupied (i.e., node ids range 0, . . . , 2l − 1). Second, building on this correspondence between Kronecker and MAG graphs, we also note that the estimates of the Kronecker parameter matrix K nicely transfer to matrix Θ of MAG model. For example, Kronecker parameter matrix K = [α = 0.98, β = 0.58, γ = 0.05] accurately models the graph of the internet connectivity [10]. Thus, in the rest of the paper, we will consider the above values of K as the typical values that the matrix Θ would normally take. In this respect, the assumption that α > β > γ appears as very natural. Furthermore, the fact that most large real-world networks satisfy α > β > γ tells us that such networks have an onion-like “core-periphery” structure [12,10]. In other words, the network is composed from denser and denser layers of edges as one moves towards the core of the network. Basically, α > β > γ indicates that more edges are likely to appear between nodes which share 1’s on more attributes and these nodes form the core of the network. Since more edges appear between pairs of nodes with attribute combination “1–0” than between those with “0–0”, there are more edges between the core and the periphery nodes (edges “1–0”) than between the nodes of the periphery themselves (edges “0–0”). In following sections, we analyze the properties of the MAG model. We focus mostly on the simplified version. Each section states the main theorem and the overview of its proof. We omit the proofs and describe them in the full paper [7].
3 The Number of Edges In this section, we derive the expression for the expected number of edges in MAG model. Moreover, this formula can valdiate not only the assumption, l = ρ log n, but also a substantial social network property, namely the Densification Power Law. Theorem 1. For a MAG graph M (n, l, μ, Θ), the number of edges, m, satisfies E [m] =
l n(n − 1) 2 l μ α + 2μ(1 − μ)β + (1 − μ)2 γ + n (μα + (1 − μ)γ) . 2
The expression is divided into two diffrent terms. The first term indicates the number of edges between distinct nodes, whereas the second term means the number of self-edges.
68
M. Kim and J. Leskovec
If we exclude self-edges, the number of edges would be therefore reduced to the first term. Before the actual analysis, we define some useful notations that will be used throughout this paper. First, let V be the set of nodes in M (n, l, μ, Θ). We refer to the weight of a node u as the number of 1’s in its attribute vectors, and denote it as |u| , i.e.,|u| = l i=1 1 {ai (u) = 1} where 1 {·} is an indicator function. We additionally define Wj as a set which consists of all nodes with the same weight j, i.e., Wj = {u ∈ V : |u| = j} for j = 0, 1, · · · , l. Similarly, Sj denotes the set of nodes with weight which is greater than or equal to j, i.e., Sj = {u ∈ V : |u| ≥ j}. By definition, Sj = ∪li=j Wi . To complete the proof of Theorem 1, using the definition of the simplified MAG model, we can derive the main lemma as follows: Lemma 1. For u ∈ V , E [deg(u)|u ∈ Wi ] is equal to i
(n − 1) (μα + (1 − μ)β) (μβ + (1 − μ)γ)
l−i
+ 2αi γ l−i .
By using this lemma, the outline of the proof for Theorem 1 is as follows. Since the number of edges is half of the degree sum, all we need to do is to sum E [deg(u)] over the degree distribution. However, because E [deg(u)] = E [deg(v)] if the weights of u and v are the same, we can add up E [deg(u)|u ∈ Wi ] over the weight distribution, i.e., binomial distribution Bin(l, μ). On the other hand, more significantly, Theorem 1 can result in two substantial features of MAG model. First, the assumption that l = ρ log n for a constant ρ can be validated by the next two corollaries. Corollary 1. m ∈ o(n) w.h.p.3 as n → ∞, if
l log n
1 > − log(μ2 α+2μ(1−μ)β+(1−μ) 2 γ) .
Corollary 2. m ∈ Θ(n2−o(1) ) w.h.p. as n → ∞, if l ∈ o(log n). Note that log μ2 α + 2μ(1 − μ)β + (1 − μ)2 γ < 0 because both μ and γ are less than 1. Thus, in order for M (n, l, μ, Θ) to have a proper number of edges (e.g., more than n), l should be bounded by the order of log n. On the contrary, since most social networks are sparse, l ∈ o(log n) case can be also reasonably excluded. In consequence, both Corollary 1 and Corollary 2 provide the upper and lower bounds of l for social networks. These bounds eventually support the assumption of l = ρ log n. Although we do not technically define any process of MAG graph evolution, we can interpret it in the folllowing way. When a new node joins the network, its behavior is governed by the node attribute distribution which is seemingly independent of the network structure. However, in a long term, since the number of attributes grows slowly as the number of nodes increases, the node attributes and the network structure are not independent. This phenomenon is somewhat aligned with the real world. When a new person enters the network, he or she seems to act independently of other people, but people eventually constitue a structured network in the large scale and their behaviors can be categorized into more classes as the network evolves. Second, under this assumption, the expected number of edges can be approximately 2 2 restated as 12 n2+ρ log(μ α+2μ(1−μ)β+(1−μ) γ ) . We can easily figure out that this fact 3
With high probability. It indicates the probability 1 − o(1).
Multiplicative Attribute Graph Model of Real-World Networks
69
agrees with the Densification Power Law [11], one of the properties of social networks, which indicates m(t) ∝ n(t)a for a > 1. For example, an instance of MAG model with ρ = 1, μ = 0.5 (Proposition 1), would have the densification exponent a = log(|Θ|) where |Θ| denotes the sum of all entries in Θ. The full proofs of all are described in the full paper [7].
4 Connectivity In the previous section, we observed that MAG model obeys the Densification Power Law. In this section, we mathematically investigate MAG model for another general property of social networks, the existence of a giant connected component. Furthermore, we also examine the situation where this giant component covers the entire network, i.e., the network is connected. We begin with the theorems that MAG graph has a giant component and further becomes connected. Theorem 2 (Giant Component). Only one connected component of size Θ(n) exists in M (n, l, μ, Θ) w.h.p. as n → ∞ , if and only if ρ 1 μ 1−μ (μα + (1 − μ)β) (μβ + (1 − μ)γ) ≥ . 2 Theorem 3 (Connectedness). Let the connectedness criterion function of M (n, l, μ, Θ) be when (1 − μ)ρ ≥ 12 (μβ + (1 − μ)γ)ρ ρ Fc (M ) = ν 1−ν (μα + (1 − μ)β) (μβ + (1 − μ)γ) otherwise where 0 < ν < μ is a solution of
μ ν 1−μ 1−ν ρ ν
1−ν
= 12 .
Then, M (n, l, μ, Θ) is connected w.h.p. as n → ∞, if Fc (M ) > M (n, l, μ, Θ) is disconnected w.h.p. as n → ∞ if Fc (M ) < 12 .
1 2.
In contrast,
To show the above theorems, we require the monotonicity property of MAG model. Theorem 4 (Monotonicity). For u, v ∈ V , P [u, v||u| = i] ≤ P [u, v||u| = j] if i ≤ j. Theorem 4 ultimately demonstrates that a node of larger weight is more likely to be connected with other nodes. In other words, a node of large weight plays a ”core” role in the network, whereas the node of small weight is regarded as ”periphery”. This feature of the MAG model has direct effects on the connectivity as well as on the existence of a giant component. By the monotonicty property, the minimum degree is likely to be the degree of the minimum weight node. Therefore, the disconnectedness is proved by showing that the expected degree of the minimum weight node is too small to be connected with any other node. Conversely, if this lowest degree is large enough, then any subset of nodes would be connected with the other part of the graph. Thus, to show the connectedness, the degree of the minimum weight node should be necessarily inspected, using Lemma 1.
70
M. Kim and J. Leskovec
Note that the criterion in Theorem 3 is separated into two cases depending on μ, which tells whether or not the expected number of weight 0 nodes, E [|W0 |], is greater than 1, because |Wj | is a binomial random variable. If this expectation is larger than 1, then the minimum weight is O(1) with high probability. Otherwise, if E [|W0 |] < 1, the equation of ν describes the ratio of the minimum weight to l as n → ∞. Therefore, the condition for connectedness actually depends on the minimum weight node. In fact, the proof of Theorem 3 is accomplished by computing the expected degree of this minimum weight node and by using some techniques introduced in [14]. Similar explanation works for the existence of a giant component. Instead of the minimum weight node, Theorem 2 shows that the existence of Θ(n) component relies on the degree of the median weight node. We intuitively understand this in the following way. We might throw away the lower half of nodes by degree. If the degree of the median weight node is large enough, then the half of the network is likely to be connected. The connectedness of this half network implies the existence of Θ(n) component, the size of which is at least n2 . In the proof, we actually examine the degrees of nodes of three different weights: μl, μl + l1/6 , and μl + l2/3 . The existence of Θ(n) component is determined by the degrees of these nodes. However, the existence of Θ(n) component does not necessarily indicate that it is a giant component, since there might be another Θ(n) component. Therefore, to prove Theorem 2 more strictly, the uniqueness of Θ(n) component has to follow the existence of it. We can prove the uniqueness by showing that if there are two connected subgraphs of size Θ(n) then they are connected each other with high probability. The proofs of those three theorems are in the full paper [7].
5 Diameter Another property of social networks is that the diameter of the network remains small although the number of nodes grows large. We can show this property in MAG model by applying the similar idea as in [14]. ρ
Theorem 5. If (μβ + (1 − μ)γ) > w.h.p. as n → ∞.
1 2,
then M (n, l, μ, Θ) has a constant diameter
This theorem does not specify the exact diameter, but, under the given condition, it guarantees the bounded diameter even though n → ∞ by using the following lemmas: Lemma 2. If (μβ + (1 − μ)γ)ρ > 12 , for λ = w.h.p. as n → ∞. ρ
μβ μβ+(1−μ)γ ,
Lemma 3. If (μβ + (1 − μ)γ) > 12 , for λ = directly connected to Sλl w.h.p. as n → ∞.
Sλl has a constant diameter
μβ μβ+(1−μ)γ ,
all nodes in V \Sλl are
By Lemma 3, we can conclude that the diameter of the entire graph is limited to (2+ diameter of Sλl ). Since by Lemma 2 the diameter of Sλl is constant with high probability under the given condition, the actual diameter is also constant. The proofs are represented in the full paper [7].
Multiplicative Attribute Graph Model of Real-World Networks
71
6 Degree Distribution In this section, we analyze the degree distribution of the simplified MAG model under some reasonable assumptions.4 Depending on Θ, MAG model produces graphs of various degree distributions. For instance, since the MAG model becomes a sparse Erd¨os-R´enyi random graph if α ≈ β ≈ γ < 1, the degree distribution will approximately follow the binomial distribution. For another extreme example, in case of α ≈ 1 and μ ≈ 1, the network will be close to a complete graph, which represents a degree distribution different from a sparse Erd¨os-R´enyi random graph. For this reason, we need to narrow down the conditions on μ and Θ as follows. If μ is close to 0 or 1, then the graph becomes an Erd¨os-R´enyi random graph with edge probability p = α (when μ ≈ 1) or γ (when μ ≈ 0). Since the degree distribution of Erd¨os-R´enyi random graph is binomial, we will exclude these extreme cases of μ. On the other hand, with regard to Θ, we assume that a reasonable configuration space for Θ would be where μα+(1−μ)β μβ+(1−μ)γ is between 1.6 and 3. For the previous Kronecker graph example, this ratio is actually about 2.44. Our approach for the condition on Θ can be also supported by real
examples in [10]. This condition is crucial for us, since in the analysis we use that μα+(1−μ)β μβ+(1−μ)γ
x
grows faster than the polynomial function of x. If μα+(1−μ)β μβ+(1−μ)γ is close to 1, we cannot make use of this fact. Assuming all these conditions on μ and Θ, we result in the following theorem. Theorem 6. In M (n, l, μ, Θ)that follows above assumptions, if
μ
1−μ
(μα + (1 − μ)β) (μβ + (1 − μ)γ)
ρ
>
1 , 2
then the tail of degree distribution, pk , follows a log-normal, specifically,
lμ(1 − μ)(ln R)2 N ln n(μβ + (1 − μ)γ)l + lμ ln R + , lμ(1 − μ)(ln R)2 , 2 for R =
μα+(1−μ)β μβ+(1−μ)γ
as n → ∞.
In other words, the degree distribution of MAG model approximately follows a quadratic relationship on log-log scale. This result is nice since some social networks follow the log-normal distribution. For instance, the degree distribution of LiveJournal network looks more parabolic than linear on log-log scale [13]. In brief, since the expected degree is an exponential function of the node weight by Lemma 1, the degree distribution is mainly affected by the distribution of node weights. Since the node weight follows binomial distribution, it can be approximated by a normal distribution for sufficiently large l. Because the logarithmic value of the expected degree is linear in the node weight and this weight follows a binomial distribution, the log value of degree approximately follows a normal distribution for large l. This eventually indicates that the degree distribution roughly follows a log-normal. 4
We trivially exclude self-edges not only because computations become simple but also because other models usually do not include them.
72
M. Kim and J. Leskovec
ρ Note that there exists a condition, (μα + (1 − μ)β)μ (μβ + (1 − μ)γ)1−μ > 12 , which is related to the existence of a giant component. First, this condition is perfectly acceptable because real-world networks have a giant component. Second, as we described in Sec. 4, this condition ensures that the median degree is large enough. Equivalently, it also indicates that the degrees of a half of the nodes are large enough. If we refer to the tail of degree distribution as the degrees of nodes with degrees above the median degree, then we can show Theorem 6. The full proofs for this analysis are described in the full paper [7].
7 Extensions: Power-Law Degree Distribution So far we have handled the simplified version of MAG model parameterized by only few variables. Even with these few parameters, many well-known properties of social networks can be reproduced. However, regarding to the degree distribution, even though the log-normal is one of the distributions that social networks commonly follow, many social networks also follow the power-law degree distribution [5]. In this section, we show that the MAG model produces networks with the power-law degree distribution by releasing some constraints. We do not attempt to analyze it in a rigorous manner, but give the intuition by suggesting an example of configuration. We still hold the condition that every attribute is binary and independently sampled from Bernoulli distribution. However, in contrast to the simplified version, we allow each attribute to have a different Bernoulli parameter as well as a different attribute-attribute similarity matrix associated wit it. The formal definition of this model is as follows: l Θj [aj (u), aj (v)] . P (aj (u) = 1) = μj , P [u, v] = j=1
The number of parameters here is 4l, which consist of μj ’s and Θj ’s for j = 1, 2, · · · , l. For convenience, we denote this power-law version of MAG model as M (n, l, μ, Θ) where μ = {μ1 , · · · , μl } and Θ = {Θ1 , · · · , Θl }. With these additional parameters, we are able to obtain the power law degree distribution as the following theorem describes. −δ
μj μj αj +(1−μj )βj Theorem 7. For M (n, l, μ, Θ), if 1−μ = for δ > 0, then the μj βj +(1−μj )γj j 1
degree distribution satisfies pk ∝ k −δ− 2 as n → ∞. In order to investigate the degree distribution of this model, the following two lemmas are essential. Lemma 4. The probability that node u in M (n, l, μ, Θ) has an attribute vector a(u) is l
(μi )1{ai (u)=1} (1 − μi )1{ai (u)=0} .
i=1
Lemma 5. The expected degree of node u in M (n, l, μ, Θ) is l 1{a (u)=1} 1{a (u)=0} (n − 1) (μi αi + (1 − μi )βi ) i (μi βi + (1 − μi ) γi ) i . i=1
Multiplicative Attribute Graph Model of Real-World Networks
73
By Lemmas 4 and 5, if the condition in Theorem 7 holds, the probability that a node has the same attribute vector as node u is proportional to (−δ)-th power of the expected degree of u. In addition, (− 21 )-th power comes from the Stirling approximation for large k. This roughly explains Theorem 7. We provide the full proof and simulation experiments in the full paper [7].
Acknowledgments We thank to Daniel McFarland for discussion and comments. Myunghwan Kim was supported by the Kwanjeong Educational Foundation fellowship. The research was supported in part by NSF grants CNS-1010921, IIS-1016909, LLNL grant B590105, Albert Yu & Mary Bechmann Foundation, IBM, Lightspeed, Microsoft and Yahoo.
References 1. Aiello, W., Chung, F., Lu, L.: A random graph model for massive graphs. In: STOC (2000) 2. Airoldi, E.M., Blei, D.M., Fienberg, S.E., Xing, E.P.: Mixed membership stochastic blockmodels. JMLR (2007) 3. Barab´asi, A.L., Albert, R.: Emergence of scaling in random networks. Science (1999) 4. Borgs, C., Chayes, J., Daskalakis, C., Roch, S.: First to market is not everything: an analysis of preferential attachment with fitness. In: STOC (2007) 5. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: SIGCOMM (1999) 6. Hoff, P., Raftery, A.: Latent space approaches to social network analysis. JASA (2002) 7. Kim, M., Leskovec, J.: Multiplicative attribute graph model of real-world networks (2010), http://arxiv.org/abs/1009.3499 8. Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: FOCS (2000) 9. Lattanzi, S., Sivakumar, D.: Affiliation networks. In: STOC (2009) 10. Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., Ghahramani, Z.: Kronecker Graphs: An Approach to Modeling Networks. JMRL (2010) 11. Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: KDD (2005) 12. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: WWW (2008) 13. Liben-Nowell, D., Novak, J., Kumar, R., Raghavan, P., Tomkins, A.: Geographic routing in social networks. PNAS (2005) 14. Mahdian, M., Xu, Y.: Stochastic kronecker graphs. In: Bonato, A., Chung, F.R.K. (eds.) WAW 2007. LNCS, vol. 4863, pp. 179–186. Springer, Heidelberg (2007) 15. McPherson, M.: An ecology of affiliation. American Sociological Review (1983) 16. Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Mathematics (2004) 17. Palla, G., Lov´asz, L., Vicsek, T.: Multifractal network generator. PNAS (2010) 18. Wasserman, S., Pattison, P.: Logit models and logistic regressions for social networks. Psychometrika (1996) 19. Watts, D.J., Strogatz, S.H.: Collective dynamics of ’small-world’ networks. Nature (1998) 20. Young, S.J., Scheinerman, E.R.: Random Dot Product Graph Models for Social Networks. In: Bonato, A., Chung, F.R.K. (eds.) WAW 2007. LNCS, vol. 4863, pp. 138–149. Springer, Heidelberg (2007)
Random Walks on Digraphs, the Generalized Digraph Laplacian and the Degree of Asymmetry Yanhua Li and Zhi-Li Zhang University of Minnesota, Twin Cities {yanhua,zhzhang}@cs.umn.edu
Abstract. In this paper we extend and generalize the standard random walk theory (or spectral graph theory) on undirected graphs to digraphs. In particular, we introduce and define a (normalized) digraph Laplacian matrix, and prove that 1) its Moore-Penrose pseudo-inverse is the (discrete) Green’s function of the digraph Laplacian matrix (as an operator on digraphs), and 2) it is the normalized fundamental matrix of the Markov chain governing random walks on digraphs. Using these results, we derive new formula for computing hitting and commute times in terms of the Moore-Penrose pseudo-inverse of the digraph Laplacian, or equivalently, the singular values and vectors of the digraph Laplacian. Furthermore, we show that the Cheeger constant defined in [6] is intrinsically a quantity associated with undirected graphs. This motivates us to introduce a metric – the largest singular value of Δ := (L˜ − L˜T )/2 – to quantify and measure the degree of asymmetry in a digraph. Using this measure, we establish several new results, such as a tighter bound (than that of Fill’s in [9] and Chung’s in [6]) on the Markov chain mixing rate, and a bound on the second smallest singular value ˜ of L.
1 Introduction Graphs arising from many applications such as web are directed, where direction of links contains crucial information. Random walks are frequently used to model certain dynamic processes on (directed or undirected) graphs, for example, to reveal important network structural information, e.g., importance of nodes as in the Page-Rank algorithm [5], or to study ways to efficiently explore complex networks. Random walks on undirected graphs have been extensively studied and are wellunderstood (see [13]). They are closely related to spectral graph theory [7], which has produced powerful tools to study many important properties of (undirected) graphs that are of both theoretical and practical significance. Well-known results include bounds on Cheeger constant and mixing rate in terms of the second smallest eigenvalue of the graph Laplacian. On the other hand, there are relatively few similar studies on directed graphs, see, e.g., [6, 15], where the authors circumvent the “directedness” of digraphs by converting them into undirected graphs through symmetrization.
The work was supported in part by the National Science Foundation grants CNS-0905037 and CNS-1017647, the DTRA grant HDTRA1-09-1-0050, and a University of Minnesota DTC DTI grant.
R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 74–85, 2010. © Springer-Verlag Berlin Heidelberg 2010
Random Walks on Digraphs
75
In this paper we develop a spectral graph theory (or random walk theory) for directed graphs (in short, digraphs). We introduce the notion of digraph Laplacian, a direct ˜ Instead generalization of the graph Laplacian (for undirected graphs), denoted by L. of using the node degrees (as in the case of undirected graphs), the digraph Laplacian is defined using stationary probabilities of the Markov chain governing random walks on digraphs. Furthermore, instead of relying on the positive semi-definiteness of the graph Laplacian matrix (for undirected graphs), we establish a key connection between the digraph Laplacian L˜ and its Moore-Penrose pseudo-inverse [10], denoted by L˜+ , and use the properties of this connection to prove several parallel results for random walks on digraphs. In particular, we show that 1) the Moore-Penrose pseudo-inverse L˜+ of the digraph Laplacian is exactly the (discrete) Green’s function of the digraph ˜ acting as an operator on digraphs [8], and 2) L˜+ is the normalized Laplacian matrix L, fundamental matrix [1] of the Markov chain governing random walks on digraphs. Based on the connection between L˜+ and the fundamental matrix, we show how hitting and commute times can be directly computed in terms of the singular values and vectors of the digraph Laplacian – this yields a more direct and efficient way to compute hitting and commute times than existing methods based on the fundamental matrix. More generally, our results suggest a “spectral graph theory” for digraphs, where graph properties can be studied using the singular values of the digraph Laplacian in place of the eigenvalues of the graph Laplacian. In particular, our theory of random walks on digraphs subsumes the existing theory of random walks on undirected graphs as a special case. Furthermore, we show that the well-known Cheeger constant – generalized by Chung to digraphs in [6] – is fundamentally a quantity intrinsic to undirected graphs, as there are infinitely many digraphs with the same symmetrized (undirected) graph. Hence bounds based on the eigenvalues of the symmetrized graph Laplacian do not uniquely capture the properties of digraphs. This leads us to introduce the degree of asymmetry to capture the overall “directedness” of digraphs, formally defined as follows: we express a digraph Laplacian L˜ in terms of a symmetric part, L¯ = (L˜ + L˜T )/2 , and a skew-symmetric part, Δ = (L˜ − L˜T )/2. (L¯ is the (symmetrized) graph Laplacian for digraphs introduced by Chung in [6].) The largest singular value of Δ, δmax , is referred to as the degree of asymmetry, which provides a quantitative measure of the asymmetry in digraphs. Many key properties of digraphs can then be bounded by the eigenvalues of L¯ and the degree of asymmetry. For instance, by accounting for the asymmetry of digraphs, we are able to obtain a tighter bound (than that of Fill’s in [9] and Chung’s in [6]) on (non-reversible) Markov chain mixing rate.
2 Preliminaries: Random Walks on Undirected Graphs We use a triple G = (V, E, A) to denote an undirected and weighted graph on the node set V = {1, 2, . . . , n}. The n × n (nonnegative) weight matrix A = [aij ] is symmetric, and is defined in such a way that aij = aji > 0, if i, j ∈ nE, and aij = aji = 0 otherwise. For 1 ≤ i ≤ n, the degree of node i is di = volume of j=1 aij . The n G, denoted by vol(G), is defined as the sum of all node degrees, d = i=1 di , i.e., vol(G) = d.
76
Y. Li and Z.-L. Zhang
A random walk on G is a Markov chain defined on G with the transition probability matrix P = [pij ], where pij = aij /di . Let D = diag[di ] be a diagonal matrix of node degrees, then P = D−1 A. Without loss of generality, we assume that the undirected graph G is connected (i.e., any node can reach any other node in G). Then it can be shown (see, e.g., [1]) that the Markov chain is irreducible, and there exists a unique stationary distribution, {π1 , π2 , . . . , πn }. Let π = [πi ]1≤i≤n be the column vector of the stationary probabilities. Then π T P = π T , where the superscript T represents (vector or matrix) transpose. Furthermore, this Markov chain (random walk) on G is reversible, namely (1) πi pij = πj pji , for any i, j, and di di πi = = , i = 1, 2, . . . , n. d d k k
(2)
Following [7], we will use the normalized graph Laplacian (instead of the unnormalized version L = D −A). Given an undirected G, the normalized graph Laplacian of G (also called normalized Laplacian matrix of G) is defined as follows: 1
1
1
1
L = D− 2 (D − A)D− 2 = D 2 (I − P )D− 2 .
(3)
A key property of the graph Laplacian (for an undirected graph) is that L is symmetric and positive semi-definite [10]. Hence all eigenvalues of L are nonnegative real numbers. In particular, for a connected undirected graph G, L has rank n−1 and has exactly one zero eigenvalue (its smallest one). Let λ1 = 0 < λ2 ≤ · · · ≤ λn be the n eigenvalues of L arranged in an increasing order, and μi , 1 ≤ i ≤ n, be the corresponding eigenvectors (of unit norm). In particular, one can show that the (column) eigenvector, μ1 , of L associated with the eigenvalue λ1 = 0, is given by √ √ di μ1 = π = [ πi ] = [ √ ]. d 1 2
(4)
Define Γ := diag[λ1 , . . . , λn ], the diagonal matrix formed by the eigenvalues, and U = [μ1 , . . . , μn ], an orthonormal matrix formed by the eigenvectors of L, where U U T = U T U = I. It is easy to see that the graph Laplacian L admits an eigendecomposition [10], namely, L = U Γ U T . Using the eigenvalues and eigenvectors of L, we can compute the hitting times and commute times using the following formula [13]: Hij =
d μ2kj μki μkj ( − ), λk dj di dj k>1
(5)
Cij =
d μki μkj ( √ − )2 , λk d dj i k>1
(6)
and
where μkj is the jth entry of the column vector μk .
Random Walks on Digraphs
77
3 Random Walk Theory on Digraphs In this section, we develop the random walk theory for digraphs. In particular, we generalize the graph Laplacian defined for undirected graphs, and introduce the digraph Laplacian matrix. We prove that the Moore-Penrose pseudo-inverse of this digraph Laplacian is exactly equal to (a normalized version of) the fundamental matrix of the Markov chain governing random walks on digraphs, and show that it is also the Green’s function of the digraph Laplacian. Using these connections, we illustrate that how hitting and commute times of random walks on digraphs can be directly computed using the singular values and vectors of the digraph Laplacian. We also show that when the underlying graph is undirected, our results reduce to the well-known results for undirected graphs. Hence our theory includes undirected graphs as a special case. 3.1 Random Walks on Directed Graphs and Fundamental Matrix As alluded earlier, random walks can be defined not only on undirected graphs, but also on digraphs. Let G = (V, E, A) be a (weighted) digraph defined on the vertex set V = {1, 2, . . . , n}, where A is a nonnegative, but generally asymmetric weight matrix such that aij > 0 if and only if the directed edge (or arc) i, j ∈ E. As before, we will simply refer to A as the adjacency matrix of G. For i = 1, 2, . . . , n, we define n n the out− = a , and the in-degree of vertex i, d = aji . In degree of vertex i, d+ ij i i j=1 n n j=1 n − general, d+ = d− . However, we have d := ni=1 d+ = d = i i=1 i i=1 j=1 aij . As before, we refer to d as the volume of the directed graph G, i.e., vol(G) = d. For conciseness, in the following unless otherwise stated, we refer to the out-degree of a vertex simply as its degree, and use di for d+ i . Let D = diag[di ] be a diagonal matrix of the vertex out-degrees, and define P = D−1 A. Then P = [pij ] is the transition probability matrix of the Markov chain associated with random walks on G, where at each vertex i, a random walk has the probability pij = aij /di to transit from vertex i to vertex j, if i, j ∈ E. We assume that G is strongly connected, i.e., there is a (directed) path from any vertex i to any other vertex j. Then the Markov chain P is irreducible, and has a unique stationary probability distribution, {πi }, where πi > 0, 1 ≤ i ≤ n. Namely, π T P = π T , where π = [π1 , . . . , πn ]T be the (column) vector of stationary probabilities. Unlike undirected graphs, the Markov chain associated with random walks on directed graphs is generally non-reversible, and eqs.(1) and (2) for undirected graphs do not hold. For random walks on directed graphs, quantities such as hitting times and commute times can be defined exactly as in the case of undirected graphs. However, since the (normalized) Laplacian matrix L is (so far!) defined only for undirected graphs, we cannot use the relations eqs.(5) and (6) to compute hitting times and commute times for random graphs on directed graphs. On the other hand, using results from the standard Markov chain theory, we can express the hitting times and commute times in terms of the fundamental matrix. In [1], Aldous and Fill define the fundamental matrix Z = [zij ] for an irreducible Markov chain with the transition probability matrix P : zij =
∞ t=0
(t)
(pij − πj ), 1 ≤ i, j ≤ n,
(7)
78
Y. Li and Z.-L. Zhang (t)
where pij is the (i, j)-th entry in the t-step transition probability matrix P t = P · · P. · t
Let Π = diag[πi ] be the diagonal matrix containing the stationary probabilities πi ’s on the diagonal, and J = [Jij ] the all-one matrix, i.e., Jij = 1, 1 ≤ i, j ≤ n. We can express Z alternatively as the sum of an infinite matrix series: Z=
∞
(P − JΠ) = t
t=0
∞
(P t − 1π T ),
(8)
t=0
where 1 = [1, . . . , 1]T is the all-one column vector. Hence J = 1 · 1T , and 1T Π = π T . While the physical meaning of the fundamental matrix Z may not be obvious from its definition eq.(7) (or eq.(8)), it plays a crucial role in computing various quantities related to random walks, or more generally, various stopping time properties of Markov chains [1]. For instance, the hitting times and commute times of random walks on a directed graph can be expressed in terms of Z as follows (see [1]): zjj − zij πj
(9)
zjj − zij zii − zji + . πj πi
(10)
Hij = and Cij =
In eqs.(7) and (8), the fundamental matrix Z is defined as an infinite sum. We show that Z in fact satisfies a simple relation eq.(11), and hence can be computed directly using the standard matrix inverse. Theorem 1. Let P be the transition probability matrix for an irreducible Markov chain. Then its corresponding fundamental matrix Z as defined in eq.(7) satisfies the following relation Z + JΠ = (I − P + JΠ)−1 .
(11)
Proof: Note that JΠ = 1π T . From π T P = π T and P 1 = 1, we have JΠP = JΠ and P JΠ = JΠ. Using these two relations, it is easy to prove the following equation by induction. P m − JΠ = (P − JΠ)m , for any integer m > 0.
(12)
Plugging this into eq.(8) yields Theorem 1. As undirected graphs are a special case of directed graphs, eqs.(9) and (10) provide an alternative way to compute hitting times and commute times for random walks on fully connected undirected graphs. In this paper we will show that eqs.(5) and (6) are in fact equivalent to eqs.(9) and (10).
Random Walks on Digraphs
79
3.2 (Normalized) Digraph Laplacian and Green’s Function for Digraphs We now generalize the existing spectral graph theory defined for undirected graphs to directed graphs by introducing an appropriately generalized Laplacian matrix for (strongly connected) diagraphs. Let G = (V, E, A) be a strongly connected (weighted) digraph defined on the vertex set V = {1, 2, . . . , n}, where in general the weight (or adjacency) matrix A is asymmetric. A major technical difficulty in dealing with digraphs is that if one naively extends the (normalized) Laplacian matrix, L = D−1/2 (D − A)D−1/2 , (or its un-normalized version, L = D − A), defined for undirected graphs to digraphs, L is in general asymmetric; hence the nice properties such as positive semidefiniteness of L no longer hold. Past attempts in generalizing the spectral graph theory to digraphs have been simply symmetrized L, e.g., by introducing a symmetric matrix, L¯ := (L + LT )/2 [6, 15]. Unfortunately, as will be shown in the Section 4, such symmetrized L¯ does not directly capture the unique characteristic of the random walk on ¯ the digraph as defined earlier, since a set of diagraphs can have the same L. √ 1 2 For a strongly connected digraph G, let Π = diag[ πi ]. We define the (normalized) digraph Laplacian for G (also referred to as the generalized (normalized) Laplacian matrix1 ), L˜ = [L˜ij ] as follows: ˜ Definition 1 (Normalized Digraph Laplacian L) 1 1 L˜ = Π 2 (I − P )Π − 2 ,
(13)
namely, for 1 ≤ i, j ≤ n, ⎧ if i = j, ⎨ 1 −1pii − 12 ˜ 2 Lij = −πi pij πj if i, j ∈ E, ⎩ 0 otherwise.
(14)
Treating this (normalized) digraph Laplacian matrix L˜ as an (asymmetric) operator on a digraph G, we now define the (discrete) Green’s function G˜ (without boundary conditions) for digraphs in exactly the same manner as for undirected graphs [8]. Namely, G˜ is a matrix with its entries, indexed by vertices i and j, that satisfies the following conditions: ˜ i,j = Ii,j − √πi πj , 1 ≤ i, j ≤ n, (15) [G˜L] and expressed in the matrix form, T
1 1 G˜L˜ = I − π 2 π 2 .
(16)
In the following we will show that G˜ is precisely L˜+ , the pseudo-inverse of the Laplacian operator L˜ on the digraph G. Furthermore, we will relate L˜+ directly to the fundamental matrix Z of the Markov chain associated with random walks on the digraph G. Before we establish a main result of this paper, we first introduce a few more notations and then prove the following useful lemma. 1
An un-normalized digraph Laplacian is defined as L = Π(I − P ) in [3].
80
Y. Li and Z.-L. Zhang
1 1 Lemma 1. Define Z˜ = Π 2 ZΠ − 2 (the normalized fundamental matrix), and J˜ = 1 1 1 1T Π 2 JΠ 2 = π 2 π 2 . The following relations regarding Z˜ and J˜ hold: (1) J˜ = J˜2 , (2) ˜ 12 = π 12 T L˜ = 0, and (3) J˜Z˜ = Z˜J˜ = Zπ ˜ 12 = π 12 T Z˜ = 0. J˜L˜ = L˜J˜ = Lπ
Proof Sketch: These relations can be established using the facts that J = 11T , 1T Π = π T , π T J = 1T , ΠJ = π1T , JΠJ = J, π T (I − P ) = 0, (I − P )1 = 0, π T Z = 0, and Z1 = 0. The last four equalities imply that the matrices I − P and Z have the same left and right eigenvectors, π T and 1, corresponding to the eigenvalue 0. We are now in a position to prove a main theorem of the paper, which states the Green’s function for the (normalized) digraph Laplacian is exactly its Moore-Penrose pseudo˜ inverse, and it is equal to the normalized fundamental matrix. Namely, G˜ = L˜+ = Z. For completeness, we also include the key definitions in the statement of the theorem. Theorem 2 (Laplacian matrix and Green’s function for digraphs). Given a strongly connected digraph G = (V, E, A) where V = {1, . . . , n} and A is a (generally asymmetric) nonnegative weight/adjancency matrix of G such that aij > 0 if and only if i, j ∈ E, let D = diag[di ] be the diagonal (out-)degree matrix, i.e., di = j aij . Then P = D−1 A is the transition probability matrix for the (irreducible and generally non-reversible) Markov chain associated with random walks on the digraph G. Let π = [π1 , . . . , πn ]T be the stationary probability distribution (represented as a column vector) for the Markov chain P , and Π = diag[πi ] be the diagonal stationary probability matrix. We define the (normalized) digraph Laplacian matrix L˜ of G as in eq.(13), 1 1 1 1 i.e., L˜ = Π 2 (I − P )Π − 2 . Define Z˜ = Π 2 ZΠ − 2 , where Z is the fundamental matrix of the Markov chain P as defined in eq.(7). ˜ Furthermore, Z˜ is Then Z˜ = L˜+ , is the pseudo-inverse of the Laplacian matrix L. ˜ the (discrete) Green’s function for L. Namely, T
1 1 1 1 Z˜L˜ = I − Π 2 JΠ 2 = I − π 2 π 2 , 1
1
(17)
1
where J is the all-one matrix and π 2 = [π12 , . . . , πn2 ]T (a column vector). Proof Sketch: From eq.(11) in Theorem 1, we have ˜ −1 . Z˜ + J˜ = (L˜ + J)
(18)
˜ and using Lemma 1, it is easy to see that Multiplying eq.(18) from the right by L˜ + J, ˜ Z˜L˜ = I − J,
(19)
˜ Similarly, which establishes that Z˜ is the Green’s function of the digraph Laplacian L. ˜ ˜ ˜ ˜ by multiplying eq.(18) from the left by L + J, we can likewise prove LZ˜ = I − J. T ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ Hence Z L = LZ = I − J, which is a real symmetric matrix. Hence (LZ) = LZ˜ ˜ T = Z˜L. ˜ Furthermore, as J˜Z˜ = 0, eq.(19) yields Z˜L˜Z˜ = Z. ˜ Similarly, as and (Z˜L) 1 1 ˜ ˜ ˜ ˜ ˜ ˜ LJ = Π 2 (I − P )JΠ 2 = 0, eq.(19) yields LZ L = L. These establish that Z˜ satisfies the four conditions of matrix pseudo-inverse. Hence Z˜ is also the Moore-Penrose ˜ Therefore, G˜ = Z˜ = L˜+ . pseudo-inverse of L.
Random Walks on Digraphs
81
3.3 Computing Hitting and Commute Times for Digraphs Using Digraph Laplacian ˜ pseudo-inverse Using the relationship between the (normalized) digraph Laplacian L,its ˜ we can now express the hitting times L˜+ , and the (normalized) fundamental matrix Z, and commute times of random walks on digraphs in terms of L˜+ , or alternatively in ˜ terms of the singular values and singular vectors of the digraph Laplacian matrix L. 1 ˜ 12 = Π − 12 L˜+ Π 12 , and using eqs.(9) and (10), we can compute From Z = Π − 2 ZΠ the hitting times and commute times for random walks on digraphs directly in terms of the entries of L˜+ . Theorem 3 (Computing hitting and commute times using L˜+ ). The hitting times and commute times of random walks on a strongly connected digraphs can be computed using the pseudo-inverse of the (generalized) normalized Laplacian matrix L˜+ as follows: Hij =
+ + L˜jj L˜ij −√ , πj πi πj
(20)
and Cij = Hij + Hji =
+ + + + L˜jj L˜ii L˜ij L˜ji + −√ −√ , πj πi πi πj πi πj
(21)
+ where L˜ij is the (i, j)-th entry of L˜+ , and πi is the stationary probability of vertex i.
For undirected graphs, we show that eqs.(20) and (21) reduce to eqs.(5) and (6) in Section 2. This can be seen from the fact that for undirected graphs, L˜ = L is symmetric and positive semi-definite. Hence the singular value decomposition of L is the same as the eigen-decomposition of L.
4 Degree of Asymmetry, Generalized Cheeger Constant and Bounds on Mixing Rate In this section we explore the relation between digraph Laplacian L˜ and its symmetrized ¯ We first show that the symmetrized Laplacian matrix L, ¯ and the Cheeger version L. constant h(G) as defined in [6] are in a sense primarily determined by an undirected graph associated with the random walks with the transition probability matrix P¯ = (P + Π −1 P T Π)/2, thus cannot capture the unique characteristics of each individual diagraph. As a result, we investigate two questions: 1) how can the “degree of asymmetry” of a digraph be quantified and measured? and 2) how does the degree of asymmetry affect crucial properties of a digraph such as the mixing rate? In the following we propose one metric – the largest singular value of Δ := (L˜ − L˜T )/2 – as a measure of the degree of asymmetry in a digraph. We show that by explicitly accounting for the degree of asymmetry, we can obtain generally tighter bounds on quantities (e.g., mixing rate) associated with random walks (or Markov chains) on digraphs.
82
Y. Li and Z.-L. Zhang
4.1 The Degree of Asymmetry, and Relations to Symmetrized Digraph Laplacian ˜ ˜T In [6], Chung introduces the symmetrized Laplacian matrix for digraphs, L¯ = L+2L , generalizes the Cheeger constant to digraphs and bounds it in terms of the second small¯ In the following we show that the symmetrized Laplacian L¯ and the est eigenvalue of L. Cheeger constant introduced by Chung are in fact two quantities intrinsic to undirected graphs.
Theorem 4. Given a digraph G, with transition probability matrix P , there exist infinite digraphs which have the same stationary distribution matrix Π and the same symmetrized transition probability matrix P¯ = (P + Π −1 P T Π)/2. As a result, all these graphs have the same symmetrized Laplacian matrix and Cheeger constant. Proof : We prove it by construction. Given a digraph G = (V, E, A), with transition probability matrix P , all the digraphs G ’s with the transition probability P as P (α) = αP + (1 − α)Π −1 P T Π,
(22)
form an infinite digraph set, denoted by G(G), where α ∈ [0, 1]. (1)It is easy to check that any P (α) defined in eq.(22) is non-negative, and satisfies π T P (α) = π T , and P (α)1 = 1, thus P (α) represents a transition probability matrix of a random walk with stationary distribution π. 1 1 For any G ∈ G(G), the digraph Laplacian matrix is given by L˜ = Π 2 (I−P )Π − 2 , and the symmetrized Laplacian is only determined by P¯ , since we have L˜ + L˜T 1 1 1 1 P + Π −1 P T Π L¯ = = Π 2 (I − )Π − 2 = Π 2 (I − P¯ )Π − 2 . 2 2
(23)
¯ For any (2) In particular, when α = 12 , P ( 12 ) = P¯ represents the undirected graph G. 1 S ⊂ N := {1, . . . , n}, define an n-element vector fS , where fS (i) = Fπ (S) , i ∈ S ¯ and fS (i) = − Fπ1(S) ¯ , i ∈ S, where Fπ (S) := i∈S πi is the circulation function [6]. 1
Define xS = Π − 2 fS . Then min S
˜ S xTS Lx fST Π(I − P )fS Fπ (∂S) = min ≤ 2 inf T T ¯ = 2h(G). (24) S S xS xS fS ΠfS min{Fπ (S), Fπ (S)}
The above inequality indicates that the Cheeger constant h(G) is closely related to ˜ xT Lx ˜ S = xT L˜T xS = 1 (xT Lx ˜ S + xT L˜T xS ) = minS xST xSS . On the other hand, xTS Lx S S S 2 S
¯ S . Hence for any digraph G with a digraph Laplacian L˜ such that (L˜ +L˜ T )/2 = xTS Lx ¯ S . We see that the left-hand side of eq.(24) hinges ¯ we have xT L˜ xS = xT Lx L, S S ¯
only on L. Therefore, any graph G ∈ G(G) has the same Cheeger constant, i.e. ¯ h(G ) = h(G) = h(G).
To capture the “degree of asymmetry” in a digraph, we express L˜ as a sum of a symmetric part and a skew-symmetric part: L˜ = L¯ + Δ, where Δ = (L˜ − L˜T )/2. Note that
Random Walks on Digraphs
83
L˜T = L¯ + ΔT = L¯ − Δ. Hence Δ captures the difference between L˜ and its transpose (which induces a reserved Markov chain or random walk). When L˜ is symmetric, then Δ = 0. Let (0 =)σ1 ≤ σ2 ≤ . . . ≤ σn denote the singular values (in an increasing ¯2 ≤ . . . ≤ λ ¯ n denote the eigenvalues of L, ¯1 ≤ λ ¯ and ˜ Likewise, let (0 =)λ order) of L. (0 =)δ1 ≤ δ2 ≤ . . . ≤ δn (= δmax ) the singular values of Δ. The following relations among them hold (See [2]): ¯i + δn , i = 1, 2, . . . , n. ¯ i ≤ σi ≤ λ λ
(25)
¯ i ≤ δn , i = 2, . . . , n. We therefore propose the From eq.(25), we see that σi − λ largest singular value of Δ, δn (= δmax ) as a measure of the degree of asymmetry in the underlying digraph. Note that δn = Δ , where · is the operator (bound) ˜ − y, L˜T x| = norm of a matrix: Δ := supx=1 Δx 2 = supy=x=1 |y, Lx ˜ − x, Ly| ˜ supy=x=1 |y, Lx (see, e.g., [2], p.6 and p.91). On the other hand, T ˜ − x, L˜ x = 0 for any x. In the following, we relate and bound δn – the x, Lx degree of asymmetry – to two other important quantities associated with the Markov chain on a digraph: the digraph gap g(G) defined below and the second largest singular value of the transmission probability matrix P . where Fπ (i, j)= πi Pij , obeys Given a digraph G, the circulation function Fπ (·), the flow conservation law at every node of a digraph: F (k, i) = j F (i, j) for all k i’s. Now, define the digraph gap g(G) = maxS i∈S | j∈S¯ (Fπ (i, j) − Fπ (j, i))|, which quantifies the maximum difference between two bipartite subgraphs S and S¯ among all partitions. We have the following theorem relating the degree of asymmetry with g(G) and σn−1 (P ), the second largest singular value of P . Theorem 5 (Bounds on the degree of asymmetry) 1
2 2g(G) ≤ δn ≤ λn−1 (P˜ T P˜ ) = σn−1 (P ), 1
(26)
1
where P˜ = Π 2 P Π − 2 . Proof : The proof of this theorem is delegated to the technical report [12]. Theorem 6 below relates and bounds the second smallest singular value σ2 of L˜ in terms of the degree of asymmetry δn , the Cheeger constant, and the second smallest ¯ 2 of L. ¯ The proof of this theorem is delegated to the technical report [12]. eigenvalue λ ¯2 , δn and the Cheeger constant). Given a strongly Theorem 6 (Relations among σ2 , λ 1 1 connected graph G = (V, E, A), and its Laplacian matrix L˜ = Π 2 (I − P )Π − 2 , we have the bounds for the second smallest singular value of L˜ as h2 (G) δn ≤ σ2 ≤ (1 + ¯ ) · 2h(G). 2 λ2 When the graph is undirected, we have the same as the bounds obtained in [6].
h2 (G) 2
(27)
¯ 2 ≤ 2h(G), which is exactly ≤ σ2 = λ
84
Y. Li and Z.-L. Zhang
˜ Finally, we introduce a generalized Cheeger constant, h(G), defined as ˜ S ˜ S ) 12 Lx (xT L˜T Lx ˜ , = min1 S T h(G) = min 1 S xS (xS xS ) 2 xS ⊥π 2
(28)
1
where for any S ⊂ N := {1, 2, . . . , n}, xS = Π − 2 fS is defined above. We see that the generalized Cheeger constant thus defined minimizes the 2-norm of the circulations ¯ whereas h(G) minimizes the 1-norm (the sum of across bipartite subgraphs S and S, ˜ ¯ Clearly, σ2 ≤ h(G). absolute values) of the circulations across S and S. 4.2 Bounding the Mixing Rate of Random Walks on Digraphs In this section, using mixing rating bounds as an example, we show that by considering the degree of asymmetry, we obtain a better bound for the mixing rate of random walks on digraphs. The mixing rate is a measure of how fast a random walk converges to its stationary distribution. Many papers have studied the problem of bounding the mixing rate of random walks (or reversible Markov chains) on undirected graphs, such as [4,11].Relatively few papers [6,9,14] have addressed the problem of bounding the mixing rate of Markov chains (or random walks) on digraphs. In bounding the convergence rate from an initial distribution to the stationary distribution of a Markov Chain with the transition probability matrix P , the χ-square distance [6, 9] is commonly used, and is defined as follows: (P t (i, j) − πj )2 1 χ(t) = max ( )2 . (29) i∈V (G) πj j∈V (G)
Fill in [9] derives an upper bound on the mixing rate of a random walk on digraphs in ¯2 of L. ¯ When the Markov chain P is strongly terms of the second smallest eigenvalue, λ 1 1 − 2 ˜ aperiodic, define P = Π 2 P Π 2 , then χ (t) ≤ εt maxi πi−1 , where ε = max1 f ⊥π 2
f T P˜ T P˜ f P˜ f 2 ¯2 . = max ≤1−λ 2 1 fT f f 2 f ⊥π
(30)
From Theorem 4, we know that this bound leads to the same upper bound for all di¯ By accounting for the degree of asymmetry, we obtain a lower graphs with the same L. P˜ f 2 bound and a (generally) tighter upper bound on f 2 as follows, which in turn yields a tighter bound on χ(t): Theorem 7. For a strongly aperiodic Markov chain P , P˜ f 2 ¯ 2 )2 + 2δn λ ¯n + δ2 . δn2 ≤ ε = max1 ≤ (1 − λ n 2 f f ⊥π 2
(31)
Proof : First, the lower bound can be obtained from Theorem 5. To prove the upper bound, we note that ¯ + ΔT L¯ + ΔT Δ)f f T (L¯ − I)2 f f T (LΔ f T P˜ T P˜ f ¯ 2 )2 + 2δn λ ¯ n + δn2 . = + ≤ (1 − λ fT f fT f fT f (32)
Random Walks on Digraphs
85
The above theorem states that the mixing rate of a random walk on a digraph cannot be ¯2 , λ ¯ n and δn . In particular, slower than δn2 ; and it is upper bounded by a function of λ ˜ when L is symmetric (i.e., the underlying graph is undirected), the bound in eq.(31) reduces to ˜ 2 Pf ¯2 )2 , ≤ (1 − λ ε = max1 2 f ⊥π 2 f ¯ 2 )2 is attained. In contrast, eq.(30) yields the bound as where the equality ε = (1 − λ ¯ ε = 1 − λ2 . Hence when the underlying graph is undirected, our bound is tighter. As a final remark, we note that similar derivations can be applied to obtain a tighter bound (than that of Chung’s [6]) on the mixing rate of lazy random walk on G with transition probability matrix P = I+P 2 (see [12]).
References 1. Aldous, D., Fill, J.A.: Reversible markov chains and random walks on graphs, http://www.stat.berkeley.edu/˜aldous/RWG/book.html 2. Bhatia, R.: Matrix Analysis. Springer, Heidelberg (1997) 3. Boley, D., Ranjan, G., Zhang, Z.-L.: An asymmetric laplacian for a directed graph. University of minnesota computer science department technical report: Tr09-009 (2009) 4. Boyd, S., Ghosh, A., Prabhakar, B., Shah, D.: Mixing times for random walks on geometric random graphs. In: SIAM Workshop on Analytic Algorithmics and Combinatorics (ANALCO), Vancouver, pp. 240–249. SIAM, Philadelphia (2005) 5. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998) 6. Chung, F.R.K.: Laplacians and the cheeger inequality for directed graphs. Annals of Combinatorics 9, 1–19 (2005) 7. Chung, F.R.K.: Spectral Graph Theory. In: CBMS Regional Conference Series in Mathematics, vol. 92 (2006) 8. Chung, F.R.K., Yau, S.T.: Discrete green’s functions. Journal of Combinatorial Theory, Series A, 191–214 (July 2000) 9. Fill, J.A.: Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process. Ann. Appl. Probab. 1(1), 62–87 (1991) 10. Horn, R., Johnson, C.R.: Matrix Analysis, 1st edn. Cambridge University Press, Cambridge (1985) 11. Jerrum, M., Son, J.-B.: Spectral gap and log-sobolev constant for balanced matroids. In: The 43rd Annual IEEE Symposium on Foundations of Computer Science, FOCS 2002 (2002) 12. Li, Y., Zhang, Z.-L.: Random walks on diagraphs, the generalized digraph laplacian and the degree of asymmetry. Technical Report. CSE Department of University of Minnesota (2010), http://www.cs.umn.edu/˜yanhua/ 13. Lov`asz, L.: Random walks on graphs: A survey. Combinatorics 2, 1–46 (1993) 14. Mihail, M.: Conductance and convergence of markov chains-a combinatorial treatment of expanders. In: Proceedings of FOCS 1989, pp. 526–531 (1989) 15. Zhou, D., Huang, J., Sch¨olkopf, B.: Learning from labeled and unlabeled data on a directed graph. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 1036–1043. ACM, New York (2005)
Finding and Visualizing Graph Clusters Using PageRank Optimization Fan Chung Graham and Alexander Tsiatas Department of Computer Science and Engineering University of California, San Diego {fan,atsiatas}@cs.ucsd.edu
Abstract. We give algorithms for finding graph clusters and drawing graphs, highlighting local community structure within the context of a larger network. For a given graph G, we use the personalized PageRank vectors to determine a set of clusters, by optimizing the jumping parameter α subject to several cluster variance measures in order to capture the graph structure according to PageRank. We then give a graph visualization algorithm for the clusters using PageRank-based coordinates. Several drawings of real-world data are given, illustrating the partition and local community structure.
1
Introduction
Finding smaller local communities within a larger graph is a well-studied problem with many applications. For example, advertisers can more effectively serve niche audiences if they can identify their target communities within the larger social web, and viruses on technological or population networks can be effectively quarantined by distributing antidote to local clusters around their origins [7]. There are numerous well-known algorithms for finding clusters within a graph, including k-means [20,22], spectral clustering [26,31], Markov cluster algorithms [11], and numerous others [17,23,24,25,27]. Many of these algorithms require embedding a graph into low-dimensional Euclidean space using pairwise distances, but graph distance-based metrics fail to capture graph structure in real-world networks with small-world phenomena since all pairs of nodes are connected within short distances. PageRank provides essential structural relationships between nodes and is particularly well suited for clustering analysis. Furthermore, PageRank vectors can be computed more efficiently than performing a dimension reduction for a large graph. In this paper, we give clustering algorithms PageRank-Clustering that use PageRank vectors to draw attention to local graph structure within a larger network. PageRank was first introduced by Brin and Page [5] for Web search algorithms. Although the original definition is for the Web graph, PageRank is well defined for any graph. Here, we will use a modified version of PageRank, known as personalized PageRank [18], using a prescribed set of nodes as a seed vector. PageRank can capture well the quantitative correlations between pairs or subsets of nodes, especially on small-world graphs where the usual graph distances R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 86–97, 2010. c Springer-Verlag Berlin Heidelberg 2010
Finding and Visualizing Graph Clusters Using PageRank Optimization
87
are all quite small. We use PageRank vectors to define a notion of PageRank distance which provides a natural metric space appropriate for graphs. A key diffusion parameter in deriving PageRank vectors is the jumping constant α. In our clustering algorithms, we will use α to control the scale of the clustering. In particular, we introduce two variance measures which can be used to automatically find the optimized values for α. We then use PageRank vectors determined by α to guide the selection of a set of centers of mass and use them to find the clusters via PageRank distances. We further apply our clustering algorithm to derive a visualization algorithm PageRank-Display to effectively display local structure when drawing large networks. The paper is organized as follows: The basic definitions for PageRank are given in Section 2. In Section 3, we describe two cluster variance measures using PageRank vectors, and we give clustering algorithms in Section 4, with analysis in Section 5. A graph drawing algorithm is given in the last section and several examples are included.
2
Preliminaries
We consider general undirected graphs G = (V, E) with vertex set V and edge set of E. For a vertex v, let dv denote the degree of v which is the number of neighbors v. For a set of nodes T ⊆ V , the volume of T is defined to be vol(T ) = v∈T dv . Let D denote the diagonal degree matrix and A the adjacency matrix of G. We consider a typical random walk on G with the transition probability matrix defined by W = D−1 A. Let π denote the stationary distribution of the random walk, if it exists. Personalized PageRank vectors are based on random walks with two governing parameters: a seed vector s, representing a probability distribution over V , and a jumping constant α, controlling the rate of diffusion. The personalized PageRank pr(α, s) is defined to be the solution to the following recurrence relation: pr(α, s) = αs + (1 − α)pr(α, s)W. Here, s (and all other vectors) will be treated as row vectors. The original definition of PageRank defined in [5] is the special case where the seed vector is the the uniform distribution. If s is simply the distribution which is 1 for a single node v and 0 elsewhere, we write pr(α, v). In general, it can be computationally expensive to compute PageRank exactly; it requires using the entire graph structure which can be prohibitive on large networks. Instead, we use an approximate PageRank algorithm as given in [3,8]. This approximation algorithm is much more tractable on large networks, because it can be computed using only the local graph structure around the starting seed vector s. Besides s and the jumping constant α, the algorithm requires an approximation parameter . For a set of points S = {s1 , . . . , sn } in Euclidean space, the Voronoi diagram is a partition of the space into disjoint regions R1 , . . . , Rn such that each Ri contains si and the region of space containing the set of points that are closer to si than any other sj . Voronoi diagrams are well-studied in the field of computational geometry. Here we consider Voronoi diagrams on graphs using PageRank vectors as a notion of closeness.
88
F. Chung Graham and A. Tsiatas
For two vertices u, v, we define the PageRank distance with jumping constant α as: distα (u, v) = pr(α, u)D−1/2 − pr(α, v)D−1/2 . We can further generalize this distance to two probability distributions p and q defined on the vertex set V of G. Namely, the PageRank distance, with jumping constant α, between p and q is defined by distα (p, q) = u,v p(u)q(v)dist(u, v). With this definition, for a subset S of vertices, we can generalize the notion of a center of mass for S to be a probability distribution c. For a given > 0, we say c is an -center or center of mass for S if v∈S distα (c, v) ≤ . Let C denote a set of k (potential) centers. The goal is for each center c to be a representative center of mass for some cluster of vertices. We let Rc denote the set of all vertices x which are closest to c in terms of PageRank, provided the jumping constant α is given: Rc = {x ∈ V : distα (c, x) ≤ distα (c , x) for all c ∈ C}.
3
PageRank Variance and Cluster Variance Measures
For a vertex v and a set of centers C, let cv denote the center that is closest to v, (i.e., cv is the center of mass c ∈ C such that v ∈ Rc ). We follow the approach as in k-means by defining the following evaluative measure for a potential set of k centers C, using PageRank instead of Euclidean distances. dv pr(α, v)D−1/2 − pr(α, cv )D−1/2 2 = dv distα (v, cv )2 . μ(C) = v∈V
v∈V
Selecting a set of representative centers within a graph is a hard problem, known to be NP-complete. There are many approximate and heuristic algorithms used in practice (see [30]). Here, we will develop algorithms that use personalized PageRank vectors to select the centers. In the Web graph, links between websites can be interpreted as votes for a website’s importance, and PageRank vectors are used to determine which pages are intrinsically more important in the overall graph. Personalized PageRank vectors are local information quantifying the importance of every node to the seed. Thus, the uth component of the personalized PageRank vector pr(α, v) quantifies how well-suited u is to be a representative cluster center for v. To evaluate a set of cluster centers in a graph G, we consider two measures that capture the community structure of G with respect to PageRank: 2 dv pr(α, v)D−1/2 − pr(α, pr(α, v))D−1/2 Φ(α) = v∈V
=
v∈V
Ψ (α) =
dv distα (v, pr(α, v))2 , 2 dv pr(α, pr(α, v))D−1/2 − πD−1/2
v∈V
=
v∈V
dv distα (pr(α, v), π)2 .
Finding and Visualizing Graph Clusters Using PageRank Optimization
89
The α-PageRank-variance Φ(α) measures discrepancies between the personalized PageRank vectors for nodes v and possible centers nearest to v, represented by the probability distribution pr(α, v). The α-cluster-variance Ψ (α) measures large discrepancies between personalized PageRank vectors for nodes v and the overall stationary distribution π. If the PageRank-variance Φ(α) is small, then the ‘guesses’ by using PageRank vectors for the centers of mass give a good upper bound for the k-means evaluation μ using PageRank distance, indicating the formation of clusters. If the cluster-variance Ψ (α) is large, then the centers of masses using the predictions from PageRank vectors are quite far from the stationary distribution, capturing a community structure. Thus, our goal is to find the appropriate α such that Φ(α) is small but Ψ (α) is large. For a specific set of centers of mass C, we use the following for an evaluative metric Ψα (C), suggesting the structural separation of the communities represented by centers in C: 2 Ψα (C) = vol(Rc ) pr(α, c)D−1/2 − πD−1/2 = vol(Rc ) distα (c, π)2 . c∈C
c∈C
We remark that this measure is essentially the analog of k-means in terms of PageRank distance, and it has a similar flavor as a heuristic given by Dyer and Frieze [9] for the traditional center selection problem.
4
The PageRank-Clustering Algorithms
These evaluative measures give us a way to evaluate a set of community centers, leading to the PageRank-Clustering algorithms presented here. The problem of finding a set of k centers minimizing μ(C) is then reduced to the problem of minimizing Φ(α) while Ψ (α) is large for appropriate α. In particular, for a special class of graphs which consist of k clusters of vertices where each cluster has Cheeger ratio at most α, the center selection algorithm is guaranteed to be successful with high probability. A natural question is to find the appropriate α for a given graph, if such α exists and if the graph is clusterable. A direct method is by computing the variance metrics for a sample of α and narrowing down the range for α using binary search. Here, we give a systematic method for determining the existence of an appropriate α and finding its value is by differentiating Φ(α), and finding roots α satisfying Φ (α) = 0. It is not too difficult to compute that the derivative of Ψ satisfies 1−α −1/2 2 −1 Φ (α) = g (α)D − 2g (α), pr(α, g (α))D (1) v v v α3 where gv (α) = pr(α, pr(α, v)(I − W )). Here, we give two versions of the clustering algorithm. For the sake of clarity, the first PageRank clustering algorithm uses exact PageRank vectors without approximation. The second PageRank clustering algirhtm allows for the use of approximate PageRank vectors as well as approximate PageRank-variance and cluster-variance for faster performance.
90
F. Chung Graham and A. Tsiatas
PageRank-ClusteringA(G, k, ): – For each vertex v in G, compute the PageRank vector pr(α, v). – Find the roots of Φ (α) (1). (There can be more than one root if the graph G has a layered clustering structure.) For each root α, we repeat the following process: – Compute Φ(α). If Φ(α) ≤ , also compute Ψ (α). Otherwise, go to the next α. – If k < Ψ (α) − 2 − , go to the next α. Else, select c log(n) sets, each consisting of k potential centers randomly chosen according to the stationary distribution π. (Here, c is some absolute constant c ≤ 100 to allow sampling with high probability.) – For each set S = {v1 , . . . , vk }, let C be the set of centers of mass where ci = pr(α, vi ). – Compute μ(C) and Ψα (C). If |μ(C)−Φ(α)| ≤ and |Ψα (C)−Ψ (α)| ≤ , determine the k Voronoi regions according to the PageRank distances using C. We can further reduce the computational complexity by using approximate PageRank vectors. PageRank-ClusteringB(G, k, ): – Find the roots of Φ (α) (1) within an error bound /2, by using sampling techniques from [29] involving O(log n) nodes, log(1/) values of α and δapproximate PageRank vectors [3,8] where δ = /n2 . There can be more than one root if the graph G has a layered clustering structure. For each root α, we repeat the following process: – Approximate Φ(α). If Φ(α) ≤ /2, also compute Ψ (α). Else, go to the next α. – If k < Ψ (α) − 2 − /2, go to the next α. Else, select c log(n) sets, each consisting of k potential centers randomly chosen according to the stationary distribution π. (Here, c is some absolute constant c ≤ 100 to allow sampling with high probability.) – For each set S = {v1 , . . . , vk }, let C be the set of centers of mass where ci is an approximate PageRank vector for pr(α, vi ). – Compute μ(C) and Ψα (C). If |μ(C)−Φ(α)| ≤ and |Ψα (C)−Ψ (α)| ≤ , determine the k Voronoi regions according to the PageRank distances using C. We remark that by using the sharp approximate PageRank algorithm in [8], the error bound δ for PageRank can be set to be quite small since the time complexity is proportional to log(1/δ). If we choose δ to be a negative power of n such as δ = /n2 , then approximate PageRank vectors lead to sharp estimates for Φ and
Finding and Visualizing Graph Clusters Using PageRank Optimization
91
Φ within an error bound of . Thus for graphs with k clusters, the PageRankClusteringB algorithm will terminate after approximating the roots of Φ , O(k log n) approximations of μ and Ψα and O(n) approximate PageRank computations. By using approximation algorithms using sampling, this can be done quite efficiently. We also note that there might be no clustering output if the conditions set within the algorithms are not satisfied. Indeed, there exist graphs that inherently do not have a k-clustered structure within the error bound that we set for . Another reason for no output is the probabilistic nature of the above sampling argument. We will provide evidence to the correctness of the above algorithm by showing that, with high probability, a graph with a k-clustered structure will have outputs that capture its clusters in a feasible manner which we will specify further. For a subset of nodes H in a graph G, the Cheeger ratio h(H) is the ratio of the number of edges leaving H and vol(H). We say a graph G is (k, h, β, )clusterable if the vertices of G can be partitioned into k parts so that (i) each part S has Cheeger ratio at most h and (ii) S has volume at least βvol(G)/k for some constant β, and (iii) there is a subset S of S, with vol(S ) ≤ (1 − )vol(S) √ has Cheeger ratio at least h. We will provide evidence for the correctness of PageRank-ClusteringA by proving the following theorem: Theorem 1. Suppose a graph G has an (k, h, β, )-clustering and α, ∈ (0, 1) satisfy ≥ hk/(2αβ). Then with high probability, PageRank-ClusteringA returns a set C of k centers with Φ(α) ≤ , Ψ (C) > k − 2 − , and the k clusters are near optimal according to the PageRank k-means measure μ with an additive error term .
5
Analyzing PageRank Clustering Algorithms
We wish to show that the PageRank-clustering algorithms are effective for treating graphs which are (k, h, β, )-clusterable. We will use a slightly modified version of a result in [3] which provides a direct connection between the Cheeger ratio and the personalized PageRank within S: Theorem A. [3] For any set S and any constants α, δ in (0, 1], there is a subset Sα ⊆ S with volume vol(Sα ) ≥ δvol(S)/2 such that for any vertex v ∈ Sα , the PageRank vector pr(α, v) satisfies pr(α, v)(S) ≥ 1 − h(S) αδ . To see that our clustering algorithm can be applied to an (h, k, β, )-clusterable hk . graph G, we will need the following condition: ≥ 2αβ Theorem A implies that in a cluster R of G, most of the vertices u in R have pr(α, u)(S) ≥ 1 − /(2k). This fact is essential in the subsequent proof that Ψ (α) ≥ k − 2 − . A sketched proof for Theorem 1: We note that pr(0, s) = π and pr(1, s) = s for any distribution s. This implies that Φ(0) = Φ(1) = Ψ (0) = 0 and Ψ (1) = n − 1. It is not hard to check that Ψ is
92
F. Chung Graham and A. Tsiatas
an increasing function since Ψ (α) > 0 for α ∈ (0, 1]. The function of particular interest is Φ. Since we wish to find α such that Φ is small, it suffices to check the roots of Φ . Suppose α is a root of Φ . To find k clusters, we can further restrict ourselves to the case of Ψ (α) ≥ k − 2 − by establishing the following claim: Claim: If a graph G can be partitioned into k clusters having Cheeger ratio at most h and ≥ hk/(2αβ), then Ψ (α) ≥ k − 2 − . Before proving the claim, we note that by sampling c log n sets of k vertices from π, for sufficiently large c, the values μ(C) and Ψ (C) for one such random set of k centers are close to Φ(α) and Ψ (α), respectively, with high probability (exponentially decreasing depending on c and β) by probabilistic concentration arguments. In this context, the upper bound for μ(C) implies that the set consisting of distributions pr(α, c) for c ∈ C serves well as the set of centers of mass. Thus, the resulting Voronoi regions using C give the desired clusters. This proves the correctness of our clustering algorithm with high probability for (k, h, β, )-clusterable graphs. Proof of the Claim: 2 dv pr(α, pr(α, v))D−1/2 − πD−1/2 Ψ (α) = v∈V
=
dv pr(α, pr(α, v))D−1/2 − πD−1/2 2
c∈C v∈Rc
=
c∈C v∈Rc
≥
dv
x
2 pr(α, pr(α, v))D−1/2 (x) − πD−1/2 (x) dv
c∈C v∈Rc
=
c∈C v∈Rc
≥
c∈C v∈Rc
≥
c∈C v∈Rc
2 pr(α, pr(α, v))D−1/2 (x) − πD−1/2 (x)
x∈Rc
dv
2 pr(α, pr(α, v))D−1/2 (x) − πD−1/2 (x)
x∈Rc
dv vol(Rc )
pr(α, pr(α, v))(x) − π(x)
2
x∈Rc
dx vol(Rc )
x∈Rc
dv vol(Rc ) 2 1 − /2 − vol(Rc ) vol(G)
vol(Rc ) 2 − 1− 2k vol(G) c∈C
2 1 vol(Rc ) ≥ − 1− k 2k vol(G) =
c∈C
1 k − 1 − )2 ≥ k − 2 − = k 2
Finding and Visualizing Graph Clusters Using PageRank Optimization
93
To illustrate PageRank-ClusteringB, we consider a dumbbell graph U as an example. This graph U has two complete graphs K20 connected by a single edge, yielding a Cheeger ratio of h ≈ 0.0026. Plotting Φ(α) (Fig. 1) and its derivative (Fig. 2) shows that there is a local minimum near α ≈ 0.018. When Ψ is large, many individual nodes have personalized PageRank vectors that differ greatly from the overall distribution. This indicates that there are many nodes that are more representative of a small cluster than the entire graph. By plotting Ψ (α) (Fig. 3) and its derivative (Fig. 4), we can see that there is a distinct inflection point in the plot of Ψ for the dumbbell graph U as well.
6
Fig. 1. Φ(α) for the dumbell graph U
Fig. 2. Φ (α) for the dumbell graph U , with the line y = 0 for reference
Fig. 3. Ψ (α) for the dumbell graph U
Fig. 4. Ψ (α) for the dumbell graph U
A Graph Drawing Algorithm Using PageRank
The visualization of complex graphs provides many computational challenges. Graphs such as the World Wide Web and social networks are known to exhibit ubiquitous structure, including power-law distributions, small-world phenomena, and a community structure [1,6,12]. With large graphs, it is easy for such intricate
94
F. Chung Graham and A. Tsiatas
structures to be lost in the sheer quantity of the nodes and edges, which can result in drawings that reflect a network’s size but not necessarily its structure. Given a set of nodes S, we can extract communities around each node and determine the layout of the graph using personalized PageRank. The arrangement can be done using a force-based graph layout algorithm such as the KamadaKawai algorithm [19]. The goal is to capture local communities; we can do this by assigning edges {s, v} for each s ∈ S and v ∈ V \ S with weight inversely proportional to the personalized PageRank. This way, unrelated nodes with low PageRank will be forced to be distant, and close communities will remain close together. We also add edges {s, s } for s, s ∈ S with large weight to encourage separation of the individual communities. We use an implementation from Graphviz [14]. We note that because force-based algorithms are simulations, they do not guarantee the exact cluster structure, but we will illustrate that it works well in practice. Additionally, there are algorithms specifically designed for clustered graph visualization [10,28] and highlighting high-ranking nodes [4], but they impose a lot of artificial hierarchical structure onto the drawing and often require precomputing the clusters. Once we have a layout for all the nodes in the graph, we can partition them by using a Voronoi diagram. We compute the Voronoi diagram efficiently using Fortune’s algorithm [13]. We tie together personalized PageRank and Voronoi diagrams in the following graph visualization algorithm: PageRank-Display(G, S, α, ) Input: a graph G = (V, E), a seed set S ⊆ V , a jumping constant α ∈ (0, 1], and an approximation factor > 0. 1. For each s ∈ S, compute an approximate PageRank vector p(α, s). 2. Construct a new graph G with vertex set V and edges as follows: – {s, v} for s ∈ S and v ∈ V \ S with weight 1/ps (v), as long as ps (v) > 0. – {s, s } for s, s ∈ S with weight 10 × maxs,v 1/ps (v). 3. Use a force-based display algorithm on G to determine coordinates cv for each v ∈ V . 4. Compute the Voronoi diagram on S. 5. Draw G using the coordinates cv , highlighting S, and overlaying the Voronoi diagram. The jumping constant α is associated with the scale of the clustering. We can determine α either by trial and error or by optimizing Φ and Ψ as in section 4. As long as G is connected, the PageRank vector will be nonzero on every vertex. Using the algorithms from [3,8], the approximation factor acts as a cutoff, and any node v with PageRank less than dv will be assigned zero. This is advantageous because the support of the approximate PageRank vector will be limited to the local community containing its seed. In PageRank-Display, we give weights to the edges equal to 1/ps (v), but this is problematic if ps (v) = 0. In that case, we omit the edge from G entirely.
Finding and Visualizing Graph Clusters Using PageRank Optimization
95
We remark that the selection of will influence the size of the local communities: the subset of nodes with nonzero approximate PageRank has volume at |S| 2 . most (1−α) (see [3]). This implies that a good selection of is O (1−α)vol(G) We also remark that the selection of S is important. If S contains vertices that are not part of communities or two nodes in the same community, then there will be no structure to display. In general, the selection of S is similar to the geometric problem of finding a set of points with minimum covering radius, which can be intractable (see [16]). There are several algorithms that can automatically choose S, including PageRank-Clustering as presented here. We used PageRank-Display to demonstrate the existence of local structure in two real-world datasets. The first is a social network of 62 dolphins [21], and one can see in Fig. 5 that they can be divided into two communities. A more interesting example is shown in Fig. 6. The graph represents games between 114 NCAA Division I American collegiate football teams [15] in 2000. The league is divided into smaller conferences; for each team, about half of its games are against conference opponents. An appropriate selection of the 8 teams in Fig. 6 reveal a partition that separates their conferences, and others are placed on the periphery of the drawing.
Fig. 5. PageRank-Display (α = 0.03) on the dolphin social network [21], separating the dolphins into two communities
Fig. 6. PageRank-Display (α = 0.1) on the football game network [15], highlighting 8 of the major collegiate conferences
References 1. Albert, R., Barab´ asi, A.-L., Jeong, H.: Diameter of the World Wide Web. Nature 401, 130–131 (1999) 2. Andersen, R., Chung, F.: Detecting sharp drops in PageRank and a simplified local partitioning algorithm. In: Cai, J.-Y., Cooper, S.B., Zhu, H. (eds.) TAMC 2007. LNCS, vol. 4484, pp. 1–12. Springer, Heidelberg (2007) 3. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using PageRank vectors. In: Proceedings of the 47th Annual IEEE Symposium on Foundation of Computer Science (FOCS 2006), pp. 475–486 (2006)
96
F. Chung Graham and A. Tsiatas
4. Brandes, U., Cornelsen, S.: Visual ranking of link structures. In: Dehne, F., Sack, J.-R., Tamassia, R. (eds.) WADS 2001. LNCS, vol. 2125, pp. 222–233. Springer, Heidelberg (2001) 5. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117 (1998) 6. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. Computer Networks 33, 1–6 (2000) 7. Chung, F., Horn, P., Tsiatas, A.: Distributing antidote using PageRank vectors. Internet Mathematics 6(2), 237–254 (2009) 8. Chung, F., Zhao, W.: A sharp PageRank algorithm with applications to edge ranking and graph sparsification (Preprint), http://www.math.ucsd.edu/~ fan/wp/sharp.pdf 9. Dyer, M.E., Frieze, A.M.: A simple heuristic for the p-centre problem. Operations Research Letters 3(6), 285–288 (1985) 10. Eades, P., Feng, Q.: Multilevel visualization of clustered graphs. In: Proceedings of the International Symposium on Graph Drawing, pp. 101–112 (1996) 11. Enright, A.J., Van Dongen, S., Ouzounis, C.A.: An efficient algorithm for largescale detection of protein families. Nucleic Acids Research 30(7), 1575–1584 (2002) 12. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the Internet topology. In: Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM 1999), pp. 251–262 (1999) 13. Fortune, S.: A sweepline algorithm for Voronoi diagrams. In: Proceedings of the Second Annual Symposium on Computational Geometry, pp. 313–322 (1986) 14. Gansner, E., North, C.: An open graph visualization system and its applications to software engineering. Software — Practice and Experience 30(11), 1203–1233 (2000) 15. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002) 16. Guruswami, V., Micciancio, D., Regev, O.: The complexity of the covering radius problem on lattices and codes. Computational Complexity 14(2), 90–120 (2005) 17. Harel, D., Koren, Y.: Graph drawing by high-dimensional embedding. In: Goodrich, M.T., Kobourov, S.G. (eds.) GD 2002. LNCS, vol. 2528, pp. 207–219. Springer, Heidelberg (2002) 18. Jeh, G., Widom, J.: Scaling personalized Web search. In: Proceedings of the 12th International Conference on World Wide Web, pp. 271–279 (2003) 19. Kamada, T., Kawai, S.: An algorithm for drawing general undirected graphs. Information Processing Letters 31(1), 7–15 (1989) 20. Lloyd, S.: Least square quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982) 21. Lusseau, D., Schneider, K., Boisseau, O.J., Haase, P., Slooten, E., Dawson, S.M.: The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology 54(4), 396–405 (2003) 22. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967) 23. Mancoridis, S., Mitchell, B.S., Chen, Y., Gansner, E.R.: Bunch: a clustering tool for the recovery and maintenance of software system structures. In: Proceedings of the IEEE International Conference on Software Maintenance, pp. 50–59 (1999)
Finding and Visualizing Graph Clusters Using PageRank Optimization
97
24. Moody, J.: Peer influence groups: identifying dense clusters in large networks. Social Networks 23(4), 261–283 (2001) 25. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004) 26. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. Advances in Neural Information Processing Systems 14(2), 849–856 (2002) 27. Noack, A.: Modularity clustering is force-directed layout. Physical Review E 79, 026102 (2009) 28. Parker, G., Franck, G., Ware, C.: Visualization of large nested graphs in 3D: navigation and interaction. Journal of Visual Languages and Computing 9(3), 299–317 (1998) 29. Rudelson, M., Vershynin, R.: Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM 54(4), Article 21 (2007) 30. Schaeffer, S.E.: Graph clustering. Computer Science Review 1(1), 27–64 (2007) 31. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Improving Random Walk Estimation Accuracy with Uniform Restarts Konstantin Avrachenkov1, Bruno Ribeiro2 , and Don Towsley2 1
2
INRIA, 2004 Route des Lucioles, Sophia-Antipolis, France Dept. of Computer Science, University of Massachusetts Amherst, Amherst, MA
Abstract. This work proposes and studies the properties of a hybrid sampling scheme that mixes independent uniform node sampling and random walk (RW)-based crawling. We show that our sampling method combines the strengths of both uniform and RW sampling while minimizing their drawbacks. In particular, our method increases the spectral gap of the random walk, and hence, accelerates convergence to the stationary distribution. The proposed method resembles PageRank but unlike PageRank preserves time-reversibility. Applying our hybrid RW to the problem of estimating degree distributions of graphs shows promising results.
1
Introduction
Many networks, including on-line social networks (OSNs) and peer-to-peer (P2P) networks, exist for which it is impossible to obtain a complete picture of the network. This leaves researchers with the need to develop sampling techniques for characterizing and searching large networks. Sampling methods can be classified as based on independent uniform sampling or crawling. These two classes of sampling methods have their advantages and drawbacks. Our work proposes and studies the properties of a hybrid sampling scheme that mixes independent uniform node sampling and random walk (RW)-based crawling. We show that our sampling method combines the strengths of both uniform and RW sampling while minimizing their drawbacks. Within the class of uniform sampling methods, uniform node sampling is widely popular and has the advantage of sampling disconnected graphs. In an online social network (OSN) where users are associated with unique numeric IDs, uniform node sampling is performed by querying randomly generated IDs. In a P2P network like Bittorrent, uniform node sampling is performed by querying a tracker server [14]. In practice, however, these samples are expensive (resourcewise) operations (the ID space in an OSN, such as Facebook and MySpace, is large and sparse and tracker queries can be rate-limited [14]). For instance, in MySpace we expect only 10% the IDs to belong to valid users [10], i.e., only one in every ten queries successfully finds a valid MySpace account. Within crawl-based sampling methods, random walk (RW) sampling is among the most popular methods [6,11,12,19,21,24]. Let G = (V, E) be an undirected, R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 98–109, 2010. c Springer-Verlag Berlin Heidelberg 2010
Improving Random Walk Estimation Accuracy with Uniform Restarts
99
non-bipartite graph with n nodes. RW sampling is preferred because it requires few resources and, when G is connected, can be shown to produce asymptotically unbiased estimates of f (G) = h(v). (1) ∀v∈V
Moreover, when G is connected, a RW visits all nodes in G in O(n3 ) [17] steps w.h.p., a useful property when searching unstructured networks (such as P2P networks). Note that the above formal RW guarantees require G to be connected. In the real-world, however, networks may consist of several disconnected components, e.g. Twitter [23] and Livejournal [21], to cite two known examples. Moreover, the performance of such methods are closely tied to the difference between the largest and the second largest eigenvalues of the associated RW transition probability matrix. This difference is also known as the spectral gap which we denote as δ. More precisely, let (Xt : t = 0, 1, 2, . . . ) be a discrete-time Markov chain associated with a random walk over G with transition probability matrix P = [pij ], ∀i, j ∈ V , where pij = 1/di and di is the degree of node i ∈ V . The eigenvalues of P are 1 = λ1 ≥ λ2 ≥ λ3 ≥ · · · ≥ λn ≥ −1. The spectral gap is defined as δ = (1 − λ2 ). Note that with the help of the lazy random walk modification [18] we can concentrate on the second largest eigenvalue λ2 and not worry about the value of the largest by modulus negative eigenvalue λn . When G is disconnected δ = 0. Typically, even a connected real complex network has small δ, explained by the clustered structure of these networks [15], and as a consequence the RW tends to get “trapped” inside subgraphs. Unfortunately the accuracy of estimates of f (G) (eq. (1)) obtained with a RW is inversely proportional to the spectral gap δ (revisited later in this section). The main contribution of this work is to increase the spectral gap, δ, of a RW by combining it with independent uniform node sampling. Although node sampling can be very expensive, when combined with RW sampling, the resulting algorithm can significantly reduce the estimation error and search time at a negligible increase in overhead. The idea is as follows, add auxiliary edges among all the nodes with weight α/n > 0. The hybrid sampling scheme, which we call RWuR, corresponds to a random walk on this modified graph. Our results show a significant increase in the spectral gap δ when our hybrid scheme is used (even when α is small). Note that when α = 0 we RWuR is a regular RW and in the limit when α → ∞ RWuR performs independent uniform sampling. In what follows we revisit the connection between the spectral gap, δ, and estimation errors; connecting δ with the Mean Squared Error (MSE) of the RW estimates.
Spectral Gap and Estimation Error For now we assume G is connected. Suppose we wish to estimate f (G) of eq. (1) from B sampled nodes obtained by a stationary RW, (X0 , X1 , . . . , XB−1 ).
100
K. Avrachenkov, B. Ribeiro, and D. Towsley
A widely used estimator of f (G) is [21,24] B−1 1 fˆ = h (Xt ) , Xt ∈ V B t=0
(2)
where h (v) = h(v)/πv , v ∈ V , and πv is the stationary distribution of the random walk. We note that if the graph is undirected, the RW is time reversible and its stationary distribution is given by πi =
di , 2|E|
∀i ∈ V.
(3)
To simplify the notation we drop the dependence of f on G. The MSE of fˆ is given by E[(fˆ − f )2 ]. Now we explore the connection between the spectral gap, δ, and the MSE of fˆ. In a stationary RW, fˆ is an unbiased estimator of f [24]. Thus, the Mean Squared Error (MSE) is also the variance of fˆ. Let varπ (fˆ) denote the variance of fˆ in a stationary RW and E[(fˆ − f )2 ] = var(fˆ) + bias(fˆ), where bias(fˆ) = (E[fˆ] − f )2 . The assymptotic ratio between (var(fˆ) + bias(fˆ)) and varπ (fˆ) is a function of the spectral gap δ [1, Chapter 4.1] 1 + λ2 var(fˆ) + bias(fˆ) 2−δ = . = B→∞ 1 − λ2 δ varπ (fˆ)
sup lim f
Note that δ also determines the mixing time of the RW [22], which means that when δ 1 (i.e., λ2 ≈ 1) it takes many steps for the random walk to converge to the stationary distribution (a potential source of bias in fˆ when the RW does not start in steady state). We now turn our attention to finding the relationship between varπ (fˆ) and δ. Consider the class of f (G) functions that measures θk , the fraction of nodes with degree k, Δ fk (G) = wk (v), ∀v∈V
where wk (v) = 1(dv = k)/n, 1(x = y) = 1 if x = y and 1(x = y) = 0, otherwise. h from eq. (1). Let Δ σ 2 = lim B (var(fˆ) + bias(fˆ)). B→∞
From the inequality [1, Chapter 4, Proof of Proposition 29] (condition ∀v∈V h(v) = 0 of [1, Chapter 4, Proposition 29] is not necessary in our scenario) σ2 2 h 22 2 δ 1− ≤ varπ (fˆ) ≤ 1+ (4) B δB δB 2B we have a relationship between the MSE of gˆ and the spectral gap δ. Eq. (4) shows that the estimator fˆ can have large MSE even in the stationary regime
Improving Random Walk Estimation Accuracy with Uniform Restarts
101
despite the fact that it has no bias in the stationary regime. Therefore, a larger value of δ not only helps to accelerate the convergence to stationarity but also it decreases MSE in the stationary regime. Let wk (v) = wk (v)/πv , v ∈ V . Note that wk (v)2 1 1(dv = k)
wk 22 = = . πv n n πv ∀v∈V ∀v∈V Let Πk = ∀v∈V 1(dv = k)πv . As nπv(k) ≥ Πk ,
wk 22 =
θk θk ≤ , nπv(k) Πk
which yields 2θk varπ (fˆk ) ≤ Πk δB
δ 1+ 2B
.
(5)
Eq. (5) shows that the error in estimating the fraction of nodes with degree k is uper bounded by the inverse of: (1) the spectral gap δ and (2) the probability that the RW finds a node with degree k, Πk . In Section 2 we see that increasing parameter α of our hybrid RW increases δ but also decreases Πk when k is larger than the average degree. This tradeoff can be seen in the experiments of Section 3, where the MSE of the fraction of high degree nodes, at first, decreases as we increase α until a certain (unknown optimal) point where a further increase in α increases the MSE. As future work we will investigate this optimal value of α.
2
Reducing Mixing Time by Restart
As we have observed in the previous section, many complex networks have small spectral gaps and hence random walks on such networks can have a negative impact on the accuracy (variance) and bias of the estimates. In this work we are interested in methods for accelerating the rate of convergence to the stationary distribution based on the addition of auxiliary transitions. Thus, the natural first method to investigate is PageRank [9]. Namely, a random walk follows some outgoing link with probability c and with probability 1 − c it jumps to an arbitrary node of the network chosen according to the uniform distribution. Then, the modified random walk is described by the following transition matrix 1 (6) P˜ = cP + (1 − c) 1 1T , n where 1 is a vector of ones with an appropriate dimension. Then, the stationary distribution of the modified random walk π ˜ (c) is a unique solution of the following equations π ˜ (c)P˜ = π ˜ (c), π ˜ (c)1 = 1. It is known [13] that the second largest eigenvalue of matrix P˜ is equal to c. Thus, by choosing c not close to one, we can significantly increase the spectral grap δ.
102
K. Avrachenkov, B. Ribeiro, and D. Towsley
However, as was observed for example in [16], PageRank’s steady state distribution can be just weakly correlated with node degree. Furthermore, there are cases when ranking of nodes according to PageRank is sensitive to the change of parameter c [8]. Thus, the stationary distribution of the modified random walk can be significantly distorted. To mitigate the latter problem, we suggest a variation where we connect all the nodes in the graph with a weight α/n. The difference with PageRank is that in our variation the uniform restart occurs not with a fixed probability but with a probability depending on the node degree. Specifically, the transition probability in our modification is given by α/n+1 di +α , if i has a link to j, pˆij = (7) α/n di +α , if i does not have a link to j. The advantage of such a modification is that the new random walk is also reversible with the following stationary distribution π ˆi (α) =
di + α 2|E| + nα
∀i ∈ V,
(8)
from which the original stationary distribution (3) can easily be retrieved. Let us now show that this second method also improves algebraic connectivity. First, we consider a regular graph where all nodes have degree d. Theorem 2.1. Let G = (V, E) be an undirected regular graph with degree d. ˆ k (α) and λk be the eigenvalues of the Markov chains associated with the Let λ modified and original random walks on G, respectively. Then, all the eigenvalues corresponding to the modified random walk except the unit eigenvalue are scaled as follows: ˆ k (α) = d λk , k = 2, ..., n. (9) λ d+α Proof. The transition probabilities of the modified random walk (7) can be written in the following matrix form d α T Pˆ = P+ 11 , d+α dn where P is the transition matrix corresponding to the original random walk. Since 1 is an eigenvector corresponding to the unit eigenvalue, we can apply Brauer’s Theorem [7]. Brauer’s Theorem says that if λ is an eigenvalue and x the corresponding eigenvector of matrix A then λ + v T x is an eigenvalue of matrix A + xv T for any vector v and the other eigenvalues of A + xv T coincide with α 11T has the following eigenvalues: the eigenvalues of A. Thus, matrix P + dn 1 + α/d, λ2 , ..., λn . Since the eigenvalues of a matrix multiplied by a scalar are the eigenvalues of that matrix multiplied by the same scalar, the eigenvalues of matrix Pˆ are: 1, d/(d + α)λ2 , ..., d/(d + α)λn . If we expand d/(d+ α) in (9) into a power series with respect to α we can rewrite (9) as follows: ˆ 2 (α) = 1 − α λ2 + o(α). λ d
Improving Random Walk Estimation Accuracy with Uniform Restarts
103
Thus, for small values of α the spectral gap can be approximated as follows: δ≈
α λ2 . d
(10)
Let us now consider the case of a general undirected graph. Theorem 2.2. Let G = (V, E) be a general undirected graph. Then, the second ˆ2 (α) of the modified random walk has the following connection largest eigenvalue λ with the second largest eigenvalue λ2 of the original random walk
n n n 1 1 k=1 dk u2k v2k k=1 dk u2k j=1 v2j ˆ n α λ2 + α + o(α), (11) λ2 (α) = 1 − n n k=1 u2k v2k k=1 u2k v2k where u2 and v2 are respectively left and right Fiedler eigenvectors of the original graph. Proof. Let us analyze the equation ˆ 2 (α)ˆ Pˆ (α)ˆ v2 (α) = λ v2 (α)
(12)
with the help of perturbation theory techniques [2,5]. We expand Pˆ (α) as a power series with respect to α. Namely, we write Pˆ (α) = P + αΓ (1) + α2 Γ (2) + ... ,
(13)
where in particular the coefficient of the first order term is given by 1 1 T (1) Γ = diag 11 − P , (14) dk n where diag d1k is a diagonal matrix with the elements d1k on the diagonal. We ˆ 2 (α) and vˆ2 (α) as power series also expand λ ˆ 2 (α) = λ(0) + αλ(1) + α2 λ(2) + ... λ
(15)
vˆ2 (α) = v (0) + αv (1) + α2 v (2) + ... .
(16)
and ˆ2 (α) and vˆ2 (α) in the form of power series (13), (15) Next we substitute Pˆ (α), λ and (16) into equation (12). Thus, we have (P + αΓ (1) + α2 Γ (2) + ...)(v (0) + αv (1) + α2 v (2) + ...) = (λ(0) + αλ(1) + α2 λ(2) + ...)(v (0) + αv (1) + α2 v (2) + ...). Equating terms with the same powers of α yields P v (0) = λ(0) v (0) .
(17)
104
K. Avrachenkov, B. Ribeiro, and D. Towsley
Since we are interested in the second largest eigenvalue of Pˆ (α), we conclude that v (0) = v2 and λ(0) = λ2 , where v2 is the eigenvector of P corresponding to the second largest eigenvalue λ2 of P . Then, collecting terms with α, we get Γ (1) v2 + P v (1) = λ(1) v2 + λ2 v (1) .
(18)
Premultiplication of equation (18) by the left eigenvector uT2 corresponding to the second largest eigenvalue λ2 (uT2 P = λ2 uT2 ) leads to (1)
uT2 Γ (1) v2 = λ2 uT2 v2 . or λ(1) =
uT2 Γ (1) v2 λ2 uT2 v2
Let us consider in more detail the expression uT2 Γ (1) v2 . 1 1 1 uT2 Γ (1) v2 = uT2 diag 1 1T v2 − uT2 diag P v2 n dk dk =
n n n 1 1 1 u2k v2j − u2k v2k λ2 n dk dk j=1 k=1
k=1
In the latter equality we use the fact that P v2 = λ2 v2 . Thus, we obtain formula (11). Even though Theorem 2.2 provides a connection between the second largest eigenvalues of the original and modified graphs, we cannot readily deduce from expression (11) if the spectral gap actually decreases. Therefore, next we analyse a typical case where we can obtain more insight from formula (11). We note that v2 and uT2 are Fiedler vectors. The Fiedler vectors indicate principal clusters of the original graph. Let us represent the transition matrix for the original graph in the following form P1 0 P = P (0) + εC = + εC, (19) 0 P2 where P1 , P2 represent the transitions inside the principal clusters and εC represents transitions between the principal clusters. We choose the blocks P1 and P2 to be transition matrices, which means that some elements of εC corresponding to the blocks P1 and P2 are negative. Of course, all elements of the sum are non-negative. Now we are ready to state the next result. Theorem 2.3. Given that the original graph has two principal Fiedler compo¯ the following connection between the nents with the same average node degree d, eigenvalues of the modified and original graphs take place α ˆ λ2 (α) = 1 − (20) λ2 + o(α) + O(ε, |E1 |−1 , |E2 |−1 ). d¯
Improving Random Walk Estimation Accuracy with Uniform Restarts
Proof. The proof can be found in our technical report [3].
105
In particular, we conclude from (20) that for small values of α the value of the spectral gap can be approximated as follows: δ≈
α λ2 . d¯
(21)
Comparing (10) and (21), it is curious to observe that in the case of the general undirected graph the parameter d in (10) is replaced by the average node ¯ degree d. The expression (21) provides simple guidelines for the choice of α. For instance, if the original graph has spectral gap very close to zero and we would like the spectral gap of the modified graph to be approximately equal to 0.1 we ¯ choose α = 0.1d.
3
Numerical Results
In the following preliminary experiments we use one real-world graph and one random graph. The real-world graph has 5, 204, 176 nodes and 77, 402, 652 edges and was collected in a nearly complete crawl of the Livejournal social blog network [20]. The Livejournal graph has average degree 14.6 and a giant strongly connected component with 5, 189, 809 nodes and the remaining nodes form a number of small connected components. The random graph is created by connecting, with one edge, two Barabási-Albert graphs [4], G1 and G2 , with average degrees 2 and 10, respectively. We call the former the BA2 graph. A RW over the BA2 graph resembles the RW transition probability matrix described in eq. (19), where Pi , i = 1, 2, are the transition probability matrices of the RW on each Barabási-Albert graph (G1 and G2 , respectively) and is small. Different from the example shown in Section 2, the average degrees of G1 and G2 are different. Our goal is to compare Random Walks (RWs) against Random Walks with uniform Restarts (RWuRs) in estimating the degree distribution of the graph, i.e., we seek to estimate Θk , the fraction of nodes with degree greater than k. ˆk be the estimated value of Θk .We use Let Θ
ˆk − Θk )2 /Θk NMSEk = E (Θ (22) to measure the estimation accuracy. Parameters: Our experiments have the following parameters. The sampling budget B, which is used in both RW and RWuR. The sampling budget of a RW determines the number of steps. The sampling budget of RWuR does not directly determine the number of steps. This is because there is a sampling budget penalty, c, associated with each restart. For instance, a RWuR that performs m uniform restarts walks B − mc steps and gathers B − mc + m observations. In all our experiments we use B = n/100 (i.e., the budget is 1% of the total number of nodes in the graph).
106
K. Avrachenkov, B. Ribeiro, and D. Towsley
Initial RW states: All experiments initialize RW in steady state while RWuR is initialized from uniformly sampled nodes (i.e., RWuR does not start in steady state). This initialization favors RW over our RWuR algorithm. Still, as seen next, our RWuR algorithm outperforms RW in all scenarios (for a given choice of α). RW RWuR (α=0.01) RWuR (α=10)
1
0.2 0.5 NMSE
NMSE
0.1 0.2 0.1 0.02 0.05 10-2
RW RWuR (α=0.1) RWuR (α=10)
0.03 1
10
2
10 degree
(a) Livejournal
10
3
10
4
2
10
2
10 vertex in-degree
10
3
10
4
(b) BA2
Fig. 1. Estimation error of RW and RWuR with varying α (larger is worst)
Our first experiment is based on an undirected version of the Livejournal graph. Figure 1(a) shows the empirical NMSE, eq. (22), on the Livejournal graph obtained from 20, 000 runs, RWuR restart weights α = 0.01, 10, and RWuR restart penalty c = 10. We choose c = 10 to match the 1/10 hit ratio of MySpace’s uniform node sampling [10]. We observe that estimates obtained with RWuR α = 0.01 are more precise than the estimates obtained with RW (particularly for small degree nodes and with almost no difference for high degree nodes). Thus, RWuR is able to reduce the NMSE even when restarts are rare (i.e., α is small). We perform the same experiment with increased restart weight α = 10 (Figure 1(a)) and observe that increasing α also increases the accuracy of RWuR for estimating the head of the distribution but decreases the accuracy at estimating its tail than both RW and RWuR with α = 0.01. Note that as we increase α RWuR gets closer to perfoming independent uniform node sampling. We also perform the same experiment over the BA2 graph (with the only difference that the smallest restart weight is now α = 0.1). Note that the BA2 graph has a clear RW bottleneck (the edge that connects the two otherwise disconnected components). Figure 1(b) shows the empirical NMSE. Unsurprisingly, the improvement in NMSE obtained by RWuR against RWs is even more pronounced than the improvement observed in our Livejournal experiments. Note that here, similar to the Livejournal experiment, increasing the restart weight from α = 0.1 to α = 10 also increases the accuracy of RWuR for estimating the head of the distribution but decreases its accuracy at estimating the tail. The above empirical observations, on the relationship between the NMSE of the degree distribution tail and α, prompt us to revisit the analysis performed at the end of Section 1. Putting together eqs. (5) and (22) yield
Improving Random Walk Estimation Accuracy with Uniform Restarts
107
1 , (23) Πk δ where Πk is as defined at the end of Section 1. From Section 2 we know that k+α . 1(dv = k) Πk = 2|E| + nα NMSEk ∝
∀v∈V
Thus, when k > d¯ (d¯ is the average degree), Πk decreases with α which implies (eq. (23)) that the NMSE increases with α. Similarly, when k < d¯ the NMSE decreases with α. Now let us look at the spectral gap δ. Section 2 shows that δ = 1 − d/(d + α)λ2 for a d-regular graph and δ ≈ αλ2 /d¯ for graphs with two quasi-disconnected components that have same average degree, assuming α to be small. These results indicate that δ increases with α. As one increases α, when k > d¯ the tradeoff between Πk increasing and δ decreasing the NMSE can ¯ explain the behavior observed in our numerical results. Similarly when k < d, the NMSE monotonically decreases with α, which can also be observed in our numerical results. 2
0.2
1
NMSE
NMSE
0.1 0.5
0.2 RW RWuR (c=1) RWuR (c=100)
0.02
1
10
2
10 degree
10
RW RWuR (c=1) RWuR (c=10) RWuR (c=100) RWuR (c=1000)
0.1
3
(a) Livejournal (α = 0.01)
10
4
2
10
2
10 vertex in-degree
10
3
10
4
(b) BA2 (α = 0.1)
Fig. 2. Estimation error of RWuR with varying restart cost c (larger is worst)
Effect of the restart cost c: The uniform restarts required in our RWuR algorithm can be expensive. In MySpace [10] one needs to perform, in average, 10 queries in order to obtain a valid address that can be used to restart the RWuR. In the following experiments we compare RW and RWuR with different restart costs c. Figure 2(a) shows the empirical NMSE of RW and RWuR (c = 1, 100; α = 0.01) on the Livejournal graph. The curves in Figure 2(a) are all on top of each other for degrees higher than 120. We notice no loss in NMSE when increasing the restart cost. This is expected as restarts are rare when α = 0.01. Figure 2(b) shows the empirical NMSE of RW and RWuR (c = 1, 10, 100, 1000; α = 0.1) on the BA2 graph. Observe that RWuR outperforms RW when c = 1, 10, 100; the exception happens when c = 1000. Thus, in a graph with a strong RW bottleneck, RWuR is the estimation method of choice even if the cost of restart is high (e.g., c = 100).
108
4
K. Avrachenkov, B. Ribeiro, and D. Towsley
Conclusions and Future Work
Our work proposed and studied the properties of a hybrid sampling scheme (RWuR) that mixes independent uniform node sampling and random walk (RW)based crawling. Our sampling method combines the strengths of both uniform and RW sampling while minimizing their drawbacks. RWuR can be used in any OSN and P2P networks that allow uniform node sampling (usually at a premium cost), such as MySpace, Facebook, and Bittorrent. We have formally shown under two scenarios that, when compared to a regular RW, RWuR has larger spectral gap, consequently reducing the mixing time. We also observe that RWuR has a positive impact on reducing the estimation error when compared to regular RWs. As part of our future work we plan to investigate the use of RWuRs to reduce the time to search for a file in a P2P network or a user in an OSN.
Acknowledgements This research was sponsored in part by the U.S. Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and do not represent the official policies, either expressed or implied, of the U.S. Army Research Laboratory or the U.S. Government. The U.S Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. This research was also sponsored in part by the European Commission through the ECODE project (INFSO-ICT-223936) of the European Seventh Framework Programme (FP7).
References 1. Aldous, D., Fill, J.A.: Reversible Markov Chains and Random Walks on Graphs. Book in preparation (1995), http://www.stat.berkeley.edu/~aldous 2. Avrachenkov, K.: Analytic perturbation theory and its applications. PhD Thesis, University of South Australia (1999) 3. Avrachenkov, K., Ribeiro, B., Towsley, D.: Improving random walk search and estimation accuracy with uniform restarts. Tech. rep., INRIA Research Report no. 7394 (2010), http://hal.inria.fr 4. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999) 5. Baumgartel, H.: Analytic perturbation theory for matrices and operators. Birkhauser, Basel (1985) 6. Bisnik, N., Abouzeid, A.A.: Optimizing random walk search algorithms in p2p networks. Computer Networks 51(6), 1499–1514 (2007) 7. Brauer, A.: Limits for the characteristic roots of a matrix, iv: Applications to stochastic matrices. Duke Math. J. 19, 75–91 (1952) 8. Bressan, M., Peserico, E.: Choose the damping, choose the ranking? In: Avrachenkov, K., Donato, D., Litvak, N. (eds.) WAW 2009. LNCS, vol. 5427, pp. 76–89. Springer, Heidelberg (2009)
Improving Random Walk Estimation Accuracy with Uniform Restarts
109
9. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117 (1998) 10. Gauvin, W., Ribeiro, B., Liu, B., Towsley, D., Wang, J.: Measurement and genderspecific analysis of user publishing characteristics on myspace. IEEE Network Special Issue on Online Social Networks (2010) 11. Gkantsidis, C., Mihail, M.: Hybrid search schemes for unstructured peer-to-peer networks. In: Proceedings of IEEE INFOCOM, pp. 1526–1537 (2005) 12. Gkantsidis, C., Mihail, M., Saberi, A.: Random walks in peer-to-peer networks: algorithms and evaluation. Perform. Eval. 63(3), 241–263 (2006) 13. Haveliwala, T., Kamvar, S.: The second eigenvalue of the Google matrix. Tech. Rep. Stanford (2003), http://ilpubs.stanford.edu:8090/582/ 14. Konrath, M.A., Barcellos, M.P., Mansilha, R.B.: Attacking a swarm with a band of liars: evaluating the impact of attacks on bittorrent. In: Proc. of the IEEE International Conference on Peer-to-Peer Computing, pp. 37–44 (2007) 15. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proc. of the WWW, pp. 695–704 (2008) 16. Litvak, N., Scheinhardt, W., Volkovich, Y., Zwart, B.: Characterization of tail dependence for in-degree and pagerank. In: Avrachenkov, K., Donato, D., Litvak, N. (eds.) WAW 2009. LNCS, vol. 5427, pp. 90–103. Springer, Heidelberg (2009) 17. Lovász, L.: Random walks on graphs: a survey. Combinatorics 2, 1–46 (1993) 18. Lovász, L., Simonovits, M.: Random walks in a convex body and an improved volume algorithm. Random Struct. Alg. 4, 359–412 (1993) 19. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and replication in unstructured peer-to-peer networks. In: Proc. of the 16th International Conference on Supercomputing, pp. 84–95 (2002) 20. Mislove, A., Marcon, M., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Measurement and Analysis of Online Social Networks. In: Proc. of the IMC (October 2007) 21. Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random walks. In: Proc. of the ACM SIGCOMM IMC (October 2010) 22. Sinclair, A.: Improved bounds for mixing rates of Markov chains and multicommodity flow. Combinatorics, Probability and Computing 1, 351–370 (1992) 23. Twitter (2010), http://twitter.com 24. Volz, E., Heckathorn, D.D.: Probability based estimation theory for RespondentDriven Sampling. Journal of Official Statistics (2008)
The Geometric Protean Model for On-Line Social Networks Anthony Bonato1 , Jeannette Janssen2 , and Pawel Pralat3, 1
2
Department of Mathematics, Ryerson University, Toronto, Canada
[email protected] Department of Mathematics and Statistics, Dalhousie University, Halifax, Canada
[email protected] 3 Department of Mathematics, West Virginia University, Morgantown, USA
[email protected]
Abstract. We introduce a new geometric, rank-based model for the link structure of on-line social networks (OSNs). In the geo-protean (GEO-P) model for OSNs nodes are identified with points in Euclidean space, and edges are stochastically generated by a mixture of the relative distance of nodes and a ranking function. With high probability, the GEO-P model generates graphs satisfying many observed properties of OSNs, such as power law degree distributions, the small world property, densification power law, and bad spectral expansion. We introduce the dimension of an OSN based on our model, and examine this new parameter using actual OSN data.
1
Introduction
On-line social networking sites such as Facebook, Flickr, LinkedIn, MySpace, and Twitter are examples of large-scale, complex, real-world networks, with an estimated total number of users that equals half of all Internet users [2]. We may model an OSN by a graph with nodes representing users and edges corresponding to friendship links. While OSNs gain increasing popularity among the general public, there is a parallel increase in interest in the cataloguing and modelling of their structure, function, and evolution. OSNs supply a vast and historically unprecedented record of large-scale human social interactions over time. The availability of large-scale social network data has led to numerous studies that revealed emergent topological properties of OSNs. For example, the recent study [15] crawled the entire Twitter site and obtained 41.7 million user profiles
The authors gratefully acknowledge support from MITACS, NSERC, Ryerson, and WVU.
R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 110–121, 2010. c Springer-Verlag Berlin Heidelberg 2010
The Geometric Protean Model for OSNs
111
and 1.47 billion social relations. The next challenge is the design and rigorous analysis of models simulating these properties. Graph models were successful in simulating properties of other complex networks such as the web graph (see the books [4,8] for surveys of such models), and it is thus natural to propose models for OSNs. Few rigorous models for OSNs have been posed and analyzed, and there is no universal consensus of which properties such models should simulate. Notable recent models are those of Kumar et al. [14], Lattanzi and Sivakumar [16], and the Iterated Local Transitivity model [5]. Researchers are now in the enviable position of observing how OSNs evolve over time, and as such, network analysis and models of OSNs typically incorporate time as a parameter. While by no means exhaustive, some of the main observed properties of OSNs include the following. (i) Large-scale. OSNs are examples of complex networks with number nodes (which we write as n) often in the millions; further, some users have disproportionately high degrees. For example, each of the nodes of Twitter corresponding to celebrities Ashton Kutcher, Ellen Degeneres, and Britney Spears have degree over five million [23]. (ii) Small world property and shrinking distances. The small world property, introduced by Watts and Strogatz [25], is a central notion in the study of complex networks (see also [13]). The small world property demands a low diameter of O(log n), and a higher clustering coefficient than found in a binomial random graph with the same number of nodes and same average degree. Adamic et al. [1] provided an early study of an OSN at Stanford University, and found that the network has the small world property. Similar results were found in [2] which studied Cyworld, MySpace, and Orkut, and in [21] which examined data collected from Flickr, YouTube, LiveJournal, and Orkut. Low diameter (of 6) and high clustering coefficient were reported in the Twitter by both Java et al. [12] and Kwak et al. [15]. Kumar et al. [14] reported that in Flickr and Yahoo!360 the diameter actually decreases over time. Similar results were reported for Cyworld in [2]. Well-known models for complex networks such as preferential attachment or copying models have logarithmically growing diameters with time. Various models (see [17,18]) were proposed simulating power law degree distributions and decreasing distances. (iii) Power law degree distributions. In a graph G of order n, let Nk be the number of nodes of degree k. The degree distribution of G follows a power law if Nk is proportional to k −b , for a fixed exponent b > 2. Power laws were observed over a decade ago in subgraphs sampled from the web graph, and are ubiquitous properties of complex networks (see Chapter 2 of [4]). Kumar, Novak, and Tomkins [14] studied the evolution of Flickr and Yahoo!360, and found that these networks exhibit power-law degree distributions. Power law degree distributions for both the in- and out-degree distributions were documented in Flickr, YouTube, LiveJournal, and Orkut [21], as well as in Twitter [12,15]. (iv) Bad spectral expansion. Social networks often organize into separate clusters in which the intra-cluster links are significantly higher than the number of intercluster links. In particular, social networks contain communities (characteristic of
112
A. Bonato, J. Janssen, and P. Pralat
social organization), where tightly knit groups correspond to the clusters [22]. As a result, it is reported in [9] that social networks, unlike other complex networks, possess bad spectral expansion properties realized by small gaps between the first and second eigenvalues of their adjacency matrices. Our main contributions in the present work are twofold: to provide a model—the geo-protean (GEO-P) model—which provably satisfies all five properties above (see Section 3; note that the model does not generate graphs with shrinking distances, the parameters can be adjusted to give constant diameter), and second, to suggest a reverse engineering approach to OSNs. Given only the link structure of OSNs, we ask whether it is possible to infer the hidden reality of such networks. Can we group users with similar attributes from only the link structure? For instance, a reasonable assumption is that out of the millions of users on a typical OSN, if we could assign the users various attributes such as age, sex, religion, geography, and so on, then we should be able to identify individuals or at least small sets of users by their set of attributes. Thus, if we can infer a set of identifying attributes for each node from the link structure, then we can use this information to recognize communities and understand connections between users. Characterizing users by a set of attributes leads naturally to a vector-based or geometric approach to OSNs. In geometric graph models, nodes are identified with points in a metric space, and edges are introduced by probabilistic rules that depend on the proximity of the nodes in the space. We envision OSNs as embedded in a social space, whose dimensions quantify user traits such as interests or geography; for instance, nodes representing users from the same city or in the same profession would likely be closer in social space. A first step in this direction was given in [19], which introduced a rank-based model in an m-dimensional grid for social networks (see also the notion of social distance provided in [24]). Such an approach was taken in geometric preferential attachment models of Flaxman et al. [10], and in the SPA geometric model for the web graph [3]. The geo-protean model incorporates a geometric view of OSNs, and also exploits ranking to determine the link structure. Higher ranked nodes are more likely to receive links. A formal description of the model is given in Section 2. Results on the model are summarized in Section 3. We present a novel approach to OSNs by assigning them a dimension; see the formula (4). Given certain OSN statistics (order, power law exponent, average degree, and diameter), we can assign each OSN a dimension based on our model. The dimension of an OSN may be roughly defined as the least integer m such that we can accurately embed the OSN in m-dimensional Euclidean space. Proofs of some of our results are presented in Section 4; the full version of the paper will contain proofs of all the results.
2
The GEO-P Model for OSNs
We now present our model for OSNs, which is based on both the notions of embedding the nodes in a metric space (geometric), and a link probability based
The Geometric Protean Model for OSNs
113
on a ranking of the nodes (protean). We identify the users of an OSN with points in m-dimensional Euclidean space. Each node has a region of influence, and nodes may be joined with a certain probability if they land within each others region of influence. Nodes are ranked by their popularity from 1 to n, where n is the number of nodes, and 1 is the highest ranked node. Nodes that are ranked higher have larger regions of influence, and so are more likely to acquire links over time. For simplicity, we consider only undirected graphs. The number of nodes n is fixed but the model is dynamic: at each time-step, a node is born and one dies. A static number of nodes is more representative of the reality of OSNs, as the number of users in an OSN would typically have a maximum (an absolute maximum arises from roughly the number of users on the internet, not counting multiple accounts). For a discussion of ranking models for complex networks, see [11,20]. We now formally define the GEO-P model. The model produces a sequence (Gt : t ≥ 0) of undirected graphs on n nodes, where t denotes time. We write Gt = (Vt , Et ). There are four parameters: the attachment strength α ∈ (0, 1), the density parameter β ∈ (0, 1 − α), the dimension m ∈ N, and the link probability p ∈ (0, 1]. Each node v ∈ Vt has rank r(v, t) ∈ [n] (we use [n] to denote the set {1, 2, . . . , n}). The rank function r(·, t) : Vt → [n] is a bijection for all t, so every node has a unique rank. The highest ranked node has rank equal to 1; the lowest ranked node has rank n. The initialization and update of the ranking is done by random initial rank. (Other ranking schemes may also be used.) In particular, the node added at time t obtains an initial rank Rt which is randomly chosen from [n] according to a prescribed distribution. Ranks of all nodes are adjusted accordingly. Formally, for each v ∈ Vt−1 ∩ Vt , r(v, t) = r(v, t − 1) + δ − γ, where δ = 1 if r(v, t − 1) > Rt and 0 otherwise, and γ = 1 if the rank of the node deleted in step t is smaller than r(v, t − 1), and 0 otherwise. Let S be the unit hypercube in Rm , with the torus metric d(·, ·) derived from the L∞ metric. In particular, for any two points x and y in Rm , d(x, y) = min{||x − y + u||∞ : u ∈ {−1, 0, 1}m}. The torus metric thus “wraps around” the boundaries of the unit cube, so every point in S is equivalent. The torus metric is chosen so that there are no boundary effects, and altering the metric will not significantly affect the main results. To initialize the model, let G0 = (V0 , E0 ) be any graph on n nodes that are chosen from S. We define the influence region of node v at time t ≥ 0, written R(v, t), to be the ball around v with volume |R(v, t)| = r(v, t)−α n−β . For t ≥ 1, we form Gt from Gt−1 according to the following rules. 1. Add a new node v that is chosen uniformly at random from S. Next, independently, for each node u ∈ Vt−1 such that v ∈ R(u, t − 1), an edge vu is
114
A. Bonato, J. Janssen, and P. Pralat
created with probability p. Note that the probability that u receives an edge is equal to p r(u, t − 1)−α n−β . The negative exponent (−α) guarantees that nodes with higher ranks (r(u, t − 1) close to 1) are more likely to receive new edges than lower ranks. 2. Choose uniformly at random a node u ∈ Vt−1 , delete u and all edges incident to u. 3. Update the ranking function r(·, t) : Vt → [n]. Since the process is an ergodic Markov chain, it will converge to a stationary distribution. The random graph corresponding to this distribution with given parameters α, β, m, p is called the geo-protean (or GEO-P model) graph, and is written GEO-P(α, β, m, p).
3
Results and Dimension
We now state the main theoretical results we discovered for the geo-protean model, with proofs supplied in the next section. The model generates with high probability graphs satisfying each of the properties (i) to (iv) in the introduction. Proofs are presented in Section 4. Throughout, we will use the stronger notion of wep in favour of the more commonly used aas, since it simplifies some of our proofs. We say that an event holds with extreme probability (wep), if it holds with probability at least 1 − exp(−Θ(log2 n)) as n → ∞. Thus, if we consider a polynomial number of events that each holds wep, then wep all events hold. Let Nk = Nk (n, p, α, β) denote the number of nodes of degree k, and N≥k = l≥k Nl . The following theorem demonstrates that the geo-protean model generates power law graphs with exponent b = 1 + 1/α.
(1)
Note that the variables N≥k represent the cumulative degree distribution, so the degree distribution of these variables has power law exponent 1/α. Theorem 1. Let α ∈ (0, 1), β ∈ (0, 1 − α), m ∈ N, p ∈ (0, 1], and n1−α−β log1/2 n ≤ k ≤ n1−α/2−β log−2α−1 n. Then wep GEO-P(α, β, m, p) satisfies α 1/α (1−β)/α −1/α N≥k = 1 + O(log−1/3 n) p n k . α+1 For a graph G = (V, E) of order n, define the average degree of G by d = Our next results shows that geo-protean graphs are dense.
2|E| n .
Theorem 2. Wep the average degree of GEO-P(α, β, m, p) is d = (1 + o(1))
p n1−α−β . 1−α
(2)
The Geometric Protean Model for OSNs
115
Note that the average degree tends to infinity with n; that is, the model generates graphs satisfying a densification power law. In [17], densification power laws were reported in several real-world networks such as the physics citation graph and the internet graph at the level of autonomous systems. Our next result describes the diameter of graphs sampled from the GEO-P model. While the diameter is not shrinking, it can be made constant by allowing the dimension to grow as a logarithmic function of n. Theorem 3. Let α ∈ (0, 1), β ∈ (0, 1 − α), m ∈ N, and p ∈ (0, 1]. Then wep the diameter of GEO-P(α, β, m, p) is β
2α
O(n (1−α)m log (1−α)m n).
(3)
We note that in a geometric model where regions of influence have constant volume and possessing the same average degree as the geo-protean model, the α+β diameter is Θ(n m ). This is a larger diameter than in the GEO-P model. If m = C log n, for some constant C > 0, then wep we obtain a diameter bounded β by a constant. We conjecture that wep the diameter is of order n (1−α)m +o(1) . In the full version of the paper, we prove that wep the GEO-P model generates graph with constant clustering coefficient. The normalized Laplacian of a graph relates to important graph properties; see [7]. Let A denote the adjacency matrix and D denote the diagonal degree matrix of a graph G. Then the normalized Laplacian of G is L = I − D−1/2 AD−1/2 . Let 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 ≤ 2 denote the eigenvalues of L. The spectral gap of the normalized Laplacian is λ = max{|λ1 − 1|, |λn−1 − 1|}. A spectral gap bounded away from 0 is an indication of bad expansion properties, which are characteristic of OSNs (see property (iv) in the introduction). The next theorem represents a drastic departure from the good expansion found in binomial random graphs, where λ = o(1) [7,8]. Theorem 4. Let α ∈ (0, 1), β ∈ (0, 1 − α), m ∈ N, and p ∈ (0, 1]. Let λ(n) be the spectral gap of the normalized Laplacian of GEO-P(α, β, m, p). Then wep 1. If m = m(n) = o(log n), then λ(n) = 1 + o(1). 2. If m = m(n) = C log n for some C > 0, then α+β λ(n) ≥ 1 − exp − . C 3.1
Dimension of OSNs
Given an OSN, we describe how we may estimate the corresponding dimension parameter m if we assume the GEO-P model. In particular, if we know the order n, power law exponent b, average degree d, and diameter D of an OSN, then we
116
A. Bonato, J. Janssen, and P. Pralat
can calculate m using our theoretical results. The formulas (1) gives an estimate for α based on the power law exponent b. If d∗ = log d/ log n, then equation (2) implies that, asymptotically, 1 − α − β = d∗ . If D∗ = log D/ log n, then (3) and β . our conjecture about the diameter implies that, asymptotically, D∗ = (1−α)m Thus, an estimate for m is given by: 1 b−1 m= ∗ 1− (4) d∗ . D b−2 Note that (4) suggests that the dimension depends on log n/ log D. If D is constant, this means that m grows logarithmically with n. Recall that the dimension of an OSN may be roughly defined as the least integer m such that we can accurately embed the OSN in m-dimensional Euclidean space. Based on our model we conjecture that the dimension of an OSN is best fit by approximately log n. The parameters b, d, and D have been determined for samples from OSNs in various studies such as [2,12,15,21]. The following chart summarizes this data and gives the predicted dimension for each network. We round m up to the nearest integer. Estimates of the total number of users n for Cyworld, Flickr, and Twitter come from Wikipedia [26], and those from YouTube comes from their website [27]. When the data consisted of directed graphs, we took b to be the power law exponent for the in-degree distribution. As noted in [2], the power law exponent of b = 5 for Cyworld holds only for users whose degree is at most approximately 100. When taking a sample, we assume that some of the neighbours of each node will be missing. Hence, when computing d∗ , we used n equalling the number of users in the sample. As we assume that the diameter of the OSN is constant, we compute D∗ with n equalling the total number of users. Parameter OSN Cyworld n 2.4 × 107 b 5 d∗ 0.22 D∗ 0.11 m 7
4
Flickr 3.2 × 107 2.78 0.17 0.19 4
Twitter 7.5 × 107 2.4 0.17 0.1 5
YouTube 3 × 108 2.99 0.1 0.16 6
Proofs of Results
We sketch the proofs of our results here, emphasizing those parts that give insight into the model. Detailed proofs of all our results will appear in a full paper. 4.1
Degree Distribution; Proof of Theorem 1
Theorem 1 follows immediately from the following theorem which shows how the degree of a given vertex depends precisely on its age rank and prestige label.
The Geometric Protean Model for OSNs
117
A vertex v has age rank a(v, t) = i at time t if it is the i-th oldest vertex of all vertices existing at time t. The result below refers to the degree of a vertex at a time L, when the steady state of the GEO-P model has been reached. The proof of the theorem follows standard methods, and is omitted here. Theorem 5. Let i = i(n) ∈ [n]. Let vi be the vertex in GEO-P(α, β, m, p) whose age rank at√time L equals a(vi , L) = i, and let Ri be the initial rank of vi . If Ri ≥ n log2 n, then wep deg(vi , L) = (1 + O(log Otherwise, that is if Ri <
−1/2
√
n))p
i + (1 − α)n
Ri n
−α
n−i n
n1−α−β .
n log2 n, wep
deg(vi , L) ≥ (1 + O(log−1/2 n))p
i n−i + nα/2 log−2α n (1 − α)n n
n1−α−β .
The proof of Theorem 1 is now a consequence of Theorem 5. One can show by an√omitted calculation that wep each vertex vi that has the initial rank Ri ≥ n log2 n such that 1/α Ri n − i −1 ≥ 1 + log−1/3 n pn1−α−β k n n has fewer than k neighbours, and each vertex vi for which 1/α Ri n − i −1 ≤ 1 − log−1/3 n pn1−α−β k n n has more than k neighbours. Let i0 be the largest value of i such that 1/α 2 log2 n 1−α−β n − i −1 k . ≥ √ pn n n This guarantees √ that the equations above do not contradict the requirement that Ri ≥ log2 n n. Note that i0 = n − O(n/ log n), since k ≤ n1−α/2−β log−2α−1 n. Using this result, we can compute the expected value of N≥k . EN≥k
1/α i0 n − i −1 k 1 + O(log−1/3 n) pn1−α−β = +O n i=1 α 1/α (1−β)/α −1/α = 1 + O(log−1/3 n) p n k . α+1
n log2 n √ n i=i +1
The concentration follows from the well-known Chernoff bound.
0
118
4.2
A. Bonato, J. Janssen, and P. Pralat
Bad Expansion: Proof of Theorem 4
For the proof of Theorem 4 we show that there are sparse cuts in the GEO-P model. For sets X and Y we use the notation e(X, Y ) for the number of edges with one end in each of X and Y . Suppose that the unit hypercube S = [0, 1]m is partitioned into two sets of the same volume, S1 = {x = (x1 , x2 , . . . , xm ) ∈ S : x1 ≤ 1/2}, and S2 = S \ S1 . Both S1 and S2 contain (1 + o(1))n/2 vertices wep. In a good expander (for instance, the binomial random graph G(n, p)), wep there would be (1 + o(1))
p |E| = (1 + o(1)) n2−α−β 2 4(1 − α)
edges between S1 and S2 . Below we show that it is not the case in our model. The proof of the following theorem is omitted. Theorem 6. Let α ∈ (0, 1), β ∈ (0, 1 − α), m ∈ N, and p ∈ (0, 1]. Then wep GEO-P(α, β, m, p) has the following properties. 1. If m = m(n) = o(log n), then e(S1 , S2 ) = o(n2−α−β ). 2. If m = m(n) = C log n for some C > 0, then p α+β n2−α−β exp − e(S1 , S2 ) ≤ (1 + o(1)) . 4(1 − α) C To finish the proof of Theorem 4, we use the expander mixing lemma for the normalized Laplacian (see [7] for its proof). For sets of nodes X and Y we use ¯ for the the notation vol(X) for the volume of the subgraph induced by X, X complement of X, and, as introduced before, e(X, Y ) for the number of edges with one end in each of X and Y. (Note that X ∩ Y does not have to be empty; in general, e(X, Y ) is defined to be the number of edges between X \ Y to Y plus twice the number of edges that contain only vertices of X ∩ Y .) Lemma 1. For all sets X ⊆ G,
2
¯
e(X, X) − (vol(X)) ≤ λ vol(X)vol(X) .
vol(G) vol(G) It follows from (2) and the Chernoff bound that wep p n2−α−β 1−α p n2−α−β vol(S1 ) = (1 + o(1)) 2(1 − α)
vol(GL ) = (1 + o(1))
= (1 + o(1))vol(S2 ).
Suppose first that m = o(log n). From Theorem 6 we get that wep e(S1 , S1 ) = vol(S1 ) − e(S1 , S2 ) = (1 + o(1))vol(S1 ) p n2−α−β , = (1 + o(1)) 2(1 − α)
The Geometric Protean Model for OSNs
119
and Lemma 1 implies that wep λn ≥ 1 + o(1). By definition, λn ≤ 1 so λn = 1 + o(1). Suppose now that m = C log n for some constant C > 0. By Theorem 6, we obtain that wep ⎞
⎛ exp − α+β C 1 p ⎠. e(S1 , S1 ) = (1 + o(1)) n2−α−β ⎝ − 1−α 2 4 The assertion follows directly from Lemma 1. 4.3
Diameter; Proof of Theorem 3
In order to show that the graph has a relatively small diameter, we will first show that wep there exists a “backbone” of vertices with a large influence region (which allow for long links), and that all vertices are within at most graph distance two from this backbone. To find the backbone, fix A, and partition the hypercube into 1/A hypercubes. Fix R, and consider nodes with initial rank at most R and age at most n/2; we call these the influential nodes. We now choose A and R so that (i) in each small hypercube, wep there are log2 n influential nodes, and (ii) the influence region of each influential node from its birth until the end of the process contains the whole hypercube in which it is located, and also all neighbouring hypercubes. It can be shown that (ii) holds wep if the initial influence region of each influential node is at least 5m A. Therefore, we obtain that R−α n−β = 5m A.
(5)
Property (i) holds if the expected number of influential nodes in each hypercube is at least 2 log2 n (Chernoff bound). Hence, we require that n R A = 2 log2 n. 2 n
(6)
Combining (5) and (6) we obtain that the number of hypercubes is equal to β m+α 2α 1 = 5 1−α n 1−α log 1−α n. A
Now, since wep there are log2 n nodes in each hypercube to choose from, wep we can select exactly one node from each hypercube so that each node is adjacent to the chosen nodes from all neighbouring hypercubes (the younger node falls into the region of influence of the older neighbours, and creates an edge with probability p). This subgraph then forms the backbone. It is clear that the diameter of the backbone is 1/m β 2α 1 = O(n (1−α)m log (1−α)m n) A
120
A. Bonato, J. Janssen, and P. Pralat
We now show that wep a node v that is not in the backbone is distance at most two from some node in the backbone. Since wep the minimum degree is Ω(n1−α−β ), wep Ω(n1−α−β ) neighbours of v have age rank at least n/2. Since each such neighbour falls into the region of influence of some node in the backbone, wep at least one neighbour of v must be connected the backbone.
5
Conclusion and Discussion
We introduced the geo-protean (GEO-P) geometric model for OSNs, and showed that with high probability, the model generates graphs satisfying each of the properties (i) to (iv) in the introduction. We introduce the dimension of an OSN based on our model, and examine this new parameter using actual OSN data. We observed that the dimension of various OSNs ranges from four to 7. It may therefore, be possible to group users via a relatively small number of attributes, although this remains unproven. The Logarithmic Dimension Hypothesis (or LDH) conjectures the dimension of an OSN is best fit by log n, where n is the number of users in the OSN. The ideas of using geometry and dimension to explore OSNs deserves to be more thoroughly investigated. Given the availability of OSN data, it may be possible to fit the data to the model to determine the dimension of a given OSN. Initial estimates from actual OSN data indicate that the spectral gap found in OSNs correlates with the spectral gap found in the GEO-P model when the dimension is approximately log n, giving some credence to the LDH. Another interesting direction would be to generalize the GEO-P to a wider array of ranking schemes (such as ranking by age or degree), and determine when similar properties (such as power laws and bad spectral expansion) provably hold. We finish by mentioning that recent work [6] indicates that social networks lack high compressibility, especially in contrast to the web graph. We propose to study the relationship between the GEO-P model and the incompressibility of OSNs in future work.
References 1. Adamic, L.A., Buyukkokten, O., Adar, E.: A social network caught in the web. First Monday 8 (2003) 2. Ahn, Y., Han, S., Kwak, H., Moon, S., Jeong, H.: Analysis of topological characteristics of huge on-line social networking services. In: Proceedings of the 16th International Conference on World Wide Web (2007) 3. Aiello, W., Bonato, A., Cooper, C., Janssen, J., Pralat, P.: A spatial web graph model with local influence regions. Internet Mathematics 5, 175–196 (2009) 4. Bonato, A.: A Course on the Web Graph. American Mathematical Society Graduate Studies Series in Mathematics, Providence, Rhode Island (2008) 5. Bonato, A., Hadi, N., Horn, P., Pralat, P., Wang, C.: Models of on-line social networks. Accepted to Internet Mathematics (2010) 6. Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: Proceedings of the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2009 (2009)
The Geometric Protean Model for OSNs
121
7. Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society, Providence (1997) 8. Chung, F.R.K., Lu, L.: Complex Graphs and Networks. American Mathematical Society, U.S.A. (2004) 9. Estrada, E.: Spectral scaling and good expansion properties in complex networks. Europhys. Lett. 73, 649–655 (2006) 10. Flaxman, A., Frieze, A., Vera, J.: A geometric preferential attachment model of networks. Internet Mathematics 3, 187–205 (2007) 11. Janssen, J., Pralat, P.: Protean graphs with a variety of ranking schemes. Theoretical Computer Science 410, 5491–5504 (2009) 12. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and communities. In: Proceedings of the Joint 9th WEBKDD and 1st SNA-KDD Workshop 2007 (2007) 13. Kleinberg, J.: The small-world phenomenon: An algorithmic perspective. In: Proceedings of the 32nd ACM Symposium on Theory of Computing (2000) 14. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of on-line social networks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006) 15. Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: Proceedings of the 19th International World Wide Web Conference (2010) 16. Lattanzi, S., Sivakumar, D.: Affiliation Networks. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing (2009) 17. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification Laws, shrinking diameters and possible explanations. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2005) 18. Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C.: Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 133–145. Springer, Heidelberg (2005) 19. Liben-Nowell, D., Novak, J., Kumar, R., Raghavan, P., Tomkins, A.: Geographic routing in social networks. Proceedings of the National Academy of Sciences 102, 11623–11628 (2005) 20. L uczak, T., Pralat, P.: Protean graphs. Internet Mathematics 3, 21–40 (2006) 21. Mislove, A., Marcon, M., Gummadi, K., Druschel, P., Bhattacharjee, B.: Measurement and analysis of on-line social networks. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (2007) 22. Newman, M.E.J., Park, J.: Why social networks are different from other types of networks. Phys. Rev. E 68(3), 036122 (2003) 23. Twitterholic, http://twitterholic.com/ (accessed September 12, 2010) 24. Watts, D.J., Dodds, P.S., Newman, M.E.J.: Identity and search in social networks. Science 296, 1302–1305 (2002) 25. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998) 26. Wikipedia: List of social networking websites, http://en.wikipedia.org/wiki/ List_of_social_networking_websites (accessed September 12, 2010) 27. YouTube, Advertising and Targeting, http://www.youtube.com/t/advertising_ targeting (accessed September 12, 2010)
Constant Price of Anarchy in Network Creation Games via Public Service Advertising Erik D. Demaine and Morteza Zadimoghaddam MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA {edemaine,morteza}@mit.edu
Abstract. Network creation games have been studied in many different settings recently. These games are motivated by social networks in which selfish agents want to construct a connection graph among themselves. Each node wants to minimize its average or maximum distance to the others, without paying much to construct the network. Many generalizations have been considered, including non-uniform interests between nodes, general graphs of allowable edges, bounded budget agents, etc. In all of these settings, there is no known constant bound on the price of anarchy. In fact, in many cases, the price of anarchy can be very large, namely, a constant power of the number of agents. This means that we have no control on the behavior of network when agents act selfishly. On the other hand, the price of stability in all these models is constant, which means that there is chance that agents act selfishly and we end up with a reasonable social cost. In this paper, we show how to use an advertising campaign (as introduced in SODA 2009 [2]) to find such efficient equilibria. More formally, we present advertising strategies such that, if an α fraction of the agents agree to cooperate in the campaign, the social cost would be at most O(1/α) times the optimum cost. This is the first constant bound on the price of anarchy that interestingly can be adapted to different settings. We also generalize our method to work in cases that α is not known in advance. Also, we do not need to assume that the cooperating agents spend all their budget in the campaign; even a small fraction (β fraction) would give us a constant price of anarchy. Keywords: algorithmic game theory, price of anarchy, selfish agents.
1
Introduction
In network creation games, nodes construct an underlying graph in order to have short routing paths among themselves. So each node incurs two types of costs, network design cost which is the amount of the contribution of the node in constructing the network, and network usage cost which is the sum of the distances to all other nodes. Nodes act selfishly, and everyone wants to minimize its own cost, i.e. the network design cost plus the usage cost. The social cost in these games is equal to sum of the costs of all nodes. R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 122–131, 2010. c Springer-Verlag Berlin Heidelberg 2010
Constant Price of Anarchy in Network Creation Games
123
To study the behaviour of social networks, we try to understand how large the social cost can be in presence of selfish agents. Nash Equilibria are the stable networks in which every agent is acting selfishly. More formally, in a Nash Equilibria every agent has no incentive to change her strategy assuming all other agents keep the same strategies. In this setting, the price of anarchy is the worst ratio of the social cost of Nash Equilibria and the optimal social cost of the network which can be designed by a central authority. The price of anarchy is introduced by Koutsoupias and Papadimitriou in [9,11], and is used to measure the behaviour of the games and networks with selfish agents. The small values of price anarchy shows that allowing agents to be selfish does not increase the social cost a lot. On the other hand, large values of price of anarchy means that the selfish behaviour of agents can lead the whole game (network) to stable situations with large social cost in comparison with the optimal cases. Model: In a network creation game there is a set of selfish nodes. Every node can construct an undirected1 edge to any other node at a fixed given cost. Each node also incurs a usage cost related to its distance to the other nodes. So the usage cost of a node is the sum of its distances to all other nodes. Clearly every node is trying to minimize its own total cost, i.e. usage cost plus the construction cost. In another variant of network creation games, called (n, k)-uniform bounded budget connection game, we have n nodes in the graph, and each node can construct k edges to other nodes. So every node only have the usage cost, but its budget to build edges is limited. The advertising campaign scenario can be applied to different game theoretic situations. In this scenarios, we can encourage people using a public service advertising to follow a specific strategy. We can design the strategy to improve the social cost. In our model, we find an advertising strategy to reduce the price of anarchy, and control the behaviour of selfish nodes. We do not need everyone to help us to achieve a small price of anarchy. We assume that α fraction of people are willing to follow our strategy, and each of them agrees to spend β fraction of its budget in the campaign. Formally, we assume that every node accepts to contribute in the campaign with probability α. We call these users receptive users as used in the literature [2]. Every receptive person is willing to use βk of its edges for the campaign. At first we assume that α and β are some known parameters, and we present an strategy that leads the network to an equilibrium with small price of anarchy. Then we adapt our strategies to work in cases that α and β are not known in advance. To get constant bounds on the price of anarchy, we assume n for some sufficiently large constant c. that k is greater than c log αβ Previous Work: Fabrikant et al. introduced the network creation games [6]. They studied the price of anarchy in these games, and achieved the first non-trivial 1
One can get the same results using the same techniques and maintaining two ingoing and outgoing trees from the root for directed graphs as well.
124
E.D. Demaine and M. Zadimoghaddam
bounds on it. They studied the structure of Nash Equilibria, and conjectured that only trees can be stable graphs in this model. Later, Abers et al. came up with an interesting class of stable graphs, and disproved the tree conjecture [1]. They also presented better upper bounds on the price of anarchy. They proved that the price of anarchy can not be more than O(n1/3 ) in general, and in some cases they gained a constant upper bound on the price of anarchy. Corbo and Parkes in [3] considered a slightly different model called bilateral network formation games, √ and studied the price of anarchy in this model. They were able to prove a O( c) upper bound on the price of anarchy where c is the cost of constructing one edge in their model. Since c can be as large as n, this bound is also as large as n to the power of a constant. Demaine et al. studied the sizes of neighborhood sets in the stable graphs, and with a recursive technique, they presented the first sub-polynomial bounds on the price of anarchy [5]. They also studied a variant of these games called cooperative network creation games, and they were able to achieve the first poly-logarithmic upper bounds on the price of anarchy [4]. This result actually shows that the diameter of stable graphs is poly-logarithmic which implies the small-world phenomenon in these games. For more details about the small world phenomenon, we refer to Kleinberg’s works [7,8]. Laoutaris et al. studied the network creation games in the bounded budget model [10]. They claimed that in many practical settings a selfish agent can not build an arbitrary number of edges to other nodes even if there is an incentive for the node in building the edge. In this model, every node has a limited amount of budget, and according to the limit, each node can build up to a given number of edges. They call these games uniform bounded budget connection games, they achieve sub-linear both upper and lower bounds on the price of anarchy in these games. They prove that the price of anarchy is between Ω( logn/k(n) ), k and O( log n(n) ) where n and k are respectively the number of nodes, and the k
maximum number of edges that each node can have in these games. Although this is an interesting model in the sense that each node has a limited number of edges, the price of anarchy can be very large in these games. In many games including network creation, selfish routing, fair cost sharing, etc, the cost of a stable graph can vary in a large range. In other words, we have both low cost and high cost Nash Equilibria. Balcan et al. claim that in such games one can hope to lead the game to low cost Equilibria using a public service advertising service [2]. They study the price of anarchy using some advertising strategies. In some cases like fair cost sharing, they present advertising strategies that reduces the price of anarchy to a constant number, and in some other games like scheduling games, they show that there exists no useful advertising strategy. Our Results: At first we prove a tight upper bound in the uniform bounded budget games. We prove that the price of anarchy is O( logn/k(n) ). According to k the lower bound in [10], this is a tight result, and shows that the price of anarchy is in fact Θ(
n/k logk (n) ).
Constant Price of Anarchy in Network Creation Games
125
Since the uniform games have a very large price of anarchy, we try to find advertising strategies to reduce the price of anarchy to a constant number in uniform bounded budget games. This way we can be sure that the degree of each node is bounded, so no one is overwhelmed in the network. On the other hand, we also know that the price of anarchy is small, so the behavior of these games is under control. Formally, we present an advertising strategy that leads the game to Equilibria with price of anarchy at most O(1/α) where α is the fraction of nodes that follow our strategy. We do not assume that everyone is willing to contribute in our strategy, and even if α is very small, we still get small price of anarchy. We also do not assume that every node that contributes in our strategy is willing to spend all its k edges as we say. We just use βk edges of a player that contributes in the advertising strategy where 0 < β < 1 can be a small constant. In Section 3, we present an advertising strategy that knows the values of α and β in advance. Then in Section 4, we adapt our strategy to work in cases that these two parameters are not given in the input, and we should find out about their values as well.
2
A Tight Upper Bound for Price of Anarchy in Uniform Games
Here we show that the price of anarchy can not be more than O( logn/k(n) ) in k the (n, k)-uniform BBC game2 . According to the Ω( logn/k(n) ) lower bound for k the price of anarchy presented in [10], this is the best upper bound that can be achieved in this game. This means that setting any limit on the budget of each node implies very large values of price of anarchy. We prove that the diameter of any stable graph in this model is bounded by O( n logk (n)/k). Lemma 1. The diameter of any stable graph in (n, k)-uniform game is at most O( n logk (n)/k). Proof. We just need to show that there is a vertex v whose distance to any other vertex is at most O( n logk (n)/k). Let G be a stable graph. Define g to be c logk (n) for a sufficiently large constant c ≥ 1. Delete all edges that are contained in a cycle of length at most g. Let G be the remaining graph. Clearly G has no cycle of length at most g. We claim that G has a vertex with degree at most k/2. If all degrees are at least k/2, we have at least (k/2)g/2 walks with length g/2 starting from an arbitrary vertex u. The endpoint of these walks are different. Otherwise we would find two walks with length g/2 starting from u and ending at the same vertex, we can also find a cycle of length at most 2(g/2) = g in the remaining graph which is a contradiction. So there are at least (k/2)g/2 different endpoints for these walks. On the other hand there are 2
Bounded Budget Connection game.
126
E.D. Demaine and M. Zadimoghaddam
at most n vertices in the graph. For any k, there is a sufficiently large value of c for which (k/2)g/2 is greater than n which is a contradiction. So there is a vertex v with degree at most k/2 in G . This means that v has at least k/2 edges like e1 , e2 , · · · , ek/2 each of which is contained in a cycle of length at most g. For each vertex u = v in G consider a shortest path from v to u in graph G. So we have n − 1 shortest paths, and each of these paths might use at most one of these k/2 edges. So there is an edge ei that is used in at most n k/2 = 2n/k paths. If vertex v deletes edge ei , its distance to at most 2n/k other vertices might increase by at most g − 1 because edge ei is contained in a cycle of length at most g. So the cost of v is increased by at most 2ng/k. Now let d be the maximum distance from v in the stable graph G. Assume that d is the distance between v and v . If v deletes edge ei , and adds edge (v, v ) its cost increases by at most 2ng/k, and decreases by at least (d/3)2 = d2 /9. To see this, we just need to consider the shortest path from v to v , there are at least d/3 vertices whose distance to v is at least 2d/3 now, and by adding edge (v, v ) their distances to v would be at most d/3. So the cost of v decreases by at least d2 /9 by adding edge (v, v ). Since we are in astable graph, 2ng/k should be at least d2 /9. This means that d is at most O( ng/k). Note that g is c logk (n), and this completes the proof. Theorem 1. The price of anarchy in a (n, k)-uniform BBC game is at most n/k O( log (n) ). k
Proof. Using Lemma 1, we know that the diameter of any stable graph in this model is bounded by O( n logk (n)/k). As mentioned in the proof of Theorem 3 in [10], the average distance in the optimum solution is at least Ω(logk (n)). This price of anarchy in these uniform games is not more than √ shows that the O( n logk (n)/k) ≤ O( logn/k(n) ). Ω(log (n)) k
3
k
How the Public Service Advertising Affects the Price of Anarchy
In this section we present strategies that lead the network to stable graphs with low social costs. We assume that every node follows our strategy with probability α. We call these follower nodes receptive nodes because of their interest in the advertised strategy. We also do not ask a person to spend all its budget in our strategy. A receptive node just has to spend βk edges in our strategy (0 < β < 1), and can use the rest of its edges arbitrarily. At first we assume that α and β are some given parameters in advance. In Section 4, we change our strategies to be adaptive and work when these parameters are not revealed in advance. αβ The advertising strategy is as follows. Define k to be c log (n) k for a sufficiently large constant c, i.e. c ≥ 5 would work. We assume that k > 1. We partition the | nodes into l ≤ logk (n) sets S1 , S2 , · · · , Sl such that |S1 | = βk/2, and |S|Si+1 = k i| for each 1 ≤ i < l. Note that the only important properties of these sets are
Constant Price of Anarchy in Network Creation Games
127
their sizes. For example, we can set S1 to be the nodes 1, 2, · · · , |S1 |, set S2 to be the nodes |S1 | + 1, · · · , |S1 | + |S2 |, and so on. We ask nodes in the first set S1 to construct edges to all other nodes in set S1 . So every receptive node in set S1 uses βk/2 − 1 edges to get directly connected to all other nodes in S1 . For i > 1, we ask each node in set Si to pick c log (n)/2α nodes randomly from set Si−1 and construct edges to them. Note that c log (n)/2α is at most βk/2 because k is greater than one, and it is also equal to αβk/c log (n). On the other hand nodes in set Si−1 receive some incoming edges from set Si . We do not assume that every such an edge is accepted by nodes in set Si−1 . For example if a non-receptive node receives an edge, the node might delete the edge, i.e. the node is not interested in our strategy or it is a malicious player. This assumption just makes our work harder because we have to find a way to take care of deleted edges. Even if the node in set Si−1 is receptive we might have a problem. Assume that the node receives more than βk/2 edges from set Si , it might delete some edges. Because a receptive node is not necessarily willing to contribute in the strategy with more than βk edges. So the node might get overwhelmed by the nodes in lower set. In these cases we just do not rely on these edges in our analysis. So we assume that if a receptive node receives at most βk/2 edges from the nodes of the lower set, it does not delete these edges. This assumption is true because we are basically asking a receptive node in set Si to handle at most βk/2 edges from the set Si+1 , and build c log (n)/2α ≤ βk/2 edges to the nodes of set Si−1 which is at most βk edges in total. Lemma 2. The edges built in the above strategy form a hierarchical tree shaped subgraph with logk (n) levels. The diameter of this subgraph is at most 2 logk n, and every receptive node is contained in this subgraph with high probability3 . Proof. We just need to prove that every receptive node v in set Si gets connected to a receptive node v in set Si−1 , and node v does not delete the edge (v, v ), i.e. node v does not get overwhelmed. Node v picks c log (n)/2α random nodes in set Si−1 . There are c log (n)/2 receptive nodes among these nodes in expectation because every node is receptive with probability α. Using Chernoff bound, we can say that there are at least log (n) receptive nodes among them with high probability (note that c is sufficiently large). So every receptive vertex v in level i is connected to at least log (n) receptive nodes in set i − 1 unless they delete their incoming edges because they have been overwhelmed. Now we prove that every node is overwhelmed in this structure with probability at most 1/2. Each node in set Si is receptive with probability α. Each receptive node makes c log (n)/2α edges to the nodes in set Si−1 randomly. So the expected number (n)/2α) . of incoming edges from set Si to a node in set Si−1 is equal to α|Si |(c|Slog i−1 | We also know that 3
|Si | |Si−1 |
is equal to k =
αβ c log (n) k.
Probability 1 − 1/nc for some large constant c.
We conclude that every node
128
E.D. Demaine and M. Zadimoghaddam
u in set Si−1 receives αβk/2 edges in expectation. Using Markov inequality, we can say that a node can be overwhelmed with probability at most α/2 < 1/2. So every node v ∈ Si is connected to at least log (n) receptive nodes in set Si−1 . Each of them is overwhelmed with probability at most 1/2. Since the overwhelming events for different nodes are negatively correlated, we can say that with high probability node v is connected to at least one receptive node in set Si−1 that is not overwhelmed. This is sufficient to see that with high probability, each receptive node has a path of length at most l to some receptive node in set S1 , where l is the number of levels. Since receptive nodes in set S1 makes direct edges to all other nodes in set S1 (and to themselves as well), they form a complete graph. We conclude that the diameter of all receptive nodes is at most 2l = 2 logk (n) with high probability. Now we can bound the diameter of the whole graph (not only the subgraph of receptive nodes). Lemma 3. The diameter of a stable graph after running the advertisement strategy is at most O(logk (n)/α). Proof. Using Lemma 2, we know that with high probability the diameter of receptive nodes is at most 2l. There are αn receptive nodes in expectation, and with high probability the number of them is not less than αn/2. Consider a receptive vertex v. Let d be the maximum distance of other nodes from v. We prove that d is O(l + log (n)/α). Delete all edges in G that are contained in at least a cycle of length at most l = l + 2 logk (n) + 1. Consider a non-receptive vertex u. We prove that if one of the k edges of u is in a cycle of length at most l , the distance from u to v is at most l /α. Let e be an edge owned by u which is in a cycle of length at most l . Let x be the distance between u and v. If vertex u deletes edge e, its distance to other nodes increases by at most l × n. On the other hand, if u makes an edge to vertex v, its distance to all receptive nodes decreases by at least x − 4l − 1 (before adding the edge its distances to receptive nodes were at least x − 2l, and after that the distances are at most 2l + 1). So the total decrease in the cost of αn u would be at least αn 2 (x − 4l − 1) because there are at least 2 receptive nodes αn with high probability. Since we are in a stable graph, 2 (x − 4l − 1) should not be greater than l × n. So x is O(l /α + l) = O(l /α) in this case. We call a vertex incomplete if at least one of its edges is deleted. As proved above, each incomplete vertex is in distance at most O(l /α) from v. We also note that the remaining graph does not have a cycle of length at most l . We claim that each vertex is either incomplete or has distance at most l from one of the incomplete vertices. So the distances of all vertices from v is at most l + O(l /α) = O(l /α). Consider an incomplete vertex u, and all walks of length l /2 starting from u in the remaining graph. If one of these walks passes over an incomplete vertex, the claim is proved. Otherwise we have k l /2 walks starting from the same vertex u. The endpoints of these walks are also different, otherwise we find a cycle of length at most l in the remaining graph. So there are at least
Constant Price of Anarchy in Network Creation Games
129
k l /2 > n different vertices in the graph which is a contradiction because l is greater than 2 logk (n). So the distances of all vertices from a receptive vertex v are at most O(l /α) = O((l + logk (n))/α). Note that l is equal to logk (n), and k is at most k. So the diameter of the whole graph is simply at most O(logk (n)/α). log k (k) k (n) Theorem 2. The price of anarchy is at most O( αlog ) using log (n) ) = O( α
the advertising strategy where k is
αβ c log (n) k
k
for a constant c.
Proof. Using Lemma 3, the diameter of a stable graph is at most O(logk (n)/α). On the other hand as mentioned in proof of Theorem 3 in [10], the average distance in the optimal graph is at least Ω(logk (n)). Combining these two facts completes the proof of this lemma. Corollary 1. For k > Ω(log1+ (n)), the price of anarchy is O(1/α). Proof. Note that α and β are some constant parameters. So k/k is O(log (n)). Since k is at least Ω(log1+ (n)), we can say that k is at most O(k 1/ ). This shows that logk (k) is O(1/) which completes the proof. Corollary 2. For k > Ω(log (n)), the price of anarchy is at most O(log log (k)/α). Proof. One just need to set k to an appropriate constant. The rest is similar to above.
4
How to Deal with Unknown α and β
In Section 3, we presented an advertising strategy that lead the network to some equilibria with small price of anarchy given two parameters α and β. Here we try to make our strategy adaptive for the cases that the parameters are not known in advance, i.e. some times a lot of agents contribute in the campaign, and sometimes a small fraction of them participate. So in these cases, we know that α > fraction of agents are willing to spend β > fraction of their budget in the campaign where and are two given lower bounds on these two parameters. We note that these two lower bounds are two constants that can be very small. Define m and m to be the two smallest integers such that > 1/2m and > 1/2m . So there exists two integers i and j such that 1/2i ≤ α ≤ 1/2i−1 , and 1/2j ≤ β ≤ 1/2j−1 where 1 ≤ i ≤ m, and 1 ≤ j ≤ m . Note that we do not need to know the exact values of parameters α and β in the advertising strategy, just an estimation would work. For example, if we know two integers i and j such that 1/2i ≤ α ≤ 1/2i−1 , and 1/2j ≤ β ≤ 1/2j−1 , we can run the above strategy with parameters 1/2i and 1/2j instead of α and β. The same probabilistic bounds would work in the same way, and we can prove the same claims as proved in Section 3. But we do not even have good estimations of these two parameters. The only thing we know is that they are in range [, 1] and [ , 1] respectively.
130
E.D. Demaine and M. Zadimoghaddam
But we know that α is in one of these m ranges: [1/2, 1], [1/4, 1/2], · · ·, [1/2m , 1/2m−1], and the same for β. We should run the strategy for different estimations of α and β in a parallel manner. So there are m × m different pairs of estimations for our parameters. But a receptive agent contributes in the camβk paign with only βk edges. We can ask a receptive node to spend m×m in each of these runs. Note that in order to run a strategy we need to set four parameters α, β, k, and n. Here we want to use the strategy for m × m parallel runs. So for k each pair (i, j), we run the strategy with parameters 1/2i , 1/2j , m×m , and n (instead of α, β, k, and n) for each 1 ≤ i ≤ m, and 1 ≤ j ≤ m . Each receptive nodes spends at most βk edges in all the runs. The only thing that changes our upper bounds on the price of anarchy, is the new value of k in each run. In fact k we are using m×m edges to reduce the price of anarchy. So we have the following theorem for cases that parameters are not known in advance. Theorem 3. When the parameters α > and β > are not known in advance, logk (k) k (n) the price of anarchy is at most O( αlog ) using the above adlog (n) ) = O( α k
αβ k vertising strategy (updated version) where k is c log (n) × m×m for a constant c. Integers m and m are log (1/) and log (1/ ) respectively.
Proof. When we run the original strategy for different pairs of (i, j), one of these pairs is a good estimation for α and β. Using the constructed edges by the receptive nodes in this specific run of the strategy and Theorem 2, we can have k this bound. The only different thing is that we can use m×m in each run, and that is why the value of k is divided by a factor of m × m . Since and are two constant (and probably very small) constants, we can say that m and m are also some constant (and probably large) numbers. We conclude that the Corollaries 1 and 2 are also true in this case (unknown α and β).
References 1. Albers, S., Eilts, S., Even-Dar, E., Mansour, Y., Roditty, L.: On Nash Equilibria for a Network Creation Game. In: Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, Miami, FL, pp. 89–98 (2006) 2. Balcan, M.-F., Blum, A., Mansour, Y.: Improved equilibria via public service advertising. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms, New York, NY, pp. 728–737 (2009) 3. Corbo, J., Parkes, D.: The price of selish behavior in bilateral network formation. In: Proceedings of the 24th Annual ACM Symposium on Principles of Distributed Computing, Las Vegas, Nevada, pp. 99–107 (2005) 4. Demaine, E.D., Hajiaghayi, M., Mahini, H., Zadimoghaddam, M.: The Price of Anarchy in Cooperative Network Creation Games. Appeared in SIGecom Exchanges 8.2 (December 2009); A preliminary version of this paper appeared in Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science, pp. 171–182 (2009)
Constant Price of Anarchy in Network Creation Games
131
5. Demaine, E.D., Hajiaghayi, M., Mahini, H., Zadimoghaddam, M.: The Price of Anarchy in Network Creation Games. In: Proceedings of the 26th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, pp. 292– 298 (2007); To appear in ACM Transactions on Algorithms 6. Fabrikant, A., Luthra, A., Maneva, E., Papadimitriou, C.H., Shenker, S.: On a network creation game. In: Proceedings of the 22nd Annual Symposium on Principles of Distributed Computing, Boston, Massachusetts, pp. 347–351 7. Kleinberg, J.: Small-World Phenomena and the Dynamics of Information. In: Advances in Neural Information Processing Systems (NIPS), vol. 14 (2001) 8. Kleinberg, J.: The small-world phenomenon: An algorithmic perspective. In: Proceedings of the 32nd ACM Symposium on Theory of Computing (2000) 9. Koutsoupias, E., Papadimitriou, C.: Worst-case equilibria. In: Meinel, C., Tison, S. (eds.) STACS 1999. LNCS, vol. 1563, pp. 404–413. Springer, Heidelberg (1999) 10. Laoutaris, N., Poplawski, L.J., Rajaraman, R., Sundaram, R., Teng, S.-H.: Bounded budget connection (BBC) games or how to make friends and influence people, on a budget. In: Proceedings of the 27th ACM Symposium on Principles of Distributed Computing, pp. 165–174 (2008) 11. Papadimitriou, C.: Algorithms, games, and the internet. In: Proceedings of the 33rd Annual ACM Symposium on Theory of Computing, Hersonissos, Greece, pp. 749–753
Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks Pooya Esfandiar1 , Francesco Bonchi2 , David F. Gleich3 , Chen Greif1 , Laks V.S. Lakshmanan1 , and Byung-Won On1 1
University of British Columbia, Vancouver BC, Canada {pooyae,greif,laks,bwon}@cs.ubc.ca 2 Yahoo! Research, Barcelona, Spain
[email protected] 3 Sandia National Laboratories , Livermore CA, USA
[email protected]
Abstract. Motivated by social network data mining problems such as link prediction and collaborative filtering, significant research effort has been devoted to computing topological measures including the Katz score and the commute time. Existing approaches typically approximate all pairwise relationships simultaneously. In this paper, we are interested in computing: the score for a single pair of nodes, and the top-k nodes with the best scores from a given source node. For the pairwise problem, we apply an iterative algorithm that computes upper and lower bounds for the measures we seek. This algorithm exploits a relationship between the Lanczos process and a quadrature rule. For the top-k problem, we propose an algorithm that only accesses a small portion of the graph and is related to techniques used in personalized PageRank computing. To test the scalability and accuracy of our algorithms we experiment with three real-world networks and find that these algorithms run in milliseconds to seconds without any preprocessing.
1
Introduction
The availability of large social networks and social interaction data (on movies, books, music, etc) have caused people to ask: what can we learn by mining this wealth of data? Measures of social relatedness play a fundamental role in answering this question. For example, Liben-Nowell and Kleinberg [13] identify a variety of topological measures as features for link prediction, the problem of predicting the likelihood of users/entities forming social ties in the future, given the current state of the network. The measures they studied fall into two categories – neighborhood-based measures and path-based measures. The former are cheaper to compute, yet the latter are more effective at link prediction. Katz
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 132–145, 2010. c Springer-Verlag Berlin Heidelberg 2010
Fast Katz and Commuters
133
scores [11] were among the best link predictors, and the commute time [6] also performed well. Other uses of Katz scores and commute time are anomalous link detection [18], recommendation [20], and clustering [19]. Katz scores measure the affinity between nodes via a weighted sum of the number of paths between them. Formally, the Katz score between node i and j ∞ is Ki,j = =1 α paths (x, y), where paths (x, y) denotes the number of paths of length between i to j and α < 1 is an attenuation parameter. Let A be the symmetric adjacency matrix, and recall that (A )i,j is the number of paths between node i and j. Then for all pairs of nodes, K = αA + α2 A2 + · · · = (I − αA)−1 − I, where the series converges if α < 1/A2. The hitting time from node i to j is the expected number of steps for a random walk started at i to visit j, and the commute time between nodes is defined as the sum of hitting times from i to j and from j to i. The hitting time may be expressed using the row-stochastic transition matrix P with firsttransition analysis: Hi,i = 0 and Hi,j = 1 + k Pi,k H,j . Unlike Katz, hitting time is not symmetric; but commute time is by definition, since C = H + H T . Computing H and C via these definitions is not straightforward, and using the graph Laplacian, L = D − A where D is the diagonal matrix of degrees, provides another means of computing the commute time. With the Laplacian, Ci,j = Vol(G)(L†i,i − 2L†i,j + L†j,j ) where Vol(G) is the sum of elements in A and L† is the pseudo-inverse of L [5]. Computing both of these measures between all pairs of nodes involves inverting a matrix, i.e. (I − αA)−1 or L† . Standard algorithms for a matrix inverse require O(n3 ) time and O(n2 ) memory and are inappropriate for a large network (see Section 2 for a brief survey of existing alternatives). Inspired by applications in anomalous link detection and recommendation [18,20], we focus on computing only a single Katz score or commute time and on computing the k most related nodes by Katz score. In Section 3, we propose customized methods for the pairwise problems based on the Lanczos/Stieltjes procedure [8]. We specialize it for the Katz and commute time measures, providing a novel and useful application for the Lanczos/Stieltjes procedure. In Section 4, we present an algorithm to approximate the strongest ties between a given source node and its neighbors in terms of the Katz score (while we discuss the case of commute time in the conclusion section). This algorithm are inspired by a technique for personalized PageRank computing [14,2,3], though heavily adapted to the Katz score. We evaluate these methods on three real-world networks and report the results in Section 5. Our methods produce answers in seconds or milliseconds, whereas preprocessing techniques may often take over 10 minutes. We have made our codes and data available for others to reproduce our results: http://stanford.edu/~dgleich/publications/2010/codes/fast-katz/.
134
2
P. Esfandiar et al.
Related Work
Most existing techniques to compute the Katz score and commute time determine the scores among all pairs of nodes simultaneously [1,24,20]. These methods tend to involve some preprocessing of the graph and a single, rather expensive, computation. In this paper instead we focus on quick estimates of these measures between a single pair of nodes and between a single node to all other nodes in the graph. Standard techniques to approximate Katz scores include truncating the series expansion to paths of length less than max [4,24] and low-rank approximation [13,1]. (Note that computing these Katz scores between nodes is quite different from computing Katz’s status index.) In general, these techniques for all the scores require more time and memory than our approach, and we do not compare against them for this reason. Sarkar and Moore [20] proposed an interesting and efficient approach for finding approximate nearest neighbors with respect to a truncated version of the commute time measure. In [21], Sarkar et al. use their truncated commute time measure for link prediction over a collaboration graph and show that it outperforms personalized PageRank [15]. Spielman and Srivastava [22] show how to approximate the effective resistance of all edges (which is proportional to commute time) in O(m log n) time for a graph with m edges and n nodes. These procedures all involve some preprocessing. Recently Li et. al. studied pairwise approximations of SimRank scores [12].
3
Algorithms for Pairwise Score
Consider the Katz score and commute time between a single pair of nodes: Ki,j = eTi (I −αA)−1 ej −δi,j and Ci,j = Vol(G)(ei −ej )T L† (ei −ej ). In these expressions, ei and ej are vectors of zeros with a 1 in the ith and jth position, respectively; and δi,j is the Kronecker delta function. A straightforward means of computing them is to solve the linear system (I − αA)x = ej and (L + n1 eeT )y = ei − ej . Then Ki,j = eTi x − δi,j and Ci,j = Vol(G)(ei − ej )T y. This form of commute time follow after substituting L† = (L + n1 eeT )−1 − n1 eeT (see [19]). Solving these linear systems is an effective method to compute only the pairwise scores. In what follows, we show how a technique combining the Lanczos iteration and a quadrature rule produces the pairwise Katz score and commute time score, as well as upper and lower bounds on the estimate. Our technique is based on the methodology developed in [8,9], which we describe below. Note that for α < 1/||A||2 , (I − αA) is symmetric positive definite, as is (L + n1 eeT ). Thus, the pairwise Katz score and the commute time score are related to the problem of computing the bilinear form uT f (E)v where E is a symmetric positive definite matrix. In the most general setting, u and v are given vectors and f is an analytic function on the interval containing the eigenvalues of E. In the application to Katz scores and commute time, f (E) = E −1 . Note we need only consider u = v because uT f (E)v = 14 (u + v)T f (E)(u + v) − (u − v)T f (E)(u − v) . (1)
Fast Katz and Commuters
135
Golub and Meurant [8,9] introduced techniques for evaluating such bilinear forms. They provided a solid mathematical framework and a rich collection of possible applications. These techniques are well known in the numerical linear algebra community, but they do not seem to have been used in data mining problems. We utilize this methodology to compute pairwise scores, which extends to a largescale setting. The algorithm has two main components: Gauss-type quadrature rules for evaluating definite integrals, and the Lanczos algorithm for partial reduction to symmetric tridiagonal form. Because E is symmetric positive definite, it has a unitary spectral decomposition, E = QΛQT , where Q is an orthogonal matrix whose columns are eigenvectors of E with unity 2-norm, and Λ is a diagonal matrix with the eigenvalues of E along its diagonal. We use this decomposition only for the derivation that follows – it is never explicitly computed in our algorithm. Given this decomposition, for any analytic function f , uT f (E)u = uT Qf (Λ)QT u =
n
f (λi )˜ uTi u˜i ,
i=1
where u˜ = QT u. The last sum can be thought of as a quadrature rule for computing the Stieltjes integral uT f (E)u =
b
f (λ)dγ(λ).
(2)
a
Here γ is a piecewise constant measure, which is monotonically increasing, and its values depend directly on the eigenvalues of E; λ denotes the set of all eigenvalues; γ is a discontinuous step function, each of whose pieces is a constant function. i Specifically, γ(λ) is identically zero if λ < mini λi (E), is equal to j=1 u ˜2j if n 2 λi ≤ λ < λi+1 , and is equal to j=1 u ˜j if λ ≥ maxi λi (E). The first of Golub and Meurant’s key insights is that we can compute an approximation for an integral of the form (2) using a quadrature rule. The second insight is that the Lanczos procedure constructs a tridiagonal matrix whose eigenvalues are the quadrature nodes for the specific measure γ, and u = ei . Since we use a quadrature rule, an estimate of the error is readily available. More importantly, we can use variants of the Gaussian integration formula to obtain both lower and upper bounds and “trap” the value of the element of the inverse that we seek between these bounds. The ability to estimate bounds for the value is powerful and provides effective stopping criteria for the algorithm. It is important to note that such component-wise bounds cannot be easily obtained if we were to extract the value of the element from a column of the inverse, by solving the corresponding linear system. Indeed, typically for the solution of a linear system, norm-wise bounds are available, but obtaining bounds pertaining to the components of the solution is significantly more challenging and results of this sort are harder to establish. Algorithm 1 reproduces a consice procedure from [9] to estimate uT E −1 u. The input is a matrix E, a vector u, estimates of extremal eigenvalues of E,
136
P. Esfandiar et al.
Algorithm 1. Computing Score Bounds Input: E, u, a < λmin(E) , b > λmax(E) , k Output: bk ≤ uT E −1 u ≤ bk 1: Initial step: h1 = 0, h0 = u, ω1 = uT Eu, γ1 = ||(E − ω1 I)u||, b1 = ω1−1 , d1 = ω1 , 1 I)u c1 = 1, d1 = ω1 − a, d1 = ω1 − b, h1 = (E−ω . γ1 2: for j = 2, ...k do 3: ωj = hTj−1 Ehj−1 4: hj = (E − ωj I)hj−1 − γj−1 hj−2 5: γj = hj j h γj
6:
hj =
7:
bj = bj−1 +
8:
dj = ω j −
9:
dj = ω j − a −
10:
ωj = a +
11:
bj = bj +
2 γj−1 c2 j−1
2 dj−1 (ωj dj−1 −γj−1 )
2 γj−1 dj−1
γj2 dj
γ
; cj = cj−1 dj−1 j−1
2 γj−1
dj−1
; dj = ω j − b −
; ωj = b + γj2 c2 j
dj (ωj dj −γj2 )
2 γj−1 dj−1
γj2 dj
; bj = bj +
γj2 c2 j
dj (ωj dj −γj2 )
a and b, and a number of iterations k. In practice we can use the infinity norm of the original matrix as an estimate for b; this quantity is trivial to compute. The value of a is known to be small and positive, and in our experiments we have it set to 10−4 . (We note here that dynamically varying alternatives exist but these were not necessary in our experiments.) The algorithm computes bj and bj , lower and bounds for uT E −1 u. The core of the algorithm are steps 3–6, which are nothing but the Lanczos algorithm. In line 7 we apply the summation for the quadrature formula. The computation needs to be done for the upper bound as well as the lower bound; see lines 9 and 10. Line 11 computes the required bounds that “trap” the required quadratic form from above and below. For Katz we set E = (I − αA) and use (1) to get eTi E −1 ej by running the procedure twice and transposing the upper and lower bounds due to the subtraction. For commute time we approximate (ei − ej )T (L + (1/n)eeT )−1 (ei − ej ).
4
Top-k Algorithms
In this section, we show how to adapt techniques for rapid personalized PageRank computation [14,2,3] to the problem of computing the top-k largest Katz scores. These algorithms exploit the graph structure by accessing the edges of individual vertices, instead of accessing the graph via a matrix-vector product. They are “local” because they only access the outlinks of a small set of vertices and need not explore the majority of the graph. See the conclusions for a discussion of commute time and why we cannot utilize this procedure for that measure. The basis of the algorithm is a variant on the Richardson stationary method for solving a linear system [23]. Given a linear system Ax = b, the Richardson iteration
Fast Katz and Commuters
137
is x(k+1) = x(k) + ωr(k) , where r(k) = b − Ax(k) is the residual vector at the kth iteration and ω is an acceleration parameter. While updating x(k+1) is a linear time operation, computing the next residual requires another matrix-vector product. To take advantage of the graph structure, the personalized PageRank algorithms [14,2,3] propose the following change: do not update x(k+1) with the entire residual, (k) and instead change only a single component of x. Formally, x(k+1) = x(k) +ωrj ej , (k)
where rj is the jth component of the residual vector. Now, computing the next residual involves accessing a single column of the matrix A: (k)
(k)
r (k+1) = b − Ax(k+1) = b − A(x(k) + ωrj ej ) = r (k) + ωrj Aej .
Suppose that r, x, and Aej are sparse, then this update introduces only a small number of new nonzeros into both x and the new residual r. Each column of A is sparse for most graphs, and thus keeping the solution and residual sparse is a natural goal for graph algorithms where the solution x is localized (i.e., many components of x can be rounded to 0 without dramatically changing the solution). By choosing the element j based on the largest entry in the sparse residual vector (maintained in a heap), this algorithm often finds a good approximation to the largest entries of the solution vector x while accessing only a small subset of the graph. Dropping the heap as in [2] yielded slightly worse localization and thus we did not use it in these experiments. For a particular node i in the graph, the Katz scores to the other nodes are given by ki = [(I − αA)−1 − I]ei . Let (I − αA)x = ei . Then ki = x − ei . We use the above process with ω = 1 to compute x. For this system, x and r are always positive, and the residual converges to 0 geometrically if α < 1/E1 . We observe convergence empirically for 1/E1 < α < 1/||E||2 and have a developed some theory to justify this result, but do not have space to present it here. To terminate our algorithm, we wait until the largest element in the residual is smaller than a specified tolerance, for example 10−4 .
5
Empirical Evaluation
Our experimental goals are: (i) to test the convergence speed; (ii) to measure the accuracy and scalability of our algorithms; and (iii) to compare our algorithms against the conjugate gradient (CG) method. Recall our setting: we only want a single score or top-k set. We use the CG iterative method as a reference point for our pairwise and top-k algorithms because it provides solutions in the large scale case without any preprocessing, just like our algorithms. As we previously mentioned, approaches based on preprocessing or simultaneously computing all the scores take considerably longer but provide more information. In the case of finding a small set of pairwise values, we leave finding the trade-off between our fast pairwise algorithms and the all-at-once approaches to future work. Experiment settings. We implemented our methods in Matlab and Matlab mex codes. All computations and timings were done in Linux on a laptop with a Core2Duo T7200 processor (2 core, 2GHz) with 2GB of memory. We used
138
P. Esfandiar et al.
Table 1. Basic statistics about our datasets: number of nodes and edges, average degree, max singular value (||A||2 ) and size of the 2-core in vertices Graph
Nodes
Edges
Avg Degree
||A||2
2-core Size
dblp arxiv flickr
93,156 86,376 513,969
178,145 517,563 3,190,452
3.82 11.98 12.41
39.5753 99.3319 663.3587
76,578 45,342 233,395
three real-world networks for our experiments: two citation-based networks based on publications databases, and one social network. The dataset1 statistics are reported in Table 1. Pairwise results. We begin by studying the accuracy of the pairwise algorithms for Katz scores and commute times. For this task, we first compute a highly accurate answer using the minres method [7] to solve the corresponding linear systems: (I − αA)x = ei for Katz and (L + n1 eeT )x = (ei − ej ) for commute time. We used a tolerance of 10−8 in these solutions. Next, we run our pairwise method. Recall that using Algorithm 1 requires a lower-bound on the smallest eigenvalue of the matrix E. We use 10−4 for this bound. We terminate our algorithms when the relative change in the upper and lower bounds is smaller than 10−4 or the upper and lower bounds cross each other. We evaluate the accuracy at each iteration of Algorithm 1. Because our approach to compute Katz scores requires two applications of Algorithm 1, the work at each iteration takes two matrixvector products. As described in previous sections, our pairwise algorithm is closely related to iterative methods for linear systems, but with the added benefit of providing lower and upper bounds. As such, its convergence closely tracks that of the conjugate gradient method, a standard iterative method. We terminate conjugate gradient when the norm of the residual is smaller than 10−4 . For convergence of the Katz scores, we use a value of α that makes B = I −αA nearly indefinite. Such a value produces the slowest convergence in our experience. The particular value we use is α = 1/(A2 + 1). For a single pair of nodes in arxiv, we show how the upper and lower bounds “trap” the pairwise Katz scores in Figure 1 (top left). At iteration 13, the lower bound approaches the upper bound. Beyond this point the algorithm converges quickly. Similar convergence results are produced for the other two graphs. We show the convergence of both bounds to the exact solution in the bottom row. Both the lower and upper bounds converge similarly. In comparison with the conjugate gradient method, our pairwise algorithm takes more matrix-vector products to converge. This happens because we must perform two applications of Algorithm 1. However, the conjugate gradient method does not provide upper and lower bounds on the element of the inverse, which our techniques do. The forthcoming experiments with commute time illustrate a case where it is difficult to terminate conjugate gradient early because of erratic convergence. For these problems, we also evaluated techniques 1
In the interest of space we provide processing details of the datasets in our web page: http://stanford.edu/~ dgleich/publications/2010/codes/fast-katz/
Fast Katz and Commuters arxiv, Katz, hard alpha 50
dblp, Katz, hard alpha 100
cg lower bound upper bound
139
flickr2, Katz, hard alpha 10
cg lower bound upper bound
cg lower bound upper bound
5
0
bounds
bounds
bounds
50 0
0 −5
−50 0
10 20 30 matrix−vector products
−50 0
40
arxiv, Katz, hard
5
10
5
10 15 20 matrix−vector products
−10 0
30
5
dblp, Katz, hard
5
10 15 matrix−vector products
10
cg lower bound upper bound
20
25
flickr2, Katz, hard
5
10
cg lower bound upper bound
25
cg lower bound upper bound
10
−5
10
0
relative error
relative error
relative error
0
0
0
10
−5
10 20 30 matrix−vector products
40
10
0
10
−5
10
−10
5
10 15 20 matrix−vector products
25
30
10
0
5
10 15 matrix−vector products
20
25
Fig. 1. Upper and lower bounds (top) and approximation error (bottom) for pairwise Katz on arxiv (left), dblp (center), and flickr (right)
based on the Neumann series for I − αA, but those took over 100 times as many iterations as conjugate gradient or our pairwise approach. The Neumann series is the same algorithm used in [24] but customized for the linear system, not the matrix inverse, which is a more appropriate comparison for the pairwise case. In Figure 2, we show how commute time converges for the same pairs of nodes. Again, the top row shows the convergence of the upper and lower bounds, and the bottom row shows the convergence of the error. While Katz took only a few iterations, computing pairwise commute times requires a few hundred iterations. A notable result is that the lower-bound from the quadrature rule provides a more accurate estimate of commute time than does the upper bound. See the curve of the lower bound in bottom row of Figure 2. This observation suggests that using the lower bound as an approximate solution is probably better for commute time. Note that the relative error in the lower-bound produced by our algorithm is almost identical to the relative error from CG. This behavior is expected in cases where the largest eigenvalue of the matrix is well-separated from the remaining eigenvalues – a fact that holds for the Laplacians of our graphs. When this happens, the Lanczos procedure underlying both our technique and CG quickly produces an accurate estimate of the true largest eigenvalue, which in turn corrects the effect of our initial overestimate of the largest eigenvalue. (Recall from Algorithm 1 that the estimate of b is present in the computation of the lower-bound bj .) Here, the conjugate gradient method suffers two problems. First, because CG does not provide bounds on the score, it is not possible to terminate it until the residual is small. Thus, the conjugate gradient method requires about twice as many iterations as our pairwise algorithms. Note, however, this result is simply a
140
P. Esfandiar et al. arxiv, Commute
4
10
2
0
10
−2
0
100 200 300 matrix−vector products
10
400
arxiv, Commute
5
10
−2
0
50
100 150 200 matrix−vector products
10
300
relative error
−5
−10
0
400
−5
10
10
200 300 400 matrix−vector products
500
600
flickr2, Commute cg lower bound upper bound
0
10
−10
100 200 300 matrix−vector products
100
5
0
10
0
10
cg lower bound upper bound relative error
0
250
dblp, Commute
5
10
cg lower bound upper bound
10
10
0
10
−2
10
bounds
10
bounds
bounds 0
cg lower bound upper bound
2
10
10
flickr2, Commute
4
10
cg lower bound upper bound
2
10
relative error
dblp, Commute
4
10
cg lower bound upper bound
0
10
−5
10
−10
50
100 150 200 matrix−vector products
250
300
10
0
100
200 300 400 matrix−vector products
500
600
Fig. 2. Upper and lower bounds (top) and approximation error (bottom) for pairwise commute time scores on arxiv (left), dblp (center), and flickr (right)
matter of detecting when to stop – both conjugate gradient and our lower-bound produce similar relative errors for the same work. Second, the relative error for conjugate gradient displays erratic behavior. Such behavior is not unexpected, because conjugate gradient optimizes the A-norm of the solution error and it is not guaranteed to provide smooth convergence in the norm of the residual. These oscillations make early termination of the CG algorithm problematic, whereas no such issues occur for the upper and lower bounds from our pairwise algorithms. Top-k results. We now proceed to a similar investigation of the top-k algorithms for Katz scores. In this section, we are concerned with the convergence of the set of top-k results. Thus, we evaluate each algorithm in terms of the precision between the top-k results generated by our algorithms and the exact top-k set produced by solving the linear system. Natural alternatives are other iterative methods and specialized direct methods that exploit sparsity. The latter – including approaches such as truncated commute time [20] – are beyond the scope of this work, since they require a different computational treatment in terms of caching and parallelization. Thus, we again use conjugate gradient (CG) as an example of iterative methods. Let Tkalg be the top-k set from our algorithm and Tk∗ be the exact top-k set. The precision at k is |Tkalg ∩ Tk∗ |/k, where | · | denotes cardinality. We also look at the Kendall-τ correlation coefficient between our algorithm’s results and the exact top-k set. This experiment will let us evaluate whether the algorithm is ordering the true set of top-k results correctly. Let xalg k∗ be the scores from our algorithm on the exact top-k set, and let x∗k∗ be the true top-k scores. The τ ∗ coefficients are computed between xalg k∗ and xk∗ . Both of these measures should tend to 1 as we increase the work in our algorithms. However, some of the exact
Fast Katz and Commuters
141
0.6 0.4 k=10 k=100 k=1000 cg k=25 k=25
0.2 0 −2
0
1 0.8 0.6 0.4
0
2
−2
10 10 10 Equivalent matrix−vector products
0
k=10 k=100 k=1000 cg k=25 k=25
0.2 0 −2
0
2
10 10 10 Equivalent matrix−vector products
0.4 k=10 k=100 k=1000 cg k=25 k=25
0.2 0 −2
0
2
10 10 10 Equivalent matrix−vector products 1
0.8 0.6 0.4 k=10 k=100 k=1000 cg k=25 k=25
0.2 0 −2
0
2
10 10 10 Equivalent matrix−vector products
Kendtall−τ ordering vs. exact
0.4
Kendtall−τ ordering vs. exact
0.6
0.6
2
1
0.8
1 0.8
10 10 10 Equivalent matrix−vector products
1 Kendtall−τ ordering vs. exact
k=10 k=100 k=1000 cg k=25 k=25
0.2
Precision@k for exact top−k sets
1 0.8
Precision@k for exact top−k sets
Precision@k for exact top−k sets
top-k results contain tied values. Our algorithm has trouble capturing precisely tied values and the effect is that our Kendall-τ score does not always tend to 1 exactly. To compare with the pairwise results, we present the algorithm performance in effective matrix-vector products. An effective matrix-vector product corresponds to our algorithm examining the same number of edges as a matrix-vector product. In other words, suppose the algorithm accesses a total of 80 neighbors in a graph with 16 edges. Then this instance corresponds to (80/16)/2 = 2.5 effective matrix vector products. For our first set of tests, we let the algorithm run for a prescribed number of steps and evaluate the results at the end. In Figure 3, we plot the convergence of the top-k set for k = 10, 25, 100, and 1000 for a single node. The top figures plot the precision at k, and the bottom figures plot the Kendall-τ correlation with the exact top-k set. Both of these measures trend to 1 quickly. In fact, the top-25 set is nearly converged after the equivalent of a single matrix-vector product – equivalent to just one iteration of the CG algorithm. We show results from the conjugate gradient method for the top-25 set after 2, 5, 10, 15, 25, and 50 matrix-vector products.
0.8 0.6 0.4 k=10 k=100 k=1000 cg k=25 k=25
0.2 0 −2
0
2
10 10 10 Equivalent matrix−vector products
Fig. 3. Precision (top) and Kendall-τ correlation (bottom) for top-k Katz time scores on arxiv (left), dblp (center), and flickr (right). We use the same value of α as Figure 1.
On the dblp graph, the top-k algorithm produces almost the exact Katz top-k set with just slightly more than 1 effective matrix-vector product. For flickr, we see a striking transition around 1 effective matrix-vector product, when it seems to suddenly “lock” the top-k sets, then slowly adjust their order. In all of the experiments, the CG algorithm does not provide any useful information until it converges. Our top-k algorithm produces useful partial information in much less work and time.
142
P. Esfandiar et al.
Runtime. Finally, we present the runtime of our pairwise and top-k methods in Table 2. We explore two values of α for Katz: easy-α 1/(10A1 + 10) hard-α 1/(max(λ(A)) + 1). The former should converge more quickly than the latter. In the pairwise case, we evaluate the the runtime on three pairs of nodes. These pairs were chosen such that there was a high degree-high degree pair, a high degree-low degree pair, and a low degree-low degree pair. For these, we use the shorthand high-high pair, etc. The results show the impact of these choices. As expected, the easy-α cases converged faster and commute time converged slower than either Katz score. In this small sample, the degree of the pairs played a role. On flickr, for example, the low-low pair converged fastest for Katz, whereas the high-low pair converged fastest for commute time. The solution tolerance was 10−4 . We do not report separate computation times for the conjugate gradient method, but note that the previous experiments suggest it should take about half the time for the Katz problems and about twice as long for the commute time experiments. In the top-k problems, we start the algorithm at one of the vertices among the same pairs of nodes. We terminate it when the largest element in the residual vector is smaller than 10−4 αdu , where du is the degree of the source node. For most of the experiments, this setting produced a 2-norm residual smaller than 10−4 , which is the same convergence criterion for CG. Table 2. Runtime (in seconds) of the pairwise (left) and top-k (right) algorithms for Katz scores and commute time. See the text for a description of the cases. Graph Pairs
Score Katz
Commute Graph Degree Katz
easy-α
hard-α
arxiv High, high 0.6081 High, low 0.6068 Low, low 0.3619
2.6902 2.3689 0.5842
24.8874 19.7079 10.7421
High, high 0.3266 High, low 0.3436 Low, low 0.2133
1.7044 1.3010 0.5458
10.3836 8.8664 8.3463
dblp
dblp
flickr High, high 5.1061 High, low 4.2578 Low, low 2.6037
6
12.7508 227.2851 11.0659 82.0949 3.4782 172.5125
easy-α
hard-α
arxiv High Low Low
0.0027 0.0003 0.0004
0.2334 0.2815 0.5315
High Low Low
0.0012 0.0011 0.0007
0.0163 0.0161 0.0173
flickr High Low Low
0.0741 0.0036 0.0040
0.0835 36.2140 0.0063
Conclusions and Future Work
Measures based on ensembles of paths such as the Katz score and the commute time have been found useful in several applications such as link prediction and collaborative filtering. In this paper, motivated by applications, we
Fast Katz and Commuters
143
focused on two problems related to fast approximations for these scores. First, for finding the score between a specified pair of nodes, we have proposed an efficient algorithm to compute it and also obtain upper and lower bounds, making use of a technique for computing bilinear forms. Second, for finding the top-k nodes that have the highest Katz scores with respect to a given source node, we have proposed a top-k algorithm based on a variant of the Richardson stationary method used in personalized PageRank. We have conducted a set of experiments on three real-world datasets and obtained many encouraging results. Our experiments demonstrate the scalability of the proposed method to large networks, without giving up much accuracy with respect to the direct methods (that are infeasible on large networks). There are many possible extensions of our techniques. For example, the algorithm we propose for computing the Katz and commute time between a given pair of nodes extends to the case where one wants to find the aggregate score between a node and a set of nodes. This could be useful in methods that find clusters using commute time [16,17,25]. In these cases, the commute time between a node and a group of nodes (e.g., a cluster) measures their affinity. We plan to explore this generalization in future work. Furthermore, in link prediction, anomalous link detection, and recommendation, the underlying graph is dynamic and evolving in time. These tasks require almost real-time computation because the results should reflect the latest state of the network, not the results of an offline cached computation. Therefore, calculation of these metrics must be as fast as possible. We hope to evaluate our algorithms in such a dynamic setting, where we believe they should fit nicely because of the fast computation and preprocessing-free nature. An alternative is to combine some offline processing with techniques to get fast online estimates of the scores. These techniques invariably involve a compromise between scalability of the approach (e.g., computing a matrix factorization offline) and the complexity of implementation (see [10,3] for examples in personalized PageRank). One key weakness of our current top-k algorithms is that they do no apply to estimating the closest commute time neighbors. This problem arises because the expression for all the commute times relative to a given node involves all of the diagonal entries of the matrix inverse, whereas the top-k algorithm only finds an approximation to a single linear system. We are currently investigating a diffusion-based measure that is inspired by commute time and can be used with our Richardson technique. Preliminary results show good agreement between the k closest nodes using commute time and the top-k set of the diffusion measure.
References 1. Acar, E., Dunlavy, D.M., Kolda, T.G.: Link prediction on evolving data using matrix and tensor factorizations. In: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, ICDMW 2009, pp. 262–269. IEEE Computer Society, Los Alamitos (2009) 2. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using PageRank vectors. In: Proc. of the 47th Annual IEEE Sym. on Found. of Comp. Sci. (2006)
144
P. Esfandiar et al.
3. Berkhin, P.: Bookmark-coloring algorithm for personalized PageRank computing. Internet Math. 3(1), 41–62 (2007) 4. Foster, K.C., Muth, S.Q., Potterat, J.J., Rothenberg, R.B.: A faster Katz status score algorithm. Comput. & Math. Organ. Theo. 7(4), 275–285 (2001) 5. Fouss, F., Pirotte, A., Renders, J.-M., Saerens, M.: Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Trans. Knowl. Data Eng. 19(3), 355–369 (2007) 6. G¨ obel, F., Jagers, A.A.: Random walks on graphs. Stochastic Processes and their Applications 2(4), 311–336 (1974) 7. Golub, G.H., Loan, C.F.V.: Matrix Computations, 3rd edn. Johns Hopkins Univ. Press, Baltimore (1996) 8. Golub, G.H., Meurant, G.: Matrices, moments and quadrature. In: Numerical analysis 1993 (Dundee, 1993). Pitman Res. Notes Math. Ser., vol. 303, pp. 105–156. Longman Sci. Tech., Harlow (1994) 9. Golub, G.H., Meurant, G.: Matrices, moments and quadrature ii; how to compute the norm of the error in iterative methods. BIT Num. Math. 37(3), 687–705 (1997) 10. Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th International Conference on the World Wide Web, pp. 271–279. ACM, New York (2003) 11. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18, 39–43 (1953) 12. Li, P., Liu, H., Yu, J.X., He, J., Du, X.: Fast single-pair simrank computation. In: Proc. of the SIAM Intl. Conf. on Data Mining (SDM 2010), Columbus, OH (2010) 13. Liben-Nowell, D., Kleinberg, J.M.: The link prediction problem for social networks. In: Proc. of the ACM Intl. Conf. on Inform. and Knowlg. Manage. CIKM 2003 (2003) 14. McSherry, F.: A uniform approach to accelerated PageRank computation. In: Proc. of the 14th Intl. Conf. on the WWW, pp. 575–582. ACM Press, New York (2005) 15. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford University (November 1999) 16. Qiu, H., Hancock, E.R.: Commute times for graph spectral clustering. In: Gagalowicz, A., Philips, W. (eds.) CAIP 2005. LNCS, vol. 3691, pp. 128–136. Springer, Heidelberg (2005) 17. Qiu, H., Hancock, E.R.: Clustering and embedding using commute times. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1873–1890 (2007) 18. Rattigan, M.J., Jensen, D.: The case for anomalous link discovery. SIGKDD Explor. Newsl. 7(2), 41–47 (2005) 19. Saerens, M., Fouss, F., Yen, L., Dupont, P.: The principal components analysis of a graph, and its relationships to spectral clustering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 371–383. Springer, Heidelberg (2004) 20. Sarkar, P., Moore, A.W.: A tractable approach to finding closest truncatedcommute-time neighbors in large graphs. In: Proc. of the 23rd Conf. on Uncert. in Art. Intell., UAI 2007 (2007) 21. Sarkar, P., Moore, A.W., Prakash, A.: Fast incremental proximity search in large graphs. In: Proc. of the 25th Intl. Conf. on Mach. Learn., ICML 2008 (2008) 22. Spielman, D.A., Srivastava, N.: Graph sparsification by effective resistances. In: Proc. of the 40th Ann. ACM Symp. on Theo. of Comput. (STOC 2008), pp. 563– 568 (2008)
Fast Katz and Commuters
145
23. Varga, R.: Matrix Iterative Analysis. Prentice-Hall, Englewood Cliffs (1962) 24. Wang, C., Satuluri, V., Parthasarathy, S.: Local probabilistic models for link prediction. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM 2007, Washington, DC, USA, pp. 322–331. IEEE Computer Society, Los Alamitos (December 2007) 25. Yen, L., Fouss, F., Decaestecker, C., Francq, P., Saerens, M.: Graph nodes clustering based on the commute-time kernel. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 1037–1045. Springer, Heidelberg (2007)
Game-Theoretic Models of Information Overload in Social Networks Christian Borgs1 , Jennifer Chayes1, Brian Karrer1,2, Brendan Meeder1,3 , R. Ravi1,3 , Ray Reagans1,4 , and Amin Sayedi1,3 1
Microsoft Research New England, Cambridge, MA 2 University of Michigan, Ann Arbor 3 Carnegie Mellon University 4 MIT Sloan School of Management
Abstract. We study the effect of information overload on user engagement in an asymmetric social network like Twitter. We introduce simple game-theoretic models that capture rate competition between celebrities producing updates in such networks where users non-strategically choose a subset of celebrities to follow based on the utility derived from high quality updates as well as disutility derived from having to wade through too many updates. Our two variants model the two behaviors of users dropping some potential connections (followership model) or leaving the network altogether (engagement model). We show that under a simple formulation of celebrity rate competition, there is no pure strategy Nash equilibrium under the first model. We then identify special cases in both models when pure rate equilibria exist for the celebrities: For the followership model, we show existence of a pure rate equilibrium when there is a global ranking of the celebrities in terms of the quality of their updates to users. This result also generalizes to the case when there is a partial order consistent with all the linear orders of the celebrities based on their qualities to the users. Furthermore, these equilibria can be computed in polynomial time. For the engagement model, pure rate equilibria exist when all users are interested in the same number of celebrities, or when they are interested in at most two. Finally, we also give a finite though inefficient procedure to determine if pure equilibria exist in the general case of the followership model.
1
Introduction
Social networking sites such as Facebook and Twitter allow users to sign up and keep in touch with others by friending them and receiving updates from their friends at their convenience by logging into web pages that provide the current feed of most recent updates; various mobile applications also port this convenience to smart phones. The two conceptually different modes of establishing this online relationship are symmetric or asymmetric, requiring consent from both sides or only from the following side to initiate and maintain the tie. These two modes are
Work done when all authors were visiting Microsoft Research New England in the summer of 2010.
R. Kumar and D. Sivakumar (Eds.): WAW 2010, LNCS 6516, pp. 146–161, 2010. © Springer-Verlag Berlin Heidelberg 2010
Game-Theoretic Models of Information Overload in Social Networks
147
reflected in two major social network platforms of today: Facebook and Twitter. There are also other major differences between these two platforms in that Facebook is a full-fledged sharing site that allows users to share photos and longer updates, while Twitter is a micro-blogging site restricting each update to a text message of up to 140 characters. These design choices drive the differences in user enrollment and engagement of users between these two platforms (see, e.g., the infographic in [1]) with the convenience of small tweets and asymmetric nature of Twitter giving very high rates of adoption, but the more substantial updates and symmetric consent maintaining more users in Facebook. Despite these differences, one feature is common to the basic versions of both home pages: In both platforms, users see one linear update feed reflecting a chronologically sorted order of posts from friends whenever she logs into each of these sites1 . Given that the foremost advantage of these sites is the convenience of getting a quick asynchronous update, the balance of information from various friends in the feed becomes important in determining the value the user will gain from the update. However, the mix of the current feed, such as the proportion from various friends is determined by the activity level of these friends. This leads to a fundamental shortcoming of the online medium in replicating realworld ties: even though the time and frequency of updating oneself is in one’s own control, how much one hears from one particular friend is not2 . We model the implication of two natural control strategies that users adopt in response (i.e., unfollowing and disengagement) in this paper. The lack of control of regulating the rate of flow of information in a particular link to a tie is part of the broader difficulty of online social networks replicating the off-line relationships: in the latter, even with asynchronous forms of communication such as postal and electronic mail, as well as in the more familiar forms of synchronized communication such as telephone or face-to-face conversations, the amount of time and hence, information communicated can be regulated at will by both parties. Even in the asymmetric follow cases of this such as attending a lecture in a conference, the user usually has the cognitive ability to switch off from the information presented to her and focus on other activities. However, the current design of online information update systems have reduced the number of degrees of freedom of a user in sending information out to her friends from the number of friends down to essentially one3 . Our study can be seen as a first step in examining the effects of this key design decision. 1
2
3
There are variants to this common feature as well: Facebook learns and filters the feed to show what it believes are the most interesting recent updates, and various Twitter clients that allow access to the tweets allow other view options, but we will stick to the basic unfiltered version here for our models. This also speaks to an important feature that is lacking in such sites and might greatly impact the ability of the user to tailor the feeds and hence stay engaged in the platform — this idea has already been formulated as information overload being a “filter failure problem” by Clay Shirkey, see [6]. There are exceptions to these too, of course, such as using lists in Twitter or groups in Facebook, but again our discussion applies to the plain version of these platforms that new users are exposed to.
148
1.1
C. Borgs et al.
Modeling the Resulting Networks
In this paper, we propose two models for explaining the set of ties that are realized and stabilized over time in an asymmetric follow network such as Twitter. This allows us to restrict our attention solely to the update rate feature while not worrying about symmetric consent; furthermore, given that Facebook has already implemented filters for enhancing the update experience while Twitter has not yet done so, our models of information overload focus on the latter where they are more relevant. Both of our models assume that the rate of sending updates to followers is the key decision variable of an agent in the network to help maintain a good follower circle. Also, both models assume that updates from friends are useful to the user but an excessive rate of updates have diminishing returns and eventually have negative returns of value in terms of flooding out the stream of useful updates from other friends. The first set of “followership” models assume that users in the network will stay in the network but will unfollow agents from whom the updates are too frequent. The second “engagement” model assumes that users are frustrated by the high update rate of their followees and leave the network altogether with probability proportional to the rate of annoyance from excessively updating friends. Both sets of models are static in that they assume a fixed set of agents; moreover, our models explicitly partition users as producers or consumers of information but not both — this is not accurate in that agents in social networks serve both functions but it is a useful enough abstraction for our purpose. Moreover, given that over 80% of the tweets come from under 20% of the users, the bipartite model of agents separated loosely into “celebrities” that produce information and “users” that simply consume it is not a major deviation from reality.4 The information contained in specific posts is not included in our models although it can vary significantly (from being updates of key news and events, to information nuggets about specific topic areas or simple grooming updates). Instead, each user associates a value to updates from a celebrity that reflect how interested they are in hearing from that celebrity. This value reflects the average utility the user gets from reading an update from that celebrity and in this way, we avoid the specific content of posts. In both sets of models, our goal is to build simple models of rate competition between celebrities and characterize the ties that remain in the network at equilibrium, and also to infer the “optimal” rates resulting from the model for the celebrities. This can shed light on the disparities between offline and online networks based on the above discussion, as well as on how far away from optimality current celebrity tweet rates can be. While there is anecdotal discussion in the media, especially pertaining to social media marketing, about producers of information (such as corporate Twitter accounts) being cognizant of the rate and quality of their updates, this strategic aspect of information producer behavior has not been studied carefully before our work. 4
Applying our followership model to a non-bipartite network can be done by duplicating each individual into a celebrity and a user.
Game-Theoretic Models of Information Overload in Social Networks
1.2
149
Our Models
Our basic models assume a complete bipartite graph on two disjoint sets of nodes: the producers or celebrities (denoted C) and the consumers or simply users or followers (denoted F ). Every edge between celebrity i and user j comes with a non-negative quality score qij that determines the proportional utility that j derives from following celebrity i. If an edge has quality score zero, we will commonly treat that edge as non-existent. An edge between celebrity i and user j in this bipartite graph represents the possibility of j following i. In the game-theoretic models, the players are the celebrities that compete by setting nonnegative rates ri , and deriving utility proportional to their influence: namely, celebrity i updating at rate ri has payoff equal to ri times the number of followers he has at this rate (which will depend on the rates of other celebrities as dictated by the particular model). The main difference between the two models is the behavior of the users who are simple payoff maximizers. Followership Games: For the first model that we term “Followership Games”, the user’s payoff is the sum of quality-weighted update rates of the celebrities he follows minus λ ≥ 0 times a superlinear function of the total rate of all celebrities he follows (for some parameter input λ) — we will henceforth use the quadratic function of the total rate but other choices such as the exponential or some polynomial can also be adopted. Higher λ’s imply higher disutility from increasing total rate of updates of followed celebrities, and the choice of superlinear function also has a similar effect as the function magnifies the amplitude of the total updates the user receives. One can also use a sublinear function of the total rate of updates but unless λ is very large, this will result in all competing celebrities to update at unbounded rates in the resulting competition. Furthermore, using a linear rather than quadratic function for the disutility term leads to a decomposition of the user’s utility into one term per celebrity: when viewed from the celebrity’s side, this competition problem becomes trivial — every celebrity i faces a certain net linear response from each user j (with parameters varying based on the quality qij the user feels from this celebrity and λ) and has to choose an optimal rate of update ri that maximizes her utility. This problem has no influence from the rates of other celebrities and hence does not involve any competition. For these reasons, we only focus on the superlinear, and in particular the quadratic disutility case. Engagement Games: In the second model, the users are assumed to get disengaged and leave the network altogether if they receive too many updates from the celebrities. In terms of the bipartite graph, the user is assumed to be currently following all users for whom they have non-zero quality, and for this paper, we assume all quality values are one in this model. The result of the Engagement Game determines whether the user continues to engage with all of the celebrities (for which their quality is non-zero) by following them, or they disengage and stop following all of them.
150
1.3
C. Borgs et al.
Our Results
Our main result is negative showing that even the simple followership model of rate competition does not admit pure Nash equilibria in general (Example 1). This comes from the fundamentally discrete and hence discontinuous nature of the utility that producers derive as they increase their rates: at certain breakpoints they start to lose followers for whom they exceed their “tolerance” limit for following them, and hence there are kinks in the producer utility function. While we do not present necessary conditions for existence of pure equilibria, we describe several special cases with sufficient conditions: For the Followership model, one is when there is a global ranking of celebrities such that the quality ordering for any user follows the same order among the celebrities (Proposition 3); a slight generalization of this holds when a dependency graph among the celebrities (that reflects quality dominance for some user) is acyclic (Proposition 4). For the Engagement model, one case where pure rate equilibria exist is when every user has unit quality for exactly k celebrities and no quality from others (Proposition 5, and another is when every user has unit quality for at most two celebrities which is the case of users with sparse interests among the celebrities (Proposition 6). In the Followership model, we use simple first order conditions to arrive at a finite set of linear conditions based on matchings to determine if there is a pure equilibrium (Section 2.3). These matching conditions characterize the set of all possible rates that are candidates for equilibrium. These conditions produce candidate rates to check for equilibrium. While not an efficient characterization, they do provide finite necessary characterization of pure strategy rate equilibria for the Followership game. We propose a similar set of linear conditions for the Engagement game based on sets of possibly engaged users, but are unable to verify that they characterize all pure equilibria. 1.4
Related Work
The concept of information overload has a long history in the management literature. Roughly speaking, an individual’s ability to process information for a task is supposed to be an inverted U-shape with respect to information quantity. Increasing information at first increases an individual’s capability but eventually additional information becomes unhelpful and information-processing ability declines. This general phenomena is seen in a wide variety of disciplines. [2] Empirical studies have attempted to identify a “core group” of links in networks like Twitter that are active and matter [3] in capturing the real underlying activity in the network. Our work differs from these streams of work in that we focus on models for the emergence of such a core network in Twitter. Many papers have addressed network formation games in other contexts [4], but have modeled the utility of a node from connecting to a neighbor in more coarse terms, without taking into account the temporal details of the interaction such as the rate of information flow that we attempt to model in our investigation.
Game-Theoretic Models of Information Overload in Social Networks
151
110 Median
100
Number of users unfollowing
90
Mean 25th percentile 75th percentile
80 70 60 50 40 30 20 10 0
500
1000 1500 2000 2500 3000 3500 Number of Tweets between August 28 and September 13, 2010
4000
4500
Fig. 1. Number of unfollow events against the number of tweets authored in the observation period
Our models are related to problems of pricing between substitutable products by a monopolist: In this analogy, the updates represent quantities of products, celebrities represent substitutable products, followers represent consumers and the qualities of an update from a celebrity to a user resemble the utility multiplier for the product for the consumer. In this context there is much work on competitive pricing across products and how this results in over or under-entry of products into this market starting from the work of Sattinger [5]. However, a key difference in our models is the additional disutility effect which results from consuming too much. 1.5
Empirical Evidence
To support our work empirically we performed a large-scale analysis of over tenmillion Twitter users. We collected the profile information of users who have unique user IDs between 20,000,000 and 30,000,000; these users created their accounts between February 3, 2009 and April 9, 2009. Within this set of users we focused on those who had between 1,000 and 5,000 followers on August 28, 2010. In our data set there are over 65,654 users meeting this condition, and we call such accounts ‘micro-celebrities.’ These users have possessed accounts for a long enough duration that we believe they are in some sort of equilibrium; they understand how Twitter works, have built up a substantial following, and have developed their own usage patterns. We recollected profile information and a list of followers for each micro-celebrity several times over the duration of two weeks. We partitioned the users into buckets based on the number of tweets they created between August 28 and September 13, 2010. The buckets we used are exponentially sized: B0 = {0}, B1 = {1}, B2 = {2, 3}, . . . , B13 = [212 , 213 − 1]. For each bucket, we looked at how many accounts unfollowed each user in the bucket and computed the mean, median, 25th, and 75th percentile of these unfollow counts. We present these numbers in the plot below; it is clear that as a micro-celebrity tweets at a higher rate, the number of accounts unfollowing that user increases.
152
C. Borgs et al.
In the next two sections, we review our two classes of Followership and Engagement models and then conclude with some directions of current and future work.
2
A Model of Friendship Selection: The Followership Game
Assume that there are n celebrities and m users. Consider some celebrity i. Suppose that the rate of every other celebrity is fixed. We assume that the utility of a user for a celebrity as the celebrity changes her tweet-rate has an inverse Ushape as suggested by the information overload literature [2]; see Figure 2. The parameters of the curve, e.g., where it hit the axis, and its rate of change of slope, however depend on the user model and the tweet-rate of the other celebrities. Given these assumptions, it is clear that a user will not follow a celebrity whom he has negative utility for. Therefore, if an individual celebrity keeps increasing her tweet-rate, her followers will stop following her one by one. If we assume that the utility of a celebrity is her influence, i.e., her tweetrate times the number of followers she has, her utility-rate curve would look like Figure 3 with discontinuities at the points where her followers drop her. The curve has slope k in the first piece, where k ≤ m is the number of users who follow the celebrity if she has very low tweet-rate. The second piece has slope k − 1 and so on, until the last piece which has slope 1. If a celebrity wants to maximize her utility given the tweet-rate of the other celebrities, she has to find the peak of this curve and update at this rate. 2.1
A Specific Utility Model
We assume a complete weighted bipartite graph between celebrities and users. For celebrity i and user j, the weight qij associated with edge (i, j) indicates the quality of user j for celebrity i. The higher qij is, the more utility user j gets from tweets of celebrity i. We assume that all qij ’s are known to the celebrities and the users. As stated before, we define the utility of celebrity i to be Ui = |Fi |ri where Fi is the set of users who follow her, and ri is her tweet-rate. For the utility function of a user, we want some function that captures the property described in Figure 2. Moreover, we need the function to capture the competition between the celebrities. For example, suppose that user a really likes celebrity x but is also following celebrity y. If celebrity x increases her tweet-rate, since the rate of information intake for user a increases, he gets overloaded. As a result, he would consider dropping one of the celebrities that he is following. Since he likes celebrity x more than celebrity y chances are that celebrity y will be dropped. Let Cj be the set of celebrities that user j follows; we define the utility of user j to be Uj =
i∈Cj
ri qij − λ(
i∈Cj
ri )2 .
Game-Theoretic Models of Information Overload in Social Networks
153
Utility
Utility
Rate
Fig. 2. Utility of User for a Specific Celebrity
Rate
Fig. 3. Utility of a Celebrity
The first term represents how much the user is benefiting from consuming tweets while the second term represents the information overload concept. By scaling the qualities without loss of generality we assume λ = 1/2. Users constantly optimize their utility function, i.e., choose the set of celebrities that they want to follow. At the same time, the celebrities are adjusting their the indicator variable for rates to maximize their utility. Fix user j and let xi be following celebrity i. User j wants to maximize Uj = i xi ri qi,j − λ( i xi ri )2 subject to xi ∈ {0, 1}. This is a non-linear integer program and is generally hard to solve. To simplify, let’s first consider the fractional version of the problem where xi ∈ [0, 1], i.e., the user can follow celebrities fractionally. Fractional Following. If celebrity i is followed fractionally (0 < xi < 1) by ∂U user j, we must have ∂xij = 0. The expression simplifies to qij = 2λ k xk rk = k xk rk . For any celebrity l where qlj > qij we must have xl = 1, and for any celebrity l where ql j < qij we must have xl = 0. The graphical representation of the optimum fractional solution is given in Figure 4. Assume that q1 ≥ . . . ≥ qn is the sorted sequence of q1j , . . . , qnj . Each horizontal segment corresponds to a celebrity; the length of the segment is the tweet-rate of the celebrity and its height is the quality of the user for this celebrity. There is a dashed line with slope 1 that goes through origin. The user will follow the celebrities to the left of the dashed line. The celebrity who intersects the dashed line is followed fractionally. The utility of celebrity i in the fractional setting should be adjusted to Ui = j xij ri where xij indicates the fraction of celebrity i that user j is following. The following proposition helps us to characterize the pure Nash equilibria of this Followership Game in the fractional setting. Proposition 1. Consider celebrity i and fix the tweet-rate of the other celebrities. If i increases her tweet-rate, her utility Ui will not decrease.
154
C. Borgs et al.
Quality q1
r1
q2
q3
r2 r3
Tweet-intake Fig. 4. User’s Strategy
Proof. Consider an arbitrary user j. If celebrity i changes her rate from ri to αri (α > 1), variable xij in the optimal solution of user j will change to a value of at least ≥ xij /α. Therefore, the expression Ui = j xij ri will not decrease. Furthermore, if for some user j and celebrity l we have xlj > 0, and qij > qlj , increasing ri strictly increases Ui . Proposition 1 suggests that in equilibrium, the celebrities are tweeting at a very high rate and each user follows only one (maybe even fractionally) celebrity. Since this equilibria is unrealistic, we consider alternative “integral” user behavior in the Followership Game. 2.2
Greedy Users
One way of modifying the fractional solution obtained in section 2.1 is to drop the only (if any) celebrity who is followed fractionally. More precisely, if the optimal fractional solution for user j is to follow celebrity i fractionally (we know that at most one such celebrity exists for each user), now we assume that he does not follow her anymore; i.e., we set all the variables xij < 1 to 0. This is equivalent to say that the user only follows those celebrities that completely lie to the left of the dashed line in Figure 4. The following is an equivalent definition for user’s behavior in this model. Definition 1. Consider user j and let q1 ≥ . . . ≥ qn be the sorted order of k q1j , . . . , qnj . Let k be the largest index such that i=1 ri ≤ qk . Under the greedy users model, user j follows the k celebrities for who he has highest quality and no one else. Next, we show that pure strategy Nash equilibrium does not necessarily exist in this model.
Game-Theoretic Models of Information Overload in Social Networks N
N
... 1
1
a
... 1
1
1
y
x
k+1
N
... 1
k
z
k+1
k b
155
k+1
k c
Fig. 5. Non-existence of Pure Nash Equilibrium
Example 1. There are 3 celebrities x,y and z, and 3(1 + N ) users. Three special users are labeled a, b and c. See Figure 1. Qualities for these users are as follows: qax = k + 1, qaz = k, qbx = k, qby = k + 1, qcy = k, and qcz = k + 1. The other 3N users are partitioned into three equal-size groups Gx , Gy and Gz . The N users in group Gi (i ∈ {x, y, z}) have quality 1 for celebrity i and quality 0 for other celebrities. All other qualities are 0. Edges with zero quality are not shown. Proposition 2. If 2k − 2 > N > k + 1, the graph depicted in Example 1 does not have a pure strategy Nash equilibrium. Proof. Assume for sake of contradiction that a pure strategy Nash equilibrium exists. First note that if a celebrity i sets her tweet-rate to 1, she can have utility at least N + 1 (N from Gi , and 1 from the user who has quality k + 1 for her). Therefore, since a celebrity can have at most one follower if her rate is strictly above k, any rate > k is strictly dominated. Also observe that with any rate > 1, the celebrity can have at most two followers; therefore, any rate 1 < r < (N +1)/2 is also dominated. As a result, the tweet-rate of each celebrity, if a Nash equilibrium exists, should be either in the interval (0, 1] or in the interval [(N + 1)/2, k]. In a Nash equilibrium, either zero, one, two, or all of the celebrities are tweeting at rate ∈ (0, 1]. We prove case-by-case that none of these possibilities can work to conclude that no equilibrium exists. If all three of them are tweeting at rate ∈ (0, 1], it is clearly beneficial for celebrity x (or y, or z by symmetry) to deviate to tweet-rate k because her utility would change from N + 2 to 2k. If at least two celebrities are tweeting at rate higher than 1, we show that one of them would benefit from changing her rate to 1. Because of the symmetric structure of the graph, assume without loss of generality that celebrities x and y are tweeting at rate ≥ (N + 1)/2 ≥ k/2 + 1. Since sum of their rates is larger than k, user b will only follow y which makes the utility of celebrity x less than or equal to k. Therefore, she would benefit from deviating to tweet-rate 1 which makes her utility N + 2.
156
C. Borgs et al.
Finally, consider the case where only one celebrity is tweeting at rate ∈ [(N + 1)/2, k]; because of the symmetric structure of the graph without loss of generality assume that this celebrity is x. If y changes her tweet-rate to k, users b and c will both follow her. Therefore, her utility will increase to 2k which contradicts the equilibrium assumption. This completes the proof that no pure strategy Nash equilibrium exists for this example. Although this example proves that a pure strategy Nash equilibrium does not always exist in the greedy version of the Followership Game, there are special cases for which we can prove the existence. Pure Rate Equilibria Exist Under Global Ranking. Suppose that there is a global ranking of the celebrities such that every user prefers higher ranked celebrities to lower ranked celebrities. Without loss of generality assume that the ordering is 1, . . . , n. More precisely, the global ranking assumption means that for every user j we have q1j ≥ . . . ≥ qnj . Proposition 3. A pure strategy Nash equilibrium exists for the followership game when there is a global ranking for the celebrities. Proof. We construct an equilibrium iteratively. We calculate the equilibrium rate ri∗ of celebrity i in step i. In step i, set the rates of celebrities 1, . . . , i − 1 to their equilibrium rate which has been calculated in the previous steps; set the rate of celebrities i + 1, . . . , n to 0. We can calculate the optimum rate rˆi of celebrity i when the rates of the others are given (it follows from computing the utility-rate curve of Figure 3). We simply set ri∗ to rˆi . Note that if some celebrity i > i changes her rate, given that the rate of everyone else is fixed, the utility of celebrity i will not be affected. Therefore, if utility of celebrity i is optimized at ri∗ when ri+1 = . . . = rn = 0, it is still optimized when ri+1 , . . . , rn have any other value. Hence, we can argue that if celebrities 1, . . . , i − 1 do not benefit from deviating from their rate, celebrity i would not benefit either. Using induction, we conclude that the rates ri∗ form a pure strategy Nash equilibrium of the game. This global ranking is a special case of a more general result. Proposition 4. For a given followership game instance, create a dependency graph on the celebrity nodes as follows: let x have a directed edge to y if the non-zero quality of x is greater than the non-zero quality of y for some user u. If the resulting dependency graph is a directed acyclic graph, a pure strategy Nash equilibrium exists for the rate competition game. Proof. Note that an edge from x to y means therefore that y takes into account x’s rate when myopically maximizing their utility. Since the dependency graph is a directed acyclic graph, the nodes can be topologically ordered and then the above induction argument will also apply to that ordering.
Game-Theoretic Models of Information Overload in Social Networks
2.3
157
Matchings Characterize Pure Rate Equilibria
For this part, assume that all quality values for a user are distinct and hence there is a strict quality ordering of the celebrities for a user. We show that we can enumerate over a finite set to check if there are any pure Nash equilibria for the rates. Note that from our description of our response function, for every celebrity whose rate is at equilibrium, there are two possibilities. Either there exists a user for who the celebrity is critically tweeting: i.e., an increase in her rate implies this user will drop her, or the celebrity has rate zero. These saturated users are distinct for distinct celebrities in the greedy users model that we are working with when the celebrity qualities from a user are distinct. Thus, there is a simple (albeit slow) procedure for finding a pure Nash equilibria of rates if one exists: First, find a matching from all subsets of celebrities (who are updating at a nonzero rate in a possible equilibrium) to users that are critical for them; Then, solve for the first order conditions being tight for this set of users with the matched celebrity being the last one followed by this user. If the resulting system is nonsingular, these rates give a possible Nash Equilibrium. We then check whether it satisfies the other equilibrium conditions (nonnegative rates and nonprofitable myopic deviations) to confirm whether the candidate solution is indeed correct.
3
A Model of User Engagement: The Engagement Game
Suppose that user j follows the set of celebrities Cj and celebrity i is followed by the set of users Fi . Instead of considering strict quality values, in the Engagement Game we merely assume that a user j is currently following all celebrities for which they have non-zero quality and not following any others. Celebrity i then decides about her tweet-rate ri , and user j decides whether to remain active on Twitter or to disengage. So unlike the Followership Game, where follow relationships resulted from the game, here follow relationships are fixed and users then decide to leave the network, rather than break ties. One way of interpreting this is that users joining Twitter follow some subset of celebrities based on their interests. After using the service for some time they might find that the combined rate of information from these celebrities is too much and decide to delete their account. Let Fi∗ ⊆ Fi denote the set of engaged followers of celebrity i. The utility of celebrity i is Ui = ri |Fi∗ |, where the set Fi∗ will be a random variable whose probability depends on celebrity rates. So we assume that celebrities maximize their expected utility. Tweeting too much increases ri but at the same time decreases the expected size of Fi∗ . The celebrities are competing for user’s attention but in order to keep the users engaged, they have to avoid high tweet-rates. We suppose that there is a function S : [0, ∞) → [0, 1] that maps the total rate of all celebrities followed by a user into a probability that the user stays in the social network and this function is the same for all users. This function is intended to capture the frustration of a typical user from receiving too many
158
C. Borgs et al.
updates, and the function outcome as a probability adds heterogeneity among the individual users though they all have the same function. Thus, user j stays in the network with probability S( i∈Cj ri ) and the (expected) utility of celebrity i is ⎛ ⎞ Ui (r) = ri ⎝ S( ri )⎠ j∈Fi
i ∈Cj
Certain choices of S admit a closed form solution for the celebrity strategies. For example, if S(r) = exp(−r), then the rate of each celebrity becomes decoupled: ⎞ ⎟ ⎜ Ui (r) = ri ⎝ exp(− r i )⎠ = r i exp(−ri ) exp(− ⎛
j∈Fi
i ∈Cj
j∈Fi
i ∈Cj ,i =i
ri ) = ri exp(−ri )
cj
j∈Fi
where cj = exp(− i ∈Cj ,i =i ri ) is constant w.r.t. ri . An easy calculation shows that the optimal solution is ri = 1. Notice that even if any of the other ri change, the optimal solution for producer i is still ri = 1. Thus, for this particular S function all the producers tweet at rate 1. Instead of this factorized form, we assume a non-trivial function that causes interaction between producers. Suppose that the function is S(r) = max(0, 1 − r). Then the probability that user j engages is max(0, 1 − i∈Cj ri ). In general, pure Nash equilibria for this game are characterized by a set of rates along with a subset F of users j that have nonnegative values of 1− i∈Cj ri . The pure equilibrium rate for a celebrity i is a solution to the constrained optimization problems of maximizing the celebrity’s the constraints that for every j in the subset F the utility ri · S(ri ) subject to probability expression 1 − i∈Cj ri of staying engaged is nonnegative, while it is nonpositive for the remaining users. We relax these two sets of constraints to get sufficient conditions for the equilibrium rates and demonstrate two special cases when the rates obey these extra constraints for free, thus giving us real equilibria. Assume for the moment that we know the set of users who have positive probability of being engaged. This set of users is the equivalent of the matching for the Followership Game. We will construct a candidate Nash Equilibria based on such a set. Assume that Fi is this set of users who follow celebrity i who have positive probability of being engaged; i.e., Fi = {j ∈ Fi | l∈Cj rl ≤ 1}. Given Fi and relaxing the constraints, the optimal rate ri∗ is ri∗ = arg max ri (1 − rl ). ri
j∈Fi
l∈Cj
Using the first order conditions we get |Fi | − j∈F l∈Cj −{i} rl i ri∗ = 2|Fi |
(1)
which is linear with respect to the rates. Therefore, if the set of users who have positive probability of being engaged is given we can solve the corresponding set
Game-Theoretic Models of Information Overload in Social Networks
159
of linear equations and calculate ri∗ ’s for all celebrities. However, we then require that ∀i, j : j ∈ Fi − Fi ⇒ rl∗ ≥ 1 (2) l∈Cj
∀i, j : j ∈ Fi ⇒
rl∗ ≤ 1.
(3)
l∈Cj
Conditions 2 and 3 imply that the solution to the linear system is self-consistent with the assumption about Fi , which users have positive probability of being engaged. Further, condition 3 demands that the rates are non-negative in combination with the linear system5 . Like in Section 2.3, a solution to the above linear system that obeys the conditions is not necessarily a Nash Equilibria. It is then necessary to check to ensure that no celebrity has incentive to unilaterally deviate their rate and change the set of possibly engaged users. In section 3.1, we prove the existence of pure strategy Nash equilibrium for two classes of sparse graphs and regular graphs by computing a solution to the linear system and proving that no unilateral deviation is profitable. 3.1
Pure Equilibria under Sparse or Regular Users
Proposition 5. A pure strategy Nash equilibrium exists for the engagement game when all users have the same degree. Proof. We show that if the graph is d-regular, i.e., |Cj | = d for all j, every celebrity tweeting at rate 1/(d + 1) is a pure strategy Nash equilibrium of the game. We let Fi = Fi and solve the linear system defined by equality 1 for every i. The solution is ri∗ = 1/(d + 1). We now confirm that the solution satisfies 2 and 3. Condition 2 is obviously satisfied since Fi − Fi = ∅; conditions ∗ since l∈Cj rl = d/(d + 1), condition 3 is also satisfied. Now let’s examine whether a particular celebrity can profitably change their rate. All users are at (d − 1)/(d + 1) without including this celebrity’s rate. So this celebrity can either possibly engage their neighboring users with rate less than or equal to 2/(d + 1) or cause them all to disengage with rate greater than 2/(d + 1). They will always do the first, meaning that they will not deviate from the solution by causing a different engagement set. Hence, by equality 1, we obtain a pure strategy Nash equilibrium of the game. Proposition 6. A pure strategy Nash equilibrium exists for the engagement game when all users follow one or two celebrities. 5
As mentioned above, these conditions will not be necessarily satisfied for every choice of Fi ’s, and in fact, it is not clear even if there exists any choice of Fi ’s for which Equation 1 and conditions 2 and 3 all hold. We conjecture that if a Nash Equilibrium exists, then there must exist such a self-consistent choice of Fi ’s.
160
C. Borgs et al.
Proof. This is the case when |Cj | ≤ 2 for all j. First note that since ri∗ is upperbounded by 1/2 for any set of possibly engaged users and each user follows at most two celebrities, every user is always possibly engaged and we do not need to consider unilateral deviations that change the set of possibly engaged users. Then letting Fi = Fi satisfies conditions 2 and 3 for any solution ri∗ . Therefore, any solution that satisfies equation 1 is a pure strategy Nash equilibrium of the game. We now prove the existence and uniqueness of a solution to the linear system defined by 1 assuming that every celebrity has at least one follower. Equation 1 reduces to n |Fi | − l=1 ail rl∗ ∗ ri = 2|Fi | where ail denotes the number of users that follow both celebrities i and l with aii = 0. Note that the set of followers of celebrity i consists of those who only follow celebrity i which we call Mi , and those who follow celebrity i and some other celebrity l. Define n × n matrix B such that bij = aij for i = j, and bii = 2|Fi |. Notice that the linear system defined by 1 has a unique n solution if and only if the determinant of B is non-zero. Since |Fi | = |Mi | + l=1 ail , every element on the diagonal of B is at least twice of the sum of all other elements in its column. Therefore, B is strictly diagonally dominant, and consequently, by Gershgorin circle theorem [8], the determinant of B is non-zero.
4
Conclusions
We have introduced new models of rate competition in asymmetric follow networks like Twitter. We hope our models will be further investigated for both empirical validation as well as further theoretical study. We also hope that our insights will highlight the importance of update rate to users of social media as well as that of effective filtering techniques to better automatically cope with such overload. Several directions for future work present themselves. One important open problem is to extend our followership model to nonbipartite networks where nodes are both users and celebrities and there is a correlation between the production and consumption utilities of agents. Another similar direction is to extend the engagement model to the non-bipartite case. Another avenue for future work in this area relate to computing the price of anarchy and the price of stability even in the special cases when there are pure strategy equilibria for these models. A natural alternate method of carrying out an analysis of information overload, particularly in the context of a fast growing network like Twitter is to postulate a stochastically growing model and study the results of a best response adjustment of rates during this stochastic growth. We could then try to find limiting properties of the network formed such as its degree sequence and the effect of the entry time of celebrities as well as their overall aggregate quality, and see if this provides a better model of network evolution than other more mechanistic and rate-oblivious models of network growth.
Game-Theoretic Models of Information Overload in Social Networks
161
References 1. The Blog Herald, Twitter’s Meteoric Rise, http://www.blogherald.com/2010/06/ 28/twitters-meteoric-rise-compared-to-facebook-infographic/ twitter-statistics-infographic-911/ 2. Eppler, M.J., Mengis, J.: The Concept of Information Overload: A Review of Literature from Organization Science, Accounting, Marketing, MIS, and Related Disciplines. The Information Society 20(5), 325–344 (2004) 3. Huberman, B., Romero, D., Wu, F.: Social networks that matter: Twitter under the microscope. First Monday 14(1) (2009) 4. Jackson, M.O.: Social and Economic Networks. Princeton University Press, Princeton (2009) 5. Sattinger, M.: Value of an Additional Firm in Monopolistic Competition. The Review of Economic Studies 51(2), 321–332 (1984) 6. Clay Shirkey on information overload versus filter failure, video from Web 2.0 Expo NY (2010), http://www.boingboing.net/2010/01/31/clay-shirky-on-infor.html 7. Doyle, P.G., Snell, J.L.: Random walks and electric networks. The Mathematical Association of America (1984) 8. Horn, R.A., Johnson, C.R.: Matrix analysis. Cambridge University Press, Cambridge (1990)
Author Index
Avrachenkov, Konstantin
98
Bonato, Anthony 110 Bonchi, Francesco 132 Borgs, Christian 146 Bradonji´c, Milan 36 Broder, Andrei 1
122
Esfandiar, Pooya
132
Flammini, Alessandro
On, Byung-Won
50
Hagberg, Aric 36 Hengartner, Nicolas W.
Ramasco, Jos´e J. 50 Ravi, R. 146 Reagans, Ray 146 Ribeiro, Bruno 98 Rocklin, Matthew 25 Sayedi, Amin
36
110
Karrer, Brian 146 Kim, Myunghwan 62 Kolountzakis, Mihail N.
132
Peng, Richard 15 Percus, Allon G. 36 Pinar, Ali 25 Pralat, Pawel 110
Gleich, David F. 132 Gon¸calves, Bruno 50 Greif, Chen 132
Janssen, Jeannette
132
Meeder, Brendan 146 Meiss, Mark R. 50 Menczer, Filippo 50 Miller, Gary L. 15
Chayes, Jennifer 146 Chung, Fan 2, 86 Demaine, Erik D.
Lakshmanan, Laks V.S. Leskovec, Jure 62 Li, Yanhua 74
15
146
Towsley, Don 98 Tsiatas, Alexander 86 Tsourakakis, Charalampos E. Zadimoghaddam, Morteza Zhang, Zhi-Li 74 Zhao, Wenbo 2
15 122