GraphLab Introduction

图文畅读版下载

[摘要:本文介绍了GraphLab处理分布式机器学习算法的优势，以及这些优势的底层实现机制]

机器学习算法的特点以及GraphLab的应运而生

大多数机器学习算法具有一些共通的特点，利用这些特点，分布式计算性能将能够得到提高：

1.1. 稀疏的数据依赖(Sparse Computational Dependencies)

 从应用场景的角度看，图模型能够良好地表达数据之间的关联关系，从数据之间的关联关系中可以获得更有价值的信息. 
 例如推荐系统领域里的协同过滤(Collaborative Filtering)  

 从计算的角度来看，在参数的求解中，通常只需要局部数据参与运算. 例如求条件概率

 GraphLab采用图模型来表示数据，计算的操作对象是图中的节点以及与该节点邻接的所有节点.  
 而不用考虑所有数据同时参与计算.  

 在迭代计算过程中，图上的节点以及边的值做相应更新，但是整个图模型的结构始终不会被改变.  
 因此，使用GraphLab开发算法时只需要关注
 迭代计算的设计，不需要像使用MPI编程时那样关注数据在集群中的分配问题.

1.2. 异步迭代计算(Asynchronous Iterative Computation)

 同步迭代计算在进行下一次迭代之前需要使用上一轮迭代更新之后的参数值，  
 在参数数量庞大的情况下，有些参数值更新快，另一些参数更新得慢，采用同步，  
 因此采用同步更新方式，计算速度受限于部分参数的更新速度.     

 GraphLab采用异步迭代计算，异步迭代计算在进行下一次迭代时，采用的是最近更新的参数值，  
 不需要等上一次参数的迭代计算完成，因而能够加快计算.

 参数更新快慢的影响因素包括：
 1. 集群计算节点的计算能力不均衡
 2. 集群中网络的延迟，影响数据传输的快慢
 3. 每个节点的计算量不均衡，如著名的power-law

1.3. 动态计算(Dynamic Computation)

 机器学习算法的参数收敛速度不一致，导致求解各个参数的计算量也不一致.  
 例如PageRank算法，大多数节点只需少量的迭代就能够收敛，而少数节点则需要多次迭代.  
 GraphLab能够在算法的计算过程中自适应地调整计算任务的优先级.

1.4. 数据的一致性问题(Sequential Consistency)

 有些算法要求数据能够分布式计算中保持一致性，例如Gibbs sampling  
 此外，数据保持一致性有利于提高随机优化算法的收敛速度.

Alt text

GraphLab Abstraction

2.1 The Data Graph

The data graph represents user modifiable program state and stores both the mutable user-defined data and encodes the sparse computational dependencies.

2.2 The Update Functions

f(υ,Sυ)→(Sυ,T)

Sυ is the scope of vertex υ (denoted by Sυ). This is showd in Fig 2a.
The returned set of μ∈T are eventually executed by applying the update function f(μ,Sυ).
GraphLab allows the user defined update functions complete freedom to read and modify in the scope Sυ. This simplifies user code and eliminates the need for the users to reason about the movement of data. Further more, by controlling what vertices are returned in T and thus to be executed, GraphLab update functions can efficiently express adaptive computation. For example, an update function may choose to return (schedule) its neighbors only when it has made a substantial change to its local data.

Alt text

2.3 The GraphLab Execution Model

The GraphLab Execution Model, presented in Alg.2 follows a simple single loop semantics.
Alt text
The only requirement imposed by the GraphLab abstraction is that all vertices in T are eventually executed

2.4 Ensuring Serializability

The GraphLab runtime ensures a serializable execution. A serializable execution implies that there exists a corresponding serial schedule of update functions that when executed by Alg.2 produces
the same values in the data-graph. In such way, GraphLab ensures that the scopes of concurrently executing update functions do not overlap.

参考文献:

Low Y, Bickson D, Gonzalez J, et al. Distributed GraphLab: A framework for machine learning and data mining in the cloud[J]. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.
C. Guestrin. NIPS Big Learning Workshop 12/18/2011
Technical report describing the GraphLab abstraction
GraphLab A Distributed Abstraction for Large Scale Machine Learning

GraphLab Introduction

GraphLab Introduction

图文畅读版下载

机器学习算法的特点以及GraphLab的应运而生

1.1. 稀疏的数据依赖(Sparse Computational Dependencies)

1.2. 异步迭代计算(Asynchronous Iterative Computation)

1.3. 动态计算(Dynamic Computation)

1.4. 数据的一致性问题(Sequential Consistency)

GraphLab Abstraction

2.1 The Data Graph

2.2 The Update Functions

2.3 The GraphLab Execution Model

2.4 Ensuring Serializability

参考文献: