Felix Halim .NET  
University Experience
IOI 2002 Yong In, Korea
ACM ICPC Regional Manila 2003
ACM ICPC Regional Manila 2004
ACM ICPC Regional Manila 2005
ACM ICPC Regional Kaohsiung 2006
ACM ICPC Regional Singapore 2007
ACM ICPC Regional Jakarta 2008 (ext)
ACM ICPC Regional Jakarta 2009 (ext)
ACM ICPC Regional Jakarta 2010
ACM ICPC Regional Jakarta 2012  Problem H
ACM ICPC Regional Jakarta 2013  Problem J (new!)
ACM ICPC World Final Tokyo 2007
Google India Code Jam 2005
Google India Code Jam 2006
Indonesia National Contest 2007
Indonesia National Contest 2008
Indonesia National Contest 2010
Facebook Hacker Cup 2011


MapReduceBased MaximumFlow Algorithm for Large SmallWorld Network Graph [ICDCS11 Paper, Slides]Felix Halim, Roland H.C. Yap, Yongzheng Wu AbstractMaximumflow algorithms are used to find spam sites, build content voting system, discover communities, etc., on graphs from the Internet. Such graphs are now so large that they have outgrown conventional memoryresident algorithms. In this paper, we show how to effectively parallelize a maxflow algorithm based on the FordFulkerson method on a cluster using the MapReduce framework. Our algorithm exploits the property that such graphs are smallworld networks with low diameter and employs optimizations to improve the effectiveness of MapReduce and increase parallelism. We are able to compute maxflow on a subset of the a large social network graph with 411 million vertices and 31 billion edges using a cluster of 21 machines in reasonable time Background and MotivationNowdays, the realworld graphs (such as WWW, Social Networks, etc.) have grown very large. For example, a Facebook graph with 700 million users where each user has 130 friends on average, requires a storage space of 700 * 10^6 * 130 * 4 bytes = 364 Gigabytes just to store the relationships alone. Even if we have a super computer with Terabytes of memory, it is unclear whether running the best maxflow algorithm on such large graph can be practical (the current best maxflow algorithm has a complexity of at least quadratic in respect of the number of vertices). In this paper, we investigate the feasibility of computing maxflow on a very large smallworld network (SWN) graph. One of the property of a SWN graph is that it has a very small (expected) diameter (the shortest path length between any two vertices in the graph is expected to be small). This allows us to develop variants of the FordFulkerson method to compute maxflow effectively for SWN graphs. Since we are dealing with a type of graph which size is far larger than a single machine memory capacity, we took an appoach to distribute the graph into several machines in a cluster to be processed in parallel. Currently, MapReduce has become the facto standard to store, manage, and process very large datasets on a cluster of thousands of commodity machines. We designed an developed our maxflow variants to work on top of the MapReduce framework. The Challenges and ApproachThe classical maxflow algorithms were designed under assumption that the entire graph is small enough fit into main memory. Such algorithms are not directly applicable to run on distributed systems (such as MapReduce framework) since it require a global view of the entire graph. On the other hand, the existing distributed maxflow algorithm based on the PushRelabel algorithm, while it can work in local manner, is too reliant on heuristics to push the flow. A wrong push can cause a very large number of MR rounds spent circling around the flow, thus is not suitable for MapReduce. However, in a system where performing one round is cheap (using BulkSynchronous Parallel model), PushRelabel should perform much better. We decided to develop maxflow variants based on the FordFulkerson method. With assumption that the graphs being processed have small diameter, we transformed the naive (sequential) FordFulkerson method into a highly parallelizable MR algorithm variants by finding augmenting paths incrementally, bidirectional search, and multiple excess paths. In the next subsections we illustrate the inner working of the naive FordFulkerson method and the variants. Notice the number of MR rounds required for each variant (the lower the better). We combined these variants together to form an effective maxflow algorithm on MR framework which we called FF1_{MR}. The Naive FordFulkerson MethodThe naive FordFulkerson method works by repeatedly find an augmenting path in the current residual graph and augment it: while true do P = find an augmenting path in the current residual graph G_{f} if (P does not exist) break Augment the flow f along the path P The FordFulkerson method does not dictate how to find an augmenting path. The naive way to find it is to use BreadthFirst Search (BFS). The complexity of this method is O(f* D) rounds, where f* is the maxflow value and D is the diameter of the graph. Note that in MR algorithms, we measure the complexity in terms of number of MR jobs (or rounds) performed. Below is an illustration on how the maxflow can be found using this way.
Finding Augmenting Paths IncrementallyAn improvement from the previous method is to incrementally find the augmenting paths. That is, we don't restart the BFS from scratch everytime an augmenting path is found. Instead, we continue the work from the previous round's result by updating parts of the graph and only destroy/remove the results that aren't fruitful. The incremental finding of augmenting paths reduces the complexity to be far lower than O(f* D) rounds since an augmenting path may arrive continuously every subsequent rounds. We expect the complexity to be O(f*) rounds.
BiDirectional SearchBidirectional search is a twoway search: one originates from the source vertex s and the other is from the sink vertex t. This doubles the utilization of work per round, so that more work can be done each round. This also effectively halves the number of rounds required to complete the maxflow. We allow more than one augmenting path to be accepted per round. The expected complexity of this variant is O(f* / A) rounds, where A is the average number of augmenting paths accepted per round. Below is the illustration of the bidirectional search variant, improving the incremental updates variant.
Multiple Excess PathsWhen an augmenting path is found and augmented, many vertices will lose its excess path if they are conflict with the augmenting path. To prevent vertices to lose all of its excess paths, we allow each vertex to store K excess paths so that even though large number of augmenting paths are augmented, many of the vertices will still be active and continue to give streams of new augmenting paths every round.
Among all variants described above, this multiple excess paths variant gives the most decrease in the number of rounds. The complexity of these variants altogether is O(f* / A) where A is the number of augmenting paths accepted per round. The value of A is very large such that the complexity becomes very close to O(D).
The experiment result above shows that the number of excess path stored in each vertex (K) have significant impact in computing maxflow of two random vertices s and t on a social network subgraph (FB1) with 21 million vertices and 112 million edges. The more the K the less the number of MR rounds required. On a very large value of K, the number of rounds becomes close to the diameter of the graph. MapReduce OptimizationsMapReduce is a general purpose framework. It is not necessarily the best framework to process graphbased data. In this section we describe our MR optimizations to work around the limitations of MR in processing graphbased data. We implemented each optimization into a variant of our FF1_{MR} algorithm: FF2_{MR} : Stateful Extension for MRWhen there are a lot of augmenting paths found in one round, we must have a (single threaded) worker that decides which augmenting paths to be accepted and which are to be rejected. In FF1, this decision is made by one of the reducers that is responsible for vertex t. The larger the number of augmenting paths, the longer that reducer need to run. This can cause a convoy effect where the other reducers are already finish but the reducer which process vertex t lags behind. This lead us to create an extension for MR, that is a dedicated worker outside MR which job is to process augmenting paths that are generated in each round by the reducers. This has advantages that the dedicated worker can start working as soon as it receives augmenting paths from the reducers. It doesn't need one extra step to send augmenting paths (as messages) to vertex t which wastes one MR round. This stateful extension solves the bottleneck of FF1_{MR}.
The improvement it brings is significant. FF2_{MR} improvement is up to 3.41 times faster for FB4 and 1.85 times faster for FB1. FF3_{MR} : Schimmy methodSchimmy method is an MR design pattern for processing graphbased data. The improvement is more apparent for larger graph FB4 (1.74 times) and lesser for smaller graph FB1 (1.25 times). FF4_{MR} : Avoid Object InstantiationsThis is a common optimizations. The improvement is around 1.16  1.41 times. FF5_{MR} : Storage Space vs. number of rounds and communication costsThe last but not least optimization is to avoid sending messages (as intermediate records) each round by sacrificing some storage space (used as flags). We can also sacrifice storage space by storing maximum number of excess paths (set K = 5000) to reduce the number of rounds. We avoid sending delta updates as messages (as intermediate records) if we can recompute the delta in the reducers.
We found a high correlation between the number of shuffled bytes and the runtime and in this variant, we try to minimize the number of shuffled bytes as much as possible. The reduce shuffle bytes depends on the number and the size of the intermediate records. The experiment above shows the number of bytes shuffled for each variant (from FF1 to FF5). FF1 is the worst since it sends all the augmenting paths found in the current round to vertex t as messages. Therefore there from round #4 to #8 the reduce shuffle bytes is high. FF2 uses the external worker to immediately process the augmenting paths, therefore the reduce shuffle bytes for round #4 to #8 is small, however, it grows afterwards since every active vertices always resend its excess path to its neighbors. FF3 avoid the master graph to be shuffled, so it is a consistent improvement over FF2. FF4 doesn't reduce the shuffled bytes, hence not shown here. FF5 sets the K to maximum and prevent redundant messages (by recomputing or by flag). FF5 manages to keep the shuffled bytes small and it is the best of our MR variants. ScalabilityWe tested our FF5_{MR} variant for very large flow value on a very large social network subgraph (FB6) with 411 million vertices and 31 billion edges. To create a very large flow, we combine w = 128 random vertices and create a super source vertex s, and similarly with the super sink t. The flow between the super source and the super sink can be up to 128 * 5000 = 640,000.
The experiment shows that even for such large maxflow values, the number of rounds required to compute the maxflow stay small (around 7 to 8 rounds). This suggest that the approaches that we've put in FF1_{MR} are effective in minimizing the number of rounds. The runtime increases linearly while the maxflow value increases exponentially. This shows the scalability of our FF_{MR} in handling a very large maxflow value. Another scalability test is in terms of graph size and number of machines (from 5 to 20 machines). We plot the runtime and number of rounds required to compute maxflow on several subgraphs (FB1 to FB6). We also created the super source and the super sink for w = 128 (the vertices are randomly chosen in the subgraph). We also plot the best case scenario of running a single BreadthFirst Search on each subgraph.
The experiment shows that our best variant, FF5_{MR}, is only a constant factor slower than a BFS_{MR}! The maxflow value is writen under the subgraph name (FB1 to FB6). If we see the graph in terms of number of edges processed per second:
The experiment shows that the larger the graph, the higher the number edges processed per second. This may mean several things: the larger the graph, the smaller its diameter and the more robust the graph (that is, despite the large number of deletion of edges in the residual graph, the graph still maintain small diameter). Conclusion and Future WorkWe developed what we believe to be the first effective and practical maxflow algorithms for MapReduce. While the best sequential maxflow algorithms have around quadratic runtime complexity, we show it is still possible to compute maxflow efficiently and effectively on very large realworld graphs with billions of edges. We achieve this by designing FF_{MR} algorithms that exploit the small diameter property of such realworld graphs while providing large amounts of parallelism. We also present novel MR optimizations that significantly improve the initial design which aim to minimize the number of rounds, the number of intermediate records, the size of the biggest record. Our optimizations also exploit tradeoffs between space needed for the data and number of rounds. Our preliminary experiments show a promising and scalable algorithm to compute maxflow for very large realworld social network subgraphs. We still see several rooms for improvement such as optimizing the last few rounds as well as giving an approximation of a maxflow value to get faster runtime.
We also see the need to benchmark with custom build Graph framework which optimizes the memory management in the case where the entire graph fit to the total memory capacity of a cluster of machines. A comparison with the PushRelabel algorithm implemented on a BulkSynchronous Parallel would be a good direction as well.
