| Felix Halim .NET | ||||
|
University Experience
IOI 2002 Yong In, Korea
ACM ICPC Regional Manila 2003
ACM ICPC Regional Manila 2004
ACM ICPC Regional Manila 2005
ACM ICPC Regional Kaohsiung 2006
ACM ICPC Regional Singapore 2007
ACM ICPC World Final Tokyo 2007
Google India Code Jam 2005
Google India Code Jam 2006
Indonesia National Contest 2007
Indonesia National Contest 2008
|
||||
Fast and Effective Histogram ConstructionFelix Halim, Panagiotis Karras, Roland Yap National University of Singapore Abstract [Full Paper (pdf)]Histogram construction or sequence segmentation is a basic task with applications in database systems, information retrieval, and knowledge management. Its aim is to approximate a sequence by line segments. Unfortunately, the quadratic algorithm that derives an optimal histogram for Euclidean error lacks the desired scalability. Therefore, sophisticated approximation algorithms have been recently proposed, while several simple heuristics are used in practice. Still, these solutions fail to resolve the efficiency-quality tradeof in a satisfactory manner. In this paper we take a fresh view on the problem. We propose conceptually clear and scalable algorithms that efficiently derive high-quality histograms. We experimentally demonstrate that existing approximation schemes fail to deliver the desired efficiency and conventional heuristics do not fare well on the side of quality. On the other hand, our schemes match or exceed the quality of the former and the efficiency of the latter. Related WorkGiven a finite data sequence D = d0, d1, ..., dn-1. Construct a histogram HB from D using at most B buckets that minimizes the total error of all the buckets. The error for a bucket is defined as the sum of squared difference between all elements in that bucket with the mean value of that bucket. This problem can be solved optimally using Dynamic Programming (DP). However, the running time of the optimal construction is quadratic O(n^2 * B). Therefore, approximations algorithms have been proposed recently to get high quality solution efficiently such as AHistL and DnS. AHistL decompose the problem into two parts. In the inner part gives the error for a parameter value is within a "good" range. The outer part searches for the appropriate range and sets the parameter. The parameter to this algorithm are delta and epsilon. The delta is adjusted over time by the outer part while the epsilon is a predetermined parameter. The smaller the epsilon, the more accurate the result (the denser the list generated in the inner part) and the slower the algorithm. The complexity of this algorithm is O(n + B3 * (e-2 + log(n)) * log(n)). AHistL is not practical when B is large and as shown in our experiments, after a certain value of B, it's better off to switch to the optimal construction algorithm. DnS (Divide and Segment) works by spliting the input sequence into X sub-partitions. Then it segments each of the X sub-partitions (into B partitions each) optimally using DP. Each sub-partition produces B partitions (samples), yielding a total of X * B samples. The optimal DP segmentation is run on the X * B samples to get the final best of B partitions. Choosing X = (n/B)2/3 yield the best complexity of O(n4/3 * B2/3). Our ContributionsWe developed an algorithm that can produce good quality (low total error) as fast as possible. We started by developing an effective and efficient greedy algorithm and used it to produce high quality samples. The samples are used to generate the final B partitions optimally using DP. We experimentally show that the number of samples collected is less than those of DnS and yet have better quality thus giving a better segmentation results and faster runtime than DnS. We also compared the time vs. quality tradeoffs with many approximation algorithms. Even though we could not give the bound for the error, we can show the efficiency and the quality of our algorithms using some well known datasets. The experiment results showed that our algorithm matches or outperforms the approximation algorithms for both quality and speed, especially for large number of elements (n) and segments (B). Our greedy algorithm works by moving from one solution to another. A solution is a set of partitions S containing B+1 elements: s0,s1,s2,...,sB where s0 is 0 and sB is n and s1,s2,..,s(B-1) are the "movable" partitions that separates the n elements into B buckets. We move from one solution to another by picking one movable partition that yields to the least increase in total error and put it in between partitions that yields the most decrease in total error. Running the Greedy algorithm multiple times with randomized initial partitions will produce many good samples which then are used to construct the final B partitions using DP. This Greedy + DP algorithm alone manage to outperform DnS. To cope with very large n ~ 100,000 as in the synthetic dataset, we do the DP in batch. That is for roughly every sqrt(n) samples, we do DP on it. For every two endpoints that has P partitions in between which also has roughly sqrt(n) samples in between, we do DP on the samples to produces the same number of P partitions thus optimizing the P partitions inside the two endpoints using the sample and DP with O(n) complexity. Greedy Runs Produce Good SamplesA single greedy run produces a set of partitions (samples). Surprisingly, more than 50% of the partitions are the partitions that belong to the global optimum partitions. Running the randomized greedy algorithm multiple times can improve the percentage of samples that belong to the optimum partitions. DP is able to pick the final best B partitions out of the samples yielding a solution that is very close to the optimal. Play with the tool below to see the partitions returned by the algorithms: DataSets:Segments: We can see that the number of samples obtained from running 50 GDYs is mostly less than DnS and the samples are mostly very good (they are close to the optimal partitions). This shows that GDY produces better samples (in quality and in number of samples) than DnS. Moreover, DnS uses uniform sampling for each X sub-partitions. It maybe the case that the first 90% of the input is a very steady sequence and the last 10% is the random sequence. In this case, if DnS splits into X = 10 partitions then the first 9*B samples from the first 9 partitions are useless (bad samples) only the last B samples are good. Our Greedy will be most likely treat the first 90% of the sequence as a single bucket and samples more on the last 10% of the input sequence. This explaines why DnS can be a poor way of sampling. Comparison Tools
DatasetsDisplay the first input sequence.balloon 2001 x 2 http://lib.stat.cmu.edu/datasets/ darwin 1400 x 1 http://www.stat.duke.edu/~mw/ts_data_sets.html shuttle 1000 x 6 http://www-aig.jpl.nasa.gov/public/mls/time-series/ winding 2500 x 7 http://www.esat.kuleuven.ac.be/~tokka/daisydata.html phone 1708 x 8 http://www.teco.edu/tea/datasets/phone1.xls exrates 2567 x 12 http://www.stat.duke.edu/data-sets/mw/ts_data/all_exrates.html djdc0093 26612 x 2 http://lib.stat.cmu.edu/datasets/djdc0093 djia16K 16384 x 1 http://lib.stat.cmu.edu/datasets/djdc0093 (filtered) synthetic 100001 x 10 http://kdd.ics.uci.edu/databases/synthetic/synthetic.html | ||||