TimSort algorithm in JavaScript sorts

The `Array.prototype.sort()` function is used to sort the elements of an array in place and returns the reference to the same array, now sorted. The default sort order is ascending, built upon converting the elements into strings, then comparing their sequences of UTF-16 code units values. Firefox uses merge sort, while Chrome and Safari use Timsort. Timsort is a hybrid stable sorting algorithm derived from merge sort and insertion sort. 

Who developed it

Timsort is a sorting algorithm that Tim Peters created in 2002 for the Python programming language. The algorithm was so good that other popular programming languages such as Java also used it. Timsort has many implementation improvements, a few heuristics, and some fine tuning. Its main idea is simple: The array that needs to be sorted is first split greedily into monotonic runs (sorted subarrays), which are then combined pairwise according to specific rules.

Comparisons with other sorting algorithms

Timsort outperforms both quicksort and merge sort in various scenarios.
TimSort complexity comparison

Basic Overview

Timsort uses the fact that real-world data is often partly sorted. The algorithm goes through the array once to find already sorted parts, called "runs." These runs should be at least minRun in length, which we'll explain later.

The algorithm looks at each element one by one, finding out if it is part of an ascending or descending run. When it sees an element not part of the sequence, it records the run and starts a new one. If a run's length is less than minRun, the algorithm makes the run longer by adding the next element and sorting the array using Insertion Sort.

After this loop, we have an array made up of several runs, each with run.length >= minRun. These sorted runs are then smartly merged.

Timsort is a stable sorting algorithm (order of elements with same value is kept) and tries to do balanced merges (a merge that merges runs of similar sizes).

Insertion Sort

Insertion sort is a simple method of sorting a list by gradually moving each item to its correct position, just like arranging playing cards in your hand.

Here is insertion sort in action:


You can look at how it works step by step here: https://visualgo.net/en/sorting

Insertion Sort vs Merge Sort

Insertion Sort is good at sorting partly sorted and small arrays. Merge Sort, on the other hand, tries to call the merge() function as few times as possible, preferring to merge a few large sorted runs. This is where the minRun property is useful. The best run length balances the performance of Insertion Sort and the number of merges needed. The ideal spot usually lies between 32 and 64.

MinRun Selection

The best minRun size depends on each array. Let's look at a scenario suggested by Tim Peters. Suppose we have 2112 elements in the array and a minRun of 32. If the data is randomly ordered, we'll probably end up with 66 runs, each 32 elements long. The first 64 runs merge nicely into a single run of length 2048. Now, we have two runs with lengths 2048 and 64. This needs a lot of data movement to put those 64 elements in place.

The solution is to pick a larger minRun — in this case, 33 — which is more likely to result in 64 runs, each with a length of 33. As a result, all merges are perfectly balanced.

Runs Stack

Timsort also employs a clever merging pattern. For random data, it's nearly impossible to find a sequence with more than 32 elements. In this case, all our runs will likely have the same length, and merge() performs best when merging runs of equal size. However, merging them consecutively can lead to problems.

Imagine we have 1000 runs, each 32 elements long, and we've merged all but two of them. We'd have run A with a length of 999 * 32 and run B with a length of 32. This is not ideal. To avoid this issue, we examine the last three runs on the stack (let's call them X, Y, and Z, where Z is the length of the most recent) and check for these invariants:

  • Invariant 1: Z > Y + X
  • Invariant 2: Y > X

If both invariants hold true, the algorithm continues to add runs until one of them fails. When this occurs, it merges run Y with the lesser of Z and X. Merging Z and X directly could compromise stability, which ensures that elements with the same value maintain their relative positions after sorting.

This approach essentially ensures that the largest run is merged last.

Stack of runs

Consider a stack of runs with lengths [128, 64, 32, 16, 8, 4, 2, 2]. Timsort will merge the two last runs, producing a new one with a length of 4, and keep going, making perfect merges, much like merging the squares in the 2048 game.

Galloping

Merging runs of different lengths is often necessary, particularly when data is partially sorted. To improve the speed of the merge, Tim Peters introduced the galloping technique.

Typically, the merge() function has a complexity of O(m+n), where m and n represent the lengths of the two runs. It must loop through one of the runs completely.

Consider the following case: we need to merge two arrays in ascending order, each containing 10,000 elements:

arr1 = [1, 2, 3...10,000] arr2 = [10,001, 10,002, 10,003...20,000]

Should we compare every value from arr1 to arr2[0] before merging the runs? Instead, we can detect that one of the arrays is consistently "winning" and trigger galloping. That way we move the whole winning portion of the run.

How can we do that? Rather than incrementing by 1 through the winning array, we double our increment until the winning streak breaks. When it does, we use binary search to find the new starting point.

In our case, we determine that arr1 is winning after the first seven comparisons. It's proven that a winning streak of seven comparisons is the best trigger for galloping. We move our pointer by 1, then 2, 4, 8, 16, etc. In our example, the winning streak will never end, so we will make log(n) comparisons.

GallopingAll red elements are smaller than blue (here, 21). Thus they can be moved in a chunk to the final array

However, if the winning streak stops, we look for the biggest value in arr1 that is still smaller than the value at the arr2 pointer, put all elements before it into the resulting array, and start again from the pointer in arr2.

Summary

Timsort is a clever algorithm. It mixes multiple methods from different sorts and makes them better. It has more secrets to reveal than I can explain in one article.
You can find the JS implementation here: https://github.com/lxsmnsyc/TimSort/blob/master/timsort.js

And finally, Timsort in action:

Useful links

Sources:
Algorithm overview by Tim Peters himself
Wikipedia
Proof of complexity

No comments:

Post a Comment