Study Topic: Algo: Searching And Sorting

From Matt Morris Wiki
Jump to navigation Jump to search

This is a Study Leave topic.

What Is It?

Basic survey of sort techniques: e.g. bubble/replacement/merge/heap.

Why Study It?

A way in to Algorithmics more generally: I know this stuff but could use going through it in a structured fashion. Once I've gone through this I can think about where to go next.

Toggl code

Toggl code is ALGO-SEARCHSORT

Deliverables

Next checkpoint: 6 hours

Checkpoint 1

[DONE] 3 hours to first checkpoint. Start with Knuth and see how it goes. Decide on whether I should:

  • stay on this topic
  • fix another ago topic to look at next
  • do something else entirely

Also think about whether Knuth is the best way in or whether I'd do better on a random trawl.

Definitely stay with Knuth for now

WriteUp

There's a great Wikipedia comparison page at Sorting algorithm

Need to include a compare and contrast on the most common library sorts: quicksort, merge sort, heap sort, Timsort, introsort

Knuth Vol 3 (Sorting and Searching)

Permutations: A sort is a permutation that satisfies the desired ordering, so permutation theory is useful when studying efficiency. An inversion of a permutation is a pair of elements out of their natural order: any swap will change the inversion count by exactly one. You can construct an inversion table for a permutation a(i) by letting b(j) be the number of elements in { a(1) ... a(j-1) } that are greater then j. There is a 1-1 relationship between inversion tables and permutations. Other relevant concepts are the average number of runs in a permutation, and their length. An involution is a permutation that is its own inverse: for instance (2 1 4 3).

Basic Approaches:

  • insertion sort: items are considered one at a time and each is put in the correct positions relative to previously-sorted items
  • exchange sort: exchange out-of-order items
  • selection sort: pick the smallest (or largest) and place apart from the rest, then the next-smallest (or largest) and so on
  • enumeration sort: an item's position is determined by the number of other item's keys that its key exceeds

Insertion Algorithms:

  • simple insertion sort:N*N: add items in, one-by-one, shifting records above the gap up by one each time
  • Shell's method: N*logN*logN divide records into groups of gradually increasing size, using insertion sort on each round
  • address calculation sorting: divide into tables - works if there's an even distribution of keys across the (known) range

Exchange Algorithms:

  • bubble sort: N*N: pairwise compare neighbours and swap if an inversion, successive passes along the array
  • Batcher's parallel method: an exchange sort that admits parallelisation
  • quick sort (1962): N*logN: recursively pivot, each time moving in from ends to shift elements: switch to insertion below a threshold (10 or so)
    • doesn't work well when set already ordered (or nearly ordered): becomes N*N
    • can pick random pivots, or take median value of start+middle+end elements in current tranche
  • radix exchange sort: N*logN: works on binary representation: sort on successively less significant bits
    • can generalise to other bases

Selection Algorithms:

  • straight selection sort: N*N: pick largest & exchange into place, then next largest, and so on.
  • refine by making use of information already gathered in comparisons
  • think about building a "tournament tree" and moving the next-best up each time the best is taken
  • heapsort (1964):' N*logN: let us say {K(i)} is a heap if K(floor(j/2)) >= K(j). Heapsort "sifts" elements into heap form

Priority Queues

  • heaps are also good for priority queues, where you want deletions to remove the item with the largest (or smallest) key
  • you can represent PQs by a leftist binary tree or a balanced tree
  • heaps use minimal memory, leftist tress are great for queue merging, balanced trees give most flexility in searching and other operations

Merging Algorithms

  • two-way merge sort: N*logN: merge runs from left and right (building runs backward from the right). Requires N storage
  • list merge sort: N*logN: here we work with link fields instead, which still needs O(N) links, but that's potentially better than N element copies
  • bottom-up merge sort: recursively split into smaller areas, and merge sort as you come back up

Distribution Algorithms

  • The key idea is to sort on each component of a lexographic ordering in turn.
  • You move elements back and forth between count piles, each time distribution sorting on the next least significant digit of the keys, with an initial pass each time to get pile size to allocate
  • This gives radix sorts, whose performance depends on key digit distribution.

Minimising Comparisons

  • heapsort is less efficient than quicksort on average, but heapsort is guaranteed to be of order NlogN. Mergesort has that too, but needs more storage
  • can get lower bound on # of comparisons S(n) by noting pow(2, S(n)) >= n!, taking 2-log=lg we get S(n) >= lg(n!) ~ n*lg(n) - n/ln(2) + 0.5*lg(n) + O(1), hence n*lg(n)
  • it gets tricky working out S(n) as n increases: you need to work through all possibilities
  • Knuth presents merge insertion as a good algorithm for minimising comparisons

Sorting Networks

  • Here you present oblivious sequences of comparisons (that is, you don't change what you do next based on this comparison's results)
  • Again there is extensive analysis on what the minimal delay is, where you carry out comparisons simultaneously where possible
  • Genetic breeding methods have outstripped humans here