1 | n/a | """Heap queue algorithm (a.k.a. priority queue). |
---|
2 | n/a | |
---|
3 | n/a | Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for |
---|
4 | n/a | all k, counting elements from 0. For the sake of comparison, |
---|
5 | n/a | non-existing elements are considered to be infinite. The interesting |
---|
6 | n/a | property of a heap is that a[0] is always its smallest element. |
---|
7 | n/a | |
---|
8 | n/a | Usage: |
---|
9 | n/a | |
---|
10 | n/a | heap = [] # creates an empty heap |
---|
11 | n/a | heappush(heap, item) # pushes a new item on the heap |
---|
12 | n/a | item = heappop(heap) # pops the smallest item from the heap |
---|
13 | n/a | item = heap[0] # smallest item on the heap without popping it |
---|
14 | n/a | heapify(x) # transforms list into a heap, in-place, in linear time |
---|
15 | n/a | item = heapreplace(heap, item) # pops and returns smallest item, and adds |
---|
16 | n/a | # new item; the heap size is unchanged |
---|
17 | n/a | |
---|
18 | n/a | Our API differs from textbook heap algorithms as follows: |
---|
19 | n/a | |
---|
20 | n/a | - We use 0-based indexing. This makes the relationship between the |
---|
21 | n/a | index for a node and the indexes for its children slightly less |
---|
22 | n/a | obvious, but is more suitable since Python uses 0-based indexing. |
---|
23 | n/a | |
---|
24 | n/a | - Our heappop() method returns the smallest item, not the largest. |
---|
25 | n/a | |
---|
26 | n/a | These two make it possible to view the heap as a regular Python list |
---|
27 | n/a | without surprises: heap[0] is the smallest item, and heap.sort() |
---|
28 | n/a | maintains the heap invariant! |
---|
29 | n/a | """ |
---|
30 | n/a | |
---|
31 | n/a | # Original code by Kevin O'Connor, augmented by Tim Peters and Raymond Hettinger |
---|
32 | n/a | |
---|
33 | n/a | __about__ = """Heap queues |
---|
34 | n/a | |
---|
35 | n/a | [explanation by François Pinard] |
---|
36 | n/a | |
---|
37 | n/a | Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for |
---|
38 | n/a | all k, counting elements from 0. For the sake of comparison, |
---|
39 | n/a | non-existing elements are considered to be infinite. The interesting |
---|
40 | n/a | property of a heap is that a[0] is always its smallest element. |
---|
41 | n/a | |
---|
42 | n/a | The strange invariant above is meant to be an efficient memory |
---|
43 | n/a | representation for a tournament. The numbers below are `k', not a[k]: |
---|
44 | n/a | |
---|
45 | n/a | 0 |
---|
46 | n/a | |
---|
47 | n/a | 1 2 |
---|
48 | n/a | |
---|
49 | n/a | 3 4 5 6 |
---|
50 | n/a | |
---|
51 | n/a | 7 8 9 10 11 12 13 14 |
---|
52 | n/a | |
---|
53 | n/a | 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
---|
54 | n/a | |
---|
55 | n/a | |
---|
56 | n/a | In the tree above, each cell `k' is topping `2*k+1' and `2*k+2'. In |
---|
57 | n/a | a usual binary tournament we see in sports, each cell is the winner |
---|
58 | n/a | over the two cells it tops, and we can trace the winner down the tree |
---|
59 | n/a | to see all opponents s/he had. However, in many computer applications |
---|
60 | n/a | of such tournaments, we do not need to trace the history of a winner. |
---|
61 | n/a | To be more memory efficient, when a winner is promoted, we try to |
---|
62 | n/a | replace it by something else at a lower level, and the rule becomes |
---|
63 | n/a | that a cell and the two cells it tops contain three different items, |
---|
64 | n/a | but the top cell "wins" over the two topped cells. |
---|
65 | n/a | |
---|
66 | n/a | If this heap invariant is protected at all time, index 0 is clearly |
---|
67 | n/a | the overall winner. The simplest algorithmic way to remove it and |
---|
68 | n/a | find the "next" winner is to move some loser (let's say cell 30 in the |
---|
69 | n/a | diagram above) into the 0 position, and then percolate this new 0 down |
---|
70 | n/a | the tree, exchanging values, until the invariant is re-established. |
---|
71 | n/a | This is clearly logarithmic on the total number of items in the tree. |
---|
72 | n/a | By iterating over all items, you get an O(n ln n) sort. |
---|
73 | n/a | |
---|
74 | n/a | A nice feature of this sort is that you can efficiently insert new |
---|
75 | n/a | items while the sort is going on, provided that the inserted items are |
---|
76 | n/a | not "better" than the last 0'th element you extracted. This is |
---|
77 | n/a | especially useful in simulation contexts, where the tree holds all |
---|
78 | n/a | incoming events, and the "win" condition means the smallest scheduled |
---|
79 | n/a | time. When an event schedule other events for execution, they are |
---|
80 | n/a | scheduled into the future, so they can easily go into the heap. So, a |
---|
81 | n/a | heap is a good structure for implementing schedulers (this is what I |
---|
82 | n/a | used for my MIDI sequencer :-). |
---|
83 | n/a | |
---|
84 | n/a | Various structures for implementing schedulers have been extensively |
---|
85 | n/a | studied, and heaps are good for this, as they are reasonably speedy, |
---|
86 | n/a | the speed is almost constant, and the worst case is not much different |
---|
87 | n/a | than the average case. However, there are other representations which |
---|
88 | n/a | are more efficient overall, yet the worst cases might be terrible. |
---|
89 | n/a | |
---|
90 | n/a | Heaps are also very useful in big disk sorts. You most probably all |
---|
91 | n/a | know that a big sort implies producing "runs" (which are pre-sorted |
---|
92 | n/a | sequences, which size is usually related to the amount of CPU memory), |
---|
93 | n/a | followed by a merging passes for these runs, which merging is often |
---|
94 | n/a | very cleverly organised[1]. It is very important that the initial |
---|
95 | n/a | sort produces the longest runs possible. Tournaments are a good way |
---|
96 | n/a | to that. If, using all the memory available to hold a tournament, you |
---|
97 | n/a | replace and percolate items that happen to fit the current run, you'll |
---|
98 | n/a | produce runs which are twice the size of the memory for random input, |
---|
99 | n/a | and much better for input fuzzily ordered. |
---|
100 | n/a | |
---|
101 | n/a | Moreover, if you output the 0'th item on disk and get an input which |
---|
102 | n/a | may not fit in the current tournament (because the value "wins" over |
---|
103 | n/a | the last output value), it cannot fit in the heap, so the size of the |
---|
104 | n/a | heap decreases. The freed memory could be cleverly reused immediately |
---|
105 | n/a | for progressively building a second heap, which grows at exactly the |
---|
106 | n/a | same rate the first heap is melting. When the first heap completely |
---|
107 | n/a | vanishes, you switch heaps and start a new run. Clever and quite |
---|
108 | n/a | effective! |
---|
109 | n/a | |
---|
110 | n/a | In a word, heaps are useful memory structures to know. I use them in |
---|
111 | n/a | a few applications, and I think it is good to keep a `heap' module |
---|
112 | n/a | around. :-) |
---|
113 | n/a | |
---|
114 | n/a | -------------------- |
---|
115 | n/a | [1] The disk balancing algorithms which are current, nowadays, are |
---|
116 | n/a | more annoying than clever, and this is a consequence of the seeking |
---|
117 | n/a | capabilities of the disks. On devices which cannot seek, like big |
---|
118 | n/a | tape drives, the story was quite different, and one had to be very |
---|
119 | n/a | clever to ensure (far in advance) that each tape movement will be the |
---|
120 | n/a | most effective possible (that is, will best participate at |
---|
121 | n/a | "progressing" the merge). Some tapes were even able to read |
---|
122 | n/a | backwards, and this was also used to avoid the rewinding time. |
---|
123 | n/a | Believe me, real good tape sorts were quite spectacular to watch! |
---|
124 | n/a | From all times, sorting has always been a Great Art! :-) |
---|
125 | n/a | """ |
---|
126 | n/a | |
---|
127 | n/a | __all__ = ['heappush', 'heappop', 'heapify', 'heapreplace', 'merge', |
---|
128 | n/a | 'nlargest', 'nsmallest', 'heappushpop'] |
---|
129 | n/a | |
---|
130 | n/a | def heappush(heap, item): |
---|
131 | n/a | """Push item onto heap, maintaining the heap invariant.""" |
---|
132 | n/a | heap.append(item) |
---|
133 | n/a | _siftdown(heap, 0, len(heap)-1) |
---|
134 | n/a | |
---|
135 | n/a | def heappop(heap): |
---|
136 | n/a | """Pop the smallest item off the heap, maintaining the heap invariant.""" |
---|
137 | n/a | lastelt = heap.pop() # raises appropriate IndexError if heap is empty |
---|
138 | n/a | if heap: |
---|
139 | n/a | returnitem = heap[0] |
---|
140 | n/a | heap[0] = lastelt |
---|
141 | n/a | _siftup(heap, 0) |
---|
142 | n/a | return returnitem |
---|
143 | n/a | return lastelt |
---|
144 | n/a | |
---|
145 | n/a | def heapreplace(heap, item): |
---|
146 | n/a | """Pop and return the current smallest value, and add the new item. |
---|
147 | n/a | |
---|
148 | n/a | This is more efficient than heappop() followed by heappush(), and can be |
---|
149 | n/a | more appropriate when using a fixed-size heap. Note that the value |
---|
150 | n/a | returned may be larger than item! That constrains reasonable uses of |
---|
151 | n/a | this routine unless written as part of a conditional replacement: |
---|
152 | n/a | |
---|
153 | n/a | if item > heap[0]: |
---|
154 | n/a | item = heapreplace(heap, item) |
---|
155 | n/a | """ |
---|
156 | n/a | returnitem = heap[0] # raises appropriate IndexError if heap is empty |
---|
157 | n/a | heap[0] = item |
---|
158 | n/a | _siftup(heap, 0) |
---|
159 | n/a | return returnitem |
---|
160 | n/a | |
---|
161 | n/a | def heappushpop(heap, item): |
---|
162 | n/a | """Fast version of a heappush followed by a heappop.""" |
---|
163 | n/a | if heap and heap[0] < item: |
---|
164 | n/a | item, heap[0] = heap[0], item |
---|
165 | n/a | _siftup(heap, 0) |
---|
166 | n/a | return item |
---|
167 | n/a | |
---|
168 | n/a | def heapify(x): |
---|
169 | n/a | """Transform list into a heap, in-place, in O(len(x)) time.""" |
---|
170 | n/a | n = len(x) |
---|
171 | n/a | # Transform bottom-up. The largest index there's any point to looking at |
---|
172 | n/a | # is the largest with a child index in-range, so must have 2*i + 1 < n, |
---|
173 | n/a | # or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so |
---|
174 | n/a | # j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is |
---|
175 | n/a | # (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1. |
---|
176 | n/a | for i in reversed(range(n//2)): |
---|
177 | n/a | _siftup(x, i) |
---|
178 | n/a | |
---|
179 | n/a | def _heappop_max(heap): |
---|
180 | n/a | """Maxheap version of a heappop.""" |
---|
181 | n/a | lastelt = heap.pop() # raises appropriate IndexError if heap is empty |
---|
182 | n/a | if heap: |
---|
183 | n/a | returnitem = heap[0] |
---|
184 | n/a | heap[0] = lastelt |
---|
185 | n/a | _siftup_max(heap, 0) |
---|
186 | n/a | return returnitem |
---|
187 | n/a | return lastelt |
---|
188 | n/a | |
---|
189 | n/a | def _heapreplace_max(heap, item): |
---|
190 | n/a | """Maxheap version of a heappop followed by a heappush.""" |
---|
191 | n/a | returnitem = heap[0] # raises appropriate IndexError if heap is empty |
---|
192 | n/a | heap[0] = item |
---|
193 | n/a | _siftup_max(heap, 0) |
---|
194 | n/a | return returnitem |
---|
195 | n/a | |
---|
196 | n/a | def _heapify_max(x): |
---|
197 | n/a | """Transform list into a maxheap, in-place, in O(len(x)) time.""" |
---|
198 | n/a | n = len(x) |
---|
199 | n/a | for i in reversed(range(n//2)): |
---|
200 | n/a | _siftup_max(x, i) |
---|
201 | n/a | |
---|
202 | n/a | # 'heap' is a heap at all indices >= startpos, except possibly for pos. pos |
---|
203 | n/a | # is the index of a leaf with a possibly out-of-order value. Restore the |
---|
204 | n/a | # heap invariant. |
---|
205 | n/a | def _siftdown(heap, startpos, pos): |
---|
206 | n/a | newitem = heap[pos] |
---|
207 | n/a | # Follow the path to the root, moving parents down until finding a place |
---|
208 | n/a | # newitem fits. |
---|
209 | n/a | while pos > startpos: |
---|
210 | n/a | parentpos = (pos - 1) >> 1 |
---|
211 | n/a | parent = heap[parentpos] |
---|
212 | n/a | if newitem < parent: |
---|
213 | n/a | heap[pos] = parent |
---|
214 | n/a | pos = parentpos |
---|
215 | n/a | continue |
---|
216 | n/a | break |
---|
217 | n/a | heap[pos] = newitem |
---|
218 | n/a | |
---|
219 | n/a | # The child indices of heap index pos are already heaps, and we want to make |
---|
220 | n/a | # a heap at index pos too. We do this by bubbling the smaller child of |
---|
221 | n/a | # pos up (and so on with that child's children, etc) until hitting a leaf, |
---|
222 | n/a | # then using _siftdown to move the oddball originally at index pos into place. |
---|
223 | n/a | # |
---|
224 | n/a | # We *could* break out of the loop as soon as we find a pos where newitem <= |
---|
225 | n/a | # both its children, but turns out that's not a good idea, and despite that |
---|
226 | n/a | # many books write the algorithm that way. During a heap pop, the last array |
---|
227 | n/a | # element is sifted in, and that tends to be large, so that comparing it |
---|
228 | n/a | # against values starting from the root usually doesn't pay (= usually doesn't |
---|
229 | n/a | # get us out of the loop early). See Knuth, Volume 3, where this is |
---|
230 | n/a | # explained and quantified in an exercise. |
---|
231 | n/a | # |
---|
232 | n/a | # Cutting the # of comparisons is important, since these routines have no |
---|
233 | n/a | # way to extract "the priority" from an array element, so that intelligence |
---|
234 | n/a | # is likely to be hiding in custom comparison methods, or in array elements |
---|
235 | n/a | # storing (priority, record) tuples. Comparisons are thus potentially |
---|
236 | n/a | # expensive. |
---|
237 | n/a | # |
---|
238 | n/a | # On random arrays of length 1000, making this change cut the number of |
---|
239 | n/a | # comparisons made by heapify() a little, and those made by exhaustive |
---|
240 | n/a | # heappop() a lot, in accord with theory. Here are typical results from 3 |
---|
241 | n/a | # runs (3 just to demonstrate how small the variance is): |
---|
242 | n/a | # |
---|
243 | n/a | # Compares needed by heapify Compares needed by 1000 heappops |
---|
244 | n/a | # -------------------------- -------------------------------- |
---|
245 | n/a | # 1837 cut to 1663 14996 cut to 8680 |
---|
246 | n/a | # 1855 cut to 1659 14966 cut to 8678 |
---|
247 | n/a | # 1847 cut to 1660 15024 cut to 8703 |
---|
248 | n/a | # |
---|
249 | n/a | # Building the heap by using heappush() 1000 times instead required |
---|
250 | n/a | # 2198, 2148, and 2219 compares: heapify() is more efficient, when |
---|
251 | n/a | # you can use it. |
---|
252 | n/a | # |
---|
253 | n/a | # The total compares needed by list.sort() on the same lists were 8627, |
---|
254 | n/a | # 8627, and 8632 (this should be compared to the sum of heapify() and |
---|
255 | n/a | # heappop() compares): list.sort() is (unsurprisingly!) more efficient |
---|
256 | n/a | # for sorting. |
---|
257 | n/a | |
---|
258 | n/a | def _siftup(heap, pos): |
---|
259 | n/a | endpos = len(heap) |
---|
260 | n/a | startpos = pos |
---|
261 | n/a | newitem = heap[pos] |
---|
262 | n/a | # Bubble up the smaller child until hitting a leaf. |
---|
263 | n/a | childpos = 2*pos + 1 # leftmost child position |
---|
264 | n/a | while childpos < endpos: |
---|
265 | n/a | # Set childpos to index of smaller child. |
---|
266 | n/a | rightpos = childpos + 1 |
---|
267 | n/a | if rightpos < endpos and not heap[childpos] < heap[rightpos]: |
---|
268 | n/a | childpos = rightpos |
---|
269 | n/a | # Move the smaller child up. |
---|
270 | n/a | heap[pos] = heap[childpos] |
---|
271 | n/a | pos = childpos |
---|
272 | n/a | childpos = 2*pos + 1 |
---|
273 | n/a | # The leaf at pos is empty now. Put newitem there, and bubble it up |
---|
274 | n/a | # to its final resting place (by sifting its parents down). |
---|
275 | n/a | heap[pos] = newitem |
---|
276 | n/a | _siftdown(heap, startpos, pos) |
---|
277 | n/a | |
---|
278 | n/a | def _siftdown_max(heap, startpos, pos): |
---|
279 | n/a | 'Maxheap variant of _siftdown' |
---|
280 | n/a | newitem = heap[pos] |
---|
281 | n/a | # Follow the path to the root, moving parents down until finding a place |
---|
282 | n/a | # newitem fits. |
---|
283 | n/a | while pos > startpos: |
---|
284 | n/a | parentpos = (pos - 1) >> 1 |
---|
285 | n/a | parent = heap[parentpos] |
---|
286 | n/a | if parent < newitem: |
---|
287 | n/a | heap[pos] = parent |
---|
288 | n/a | pos = parentpos |
---|
289 | n/a | continue |
---|
290 | n/a | break |
---|
291 | n/a | heap[pos] = newitem |
---|
292 | n/a | |
---|
293 | n/a | def _siftup_max(heap, pos): |
---|
294 | n/a | 'Maxheap variant of _siftup' |
---|
295 | n/a | endpos = len(heap) |
---|
296 | n/a | startpos = pos |
---|
297 | n/a | newitem = heap[pos] |
---|
298 | n/a | # Bubble up the larger child until hitting a leaf. |
---|
299 | n/a | childpos = 2*pos + 1 # leftmost child position |
---|
300 | n/a | while childpos < endpos: |
---|
301 | n/a | # Set childpos to index of larger child. |
---|
302 | n/a | rightpos = childpos + 1 |
---|
303 | n/a | if rightpos < endpos and not heap[rightpos] < heap[childpos]: |
---|
304 | n/a | childpos = rightpos |
---|
305 | n/a | # Move the larger child up. |
---|
306 | n/a | heap[pos] = heap[childpos] |
---|
307 | n/a | pos = childpos |
---|
308 | n/a | childpos = 2*pos + 1 |
---|
309 | n/a | # The leaf at pos is empty now. Put newitem there, and bubble it up |
---|
310 | n/a | # to its final resting place (by sifting its parents down). |
---|
311 | n/a | heap[pos] = newitem |
---|
312 | n/a | _siftdown_max(heap, startpos, pos) |
---|
313 | n/a | |
---|
314 | n/a | def merge(*iterables, key=None, reverse=False): |
---|
315 | n/a | '''Merge multiple sorted inputs into a single sorted output. |
---|
316 | n/a | |
---|
317 | n/a | Similar to sorted(itertools.chain(*iterables)) but returns a generator, |
---|
318 | n/a | does not pull the data into memory all at once, and assumes that each of |
---|
319 | n/a | the input streams is already sorted (smallest to largest). |
---|
320 | n/a | |
---|
321 | n/a | >>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25])) |
---|
322 | n/a | [0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25] |
---|
323 | n/a | |
---|
324 | n/a | If *key* is not None, applies a key function to each element to determine |
---|
325 | n/a | its sort order. |
---|
326 | n/a | |
---|
327 | n/a | >>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len)) |
---|
328 | n/a | ['dog', 'cat', 'fish', 'horse', 'kangaroo'] |
---|
329 | n/a | |
---|
330 | n/a | ''' |
---|
331 | n/a | |
---|
332 | n/a | h = [] |
---|
333 | n/a | h_append = h.append |
---|
334 | n/a | |
---|
335 | n/a | if reverse: |
---|
336 | n/a | _heapify = _heapify_max |
---|
337 | n/a | _heappop = _heappop_max |
---|
338 | n/a | _heapreplace = _heapreplace_max |
---|
339 | n/a | direction = -1 |
---|
340 | n/a | else: |
---|
341 | n/a | _heapify = heapify |
---|
342 | n/a | _heappop = heappop |
---|
343 | n/a | _heapreplace = heapreplace |
---|
344 | n/a | direction = 1 |
---|
345 | n/a | |
---|
346 | n/a | if key is None: |
---|
347 | n/a | for order, it in enumerate(map(iter, iterables)): |
---|
348 | n/a | try: |
---|
349 | n/a | next = it.__next__ |
---|
350 | n/a | h_append([next(), order * direction, next]) |
---|
351 | n/a | except StopIteration: |
---|
352 | n/a | pass |
---|
353 | n/a | _heapify(h) |
---|
354 | n/a | while len(h) > 1: |
---|
355 | n/a | try: |
---|
356 | n/a | while True: |
---|
357 | n/a | value, order, next = s = h[0] |
---|
358 | n/a | yield value |
---|
359 | n/a | s[0] = next() # raises StopIteration when exhausted |
---|
360 | n/a | _heapreplace(h, s) # restore heap condition |
---|
361 | n/a | except StopIteration: |
---|
362 | n/a | _heappop(h) # remove empty iterator |
---|
363 | n/a | if h: |
---|
364 | n/a | # fast case when only a single iterator remains |
---|
365 | n/a | value, order, next = h[0] |
---|
366 | n/a | yield value |
---|
367 | n/a | yield from next.__self__ |
---|
368 | n/a | return |
---|
369 | n/a | |
---|
370 | n/a | for order, it in enumerate(map(iter, iterables)): |
---|
371 | n/a | try: |
---|
372 | n/a | next = it.__next__ |
---|
373 | n/a | value = next() |
---|
374 | n/a | h_append([key(value), order * direction, value, next]) |
---|
375 | n/a | except StopIteration: |
---|
376 | n/a | pass |
---|
377 | n/a | _heapify(h) |
---|
378 | n/a | while len(h) > 1: |
---|
379 | n/a | try: |
---|
380 | n/a | while True: |
---|
381 | n/a | key_value, order, value, next = s = h[0] |
---|
382 | n/a | yield value |
---|
383 | n/a | value = next() |
---|
384 | n/a | s[0] = key(value) |
---|
385 | n/a | s[2] = value |
---|
386 | n/a | _heapreplace(h, s) |
---|
387 | n/a | except StopIteration: |
---|
388 | n/a | _heappop(h) |
---|
389 | n/a | if h: |
---|
390 | n/a | key_value, order, value, next = h[0] |
---|
391 | n/a | yield value |
---|
392 | n/a | yield from next.__self__ |
---|
393 | n/a | |
---|
394 | n/a | |
---|
395 | n/a | # Algorithm notes for nlargest() and nsmallest() |
---|
396 | n/a | # ============================================== |
---|
397 | n/a | # |
---|
398 | n/a | # Make a single pass over the data while keeping the k most extreme values |
---|
399 | n/a | # in a heap. Memory consumption is limited to keeping k values in a list. |
---|
400 | n/a | # |
---|
401 | n/a | # Measured performance for random inputs: |
---|
402 | n/a | # |
---|
403 | n/a | # number of comparisons |
---|
404 | n/a | # n inputs k-extreme values (average of 5 trials) % more than min() |
---|
405 | n/a | # ------------- ---------------- --------------------- ----------------- |
---|
406 | n/a | # 1,000 100 3,317 231.7% |
---|
407 | n/a | # 10,000 100 14,046 40.5% |
---|
408 | n/a | # 100,000 100 105,749 5.7% |
---|
409 | n/a | # 1,000,000 100 1,007,751 0.8% |
---|
410 | n/a | # 10,000,000 100 10,009,401 0.1% |
---|
411 | n/a | # |
---|
412 | n/a | # Theoretical number of comparisons for k smallest of n random inputs: |
---|
413 | n/a | # |
---|
414 | n/a | # Step Comparisons Action |
---|
415 | n/a | # ---- -------------------------- --------------------------- |
---|
416 | n/a | # 1 1.66 * k heapify the first k-inputs |
---|
417 | n/a | # 2 n - k compare remaining elements to top of heap |
---|
418 | n/a | # 3 k * (1 + lg2(k)) * ln(n/k) replace the topmost value on the heap |
---|
419 | n/a | # 4 k * lg2(k) - (k/2) final sort of the k most extreme values |
---|
420 | n/a | # |
---|
421 | n/a | # Combining and simplifying for a rough estimate gives: |
---|
422 | n/a | # |
---|
423 | n/a | # comparisons = n + k * (log(k, 2) * log(n/k) + log(k, 2) + log(n/k)) |
---|
424 | n/a | # |
---|
425 | n/a | # Computing the number of comparisons for step 3: |
---|
426 | n/a | # ----------------------------------------------- |
---|
427 | n/a | # * For the i-th new value from the iterable, the probability of being in the |
---|
428 | n/a | # k most extreme values is k/i. For example, the probability of the 101st |
---|
429 | n/a | # value seen being in the 100 most extreme values is 100/101. |
---|
430 | n/a | # * If the value is a new extreme value, the cost of inserting it into the |
---|
431 | n/a | # heap is 1 + log(k, 2). |
---|
432 | n/a | # * The probability times the cost gives: |
---|
433 | n/a | # (k/i) * (1 + log(k, 2)) |
---|
434 | n/a | # * Summing across the remaining n-k elements gives: |
---|
435 | n/a | # sum((k/i) * (1 + log(k, 2)) for i in range(k+1, n+1)) |
---|
436 | n/a | # * This reduces to: |
---|
437 | n/a | # (H(n) - H(k)) * k * (1 + log(k, 2)) |
---|
438 | n/a | # * Where H(n) is the n-th harmonic number estimated by: |
---|
439 | n/a | # gamma = 0.5772156649 |
---|
440 | n/a | # H(n) = log(n, e) + gamma + 1 / (2 * n) |
---|
441 | n/a | # http://en.wikipedia.org/wiki/Harmonic_series_(mathematics)#Rate_of_divergence |
---|
442 | n/a | # * Substituting the H(n) formula: |
---|
443 | n/a | # comparisons = k * (1 + log(k, 2)) * (log(n/k, e) + (1/n - 1/k) / 2) |
---|
444 | n/a | # |
---|
445 | n/a | # Worst-case for step 3: |
---|
446 | n/a | # ---------------------- |
---|
447 | n/a | # In the worst case, the input data is reversed sorted so that every new element |
---|
448 | n/a | # must be inserted in the heap: |
---|
449 | n/a | # |
---|
450 | n/a | # comparisons = 1.66 * k + log(k, 2) * (n - k) |
---|
451 | n/a | # |
---|
452 | n/a | # Alternative Algorithms |
---|
453 | n/a | # ---------------------- |
---|
454 | n/a | # Other algorithms were not used because they: |
---|
455 | n/a | # 1) Took much more auxiliary memory, |
---|
456 | n/a | # 2) Made multiple passes over the data. |
---|
457 | n/a | # 3) Made more comparisons in common cases (small k, large n, semi-random input). |
---|
458 | n/a | # See the more detailed comparison of approach at: |
---|
459 | n/a | # http://code.activestate.com/recipes/577573-compare-algorithms-for-heapqsmallest |
---|
460 | n/a | |
---|
461 | n/a | def nsmallest(n, iterable, key=None): |
---|
462 | n/a | """Find the n smallest elements in a dataset. |
---|
463 | n/a | |
---|
464 | n/a | Equivalent to: sorted(iterable, key=key)[:n] |
---|
465 | n/a | """ |
---|
466 | n/a | |
---|
467 | n/a | # Short-cut for n==1 is to use min() |
---|
468 | n/a | if n == 1: |
---|
469 | n/a | it = iter(iterable) |
---|
470 | n/a | sentinel = object() |
---|
471 | n/a | if key is None: |
---|
472 | n/a | result = min(it, default=sentinel) |
---|
473 | n/a | else: |
---|
474 | n/a | result = min(it, default=sentinel, key=key) |
---|
475 | n/a | return [] if result is sentinel else [result] |
---|
476 | n/a | |
---|
477 | n/a | # When n>=size, it's faster to use sorted() |
---|
478 | n/a | try: |
---|
479 | n/a | size = len(iterable) |
---|
480 | n/a | except (TypeError, AttributeError): |
---|
481 | n/a | pass |
---|
482 | n/a | else: |
---|
483 | n/a | if n >= size: |
---|
484 | n/a | return sorted(iterable, key=key)[:n] |
---|
485 | n/a | |
---|
486 | n/a | # When key is none, use simpler decoration |
---|
487 | n/a | if key is None: |
---|
488 | n/a | it = iter(iterable) |
---|
489 | n/a | # put the range(n) first so that zip() doesn't |
---|
490 | n/a | # consume one too many elements from the iterator |
---|
491 | n/a | result = [(elem, i) for i, elem in zip(range(n), it)] |
---|
492 | n/a | if not result: |
---|
493 | n/a | return result |
---|
494 | n/a | _heapify_max(result) |
---|
495 | n/a | top = result[0][0] |
---|
496 | n/a | order = n |
---|
497 | n/a | _heapreplace = _heapreplace_max |
---|
498 | n/a | for elem in it: |
---|
499 | n/a | if elem < top: |
---|
500 | n/a | _heapreplace(result, (elem, order)) |
---|
501 | n/a | top = result[0][0] |
---|
502 | n/a | order += 1 |
---|
503 | n/a | result.sort() |
---|
504 | n/a | return [r[0] for r in result] |
---|
505 | n/a | |
---|
506 | n/a | # General case, slowest method |
---|
507 | n/a | it = iter(iterable) |
---|
508 | n/a | result = [(key(elem), i, elem) for i, elem in zip(range(n), it)] |
---|
509 | n/a | if not result: |
---|
510 | n/a | return result |
---|
511 | n/a | _heapify_max(result) |
---|
512 | n/a | top = result[0][0] |
---|
513 | n/a | order = n |
---|
514 | n/a | _heapreplace = _heapreplace_max |
---|
515 | n/a | for elem in it: |
---|
516 | n/a | k = key(elem) |
---|
517 | n/a | if k < top: |
---|
518 | n/a | _heapreplace(result, (k, order, elem)) |
---|
519 | n/a | top = result[0][0] |
---|
520 | n/a | order += 1 |
---|
521 | n/a | result.sort() |
---|
522 | n/a | return [r[2] for r in result] |
---|
523 | n/a | |
---|
524 | n/a | def nlargest(n, iterable, key=None): |
---|
525 | n/a | """Find the n largest elements in a dataset. |
---|
526 | n/a | |
---|
527 | n/a | Equivalent to: sorted(iterable, key=key, reverse=True)[:n] |
---|
528 | n/a | """ |
---|
529 | n/a | |
---|
530 | n/a | # Short-cut for n==1 is to use max() |
---|
531 | n/a | if n == 1: |
---|
532 | n/a | it = iter(iterable) |
---|
533 | n/a | sentinel = object() |
---|
534 | n/a | if key is None: |
---|
535 | n/a | result = max(it, default=sentinel) |
---|
536 | n/a | else: |
---|
537 | n/a | result = max(it, default=sentinel, key=key) |
---|
538 | n/a | return [] if result is sentinel else [result] |
---|
539 | n/a | |
---|
540 | n/a | # When n>=size, it's faster to use sorted() |
---|
541 | n/a | try: |
---|
542 | n/a | size = len(iterable) |
---|
543 | n/a | except (TypeError, AttributeError): |
---|
544 | n/a | pass |
---|
545 | n/a | else: |
---|
546 | n/a | if n >= size: |
---|
547 | n/a | return sorted(iterable, key=key, reverse=True)[:n] |
---|
548 | n/a | |
---|
549 | n/a | # When key is none, use simpler decoration |
---|
550 | n/a | if key is None: |
---|
551 | n/a | it = iter(iterable) |
---|
552 | n/a | result = [(elem, i) for i, elem in zip(range(0, -n, -1), it)] |
---|
553 | n/a | if not result: |
---|
554 | n/a | return result |
---|
555 | n/a | heapify(result) |
---|
556 | n/a | top = result[0][0] |
---|
557 | n/a | order = -n |
---|
558 | n/a | _heapreplace = heapreplace |
---|
559 | n/a | for elem in it: |
---|
560 | n/a | if top < elem: |
---|
561 | n/a | _heapreplace(result, (elem, order)) |
---|
562 | n/a | top = result[0][0] |
---|
563 | n/a | order -= 1 |
---|
564 | n/a | result.sort(reverse=True) |
---|
565 | n/a | return [r[0] for r in result] |
---|
566 | n/a | |
---|
567 | n/a | # General case, slowest method |
---|
568 | n/a | it = iter(iterable) |
---|
569 | n/a | result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)] |
---|
570 | n/a | if not result: |
---|
571 | n/a | return result |
---|
572 | n/a | heapify(result) |
---|
573 | n/a | top = result[0][0] |
---|
574 | n/a | order = -n |
---|
575 | n/a | _heapreplace = heapreplace |
---|
576 | n/a | for elem in it: |
---|
577 | n/a | k = key(elem) |
---|
578 | n/a | if top < k: |
---|
579 | n/a | _heapreplace(result, (k, order, elem)) |
---|
580 | n/a | top = result[0][0] |
---|
581 | n/a | order -= 1 |
---|
582 | n/a | result.sort(reverse=True) |
---|
583 | n/a | return [r[2] for r in result] |
---|
584 | n/a | |
---|
585 | n/a | # If available, use C implementation |
---|
586 | n/a | try: |
---|
587 | n/a | from _heapq import * |
---|
588 | n/a | except ImportError: |
---|
589 | n/a | pass |
---|
590 | n/a | try: |
---|
591 | n/a | from _heapq import _heapreplace_max |
---|
592 | n/a | except ImportError: |
---|
593 | n/a | pass |
---|
594 | n/a | try: |
---|
595 | n/a | from _heapq import _heapify_max |
---|
596 | n/a | except ImportError: |
---|
597 | n/a | pass |
---|
598 | n/a | try: |
---|
599 | n/a | from _heapq import _heappop_max |
---|
600 | n/a | except ImportError: |
---|
601 | n/a | pass |
---|
602 | n/a | |
---|
603 | n/a | |
---|
604 | n/a | if __name__ == "__main__": |
---|
605 | n/a | |
---|
606 | n/a | import doctest |
---|
607 | n/a | print(doctest.testmod()) |
---|