My Favorite Algorithm: Linear Time Median Finding (2018)

My Favorite Algorithm: Linear Time Median Finding (2018)(rcoh.me)

371 points by skanderbm 1 year ago | 181 comments

danlark 1 year ago |

Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)

https://danlark.org/2020/11/11/miniselect-practical-and-gene...

thanatropism 1 year ago | |

Is any of those easily modifiable to return the arg-median (the index which has the median).

cfors 1 year ago | |

Just wanted to say thank you for this article - I've read and shared this a few times over the years!

rented_mule 1 year ago |

10-15 years ago, I found myself needing to regularly find the median of many billions of values, each parsed out of a multi-kilobyte log entry. MapReduce was what we were using for processing large amounts of data at the time. With MapReduce over that much data, you don't just want linear time, but ideally single pass, distributed across machines. Subsequent passes over much smaller amounts of data are fine.

It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.

As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.

The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.

justinpombrio 1 year ago | |

Did you actually need to find the true median of billions of values? Or would finding a value between 49.9% and 50.1% suffice? Because the latter is much easier: sample 10,000 elements uniformly at random and take their median.

(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)

rented_mule 1 year ago | | |

The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.

Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.

enriquto 1 year ago | | |

> the latter is much easier: sample 10,000 elements uniformly at random and take their median

Do you have a source for that claim?

I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.

True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.

andruby 1 year ago | | |

I was thinking the same thing.

In all use-cases I've seen a close estimate of the median was enough.

hhmc 1 year ago | | |

You also can use the fact that for any distribution, the median is never further than 1SD away from the mean.

digaozao 1 year ago | |

I am not sure. But from the outside, it looks like what Prometheus does behind the scenes. It seems to me that Prometheus works like that because it has a limit on latency time around 10s on some systems I worked. So when we had requests above that limit it got all on 10s, even though it could be higher than that. Interesting.

Filligree 1 year ago | |

Was this by any chance for generating availability metrics, and were you an intern at the time? The system sounds, ah, very familiar.

rented_mule 1 year ago | | |

The metrics were about speed. And I was decades past my last internship at the time in question. But, as is so often the case, more than one of us may have been reinventing pretty similar wheels. :)

ant6n 1 year ago | |

I’m not sure why you use a dictionary with keys 0…999, instead of an array indexed 0…999.

rented_mule 1 year ago | | |

I was using the term dictionary for illustration purposes. Remember, this was all in the context of MapReduce. Computation within MapReduce is built around grouping values by keys, which makes dictionaries a natural way to think about many MapReduce oriented algorithms, at least for me. The key/value pairs appear as streams of two-tuples, not as dictionaries or arrays.

tomrod 1 year ago | | |

That's just a dict/map with less flexibility on the keys :D

ashton314 1 year ago | |

Where were you working? Sounds like you got lucky to work on some fun problems!

rented_mule 1 year ago | | |

Sorry, but I'm trying to keep this account relatively anonymous to sidestep some of my issues with being shy.

But, you're right, I was lucky to work on a bunch of fun problems. That period, in particular, was pretty amazing. I was part of a fun, collaborative team working on hard problems. And management showed a lot of trust in us. We came up with some very interesting solutions, some by skill and some by luck, that set the foundation for years of growth that came after that (both revenue growth and technical platform growth).

xinok 1 year ago |

> P.S: In 2017 a new paper came out that actually makes the median-of-medians approach competitive with other selection algorithms. Thanks to the paper’s author, Andrei Alexandrescu for bringing it to my attention!

He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!

There's Treasure Everywhere - Andrei Alexandrescu

https://www.youtube.com/watch?v=fd1_Miy1Clg

_yid9 1 year ago | |

Andrei Alexandrescu is awesome; around 2000 he gave on talk on lock-free wait-free algorithms that I immediately applied to a huge C++ industrial control networking project at the time.

I'd recommend anyone who writes software listening and reading anything of Andrei's you can find; this one is indeed a Treasure!

fasa99 1 year ago | |

that's wild, a bit of a polymath by computer science standards. I know him from template metaprogramming fame and here he is shifting from programming languages to algorithms

mabbo 1 year ago |

I learned about the median-of-medians quickselect algorithm when I was an undergrad and was really impressed by it. I implemented it, and it was terribly slow. It's runtime grew linearly, but that only really mattered if you had at least a few billion items in your list.

I was chatting about this with a grad student friend who casually said something like "Sure, it's slow, but what really matters is that it proves that it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible. Now that we do, we know there might an even faster linear algorithm." Really got into the philosophy of what Computer Science is about in the first place.

The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.

kwantam 1 year ago |

One of the fun things about the median-of-medians algorithm is its completely star-studded author list.

Manuel Blum - Turing award winner in 1995

Robert Floyd - Turing award winner in 1978

Ron Rivest - Turing award winner in 2002

Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)

Vaughan Pratt - oh no, the only non-Turing award winner in the list. Oh right but he's emeritus faculty at Stanford, directed the SUN project before it became Sun Microsystems, was instrumental in Sun's early days (director of research and designer of the Sun logo!), and is responsible for all kinds of other awesome stuff (near and dear to me: Pratt certificates of primality).

Four independent Turing awards! SPARCstations! This paper has it all.

jiggawatts 1 year ago | |

Job interview question for an entry-level front end developer: "Reproduce the work of four Turing award winners in the next thirty minutes. You have a dirty whiteboard and a dry pen. Your time begins... now."

ted_dunning 1 year ago | | |

And if you really want to impress, you reach into your pack and pull out the pens you carry just in case you run into dry pens at a critical moment.

Munksgaard 1 year ago | |

Here's a direct link for anyone who, like me, would be interested in reading the original article: https://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf

That's an impressive list of authors, for sure.

praptak 1 year ago | |

Some other awesome stuff by Pratt:

Pratt parsing (HN discussion: https://news.ycombinator.com/item?id=39066465), the "P" in the KMP algorithm.

someplaceguy 1 year ago |

    return l[len(l) / 2]

I'm not a Python expert, but doesn't the `/` operator return a float in Python? Why would you use a float as an array index instead of doing integer division (with `//`)?

I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.

Perhaps this could be forgiven if you're a Python novice and hadn't realized that the two different operators exist, but this is not the case here, as the article contains this even more baffling code which uses integer division in one branch but float division in the other:

    def quickselect_median(l, pivot_fn=random.choice):
        if len(l) % 2 == 1:
            return quickselect(l, len(l) // 2, pivot_fn)
        else:
            return 0.5 * (quickselect(l, len(l) / 2 - 1, pivot_fn) +
                           quickselect(l, len(l) / 2, pivot_fn))

That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.

jononor 1 year ago | |

Well spotted! In Python 2 there was only one operator, but in Python 3 they are distinct. Indexing an array with a float raises an exception, I believe.

runeblaze 1 year ago | |

I do agree that it is a code smell. However given that this is an algorithms article I don't think it is exactly that fair to judge it based on code quality. I think of it as: instead of writing it in pseudocode the author chose a real pseudocode-like programming language, and it (presumably) runs well for illustrative purposes.

TacticalCoder 1 year ago |

I really enjoyed TFA but this:

> Technically, you could get extremely unlucky: at each step, you could pick the largest element as your pivot. Each step would only remove one element from the list and you’d actually have O(n2) performance instead of O(n)

If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen. If the data is really too big to shuffle, then only shuffle once a bucket is small enough to be shuffled.

If you do shuffle, probabilities are here to guarantee that that worst case cannot happen. If anyone says that "technically" it can happen, I'll answer that then "technically" an attacker could also guess correctly every bit of your 256 bits private key.

Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.

From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.

bo1024 1 year ago | |

You're already using your own randomness to pick the pivot at random, so I don't see why the shuffle helps more. But yes, if your randomness is trustworthy, the probability of more than O(n) runtime is very low.

Reubend 1 year ago | |

> If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen.

It doesn't guarantee that you avoid the worst case, it just removes the possibility of forcing the worst case.

furstenheim 1 year ago |

Floyd Ryvest also does the job . A bit more efficient IIRC.

However I never managed to understand how it works.

https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...

throwaway294531 1 year ago |

If you're selecting the n:th element, where n is very small (or large), using median-of-medians may not be the best choice.

Instead, you can use a biased pivot as in [1] or something I call "j:th of k:th". Floyd-Rivest can also speed things up. I have a hobby project that gets 1.2-2.0x throughput when compared to a well implemented quickselect, see: https://github.com/koskinev/turboselect

If anyone has pointers to fast generic & in-place selection algorithms, I'm interested.

[1] https://doi.org/10.4230/LIPIcs.SEA.2017.24

mgaunard 1 year ago |

You could also use one of the streaming algorithms which allow you to compute approximations for arbitrary quantiles without ever needing to store the whole data in memory.

anonymoushn 1 year ago |

One commonly sees the implication that radix sort cannot be used for data types other than integers, or for composite data types, or for large data types. For example, TFA says you could use radix sort if your input is 32-bit integers. But you can use it on anything. You can use radix sort to sort strings in O(n) time.

ncruces 1 year ago |

An implementation in Go, that's (hopefully) simple enough to be understandable, yet minimally practical:

https://github.com/ncruces/sort/blob/main/quick/quick.go

Xcelerate 1 year ago |

I received a variant of this problem as an interview question a few months ago. Except the linear time approach would not have worked here, since the list contains trillions of numbers, you only have sequential read access, and the list cannot be loaded into memory. 30 minutes — go.

First I asked if anything could be assumed about the statistics on the distribution of the numbers. Nope, could be anything, except the numbers are 32-bit ints. After fiddling around for a bit I finally decided on a scheme that creates a bounding interval for the unknown median value (one variable contains the upper bound and one contains the lower bound based on 2^32 possible values) and then adjusts this interval on each successive pass through the data. The last step is to average the upper and lower bound in case there are an odd number of integers. Worst case, this approach requires O(log n) passes through the data, so even for trillions of numbers it’s fairly quick.

I wrapped up the solution right at the time limit, and my code ran fine on the test cases. Was decently proud of myself for getting a solution in the allotted time.

Well, the interview feedback arrived, and it turns out my solution was rejected for being suboptimal. Apparently there is a more efficient approach that utilizes priority heaps. After looking up and reading about the priority heap approach, all I can say is that I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...

I had never used leetcode before because I never had difficulty with prior coding interviews (my last job search was many years before the 2022 layoffs), but after this interview, I immediately signed up for a subscription. And of course the “median file integer” question I received is one of the most asked questions on the list of “hard” problems.

jagged-chisel 1 year ago |

It's quicksort with a modification to select the median during the process. I feel like this is a good way to approach lots of "find $THING in list" questions.

mnw21cam 1 year ago | |

It's quicksort, but neglecting a load of the work that quicksort would normally have to do. Instead of recursing twice, leading to O(nlogn) behaviour, it's only recursing once.

KMag 1 year ago | | |

I used to ask how to find the 10th percentile value from an arbitrarily ordered list as an interview question. Most candidates suggested sorting, and then I'd ask if they could do better. If they got stuck, I'd ask them which sorting algorithm they'd suggest. If they suggested quicksort, then I could gently guide them down optimizing quicksort to quickselect. Most candidates made the mistake of believing getting rid of half the work at every division results in half the work overall. They realized it was significantly faster, but usually didn't realize it was O(N) expected time.

If we had time, I'd ask about the worst-case scenario, and see if they could optimize heapsort to heapselect. Good candidates could suggest starting out with selectsort optimistically and switching to heapselect if the number of recursions exceeded some constant times the number of expected recursions.

If they knew about median-of-medians, they could probably just suggest introselect at the start, and move on to another question.

someplaceguy 1 year ago |

I found this part of the code quite funny:

    # If there are < 5 items, just return the median
    if len(l) < 5:
        # In this case, we fall back on the first median function we wrote.
        # Since we only run this on a list of 5 or fewer items, it doesn't
        # depend on the length of the input and can be considered constant
        # time.
        return nlogn_median(l)

Hell, why not just use 2^140 instead of 5 as the cut-off point, then? This way you'd have constant time median finding for all arrays that can be represented in any real-world computer! :) [1]

[1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

ignoramous 1 year ago |

If an approximation is enough, the p2 quantile estimator (O(1) memory) is pretty neat: https://news.ycombinator.com/item?id=25201093

saagarjha 1 year ago |

This is hinted at in the post but if you're using C++ you will typically have access to quickselect via std::nth_element. I've replaced many a sort with that in code review :) (Well, not many. But at least a handful.)

conradludgate 1 year ago | |

Same with rust, there's the `select_nth_unstable` family on slices that will do this for you. It uses a more fancy pivot choosing algorithm but will fall back to median-of-medians if it detects it's taking too long

chpatrick 1 year ago |

Another nice one is O(1) weighted sampling (after O(n) preprocessing).

https://en.wikipedia.org/wiki/Alias_method

melonmouse 1 year ago |

The linked proof for that median of medians is O(n) feels counterintuitive to me. Here's a (simpler?) alternative.

  T(0) = 0
  T(1) = 1
  T(n) = n + T(n/5) + T(7/10*n)

We want to prove that:

  T(n) ≤ C*n

It is intuitive that T(a+b) ≥ T(a) + T(b), or in other words, T is superadditive. That can be shown by induction:

Induction base: it holds for all a+b < 1, the only case being a=0, b=0:

  T(0+0) = 0 + T(0) + T(0) ≥ T(0) + T(0)

Induction step: suppose it holds for all a+b < k. Let a+b = k.

  T(a+b) = T(k)
         = k + T(k/5) + T(7/10*k)
         ≥ k + T(a/5) + T(b/5) + T(7/10*a) + T(7/10*b)
         = [a + T(a/5) + T(7/10*a)] + [b + T(b/5) + T(7/10*b)]
         = T(a) + T(b)

Because T is superadditive:

  T(n) = n + T(n/5) + T(7/10*n)
       ≤ n + T(n/5 + 7/10*n)
       = n + T(9/10*n)

Now we can apply the master theorem. Or to write out the proof (using a geometric series):

  T(n) ≤ n + T(9/10*n)
       ≤ n * ∑ᵢ₌₀ᶦⁿᶠᶦⁿᶦᵗʸ (9/10)^i
       = n * 1/(1-9/10)
       = 10*n

So, we have shown the algorithm is O(n) with C=10 (or less).

beyondCritics 1 year ago | |

I like the idea to use super additivity, but in a proof you cannot creatively extend T to the reals, this should be fixed.

Here is the slightly mopped up proof i had in mind, when i posted my hints below:

  Let be r>=1 and 0<a(i) for all 1<=i<=r and 1/a(1) + ... + 1/a(n) =: s < 1.
  Then a(i) > 1 for all 1 <= i <= r. 

  Let be c > 0 and
  T(0) := 0
  T(n) := c \* n + T(floor(n/a(1))) + ... + T(floor(n/a(r)))

  Then T(n) <= b * n for all n with b := c/(1-s) > 0 !
  Proof by induction: 
  "n=0" : 
   The statement holds trivially.

  "k->n": 
   Let n>=1 and assume the statement holds for all 0<=k<n. 
   Now since a(i)>1 we have floor(n/a(i)) <= n/a(i) < n. By the induction hypothesis therefore
   T(floor(n/a(i))) <= b * floor(n/a(i)) <= b * n/a(i). 
   Apply this to get:
   T(n) =  c * n + T(floor(n/a(1))) + ... + T(floor(n/a(r)))
        <= c * n + b * n/a(1) + ... +  b * n/a(r)
        = (c + b*s) * n
        = b * n.
   Hence T(n) <= b * n.

hammeiam 1 year ago |

The "Split the array into subarrays of length 5, now sorting all of the arrays is O(n) instead of O(n log n)" feels like cheating to me

marcosdumay 1 year ago | |

O(n log 5) is O(n). There's no cheating, sorting small arrays in a list is a completely different problem from sorting a large array.

tptacek 1 year ago | |

They're not sorting all the arrays?

Later

(i was going to delete this comment, but for posterity, i misread --- sorting the lists, not the contents of the list, sure)

IncreasePosts 1 year ago | |

It would only be cheating if you could merge the arrays in O(1), which you can't.

hammeiam 1 year ago | | |

ahh this is the insight I was missing, thank you!

Sharlin 1 year ago | |

It’s unambiguously O(n), there’s no lg n anywhere to be seen. It may be O(n) with a bit larger constant factor, but the whole point of big-O analysis is that those don’t matter.

pfortuny 1 year ago | |

Actually lots of algorithms "feel" like cheating until you understand what you were not looking at (fast matrix multiplication, fast fourier transforms...).

Someone 1 year ago |

FTA:

“Proof of Average O(n)

On average, the pivot will split the list into 2 approximately equal-sized pieces. Therefore, each subsequent recursion operates on 1⁄2 the data of the previous step.”

That “therefore” doesn’t follow, so this is more an intuition than a proof. The problem with it is that the medium is more likely to end up in the larger of the two pieces, so you more likely have to recurse on the larger part than on the smaller part.

What saves you is that O(n) doesn’t say anything about constants.

Also, I would think you can improve things a bit for real world data by, on subsequent iterations, using the average of the set as pivot (You can compute that for both pieces on the fly while doing the splitting. The average may not be in the set of items, but that doesn’t matter for this algorithm). Is that true?

sfpotter 1 year ago |

A nice way to approximate the median: https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf

RcouF1uZ4gsC 1 year ago |

> The C++ standard library uses an algorithm called introselect which utilizes a combination of heapselect and quickselect and has an O(nlogn) bound.

Introselect is a combination of Quickselect and Median of Medians and is O(n) worst case.

Tarean 1 year ago |

Love this algorithm. It feels like magic, and then it feels obvious and basically like binary search.

Similar to the algorithm to parallelize the merge step of merge sort. Split the two sorted sequences into four sequences so that `merge(left[0:leftSplit], right[0:rightSplit])+merge(left[leftSplit:], right[rightSplit:])` is sorted. leftSplit+rightSplit should be halve the total length, and the elements in the left partition must be <= the elements in the right partition.

Seems impossible, and then you think about it and it's just binary search.

teo_zero 1 year ago |

> On average, the pivot will split the list into 2 approximately equal-sized pieces.

Where does this come from?

Even assuming a perfect random function, this would be true only for distributions that show some symmetry. But if the input is all 10s and one 5, each step will generate quite different-sized pieces!

paldepind2 1 year ago | |

I think you answered your own question. It's the standard average-time analysis of Quicksort and the (unmentioned) assumption is that the numbers are from some uniform distribution.

Why would the distribution have to be symmetric? My intuition is that if you sample n numbers from some distribution (even if it's skewed) and pick a random number among the n numbers, then on average that number would be separate the number into two equal-sized sets. Are you saying that is wrong?

teo_zero 1 year ago | | |

With real numbers, I have the same intuition. But with integers, where 2 or more elements can be exactly the same, and with the two sets defined as they are defined in TFA, that is one "less than" and one "greater or equal", then I'd argue that the second set will be bigger than the former.

In the pathological case where all the elements are the same value, one set will always be empty and the algorithm will not even terminate.

In a less extreme case where nearly all the items are the same except a few ones, then the algorithm will slowly advance, but not with the progression n, n/2, n/4, etc. that is needed to prove it's O(n).

Please note that the "less extreme case" I depicted above is quite common in significant real-world statistics. For example, how many times a site is visited by unique users per day: a long sequence of 1s with some sparse numbers>1. Or how many children/cars/pets per family: many repeated small numbers with a few sparse outliers. Etc.

runiq 1 year ago |

Why is it okay to drop not-full chunks? The article doesn't explain that and I'm stupid.

Edit: I just realized that the function where non-full chunks are dropped is just the one for finding the pivot, not the one for finding the median. I understand now.

ValleZ 1 year ago |

I was asked to invent this algorithm on a whiteboard in 30 minutes. Loved it.

beyondCritics 1 year ago |

<It’s not straightforward to prove why this is O(n).

Replace T(n/5) with T(floor(n/5)) and T(7n/10) with T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.

kccqzy 1 year ago |

> Quickselect gets us linear performance, but only in the average case. What if we aren’t happy to be average, but instead want to guarantee that our algorithm is linear time, no matter what?

I don't agree with the need for this guarantee. Note that the article already says the selection of the pivot is by random. You can simply choose a very good random function to avoid an attacker crafting an input that needs quadratic time. You'll never be unlucky enough for this to be a problem. This is basically the same kind of mindset that leads people into thinking, what if I use SHA256 to hash these two different strings to get the same hash?

mitthrowaway2 1 year ago | |

It's a very important guarantee for use in real-time signal processing applications.

forrestthewoods 1 year ago | |

> I don't agree with the need for this guarantee.

You don’t get to agree with it or not. It depends on the project! Clearly there exist some projects in the world where it’s important.

But honestly it doesn’t matter. Because as the article shows with random data that median-of-medians is strictly better than random pivot. So even if you don’t need the requirement there is zero loss to achieve it.

kccqzy 1 year ago | | |

The median-of-median comes at a cost for execution time. Chances are, sorting each five-element chunk is a lot slower than even running a sophisticated random number generator.

zelphirkalt 1 year ago |

You can simply pass once over the data, and while you do that, count occurrences of the elements, memorizing the last maximum. Whenever an element is counted, you check, if that count is now higher than the previous maximum. If it is, you memorize the element and its count as the maximum, of course. Very simple approach and linear in time, with minimal book keeping on the way (only the median element and the count (previous max)).

I don't find it surprising or special at all, that finding the median works in linear time, since even this ad-hoc thought of way is in linear time.

EDIT: Ah right, I mixed up mode and median. My bad.

gcr 1 year ago | |

This finds the mode (most common element), not the median.

Wouldn't you also need to keep track of all element counts with your approach? You can't keep the count of only the second-most-common element because you don't know what that is yet.

zelphirkalt 1 year ago | | |

Yes, you are right. I mixed up mode and median.

And yes, one would need to keep track of at least a key for each element (not a huge element, if they are somehow huge). But that would be about space complexity.

vismit2000 1 year ago |

This is covered in section 9.3 in CLRS book - Medians and Order Statistics

SkiFire13 1 year ago |

I wonder what's the reason of picking groups of 5 elements instead of 2 or 8.

danlark 1 year ago | |

3 and 4 elements will fail to prove the complexity is linear

You still can do 3 or 4 but with slight modifications

https://arxiv.org/abs/1409.3600

For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear

lalaland1125 1 year ago | |

1. You want an odd number so the median is the middle element of the sublist.

2. One and three are probably too small

nilslindemann 1 year ago |

"ns" instead of "l" and "n" instead of "el" would have been my choice (seen in Haskell code).

robinhouston 1 year ago | |

The trouble with using this convention (which I also like) in Python code is that sooner or later one wants to name a pair of lists 'as' and 'bs', which then causes a syntax error because 'as' is a keyword in Python. There is a similar problem with 'is' and 'js'.

nilslindemann 1 year ago | | |

Sure, naming is hard, but avoid "l", "I", "O", "o".

Very short variable names (including "ns" and "n") are always some kind of disturbance when reading code, especially when the variable lasts longer than one screen of code – one has to memorize the meaning. They sometimes have a point, e.g. in mathematical code like this one. But variables like "l" and "O" are bad for a further reason, as they can not easily be distinguished from the numbers. See also the Python style guide: https://peps.python.org/pep-0008/#names-to-avoid