Edit: I just realized that the function where non-full chunks are dropped is just the one for finding the pivot, not the one for finding the median. I understand now.
Replace T(n/5) with T(floor(n/5)) and T(7n/10) with T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.
I don't agree with the need for this guarantee. Note that the article already says the selection of the pivot is by random. You can simply choose a very good random function to avoid an attacker crafting an input that needs quadratic time. You'll never be unlucky enough for this to be a problem. This is basically the same kind of mindset that leads people into thinking, what if I use SHA256 to hash these two different strings to get the same hash?
You don’t get to agree with it or not. It depends on the project! Clearly there exist some projects in the world where it’s important.
But honestly it doesn’t matter. Because as the article shows with random data that median-of-medians is strictly better than random pivot. So even if you don’t need the requirement there is zero loss to achieve it.
I don't find it surprising or special at all, that finding the median works in linear time, since even this ad-hoc thought of way is in linear time.
EDIT: Ah right, I mixed up mode and median. My bad.
Wouldn't you also need to keep track of all element counts with your approach? You can't keep the count of only the second-most-common element because you don't know what that is yet.
And yes, one would need to keep track of at least a key for each element (not a huge element, if they are somehow huge). But that would be about space complexity.
You still can do 3 or 4 but with slight modifications
https://arxiv.org/abs/1409.3600
For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear
2. One and three are probably too small
Very short variable names (including "ns" and "n") are always some kind of disturbance when reading code, especially when the variable lasts longer than one screen of code – one has to memorize the meaning. They sometimes have a point, e.g. in mathematical code like this one. But variables like "l" and "O" are bad for a further reason, as they can not easily be distinguished from the numbers. See also the Python style guide: https://peps.python.org/pep-0008/#names-to-avoid
https://danlark.org/2020/11/11/miniselect-practical-and-gene...
It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.
As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.
The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.
(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)
Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.
Do you have a source for that claim?
I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.
True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.
In all use-cases I've seen a close estimate of the median was enough.
But, you're right, I was lucky to work on a bunch of fun problems. That period, in particular, was pretty amazing. I was part of a fun, collaborative team working on hard problems. And management showed a lot of trust in us. We came up with some very interesting solutions, some by skill and some by luck, that set the foundation for years of growth that came after that (both revenue growth and technical platform growth).
He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!
There's Treasure Everywhere - Andrei Alexandrescu
I'd recommend anyone who writes software listening and reading anything of Andrei's you can find; this one is indeed a Treasure!
I was chatting about this with a grad student friend who casually said something like "Sure, it's slow, but what really matters is that it proves that it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible. Now that we do, we know there might an even faster linear algorithm." Really got into the philosophy of what Computer Science is about in the first place.
The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.
Manuel Blum - Turing award winner in 1995
Robert Floyd - Turing award winner in 1978
Ron Rivest - Turing award winner in 2002
Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)
Vaughan Pratt - oh no, the only non-Turing award winner in the list. Oh right but he's emeritus faculty at Stanford, directed the SUN project before it became Sun Microsystems, was instrumental in Sun's early days (director of research and designer of the Sun logo!), and is responsible for all kinds of other awesome stuff (near and dear to me: Pratt certificates of primality).
Four independent Turing awards! SPARCstations! This paper has it all.
That's an impressive list of authors, for sure.
Pratt parsing (HN discussion: https://news.ycombinator.com/item?id=39066465), the "P" in the KMP algorithm.
return l[len(l) / 2]
I'm not a Python expert, but doesn't the `/` operator return a float in Python? Why would you use a float as an array index instead of doing integer division (with `//`)?I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.
Perhaps this could be forgiven if you're a Python novice and hadn't realized that the two different operators exist, but this is not the case here, as the article contains this even more baffling code which uses integer division in one branch but float division in the other:
def quickselect_median(l, pivot_fn=random.choice):
if len(l) % 2 == 1:
return quickselect(l, len(l) // 2, pivot_fn)
else:
return 0.5 * (quickselect(l, len(l) / 2 - 1, pivot_fn) +
quickselect(l, len(l) / 2, pivot_fn))
That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.> Technically, you could get extremely unlucky: at each step, you could pick the largest element as your pivot. Each step would only remove one element from the list and you’d actually have O(n2) performance instead of O(n)
If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen. If the data is really too big to shuffle, then only shuffle once a bucket is small enough to be shuffled.
If you do shuffle, probabilities are here to guarantee that that worst case cannot happen. If anyone says that "technically" it can happen, I'll answer that then "technically" an attacker could also guess correctly every bit of your 256 bits private key.
Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.
From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.
It doesn't guarantee that you avoid the worst case, it just removes the possibility of forcing the worst case.
However I never managed to understand how it works.
https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...
Instead, you can use a biased pivot as in [1] or something I call "j:th of k:th". Floyd-Rivest can also speed things up. I have a hobby project that gets 1.2-2.0x throughput when compared to a well implemented quickselect, see: https://github.com/koskinev/turboselect
If anyone has pointers to fast generic & in-place selection algorithms, I'm interested.
First I asked if anything could be assumed about the statistics on the distribution of the numbers. Nope, could be anything, except the numbers are 32-bit ints. After fiddling around for a bit I finally decided on a scheme that creates a bounding interval for the unknown median value (one variable contains the upper bound and one contains the lower bound based on 2^32 possible values) and then adjusts this interval on each successive pass through the data. The last step is to average the upper and lower bound in case there are an odd number of integers. Worst case, this approach requires O(log n) passes through the data, so even for trillions of numbers it’s fairly quick.
I wrapped up the solution right at the time limit, and my code ran fine on the test cases. Was decently proud of myself for getting a solution in the allotted time.
Well, the interview feedback arrived, and it turns out my solution was rejected for being suboptimal. Apparently there is a more efficient approach that utilizes priority heaps. After looking up and reading about the priority heap approach, all I can say is that I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...
I had never used leetcode before because I never had difficulty with prior coding interviews (my last job search was many years before the 2022 layoffs), but after this interview, I immediately signed up for a subscription. And of course the “median file integer” question I received is one of the most asked questions on the list of “hard” problems.
If we had time, I'd ask about the worst-case scenario, and see if they could optimize heapsort to heapselect. Good candidates could suggest starting out with selectsort optimistically and switching to heapselect if the number of recursions exceeded some constant times the number of expected recursions.
If they knew about median-of-medians, they could probably just suggest introselect at the start, and move on to another question.
# If there are < 5 items, just return the median
if len(l) < 5:
# In this case, we fall back on the first median function we wrote.
# Since we only run this on a list of 5 or fewer items, it doesn't
# depend on the length of the input and can be considered constant
# time.
return nlogn_median(l)
Hell, why not just use 2^140 instead of 5 as the cut-off point, then? This way you'd have constant time median finding for all arrays that can be represented in any real-world computer! :) [1][1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/
T(0) = 0
T(1) = 1
T(n) = n + T(n/5) + T(7/10*n)
We want to prove that: T(n) ≤ C*n
It is intuitive that T(a+b) ≥ T(a) + T(b), or in other words, T is superadditive. That can be shown by induction:Induction base: it holds for all a+b < 1, the only case being a=0, b=0:
T(0+0) = 0 + T(0) + T(0) ≥ T(0) + T(0)
Induction step: suppose it holds for all a+b < k. Let a+b = k. T(a+b) = T(k)
= k + T(k/5) + T(7/10*k)
≥ k + T(a/5) + T(b/5) + T(7/10*a) + T(7/10*b)
= [a + T(a/5) + T(7/10*a)] + [b + T(b/5) + T(7/10*b)]
= T(a) + T(b)
Because T is superadditive: T(n) = n + T(n/5) + T(7/10*n)
≤ n + T(n/5 + 7/10*n)
= n + T(9/10*n)
Now we can apply the master theorem. Or to write out the proof (using a geometric series): T(n) ≤ n + T(9/10*n)
≤ n * ∑ᵢ₌₀ᶦⁿᶠᶦⁿᶦᵗʸ (9/10)^i
= n * 1/(1-9/10)
= 10*n
So, we have shown the algorithm is O(n) with C=10 (or less).Here is the slightly mopped up proof i had in mind, when i posted my hints below:
Let be r>=1 and 0<a(i) for all 1<=i<=r and 1/a(1) + ... + 1/a(n) =: s < 1.
Then a(i) > 1 for all 1 <= i <= r.
Let be c > 0 and
T(0) := 0
T(n) := c \* n + T(floor(n/a(1))) + ... + T(floor(n/a(r)))
Then T(n) <= b * n for all n with b := c/(1-s) > 0 !
Proof by induction:
"n=0" :
The statement holds trivially.
"k->n":
Let n>=1 and assume the statement holds for all 0<=k<n.
Now since a(i)>1 we have floor(n/a(i)) <= n/a(i) < n. By the induction hypothesis therefore
T(floor(n/a(i))) <= b * floor(n/a(i)) <= b * n/a(i).
Apply this to get:
T(n) = c * n + T(floor(n/a(1))) + ... + T(floor(n/a(r)))
<= c * n + b * n/a(1) + ... + b * n/a(r)
= (c + b*s) * n
= b * n.
Hence T(n) <= b * n.Later
(i was going to delete this comment, but for posterity, i misread --- sorting the lists, not the contents of the list, sure)
“Proof of Average O(n)
On average, the pivot will split the list into 2 approximately equal-sized pieces. Therefore, each subsequent recursion operates on 1⁄2 the data of the previous step.”
That “therefore” doesn’t follow, so this is more an intuition than a proof. The problem with it is that the medium is more likely to end up in the larger of the two pieces, so you more likely have to recurse on the larger part than on the smaller part.
What saves you is that O(n) doesn’t say anything about constants.
Also, I would think you can improve things a bit for real world data by, on subsequent iterations, using the average of the set as pivot (You can compute that for both pieces on the fly while doing the splitting. The average may not be in the set of items, but that doesn’t matter for this algorithm). Is that true?
Introselect is a combination of Quickselect and Median of Medians and is O(n) worst case.
Similar to the algorithm to parallelize the merge step of merge sort. Split the two sorted sequences into four sequences so that `merge(left[0:leftSplit], right[0:rightSplit])+merge(left[leftSplit:], right[rightSplit:])` is sorted. leftSplit+rightSplit should be halve the total length, and the elements in the left partition must be <= the elements in the right partition.
Seems impossible, and then you think about it and it's just binary search.
Where does this come from?
Even assuming a perfect random function, this would be true only for distributions that show some symmetry. But if the input is all 10s and one 5, each step will generate quite different-sized pieces!
Why would the distribution have to be symmetric? My intuition is that if you sample n numbers from some distribution (even if it's skewed) and pick a random number among the n numbers, then on average that number would be separate the number into two equal-sized sets. Are you saying that is wrong?
In the pathological case where all the elements are the same value, one set will always be empty and the algorithm will not even terminate.
In a less extreme case where nearly all the items are the same except a few ones, then the algorithm will slowly advance, but not with the progression n, n/2, n/4, etc. that is needed to prove it's O(n).
Please note that the "less extreme case" I depicted above is quite common in significant real-world statistics. For example, how many times a site is visited by unique users per day: a long sequence of 1s with some sparse numbers>1. Or how many children/cars/pets per family: many repeated small numbers with a few sparse outliers. Etc.
> there might be an even faster linear algorithm,
but
> it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible.
For me, the moment of clarity was understanding that theoretical CS mainly cares about problems, not algorithms. Algorithms are tools to prove upper bounds on the complexity of problems. Lower bounds are equally important and cannot be proved by designing algorithms. We even see theorems of the form "there exists an O(whatever) algorithm for <problem>": the algorithm's existence can sometimes be proven non-constructively.
So if the median problem sat for a long time with a linear lower bound and superlinear upper bound, we might start to wonder if the problem has a superlinear lower bound, and spend our effort working on that instead. The existence of a linear-time algorithm immediately closes that path. The only remaining work is to tighten the constant factor. The community's effort can be focused.
A famous example is the linear programming problem. Klee and Minty proved an exponential worst case for the simplex algorithm, but not for linear programming itself. Later, Khachiyan proved that the ellipsoid algorithm was polynomial-time, but it had huge constant factors and was useless in practice. However, a few years later, Karmarkar gave an efficient polynomial-time algorithm. One can imagine how Khachiyan's work, although inefficient, could motivate a more intense focus on polynomial-time LP algorithms leading to Karmarkar's breakthrough.
I think we got a constant factor of 22 for this algorithm so maybe it was a related one or something.
Personally I would gravitate towards the quickselect algorithm described in the OP until I was forced to consider a streaming median approximation method.
You write the query and the UI knows you're querying metric xyz_inflight_requests, it runs a preflight check to get the cardinality of that metric, and gives you a prompt: "xyz_inflight_requests is a high-cardinality metric, this query may take some time - consider using estimated_median instead of median".
- quantile sketches, such as t-digest, which aim to control the quantile error or rank error. Apache DataSketches has several examples, https://datasketches.apache.org/docs/Quantiles/QuantilesOver...
- histograms, such as my hg64, or hdr histograms, or ddsketch. These control the value error, and are generally easier to understand and faster than quantile sketches. https://dotat.at/@/2022-10-12-histogram.html
You keep an estimate for the current quantile value, and then for each element in your stream, you either increment (if the element is greater than your estimate) or decrement (if the element is less than your estimate) by fixed "up -step" and "down-step" amounts. If your increment and decrement steps are equal, you should converge to the median. If you shift the ratio of increment and decrement steps you can estimate any quantile.
For example, say that your increment step is 0.05 and your decrement step is 0.95. When your estimate converges to a steady state, then you must be hitting greater values 95% of the time and lesser values 5% of the time, hence you've estimated the 95th percentile.
The tricky bit is choosing the overall scale of your steps. If your steps are very small relative to the scale of your values, it will converge very slowly and not track changes in the stream. You don't want your steps to be too large because they determine the precision. The FAME algorithm has a step where you shrink your step size when your data value is near your estimate (causing the step size to auto-scale down).
[1]: http://www.eng.tau.ac.il/~shavitt/courses/LargeG/streaming-m...
This one has a streaming variant.
It's actually hard to come up with something that cannot be sorted lexicographically. The best example I was able to find was big fractions. Though even then you could write them as continued fractions and sort those lexicographically (would be a bit trickier than strings).
Radix sort is awesome if k is small, N is huge and/or you are using a GPU. On a CPU, comparison based sorting is faster in most cases.
I evaluated various sorts for strings as part of my winning submission to https://easyperf.net/blog/2022/05/28/Performance-analysis-an... and found https://github.com/bingmann/parallel-string-sorting to be helpful. For a single core, the fastest implementation among those in parallel-string-sorting was a radix sort, so my submission included a radix sort based on that one.
The only other contender was multi-key quicksort, which is notably not a comparison sort (i.e. a general-purpose string comparison function is not used as a subroutine of multi-key quicksort). In either case, you end up operating on something like an array of structs containing a pointer to the string, an integer offset into the string, and a few cached bytes from the string, and in either case I don't really know what problems you expect to have if you're dealing with null-terminated strings.
A very similar similar radix sort is included in https://github.com/alichraghi/zort which includes some benchmarks, but I haven't done the work to have it work on strings or arbitrary structs.
But when you run into <algorithm> or <feels like algorithm>, the correct solution is to slow down, Google, then document your work as you go. In a real application log n may be insufficient. But coding interview exercises need tight constraints to fit the nature of the interview.
The whole problem was kind of miscommunicated, because the interviewer showed up 10 minutes late, picked a problem from a list, and the requirements for the problem were only revealed when I started going a direction the interviewer wasn’t looking for (“Oh, the file is actually read-only.” “Oh, each number in the file is an integer, not a float.”)
I get the notion of making the point out of principle, but it’s sort of like arguing on the phone with someone at a call center—it’s better to just cut your losses quickly and move on to the next option in the current market.
My own response would have been a variant on radix-sort. Keep an array of 256 counters, and make a pass counting all of the high bytes. Now you know the high byte of the median. Make another pass keeping a histogram of the second byte of all values that match the high byte. And so on.
This takes four passes and requires 256 x 8 byte counters plus incidentals.
In a single pass you can't get the exact answer.
What a bullshit task. I’m beginning to think this kind of interviewing should be banned. Seems to me it’s just an easy escape hatch for the interviewer/hiring manager when they want to discriminate based on prejudice.
Any halting program that runs on a real world computer is O(1), by definition.
Except that there is no such thing as "arbitrarily large storage", as my link in the parent comment explained: https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/
So why would you want to talk about arbitrarily large input (where the input is an array that is stored in memory)?
As I understood, this big-O notation is intended to have some real-world usefulness, is it not? Care to elaborate what that usefulness is, exactly? Or is it just a purely fictional notion in the realm of ideas with no real-world application?
And if so, why bother studying it at all, except as a mathematical curiosity written in some mathematical pseudo-code rather than a programming or engineering challenge written in a real-world programming language?
Edit: s/pretending/intended/
In practice you might also want to use a O(n^2) algorithm like insertion sort under some threshold.
Sure, but the author didn't argue that the simpler algorithm would be faster for 5 items, which would indeed make sense.
Instead, the author argued that it's OK to use the simpler algorithm for less than 5 items because 5 is a constant and therefore the simpler algorithm runs in constant time, hence my point that you could use the same argument to say that 2^140 (or 2^256) could just as well be used as the cut-off point and similarly argue that the simpler algorithm runs in constant time for all arrays than can be represented on a real-world computer, therefore obviating the need for the more complex algorithm (which obviously makes no sense).
Since these remaining fractions combine multiplicatively, we actually care about the geometric mean of the remaining fraction of the array, which is e^[(integral of ln(x) dx from x=0.5 to x=1) / (1 - 0.5)], or about 73.5%.
Regardless, it forms a geometric series, which should converge to 1/(1-0.735) or about 3.77.
Regarding using the average as the pivot: the question is really what quantile would be equal to the mean for your distribution. Heavily skewed distributions would perform pretty badly. It would perform particularly badly on 0.01*np.arange(1, 100) -- for each partitioning step, the mean would land between the first element and the second element.
Only in the first iteration. There’s a good chance it will be in the smaller one in the second iteration, for example.
So, your analysis is a bit too harsh, but probably good enough for a proof that it’s O(n) on average.
> Heavily skewed distributions would perform pretty badly
That’s why I used the weasel worlds “real world data” ;-)
I also thought about mentioning that skew can be computed streaming (see for example https://www.boost.org/doc/libs/1_53_0/doc/html/accumulators/...), but even if you have that, there still are distributions that will perform badly.
I totally respect that—don't let me pressure you any further.
Yeah, sounds like you had the ideal work setup: cool problem requiring cool technical solutions with good management. That's wonderful!
https://github.com/tdunning/t-digest https://www.sciencedirect.com/science/article/pii/S266596382... https://arxiv.org/pdf/2102.09299
And, as I mentioned else-thread, exponential histograms are the best choice in almost all practical situations.
> The Space required to store the elements in Heap is O(n).
I don't think this algorithm is suitable for trillions of items.
The n log n approach definitely works though.
---
Had some "lucky" candidate stumbled upon an optimization you had never seen before, would you recognize it? If so, would you be honest and let them keep their discovery? After all, this isn't work for hire ...
Moving on. Did these interviews reflect the day-to-day work these software engineers would be performing if they were accepted? I guarantee your business isn't going to recoup millions of dollars because someone, in their day-to-day, hit on an optimation of an existing algorithm. Nor is it likely they'll be discovering wonderful new money-saving algorithms for your business.
If you're a pure research lab, employing PhD candidates and PhDs, maybe this kind of interview does indicate the required skills.
Among other things, I used to work on Google's indexing system, and also high volume risk calculations and market data for a large multinational bank, and now high volume trading signals for a hedge fund.
For instance, corporate clients of banks sometimes need some insurance for some scenario, but if you can narrow down that insurance to exactly what they need, you can offer them that insurance cheaper than competitors. These structured products/exotic options can be difficult to model. For instance, say an Australian life insurance provider is selling insurance in Japan, getting paid in JPY, and doing their accounting in AUD. They might want insurance against shifts in the Japanese mortality curve (Japanese dying faster than expected) over the next 30 years, but they only need you to cover 100% of their losses over 100 million AUD. You run the numbers, you sell them this insurance at a set price in AUD for the next 30 years, and you do your accounting in USD. (The accounting currency (numerare) is relevant.) There's basically nobody who would be willing to buy these contracts off of you, so to a first-order approximation, you're on the hook for these products for the next 30 years.
If you can offer higher fidelity modeling, you can offer cheaper insurance than competitors. If you re-calculate the risk across your entire multinational bank daily, you can safely do more business by better managing your risk exposures.
Daily re-calculations of risk for some structured end up cutting into the profit margins by double-digit percentages. Getting the algorithms correct, minimizing the amount of data that needs to be re-fetched, and maximizing re-use of partial results can make the difference of several million dollars in compute cost per year for just a handful of clients, and determines if the return-on-investment justifies keeping an extra structurer or two on the desk.
In order to properly manage your risk, determine what other trades all of your business are able to trade every day, etc., you need to calculate your risk exposure every day. This is basically the first partial derivatives of the value of the structured product with respect to potentially hundreds of different factors. Every day, you take current FX futures and forward contracts in order to try and estimate JPY/AUD and AUD/USD exchange rates for the next 30 years. You also make 30-year projections on the Japanese mortality curve, and use credit index prices to estimate the probability the client goes out of business (counterparty risk) for the next 30 years. Obviously, you do your best to minimize re-calculation for the 30 years, to incrementally update the simulations as the inputs change rather than re-calculating from scratch.
For the next 30 years, shifts in the Japanese mortality curves for the structured product desk affects the amount of trading in Japanese Yen and Australian Dollars that the FX desk can make, how much the Japanese equities desk needs to hedge their FX exposure, etc. You could have fixed risk allocations to each desk, but ignoring potential offsetting exposures across desks means leaving money on the table.
You can't sell these trades to another party if you find your risk calculations are getting too expensive. You are stuck for 30 years, so your only option is to roll up your sleeves and really get optimizing. Either that, or you re-calculate your risk less often and add in a larger safety margin and do a bit less business across lots of different trading desks.
I started asking this as an interview question when I noticed a colleague had implemented several O(N*2) algorithms (and even one O(N*3)) that had O(N) alternatives. Analysis for a day of heavy equity trading went from 8 hours down to an hour once I replaced my colleague's O(N*2) algorithm with an O(N) algorithm.
The point of the exercise is mostly to see if they can properly analyze the effects of algorithm changes, in an environment where the right algorithm saves tens of millions of dollars per year.
Granted, it's a bit niche, but not really that niche. I've been doing this sort of stuff in several different industries.
In the article n was set to 5. All of those arrays (except maybe 1) have exactly 5 elements. There is no variance (and even if there was, it would be tiny, there is no point in talking about limits of 5-element sequences).
No, the code was:
# If there are < 5 items, just return the median
if len(l) < 5:
return nlogn_median(l)
> and even if there was, it would be tiny, there is no point in talking about limits of 5-element sequencesSo your point is: not all constants are created equal. Which circles all the way back to my original point that this argument is pretty funny :)
You can do dynamic sampling: e.g. take double the samples, see what decimal in your result budges. Adjust.
[1] Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580.
If you want the n-9s rate of failure (eg., n=5, 99.999) for a system with a power-law performance distribution, you could be waiting much more than a billion samples to see a failure.
eg., 3E10 ms (30 bn samples) in a year, at 5-9s failure rate has 3E3 ms of failure (3 thousand samples) -- which will be lower given a sampling strategy.
Two observations: 1. With sufficiently large n, you can control your variance to an acceptable level. 2. The factor f is outside of your control. If your problem has a small density around the median, then you'll need to throw more samples to compensate for it.
I think your concern is about #2: you can always construct a pathological distribution to make the sampling-based approach unattainable. The paper provided guidance on what to expect when you apply the sampling-based approach.
https://en.m.wikipedia.org/wiki/Non-uniform_random_variate_g...
That's a good point, though, that the 49th percentile and 51st percentile can be arbitrarily far from the median.
E.g., if you need to run a task on 10M inputs, then knowing that your algorithm is O(N) doesn't tell you anything at all about how long your task will take. It also doesn't tell you whether that algorithm will be faster than some other algorithm that's O(N^2).
But it does tell you that if your task size doubles to 20M inputs, you can expect the time required for the first algorithm to double, and the second to quadruple. And that knowledge isn't arcane or theoretical, it works on real-world hardware and is really useful for modeling how your code will run as inputs scale up.
So you're right that dividing into chunks of 5 and iterating over them doesn't affect the scaling - you wind up with an array of (N/5) medians, and it's still O(N) to iterate over that array. But the next step isn't to sort that array, you only need to quickSelect on it, and that's why the final step is O(N) rather than O(NlogN).
If the input is an array that is stored in a piece of real-world memory, then the only possible complexity is O(1). It's just how big-O works. Big-O notation is an abstraction that is much much closer to mathematics than to engineering.
> this big-O notation is pretending to have some real-world usefulness...
Big-O notation is not a person so I'm not sure whether it can pretend something. CS professors might exaggerate big-O notation's real-world usefulness so their students don't fall asleep too fast.
> fictional
Theoretical. Just like the other theoretical ideas, at best big-O notation gives some vague insights that help people solve real problems. It gives a very rough feeling about whether an algorithm is fast or not.
By the way, Turing machine is in this category as well.
I really want to know what a one-pass, low-memory solution looks like, lol.
This is assuming a deterministic algorithm; maybe a random algorithm could work with high probability and less memory?
Note also that there are other bounds of importance and each has trade-offs.
T-digest gives you a strict bound on memory use and no dynamic allocation. But it does not have guaranteed accuracy bounds. It gives very good accuracy in practice and is very good at relative errors (i.e. 99.999th percentile estimate is between the 99.9985%-ile and 99.9995%-ile)
KL-sketch gives you a strict bound on memory use, but is limited to absolute quantile error. (i.e. 99.99%-ile is between 99.9%-ile and 100%-ile. This is useless for extrema, but fine for medians)
Cormode's extension to KL-sketch gives you strict bound on relative accuracy, but n log n memory use.
Exponential histograms give you strict bounds on memory use, no allocation and strict bounds on relative error in measurement space (i.e. 99.99%-ile ± % error). See the log-histogram[1] for some simple code and hdrHistogram[2] for a widely used version. Variants of this are used in Prometheus.
The exponential histogram is, by far, the best choice in most practical situations since an answer that says 3.1 ± 0.2 seconds is just so much more understandable for humans than a bound on quantile error. I say this as the author of the t-digest.
[1] https://github.com/tdunning/t-digest/blob/main/core/src/main...
https://aakinshin.net/posts/greenwald-khanna-quantile-estima...
I've used online quantile estimators (GK in particular) to very effectively look for anomalies in streaming data.
It worked much better than the usual mean/stddev threshold approach (which was embedded in competing producsts), because it made no assumptions about the distribution of the data.
One thing to note is that GK is online, but not windowed, so it looks back to the very first value.
However this can be overcome by using multiple, possibly overlapping, summaries, to allow old values to eventually drop off.
That would mean it's possible to sort N random 64-bit integers in O(N+M) which is just O(N) with a constant factor of 9 (if taking the length in bytes) or 65 (if taking the length in bits), so sort billions of random integers in linear time, is that truly right?
EDIT: I think it does make sense, M is length*N, and in scenarios where this matters this length will be in the order of log(N) anyway so it's still NlogN-sh.
Yes. You can sort just about anything in linear time.
> EDIT: I think it does make sense, M is length*N, and in scenarios where this matters this length will be in the order of log(N) anyway so it's still NlogN-sh.
I mainly wrote N+M rather than M to be pedantically correct for degenerate inputs that consist of mostly empty strings. Regardless, if you consider the size of the whole input, it's linear in that.
Your comment makes me think it would be swell to add a fast path for when the input range compares equal for the current byte or bytes though. In general radix sorts have some computation to spare (on commonly used CPUs, more computation may be performed per element during the histogramming pass without spending any additional time). Some cutting-edge radix sorts spend this spare computation to look for sorted subsequences, etc.
> Some cutting-edge radix sorts
Where do you find these? :-)
As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.
I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"
Why go to the store and back twice to buy two things instead of buying two things in one trip? ;p
You could implement a similar (and simpler) fast path for types with contiguous memory by performing min and max per iteration instead.
We were an infra-focused team building up a scalable data handling / computation platform in preparation to apply ML at a non-trivial scale. When we finally hired some people with deep stats knowledge 10 or 12 years ago, there was a lot of great learning for everyone about how each of our areas of expertise helped the others. I wish we'd found the right stats people earlier to help us understand things like this more deeply, more quickly.
In fact, this is often where decisions on what metrics to have in the first place come from. Ask why. You can go far without deep stats knowledge!
If you're testing on algorithm implementation and requirements gathering in thirty minutes, you're not testing for the skills you claim to be testing for. There's no way you're getting a good (let alone accurate) picture of the candidate's ability to gather requirements and implement those requirements, especially if your selection tactic is to deny them because they didn't get the PhD answer.
You're testing for how good of a minion this candidate will be.
> You're testing for how good of a minion this candidate will be.
I agreed with much of what you had to say up until this point. Well, and I'm not about 2-hour interviews either; that's too disruption to my work.
To me, a coding challenge is a conversation piece. I will see some skills like "can the candidate get a software project in their top-listed language on their cv running in less than an hour," and I do judge questions they ask or don't ask. But then, I don't just pull a leetcode challenge from the shelf. My favorite "coding challenge" isn't actually that challenging (which can send leetcode-grinders spiraling); it's about the journey and not the destination.
I'm not sure. My original C++ implementation which is for non-adversarial inputs puts the array containing histograms and indices on the heap but uses the stack for a handful of locals, so it would explode if you passed the wrong thing. The sort it's based on in the parallel-string-sorting repo works the same way. Oops!
> cache misses & memory pressure due to potentially long input elements (and thus correspondingly large stack)
I think this should be okay? The best of these sorts try to visit the actual strings infrequently. You'd achieve worst-case cache performance by passing a pretty large number of strings with a long, equal prefix. Ideally these strings would share the bottom ~16 bits of their address, so that they could evict each other from cache when you access them. See https://danluu.com/3c-conflict/
> Where do you find these? :-)
There's a list of cool sorts here https://github.com/scandum/quadsort?tab=readme-ov-file#varia... and they are often submitted to HN.
Otherwise - cool, thanks!
This is the item I want to respond to: https://news.ycombinator.com/item?id=19768492 When you took a Classical Mechanics course you were puzzled by the form of the Lagrangian: L = T - V
I have created a resource for the purpose of making application of calculus of variations in mechanics transparent. As part of that the form of the Lagrangian L=T-V is explained.
http://cleonis.nl/physics/phys256/calculus_variations.php
http://cleonis.nl/physics/phys256/energy_position_equation.p...
I recognize the 'you are certainly entitled to ask why' quote, it's from the book 'Classical Mechanics' by John Taylor.
Here's the thing: there is a good answer to the 'why' question. Once you know that answer things become transparent, and any wall is gone.
I'm aware your expectations may be low. Your thinking may be: if textbook authors such as John Taylor don't know the why, then why would some random dude know?
The thing is: this is the age of search machines on the internet; it's mindblowing how searcheable information is. I've combed, I got to put pieces of information together that hadn't been put together before, and things started rolling.
I'm stoked; that's why I'm reaching out to people.
I came across the ycombinator thread following up something that Jess Riedel had written.
- I mean this kindly, but this is getting a tad bit aggressive. Please let me glance at my own pace (or not glance, if I don't find the interest to), rather than periodically following up with me around the forum to make sure I look at what you wrote. I was much more interested in this problem at the time I came across it than I was in 2019, and I was similarly more interested in 2019 than I am now. During this time I have both (a) come across other explanations that I've found ~semi-satisfactory, and (b) gathered other things to occupy my brain with. While I do get some enjoyment from the topic, refreshing my understanding of analytical mechanics is really not my topmost interest or priority right now. It could easily be months or years before I become interested in the problem again to even think about it.
- I find it rather... jarring to see a sentence like "Summing bars of signed area, in the limit of subdividing into infinitely many bars: that is evaluating the integral" in the middle of an explanation about Lagrangians and the principle of stationary action. Yes... integration is the limit of addition; you should hope your reader knows that well by now. You wouldn't re-teach "multiplication is really just repeated addition" in the middle of a lecture about Fourier series; this feels equally out-of-place. Not only does it make it seem like you don't know your audience, it also wastes the reader's time, and makes it hard for them to find any gems you might have shared.
- I took a quick look at your site and I don't actually see you attempt to explain the rationale for why the quantity of interest is T - V (as opposed to, say, T + V) anywhere. You mention "you are looking for the sweet spot: the point where as the object is moving the rate of change of kinetic energy matches the rate of change of potential energy", but... am I? Really? I certainly wasn't looking for that, nor have any idea why you thought I would be looking for that. It almost seems to assume what you were trying to prove!
About the presentation: I think I agree: once I'm up to the level of discussing Lagrangians and stationary action I should not re-teach integration; the reader will be familiar with that.
That particular presentation grew over time; I agree it is uneven. I need to scrap a lot of it.
The preceding article http://cleonis.nl/physics/phys256/calculus_variations.php Is more an overarching concept.
Also, I'm active on the stackexchange physics forum. Over the years: Hamilton's stationary action is a recurring question subject. Some weeks ago I went back to the first time a stationary action question was posted, submitting an answer. In that answer: I aimed to work the exposition down to a minimum, presenting a continuous arch. https://physics.stackexchange.com/a/821469/17198
three sections:
1. Work-Energy theorem
2. The central equation of the work 'Mécanique Analytique' by Joseph Louis Lagrange (I discuss _why_ that equation obtains.)
3. Hamilton's stationary action
It's a tricky situation. I'm not assuming the thing I present derivation of, but I can see how it may appear that way.