Nearly all binary searches and mergesorts are broken (2006)

Nearly all binary searches and mergesorts are broken (2006)(ai.googleblog.com)

163 points by finnlab 3 years ago | 195 comments

dataflow 3 years ago |

Fun fact, there are some other lessons here: it can sometimes pay off to (1) generalize your function, and (2) respect the mathematical axioms you're supposed to be following. This (obviously) isn't to say you should always generalize everything, but you should at least consider what would happen if you did so, and if the difference is small, perhaps do it. The benefit of doing so being that it can avoid problems that aren't otherwise obvious—sometimes by design, sometimes by accident.

In particular, (x + y) / 2 is the wrong implementation of midpoint in general, because it would fail to even compile on objects you can't add together. But midpoint is well-defined on anything you can subtract (i.e. anything you can define a consistent distance function for)—and it doesn't require addition to be well-defined between those objects!

One obvious (in C/C++, and not-so-obvious in Java) counterexample here is pointers/iterators. You can subtract them, but not add them. And, in fact, if you implement midpoint in a manner that generalizes to those and respects the intrinsic constraints of the problem, you end up with the same x + (y - x) / 2 implementation, which doesn't have this bug.

europeanguy 3 years ago | |

Interesting. Another example is datetimes. You can't add datetimes. You can add a datetime and a time delta, and the difference of two datetimes is a timedelta.

I guess in maths this is called a generating Lie algebra (maybe someone can comment on this?)

maxiepoo 3 years ago | | |

I think the concept you are looking for is a ["torsor"](https://en.wikipedia.org/wiki/Principal_homogeneous_space).

Basically,

1. You have a 0 time delta, and you can add and subtract them satisfying some natural equations. (time deltas form a group)

2. You can add time deltas to a datetime to get a new datetime, and this satisfies some natural equations relating to adding time deltas to each other (time deltas act on datetimes).

3. You can subtract two datetimes to get a time delta satisfying some more natural equations (the action is free and transitive).

enriquto 3 years ago | | |

> I guess in maths this is called a generating Lie algebra

This is often called an affine structure.

zeroonetwothree 3 years ago | |

Not all metric spaces have midpoints (or unique midpoints) so it’s not true you can compute a midpoint any time you have a distance function (you are right you can define it but that’s kind of useless computationally since it doesn’t give you an algorithm).

dataflow 3 years ago | | |

If we're going the pedantic route, note that you don't need (and in fact half the time cannot have) uniqueness in our case anyway. There isn't really a unique midpoint for {0, 1, 2, 3}; both 1 and 2 are valid midpoints, even for binary search. We just pick the first one arbitrarily and work with that.

But note that that sentence was just about calculating midpoints, not about the larger binary search algorithm. And in any case, I was just trying to convey layman intuition, not write a mathematically precise theorem.

morelisp 3 years ago | |

This should also be obvious after a bit of thought to anyone who has worked with timestamps, and is also well-known in e.g. animation where midpoint is just a special case of p=0.5.

pmayrgundter 3 years ago | |

Not sure about midpoint being well-defined on anything we can subtract.. Z and R are infinite.. there are a lot of values in there that don't compute.

To vary your point here, the axioms for twos complement and IEEE floating-point aren't well known or observed.

ChadNauseam 3 years ago | | |

There are countably infinite turing machines and there is one for every element in Z. But there are uncountably infinite real numbers, so we’re out of luck for almost all of them.

tromp 3 years ago |

The bug in question is trying to compute an average as

    avg = (x + y) / 2

which fails both for signed ints (when adding positive x and y overflows maxint) and for unsigned ints (when x + y wraps around 0). Note that this can only be considered a bug for array indices x,y when these are 32 bit variables and the array can conceivably grow to more than 2 billion elements.

I wonder what is the simplest fix if the ordering between x and y is not known (e.g. in applications when x and y are not range bounds) and the language has no right-shift operation...

dang 3 years ago |

Google Research Blog: Nearly All Binary Searches and Mergesorts Are Broken - https://news.ycombinator.com/item?id=16890739 - April 2018 (1 comment)

Nearly All Binary Searches and Mergesorts Are Broken (2006) - https://news.ycombinator.com/item?id=14906429 - Aug 2017 (86 comments)

Nearly All Binary Searches and Mergesorts Are Broken (2006) - https://news.ycombinator.com/item?id=12147703 - July 2016 (35 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=9857392 - July 2015 (43 comments)

Read All About It: Nearly All Binary Searches and Mergesorts Are Broken - https://news.ycombinator.com/item?id=9113001 - Feb 2015 (2 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=7594625 - April 2014 (2 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=6799336 - Nov 2013 (46 comments)

Nearly All Binary Searches and Mergesorts are Broken (2006) - https://news.ycombinator.com/item?id=1130463 - Feb 2010 (49 comments)

Google Research Blog: Nearly All Binary Searches and Mergesorts are Broken [2006] - https://news.ycombinator.com/item?id=621557 - May 2009 (9 comments)

fnordpiglet 3 years ago |

This was always my go to interview question when I wanted to smugly prove to someone I’m smarter than them because I knew in fact they were smarter than me and I was feeling insecure. Good to see others use overflow gotchas too.

0x445442 3 years ago | |

My favorite was; write a function that determines the number of games necessary to be played in a single elimination tournament with N participants. It’s interesting to watch how many go off into recursion land when they get into the mind set of solving these Leet Code puzzles.

fnordpiglet 3 years ago | | |

My favorite is when interviewers expect you to know sportsball stuff like tournament elimination rules when interviewing programmers who clearly don’t care about sportsball

quag 3 years ago | | |

N-1 games?

junon 3 years ago | |

I hate when interviewers rely on niche recall-only interview questions...

latency-guy2 3 years ago | | |

Eh, I don't think integer overflow is a recall-only type question

This type of issue is pretty common to encounter and I make at least a few fixes a year specifically addressing integer overflow across many companies

legosexmagic 3 years ago |

the right solution is to parametrize the search region as (offset, length) instead of (start, end). then the midpoint is just offset+length/2.

you can also remove that unpredictable branch in the loop if you want.

  whatever_t *bisect(whatever_t *offset, size_t length, whatever_t x) {
    while(size_t midpoint = length / 2) {
      bool side = x < offset[midpoint];
      midpoint &= side - 1;
      length >>= side;
      offset += midpoint;
      length -= midpoint;
    }
    return offset;
  }

abecedarius 3 years ago | |

(offset, length) was how I coded it in the 90s, too, precisely because it made correctness clearer. "Nearly all" broken, hmph.

bugfix-66 3 years ago |

Here is the approach taken in Go's sort.Search()

Do the sum using signed int.

Then cast to unsigned int before the division (i.e., use a non-arithmetic shift low).

Then cast back to signed int.

  func Search(n int, f func(int) bool) int {
      // Define f(-1) == false and f(n) == true.
      // Invariant: f(i-1) == false, f(j) == true.
      i, j := 0, n
      for i < j {
          h := int(uint(i+j) >> 1) // avoid overflow when computing h
          // i ≤ h < j
          if !f(h) {
              i = h + 1 // preserves f(i-1) == false
          } else {
              j = h // preserves f(j) == true
          }
      }
      // i == j, f(i-1) == false, and f(j) (= f(i)) == true  =>  answer is i.
      return i
  }

If you care about stuff like this you may enjoy the puzzle "Upside-Down Arithmetic Shift":

https://bugfix-66.com/76b563beb6f4e61801fce4e835be862fb3dbbe...

morelisp 3 years ago | |

The solution here is not really interesting except from a language design perspective. Go avoids this problem by having the maximum array length be int, but doing the math in uint. This won’t work in languages that lack uints (Java) or have maximum array sizes in uint (C/C++).

LoganDark 3 years ago | | |

Java lacks a distinct uint type, but (since Java 8) allows you to perform unsigned operations on a regular int, effectively treating it as a uint.

It doesn't help that almost nobody knows this, though.

wizeman 3 years ago | |

This wouldn't work for C/C++ because in these languages signed integer overflow is undefined behavior.

bugfix-66 3 years ago | | |

That is correct. A serious mistake in C.

Go was designed by (among others) the father of Unix Ken Thompson, with an understanding of the mistakes of C and C++.

Another example is that Go requires explicit integer casts (disallowing implicit integer casts) to avoid what is now understood to be an enormous source of confusion and bugs in C.

You can understand Go as an improved C, designed for a world where parallel computing (e.g., dozens of CPU cores) is commonplace.

morelisp 3 years ago | | |

You could write the same approach in C as `(size_t)i+(size_t)j` without UB. The real reason it doesn't work in C is because a memory region can be large enough to still overflow in that case.

IncRnd 3 years ago |

There are still edge cases here - various posters here have mentioned them.

The proper method is to type promote first - not just to unsigned but to a wider variable type - 32 to 64 bits or from 64 to 128 bits. Unsigned simply gives a single extra bit, while erasing negative semantics. Promoting to twice the size works for either addition or multiplication. The benefits are correctness and the ability to be understood at a glance.

dataflow 3 years ago | |

> There are still edge cases here - various posters here have mentioned them.

Are you sure? What's an example of an array.length that would trigger a remaining edge case here? (Keep in mind array.length is 32-bit in Java.)

delusional 3 years ago |

Calling binary search and mergesort implementations "broken" does the author no service with his argument. If the key lesson is to "carefully consider your invariants" then the proper takeaway is that binary search and mergesort implementation lose generality with large arrays.

The implementation shown works perfectly for arrays on the order 2^30. Calling them broken is like saying strlen is broken for strings that aren't null terminated.

lkuty 3 years ago |

"It is not sufficient merely to prove a program correct; you have to test it too."

Well in fact it is exactly the contrary.

Jtsummers 3 years ago | |

I took it as a reference to Knuth: "Beware of bugs in the above code; I have only proved it correct, not tried it."

https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf [PDF] page 7 of the PDF, 5 of the classroom note.

User23 3 years ago | |

It’s a clumsy formulation, but if what he means is that you need to be assured that the model you’re proving in accurately reflects the behavior of what is being modeled then he is correct at least sometimes. For example a naive Z3 proof of the mid procedure would be valid since Z3 ints are unbounded. The issue isn’t that the proof is wrong, it’s that the model is.

If the system has a well written formal specification then your model can be built from that without error if done diligently. One real world example is the first Algol 60 compiler, which was built to a formal specification. On the other hand if there is no useful spec or no spec at all then you end up needing to experiment, ie test, and get your model as close as you can.

a1369209993 3 years ago | |

No, what you've observed is the (IIRC the terminology) converse, namely:

It is not sufficient merely to test a program; you have to prove it correct too.

In addition, it is not sufficient merely to prove a program correct; you have to test it too.

In summary, you have to both prove a program correct, and test it; skipping either will result in buggy garbage.

joshuamorton 3 years ago | | |

Grandparent is correct. If you've proven the behavior correct, you don't need to test. The proof is the test. This is usually only true in languages-that-are-proof-assistants (idris). In the cases above, they hadn't actually formally proven the behavior correct.

EdSchouten 3 years ago |

If instead of 'int' you were to use 'size_t' (or the equivalent of that provided by your programming language of choice), then there should be no issues in practice. Then you would only see overflows if your elements were 1 byte in size, and the input spans more than half of the virtual address space. This is unlikely for two reasons:

1. If you only have single byte elements, you'd better use counting sort.

2. There always tend to be parts of the virtual address space that are reserved. On x86-64, most userspace processes can only access 2^47 bytes of space.

junon 3 years ago | |

> input spans more than half of the virtual address space

Not only that, but in practice most general purpose operating systems are designed with higher-half kernels[0].

[0] https://wiki.osdev.org/Higher_Half_Kernel

valleyer 3 years ago | | |

32-bit Mac OS X was not (it had a 4/4 scheme).

Though even then I'm not sure you could reliably allocate two gigs of contiguous virtual space without running into some immovable OS-provided thing.

GuB-42 3 years ago |

It is unfortunate that the language doesn't have a built-in "average between two ints" function. It is a common operation, people often get it wrong, as shown by this article, and it may have a really simple and correct assembly representation that the compiler may take advantage of.

Such a function, even if it seems trivial, has some educative value as it opens an opportunity to explain the problem in the documentation.

fay59 3 years ago | |

I feel that it’s so simple that many people will overlook that it even exists. In languages that have both, it’s hard for functions to compete with operators. I don’t think that this is the best design to promote correctness.

GuB-42 3 years ago | | |

Maybe, but providing simple functions for "obvious" operations, to promote correctness, make it easier for the compiler, or just for convenience is not uncommon at all. Most languages have a min/max function somewhere, sometimes built-in, sometimes in the standard library, even though it is trivial to implement. C is a notable exception, and it is a problem because, you have a lot of ad-hoc solutions, all with their own issues.

If you look at GLSL, it has many function that do obvious things, like exp2(x) that does the same thing as pow(2,x), and I don't think anyone has any issue with that. It even has a specific "fma" operation (fma(a,b,c) = a*b+c, precisely), that solves a similar kind of problem as the overflowing average.

User23 3 years ago |

Knuth’s section on binary search in The Art of Computer Programming is enlightening. One historical curiosity that he notes is that it took something like a decade from the discovery of the algorithm to an implementation that was correct for all inputs.

I briefly tried using binary search as a weeder problem and quickly abandoned it when no one got it right.

kazinator 3 years ago |

This article is poorly/incompletely reasoned.

Suppose your high, low and mid indexes are as wide as a pointer on your machine: 32 or 64 bits. Unsigned.

Suppose you're binary searching or merge sorting a structure that fits entirely into memory.

The only way (low + high)/2 will overflow is if the object being subdivided fills the entire address space, and is an array of individual bytes. Or else is a sparsely populated, virtual structure.

If the space contains distinct objects from [0] to [high-1], and they are more than a byte wide, this is a non-issue. If the objects are more than two bytes wide, you can use signed integers.

Also, you're never going to manipulate objects that fill the whole address space. On 32 bits, some applications came close. On 64 bits, people are using the top 16 bits of a pointer for a tag.

kragen 3 years ago | |

> Suppose your high, low and mid indexes are as wide as a pointer on your machine: 32 or 64 bits. Unsigned.

Yeah, if you suppose that, you can correctly conclude that you only run into overflow if the object is a byte array that fills more than half the address space (though not the entire address space as you say). And that's why this problem remained unnoticed from 01958 or whenever someone first published a correct-on-my-machine binary search until 02006.

But suppose they aren't. Suppose, for example, that you're in Java, where there's no such thing as an unsigned type, and where ints are 32 bits even on a 64-bit machine. Suddenly the move to 64-bit machines around 02006 demonstrates that you have this problem on any array with more than 2³⁰ elements. It's easy to have 2³⁰ elements on a 64-bit machine! Even if they aren't bytes.

dunhuang_nomad 3 years ago |

Does anyone know why the bitshift method works?

Is it that low and high are both floating point, so you're not constrained by int precision and so you don't get an overflow error. The article makes it sound like sign switching is the issue, but this is just a general overflow problem, right?

dataflow 3 years ago | |

The ">>>" operator works, the ">>" operator doesn't. The reason the former works is that it basically performs unsigned division by a power of 2; the latter does it signed. There's no floating-point.

erikpukinskis 3 years ago | | |

What do the five >s and the , mean in this comment?

odo1242 3 years ago | |

No, it's because the reason that integers overflow is that negative numbers are technically stored as larger than positive numbers in the Two's complement representation most computers use to store integers. Neither low and high are floats.

Example with 8-bit integers (from wikipedia):

Bits, Unsigned value, Signed value

0000 0000, 0, 0

0000 0001, 1, 1

0000 0010, 2, 2

0111 1110, 126, 126

0111 1111, 127, 127

1000 0000, 128, −128

When the logical bit shift is conducted on -128, -128 is treated as an unsigned integer. Its sign bit gets shifted such that the integer becomes 0100 0000, aka 64.

dunhuang_nomad 3 years ago | | |

Oh I see, this is very helpful. Thank you.

david_allison 3 years ago |

16 years later, it's still incorrect on Wikipedia

https://en.wikipedia.org/wiki/Binary_search_algorithm#Proced...

enriquto 3 years ago | |

But this is pseudocode. For all you know, it could be implemented in a language whose integers are arbitrary precision, in which case it is perfectly correct and appropriate.

gp 3 years ago | | |

> language whose integers are arbitrary precision

I’m not sure what this could mean. Could you please share some examples?

froh 3 years ago | |

see "implementation issues" in the same article, with

  M = L + (R - L)/2

david_allison 3 years ago | | |

I'm aware (plus the fact that the algorithm is correct in Python). It's very unlikely that this is an argument I can win.

I'm taking a pragmatic perspective: like it or not, people are going to skim the article and copy & paste the pseudocode.

Given that the pseudocode is buggy in the vast majority of programming languages and the user isn't informed about this in the pseudocode, it's going to lead to unnecessary bugs.

elcomet 3 years ago | |

They do discuss it though at the end of the article

https://en.wikipedia.org/wiki/Binary_search_algorithm#Implem...

And as other mentionned, this is pseudo code and not implementation. But if you think it's incorrect, feel free to correct it.

jansan 3 years ago |

Spoiler: If you are using Javascript, this bug only affects you if your arrays have more than Number.MAX_SAFE_INTEGER/2 entries, which is about 2^52. In other words, don't waste your time with fixing this bug.

chowells 3 years ago | |

Unless you're binary searching something other than a data structure. Fascinatingly, binary search works just fine in optimization problems where the function to optimize is monotonic.

butlerm 3 years ago |

Anyone dealing with arrays containing a billion elements or more really ought to be using 64 bit arithmetic to avoid problems like this. Certainly better to do this the right way though.

PartiallyTyped 3 years ago | |

Is there any reason not to use 64bit arithmetic anyway?

queuebert 3 years ago |

This is a great example of how good algorithms are software plus hardware. The idea that a pure mathematical idea can be naively implemented on any hardware has never truly materialized.

Yes, we are a long way from flipping switches to input machine code, but there are still hardware considerations for correctness and performance, e.g. the entire industry of deep learning running somewhat weird implementations of linear algebra to be fast on GPUs.

runeblaze 3 years ago |

Oh boy, in 2022 you could not afford writing a broken binary search in any serious coding interview. Back before 2006 apparently PhD students in CMU could not.

feoren 3 years ago | |

Are you kidding? If you were asked in a coding interview to write a binary search, and you wrote the broken version in the post on a whiteboard, you'd be in the top 5% of applicants. Most applicants can barely write a for loop on the board.

altaltalt 3 years ago |

Can't it simply be written like this?

    mid = low/2 + high/2

Godel_unicode 3 years ago | |

Division is not associative:

https://www.khanacademy.org/math/arithmetic-home/multiply-di...

mimon 3 years ago | | |

While that is true it is not relevant here, since this example does not involve associativity.

What is relevent here is that integer division is not distributive over addition.

dataflow 3 years ago | |

Nope, try low = 1, high = 1 and you get mid = 0.

benmmurphy 3 years ago | | |

i think you can fix it with: (low >> 1) + (high >> 1) + (low & 1 & high)

for unsigned numbers. not sure if it works for signed numbers.

curling_grad 3 years ago | |

For low=3, high=5 case, this gives mid=3.

kfajdsl 3 years ago |

My data structures professor took off points for that in an assignment once :(

yarskegg 3 years ago |

I think this might be better for c/c++ though admittedly a bit more cryptic:

(x>>1) + (y>>1) + (0x01 & x & y)

vintermann 3 years ago |

On mobile, this site is broken too. Text doesn't wrap and scrolling seems to be disabled.

rgovostes 3 years ago | |

It's a post from before the iPhone came out, try reading the WAP version of the blog on your Cingular connection.

remram 3 years ago | | |

The blog is still active though. Somehow they fixed their layouts but kept old posts on the old layout?

andai 3 years ago | |

Yeah I had to use reader mode.

tiagod 3 years ago | |

I really dislike when devs disable mobile scrolling without knowing for sure their content is wrapping properly.

blacklight 3 years ago |

I'd put the blame on languages that don't allow exceptions, and whose return value in case of errors belong to the same domain as the solution.

I've coded binary searches and sorts tons of times in C++, and yet none was succeptible to this bug. Why? Because, whenever you're talking indices, you should ALWAYS use unsigned int. Since an array can't have negative indices, if you use unsigned ints the problem is solved by design. And, if the element is not found, you throw an exception.

Instead, in C you don't have exceptions, and you have to figure out creative ways for returning errors. errno-like statics work badly with concurrency. And doing something like int search(..., int* err), and setting err inside of your functions, feels cumbersome.

So what does everyone do? Return a positive int if the index is found, or -1 otherwise.

In other words, we artificially extend the domain of the solution just to include the error. We force into the signed integer domain something that was always supposed to be unsigned.

This is the most common cause for most of the integer overflows problems out there.

# Temp.java import java.util.*; public class Temp { public static void main(String[] args) { byte[] arr = new byte[0x7FFFFFF0]; new Random().nextBytes(arr); Arrays.sort(arr); } } $ javac Temp.java && time -p java Temp real 10.30

char array[60000]; // 5KB left for code and stack if not segmented size_t i = 40000; size_t j = 50000; size_t mid = (i+j)/2; // should be 45000 // i+j = (size_t)90000 = 24464 // mid = 24464/2 = 12232 != 45000

#include <stdio.h> #include <stdlib.h> #include <unistd.h> int main() { // 127 TiB size_t size = 127ULL * 1024 * 1024 * 1024 * 1024; printf("allocating %zu bytes of virtual address space...\n", size); void *p = malloc(size); if (p == NULL) { perror("malloc"); exit(1); } printf("success: %p\n", p); sleep(3600); }