More Itertools

More Itertools(more-itertools.readthedocs.io)

206 points by stereoabuse 2 years ago | 49 comments

I've implemented the "chunked" iterator a million times. Glad to see I can just import this next time.

fastily 2 years ago | |

Since python 3.12, builtin itertools now includes a batched method https://docs.python.org/3/library/itertools.html#itertools.b...

jszymborski 2 years ago | | |

Even better! Thanks :)

tempcommenttt 2 years ago |

If you like this sort of things, why not check out “boltons” - things that should be built-in in Python?

https://pypi.org/project/boltons/

benkuykendall 2 years ago |

My favorite function here is more_itertools.one. Especially in something like a unit test, where ValueErrors from unexpected conditions are desirable, we can use it to turn code like

  results = list(get_some_stuff(...))
  assert len(results) = 1
  result = results[0]

into

  result = one(get_some_stuff(...))

I guess you could also use tuple-unpacking:

  result, = get_some_stuff(...)

But the syntax is awkward to unpack a single item. Doesn't that trailing comma just look implausible? (Also I've worked with type-checkers that will complain when a tuple-unpacking could potentially fail, while one has a clear type signatures Iterable[T] -> T.)

rjmill 2 years ago | |

You can also do

  [result] = get_some_stuff(...)

adammarples 2 years ago | |

Do tuple unpacking like this

result, _* = iterable()

plyp 2 years ago | | |

That’s not the same though. Your unpacking allows for any non-empty iterable while OPs only allows for an iterable with exactly one item or else it throws an exception.

jauntywundrkind 2 years ago |

Shout out to JavaScript massively delaying https://github.com/tc39/proposal-async-iterator-helpers in the 23rd hour.

The proposal seemed very close to getting shipped alongside https://github.com/tc39/proposal-iterator-helpers while basically accepting many of the constraints of current async iteration (one at a time consumption). But the folks really accepted that concurrency needs had evolved, decided to hold back & keep iterating & churning for better.

I feel like a lot of the easy visible mood on the web (against the web) is that there's too much, that stuff is just piled in. But I see a lot of caring & deliberation & trying to get shit right & good. Sometimes that too can be maddening, but ultimately with the web there aren't really re-do-es & the deliberation is good.

jacobolus 2 years ago | |

You can implement quite a lot of Python's itertools in Javascript without too much trouble. For instance, https://observablehq.com/@jrus/itertools

Disclaimer: this code was written several years ago with few downstream users, not all of these are super high performing, and they have not been super extensively tested.

raymondh 2 years ago | | |

Your nice work on the JS itertools port has a todo for a "better tee". This was my fault because the old "rough equivalent" code in the Python docs was too obscure and didn't provide a good emulation.

Here is an update that should be much easier to convert to JS:

        def tee(iterable, n=2):
            iterator = iter(iterable)
            shared_link = [None, None]
            return tuple(_tee(iterator, shared_link) for _ in range(n))

        def _tee(iterator, link):
            try:
                while True:
                    if link[1] is None:
                        link[0] = next(iterator)
                        link[1] = [None, None]
                    value, link = link
                    yield value
            except StopIteration:
                return

danpalmer 2 years ago | |

> But the folks really accepted that concurrency needs had evolved, decided to hold back & keep iterating & churning for better

I'm not sure if it was this proposal or another one in a similar space, but I've recently heard about several async improvements that were woefully under-spec'd, and would likely have caused much more harm than good due to all the edge cases that were missed.

PLenz 2 years ago |

This library is my python productivity secret weapon. So many things I've needed to impliment in the past is now just chaining functions in itertools, functions, and this

elijahbenizzy 2 years ago |

Nice! These can make code a ton simpler. Also no python dependencies, which is a requirement for me adopting. Would love to see this brought into the standard lib at some point.

slig 2 years ago |

What's the process for adding these to the Python's stdlib? Is it even possible to adopt a whole library such as this one?

loloquwowndueo 2 years ago | |

Yes. Unittest.mock used to be a third-party library.

For an idea of the process followed, look up PEP417 (Python Enhancement Proposal.

slig 2 years ago | | |

Thank you!

appplication 2 years ago | |

It’s possible but tends not to be common for a multitude of reasons. The biggest issue is library updates become synced to version patch updates, which doesn’t provide a lot of flexibility. A package would have to be exceptionally stable to be a reasonable candidate.

cosmic_quanta 2 years ago | |

It must be possible, because the 'dataclasses' library used to be third-party.

ericvsmith 2 years ago | | |

That’s not actually true. While dataclasses to most of its inspiration from attrs, there are many features of attrs that were deliberately not implemented in dataclasses, just so it could “fit” in the stdlib.

Or maybe you mean the backport of dataclasses to 3.6 that is available on PyPI? That actually came after dataclasses was added to 3.7.

Source: I wrote dataclasses.

jdeaton 2 years ago |

it has always annoyed me that flatten isn't already part of itertools

jdeaton 2 years ago | |

ok itertools has chain.from_iterable but that name is hard to remember

Myrmornis 2 years ago | | |

Yes, I think it might have been a slight design mistake to make the variadic version the default. I've only very rarely used it, whereas I use chain.from_iterable a lot.

isoprophlex 2 years ago | |

Amen. For a language that gloats on about "flat is better than nested" you have to jump to too many hoops to get your stuff flattened.

vismit2000 2 years ago | |

It's there in form of chain:

from itertools import chain

flatten = chain.from_iterable

Ref: pytudes - https://github.com/norvig/pytudes/blob/main/ipynb/Advent-202...

jonathan_landy 2 years ago | |

Is np.flatten not a workable option in some cases?

benkuykendall 2 years ago | | |

Maybe in some cases, but the performance characteristics are way different. The functions in `more_itertools` return lazy generators, but it looks like `np.flatten` materializes the results in an ndarray.

rnewme 2 years ago | | |

Is np part of the itertools?

zhukovgreen 2 years ago |

I was frustrated by the itertools design, because the chain of operations are going from the inside out. Iterative design in Scala is much friendly to me

https://pybites.circle.so/c/python-discussion/functional-com...

hiAndrewQuinn 2 years ago |

itertools is a gem and has been since the 2.7 days. Glad to see people waking up to its powerful abstractions.

wenc 2 years ago | |

itertools (iterators) and collections (data structures) are both underrated modules in stdlib.

drexlspivey 2 years ago | | |

And are both written by Raymond Hettinger

zokier 2 years ago | |

itertools and more-itertools are two different libs

screye 2 years ago |

This looks great.

Usally, I'd cast my arrays into a pandas DF and then use the equivalent dataframe operations. To me, pandas and numpy might as well be part of the python stdlib.

How should I reason about the tradeoff of using something like this vs pandas/numpy ? Esp. with Numpy 2.0 supporting the string dtype.

almostgotcaught 2 years ago | |

> Usally, I'd cast my arrays into a pandas DF

I promise I mean no offense by this but this is so comically absurd. Like you know it's not a cast right? Ie that you're constructing pandas dataframes.

> How should I reason about the tradeoff of using something like this vs pandas/numpy ?

For small sizes, operations on native types will be faster than the construction of complex objects.

mabster 2 years ago | | |

Also, my grief with DF is they aren't typed (typing module) by column. Maybe that's changed though? It's been a while.

The only way to understand what's going on with DF code is to step it in a debugger. I know they can be much faster, but man you pay a maintainability price!

screye 2 years ago | | |

No offense taken.

My tasks aren't usually bottlenecked by the df creation operation. To me, the convenience offered by dfs outstrips the compute hit. However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.

samsquire 2 years ago |

This is really helpful. Thank you.

I would like to see some kind of query AST for this stuff in a query engine for semantics that its ops can be fused together for efficiency. For example, like a Clojure transducer.

def tee(iterable, n=2): iterator = iter(iterable) shared_link = [None, None] return tuple(_tee(iterator, shared_link) for _ in range(n)) def _tee(iterator, link): try: while True: if link[1] is None: link[0] = next(iterator) link[1] = [None, None] value, link = link yield value except StopIteration: return

import time import pandas as pd ls = list(range(10)) b = time.monotonic_ns() odds = [v for v in ls if v % 2] e = time.monotonic_ns() - b print(f"{e=}") bb = time.monotonic_ns() df = pd.DataFrame(ls) odds = df[df % 2 == 1] ee = time.monotonic_ns() - bb print(f"{ee=}") print("ratio", ee/e) >>> e=1166 >>> ee=656792 >>> ratio 563.2864493996569