Making Python Less Random(healeycodes.com) |
Making Python Less Random(healeycodes.com) |
In python you generally have two kinds of randomness: cryptographically-secure randomness, and pseudorandomness. The general recommendation is: if you need a CSRNG, use ``os.urandom`` -- or, more recently, the stdlib ``secrets`` module. But if it doesn't need to be cryptographically secure, you should use the stdlib ``random`` module.
The thing is, the ``random`` module gives you the ability to seed and re-seed the underlying PRNG state machine. You can even create your own instances of the PRNG state machine, if you want to isolate yourself from other libraries, and then you can seed or reseed that state machine at will without affecting anything else. So for pseudorandom "randomness", the stdlib already exposes a purpose-built function that does exactly what the OP needs. Also, within individual tests, it's perfectly possible to monkeypatch the root PRNG in the random module with your own temporary copy, modify the seed, etc, so you can even make this work on a per-test basis, using completely bog-standard python, no special sauce required. Well-written libraries even expose this as a primitive for dependency injection, so that you can have direct control over the PRNG.
Meanwhile, for applications that require CSRNG... you really shouldn't be writing code that is testing for a deterministic result. At least in my experience, assuming you aren't testing the implementation of cryptographic primitives, there are always better strategies -- things like round-trip tests, for example.
So... are the 3rd-party deps just "misbehaving" and calling ``os.urandom`` for no reason? Does the OP author not know about ``random.seed``? Does the author want to avoid monkeypatches in tests (which are completely standard practice in python)? Is there something else going on entirely? Intercepting syscalls to get deterministic randomness in python really feels like bringing an atom bomb to a game of fingerguns.
As far as the approach, I agree in that I don't understand why 'no code changes' is that important, especially in the context of Python which has a general attitude of consent towards monkeypatching code. Maybe one of the randomness sources was hashing all the source files? :P
In C++, if you use std::mt19937, everything from seeding to the explicit generator is crystal clear while being terse as well.
Editing to add that another thing that trips me to every few years is that the hash function isn't repeatable between runs. Meaning if you run the program and record a hash of an object, then run it again, they'll be different. This is good for more secure maps and stuff but not good for thinking you can store them to a file and use them later.
$ strace python3 -c 'with open("/dev/random", "rb") as f: print(f.read(8))'
[snip-snip]
openat(AT_FDCWD, "/dev/random", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x8), ...}, AT_EMPTY_PATH) = 0
ioctl(3, TCGETS, 0x7ffd8198d640) = -1 EINVAL (Invalid argument)
lseek(3, 0, SEEK_CUR) = 0
read(3, "\366m@\t5Q9\206\341\316/pXK\266\273~J\27\321:\34\330VL\253L\34\217\264L\373"..., 4096) = 4096
write(1, "b'\\xf6m@\\t5Q9\\x86'\n", 19b'\xf6m@\t5Q9\x86'
) = 19
close(3) = 0
There is also /dev/urandom.1. Run each simulation in its own process, using eg multiprocessing.Pool
2. Processes receive a specification for the simulation as a simple dictionary, one key of which is "seeds"
3. Seed the global RNGs we use (math.random and np.random) at the start of each simulation
4. For some objects, we seed the state separately from the global seeds, run the random generation, then save the RNG state to restore later so we can have truly independent RNGs
5. Spot check individual simulations by running them twice to ensure they have the same results (1/1000, but this is customizable)
This has worked very well for us so far, and is dead simple.
import os
os.urandom = lambda n: b'\x00' * n
import random
random.randint = lambda a, b: a
I love it!And while the article serves as a nice introduction to ptrace(), I think as a solution to the posted problem it's strictly more complicated than just replacing the getrandom() implementation with LD_PRELOAD (which the author also mentions as an option). For reference, that can be done as follows:
% cat getrandom.c
#include <string.h>
#include <sys/types.h>
ssize_t getrandom(void \*buf, size_t buflen, unsigned int flags) {
memset(buf, 0, buflen);
return buflen;
}
% cc getrandom.c -shared -o getrandom.so
% LD_PRELOAD=./getrandom.so python3 -c 'import os; print(os.urandom(8))'
b'\x00\x00\x00\x00\x00\x00\x00\x00'
Note that these solutions work slightly differently: ptrace() intercepts the getrandom() syscall, but LD_PRELOAD replaces the getrandom() implementation in libc.so (which normally invokes the getrandom() syscall on Linux).I'm not sure if the problem had anything to do with Python. The article is a bit silent on the specific issue with randomness. If detouring urandom() fixed it, it was probably the randomized hash tables.
It cannot have been third party modules calling random.seed() since that would not have been fixed by the hack (meant positively).
You can say that randomized hash tables by default are a mistake, same as the crippled arbitrary precision arithmetic.
If you write a web service, just set the proper defaults at the start of your program.
A good starting point is this article (though it's a little outdated): https://martinfowler.com/articles/injection.html
DI is not very common in Python, for a variety of reasons, but there apparently are DI frameworks, like: https://python-dependency-injector.ets-labs.org/index.html
And the answer was basically along the lines of: it's a fancy way to pass something like a function argument.
Taking an example like the article. Lets say you have a game with a ghost which randomly moves left or right. This would NOT be dependency injected:
class Ghost:
def __init__(self):
self.pos = 5
def move(self):
match random.randint(0, 1):
case 0:
self.pos -= 1
case 1:
self.pos += 1
It's constructed like this: ghost = Ghost()
Ghost's behaviour depends on the state of the global RNG, but that isn't obvious from the perspective of the user of this class.So instead we apply DI, and pass a random number generator in:
class Ghost:
def __init__(self, rng):
self.pos = 5
self.rng = rng
def move(self):
match self.rng.randint(0, 1):
case 0:
self.pos -= 1
case 1:
self.pos += 1
It's constructed like: ghost = Ghost(rng=random)
Now the fact that the class uses random numbers is explicit, and you can pass in an alternative RNG for testing purposes.DI is a very useful technique that can make the construction of your system understandable, and make it easy to mock out dependencies. Much like mocking however - it shouldn't be over-used. If you use DI too much your code will become opaque as you'll never know what the concrete type of code you're calling is. Python's progressive typing can help here to some extent.
-----------
Dependency injection is not to be confused with dependency injection systems, which are complicated beasts that obscure what dependencies are actually constructed or provided. They make DI implicit again with the argument that it's better because you don't have to pass parameters manually. I would argue that if you need a dependency injection system, maybe you've over-used dependency injection.
-----------
I like to think of it as being related to capability based security[1], where you have to explicitly provide your dependencies, otherwise you won't be able to access them.
[1]: https://en.wikipedia.org/wiki/Capability-based_security
Back in the day, sometimes we had to monkey-patch interface layers like database drivers and other code that was open to modification but closed to extension. Usually to disable some legacy or proprietary feature that broke everything else. Like “you have to use a database from 1993” and it had a `assert check_winxp_version()` or something dumb in an `__init__.py` top-level.
These days, there are mature or python-native solutions to all of those that I recall.
- However! This article is more like using the debugger and ptrace as a Game Genie or save editor than about the utility of `prng = random.Random(123)`. The actual point of the article wasn’t much about python ;)
> Patching should be done as early as possible in the lifecycle of the program. For example, the main module (the one that tests against __main__ or is otherwise the first imported) should begin with this code, ideally before any other imports:
from gevent import monkey
monkey.patch_all()
A corollary of the above is that patching should be done on the main thread and should be done while the program is single-threaded.It's possible to patch later on, but much more involved. If you patch module A after you've already loaded module B, which itself loads module A, then you have to both patch module A and track down and patch every reference to module A in module B. Usually those will just be global references, but not always.
The big picture of what a DI framework does is let you declare your structural object graph using a config file or decorators and have the whole thing instantiated at runtime automagically.
The detailed view is "a fancy way to pass something like a function argument". An object that has dependencies gets them passed in ("injected") at runtime rather than calling dependent function directly or internally instantiating dependent objects and calling them.
Doing things this way in OO languages has a number of benefits, including improved testability.
Injecting dependencies when you instantiate an object, or passing dependencies into a method via arguments, rather than having methods or constructors create their dependencies, is good design and increases testability.
On the other hand, DI frameworks are, IMO, an awful awful mess and mistake and I'm glad the industry has moved away from them. The problem with DI frameworks is that setting parameters automatically is only one part of it; the other part of it is object lifecycle management. After all, a Foo is automatically injected into your class, it means the DI framework needs to know how to create a Foo, and dispose of a Foo.
This is where you get into per-request sessions, context hooks, and all of the stupid Spring bean bullshit everyone has come to hate and is really the main reason for so much anti-"enterprise" software patterns.
Funnily enough, it is unnecessary. Just make your dependencies into parameters and you'll get 95% of the benefit of DI. The last 5% is the code to wire up the object graph at certain entrypoints which can be large, but should be simple code.
I don't love spring, but I've seen the benefits from having a well-organized, declarative graph of the top-level objects (i.e. the core objects that live for the entire process lifetime). It provides a clear pattern for how to add new code to a large, growing codebase. Without such a structure, devs end up tacking new code on in arbitrary ways.
Yes. I specifically asked like that to avoid getting a description of DI just couched in more OOP jargon but also to provide some alternative programming vocabulary (so you don't have to try to explain everything in everyday English, which seldom goes well).