Myths Programmers Believe about CPU Caches (2018)(software.rajivprab.com) |
Myths Programmers Believe about CPU Caches (2018)(software.rajivprab.com) |
I found that book very readable, and it cleared up most misconceptions I had. It also teaches a universal vocabulary for discussing coherency/consistency, which is useful for conveying the nuances of the topic.
Cache coherence is not super relevant to most programmers though. Every language provides an abstraction on top of caches, and nearly every language uses the “data race free -> sequentially consistent”. Having an understanding of data races and sequential consistency is much more important than understanding caching: the compiler/runtime has more freedom to mess with your code than your CPU (unless you are on something like the DEC Alpha, which you probably aren't).
If you are writing an OS/Hypervisor/Compiler (or any other situation where you touch asm), cache coherence is a subject you probably need a solid grasp of.
edit: only those PDF's with a checkmark are available as PDF to download, the rest can be bought. Quite a few actually available for download.
Of course you might end up analyzing assembly level perf traces in a hot path for some game console with a less known CPU architecture and making a cross cpu cache miss slightly less slow could just maybe be helped by the detailed understanding of the machine model, but by that time you're already far in the not-giant-turd territory (at least if you're optimizing the right thing).
Of course computer architecture is fascinating and fun to learn about.
Of course performance matters, but surely having performance tests, rather than trying to second guess what the whole stack below you might be doing, is
1. more efficient 2. more accurate 3. more likely to detect changes in a timely way.
That's not to say, you shouldn't be curious and deep understanding isn't a good thing.
Just saying understanding inside-out the abstraction you are working with ( eg Java Memory Model ) it's performance characteristics ( from real world testing ) - is more important than some passing knowledge of real world CPU design.
This app I am using right now is in a webbrowser - not sure how understanding cache coherency helps in a single threaded javascript.
For the bulk of the eng I work with the concept of StoreLoad reordering on x86 would be an academic distraction.
It was before my time :)
The issue that comes up on Alpha is this code:
thread1() {
x = …; // Store to *p
release_barrier(); // guarantee global happens-before
p = &x; // ... and now store the p value.
}
thread2() {
r1 = p; // If this reads &x from thread1,
r2 = *r1; // this doesn't have to read the value of x!
}
The Alpha's approach to memory was to impose absolutely no constraints on memory unless you asked it to. And each CPU had two cache banks, which means that from the hardware perspective, you can think of it as having two threads reading from memory, each performing their own coherency logic. So you can have one cache bank reading the value of p who, having processed all pending cache traffic, saw both the stores, and then you can turn around and request the other cache bank to read *p who, being behind on the traffic, hasn't seen either store yet.Architectures with only one cache bank don't have this problem. Other architectures with cache banks feel obligated to solve the issues by adding extra hardware to make sure that the second cache bank has processed sufficient cache coherency traffic to not be behind the first one if there's a dependency chain crossing cache banks.
The C++11/C11 memory model added memory_order_consume specifically to support the Alpha.
https://preshing.com/20140709/the-purpose-of-memory_order_co...
https://github.com/torvalds/linux/blob/master/Documentation/...
ARM systems are more relaxed, and therefore need more barriers than on x86. Memory barriers (which also function as "compiler barriers" for the memory / register thing discussed in the article) are handled as long as you properly use locks (or other synchronization primitives like semaphores or mutexes).
Its good to know how things work "under the covers" for performance reasons at least. Especially if you ever write a lock-free data-structure (not allowed to use... well... mutexes or locks), so you need to place the barriers in the appropriate spot.
------
I think the acquire/release model of consistency will become more important in the coming years. PCIe 4.0 is showing signs of supporting acquire/release... ARM and POWER have added acquire/release model instructions, and even CUDA has acquire/release semantics being built.
As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.
Your options here are either to manually invalidate your caches and synchronize with the DMA (e.g. via interrupts), or to request from the OS that the given memory section be entirely uncached; or in some cases, you can get away with a write-through cache policy, if the DMA is only ever reading the memory.
https://austingwalters.com/the-cache-and-multithreading/
Hope it helps someone, I tend to link it to my co-workers when they ask me why I PR'd re-ordering of loops & functions OR when they ask how I get speedups without changing functionality.
I'm wondering if the MESI protocol could be used in a networked database manner? I feel like you need master node though to coordinate everything though (like the L2 does in the example).
Take a typical game loop and assume we have a list of Transforms (e.g. world matrix, translation/rotation/scale, whatever - each Transform is a collection of floats in contiguous memory)
Different systems that run in that loop need those transforms in different orders. Rendering may want to organize it by material (to avoid shader switching), AI may want to organize it by type of state machine, Collision by geometric proximity, Culling for physics and lighting might be different, and the list goes on.
Naive answer is "just duplicate the transforms when they are updated" but that introduces its own complexity and comes at its own cost of fetching data from the cache.
I guess what I'm getting at is:
1) I would love to learn more about how this problem is tackled by robust game engines (I guess games too - but games have more specific knowledge than engines and can have unique code to handle it)
2) Does it all just come out in the wash at the end of the day? Not saying just throw it all on the heap and don't care... but, maybe say optimizing for one path, like "updating world transforms for rendering", is worth it and then whatever the cost is to jump around elsewhere doesn't really matter?
Sorry if my question is a bit vague... any insight is appreciated
In short, write a transform once and treat it as immutable. Do not reuse the Transform allocation for a good while for subsequent frames to ensure that its cache lines are no longer in cache. If you do need to reuse right away, you can force invalidate cache lines by addresses, so that the single-writer in the next step is the single (O)wner and no other caches need to invalidate anything.
It seems like this would be generally applicable for a wide variety of distributed & concurrent applications.
It's also not fully accessible, each core can only directly access the 16 MB in its group of 4. Everything else is the same as a cross cache read.
Whereas today, 512MB is a bare minimum amount of DRAM in a general-purpose desktop. Times change.
Different processes can certainly have different versions of the same state, different values for the same variable, and different values at the same virtual address.
And what about virtual caches? Non-coherent cache lines?
Moreover, even in the face of cache coherency you can still have race conditions.
what do you mean? Either two caches agree on the content of a cacheline or one of the cacheline is marked invalid (and the stale content is irrelevant). There are components of a core that might not respect coherency, like load and store buffers and arguably registers, but not caches (on cache-coherent systems of course).
Virtually addressed caches are an issue and that's why they have fallen out of favor.
I mean after we account for memory operations reordering on each core, the memory address storing a single value that is visible to all cores is a correct model from the concurrency-correctness point of view, right?
Or in other words, the only kind of visible reordering that is allowed to occur is that writes can be delayed past reads.
An example of a situation where this is significant:
thread 1 thread 2
mov [X], 0 mov [Y], 0
mov [X], 1 mov [Y], 1
mov r1, [Y] mov r2, [X]
After this sequence of code, r1 == r2 == 0 is legal. (As is any other combination of 1 and 0.)(edit:) And just to add, all this reordering is of course impossible to detect on just one core, as when a read request hits a recent write on the same core, it reads it out of the store queue. This can sometimes be really bad for performance, though, as if you read a value that is partially in the store queue (such as, write 16-bit value to x, the immediately read 32 bits from x), some cpus will stall that read, and all that follow it, until the entire store queue is flushed. Since the store queue can easily take tens if not hundreds of cycles to clear, this can be very expensive.
* there might be some minor exceptions
The reason some "app" is slow is not because of cache coherence traffic. It's because somebody chose the wrong data structure, created some stupid architecture, wrote incomprehensible code that the next guy extended in a horribly inefficient way cause they didn't understand the original. My web browsing experience is slow because people include heaps of bloated JS crap and trackers and ad networks so I have to load 15 megabytes of nonsense and wait for JS to render stuff I don't want to see. None of this was any better if anybody involved understood CPU caches better.
Even in the kernel or HPC applications, most code is not in the hot path. Programmers should rather focus on clean architectures and understandable code. How does it help if you hand-optimize some stuff that nobody understands, nobody can fix a bug, nobody can extend it. That's horrible code, even if it's 5% faster than what somebody else wrote.
TL;DR: This is interesting, but likely totally irrelevant to your day job. In the list of things to do to improve your code, it comes so far down that you'll never get there.
"Memory models
Different architectures provide different memory models. As of this writing, ARM and POWER architecture machines have comparatively “weak” memory models: the CPU core has considerable leeway in reordering load and store operations in ways that might change the semantics of programs in a multi-core context, along with “memory barrier” instructions that can be used by the program to specify constraints: “do not reorder memory operations across this line”. By contrast, x86 comes with a quite strong memory model."
Even the description of MOESI is just an introduction and, as the article mentions, actual systems use more complicated protocols.
Edit: if anything, the misconception is that memory barriers have anything to do with cache coherency.
They have implemented the acquire-release consistency model since day one (or, the day they started supporting multi-processors). Yes, there are some subtleties there that have in some cases been tightened later on, e.g. multi-copy atomicity.
ARM and POWERPC CPU devs worked very hard to get consume/release into C++11, but no compiler writer actually implemented that part of the standard. As such, consume/release can be safely forgotten into the annals of computer history (much like DEC Alpha's fully relaxed semantics)
Then in ARM8, ARM simply added LDAR (Load-acquire) and STLR (Store-release) instructions. https://developer.arm.com/docs/100941/0100/barriers . So the ARM CPU how fully supports the acquire/release model. Apparently IBM's POWER instruction set was similarly strengthened to acquire/release (either POWER8 or POWER9).
ARM / POWER "normal" loads and stores are still consume/release semantics. But compilers can simply emit LDAR (load-acquire) for the stronger guarantee.
----------
I remember at least one talk that showed that consume/release is ideal for things like Linux's RCU or something like that (that acquire/release is actually "too strong" for RCU, and therefore suboptimal). But because compiler-writers found consume/release too hard to reason about in practice, we're stuck with acquire/release.
It seems like the C++ standard continues to evolve to push for memory_order_consume (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p075...), but all the details are still up for discussion.
Linux relies heavily on performant RCU for scalability, which a pure acquire/release SW programming model can't support.
https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt
In it is:
> The rcu_read_lock() and rcu_read_unlock() primitive read-acquire and release a global reader-writer lock.
Seems like RCU-operations in the Linux kernel are defined in acquire-barrier and release-barrier terms. I heard a while ago that RCU could be discussed in terms of release-consume semantics (which are slightly faster but harder to understand...) but very few people understand release-consume.
As such, release-acquire is probably the memory model of the future. I'm not really aware of anything aside from: Fully Relaxed (unordered), the obscure release-consume, release-acquire, and finally sequentially consistent (too slow for modern systems)
---------
Are you perhaps confusing "acquire-release" semantics (which is a memory-barrier / cache coherence principle) with spinlocks perchance? Acquire-release seems to be the "Fastest-practical" memory consistency model. (Since Relaxed doesn't work, and release-consume is too confusing)
For more info on acquire-release, Preshing's blogposts are great: https://preshing.com/20130922/acquire-and-release-fences/
The downsides of MESI are that (a) it requires broadcasts, which don't scale very well; and (b) it doesn't tolerate partitions -- which also imposes an effective scaling limit, since large systems are always partitioned (usually with a partition of N-k and k partitions of 1, due to k nodes having failed).
No, it can be implemented with directory instead, e.g.
https://en.wikipedia.org/wiki/Directory-based_cache_coherenc...
Or various combinations of snooping and directories ("snoop filters", or directories that act as "bridges" between broadcast domains, etc.).
In current Xeon processors (and presumably AMD EPYC as well, thought I don't yet have first-hand experience with those), you have a couple of directories per CPU with snoop filtering, as with tens of cores broadcasting becomes a scalability bottleneck. In the BIOS you can change the mode how it operates, with slightly different names and semantics depending on the CPU generation.
The short answer is yes, it can, though you tend to use key ranges (or a "predicate range" for generality) rather than cache lines and addresses.
Taken to its extreme it can produce extremely fast and versatile distributed databases to the extent that different nodes are accessing non-overlapping key ranges, or only reading from shared key ranges.
Neither a master nor broadcasts are really needed. (Though avoiding masters is quite complicated to do right; and a little trickle of broadcasting is often used, just for nodes to discover each other, and check they are still up, but not for every cache transaction.)
The general MESI-like pattern is probably used more commonly for coherent network filesystems than databases, though.
Many network filesystems uses "leases" (or "oplocks") on files or ranges, which are similar to the states in MESI if the filesystem is one of the good ones that is coherent.
This bit of Documentation for SMB, the Microsoft network filesystem, might remind you of the S and M/E states in MESI:
· Read-caching lease: allows caching reads and can be *shared* by multiple clients.
· Write-caching lease: allows caching writes and is *exclusive* to only one client.Also, cache is essentially optional, and its configuration (not just sizes, but how it is shared amongst cores and other units, speeds relative to other memory levels, even how many levels there are (and each level can have different properties) how cache rows are arranged and mapped, their size, ..., etc.) can and will vary between otherwise identical looking systems. If you are compiling to optimise for cache use you end up either having to JiT compile, or compile several versions of some routines and include them all so which one is used can be chosen at run-time, or have different versions of the whole compiled output for different systems. All of those things happen at times anyway for other reasons, but presumably the overall pay-off of doing the same for cache variances isn't high enough for it to be worthwhile building into general purpose compilers (though the cases for/against this sort of work in domain specific compilers and other tool chains may be quite different).
Some designs argue that we shouldn't need to care about the implementation details of any memory let alone L1/2/3/? cache - just access storage and let the OS & hardware make use of the faster memory levels it has access to as it sees fit to optimise that storage access.
However, there are many machines, which instead of using caches, use programmer-visible “scratchpads”. There are many reasons for this, but two big reasons are performance-predictability (real-time systems) or to avoid the hw complexity of cache management.
In general though, scratchpads are hugely painful to program for and terrible for code portability.
Turning the argument the other way, you as a programmer can control a lot about caches: you can prefetch cache lines, invalidate them and use streaming instructions to bypass them.
In my opinion an important part of being a good programmer is understanding - at least at a broad level - how the set of abstractions you're working on top of work.
At the end of the day, our job is making computer hardware operate on data in memory. The more that we forget that, and think about computing as some abstract endeavor performed in Plato's heaven, the more tendency we have for bloat and inefficiency to creep into our various abstractions.
In other words, I think it's better to think about abstractions as a tool for interacting with hardware, not as something to save us from dealing with hardware.
Does this mean when you use double-checked locking on p on non-alpha systems, you do not need any kind of synchronization on the fast path where p is initialized?
So this would be correct?
if (!p) {
T *x = new T;
release_barrier();
if (!compare_and_swap(p, 0, x))
delete x;
}At my previous company. I would have to spend the better part of a day to get my "documentation" in the "correct" Confluence Style And Manner. They were adamant THAT'S were the value is, to have documentation is the most beautiful and absurd style" and double-linked structure. You would have to block out a day or two in your scrum(what nonsense) just to focus on your documentation. And this is not some sort of important* software like Linux or Banking... but a stupid website.
AFAIR Paul McKenney was the primus motor, and the motivation was largely RCU. Then again, McKenney also worked for IBM at the time and certainly had an interest in pushing a model that mapped well to POWER.
But it turned out to be both somewhat mis-specified and hard to implement cleanly, so most compilers just implemented it as an acquire.
As you mention, there is ongoing work to fix it.
As for ARM, it seems the big thing they've done since the initial release of ARMv8 is to banish non-multicopy atomicity. See https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pd...
And of course, making the decompression 15-20% faster by optimizing the decompression code (which is usually not even written by the developers of said software but just some external library) won't even make a difference because 20% less than 5 seconds is still 4 seconds which is way too long for a program to start. Instead using a different compression algorithm that increases file size by 25% but decompression speed by 10x would actually start solving the problem, with the next step being to ask why the program needs to read so much damn data at the start in the first place.
But since NVMe SSDs and Intel CPUs with very high boost clocks are quickly becoming the norm now even for laptops I don't see much of that happening, because Firefox starts pretty quickly (~1 second) on those machines.
As for why so much data needs to be read... I just checked, and on Mac the main Firefox library (the executable itself is mostly a stub) is 120MB. So that's going to take a second or three just to read in at typical HDD speeds (faster on a good SSD), and then the dynamic linker has to do its thing on that big library, which is not instantaneous either.
yes and no. Alpha is relevant because it is the only architecture where consume requires an explicit barrier, but then again, I think the revised C++11 memory model might not even be fully implementable on Alpha;
Consume primarily exist because acquire is very expensive to implement in traditional RISCs like POWER and 32 bit ARM, while consume can be implemented with a data or control dependency. Aarch64 has efficient acquire/release barriers, so it is less of an issue.
/pedantic
acq/rel remains efficient on a platform that doesn't need barriers to implement it (ie, x86).
consume remains efficient on a platform that doesn't need barriers to enforce a dataflow dependency (ie, ARMv7).
This is because it requires individual bytes to be accessible in atomic way between multiple threads, which without BWX is not possible on Alpha, which started out with minimum 32bit read/writes.
Or if you work in finance, mining, oil industry, medical, biosciences and countless other fields where you need to get good performance out of your hardware. Yes, even if you use GPUs, they're no magic bullet, they also have their architectural bottlenecks, strengths and weaknesses.
Or if you care about power consumption.
There are a lot of reasons to optimize hot paths and inner loops. CPU single core performance isn't improving much anymore, and we need to make better use of what we have.
You could just as well for example say that front end Javascript developer almost never needs to understand event callbacks or how DOM works.
If you write multithreaded high performance code, yeah, you do need to know about cache coherency at varying levels of detail. Sometimes rough rules of thumb work, not too often you need to understand all those annoying performance destroying details that leak through cache abstraction.
Very succinct and correct. This is particularly pernicious with atomics, which people use for lock-free stuff such as queues and work stealing thread pools. If atomics used by different threads/cores share a cache line and you mutate them a lot, perf gets instantly fucked, sometimes worse than if you used a mutex in the first place. And if you don't benchmark, you aren't going to notice.
What about Total Store Ordering (TSO), which is what e.g. the obscure and rare x86(-64) architecture implements (and SPARC as well)?
Well, that is until you enable gcc -O3 optimizations, which will move memory around, merge variables together, and other such optimizations that will follow the acquire-release model instead of TSO. Remember that the compiler has to consider the memory-consistency model between registers and RAM (when are registers holding "stale" data and need to be re-read from RAM?)
-------
The thing is, acquire-release is becoming far more popular and is the golden-standard that C++11 has more or less settled upon. C++11, ARM, POWER9, CUDA, OpenCL have moved onto acquire-release semantics for their memory model.
Next generation PCIe 5.0, CXL, OpenCAPI, are all looking at extending cache-coherence out to I/O devices such as NVMe flash and GPUs / Coprocessors. I'm betting that Acquire/release will become more popular in the coming years. TSO is too "strict" in practice, people actually want their reads-and-writes to "float" out of order with each other in most cases, especially when you're talking about a PCIe-pipe that takes 5-microseconds (20,000 clock-ticks!!) to communicate over.
Yes, in a way it's a race to the bottom; code that works on TSO hw works on acquire-release hw, but not the other way around. There's only two ways to combat this race: education, and using concurrency libraries written by people who know what they're doing.
> acquire-release is becoming far more popular and is the golden-standard that C++11 has more or less settled upon
Hmm, how come? C++11 supports many different models, relaxed, acquire/release, and sequential consistency, with sequential consistency being the default for atomic variables. Now, acquire/release looks like a decent compromise between ease of hw implementation and programming complexity, but AFAICS it's not the anointed one true model.
To some extent I think that's a failing of the C++11 model. Instead of choosing one (sane) model, they made people choose between an array of models with subtle semantics. That's what the recent formal Linux kernel model did, although that's not ideal either, with the requirement to not be too different from the previous informal description and boatloads of legacy code. See http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/asplos18.pdf
In general, it seems to me that progress is being made in formal memory models, and I hope that in some years time there will be some kind of synthesis giving us a model that is both reasonably easy to implement in hw with good performance, easy enough to reason about, as well as formally provable. We'll see.
This means that as a programmer, you still have to code against the release/acquire model because the compiler may reorder your memory accesses. Having TSO in hardware is still helpful though, because it means the compiler has to emit fewer explicit barrier instructions at the end. That is, the barriers that you do have in your original code end up being a little bit cheaper (at the cost of having an overall more complex hardware architecture).
Well, nothing will ever be "officially" blessed as the one true model. As the saying goes: we programmers are like cats, we all will be moving off in our own direction, doing our own thing.
Overall, I just think that "programmer culture" is beginning to settle down on Acquire-release semantics. Its just a hunch... but more-and-more languages (C++, CUDA), and systems (ARM, POWER, NVidia GPUs, AMD GPUs) seem to be moving towards Acquire-release.
And in the next few years, we'll have cache-coherency over PCIe 4.0 or PCIe 5.0 in some form (CXL or other protocols on top of it). A unified memory model across CPU, DDR4 RAM, the PCIe-bus, and co-processors (GPUs, FPGAs, or Tensor cores), and high-speed storage (Optane and Flash SSDs over NVMe) is needed.
The community is just a few years out from having a unified memory model + coherent caches across the I/O fabric. Once this "defacto standard" is written, it will be very hard for it to change. That's why I think acquire-release is here to stay for the long term. Its the leading memory model right now.
Or are we having a terminology confusion and you use the coherency term for software visible performance characteristics of caches generally? I do agree that understanding cache effects and their intersection with multiprocessing is generally important in perf work. As is understanding the architected memory model, which tells you what you can and can't rely on semantically.
Yes. When they're writing high performance multithreaded analysis software and they're deciding which cache lines to write to and which only to read from. Those lines you only read from can be in S (shared) or F (forward) state.
And why is this important? Performance characteristics of a line entering in M (modified) state are pretty bad, if the line is shared between multiple cores.
Perhaps you'll also want to do all reads at once, knowing the line will more likely remain in F state for short periods, instead of bouncing S/F line state between CPU cores?
NUMA comes also to play, making this mistake even more costly. You really want to keep inter-core (especially NUMA socket!) communication to the minimum.
Of course, you could say you don't strictly need to understand MESI (or MESIF), but it really helps understanding why you do things certain way and reasoning about multi-core performance. The thing is, you can say same "you don't need to know this" about a lot of "low level details" in software trade.
Just like you need to understand DOM as a front-end developer to minimize DOM changes, even if you don't access it directly.
In cache coherency case it's analogously about reducing unnecessary multicast and broadcast messages sent between cores.
So now we get to thinking about whether this gives the developer an advantage over just knowing "Dirtying cache lines across different cores/threads is slow". I don't think I would conclude so here.
But yeah I like reading details about microarchitectural details and other computer architecture topics, and am symphatetic to the point of view that knowing the "why" is nice. Just like I find it interesting to read about how DOM APIs are implemented in browsers and why they are hard to make faster...
I know that TSO allows more easily to recover sequential consistency with additional barriers (Intel strengthened their original memory model to TSO for this reason).
Acquire-release consistency allows many more reorderings in addition to store->load (load->load, load->store, store->store).
For more info see e.g. Table 5 in http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...
The hidden thing behind all this is that even if the data is just read-shared, it can still generate traffic between cores and sockets.
Since these communication links are a shared resource [0], doing things wrong hurts performance in unrelated code and cores. Just because of storm of cache coherency packets is being sent between cores.
So yeah, you really do want to minimize this to maximize performance and scalability across the whole system!
[0]: In Intel's case, this shared resource is ring bus inside CPU socket and QPI between CPU sockets.
Only because most software "engineers" don't give two shits about the actual user experience of their glacially slow over-engineered garbage.
Usually 1-7 are all you need. If you get all the way to 10 you are in the deep end for most things.
Big O is good for many things. But in reality big O is O(N)+C where the C can get you. That is where the later steps help. But usually you can totally ignore it. Most of the big wins I get are just from flipping out a bad search for a O(log(n)) search, or removing 'extra' code.
Fixing that did not require knowledge of cache lines.
While I agree that the details of StoreLoad are likely a distraction the big picture concepts of cache coherence presented in this article are table stakes for performant systems.
This is true for literally every hard problem in dev though, and the implication that you need to grasp everything just in case you need it is silly. The problem space in compsci is too big to know everything. We have to choose.
I encourage people I work with to read many of the books in this book series. I particularly encourage them to read “Hardware and Software Support for Virtualization”, since it’s basically a book on their job.
If you had to rank them in importance for the average engineer, how would you rank them?
Of course the minority that do need performance are where the real engineers are needed.
The lag is already noticeable. If I don't pay attention to performance, I know the end result will be slow and unpleasant to use.
It is a good framework to get you in the ballpark of the correct thing. Even usually 99% of the time it is right. But sometimes the arch bites back due to your data.