Parallella: A Supercomputer For Everyone(kickstarter.com) |
Parallella: A Supercomputer For Everyone(kickstarter.com) |
For personal computers - desktops and laptops - I think we don't have a shortage of processor cycles. The minimal specs of the Raspberry Pi make it useable - 256MB of RAM, 700 MHz CPU, a few GB of storage and enough MB to saturate a home broadband connection. What is compelling about the best contemporary personal computing devices is form factor. How easy is it to provide input; how nice is the screen; if it is a mobile device, how heavy is it and does the battery last long enough, etc.
Does a personal parallel computer really help me? At first blush, I am having a hard time seeing how. Clearly, there are CPU intensive workloads that people have mentioned in this discussion - ray tracing is one. The video mentions robotics and algorithms. I have mixed feelings about that since I personally believe the future of robotics lies in computation off the physical robot itself - aka cloud robotics. A use case I personally would find beneficial is the ability to run dozens of VMs on the same machine. Heck ... each of my 50 open browser tabs could run inside separate VMs. I know light weight container technology is around for a while. e.g. jails, LXC. But what about hypervisor-based virtualization - e.g. VMWare, Xen, etc.? While the parallelization offered by this tech would be awesome, what seems to be missing is the ability to address lots and lots of memory.
Back in 06, I remember seeing fear in the eyes of some hardware and software engineers. In the next year, we were supposed to have 100 cores in our plain old desktops. How the heck are we going to program them? I found the situation a bit irrational. Every talk started with the death of Moore's Law because we couldn't shrink dies any further. More cores was posited as the only solution. Except, no one could code them for general purpose apps like Word, Excel, etc. In retrospect, I wonder why I don't have 100 cores in my desktop in 2012. I suspect because they aren't useful for average joe user.
P.S. Forgive my directionless rambling. I don't have a particularly strong opinion on this subject anymore.
That plus cheap access to a massively parallel computer could also be very interesting.
Except where raspberry Pi + online storage could be useful to many, many people. Massive parallelism is probably only interesting to folks like us.
I suspect the problem is that it has no compelling (and immediate) "use case". If they could communicate a set of application ideas then I suspect that a whole new raft of supporters will be happy to risk at least $99.
Also, the $3million stretch goal is just waaaaay too far, too bad that the better design is floated for just that level.
Hoping their funding drive succeeds. I am liking the fact that ISA is being fully documented and we will have a fully open-source toolchain to work with the system.
(Disclaimer: Not associated with Adapteva in any way).
But on purely geek terms this thing seems to warrant a "holy shit":
http://www.adapteva.com/products/silicon-devices/e64g401/
Again I don't know how (un)common that sort of thing is but I wasn't expecting to see 64 cores in that tiny form factor. Does anyone here know how cutting edge this thing is if at all?
[Edit]
Also does anyone here want to address use cases for this thing?
It is not really a performance designation. It doesn't define a certain architecture or design.
It is pretty clearly an economic designation.
In general: money. Buying more of the most performant equipment available.
So. It's an economic designation.
(though I see the more informative "50 GFLOPS/Watt" below... and I like the prospect of something that would make it cheap to play with large scale real time neural nets...)
That the cores don't run in lockstep can be shader heaven! I'm imagining using the cores in a pipeline with zoning so some core 'owns' some tile of the screen and does z-buffering, and other core does clipping of graphics primitives for each tile, and a sea of compute nodes between them chew up work and push it onwards.
Some kind of using the cores as a spatial index too. Passing rays to other cores as they propagate beyond the aabb belonging to a core.
Doubtless it wouldn't work like that. And wouldn't work well. But its fun thinking about it! :)
I can see this platform being a good tool for students and researchers to experiment with algorithm speedups by making their sequential code, parallel.
In my parallel programming class, our teacher had to rig together a computer lab to connect the 12 quad core computers to simulate a 64 core cluster. Then again, 64 core cluster of Parallella would cost like $7000. You can get the same 64 core setup by buying 8 x 8 core consumer desktop computer for under $3000, which will still be more cost effective and probably have ten times more computing power because of the x86 architecture.
It is a more powerful expression of the benefit of scaling with parallelism. Principally, instead of scaling speed with respect to a fixed data size, you scale the data size with respect to a fixed speed.
Having more cores means you (sometimes) can have more data. You still need those parallel programmers with their parallel algorithms though :-)
And yes I get that it's open source blah blah blah, but this project is certainly part of the plan for an institutionally-funded business to make money. Adapteva is a .com, not a .org.
Separately: if Adapteva is only 8 months from delivering completed product to users, shouldn't they be able to raise more funds through traditional channels? They clearly have/had VC buy-in and can raise through institutional channels. If they are just finishing the final debugging/SDKs/etc. at this point, it's not a good sign that they can't raise another $750k from their existing backers to cover final launch costs.
I don't have a horse in this race, but it doesn't feel quite right to me.
http://www.youtube.com/user/GreenArraysInc?feature=CAQQwRs%3...
I would totally agree that memory constraint is sort of tied to manycore architectures, but in this case I find it pushed to the limits.
If the Kickstarter falls through, what options could you still make available to hobbyists? Is there some version of your current prototype setup that you could sell, even if it's not one convenient board?
And if it is so, should expecting Erlang compiler be out of the question? :)
Having looked at the data a bit more: I like their specs concerning system balance. 100 GFLOPS over 6.4GB/s gives you a system balance of 15.625 FLOPS per memory access, that's about the same balance as a Westmere Xeon - pretty good for real world algorithms.
For comparison: NVIDIA Fermi has a system balance of about 20. Meaning: Fermi is sooner bounded by memory bandwidth, which is very often the limiting factor in real world computations.
One thing though: High Performance Computing is all about software / tooling support. If this company comes out with OpenCL in C (even better Fortran 90+) support, then we're talking.
Edit: By similar 'range' I meant core per mm^2 ratio.
For example, one particular embedded 40nm GPU design that I know about can deliver about 25 GFlops or so in the same die area.
edit: No OpenMP support.
Tilera did a very similar looking 64 cores on a chip in 2007, which is the oldest instance I know of off the top of my head. Their devices cost(or at least they used to) a few grand though. Tilera has bumped it up to around 100 per chip these days. I don't know anything about either architecture so it is hard to say if 64 1Ghz adapteva cores compares with 64 1.5Ghz Tilera cores.
So not quite cutting edge just an under explored side channel.
Dedicated machines to host backend applications -- SQL servers, Apache, nginx, etc.
http://www.anandtech.com/show/2918/2
That first picture shows 4 cores made of 4 sub cores with 32 processing elements each. Now Nvidia would claim each of those 32 processing elements is a core, but each of those cores can not act independently. So it is more like a very wide, very hyper threaded 16 core processor.
This was our thought process:
We have received a lot of negative feedback regarding this number so we want to explain the meaning and motivation. A single number can never characterize the performance of an architecture. The only thing that really matters is how many seconds and how many joules YOUR application consumes on a specific platform.
Still, we think multiplying the core frequency(700MHz) times the number of cores (64) is as good a metric as any. As a comparison point, the theoretical peak GFLOPS number often quoted for GPUs is really only reachable if you have an application with significant data parallelism and limited branching. Other numbers used in the past by processors include: peak GFLOPS, MIPS, Dhrystone scores, CoreMark scores, SPEC scores, Linpack scores, etc. Taken by themselves, datasheet specs mean very little. We have published all of our data and manuals and we hope it's clear what our architecture can do. If not, let us know how we can convince you.
That said, I still think that the GHz stat is just about as BAD a metric as any (I suppose "pin count times # of cores" would be worse :-). About the only positive inference I can draw from this is that you have the thermal situation in your system under control.
But piling up cores and cooling them is, IMHO, one of the easiest parts of designing a massively parallel system. The interesting part of the design is the interconnections between the cores, and any metric that multiplies single core performance by number of cores tells me nothing about that.
So not only am I not learning a key part of the performance characteristics of your system, but by omitting it, you make me wonder whether the ENGINEERING of the system might be similarly misguided on this aspect as the MARKETING seems to be (i.e. does marketing omit this aspect of the system because it was not important to the engineers either?).
Linpack at least has benchmarks both for showing off the cores in nearly independent scenarios, and for showing the system when actual communication has to take place. Obviously, each parallel application is different, but you'd at least show ONE indication of performance in situations that are not embarrassingly parallel (http://en.wikipedia.org/wiki/Embarrassingly_parallel).
http://www.adapteva.com/white-papers/using-a-scalable-parall...
Corner turns for 2D FFTs are usually quite challenging for GPUs and CPUs.[ref] Yaniv, our DSP guru, completed the corner turn part of the algorithm with ease in a couple of days and the on chip data movement constitutes a very small portion of the total application wall time.(complete with source code published as well if you really want to dig).
It's hard to market FFT cycle counts to the general audience:-)
I just take issue with raising money from unsophisticated unaccredited investors without even providing complete disclosure or binding contracts in return. I know a lot of companies do it, and I dislike it in those cases too. I also know I'm in the minority here and that it's only a matter of time before companies with huge VC backing & public companies are using Kickstarter to raise money. I think that's a bad thing, but others disagree.
Also, quickly:
- I'm not against for-profits using Kickstarter; most of the efforts there are for-profits. But companies that have raised millions of dollars probably should disclose that fact prominently in their campaigns.
- Similarly, the fact that you've been denied investment by >50 institutional investors is relevant in asking for money. It might be positive for some, negative for some. But it's likely not going to be a no-op for most.
- My figures come straight from Crunchbase, I'm not more connected than that.
>> Kickstarter funds are "cheap" (no dilution & no debt)
Of course they are, and that's kind of my point. Raising money from unsophisticated unaccredited investors without providing full disclosure or even a contract in exchange is obviously a great source of capital. However, I'm not convinced pitches like this would withstand scrutiny by the relevant regulators if they were not asleep at the switch.
I didn't mean to be overly critical of the project. I wish them all due success. That doesn't mean I can't dislike the Kickstarter campaign. (I would similarly dislike AMD raising funds on Kickstarter, even if I liked the project.)
Defining it on any architecture or performance metric is just pointless because the march of time renders such things utterly moot. Remember when PlayStation 2s were "supercomputers"? Please.
I agree that you're not going to get a "supercomputer" for less than about $100,000. But supercomputers are defined by what they can do. Their cost is secondary. Necessary in a world without magic, but secondary. I can spend $100,000 on a computer, but that alone does not make it a "supercomputer".
I suspect that the people who would be happiest buying something like this are going to be very technical, not just USING Linpack, FFTs, neural networks, or HMMs on a regular basis, but used to IMPLEMENTING them as well. This audience is definitely going to want red meat like the paper you're linking to.
With the Kickstarter campaign, you may also get customers who just think it's cool to own a supercomputer, but when they realize they can't run Crysis on it, they may be disappointed.
So: you can get large amounts of performance with simple architecture, but only for some problems, with graphics not being in set of these problems.
- not simple SIMD. NVIDIA calls it SIMT (single instruction multiple thread), mostly since you can branch a subset of them, so for the programmer it does feel somewhat like threads.
- not just optimized for Graphics anymore. E.g. since Fermi, the Tesla cards have DP performance = 50% of SP - which has been specifically introduced for HPC purposes. They have also constantly improved the schedulers to go more into general purpose computing, e.g. Kepler 2 seems to support arbitrary call graphs on the device. Again, that's useless for graphics.
- suitable for pretty much all stencil computations. Even for heavily bandwidth bounded problems GPUs are generally ahead of CPUs since they have very high memory bandwidth. The performance estimate I use for my master thesis comes out at 5x for Fermi over six core Westmere Xeon for bandwidth bounded and 7.5x for computationally bounded problems.
HPC is all about performance per dollar, performance per watt - and (sadly) sometimes linpack results because some institution wants to be in the top of some arbitrary list. In all of these aspects GPUs come out ahead of x86, which has been very dominant since the 90ies. Which is why GPUs are now in 4 of the top 20 systems - each of those are hundreds of millions of dollars in investments. That wouldn't be done if they weren't suitable for most computational problems.
And as for SIMD/SIMT, I mentioned SIMD mostly in relation to operations on short vectors done by one thread, which is mostly irrelevant to overall architecture of the core, as it can very well be implemented by pure combinational logic in one cycle given enough space. My mental model of how modern GPU core (physical, not logical) actually works is essentially some kind of simplistic RISC/VLIW design with large amounts of registers with compiler and or hardware interleaving instructions of multiple threads into one pipeline, which may or may not be how it actually works but it looks probable to me.
In my opinion most of chips like Epiphany IV or XMOS or whatever, in contrast to GPUs, are useful for only limited classes of workloads as they tend to be memory starved.
Here's a simple test.
Suppose that I give you these two measurements:
SPECmark: 1.2 million.
SPECmark: 3 million.
Which one is the supercomputer? Without knowing the date, there's simply no way for you to tell.
Suppose instead I write:
System cost: $20 million 1990 dollars
System cost: $400 1990 dollars
Which one is the supercomputer? I think most people will be able to pick which one is which.
The major difficulty with your second list is that expensive computers don't need to be high performance. Consider the computers that go into satellites and spacecraft. They are extremely expensive, but not high performing.