PetaBricks: A Language and Compiler Based on Autotuning

PetaBricks: A Language and Compiler Based on Autotuning


bjbjA NCSU Creative Services PetaBricks: A
Language and Compiler Based on Autotuning Saman Amarasinghe Computer Science, MIT October
17, 2011 Amarasinghe: Saman Amarasinghe Male Speaker All right. So, let s get started over
here. It s my pleasure to introduce Saman Amarasinghe. Was that close enough? Oh, sorry.
He is leading the compiler group at MIT and you may know him for L.L. Briggs [ph], for
DynamoRIO, or for StreamIt, or some other projects that he s led in and are well known
in the community. He s got his master s from Cornell, right? Sorry. It was your undergraduate,
the bachelor s and both master s and Ph.D. from Stanford. So all yours. Amarasinghe:
Okay. So thank you. So I am doing a small bit and switch in the title. So the title
that was advertised was probably a lot more interesting, but it s almost the same material,
exactly the same material. So what I m going to do is I m going to start with telling three
side stories. These are kind of interesting experiences that has led to this language
and compiler I m going to talk about, and then but on the other hand, this is a lot
more general global view of, I think, where the world stands with respect to high performance.
And then three observations that actually directly led to the language and I have a
bunch of material; let s see how much I can cover within an hour. So, the first side story
is about basically how to get how the current program must deal with performance. So if
you look at today s programmers out there, for the last 10-20 years they have been very
happy in the world [ph] they got. Most of them was completely oblivious to the process
that their world was running. Because most [ph] _______ gave that, all the performance
needed for your average programmer. So I call him the average programmer, the Joe programmer.
And for them, every year about 50%-60% improvement in performance is good enough and that more
than _______ good enough and they re pretty happy. And what that means is they were able
to build solid boundary between hardware and software. If you look at something like Java,
it basically abstracts out everything but hardware from the program. You don t even
know where your program is going to run; it s going to run in some mission out there.
And that was really nice for these people. This abstraction provided a lot of freedom
for this average programmer. They were able to build very complex programs and worry about
features of the program, not where it s running and how great a performance they re getting.
So this is where the average programmer was. And then they are in things like research
labs. They were a few people who really care about performance; these were like fewer,
even less than a percent probably of that, and these guys were the ones who really was
going all the way down, dealing with hardware issues and trying to get all these performers.
So this boundary was really there for 99% of the people, they could just live outside
and not think about it before [ph]. So this was made possible basically by Moore s law.
So in a talk like this I m obligated to show this slide there, so what this shows is that
every 18 months the number of process tests keep doubling; that s what the Moore s law
says basically. And for about I need to probably update this slide, but ______ prior [ph] had
the process in there. And this line has continued in here [ph]. Unfortunately what has happened
was, till about 2005, this line basically tracked the performance. This is the SPEC
performance for all these machines. So as you double the transistors, we actually got
this performance also doubling for a while, and what has happened is now this is slowing
down. But before I go why it slows us down, I want to basically look at this phenomenon
about this performance. So all of you, and especially most of my lifetime, I have lived
in this life, there, every 18 months performance _____. We have gotten a huge amount of performance.
So the first question is, when you have excess of something, no matter what it is, people
normally squander most of it. So how have they squandered the Moore s dividend? So the
key thing is we have gotten about 10,000x performance gain in these 30 years. And where
did it go? Are we using it? I mean, of course, if you look at the modern software area, you
have a lot of them that s useful, but there are a lot of them [ph] actually wasted. So
if you look at the last decade, we concentrated heavily on everything except performance.
And we looked at things like correctness, programmer productivity, and that is where
all the research was done, that s where a lot of teaching was done, and if you look
at languages, tools, all those things reflect that. Basically, everything except performance.
In fact, software engineering is probably the only engineering discipline out there
where performance and efficiency is not the central theme. If you go talk of the way we
do in software engineering to somebody like a mechanical engineer or civil engineer, they
will think you re crazy: that those fields are all about doing more from less resources.
And we are basically mainly about doing less with more resources. And the interesting thing
is, this is really reflecting through our entire community. So I teach a software engineering
course, so at some point I was looking at all the things I am teaching our students
I said, Wait a minute, let me see how I can do some of these things we talk about. So
I just took Matrix Multiply and said, Look I had a bunch of lectures about immutable
types, dynamic dispatch, object-oriented programming, all done very nice, and this is what our schools
are learning these days. And probably a lot of you, this is what you re learning. You
look at that [ph]. I take Matrix Multiply and I will use all these great concepts and
apply that Matrix Multiply. So I did, and I ran with a basic 2K by 2K [ph] matrix and
I got a number; this is in fact shows it s pretty darn slow. So I said, Okay, this doesn
t look good, and immutable types is not something you want to do in Matrix Multiply because
it makes then [INDISCERNIBLE] to the power of four [ph], because you have to keep copying
the matrix. That s not nice. So I said, Okay, I get enough immutable types, and suddenly
I get 220x performances. Okay, this is probably where probably most people will start probably,
they won t probably do immutable types, and they say, Look, I was doing this nice fast
and accurate [ph] dynamic dispatch because I want to have different type of matrices
with integers and plots [ph]. I said, Wait a minute. Matrix, you don t [ph] need dynamic
dispatch, moist of the matrices are basically double so just do one type in there. And if
I do that I get another final 22x [ph]. And then I said, Oh ______ a matrix is a matrix.
I mean it s just a nice two-dimensional array [ph], you no longer have these fancy objects
here, you get enough objects. Now they get another 2x improvement. Now basically what
I have done is gotten rid of all the things I was teaching in that class. So back to very
simple language in here. And then I said, Okay, we might as well just get you to learn
the language itself by do what I m teaching now [ph]. I do Java, let s go to C, and to
my surprise, just copying that exact almost exact piece of code to see, gave me just 2.1x
performance increase. So it s like, wow, this is interesting. Huh? No, I mean Java is doing
a good GIF compiler [ph]. On the other hand, I have done this in, like, Python and PHP
and Java skip [ph] that, I guess, like a couple thousand x [INDISCERNIBLE]. So at this point
I said, Wait a minute. This is interesting, but when I was a graduate student a long time
ago, there was a huge amount of emphasis in getting good performance; we actually went
and hacked the code to get good memory program system performance and stuff like this. And
okay, let me try to remember how we did all those things in the good old days. And things
like and if you re doing Matrix Multiply, you transpose the matrix, you get unit stride
in one matrix; okay, you transpose one, you get another 3.4x performance, then you tile
for cache locality, you get another 1.7x performance, you recognize [ph] you get another 2.8x performance,
you prefetch, you get another 2.7x performance, and at that point I start using the BLAS label
because I didn t want to right prefetching code by hand. And then finally get parallelization.
At that point I got 3.5x. The interesting thing is we got to get started. I am have
a 296,000x performance gain for this part ________. So, basically comparing the typical
software engineering approach, very high-level, object-oriented, all these nice cool things
we are teaching everybody and what they re practicing these days to what a good performance
engineer would do; basically looking at every ounce of performance, looking at all the architectural
details and getting really good performance. A huge difference. So, a lot of times, when
I ask this, the people are surprised how much of these in fact, Matrix Multiply is probably
not your typical thing that every program you can do, but this is huge. So I want to
figure out, okay, how do you do this? I just compare, and a lot of times when we look at
performance we look at power. Power is an interesting issue. Or energy. So if we look
at energy. If you look at miles per gallon difference between this supertanker, which
is the gallons per mile, versus this one, it s only 14,000 [ph] _____. So in fact, if
you want to compare, you know, to have 20 supertankers to compare, so if you think about
it, if somebody comes with the technology [ph] saying, Look, I can run 20 supertankers
with the same amount of fuel required to run this small thing, it would be a revolution.
I mean, we d solve the world s energy crisis. And this is the kind of things we are at least
in Matrix Multiply, in this example we are just wasting. So, the first thing is, there
s a lot of performance out there we had, and the kind of the stack we had built, also,
if you look at it, today s machines [ph], how we are running, the bottom rows we have
the hardware and then we put a virtualization layer on top of that, on top of the running
operating system; on top of that we re running a browser, on top of running, we re running
interpreter [ph], on top of running like a Java script. I mean if you look at that list,
we are just wasting a huge amount of resources. And so _______ is, even though we feel like
we don t have any performance yet, we have a lot that we left behind. So if you think
about when you look at the next system, if we design them well, we will actually be able
to gain a lot. So the reason that we had to do that is, as I was alluding to here before,
your performance starts flattening out and there now basically this has flattened, and
in fact it s almost going down for a single ______ performance [ph]. So we are not getting
single performance anymore. So you can t just wait for a year and say, Look, I wrote the
program, it s running slow, wait until the next generation. [INDISCERNIBLE] That s what
it used to be. Not going to work anymore. So there s no performance gains we are getting
and performance has to come from somewhere else. So if you want performance, better use
something like better languages, you need to be learning [ph] disciplined programming,
you can t just use the XSU things in here, and/or you want to do performance engineering,
so you won t actually really engineer a program [ph] for performance. And of course the other
way of getting performance is parallelism. So what has happened is Moore s law has morphed
into, basically, providing performance by parallelism because now every generation what
s happening is you re not getting additional performance but you are getting additional
cores [ph]. So now, every 18 months you double the number of cores you re getting. So if
you want performance you d better use those cores [ph]. And so now the key thing is performance
is parallelism. But the problem is this is really hard. Because you look at, take these
programmers who are completely oblivious to performers, and now we see them not only have
to deal with performance, they have to deal with parallelism. [INDISCERNIBLE] They just
have to basically go all the way down and deal with all these complexities. So putting
it more ______. So right now we have this performance of oblivious programmer, nice
flat road, no issues in there. And we said, Okay, you ran for you had a nice 20 years
on a flat road, now go climb this mountain. Goodbye. And that s basically what we are
giving them because we gave all the problem, we gave it to the programmers. We said, Okay,
you worry about all those issues [ph]. The interesting thing is, is there a better trajectory?
So something that we ll say, we ll never get the flat trajectory here. You can t say those
good old days are gone; you can t just ignore it. But can you ask them to do something less
than dealing with every problem? So one thing we have been looking at is, can we ask programmers
to deal with concurrency? Can we ask where the parallel you can run parallel, but you
don t have to do all the performance optimization; in the compiler, we will try to get ______.
So we are trying to find the right middle ground, because I don t think this is going
to work anymore and this is not sustainable and we have a better picture [ph]. So, one
way to do this is basically separate the word in there. So first is a parallelism extraction.
You want to get things in parallel. The interesting thing is most of the past programs are written
to simulate something in the real world. That real world is actually parallel, and can we
get I think there are a lot of natural parallels among here, and by the time we write a program
in a language like C or Java, we just basically obscure it. And can we get back without obscuring
it? And also this need for parallel thinking, things like theory, algorithm, data structures
basically has to be thought to in a parallel way. I ll get back to some of this point a
little bit later. And then there is parallelism management. Once we know the parallels that
are available, we need to map and get good performance. That, I think, is something the
complier handles [ph]. The key thing in here is, when there s parallelism, let users provide
that. If there is no parallelism available, just you should go [ph] that program is not
going to give you parallelisms. That s a non-starter. But at least easy things. Let the programmers
express that and then let s build a system where we can manage and actually lead to good
performance. So that s the basically the big motivation for a bunch of my work. So next
story that I want to talk about in these side stories is future-proofing software. So this
slide is out of order. So let me go in there. So if you look at the Joe programmer, if they
wrote a program in, say, 1970, the program will still work, and it will not only work,
it will work faster. And what happens is these programs are still running faster because
it s wrote [ph] the same basically, abstract machine models, a simple 191 [ph] model, and
they still work faster. And because now machines are much faster, they run. Except what s happening
is the new programming models out there is not the simple 191 model, they have these
multicores, and these things keep changing. And the new reality is these machine models
are going to keep changing a lot, and if you write a program for today s multicore, it
s not going to work in tomorrow s multicore because they re very different. So future
proofing [ph] programs for Joe, it suddenly becomes really hard. And I m going to go back
in my slides for some reason. So, what happened is if you look at the program in one of those
research labs I talked about, they had this issue for the longest time because they were
trying to get the last ounce of performance [ph]. And what that means is every time there
s a new machine they had to target that. So at some point it was a vector machine, so
they did a _______; the next time they got a ______ machine so they did open [INDISCERNIBLE].
Then they got a distributed-memory machine, they wrote MPI. So they have been having these
issues of basically reporting [ph] programs every year after year and that s a huge problem.
So basically putting it basically, most software lasts 30+ years, most of the time, and if
you go to a research lab, a lifetime of a computer system is less than six years, and
every three years you get a new system. And every three years you report and very soon
the kind of program atrophies because the ______ program was written beautifully by
a physicist and somebody hacked it and somebody else hacked it, and after about three or four
levels of hack, it s just the original meaning is very hard to figure out. And this is reflected
here. Whereas if you look at something like production costs on the user space, like if
you look at even the latest Microsoft software, there s a vulnerability here. You see that
same vulnerability exists in Windows XP. That means 20 years ago, that was set up with the
same piece of code [ph] still ______. I mean that s very difficult to do in the new year;
we can t write a piece of code and expect it to survive [ph] long ______. So what happens
if there s nothing in a machine model anymore? Between processor types keep changing, different
generations keep changing, because this is the first time exponential is exposed to software,
which is the number ______. Beforehand, the number of processors was not exposed [ph]
and we didn t care about it. But now a number of code [ph] is exposed, so the software has
to deal with something, doubling, probably every two years. And that s a really big change.
So, how to do what we did to Java for performance? With Java, we said write once to run ______
so we got portable. We could function in portability. But we need to get performance portability,
and this is really, really hard. So why is it hard? Anytime anyone decides the language,
that s this big occasion [ph]. One thing you want to do is, of course, right now you want
to get really good performance. So the way you get good performance is you expose a lot
of architectural details all the way down. You say, Okay, I will expose everything you
need to get good performance. And so you can do that; you can write the program and it
will run really, really hard [ph]. And what that means is you kind of get into a local
minima, and if the next machine is somewhere here, there s no way you can just get out
of and get there; you need a heroic effort. There s no way you can do it automatically
because you might not even have information. For example, if you look at something like
MPI, message passing interface, that basically exposes all the granular stuff about the machine,
all the hierarchy and stuff like that you program for that; it worked beautifully today
but tomorrow it will not work at all. That s the hard part. On the other hand, if you
re trying to do something much higher level, you can try to hide those architectural details,
but if you got too high, you re above the clouds, you don t see the landscape. So you
might land at probably top of a mountain than actually at the bottom because you don t see
what s going on. So a lot of languages, things like list languages, all those things that
raise the level of abstraction too high, never able to get good enough ______. So those programs
lasted forever because they were not tied to anything. But on the other hand, they never
gave any good performance either. So things like in high-performance C [ph], high-performance
Fortran, the beautiful language, you have a lot of abstract constructs and you will
uniformly run slow, no matter what. And so that s how that _______. So to be effective
at future-proof, you have to do this very delicate balance. You got to restrict choices
when a property is hard to automate or constant across architectures, because if something
is constant if every architecture is going to do the same thing again and again, you
don t want to abstract it out; it doesn t serve you any purpose by abstracting out.
Expose it to the user. If you have to look at a language like C, you did a beautiful
job of basically on all the ______ architectures, hiding out the differences of such things
like the type of registers out there, ______ range [ph] and stuff like that, but exposes
common properties because one instruction at a time is cheaper [ph] so all those things
are exposed [INDISCERNIBLE] in there [ph]. On the other hand, if there s something automatable
and variable, it means you think you are varying [ph] between machines, you hide it from the
uses. And languages like C, you hide it from the user, hoping you can automate some things
that reduce the location from the early ______ compilers are not that great, but ask people
when they become better and better and at some point they became better than a compiler.
I mean hand a person can do by hand, compilers became better. So the key thing is finding
that balance and that s really critical, and I think going forward, we need to do that
well. So the third interesting story I will say before I get into what we were doing,
is the evolution of programming language [ph]. How did we get there? The reason I am going
to talk about it is I want to motivate the fact we need some new languages and why. So,
if you look at the first computers out there, they were very limited in power. In fact,
one of the most complicated things or the most daunting task those computers did was
the compiling of programs. So you had to make the language of the compiler easy. You can
t make the compiler too difficult. So the way you make the life of a compiler easy is
create these languages that didn t give you that many choices for the compiler [ph]. You
said okay, tell the compiler exactly what the machine has to do, and the compiler can
do it, and we ll be good. So what that means is when we have an algorithm, instead of giving
anything high level, you told the compiler go do this, do this, do this very specific
way, they all get the result. So you just give them exactly one algorithm, exactly one
______ through the program, one data layout, you do parallel of one parallelism ______.
It was beautiful. So you found this is, I assume, your brain, you know a bunch of ways
of solving the problem, you found what you thought was the best way of doing that and
directly mapped into the system [ph] space of the program, which I assume this green
says run really fast, red thinks run slow ; you found a good enough space that you can
run that program and then map it ______. So yeah, do this this way and you ll get good
performance [ph]. And you got good performance that you cared about [ph]. So what happened
as time progressed? Computers became powerful, and what that means is compilers can do more
things. So we said, Yeah, that s what you use us, but let me try to figure out there
s something better you can do for this program. Can you run faster? So what you did was, from
the space that was provided, you start looking around it; okay, can you help with the register
location strategy, can you help with a memory layout strategy? You look for different things,
trying to find a better _______. So within that space you can find something better.
So this is where we started. But in the last decade what happened was, when you look at
optimizing compilers, we had a lot more computer power. So then what we decided was, okay,
now we need to find a lot more things to do. So we started doing heroic analysis. So the
idea there is the languages was giving you this one way of doing that. What we were trying
to do is other possibilities, we were trying to expand, expand, expand. But the problem
is just the choice of writing the program; you have eliminated a lot of knowledge. So
the programmer had a lot of ideas in their head but by choosing this one, a lot of those
things are not now available to the system. So we had heroic analysis, but even doing
that, there s limited things you can do. So this where we decided we needed to read the
language. So the way we approached this thing was twofold. One said, okay, if you know multiple
ways of solving this problem, we can basically ask you to do all this different ways. And
what that means is we ask you to express the intent of the program, not a method of solving;
just tell me what you are trying to solve, don t tell me how to solve it. By telling
you that for example, by giving you an example, I can figure out that. Another important thing
we did was figure out muscle basically outpaces brain. At least the computers are more slow
and something keeps doubling every 18 months, then that the human brain is still what it
was four years ago; we haven t doubled our capacity [ph]. So don t make all the problem
in your head, do something that actually gives you more and more performance. So how do you
do those things? So that basically led to a better [ph] ________. So, the theme of telling
stories, so I m going to tell you three direct observations that led to the language. First
this algorithm in place [ph]. So I m a compiler guy. I take somebody s program and make it
run really fast if I can. And after I do all these creative [ph] transformations and get
it to run faster on my machine, I ll go talk to probably who wrote the algorithm or some
other numerical guy and say, Look what I can do with this algorithm. I can run this fast
[ph]. And a lot of times they would look at it and say, Wait a minute, if you are trying
to do it in that machine, I won t use that algorithm. That s another thing that we like
to do much better in that _______, and that s really frustrating. Because I do all this
work and they say it s useless. I will do something different. So what that means is
the realization that, as we go about solving a problem, there are many solutions, and some
solutions are better under some circumstances. For large input size you might want to do
something [INDISCERNIBLE] for example, if you look at parallelism, there s nothing called
[ph] an absolutely based algorithm, no matter what, in most problems. And so because of
that, how do you take advantage of that? This becomes an even bigger issue when you go to
multicores because multicores expose this exponential parameter [ph] out, basic number
of cores [ph]. So the best algorithm for two cores or one core, it s not the best one for
four cores, it s really not for 16 and not for 1,000. So algorithms have to keep changing.
And so if you pick one, you re kind of stuck in one data point. And we had _____ like that.
So basically there s no single algorithm that can solve all cases [ph], so we had to deal
with it at the language level. So that s the first observation. Second one, actually, what
I alluded to right before, the world is parallel, and many people have figured this one out.
So if you look at mathematicians, they describe things parallel; they have a set notation,
sigma, simultaneous equations, and this is a very nice parallel way of describing that.
For some reason, the computer scientists decided parallelism is hard; from all the time when
they said, Okay, we will come up with a sequential abstract and that will simplify our life.
Everything we do in theory and everything is basically to sequentialize [ph] the world.
So a statement executed sequentially, if you do a set, you say, Oh, it s a loop, s not
a set; it goes to 1 to n. You choose order [ph] _____. Even a recursive definition, you
say, Okay, if your given if n ____ we can give you f(n), n + 1 solution [ph]. We just
do if there is a parallel world and imposed sequence [ph] _______. And this worked wonderful
for a while. It let us limit complexity in many problems. But now it s really coming
to bite us [ph]. And we had to really redo that. And some of it is not that difficult.
So when I look at a lot of programs I find the problems, they weren t really parallel.
You write, sequentialize it in C, and you re asking the compiler to re-parallelize.
Doesn t make sense. So how do you make matter of parallelism easy? So the idea there is
make parallelism the easiest way to describe something, make sequential _____ harder [ph]
than parallelism. So sequential should be a special case that you have to work a little
bit harder to get, than the other way around. The third observation was, when I started
writing compilers, you had to modify [ph] ______. We knew what hardware was going to
be [ph]. There was a process [ph], we knew exactly how the process works in there; if
you had a nice model Intel or IBM or whatever ______ told us what it does and we knew what
the compiler did [ph]. These days, the systems are way too complicated. Sometimes Intel doesn
t even know what their processes are doing. Now some people there they say they get surprised
what happens in the process. And the compiler, something like GCC, has 30+ process [ph],
it s really hard to know if you do something early what s going to happen later, why it
s doing something. And there are a lot of subtle interactions going all over the place
and there are thousands of numbers [ph] in there. But, computers are cheap. And in fact,
we have all this parallel resource here. So instead of trying to come up with this complex
model in our heads, why don t you do end-to-end execution, observe it, and using machine learning
to figure out what s best. So how can we basically use machine learning to simplify our lives.
That is basically the three driving forces behind PetaBricks. So, next part, what I m
going to do is I m going to describe the PetaBricks language a little, giving example, talk about
the compilers, some [ph] _____, and two extensions we have done in language [ph], trying to give
you a feed [ph] of where we are going. This is some still fairly early work, and so you
will see that in the _______. So if you look at PetaBricks in here. So here s a simple
matrix model. So what it says is you get a Matrix A _______ and you get Matrix B and
you produce Matrix AB. And the way you do it is you basically have this piece of code
that calculates that ______ for the rule. And the rule says I m going to calculate Cell
AB and I don t tell you which cell it is [ph]. I don t give you any order. So all the cells
have to be calculated this way. You figure out how to do it. And of course you do it
by basically taking a row and a column and doing dot product [ph]. So what I have done
is given you an implicitly parallel description of basically Matrix Multiply [ph]. I didn
t tell you what order; I have freedom to do it any order I want, and then of course if
you have to have a sequential constraint, I had to do some additional work here, so
okay, by the way some of the orders are not something you can do. So I made the parallelisms
really easy to do that [ph]. The next thing I said was, okay, if you know other ways to
Matrix Multiply, tell them to me too [ph]. So you can do Matrix Multiply by basically
taking two regions and doing two small multiplies and adding them together. Or you can do it
by basically subdividing the matrix in a different way and calculating submatrices in there.
Or you can do it in other parts; there are multiple parts. And in fact, if you give this
one, our compiler will automatically get rid of these. So you don t nearly have to give
these three in here [ph]. You can do that. However, that s not all how you can do Matrix
Multiply. That s been called a Strassen s algorithm that runs asymptotically better,
and that s very different. And I don t know a single compiler that, given the simple Matrix
Multiply, that automatically can get _______ this one. And the worst thing is this is not
always the best way of doing _______. This has bad cache memory behavior; sometimes other
______ blocking might be better. So there s no way in modern language to get that point
across. In PetaBricks ______ is another way of doing it. You figure out what works best
and that s what the compiler did [ph]. So basically what we have done is made algorithmic
choice basically a key aspect of the language. We ask, we make it easy for programmers to
give us multiple ways of doing that. And what complier can do is use this multiple ways
and figure out the right hybrid, or basically poly-algorithm, that combines multiple algorithms
together to get the best performance [ph]. And that can change, basically given different
data, given different architectures, you can do something different. And also what we did
was we made a language where we synthesized outer control flow. So basically, we can do
it by the rows, columns, diagonals, reverse order, parallel; we can do any different way
we want, and it s up to the basically compiler to figure that out [ph]. We don t try to undo
one and do something different. We just made it completely easy to do that. So that makes
our life much easier. So what we ask programmers to do is if there s some order we cannot do,
tell us that. So sequentiality, give us additional information that we will drop some more of
those in there [ph]. And basically allows compiler to explore this nice choice space.
So let me give you a little bit of a fast background in what the compiler does, and
walk through that. So in order to do that I am going to do this by giving a small example.
So this example is doing a Rolling Sum. Okay, so in order to do Rolling Sum, I figured out
two ways of doing a Rolling Sum. The first way of doing the Rolling Sum basically says
I want to calculate the sum here, and the way I calculate the sum from B is basically
by taking the previous B element and A and adding that. So I take this one, add, and
I take this one and add it. So what it says is I m calculating cell of i using a cell
of A, previous cell here, and A value and [INDISCERNIBLE] do that [ph]. The other way
of doing basically a Rolling Sum is basically, I can basically, each one I can figure out
what needs to be added and add them together. So basically you will see the difference in
_____. This one has a lot less operation, but under [ph] the _____ sequence, that I
had to do this before I do this and I had to do this before I do this; I have to sequentially
do it [ph]. This is a very parallel one. It can calculate all the sums parallel, except
I do a lot more operations here because I m doing each [ph] m adding them together.
So, given different architecture, different situations, one might be better than the other
and that s what the system is able to do, the compiler. But this is how we describe
that. So, after describe this program, the compiler basically goes through some analysis.
So to give you a kind of a rough feel for what happens in here. So we do three kind
of new analyses in the compiler, what we call applicable regions, choice grids, and choice
dependency graphs. The first thing we do is we made [ph] because we are getting these
rules, the interesting thing is most of the programs, when you look at the complexity,
it comes a lot of times in the boundary condition. You have this nice matrix, you can do something;
the boundaries you had to this, very complicated. So a lot of times, by the time we know [ph]
the boundary condition, it really complicates the ______ support [ph]. But it doesn the
boundary condition is something you don t care that much about performance, but it s
performance matters in the middle. So by doing these rules, we let you write different rules.
You can write some rules that only work for the boundary, other rules work for other areas.
Middle, or vice versa. So in there, what we did was we wrote this rule. This rule only
works for these three elements. Because to calculate an element, you need the previous
element. So in there there s no previous element. So this rule only applies to these three [ph].
Whereas this rule applies to all the elements. So at this point we have different ways of
calculating the word [ph], so what you find is you find the rules and figure out which
part of the program that rule can be used to calculate which part of the data. And then
by doing that we can actually nicely separate out these weird [ph] boundary conditions;
you don t have to write one complicated mess. And the next thing what we do is you write
the entire data set by rules. So we can say, Wait a minute, this data item. Can we only
use rule 1 to calculate? These data items can use both rules, zero and one, to calculate.
By doing that I can say okay I do a different set of analysis to calculate this item because
I have a different set of choices than here [ph]. So I define the word according to that.
And then finally I come up with a choice dependency graph; that basically says, given the region,
what are the rules and what are the dependencies you can use? So in here I can only use full
[ph] zero, here s my dependence in here, and for the others I have four different rules
I can calculate [ph]. So this gives you an indication what is all the freedom that the
execution have to get the results. So putting it all together, what happens is you first
get your source program and PetaBricks compiler, if you create [ph] this autotuning binary.
What that tells us is here all the things you can use and here are the dependences you
have [INDISCERNIBLE]. So some stuff, some rules you can do all the data, some rules
you have to wait until it satisfied certain conditions. So it will give you all the choices
in here. And then what you do is you can do offline autotuning. We use this, we run we
have a data [ph] [INDISCERNIBLE]. We generate data and keep running through all this ______
we get all the possibilities in here. And once you figure out a really good choices
that will give you the best algorithm, basically you can recompile that in and get a final
binary that doesn t have all these choices available. So you don t have all the choices
_____. So, autotuning, basically, used to genetic tuner to find all these hard binary
decisions, and n-ary search algorithm [INDISCERNIBLE] so with block size, you had to search a linear
spacing [ph]. And one interesting thing we do in our autotune is we start basically with
very small dataset. So if you have matrix of 2 by 2 you can try millions of different
choices. And then we use the best set to basically see the next level of _____ larger [ph] ______.
So as you go to a larger and larger problem, you don t have to go through the entire space;
you don t look at all the possibilities. We re not eliminating anything; we are basically
giving it less probability. So for example, like blocking; a 2 by 2 matrix has no blocking,
it s very slow. You don t want to have bunch of one [ph] ______. So at that time of the
evaluation we ll say that s really bad, and it will get a low priority. But as we go larger
and larger and as you start looking at a matrix that basically doesn t fit into anyone s cache,
finally when you try blocking suddenly you realize it s a good algorithm; it improves
and it becomes it goes into the better [ph] ______. By doing that, we build, and the reason
for doing that is if we run a really bad algorithm in a really large dataset you can be there
forever. So we want to make sure that we don t try really bad things that often, especially
when you look at _______. This bottom [ph] autotuning actually helps a lot. So, let me
show you some of our early results. So, here is Sort. Sort is this classic algorithm. There
are many different ways of doing Sort, and in fact some stuff, something like insertion
Sort is really good when the data is small. But if there s a lot of data, things like
Radix, Sort becomes much better. And so what we can do is and there s a lot more crossover
at this point that you don t see we can learn basically hybrid, the poly-algorithm. That
s better than everybody. What it does is it simply [ph] goes down at a switch point, a
different point and switch it to different algorithm. Let me show you a little pictorial
about type of things we do [ph]. So let s look at this, four sorting algorithms: Mergesort,
Insertsort, Radixsort, and Quicksort. So poly-algorithms is not new. If you actually go look at Standard
Template Library in there, that s in fact a poly-algorithm. That s a Mergesort to image
[ph] Sort that, at 15 elements, switch to Insertionsort. So we were pretty impressed,
like, wow, these people have done that and we said, Wait a minute, why it s 15 [ph]?
And then we re all trying to figure out where did that come from and we realized, in like
1999, it was introduced by SGI, in SGI compilers ST Library, that, and they picked 15. And
still it s 15 because 15 is the variable that is in the program. There s no way the compiler
can change that. And so we said, Does 15 make sense? In fact, something like 2,500 makes
sense today because the caches are big and stuff like that. But still it s 15; if you
go look at your ST Library there will be a constant that at 15 it s going to switch.
And so we said, Okay, look. So instead of 15, if you actually do it right, what will
happen? The interesting thing is if you look at something, Xeon, one core, people start
with Radix ______, and at about 98, it will _____ 4-way [ph] Mergesort, and then about
75, it should secure [ph] Insertionsort. That s great. So that s what happened. But if you
look at 8-way Xeon, it s a very different study. What it will do is it will start with
the N-way Mergesort, do a Mergesort, and about 1,200 because now it s parallel, switch to
Quicksort, and at about 600, it s Insertionsort. It s a very different cut-off point. And something
like Sun Niagara, that s even a different story; it has a very different memory system.
So it never do anything other than Mergesort. You do different size Mergesorts at 2, 4,
8, 16 Mergesort, having other different ways [ph] _______. So if you like put them together,
so here s the kind of ______ chart [ph]. So what I show here is different architectures
with a system, Mobile Core 2 Duo and different two-way Xeon systems and one Sun Niagara.
And here s the algorithm it creates, like an infinite [ph] it s a Quicksort, and then
it switches to do a Merge [ph] so that this one, 1-way Mergesort, 8-way Mergesort, and
Insertionsort. So there are all these combination algorithms that are very different. The first
question is, so how important is the selection [ph]? I mean, is there a big performance difference
[ph]? So, in order to try that, we basically looked [ph] at each of algorithm so Mobile
algorithm means this one in here and ran it on other systems and looked at the slow down.
So if you use this algorithm that s given here, and ran it on a 1-way Xeon, it s runs
about 60% slower. And if you run it on a 8-way Xeon, it runs about 60% slower. And if you
took the algorithm ______ in Niagara, and ran on a 2-way [ph] Xeon, it s 2, 2.3x slower.
So what that means is even if you look at architecture, it s not better [ph], almost
the same generation [ph], not that different. If you actually go and do [ph] on it for a
machine, there is almost [INDISCERNIBLE]. This, you re not building that much. This
is a simple architecture, simple algorithm, but you get a huge difference here. So, if
you look at something like Matrix Multiply, there are all these different ways you can
do Strassen s, transpose, recursive blocking, simple basic [ph] in here, so the performance
you re getting here and then what we can do is do a hybrid one that s better than ______.
The nice thing about hybrid is it should always be at least the best you can get, and that
and we shouldn t get into a place that s worse [ph] because we can always switch to that
algorithm. And if you look at something like Eigenvector Solve, because this is because
it ll be composite [ph], you can keep switching algorithms as we go down; you can actually
get a performance that basically is much better than everything else. So what happens is that
at the low end if you [INDISCERNIBLE] and then basically it s set to DC in, basically,
_______. So you get results [ph]. So this is some area with which we are writing a lot
more benchmarks these days in here and try to get a better understanding of the language.
So, I want to now switch gears a little bit and talk about two different extensions, depends
on how much time I have that I can cover. So the first one is variable accuracy. So
we are used to this word, we are things look black and white. So what we said was, okay,
if your algorithm, it will solve [ph], it will give you one result, always. But the
world is not that simple. The world is a place where there are a lot of shades of gray. So
things like iterative algorithms, basically. So if you trace a certain amount of times,
at some point somebody will say, Okay, this is good enough, I m going to stop, I like
this ______. Or things like signal processing, things like images and sound. What s the perfect
sound? If you are well, you re happy. And different people have probably different perception.
And so there are a lot of this give and take you can do, things like approximate algorithm.
So many of these things you can trade accuracy for speed. Or, given a certain amount of time,
what s the best accuracy _______. So you have this _______. So, most of the time, this is
what all users want: solve this to a certain accuracy as fast as possible, using whatever
you can, or vice versa; given a certain time, give me the best accuracy you can get. So
one interesting algorithm for doing this is called multigrid. So, for people who probably
don t know multigrid, I ll do a quick intro into the multigrid. This is using iteratively
solving partial differential equations on a gridded domain. So you have bunch of data
points. So one way to think about that is, assume you have a big sheet of metal; in one
place you put some heat source, and you want to figure out how the heat is going to be
propagated [ph]. The way you can do it iterative is you divide it into a bunch of grid points
and each point you basically [INDISCERNIBLE]. And you do it and for every point you do that
one iteration. So you have this heat to propagate little bit, and you do it enough times, at
some point you get the full propagation of the heat. So if you do that. And that s what
you do. But you can do it many different ways. So, one is called relaxation, update points
using neighboring values. It s an extensive computation and sometimes we have five points
in the data, you look at your point, and then for nearest neighbors, and you update that.
But that takes a long time. Because if you think about a large sheet [ph], that every
step, you need to propagate one pixel, one pixel, one pixel slowly propagating all the
way. So to get through a large thing [ph] it would take a very long time to do that.
The other way people can do is what they all restriction interpolation. So I sort of think
I have very fine [ph] grid because we find a good result. Why not first basically, a
coarser [ph] grid ______. And what you do is, then you have much fewer points and that
value will propagate very fast. And then after it propagates through that, you can again
do interpolation and make [INDISCERNIBLE]. So what happens is you have a very fine resolution
in here and you relax to basically come up with the coarser grid [ph] solution here,
and then at some point to relax a lot in here. And then once you are happy with that you
can basically go and again interpolate into the finer grid you want [ph]. So when you
are causing it, a value from here to here, propagating in like two steps, here it would
take a lot more steps propagation. So, in the field, there are papers published on how
to do that; there are things like called V-cycles, W-cycles, full MG V-cycles. All these things
on paper. So if you start with this resolution and go down, down and come up, or go down,
down and come up a little bit, go down. So all these things are basically each of these
shapes is a paper. So there are a lot of questions. What do relaxation operations [ph] do? How
many iterations to run this thing? How close we can do? All these steps we can do. So there
are three main things you can do. At some point, if you have a small amount of data
points, you can do a direct solve [ph]. You can write a system of equations [ph] and solve
that as exactly _______. And if it is too many data points it s [INDISCERNIBLE]. Or
you can iteratively solve something; you can run iterations again and again and again basically
do, do this [INDISCERNIBLE]. Or you can recursively go to a smaller, a coarser grade, do something
in here, come back out. And so you have the standard V-shape, and then at some point you
can go do direct solving here, in there. Or you can do direct solve a little bit, go up,
and then you can do all of this kind of iterations [ph] and all that. And also when you iterate,
you have to figure out how many times you want to iterate. So this is I hope I give
a very fast overview of what this algorithm is about. So, what happens is what you get
is this trade-off between accuracy and time. There are a bunch of algorithms that will
give you a result with certain accuracy, taking certain time, and you get this two-dimensional
split [ph]. The interesting thing is this side is better, that was highest in accuracy,
so you don t want something, get low [ph] accuracy running long time, that s not that
good [ph]. So what you need is actually this optimal frontier; this is what s most important
for you. These are the algorithms that basically need [ph] accuracy versus time, and of course
you want everything; it depends on the amount of accuracy you want, you can trade off that.
And so what happens is there are a lot of these algorithms in here so it s too hard
to remember everything. So what we can do is we can quantize this space and say each
quanta will give you one algorithm. So remember this one for each of these quantas. That s
sort of best algorithm to do that. So, what we have done is we used dynamic program to
manage this autotuning. We searched through these shapes and figured out the optimal for
that size in here and we can build it up in here. So if you look at what s going on, our
program looks like this; what the program has is a best direct solver in here, in there.
And then we have iterative solver that we can do, and then we can call recursively,
go down and fall and come back up here. And all those things have a number of iterations
that are also numbers, it s run [ph], or as much as you think it s good. That s how it
looks. We don t tell you how much. So figure out what s good, on that ______ model [ph]
type [ph], give me their answer. And so what we have generated is a bunch of language extensions.
So for example, we have variables that are saying, okay, I don t assign a value; you
figure out what s best for these variables. And we have accuracy metric that when you
re measuring you tell me how to measure the accuracy of this program. And some of the
time it might be very simple things like the average or means ______ or you can write your
[INDISCERNIBLE]. And then of course you can say how many quantization you do, and of course
you need a data generator so we can generate enough of this data [ph] ______. So this is
what we re asking for the programmer and then we will do what s right. So what happens is
when you re training, you will find each algorithm for each of these means for one resolution
is there [ph]. And for next resolution we will construct algorithms using these as our
building blocks and construct one. So we will get all these data points with all the algorithm
built out of this one. And then some of these things will be the best in this region and
we ll get that. So each resolution you will use the previous resolutions values to build
it _______ algorithm [ph]. So what happens is if you look at it, what happens is this
algorithm the level of algorithm might only use, basically, the higher-level algorithm
in here. So for example, we ll say we will do this in finer and we go to 2x, run two
times this one at some level, by coarsening it, and then use that result. So then, to
get to the higher level, you basically have all these choices and _______ run this one
time, coarse [ph], run this two times, and then probably direct _______ and go back out
[ph]. By resolution you mean the grid is twice as big? Amarasinghe: Yes, exactly. So basically
assume I have a grid of 4,096, so we can say, okay, that means accuracy 107, and then you
want a less accurate solution for a smaller grid, run it twice, and this accuracy run
sometimes [ph] ______ even less accurate, run, and coarsen it. At some point when they
are small enough I can do direct solve, I can go back up. So I can do this very combination;
that s the best algorithm. So what happens is we come up with these crazy shapes, different
shapes for different algorithms. We can figure out, okay, given the accuracy, given the size
of data, what s the best way to go through that [ph]. And we can learn that, and that
s the really cool part on that. And also, because we are using this optimal substructure,
these things get _______ in even larger programs [ph] [INDISCERNIBLE]. So here s one result
in there because this is two-dimensional results. Hard to solve everything for one accuracy
value being here, so you can have all this individual program but we ll find something
that basically has the best of all ______. To give you another view, this is another
interesting algorithm, which is bin packing. So if you get a bunch of different size of
boxes and if you have a bin to pack it up, the optimal solution is _______ in there.
But there are a lot of heuristics that you can find [ph]. Some heuristics are very fast
than others but they might not be that good. So what we re showing here is that if you
want an algorithm that s only 50%, do the best you can get. There are very depending
on the size, there are some algorithms that are really fast; we ll get through 50% of
the best you can get [ph]. But if you want something to 10% [ph], there are other algorithms,
that s like something like first fit, to give you close to 10% in this size, but it will
take longer to _______. So given the accuracy one [ph] and given the data size, you can
say, okay, this is the algorithm I want to run with that [ph]. So I can learn that space.
So it s a complex space but this is what our learning algorithm is [ph]. Okay. I think
we are about 5% m going to skip the next section. I don t think I have time for that. Let s
directly go to conclusions. So, I think it s time we come up with languages that are
based on autotuning. I think autotuning is fine today, a lot of people have done autotuning
as libraries, but I think we need to design [ph] language [ph]. So that because there
s this convergence of multiple forces. First of all, the multicore menace. We have multicores,
and we have no idea how to use it, and these things keep changing faster and we have to
find a way to do that. And second thing is we want the future-proof our program. We want
to write a program once and ten years down the line we want it to work. We had it for
now and by the way we are going with multicores, we are going to lose it, unless we figure
out some way of doing that. The other most interesting thing is we have a lot of muscle
available in computing. Except we don t have a human brain _______. And in fact, the compilers
and programming languages, they are everybody uses to take advantage of computing power
[ph], but we are the ones who are the least users of computers [ph]. If you look at things
like CADs, tools, they re all embracing high-performance computing; they are their tools to run on
large systems, they re doing beautiful visualizing all of those things. In compilers, we are
still running on the one core and only parallels, so we have this to make _______. There s no
single parallel compiler, that we have parallelizing compilers, but none of the compilers itself
have had them [ph]. So the key thing is how can we compilers and this community use the
computing power that we are trying to manage ourselves to actually make things better [ph].
I think there s a lot more we can do if we can embrace that and say, Okay, look, we are
going to use a lot of computing power. And in PetaBricks what we have shown is it can
be done, and it has benefits _______. However, the product is far from done, it s not an
acceptable thing. There are a lot of questions to be answered. For example, we re asking
people to do more work. Normally say okay, get the first program, first algorithm written,
and get it working and then try the second one and the third one. In these days, time
to market is critical; it s very hard to convince somebody once you get it finished to do some
extra work. So will people be able to do this, say, Okay, look, I will give you other option,
even though my first option is perfectly sufficient ? [INDISCERNIBLE] So we re asking for extra.
So the convincing people who do that, I think it s still up to their saying okay still the
average Joe programmer that has released that _______, going to give me other choices so
they re going to move onto the next thing [ph] or how we do that. And also, now, you
take a couple of programs and build a very complex algorithm. How do you know it s correct
[ph]? I mean there are a lot of complexity and verification, validation, and in fact,
if you do dynamically, that every time you run you run a different algorithm. And that
s a nightmare for a lot people. So one saving grace was what we found was, especially for
when you re not doing variable position, we can compare each algorithms against it. In
fact, we found bugs [ph]. So if you write a new algorithm, it better produce the same
results at the other ones we have. So when you are doing basically learning, you re not
only just looking at the time, you re actually looking at the results comparing. And we actually
found a lot of bugs and said, Oh, wait a minute, in that case, okay, some algorithm predicted
a different result. Okay, something might be wrong. So we can actually use it for testing
[ph]. But for verification, validation, still up in there, how do you validate a poly-algorithm
that _______. So that s all I have. I m open for questions. [APPLAUSE] Amarasinghe: Should
we open up for some questions _______ because they might help [INDISCERNIBLE]. Questions?
I know we have [INDISCERNIBLE]. They re sitting, they re paying attention; they might be hearing
[ph] otherwise otherwise they wouldn t be sitting there for hours. Okay there s a question.
Go ahead. Thanks for the talk. Can you hear me? Amarasinghe: Yes. Thanks for the talk.
Towards the end, when you were talking about variable accuracy, variable accuracy computation,
how do you anticipate users reasoning about the various deals that they can specify for
their particular code? Because, in some sense, one of your examples had a field where you
had to specify the range of accuracy or accuracy quantiles. Did you have some tool that helps
users to figure out exactly what those quantiles should be? Amarasinghe: So right now, I think
accuracy quantiles is an optional thing. What we will do is we will look at the range of
variable you can get and basically chalk it up to pieces [ph] in there. And the accuracy
quantile, that becomes important if you have like a non-deviant kind of range where you
have so, at this point, we don t have any tools like that because a lot of it is, I
think, semantics of the program. If it is a simple linear rate, you don t need to specify
that. But we have found some programs that there are things like accuracy becomes very
important in certain ranges where you have to quantitize it much more smaller versus
other ranges. So, if you know more information about your program, we won t provide them
with the tools to do that, but that s optional. I think, thinking about some of these _______
accuracies [ph] can be hard, but on the other hand, it might be liberating, too, because
what we re asking you to do is tell us less. So instead of saying, i equals hmm, how many
times should I trade [ph]? do whatever you want. So i equals one [ph] instead of saying
one to picking a number, instead of one, two [ph], I will leave it blank. In some sense
I think I am asking you to do less programming than more programming. So hopefully, this
of course has to be proven. In fact, I think this might be easier for the programmers because
something like trading algorithms, a lot of times people spend a huge amount of time trying
to figure out what s the right number of iteration [ph]. And here what we re saying is don t
do it. [END RECORDING] NCSU Creative Services TCSDLS-20111017-Amarasinghe October 17, 2011
Page PAGE Transcript prepared by Rogers Word Service 919-834-0000 1-800-582-8749 www.rogersword.com
h$_ h$_ h$_ h$_ [Content_Types].xml #!MB ;c=1 _rels/.rels theme/theme/themeManager.xml sQ}#
theme/theme/theme1.xml G$$DA : BR {[email protected] V*[_X ,lY Ssd+r] 5|E Vky- V4Ej 6NGU s?^V *7;wP EBU` 5VD GGHPXNT, /M,W m2iU [[v _Xtl theme/theme/_rels/themeManager.xml.rels
6?$Q K(M&$R(.1 [Content_Types].xmlPK _rels/.relsPK theme/theme/themeManager.xmlPK theme/theme/theme1.xmlPK
theme/theme/_rels/themeManager.xml.relsPKme

Leave a Reply

Your email address will not be published. Required fields are marked *