CS6810 — Lecture 20. Lectures on Compiler-Based ILP.

CS6810 — Lecture 20. Lectures on Compiler-Based ILP.


In this last video on composite
techniques for high ILP, I’ll focus on the topic
of memory dependencies. OK. So let’s go back to the
same example I had before, where I was loading something
into R1 from an address sitting in R2. And there’s a store before
that, which is writing something into a location specified in R4. And what I want to do is I want
to move the load as early up in the program. Because I know this is
a long latency load. And I’d like to bring
the value in well before it’s actually used. So my compiler
examines this code. It sees that there are
no register dependencies. And so it moves
the load higher up. So the new code
becomes like this. And then I have the branch. And the branch is really not
very important in this example. But I’ll just
retain it over here. OK. So as I said, there are
no compiler dependencies– or, no compiler detected
register dependencies. And so this should be safe. OK. But now, let’s look
at my data memory. I have a location
setting here, in memory. And its address
is, let’s say, ABC. And it has the value
50 sitting over here. Let’s say that as
you’re executing this code– and let’s look
at the original code first. It turns out that R4 happens
to contain the value ABC. That is, it’s pointing
to the same location. And let’s say R3 happens
to contain the value 70. So when I execute
the store, I’m going to put the value
70 in over here. So the 50 becomes 70. And then later,
when I do the load, again, it so happens that
R2 contains the address ABC. So I load from here. And I put 70 into this register. OK. So I should be leaving
this point in the program with the value 70 sitting in R1. Now let’s look at this
new code over here. All right. So again, R2 is
going to match ABC. So the first thing I’m doing
is bringing the value in ABC into R1. And at first, the
value in ABC was 50. So I’m leaving– so
when I finish this code, the registrar R1 will
have the value 50. Because the store comes later
and changes this 50 into a 70. OK. So essentially, in
my original code, there was a write
into that location, and then, a read
from that location. So this is a read after write
data dependence, which means that if I were to flip
the ordering of these two instructions, I will proceed
with the wrong result. OK. So remember, when we talked
about data dependencies before, we said that when
somebody writes something into R1, and then, when
you read something from R1, that’s a read after write
register data dependence, or register data hazard. And because of
this, we were very careful to introduce stall
cycles in our pipeline to make sure that the read
happened after the write. OK. So we were disallowing
any kind of reordering of these two instructions. And the hardware was
able to detect this by introducing stall cycles. And the compiler was
careful to make sure that it didn’t move
the second instruction above the first instruction. OK. But now, what we have here
is a memory data hazard where you have a store which
writes something into a given memory location, and
then, a load which is also reading from that
same memory location. So this is an example of a read
after write memory data hazard. And we should not be reordering
these two instructions as well. So how can the
compiler figure out that there is a
memory data hazard? When we were dealing with
register data hazards, those register data
dependencies were being encoded in the
names of the registers. I could look at R! and I could know there was
a register data dependence. And that’s why I would not
move instructions up or down. In this case, the
memory dependence is encoded not in
the register names but in the contents of
these register entries. If you look at the names,
they are completely different. But the fact that they both
contain the same value– that R4 contains ABC
and R2 contains ABC– is the one that creates
the memory data dependence. So at compile time,
the compiler is just going to look at
the register names, conclude that they
are different. But since it has no idea about
the contents of the registers, it actually cannot figure
out memory data dependencies at compile time. Because the contents
of R4 and R2 depend on your whatever inputs
the program may have received and whatever is being
computed on the fly. So since it does not know
about memory data dependencies, the compiler has one of
two options before it. One is to be conservative
and assume that every load and store might
depend on each other. Because there’s a small
chance that the values sitting in R4 and R2 might match. And so the compiler
says, well, I’ll never move or load
above a prior store. If we were to adopt
that approach, then the compiler
has few options. Because loads and stores are
very common in the program. OK. They occur very
frequently, which means that a load
can probably not be moved up by more
than a few instructions. So that wouldn’t help us
maximize our performance. So instead, we
look at option two. And what does that do? Let’s walk through
an example again. OK. So my new code would
become something like this, where
I have a load.s. I’m letting my
compiler be aggressive. I’m letting it move instructions
before a prior store. But if there’s a
memory dependence, I’m going to try and
detect this at runtime. So my new code will
look like this. Just as before, I now have
a sentinel instruction, which is later down in the code. And now, when I
execute the load, the hardware recognizes that
this is a speculative load. This load has been moved up. So I have a special table,
which, in some architectures, is called the advanced
load address table. So every time I have a load that
is being executed in advance, the address that I fetch from
is being recorded in the stable. So if this matches
ABC, this stable will now record the fact
that the value fetched in R1 from address ABC is
a speculative load. Later, when I execute any
store, I’ll look at the address that I’m storing into. In this case, that
address is ABC. Now I have to go and
check my ALAT table and see if there’s a match. So in this case, I
detect that this ABC matches the ABC that I had
previously fetched from, which means that advanced
load has moved ahead with a possibly wrong value. So I put an x over
here saying that I just invalidated that advanced load. Later, when I get to
this check instruction, I’m going to look
at the entry for R1. And I realize that this
invalidation has happened. So I jump to my
recovery code, which might re-execute the load,
get the correct value. And it might also
re-execute any dependence of the load, which might have
also been moved, earlier, up. OK. If the store had been to
ABCD, then this checkmark would not have been there. And the check
instruction would have said that this has succeeded. The advanced load has
succeeded and I can move on with the correct values
already sitting in R1. OK. So this is how I handle
memory dependencies with an additional table which
detects memory dependence violations on the fly. OK. So when we got to this entire
topic of compiler techniques for ILP, we said that
our goal over here was to try and
maximize performance without putting too much
onus on the hardware. So we said that we should
keep the hardware simple. And let’s leave it
up to the compiler to reorder instructions,
and hide stall cycles, and maximize performance. But as we have
moved along, we have said that there are
certain limitations on what the compiler can do. And to overcome
those limitations, you need techniques
like predication– or in this case, an ALAT table–
to detect memory dependencies. And all of these put onus
back on the hardware. So you need a table
like this, in this case. And you need to do a check
every time you do a store. You also need techniques
like predication, which increase the register
file ports, or to increase the pressure on your
bypassing circuits, and so on. So ultimately, if you want
really high performance, it is really hard to do
that with purely compiler techniques. You ultimately need some
hardware support as well. So we’ve almost come full
circle in some sense. So what this means is, if you
want really high performance, you do need to
resort to techniques that are somewhat
hardware intensive. And so starting
with the next video, we’ll look at these hardware
intensive techniques, which require almost no
support from the compiler. So these are techniques
like [INAUDIBLE] execution. But if you want to really
keep the hardware simple, then, there’s only so much
additional performance that you can get. And hence, these compiler
based techniques are restricted to domains such as embedded
processes, where you don’t really need the ultimate
highest performance, but you want to keep the
hardware really simple. OK. So starting with the
next video, we’ll look at more hardware
intensive techniques.

Leave a Reply

Your email address will not be published. Required fields are marked *