Apache Spark for Java Developers – Course Extract – Performance

Apache Spark for Java Developers – Course Extract – Performance


– [Instructor] Well, we’re approaching the end of this module of Apache Spark. There will be at least one
further module on Spark where we’ll be covering,
at the very least, how to work with the Spark SQL, or Spark SQL API. Now this is a newer API, it’s quite a bit richer than the core API that we’ve been working
with on this module. The core API though is still
very rich and worthwhile so I hope you’ve got a
lot of out of this course, but I have been putting off talking about, not just performance, but also some of the
inner workings of Spark. And I have been deliberately
ignoring some things that I don’t think you should be ignoring if you’re planning to be a
serious user of Apache Spark. Of course, you don’t
need to know everything that’s going on inside Spark, but I believe that having a good handle on the underlying model of Spark gives you a much better chance
of writing good Spark jobs. So that’s what we’re going
to be doing in this chapter. Now I seemed to have,
throughout this course, been constantly saying,
oh, we’ll talk about that in the last chapter on performance. So for that reason, we do
have quite a lot of topics to talk about in this chapter. But I’ll try to keep it
all one, single chapter so it’s all in one place. And the plan for this session is we’re going to look at, first of all, the difference between
transformations and actions, which I think we’ve probably
touched on through the course but it’s definitely worth
really nailing that down and making it sure that we’re clear on what the difference
between those two things are. We’ll then go on to have a
look in detail at the DAG, or the execution plan, and we’re going to use, once again, the WebUI that we used back in the chapter on Amazon Elastic MapReduce, we’re going to return to that console and we’re going to really
drill into the details that we’re seeing in there. We’ll then return to
looking at transformations and we’ll talk in detail
about the difference between a narrow and
a wide transformation. And another way of putting it is we’re going to look
at what shuffles are and what the stages are
that you might have noticed appearing on the console when
we’re running our programmes. We’ll have a look at what’s
happening with partitioning, and I’ll talk in particular
about one possible problem you might have in real life
which is called a Key Skew. And I’ll also talk about the
groupByKey transformation which we have done at
least once at some point on the course. And I hinted very strongly
that there’s some problems with groupByKey and it’s not
really performance problems but it’s more just an operational
problem with groupByKey. It’s very easy to get
out of memory exceptions in your cluster if you’re
working with groupByKey. The general advice would
be to avoid groupByKey because you can always do
things in a different way and I’ll show you an example of that. And we’ll also look at
caching and persistence which sounds like a big, scary topic but actually if you understand
transformations and actions, and you can read the DAG, then you will find the concepts of caching and persisting quite natural. It’s basically going to
be a single line of code. Well, we’ve returned here
back to the programme that we wrote in the sort
of first part of the course where we we’re doing the keyword ranking. We were reading in this input text file which is a big subtitle file and we’re generating a set of key words ranked by their popularity. Now I talked in the introduction chapter about the fact that on
each of these lines of code when we’re sort of thinking in terms of, let’s take this one for example, we’re taking an RDD and then
we’re going to filter that RDD, and we’re going to end up with a new RDD. And, you might have in
your head, a mental model of a great, big set of data being the RDD. And for each line of code,
we’re modifying that RDD. Well, I had to say this in
the introduction chapter because I didn’t want anyone to be misled but at a runtime, that’s absolutely not what is happening. We’re not building RDDs
on each of these steps. In fact, what we’re building
is an execution plan, only when we get to an operation where Spark actually
has to do a calculation to provide a result. Does anything really happen? Now, in this programme, in fact, the only time, the only point where Spark
needs to do any calculations is on the line here where
we’re doing the take. Now the reason that it has
to do something at this point is we are asking for a Java result. Notice the left-hand side
is this is a regular list, rather than an RDD. So it turns out that it’s
only on this line of code that Spark has to commit
to doing its calculations. Every single previous line of code has merely been building
the execution plan. Now for that reason, Spark makes a differentiation
between its operations. The operation here, the
final operation, the take, is called an action. Actions are going to
make calculations happen and they’re generally going to result in some regular Java objects as a result. So, you can think of the actions as being, get me the results. The previous operations
where we’ve been building, what I’ve been calling, new RDDs, actually there’s no new
RDD being built here. What we’re doing is adding a
new step to the execution plan and Spark calls these transformations. So you really do need to
understand the difference between a transformation and an action, Now, it might not be obvious immediately why the fact that there’s a difference between transformations
and actions is relevant. But as we go through this chapter, you’ll see that it really is relevant so we will need to keep in mind whether we’re doing a
transformation or an action. But before we see the impact of this, I’d just to like to give
you a quick demo really. So I’m going to use the debugger, which I don’t use very often but can do that with a double-click on the left-hand, kind of, column here next to the line number. So I’ve put break points on line 28 and I’m going to do a runner’s debug, the main programme. I assume you’re familiar with the debug, so I won’t go on about the debugger. But we’ve hit the breakpoint now, so we’ll switch to the debug perspective. We’re now on this line of code here, we haven’t yet executed line 28. Now, up until now on this course, I would have been commenting
on this code by saying on this line, we read in
the text file input.txt well, can I be really
clear and precise now, we are not loading in that text file, on this line of code. We are telling Spark, to
include in its execution plan the loading of this text file. And the key thing about
that is that means, this text file will only be read in when it needs to run the execution plan and that means all the
way down here on line 48. Now, if I do a step over. I felt that that line
was a little bit slow but can you see in the logging? I’ll do a clear on the console, nothing happened there. So, where you’ve been seeing
the progress of the jobs and the little animated
bottom, nothing happened. And actually as we step over,
each of these transformations we’re not seen anything
happen in the console. Now I’m not going to do it
here, but if you want to, you could drill into the
structure of these RDD objects and you’ll see they
don’t contain any data. We have read no data in, at this point. I’m kind of half way
through the scripts now. No data has been read in when nearly building that execution plan. Let’s keep stepping over. So now I’m not doing edits at this point. So let’s take this one, this
is just interesting words where we’re doing the mapToPair Now, I don’t know how long
that operation is going to take when it actually executes. But I imagine it’s a significant time. And yet when I click the step over bank it’s taking millisecond. It’s not doing anything heavy. So, I keep stepping over. Now, actually I’ve reduced the key. Was a little bit sluggish, but I imagine it’s having
to do complicated things with the execution plan. It’s not working on the data. So, I hope I’ve got that
clear, and as we get closer to, now when I stepped over to sortByKey there something did happen on the console. Now, sortByKey is a transformation. So, once again, at the risk of boring you. No data has been read at this points, no calculations were performed there but something happened on the outputs, and I’ll talk about why that was the case in a short while, but we
are now at the action. And if I step over this,
now that was the point that the calculations happened. And if I click on the
results here in the debugger, the results are going to
be normal Java objects, so at that point, Spark had
to do it’s calculations. That’s the first thing to get across, and it can be difficult to remember, which of these operations
are transformations? And which of them are actions? And my best advice here is to
use the Spark documentation. Look for the link to
RDD programming guide, and there’s a link here
to transformations, this one there for actions. And you have here a list of at least the common transformations. So, we know now the map, filter, flatMap and distinct. Quite a few of those operations we’ve been doing on the course,
are all transformations. And we often say these
are lazily executed, meaning that these jobs
only have to be done going a little further down,
when we reach an action. So, common actions are collect, count, first. I don’t think we used
first, but it’s very similar to just doing a take of one. And things such as saving to a text file. So all of these operations will result in the execution plan
becoming an execution. So, I hope that’s understandable and really that concept
is going to be relevant to every single one of
these performance aspects that we’re going to look at
through the rest of the chapter. Now, very much the
difficult thing about Spark is it almost feels like magic, what it’s doing under the hood. It is incredibly complicated Spark, and it can be very difficult to reason about what’s happening at runtime. Now, even after working with
Spark for quite a long time, I still find myself, even with
relatively simple programmes I still find myself scratching my head and thinking why did that happen? And why is this performed slowly when I expected that to be quick? So, rather than just
looking at the programme and trying to guess at what’s happening, I strongly recommend you get used to looking at the web user interface to get a full view of that execution plan. It’s otherwise known as the DAG, which I mentioned in the introduction stands for Directed Acyclic Graph, is a bit of jargon from computer science that just means a graph really that doesn’t have any loops inside it. Well, we did see this very briefly in the chapter on Elastic MapReduce, but I reckon it would be good if we could look at that web
user interface on our programmes when we’re running locally. So how can we do that? Well, you saw that on Elastic MapReduce, we could look at the history server which was running on localhost:18080 and the history server,
will gather together all of the previous jobs
that you’ve already run. Now, this history server is
actually supplied by a scripts which was running on our
Elastic MapReduce server and it’s just that, that
was pre-configured by Amazon so we didn’t have to stop that script. You can go and download Apache Spark, and you’ll get as part
of that distribution the scripts that you can run which will run a local Web server, and will give you access
to that history server. Now, we could do that, I just kind of feel we’re getting towards
the end of the course and I don’t want to be getting you to instal software at this stage. Do investigate that option if you want But, I have a little bit of a hack which isn’t very elegant, I must admit. But it does at least
work for basic testing that we’re doing here. Now, it turns out that when
your programme is running, Spark will start a Web server on port 4040 that you can visit, and that will show you the progress of the current executing job. So that’s gonna be 4040, but
we have the same problem, that you’re only going
to get access to that while the job is running, or at least while your
Java programme is running. So, what I would commonly
do, and I will warn you this is a hack. At the bottom of the script, just before we do the sc.close I’m just gonna hold the console. I’m going to create a
scanner object from java.util that if you’re familiar with that, but that allows you to
read from the console. So we create a new scanner,
I’m passing to the constructor an instance of System.in And actually be careful,
you’ve got to make sure you’ve imported this. There’s a lot of scanners here. It’s the java.util scanner
that you’re looking for. standard built-in Java. And then you could call scanner.nextLine and that will just make that the console wait for a text input. And that means we can visit the WebUI. Now, the downside of doing this is it’s very easy then to forget
to terminate this programme. And you’ll notice you’ve done
that when you get a warning, when you run the next programme and it says I cannot buy in to port 4040. So, it’s a minded downside of doing this, but it’s a quick and dirty
hack, but it does at least work. So let’s run our word counts again, not with the debugger this
time, just a normal regular run. And there it’s run through,
we’ve got the results, but can you see there? The red button is still illuminated this programme still running, and it’s actually waiting for me, to enter something and hit return. So, I’m not going to hit
return, but back to the browser localhost 4040 gives us
a very similar webpage to the history webpage that we saw when running on Elastic MapReduce. Now, there are some things on
here that I can’t yet explain because they are a little bit complicated. But what we have here is
a list of completed jobs. Now, a job represents any action that you’ve performed in your scripts. Now, I say roughly because actually we’ve only performed one
action on our scripts, as we know and that was the take. However, and it’s just a
little bit of a quirk this and it’s difficult for me to explain. To be honest I’ve never
fully understood it either, but it’s just the internal
implementation of sortByKey will generate a job. We know sortByKey is not an action, it’s just a transformation
but it does show up as a job on the Spark console. Now, I don’t want to get too
hung up on that just now, I might go in a bit more
detail about that later, but for now, what I want to show you is let’s follow the link
to the sortByKey job. And what we can see
now for the first time, is the execution plan
as I’ve been calling it, or Spark calls it the DAG. And this is what we’ve
been building a penal code, each of these transformations
is added to the DAG. And I know at first it looks quite scary, but I think once you understand
what these stages are, then really, all he is
saying is the transformations being carried out one after the other. So, let’s try to understand
what a stage is all about. And a stage, is related to a shuffle. So, we’ll now go on to
look at the difference between Narrow and Wide Transformations. Let’s run through simple
example to illustrate and I will actually code this up as well so we can see the
resulting execution plans over on real Spark. But imagine we have a text
file, this is terabytes from some kind of server, and it’s the logging that we saw before. So, we have a logging
level followed by a colon, followed by a date and time. Now, on the execution plan we’re going to call the text file method which adds a node into the execution plan, meaning that when we get to an action, Spark will have to load
this file in, into memory. Well, of course it doesn’t really, what it will do is tell the worker nodes, to load fragments of the
text file into memory. And we’re going to end up
with a series of partitions which I know I now introduced
that in the opening chapter just to remind you a
partition is a chunk of data. Now, imagine that on our deployment we are going to have two worker nodes just like we did when we
run Elastic MapReduce. While thinking about
it for a terabyte file we probably need a lot
more than two worker nodes because of course we need
to fit each partition into RAM, but I don’t
have space on the caption. So, assume that two worker nodes is sufficient for this example. And we’re going to typically
see multiple partitions on each of the nodes. Now, that’s one example
of where I just waved my hands around in the early chapter, and I now want to be specific about why are we seeing multiple partitions? Why doesn’t it just have
one partition per node? For example, let’s put
that in another way. How will Spark determine
in this partitioning? And the answer is it depends on the input source for the RDD. But when you’re working
with a text file like this then it’s a simple case of
Spark will chunk the data. I think we saw this in the
Elastic MapReduce chapter, I loaded that big 2 GB file, and you saw that the partitions were
on Elastic MapReduce, they were 64 MB each. So, there’s nothing
complicated going on here other than Spark is
just arbitrarily saying, Okay we’ll have each partition containing some number of megabytes of the data. The number of megabytes
isn’t that important to me, actually varies depending on what kind of file system you’re using. We saw though that it’s 64 MB when you’re using Amazon S3. So that’s how we obtained
the initial partitions. Now, if we’d not loaded in
the data from the text file I can’t really think of
how that could be the case in this particular example. But my point is, if
this RDD had been ceded by a data source other than a text file, it would have simply use the hashes of these strings remember
they are just simple strings. It would have calculated the hash code of each of the strings,
and then it would’ve used a modulus of that hash
code, to if you like randomly spread the strings between the different partitions. And that’s what I mean by it
depends on the input source. So you can see then,
we’ve got our strings now scattered, sort of randomly
across the partitions. And the randomness is just,
we’ve take chunks of the data. But now we are going to go on, and we’re going to do a transformation. So, one of the easy
transformations is a filter. Let’s say we’re only interested in the warnings for this example. So we’re going to filter out, and we’re going to only include the lines that start with a string wall. So you know how this works by now. The lambda expression is
going to be distributed to the individual nodes. Actually distributed to
each of the partitions. I haven’t quite got that
across on the slide here, but the point is, this function is sent to each of these partitions, where it will become a task. So, the block of code executing against the partition is called a task. So, with this simplified example we’re going to end up with six tasks. Those tasks will execute, and we’re going to end up
with our transformed RDDs. So one of the partitions is now empty because there were no warnings,
and the other partitions have been modified accordingly. Now, I hope you can see that all of this could be done totally in parallel. And this function could be applied to each partition in turn. Now, Spark implemented that transformation without having to move any data around. It didn’t have to change the
partitions in any way at all. So for that reason, this
type of transformation is called a Narrow Transformation. Let’s have a look at
another transformation. If we go back to our original input RDD, back to the six partitions
and I’ve got my errors and Fatals back in place. We’ve very commonly on this course done a map to pair transformation. So, the input data is a kind
of a almost a useless string, just a single string. So we’ve seen on the course, it’s very useful to split
it out into key value pairs. So, this will be the resulting RDD. I’ve not bothered showing
the lambda function for this, is that business of taking the string and splitting it and extracting
this zero of elements and the first elements. The caption doesn’t get
this across brilliantly but I’m denoting there
with the round brackets and the Warn comma, and then a date, that this is a key, and this
is a value in a pair RDD. And you know the purpose for doing that, we can do rich operations on the keys such as grouping and sorting, that we’ll look at that in a minute. But the important thing is, this is also a narrow transformation. There is going to be no need for Spark to move the data around in
any way to make this happen. Each input partition will simply
become an output partition. So, it’s never transformation, but the reason I wanted to show you this, is just to show that when you have one of these Narrow Transformations, you guys will end up with a situation where the keys are scattered
across the partitions. We have an error here, an
error here, an error here, and an error here. So, that’s another Narrow Transformation. So, now that we’ve got this key value RDD, let’s have a look at
another transformation we can apply to it. you’ll remember, we only did
this briefly on the course, but there’s a transformation
called groupByKey and that’s going to create a new pair RDD. Where for each key, we
have a list in this case it will be of all of the dates that that logging level occurred. Now, the only way Spark can implement this is it’s got to for
example, gather together all of the Fatals. So to gather together
the all of the Fatals, it’s going to have to move data around from one partition to another. And most significantly, because
of course the partitions are spread across multiple physical nodes, that’s going to involve network traffic. And not just network traffic in fact, it’s going to have to convert these values which okay for us the strings, but it could be any Java object remember. It’s going to have to
convert those objects into binary, so that it can
transfer across the network. In other words it has to
serialise all of the data before moving it across the nodes. Now that, I hope you can
just feel intuitively is a very painful operation, it’s going to be an expensive operation. And for that reason, these transformations have a special name, and that name is a Wide Transformation. Now, I’ve not finished the caption yet, because my animation
here was still containing exactly the same information, and that’s because I kind
of wanted to illustrate this by saying, well I’ve got
rework this caption now. And to rework the caption, I’ve got to move the data around. I’ve got to get all the
Fatals in the same place. So, I’ll copy this line
here, and maybe move it to this partition. So, paste that into there,
and there’s an error here, let’s move all of the errors
across this partition. I’m being a little bit silly here, by showing you how painful my work is, I’ve got to do all of this
changing of the caption, while I’m illustrating there that Spark has to do that work as well, it is gonna be pretty painful. To avoid boring you, I think I’ll finish this captioning off-camera. Cool, and that was really hard work, I had to shuffle all
of these strings around to get them so that they’re
are all clumped together. And as you probably know
by now, Spark does indeed call this process shuffling. So, shuffle is triggered by
any of the Wide Transformations and it is always an expensive operation. Of course the actual
transformation of groupByKey is we end up with a single key or Warn, together with a list of all of the dates for warnings, and the
same is true for the keys. So, the end result of the groupByKey, will look something like this. Now, I want to point out,
that although we’ve ended up with if you like in this
example, three partitions, there are still three other partitions that are now empty. And we’ll see that’s the case
when we try this in code. So, shuffle is always
an expensive operation and as part of a training course, many of you will be looking
for a concrete advice that you need to follow. So, I guess I could say, avoid shuffles. Well, that would be a silly
piece of advice really because you can’t avoid shuffles in really any nontrivial Spark job. If you need to do a shuffle,
you need to do a shuffle. But you can think
carefully about when you do a Wide Transformation in your job. Let me give you a simple example here, let’s say the whole
purpose of this exercise was to get a report that counts the number of Fatal log
messages that we had over the period of this log file. Or we could have absolutely
done exactly the steps that we’ve just seen, and
now we can simply go in, get the key of Fatal, we can count the number
of elements in here, we get the answer of two. Now, obviously that’s a simple example, and you can probably spot immediately that there’s a major
performance improvement that we could make here. We had to do all of that shuffling, and then we’ve actually ignored all of the Warn and the Errors, and we’ve just gone
straight for the Fatal. I guess it’s fairly
obvious in this example that it would’ve been
much, much more sensible to at the early stages of this job, because we’re not interested in anything other than Fatals, we
could have done a filter, which you know now is a
Narrow Transformation, and that would’ve ended up with an RDD that looks something like this. We’ve still got six partitions, but most of them are empty now, and we just have now a much
smaller amounts of data. And now at this point, we can go ahead and do whatever grouping we need to do. If we do the groupByKey now, there’s much, much less
data to shuffle around. So an easy example here, when
you get deep with your Spark and your Spark start writing
much more complex jobs, it’s not always obvious that you’ve done a Wide
Transformation too early. So, that’s a really good reason why the Spark UI is a really good guide to where you may have made
some performance blunders. Now, I thought it would be good to actually run through the example with just a normal caption, in the code. So, this is not something that we’ve done previously on the course. I’ll include the PartitionTesting.job following the practicals and
code folder for this chapter. I’ll also put the big log.txt
file in there as well, although it is about 300 MB. Practically some code, this is it file, so I think it will
compress down quite well. So, let’s have a look
at this file in detail. And it’s just doing
exactly what I was doing on the caption really. I’m going to add in to the execution plan, the loading of that text file, and then just for illustration I’m going to output the
number of partitions that Spark is determined for the RDD. And I can tell you, that given
the size of that input file, it’s actually going to split that down it’s a 32 MB partitions by default. Which I think is default when we’re working with
a local file system. and it’s fairly arbitrary but I think, now I’m not entire sure on this one, but I think that’s purely
because the HDFS defaults for Hadoop file system is 32 MB. As we saw in a previous chapter though when we run this on
Amazon Elastic MapReduce, it was a 64 MB partition. Doesn’t really matter how big they are, it’s just that they are,
of that kind of size. So, I think we’re going to end up with about 11 or 12 partitions based on the input size of that file. So, now I’m going to do the mapToPair which is going to split out
the log level and the date. And remember that’s a
Narrow Transformation there will be no shuffling. And I’m going to output how
many partitions there are after that transformation. We should expect that
number to be the same, there’s definitely no
repartitioning going on when we have done a Narrow Transformation. But now we got to the Wide Transformation, and that’s where we’re
going to do the groupByKey. Notice I’ve not done the filter, I am going to group everything so we should have a very
expensive operation here. There’s going to be a
lot of shuffling to do. And then we’re going to output
the number of partitions after the Wide Transformation. Now, I will admit that for quite long time when working with Spark,
I always in my head thought well, there’s
now only going to be, well on the caption we
actually have three partitions but I can tell you that the
file I’ve generated here just contains, I think
it’s errors and warnings. I in my head I’m thinking
are these two partitions after the Wide Transformation. Actually there will still be 11 or 12, the number of partitions
will not have changed. It’s just that most of those
partitions will be empty. And then all I’m doing
here, in the foreach is I’m looping through each key, and I’m gonna output how
many elements there are in the resulting value. So, I hope that’s all
very straightforward, very similar to the kinds of things we’ve been doing before. Now, before I run this, I’ve
actually done it off-camera but I want to remind you, that
if you’ve got a previous run in the console, make sure you hit return, and make sure that the stop
button is now not lit up, otherwise you won’t get the WebUI. So let’s give this run, and
I mean its not massive data, but it’s going to be bigger than what we’ve had before. So, I told you in the chapter
on Amazon Elastic MapReduce what you’re seeing here in the logging. So, this is the number of completed tasks. So we have 11 tasks, so we
can immediately see actually without even looking at the print line that there were 11 partitions
in the original RDD. So this is just a number
of completed tasks going up and up. But for the first time now,
I can draw your attention to we have a stage zero, and a stage one. You can probably guess now,
what the stages are all about, if not, I’ll explain
when we get to the UI. But can you see that our
initial RDD partition size was indeed 11, now it was just based on the size of the input file. If we double the size of the input file, we would get 22 partitions. After the Narrow Transformation, definitely still 11 partitions. But even after that Wide Transformation, we’re still going to 11 partitions. But my strong suspicion is that only two of those partitions
had any data inside them. Well, I say it’s a strong suspicion. How do we prove it? Well, let’s have a look at that WebUI. All right then, back
to your localhost 4040 and this time it’s simpler than before. we only have one completed job and that’s because we only had one action and the action was the foreach at the end. So let’s drill into this. And there is so much
information in this WebUI. I can’t show you everything, but I hope I’m going to show you all of the kind of main parts
of what we’re seeing here. The DAG, as we’ve seen previously each of these boxes is a transformation. So we have the initial transformation here which is the loading in of the text file. By the way if you click
on any of these boxes, it will take you to an expanded view and very usefully, will point
you back to the line of code that resulted in that transformation. I find that so useful. When you get a big graph,
it can be really confusing you see a transformation, you’ve
no idea where it came from. Always good to map it back to the code. So, line 28 of the code
resulted in this transformation. Let’s go back. The important thing I
want to show you here is what these stages are all about. And by and large, there
are some exceptions. But generally, a stage is a series of transformations that don’t need a shuffle. When we get to the point
where a shuffle is required, then Spark creates a new stage. So we can immediately
see at a glance here, that our Spark job, needed one shuffle. Because this stage here,
got as far as the map. Let’s find out where the map came from. The map is here, line 32. This map was a Narrow Transformation so it was done in the same
stage as the previous. But then the next transformation
was the groupByKey. So, the shuffle if you like, I’m kind of waving my points
right around a bit here just to say the shuffle kind of occurred here after the map. It doesn’t explicitly show
it, but it doesn’t need to. So, this is really
useful, and this tells us if we look at the
stages, stages zero here. The name of the stage by the way is just the name of the
last transformation. And some really important
information here, we can get a feel for how long it took, it was about six seconds the first stage. And we had 348 MB of input data, that was the size of the input file. But can you see that there was
a shuffle rights at the end? Now, when it does a shuffle,
it has to write data to disc so that it can be serialised
before being transferred across the network. Now, the internal implementation of that is that data is compressed. So that’s why what we’re seeing
here is quite a lot smaller than the input, even
though we haven’t removed any of that data before doing the shuffle. So, we are seeing quite
a significant compression of that data. And can you see, that for the next stage, stage one, the first thing
it has to do for that stage is read that data back in. So that’s just how it is implemented, they call it a push pull model, each stage will output its results and the next stage reads them back in. Okay, let’s drill into the stages, so let’s have a look at stage zero. And if we go further down here, there’s some well overwhelming
amounts of information, but let me point you to the highlights. We can see that there were 11 tasks, and this is very helpful really. It will tell us the average duration of each of the tasks. So, I can see on average, and that’s going to be the median here. On average they were
taking about two seconds. And the minimum, the fastest of them talk about one second. And the slowest took about three seconds. Now, why is that interesting? Well, it’s interesting because there doesn’t seem to be a
big spread across those times. Now, that makes me quite happy, and I’ll explain why when we
have a look at the other stage. We’ve also got a line covering
garbage collection here. Now I doubt it will be a
problem for our simple task, but if you see some big
values sticking out here, then you might have a problem with creating too many
objects in your tasks. But we can see that the input
records were on average 32 MB. It looks like one of the partitions didn’t quite get a
complete block of 32 MB. Well, we can actually find that out, because if we go down to the bottom this is incredibly useful. This is a list of all of the
tasks which were Java threads, executed against our 11 partitions. And they were all successful, there we can see the durations. And crucially, we can
also see each of them were given 32 MB of input data. Except the last one, which is
kind of got a remainder there. Each of them when it came to the shuffle, wrote about 7 MB of data. So that’s an overview of what
these metrics are telling you. Now, in real life it will be up to you to work out if these
values are good or bad. I reckon for this particular stage there’s nothing worrying there, it’s the kinds of things we
would’ve expected to have seen. But if I go back and drill
into the second stage. So, this is where we’re doing the groupByKey on line 42 of the code. Line 42 here. That’s our Wide Transformation of course, and that’s why a new
stage is being triggered. Let’s have a look at what
we’re seeing in here. And I’m little more worried. We can see that the durations of these, remember we still have 11
partitions so therefore 11 tasks. The minimum time was three milliseconds. The median was 29, but the maximum was nine seconds. So, I have an alarm bell ringing here, there’s a very wide
spread of values there. And what that suggest is, some of the tasks are doing nothing, and some of the tasks
are working really hard. So with that alarm bell
ringing in my head, if we go further down. And we’ll have a look at the 11 tasks that executed in this stage. We now have the answer
laid out in front of us. Exactly as I described
on the captions really. It turns out that in this case, tasks ID six and seven, got all of the data. So we’ve ended up with
a partition containing all of the Warn keys, and
we’ve got a second partition containing all of the Error keys, and all of the other
partitions are just empty. And that means that, the
Java code that’s running on partition six, had an
awful lot of work to do. Nine seconds, and
similarly for the task here it had nine seconds of work to do. And the other tasks were basically, I guess the 35 milliseconds is just the time it took to start up. See there was nothing to
do, and then stop again. And we can also see,
that the shuffle readings that’s the data they’ve read in, from the results of the previous stage. Well, most of them read nothing in. So, you might imagine in real life I might have thought, oh well, it makes sense to
deploy this to a cluster with 11 worker nodes. And then we’ll get really
good parallel performance on the 11 partitions. That would’ve been fine
for the first stage but when it got to the second stage, we’re gonna find that most of that cluster will be sat there lying idle. So is that a problem? Well, it’s hard for me answer because it depends entirely
on your real-world data set. But it’s clearly something of concern that on this second stage, we’re
experiencing the possibility of our cluster being
massively underutilised. Now, if it turns out,
that in the real world stage zero is taking 90% of the time, and stage one is taking 10% of the time it probably doesn’t matter. 90% of the timer, all
11 nodes on my cluster are busy working away. And then okay, for the next stage only two nodes in the cluster are working, but it’s just a kind
of a finishing off job, and is only a short amount
of time spent there anyway. But if it turned out that stage zero is only taking 10% of the time, but then for 90% of the time, most of the cluster is sitting idle then clearly you would have a problem, and you would have to do some rework. So, this is definitely a problem and I’ve seen this happen
many, many times in Spark where you have a job
which is I don’t know, taking half an hour to run, and you need to get that job
running in say 10 minutes. So you invest in expanding
your cluster size and you end up paying for lots
of very expensive instances and you find that the performances
isn’t improving at all. And very likely, the reason for that is you have this problem where all of your data is ending
up on just a single node or just a small number of nodes, because of this keying problem. In the example I’ve been giving, I’m sort of running out
of steam with the example because I’m just counting
the number of messages. Which probably wouldn’t be a problem because that’s not going
to take long to do, and we’re finished anyway. But we have been doing quite
a lot of joins on this course and I can tell you that joins,
because joins work on keys. A join is a Wide Transformation. And so when you do a join,
the data will be shuffled. And if you don’t have
a wide spread of keys you can end up with a massive joined RDDs just residing on single
nodes or single partitions. And you might see out of memory exceptions but the work that you do after the join and that’s quite typical
after doing a join you’re gonna be doing a lot more work. Suddenly all of that
work is taking forever because of exactly what we’re seeing here. How do you avoid this? Well, the first suggestion is, as before if you can possibly rework your scripts, so that the Wide Transformations
happen late in the process, so that your join, or your group buys, are only working on
small subsets of the data that can be run really quickly. If that’s not possible, then
you might have to resort to salting your keys. Salting might sound scary,
but it’s quite a simple hack. It does make your scripts
more complicated however so I would only be
thinking about doing this as a last resort if I
really did have problems. So, with the previous example, we’re getting all of our Warnings appearing in a single partition. So, what we could have done, in an earlier stage of the process, is we could have changed the keys so that the key is the log-level Warn, followed by some random number. So, this random number
is called the salts, and you might be familiar with salts if you’ve done anything with security. So in this example, it
wouldn’t be as neat as this but the Warn ones are going to be in this partition, the Warn fours will be in this partition. The Warn twos and so on. Meaning that the values
are now nicely spread across the entire cluster. And now after that, I can
do further transformations. Now, let’s say for example I needed a count of all the Warns. Well, I’d have to count the Warn ones, the Warn twos, the Warn threes and so on. And then add those subtotals together. So, it’s a little bit like
doing a MapReduce in a way. I won’t go on too long about this because for me it’s a
kind of a last resort, it’s something I’d only do
if I was really in trouble and I had no other way of getting the data spread across multiple partitions. And because it does complicate things and that at some point, we’ve got to get this
data back together again. But it does at least work. Now, my objective with showing you this was only to give you a
brief flavour of salting so you know what it is. And I think you also need to know that it is not some really
big scary complicated process, it’s a bit of a hack really. Just to show you how this
might have worked in the code. So, with this code remember,
we ended up with a stage two, we ended up with all of
the data on two partitions. And we had nine partitions
that were absolutely unused. And therefore if we’d used an
11 node cluster for example, we would have nine nodes
burning up electricity and yet doing nothing. What I could’ve done in that
scenario, as a last resort. When I did the map here,
and I extracted the level, I could’ve just added, I’m
going to add a Math.Random I’m not going to do a great job of this. I multiplied the Math.Random by 11 and convert that. It feels hacky this, it is hacky. I’ll convert that into into an int and that’s going to be
added to the strings. So I’m to get Warn zero,
Warn one, Warn two. In a random fashion, and therefore, and I mustn’t forget, to hit the enter key to end that previous job. And I’m going to run that again now. Now, I think well, that’s running I’ll just remind you that my programme is going to output the
counts of all of the keys. Okay, so you can see now that rather than having a
single key for each level we’ve got multiple keys for each level. So yeah, if I wanted to counter how many of each of these that were error would have to work to
gather to get the Warns, and to gather to get the Errors. But I might be doing that
after some further work that filters them down a bit or something. But all I want to show you there is, as a result of doing that,
if I now look at stage one, and you remember it was all very skewed, the durations had a really widespread, but this time it’s looking a lot better. The minimum was less than a second, and the maximum was a second. Well, we look at why
that’s the case in a bit. But it looks like the
medium of two seconds is actually looking pretty good. And the big difference
now is for the tasks yeah the durations a lot more balanced, they are not actually perfect, but certainly better than before, and you can see from the shuffle rates that all 11 partitions
are reading in at least some of the data. So, I definitely want
to consider doing that if after doing the groupByKey, I was doing lot’s more
complicated operations where I wanted to retain the
power of the parallelism. So, it’s difficult to explain that, and I just want to remind
you that using the salting is definitely an advanced operation only if you really, really,
really need to do it because it will cause
you further headaches, but it is at least now
a tool in your toolbox. So we’re nearly there. We’ve had a look at
partitions and shuffles, and what I’ve just been describing is an example of Key Skew, and that’s where you
have a lot of instances of one particular key, and very few instances of other keys. Now, my example was just
a very specific example of Key Skew, but I hope you get the idea. But on my list of topics, was actually to avoid
groupByKey where possible. Now, it’s not really like
me to say don’t use this because you could argue that, it depends on your requirements and it depends on the situation. But I found, and I think you’ll find all performance guides on Spark, all start with avoid groupByKey. And I don’t need a new caption for this, I can pretty much use the
same example as previously. So, we started out with
our data nicely partitioned across lots of partitions. Remember it was chunked
from the file size. But then when doing groupByKey, we know now a shuffle is required
which is not a good thing. We do want to avoid
shuffles where possible, but really the worst
thing about groupByKey is it is now going to have
to store all of the values on a single node in your cluster. And actually this
example is a perfect one. I said the input data
was terabytes in size, While what we’re going to end up with here is one of these nodes in the cluster is going to be ending up having to store at least a third of that file, in memory. So, okay we’ve got 11
nodes in this cluster but after doing the groupByKey we’re going to have potentially
a third of a terabyte because all of this data
is being grouped together. So it’s very easy with groupByKeys to end up with out of memory
exceptions on your nodes even if you have a lot
of nodes in your cluster. So that’s one reason
for avoiding groupByKey. I think you’ll find that groupByKey can always be replaced by a
more performance alternative. I’m going to give you a
specific example here. Let’s say that our
requirement with this exercise was to count the number of Fatals, number Warnings and number of Errors, and it’s not working because
we’re using groupByKey and this node is crushing
because it’s trying to fit too many of these Warning
messages in memory. Now, in this example, and
we’ve seen this on the course it will be much more efficient, much more efficient to do a reduceByKey. Just a revision of how we did that. Here is our row per RDD,
with keys and values. We would then do the trick
of doing the mapToPair where we map each logging
level to the number one. Now remember, this is a
Narrow Transformation, so no shuffling will be
required at this stage. The caption doesn’t
capture this very well, but remember at this stage,
we will have terabytes of data but it will still be spread
evenly, fairly evenly, across the many nodes in our cluster. You know that the next
part of this process is to do a reduceByKey, we’ve done this many, many
times on the course so far. Now reduceByKey, will require a shuffle. See, you might be thinking well, how is this any better than groupByKey? But here is the big important difference, with a reduceByKey. reduceByKey has two phases if you like. On the first phase, no
shuffling is required because Spark can apply this lambda to the elements on each partition. It will apply the reduce first
of all, on each partition without needing to do any shuffling. So, we will end up at the
end of this first stage with the same number of partitions, no shuffling will have occurred, but there will be if you
like a kind of subtotal, on each partition. Because my example is not very rich here, and I’ve only got a small
number of messages here. It doesn’t really come across brilliantly because you don’t see a
massive difference here. You’ll notice on this partition,
that one came from here, there were two warnings, you
can just see a Warn there, and that’s resulted in the reduced value of Warm comma two. So, let me rework that, and put in some more realistic values. So the reduce has been done, just on a partition by partition basis. So, we’ll end up with the
partitions with a subtotal of how many messages there
were, just in that partition. So these are much, much bigger values. And that is called, a Map Side Reduce, I’m not sure why they
call it Map Side Reduce, I think I’d rather call it
a partition side reduce, but the points of all of that is, I mean we’re not at the final answer yet, it’s got to gather together the Fatals, for example from each partition
onto a single partition. Therefore it has to do a shuffle now. But I hope you can see, that the amounts of data reshuffling is dramatically reduced. In each partition, there’s
going to be at most one entry for each key in our key space. For us that’s going to be just 405. So, the shuffling is really minimal. so we end up now after the shuffle with all the Fatals on one partition, the Warns on another,
the Errors on another. So, of course all it now needs to do is apply the reduce again on the subtotals to get the final results as seen here. So, just to reiterate then, reduceByKey will do a reduce on each partition first that’s the map side reduce, and that will greatly reduce the amount of data shuffled around. So you’re going to find that groupByKey in most circumstances can be replaced with an operation such as reduceByKey. If you got a richer requirement, then there are other
transformations available in the API which we don’t have enough
time to go into details of all those transformations, but I hope you get the feel
for what I’m getting at, groupByKey is generally to be avoided. Just one less topic to cover then, and that’s Caching and persistence, which I know it sounds big and scary but I hope I will be
able to demonstrate this with a simple example. If we go back to our partition testing, I’m just using this because it’s simple. I’ll just remember to hit return to terminate the previous programme. And let’s rework this, so on line 34 we don’t have that salting anymore we would avoid salting unless absolutely necessary in real-life. So we’re back to the state now, where we’re running
through our transformations and then we’re counting how many elements there are for each key. But let’s say we also want
to get the grand total, What was the total counts
of all of the elements? Well, a really simple way of doing that would be to take our RDD
which is called results. we can call counts on the RDD, and we can pass that
into System.err.println. I hope you’ll agree, nothing
scary or confusing there. But I have actually made quite a serious performance blunder there. It’s quite subtle, and I
don’t if you’ve spotted it, but I certainly could not spot that when I first started working with Spark. Let’s run the programme, and
we’ll have a look at the WebUI. I wonder if you expected that output. So we have the two stages as usual, and it output the elements for the keys, which is what we wanted. But then when we call
count, it run another stage which seems to suggest
that to do the count it had to do a shuffle, really? Why would it have to do shuffle? And so this time if we
go to the jobs page. And what we’re seeing here, is a list of all of the
actions that were performed. We only had one entry in here before and that was the foreach. But we now have a separate
entry for the count. So, I think we’ll find,
if we look at the foreach, nothing will have changed in there, this is exactly the same as before. But if I go into the count, you might be surprised to
see this is the stage three, the extra stage that we didn’t expect. It has done the groupByKey again. And it’s also got a stage
II here which it skipped. Now, that’s confusing,
I’ll come back to explain what that’s all about, but the
first thing to addresses is why is it running another
stage, just to get that count? It appears to be redoing previous steps, that were already done, if
I go back to the first job. It already did this groupByKey
amount values in stage one. Clearly there’s something worrying here. And again, it’s all back to the fact that when we have a Java RDD here, we don’t really have any data at all. There is no data loaded in by this line. What we’re doing is we’re saying, we need to add an element
to the execution plan. Now, it’s only when it reaches an action such as the foreach,
this is the first action and therefore that will become a job. And it executes the plan, required to achieve these results. So, it’s going to do the transformations, it’s going to do the shuffling, it’s going to do the grouping. And then it will output the results. But once it’s output the
results, the real RDD in memory the actual data, will be discarded. So when it comes to the
count, believe it or not, it has to go all the way back, in this case old way back to line 28, to execute all of these steps again. Which is a real surprise, to
anybody starting with Spark. And it’s the one big secret
that I’ve been keeping all the way through this course. And, I feel a little bit bad about that but I hope you understand
that you really need to have done quite a bit of work with Spark before you can address this concept. So when you get to an action, it has to go all the way back
to the initial RDD to rerun. it’s not storing all of this
intermediate data in memory. Now, there is one optimization. If I go back and look at,
this is the counts job that’s the second job was carried out. But as I said, it’s had to go all the way back to the beginning, to where it loaded in the text file, and it’s had to run all
of those stages again. Well, you can see it hasn’t,
it’s actually skipped stage II. Now, this is a performance optimization that was added I think back
in something like Spark 1.3 and it’s simply that
whenever a shuffle happens, so we know shuffles happen
here at the end of this stage, or I should say actually, if
I go back to the first job. When the shuffle happened
here at the end of stage zero, the results of the shuffle
are written to disc. I briefly mentioned that before. As a result of the shuffle rights. And that means, when it
came to do the second job, I hope you’re clear know. It may be surprising, but
it has to go all the way back to the beginning. But it’s able to
optimise, because it knows it’s already done this, and already written the results to disc. So that stage can be skipped, and stage III can simply read that shuffle data backing into memory. So, it’s not quite true, that it always has to go
all way back to the start. In fact for an action to occur, it’s going to have to go back to the last shuffle stage basically. So, in this example to do the count, it’s had to do the groupByKey again. By the way, I didn’t mention this, this map values, we don’t
have a map in our code, but it’s actually just part
of the groupByKey process. It’s sort of an internal
implementation detail, and we can tell that if
we follow the link there. You’ll see that they both map to the same line of our code, line 42 is indeed the groupByKey. So the points of that is, it’s had to do the groupByKey again, when we’ve already done the groupByKey. So this may have seem really inefficient that every time an action is performed, Spark is if you like, going all the way back to the beginning actually not the beginning,
but it will be all the way back to the results of the previous shuffling. And it’s very easy to
end up with a situation where you’re doing a lot more work than you think you’re
doing, in your scripts. Now, what we can do to
avoid this inefficiency is with any RDD we can tell Spark that we want to store the RDD in memory, and keep it there for future operations. So, in our example when
we did this groupByKey, Before we perform the actions, we could have said that the results, Again it’s another operation on the RDD, and it’s simply called cache, pretty easy. And remember, it’s a method
that returns an object so you need to reassign it. And that’s just telling
Spark that the actual data that’s going to be resulting
from this transformation here, I think of it as a checkpoint. It needs to store that
data physically in memory. And that will mean in
this specific example, when it comes to the foreach, it will be working from that cache data. And similarly, when we get to the count, it will work from this
cached data as well. In other words it won’t have
to run the groupByKey again. So let’s see the results of that. Again, I’ll need to remember
to stop that programme, and run again. Aah, well actually I didn’t
feel the difference there, and the reason is that if
you can see this warning here it is warning me that
there was no enough space to cache the RDD in memory. But yeah, unfortunately
cache will only work if there’s enough space in RAM. Now, what will happen if cache fails. Well, it will just give you this warning, and then it will carry on as before. So, unfortunately the stage III data have to run that groupByKey again. And I definitely felt the
performance in parts of that. So cache is okay, if you’ve
got relatively small RDDs. The other alternative, instead
of cache is to use persist. Now, persist takes a parameter
of type storage level. And the set of values you can use there and they’re not very
readable, and I’m afraid there’s no good API in
the Java docs for this. But, we have basically
the option of disc only, memory in disc, or memory only. Now, if you go from memory only, you’re going to get
exactly the same as cache, they are all the same operation. I think in our case if we
go for a memory and disc, it will try to use memory if possible. But if there’s not enough space in memory, it will fall back to
writing that RDD to disc. So we can try that, again
I’ll need to hit return to terminate the programme,
and let’s run a game. Okay, so the warnings on are telling us that there wasn’t enough space to cache it but instead it’s persisted to disc. Now, I have to admit that in this run, I didn’t feel a performance
improvement from this, and my guess is, really don’t have time to analyse this any further, because of I’ve got to
release this course. But my guess is the overhead
of writing it to disc, is actually not saving me any
time on the specific example. But if you were working
with much bigger data sets, with a much big cluster, then this could well
make a big difference. Back on the Spark UI,
again we’ve got two jobs. One for each of the actions. Now, this time if we look at
the first job, the foreach, the difference is, when we
get to the map value stage the green circle, denotes that this RDD
at this stage was cached And now if we look in job one, now you might not expect this, I expected not to see a stage three
appearing here on the graph. I think it’s slightly confusing WebUI is appearing on the graph. But the fact that it’s
a green circle there, it’s is telling us actually, this RDD is being read from the cache. So this groupByKey step, wouldn’t
have needed to take place. And that’s why although
there was a stage three, the green circle is telling us that that it’s worked from the cached value. Now this is just been a small example just to give us a feel for the WebUI but what I’d like to do,
to complete this chapter is to go back to our
viewing figures example, which was a far more significant exercise. And I don’t know if you did that exercise but yeah, we had quite a
complicated set of transformations. So, I’m planning to do the same thing here before I close the export context, I’m going to do the trick
of doing the scanner, on a scanner.nextLine and I’ll make sure that in the console, yeah. I’ve terminated the previous run. I’m going to run this
viewing figures example now, and we’ll have a quick look at how this looks in the Spark console. And yes, significantly more complicated we’ve got several stages,
a zero, a one a three, and a stage six. But can you tell from this logging, it’s trying to express that stage six here has been run in parallel with stage one. Now, that’s interesting, we
haven’t seen that happen so far and that’s what Spark is going to do if it determines that
on a particular pipeline of transformations. One stage isn’t dependence upon the output of another stage, then it
can run them in parallel. So that’s really the point of
this DAG, this execution plan. And one of the main reasons
why it runs it lazily, so that it can spot
optimizations like this. But let’s have a look at
it, on localhost 4040. Now, we’re seeing two jobs again, for the same reason we
saw two jobs earlier. When we’re doing a sortByKey,
it will trigger off a job to do the sorting. So, we’re seeing a separate
step for the sorting, and a separate step for the collecting. I won’t get hung up on that just now, but I think if we follow
the link to sortByKey, that’s going to give us actually quite a scary visualisation of the DAG. Certainly, far more complicated than anything else we’ve seen so far. Now, this video is getting long, so I won’t go too deep into it, but you should be able to piece together what’s happening, based on
what we’ve learned so far. It’s only a bit more complicated this DAG because on that exercise, we did work with three
different input files. So, the stage zero here,
if I follow the link is pointing to line 103, and it’s actually reading in
the titles of the courses, Whereas this stage here, is reading in the viewing figures. And this stage here, is
reading in the chapter data. And then well, I can’t get my head around of what we did after that,
it’s quite complicated but we have various jobs, where we’re doing things
like joins for example. If we take the stage four,
as a working example. This was on line 47. On line 47, we were joining together the viewing figures and the chapter data. So going back, clearly the stage four, depends on just following
these lines back. It depends on these
two sets of input data, this one here and this
one here being present and actually this input data
here was the viewing figures and you might remember we had to remove the duplicated views, which is the reason for this intermediate stage here. It all hangs together, but I’ve realised, just scanning through this, that there’s something
very odd going on here. We only have three input text files. We have a stage reading them in, but if we go all the way
over to Stage Six here, you’ll see we’re reading
one of the files in a game. What’s going on there? Well, let’s follow the link. This is the chapters
data, that’s really odd because if I go to stage two, we’ve already read that
filing in stage II. Now, I admit this is complicated and you might want to the stop video and have a good look at this for yourself, and have a deep hard think
about what’s going on. But again it’s all to do
with the lazy evaluation of the fact that when you do a transform, the input data for the transform
is not stored in memory unless you ask otherwise. Now, what’s going on here is, let’s have a look at this map. So, this is the first transformation that we apply to that text file. So that map is happening on line 134. So, actually that map is right down here and it’s just the splitting
of the data into a pair RDD. And then the results of that, are past then to this join in Stage four. So it’s been joined, and it looks like if I follow this line
through its being joined, to those distinct viewing figures. So the data inside the chapters
file is clearly needed, so that it can do this join. But then much later on in the process, and yeah, it has complicated this. We have reduceByKey operation here, and it’s here on line 39. And this is where we worked out, how many chapters they
were, for each course. And I think that was a
some a warm-up exercise but the data is needed later on. So, my point is we’re reading
the chapter data RDD here, but we’re also doing a transformation on the chapter data RDD here. And as you know now,
the data in these RDDs is not by default automatically cached unless you force Spark to do so. To put that in another way, when it had to do the join here, it clearly needed to load the data in from the chapters file. But then in a different
part of the process we needed that data again,
that’s the point of this. We’ve used this input data, in two separate places in our script. Now, I wouldn’t have spotted that without looking at this graph, and spending quite sometime, trying to work out what’s going on. So, I’m guessing you might
be confused at this point, let me just reiterate,
what we’ve identified is, that the data in this text file is used in two separate
places in the process. And you know, from what
we learned previously that Spark is going to have
to recalculate that data each time it’s used. You know the solution to
this now, we could be taking, if I hover over here,
this mapping operation that’s done here on line
134, is the same in stage II, and it’s the same in stage six. And therefore we really should be caching the results of
that mapping operation. So, if we look on line 134 of the Java, and that’s here were
we’ve done this mapToPair. Now we’re returning
the Java pair RDD here, so I think what we can do
quite simply on the end of here is simply fluently add in a cache. So now, this data when it’s read in, will be stored in memory, and I think we’ll have
room in memory for this, and that will be reused
wherever it’s referenced later on in the graph. So, quite an advanced example of that, I’ll just terminate the previous run, and let’s see the difference. Now, I have benchmarked this off-camera, and because we’re working
on such a small set of data, I’m not really seeing
a big, big difference. And actually thinking about it, the chapters.csv file, I
don’t know if you remember, it’s actually quite small data. So, in practise for this example, I don’t think we really need that cache, I think we can easily take the hits off sort of laboriously
loading in that data twice it wouldn’t be a big deal. But in real life, if you
had done further operations for the transformations on this RDD, you might have done another
15 transformations on it, before you then go on to use it. Well, you’ll be doing those 15 operations more than once, and that would
be obviously quite wasteful, so that’s where the cache
would become useful. Anyway, I just want to show
you the difference now, in fact I’ll just refresh that page. And now, the difference is
you can see that green circle highlighting that these
two points in the graph are working from the same data. Now, I realise that was quite advanced but I didn’t want to just
work with simple examples, I wanted to get a little
bit more complicated. So you can see the kinds of problems you’re going to face when
working with Spark in real life. Hopefully in this chapter, I’ve given you a good look under the hood. And you know, I could have
spend an entire course, I could have probably
spent a series of courses looking at the internal workings of Spark but I think that’s probably enough if you’re wanting to
get started with Spark and certainly if you
are a Spark programmer. I hope you’ve now got a
feel, for some of the traps that could catch you out
when working in Spark. Thanks for watching the course, I’ll see you hopefully in the next module where we’ll be looking at
Spark SQL or Spark Sequel. In the meantime, happy Sparking.

1 thought on “Apache Spark for Java Developers – Course Extract – Performance”

Leave a Reply

Your email address will not be published. Required fields are marked *