Cloud Computing – Computer Science for Business Leaders – July 2016


DAVID MALAN: All right. So we are back. Now it’s time for the cloud. What the heck is the cloud? Who’s in the cloud? Who uses the cloud? Yeah, OK. Everyone over here and over here. OK. So what does this mean? We kind of take it for granted now. But what does cloud computing mean? Yeah? AUDIENCE: Off premise? DAVID MALAN: Off premise. OK, good. So what’s off premise? AUDIENCE: You’re hosting all your
data off of your physical premises. [INAUDIBLE]. DAVID MALAN: Good. AUDIENCE: [INAUDIBLE]. DAVID MALAN: OK, applications, too. David? AUDIENCE: [INAUDIBLE] off-site storage. DAVID MALAN: Yeah, and this holds true
for both businesses and consumers. Like, for us consumers in the room,
we might be in the cloud in the sense that we’re backing up
our photos to iCloud or you’re using Dropbox
or some such service to store your data in the cloud. But a company could
certainly do it, too. And that’s where we’ll focus
today– what you actually do if you’re doing
something for a business and you don’t want to run
your own on-premise servers. You don’t want to have to worry about
power and cooling and electricity and physical security and hiring
someone to run all of that and paying for all of that. Rather, you’d much rather
leverage someone else’s expertise in and infrastructure
for that kind of stuff and actually use their
off-premise servers. And so cloud computing kind of came
onto the scene a few years ago, when really it was just
a nice new rebranding of what it means to outsource or to
rent someone else’s server space. But it’s been driven, in
part, by technological trends. Does anyone have a sense of
what it is that’s been happening in industry technologically that’s made
cloud computing all the more the rage and all the more
technologically possible? AUDIENCE: It’s faster? DAVID MALAN: Faster? What’s faster? AUDIENCE: The give and take of
the data between [INAUDIBLE]. DAVID MALAN: OK, so transfer rates,
bandwidth between points A and B. It’s possible to move
data around faster. And now that that’s the case,
it’s not such a big deal, maybe, if your photos are stored in the cloud
because you can see them almost as quickly on your device, anyway. Sean? AUDIENCE: Cellular technology. DAVID MALAN: Cellular technology. How so? AUDIENCE: As far as speed. [INAUDIBLE]. DAVID MALAN: Yeah. Yeah, this is definitely kind
of the bottleneck these days. But it’s getting better. There was, like, edge, and then 3g,
and now LTE and other such variants thereof. And so it’s becoming a
little more seamless, such that it doesn’t matter where the data
is because if you see it pretty much instantaneously, it doesn’t matter
how close or how far the data is. Other trends? AUDIENCE: Web hosts. DAVID MALAN: What’s that? AUDIENCE: Web hosts DAVID MALAN: Web– AUDIENCE: Like, AWS, Google [INAUDIBLE]. DAVID MALAN: Oh, OK, sure. So these providers, these
big players especially that have really popularized this. Amazon, in particular,
was one of the first some years ago to start building out
this fairly generic infrastructure as a service, IAAS, which is kind of the
silly buzzword or buzz acronym for it– Infrastructure As a Service, which
describes the sort of virtualization of low-level services. And we’ll come back to this in just a
bit as to what that menu of options is and how they are representative
of other offerings from, like, Google and
Microsoft and others. What– Grace? AUDIENCE: [INAUDIBLE] scale much faster? Like, the volumes of data [INAUDIBLE]. DAVID MALAN: OK. And what you mean by the
ability to scale faster? Where does that come from? AUDIENCE: Cloud server could
scale up additional memory that you couldn’t do if you
were in a physical location. DAVID MALAN: OK. AUDIENCE: Like, having to buy and
build out more servers at Harvard is much harder to do, to get more space. DAVID MALAN: Yeah. Absolutely. And let me toss in the word
spikiness or spiky traffic, especially when websites
have gotten mentioned on Buzzfeed or Slashdot or other
such websites, where all of a sudden your baseline users might be some number
of hundreds or thousands people per day or per second or whatever. And then all of a sudden,
there’s a massive spike. And in yesteryear, the results of that
would be your website goes offline. Well, why is that? Well, let’s actually focus
on that for a moment. Why would your website or
web server physically go offline or break or crash just
because you have a lot of users? What’s going on? Sean? AUDIENCE: Kind of [INAUDIBLE] over here. The computer only has so
many connections, I guess? DAVID MALAN: Uh-huh, OK. AUDIENCE: Maxed out? DAVID MALAN: Yeah. So if you– a computer,
of course, can only do a finite amount of work
per unit of time, right? Because it’s a finite device. There’s some ceiling on how much disk
space it has, CPU cycles, so to speak, how much it can do per second,
how much RAM it actually has. So if you try to exceed that
amount, sometimes the behavior is undefined, especially if
the programmers didn’t really worry about that upper bound scenario. And at the end of the day,
there’s really nothing you can do. If you are just getting request after
request after request, at some point, something’s got to break along the way. And maybe it’s the
routers in between you and point B that just
start dropping the data. Maybe it’s your own server
that can’t handle the load. And it just gets so consumed
with handling request, request, request, even if it runs out of
RAM, it might use virtual memory, as we discussed. But then it’s spending all of its
time temporarily moving those requests until you get locked in this cycle where
now you’ve used all of your disk space. And frankly, computers do not like
it when they run out of disk space. Bad things happen, mostly
because the software doesn’t anticipate that actually happening. And so things slow to
a crawl, and the server effectively freezes, crashes, or
some other ill-defined behavior. And so your server goes offline. So what do you do in cases of
that spiky traffic, at least before cloud computing? AUDIENCE: More servers. DAVID MALAN: Right, more servers. So you sort of hope that
the customers will still be there tomorrow or next week or
next month when you’ve actually bought the equipment and plugged it
in and installed it and configured it. And the thing is, there’s
a lot of complexity when it comes to wiring things up,
both virtually and physically, as to how you design your software. So we’ll come to that in just a moment. So cloud computing’s gotten
super alluring insofar as you can amortize your costs over
all of the other hardware there– you and other people can do this. And so when you do get spiky behavior,
assuming that not every other company and website on the internet is also
getting a spike of behavior, which stands to reason that’s not
possible because there’s only a finite number of users. So they have to go one way or the other. And there’s many different
providers out there. You can consume all the more of Amazon
or Microsoft’s or Google’s services, and then as soon as people
lose interest in your site or the article that got
reblogged or whatnot, then you sort of turn off
those rented servers that you were borrowing from someone else. Now, ideally, all of this
is automatic, and you yourself don’t have to log in
anywhere or make a phone call and actually scale up your
services by saying, hey, we’d like to place an
order for two more servers. And indeed, all the rage these days
would be something like autoscaling, where you actually configure the service
or write software that monitors, well, how much of my RAM am I using? How much disk space am I using? How many users are currently
on my website right now? And you yourself define
some threshold such that if your servers are
at, like, 80% of capacity, you still have 20%, of course. But maybe you should
get ahead of that curve, turn on automatically some more
servers configured identically so that your overall
utilization is maybe in the more comfortable
zone of 50% or whatever it is you want to be comfortable
with so that you can sort of grow and contract based on actual load. Yeah? AUDIENCE: That’s like an elastic cloud. DAVID MALAN: Exactly. So elastic cloud is– that’s
using two different Amazon terms. But yes, anything elastic means
exactly this– having the software automatically add or subtract resources
based on actual load or thresholds that you set. Absolutely. So what’s the downside of this? AUDIENCE: Security? DAVID MALAN: Security? How so? AUDIENCE: Getting all
of your data types. DAVID MALAN: OK. So yeah. I mean, especially for
particularly sensitive data, whether it’s HR or financial or
intellectual property or otherwise. Cloud computing literally
means moving your data off of your own, what might have been
internal servers, to someone else’s servers. Now, there are private clouds,
which is a way of mitigating this. And this is sort of a
sillier marketing term. Private cloud means just having
your own servers, like you used to. But maybe more technically it will often
mean running certain software on it so that you abstract away the detail
that those are your own servers so that functionally, they’re
configured to behave identically to the third-party servers. And all it is is like a line in
a configuration file that says, send users to our private cloud
or send users to our public cloud. So abstraction comes
into play here, where it doesn’t matter if
it’s a Dell computer or IBM computer or anything else. You’re running software
that creates the illusion to your own code, your own
programs, that it could just be third-party servers or your own. It doesn’t matter. But putting your data out
there might not be acceptable. In fact, there are so many
popular web services out there. GitHub, if you’re
familiar, for instance, is a very popular service for
hosting your programming code. And your connection to GitHub will
be encrypted between point A and B. But your code on their
servers isn’t going to be encrypted because the
whole purpose of that site is to be able to share your code,
either publicly or even internally, privately, to other people
without jumping over hoops with encryption and whatnot. So it could exist, but it doesn’t. But so many companies are putting
really all of their software in the cloud because it’s kind of
trendy to do or it’s cheaper to do or they didn’t really think it
through, any number of reasons. But it’s very much in vogue,
certainly, these days. What’s another downside of using
the cloud or enabling autoscaling? AUDIENCE: If you don’t have
internet, you don’t have cloud. DAVID MALAN: Yeah. So if you don’t have internet access,
you don’t have, in turn, the cloud. And this has actually
happened in some ways, too. It’s very much in vogue these
days for software developers to just assume constant internet access
and that third-party services will just be alive so much so that– and
we ourselves here on campus do this because it’s cheaper and
easier at the end of the day. But we increase these risks. If this is little old me
on my laptop, and we’re using some third-party
service like GitHub here– and there’s equivalents
of this– and then maybe this is Amazon Web Services over here. And this is our cloud provider,
this is the middleman that’s storing our programming code
just because it’s convenient and it’s an easy way for us
to share so that I can use it, my buddies here on their
laptops can use it. And so all of us might–
let’s just put one arrow. So if these lines represent our
laptops connections to GitHub, where we’re constantly sharing
code and using it in a cloud sense to kind of distribute. If I make changes, I can push
it here, then persons B and C can also access the same. It’s very common these
days with cloud services to have what are called
hooks, so to speak, whereby a hook is a
reaction to some event. And by that I mean if I, for instance,
make some change to our website and I push it, so to speak, to GitHub
or whatever third-party service, you can define a hook
on GitHub’s server that says any time you hear a push
from one of our customers, go ahead and deploy that customer’s
code to some servers that have been preconfigured to
receive code from GitHub. So you use this as sort of a middleman
so that everyone can push changes here. Then that code automatically
gets pushed to Amazon Web Services, where our customers can
actually then see those changes. And this is an example of something
called Continuous Deployment or CD, whereby whereas in yesteryear
or yester-yesteryear, companies would update their software
once a year, every two years. You would literally receive in
the mail a shrink-wrapped box before the internet. And then even when there
was the internet and things like Microsoft Office, they might
update themselves every few years. Microsoft Office 2008, Microsoft Office
2013, or whatever the milestones were. Much more in vogue these days,
certainly among startups, is continuous deployment, whereby
you might update your website’s code five times a day, 20 times
a day, 30 times a day. Anytime someone makes even the
smallest change to the code, it gets pushed to some
central repository, you maybe have some
tests run– also known as Continuous Integration,
whereby automatically are certain tests run to make sure
is the code working as expected? And if so, it gets pushed
to someone like Amazon. So among the upsides here
is just the ease of use. We as users over here, we don’t have
to worry about how to run our servers. We don’t have to worry about how to
share our code among collaborators. We can just pay each of those
folks a few dollars per month, and they just make all
this happen for us. But beyond money, what other
prices must we be paying? What are the risks? AUDIENCE: [INAUDIBLE]. DAVID MALAN: What’s that? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Sure. So we’re consuming a lot more
bandwidth, which at least for stuff like code, not too
worrisome, since it’s small. Video sites, absolutely, would
consume an order of magnitude more. Netflix has run up against that. Yeah, Victor? AUDIENCE: Latency? DAVID MALAN: Latency? How so? AUDIENCE: Servers are on
different sides of the country. DAVID MALAN: Yeah. So whereas if you had a central server
at your company, in this building, for instance, we could save our code
centrally within a few milliseconds, let’s say, or, like, a second. But if we have to push it
to California or Virginia or whatever GitHub’s
main servers are, that could take longer, more
milliseconds, more seconds. So you might have latency, the time
between when you start an action and it actually completes. Other thoughts? Yeah. AUDIENCE: What about trust,
security issue, [INAUDIBLE]. DAVID MALAN: Yeah, this is kind of
one of the underappreciated details. I mean, fortunately
for our academic uses, we’re not really too worried about
people stealing our code or the code that we use to administer
assignments and such. But we’re fairly unique in that sense. Any number of companies that
actually use services like GitHub are putting all of their intellectual
property in a third party’s hands because it’s convenient. But there’s a massive
potential downside there. Now thankfully, as an
aside, just because I’m picking on GitHub as
the most popular, they have something called GitHub Enterprise
edition, which is the same software, but you get to run it on your
own computer or your own servers or in the cloud. But even then, Amazon, in
theory, has access to it because Amazon has physical
humans as employees who could certainly physically
access those devices, as well. And really all that’s
protecting you in that case there are SLAs or policy agreements
between you and the provider. But this is, I think, an
underappreciated thing. I mean, most any internet
startup certainly just jumps on the latest and
greatest technologies, I dare say without really
thinking through the implications. But all it takes is for GitHub
or whatever third-party service to be compromised. You could lose all of your
intellectual property. But even more mundanely but
significantly, what else could go wrong here? AUDIENCE: If the server went down– DAVID MALAN: Yeah. When GitHub goes down,
half of the internet seems to break these days, at
least among trendy startups and so forth who are using
this third-party service. Because if your whole system is built to
deploy your code through this pipeline, so to speak, if this piece breaks,
you’re kind of dead in the water. Now, your servers are
probably still running. But they’re running older
code because you haven’t been able to push those changes. Now, you could certainly
circumvent this. There’s no fixed requirement
of using this middleman. But it’s going to take time, and
then you have to re-engineer things. And it’s just kind of a headache. So it’s probably better just to kind of
ride it out or wait until it resolves. But there’s that issue, too. Very common too is for
software development– more on this tomorrow– to
rely on third-party libraries. A library is a bunch of code
that someone else has written, often open source, freely available,
that you can use in your own project. But it’s very much in vogue these days
to have deployment time– to resolve your dependencies at deployment time. What do I mean by this? Suppose that I’m writing software
that uses some third-party library. Like, I have no idea how
to send emails, but I know that someone else wrote
a library, a bunch of code, that knows how to send emails. So I’m using his or her library in
my website to send my emails out. It’s very common these days not
to save copies of your libraries you’re using in your own code
repository, partly ’cause of principle. It’s just redundant, and if someone
else is already saving and archiving different versions of their email
library, why should you do the same? It’s wasting space. Things might get out of sync. And so what people will sometimes do
is you store only your website’s code here. You push it to some central source. And the moment it gets
deployed to Amazon Web Services or wherever is when
automatically, some program grabs all these other third-party
services that you might have been using that get linked in as well. And we’ll call these libraries. Of course, the problem there is exactly
the same as if GitHub goes down. And it’s funny, it’s
the stupidest thing– let me see– node.js
left shift, left pad. OK. So this was all the rage. Here you go– how one developer
just broke Node, Babel, and thousands of projects. So this was delightful to
read because in recent years, it’s just become very common,
this kind of paradigm– to not only use libraries, which
has been happening for decades, but to use third-party libraries
that are hosted elsewhere and are pulled in dynamically
for your own project, which has some upsides but also some downsides. And essentially, for reasons
I’ll defer to this article or can send the URL
around later, someone who was hosting a library called
left-pad whose purpose in life is just to add, I think, white
space– so space characters– to the left of a sentence, if you want. If you want to kind of shift
a sentence over this way, it’s not hard to do in code. But someone wrote this, and it’s
very popular to make it open source. And so a lot of people were
relying on this very small library. And for whatever reason–
some of them, I think, personal– this fellow removed his
library from public distribution. And to this article’s
headline, all of these projects suddenly broke because
these companies and persons are trying to deploy their
code or update their code and no longer can this
dependency be resolved. And I think– I mean, what’s
amazing is how simple this is. So let me see if I can find the code. OK, so even if you’re unfamiliar
with programming– well, this is not that much code. It looks a little scary because
it has so many lines here. But half of these lines are
what are called comments, just human-readable strings. This is not, like, a huge
amount of intellectual property. Someone could whip this up in probably
a few minutes and a bit of testing. But thousands of
projects were apparently using this tiny, tiny piece of software. And the unavailability of it
suddenly broke all of these projects. So these are the kinds of decisions,
too, that might come as a surprise, certainly, to managers and folks
who, why is the website down? Well, someone took down
their third-party library. This is not, like, a great threat
to software development per se. But it is sort of a side
effect of very popular trends and paradigms in engineering–
having very distributed approaches to building your software, but you
introduce a lot of what we would call, more generally, single
points of failure. Like if GitHub goes down, you go down. If Amazon Web Services go down, if
you haven’t engineered around this, you go down, as well. And so that’s what’s both exciting
and sort of risky about the cloud, is if you don’t necessarily
understand what building blocks exist and how you can assemble
all of those together. So let’s come back to one final
question, but first Vanessa. AUDIENCE: So what would
be a best practice? ‘Cause I know engineers
I’ve worked with don’t want to create dependency in their code. So they would do exactly [INAUDIBLE]. DAVID MALAN: Yeah. It kind of depends on
cost and convenience. Like the reality is it
is just– especially for a young startup,
where you really want to have high returns quickly from
the limited resources and labor that you have. You don’t necessarily want your humans
spending a day, a weekend, a week sort of setting up a
centralized code repository and all of the sort of
configuration required for that. You don’t necessarily want them to have
to set up their own servers locally because that could take, like, a week or
a month to buy and a week to configure. And so it’s kind of this
judgment call, whereby, yes, those would be better in terms of
security and robustness and uptime. But it’s going to cost us a month or
two of labor or effort by that person. And so now we’re two months
behind where we want to be. So I would say it is
not uncommon to do this. For startups, it’s probably fine because
you’re young enough and small enough that if you go offline, it’s
not great, but you’re not going to be losing millions of dollars
a day as a big fish like Amazon might. So it kind of depends on what
the cost benefit ratio is. And only you and they
could determine that. I would say it’s very common to do this. It is not hard to add your
dependencies to your own repository. And this is perhaps a stupid trend. So I would just do that
because it’s really no cost. But then there’s other issues here
that we’ll start to explore in a moment because you can really go overboard
when it comes to redundancy and planning for the worst. And if there’s only a, like, 0.001%
chance of your website going offline, do you really want to spend 10,000
times more to avoid that threat? So it depends on what the
expected cost is, as well. So we’ll come to those
decisions in just a moment. So one final question– what
else has spurred forward the popularity of cloud computing
besides the sort of benefits to users and companies? What technologically has made this,
perhaps, all the more of a thing? Ave? AUDIENCE: We’re so
reliant on [INAUDIBLE]. DAVID MALAN: Yeah. So this is a biggie. I mean, I alluded earlier
to this verbal list of, like, power and
cooling and physical space, not to mention the money
required to procure servers. And back in the day– it was
only, like, 10 or so years ago– I still remember doing
this consulting gig once where we bought a whole lot
of hardware because we wanted to run our own servers and
run them in a data center where we were renting space. And maybe the first
time around, it was fun to kind of crawl around on
the floor and wire everything together and make sure that
all of the individual servers had multiple hard drives for
redundancy and multiple power supplies for redundancy and think
through all of this. But once you’ve kind of done that once
and spent for that much redundancy only to find that, well,
occasionally your usage is here. Maybe it’s over here. But you sort of have to pay for up here. It’s not all that compelling. And it’s also a huge
amount of work that doesn’t need to be done by you and
your more limited team. So that’s certainly driven. Not to mention lack of space. Like at Harvard, we
started using the cloud, in part, because we, for
our team– we had no space. We had no cooling. We kind of didn’t really have power. So we really had no other options other
than putting it under someone’s desk. AUDIENCE: I was gonna say
one other [INAUDIBLE] server and storage technology
that makes it actually cost effective for these
companies to do this. Where before, they could
only do it for themselves. DAVID MALAN: That’s what’s
really helped technologically. If you’ve heard of Moore’s law,
whose definition kind of gets tweaked every few
years– but it generally says that the number
of transistors on a CPU doubles every 18 months
or 12 months or 24 months, depending on when you’ve
looked at the definition. But it essentially says
that technological trends double every year, give
or take, which means you have twice as many transistors
inside of your computer every year. You have twice as much storage space
for the same amount of money every year. You have twice as much
CPU speed or cores, so to speak, inside of your
computer every year. So there’s this doubling. And if you think about a
doubling, it’s the opposite of the logarithmic curve we
saw earlier, which still rises, but ever more slowly. Something like Moore’s law is
more like a hockey stick, where we’re kind of more on
this side nowadays, where the returns of having things
double and double and double and double have really started to yield
some exciting returns, so much so that this Mac here–
let’s see, About This Mac. This is three gigahertz,
running an Intel Core i7, which is a type of CPU,
16 gigabytes of RAM. So this means in terms of
processor, CPU, speed, my computer can do 3.1 billion things per second. What is the limiting factor,
then, in using a computer? I can only check my email so fast
or reload Facebook so quickly. The human is by far the slowest
piece of equipment standing in front of this laptop. And so we’re at the point even
with desktop or laptop computers that we have far more resources than
even we humans know what to do with. And servers, by contrast, will
have not just one or two CPUS. They might have 16 or 32 or
64 brains inside of them. They might have tens of gigabytes
or hundreds of gigabytes of RAM that those CPUs can use. They might have terabytes and
terabytes of space to use. And it’s sort of more than
individuals might necessarily need. You have so much more
hardware and performance packed into such a small
package that it would be nice to amortize the
costs over multiple users. But at the same time, I don’t want
my intellectual property and my code and my data sitting alongside Nicholas’s
data an Ave’s data and Sarah’s data. I want at least my own user accounts
and administrative privileges. I want some kind of barrier
between my data and their data. And so the way that has– what’s
really popularized this of late has been virtualization
or virtual machines. And this is a diagram drawn
from a Docker’s website, which is an even newer incarnation
of this general idea. But if you’re unfamiliar
with virtualization, the user-facing feature
that it provides is it allows you to run one operating
system on top of another. So if you’re running Mac
OS, you can have Windows running in a window on your computer. And conversely, if you run
Windows, you can, in theory, run Mac OS in a window on your computer. But Apple doesn’t like to let
people do this, so it’s hard. But you can run Linux
or Unix– these are other operating systems–
on top of Mac OS or on top of Windows, again, sort of
visually within a window. But what that means is you are
virtualizing one operating system and one computer, and using one
computer to pretend that it can actually support multiple ones. So pictorially, you might have this. So infrastructure is just referring
to your Mac or PC in this story. Host operating system’s
going to a Mac OS or Windows for most people in the room. Hypervisor is the fancy name given
to a virtual machine monitor. It’s virtualization software, like
VMware Fusion, VMware Workstation, VMware Player– suffice
it to say, VMware is a big company in this space– Oracle
VirtualBox, Microsoft Virtual PC. And there’s a few other– something,
the company’s name might be Parallels. Parallels for Mac OS. There’s a lot of different
software that can do this. And as this picture suggests, it
runs on top of the operating system. So it’s just a program
running on Mac OS or Windows. But then as these three towers
suggest, what hypervisor does for you is it lets you run as
many as three different operating systems, even more, on top of your own. And you can think of it as
being in separate windows. So now that this is possible, if I
might go out and rent, effectively, in the cloud a really big server with
way more disk space, way more RAM, way more CPU cycles than I need for my
little business, well, you know what? I could chop this up, effectively,
for Nicholas, Ave, Sarah, and myself so that each of us can
run our own operating system– different operating systems, no less. We can each have our own
usernames and passwords. All of our data and code can be
isolated from everyone else’s. Now, whoever owns that machine, in
theory, could access all of our work because by having physical access. But at least Nicholas, Sarah, and
Ave are compartmentalized, as am I, so that no one else can
get at our own data. And so one of the reasons that
we have virtualization so trendy these days is we just have almost more
CPUs and more space and more memory than we even know what to do with,
at least within the footprint of a single machine. So that too, has spurred things forward. Now, as an aside, there’s another
technology– no break yet. There’s another technology that alluded
to a moment ago called containerization which is, if you’ve not heard the term,
am even lighter-weight version of this, whereby containers are similar in spirit
to virtual machines but can be started and can be booted much faster than
full-fledged virtual machines. We’ll have more on those another time. Yeah, Anessa. AUDIENCE: So I know at
least the team that I worked with [INAUDIBLE]
containerization is the thing right now. And they’re even building [INAUDIBLE]. What are some of the–
I just want to get a better understanding of the values
and the risks of containerization. DAVID MALAN: Sure. So big fan. In fact, I and our team are in the
process of containerizing everything that we do right now. So big fan. Let me see, what is Docker? So Docker is sort of the
de facto standard right now, though there’s
variations of this idea. And the picture I showed is
actually from their own comparison. Oh, they seem to have changed it. Now they’ve changed it to blue. But here is kind of a side-by-side
comparison of the two ideas. So on the left is virtualization,
a sort of two-dimensional version of what we just saw in blue. And on the right is containerization. So one of the takeaways the
picture is meant to convey is look how much lighter-weight
Docker is on the right-hand side. There’s just less clutter there. But that’s kind of true. Containerization does the
following– or rather, virtualization has you running one base operating
system and hypervisor on top of it, and then multiple copies of some
other OS or OSes on top of those. Containerization has you
run one operating system that all of your so-called
containers share access to. So you install one operating
system underneath it all. And then all of your containers
share some other operating system of your choice. So that’s already reducing from
three down to just one operating system, for instance. Moreover, containerization tends to use
a technique called union file system. A file system is just the fancy term
for the way in which you store data on your hard drives and solid
state drives and so forth. A union file system
gives you the ability to layer things so that, for
instance, the owner of this machine would install some base
layer of software– like, only the minimal
amount of software necessary to boot the computer. But then Anessa, you and
your team might need– you might be writing
your product in Python. So you need certain Python
software and certain libraries. I, by contrast, might be
writing my site in PHP. I don’t need that layer. I need this layer of software. And what containerization
allows you to do is all share everything
that’s down here, but only optionally add these layers
such that only you see your layer, only I see my layer. But we share enough of
the common resources that we can do more work on
that machine per unit of time because we’re not trying to run one,
two three separate operating systems. We’re really just running
one at that layer. So that’s the gist of it. And I would say the
risks and the downsides are it’s just so bleeding edge, still. I mean, it’s very popular. I just came back from the Dockercon,
the Docker conference in Seattle a few weeks ago. And there were a couple
thousand people there. It was apparently doubled
in size from last year. So containerization is all the rage. But the result of which is even on
my own computer– you can see Docker is installed on my computer. Actually, you can see
the version number there. I am running version 1.12.0,
release candidate three, beta 18, which means this is the
18th beta version or test version of the software. So stuff breaks. And bleeding edge can be painful. So the upside, though,
on the other hand, is that Amazon, Google,
Microsoft and others are all starting to support this. And what’s nice is that
it’s a nice commoditization of what have been cloud providers. For many years, you would
have to write your code and build your product
in a way that’s very specific to Google or
Microsoft or Amazon or any number of third-party companies. And it’s great for them. You’re kind of bought in. But it’s not great for you if you
want to jump ship or change or use multiple cloud providers. So containerization is
nice, popularized by Docker, the sort of leading
player in this, in that it allows you to abstract away–
perfect tie in to earlier– what it means to be the cloud. And you write your software
for Docker, and you don’t have to care if
it’s ending up on Google or Amazon or Microsoft or the like. So it’s great in that regard. So you shouldn’t have
any regrets, but you should realize that maybe
with higher probability, you’ll run into technical headaches
versus other technologies. Really good question. All right. So if we now have the ability
to have all of these various– if we have the ability to
run so many different things all on the same hardware,
that means that no longer do we have to have just
one server for our website. And indeed, this was
inevitable because if you have a server that can only handle
so many users per second or per day, surely once you’re
popular enough, you’re going to need more hardware
to support more users. So let’s consider what starts
to happen when we do that. So if I have just one little server
here, called my www web server, per our conversation before
lunch, what does that web server need to have in order
to work on the internet? AUDIENCE: An IP address. DAVID MALAN: An IP address. So it has to have an IP address. And it has to have a DNS entry so
that if I type in www.something.com, the servers in the world can
convert that to an IP address, and Macs and PCs and everyone
can find this on the internet. And we’ll just abstract
away the internet as a cloud so that the server is somehow
connected to the internet. So that’s great. The world is nice and simple
when you just have one server. Now, suppose I want to
store data for my website. Users are registering. Users are buying things. I want to store that information. And where, of course, is data like that
usually stored, if generally familiar? What kind of technology
do you use to store data? AUDIENCE: A databse. DAVID MALAN: A database. Yeah, so a database. You could just store it in files. You could just save text files
every time someone buys something, and that works. But it’s not very scalable. It’s not very searchable. So databases are products like
Microsoft Access is a tiny version. Microsoft’s SQL Server is a bigger one. Oracle is a behemoth. There’s PostgreSQL. There’s MySQL and bunches of others. But at the end of the
day– and actually, these are only the relational databases. There’s also things like MongoDB. There’s Redis for certain applications,
though not necessarily as persistent, and bunches of others still. And those are object-oriented
databases or document stores. But there’s just a long list
of ways of storing your data. And generally what all of these
things provide, a database, is a way to save information, delete
information, update information, and search for information. And the last one is the really juicy
detail because especially as you’re big and you’re popular– and to
your analytics comment earlier, it’d be nice if you could actually
select and search over data quickly so as to get answers more quickly. And that’s what databases do. Oracle’s intellectual property
is sort of the secret sauce that helps you find your data
fast, and same with all of these products, doing it better,
for instance, than the competitor. So with that said, you could
run not only web server software and a database on one physical server. In fact, super common, especially
for startups or someone who’s just got a test server under his or her desk. You just run all of these same
servers on the same device. And among the servers you might run,
you might have– so these are databases. Let me keep this all together. These are database technologies. And on the other hand, we might have web
servers like Microsoft IIS– Internet Information Server. Apache is a very popular web server
for Linux and other operating systems. There’s NGINX, which is also very
popular, and bunches of others. So this is web server software. This is the server software that knows
what to do when it receives a request like GET/HTTP/1.1. So when we did that quick example
earlier when I visited google.com, they are running something
like this on their server. But if they want to store data
because people are buying things or they’re logging
information, they probably need to also run one of these servers. And a server, even though almost all
of us think of it as a physical device, a server is really just
a piece of software. And you can have
multiple servers running on one physical device, one server. So it’s confusing. The term means different
things in different contexts. But you can certainly run multiple
things on the same server. In fact, if this server is supposed
to send email confirmations when people check out, this could
be an email server, as well. If they’ve got built-in chat
software for customer service, it could also be running there. But at the end of the day, no matter
how much work this thing is doing, it can only do a finite amount of work. So what starts to break as soon
as we need a second server? So suppose we need to
invest in a second server. We have the money. We can do so. What do you do now if this now
becomes one, and this becomes www2. What kinds of questions
do you need to ask? Or what might the engineers
need to do to make this work? And I’ve deliberately removed the line
because now, what gets wired to what? How does it all work? Yeah? AUDIENCE: Could you update www1
has www2 get updated, as well? DAVID MALAN: Good question. Updated in what sense? Like, your code? AUDIENCE: Yeah, [INAUDIBLE]
any server aspect. DAVID MALAN: Yeah, hopefully. So there’s this wrinkle, right? If you want to update
the servers, you could try to push the updates simultaneously. But there could be a slight delay. So one user might see the old software. One user might see the new, which
doesn’t feel great, but is a reality. What else comes to mind? Yeah? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Yeah. It’s more worrisome with the database. If I now continue my super
simple world where I have a web server and an email server
and a database server all on the same physical box,
what if I happen to log in here, but Sarah– sorry. I happen to end up here. Sarah ends up here. Now our data is on separate servers. And then maybe tomorrow,
we visit the website again, and somehow, Sarah ends up over
here, doesn’t see her data. I end up over here. I don’t see my data. This does not feel like a great design. So already, our super
simple initial design of assume one server, everything
running on it, breaks. What else might break? Or what else might we want to
consider before we start fixing? Or if you put on the head of the
manager– so I’m the engineering guy. I can answer all your questions. But you have to ask me
the technical questions to get to a place of comfort yourself
that this will actually work. What other questions
should spring to mind? This is your business. No questions? Because I will just as soon
leave everything disconnected. AUDIENCE: Yeah, I was gonna say, how do
you have two databases– no matter how a person logs into our website, how
do we make sure their data is intact no matter where they log into? DAVID MALAN: Ah. OK. Well, I’ve thought about that. Don’t worry. We’re actually going
to have a third server. It’s often drawn as a
cylinder like this here. This will be database. And these guys are both
going to be connected to it. So I’m now going to have two tiers. And let me introduce some new jargon. I would typically call this my
front end, here, or my front end. And this I shall call the back end. And generally, front
end means anything user facing, that the user’s laptop
or desktop might somehow talk to. Back end is something that the
servers might only talk to. The user’s not going to
be allowed to talk here. All right? So I’ve answered that. We’re going to centralize the
database here so that there is no more data on individual servers. It’s now centralized here. What other questions have you now? AUDIENCE: Do we need another DNS entry? Or how does it– we have one IP address? DAVID MALAN: Well, we’ll just tell our
customers to go to www1.something.com. Or if it seems busy, go to www2. So how do we fix that? AUDIENCE: Both those should
go to the same IP address. DAVID MALAN: Both of those should
go to the same– ideally, yes. OK, so I– oh, Anessa,
do you want to comment? AUDIENCE: I mean, you somehow need
to be able to run [INAUDIBLE]. DAVID MALAN: And though,
to be clear, I claim now there is no right server
because the database is central. So now these are commodity. It doesn’t matter which one you end up
on so long as it has capacity for you. AUDIENCE: Right. So you need to do something to make sure
that you’re going to one [INAUDIBLE]. DAVID MALAN: OK. So what’s the simplest way we’ve
seen a company do this so far? We’ve only seen one. What did Yahoo do to balance
load across their servers? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Yeah, the round robin. Rotating their traffic via DNS. So, for instance, if someone’s laptop
out there on the internet requests www.something.com or Yahoo, the DNS
server, which is not pictured here but is somewhere– let’s just say, yeah. We have a DNS server. It’s over here. And I won’t bother drawing the
lines because it’s kind of– we’ll just assume it exists. The first time someone
asks for something.com, I’m going to give them the
IP address of this server. The second time someone asks, I’m
going to give them the IP address. And then this one, and then this, and
then da, da, da, and back and forth. What’s good about this? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Now one
doesn’t get too busy, and– AUDIENCE: It’s pretty simple. DAVID MALAN: Simple is good, right? And this is underappreciated,
but the easier you can architect your systems, in theory,
the fewer things that might go wrong. So simple is good. It’s really like one line
in a configuration file on Yahoo’s end to implement load
balancing in apparently that way, though, to be fair, they
have more than three servers. So there’s a whole other layer of load
balancing they’re absolutely doing. So I’m oversimplifying. What’s bad about this approach? Yeah? AUDIENCE: It depends what you’re having
people do when they get those servers. If people are doing things
that are radically out of scale with one another, [INAUDIBLE]. DAVID MALAN: Exactly. Even though with 50% odds, you’re
going one place or the other, what if Sarah is spending way more
time on the website than I am? So she’s consuming
disproportionately more resources. I might want to send more
users to the server I’m on and avoid sending anyone to her
server for some amount of time. So 50-50. Overall, given an
infinite amount of time and an infinite number of resources,
we’ll all just kind of average out. But in reality, there
might be spikiness. So that might not
necessarily be the best way. So what else could we do? DNS feels overly simplistic. Let’s not go there. Some companies, as an aside–
and look for this in life. It doesn’t seem to happen
too often, but it usually happens with kind of
bad, bigger companies. Sometimes, you will see in
the URL that you are at, literally www1.something.com
and www.2.something.com. And this is, frankly, because
of moronic technical design decisions where they are somehow
redirecting the user to a different named server simply to balance load. And I say moronic partly to be
judgmental, technologically, but also because it’s
technologically unnecessary, and it actually has downsides. Why might you not want to send a user
to a name like www1, www2, and so forth? Why might you regret that decision? AUDIENCE: Then they might
go there themselves. DAVID MALAN: They might
go there themselves. So I go to www2.something.com. Why is that bad? Won’t it be there? AUDIENCE: Maybe. DAVID MALAN: Maybe. AUDIENCE: Maybe next time,
they want you to go to one. And now [INAUDIBLE]. DAVID MALAN: Yeah. So if your users maybe
bookmark a specific URL, and they just kind of out of habit
always go back to that bookmark, now your whole 50-50 fantasy
is kind of out the window, unless people bookmark the websites
with equal probability, which might be the case. But in either case, you’re
sort of losing a bit of control over the process. What if– we’re only talking about
two servers, but what if it was www20? And you know what? You only need 19 servers nowadays,
so you turned off number 20 or you stopped renting or
paying for those resources. Now they’ve bookmarked a
dead end, which isn’t good. And frankly, most users won’t
have the wherewithal to realize when they click on that bookmark
or whatnot, why isn’t it working? They’re just going to assume
your whole business is offline and maybe shop elsewhere. So that’s not good, either. So what could we do to solve this? And, in fact, let’s not
even give these things names because if we don’t want users
knowing about them or seeing them, they might as well just have
an IP address number one. We’ll just abstract it away. This is IP address number 2. But the user doesn’t need to
know or care what those are. Yeah, Daria? AUDIENCE: Are those both running
the same amount of services? Like, you’ve got an email server on
one and, like, the web server software on one? Each of those IP one and two have
everything except for the database? DAVID MALAN: At the moment,
I was assuming that. But as an aside, even if they weren’t,
turns out with most of the strategies we’ll figure out, you can weight your
percentages a little differently. So it doesn’t have to be 50-50. Could be 75-25, or you can
take into account how much. AUDIENCE: It’s like, can you
pull more things out and just run them like a database? DAVID MALAN: Ah, OK. So let’s say there is an email server. So let’s call this the email server. Let’s factor that out because it was
consuming some resources unnecessarily. So I like the instincts. Unfortunately, you can only
do this finitely many times until all that’s left is
the web server on both. And even then, if we get more
users than we can handle, we’re just talking about two servers. We might need a third or a fourth. So we can never quite
escape this problem. We can just postpone
it, which is reasonable. Katie? AUDIENCE: Is there a way to put a cap
on once one server has a certain amount of traffic, go to the other server? DAVID MALAN: Absolutely. We could impose caps. But to my same comment
here, that still breaks as soon as we overload server
number one and server number 2. So we’re still going
to need to add more. But even then, how do we decide
how to route the traffic? One idea that doesn’t seem to
have come yet– a buzzword, too, that we can toss up here is
what’s called vertical scaling. You can throw money at
the problem, so to speak. So we kind of skipped a step, right? Instead of going from– we
went from one server to two servers, which was
nice because it created a lot of possibilities but problems. But why don’t we just
sell the old server, buy a bigger, better server, more
RAM, more CPUs, more disk space, and just literally throw money at
the problem, an upside of which is this whole conversation we’re
having now, let’s just avoid it. Right? Let’s just get rid of this. This is just too hard. Too many problems arise. Let’s just put this as our web server. What’s the upside? What’s an upside? AUDIENCE: Simplest. DAVID MALAN: Simplest, right? I literally didn’t have
to think about this. All I had to do was buy
a server, configure it, but it’s configured identically. I just spent more money on it. What’s the downside, of course? Same thing. I spent a lot of money on it. And, more fundamentally,
what’s the problem here? AUDIENCE: It’s not a long-term solution. DAVID MALAN: It’s not
a long-term solution. I’ve postponed the issue,
which is reasonable, if I’ve just got to get
through some sales cycle or somehow get through the holiday
season or something like that. But there’s going to be this
ceiling on just how many resources you can fit into one machine. Typically, especially from companies
like Apple and even Dell and others, you’re going to pay a premium for
getting the very top of the line. So you’re overspending on the hardware. And so companies like
Google years ago began to popularize what has been
called horizontal scaling, where instead of getting one big,
souped-up version of something, you get the cheapest version, perhaps. You go the other extreme and just get
lots and lots of cheaper or medium spec devices. But unfortunately–
well, fortunately, that’s great because in theory, it
allows you to scale infinitely, so long as you have the money and
the space and so forth for it. But it creates a whole slew of problems. So we’re kind of back
to where we were before. So DNS we proposed. Eh, it doesn’t really cut it. It’s not smart enough because DNS
has no notion of weights or load. It has no feedback loop. All it does is translate domain
names to IP addresses and vice versa. So what could we introduce
to help solve this problem? The answer’s kind of
implicit on the board because we used a
technique twice already now that could help us balance load. Yeah. AUDIENCE: Could you just
have a feedback loop so that when you need more
service space, you scale up, and when you need less, you scale down? DAVID MALAN: OK. So that’ll get us to the
point of auto scaling. And that’ll allow us to add IP address
number three and four and five. But fundamentally, two is
interesting because it’s representative of an infinite
supply of problems now, which is what if you have
more than one server? Question at hand is how do we decide
or what pieces of hardware or features do we need to add to this story in order
to get data from users to server one or two or three or four or five or six. Yeah? AUDIENCE: Could you put– I don’t know–
another server or something on top of it that’s just directing? DAVID MALAN: Yeah. In fact it has a wonderful– AUDIENCE: Like a router? DAVID MALAN: –word. Yeah, it wouldn’t technically
be called a router, though it’s similar in spirit. Load balancer, which is,
in a sense, a router. So a lot of these terms
are interchangeable and more just conventions
than anything else. I’ll call this LB for load balancer. And that’s exactly right. Now let me connect some lines. This looks like CB. That’s LB, load balancer. So now, it is on the internet
somehow with a public IP address. And these two servers
have an IP address. But you know what? I’m going to call this private. And this one too will be private. This guy needs an IP that’s public,
which is not unlike our home router. So calling it a router is not
unreasonable in that regard. And what does this load
balancer need to do? Well, he’s got to decide whether to
route data to the left or to the right. And just to be clear, what
might feed into that decision? AUDIENCE: Usage. DAVID MALAN: Usage. So I’m going to specifically draw
these lines as bi-directional arrows. So there’s some kind of feedback loop,
or constantly these servers are saying, I’m at 10% capacity. I’m at 20% capacity. Or I have 1,000 users,
or I have no users. Whatever the metric is
that you care about, there could be that feedback loop. And then the load balancer
could indeed route the traffic. And so long as the response
that goes back also knows to go through the
load balancer, it’ll just kind of work seamlessly,
much like our home network. So we’ve fixed that problem. I like that. What new problems have we created
at this point in the whole story? AUDIENCE: Bottleneck? DAVID MALAN: Bottleneck? Where at? AUDIENCE: In the load balancer? DAVID MALAN: Yeah. This is kind of besides
the point, right? Like, Grace, haven’t you kind
of broke– it’s a regression. Like we solved our
problem of load earlier by doubling the number of servers. But to get that to work, you’ve
proposed that we go back to one server because then it all just kind of
works and we somehow flail the traffic to the left or to the right. So it’s not wrong. So what’s a pushback? This is OK, in some sense. Why is this OK, even though before it
was not OK to just confine ourselves to one server? AUDIENCE: Because the
load balancer’s only job is to push people to [INAUDIBLE]. DAVID MALAN: Exactly. That’s its sole purpose in life. And if it’s reasonable
to assume, which it kind of is, that the
web servers probably have to do a little more work– right? They have to talk to a database. They have to check a user out. They might have to
trigger emails to be sent. It just feels like there’s a
bunch of work they need to do. Load balancer literally,
in the dumbest sense, needs to just send 50% of
traffic here, 50% here. But we know we can do better. So it needs to have a little
bit of sense of metrics. But at the end of the day,
it’s just like a traffic cop going this way or that way. Intuitively, that feels
like a little bit less work. And so indeed, you could throw maybe
more resources at this one server and then get really
good economy of scale by horizontally scaling
your front end tier, so to speak, up until
some actual threshold. And the thresholds are going to be
twofold– whether this is software or hardware that you either
download or you buy physically, either you’re going to have one,
licensing restrictions, whereby whatever company you buy
it from is going to say, this can handle 10,000
concurrent connections at a time. After that, you need to upgrade to
our more expensive device or something like that. Or it could just be technological, like
this device only has so much capacity. It can only physically
handle 10,000 connections at a time, after which you’re going to
need to upgrade to some other device altogether. So it can be a mix of those. So that actually is a nice
revelation of the next problem. OK, so I can easily spread
this load out here to three. And I can add in another one over here. But what’s going to break next, if not
my front end web tier, so to speak? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Database? And what makes you think that? AUDIENCE: [INAUDIBLE]
simultaneous queries. DAVID MALAN: Yeah. Like, to my concern
earlier, we’re horizontally scaling this to handle more users. But we just still are sending
all the data to one place. And at some point, if we’re successful–
and it’s a good problem to have– we’re going to overwhelm
our one database. So what might we do there? How do we fix that? More, all right. But now, the problem with
doing this to your database is that the database,
of course, is stateful. It actually stores information,
otherwise generally known as state. The web servers I proposed
earlier can be ignorant of state. All of their permanent data
gets stored on the database. So if we do add another database
into the picture, like down here, where do I put my data? Right? I could put my data
here, Sarah’s data here. But then we need to make sure that every
subsequent request from me goes here and from Sarah goes here
so that it’s consistent. Or, of course, she’s not
going to see her data. I’m not going to see my data. So how do we solve that problem? AUDIENCE: The types of data [INAUDIBLE]. DAVID MALAN: OK. So we can use a technique that–
let me toss in a buzzword, sharding. To shard data means to, using
some decision-making process, send certain data this way
and other data this way. And an example we often
use here on campus is back in the day of
Facebook or thefacebook.com, when Mark finally started
expanding it to other campuses, it was harvard.thefacebook.com
and mit.thefacebook.com and berkeley.thefacebook.com or
whatever the additional schools were, which was a way of sharding your
setup because Harvard people were going to one setup and MIT people
were going to another setup. But, of course, a downside early
on, if you remember back then, is you couldn’t really be friends
with people in other networks. That was a whole other feature. But that was an example of sharding. Now, if you just have one website,
it’s possible and reasonable that anyone whose name starts with
D might go to the left database. Anyone whose name starts with S
might go to the second database. Or you could divide
the alphabet similarly. So you could shard based on that. What might bite you there, if you’re
just sharding based on people’s names? AUDIENCE: Growth. DAVID MALAN: Growth? You could maybe put A through
M on one database server and N through Z on the other. And that works fine initially, but
eventually, you got to split, like, the A’s through the M’s. And what about– then you get down to
the A’s and the BA’s, and then you have just A’s. But one server’s not
enough for all the A’s. So it’s an ongoing problem. What else? AUDIENCE: Would sharding also apply if
you’re topically separating your data? Like, this becomes my sales
database and my profile database– DAVID MALAN: Absolutely. AUDIENCE: –and my product database. DAVID MALAN: Yep, you can
decide it however you want. Names is one approach for very
consumer-oriented databases. You can put different types of
data on different databases. Same problem ultimately,
as Griff proposes, whereby if you just
have so much sales data, you might still need
to shard that somehow. So at some point, you need to kind
of figure out how to part the waters and scale out. But quite possible. Sarah? AUDIENCE: Could you shard based on time? Like long-term data gets stored– DAVID MALAN: Oh, of course. Yep. You could put short-term data on some
servers, long-term data on others. And long-term data, frankly, could be
on older, even slower, or temporarily offline servers. And in fact, I can only
conjecture, but I really don’t understand why certain banks and
big companies like Comcast and others only let me see my
statements a year past. Surely in 2016, if I can get my order
history on Amazon from the 2000s, you can give me my bank
accounts from 13 months ago. And it’s probably either a foolish
or a conscious design decision that they’re just offloading old data
because it costs them, ironically, too much money to keep
around their user’s data. Other thoughts? AUDIENCE: What about
instead of sharding, what about have some sort of
overlay that’s able to point to– DAVID MALAN: OK. So we could have, even for sharding,
some notion of load balancing here. And it’s not load balancing
in a generic sense. It needs to be a decision maker. Am I misinterpreting? AUDIENCE: Yes. So it can choose either one. It could figure out
which database it’s in. DAVID MALAN: OK. So based on that, I’m either
going to call it– well, it wouldn’t be a load balancer if
it’s just generically doing this. It’s some kind of device
that’s doing the sharding. And the sharding could either
be done in software, to be fair, and the web servers could be programmed
to say the A through M’s over here and the N’s through Z’s over here. You don’t necessarily need a middleman. But we certainly could,
and it could be making those decisions such that all of these
servers talk to that middleman first. But it turns out you could load balance. We could take the same principle
of earlier of load balancing, but also solve this
problem in a different way. Let me push this up for a moment. And if we erase this, let me
actually call this a load balancer. And let me assume that now this is
going to go to databases as follows. Let’s actually do this, just
so I have some more room. So here, I’ll draw a
slightly bigger database. Not uncommon with databases maybe
to throw a little more money at it because it’s a lot easier to keep
your data initially all in one place. And so we might just vertically
scale this thing initially. So throw money at it, so we still
have the simplicity of one database, albeit a problem for downtime
if this thing goes offline. But more on that in a moment. But what I’m really
concerned with is rights. Changing information is
what’s ideally centralized. But reading information could
come from redundant copies. And so what’s fairly common is maybe
you have a bunch of read replicas. And that’s not necessarily
a technical term. It’s just kind of a
term of art here, where the replica, as the
name suggests, is really just a duplication of this thing. But maybe it’s a little
smaller or slower or cheaper. But there’s some kind of synchronization
from this one to this one. So writes are coming into this one,
and I’ll represent writes with a W. But when the user’s code running on the
web servers want to read information, that data’s not going to come from here. So that arrow is not going to exist. Rather, all of the reads
going back to the web servers are going to come from this
device, historically called the slave or secondary, whereas
this would be the master or primary. And what’s nice about
this topology is that we could have multiple read replicas. We could even add a third one in here. And the decision as to
whether or not this works well is kind of a side effect of whatever
your businesses is or the use case is. If your website is very read
heavy, this works great. You have few or one
server databases devoted to writes– so changes, deletions,
additions, that kind of thing. But you can have as many read
replicas allocated as you want, which are just real-time
copies of the master database that your code actually reads from. So something like Facebook–
depends on the user. Many people on Facebook probably read
more information than they post, right? Every time you log in, you might
post one thing, maybe, let’s say. But you might read 10
posts from friends. So in that sense, you’re
sort of read-heavy. And you can imagine other
applications– maybe in Amazon, maybe you tend to window shop more. So you rarely buy things, but
you go to shop around a lot. So you might be very read-heavy,
but you only checkout infrequently. So this topology might work well. Now, there’s kind of a problem here. If I keep drawing more and
more databases like this, what might start to break, eventually? AUDIENCE: The master. DAVID MALAN: The master, right? If we’re asking the master to copy
itself– in parallel, no less– to all of these secondary databases, at some
point this has got to break, right? It can’t infinitely handle traffic
over here and then infinitely duplicate itself to all of
these replicas down here. But what’s nice and
what’s not uncommon is to have a whole hierarchical structure. You know what? So if that is worrisome, let’s
just then have one medium sized or large replica here,
but then replicate in sort of tree fashion off of
it to these other replicas. So kind of push the problem away from
the super-important special database, the write database– write with a W.
And then the read replicas down here have their own sort of
hierarchy that gives us a bit of defense against that issue. Now, what problem still
remains in this picture? What could go wrong, critique, somehow? AUDIENCE: I don’t understand
how you would [INAUDIBLE]. DAVID MALAN: So what I’m proposing
is this is allowing us to scale. So if, I assume, as in the
Facebook scenario described, that most of my business
involves reads, where I want to have as many
read servers as possible, but I can get by with just
one writeable server, then what’s nice about this is that we
can add additional read replicas, so to speak, and handle
more and more and more users without having to
deal with the problem that I proposed earlier as a bit of a
headache– how do we actually decide, based on sharding or some other
logic, how to split our data? I just avoid splitting
our data altogether. So that’s the problem
we’ve solved, is scaling. We can handle lots and lots
and lots of reads this way. But there’s still a problem, even if
this is plenty of capacity for writing. Grace? AUDIENCE: The timing, then,
between the writing and the reading and if you want to overwrite
or update something you’ve just written where you’re reading it from. DAVID MALAN: Yeah. There’s a bit of latency, again,
so to speak, between the time you start to do something
and that actually happens. And you can sometimes see this, in
fact, because of latency or caching, even on things like Facebook. An example I always think of is on
occasion, I feel like I’ve posted or commented something on Facebook. Then in another tab, I might hit
Reload, and I don’t see the change. And then I have to reload,
and then it’s there. But it didn’t have the immediate
effect that I assumed it would. And that could be any number of reasons. But one of them could just
be propagation delays. Like, this does not happen instantly. It’s going to take some amount of
time, some number of milliseconds, for that data to propagate. So you get minor inconsistencies. And that is problematic if, for
instance, you read a value here, you write that change,
then you’re like, oh, no. Wait a minute. I want to fix whatever I just did. I want to fix a typo or something. You might end up changing this
version instead of that version. That has to be a
conscious design decision. Yes, that is possible,
so this is imperfect. What else is worrisome here? Management should not
sign off on this design, I would propose, at least if
management has money with which they’re willing to solve this problem. AUDIENCE: From a user point of view,
it might be a lot of time [INAUDIBLE]. DAVID MALAN: Good– not bad instinct. But we’re talking milliseconds. So I would push back at this point
and say, users are rarely if ever going to notice this. But fair. What could go wrong? Always consider that, and especially
if you are starting something. What are the questions you would ask
of the engineers you’re working with? What could break first? You don’t even have to be an
engineer to sort of identify intuitively what could go wrong. Grace? AUDIENCE: You still have one master
where you’re writing everything. There’s no redundancy there. There’s no [INAUDIBLE]. DAVID MALAN: Yeah. And the buzz word here is
single point of failure. Single points of failure, generally bad. And here, too, it does not take an
engineering degree to isolate that, so long as you have a conceptual
understanding of things. The fact that we have
just one database for writes, one master database–
very bad in that sense. If this goes offline, it would seem
that our entire back end goes offline, which is probably bad if our back end is
where we’re storing all of our products or all of our user data or
all of our Facebook posts or whatever the tool might be doing. This is really the stuff we
care about, the actual data. So not good. And, in fact, we dealt with that
earlier by introducing some redundancy at the web tier. So propose what’s good and bad about
this– suppose that, all right, I’m going to deal with this by
adding a second writeable database. I realize it’s going to cost money. But if you want me to fix this,
that’s the price we pay, literally. But there’s technical
prices we now need to pay. How do we kind of wire this
thing in, the second database? Or what questions does it invite again? AUDIENCE: Write things simultaneously? DAVID MALAN: Yeah. Like, all right. So why don’t we do that– like
sharding sounded like so much work. It’s so hard to solve. It doesn’t fundamentally
solve the problem long-term. Let me just go ahead and write my
data in duplicate to two places. Well, this would be an incorrect
approach, but the theory isn’t bad. What would typically
happen, though, is this– there’s the notion of master-master
relationships, whereby it doesn’t matter which one you write to. You can configure databases
to make sure that any changes that happen on one
immediately and automatically get synced to the other. So it’s called master-master
replication, in this case. So that helps with that. And now, frankly, now this opens
up really interesting opportunities because we could have
another database over here, and it could have its own databases. So we have this whole
family tree thing going on that really gives us a lot
of capacity– horizontal scaling even though, paradoxically, it’s
all hierarchical in this case. So that’s pretty good. And that’s kind of a
pretty common solution, if you have the money
with which to cover that and you’re willing to take
on the additional complexity. Like Anessa, to some of
your concerns with the team, this is more complicated. To someone else’s point
earlier, simple is good. This is no longer very simple. Thankfully, this is a
common problem, so there’s plenty of documentation and
precedent for doing this. But this is the added kind
of complexity that we have. There’s still some other single
points of failure on the screen. What are those? AUDIENCE: Load balancer? DAVID MALAN: Yeah, the load balancer. So damn it, that was such
a nice solution earlier. But if really we want to be uptight
about this, got to fix this, too. So let me go in there and put
in one, two load balancers. Of course now, all right– so we
have two problems, on the outside and on the inside. So how does this actually work? What would you propose we do
to get this topology working? AUDIENCE: Put another
load balancer on top? DAVID MALAN: Yeah, we could
kind of do this all day long, just kind of keep stacking
and unstacking and stacking. So not bad, but not going to
solve the problem fundamentally. What else might work here? AUDIENCE: Master-master
load balancer [INAUDIBLE]. DAVID MALAN: Yeah, that’s not bad. That doesn’t really solve– let’s
solve the first problem first. How do we decide for the users
where their traffic ends up? So I am someone on a laptop. I type in something.com. I’m here in the cloud. I’m coming out of the cloud,
ready to go to your website. Which load balancer do I hit and how? Yeah, Anessa? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Yeah. So we didn’t talked about this earlier. You could use DNS to
geographically segregate your users based on geography. And companies like Akamai
have done this for years, whereby if they detect an
IP address that they’re pretty sure is coming
from North America, they might send you to one destination. If you’re coming from Africa, you
might go to another destination. And with high probability,
they can take into account where they IP addresses
are coming from just based on the allocation of IP addresses
globally around the world, which is a centralized process. So that could help. We could do a little bit of
that, and that’s not a bad thing, thereby splitting our traffic. Unfortunately, in that model, if
America goes offline, its load balancer, then only African customers can visit. Or conversely, if the other one
goes down, only the others can. So we’d have to do something
adaptive, where we’d then have to quickly change DNS so
that, OK, if the American load balancer’s offline, then we
better send all of our traffic to the servers and load
balancer that’s in Africa. Unfortunately, in the world of DNS, what
have other DNS servers and Macs and PCs and browsers unfortunately done? They’ve cached the damn old address,
which means some of our users might still be given the
appearance that we’re offline, even though we’re up and running
just fine with our servers in Africa in this case. So trade-off there. I’ll spoil this one, only
because it’s perhaps not obvious. Typically, what you would do with a
load balancer situation is only one of them would really be operational at a
time, just because it’s nice and simple and it avoids exactly that kind
of slippery slope of a problem. But these two things are talking
to one another via a technique you generally call heartbeats. And as the name implies, both
of them kind of have heartbeats. And that means in technical
forms that each of them just kind of sends a signal to
the other every second or minute or whatever– I’m alive. I’m alive. I’m alive. Because what the other
can do when it stops hearing the heartbeat from the other,
it can infer with reasonable accuracy that, oh, that server must have
died, literally or figuratively. I am going to proactively
take over its IP address and start listening for requests on
the internet on that same IP address. So you have one IP address, still, but
it floats between the load balancers based on whichever one has decided I
am now in charge, which then allows you to tolerate one of them going offline. And in theory, you could make
this true for three or four, but you get diminishing returns. We could do the same thing down here. So if we really care
and worry about this, we could do the same kind of
heartbeat approach down here. And then with the databases, we’re
still OK with this kind of hierarchy. And this isn’t so much a heartbeat,
recall, as a synchronization. But dear God, look at what we’ve
just built. What a nightmare, right? Remember what we started with. We started with this. And now we’re up to this. But this is truly what it means to
design, like, an enterprise class architecture and to be resilient
against failure and to handle load and to have built into the whole
architecture the ability to scale. So a lot of desirable features. And we didn’t go from
that directly to that. It was, hopefully, a fairly logical
story of frustrations and solutions along the way. But dear God, the complexity. Like, we are a far cry from what
was proposed as simple before. So what does this mean? So in the real, physical world back in
the day, you would buy these servers. You would wire them together. You would configure them. And when something dies,
hopefully you have alerts set up. And it’s just a laundry
list of operational things. And so a common role in a company
would be ops or operations, which is all the hardware
stuff, the networking stuff, the behind-the-scenes
stuff, the lower level details. And it’s fun, and it
appeals to certain people. But it is being supplanted,
in part, by cloud services. And what you get from
these– where’s our acronym? Might have erased it–
Infrastructure As A Service, IAAS, is these same capabilities from
companies like Amazon and Microsoft, but in the cloud. So if you want a load balancer,
you don’t buy a server and physically plug it in. You click a button on a website
that gives you a software implementation of a load balancer. So what’s nice is because of
virtualization and because of software being so
configurable, you can implement in software what has
historically been a physical device. You can create the illusion
that it’s the same thing. And so that’s what you’re doing
with a lot of cloud services. You’re saying, give me a load balancer. Give it this IP address. Give me two back end servers
or four back end servers. Give me a database, two databases. And it’s all click, click, click or
with a command line, textual interface. You’re wiring things together virtually. So at the end of the
day, it’s the same skill set other than you don’t need to
physically plug things in anymore. But it’s the same mental paradigm. It’s the same kind of decision
process, the same amount of complexity. But it’s more virtual
than it is physical. And, in fact, if I pull up the
one I keep mentioning, only because I tend to use them myself here. But this is Amazon Web Services. You’ll see an overwhelming
list of products these days. Frankly, it’s to a fault, I think, how
many damn different services they have. It is completely overwhelming,
I think, to the uninitiated. And even I have kind of started
to get confused as to what exists. But just to give you a teaser so
you’ve at least heard of a few of them, Amazon EC2 is elastic compute cloud. This is what you would use, typically,
to implement your front end, your web server tier. But it really is just generic,
virtualized servers that can do anything you want them to do. In our story, we would
use them as web servers. Amazon has elastic load balancing,
which replaces our load balancers. And it’s elastic in the sense that
if you start to get a lot of traffic, they give you more capacity,
either by moving your load balancer to a different virtual machine,
a bigger one with more resources, or maybe giving you multiple ones. But what’s nice and what’s
beautiful about their latching on to this word elastic is you
don’t have to think about it. Conceptually, though, it’s doing this. So understanding the problem is still
important and daresay requisite. But you don’t have to worry
as much about the solution. Autoscaling is what decides
how many of these front end web servers in our story to turn on. So Amazon for you, albeit
with some complexity– it’s not nearly as
easily done as said– can decide to turn these things on or off
and give you one or two or three or 100 based on the load you’re
currently experiencing in metric you’ve specified. Amazon RDS, Relational Database Server. That’s what can replace
all of this complexity. You can just say, give me
one big database server, and they’ll figure out how big to make
it and how to contract and expand. And actually, you can literally
check a box that says replicate this, so you get an automated backup of it. So this is the kind of
stuff that you would just spend so much time and money as a human,
building up, figuring out, updating, configuring. All of this has been
abstracted away– there, too, to our story earlier about abstraction. We certainly won’t go
through most of these, partly because I don’t know all of
them and partly because they’re not so germane. But S3 is a common one–
Simple Storage Service. This is a way of getting nearly
infinite disk space in the cloud. So for some years,
Dropbox, for instance, was using Amazon S3 for their data. I believe they have since moved to
running their own servers, presumably because of cost or security or the like. But you have gigabytes
or terabytes of space. And, in fact, for
courses I teach, we move all of our video files, which tend
to be big– we don’t run anything on Harvard’s servers. It’s all run in the Amazon
cloud because they abstract all of that detail away for us. So it’s both overwhelming
but also exciting in that there are all these ingredients. And the fact that this is
so low-level, so to speak, is what makes it
Infrastructure As A Service. Frankly, for startups
and the like, more, I think, appealing is Platform As
A Service, which is, if you will, a layer on top of this
in terms of abstraction because at the end of
the day, as interesting as it might be to a computer scientist
or an engineer, oh, my dear God. I don’t really care about
load balancing, sharding. I don’t really need to think
about this to build my business, especially if it’s just
me or just a few people. The returns are probably higher on
focusing on our application, not this whole narrative. And so if you go to companies
like Heroku, which is very popular and is built on top of
Amazon, you’ll see, one, a much more reasonable list of options. But what’s nice is–
let me find the page. It’s a Platform As A
Service in the sense that, ah, now we’re focusing on what
I care about as a software developer. What language am I using? What database technology
do I want to use? I don’t care what’s wired to what or
where the load balancing is happening. Just please abstract
that all away from me, so you just give me a
black box, effectively, that I can put my product
on, and it just runs. And so here, your first decision
point isn’t the lower level details I just rattled off on Amazon. It’s higher level details
like languages, which we’ll talk about more tomorrow, as well. So for a younger startup, honestly,
starting it like the Heroku layer tends to be much more pleasurable
than the Amazon layer. Google, for instance, has App
Engine and their Compute Cloud. They have any number
of options, as well. Microsoft has their own. So if you google Microsoft
or you bing Microsoft Azure, you’ll find your way here. And in terms of how you
decide which to use, I would generally, especially for
a startup, go with what you know or what you know someone knows. Looking for recommendations? You can google around. Just so I don’t forget
to mention it, if you go to Hacker News, whose website is
news.ycombinator.com, an investment fund, this is a good
place to stay current with these kinds of technologies. I would say that quora.com
is very good, too. It’s kind of the right community to
have these kinds of technical decisions. And TechCrunch, although that’s more
newsy than it is thoughtful discussion. So those three sites together,
I would say– especially if you are part of a startup, keeping
those kinds of sources in mind and just kind of passively
reading those things will help keep you at
least current on a lot of these options and
tools and techniques. But more on that tomorrow, as well. Any questions? Yeah? AUDIENCE: Could you go back to Heroku? DAVID MALAN: Sure. AUDIENCE: So essentially,
it’s all about [INAUDIBLE]. DAVID MALAN: You do. And let me see if I can
find one more screen. The docs kind of change
pretty frequently. AUDIENCE: [INAUDIBLE]. DAVID MALAN: I’m sorry? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Oh, let’s see. Deploy, build– oh, yeah. So this is actually a
very clever way– OK. So this is all the stuff we just
spent an hour talking about. Heroku makes the world feel like that. Yeah, so infrastructure, platform,
infrastructure, platform. So this is nice, and this is compelling. So what are some of the
down– work with, let’s see. This is just random [INAUDIBLE]. I do want to show one thing. Let’s see. Pricing is interesting. Hobby. Dyno. So they have some of their– dyno
is not really a technical term. This is their own marketing
thing for how much resource you get on their servers. This is what I wanted, like databases. No, that’s now what I want. Feature, add ons. Explore add ons pricing. OK, this is what’s fun. So most of these– well,
it’s also overwhelming, too. There are so many third-party tools,
databases, libraries, software, services. We could spend a month,
I’m sure, just even looking at the definitions of these things. What’s nice about Heroku
is they have figured out how to install all of this stuff,
how to configure this stuff. And so if you want it, you just
sort of click it and add it to your shopping cart, so to speak. And so long as you adhere
to certain conventions that Heroku has– so you have
to design your app a little bit to be consistent with
their design approach. So you’re tied a little bit to
their topology, though not hugely. You can just so much more
easily use these services. And it’s a beautiful, beautiful thing. But what’s the downside then? Sounds like all win. AUDIENCE: A little less customizable. DAVID MALAN: A little less customizable. You’re much more dependent on their
own design decisions and their sort of optimization, presumably, for common
cases that maybe your own unique cases or whatever don’t fit well. Yeah, Sarah? AUDIENCE: So are we
monitoring the databases less frequently so you’re less
likely to predict something bad is going to happen down the road? DAVID MALAN: Yeah, good point. So maybe they’re
monitoring less frequently. And I would also say,
even if that’s not true, the fact that there’s another party
involved– so it’s not just Amazon. Now there’s multiple layers
where things could go wrong. That feels, too, worrisome
along those lines. Sounds all good. So what’s the catch? What’s another catch? AUDIENCE: [INAUDIBLE]. DAVID MALAN: Yeah, it costs more. It’s great to have all of these
features and live in this little world where none of that underlying
infrastructure exists, which is probably fine
for a lot of people, certainly when they’re first starting
or getting a startup off of the ground. But certainly once
you have good problems like lots of users, that’s the point
at which you might need to decide, did we really engineer
this at the right level? Should we have maybe started in advance,
albeit at greater cost or complexity, so that over time, we’re just ready
to go and we’re ready to scale? Or was it the right
call to simplify things, pay for this value-added
service, and not have to worry about those
lower-level implementation details? So it totally depends. But I would say in general, certainly
for a small, up-and-coming startup, simpler is probably good, at
least if you have the money to cover the marginal costs. And I’ll defer to the
pricing pages here. But what is typical here is if we
look at AWS pricing, for instance, there’s any number of things they charge
for, nickel and diming here and there. But it’s often literally
nickels and dimes. So thankfully, it takes a while
for the costs to actually add up. Just to give you a sense,
though, if you get, let’s call it, a small server that
has two gigabytes of memory, which is about as much as a small laptop
might have these days, and essentially one CPU, you’ll pay 0.02 cents
per hour to use that server. So if you do this– if we
pull up my calculator again, so it’s that many cents per hour. There’s 24 hours in a day. There’s 365 days in a year. You’ll pay $227 to rent
that server year round. And this, too, is where
there’s a trade-off. In the consulting gig
I alluded to earlier, we had to handle some– I forget the
number offhand– hundreds of thousands of hits per day, which
was kind of a lot. And we were moving from one
old architecture to another. And so when we did the math back in
the day, and it was some years ago, it was actually going to cost
us quite a bit in the cloud because we were going to
have so many recurring costs. Certainly after a year,
after two years, we worried it was really going to add up. By contrast, we happened to go
the hardware route at the time. The cloud and Amazon were
not as mature at the time, either, so there were
some risk concerns. But the upfront costs, significant–
thousands and thousands of dollars. But very low marginal costs
or recurring costs thereafter. So that, too, was a trade-off, as well. Other questions or comments? AUDIENCE: [INAUDIBLE]. DAVID MALAN: I’m sorry? AUDIENCE: Why do they do it by hour? Why not do it by year? DAVID MALAN: Oh, so
that’s a good question. So why do they do it per hour, per year? It’s not uncommon in
this cloud-based economy to really just want to spin
up servers for a few hours. Like, if you get spiky traffic,
you might get a real hit around the holidays. Or maybe you get blogged
about, and so you have to tolerate this for a
few hours, a few days or weeks. But after that, you definitely
don’t want to commit, necessarily, for the whole year. One of the best articles years
ago when Amazon was first maturing was the New York Times– let me see. New York Times Amazon EC2 tiff. They did this– yeah, 2008. Oh, no. It’s this one, 2007. So this was an article I remember
being so inspired by at the time. They had– let’s see,
public domain articles. They had 11 million articles as PDF. Or they wanted to create, it
sounds, 11 million articles as PDFs. So this was an example of
something that hopefully wasn’t going to take them a whole year. And, in fact, if you
read through the article, one of the inspiring
takeaways at the time was that the person who set this up used
Amazon’s cloud service to sort of scale up suddenly from zero to, I don’t
know, a few hundred or a few thousand servers, ran them for a few
hours or days, shut it all down, and paid some dollar amount, but some
modest dollar amount in the article. And I think one of his cute comments
is he screwed up at one point. And the PDFs didn’t come out right. But no big deal, they just ran it again. So twice the cost, but it was
still relatively few dollars. And that was without
having to buy or set up a single server at the New York Times. So for those kinds of
workloads or data analytics where you really just need to
do a lot of number crunching, then shut it all down,
the cloud is amazing because it would cost you
massive amounts of money and time to do it locally otherwise. Other questions or comments? That was the cloud. Let me propose this– I sent
around an email last night. And if you haven’t already,
do read that email, and make sure you are able to log
in to cs50.io during the break. If not, just call me over,
and I’ll lend a hand. But otherwise, why don’t we take our
15-minute break here and come back right after 3:15 to finish off the day?

Leave a Reply

Your email address will not be published. Required fields are marked *