Lec 21 | MIT 6.00SC Introduction to Computer Science and Programming, Spring 2011

Lec 21 | MIT 6.00SC Introduction to Computer Science and Programming, Spring 2011


The following content is
provided under a Creative Commons license. Your support will help MIT
OpenCourseWare continue to offer high quality educational
resources for free. To make a donation or view
additional materials from hundreds of MIT courses, visit
MIT OpenCourseWare at ocw.mit.edu PROFESSOR: Good morning. Oh, it’s so nice to
get a response. Thank you. I appreciate it. i have a confession to make. I stopped at my usual candy
store this morning and they didn’t have any. So I am bereft of anything
other than crummy little Tootsie Rolls for today. But I promise I’ll have a
new supply by Thursday. We ended up the last lecture
looking at pseudocode for k-means clustering and talking
a little bit about the whole idea and what it’s doing. So I want to start today by
moving from the pseudocode to some real code. This was on a previous
handout but it’s also on today’s handout. So let’s look at it. Not so surprisingly I’ve chosen
to call it k-means. And you’ll notice that it’s
got some arguments. The point to be clustered,
k, and that’s an interesting question. Unlike hierarchical clustering,
where we could run it and get what’s called a
dendrogram and stop at any level and see what we liked,
k-means involves knowing in the very beginning how many
clusters we want. We’ll talk a little bit about
how we could choose k. A cutoff. What the cutoff is doing, you
may recall that in the pseudocode k-means was
iterative and we keep re-clustering until the change
is small enough that we feel it’s stable. That is to say, the new clusters
are not that much different from the
old clusters. The cutoff is the definition
of what we mean by small enough. We’ll see how that gets used. The type of point
to be clustered. The maximum number
of iterations. There’s no guarantee that
things will converge. As we’ll see, they usually
converge very quickly in a small number of iterations. But it’s prudent to have
something like this just in case things go awry. And to print. Just my usual trick of being
able to print some debugging information if I need it, but
not getting buried in output if I don’t. All right. Let’s look at the code. And it very much follows the
outline of the pseudocode we started with last time. We’re going to start by
choosing k initial centroids at random. So I’m just going to go and take
all the points I have. And I’m assuming, by the way,
I should have written this down probably, that I have
at least k points. Otherwise it doesn’t
make much sense. If you have 10 points,
you’re not going to find 100 clusters. So I’ll take k random centroids,
and those will be my initial centroids. There are more sophisticated
ways of choosing centroids, as discussed in the problem set,
but most of the time people just choose them at random,
because at least if you do it repetitively it guarantees
against some sort of systematic error. Whoa. What happened? I see. Come back. Thank you. All right. Then I’m going to say that
the clusters I have initially are empty. And then I’m going to create a
bunch of singleton clusters, one for each centroid. So all of this is just
the initialization, getting things going. I haven’t had any
iterations yet. And the biggest change so far
I’m just setting arbitrarily to the cutoff. All right. And now I’m going to iterate
until the change is smaller than the cutoff while biggest
change is at least the cutoff. And just in case numIters is
less than the maximum, I’m going to create a list
containing k empty list. So these are the new clusters. And then I’m going to go through
for i in range k. I’m going to append
the empty cluster. These are going to
be the new ones. And then for p and all the
points I’m going to find the centroid in the existing
clustering that’s closest to p. That’s what’s going on here. Once I’ve found that, I’m going
to add p to the correct cluster, go and do it
for the next point. Then when I’m done, I’m going to
compare the new clustering to the old clustering and
get the biggest change. And then go back and
do it again. All right? People, understand that basic
structure and even some of the details of the code. It’s not very complicated. But if you haven’t seen
it before, it can be a little bit tricky. When I’m done I’m going to just
get some statistics here about the clusters, going to
keep track of the number of iterations and the maximum
diameter of a cluster, so the cluster in which things are
least tightly grouped. And this will give me an
indication of how good a clustering I have. OK? Does that make sense
to everybody? Any questions about
the k-means code? Well, before we use it, let’s
look at how we use it. I’ve written this function
testOne that uses it. Some arbitrary values
for k in the cutoff. Number of trials is kind
of boring here. I’ve only said one is the
default and I’ve set print steps to false. The thing I want you to notice
here, because I’m choosing the initial clustering at random,
I can get different results each time I run this. Because of that, I might want
to run it many times and choose the quote, “best
clustering.” What metric am I using for best clustering? It’s a minmax metric. I’m choosing the minimum of
the maximum diameters. So I’m finding the worst cluster
and trying to make that as good as I can make it. You could look at the
average cluster. This is like the linkage
distances we talked about before. That’s the normal
kind of thing. It’s like when we did Monte
Carlo simulations or random walks, flipping coins. You do a lot of trials and then
you can either average over the trials, which wouldn’t
make sense for the clustering, or select the
trial that has some property you like. This is the way people
usually use k-means. Typically they may do 100 trials
and choose the best, the one that gives them
the best clustering. Let’s look at this, and let’s
try it for a couple of examples here. Let’s start it up. And we’ll just run test one on
our old mammal teeth database. We get some clustering. Now we’ll run it again. We get a clustering. I don’t know, is it the
same clustering? Kind of looks like it is. No reason to suspect
it would be. We run it again. Well you know, this is
very unfortunate. It’s supposed to give different
answers here because it often does. I think they’re the same
answers, though. Aren’t they? Yes? Anyone see a difference? No, they’re the same. How unlucky can you be? Every time I ran it at my desk
it came up the first two times with different things. But take my word for it, and
we’ll see that with other examples, it could come out
with different answers. Let’s try it with some
printing on. We get some things here. Let’s try it. What have we got out
of this one? All right. Oh, well. Sometimes you get lucky and
sometimes you get unlucky with randomness. All right. So, why did we start
with k-means? Not because we needed it
for the mammals’ teeth. The hierarchical worked fine,
but because it was too slow when we tried to look
at something big like the counties. So now let’s move on and talk
about clustering the counties. We’ll use exactly the
k-means code. It’s one of the reasons we’re
allowed to pass in the point type as an argument. But the interesting thing will
be what we do for the counties themselves. This gets a little
complicated. In particular, what I’ve added
to the counties is this notion of a filter. The reason I’ve done this is,
as we’ve seen before, the choice of features can make
a big difference in what clustering you get. I didn’t want to do a lot of
typing as we do in these examples, so what I did is I
created a bunch of filters. For example, no wealth, which
says, all right, we’re not going to look at home value. We’re giving that
a weight of 0. We’re giving income a weight
of 0, we’re giving poverty level a rate of 0. But we’re giving the
population a weight of 1, et cetera. OK. What we see here is each filter
supplies the weight, in this case either 0 or
1, to a feature. This will allow me as we
go forward to run some experiments with different
features. All features, everything
has a weight of 1. I made a mistake though. That should have been a 1. Then I have filter names, which
are just a dictionary. And that’ll make it easy for
me to run various kinds of tests with different filters. Then I’ve got init, which takes
as its arguments the things you would expect,
plus the filter name. So it takes the original
attributes, the normalized attributes. And you will recall that, why
do we need to normalize attributes? If we don’t, we have something
like population, which could number in the millions, and
we’re comparing it to percent female, which we know cannot
be more than a 100. So the small values become
totally dominated by the big absolute values and when we run
any clustering it ends up only looking at population or
number of farm acres, or something that’s big. Has a big dynamic range. Manhattan has no farm acres. Some county in Iowa has a lot. Maybe they’re identical in
every other respect. Unlikely, but who knows? Except I guess there’s no
baseball teams in Iowa. But at any rate, we always scale
or we try and normalize so that we don’t get
fooled by that. Then I go through and, if I
haven’t already, this is a class variable attribute
filter, which is initially set to none. Not an instance variable,
but a class variable. And what we see here is, if that
class variable is still none, this will mean it’s the
first time we’ve generated a point of type county, then what
we’re going to do is set up the filter to only look at
the attributes we care about. So only the attributes which
have a value of 1. And then I’m going to override
distance from class point to look at the features
we care about. OK. Does this basic structure and
idea make sense to people? It should. I hope it does, because the
current problem set requires you to understand it in which
you all will be doing some experiments. So now I want to do some
experiments with it. I’m not going to spend too
much time, even though it would be fun, because I don’t
want to deprive you of the fun of doing your problem sets. So let’s look at an example. I’ve got test, which is pretty
much like testOne. Runs k-means number of times
and chooses the best. And we can start. Well, let’s start by running
some examples ourselves. So I’m going to start
by clustering on education level only. I’m going to get 20 clusters, 20
chosen just so it wouldn’t take too long to run, and we’ll
filter on education. And we’ll see what we get. Well, I should have probably
done more than one cluster just to make it work. But we’ve got it and just for
fun I’m keeping track of what cluster Middlesex County, the
county in which MIT shows up. So we can see that
it’s similar to a bunch of other counties. And it happens to have an
average income of $28,665, or at least it did then. And if we look, we
should also see– no, let me go back. I foolishly didn’t uncomment
pyLab.show. So we better go back
and do that. Well, we’re just going to nuke
it and run it again because it’s easy and I wanted
to run it with a couple of trials anyway. So, we’ll first do
the clustering. We get cluster 0. Now we’re getting
a second one. It’s going to choose whichever
was the tightest. And we’ll see that that’s
what it looks like. So we’ve now clustered the
counties based on education level, no other features. And we see that it’s got some
interesting properties. There is a small number of
counties, clusters, out here near the right side
with high income. And, in fact, we’ll see
that we are fortunate to be in that cluster. One of the clusters that
contains wealthy counties. And you could look at it and see
whether you recognize any of the other counties that
hang out with Middlesex. Things like Marin County,
San Francisco County. Not surprisingly. Remember, we’re clustering by
education and these might be counties where you would expect
the level of education to be comparable to the level
of education in Middlesex. All right. Let me get rid of
that for now. Sure. I ran it. I didn’t want you to have to sit
through it, but I ran it on a much bigger sample size. So here’s what I got
when I ran it asking for 100 clusters. And I think it was 5 trials. And you’ll notice that this
case, actually, we have a much smaller cluster containing
Middlesex. Not surprising, because I’ve
done 100 rather than 20. And it should be pretty tight
since I chose the best– you can see we have a
distribution here. Now, remember that the name of
the game here is we’re trying to see whether we can infer
something interesting by clustering. Unsupervised learning. So one of the questions we
should ask is, how different is what we’re getting here
from if we chose something at random? Now, remember we did
not cluster on things based on income. I happened to plot income here
just because I was curious as to how this clustering
related to income. Suppose we had just chosen at
random and split the counties at random into 100 different
clusters, What would you have expected this kind of
graph to look like? Do we have something that
is different, obviously different, from what we might
have gotten if we’d just done a random division into 100
different clusters? Think about it. What would you get? AUDIENCE: A bell curve? PROFESSOR: Pardon? AUDIENCE: We’d get
a bell curve. PROFESSOR: Well, a bell curve
is a good guess because bell curves occur a lot in nature. And as I said, I apologize for
the rather miserable quality of the rewards. It’s a good guess but I think
it’s the wrong guess. What would you expect? Would you expect the different
clusters– yeah, go ahead. AUDIENCE: You probably might
expect them all to average at a certain point for a very
sharp bell curve? PROFESSOR: A very sharp bell
curve was one comment. Well, someone else
want to try it? That’s kind of close. I thought you were on the right
track in the beginning. Well, take a different
example. Let’s take students. If I were to select a 100 MIT
students at random and compute their GPA, would you it to be
radically different from the GPA of all of MIT? The average GPA of
all MIT students? Probably not, right? So if I take 100 counties and
put them into a cluster, the average income of that cluster
is probably pretty close to the average income
in the country. So you’d actually expect it
to be kind of flat, right? That each of the randomly chosen
clusters would have the same income, more or less. Well, that’s clearly not
what we have here. So we can clearly infer from the
fact that this is not flat that there is some interesting
correlation between level of income and education. And for those of us who earn our
living in education, we’re glad to see it’s positive,
actually. Not negative. As another experiment,
just for fun, I clustered by gender only. So this looked only
at the female/male ratio in the counties. And here you’ll see Middlesex
is with– remember we had about 3,000
counties to start with. So the fact that there were
so few in the cluster on education was interesting,
right? Here we have more. And we get a very
different-looking picture. Which says, perhaps, that the
female/male ratio is not unrelated to income, but it’s
a rather different relation than we get from education. This is what would be called
a bi-modal distribution. A lot here and a lot here and
not much in the middle. But again the dynamic range
is much smaller. But we do have some
counties where the income is pretty miserable. All right. We could play a lot more with
this but I’m not going to. I do want to, before we leave
it, because we’re about to leave machine learning,
reiterate a few of the major points that I wanted
to make sure were the take home messages. So, we talked about supervised
learning much less than we talked about unsupervised. Interestingly, because
unsupervised learning is probably used more often in the
sciences than supervised. And when we did supervised
learning, we started with a training set that had labels. Each point had a label. And then we tried to infer the
relationships between the features of the points and
the associated labels. Between the features
and the labels. We then looked at unsupervised
learning. The issue here was, our
training set was all unlabeled data. And what we try and infer is
relationships among points. So, rather than trying to
understand how the features relate to the labels, we’re just
trying to understand how the points, or actually, the
features related to the points, relate to one another. Both of these, as I said
earlier, are similar to what we saw when we did regression
where we tried to fit curves to data. You need to be careful and wary
of over-fitting just as you did with regression. In particular, if the training
data is small, a small set of training data, you may learn
things that are true of the training data that are not true
of the data on which you will subsequently run the
algorithm to test it. So you need to be
wary of that. Another important lesson is
that features matter. Which features matter? It matters whether they’re
normalized. And in some cases you can even
weight them if you want to make some features more
important than the others. Features need to be relevant to
the kind of knowledge that you hope to acquire. For example, when I was trying
to look at the eating habits of mammals, I chose features
based upon teeth, not features based upon how much hair they
had or their color of the lengths of the tails. I chose something that I had
domain knowledge which would suggest that it was probably
relevant to the problem at hand. Question at hand. And then we discovered it was. Just as here, I said, well,
maybe education has something to do with income. We ran it and we discovered,
thank goodness, that it does. OK. So I probably told you ten times
that features matter. If not, I should have
because they do. And it’s probably the most
important thing to get right in doing machine learning. Now, our foray into machine
learning is part of a much larger unit. In fact, the largest unit of
the course really, is about how to use computation to make
sense of the kind of information one encounters
in the world. A big part of this is finding
useful ways to abstract from the situation you’re initially
confronted with to create a model about which
one can reason. We saw that when we
did curve fitting. We would abstract from
the points to a curve to get a model. And we see that with machine
learning, that we abstract from every detail about a county
to say, the education level, to give us a model
of the counties that might be useful. I now want to talk about another
kind of way to build models that’s as popular
a way as there is. Probably the most common
kinds of models. Those models are graph
theoretic. There’s a whole rich theory
about graphs and graph theory that are used to understand
these models. Suppose, for example, you had
a list of all the airline flights between every city in
the United States and what each flight cost. Suppose also, counterfactual
supposition, that for all cities A, B and C, the cost of
flying from A to C by way of B was the cost of A to B and the
cost from B to C. We happen to know that’s not true, but
we can pretend it is. So what are some of the
questions you might ask if I gave you all that data? And in fact, there’s a company
called ITA Software in Cambridge, recently acquired by
Google for, I think, $700 million, that is built upon
answering these kinds of questions about these
kinds of graphs. So you could ask, for example,
what’s the shortest number of hops between two cities? If I want to fly from here to
Juneau, Alaska, what’s the fewest number of stops? I could ask, what’s the
least expensive– different question– flight
from here to Juneau. I could ask what’s the least
expensive way involving no more than two stops, just in
case I don’t want to stop too many places. I could say I have ten cities. What’s the least expensive
way to visit each of them on my vacation? All of these problems are nicely
formalized as graphs. A graph is a set of nodes. Think of those as objects. Nodes are also often called
vertices or a vertex for one of them. Those nodes are connected
by a set of edges, often called arcs. If the edges are
uni-directional, the equivalent of a one-way street,
it’s called a digraph, or directed graph. Graphs are typically used in
situations in which there are interesting relationships
among the parts. The first documented use of this
kind of a graph was in 1735 when the Swiss
mathematician Leonard Euler used what we now call graph
theory to formulate and solve the Konigberg’s Bridges
problem. So this is a map of Konigsberg,
which was then the capital of East Prussia, a part
of what’s today Germany. And it was built at the
intersection of two rivers and contained a lot of islands. The islands were connected to
each other and to the mainland by seven bridges. For some bizarre reason which
history does not record and I cannot even imagine, the
residents of this city were obsessed with the question of
whether it was possible to take a walk through the city
that involved crossing each bridge exactly once. Could you somehow take a walk
and go over each bridge exactly once? I don’t know why they cared. They seemed to care. They debated it, they walked
around, they did things. It probably would be unfair for
me to ask you to look at this map and answer
the question. But it’s kind of complicated. Euler’s great insight was that
you didn’t have to actually look at the level of detail
represented by this map to answer the question. You could vastly simplify it. And what he said is, well, let’s
represent each land mass by a point, and each
bridge as a line. So, in fact, his map of
Konigsberg looked like that. Considerably simpler. This is a graph. We have some vertices
and some edges. He said, well, we can just look
at this problem and now ask the question. Once he reformulated the problem
this way it became a lot simpler to think about and
he reasoned as follows. If a walk were to traverse each
bridge exactly once, it must be the case that each node,
except for the first and the last, in the walk must have
an even number of edges. So if you were to go to an
island and leave the island and traverse each bridge to the
island unless there were an even number, you
couldn’t traverse each one exactly once. If there were only one bridge,
once you got to the island you were stuck. If there were two bridges you
could get there and leave. But if there were three bridges
you could get there, leave, get there and
you’re stuck again. He then looked at it and said,
well, none of these nodes have an even number of edges. Therefore you can’t do it. End of story. Stop arguing. Kind of a nice piece of logic. And then Euler later went on to
generalize this theorem to cover a lot of other
situations. But what was important was not
the fact that he solved this problem but that he thought
about the notion of taking a map and formulating
it as a graph. This was the first example of
that and since then everything has worked that way. So if you take this kind of idea
and now you extend it to digraphs, you can deal
with one-way bridges or one-way streets. Or suppose you want to look at
our airline problem, you can extend it to include weights. For example, the number of miles
between two cities or the amount of toll you’d have
to pay on some road. So, for example, once you’ve
done this you can easily represent the entire US highway
system or any roadmap by a weighted directed graph. Or, used more often, probably,
the World Wide Web is today typically modeled as a directed
graph where there’s an edge from page A to page B,
if there’s a link on page A to page B. And then maybe if you want to
care, ask the question how often do people go from A to B,
a very important question, say, to somebody like Google,
who wants to know how often people click on a link to get
to another place so they can charge for those clicks, you
use a weighted graph, which says how often does someone
go from here to there. And so a company like Google
maintains a model of what happens and uses a weighted,
directed graph to essentially represent the Web, the clicks
and everything else, and can do all sorts of analysis
on traffic patterns and things like that. There are also many less
obvious uses of graphs. Biologists use graphs to measure
things ranging from the way proteins interact with
each other to, more obviously, gene expression networks
are clearly graphs. Physicists use graphs to model
phase transitions with typically the weight of the edge
representing the amount of energy needed to go from
one phase to another. Those are again weighted
directed graphs. The direction is, can you get
from this phase to that phase? And the weight is how much
energy does it require? Epidemiologists use graphs to
model diseases, et cetera. We’ll see an example
of that in a bit. All right. Let’s look at some code now
to implement graphs. This is also in your handout and
I’m going to comment this out just so we don’t
run it by accident. As you might expect, I’m
going to use classes to implement graphs. I start with a class node
which is at this point pretty simple. It’s just got a name. Now, you might say, well, why
did I even bother introducing a class here? Why don’t I just use strings? Well, because I was kind of
wary that sometime later I might want to associate more
properties with nodes. So you could imagine if I’m
using a graph to model the World Wide Web, I don’t want
more than the URL for a page. I might want to have all the
words on the page, or who knows what else about it. So I just said, for safety,
let’s start with a simple class, but let’s make it a class
so that any code I write can be reused if at some later
date I decide nodes are going to be more complicated
than just strings. Good programming practice. An edge is only a little
bit more complicated. It’s got a source, a
destination, and a weight. So you can see that I’m using
the most general form of an edge so that I will be able use
edges not only for graphs and digraphs, but also weighted
directed graphs by having all the potential
properties I might need and then some simple things
to fetch things. The next cluster in the
hierarchy is a digraph. So it’s got an init, of course,
I can add nodes to it. I’m not going to allow myself
to add the same node more than once. I can add edges. And I’m going to check to make
sure that I’m only connecting nodes that are in the graph. Then I’ve got children of,
which gives me all the descendants of a node, and
has node in string. And then interestingly enough,
maybe surprising to some of you, I’ve made graph a
sub-class of digraph. Maybe that seems a little odd. After all, when I started
I started talking about digraphs. About graphs, and then said, and
we can add this feature. But now I’m going the
other way around. Why is that? Why do you think that’s
the right way to structure a hierarchy? What’s the relation of
graphs to digraphs? Digraphs are more general
than graphs. A graph is a specialization
of a digraph. Just like a county is a
specialization of a point. So, typically as you design
these class hierarchies, the more specialized something is,
the further it has to be down. More like a subclass. Does that make sense? I can’t really turn
this on its head. I can specialize a digraph
to get a graph. I can’t specialize a graph
to get a digraph. And that’s why this hierarchy
is organized the way it is. What else is there are
interesting to say about this? A key question, probably the
most important question in designing and implementation
of graphs, is the choice of data structures to represent
the digraph in this case. There are two possibilities
that people typically use. They can use an adjacency
matrix. So if you have N nodes, you have
an N by N matrix where each entry gives, in the case
of a digraph, a weighted digraph, the weight-connecting
nodes on that edge. Or in the case of a graph it
can be just true or false. So this is, can I
get from A to B? Is that going to
be sufficient? Suppose you have a graph
that looks like this. And we’ll call that Boston
and this New York. And I want to model the roads. Well, I might have a road that
looks like this and a road that looks like this from
Boston to New York. As in, I might have more
than one road. So I have to be careful when
I use an adjacency matrix representation, to realize
that each element of the matrix could itself be somewhat
more complicated in the case that there
are multiple edges connecting two nodes. And in fact, in many graphs we
will see there are multiple edges connecting the
same two nodes. It would be surprising
if there weren’t. All right. Now, the other common
representation is an adjacency list. In an adjacency list, for every
node I list all of the edges emanating from
that node. Which of these is better? Well, neither. It depends upon your
application. An adjacency matrix is often
the best choice when the connections are dense. Everything is connected
to everything else. But is very wasteful if the
connections are sparse. If there are no roads connecting
most of your cities or no airplane flights
connecting most of your cities, then you don’t want to
have a matrix where most of the entries are empty. Just to make sure that people
follow the difference, which am I using in my implementation
here? Am I using an adjacency matrix
or an adjacency list? I heard somebody say an
adjacency list and because my candy supply is so meager they
didn’t even bother raising in their hand so I know
who said it. But yes, it is an
adjacency list. And we can see that by
looking what happens when we add an edge. I’m associating it with
the source node. So from each node, and we can
see that when we look at the children of– here, I just return the
edges of that node. And that’s the list of all
the places you can get to from that node. So it’s very simple, but
it’s very useful. Next lecture we’ll look– Yeah? Thank you. Question. I love questions. AUDIENCE: Going back to the
digraph, what makes the graph more specialized? PROFESSOR: What makes– AUDIENCE: The graph
more specialized? PROFESSOR: Good question. The question is, what makes the
graph more specialized? What we’ll see here when we look
at graph, it’s not a very efficient implementation, but
every time you add an edge, I add an edge in the reverse
direction. Because if you can get from node
A to node B you can get from node B to node A. So I’ve
removed the possibility that you have, say, one-way
streets in the graph. And therefore it’s
a specialization. There are things I can not do
with graphs that I can do with digraphs, but anything I can do
with a graph I can do with the digraph. It’s more general. That make sense? That’s a great question and
I’m glad you asked it. All right. Thursday we’re going to look at
a bunch of classic problems that can be solved using graphs,
and I think that should be fun.

Leave a Reply

Your email address will not be published. Required fields are marked *