UW Allen School Colloquia: GRAIL Lab at the Allen School

All right, thanks,
everyone, for being here, and welcome to our
Grail presentation. So Grail is graphics
and imaging lab. And do we work on a variety of
different problems related to– basically,
computational problems related to visual
information, everything from capturing,
understanding, processing, predictive simulation, modeling,
and creation of 3D content. We have a diverse
set of faculty, and we have quite a few
students today representing the different problems that
Grail studies that we’re presenting you today. So our first presenter
is Daniel Gordon. Daniel works with Ali. And he’ll be telling us
about his work on how to use simulation in
virtual environments to learn how to
accomplish complex tasks. Thank you. Thanks, Andriana,
for introducing me. That takes up one
of my six slides. So, yeah. I’m going to be– is the mic good? There we go. So I’m going to be
talking about trying to push the envelope on what
we can learn from 3D simulation environments. And I’m going to go through
a few different ways that we’ve looked at
solving this problem, or looked at learning. So, first off, we need a
simulation environment. And luckily enough, there is
this nice company, group– I don’t know what
to call it– called AI2 that has been
working for a few years on making this really
nice set of diverse rooms and interactions you can
perform in these rooms. So this is the simulator
called AI2 Thor. And basically, you can
think about it as, I think, 30 different virtual homes. So there’s 120
different environments, and each environment
is a single room where you can interact with the
objects in a variety of ways. You can kind of think about
this like a video game, except that there is also a
lot of metadata under the hood that we can use for learning
and training purposes. And so inside this
environment we can do things like picking up a laptop,
and closing a book, and reading Python, I guess. And speaking of Python, there’s
a really nice Python interface. And basically, you’re just
communicating back and forth with this game, for
lack of a better word. And so we have this nice
tool, and we want to see, like, what can we actually
learn to do with this? So the first thing
we looked at was what we call interactive
question answering. So the task here is imagine that
you are now put in this room, and someone asks you a
question about the environment. And these questions
are not just about kind of is there a
fridge in the room, but they’re more about the small
objects that may or may not be there or there might
be one or more of them. The reason for doing this is
because then we can basically randomize the positions
and the numbers of the different
objects in the scene, and we can create
a large data set of different virtual
environments. So then the question is actually
can we learn how to do this? Can we learn how to
answer questions? And the answer is yes. We can actually just put
an agent in the environment and give it some pretty sparse
training signal, basically just whether it gets a
question right or wrong, as well as whether it
tries to perform an action and whether that action
succeeds or fails. And with these two
pretty basic things, we can already get some
pretty interesting behavior. So this is an example
that I pulled out where the question we were asked
was is there bread in the room. And so in the data set,
this is a balanced data set where there is a 50%
of there being bread, and there is a 50% chance
of there not being bread. So the agent starts off with
basically a 50/50 guess on whether there is or
isn’t bread, which is these red and blue
dots sort of hovering in the middle of 50/50 here. And then the first
thing it does, it says, OK, well, obviously,
I don’t know anything. I’ve just been put
in this environment. I need to actually walk around. So you can see at
the bottom here– yeah, here– that this is
tracing out the agent’s path. And it starts, basically,
walking around the table trying to look at parts of the room and
observe different things that might be around. And lo and behold, this
tiny, little orange square there is actually
bread that it detects. The orange square is there just
as an output of the method. It’s not actually an input. But we are showing that we can
successfully detect the bread. But the agent still
has to actually decide whether to answer or not. And so at the following
step, it actually says, OK, well,
I’ve seen the bread. So it sees the bread right
here, and it’s still at 50/50. And then at the next step, it
says, OK, I’ve seen the bread. So I know the answer
is yes, there’s bread, and I should answer now to
basically end the episode and get my reward. So given this, we
actually wanted to dive in and see what did
we actually learn? What types of
information do we learn? And one interesting
thing we saw was we could basically cluster
the features that are network outputs,
and look and make sure that they were learning
something interesting. And in this case, we are
plotting the basically open objects versus closed objects. And we can actually see a
pretty nice separation, even in a low dimensional space
that the features basically learn on their own with
no inherent supervision. So the other thing I’m going
to talk about for 30 seconds is actually work that
is ongoing right now. And we’re trying to extend
this sort of question regime to much more complicated tasks. So in this case, we start– actually, we start
with a video here where we are doing some complex task. And then we give this video
to Amazon mechanical turkers and ask them to describe
the task step by step. And then the goal is basically
to flip that around and then say, OK, from a
step-by-step instruction, can we actually learn
how to accomplish a task? And even harder, can we
just give the outline of the task, which is like
the main title, wash a knife, and put it on the countertop? And can we actually learn
kind of all of the things that we’d need to do to be
able to accomplish that? So as I said, this
is ongoing work. But right now, we’re sort
of collecting the data and training some
initial models. And it looks promising
that especially with the sort of
step-by-step instructions, we are able to
actually learn how to accomplish these
pretty complex goals with multiple different steps. So I think with that,
I will take questions. [APPLAUSE] Yeah? AUDIENCE: So pretty cool stuff. So I didn’t quite
understand what it was, what it was trained on. When you asked, hey, is
there bread on this table, does it know what
bread is already, or is that in a training set,
or try to understand the semantics of what bread
is and that it’s on the table, or is that stuff we’ll learn? DANIEL GORDON: Yeah, so
the question is kind of do we bake an object
detection into the system, or do we try to learn that end
to end, or what are the inputs? And, in this case, we
actually tried both. We tried training on
detecting the objects, and we tried to
sort of end to end. And training to detect the
objects works pretty well. There’s a bit of
difficulty with sort of if we were to
train on real images and then test in the simulation. It doesn’t work quite so well. But if we train in
simulation, we’re actually able to do
pretty well in this. So, yes. We do end up feeding
object detections in as part of the network. ANDRIANA SCHULZ: [INAUDIBLE]. [APPLAUSE] All right, the next
speaker is Zuan. Zuan is also a PhD
student in Grail. She’s working with Steve Sites. And she’ll be telling us
about a method for simulating the experience of looking
through an open window into some awesome,
historical images. XUAN LUO: Thank you, Andriana,
for the introduction. So I’m beginning to
talk about slow glass. You’re probably
confused now, but you’ll see what I mean later. So how many people
recognize who he is? Yeah. Thomas Edison with
his giant light bulb, and which eventually
changes the world. So this is what a typical
historical image looks to us. It’s looks– it depicts
a very interesting story. But the image itself is so
pale, not vivid, and just flat. And it hardly occurred
to us that those people have lived in the
same world as us, in the same space
that we also share. It’s just from a different time. So in the science fiction
of light of the older days, Bob Shaw introduced a
very interesting concept called slow glass. So imagine a piece
of glass that’s so thick and compact
that even light will take multiple years to pass. Imagine. So from it, you
will see something that’s from many,
many, many years ago. So imagine if there is a
piece of glass that takes 130 years for light to pass. Then from it, it will see Thomas
Edison sitting in front of you. It’s like a window to the past. And you can see the history
in 3D right in front of you. So slow glass is not
possible physically. But we can do it
computationally. So how do we do that? First, we need to understand
the 3D of the scene. And so we need to gather depth. But from a single viewpoint,
it’s not good enough. We want to see the different
perspective of the thing. People will naturally move
around in front of window, and we want to simulate
different prospect of it. And it’s very important
to 3D perception. Surprisingly, sterile camera
exists even 150 years ago. And many important
historical people and scenes have been captured in stereo. So this is a example
sterile card. If you will flip
between them, you will see what’s the 3D
information captured by that. So this is Mark Twain 3D. Has anyone seen this? Yep. This is a stereoscope. And it was so
popular at that time. And people mass produced this. And there was a company called
Keystone View Company that was the largest company among them. By 1935, it had two million
stereoscopic negatives. And in 1978, it donated
all the negatives to the Museum of
Photography, California Museum of Photography. So now on the
website, we can see like a 45k of those
stereo images that’s been digitized online. So we can just scroll
down those images, crop the left and right
view, remove some bad ones, and convert them in 3D,
compute the 3D depth for them. So from that, we can
collect a huge data set of historical
people and their depth for those interesting things
and people in the past. But this is not enough. This is just for one
viewpoint, like 2.5D. We want to do like
real 3D, where you can move around and see
from different perspective. So how should we do that? So this is one viewpoint. If you want to move
to another viewpoint, you will see holes like this,
exclusion that’s not captured. So we can apply a
state-of-the-art in painting algorithm to these images. But you will see those
impending algorithm, they are designed to fuse
information from both sides. So you will see a lot
of blurry boundary. But what we really want for
this kind of disocclusion in painting, is
that we want to fuse the information from only
the background, but not the foreground. So we can use the known
foreground boundary from the depth that we get to
guide the network to paint only from the background. And also, from the depth, from
the depth data set we have, we can simulate a lot a lot
of disocclusion holes to train our more than most
state-of-the-art now works on these historical images to
then paint in those regions. So from that, we can simulate
these historical scenes viewing 60-degrees of freedom. So this is some gold
mining in Colorado. This is war brides
from the World War I. This is Carlton cotton
machine from those time. This is automobile shaping,
press shaping machines. Thomas Edison in front of
his house with his son. And we can have such– oh, sorry. [MUSIC PLAYING] AR app. Yeah, so typically, we
see those images in flat. But we can have a new
3D window into the past. So with that, open to question. [APPLAUSE] AUDIENCE: So if the pictures
were not black and white, would the depth
destination be better? Does the fact that
it’s black and white the affect that estimation? XUAN LUO: So I met
a lot of troubles because this is black and
white and low quality. So for most algorithm,
if I just apply on them, the sterile algorithm would
assume photo consistency, meaning the same
corresponding points should have the same color. But this one, like
left or right image have different color
because of color degradation and this stuff. And also, the black and
white, it’s less information, so probably has some effect. But I didn’t test this far. AUDIENCE: You
colorized it first, like compositional coloring
before you do the depth– XUAN LUO: So that’s like a
chicken and the egg problem. So if I can colorize
them consistently, then I could do it well. And then if I can colorize
it correct, like– AUDIENCE: [INAUDIBLE]. XUAN LUO: Yeah, yeah, yeah. So correspondence and
colorization is like– sterile consistent colorization
is a chicken and egg problem. AUDIENCE: It’s really cool. XUAN LUO: Thank you. ANDRIANA SCHULZ: [INAUDIBLE]. [APPLAUSE] All right, the
next speaker is JJ, who’s also working with Steve. And you were telling us about
how to model shape and specular reflections using a
combination of physically and learning-based approaches. JEONG JOON PARK: Hi, everyone. My name is JJ Park,
and today, I’m going to introduce my
current and past project about effectively modeling
glossy reflections and geometry using
neural networks. And at first, I’m going
to begin with my project on glossy reflection modeling. So novelty prediction
is an important problem in 3D computer
vision that involves predicting a new view
of a scene that is not captured in your input video. In this diagram, the black
camera frosted our input video, where the reds are
unseen camera views that we want to synthesize for. And this problem is a
well-studied in computer vision community for decades. But most of the work’s focus
on surfaces that are diffused, which means that glossy
reflections are not modeled. In this work, we focus on
reconstructing the scenes with glossy reflections, such as
the [INAUDIBLE] in the diagram. And we want to generate
realistic novel views. And we focus on modeling
specular highlights only, assuming camera
pose, geometry, and diffuse textures
are all known. The tricky part of predicting
specular reflections is that a same 3D point
appears differently in different viewpoints. A brute force way
to model this would require us to capture the scene
from all possible viewpoints, which is invisible. So instead, we try to model
the reflection direction, which could be shared by
various viewpoints. And we gather this specular
information from multiple views into spectral
reflections map, which holds glossy
information about one specific material in the scene. When the scene contains
more than one materials, spectral reflections
of each surface point is described as linear
combination of multiple SRMs, and we pose the
material segmentation as a recognition task where each
pixel’s weights are estimated from a neural network. And we conduct a
join optimization of the network and
specular lighting to reproduce the input images. Here are some of the environment
lightings we recovered, which closely matches
the ground truth. And note that the reconstructing
such room environment merely from a video of an object
is a task that even humans can hardly do. And the following video
showed the quality of our novel view prediction. The left one is ground
truth, and the right one is our reconstruction
prediction of a novel view. And here is another video. Note that the specular
reflections are accurately modeled. And although there
are more results, we omit them due to time limit. And now I’ll move on to
the second part, which is about a new
representation of geometry suitable for deep learning. Convolutional
neural networks had been a cornerstone for 2D
computer vision methods, but they grow quickly
in space and time when directly applied
to [INAUDIBLE].. And more compact
representations, such as point clouds do
not describe surface, and triangles meshes
come with unknown number of vertices and topologies. So here, we present
a new representation that is more
efficient, expressive, and fully continuous. Our models produce high quality
continuous surfaces with complex topologies and obtain
state-of-the-art results in shape reconstruction and
completion while having orders of magnitude smaller network
size than voxel-based methods. Our key idea is to
directly regress the continuous signed distance
field using a neural network. So signed distance function
is a volumetric field where the magnitude of a
point represents the distance to the closest surface
and the sign indicates whether the point is inside
or outside of the shape. So we regress this
signed distance field using a fully connected
neural network, and this neural network
takes input as x, y, z, point and classifies whether
this point is inside or outside of the surface. And the network is essentially
a binary classifier where its decision boundary
is the surface of this 3D shape itself. And using this
new representation and a learning method
called autodecoder, we could learn a compact
and effective embedding space of a data set of
shapes, for example, thousands of chairs. This embedding space can then be
used to reconstruct, retrieve, and complete a shape. For example, given
an input table, we could back propagate
to the latent code to find a table in
the embedding space that best matches the input. And this framework
works even when the input is a partial
observation of the shape, and it is automatically
completed using the priors of the network. Here, we show a visualization
of the inference process. The optimization
tries to find best code that matches
the single depth map observation shown as the
green dots in the middle. Our method performed
significantly better than other methods in shape
completion and reconstruction tasks. Compared to other
voxel-based method, it provides higher
accuracy model while having smaller
network size. And compared to the
state-of-the-art mesh-based method, our network shows
much higher expressive power. For shape completion, given
an input single depth map, our method finds a high
quality, optimal shape for the observation. And compared to
the ground truth, it generates much higher
quality completion results than the state-of-the-art. That’s all I have. Thank you. [APPLAUSE] AUDIENCE: Really cool. I’m curious for the
glossy reflections work, would that be possible to
do without Machine Learning, or is Machine
Learning a key part to actually making the glossy
reflections particularly possible? JEONG JOON PARK: Yeah,
so a lot of people have been trying to solve
this problem for many years. But essentially, they
could solve this problem when there’s a very accurate
settings, lab setups and stuff. But if you want to
do it in the wild, it is necessary to be
robust, which requires learning methods, usually. AUDIENCE: A lot of or maybe
all of the images, the examples you showed were symmetric. Is that symmetry
[INAUDIBLE] encoder, or is it something you’d
put in ahead of time? JEONG JOON PARK: We didn’t try
to manually input those prior. It’s learned from the data,
so because usually, there seem to be symmetry in the data set. Yeah. Yeah. AUDIENCE: Can you kind of
interpret what was learned? Is it like the BRBF? What does it learn? JEONG JOON PARK: So
the network currently learns the segmentation,
which part is what material. And then there’s physical
models that’s attached to it. So and then later
on, the network also learns how to refine
some of the artifacts of the physical model. AUDIENCE: Thank you. [APPLAUSE] ANDRIANA SCHULZ: All right,
our next speaker is Haisen. He’s a postdoc working with me,
and will tell us about his work on using programming
languages for design and fabrication of carpentry. HAISEN ZHAO: Hi, everyone. Thank you for your attention. I’m Haisen Zhao, a
postdoc under Andriana. Suppose you want to
fabricate a design. You must first design
then fabricate it. One can say we can decouple
the design, the fabrication. Then we can optimize and
the fabrication separately. And soon we can– yeah. Under another set, there
are two stages also related to each other. For example, the
fabrication defines the space of what we
can physically realize. And the design will affect
the fabrication performance. So you can see here are
two conflict objective. So can we generate our
workflow to account for the two conflicting ideas? Yes. Yes. Our main idea is we can see
the design, the fabrication in both programs. The design, the
fabrication can be seen as some operations
with some other. Yeah. Then we want to generate some
instruction, that architecture, something like
that from computer architecture to the couple, the
design, the fabrication states. And here is an end-to-end
example of our system. The user need to input
the cabinetry tool to fabricate the design
and set up material library to generate the design. Then user design, you use
our UI to design some model. At the same time, a code, a high
level code can be generated. Then– sorry– my radio
have some problem. Then our system can generate
a multi-objective optimization to generate a set of
instructions from which– then for each instruction
the [INAUDIBLE] can follow our instruction
to fabricate the design. Here is our main
component of our system. We define the high level
HELM as a language tool to define the design. And the low level HELM tool
defines the fabrication space. Then our key component
is a compiler, which can translate the high
level code to low level code. The key challenge
of our compiler is the multi-objective
function, which can consider three main
objective in manufacturing promise. This optimization must generate
a long sequence of instructions fabrication steps and consider
multiple conflicting objective. So to solve this,
optimization problem, can we get some benefit from
[INAUDIBLE] the fabrication process as a program? Yes. We borrow some things
from PR domain. We use the E-graph
concept from PR. E-graph is a very
penetrating representation of lots equivalence classes. In PR, they use you
E-graph to populate a large equivalence
code of the input code. Then they can use an
optimization method to extract the optimal code
from the large equivalence code. Here is one example
of the E-graph. In this graph, it’s noted
that we recorded that as an e-class,
equivalence classes. It can generate– for
example, here in this note, we can generate one, two,
three, the two, the three parts. And the e-class, it’s a
well-contained multi-e-notes, equivalent notes. Each note can be
fabricated, the same part with a different order. For example, here, we can
generate those three parts with different order from
the [INAUDIBLE] stocks. So can we just
apply this e-graph to fall cabinetry compiler? The answer is no. The first [INAUDIBLE]
is that MPR domain, that the variable is linear. But in fabrication, the linear
of a challenge means that the variable can be or
cannot be really usable. In carpentry, the
variable cannot be reused in the following process. And so [INAUDIBLE] lumber cannot
be reused for the following cut, if you have
just the cut cutted. And our challenge
is that we must design some new [INAUDIBLE]
rules to populate this e-graph. And this [INAUDIBLE] must
consider some geometry information. For example, this one, we
can generate the arrow one and arrow two with two
cut from one lumber, or we can generate
the two lumber from one stock with one cut. [INAUDIBLE] We must look at some
geometry information. Another challenge
is that we meet mult-objective
optimization problem. But our original
e-graph, just one, single optimization problem. So how do we solve this? First, we apply a geometric
method to populate the e-graph. There are two main steps. We first [INAUDIBLE]
different parts. OK, maybe I’ll use [INAUDIBLE]
algorithm quickly, OK? E-step will generate
many equivalence choice. Then after built
this e-graph, we apply the genetic
algorithm to extract the optimal instructions
from this e-graph and so on. We define a new
[INAUDIBLE] cross over. Here is one, our example. We can generate a different–
many different instructions, consider three objectives. Here, that’s all. Thank you. [APPLAUSE] AUDIENCE: So I recently had the
experience of putting together some IKEA furniture. So the fabrication, or the
construction, fabrication process there, most of it’s done
by someone with a lot of skill. But then the last few
steps are done by someone with almost no skill. so is there a way to express
that sort of constraint within this that have that
some of the fabrication steps will be done? There are more complex
tools available, or more skilled user, and
then the last few steps not so skilled user? HAISEN ZHAO: As [INAUDIBLE]
we make and extract [INAUDIBLE] to
experiment– there are some very useful experiments
into our optimization. Another is that our system
can generate the instructions. And the nonskilled worker
can follow our interactions to fabricate this, either
to improve their skill. Yeah, thank you. [APPLAUSE] ANDRIANA SCHULZ:
Our next speaker is Luyang, who is also a
PhD student at Grail working with Era, Steve, and Brian. And he will tell
us about his work on creating an
immersive basketball experience from single images. LUYANG ZHUL: Thank you. So in this project, I will
try to reconstruct basketball player from single images. And the intuition for
this project’s like– we all know basketball
is a very immersive game. So the closer you
are to the court you have better experience. But actually, it’s time
consuming and expensive for me to drive three hours to
Portland and buy a [INAUDIBLE].. So what I want to do is– sorry. Why there’s no video? I want to reconstruct the
basketball game in VR and AR. So this is a internal demo. They tried to reconstruct
a single frame in 3D. But they put a lot of
camera around the stadium and do a multi-view volumetric
reconstruction method. It’s impossible for me to play
40 camera around a stadium to do that. So can I do it on single video? So here’s a demo. It’s also from Grail lab. They tried to reconstruct
a football game from a single YouTube video. But the problem
is they only tried to reconstruct a depth map,
which is partially observed. So if you walk around to
the other side of the court, it will be missing. So what we want to do is, can
we reconstruct four 3D players from a single video so that
when we walk around the court you can see like the full– like every part of the player? So this is very challenge for
us because for the real YouTube video, we don’t have
ground truth data. And compared to soccer,
we want to reconstruct unseen part of the player,
like in the observed part. And for basketball, the
player has much faster motion. And the player
always jumps a lot. It’s not like football–
the player is always standing on the ground– which
pose another challenge for us. And to solve this problem,
first, for the data part, we played a lot of video game
to extract the mesh data. And now we are also focusing on
reconstructing a single player. So here’s a short demo
of what we have done. So the top left is you
input YouTube video. And the top, top
right, and bottom left is ours 3D reconstruction
from that YouTube video. You can see that we
can reconstruct LeBron in different view angles. So let’s see our pipeline,
how we solve this problem. So given a YouTube image, I
first use state-of-the-art human tracking to crop the
player I want to reconstruct, for example, LeBron. And then I [INAUDIBLE] to
predict the 2d pose, 3D pose, and the jumping
formation of LeBron. And using the 3D pose, I
do some temporal smoothing because [INAUDIBLE] 3D pose
has a lot of jittery results. So if we don’t do temporal
smoothing, it oscillates a lot. And after that, I put
into a skinny [INAUDIBLE] to get my ground truth 3D image. This LeBron is putting the
origin of the word coordinate– I want you putting in the
correct place on the court. So I also need to estimate the
camera from the input video. To do this, I tried to estimate
the court line from the image, and I also have a
predefined 3D court line so that I can do some camera
calibration job to estimate the camera parameters. And then using the camera
information, 3D image, and the jumping formation,
I can solve an optimization to put the player in
the correct position. So for the training data,
we use NBA2K19 video game. And there’s also a
graphics debugging software called RenderDoc. So it allows me to get all the
rendering resources we want to. And using this software, I can
get the camera, the triangle mesh, and the textures
from which we can further get a texture mesh, and the
3D joints from the mesh. So basically, this formulates
all the training data for our neural network. And I collect about
10,000 example for training the
skinning net and 200 kilo example for training the
personalized poles net. By the way, we are
hiring students who is good at playing NBA2K
for me to collect data. Yeah. I have two undergrads. Including me, we have played
about 400 hours of NBA2K to do this. And it’s just for LeBron,
for single player. If we want to do 10 players,
we 10 more people to help us do this. Yeah. And for the post net,
it’s very, like follow the traditional
neural network design. It’s just using 3D, a 2D
heat map and 3D location map to predict 2D
pose and 3D pose. And the result show that
will be the state of art. And for the skinny, net,
we tried to predict from 3D pose to mesh. But it’s super hard
to directly do that. So we use a TO embedding
network architecture. So we first try to
reconstruct from mesh to mesh. And then we have
another encoder to get a [INAUDIBLE]
distribution of the pose and minimize the distribution
between mesh and pose. And then I trained them
jointly during testing. I only use the bottom
part of encoder. And our results show that we
still are the state of art. So, finally, I will
show you more results. It’s all about LeBron. Yeah. So I just select two views
to render the images. But if you are in a VR
devices, actually, you can move around the court to
see what happening, all angles. This is just a
random in two view. I can still see there
are some artifacts here. Yeah, so finally, I try– we are moving on from single
payer to multi-players. And on single-frame here,
these are like live body demo. [MUSIC PLAYING] So this is our 3D
reconstruction, but it’s all LeBron James. [LAUGHTER] Yeah. Because we are only
doing LeBron James. But we are working on extending
it to multiple players. Yeah, thank you. [APPLAUSE] AUDIENCE: Sometimes you did
the reconstruction frame by frame or estimation. And it was noisy, so you
smoothed over frames. Is there any way to remove
the smoothing earlier in the process,
so maybe you could produce a couple of frames? LUYANG ZHUL: The problem is
now we collect the game data frame by frame. If we want to encode the
temporary information in neural network, we need
to have the video data. But luckily, like the game
company also can tap us. So they are willing to
provide us some real– like they are
rendering pipelines so that we can
extract the real video data to train the network. So that’s why we can
extend to multiple players and on longer videos. Yeah. AUDIENCE: So one
artifact I noticed, he seemed to be kind of
skating around on the floor. Is that because the camera
doesn’t constrain that dimension? When you see him move in a
certain direction, that’s a way that the
camera just couldn’t tell where he was moving? LUYANG ZHUL: Acutally,
the camera is fixed. But the artifacts,
like skating, it’s like we don’t add any
physical constraint, realistic physical system to
make the player move like real. And we are doing
like frame by frame. So we don’t explore any
temporal information. So that’s why I say
if we have video data, we can learn the temporal
physics inside the network to make it better. AUDIENCE: And then you make the
player play better too, right? LUYANG ZHUL: Yeah. AUDIENCE: Final of the game. LUYANG ZHUL: Yeah, I want
to play like LeBron James. AUDIENCE: Cool. [APPLAUSE] ANDRIANA SCHULZ:
Our last speaker is Rowan, who’s a PhD
student working with Eli. And we’ll talk about
his work on building AI models that can perform
visual common sense reasoning. ROWAN ZELLERS: OK, I work
with Gretchen too, also. Yeah, no worries. No worries. Yeah, I’m going to be
presenting on a joint work that involves many people. Yeah. So I’m going be presenting this
work called from recognition to cognition, visual
common sense reasoning. And in this, I’m
going to be talking about this new project
called VCR, a large scale challenge for today’s
vision systems because it requires not only
visual recognition but also common sense reasoning. So over the past
couple of years, computer vision has
made tremendous progress towards understanding scenes
at a perceptual level. So, for instance, a
machine can detect many of the objects
in the scene, and can provide high
quality segmentation masks. So all of this is
machine generated, which is pretty amazing. However, these
objects by themselves don’t tell the full story. So, for instance, what
I want to know here isn’t exactly how
many people there are, or how many cups there are,
but rather for these four people, who are– let me just talk about
them for a second. There is person one, who
is on the left, person two, and person four. And they’re just
sitting in a diner. And person three, the server
who’s giving them their food, what I want to know is why
is person four pointing at person one. So this question is
highly challenging, because to understand
this question and to answer correctly,
we need to understand the whole situation. We need to understand that
there are people at a diner, they’ve previously ordered
food, and right now, they’re waiting for their food. They’re probably hungry. And person three is the server. And she’s bringing
them their food. And she has several
plates in her hand. So reasoning about
the situation, and connecting
the dots, we might say that person four is telling
person three that person one ordered the pancakes. Whew. That’s complicated. But given enough time, people
can figure this sort of thing out. And so we can even cast
these in a multiple choice format for easy evaluation. Given four options here,
of which one, exactly one is correct, the goal
is to pick which one. So the other couple
of options here are, in addition a, which
is the correct one, b, he just told a joke– maybe, probably not–
c, he is feeling accusatory towards
person one, or d, he is giving person
one directions. And all of these
answers are true probably for some
image, but not this one. Given this image,
only a makes sense. Still, answering
correctly isn’t enough. We want machines to
answer correctly, but to do so for
the right reasons. So we make explanation a
critical part of the task. Now, after answering
correctly, a machine is given four
options, and it has to justify its answer
by predicting the best natural language rationale. So here, the four
options are a, person one has the pancakes
in front of him, b, person four is taking
everyone’s order and asked for clarification,
c, person three is looking at the pancakes, and
both she and person two are smiling slightly, or d,
person three is delivering food to the table, and she might
not know whose order is whose. So a, b, and c
are all incorrect, and d is correct here. And so that so that’s
it, the VCR task, which requires both
answering and justifying, and it’s done in a
multiple choice format to make it easy for machines. So I’m just going to give
three quick highlights here. So first, we had to create a
big crowd sourcing pipeline in order to build this data set
at scale because these machines require a lot of data to use. Second, many visual
question answering data sets have these
unwanted biases that allow them to do suspiciously
well without, in many cases, even looking at the image. And so we have an
approach that helps us avoid that for this task. And third, I have a model
that’s the first stab at approaching this problem. So I’m going to
discuss that right now. So the idea is sort of this. Given a question and
an answer choice, we’re going to need to
do a bunch of steps. First, we’re going
to need to connect the words in the question,
like for instance, person four and person one,
with the relevant parts of the image. And that’s easy for humans,
but not necessarily obvious how to do for machines. Let’s see. And then what we’re
going to need to do is contextualize both
the question with respect to the answer and both
with respect to the image. And then we’re going to need
to reason over this high level representation to
say what’s correct. And so yeah. And hopefully in
this case, we’ll say that this answer is correct,
that he’s telling person three, I think, that person one
ordered the pancakes. Cool. So here’s a brief example
of a question from VCR that our model gets correct. So, in this case,
the question is why is person one pointing
a gun at person two? And the model correctly
said that this is because person one and person
three are robbing the bank, and person two is
the bank manager. It didn’t say that
person one wants to kill person two, which is
probably right for some image, but not this one. And so this is just
one of many examples that you can browse on our
website, visualcommonsense.com. And we also have a leader
board where a bunch of people already have submitted
submissions for their model. But still, there is
this very large gap between machine performance
and human performance. So I’m really excited
about this progress. So in summary, this is
just all about this task of visual common
sense reasoning, which requires machines to not
only recognize visual concepts, but also do common sense
inference over that, and connect that to
knowledge and what not. And it combines not
only answering, but also explanation, justifying
why an answer is true. So we have a data set and
a model for this task. But still, much more
needs to be done in order to reach human performance. So that’s it. And I can answer
questions and stuff. [APPLAUSE] Yeah? AUDIENCE: So you
kind of pose this as you want machines to be
able to answer these questions and give specification. And then that whole task is
multiple choice answering. So I’m wondering do
you think that being good at multiple
choice answering directly translates to being
able to compute your free form answers? ROWAN ZELLERS: Yeah. Yeah. That’s a good question. So the question,
about does performance in multiple choice task
correlate with generative tasks or even the stuff we
really want to measure, which I think the thing, the
problem is these generation tasks are a bit more
difficult to evaluate. And moreover, just generating,
so we want machines to answer correctly, of course. We want them to be
able to generate stuff, but we also want them to
do so much more than that. We want them really not to
just answer these questions, but to have this
understanding about how the world works and connect
it to vision and language. And so I guess my thought
is once machines start doing really well at this. Then maybe we’ll have to turn
to more challenging things. But right now we’re just trying
to get it off the ground. AUDIENCE: What if
all four choices are wrong, or irrelevant, or not? ROWAN ZELLERS: Yeah,
that’s a good question. So what about
ambiguities and stuff? And there is some ambiguity. So human performance
is somewhere around 95% because sometimes people aren’t
great at telling apart them. But we constructed our
crowdsourcing pipeline to carefully reduce the
instances of ambiguity. Yeah. AUDIENCE: Do you have the
wrong answers as well so that actually you
understand what the machine is
trying to reason when it predicts the wrong answer? ROWAN ZELLERS: Oh,
that’s a good question. So what do we do for the wrong– do we do rationales
for the wrong answers? And the answer is actually, no. We thought about that. But the problem is
we really don’t– the explainability aspect
could be kind of interesting. But we thought that
maybe it wouldn’t be as useful to have
machines describe why the wrong answers are right. Yeah. AUDIENCE: And tasks of
your representations right. ROWAN ZELLERS: Yeah, totally. So the other problem,
the other potential issue is that getting– it might be hard to
collect possible candidates for justifying the wrong
answers, if that makes sense. You would have to ask humans to
give me an explanation for why this wrong thing is true. Yep? AUDIENCE: In a similar
vein, and maybe the answer is very similar. But identifying the knowledge
of none of the above pertain or would be accurate? It seems like that
training task, that also might be helpful
to be able to look and them and say I have no confidence
in any of these answers. But, again, I see what you’re
saying, even for a human, and say why none of the
answers are correct. ROWAN ZELLERS: Yeah, totally. Yeah, my thought with
that is– so I really like the suggestion of
using none of the above. And we were playing
with that as well. But my thought is we want humans
to do really well on this task, because otherwise,
we’re not really measuring much of anything. And by picking the
best answer means– it’s something anyone can do
without having to kind of set a threshold for whether an
answer is right or wrong, which is how I always
approach these exams. AUDIENCE: No, it’s OK. ANDRIANA SCHULZ: OK. So let’s thank all the
speakers one more time. [APPLAUSE] They will all be available
here a few minutes to answer more
questions you have. Hopefully, this gave
you some understanding of the breadth of topics
that we work in Grail, everything related from
the study of visual content from understanding, and
reasoning, 3D modeling, and reconstruction, VR,
AR applications, geometry processing, and fabrication. So if you’re interested in
these kinds of topics, stop by. [APPLAUSE] AUDIENCE: Some of
those were great. [INTERPOSING VOICES]

Leave a Reply

Your email address will not be published. Required fields are marked *