Vision: API and Cloud AutoML (Cloud Next ’18)

Hello, everyone. I wanted to first start
with a quick show of hands. How many people here
have experimented with computer vision? Wow, that’s like almost
half of the room. How many people here
have effectively deployed computer vision
models in production? A significantly smaller number. So hopefully, by
the end of my talk, you’re going to
be as excited as I am to employ these
models in production to solve actual
business problems. My name is Francisco Uribe, and
I’m the Product Lead for AutoML and the computer vision teams. In this talk we will discuss
Google’s computer vision offerings. I’ll first start by providing
a quick introduction into our building
blocks, and then I’ll turn it over to three of
our distinguished customers and partners to
talk more about how they’re using their
technology, our technology. I wanted to first
start by sharing the mission of our team. Our job is to enable
developers and partners to build the next generation
of computer vision autonomy solutions. Whether you’re an energy
company seeking to attack rust or denting in a wind
turbine, or whether you’re an automaker seeking to
integrate driver assistance into your vehicles, or a
financial institution seeking to extract texts
from scan documents, we want to provide you with
a rich and high quality set of building blocks for
you to implement and assemble these autonomy solutions. And the way we do that is
by packaging and putting at your fingerprints Google’s
state-of-the-art machine learning acceleration hardware,
like Cloud and Edge TPUs, and more than a decade in
state-of-the-art computer vision and machine
perception research. Specifically, our team
is part of the Cloud AI family of products. And we focus on
three main areas. The first one, Vision
API, and the second one, Video Intelligence API, help
you understand your images and videos, respectively,
using the power of Google pre-trained models. And the third one,
AutoML, is a suite of products that help you build
high-quality, machine learning models with no machine
learning expertise required. Let’s first start
with the Vision API. The Vision API helps you
understand your images using Google pre-trained models. You can use this API to
extract value from your images, extract contents, and then
integrate this content into your applications. Vision API has a rich
set of features for you to understand your
images, ranging from content annotation,
OCR, all the way, to content moderation. But since the last
Google Next, I’m excited to introduce
14 new capabilities. The first one,
Handwriting OCR, helps you extract handwritten
text from your documents. The second one,
Object Localizer, helps you extract not
only the presence, but the actual coordinates
of objects in your images. And the third one,
Web Detection, helps you extract entities and
similar images from the web, leveraging the power
of Google Search. The fourth one,
Product Search, helps retailers integrate visual
search into their mobile and web applications. In the next few slides, I’ll
provide three example solutions that customers are assembling
with our technology. In the first one,
customers are combining the power of Vision API
content annotation features with Cloud Sub in App Engine
to index their images at scale and be able to query them
using their rich metadata. Media corporations are using
this to tag to their content and find the right media
assets for the next production or media campaign. Financial institutions
are using this to find text in a large
set of scanned documents. And even retailers are
using this to find products simply using visual features. The second one, customers
are combining the power of Vision API OCR and
our Natural Language APIs to be able to classify and
extract structure information from their scanned documents. Companies are using this
to digitize their content and enable new workflows and
business processes downstream. And the last one, retailers
are using our APIs to build visual
search applications. And the way they’re doing
that is by first, taking different types of products,
like sneakers, pants, or t-shirts, using our
Object Localizer API, and then sending those crops
to the Product Search API to retrieve the exact SKU of
that product in a customer’s camera. So in a nutshell, you can
assemble all these solutions using our Vision API
with no machine learning expertise required. Now switching gears, I wanted
to talk about the Video Intelligence API. This API helps you annotate
and understand your videos at scale in a short
period of time. You can think of Video
Intelligence API as Vision API but heavily
optimized for videos, meaning we leverage the
temporal component of videos. But as Vision API,
Video Intelligence API has features for content
annotation like, Label Detection, and content
moderation, like Safe Search, but it also has
video-specific features like Shot Change Detection. Media and
entertainment companies are using this to detect vast
databases of media footage. Other types of
companies are using this to determine the
right place and time to place contextual ads
on the video screens. And companies are also using
this to extract rich metadata from their videos to enhance the
quality of their recommendation engines. Independent of the
uses of our customers, we’re very excited
about what they are doing with our technology. But this is just the beginning. With the next few weeks,
we’ll be introducing key new capabilities
that are going to enhance our Video
Intelligence capabilities even further. So please stay tuned. Now, if the Video Intelligence
API or the Vision API do not meet your needs,
because your data belongs to a specific domain,
or because you want us to add additional
support for your own concepts, we introduced AutoML. AutoML is a suite
of products that helps you build high-quality
machine learning models for your own data and for your
own use case with no machine learning expertise required. We all know machine
learning’s hard. Whenever you’re confronted with
a machine learning challenge, you have to determine what
is the right set of data preprocessing thing needs to
use, which model should you use. If you are using
deep learning, you need to determine which
architecture to use or which set of
parameters to pick. And after you’re done
training your model, you need to determine what is
the right server infrastructure to serve these models
at scale and meeting whatever latency
requirements you may have. This entire process is
very time consuming. It requires a lot of
money, because training a lot of these models takes a
lot of computational resources and at the same time require
significant machine learning expertise. So that’s the reason
why we introduce AutoML, to help you automate
the most complex, hardest steps when building
a machine learning model. And this way, you can focus
on higher value activities, like collecting the
right data and defining the right set of business
objectives for your problem. And all of this without
having to compromise on mission learning quality. So in a nutshell, AutoML helps
to build high-quality machine learning models
for your own data. And after you train your model,
we deploy this model on an API that’s easy to use that you
can use to run predictions with the rest of your data set. Yesterday, we introduced two new
AutoML products, AutoML Natural Language and AutoML Translate. But the most relevant AutoML
for this particular session is AutoML Vision that helps
to build high-quality image classification models. So, for example, let’s
imagine that you’re a retailer trying to classify
different types of products into your own set of categories. All you have to do is provide a
series of examples of handbags, shoes, and hats, and AutoML
does the magic behind the scenes to produce a model
that then you can later use to tag the rest
of your product data. Now how does AutoML
work behind the scenes? Behind the scenes, AutoML
uses AI to build AI. Every time you pass
your data to AutoML, an AI agent builds a
high-quality, deep learning model with optimal architecture
and a set of hyper-parameters to solve your particular
business need. Now to test this
technology, we run AutoML on ImageNet,
arguably the most popular and heavily-studied
data scientist data set. And AutoML managed
to produce new, state-of-the-art results
on these data sets. And what’s even
more remarkable is the fact that because these
models are produced by AI, it can optimally trade
up the model size and accuracy of this
model, something’s that’s crucial to deploy
these models on the Edge. So in a nutshell,
AutoML is a product for citizen data
scientists and developers to build high-quality models and
with a great user experience. And to illustrate the point
about the user experience, I wanted to quickly show
you guys a demo of AutoML. Let’s imagine that we are
all meteorologists trying to classify photos of the sky
into different types of clouds, which I recently
learned is actually a good predictor of
future weather events. So if I’m confronted with this
problem, all I will have to do is collect a few example
images for each one of these different
types of clouds, and then open AutoML and
click on a new data set. Let’s say I want
to create cloud3. And here I’m offered
two different options to upload my data. I can either upload them to
my local drive in a zip file or by pointing out a Google
Cloud Storage packet. But since I have already
uploaded a cloud data set, I’m going to go back
to the data set section and pick clouds_2 that
has around 1800 images. Somebody on our team
actually liked these photos. And here you can see all
the different steps required to build the model, ranging
from images, all the way to prediction. And the first part, Images, has
a Google Photo Slide UI for you to manage your images. Here you can create
your image labels. And we show you your
label distribution, so you can determine how to
properly balance your data set. So after you’re done curating
your training data set, you can move to the
next step, Train. And here we show all
the different models that you have already trained
with their corresponding Precision and Recall values. If you want to
train a new model, all you have to do is click
Train new model, and that’s it. Now if I want to learn more
about the performance of one particular model, what I have to
do is click on See evaluation. Here, we show more advanced
metrics like the Precision and Recall graphs, PR curves. And at the same time, you
can tweak this core threshold for you to determine the ideal
trade off between Precision and Recall for your
particular business uses. And scrolling down, you can
see the confusion matrix that shows the labels that
the model is confusing. So, for instance, you can see
that the model is confusing all the cumulus labels
with the cirus labels around 23.8% of the time. So if you click on that cell,
we show you sample images that the model is
confusing, so you can make a more guided or
educated guess about which additional images
to upload to improve the quality of the model
at the decision boundaries. So after you’re done
curating your data sets, and you have built a model
that you’re happy with, you can go to the Predict tab. And here we show
different options how to query your model
with a simple REST API script or a Python script. But if you’re in a rush, you
can simply upload an image from your hard drive. Here I’m going to test
uploading an image. And voila, you can see that
the model accurately predicted this is a cumulus label. Now switching back
to the slides. So to wrap up my
session, I wanted to announce that since
we launched AutoML, we have had more
than 18,000 sign-ups. So we are extremely
humbled by our traction. Our customers’ use cases span
multiple verticals, ranging from medical imaging, to
retailers tagging product images, all the way to waste
management corporations using AutoML to sort out
different types of trash, like landfill,
compost, or recycling. But independent of the
use cases, independent of the verticals and the
size of our customers, we’re extremely
happy about what they are doing with our technology. And on that note, I wanted to
introduce Eli to the stage. He’s the Head of Data
Science at Litterati. [APPLAUSE] ELI THOMAS: Hi, I’m Eli. I’m the data scientist
at Litterati. We are a platform that empowers
people in cleaning the planet. Each year 14 billion pounds
of trash flow into our oceans, four trillion cigarettes
are thrown on the ground. 11 billion is spent in
cleaning it up in US alone. Litter destroys the
environment, kills wildlife, poisons our food system, yet
strangely, there’s no data. So we built an app. Using our app, each user can
take a picture of litter, say, a Starbucks or a
Burger King cup, tag it, and then dispose of it. The data that we
collect from each user is their username,
what they picked up. A geotag tells us where, and
a timestamp tells us when. And, in turn, we are
building the global database for litter, the
first of its kind, and ultimately, a
litter-free planet. What we understand is the
data we collect and it’s integrity it’s critical. The data we collect
is used by cities, nonprofit organizations, brands,
and schools across the world. For example, in San
Francisco, our data generates annual tax
revenue of $4 million. But we understand the problem we
are trying to solve is complex. So we partnered up with
Google’s AutoML team. And what it helps us do is make
the process as easy as possible for our users, so that they
can pick up more litter and our data smarter. So what AutoML does, it
takes the user images, and in the back end, performs
image recognition. This helps us in
our data integrity. This is just a sample of
examples of user images that AutoML has tagged in the
backend with great results. Hands down, for us, at
Litterati, Google’s AutoML has proved to be
powerful, easy to use, with a robust process flow. Moving forward, we want to
take AutoML from the backend and put it in each user app
to improve the user tagging. So the user will
still take a picture. AutoML will tag it in their app. And then our user will help
us improve our litter database by going in and adding
additional tags, say, a local brand of
beer, or cola, or candy. And guess what? Our Litterati
story is spreading. We have some amazing
partners, with United Nations being the most recent. And we’re just getting started. Thank you so much
for letting me share. [APPLAUSE] FRANCISCO URIBE: Nice job. Thanks a lot, Eli. So now I want to introduce you
to Zain Shah, data scientist at Opendoor, to talk about
their experience using AutoML. ZAIN SHAH: Thanks, Francisco. So, guys, I’m Zain Shah. I’m with the data
science team at Opendoor. And you may be wondering
what is Opendoor? Well, our mission is
to empower everyone with the freedom to move. What that means is instead
of doing a bunch of repairs on your home, working with
an agent to list your home, having random people show
up at odd hours of the day, instead you just
go to our website, tell us a bit about your
home, and we figured out from our data, what we think
it will be able to sell for. And we make an offer to buy
your house for that amount, an instant cash offer. If you like our offer, you
just sell it to us and relax. You don’t have to
do much at all. You just move on with your life. Now this may sound like
a crazy pipe dream, but we’ve actually made
a lot of progress so far. People love what we’re doing. This time last year, we
were in three cities. Now we’re in 10. We’ve acquired 800,000 houses
and put them on the market. We now spend about a
billion dollars a year purchasing homes. And all these
numbers are growing. We’re expanding very quickly. Now what does that
experience look like? Well, you basically
just go to our website. You tell us where you
live, and from there, we use our data to make
more informed decisions. I can show you a bit
of what it looks like. Go to our website. Tell us where you
live, like I said. And through our process,
you basically just give us some information that
we might not automatically have. And you tell us what’s
going on inside your home. From this, we can
dynamically calculate how much we think we’re going
to be able to sell the home for. And if you like
what we offer, you can basically just sell your
home to us for that amount and relax. You can also investigate
what our prediction was, and how we came to that amount. And it’s a very easy experience. Now from the data
science team at Opendoor, the sorts of problems
that I deal with are pretty crucial
to the business. Problems we deal with are,
like, accurately valuing homes totally sight unseen. You just go to our
website and tell us a bit, and we have to
figure out actually what it’s going to sell for. This is important for us to
actually do this at scale. We need to predict how
long it’ll take to sell, and we need to analyze
market dynamics to correctly understand
what’s going on in the market and optimally price the homes. Now, like I said, one of
our primary challenges is valuing someone’s home
totally sight unseen. As you can see here,
these two homes, they might look very
similar on paper. But in person, it’s quite
clear that the home on the left is not very nice. So how do we know this? Well, we can actually figure
out from photos, for example, how nice the home is. There are a bunch of
things that are, maybe not what drives the
value of the home, but definitely are strong
indicators that we’ve isolated. They’re mostly proxy
features, things like if you have stainless
steel appliances. That’s a pretty good indicator. If you have granite
counter-tops, that’s a pretty good indicator. Brand-new cabinets, separate
tub and shower in your bathroom, things like that. Now like you saw in the flow, we
get some data from our sellers. They tell us exactly a lot of
the structured information, like whether they have
granite counter-tops. But for the historical
transactions that we train our models on,
we don’t get that at all. We get the square footage,
number of bathrooms, maybe how many stories the home
has, and where it’s located. But none of this information
that we really need. So that’s where computer
vision comes in. From the historical
transactions, we don’t have the structured
data, but we do have photos. So we can look at the
photos of the home. We can figure out whether it has
an upgraded kitchen, whether it has a nice bathroom, and we
can use that to better inform our decisions. But the problem there
is that’s actually pretty time-consuming
and difficult. So that’s me on the right there. In a past life, I’ve
spent weeks poring over filtered visualizations
from convolutional neural networks to understand
what was going on and improve the performance. But it’s a lot of work. And sometimes it doesn’t
even work super well, and it takes way too long. So that’s where AutoML comes in. Basically, all we had to do
in order to test this use case is get a bunch of
labels in a spreadsheet, upload them to
Google Cloud AutoML, and from there, we get a
perfectly– well, very accurate model that predicts exactly
what we want, in this case whether a kitchen has stainless
steel appliances or not with basically zero work. What’s next for us? Well, we want to
get more labels. The more we know about
a home, the better. So this means things like
features about the bathrooms, living rooms, backyards, and
other imagery sources as well. AutoML is very flexible. So we can look at aerial
imagery, different rooms, even Street View images to
understand what a home looks like from the outside. Thanks. [APPLAUSE] FRANCISCO URIBE: Awesome. Thank you, Zain. And now we have Rob
Munro, CEO of Figure Eight to talk about our human
labeling partnership. ROBERT MUNRO: Thank
you, Francisco. So something that all of
these AutoML use cases have in common, which
is different from a lot of the existing,
off-the-shelf APIs in GCP, is that they rely
on human label data. So it’s humans who are going in
and saying this label applies to this image. And then that powers the machine
learning to do this at scale. So Figure Eight is a company
that provides software to label your data
for machine learning. We’re the most
widely used software for GCP customers today. And it was delightful
to find out as we’re organizing this
that Opendoor is actually one of our customers. So those images that you saw
in the previous presentation are ones that our
workforce labeled for them. And you can find this
use case on our website. So if you need to have
humans label data for machine learning, there’s
a number of things that you need to
think carefully about. First of all, how can
you make this process as efficient as possible. How can you have a smart
interface that allows someone to create the right labels
as quickly as possible and remove a lot
of the redundancy? Secondly, you need
to think about who is the right workforce for this. If you have a volume of labels
where you can’t put your data scientists on this
manual task, then you might want to reach out
and use crowd source or professional labelers. We manage a marketplace
of about 100,000 people who regularly annotate
data for all kinds of different information. Finally, you want good
machine learning integration. So you want to be able to
take your human label data, put it into your AutoML models,
and also quickly iterate as you need to add more data. So one particular use
case I’m looking at, which I’ll speak
about here, takes us from the first presentation,
which was about computer vision across the entire world, the
second presentation, which was about computer
vision for your home. I’m going to go
inside your home, specifically, into your closets,
and talk about your shoes. Who here is familiar with WGSN? Anyone? OK, look around at the people
with their hands up next to you. They are the best dressed
people in this room. I just saw a bunch
of extra hands go up. [LAUGHTER] So WGSN are a company that
predicts fashion trends. One very important part
of making this forecast is to see what is
currently being sold across a very large
number of different retailers’ websites. And as you can
imagine, these trends are much more fine-grained
than just shoes, versus shirts, versus dresses. It’s about very specific
types and stylings within these different
pieces of clothing. So you can see in
this example here, this is something that
a human might look at. Is this a show? Yes, it is. What kind of shoe is it? Is it is it a heel,
or is it a flat shoe? And you can see that
the annotator here, the human creating
this annotation, they can see an example of
what a flat versus a heel is, with a text
description of that inline, allowing them to quickly use
those instructions to make the right decision. Similarly, here, if
they selected a boot, this is a categorical
distinction between a lace-up boot
and a slip-on boot. So this kind of interface
is one of the ways in which you’re able to get
high-quality training data. Because you know
that even if it’s not someone doing the labeling that
you’re interacting directly with, you’re able
to provide them with the right instructions. So once these data
points have all been labeled according
to their categories, you can take this
information automatically put it into AutoML and have
this process run at scale. One of the nice
features of AutoML is that it will tell you the
confidence of each prediction, and it has a really good
correlation between confidence and accuracy. So what this means
for someone like WGSN, or someone with a
similar use case, is that they can use AutoML
to automate everything that they know is correct
by machine learning. And then for those
remaining items, because they need to
get 100% accuracy, they can put those back
to humans for review. So minimizing that human
part and optimizing as much as they can, so they can
have these accurate forecasts. So this is what the Figure Eight
platform looks like to do this. I’m not going to go
into all the details here, just a couple of points. In addition to having
configurable interfaces that allows you to set these
instructions correctly, there are a number
of quality controls. So, for example,
you can set which workforces work on this task. We work with a lot
of retail companies. So eBay that you saw earlier
is one of our customers, is someone that we do
image labeling for. So the same people
who have learned about different
fashions through eBay can also be working on
your particular task. You can automatically give
the task to multiple people, so you get a certain
level of agreement for items that
might be ambiguous. And you can also use
our patented system for embedding test questions. So if there are some items
for which you already know the labeled answers, they
can be hidden along the way, and thereby, you’re
evaluating the accuracy of every single person that’s
working on this platform. We’ll often have
tens of thousands of people working in parallel
for a single client’s use case. It’s this kind of
technology that enables you to scale to
that kind of workforce very seamlessly. The final thing, which
is very important, is making this annotation
as accurate as possible. And this is somewhere where you
can combine machine learning. And this is best demonstrated as
a video rather than me talking. So you can see here
machine-assisted video annotation of two
methods side-by-side, the bottom being the one that
you can use in our platform. So you can imagine,
all right, if you have 30 frames a second times
10 seconds, that’s 300 frames. Then times 20 of our 6,000
individual objects that you might want to identify. So what we have in
our system is a way that these objects
are automatically tracked between frames, still
allowing that human annotator to review every single
one and only make the update, the correction, or
the deletion when the machine learning has gotten this wrong. So in this particular
use case, you can see we’re about
25 to 30 times faster with no loss of accuracy. And we’ll see this be
up to 100 times faster for a lot of our customers. What that means is that the
human component creating data could be reduced
to 1%, or you could annotate 100 times more data, or
apply 100 times more use cases for that budget. And this could be
either if you’re using the workforce in our
marketplace or the efficiency gains that you’re
getting if you’re bringing your own workforce
to this particular task. So we’re really proud to
announce today the Figure Eight and Google Cloud
partnership for AutoML, so that anyone using AutoML can
take advantage of the Figure Eight annotation platform. There’s four
particular components to this, our customizable
annotation template. So that same
interface that you saw for classifying
types of clothes can be used for any kind of
computer vision task. And you can use both
our inbuilt workflows and JavaScript to customize
that as much as you like to your use case. Assigned consultants,
so if you are new to trying to
get quality training data from a distributed
human workforce, our consultants can
help you with that. Human labeling, so like I said,
access to about 100,000 people in our marketplace, both
crowd-sourced workers and professionals that you can
interact with directly and get paid by the hour. And quality control, the three
different quality control methods that I
spoke about today. So I’d love to
hand this back over to Francisco who can talk
more about this partnership. [APPLAUSE] FRANCISCO URIBE:
Thanks a lot, Rob. And to wrap up, I wanted to
invite you to learn more about our products at and
intelligence. Second, and a topic that
requires a session on its own, at Google, we are committed
to the responsible use of AI. As a first step, our
team put together a guide to help our
AutoML customers detect and potentially mitigate
the bias in their data sets and models. So if you want to know
more about that guide, you can go to It’s important to note that
this is a live document. And we’re really eager
to hear your feedback, so please stay in touch. And last, but not
least, I wanted to invite you to participate
in a limited time offer. Exclusive for the
members of this audience, we are offering two
hours free of machine learning and furnished
consultation, including 3,000 annotations. If you want to
participate in this offer, you can go to All you have to do is provide
your contact information, say that you came
to this session, and then provide a quick
overview of your use case, and then somebody in our team
will reach out to you shortly. So with that, I
wanted to thank you. [APPLAUSE] [MUSIC PLAYING]

Leave a Reply

Your email address will not be published. Required fields are marked *