The next speaker is Professor John Edward Hopcroft
Professor Hopcroft is an American theoretical computer scientist
best known for his work on automata formal languages and algorithms
and his textbooks on the theory of computation algorithms which regarded as standards in their fields
he is the IBM professor of engineering and applied mathematics in computer science at Cornell University
he received his master's degree and PhD from Stanford University
he worked for 3 years at Princeton University and
since then has been at Cornell University
In 1986 he received the Turing award jointly with Robert Tarjan for fundamental achievements
in the design and analysis of algorithms
Professor Hopcroft is a memberof the National Academy of Sciences and the National Academy of Engineering
in 2010 he and Jeffrey Ullman received a John von Neumann Medal for
many seminal contributions to theoretical computer science
His book co-authors include Jeffrey Ullman , Alfred Aho and others
so please Professor Hopcroft
Thank you for the introduction and following Al's comments
about what should work on one of the things that I tell a student is
if you're going to do fundamental research it ought to be research which
is exciting to you. In other words don't do research that your faculty
adviser tells you to work on
because that will be work you want to do what excites you
because that will really be fun and the other comment I will
make is that the world is changing and usually ought to position yourself for
the future and I tell my students one story about my own career
The fact that when I was at Princeton Edie McCluskey asked me to create a course in computer science
At that time there were no departments of computer science
there were no courses, there were no books. And teaching that course made me one of the
world's first computer scientists and I didn't realize at the time what the
consequences of that would be but when our government wanted a senior person in
computer science I was on the shortlist and when I was fairly young I got a call
from our White House saying our president wanted to appoint me to the
National Science Foundation which oversees science funding in the U.S.
now imagine if I had been in high-energy particle physics I would still be
waiting today for the senior faculty ahead of me to retire but because there
were no senior faculty ahead of me I had opportunities that one would just not
expect to have so the other thing I guess I would tell students is position
yourself for the future and that actually goes not just for people it
goes for companies and it goes for countries but with that
I will start my talk it turns out that I believe were
undergoing an information revolution and the impact of this revolution is going
to be as important as the impact of the Agricultural Revolution or the
Industrial Revolution and it turns out that machine learning is going to be an
important driver of this revolution and one of the main things which is beating
it up is deep learning and so in this talk I'm going to start by giving you a
kind of a little bit of background about AI and then I'm going to talk about
research problems in deep learning so one of the things in AI was a little
unit called a threshold logic unit and this unit has a number of inputs x1 x2
x3 and each input has a weight and what it does is it sums the input times the
weight and if the sum is less than some threshold it outputs a zero otherwise it
outputs a 1. Now we've moved from just this threshold to something a
little bit more a little different because as we make networks of these
gates we want to be able to take the derivative of an error function and the
threshold is not a derivative function so we replace the threshold sometimes
with a sigmod function and actually more recently a function that works very well
is a function that gives zero output if the sum is negative and gives the
input is the output if the sum is positive and that's a very simple it can
be you take a derivative of that.
And I want to give you the algorithm to train
a threshold logic unit. Initially you set the weights to zero and then you cycle
through all inputs and if the out input is misclassified and you want a 1 output
you add the input to the weight vector. If you want a zero output then you
subtract the input from the weight factor
and there's just one thing to remember about this I'll tell you in a minute
but if your data is linearly separable this algorithm will quickly converge to
a set of weights which will separate the the data. But what if your data is not
linearly separable. The one thing I want you to remember is that the weight
vector will be a linear sum of patterns because remember we set the weight
vector to zero and every time we change the weight vector we either added or
subtracted an input. So the weight vector will always be a linear sum of the input patterns
If the data is not linearly separable what you might do is you might
map the data to a higher dimensional space where it is linearly separable and
in this case what I might do is I might add a third coordinate and I'm gonna
take each data point and depending on how far it is from the origin move it
out from the plane that the data was originally in and the zeroes you can see
are farther from the origin than the x's so I'm going to pull the zeros out and I
can easily put a plane between the zeros and X's. Okay, Now it may be that the
mapping you're going to use the function f may map the input to an infinite
dimensional space but the interesting thing is you don't need to be able to
compute the function f because you don't need it to run this algorithm. All you
need is the products of the mappings of images so the way this algorithm at work
in the higher dimensional space I got my input ai and it gets mapped to f of ai
and the weight vector is going to be a linear combination of the mappings.
Now notice that if I want to multiply the weight
vector times the pattern f of aj I don't need to know the value of f of aj.
All I need is the product of f of ai and f of aj and so if and when I want to upgrade
the weight factor if I'm going to add f of aj to the weight vector all I need to
do is increase the coefficient by one. okay so what that suggests is the
concept of what we call a kernel rather than pick a function f why don't we pick
a matrix of products and can I just select an arbitrary matrix not quite
I've got to select a matrix for which there exists a function which would give
rise to that matrix and it turns out that'll be that's true
if and only if the matrix is positive semi-definite so I'll give you an
example of a matrix you could pick the Gaussian kernel if I want to know what
the element of this matrix the ij element is I need to know what f of ai
times f of aj is and it turns out I simply take the difference between the
input ai and the input aj take the length of that vector multiply it by a
constant and raise e to that power and so I never actually had to calculate
what the function was that would gives rise to this Gaussian kernel and in fact
this mapping is 1/2 an infinite dimensional space and there are many
kernels like this and this is the basis of what's called to support vector
machine and the support vector machine is what was driving AI for the last 15
20 years and it was very successful but the the thing that accelerated the area
of AI is deep learning and so let me talk about that
what drove this was something called the imagenet competition it they had
1.2million images which were labeled and there were a thousand categories and so
maybe one was cat one was car another airplane and they ran a competition and
what you would do is you would develop an algorithm and they would give you
some training data and you would train your algorithm on that and then you
would look to see how well your algorithm generalized to some test data
and up until about 2012 the error rate was about 25% and the improvements from
one year to another were only a small fraction of a percent but in 2012
Alexnet came along and all of a sudden the error rate dropped to 15% and
this is it was really exciting because this was a major achievement and at this
point people started applying deep learning to many problems in finance and
sociology and all kinds of problems and deep learning was very successful people
don't know why it was so successful but it was successful along then in 2014
Google Net came along and dropped the error rate to 6.6 % and
then ResNet came along dropped it to 3.6 % and if you have
a trained human being their error rate is about 5 % so the computer
now programs are better than a trained human being. I guess I should point out
that ResNet these networks got deeper and deeper and ResNet is actually a
thousand levels deep and so you might ask how they trained it and so forth
these are interesting issues. So let me quickly point out that what we're
talking about here is supervised learning you put in an image and you try
to get the classification of that image correct
you do that by changing the weights in these various levels and to do that you
have to take a derivative and that's an interesting problem how when these
networks get a thousand levels deep how you do that. But something that is
interesting someone came along and said let's do something different, let's put
in an image and train the network to reproduce the image. Now I don't have to
tell you how the image is classified and what they observed is that these gates
in here formed a better representation of the image and in fact someone pointed
out that one of these gates would respond if the image was that of a cat
Nobody told the computer which images were cats and which were not but the
computer figured that out and all of a sudden we realized that maybe we could
do unsupervised learning and this is an important area of research today
Now the actual network that people use are slightly different and what network
you use depends on whether you're doing looking at images or whether you're
looking at speech and I'm going to just focus on images and images they have
something called a convolutional level and what you do is you take a small
little grid typically it's five by five but it to get this on this slide I made
it three by three and what you do is you slide this little image across and for
each position as you move it across you have a gate and then you move it down
one row slide it across and you have a number of gates here equal to the number
of pixels in your image. And what this little window does is it looks for a
certain feature it might look for an edge or a corner or something like that.
What you want to do oh by the way the weights associated with each
position here are the same. So if you had a five by five window you
only have 25 weights not a million ways. But you want to find various
features so there are many of these convolutional setups in a particular level
Then to try to keep the networks a little smaller there's another level
called pooling where they have a 2x2 window and what they do is move it
across but when they move it two units at a time so there isn't
overlapping and they take either the maximum value or the average value and
the reason they can do that it's not so important exactly where a feature occurs
but how the features occur relative to one another and so
this seems to work and they put together
Alexanet by the way add five of these convolutional levels and then three
fully connected levels and then something called softmax
Okay and so you put an image in there and they trained it to produce the right classification
Now what I'm going to do here is is talk a little bit about
some research and I'm going to talk about something called the activation vector
You put an image in and you'll get a value in each of these gates at
this given level. So the I have a vector for each image. Okay, but it turns out
that there are two notions of activation space because I could put those columns
in a matrix and in one case I put in an image and I get a vector representing it
the other is I look at just one gate. That's coming across and what I ask I
want is a vector there where for each coordinate corresponds to an image and
it tells me what that gauge is representing so let me show you some
things that you can do here. The first thing is if I have an image
it's easy to find the activation factor I simply put it into my network and see
what the the activation vector is but suppose you gave me an activation vector
and wanted to know what image produced that there are many ways to solve this
problem I'm just going to give you a one that's easy to understand pick a random
image find out what its activation vector is and then do gradient descent
on the pixels of the random image to move this activation vector up to that
activation vector and so you go like that and then what that will do is that
will convert your random image to the image that produce the activation vector
a sub I I just wanted to point out that you could go from activation space back
to the image because now I'll show you what you might do with it. When I put an
image into this network I might take the activation vector here and say that's
the content of the image and then what I might do is I might take a vector here
and take the cross-product and say that's the style of the image and the
reason I take the cross-product is that tells me how adjacent pixels are related
to one another. So if you do that we took a picture of George Bush our former
president and then we found the activation vector and we said we're
going to use that for the content of the president but then we took 200 images of
older people and we said we're going to use the average of their activation
vector for the style and then we recreated the image using the content of
George Bush but the style of an older person and this is what we got
Now one of the things that I do is each year I bring 30 or 40 students from
China over to Cornell University and these are students who have just spent
their junior year and each of them is asked to do a research project
and one of them did the following they took a picture of Cornell University
and they said what would Cornell University look like if it was in China so they took a
piece of Asian art work and this is what they got as Cornell University if it was
located in China. One of the things I want to talk about is this activation space
because it's very high dimensional but if you take all of the images of cats
they will form a manifold a much lower dimensional manifold in this activation space
and understanding what this manifold is I think is going to be
important to understanding why deep learning works
and let me just repeat there have been thousands of people who have applied deep learning in
various application areas and they are very successful but nobody seems to
understand why deep learning works so well and that theory still needs to be developed.
So I'll just show you some other things that we can do
here's a photo of Cornell University and then we took some styles coming across here
and we asked what would Cornell look like under each of those styles and
that's this bottom line but what we used is we used a pre trained network and we
asked the question do you really have to train a deep learning network to do
various things and so we tried the experiment again with random weights and
we did just about as well okay and so one other thing that's important is to
ask for each thing you're doing does it require the training or is it the
structure of the network that causes it to happen
and here's just a list of some projects interesting what do individual gates learn how does what
the second-level gates learn differ from the first and so on
I guess I should mention something about how does what a
gate learns evolves over time one of these juniors that I brought over to
Cornell was training in a network and was watching what happened
And he observed that three gates started to learn the size of the image he was using
black and white images and then after a little while two of them gave up and
decided that the network didn't need to learn have three gates learn the same thing
and these other two shifted and started to learn something else and you
can see that there's a fundamental research project there why did these
gates give up and start to learn something else
unfortunately he was only at Cornell for a month and he only discovered this
towards the end and didn't have time to explain why but there's just really
exciting research here to ask if two gates learn the same thing what you
might do is take the activation vectors for two gates and take the covariance
and if the covariance is one then they're learning the same thing and so
you might do the following if I have two networks
I put the gates of one network going across four columns and going down four
rows for the other and calculate the covariance and what you might discover
one of the questions we were asking is if you train to network twice does it
learn the same thing or does it learn it entirely different ways
in a particularexample that we tried we found a gate in one network which was really highly
correlated with what the gate in another learned but it was only a few gates that
we could match up but then we noticed that there was a couple gates three and
four in one network which together learned the same thing that gates two
and seven learned in the other and so it looks like they were learning the same
subspace but they just had a different coordinate system for
it point out that when you train one of these networks
there's a large number of local minima at least we believe they're local minima
they may not be in three dimensions if you're training something if you're doing
gradient descent you're not going to get stuck at a saddle point because if
there's a saddle point where in one direction you can go down in another
direction it goes up just numerical error is going to prevent you from being
at that saddle point and you're going to continue on down but when you're in a
million dimensions that may not be true because if they're only say ten
dimensions that go down you might not be able to find them and it might be these
local minima are actually saddle points and so a research question is how do you
determine in high dimension whether you're at a minimum but I'm going to
just show you something
let's say I train a network and on the training data this is the error curve
you'll notice that there are two local minima at both about the same which one
should you pick. Well you'd like to ask the question which one is going to
generalize better. this is the error for the training data but how about for real data
and I will conjecture that you ought to take the broad one rather than the sharp one
The reason for that is if you pick your training data
statistically from the full set then the error function for the testing data
should be this roughly the same as for the training data just slightly different
and so let me put the error function up for the real data that's
this dotted line it shifted a little but notice when you have a broad minimum
when you shift it a little the error function doesn't go up very much but
when you have a very sharp local minima and you shift it a little the error
function goes up significantly so these are just interesting questions that lead
to lots of interesting research one of the things if you have two tasks
and you learned them separately you might ask what would happen if we learn
these two tasks together or what is common to these two tasks and so I can
take these two networks and change them just slightly if I pull a few of these
gates out and share them and now I train it's going to turn out the things which
are common to the two tasks are going to be learned by these gates and these are
going to be things which are specific to the lower task and those specific to the
upper task and all I want it to do here is just show you the excited kinds of
things because there's just millions of questions we can ask and things to
explore that we don't really understand I just keep on time I'm going to go kind
of quickly here something which is very important is the notion of a generative
adversarial Network at one time people were trying to create realistically
looking images like to feed in the word cat and would like an image of a cat to
come out and they were not doing very well but someone had the idea of saying
why don't we train a network which is called the synthetic image discriminator
which can tell the difference between a real image and a generated image
So I'll take a thousand real images and a thousand generated images and I'll train this network
then what I will do is I will take my image generator and plug it
in and then I will start training my image generator until it fools the
synthetic generate detector. At that point I will start training the
synthetic image detect discriminator a little bit more and I'll alternate these
and what they discovered is pretty soon they get images out which really look
like real images and you can use this for many things one of the things you
might want to do is develop a translator from one language to another the way we
used to do it is we'd find many documents where we had a copy in both
languages but if we didn't have that what would we do well we might use a
discriminator we first trace take away let's say I'm going to translate from
English to German I take something which you'll take an English sentence and just
produce a bunch of German words then I will take a discriminator which will
tell me the difference between something which is just a bunch of German words
and something which is a German sentence and then I will get another translator
which takes German back to English and I will train these three things until the
output back to English is the same as the input and if you think about this
what this forces to happen is the first translator to German has to create a
sentence and to get the final from the English back to be the what you put in
it better be a translation of the English sentence and this just suggests
the wide range of things you might do with a this adversarial discriminator
what people are trying to do is compression these networks these
networks are getting so big that if you want to put it on your iPhone there's a
problem and we don't seem to be able to train a little network like this if we
try to train it to get the right classification so people are now doing
the following they train a big network to work well then they take the
activation vector and say could we train this network to produce the activation
vector rather than the classification and we're going to see how well people
can do there you've probably heard of fooling you can take an image of a cat
which will be correctly classified you can change a few pixels
and you probably can't even tell that those two pictures are different but all
of a sudden the deep network said oh it's an automobile and the reason for
that is in that activation space that manifold is if you move off that
manifold perma dick Euler to it you're going to change the classification now
this isn't going to cause you a problem the reason this is misclassified is you
could quickly tell it was not a real image because there's a pixel in there
which has no relationship to the adjacent pixel and there's a few of these
and if you just filter this image it will get back to classification of a cat
but people have found that they can actually take real images and fool things
very quickly when my daughter was 2-3 years old I used to go through the
best word book ever and point at pictures and one of the pictures I pointed out
was fire engine. A single picture and we were out for a walk and
there was a fire engine on the street and she pointed to it and said fire engine
she learned from one picture these deep networks are learning from
thousands of pictures how are we going to teach it to learn from one picture as
possible that my daughter had certain practice she had seen billions of
pictures before and she learned how to learn from pictures but just at the end
people always ask me the question is artificial intelligence real and
AI programs do not extract the essence of an object and
understand its function or other important aspects
It's merely pattern recognition in high dimensional space
and so if you trained a deep network to recognize railroad cars yeah
and you gave it a picture like this it would probably say box car not realizing
it's an engine because it doesn't look like that but let me just wind up not
all intelligent tasks need AI some just need computing power
and with that I will conclude
Thank you very much professor Hopcroft for your lecture
analyzing in the MapReduce with the drug interaction with an example of
drug interactions thank you very much
No comments:
Post a Comment