>> Hello everybody,
it's a pleasure to have Tobias with us here,
and he'll be telling us about, I presume,
Interactive Machine Learning and
thinking about [inaudible] user,
and what he's done a lot of interesting work.
>> Hello everyone.
Thanks for the great introduction.
And today, I want to talk about how we can improve
Machine Learning beyond the algorithm
by not working with the algorithm itself,
but also by improving it through working
with the user interface, as he mentioned.
So, in order to
better understand what I exactly mean by that,
I want to start with an example.
Let's say, we have this fictious bookstore,
and we want to find out
how good the inventory of our bookstore is.
So, that means we want to find out what is
the average rating that people would
give to books in our inventory.
So, the good people we are,
we go out and we collect data.
And the way we do it is,
we would show users
a book along with a start rater interface and say,
rate this book on a scale from one to five,
how much you would like to read it.
So, we do this for n randomly sample books,
and after we collected all the ratings,
and in our second step,
we do a very sophisticated inference namely,
we infer the mean parameter by
simply averaging all the values.
So, we're done. Right?
No, because we have
an angry boss coming in and she's saying,
"Your estimates are way too noisy,
there's too much variance, I
don't know what to do with this."
So, to put your Machine Learning thinking hat on,
and you will back,
and you look at this and you say,
"Well maybe this estimator isn't the best,
let's improve this by using some prior knowledge."
So, kick this out and let's
assume that our ratings come actually from
a normal distribution that
has a normal prior on the mean.
All right. So, notice that this is
in principle an implicit user model because
it somehow assumes how the data was generated.
And then in the second step,
you would just compute
the map estimate for your mean parameter.
So, that should help
us reduce the variance in our estimates but,
I'm going to argue here that this is
not the only way to do this.
Actually, if we take a step back
and think about what we did here in
our little toy problem, in the first part,
we actually had people giving
us ratings through an interface,
whereas in the second part we had
this very sophisticated algorithm
to give us the mean rating.
But actually, we've only
worked really with the second part,
and my argument is that we could have
equally worked also with the first part.
And what we could have done is also
swapped out the interface and changed it,
so that we would show a similar book to the one that
users are about to rate that they've
previously rated along with the star ratings he gave.
So, we had, "The Hobbit," here and he would show,
you gave, "Lord of The Rings," three stars.
And it is known people have done that,
and you can see in
end user's studies that this reduces also
the variance in the rating data that you get.
So, you've achieved essentially,
the same or similar result by working
with the interface instead of
the the algorithm, adjusting the algorithm.
Now, this what you've just
seen is a very crude interactive system,
you could even argue it's not interactive.
But my point is that,
even when you move to
more complicated interactive systems
such as recommender systems or search engines,
the same kind of views should hold.
So, you have a back-end Machine Learning algorithm,
and you have people that interact
with the algorithm through an interface,
and both are tightly coupled together through the data.
So, data that people generate but also kind of output,
like output at predictions from the ML algorithm.
And in this talk,
I want to highlight,
my and our efforts
to work with the two different components here.
So in the first part, I want to talk about
how we can design better ML algorithms
taking into account by CS from
the users and from the user interfaces.
And then the second part of my talk,
I want to start with the people in interfaces side and
thinking about how we can design
better user interfaces for
more effective Machine Learning,
I shaping and designing
data that the algorithm then takes as input.
All right. So, in this first part,
I want to start with a paper where
we show how to evaluate and train
recommender systems using a new approach
that relies on Castle and Friends.
And as a motivating example,
I want to look at movie recommendation.
In movie recommendation, we have
a population of users, so here,
and then popular number of movies,
and we want to recommend
those movies to each user that he or she values the most.
In this two example,
we have two user groups,
romance lovers and horror lovers.
Romance lovers, love romance movies
but hate horror movies, and horror lovers,
love horror movies but hate the romance movies,
and both user groups
are indifferent with respect to drama movies.
Now, the problem is that in practice,
we don't get to observe this full ratings matrix
but we only get to observe a subset.
And it is well known that users are more likely
to give more ratings for things that they like.
So, we would see
much more five star ratings
than one star ratings in the data.
And the fact that
these five star ratings are over represented
is due to what is known as
selection bias and this problem is also sometimes called,
"Data Missing Not At Random (MNAR)."
And I already mentioned that there is one source of
bias that we saw that comes from the users.
So, the user-induced bias comes from the fact that
people are more likely to rate
good items and or like
effects that they would look at
more at certain categories more often and so on.
But then there's also system-induced bias
that comes from advertising,
making certain items having
a more prominent position on the starting pay,
means that users are more likely
to click on it, and rate it.
The obvious question is what happens if we just ignore
selection bias and apply
or is centered on machinery here?
Notice that this is the kind
of most frequent approach in the literature module,
a handful of generative approaches
that will compare against later.
Yeah. What happens now?
When we ignore selection bias,
it turns out that that we can be horribly misinformed.
Suppose we want to deploy a new recommender policy here,
Y-hat, and Y-hat actually
turns out to be pretty crummy so,
it recommends mostly one star movies to users.
I've indicated that here as the boxes.
And what happens now,
if we evaluated on
previously collected data is the following, suddenly,
our crummy policy looks pretty good because
the overlap with the five star ratings
is pretty significant so,
we would be misled
and arrive at the wrong conclusion
because of selection bias.
So, that's not good. And a very similar thing
happens also when evaluated predicting ratings,
so you complete the matrix,
like in the Netflix challenge,
instead of policy evaluation.
And we have two systems that we are evaluating.
So, the first system ignores
the two groups and just predicts all five.
It makes a large error on the full ratings matrix.
So, you would have a lot
of errors here and I've indicated that in deep red.
Now, the second policy is better because
it recognizes that there this diagonal structure.
However, it makes a small mistake,
small error on the drama movies
by predicting fives and I've indicated that in light red.
What happens now, if we value
these systems again on observed data?
Well, for the first system we make five errors,
since there is not a lot of five star,
since there's not a lot of one star ratings,
and then for the second system,
we were making far more errors because there is
just more of the three star ratings in the log data.
So, this actually makes the second systems performance
look much worse and
we would prefer the first system here.
And again, we came to the wrong conclusion,
and pick the bad system
because we were ignoring selection bias.
So, that should hopefully. Yes?
>> Also, five is not as much different
than three as five is different than.
>> Yeah.
But, I mean, in this example,
if you kind of try to sum it
up then you would still see that this would outweigh,
the number would just outweigh the difference here.
But that's a valid point.
So the key idea in our paper is to kind of
solve this by connecting
it to the potential outcomes framework from
causal inference and in the potential outcomes framework,
you often think about patients getting treatments,
and for each patient you would only
get to observe the outcome
of the treatment which he or she got assigned to.
And the counterfactual question is then how
a patient would have fared under a different treatment,
different than the one that you prescribe.
And you can think in a similar way
about movie recommendation at least,
in the policy setting where, now,
users get prescribed movies and you want to find out
how users would have enjoyed
movies different from the ones
that they have potentially watched.
And in order to do that,
what it turns out is that we need to understand mainly
how people were assigned movies or treatments,
and this is also called the assignment mechanism.
To make this more precise in recommendation,
this actually boils down to knowing what the
marginal probably is for
a rating in our ratings matrix to be observed.
And this is also sometimes called the propensity.
And then, we can use
the inverse propensity estimator which
is well known, and I mean,
I think both of you are very familiar with it,
has been used in many other settings,
domain adaptation as well as reinforcement learning.
And the way it works here is so you would sum
over all individual losses and then
re-weight each individual loss by the inverse propensity.
And that gives you an unbiased estimate of the loss.
And this also extends to
other performance measures that you can write as
a sum of individual losses.
And on the right, you can see the propensity matrix that
I used to generate the example,
the observed pattern you saw earlier.
So we had a hyperbole here, a small one here.
And if we just had used
these propensities along with
the IPS estimator back then,
the problems that I mentioned earlier,
coming from selection bias would have gone away.
So really, all we need is this propensity matrix here.
Now, how do we get this?
Well, there's actually two settings,
and the first settings,
the experimental one,
we know the propensities because they
were been under our control.
So we had an app placement system,
we had something that, ,
sarcastically put things in
front of users and we just record these probabilities.
The second setting is a little bit more intricate in
the observational settings users self select.
And in that case, we need to estimate these propensities,
and that corresponds to inferring the parameter of
apparently random variables namely the ones
in the observation matrix.
Note that this observation matrix is fully observed.
So, it contains the one
whenever rating was observed, and zero otherwise.
And to do this estimation,
we can include side information such as
user item features, X if available.
And since this is a standard supervised learning task,
we can use a variety of models
such as logistic regression IP
or Bernoulli matrix factorization, et cetera.
So, now that we talked about how to fix evaluation,
we can apply the same ideas to
learning and our idea was simply to couple together ERM,
empirical risk minimization with
the inverse propensity scoring estimator.
So now we would pick the hypothesis,
the model that performance best under the IPS estimator,
instead of the naive one that just
has an unweighted loss.
To make this more concrete,
the objective below is probably all known to you.
This is a standard,
mean squared error loss matrix factorization
with the regularization terms,
and we know how to solve this, and scale this,
and the only thing that changes now when you use
the IPS estimator is that
you get this propensity weight here.
So, really not that much but I think,
conceptually, this is a bigger step.
So, just to make this framework a little bit more clear,
it's very modular in that first to kind of estimate
and pick and estimate a propensity model.
And then in a second step you would
use your ERM objective,
together with your estimated propensities.
And this is different from
a generative approach where you often kind of reason
about how the data is missing
and then you have a model that explains
how all that data comes about.
And those two are couple together violating
variables and that makes
it a little bit more sophisticated,
but we'll also compare against them later.
So before I get to the empirical results,
I want to quickly talk about
some theoretical insights that we provide in the paper.
It turns out that there's
this additional tradeoff between bias and
variance that comes from the propensity estimation.
So, if you bound the true error with
the empirical error you get a
very familiar looking boundary
except for the colored terms here.
So what you get is this bias that is a penalty
that tells you how far you are off from,
the estimate of propensities
are off from the true propensities.
And then there's also
this variance component that
comes due to the estimated propensities.
So, just to instantiate this
in a naive method would be high and
bias because that is kind of
mismatched but low in
variance because those are constant.
But our method would be,
like for perfectly estimated ones,
we would have no bias but a higher variance, usually.
And this also kind of shows that it might be
beneficial in some scenarios to tradeoffs some bias,
some lower variance at the cost of
a small bias by overestimating the propensities.
Okay. So, to evaluate whether
our method improves performance
we did an empirical study,
on two real world data sets.
The first data set was one that we collected ourselves.
People went shopping for coats,
and also rated them and then that second one is
from Yahoo where people listen to songs and rated them.
And both data sets were special in
that they contained a missing completely
at random test date set where people were
actually more or less forced to rate us,
randomly sub and built,
a set of items,
different from the ones that they
were browsing earlier integrating themselves.
And so what we did is we trained on
the missing training data and then
evaluated on the missing completed random test data.
And we compared against
the latest generative approach that we
can find from 2014.
And on both data sets,
the propensities were estimated
via logistic regression using
user item features in the code dataset and
using the base for the song rating datasets.
So the results show that
our method outperforms both kind of
naive matrix factorization as
well as the more sophisticated generative model
on both losses and on both datasets.
So encouraging. And so,
by the end of this kind of part I
hope that it was able to convince you that
propensity is growing and
he has nice one because it's modular,
it directly optimized the target loss and
not just some locked likelihood.
There is no ladened variables and
it's usually scalable in the same way
that the original problem was scalable in. Yes?
>> So, assuming both these datasets
did not have non-propensity scores.
>> Non-propensity.
>> So, the actual presentation was not done,
using randomization that you.
>> Yeah. So both data sets
were using the observational setting.
>> Just curious if you did any experiment where you
tested like how close can you come to the article.
If you
actually have a randomized experiment in order to.
>> We have some, we
have some synthetic experiments where we,
I can show these later I think,
they are at the end.
So, we had some synthetic experiments
where we tested how robust this kind of,
because you actually just care
about the final theorem step,
how robust they are to
the estimated of propensities as well,
it turns out there are quite robust.
Okay. So, for the remainder of this part,
I want to talk about a follow up project,
where we apply very similar ideas
to devise Learning-to-Rank.
Just to get everyone back
into the Learning-to-Rank setting,
we work with the query xi, usually.
Let's say it's winter shoes,
and then we have a ranking algorithm
in production as far,
and that outputs a ranking,
and then we collect click logs
from that ranking algorithm.
So, people click on B and D, and we would store that.
And then our task is to take all this data.
Have a learning algorithm
that hopefully outputs a better ranker.
The traditional way of doing Learning-to-Rank,
is to hire judges that annotated results.
So, you would hire
people and they would go through this ranking,
, for the query shoes, and then would say,
well result C is relevant,
F and G is relevant.
So, when you evaluate,
this is also called Full Information Learning-to-Rank.
And it's straightforward to evaluate the new ranker,
because on this inquiry,
because you just reorganize results,
and then you can compute the loss function here.
Everything is known.
However, it's often more convenient and also cheaper to
work with implicit feedback from click logs,
and to be still able to learn what you do is you assume
a certain user model and the weakest case
that a click indicates relevance.
So, here user clicked on result C,
means that we assume this result to be relevant.
The other ones have question marks,
and the problem is,
however, that we only have partial information,
and so when we want to evaluate a new ranker,
we can't compute our lost function directly,
because there is missing data and
it is missing not at random,
and. Was that a question?
>> Yeah. Should you assume A and B are not relevant?
>> Yeah, or at least not,
well, that is something you could do,
that some click models do.
But, in even a simpler case you could just say,
I assume only that a click means relevant.
That's kind of the click model
that's one of the most simple ones.
Okay, so if we want to evaluate the new ranker,
we can't compute the loss function
here because there's so many question marks.
And it turns out however,
that we can compute it
in expectation in a similar way as before.
So, just to introduce a little bit of formal notation,
we're working with a loss function delta here,
giving the relevance labels, binary relevance labels.
It tells you how good,
how good a ranking is.
And in our paper,
we use the sum of ranks of the relevant documents.
So, if this was the ranking being presented,
then you would end up rank 3, rank 6,
and 7, that would give you a loss of 16.
Our user model was again
the very simple assumption of a click means relevant,
but that kind of
posses the question what does no click mean?
Well, we could either make,
, the assumptions that you made,
but in general we could reason about, well,
either a user did not observe a result,
or the result is not relevant.
It turns out that we don't even,
we don't need to think about this so much,
because it's all solved by
just again knowing
the assignment and observation mechanism.
And again, we need to
estimate propensities that indicate how
likely it is for relevance label to be observed.
So, here we would say that the probability of
observing this relevance label here
is 0.5 and then again,
we can use the IPS Estimator to evaluate,
to compute, the loss of a new ranking.
But the only sum of the clicked results here,
because of our user model,
because of the assumption, that's
where the assumption comes in. Yes.
>> Do you assume that this probability depends,
or is independent of the location where you show that?
>> This- you can plot in
different kind of propensity model here,
what we did is that, it is dependent only on the rank.
So a position bias model.
>> It's like there's only position bias?
>> In this, but you could include,
you could think about other biases.
But we only, I mean,
that is the main part.
>> And in your data there is only one click per query.
>> No. It's multiple.
>> There's multiple, multiple.
>> Multiple clicks per query.
And then, and then.
>> [inaudible] binary now?
>> Yeah. Let's all subide
simplifies to kind of that setting.
And then, to get the unbiased risk estimate for
the entire dataset for an entire ranker,
you would just go and average over all,
all queries in your log.
And, so we coded this up,
and as an SVM-Rank,
so there's a propensity weighted SVM-Rank which optimizes
the last you saw earlier on the previous slide,
and we compared it against a standard SVM-Rank,
which didn't use any weights,
and unhand tune production algorithm that was in place.
And we tested it on
an academic search engine archive
that are probably all know it.
We estimate the propensity is via
a small intervention experiment
where we swapped pairs of results,
and then evaluated whether or,
, propensity weighted
SVM-Rank is better by interleaving
the results of it with another method,
and then counting how many clicks each method got.
And what you can see here is that our propensity weighted
SVM-Rank significantly wins against
both the production and the Naive Method,
by just taking, and that one is just
taking into account position bias.
So, yeah that kind of a,
concludes my first part.
But, I also want to briefly mention
a bunch of observations. Did you have a question?
>> Yes. Did you just assume that people
saw everything above when they click but nothing below?
>> No, that.
We didn't even do that.
We had a really Bernoulli like each position was like
a coin flip and that was kind of the propensity.
>> Yeah, but I'm curious how strong of a baseline is
that compared to position dependently.
>> How strong of
a baseline that is to position dependently.
>> It is position dependent,
because the coin is different at each position.
>> Yeah. Exactly. Yeah.
>> What he's asking about is your baseline.
>> The baseline. You just could have filled
in missing data as Dave was suggesting and
just assume everything above was a true zero.
>> So, I think we talk in that paper also about,
, you can use more sophisticated click models,
there's all kinds of, ,
processes that you can put into, ,
you can make it more sophisticated,
how you estimate these propensities.
But, I think this really just is, I mean,
it's general enough to incorporate
these more sophisticated propensity models.
We just wanted to show that even
by doing the simple adjustment,
you can do much better than.
>> [inaudible].
>> Less sophisticated?
>> Yes. We're suggesting
that don't use propensities at all,
and just treat everything above as a zero.
>> Oh, the.
>> It is a better ranker on that paper [inaudible].
>> Okay. That's more like the kind
of the pairwise preference is one which is,
I don't,
yeah, I don't know why we wish,
I think there were some experiments I need to,
I need to look that up again.
Yeah, that's a good point. Thanks.
>> I'm also curious
in archive data we're actually see multiple clicks,
I mean, my own use of archive searches,
very much like I'm looking for various specific paper.
I find it, I read it,
and I don't know I think others users
do it in a different manner, but I'm curious if.
>> This is, yeah. This is the full text search though.
This is different from the,
just where you want to find the specific one.
Notice that there is this full text search dropbox,
and then you land on the full text search.
And, there's, this one is usually more fuzzy,
like the full text search is not as precise, so,
you would specify keywords,
and then you would get it back a ranking,
and not so much matching like in a boolean way.
So, yeah, I mean,
you would see people click multiple times because,
yeah, to find find something they were interested in.
Just in general exploring. All right.
All right, so I want to move on to
the second part where as an intermediate goal,
I want to, yeah,
I want to talk about how we can aim to design interfaces
that allow us to obtain more and better feedback data,
all while not hurting user satisfaction,
and that is because
better data will then allow
us to improve our predictions.
So, really I want to
focus on this outline connection here,
like users and interfaces,
like affecting the data that the Machine Learning gets.
In this first, yes,
so how can we get better and more feedback data?
In this first project that I want to talk about,
we came up with a new interface that
allowed us to collect a new type of feedback signal.
So, I know it's lunchtime soon,
so I want to run a little experiment with you.
So, I want you to stare at this for 10-15 seconds.
Figure out what you want to have for lunch.
All rights, who knows
already what he or she wants for lunch?
Whoo. Wow, we have some, yeah, okay.
Well, not too many but some decisive people. I like it.
But in general, making a decision is hard here.
Why is that? Well, there's a large set of options.
You are probably not familiar with all of the inventory,
like what are the options out there?
And you're also often
uncertain about your own preferences.
Did you have pizza yesterday?
Like are you going to go out for Chinese tomorrow?
So, it takes a lot of thinking.
And our starting point here was really
to think about how we can
support users to drive the feedback generation.
And obviously, in the long term,
we want to provide better recommendations.
But for that, we need like
that I can prove feedback first.
But in the short-term, we can think about
how to reduce cognitive burden,
and I'll come back to that in a second.
So what you've just seen in this example of
picking something to eat is
an instance of what we call
session-based decision making,
where your goal is to choose one option,
and the information need is fixed.
So you wouldn't just do something else.
Examples for that are picking a movie for tonight,
searching for recipe, comparing laptops online,
planning a trip, booking hotel, and so on.
It turns out that a common strategy employed by people
in this session-based decision making
is called consideration set formation.
And basically, it works in two steps.
In the first step you would narrow the set
a large set of options down to
a smaller set called the consideration set,
and then so I would go through and I would
look at well, that looks good.
Oh, the quiche looks also good.
And then based on that consideration set,
you would then make a decision and so I
would now reason and think well,
I had pizza yesterday for dinner,
how about the quiche?
So decision done.
So this kind of strategy or this kind of insight was
the main inspiration for our interface.
The session-based decision making we studied was
obviously not on food choices but on movie choices.
So below is an interface that resembles that of
many online streaming providers.
So you can scroll, you can kind of filter,
and get more information by clicking on it.
And our idea was to augment that interface
with what we call a Shortlist Component
so users could click on a "Plus" button or
"Drag and Drop" buttons up there to keep track of them.
So this was kind of a list of items that are
currently considering or are interested in.
And to find out whether or how that shapes kind
of the feedback data that we want,
we run a user study where we compared
the interface with a Shortlist Component
against the one without one.
And the task from that we
gave people and Amazon Turk was,
imagine a very good friend of
yours is coming to your place to visit.
After hanging out for a while you
plan to watch a nice movie together.
In this experience you'll be asked to
select a movie to watch with your friend.
And so, we had 60 people come in,
most of them PhD students,
three-quarters were male, and one-quarter female.
And we randomly assigned them to one of two flights.
In the first flight,
they started with the Shortlist first,
and in the second flight it was last.
And in each flight,
they had to choose a movie eight times.
So each of these corresponds
a new session and
each session they had a fresh set of
movies of a 1,000 movies to pick from.
And then we collected the feedback data and
also user feedback through surveys.
So, now we can look at,
do shortlist lead to more data?
That was kind of our goal kind of in can we drive?
Can we drive users somehow to give us more data?
And it's if you look at the movies with interactions
like that it's the unique number of movies,
then without the Shortlist people click on
2.7 items, on average 2.8.
But with a Shortlist it's more than
twice the amount of movies.
So you get 5.7 kind of
positive interactions in each session on average.
>> By positive interaction,
you mean like click on details and.
>> Click on the details or shortlists it.
And something, notice that something
that was Shortlist didn't necessarily
have to be clicked on
because it might have been a known movie and
then in that case you don't want to
look up more information,
but you would still shortlist
it because you want to keep track of it.
You're like, oh. that's a classic.
>> How is that 1,000 movies sorted?
What is the order when it's displayed to the user?
>> It was recent movies first,
and then by the number of stars.
So by year, and then within each year a number of stars.
>> Number of stars is the public.
>> IMDb score.
>> I see. Okay.
>> Yes, back there.
>> Two questions, first is interface do
you categorize other movies into different genre?
>> Yeah, you can.
>> And how to browse 1,000?
>> You can browse by selecting the genre here.
>> And second is,
how is this different from
something that Amazon is doing like I
can add some movie to my watch later or my favorite list.
I don't mind now watching right now,
I just want [inaudible] to that section.
>> That's a good question.
So shortlists are different in that they are
only kind of temporary and really tied to the session.
It's not something that you add to your,
like I want to watch later, it's not persistent.
This is really just for the decision making,
for this single task and not so much.
I mean people hack this in all kinds of ways,
they open tabs, they add,
I add stuff to the shopping cart just to
remember it even,
though I don't want to buy it.
But yeah it is different, that's hopefully.
Okay. So we have more than twice the amount of
training data but that does also help recommendation
and that's what we looked at next. Oh sorry.
>> How is the Shortlist constructed?
>> The Shortlist is something users actually,
users come up with so they add
something that they're interested
in during a session to it.
It's empty when they start and
then it's just like a stack of
books when you go to a store that you are considering.
>> Without the shortlist they don't have the ability to
make the shortlist, is that what you mean?
>> Sorry. >> With or without shortlist.
>> Without they don't have that ability.
They have to remember it themselves,
or write it down,
or do something else. Yes.
>> [inaudible] people can put in the shopping basket
without purchasing in the end, right?
Just put to shopping basket
and then the time to check out,
I decide which one, I want to buy for sure.
>> Right.
Yeah, that's what we what we call it kind of a hacked
way of shortlisting and that was also
the inspiration to make this kind of
an explicit interface component rather than.
I mean sure you can find ways around it,
but our idea was really to study
how this influences when we give it to people explicitly.
>> Judging to the movie category,
do you think how could this be
useful other domains or in practice.
>> Oh, I think having,
having the ability to shortlist stuff and remember
things you're currently considering,
I think that goes for many, many,
applies to many of these one choice tasks,
where you want to figure and compare things.
Like, choose one thing among and then
whenever you can support that I think
Shortlist are good interface component.
Whether or not it should have this form as
a visual component or something
else maybe maybe it's smaller or like on the side,
that's I think up to the interface designer.
But I think it is important
to reason about how easy it is to add things to it
and that it really supports users in their task.
You don't want to put out too high costs because
you're adding something to the shopping task basket
hides it again and then you need to make
an extra navigational effort
to go to the shopping basket.
So these are all things to consider. Yes?
>> So in the 5.71 that you have got data,
you're counting a one every time a
person resulting to the shortlist?
>> Yes, but I would count the unique items.
So if something was clicked on and shortlisted,
it would only count as one.
>> Okay.
>> Yeah. So it's not just
>> So different comparison
is of many clicks you have out of the section,
as you can find, between
the one we shortlist and one without.
>> Why would that be?
>> Because you're counting more clicks.
If I'm just adding things to shortlist,
not clicking anything, it would be in the other case.
It would be the same of zero clicks.
And there would be no shortcuts.
But you're counting not with shortlist.
>> I'm counting with
>> With 5.71.
>> Yeah. I'm counting with the shortlist
because that was the whole goal,
to kind of collect a new type of feedback data.
>> Okay. So even if a person doesn't watch anything,
you will still have more data.
>> Yeah.
>> If it's a little bit less [inaudible].
>> I mean, in this case, they were forced in
the end to choose a movie to terminate the session.
So in the end, they had to choose something,
whether they used a shortlist or whether not. Yes.
>> I don't think you quite understood the last question.
It's a little bit of an apples and oranges comparison.
I could just drag
the movie into the shortlist or I could click on it,
look at it, and then move it into the shortlist.
And if I didn't have the shortlist
and the equivalent of the first thing,
to understand the difference, so I might
just move a movie into
the shortlist without looking at it.
Just imagine I had like two levels of interest.
One is like,"Oh, I don't immediately want to
reject that movie," and another is like,
"Oh, I'm actually seriously considering that."
And, I might put both in the shortlist.
But if there's no shortlist category,
then I'm only going to
get to see the second out of click.
>> Yeah, exactly.
That's the point of the shortlist,
I think, to elicit that type of feedback.
>> But it's not clear that it's doing anything, right?
You could just be buffering the [inaudible]
>> Yeah.
>> I think his next slide will.
>> So I think
the point is to give more data to the algorithm,
so you can do better in future.
And if you do doing stuck in your head,
maybe you accomplish your task one off.
So there is less training data for the algorithms.
>> Okay, but you're interpreting
the different number of clicks as
meaningful and it's not [inaudible].
>> No, not yet.
>> Okay. >> Not yet.
>> That hopefully answers a bunch of the questions.
And, I'll also talk more about how
users feel about the shortlist.
So in the prediction task,
we kind of took the movie from each session and held
it out that people finally chose,
and our goal was to rank it
to the top of those inner set of friend movies.
And, the training data were
all displayed movies in the session.
We used the ranking SVM and as feedback,
we said that items that were examined,
clicked on, or if available,
shortlisted, should be ranked above skipped items.
We didn't discriminate between
the two types of feedback signal,
and then in the test data,
we embedded the chosen movie
in ninety nine random movies,
and we measured the mean reciprocal rank,
which is one over the position where that movie occurred.
So 1 over 100 would be worst,
and 1 is the best.
So if you consider the MRR for random baseline,
and you learn from the sessions that had no shortlist,
you get a small improvement.
Well, that's not so great, you think.
But once you learn from sessions that had that shortlist,
so use that kind of feedback data.
The new type of feedback signal that you collected,
you actually are able to increase
your MMR quite significantly.
So why are people willing to put up with this?
Why are they even using the shortlist?
In order to understand this better,
I want to break it down into three smaller questions.
So first, do users appreciate the shortlist interface?
Second, do shortlist increase choice satisfaction?
And third, how do users adapt their strategies?
And yes, users really prefer their "Shortlists".
We asked them to state a preference between
the two interfaces they interacted with,
and most of them either strongly prefer
or prefer the shortlist
over the non-shortlist interface.
Moreover, people use them in
over 93 percent of all sessions,
even though they could've just skipped the shortlist.
Are people more satisfied with their choices,
is another question we looked at.
So we asked them,
"Which interface you think gave you
more satisfaction terms of your final choice?"
And again,
most people strong prefer or
prefer the shortlist interface.
So that's another positive benefit of shortlist.
And then, lastly, how did users adapt their strategies?
This was more to understand how
adding new interface component
also changes the way people behave,
and so what we did is, we asked
to self report the strategies that they used.
And so first, good,
which was the most frequent one
when they didn't have a shortlist,
was they pick the first good item,
which is called satisficing sometimes in psychology.
Track multiple is the consideration set strategy.
Track 1 is just one best option currently,
and then choosing that in the end,
and other is something else.
Now, when you give them the shortlist,
then most people switch to this strategy,
to the optimizing one.
So with shortlist, they satisfice really less,
and they optimize more.
So you can really influence also
by the interface how people interact.
And moreover, in their statements,
people reported that they had lower cognitive loads.
It was easier to keep track of things with "Shortlists".
So "Shortlists", I hope
that I was able
to motivate that shortlist not only helped
long term because we were able to
provide an improved recommendations,
but also short term we
had reduced cognitive burden on users,
and we were able to increase
users satisfaction. Is that a question?
>> So shortlist, people are very familiar with, right?
So there's a habit there that
goes into it because
a lot of people are familiar with shortlist.
But did you test this with a completely new UI mechanism.
Not shortlist, something very different.
>> That's a very good question, and we haven't.
But I'm sure there's many other ways to support users.
I mean, this consideration set formation
is just one theory of how people,
or one strategy that people
employ when it comes to making decisions.
But there is other strategies as well,
and I think a good UI should make it possible
to support a multitude of these strategies really.
So thinking about, along these lines,
is very interesting. Yeah.
>> Did people take longer
to make their choice with the shortlist?
>> I think yes. We found a slight increase
in session length. Yes.
>> It could be good or bad [inaudible].
>> Could be good or bad depending on what you're
optimizing for as a systems designer,
but we looked outside our perceived satisfaction.
Were they happier with what they did?
Because giving up might not make you happier,
even though you're more effective.
>> So you may
have talked about this, algorithm for a signal.
The movie in the shortlist,
the movie people may click on
but didn't put on the shortlist,
those signals, do you put different ways using
algorithm or do you find the
ways influence the performance to algorithm?
>> We didn't use weights,
but what we did is we had another.
>> Then shortlist could be two [inaudible] could be one.
There's even no ways to play this to maybe [inaudible].
>> What we tested, what we did in the paper
is that we ran also
an experiment where we said
that shortlisted items should be ranked above "Clicked",
just clicked items that were not shortlisted.
So shortlisted and clicked is stronger as just clicked,
and then that is stronger than any of the skipped items.
But that didn't improve performance.
>> Another note I think is,
maybe something interesting that you can experiment is
the number or the spots available in the shortlist.
Because maybe without challenge,
people just thinking, just [inaudible].
But having this box, [inaudible] ,
people will have a mindset,
"Maybe i need to pay all them.
The longer that it stays, I'm going
to have more and [inaudible] more.
To see whether that can influence
We give some goal.
Suddenly, people will be feeling the goal.
>> Like the big balls,
when you go and get frozen yogurt, and you're like,
"Oh this is nothing.",
And you have pay $15 for it,
that's a good idea, it's really cool.
Just to move a little bit further,
I want to quickly talk about other ways of
collecting more and better feedback data.
And one way was to come up with a new interface.
But another avenue is to keep the existing interface and
just work with the incentives
like changing certain things.
And that's what exploring
these and the following two projects.
So here, we kind of looked at how we can shape
the feedback data we get
inspired by Information Foraging theory,
which explains how people
seek information in unknown environments.
And I'm not going to go into too much detail about it
but practically what that meant is that we varied,
we examined how two things
influence users feedback, implicit feedback.
First was information access cost,
which we mapped to whether you
could click on an item for more information or hover.
So hovering obviously, is
lower effort because information just pops up.
Clicking require you to open a pop up
then close the pop up, so higher cost.
And then, information sent was a second access,
where we varied the length
of the information descriptions.
So in the weak scent condition there was
nothing shown below the posters, just the posters.
And in the strong scent condition,
you have the number of stars,
the title, and the genres along with the year.
>> [inaudible]
>> What? Scent?
because it's inspired by how
predators were hunting for prey.
So Information Foraging theory
comes really from these carnivores,
and they smell something,
and then they follow a lead,
and if the scent is strong enough, they follow a cue.
So that's really how this was inspired.
I don't know, it's just the vocabulary they use.
So what he found is,
by exploring these interventions is that,
feedback quality, as we measured
it in the paper, was not affected.
But feedback quantity is something
that we were able to shape.
And we were able to increase this significantly
by lowering the information access costs.
So, hovering really made
people more likely to provide feedback,
and we were able
to increase it moderately by weakening scent,
by showing them less information so they had to click.
This is also what you would expect.
But this is another kind of knot that allows you to
just tweak the feedback that you get.
However, notice that here,
people in the user survey that we asked,
people prefer the strong scent.
That is because it shows more information upfront.
But from an ML perspective you
might sometimes prefer showing
less information so people would also give
you the feedback you want,
even though they're just mildly interested in.
So the optimal ML and HCI design points
do not have to coincide.
Even though, for the short list, that was the case.
In general, it doesn't have to be that way.
And then, this is
the final project that I want to talk about.
This is this year's wisdom paper,
where we looked at the cost of exploration
from the users perspective.
Our goal was to kind of collect feedback data,
look at how we can collect feedback data from
exploration while maintaining user satisfaction.
Just to familiarize, I mean most of
you are very familiar with this,
but let's quickly talk about the exploration trade off.
Let's say you have a user that likes Finding Nemo.
Base on that information,
you could exploit that
fully and recommend movies that are very
similar in order to
maximize short term user satisfaction,
or you could try to explore and
cover all possible users interests so you learn more
about the user in the long run and
make a diverse set of recommendations here.
I mean, there's obviously this trade off and we studied
these this trade off from a user perspective.
And so what we did is, we
created what's called Mix in exploration,
where we mixed in items from the exploration policy
into items from the exploitation policy,
which was just content-based recommender.
And then, we looked at how people behaved,
as well as their satisfaction,
and the feedback signal.
So what we found is that,
good news, limited exploration is cheap.
Meaning, that if the amount
of exploration is not too large,
impact on user satisfaction and the feedback in
quality and quantity is minimal, not significant.
But once you move beyond a certain point then,
you get this non-linear increase in costs.
So if you use some slots for exploration, that's fine.
You go past a certain point,
you really hurt users.
>> Is this figure like a cartoon illustration or [inaudible]?
>> This is a cartoon illustration.
I should say at this point, if it wasn't clear.
Because it has good and bad
and I don't know, I haven't found it.
>> I thought you were just hiding the
Y axis because it's too complicated [inaudible].
>> Yeah, this is just a sketch.
But for more details,
I'm happy to talk to you.
But this is essentially what we found.
So I think this is also a step towards better
understanding how we can gather data. Yes?
>> But it's not fair that you have to do like
the learning performance might not be [inaudible] in
the amount of exploration.
And the only explorer [inaudible] suffer.
[inaudible] So it's that sweet spot of not
hurting the users is the same
as doing nothing exploratation for learning.
>> But is that user's satisfaction,
kind of like the objective that you want
to exploit out there?
>> Not for long. But I want to maximize long term [inaudible].
>> I'd rather say that this is kind gives you
an upper bound on how much,
at any given point, you should explore.
And then you can obviously,
the more you stay on
the exploitation site the better it will be.
Once you've acquired enough information.
All right, so that brings us close to the end.
Just want to quickly touch upon
some other work that I've done.
Mainly on how to evaluate systems better.
So I've worked on evaluation of
ranking functions, of word embeddings,
and then how to combine data from
multiple logging policies to evaluate a new policy.
And most importantly, I
want to thank my great collaborators,
all the Cornell professors I worked with,
my MSR mentors, and all the great Cornell students,
both Undergraduate and Graduate,
that I had a chance to work with.
And moving forward, I just want to give you
a brief outlook on what my vision is.
My vision is really to have
a highly accurate and useful interactive systems
through a holistic design.
And as I've argued,
that requires us to think about
how components of an interactive system work together.
And I've already laid out some examples of how to
do this but there's plenty of
other exciting questions left
like on the algorithmic side,
you can think about how to learn or
how to assign online experiments
so that you can learn optimally from locked data,
take care of the biases there.
You can think about including other components,
which is growing area,
fairness, transparency and dependability, and so on.
And then, from the user standpoint,
I like thinking about putting
users in control of their predictions.
Because the short list give
them a way of express what they're interested in.
But you could think even further
what a good mechanism is for users,
to specify what they're interested in,
or to give feedback to the system in a more explicit way.
That also maybe benefits them in the short term.
In general, I think,
I've just only explored a bunch
of incentives for data generation.
But there should be a whole zoo of
possible patterns and mechanisms out there.
And to wrap everything up,
I hope I was able to show that there is
really multiple complimentary pathways to
improving Machine Learning beyond the algorithm.
In the first part,
I talked about how to prove
ML and interactive systems
by understanding the biases in the data,
and then we used techniques from
counterfactual inference that allowed us to include
these biases into learning
to build more robust ML systems.
And then the second part,
we started by thinking about how people work,
and how they interact,
and that allowed us to design interfaces so
that we can shape and maybe collect new feedback signals.
That I think concludes my talk. Thanks.
No comments:
Post a Comment