>> Thanks so much. It's such a pleasure to have
the opportunity to speak here today. So I'm a professor of
statistics at the University of Washington. And
what I wanna tell you about today, is how even though we
now live in an area, in an era, excuse me, of big data,
big science, team science, we need statistictions and
classical statistical training more than ever. And so
we've heard a lot how, there's a Twitter hashtag that we're
all supposed to be using today. So I thought that to
keep with that spirit, I would motivate my talk by a Twitter
post from a few weeks ago. So I asked, within the context of
interdisciplinary data science, what is the number
one thing that rigorous statistical training brings to
the table? And I got a lot of really interesting answers,
which basically organized along three themes. So
the first theme, was that the reason that we need statistics
within the context of big data science. Is because statistics
allows us to evaluate whether the data that was collected is
sufficient to answer the question that was asked. And
a lot of people gave different variants of this answer.
So I thought this was a really interesting answer, and
the fact that it came up so
many times, maybe merits a little bit of discussion. So
really, the idea here is that one of the key concepts behind
all of statistics, is the notion of experimental design.
If you have a question that you want answered,
you don't just start, rooting around in the dark trying to
answer it using data. Instead you should sit down and
thoughtfully design an experiment,
that'll allow you to answer that question. So if we could,
we'd carefully design all of our experiments, but
of course we live in the real world, and
often this is impossible for a number of reasons. And
one reason that it's not possible,
is because it might be unethical.
So so a lot of my research, as you'll hear,
has to do with applications to biomedical research. So for
example, we can't randomize medical treatments to people.
It would be really great if we could get clear cut answers
about the effect of a medical treatment, by assigning half
of the people to treatment and half the people to control.
But of course we can't do that for
pretty obvious reasons. Another reason that we often
can't design experiments the way that we would want to,
from a statistical perspective is because it might just be
impossible. So
if we want to know the role of genetic makeup
on some disease, or on smoking risk, or on whatever it is.
If we could we would randomize people to genetic makeup, but
of course we can't. We also can't randomize half
of the people in our study to have one socioeconomic status,
and half of the people to have another. So
statistically, we know how to design experiments but
in the real world we often can't. So within the context
of the fact that the data that we get it typically not from
a well-designed experiment, we need to be really careful.
And one thing that we need to do, is we need to be able to
figure out which questions are actually answerable based on
our data. And there can be gap between the data that we have,
and the questions that we wish we could answer on our data.
But understanding that gap and
making sure we only answer questions we can answer,
is really critical to drawing valid conclusions from our
data. And another thing that we can try to do,
is to carefully expand the set of questions that we
answer using the data that we have, using the ideas from
colleagues and friends from related fields. But
these are things that need to be done with care,
and they all require a deep understanding of statistics.
So the next theme, and the answers that I have received
is that statistics and a deep understanding of statistics,
gives us a rigorous framework for quantifying uncertainty.
So of course, we all learning about things like p-values and
confidence intervals in our basic statistics classes. So
in a simple setting,
we actually know how to quantify uncertainty and
this is part of our standard statistical toolkit. So
we know that in theory we should just get a data set,
and use our data to perform a single pre-specified analysis.
And if we've done that,
then we know exactly how to quantify uncertainty.
We have things like p-values and confidence intervals. But
once again we live in the real world, and
it's totally unrealistic that we're gonna use a data set
just to answer exactly one pre-specified question.
In fact is, our data are expensive,
our data are limited, and we're gonna try to get as much
as we can out of our data. But to make matters even worse
from a statistical perspective,
not only do we perform a lot of analysis on our data, but
we use the results of one analysis,
to determine what questions we wanna ask next. And
so we actually proceed to in this iterative way, where we
don't know and advance all the questions we're gonna ask.
And we actually determine what question we wanna ask next,
based on the previous answer we got. And this becomes very
quickly a very tricky statistical setting,
where it's actually very hard to quantify the uncertainty
associated with their analysis.
It's hard to know how sure we can be if the results that
we've gotten, and
unfortunately this can lead to really serious problems.
We can refer to this as double dipping,
and when we fail to account for double dipping,
we draw conclusions in my data things can go really wrong.
It could, they can go spectacularly wrong in a lot
of different areas, but in one way I see it
a lot is within Science. In the last few years,
people have been talking a lot about the idea that there is
reproducibility crisis in science.
Where results that are published in a peer reviewed
article, might turn out not to hold up to further scrutiny,
if someone else collects you know a new data set to try and
answer the same question. So the chart here is from Nature
which is one of the premier scientific journals, and
they surveyed 1,600 researchers on whether or
not there's reproducibility crisis in science.
And 90% of the researchers said,
yes there is a reproducibility crisis.
And a lot of these crisis really stems directly,
from the fact that once again we're not looking at a data
set just one or two or three times to answer pre specified
questions. Instead, we're looking again, and again, and
again, and we're using the answers to our previous
questions to inform our future questions. And
finally, the last answer that I, or the last set of answers
that I got, also had a common theme. And
really the answer that I found very interesting was that
someone said, the reason that we have models is not to fit
the data but to sharpen the questions. So
if we think about that for a second, the reason we need
statistical modelling is not even to understand the data,
but to better understand our questions. And
I think anybody who's tried to fit a statistical model,
understands what this means. So to, for a concrete example,
suppose that we are interested in a model for smoking and
asthma. And I tell you, you know, go out there collect
some data and try to come up with some model for
the association between smoking and asthma.
So okay, you might think about that and
you might come back to me and say,
you wanna fit a logistic regression model.
This is a very standard model and statistics. Here,
Y can be an asthma diagnosis and X is the number of packs
smoked per day. Until this model is saying that basically
the probability that someone has asthma diagnosis,
is just a function of the number of pack per day. And
basically if this coefficient beta is positive, then
the more the person smokes, the higher the probability of
an asthma diagnosis. And this seems like a pretty
reasonable model at first glance, but
sort of the more we think about it, the more we might
begin to question whether this is the right model. So
for example, this model only incorporates with asthma and
number of packs smoke for a day.
There might be a lot of other things that we think should be
included in the model like age, physical activity and
socioeconomic status. Because we might,
we might expect those also to be associated with smoking and
asthma. And that seems like something we should probably
incorporate in the model. Or maybe we want a different
model entirely that says that there's an underlying factor,
which maybe we don't have access to like genetic
make-up. That causes both smoking risk and asthma risk.
So when we start trying to write out in mathematical
statisti, statistical terms, the model that we're
interested in, we quickly realized that our initial
question was too vague. And really, the experience
of statistical modeling caused us to refine these questions,
in a way that we otherwise wouldn't be doing. So
now, I just wanna move on from this to say a few words about
my work, which involves statistical learning with
Applications to Biomedical Research. And so
this is a really fun area to be involved in.
Because the field of biology has just been transformed in
the last 20 years with new technologies, that make it
possible to measure things that we previously thought
were completely unmeasurable. So one example of this has
to do with, with sequencing the human genome.
So of course the human genome was sequenced almost 20 years
ago now, And in the last 20 years, the technology has just
been completely transforms. So that now it's actually pretty
inexpensive to get a genome sequenced.
So people have hope for the last 20 years,
that when it came to pass that
human genome sequence would eventually become this cheap.
That we would be able to use your individual DNA sequence,
in order to inform our predictions about your health,
our understanding of your disease, our treatment for
you if you get sick and so on. And it turns out that a lot of
this really hasn't come to pass. Basically, the issue is
that we have a lot of data, but the statistical analysis
of this data is really hard. So the point is,
we can sequence your genome,
and it's actually very cheap compared to the cost of many
medical treatments out there. But the problem is we don´t
know what to do with it. And to understand why it's hard to
know what to do with this data, I think it helps to just
take a step back to high school or college biology, and
just remember what we know about DNA. So you learned in,
in your long ago in your biology class, that mom and
dad have DNA, and basically putting aside the details,
mom and dad's DNA con, combines to give baby's DNA.
But this process is not without error, and
actually a lot of errors are made in this process.
In particular in every generation, there is a whole
batch of new mutations that baby has, that mom and dad do
not have. Which are shown here as these purple stars.
And these mutations they happened all the time. This is
just part of the process, you definitely have some mutations
in your DNA that your parents didn't have. Okay, so
everybody has these, and for
most part these are probably pretty harmless. But
not always, some of this de novo mutations,
de novo because they're new and mutations because they
are changes to your DNA. Some of these de novo mutations
probably aren't harmless. Some of them might increase risk of
disease, some of them actually might be completely lethal, so
that someone with this de novo mutation would never have been
born. So the question really is, if we sequence your DNA,
and we can sequence your mom and your dad's DNA, and
we find out which of your, which de novo mutations you
have, how can we make sense of those. How do we
know which of those are likely to be harmful and
potentially increasing your risk for disease, or
explaining a disease that you already have? And
which of those are probably not a big deal, and
we don't really need to worry about them. So
it turns out this is a really hard question to answer.
And the reason that it's hard, is because there's 9 billion
possible of these de novo mutations. The reason for this
is because your DNA sequence is 3 billion base pairs long.
So an A can switch to a T-A-G or a C and so
on at each of these 3 billion provisions. So 3 billion times
3 is 9 billion. And figuring out which one of these
actually matter, versus being harmless is really hard.
So the first thing you might think is, okay,
let's just collect more data. Let's just go out and
let's try sequence everyone, imagine money is no object and
we could just sequence everyone in the world. But
it turns out that's actually gonna not solve the problem.
we could literally collect DNA
sequencing data from everyone in the world, but that won't
actually tell us which of these mutations really matter.
The reason for that,
is because there's 7 billion people on earth, but
there's 9 billion possible de novo mutations.
So you literally might not see some of them. If you don't see
them you won't know it, did you just not see them
by chance? Or did you not see them, because actually that
the one that you didn't say is completely lethal? So someone
with it would never had been born. Or if there's a de novo
mutation that you do see a whole bunch of times okay,
maybe that sort of suggests that it's not that harmful,
but on the other hand maybe it's actually increasing risk
of disease in those people. But there are many possible
diseases, you're just not gonna have a lot of power to
do that type of dissociation. So it's sort of a crazy
situation where there literally aren't enough people
alive in order to use data to answer this question directly.
So we need statistical models to help fill in this gap. And
so a few years ago with some collaborators,
we took an approach where we collected a really large data
set containing 15 million mutations, that for
various reasons we believe are really terrible. Like these
are mutations that you do not wanna have in your DNA.
And then another 15 million mutations that we have reasons
to believe aren't actually that bad. And we collected all
the data that we could for these 30 million mutations, so
things like which chromosome this mutation is on.
Whether it was a DNA letter A that switched to a C,
etc. What is the composition of that region of DNA?
So is it rich in these particular letters
versus those those particular letters?
Is that region of DNA very highly conserved,
or does it tend to be very different across
individuals and across species, and so on. So
all the data that we possibly could, and
we actually just fit a pretty simple statistical classifier,
in order to try to classify mutations into whether or
not they're harmless or
harmful. And on the basis of this classifier,
we could actually compute our predictions for the effects of
all 9 billion possible de novo mutations. And again,
we're never, there's no way we could collect enough data to
actually see all 9 billion of these mutations, but
we nonetheless can get predictions for
all 9 billion of them. And we published this in a paper. And
what this is, is a resource that really makes it possible
now for, for anyone anywhere a researcher, a doctor whoever,
who has de novo mutations that they wanna investigate.
To just go to our website, type in the position and
the chromosome number and also the type of mutation, so for
example if an A mutated to a T and so
on. And get our prediction for whether or not that
mutation is probably harmless, or whether or not it's
something that's potentially very devastating that
should be investigated further. So
this is just an example of how even though we have a lot of
data, data alone isn't gonna solve our problem. And we're
gonna need to fill in the gaps through statistical modeling,
in order to be able to not only make use of the data we
have, but also answer scientific questions
that can't be answered by way of data alone. So
that's all that I have to say, and thanks so much for
your attention. >> [APPLAUSE]
>> Wonderful.
Okay, we have time for your, for a couple of questions,
so please fire away. Let me,
you have, you have a- >> So
the thing is that if you don't just get mutations at the time
of birth. They also happen over a lifespan. So how do you
think you can prove it? >> Yeah, that's a really great
question. So like, the the question was we don't just,
there aren't just mutation that happened at birth,
mutations can actually occur over the course of a lifetime.
And that's definitely true. So de nova mutations,
I'm mentioning those now as a motivation for
something that we might wanna study,
that we'll never have enough data on. You're absolutely
right. That mutations don´t just happen at birth.
For example in certain types of cancer,
we expect to see a lot of mutations acruing, and things
like that. But these scores that we've developed for
predictions of the effects of all 9 billion possible
variants, can still be useful in that setting. Because even
if a mutation occurs later, we're still gonna wanna know,
is that a mutation that's probably harmful, or
is it probably a pretty harmless mutation? We're gonna
take a question from, Facebook or Twitter, yep.
>> Yeah, we've got a question
from Facebook here. So there are a lot of privacy rights
issues when you're working with DNA data, whether it's
personally identifiable. Can you comment on how you ensure,
that people's privacy is respected while you work with
the data? >> Yeah, that's a really good
question. So within the scope of this project, we are not
working with individual patient's genomic data. But
the idea that an individual person's genomic data
has privacy issues associated with it, is really serious and
it's a very serious ethical issue. So for example,
one way that this comes up is I could be like,
okay like here's my genetic data,
I'm just gonna like post it to the internet. It's my data and
I can just make it public, and on the one hand,
it seems like I should have a right to do that. But
on the other hand, if I have some, very serious hereditary
condition in my DNA, then I've just revealed information not
only about my parents but also my siblings and potentially,
my second cousins. And so
I really revealed the information about them as well
that I
didn't necessarily have their permission to do. So
issues associated with privacy of DNA are very serious. And
for this reason, human genetic data is not usually like put
on the Web for public access. >> Well, thanks so much,
Daniela, for your wonderful talk.
>> [APPLAUSE]
No comments:
Post a Comment