This video is about how to divide the
data into a training set and a test
set. So for our first example, we are
going to consider the packages data and
now I've read in the data. Look at the
dimensions, make sure it looks good. Okay,
there 730 observations and 16
characteristics in the dataset.
Let's just look at the first row. OK, so
we have a year, whether it's a weekend,
whether it's the holidays. It seems like
we're measuring in days. I had the days I
would know that these are indeed
measured in or the first day in the dataset
is January first. So since it's measured
in time order, I want to create two
datasets, the training and test with a test
that is comprised of the most recent
data. So should be the latest thirty
percent of the data. So I'm gonna go
ahead and create two lists of the numbers. One
is train and one is test. First
thing I need is to know how many
observations we have. So it will take just
the length of one variable and now I can
see that it's 730 which I already knew,
but now I have stored in. So the
training data is going to be the first
series of the data, it's going to go from one
to 0.7 times n and what does that mean? It's
going to count one,two,three, all the way
up to 0.7 times n. So let's run that
line of code and see what happens. To
create the list train and let's print train out and
train contains 1 2 3 4 and so on
all the way up to 5, 11 which is seventy
percent of the 730 observations we have.
Now I want to put everything else into
the test set.
So I'm gonna take a special command
called "setdiff", this is the difference
between two sets of numbers, the first
set of numbers is 1 to n and that's all the
observations that I have available
1 2 3 and so on, up to 730. But I want to
take out the second set train, because I
already used that in one place. So if I
run this line and then print out with
test is. If all the numbers from 5 to 12
to 730, so you can see that test contains all
the numbers from 1 to 511 and I'm sorry
train contains all the numbers from 1 to
511 and test contains 512 to 730.
So I've used every observation in
the dataset but the later ones, the last
ones are contained inside of test. So the
training data is going to be our
development, we're going to use that to
create this model. It's going to be this space
we use in the workshop to build a
product. Then the test dataset we're
going to use after the models being
created.
We're going to use it to see if the
model can stand up in the future. So we
have this data models never have seen before,
which is going to test its performance.
So am I done?
The answer's no. I only have a list of
numbers. Now I need to create two
datasets, I'll create the dtrain because
it's a training data created from our
data set d and I'll create dtest
because it's a testing data created from
our data set d and the way that I'll do
that is take all the elements of
d,the original data such that there
in the list train. So this is going to
pull every row that's contained within
train and put it in dtrain our new
dataset. So it's going to take the first
row the second row, the third row, the forth row and
so on, all the way to 511th row and put
it into a new dataset called dtrain
and then this
dtest is going to take every row in the
list test the 512th row to 513th row, all
the way up to the 730th row and store
them in dtest. When I run these two
lines of code, our two new datasets, the
dimensions of dtrain are 511 rows
with 16 characteristics and the
dimensions of the dtest are 219 rows, the
remaining from 512 to 730. So these two
datasets don't contain any the same
elements. It's like dividing the group of
students into some that are going to go
into one room and some that are going to
go into another. There's no overlap
between them.
Now what if I use a different datasets,
so that's when things were in order, we
put the last bit into the test set. What if I
use instead the scores data. So I've got
students, remember in this dataset, let's
look at what this dataset looks like.
I've got a student ID, hours studied, whether
they slept eight hours, their previous
grade and their grade on the exam.
Students are not in time order. I'm going
to have to randomly select that
according to the slides in the previous
video. So instead of putting the first
seventy percent and last seventy percent,
I'm going to randomly select using this
function sample, but the process is going
to be the same. I'm going to create a
train and a test. So sample, I want to
sample from 1 to n so I want to choose from
that list of numbers at random. And I
want to choose how many numbers, I want
to choose 0.7 times n numbers.
So the first argument here is from what
set should I choose.
This one colon n ,the second element
here 0.7 times n is how many
numbers do I want to choose and then this
replace=False is the third element and
what that means is after I select
something, I don't want to ever be
selected again. I'm putting it into a set.
That's it. This is the code that I would use
always if the data is organized by
subject and it's not in time order. So
what's going to happen when I do that?
Let's just see. Now train looks like this.
Oops. What's my mistake? Why do I have so many
observations in here? 727 .The scoress dataset
only have a hundred of students in it,
because n is still 730. I forgot
to reset n. So let's do that. n is the
length of the first variable, the first
variable is student ID. When I run this,
now n is 100. So train, when I rerun
this, now it's going to rerun with n
equals to a hundred instead of the old
one. Now train is a list of randomly
selected students out of 100. So it's
going to take the 57th student,the53rd
student, the 38th student,the 78th student and put them all
into the training data. Now the test that
is created using the same command
setdiff(1:n, train), because I'm still going to
take the numbers 1 to 100 and
remove everything that's already been
put into the training data. So when I see
what test looks like, it's just the remaining
students. The first student was never
randomly selected, so he's left over to
go into the test set ,the second student was
never selected, so he left over to go
into the test set.
So these are the 30 remaining students
after 70 students were chosen at random.
So now we have two sets that do not
overlap. 70 of the students are in the training
data, 30 of the students in the test
data, but I haven't created the datasets
yet. I will need to create them. So dtrain
and dtest for the new dataset,
and d[train,]
and d[test,] when I run these two
lines of code, I'll be creating two new
datasets. dtrain.
Let's test the dimensions of be dtrain,
has 70 students with five characteristics 177 00:09:15,190 --> 00:09:23,830 each. Let's look at the first row of
dtrain. It's the 57th student, student ID is 57 and
these are the characteristics about that student. Now
let's look at the dimensions of dtest.
30th student is in there and each of their
five characteristics are with them. So we
created 2 datasets. One for creating a
model in the workshop and one for testing the
model and a beta test.
For more infomation >> Renault Captur 1.2 TCe Helly Hansen, Automaat, R-Link, Pdc, 9.000 km !! Nieuwstaat - Duration: 1:08.
For more infomation >> Renault Captur TCE 90pk Dynamique (Camera/R-LINK/17"LMV/Climate) - Duration: 0:44. 
For more infomation >> Honda Civic 2.0 TYPE R GT 310 PK Navi Achteruitrijcamera LED - Duration: 0:54. 
For more infomation >> LEARN WORDS AND COLOURS WITH COLOURFUL LETTERS - Duration: 5:17. 
For more infomation >> Anita Malfatti e sua festa da cor - Duration: 2:17.
For more infomation >> ABVP - Pare, pense e use - Duration: 17:28.
For more infomation >> ABVP - Memórias de classe (Cotidiano e lutas sindicais - 1930 / 1935) - Duration: 50:26. 
For more infomation >> Citroën C1 1.0 e-VTi Feel - Airco - Pack Comfort - Duration: 1:02.
For more infomation >> ABVP - Sexo e maçanetas - Duration: 26:58.
For more infomation >> Mercedes-Benz E-Klasse E 350 e Avantgarde Lease Edition Automaat 15% Bijtelling - Duration: 1:00.
For more infomation >> Mercedes-Benz E-Klasse Estate E250 CGI BlueEff. BNS AvantGarde Autom. Navi SK-dak Xenon PDC 99.724km - Duration: 1:03.
For more infomation >> Mercedes-Benz C-Klasse C 350 e Estate Avantgarde Lease Edition Automaat 15% bijtelling! - Duration: 1:01.
For more infomation >> Peugeot 208 1.4 E-HDI BLUE LEASE LEER NAVIGATIE PDC - Duration: 1:02.
For more infomation >> Peugeot 508 1.6 E-HDI BLUE LEASE EXECUTIVE NAVI/CLIMA/PDC - Duration: 1:03.
For more infomation >> Peugeot 108 1.0 E-VTI ACTIVE AIRCO ELEC.PAKKET - Duration: 1:03.
For more infomation >> Citroën C3 1.6 e-HDi Airdream 92pk Dynamique met Panodak!! - Duration: 0:54.
For more infomation >> Citroën Grand C4 Picasso 1.6 E-THP 165PK AUTOMAAT * 7-ZITS * NAVI * CLIMA * LMV * TREKH - Duration: 0:55.
For more infomation >> Perché studiare e lavorare NON sono due cose separate | Appunti Condivisi - Duration: 8:14.
For more infomation >> Citroën C1 1.0 E-VTi 68PK FEEL 5-DRS. * AIRCO * - Duration: 0:42.
For more infomation >> Citroën C1 1.0 E-VTi 68PK 5-DRS. FEEL * AIRCO * VERWACHT * - Duration: 0:41. 
For more infomation >> HOW WOULD THE CHARACTERS OF GRAVITY FALLS BE IN REAL LIFE | TeoriaTV - Duration: 3:17. 
For more infomation >> WHY I DON'T SWEAR/CURSE 🚫😡💢👿 - Duration: 8:29. 
For more infomation >> Burstner Brevio 600 t - Duration: 1:21.
For more infomation >> Opel Zafira 1.4 T(140pk) BUSINESS+ 7P. NAVI/ECC/AGR STOELEN - Duration: 1:02.
For more infomation >> Nissan QASHQAI 1.2 115pk DIG-T XTRONIC N-Connecta - Duration: 1:03.
For more infomation >> Kia Rio 1.0 T-GDI First Edition 100PK - Duration: 0:58.
For more infomation >> Volvo S80 2.5 T AUTOMAAT LEER/NAVI/CLIMA - Duration: 1:00.
For more infomation >> Opel Zafira Tourer 1.4 T 140PK 7PERS COSMO CAMERA/NAVI/PDC - Duration: 0:54.
For more infomation >> Kia cee'd 1.0 T-GDI 88KW GT-LINE SW - Duration: 0:58.
For more infomation >> Kia cee'd 1.0 T-GDI 88KW FIRST EDITION SW - Duration: 1:00.
For more infomation >> Exam skills: 6 tips for getting ready for your exams - Duration: 2:46.
For more infomation >> Resident Evil 4 Wii Edition - Leon Story Final Chapter (Full Play) - Duration: 11:10. 







No comments:
Post a Comment