STAT GR5206 Homework 2 assignment 代写
100%原创包过,高质代写&免费提供Turnitin报告--24小时客服QQ&微信:120591129
STAT GR5206 Homework 2 assignment 代写
STAT GR5206 Homework 2 [100 pts]
Due 8:00pm Monday, October 16th on Canvas
our homework should be submitted on Canvas using RMarkdown. Please submit both a
knitted .pdf file and a raw .Rmd file. (If you are having trouble knitting to .pdf come
to office hours and we’ll try to sort it out, but for the homework, knit to .html and then
convert to .pdf before handing it in). We will not (and cannot) accept any other formats.
Please clearly label the questions in your responses and support your answers by textual
explanations and the code you use to produce the result. Note that you cannot answer the
questions by observing the data in the “Environment” section of RStudio or in Excel – you
must use coded commands.
Goals: regular expressions, character functions in R, and web scraping.
In this assignment, we’re going to scrape the 2017-2018 Brooklyn Nets Regular Season
Schedule (they’re a basketball team from Brooklyn that plays in the NBA). We will take the
regular season schedule from http://www.espn.com/nba/ and reassemble the game listings
in an R data frame for computational use.
To do this, perform the following tasks:
i. Use the readLines() command we studied in class to load the NetsSchedule.html file
into a character vector in R. Call the vector nets1718.
a. How many lines are in the NetsSchedule.html file?
b. What is the total number of characters in the file?
STAT GR5206 Homework 2 assignment 代写
c. What is the maximum number of characters in a single line of the file?
ii. Open NetsSchedule.html as a webpage. This should happen if you simply click on
the file. You should see a table listing all the games scheduled for the 2017-2018 NBA
season. There are a total of 82 regular season games scheduled. Who and when are
they playing first? Who and when are they playing last?
iii. Now, open NetsSchedule.html using a text editor. To do this you may need to right-
click on the file and tell your computer to use a text editor to open the file. What
line in the file holds information about the first game of the regular season (date, time,
opponent)? What line provides the date, time, and opponent for the final game? It
may be helpful to use CTRL-F or COMMAND-F here and also work between the file in R
and in the text editor.
Using NetsSchedule.html we’d like to extract the following variables: the date, the game
time (ET), the opponent, and whether the game is home or away. Looking at the file in
1
the text editor, locate each of these variables. For the next part of the homework we use
regular expressions to extract this information.
iv. Write a regular expression that will capture the date of the game. Then using the
grep() function find the lines in the file that correspond to the games. Make sure
that grep() finds 82 lines, and the first and last locations grep() finds match the
first and last games you found in (ii).
v. Using the expression you wrote in (v) along with the functions regexp() and regmatches(),
extract the dates from the text file. Store this information in a vector called date to
save to use below. HINT: We did something like this in class.
vi. Use the same strategy as in (v) and (vi) to create a time vector that stores the time
of the game.
vii. We would now like to gather information about whether the game is home or away.
This information is indicated in the schedule by either an ‘@’ or a ‘vs’ in front of the
opponent. If the Nets are playing ‘@’ their opponent’s court, the game is away. If the
Nets are playing ‘vs’ the opponent, the game is at home.
Capture this information using a regular expression. You may want to use the HTML
code around these values to guide your search. Then extract this information and use
it to create a vector called home which takes the value 1 if the game is played at home
or 0 if it is away.
HINT: In my solution, I use the fact that in each line, the string <li class= "game-status
"> appears before this information. So my regular expression searches for that string
followed by ‘@’ or that string followed by ‘vs’. After I’ve extracted these strings, I use
gsub() to finally extract just the ‘@’ or the ‘vs’.
viii. Finally we would like to find the opponent, again capture this information using a
regular expression. Extract these values and save them to a vector called opponent.
Again, to write your regular expression you may want to use the HTML code around
the names to guide your search.
ix. Construct a data frame of the four variables in the following order: date, time,
opponent, home. Print the frame from rows 1 to 10 Does the data match the first 10
games as seen from the web browser?
2
STAT GR5206 Homework 2 assignment 代写