Chapter 1 Introduction: Getting Familiar with R and RStudio
1.1 What is R and RStudio?
R is a GNU project and may be thought of as an implementation of the S language (developed at Bell Laboratories by Rick Becker, John Chambers and Allan Wilks). It is a language and environment for statistical computing and graphics. R contains a large number of built in function for classical and modern statistical analysis. One of the main advantages of using this program, particularly for undergraduate students, is that it is free! A core team of statisticians and countless other contributors consistently maintain, update, and improve R and make versions that run well under the most operating systems. The web page for the R Project for statistical computing is located at http://www.r-project.org".
We will never use R directly, rather we will use RStudio. RStudio is an integrated development environment (IDE) that allows you to interact with R more readily. RStudio is also free to use. For our purpose, we will be using RStudio as a graphical user interface for R. Think of R as the engine for a care, whereas RStudio is the dashboard, wheel, etc. that allows one to control the engine (although it does so much more such as create slides, books, web applications, and other things).
1.2 Downloading and installing R
The CSUB virtual computer lab will have R and RStudio installed. If you prefer to install these on your personal computer computer, follow these steps:
Visit the site https://cran.r-project.org/.
Click the link that corresponds to your appropriate operating system. (see Figure 1.1). The directions given are for Windows but the process is fairly similar for Mac or Linux. Chromebook users will have to use the virtual computer lab to access R and RStudio unless you are proficient in using linux.
Select base and download the latest release (the link will state “Download R ‘some version’ for Windows”)
Once the program has downloaded, install R with the default settings.
That’s it!
1.3 Downloading and installing RStudio
Follow the steps:
Visit the site http://www.rstudio.com/products/rstudio/download/. Scroll down the page until you see the column header “RStudio Desktop” (see Figure 1.2).
Click “DOWNLOAD.” Then select the appropriate version depending on your operating system (Figure 1.3).
Once the program has downloaded, install RStudio with the default settings.
That’s it! You can now run Rstudio from your home computer.
1.4 The layout of RStudio
Open RStudio. The first time you open RStudio, you will see three panes (Figure 1.4). A forth pane is hidden by default, but can be opened by clicking the File drop-down menu, then New File, and then R Script (File>New File>R Script). RStudio should look something like what you see below but perhaps with a different color scheme. Please see me during students hours or before/after lecture if you wish to change the color scheme.
A brief description of each pane:
The upper left pane: The R Script text editor or source. This is where commands are typed that will be executed in the script editor. Writing and saving a series of R commands in the script file makes it easy to reuse or modify code at a later time.
Lower left pane: the console pane. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the R prompt (the symbol
>
). As its name suggests, this prompt is really a request, a request for a command. It is here where RStudio will tell R what to do. This is the most important pane because this is where R actually does stuff and provides output. That is, it is where commands are entered and executed with output printed.Upper right pane contains your environment/history pane.
- In the environment tab you can see which data and values R has in its memory. R’s memory is called an environment.
- The history tab shows what has been typed before.
- Don’t worry about the rest of the tabs.
Bottom right pane is the files/plots/packages/help pane.
- In the Files tab you can browse and select files to open.
- The Plots tab will show any plots that you create.
- The Packages tab shows a list of installed R packages (more on R packages later).
- The Help tab is the output location of any help files called (more on help files later).
1.5 Expressions and Assignments
Commands in R are expression or assignments. At the command prompt >
in the console pane, do the following:
- Type
25-5
and hit enter. (this an example of an expression). - Type
h=25-5
and hit enter (this is an example of an assignment). This can be read as the difference between 25 and 5 is assigned to the object h”. Next typeh
and hit enter. - Type
H<-20
and hit enter. Next typeH
and hit enter.
The result should be as follows
25 - 5
> [1] 20
= 25 - 5
H
H> [1] 20
<- 20
h # The pound sign is used for comments
h > [1] 20
Notice the following from the commands that were executed above:
h
andH
are not the same thing (they are clearly distinct), so R is case sensitive.- You can also see these new objects (
H
andh
) are in your environment tab on the upper right pane. These objects are generally called R objects. - You will get into the habit of saving things in R objects so that we can access them at later time. As long as your RStudio session remains open, RStudio keeps any R objects that you defined in it’s memory.
- Once you end your session, RStudio loses what was in its memory. Next time you start an RStudio session, no R objects will be in its memory. Is this a problem? Later you will see that this is not a problem!
- The assignment operator is
=
or<-
. Either one is acceptable, although I have a preference for<-
so most of my handout/notes will reflect this. - All R object names must start with a letter (uppercase or lowercase). But the rest can be made up of letters, numbers or periods (with no spaces in between).
- The
#
character is used to indicate a comment. Anything to the right of a#
is ignored by R.
Back to the R objects H
and h
. The objects were created and are now stored in R’s memory So if you type H
and hit enter again, then 20 will come up. RStudio can be closed by selecting the \(File>Quit Session\) or you can just enter the command q()
in the console enter. When exiting RStudio you will be asked “Save workspace image to ….” If you select “Don’t Save,” then will you lose everything that is in your memory. For example, if you were to close without saving and then open RStudio again and type H
you will get an error as follows. Saving the workspace is not required nor needed.
1.6 Entering and running R code in RStudio
The easiest way to enter code or commands into RStudio is to type a command in the console pane and press enter. But this method is only efficient for short and simple analysis. The recommended way is to use RStudio’s text editor. Use the built in text editor as follows:
Open an R script file (File>New file>R script).
Type in your R code or commands into this editor. For example, type
h <- 20
.Save the R script file as an .R file. For example, save your file as FirstRsession.R.
You may then execute the command
h <- 20
directly from the text editor without copying and pasting the command. This is done by highlighting the command, and then selecting the “Run” button (figure 5). You should then see the command executed in the console pane.
Important: Always save your R script file periodically so that you may re-run the analysis when necessary (this applies to any assignments or exams).
1.7 A fancy calculator
At its most basic level, R can be viewed as a fancy calculator. While we won’t be using RStudio to do algebra problems, it is a good way to get familiar with RStudio. The basic operations are +
(add), -
(subtract), *
(multiply), /
(divide), and powers with the ^
operator. R follows the order of operations. Type the following on the new line in the editor and select “Run”:
727/163
and you should get the output [1]
4.4601227.
Upon selecting “Run,” the result of the above appears in the console pane, preceded by the command you executed, and prefixed by the number 1 in square brackets [1]
. The [1]
indicates that this is the first (and in this case only) result from the command. Many commands will return multiple values. Try the following one by one, where each is typed on a new line and then selecting “Run” after typing the command in the editor:
25*10
5/2
2 + 2
After running these commands, you should note the following about the R prompt, >
:
- The
>
prompt means that R is content and ready for a new command or input. - Please note that spacing is not an issue with R. For example,
5 / 2
is the same as5/2
. However, using spaces makes the code easier to read and catch mistakes.
> H
Error: object 'H' not found
Don’t worry about saving your workspace as we generally will not need to do this in the course.
1.8 Working with vectors
The function c( )
function combines or concatenates terms together into a vector. Suppose we wish to store the values 1, 0, 2, 0, and 3 in a vector and store this result in an object called X
. To do so, run the following:
<- c( 1 , 0 , 2 , 0 , 3 )
x x
## [1] 1 0 2 0 3
What is the mean of the data stored in x
? We can do this by summing up the values and dividing by 5:
<- 1 + 0 + 2 + 0 + 3
SumOfx /5 SumOfx
## [1] 1.2
We could also use the sum( )
function which will add all the elements in a numerical vector:
sum(x)
## [1] 6
sum(x)/5
## [1] 1.2
Note the above in summation notation is \[ \frac{\sum_{i=1}^5 x_i}{5} \]
Suppose we wanted to compute the sum below:
\[
\sum_{i=1}^5 x_i^2
\]
We can do this in steps before we use the sum( )
function:
^2 x
## [1] 1 0 4 0 9
<- x^2 # assign the squared elements to 'xsq'
xsq
sum( xsq )
## [1] 14
What if we wanted to compute the sum below? \[ \frac{\sum_{i=1}^5 (x_i - \bar{x})^2}{4} \] Like before, we do this in steps:
<- x - sum(x)/5
diffofxANDxbar diffofxANDxbar
## [1] -0.2 -1.2 0.8 -1.2 1.8
<- diffofxANDxbar^2
sqdiffofxANDxbar sqdiffofxANDxbar
## [1] 0.04 1.44 0.64 1.44 3.24
sum( sqdiffofxANDxbar ) / 4
## [1] 1.7
1.9 Data types in R
There are six basic data types in R:
- Numeric
- Integer
- Complex
- Character/string
- Factor
- Logical
We will only deal with numeric, character, factor and logical data types. Numeric data consists of decimal values. For example,
<- 10.355
j j
## [1] 10.355
In R, numeric is the default type for numbers. It stores all numbers as floating-point numbers (numbers with decimals). Character data are character strings (a string of one or more characters). Character data are created by putting the string in quotations. For example,
<- c( "hi", "hello" )
k k
## [1] "hi" "hello"
<- "3.4403" # a string since we put it in quotations!
L # this is not numeric L
## [1] "3.4403"
Logical data can have one of two values: TRUE
or FALSE
. Logical are generally created when there is a comparison between variables.
<- c( TRUE, TRUE, FALSE ) # no quotation marks!
p # this is not character data but rather logical values p
## [1] TRUE TRUE FALSE
Factor data contains a set of numeric codes with character-valued levels. A simply way to create a factor variable is to first define it as character variable and then covert it to a factor variable by applying as.factor( )
to the vector:
<- c( "married", "married", "divorced", "single", "single", "widowed", "married" )
MaritalStatus MaritalStatus
## [1] "married" "married" "divorced" "single" "single" "widowed" "married"
### convert character variable to factor variable:
<- as.factor( MaritalStatus )
MaritalStatus MaritalStatus
## [1] married married divorced single single widowed married
## Levels: divorced married single widowed
Note that data within a vector can only be one kind of data type (i.e., you can’t mix numeric values with character or logical values). Further, you are not expected to be able to create factor variables (or other data types), but rather you are expected to only recognize and distinguish between these data types.
1.11 Using R functions
R has many built-in function. For example, the c( )
function combines, or concatenates terms together
into a vector.
Suppose we wish to assign the values
1 0 2 0 3 1 0 1 2 0
to a vector denoted by `X. In your R script, run the code given below:
X <- c(1, 0, 2, 0, 3, 1, 0, 1, 2, 0)
X
You should have obtained output as follows:
<- c(1, 0, 2, 0, 3, 1, 0, 1, 2, 0)
X
X## [1] 1 0 2 0 3 1 0 1 2 0
Note that this is executed code since we see the command prompt >
before the code (Note: the #
show before >
would not appear in your R console. This is placed in these notes to better distinguish between R code and output. If you wish to bring up X
once more you do not have to retype X
. Instead, use the level keys \(\uparrow\) and \(\downarrow\) to scroll up and
down past commands issued in your current R work session.
Another example of an R built-in function is the sum( )
. This function computes the sumn of all the numbers in a vector. The syntax consist of the function name followed by parentheses to contain the argument(s). The sum( )
function only requires one argument, the data vector. We will make use of R’s many built in functions along the way. Examples of other functions are shown below:
### computing the mean of the numbers stored in X
sum(X) # sample mean
> [1] 10
### compute the length of the vector
length(X)
> [1] 10
# more on functions later.
1.12 Loading a package and installing new packages
All functions in R come from libraries/packages. There are many packages that can be installed to expand the ability of R. By default, R comes with several packages installed (some loaded automatically when you open RStudio to be used and some not). For example, a function called mean()
is available from the base
package, but this packages is automatically loaded when RStudio is opened so we will always have access to this function without loading an packages.
As for those packages that aren’t loaded automatically, we first have to load the packages to use its “tools.” For example,the MASS
library come with R but is not automatically loaded. This package provides a function called boxcox()
(for box-cox transformations). To use boxcox()
, we first have to load this package by running the command require(MASS)
or library(MASS)
.
Remember to think as packages as toolboxes. We will install additional packages called openintro
and mosaic
that will provide useful data sets and functions (“tools”). These packages are specifically designed to make R more accessible.
To install these packages, in the R console type the following command:
install.packages( "openintro" ) # no spaces within quotation marks!
Hit enter, and you will see activity occurring in the R console. This may take a few minutes to complete. Note the following during the installation process:
- If R asks you to choose a CRAN mirror for downloading, select any USA mirror.
- If R asks to use your personal library, answer yes.
- If R prompts you to update any packages, please agree to the updates.
When it is done installing, it should say something like “The downloaded binary packages are in…” right above the R prompt >
. This means the packages installed correctly.
Now install the mosaic
package. In the R console type and run the following command:
install.packages( "mosaic" ) # no spaces within quotation marks!
Again, this may take a few minutes to complete. Note that these packages only have to be installed once on your computer. However, every time a new RStudio session is started, the package will have to be loaded before datasets and/or functions from the package can be used. If a different computer is being used, these packages will have to bed installed again. For now, only these additional packages are required.
1.10 Comments in R
Everything that you type after a
#
sign is assumed to be a comment and is ignored by R. Adding comments to your R script is useful because it will help you recall what your commands or lines of code will do. In the editor, type the following and select “Run”:Note that
y
is now in RStudio’s memory, and we can access its value by typingy
on a new line and selecting “Run.” Now, type the following in your editor on a new line and then select “Run”:Note that this command is ignored because of
#
. Get into the habit of using comments!