Introduction to R
R InstallationStart RGeneral HintsAssignment Statements Double Variables Integer Variables Complex Variables
String Variables Logical Variables Arithmetical Operators Relational OperatorsLogical OperatorsVectors of IntegersVectors of Doubles
Logical VectorsString VectorsNamed Vectors ElementsList of String Groups List of Strings and Integers ArraysMatrices
Data Frames Built-in DataText Input Clipboard Input Web File Input Local File Input Keyboard Input
R is a programming language for analytical computing and presentation. It is a GNU project which is similar to the S language and environment which was developed by John Chambers and colleagues at Bell Laboratories (formerly AT&T, now Nokia Bell Labs). R can be considered as a different implementation of S.
"R is an official part of the Free Software Foundation’s GNU project, and the R Foundation has similar goals to other open source software foundations like the Apache Foundation or the GNOME Foundation."
One of R’s strengths is the ease of use combined with enormous expressive and programming production power.
R runs on a wide variety of UNIX/Linux platforms, Windows and MacOS X.
R Installation
In order to install R on MAC or PC, first download the latest setup program from the R Project for Statistical Computing page. Look for and follow the download R link.
R-Project
On the next page, in the 0-Cloud section, select the first link.
R-Mirrors
Depending on your OS, select (download) the appropriate R setup options. The remaining part of this instruction will use the Windows installer.
R-Setup Download Options
Most likely, you will need to follow the "install R for the first time" link. Click here if you want to upgrade your existing R installation.
R-Setup First Time
Finally, click the Download R … link (this option is for the Windows OS).
R-Setup For Win
On the instructor's system, the setup program was downloaded to folder C:/_resource/software/R. Notice that R uses forward slash (/) as a directory separator also on Windows.
Another popular destination is just the Windows' Download folder. Acutely, with Google Chrome, we can go directly to this folder by following the download path as shown below.
Go to Downlad folder.
Next, launch (double-click) the setup program.
Run R Setup
The installation process is simple. I would only suggest changing the installation's target folder. When you get to the folder setup form, you may wish to change the installation folder to a simpler target, for example to C:\dev\R36 (no need to make any changes on Mac). In this instruction the installation folder of R (here C:/dev/R36) will be referred to as %rhome%.
R Home
From this point on, just click the Next buttons (except maybe for adding an R shortcut to the Quick Lunch bar.) and the last Finish button. R Shortcut + Finish
Additional Installation Procedures
In addition to the powerful functionality provided by the basic R installation, many users and organization have developed convenient libraries that can be incorporated into our R environment.
For example, you may recall the data structure for ANOVA that you used in Excel. In R, the same data set that is good for Excel must be reorganized (reshaped). Library "reshape2" provides an easy to use function called melt(). This function will become available after the following library has been installed:
install.packages("reshape2")
This installation uses the default library location that was created by the R installer. On your instructor's computer, this location (folder) is: %rhome%/library (here C:\dev\R36\library).
If you also installed R within the Anaconda distribution, this location will be different. On the instructor's computer, Anaconda is installed in directory C:/dev/anaconda. Let's refer to it as %ANACONDA_HOME%. Within this installation, the R library's folder is:
%ANACONDA_HOME
%/Lib/R/library
(here C:/dev/anaconda/Lib/R/library).
The following example shows how to install two libraries ("pckg1" and "pckg2") at this location:
install.packages(c("pckg1", "pckg2"), lib="C:/dev/anaconda/Lib/R/library")
Start R (R GUI)
Find R in and lunch it from the Start list or use the Quick Lunch pop-up to locate and lunch R. These shortcuts lead to program R GUI (Rgui.exe) that is located in the bin/x64 subfolder of the R Home folder (here C:/dev/R36). This is a Graphical User Interface version of R. By the way, this instruction assumes that you are using a 64-bit operating systems. It is also possible to run R GUI in its 32-bit mode (from folder bin/i386).
Run R
If you prefer to have a shortcut to R GUI on your desktop, open folder C:/dev/R36\bin/x64, Copy file Rgui.exe and paste it as shortcut on the desktop. If you installed R in a different folder, go to that folder, find subfolder bin/x64 and program Rgui.exe there. Whichever method you use, R will open its GUI (Graphical User Interface) window with R Console filled with messages and ready for work:
New R Console
You may wish to press Ctrl+L to clear the messages and write a few lines of code, for example:
a = 3
b = -2
c = -8
Suppose that the three variables hold the values of the parameters of a quadratic equation (ax2+bx+c = 0). One way to solve this equations is to first calculate the discriminant of the equation:
discr = b^2 -4*a*c
discr
[1] 100
Since the value of the discriminant is positive, the equation has two different roots:
sd = sqrt(discr)
twoa = 2 * a
x1 = (-b - sd) / twoa
x2 = (-b + sd) / twoa
x1
[1] -1.333333
x2
[1] 2
The roots (x1, x2) of this equation are (1.333333, 2). Enter q() to quit R GUI. Since you can reuse the above code to run this application again, you may wish to press No.
R GUI Example
R [Plain] Console
It is also possible to run [plain] R in a Command (Console) window. Just execute program R.exe from directory %rhome%/bin
(here C:/dev/R36/bin):
Run R in Console
R opens an interactive Console that is similar to the R GUI's Console but there are no menu options. The following example shows a simple work done in such a Console. It calculates the monthly loan payment given the principal (present) value of the loan (p), periodic interest rate (r), and the number of payments (n). The loan amount is caclulates according to the following formula:
Loan Payment Formula
Notice that the original input for the rate and term is given on the annual basis.
Monthly Payment in R
R can also be lunched from a Power Shell window. In File Explorer, hold down Shift, point to and right-click subfolder bin in folder %R_HOME% (here C:\dev\R36). From the popup menu, select option Open
Run R from Power Shell
Windows open a Power Shell console. To run R, just enter .\R. An R console opens. It is similar to the Rterm (R-terminal) window shown above. Notice that in the Power Shell, the R command is preceded by token .\ which tells Windows to run R from the current directory (.\). Try some R code, for example check multiplication vs. exponentiation, and quit R without saving this session. Do not close the Power Shell window.
Run R from Power Shell
It would appear that the plain R consoles are not worth any attention. However, there are circumstances when running R, for example, from a Power Shell window is advantageous. Let go back to our loan payment example. This time, we will write and save the code in a plain text document, pmt.r. Open a text editor (Notepad) and type the following lines of code.
arg = commandArgs(TRUE)

principal = as.numeric(arg[1])
annual_rate = as.numeric(arg[2])
years = as.numeric(arg[3])
r = annual_rate / 12
n = years * 12
pmt = r * principal / (1 - (1 + r)^-n)
print("Monthly Payement:")
print(round(pmt,2))
R Script pmt.r
Save this document in a folder dedicated to R, for example in C:\dev\rapp\ as pmt.r. Back in the Power Shell window, enter the following command:
./Rscript C:/dev/rapp/pmt.r 100000 0.06 15
Invoke R Script from System's Shell
This command invokes program Rscript.exe, requesting your code, stored in script C:\dev\rapp\ as pmt.r, to be executed, and passing three command-line arguments (input from the system's shell). The order of the arguments should be consistent with the order the arguments are accessed in your R script. As you can see, the first line of the script is acknowledging that the script should read one or more arguments from the system's shell. It also instructs R to store the arguments in vector arg. The next lines get the arguments from the vector (arg) into three variables: principal, annual_rate, and years. Since the arguments are originally passed as strings they must be converted into numbers, here using function as.numeric().
So far, our R development environment has the following structure:
File System for R
The paths to specific files, used so far, are:
FilePath
R.exe C:/dev/R36/bin/
Rscript.exe C:/dev/R36/bin/
pmt.r C:/dev/rapp/
We could use absolute paths to execute R code stored in a file by program Rscript.exe. For example, the following command will process script pmt.r by program Rscript.exe from any folder (here from C:\tmp):
cd /tmp
C:/dev/R36/bin/Rscript.exe C:/dev/rapp/pmt.r 100000 0.06 15
Using Absolute Paths
Notice that the first command switches to folder /tmp on the current drive (C:). If you do not have this folder, you can use any other existing folder or you can create this folder with command mkdir /tmp. The second one invokes program Rscript.exe while passing to it R script (pmt.r) and three command-line arguments.
As shown above, the file system is hierarchical. Within the system, we can have programs communicate with their documents, using a relative references. For example, suppose that our Power Shell's folder is C:/dev/R36/bin (get there using command) and we want to run Rscript.exe from this (current) folder to process script pmt.r (stored in folder C:/dev/rapp). Since navigation from one file to another must be hierarchical, Rscript.exe is unable to directly find script pmt.r. These files reside on different branches of the file tree. In order to get from one folder to another, the navigation pathway must go through a common [top] folder of the two branches (here C:/dev/). In order to get from folder bin to folder rapp, we need to go up two levels (to folder C:\dev) and from this folder—down to folder rapp. Using this approach we can run out application also in the following way:
cd /dev/R36/bin
./Rscript.exe ../../rapp/pmt.r 100000 0.06 15Using Relative Path
Exercise:
How would you set up your command to have program Rscrip.exe execute script pmt.r from folder rapp (passing the same command-line argumants)?
 
General Guidelines
In an R console, typing an expression and pressing Enter executes the expression.
Pressing , allows you to browse though, access, and or edit previously entered expressions (lines).
R's object names are case sensitive. For example variable x is not the same as variable X.
If you type a letter and press Tab twice, you will see functions or object names that start with this letter.
You can copy/paste single or multiple lines from and to the R console.
Use function help() to get help about an R object or function. For example, to learn more about function sd(), enter:
help(sd)
Assignment Statements
An assignment statement is a fundamental part of every computer program. It is a computer operation that stores a value in a memory location identified by a [named] variable. The following example shows assignment statements from other programming languages:
Assignment Statements
In R, there are three ways to perform an assignment statement:
x <- 5
x
[1] 5
x = 5
x
[1] 5
5 -> x
x
[1] 5
They all say: assign 5 to x, store 5 at x, or set x to 5. This instruction will use the middle way (most of the time)!
Simple R Variables, Objects, and Data Types
character (e.g. 'Mike')
real (floating point, double) numbers (e.g. 0.05)
integer (e.g. 30)
complex (e.g. 1 + 2i)
logical (e.g. FALSE, TRUE)
Important note:
As opposed to other programming languages, R does not have scalar types. Even if a variable is assigned a single value, the variable is treated as a vector (the simplest R data structure). As single-valued variables, they do not have to use an index which is mandatory for multi-valued data structures. You can examine the type of the variable using functions class() and typeof().
Numeric Variables (double)
There in no declaration of the type for R variables. The variables adopt their type from the values or expressions they are assigned to. In the following example, value 5 is a numeric value. The smallest positive value of the double type is 2.225074e-308; the largest one—1.797693e+308.
x = 5
x
[1] 5
class(x)
[1] "numeric"
typeof(x)
[1] "double"
Notice an index [1] notation shown in front of the variable value. R is telling as that it is a vector. With a signgle-valued variable x is the same as x[1].
x[1]
[1] 5
The following examples show a variable (y) being set to an expression (x + 1). It inherits the type from the expression in which both the components are of numeric (double) type:
y = x + 1
y
[1] 6
class(y)
[1] "numeric"
typeof(y)
[1] "double"
Numeric Variables (integer)
A single-valued, integer variable:
n = 10L
n
1] 5
class(n)
[1] "integer"
typeof(n)
[1] "integer"
As you can see, in order to force a single value to be an integer, we add letter L to the number. In the following example, variable m is set to a sequence of whole numbers (1,2,…,8):
m = 4:8
m
[1] 4 5 6 7 8
class(m)
[1] "integer"
typeof(m)
[1] "integer"
m[2]
[1] 5
m[2:4]
[1] 5 6 7
Notice the way vector m values can be accessed via its index. Notation [2:4] gets the sequence (sub-vectors) at consecutive index positions 2 through 4.
Numeric Variables (complex)
Complex numbers have serious applications in Math and Physics, particularly in Electronics and Electromagnetism. They are shown here for completeness.
c = 3 + 2i
c
[1] 3+2i
class(c)
[1] "complex"
typeof(c)
[1] "complex"
c = complex(real = 3, imaginary = 2)
c
[1] 3+2i
Also try this:
d = sqrt(-1)
d
[1] NaN       # Not a Number
d = sqrt(as.complex(-1))
d
[1] 0+1i
Character (String) Variables
Consider the following example
first_name = 'Ann'
first_name
[1] "Ann"
first_name[1]
1] "Ann"
class(first_name)
[1] "character"
Contrary to many other programming languages, the entire string is stored at the index position 1. To access individual characters function substr() can be used. For example:
full_name = 'Ann J Fortran'
full_name
[1] "Ann J Fortran"
first_name = substr(full_name,1,3)
first_name
[1] "Ann"
last_name = substr(full_name,7,13)
last_name
[1] "Fortran"
middle_initial = substr(full_name,5,5)
middle_initial
[1] "J"
Function substr() uses three arguments (string, from_position, to_position).
Logical Variables
A logical variable can represent one of the two states: TRUE or FALSE.
is_young = TRUE
is_young
[1] TRUE
ann_age = 25
tom_age = 30
ann_is_younger = ann_age < tom_age
ann_is_younger
[1] TRUE
tom_is_younger = tom_age < ann_age
tom_is_younger
[1] FALSE
ann_is_younger = !tom_is_younger
ann_is_younger
[1] TRUE
Notice the exclamation point (!) working as negation operator (not FALSE is TRUE).
Arithmetical Operators
Arithmetical operators are used in numeric expressions.
Operator
Meaning
Example
Result
+
Addition
2+5
7
-
Subtraction
43650
3
*
Multiplication
3*8
24
/
Division
43803
3
%/%
Integer Division
9%/%4
2
%%
Division Remainder
9%%4
1
^ or **
Exponentiation
3^4
81
Relational Operators
Relational operators are use to build relational (comparison) expressions.
Operator
Meaning
Example
Result
<
Less Than
2 < 5
TRUE
<=
Less Than or Equal To
4 <= 4
TRUE
==
Equal To
3 == 9/3
TRUE
>
Greater Than
2 > 5
FALSE
>=
Greater Than or Equal To
4 >= 4
TRUE
!=
Not Equal To
2 != 2
FALSE
Logical Operators
Logical operator are used to construct logical (TRUE/FALSE) expressions.
Operator
Meaning
Example
Result
!
Negation (NOT)
!TRUE
FALSE
&
Intersection (AND)
TRUE & FALSE
FALSE
|
Alternative (OR)
TRUE | FALSE
TRUE
Structured R Variables, Objects, and Data Types
vector - a sequence of data of the same type, e.g. c(2,5,1,7,4)
lista list of data of possibly different types, e.g. list(1,"A",c(4,8,5))
arraya multidimensional array, array(c(1,2,3,4,5,6,7,8),dim = c(2,2,2))
matrix – a two-dimensional array, e.g. matrix(c(1,2,3,4,5,6), nrow=2, ncol=3, byrow=TRUE)
dataframe – a table (e.g. mtcars)
Vectors
Vectors are ordered collections of objects (elements or components) of the same type.
Vectors of Integers
A sequence of natural numbers (1,2,...,8).
v = 1:8
v
[1] 1 2 3 4 5 6 7 8
typeof(v)
[1] "integer"
Function seq.int() generates a sequence of numbers from first to not less than last, advanced by increment.
v1 = seq.int(1L,5L,2L)
v1
[1] 1 3 5
v2 = seq.int(2L,6L,2L)
v2
[1] 2 4 6
Operators used with vectors perform the operations on all vector elements:
v
[1] 1 2 3 4 5 6 7 8
sqr_v = v ^ 2
sqr_v
[1] 1 4 9 16 25 36 49 64
Vectors of Doubles [Double Precision Numbers]
A sequence of real numbers (1.0,2.5,2,2.5,...,8.5,9.0).
v = seq.int(1,9,0.5)
v
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0
n = length(v)
n
[1] 17
Write an expression to return the difference between consecutive numbers of the vector v.
(Second – First, Third – Second, Fourth – Third, . . ., Last – Second Last.)
v[2:n]-v[1:n-1]
[1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
The differences are fixed as the vector, v,has been created as a sequence with a fixed increment (here 0.5).
Add them up:
v_sum = sum(v)
v_sum
[1] 85
Generate a more descriptive output, using the cat() function.
cat('The sum of \n  '); cat(v); cat('\nis: '); cat(v_sum); cat('\n');
The sum of
  1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9
is: 85
Using the c() function. Think of it as "combine elements" or "concatenate elements" into a vector.
vec = c(3,5,7,8,9)
vec
[1] 3 5 7 8 9
n = length(vec)
n
[1] 5
sum(vec)/length(vec)
[1] 6.4
Same as:
mean(vec)
[1] 6.4
R coerces vector elements into the most "agreeable" type.
Create a vector of integers:
ints = c(1:5)
ints
[1] 1 2 3 4 5
typeof(ints)
[1] "integer"
Create a vector of doubles (made of explicit integers (1:5) and one implicit double (6):
dbls = c(1:5,6)
dbls
[1] 1 2 3 4 5 6
typeof(dbls)
[1] "double"
Such data type conversion is based on the principle of data integrity that prevents loss of data (or precision). Moving from "integer" to "double" is generally safe but not the other way around.
In many situations we may need to initialize vectors. The following examples show two applications of the rep() function that creates vectors by replicating numbers or other vectors.
vect1 = rep(0,5)
vect1
[1] 0 0 0 0 0
vect2 = rep(c(1,2,3),5)
vect2
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
Use the append() function to append new data to the end of a vector or to insert the data at some inner position. Let's first append new content at the end of vector vect1.
vect1 = append(vect1,rep(1,5))
vect1
[1] 0 0 0 0 0 1 1 1 1 1
Next, insert number 2 before position 1 in vector vect1.
vect1 = append(vect1,2,0)
vect1
[1] 2 0 0 0 0 0 1 1 1 1 1
Vectors of Logical Values
Logical vectors combine logical values FALSE, TRUE.
Let's replicate FALSE seven times.
bool_vec = rep(FALSE, 7)
bool_vec
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Here is another way of doing it:
bool_vec = vector("logical", 7)
bool_vec
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Logical values are usually produced by logical expression. The following example shows outcomes for three logical (relational) expressions that are stored in a logical vector, vbool.
x = 5
vbool = c(x < 5, x == 5, x > 5)
vbool
[1] FALSE TRUE FALSE
Vectors of Strings
Let's create a vector of names and then sort the vector alphabetically.
fn = c('Ken','Ann','Tom','Eve','Pat','Don','Kim','Sue')
fn
[1] "Ken" "Ann" "Tom" "Eve" "Pat" "Don" "Kim" "Sue"
sorted_fn = sort(fn)
sorted_fn
[1] "Ann" "Don" "Eve" "Ken" "Kim" "Pat" "Sue" "Tom"
What about a descending order?
sorted_desc_fn = sort(fn, decreasing = FALSE)
sorted_desc_fn
[1] "Tom" "Sue" "Pat" "Kim" "Ken" "Eve" "Don" "Ann"
Get lengths of all strings stored in a vector.
fnm = c('Kim','Debb','Chris','Jeanny','Michael')
str_lengths = nchar(fnm)
str_lengths
[1] 3 4 5 6 7
Vectors with Named Elements
Typically, we access data elements from vectors by index. For example:
age = c(22,29,35,50,65)
age
[1] 22 29 35 50 65
Show second, third and last elements:
age(c(2,3,5))
[1] 29 35 65
Let's name and access vector elements by name.
names(age) = c('Dan','Eve','Fox','Jan','Kam')
age
Dan Eve Fox Jan Kam
22 29 35 50 65
age['Kam']
Kam
65
age['Eve']
Eve
29
age[c('Dan','Jan')]
Dan Jan
22  50
With named elements, vectors can be used as dictionaries (lookup sources).
Lists
A list may contain objects of different sizes.
List of String Groups (Vectors)
Create a list element contains vectors of strings of different sizes.
grp1 = c('Ken','Ann','Tom')
grp2 = c('Eve','Pat')
grp3 = c('Don','Kim','Sue','Ron')
grps = list(grp1,grp2,grp3)
grps
[[1]]
[1] "Ken" "Ann" "Tom"

[[2]]
[1] "Eve" "Pat"

[[3]]
[1] "Don" "Kim" "Sue" "Ron"
Let's find out more about this list.
Get the first list's member:
grps[[1]]
[1] "Ken" "Ann" "Tom"
Get the second name of the 3rd list's member.
grps[[3]][2]
[1] "Kim"
Find out the size of the list.
length(grps)
[1] 3
Get the length of the 1st member (vector).
length(grps[[1]])
[1] 3
Get the length of the 2nd member (vector).
length(grps[[2]])
[1] 2
Get the length of the 3rd member (vector).
length(grps[[3]])
[1] 4
List of Strings and Integers
A list may contain objects of different types. Here, each list element contains vectors of different types.
name = c('Ken', 'Ann', 'Tom')
age = c(25, 19, 33)
people = list(name, age)
people
[[1]]
[1] "Ken" "Ann" "Tom"

[[2]]
[1] 25 19 33
Let's write a statement about Ken's age, using function cat().
cat(people[[1]][1]); cat(' is '); cat(people[[2]][1]); cat(' years old.'); cat('\n')
Ken is 25 years old.
Function cat() is used to print a single object. Statement cat('\n') performs a line feed (same as the Enter key does). Notice that this program line contains multiple statements. Typically, we write each statement on a separate line. R statements that appear on the one line must be separated with a colon (;).
A list may have named elements. Thus it can work as a dictionary, a key-value structure.
Let's redefine the people list. This time, we will store the names as keys.
people = list(Ken=25,Ann=19, Tom=33)
people
$Ken
[1] 25

$Ann
[1] 19

$Tom
[1] 33
How old is Ann?
cat('Ann'); cat(' is '); cat(people$Ann); cat(' years old.'); cat('\n')
Ann is 19 years old.
Notation people$Ann is equivalent to people[[2]][1].
cat('Ann'); cat(' is '); cat(people[[2]][1]); cat(' years old.'); cat('\n')
Ann is 19 years old.
Arrays
An array is a multi-dimensional structure, in which each dimension holds the same number of elements. In a two-dimensional array, each row has the same length and each column has the same length. Because of these properties, arrays are also referred to a rectangular structures. Again, a two-dimensional array looks like a rectangle. Here is an example of a two-dimensional array (2 by 3):
x = array( c(3,5,4,4,3,4), dim=c(2,3) )
x
 [,1][,2][,3]
[1,]343
[2,]544
This array is arranged by columns. The first two values are placed in the first column, the second two values—in the second column, etc.
Array dimensions can also be named.
dn = list( c('r1','r2'), c('c1','c2','c3') )
y = array( c(3,5,4,4,3,4), dim=c(2,3), dimnames=dn )
y
 c1c2c3
r1343
r2544
Accessing arrays by index.
x = array( c(3,5,4,4,3,4), dim=c(2,3) )
x
 [,1][,2][,3]
[1,]343
[2,]544
Get the element from the 2nd row and 1st column.
x[2,1]
[1] 5
Get the elements from the 2nd column (all rows intersecting this column).
x[,2]
[1] 4 4
Get the elements from the 2nd row (all columns intersecting this row).
x[2,]
[1] 5 4 4
Get the elements from columns 2 and 3 (all rows intersecting these columns).
x[,2:3]
 [,1][,2]
[1,]43
[2,]44
Accessing arrays by name.
y
 c1c2c3
r1343
r2544
y['r1',]
c1 c2 c3
3  4  3
y[,'c2']
r1 r2
4  4
y['r2','c3']
[1] 4
Matrices
A matrix is a two dimensional array. As numeric arrays, matrices are very convenient and vary powerful data containers in Data Science. They are based on the concept of matrices as defined in Linear Algebra. Here is an m x n matrix named as A, having m rows and n columns.
A =
a11a12a1ja1n
a21a22a2ja2n
ai1ai2aijain
am1am2amjamn
A matrix, S, having the same number of rows (m) and columns (m) is referred to as a Square matrix.
S =
a11a12a1ia1m
a21a22a2ia2m
ai1ai2aiiaim
am1am2amiamm
A square matrix, D, with all zeros except for the left-down diagonal is called a Diagonal matrix.
D =
a11000
0a2200
00 aii 0
000 amm
A Diagonal matrix, I, with all diagonal values of 1 is referred to as an Identity matrix.
I =
1000
0100
0010
0001
Creating a matrix from a vector.
mx = matrix( c(3,5,4,4,3,4), nrow = 2, ncol = 3 )
mx
 [,1][,2][,3]
[1,]343
[2,]544
Matrix mx is arranged by columns (by default). The first two values of the vector argument, c(3,5,4,4,3,4), are placed in the first column, the second two values—in the second column, etc. Matrix created from Vector by Columns
Try this to arrange data by rows:
my = matrix( c(3,5,4,4,3,4), nrow = 2, ncol = 3, byrow=TRUE )
my
 [,1][,2][,3]
[1,]354
[2,]434
Matrix my is arranged by columns. The first three values are placed in the first row, the second three values—in the second row. Matrix created from Vector by Rows
Matrices can be manipulated mathematically (like in Linear Algebra).
Multiply a matrix by a number:
mx = matrix( c(3,5,4,4,3,4), nrow = 2, ncol = 3 )
mx
 [,1][,2][,3]
[1,]343
[2,]544
m1x = mx * 2
m1x
 [,1][,2][,3]
[1,]686
[2,]1088
Each of the matrix elements is multiplied by the number (here 2).
"Matrix Multipled by 2
Add a number to a matrix:
mx = matrix( c(3,5,4,4,3,4), nrow = 2, ncol = 3 )
mx
 [,1][,2][,3]
[1,]343
[2,]544
m2x = mx + 2
m2x
 [,1][,2][,3]
[1,]565
[2,]766
Number 2 is added to each element of the matrix.
Add 2 to Matrix
Add two matrices:
dat1 = c(1,3,4,2,5,2)
mx1 = matrix( dat1, nrow = 2, ncol = 3 )
mx1
 [,1][,2][,3]
[1,]145
[2,]322
dat2 = c(4,1,1,3,1,2)
mx2 = matrix( dat2, nrow = 2, ncol = 3 )
mx2
 [,1][,2][,3]
[1,]411
[2,]132
mx3 = mx1 + mx2
mx3
 [,1][,2][,3]
[1,]556
[2,]454
The matrices are added by elements.
Adding Two Matrices
Multiply two matrices by elements (an element-wise matrix multiplication):
dat1 = c(1,3,4,2,5,2)
mx1 = matrix( dat1, nrow = 2, ncol = 3 )
mx1
 [,1][,2][,3]
[1,]145
[2,]322
dat2 = c(4,1,1,3,1,2)
mx2 = matrix( dat2, nrow = 2, ncol = 3 )
mx2
 [,1][,2][,3]
[1,]411
[2,]132
mx3 = mx1 * mx2
mx3
 [,1][,2][,3]
[1,]445
[2,]364
This type of matrix multiplication is a vector driven (element wise) multiplication. The matrices must be of the same size.
Element-wise Matrix Multiplication
Multiply two matrices as in Linear Algebra (a pure matrix multiplication):
dat1 = c(1,3,4,2,5,2)
mx1 = matrix( dat1, nrow = 2, ncol = 3 )
mx1
 [,1][,2][,3]
[1,]145
[2,]322
dat2 = c(4,1,1,3,1,2)
mx2 = matrix( dat2, nrow = 3, ncol = 2 )
mx2
[,1][,2]
[1,]4 3
[2,] 1 1
[3,]1 2
mx3 = mx1 %*% mx2
mx3
 [,1][,2]
[1,]1317
[2,]1615
The two matrices must be conformable, that is the number of columns in the first matrix must be equal to the number of rows in the second matrix. This operation is performed in Excel by function MMULT().
Matrix Multiplication
The elements of the resulting matrix are cross-products of rows in the first matrix and columns in the second matrix. A cross product of two equally sized vectors is the sum of the product of the elements of the vectors at the same position. Such and operation is done in Excel, using the SumProduct() function. For two vectors, v1 and v2, R can calculate the cross-product as crossprod(v1,v2) or as v1 %*% v2. In either case the outcome is a matrix. To produce a [plain] number, force the outcome to be a number: as.numeric(crossprod(v1,v2)) and as.numeric(v1 %*% v2).
Calculate an inverse matrix.
vx = c(2,1,4,3,2,1,1,3,3)
mx = matrix( vx, nrow = 3, ncol = 3 )
mx
[,1][,2][,3]
[1,]231
[2,] 1 2 3123
[3,] 4 1 3413
In Linear Algebra, if A is a square matrix then its inverse is denoted by A-1. R gets the inverse matrix, using the solve() function::
inv_mx = solve(mx)
inv_mx
[,1][,2][,3]
[1,]0.1153846-0.307692310.26923077
[2,] 1 2 30.34615380.07692308-0.19230769
[3,] 4 1 3-0.26923080.384615380.03846154
This operation is performed in Excel by function MINVERSE().
The input matrix must be square. An inverse matrix multiplied by the original matrix will produce an Identity matrix:
mx %*% inv_mx
[,1][,2][,3]
[1,]10-7.632783E-17
[2,] 1 2 3-1.110223E-1610
[3,] 4 1 3001
Due to rounding imprecision, the above output does not look like a perfect Identity matrix. Here is a rounded result (rounded to 12 decimal digits after the decimal period:
round(mx %*% inv_mx,12)
[,1][,2][,3]
[1,]100
[2,] 1 2 3010
[3,] 4 1 3001
Solving a Linear Equation System
In general, using a matrix notation, a linear equation system and its solution can be expressed as follows:
A∙x = b
x = A-1∙b
For the following equation system:
2x1 + 3x2 + 1x3 = 630
1x1 + 2x2 + 3x3 = 550
4x1 + 1x2 + 3x3 = 600
matrix A is defined by the coefficients of the [unknown] variables:
2  3  1
1  2  3
4  1  3
In R:
v = c(2,3,1,1,2,3,4,1,3)
A = matrix( v, nrow = 3, ncol = 3, byrow=T)
A
[,1][,2][,3]
[1,]2 3 1
[2,] 12 3
[3,] 4 13
Vector b represents the right-hand side values:
630
550
600
In R:
b = c(630,550,600)
b
[1] 630 550 600
The solution (here x1, x2, x3) is provided as a matrix product of the inverse matrix and the right-hand side vector.
x = solve(A) %*% b
x
 [,1]
[1,]65
[2,]145
[3,]65
Solution: x1 = 65, x2 = 145, x3 = 65. As a check, if you multiply matrix A by vector x (A∙x), you should get the right-hand side vector:
A %*% x
 [,1]
[1,]630
[2,]550
[3,]600
Special matrix cases.
Diagonal:
diag(c(4,6,7))
[,1][,2][,3]
[1,]400
[2,]060
[3,]007
Identity:
diag(3)
[,1][,2][,3]
[1,]100
[2,]010
[3,]001
Matrix x transposed (xT):
x = matrix(c(4,7,2))
x
 [,1]
[1,]4
[2,]7
[3,]2
Transpose (xT):
xt = t(x)
xt
 [,1][,2][,3]
[1,]472
Matrix Operations on Vectors!
Consider two vectors, v1 and v2.
v1 = c(4,6,2,1)
v1
[1] 4 6 2 1
v2 = c(1,2,1,1)
v2
[1] 1 2 1 1
Now, calculate a sum of the products of the vector elements (the same as =SumProduct() in Excel).
v1 %*% v2
 [,1]
[1,]19
Note: In this case, it may be safer to use the crossprod() function:
crossprod(v1,v2)
 [,1]
[1,]19
Multiply the vectors as matrices (the same as =MMult() in Excel). The second vector must be transposed. Otherwise the matrices (based on the vectors) would not be conformable.
v1 %*% t(v2)
 [,1][,2][,3][,4]
[1,]4844
[2,]61266
[3,]2422
[4,]1211
Note: this matrix is a collection of the products of all possible pairings of the vector values.
1211 
4*14*24*14*14
6*16*26*16*16
2*12*22*12*12
1*11*21*11*11
Data Frames
Data frames are table-like structures (similar to Excel tables). They are extremely popular among data analysts. If you have worked with spreadsheet tables, you will find R data frames also friendly but also very powerful.
Excel Table R Data Frame
Data Access:
pid: B4:B7
fst: C4:C7
salary: D4:D7
row 1: B4:D4
row 3: B6:D6
Data Access:
pid: hr$pid / hr[,1] / hr[[1]] / hr[1]
fst: hr$fst / hr[,2] / hr[[2]] / hr[2]
salary: hr$salary / hr[,3] / hr[[3]] / hr[3]
row 1: hr[1,]
row 3: hr[3,]
It is also possible to assign names to the ranges. Notice that notation hr[1] gets the entire
column 1 as a data frame.
Create a data frame (df_hr), consisting of three columns: person ID, person_id, first name, first_name, and annual salary, annual_salary. One way to create a data frame is to assemble it, using columns defined separately as vectors.
person_id = c(1:4)
first_name = c("Kim","Ben","Ann","Dru")
annual_salary = c(75000,64500,45550,92150)
df_hr = data.frame( person_id, first_name, annual_salary)
df_hr
  person_id first_name annual_salary
1         1        Kim         75000
2         2        Ben         64500
3         3        Ann         45550
4         4        Dru         92150
R has done a few interesting things. It used the names of the [input] vectors as the data frame's column names. This can be changed by giving the names explicitly (e.g. as pid, fst, and salary):
hr = data.frame( pid=person_id, fst=first_name, salary=annual_salary)
hr
  pid fst salary
1   1 Kim  75000
2   2 Ben  64500
3   3 Ann  45550
4   4 Dru  92150
Let's try a few ways to access data from data frame hr.
Show the data frame in a grid (spreadsheet):
View(hr)
Dataframe as Grid
Show all data in column 1 as a data frame:
hr[1]
  pid
1   1
2   2
3   3
4   4
Show all data in row 2 (all columns) as a data frame:
hr[2,]
  pid fst salary
2   2 Ben  64500
Show all data in column 1 (all rows) as a vector:
hr[[1]]     # same as hr$pid or hr[,1]
[1] 1 2 3 4
Show all data in row 2 (all columns) as [flat] vector:
as.character(hr[2,])
[1] "2"     "2"     "64500"
Let's try a few more ways to access data from data frame hr.
Data Frame as Grid
Show the salary value for Ann:
 
hr[hr$fst=='Ann',3] # same as hr[hr$fst=='Ann',]$salary
[1] 45550
Data Frame as Grid
Show the person ID and salary value for Ann:
 
hr[hr$fst=='Ann',c(1,3)]
  pid salary
3   3  45550
Print the person ID and salary value for Ann, using the cat() function:
cat(hr[hr$fst=='Ann',c(1,3)])
Error in cat(hr[hr$fst == "Ann", c(1, 3)]) :
  argument 1 (type 'list') cannot be handled by 'cat'
Try this way:
cat(unlist(hr[hr$fst=='Ann',c(1,3)]))
3 45550
Data Frame as Grid
Show the Ann's salary record:
 
ann_record = hr[hr$fst=='Ann',]
ann_record
  pid fst salary
3   3 Ann  45550
We know that variable ann_record can't be printed by the cat() function, unless the variable is flattened (unlisted). Let’s see what happened when we do it:
ar = unlist(ann_record)
cat(ar)
3 1 45550
We got an interesting output: pid = 3, fst = 1, and salary = 45550. It all looks reasonable (?) except for fst. It is supposed to be equal to 'Ann' but it shows 1. Well, if we examine column fst, we will see that it does not show quoted strings. It also shows the so called levels.
hr$fst
[1] Kim Ben Ann Dru
Levels: Ann Ben Dru Kim
During construction of the frame, R has converted the string vector, first_name, to a factor vector. All the string data got mapped to integers for faster processing. See the examples for more details.
Let's find out more about the factors assigned to fst strings.
Show column fst as a numeric vector:
fst_fct = as.numeric(hr$fst)
fst_fct
[1] 4 2 1 3
In a database world, we would say that R indexed the names. Notice that the factor values are assigned to the strings according to the alphabetical order. Kim → 4, Ben → 2, Ann → 1, Dru →3. It is much easier to process ordered data than—random data. It is particularly advantageous when processing large data sets. It is possible (but not advisable) to convert the column to character (string) data.
hr$fst = as.character(hr$fst)
hr$fst
[1] "Kim" "Ben" "Ann" "Dru"
When displaying the entire data frame, hr, the quotes are not shown.
hr
  pid fst salary
1   1 Kim  75000
2   2 Ben  64500
3   3 Ann  45550
4   4 Dru  92150
As shown above, data frames can be queried. Rows, columns, and their fragments can be filtered out. Let's try to build a filter that will allow us to select specific rows.
m = nrow(hr)
filter = rep(TRUE, m)
filter
[1] TRUE TRUE TRUE TRUE
hr[filter,]
  pid fst salary
1   1 Kim  75000
2   2 Ben  64500
3   3 Ann  45550
4   4 Dru  92150
Since all the values of variable filter, placed at the row reference, are TRUE, all the rows are displayed. Let's modify the filter, by setting the 2nd and 4th values to FALSE and then repeat the same statement.
filter[c(2,4)] = FALSE
filter
[1] TRUE FALSE TRUE FALSE
hr[filter,]
  pid fst salary
1   1 Kim  75000
3   3 Ann  45550
There are many ways to create filters. Let's try a few more examples. First, show records, with salaries below 70,000.
filter = hr$salary < 70000
filter
[1] FALSE TRUE TRUE FALSE
hr[filter,]
  pid fst salary
2   2 Ben  64500
3   3 Ann  45550
What about salaries between 60,000 and 80,000?
filter = hr$salary >= 60000 & hr$salary <= 80000
filter
[1] TRUE TRUE FALSE FALSE
hr[filter,]
  pid fst salary
1   1 Kim  75000
2   2 Ben  64500
Note: The sqldf library provides powerful functions to query data frames, using SQL.
Let’s get the salaries above the average.
avg_salary = mean(hr$salary)
avg_salary
[1] 69300
filter = hr$salary > avg_salary
filter
hr[filter,]
  pid fst salary
1   1 Kim  75000
4   4 Dru  92150
Data Frame as Grid
Finally, let’s show data at the intersection of rows 1, 4 and columns 2, 3.
 
hr[c(1,4),c(2,3)]
  fst salary
1 Kim  75000
4 Dru  92150
There are other ways to create data frames. Suppose that we need to gradually add rows to a data frame. We start with an empty frame and we add rows in a loop. The following investment example illustrated this concept. We invest an initial value (e.g. principal = $1,000) at the begining of the first year. At the end of the year we add interest to the investemnt (based on rate = 5%) which gets reinvested at the beginign of the next year. We repeat this procedure for 10 years (term = 10).

principal = 1000
rate = 0.05
term = 10
# Create an empty frame with three columns of type: integer, double, double.
cash_earned = data.frame(year=integer(0), beginning=double(0), ending=double(0))
investment = principal
for (y in 1:term) {
  beg = investment
  investment = (1 + rate) * investment
  # Create a one-row frame with values for columns: year, beginning, ending.
  new_cash = data.frame(year=y, beginning=beg, ending=investment)
  # Bind the original data frame with the new one (by rows).
  cash_earned = rbind(cash_earned, new_cash)
}
cash_earned

     year beginning   ending
1     1  1000.000 1050.000
2     2  1050.000 1102.500
3     3  1102.500 1157.625
4     4  1157.625 1215.506
5     5  1215.506 1276.282
6     6  1276.282 1340.096
7     7  1340.096 1407.100
8     8  1407.100 1477.455
9     9  1477.455 1551.328
10   10  1551.328 1628.895
Data
Built-in Data
R comes not only with programming capabilities. The R installer also add some data to the R environment.
Let's try a built-in dataframe mtcars.
cars = mtcars
head(cars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
attributes(cars)
$names
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"
$row.names
 [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"          "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
 [7] "Duster 360"          "Merc 240D"           "Merc 230"            "Merc 280"            "Merc 280C"           "Merc 450SE"         
[13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood"  "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
[19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"       "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
[25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"       "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
[31] "Maserati Bora"       "Volvo 142E"         
$class
[1] "data.frame"
Text Data
R can also parse text and load it into a dataframe.
hr = read.table(text='
pid fst salary
1 Kim 75000
2 Ben 64500
3 Ann 45550
4 Dru 92150'
, header = TRUE)
hr
  pid fst salary
1   1 Kim  75000
2   2 Ben  64500
3   3 Ann  45550
4   4 Dru  92150
External Data Sources
Many data science project with R are done, using data that come from external sources (local file, Web document, database, etc.). R is equipped with many functions that can retrieve data (usually into data frames) from such resources.
Get Data via Clipboard
A popular way is to transfer data via the system's Clipboard. Here is how to copy data from a spreadsheet (Excel) document.
Step 1. Launch R.
Step 2. (prepare an R statement to receive data). Do not press Enter at the end of the statement's line.
x = read.table(file="clipboard", sep ="\t", header=TRUE)
Step 3. Open the Excel's dataset (workbook). Try this workbook hr.xlsx.
Step 4. Select the table you want to send to R, (in hr.xlsx, select range B3:E12) and press the Copy button (or hit keys Ctrl+C).
Excel Table
Step 5. Switch back to R and press the Enter key (while the cursor is still flashing at the end of the above statement). If the statement has been entered prior to copying the dataset in Excel, just press the UP Arrow key (to recall the statement) and then press Enter.
Show the table.
x
  pid   fst   salary degree
1   1  Kim   75,000    mba 
2   2  Ben   64,500   bsba 
3   3  Ann   45,550   bsba 
4   4  Dru   92,150    mba 
5   5  Ken   95,300    phd 
6   6  Sue   90,230   bsba 
7   7  Tim   50,390    mba 
8   8  Eve   42,960    phd 
9   9  Pat   52,440   bsba
Get Data From a Web Document
It is very easy to load data into R from Web files. Most frequently such data sources are stored as the so called Comma Separated Values (CSV). Excel refers to such file types as CSV (Comma delimited) (*.csv). To save an Excel worksheet as a CSV file, use the File > Save As command and change the file type to CSV (Comma delimited) (*.csv) via the Save as type: drop-down option.
The following example shows how to load data from URL hr.csv.
x = read.csv('http://biiat.com/dat/hr.csv', header=TRUE)
x
              pid   fst   salary degree
1   1  Kim   75,000    mba 
2   2  Ben   64,500   bsba 
3   3  Ann   45,550   bsba 
4   4  Dru   92,150    mba 
5   5  Ken   95,300    phd 
6   6  Sue   90,230   bsba 
7   7  Tim   50,390    mba 
8   8  Eve   42,960    phd 
9   9  Pat   52,440   bsba
Get Data From a Local Document
Another popular data source is a local text document. The procedure for loading data from a local file to a data frame is very similar to the previous one (shown on the previous slide). All it takes is replacing the Web address with local file reference. For example, file hr.csv, residing in folder T:\rdev, would be retrieved into a data frame in the following way:
x = read.csv('T:/rdev/hr.csv', header=TRUE)
x
  pid   fst   salary degree
1   1  Kim   75,000    mba 
2   2  Ben   64,500   bsba 
3   3  Ann   45,550   bsba 
4   4  Dru   92,150    mba 
5   5  Ken   95,300    phd 
6   6  Sue   90,230   bsba 
7   7  Tim   50,390    mba 
8   8  Eve   42,960    phd 
9   9  Pat   52,440   bsba
What about getting a simple sample from a [plain] text document to a vector (txt)? Let’s first download file names.txt from names into folder T:/rdev/. Next read the file, using function scan().
txt = scan('T:/rdev/names.txt', character())
Read 483 items
txt[1:50]     # Show the firsts 50 values.
[1] "Kim" "Dan" "Pat" "Pat" "Pat" "Kim" "Dan" "Dan" "Dan" "Pat" "Ann" "Kim" "Ann" "Kim" "Kim" "Dan" "Ann"
[18] "Dan" "Ann" "Pat" "Kim" "Pat" "Pat" "Kim" "Kim" "Kim" "Kim" "Dan" "Kim" "Kim" "Ann" "Kim" "Kim" "Ann"
[35] "Dan" "Dan" "Dan" "Kim" "Dan" "Pat" "Dan" "Ann" "Kim" "Dan" "Pat" "Kim" "Dan" "Kim" "Pat" "Pat"
Convert this character vector to a factor vector. Notice the levels (domain).
txt = factor(txt)
txt[1:50]
[1] Kim Dan Pat Pat Pat Kim Dan Dan Dan Pat Ann Kim Ann Kim Kim Dan Ann Dan Ann Pat Kim Pat Pat Kim Kim
[26] Kim Kim Dan Kim Kim Ann Kim Kim Ann Dan Dan Dan Kim Dan Pat Dan Ann Kim Dan Pat Kim Dan Kim Pat Pat
Levels: Ann Dan Kim Pat
Tabulate the sample (count the number of times each name appears in the sample).
frq = table(txt)
frq
txt
Ann Dan Kim Pat 
 65 161 178  79
The class of variable frq is "table" and the type is "integer". It is easy to convert this variable to a data frame:
ft = data.frame(frq)
colnames(ft) = c('fn','frq')
ft
   fn frq
1 Ann  65
2 Dan 161
3 Kim 178
4 Pat  79
Add a relative frequency column. Name is a p (for proportion).
n = sum(ft$frq)
prop = ft$frq / n
ft$p = prop
ft
   fn frq         p
1 Ann  65 0.1345756
2 Dan 161 0.3333333
3 Kim 178 0.3685300
4 Pat  79 0.1635611
Show the proportions as a pie chart.
perc = paste(round(100*ft$p,2), "%", sep="")
lbls = paste(ft$fn, perc, sep=', ')
pie(ft$p, labels=lbls, main='Name Frequency')
Pie Chart
Function paste() is used here to format and combine names and proportions into labels incorporated onto the chart. It works like Excel's function =CONCATENATE().
This is it!
Get Data From a Keyboard
The real power of R lies in handling external data sources. In addition to those listed above, R can also accept input from a keyboard. Function readline(prompt = "") can do this job. The prompt argument is an invitation to provide an input. The following example imitates a simple game: Rolling a Die. The entire code is wrapped up in a [user-defined] function, check_roll(). When function is called, it is provided with a randomly generated number, selected from a set of numbers (1,2,...,6), stored in variable die. It is done, using function sample() that takes the set (here die)and randomly selects one number from it. The selection is done according to an uniform distribution. An alternative generator would be: ceiling(runif(1, min=0, max=6)).
check_roll = function(roll) {
  guess = readline('Your guess: ')
  guess = as.integer(guess)
  if (guess == roll)
    cat('Go it!\n')
  else
    cat('Sorry, the roll result is:',roll,'\n')
}
if( interactive() ) check_roll( sample(1:6,1) )
Your guess: 6
Sorry, the roll result is: 4
Copyright © 2019 biiat.com
Contributed by Nicholas Letkowski.