January 27, 2020
Usually you will use read.csv()
read.table()
are generalizations for their data formatsWe can read any data from a website or the desktop
If we use some Census data from the web source,
census <- read.csv("http://myaddress/address2/mydata.csv")
calif_penn <- read.csv("http://www.stat.cmu.edu/~cshalizi/uADA/13/hw/01/calif_penn_2011.csv") head(calif_penn)
## X GEO.id2 STATEFP COUNTYFP TRACTCE POPULATION LATITUDE LONGITUDE ## 1 1 6001400100 6 1 400100 2937 37.85942 -122.2304 ## 2 2 6001400200 6 1 400200 1974 37.84829 -122.2495 ## 3 3 6001400300 6 1 400300 4865 37.84027 -122.2544 ## 4 4 6001400400 6 1 400400 3703 37.84845 -122.2573 ## 5 5 6001400500 6 1 400500 3517 37.84894 -122.2647 ## 6 6 6001400600 6 1 400600 1571 37.84223 -122.2649 ## GEO.display.label Median_house_value ## 1 Census Tract 4001, Alameda County, California NA ## 2 Census Tract 4002, Alameda County, California 909600 ## 3 Census Tract 4003, Alameda County, California 748700 ## 4 Census Tract 4004, Alameda County, California 773600 ## 5 Census Tract 4005, Alameda County, California 579200 ## 6 Census Tract 4006, Alameda County, California 480800 ## Total_units Vacant_units Median_rooms Mean_household_size_owners ## 1 1425 162 6.5 2.02 ## 2 929 37 6.0 2.53 ## 3 2655 134 4.6 2.45 ## 4 1911 68 5.0 2.04 ## 5 1703 71 4.5 2.66 ## 6 781 65 4.8 2.58 ## Mean_household_size_renters Built_2005_or_later Built_2000_to_2004 ## 1 1.59 4.6 9.3 ## 2 1.81 0.0 1.2 ## 3 1.66 0.0 0.0 ## 4 2.19 0.0 0.2 ## 5 1.72 0.0 0.2 ## 6 2.18 0.0 0.0 ## Built_1990s Built_1980s Built_1970s Built_1960s Built_1950s Built_1940s ## 1 50.9 2.5 4.8 1.3 13.9 2.8 ## 2 0.0 1.3 6.1 6.5 1.0 10.8 ## 3 2.3 3.2 5.2 8.3 5.3 7.8 ## 4 1.3 0.0 4.9 4.3 8.0 10.4 ## 5 1.1 1.9 3.7 5.8 6.0 7.5 ## 6 1.2 1.4 1.0 6.5 19.7 17.0 ## Built_1939_or_earlier Bedrooms_0 Bedrooms_1 Bedrooms_2 Bedrooms_3 ## 1 9.9 3.6 5.6 11.9 40.6 ## 2 73.2 3.0 16.4 27.4 34.4 ## 3 68.0 11.5 28.4 29.2 20.4 ## 4 71.1 5.2 27.7 33.7 21.9 ## 5 73.8 4.9 30.2 38.1 19.3 ## 6 53.1 3.5 20.4 40.1 30.7 ## Bedrooms_4 Bedrooms_5_or_more Owners Renters Median_household_income ## 1 31.6 6.7 81.2 18.8 156250 ## 2 17.5 1.2 66.0 34.0 111667 ## 3 7.9 2.7 45.1 54.9 66094 ## 4 7.3 4.2 45.0 55.0 87306 ## 5 5.4 2.1 43.6 56.4 62386 ## 6 4.6 0.8 51.0 49.0 55658 ## Mean_household_income ## 1 237805 ## 2 195229 ## 3 105877 ## 4 106248 ## 5 74604 ## 6 73933
If we want to use data from our desktop, we should include a file path (ex. c:/)
MyData <- read.csv("c:/TheDataIWantToReadIn.csv", header=TRUE)
Remember that you must use the forward slash / or double backslash \ in R!
The Windows format of single backslash does not work
header=TRUE
means the first row contains the names of the variables.We will use built-in dataset in R
R comes with several built-in datasets
Generally used as demo data for playing with R functions
data()
mtcars : Motor Trend Car Road Tests
The data was extracted from the 1974 Motor Trend US magazine
The Data consists of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)
cars <- mtcars
head()
returns the first 6 lines of the data
tail()
returns the last 6 lines of the data
head(cars)
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
dim()
provides the number of rows and columns of the data
ncol()
returns the number of columns of the data
nrow()
provides the number of rows of the data
dim(cars)
## [1] 32 11
ncol(cars)
## [1] 11
nrow(cars)
## [1] 32
colnames()
provides the column names of the data
rownames()
provides the row names of the data
colnames(cars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb"
rownames(cars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E"
str()
tells us the type of each column and gives us a preview of the data.
str(cars)
## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... ## $ disp: num 160 160 108 258 360 ... ## $ hp : num 110 110 93 110 175 105 245 62 95 123 ... ## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec: num 16.5 17 18.6 19.4 17 ... ## $ vs : num 0 0 1 1 0 1 0 1 1 1 ... ## $ am : num 1 1 1 0 0 0 0 0 0 0 ... ## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
In R, subsetting is done with [
The ,
seperates rows to select from columns to select
In [a, b]
, a
indicates a-th row, b
indicates b-th column
cars[1,1]
## [1] 21
head(cars)
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
When one of these is empty, it keeps all values
data[a,]
gives an a-th row with the whole column values
data[,b]
gives a b-th column with the whole row values
cars[1,]
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
cars[,1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 ## [29] 15.8 19.7 15.0 21.4
For column variable, we can use the name of a variable
The $
is used to select a column of a data frame
cars$hp
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 ## [18] 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
cars[,4]
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 ## [18] 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
cars$hp
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 ## [18] 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
cars$Hp
## NULL
We can select the several rows and columns at the same time
Let’s find mpg
(Miles/gallon; column 1) and hp
(Horsepower; column 4)` of the cars
cars[,1,4] # ?
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 ## [29] 15.8 19.7 15.0 21.4
cars[,c(1,4)]
## mpg hp ## Mazda RX4 21.0 110 ## Mazda RX4 Wag 21.0 110 ## Datsun 710 22.8 93 ## Hornet 4 Drive 21.4 110 ## Hornet Sportabout 18.7 175 ## Valiant 18.1 105 ## Duster 360 14.3 245 ## Merc 240D 24.4 62 ## Merc 230 22.8 95 ## Merc 280 19.2 123 ## Merc 280C 17.8 123 ## Merc 450SE 16.4 180 ## Merc 450SL 17.3 180 ## Merc 450SLC 15.2 180 ## Cadillac Fleetwood 10.4 205 ## Lincoln Continental 10.4 215 ## Chrysler Imperial 14.7 230 ## Fiat 128 32.4 66 ## Honda Civic 30.4 52 ## Toyota Corolla 33.9 65 ## Toyota Corona 21.5 97 ## Dodge Challenger 15.5 150 ## AMC Javelin 15.2 150 ## Camaro Z28 13.3 245 ## Pontiac Firebird 19.2 175 ## Fiat X1-9 27.3 66 ## Porsche 914-2 26.0 91 ## Lotus Europa 30.4 113 ## Ford Pantera L 15.8 264 ## Ferrari Dino 19.7 175 ## Maserati Bora 15.0 335 ## Volvo 142E 21.4 109
We can use the names of variables
cars[,c("mpg", "hp")]
## mpg hp ## Mazda RX4 21.0 110 ## Mazda RX4 Wag 21.0 110 ## Datsun 710 22.8 93 ## Hornet 4 Drive 21.4 110 ## Hornet Sportabout 18.7 175 ## Valiant 18.1 105 ## Duster 360 14.3 245 ## Merc 240D 24.4 62 ## Merc 230 22.8 95 ## Merc 280 19.2 123 ## Merc 280C 17.8 123 ## Merc 450SE 16.4 180 ## Merc 450SL 17.3 180 ## Merc 450SLC 15.2 180 ## Cadillac Fleetwood 10.4 205 ## Lincoln Continental 10.4 215 ## Chrysler Imperial 14.7 230 ## Fiat 128 32.4 66 ## Honda Civic 30.4 52 ## Toyota Corolla 33.9 65 ## Toyota Corona 21.5 97 ## Dodge Challenger 15.5 150 ## AMC Javelin 15.2 150 ## Camaro Z28 13.3 245 ## Pontiac Firebird 19.2 175 ## Fiat X1-9 27.3 66 ## Porsche 914-2 26.0 91 ## Lotus Europa 30.4 113 ## Ford Pantera L 15.8 264 ## Ferrari Dino 19.7 175 ## Maserati Bora 15.0 335 ## Volvo 142E 21.4 109
Let’s select cars with V-shaped Engine
vs
contains the type of engine information
cars$vs
## [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
table(cars$vs)
## ## 0 1 ## 18 14
table()
gives us counts at each levelscars$vs==0
## [1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE ## [12] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE ## [23] TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
It contains TRUE/FALSE
information of 32 cars,
row
.cars[cars$vs==0, ]
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 ## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 ## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 ## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 ## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 ## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 ## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 ## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 ## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
nrow(cars[cars$vs==0, ])
## [1] 18
Quiz!
Select cars
V-shape Engine
more than 18 mpg
How?
Answer~
cars$vs==0 & cars$mpg > 18
## [1] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [23] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
cars[cars$vs==0 & cars$mpg > 18,]
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Create a new variable hp_per_wt
about horsepower per weight.
Put new variable within dataset using cars$
cars$hp/cars$wt
## [1] 41.98473 38.26087 40.08621 34.21462 50.87209 30.34682 68.62745 ## [8] 19.43574 30.15873 35.75581 35.75581 44.22604 48.25737 47.61905 ## [15] 39.04762 39.63864 43.03087 30.00000 32.19814 35.42234 39.35091 ## [22] 42.61364 43.66812 63.80208 45.51365 34.10853 42.52336 74.68605 ## [29] 83.28076 63.17690 93.83754 39.20863
cars$hp_per_wt <- cars$hp/cars$wt
head(cars[,8:12])
## vs am gear carb hp_per_wt ## Mazda RX4 0 1 4 4 41.98473 ## Mazda RX4 Wag 0 1 4 4 38.26087 ## Datsun 710 1 1 4 1 40.08621 ## Hornet 4 Drive 1 0 3 1 34.21462 ## Hornet Sportabout 0 0 3 2 50.87209 ## Valiant 1 0 3 1 30.34682
We could also use:
cbind
is used to combine by columns
cars <- cbind(cars, cars$hp/cars$wt) cars[1:3,]
## mpg cyl disp hp drat wt qsec vs am gear carb hp_per_wt ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 41.98473 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 38.26087 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 40.08621 ## cars$hp/cars$wt ## Mazda RX4 41.98473 ## Mazda RX4 Wag 38.26087 ## Datsun 710 40.08621
colnames(cars)
## [1] "mpg" "cyl" "disp" ## [4] "hp" "drat" "wt" ## [7] "qsec" "vs" "am" ## [10] "gear" "carb" "hp_per_wt" ## [13] "cars$hp/cars$wt"
colnames(cars)[13] <- "hp_per_wt2"
Using cbind
, we could put the variables at the specific region
cars2 <- cars[1:3,c("hp", "wt")] cbind(cars2, cars2$hp/cars2$wt)
## hp wt cars2$hp/cars2$wt ## Mazda RX4 110 2.620 41.98473 ## Mazda RX4 Wag 110 2.875 38.26087 ## Datsun 710 93 2.320 40.08621
cbind(cars2, cars2$hp/cars2$wt)[,c(3,1,2)]
## cars2$hp/cars2$wt hp wt ## Mazda RX4 41.98473 110 2.620 ## Mazda RX4 Wag 38.26087 110 2.875 ## Datsun 710 40.08621 93 2.320
Let’s add hp_per_wt
next to hp
.
cars3 <- cbind(cars, cars$hp_per_wt)[,c(1:4, 13, 5:12)] colnames(cars3)
## [1] "mpg" "cyl" "disp" "hp" "hp_per_wt2" ## [6] "drat" "wt" "qsec" "vs" "am" ## [11] "gear" "carb" "hp_per_wt"
cars3[1:3,]
## mpg cyl disp hp hp_per_wt2 drat wt qsec vs am gear ## Mazda RX4 21.0 6 160 110 41.98473 3.90 2.620 16.46 0 1 4 ## Mazda RX4 Wag 21.0 6 160 110 38.26087 3.90 2.875 17.02 0 1 4 ## Datsun 710 22.8 4 108 93 40.08621 3.85 2.320 18.61 1 1 4 ## carb hp_per_wt ## Mazda RX4 4 41.98473 ## Mazda RX4 Wag 4 38.26087 ## Datsun 710 1 40.08621
hp_per_wt
seems to be too long
cars[1:3,7:ncol(cars)]
## qsec vs am gear carb hp_per_wt hp_per_wt2 ## Mazda RX4 16.46 0 1 4 4 41.98473 41.98473 ## Mazda RX4 Wag 17.02 0 1 4 4 38.26087 38.26087 ## Datsun 710 18.61 1 1 4 1 40.08621 40.08621
Let’s change this variable shorter
round
rounds the values to specified number of decimal pointscars$hp_per_wt <- round(cars$hp_per_wt, 3) cars[1:3,7:ncol(cars)]
## qsec vs am gear carb hp_per_wt hp_per_wt2 ## Mazda RX4 16.46 0 1 4 4 41.985 41.98473 ## Mazda RX4 Wag 17.02 0 1 4 4 38.261 38.26087 ## Datsun 710 18.61 1 1 4 1 40.086 40.08621
wt
is weight (1000 lbs)
Let’s convert that into weight per lb
cars$wt <- cars$wt * 1000 cars$wt
## [1] 2620 2875 2320 3215 3440 3460 3570 3190 3150 3440 3440 4070 3730 3780 ## [15] 5250 5424 5345 2200 1615 1835 2465 3520 3435 3840 3845 1935 2140 1513 ## [29] 3170 2770 3570 2780
cars[1:3,]
## mpg cyl disp hp drat wt qsec vs am gear carb hp_per_wt ## Mazda RX4 21.0 6 160 110 3.90 2620 16.46 0 1 4 4 41.985 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2875 17.02 0 1 4 4 38.261 ## Datsun 710 22.8 4 108 93 3.85 2320 18.61 1 1 4 1 40.086 ## hp_per_wt2 ## Mazda RX4 41.98473 ## Mazda RX4 Wag 38.26087 ## Datsun 710 40.08621
We would like to standardize the variable qsec
(1/4 mile time)
cars$qsec2 <- (cars$qsec - mean(cars$qsec)) / sd(cars$qsec)
We know that vs
is related to the shape of engine
Let’s change (0, 1) to (“V”, “S”)
How?
cars$vs[cars$vs=="0"]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cars$vs[cars$vs=="0"] <- "V" cars$vs[cars$vs=="1"] <- "S" cars$vs
## [1] "V" "V" "S" "S" "V" "S" "V" "S" "S" "S" "S" "V" "V" "V" "V" "V" "V" ## [18] "S" "S" "S" "S" "V" "V" "V" "V" "S" "V" "S" "V" "V" "V" "S"
I don’t think we will need hp_per_wt
, hp_per_wt2
, and qsec2
, so let’s remove those columns
-
tells R to remove this column number
colnames(cars)
## [1] "mpg" "cyl" "disp" "hp" "drat" ## [6] "wt" "qsec" "vs" "am" "gear" ## [11] "carb" "hp_per_wt" "hp_per_wt2" "qsec2"
cars <- cars[,-14]
cars[1:3,]
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2620 -0.7771651 V 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2875 -0.4637808 V 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2320 0.4260068 S 1 4 1 ## hp_per_wt hp_per_wt2 ## Mazda RX4 41.985 41.98473 ## Mazda RX4 Wag 38.261 38.26087 ## Datsun 710 40.086 40.08621
We can remove multiple variables at the same time
cars <- cars[,-c(12,13)] cars[1:3,]
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2620 -0.7771651 V 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2875 -0.4637808 V 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2320 0.4260068 S 1 4 1
Let’s sort mpg
of cars
data
cars <- mtcars cars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 ## [29] 15.8 19.7 15.0 21.4
sort(cars$mpg)
## [1] 10.4 10.4 13.3 14.3 14.7 15.0 15.2 15.2 15.5 15.8 16.4 17.3 17.8 18.1 ## [15] 18.7 19.2 19.2 19.7 21.0 21.0 21.4 21.4 21.5 22.8 22.8 24.4 26.0 27.3 ## [29] 30.4 30.4 32.4 33.9
By default, sorting is ascending
If you want to sort it into descending order…
sort(cars$mpg, decreasing = TRUE)
## [1] 33.9 32.4 30.4 30.4 27.3 26.0 24.4 22.8 22.8 21.5 21.4 21.4 21.0 21.0 ## [15] 19.7 19.2 19.2 18.7 18.1 17.8 17.3 16.4 15.8 15.5 15.2 15.2 15.0 14.7 ## [29] 14.3 13.3 10.4 10.4
We use order
function to sort a data frame
order
gives the indices of the vector in sorted order
a <- c(100, 10, 1000) order(a)
## [1] 2 1 3
10
is the smallest element in a
, which is the second element, So 2 comes first in the output of order(a)
100
is the second smallest element and the first element in a
, so 1 comes second
We can use the output of order(a)
to reshuffle a
a[order(a)]
## [1] 10 100 1000
Now sort the whole data cars
into ascending order on mpg
head(cars[order(cars$mpg),])
## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 ## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 ## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 ## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
If you want to sort data descending order on mpg
,
- (minus sign)
could be useful
head(cars[order(-cars$mpg),])
## mpg cyl disp hp drat wt qsec vs am gear carb ## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Let’s merge two data frames
We can start from two simple datasets
simple.cars <- cars[,c(1,2)] simple.cars2 <- cars[,c(10,11)] head(simple.cars)
## mpg cyl ## Mazda RX4 21.0 6 ## Mazda RX4 Wag 21.0 6 ## Datsun 710 22.8 4 ## Hornet 4 Drive 21.4 6 ## Hornet Sportabout 18.7 8 ## Valiant 18.1 6
head(simple.cars2)
## gear carb ## Mazda RX4 4 4 ## Mazda RX4 Wag 4 4 ## Datsun 710 4 1 ## Hornet 4 Drive 3 1 ## Hornet Sportabout 3 2 ## Valiant 3 1
Two datasets have the same rowname
Merge them by rowname
merge(simple.cars, simple.cars2, by = "row.names")
## Row.names mpg cyl gear carb ## 1 AMC Javelin 15.2 8 3 2 ## 2 Cadillac Fleetwood 10.4 8 3 4 ## 3 Camaro Z28 13.3 8 3 4 ## 4 Chrysler Imperial 14.7 8 3 4 ## 5 Datsun 710 22.8 4 4 1 ## 6 Dodge Challenger 15.5 8 3 2 ## 7 Duster 360 14.3 8 3 4 ## 8 Ferrari Dino 19.7 6 5 6 ## 9 Fiat 128 32.4 4 4 1 ## 10 Fiat X1-9 27.3 4 4 1 ## 11 Ford Pantera L 15.8 8 5 4 ## 12 Honda Civic 30.4 4 4 2 ## 13 Hornet 4 Drive 21.4 6 3 1 ## 14 Hornet Sportabout 18.7 8 3 2 ## 15 Lincoln Continental 10.4 8 3 4 ## 16 Lotus Europa 30.4 4 5 2 ## 17 Maserati Bora 15.0 8 5 8 ## 18 Mazda RX4 21.0 6 4 4 ## 19 Mazda RX4 Wag 21.0 6 4 4 ## 20 Merc 230 22.8 4 4 2 ## 21 Merc 240D 24.4 4 4 2 ## 22 Merc 280 19.2 6 4 4 ## 23 Merc 280C 17.8 6 4 4 ## 24 Merc 450SE 16.4 8 3 3 ## 25 Merc 450SL 17.3 8 3 3 ## 26 Merc 450SLC 15.2 8 3 3 ## 27 Pontiac Firebird 19.2 8 3 2 ## 28 Porsche 914-2 26.0 4 5 2 ## 29 Toyota Corolla 33.9 4 4 1 ## 30 Toyota Corona 21.5 4 3 1 ## 31 Valiant 18.1 6 3 1 ## 32 Volvo 142E 21.4 4 4 2
merge(x, y, by, by.x, by.y)
x
and y
are two data frames
by
, by.x
, by.y
specifies the columns used for merging
What if rownames don’t match?
In usual cases, we need a common column to merge two datasets
simple.cars <- data.frame("id" = rownames(cars), "mpg" = cars$mpg, "hp" = cars$hp) simple.cars2 <- data.frame("car.type" = rownames(cars), "vs" = cars$vs, "am" = cars$am) head(simple.cars)
## id mpg hp ## 1 Mazda RX4 21.0 110 ## 2 Mazda RX4 Wag 21.0 110 ## 3 Datsun 710 22.8 93 ## 4 Hornet 4 Drive 21.4 110 ## 5 Hornet Sportabout 18.7 175 ## 6 Valiant 18.1 105
head(simple.cars2)
## car.type vs am ## 1 Mazda RX4 0 1 ## 2 Mazda RX4 Wag 0 1 ## 3 Datsun 710 1 1 ## 4 Hornet 4 Drive 1 0 ## 5 Hornet Sportabout 0 0 ## 6 Valiant 1 0
merge(x = simple.cars, y = simple.cars2, by.x = "id", by.y = "car.type")
## id mpg hp vs am ## 1 AMC Javelin 15.2 150 0 0 ## 2 Cadillac Fleetwood 10.4 205 0 0 ## 3 Camaro Z28 13.3 245 0 0 ## 4 Chrysler Imperial 14.7 230 0 0 ## 5 Datsun 710 22.8 93 1 1 ## 6 Dodge Challenger 15.5 150 0 0 ## 7 Duster 360 14.3 245 0 0 ## 8 Ferrari Dino 19.7 175 0 1 ## 9 Fiat 128 32.4 66 1 1 ## 10 Fiat X1-9 27.3 66 1 1 ## 11 Ford Pantera L 15.8 264 0 1 ## 12 Honda Civic 30.4 52 1 1 ## 13 Hornet 4 Drive 21.4 110 1 0 ## 14 Hornet Sportabout 18.7 175 0 0 ## 15 Lincoln Continental 10.4 215 0 0 ## 16 Lotus Europa 30.4 113 1 1 ## 17 Maserati Bora 15.0 335 0 1 ## 18 Mazda RX4 21.0 110 0 1 ## 19 Mazda RX4 Wag 21.0 110 0 1 ## 20 Merc 230 22.8 95 1 0 ## 21 Merc 240D 24.4 62 1 0 ## 22 Merc 280 19.2 123 1 0 ## 23 Merc 280C 17.8 123 1 0 ## 24 Merc 450SE 16.4 180 0 0 ## 25 Merc 450SL 17.3 180 0 0 ## 26 Merc 450SLC 15.2 180 0 0 ## 27 Pontiac Firebird 19.2 175 0 0 ## 28 Porsche 914-2 26.0 91 0 1 ## 29 Toyota Corolla 33.9 65 1 1 ## 30 Toyota Corona 21.5 97 1 0 ## 31 Valiant 18.1 105 1 0 ## 32 Volvo 142E 21.4 109 1 1
What if some rows in x do not match rows in y?
There are only two common rows in simple.cars3
and simple.cars4
.
simple.cars3 <- simple.cars[1:5,] simple.cars4 <- simple.cars2[4:8,]
simple.cars3
## id mpg hp ## 1 Mazda RX4 21.0 110 ## 2 Mazda RX4 Wag 21.0 110 ## 3 Datsun 710 22.8 93 ## 4 Hornet 4 Drive 21.4 110 ## 5 Hornet Sportabout 18.7 175
simple.cars4
## car.type vs am ## 4 Hornet 4 Drive 1 0 ## 5 Hornet Sportabout 0 0 ## 6 Valiant 1 0 ## 7 Duster 360 0 0 ## 8 Merc 240D 1 0
Only Hornet
rows(4-5) are common in this case
Then, what happens if we merge these two dataset?
merge(simple.cars3, simple.cars4, by.x = "id", by.y = "car.type")
## id mpg hp vs am ## 1 Hornet 4 Drive 21.4 110 1 0 ## 2 Hornet Sportabout 18.7 175 0 0
If you want to keep all data from both,
merge(simple.cars3, simple.cars4, by.x = "id", by.y = "car.type", all = TRUE)
## id mpg hp vs am ## 1 Datsun 710 22.8 93 NA NA ## 2 Duster 360 NA NA 0 0 ## 3 Hornet 4 Drive 21.4 110 1 0 ## 4 Hornet Sportabout 18.7 175 0 0 ## 5 Mazda RX4 21.0 110 NA NA ## 6 Mazda RX4 Wag 21.0 110 NA NA ## 7 Merc 240D NA NA 1 0 ## 8 Valiant NA NA 1 0
Using all
, we can keep all data and NA fills for missing matches on either side
t()
is used to transpose a matrix or a dataframe
cars[1:3,]
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
t(cars)[1:3,]
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout ## mpg 21 21 22.8 21.4 18.7 ## cyl 6 6 4.0 6.0 8.0 ## disp 160 160 108.0 258.0 360.0 ## Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE ## mpg 18.1 14.3 24.4 22.8 19.2 17.8 16.4 ## cyl 6.0 8.0 4.0 4.0 6.0 6.0 8.0 ## disp 225.0 360.0 146.7 140.8 167.6 167.6 275.8 ## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental ## mpg 17.3 15.2 10.4 10.4 ## cyl 8.0 8.0 8.0 8.0 ## disp 275.8 275.8 472.0 460.0 ## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona ## mpg 14.7 32.4 30.4 33.9 21.5 ## cyl 8.0 4.0 4.0 4.0 4.0 ## disp 440.0 78.7 75.7 71.1 120.1 ## Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 ## mpg 15.5 15.2 13.3 19.2 27.3 ## cyl 8.0 8.0 8.0 8.0 4.0 ## disp 318.0 304.0 350.0 400.0 79.0 ## Porsche 914-2 Lotus Europa Ford Pantera L Ferrari Dino Maserati Bora ## mpg 26.0 30.4 15.8 19.7 15 ## cyl 4.0 4.0 8.0 6.0 8 ## disp 120.3 95.1 351.0 145.0 301 ## Volvo 142E ## mpg 21.4 ## cyl 4.0 ## disp 121.0
reshape
function is used for repeated measurements
We need new dataset : Indometh
Indometh
is data about pharmacokinetics of indometacin
head(Indometh)
## Subject time conc ## 1 1 0.25 1.50 ## 2 1 0.50 0.94 ## 3 1 0.75 0.78 ## 4 1 1.00 0.48 ## 5 1 1.25 0.37 ## 6 1 2.00 0.19
dim(Indometh)
## [1] 66 3
summary(Indometh)
## Subject time conc ## 1:11 Min. :0.250 Min. :0.0500 ## 4:11 1st Qu.:0.750 1st Qu.:0.1100 ## 2:11 Median :2.000 Median :0.3400 ## 5:11 Mean :2.886 Mean :0.5918 ## 6:11 3rd Qu.:5.000 3rd Qu.:0.8325 ## 3:11 Max. :8.000 Max. :2.7200
Subject
: Each of the six subjects were given indometacin
time
: times at which blood samples were drawn (total 11 time points)
conc
: plasma concentrations of indometacin
Indometh[1:15,]
## Subject time conc ## 1 1 0.25 1.50 ## 2 1 0.50 0.94 ## 3 1 0.75 0.78 ## 4 1 1.00 0.48 ## 5 1 1.25 0.37 ## 6 1 2.00 0.19 ## 7 1 3.00 0.12 ## 8 1 4.00 0.11 ## 9 1 5.00 0.08 ## 10 1 6.00 0.07 ## 11 1 8.00 0.05 ## 12 2 0.25 2.03 ## 13 2 0.50 1.63 ## 14 2 0.75 0.71 ## 15 2 1.00 0.70
From reshape
function, we can get “wide” data and “long” data
“wide” means putting repeated measurements in seperate columns
“long” means putting repeated measurements in the same columns
Let’s get “wide” data
wide <- reshape(Indometh, idvar = "Subject", direction = "wide") wide
## Subject conc.0.25 conc.0.5 conc.0.75 conc.1 conc.1.25 conc.2 conc.3 ## 1 1 1.50 0.94 0.78 0.48 0.37 0.19 0.12 ## 12 2 2.03 1.63 0.71 0.70 0.64 0.36 0.32 ## 23 3 2.72 1.49 1.16 0.80 0.80 0.39 0.22 ## 34 4 1.85 1.39 1.02 0.89 0.59 0.40 0.16 ## 45 5 2.05 1.04 0.81 0.39 0.30 0.23 0.13 ## 56 6 2.31 1.44 1.03 0.84 0.64 0.42 0.24 ## conc.4 conc.5 conc.6 conc.8 ## 1 0.11 0.08 0.07 0.05 ## 12 0.20 0.25 0.12 0.08 ## 23 0.12 0.11 0.08 0.08 ## 34 0.11 0.10 0.07 0.07 ## 45 0.11 0.08 0.10 0.06 ## 56 0.17 0.13 0.10 0.09
Let’s go back to “long” data
reshape(wide, direction = "long")
## Subject time conc.0.25 ## 1.0.25 1 0.25 1.50 ## 2.0.25 2 0.25 2.03 ## 3.0.25 3 0.25 2.72 ## 4.0.25 4 0.25 1.85 ## 5.0.25 5 0.25 2.05 ## 6.0.25 6 0.25 2.31 ## 1.0.5 1 0.50 0.94 ## 2.0.5 2 0.50 1.63 ## 3.0.5 3 0.50 1.49 ## 4.0.5 4 0.50 1.39 ## 5.0.5 5 0.50 1.04 ## 6.0.5 6 0.50 1.44 ## 1.0.75 1 0.75 0.78 ## 2.0.75 2 0.75 0.71 ## 3.0.75 3 0.75 1.16 ## 4.0.75 4 0.75 1.02 ## 5.0.75 5 0.75 0.81 ## 6.0.75 6 0.75 1.03 ## 1.1 1 1.00 0.48 ## 2.1 2 1.00 0.70 ## 3.1 3 1.00 0.80 ## 4.1 4 1.00 0.89 ## 5.1 5 1.00 0.39 ## 6.1 6 1.00 0.84 ## 1.1.25 1 1.25 0.37 ## 2.1.25 2 1.25 0.64 ## 3.1.25 3 1.25 0.80 ## 4.1.25 4 1.25 0.59 ## 5.1.25 5 1.25 0.30 ## 6.1.25 6 1.25 0.64 ## 1.2 1 2.00 0.19 ## 2.2 2 2.00 0.36 ## 3.2 3 2.00 0.39 ## 4.2 4 2.00 0.40 ## 5.2 5 2.00 0.23 ## 6.2 6 2.00 0.42 ## 1.3 1 3.00 0.12 ## 2.3 2 3.00 0.32 ## 3.3 3 3.00 0.22 ## 4.3 4 3.00 0.16 ## 5.3 5 3.00 0.13 ## 6.3 6 3.00 0.24 ## 1.4 1 4.00 0.11 ## 2.4 2 4.00 0.20 ## 3.4 3 4.00 0.12 ## 4.4 4 4.00 0.11 ## 5.4 5 4.00 0.11 ## 6.4 6 4.00 0.17 ## 1.5 1 5.00 0.08 ## 2.5 2 5.00 0.25 ## 3.5 3 5.00 0.11 ## 4.5 4 5.00 0.10 ## 5.5 5 5.00 0.08 ## 6.5 6 5.00 0.13 ## 1.6 1 6.00 0.07 ## 2.6 2 6.00 0.12 ## 3.6 3 6.00 0.08 ## 4.6 4 6.00 0.07 ## 5.6 5 6.00 0.10 ## 6.6 6 6.00 0.10 ## 1.8 1 8.00 0.05 ## 2.8 2 8.00 0.08 ## 3.8 3 8.00 0.08 ## 4.8 4 8.00 0.07 ## 5.8 5 8.00 0.06 ## 6.8 6 8.00 0.09
reshape(data, idvar, direction)
idvar
needs to identify multiple records from the same individual
direction
matches to either “wide” or “long”
Let’s find the mean of these variables
cars2 <- cars[,c("mpg", "disp", "hp", "wt", "qsec")] head(cars2)
## mpg disp hp wt qsec ## Mazda RX4 21.0 160 110 2.620 16.46 ## Mazda RX4 Wag 21.0 160 110 2.875 17.02 ## Datsun 710 22.8 108 93 2.320 18.61 ## Hornet 4 Drive 21.4 258 110 3.215 19.44 ## Hornet Sportabout 18.7 360 175 3.440 17.02 ## Valiant 18.1 225 105 3.460 20.22
mpg.mean <- mean(cars2[,1]) disp.mean <- mean(cars2[,2]) hp.mean <- mean(cars2[,3]) wt.mean <- mean(cars2[,4]) qsec.mean <- mean(cars2[,5]) c(mpg.mean, disp.mean, hp.mean, wt.mean, qsec.mean)
## [1] 20.09062 230.72188 146.68750 3.21725 17.84875
We can do this with easy and simple way!
apply(cars2, 2, mean)
## mpg disp hp wt qsec ## 20.09062 230.72188 146.68750 3.21725 17.84875
We can apply several functions
median
, sum
, min
, max
, or even user-defined functionapply(cars2, 2, max)
## mpg disp hp wt qsec ## 33.900 472.000 335.000 5.424 22.900
apply(X, margin, function)
X
: data
margin
: 1 indicates rows, 2 indicates columns
function
: the function to be applied
Version control is a system that records changes to a file or set of files over time. This way you can recall specific versions later.
One easy way to utilize version control software is through Github.
If you don’t already have a github account, proceed to this link and make one:
Let’s look at an example here:
https://github.com/SDeMora/cnn_transcripts
How can I integrate R easily with Github?
Time to introduce ourselves!
Go to the shell (Terminal) and type in our information like this:
Let’s create our first repo!
How can we link it with our machines and R?
Once you’re in RStudio…