class: center, middle, inverse, title-slide # Starting with data frames ## Data analysis and visualization in R
UC Merced ### Andrea Sánchez-Tapia
Rio de Janeiro Botanical Garden - ¡liibre! - RLadies+ ### 2021-03-23 --- ## last time + we setup a project and its file structure + we started using R inside RStudio + we created numerical, character and logical vectors with `c()` + we learned to subset vectors with brackets `[]` and other functions: `length()`, `:`, `seq(from, to, interval)` + vector subset can be done via numeric or logical indexes --- ## data structures in R + __vector__: lineal arrays (one dimension: only length) -- + __matrices__: arrays of vectors of the same type (all numeric or all character, for instance) (two dimensions: width and length) -- + __data frames__: two-dimensional structures ("rectangular") but might be of combined types (i.e., column 1 with names, column 2 with numbers) -- + __factors__: vectors (one-dimensional) representing __categorical variables__ and thus having __levels__ -- + __lists__: literally lists, of objects that can be of any type (a list of data frames, or different objects) -- + __arrays__ are similar to matrices and dataframes but may be three-dimensional ("layered" data frames) --- ## matrices + data have to be of the same type ```r ?matrix matrix(nrow = 4, ncol = 3) ``` ``` ## [,1] [,2] [,3] ## [1,] NA NA NA ## [2,] NA NA NA ## [3,] NA NA NA ## [4,] NA NA NA ``` --- ## matrices + you can also fit vectors with the correct dimensions ```r nums <- 1:12 matrix(data = nums, nrow = 3) matrix(data = nums, nrow = 3, byrow = TRUE) ``` --- ## matrices + naming a matrix: `NULL or a list of length 2 giving the row and column names respectively.` (?) look at the examples! --- ## matrices ```r dim1_names <- c("row1", "row2", "row3") dim2_names <- c("col1", "col2", "col3", "col4") names_matrix <- list(dim_1 = dim1_names, dim_2 = dim2_names) str(names_matrix) m <- matrix(data = nums, nrow = 3, dimnames = names_matrix) dim(m) dimnames(m) ``` -- + you can convert easily between data types `data.frame("m")`, `as.data.frame("m")`, `as.vector(m)`. (the same goes for changes between "numeric", "logical") --- class: inverse, middle, center ## starting with data frames --- ## the survey dataset + Data frames: one row per sampling unit (individual), one column per variable <img src="./figs/columns.png" width="400" style="display: block; margin: auto;" /> --- ## downloading the dataset We are going to download the file to our `./data/raw` sub folder: ```r download.file(url = "https://ndownloader.figshare.com/files/2292169", destfile = "./data/raw/portal_data_joined.csv") ``` --- ## reading files into R Functions to read data are key to any project. for data frames: `read.csv()`, `read.delim()` ```r surveys <- read.csv("./data/raw/portal_data_joined.csv") surveys_check <- read.table(file = "./data/raw/portal_data_joined.csv", sep = ",", header = TRUE) identical(surveys, surveys_check) ``` ``` ## [1] TRUE ``` --- ## reading files into R + Package __readr__ + Package __data.table__ (`data.table::fread()`) when you need to open a large file + Excel spreadsheets: `readxl::read_excel()` + __Graphic interface__ There are __many other ways__ to read data into R, some are specific for the type of data (GIS shapefiles or raster, and specific packages may come with their own reader functions) --- ## inspecting `data.frame` objects ```r str(surveys) dim(surveys) nrow(surveys) ncol(surveys) head(surveys) # 6 rows by default tail(surveys) names(surveys) rownames(surveys) length(surveys) # number of columns summary(surveys) ``` --- ## inspecting `data.frame` objects Based on the output of `str(surveys)`, can you answer the following questions? + What is the class of the object surveys? + How many rows and how many columns are in this object? + What is the type of data of the columns? --- ## indexing and subsetting data frames + a vector has only one dimension, so: + `length()` refers to number of __elements__ + `dim()` + selection between brackets `[]` -- + a data.frame has __two__ dimensions: `dim()`, `ncol()`, `nrow()` selection between brackets `[]` __BUT with the two dimensions separated by a comma__: `[rows, columns]` -- + we'll try to refer to these operations as __selecting columns__ and __filtering rows__ --- ## selecting columns + with numeric indexes and vectors ```r surveys[, 6] surveys[1, ] surveys[ , 13] surveys[4, 13] surveys[1:4, 1:3] ``` --- ## indexing and subsetting data frames + minus sign to __remove__ the indexed column or row ```r # The whole data frame, except the first column surveys[, -1] nrow(surveys) surveys[-(7:34786), ] # Equivalent to head(surveys) ``` --- ## selecting columns by name ```r names(surveys) surveys["species_id"] # Result is a data.frame surveys[["species_id"]] # Result is a vector * surveys[, "species_id"] # Result is a vector * surveys$species_id # Result is a vector ``` + R has several ways to do some things --- ## indexing and subsetting data frames <small> ```r sub <- surveys[1:10,] # first element in the first column of the data frame # first element in the 6th column # first column of the data frame (as a vector) # first column of the data frame (as a dataframe) # first three elements in the 7th column (as a vector) # the 3rd row of the data frame # equivalent to head_surveys <- head(surveys) ``` --- ## indexing and subsetting data frames <small> ```r sub <- surveys[1:10,] # first element in the first column of the data frame sub[1, 1] # first element in the 6th column sub[1, 6] # first column of the data frame (as a vector) sub[, 1] # first column of the data frame (as a dataframe) sub[1] # first three elements in the 7th column (as a vector) sub[1:3, 7] # the 3rd row of the data frame sub[3, ] # equivalent to head_surveys <- head(surveys) head_surveys <- surveys[1:6, ] ``` --- ## challenge <small> + Create a data.frame (`surveys_200`) containing only the data in row 200 of the `surveys` dataset -- + Notice how `nrow()` gave you the number of rows in a data.frame? Use that number to pull out just that last row in the data frame -- + Compare that with what you see as the last row using `tail()` to make sure it’s meeting expectations -- + Pull out that last row using `nrow()` instead of the row number. -- + Create a new data frame (`surveys_last`) from that last row. -- + Use `nrow()` to extract the row that is in the middle of the data frame. Store the content of this row in an object named `surveys_middle`. -- + Combine `nrow()` with the - notation above to reproduce the behavior of `head(surveys)`, keeping just the first through 6th rows of the surveys dataset. </small> --- ## dealing with missing data ```r sub <- surveys[1:10,] #str(sub) sub$hindfoot_length sub$hindfoot_length == NA #it cannot compare! because it's NA #we use is.na: #is.na(sub$hindfoot_length) # yes! returns a logical vector sub$hindfoot[!is.na(sub$hindfoot_length)] ``` --- ## dealing with missing data + in some functions: `na.rm` ```r mean(sub$hindfoot_length) ``` ``` ## [1] NA ``` ```r mean(sub$hindfoot_length, na.rm = T) ``` ``` ## [1] 31.5 ``` --- ## dealing with missing data + Dealing with missing data in dataframes: __filtering rows that have NAs__ ```r non_NA_w <- surveys[!is.na(surveys$weight),] dim(non_NA_w) ``` ``` ## [1] 32283 13 ``` ```r non_NA <- surveys[!is.na(surveys$weight) & !is.na(surveys$hindfoot_length),] dim(non_NA) ``` ``` ## [1] 30738 13 ``` --- ## dealing with NAs ```r #complete.cases(surveys) surveys1 <- surveys[complete.cases(surveys) , ] surveys2 <- na.omit(surveys) dim(surveys1) ``` ``` ## [1] 30738 13 ``` ```r dim(surveys2) ``` ``` ## [1] 30738 13 ``` --- ## write csv objects to disk ```r if (!dir.exists("data/processed")) dir.create("data/processed", recursive = T) write.csv(surveys1, "data/processed/surveys_mod.csv") ``` #### remember you never overwrite your original, raw data! --- ## read the modified csv ```r surveys <- read.csv("data/processed/surveys_mod.csv") str(surveys) ``` --- ## factors + __factors__: vectors (one-dimensional) representing __categorical variables__ and thus having __levels__. ordered (`c(“low”, “medium”, “high”)` or unordered (`c("green", "blue", "red")`) + R < 4.0 had a default behavior `stringsAsFactors = TRUE` so any character column was transformed into a factor ```r `?read.csv()` ?default.stringsAsFactors ``` #### today if we want factors we have to transform the vectors --- ## factors <small> ```r ## Compare the difference between our data read as #`factor` vs `character`. surveys <- read.csv("data/raw/portal_data_joined.csv", stringsAsFactors = FALSE) str(surveys) surveys <- read.csv("data/raw/portal_data_joined.csv", stringsAsFactors = TRUE) str(surveys) ``` --- ## factors Convert the column "plot_type" and "sex" into a factor: ```r surveys$plot_type <- factor(surveys$plot_type) surveys$sex <- factor(surveys$sex) ``` (actually this is a way to create new columns) --- ## working with factors <small> ```r sex <- factor(c("male", "female", "female", "male")) levels(sex) # in alphabetical order! nlevels(sex) sex sex <- factor(sex, levels = c("male", "female")) sex # after re-ordering as.character(sex) ``` --- ## let's make a plot of a factor variable `plot(as.factor(surveys$sex))` .pull-right[let's rename this label] ![](02_Starting_with_data_files/figure-html/plot-1.png)<!-- --> --- ## let's make a plot of a factor variable `plot(sex)` .pull-right[let's rename this label] ![](02_Starting_with_data_files/figure-html/plot2-1.png)<!-- --> --- ## challenge <small> .pull-left[ ![](02_Starting_with_data_files/figure-html/remedy003-1.png)<!-- --> ] .pull-right[ + Rename “F” and “M” to “female” and “male” respectively. + Now that we have renamed the factor level to “undetermined”, can you recreate the barplot such that “undetermined” is last (after “male”)? ] --- ## some basic plotting ```r plot(surveys$hindfoot_length) plot(surveys$weight) plot(sort(surveys$hindfoot_length)) plot(sort(surveys$weight)) ``` --- ## scatterplots + two continuous variables ```r x <- surveys$weight y <- surveys$hindfoot_length plot(x, y) #but you can also avoid creating the vectors plot(surveys$weight, surveys$hindfoot_length) #check the help for parameter order, x, y axis. ``` --- ## boxplots ```r head(surveys$plot_type) levels(surveys$plot_type) ``` # boxplots ```r plot(surveys$weight ~ surveys$plot_type) ``` + check the __tilde__ `~` notation, formula: "as a function of" `plot(a ~ b)` `plot(b, a)` --- ## plotting basics all parameters for plotting are in function `par()` ```r plot(x, y) plot(x, y, xlab = "Hindfoot length", ylab = "Weight", cex = 0.9, pch = 19) plot(x, y, xlab = "Hindfoot length", ylab = "Weight", cex = 0.9, pch = 19, col = "red") plot(x, y, xlab = "Hindfoot length", ylab = "Weight", cex = 0.9, pch = 19, col = surveys$sex) ``` + pch, cex, xlab, ylab, + __saving a plot!__ --- class: center, middle # ¡Thanks! <center> <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> [andreasancheztapia@gmail.com](mailto:andreasancheztapia@gmail.com) <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [@SanchezTapiaA](https://twitter.com/SanchezTapiaA) <svg viewBox="0 0 496 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg><svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M105.2 24.9c-3.1-8.9-15.7-8.9-18.9 0L29.8 199.7h132c-.1 0-56.6-174.8-56.6-174.8zM.9 287.7c-2.6 8 .3 16.9 7.1 22l247.9 184-226.2-294zm160.8-88l94.3 294 94.3-294zm349.4 88l-28.8-88-226.3 294 247.9-184c6.9-5.1 9.7-14 7.2-22zM425.7 24.9c-3.1-8.9-15.7-8.9-18.9 0l-56.6 174.8h132z"></path></svg> <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M22.2 32A16 16 0 0 0 6 47.8a26.35 26.35 0 0 0 .2 2.8l67.9 412.1a21.77 21.77 0 0 0 21.3 18.2h325.7a16 16 0 0 0 16-13.4L505 50.7a16 16 0 0 0-13.2-18.3 24.58 24.58 0 0 0-2.8-.2L22.2 32zm285.9 297.8h-104l-28.1-147h157.3l-25.2 147z"></path></svg> [andreasancheztapia](http://github.com/andreasancheztapia)