Tidy Data

  • 분석하기 좋은 형태로 정리된 Data
  • Codd’s 3rd normal form
    1. Each variable​ forms a column.
    2. Each observation​ forms a row.  (시행 또는 Trial을 확인한 결과, 또는 관찰값) 
    3. Each type of observational unitforms a table.


The five most common problems with messy datasets

● Column headers are values, not variable names.  (숫자, 연도, 구간, 수/우/미/양/가를  Column 이름으로 쓰면 안된다)  

● Multiple variables are stored in one column.

● Variables are stored in both rows and columns.

● Multiple types of observational units are stored in the same table.

● A single observational unit is stored in multiple tables.


  • Wide form 보다 Long Form이 더 선호됨 
  • Data 정규화(=중복제거)하는 것이  선호되지만 그렇지 않은 경우도 있음
    • Space 절약 및 일관성 측면에서 정규화가 중요하나 추출비용이 증가 


dplyr 소개

● d: data.frame, ply: apply, r

● dplyr: data.frame ­> data.frame

● ‘plyr’ packagedata.frame 특화 버전

참고: The Split­Apply­Combine Strategy for Data Analysis

dplyr의 기본 구조


[R] 데이터 처리의 새로운 강자, dplyr 패키지

dplyr 패키지의 기본이 되는 것은 다음 5개 함수입니다.
함수명내용유사함수
filter()지정한 조건식에 맞는 데이터 추출subset()
select()열의 추출data[, c(“Year”, “Month”)]
mutate()열 추가transform()
arrange()정렬order(), sort()
summarise()집계aggregate()


DPLYR: 데이터 핸들링 쉽게 하기

 특히, data.table 패키지와 함께 사용하면 단순히 data.frame을 사용했을 때보다 속도 이득이 있다. RDBMS에도 dplyr을 이용해서 데이터 접근이 가능하고 추출 또한 쉽다. R에서 데이터를 다룰 때 가장 편한 패키지라는 생각이 든다.



Rmarkdown으로 문서 작성하기

● markdown 문법:

https://github.com/adam­p/markdown­here/wiki/Markdown­Cheatsheet

● R + markdown = Rmarkdown (.Rmd 파일)

보고서를 코딩한다

기존 방식

분석 결과 만들고복사해서 편집기(워드프로세서혹은 엑셀)에 붙여 넣고 편집literate programming

분석 과정과 문서 작성을 한번에 하고결과물을 출력.


Knit와 R



#**********************************
# Lecture 5: handling data set with dplyr
# 2015-03-10
# jangsea.park@gmail.com
#**********************************
# install packages --------------------------------------------------------
# check if ggplot2, dplyr are already installed
installed_packages <- row.names(installed.packages())
if (!('dplyr' %in% installed_packages)) install.packages('dplyr')
# if ram runs short then go to AWS which is cheap!

# load packages -----------------------------------------------------------
library(dplyr)
# read data ---------------------------------------------------------------
# check working directory
getwd()
# check data file
datapath <- '/Users/tommy/Documents/Data Analytics/lecture05/Project 5'
list.files(path=datapath,
pattern='*.csv')
# read data

# Character vectors should NOT be converted to factors.
bike <- read.csv(file='./bike_sharing_hour.csv',
stringsAsFactors=F
)
# check data
head(bike)
View(bike)
str(bike)
class(bike)
# make POSIXlt vector:
ts <- strptime(paste(bike$dteday, bike$hr), '%Y-%m-%d %H')
bike <- cbind(bike, ts)
# Starting dplyr::tbl_df --------------------------------------------------
bike <- tbl_df(bike) # make it friendly with dplyr
bike
class(bike) # Class is changed.

# dplyr::filter -------------- select rows -------------------------------
?dplyr::filter # Return rows with matching conditions
# only spring
filter(bike, season == 1) # if season=1 then..
# temp > 0.5
filter(bike, temp > 0.5)
# spring and fall
filter(bike, season %in% c(1, 3)) # %in% means in between
filter(bike, season == 1 | 3) # | means or
# 1:10 %in% 1:2
# not winter
filter(bike, season != 4) # != means not
# feel hotter
filter(bike, atemp * 50 > temp * 41) # denormalize

# dplyr:select ------------------------------------------------------------
?dplyr::select # select columns vs (filter)
select(bike, c(dteday, holiday, weekday, workingday)) # reduce columns to 4
select(bike, -instant) # - means 'except'

# dplyr::arrange ----------------------------------------------------------
?dplyr::arrange
# sort by cnt : consumes CPU alot
arrange(bike, cnt) # in order sorting
# sort by cnt desc
arrange(bike, desc(cnt))

# dplyr::mutate ------------------- add new cloumn -------------------------
?dplyr::mutate
# make original temp
# make original temp, atemp, hum, windspeed
mutate(bike, o.temp = temp * 41, o.atemp = atemp * 50, o.hum = hum * 100, o.windspeed = windspeed * 67)


# indent!!
mutate(bike,
o.temp = temp * 41,
o.atemp = atemp * 50,
o.hum = hum * 100,
o.windspeed = windspeed * 67)

# dplyr::summarise --------------------------------------------------------
?dplyr::summarise
# sum, mean(average)
summarise(bike, sum(cnt))
summarise(bike, mean(cnt))
# group by season
?dplyr::group_by # Group a tbl by one or more variables.
by_season <- group_by(bike, season)
by_season
summarise(by_season, sum(cnt))
summarise(by_season, mean(cnt))

# pipe line ---------------------------------------------------------------
?dplyr::chain # Chain together multiple operations.
# sort by cnt
# calcuate original temp
# filter original temp > 10
# group by day
# calucate sum, mean of cnt by day
# DIRTY nesting # moethod of ((())) # thinking vs. code
daily_summary <-
summarise(
group_by(
filter(
mutate(
arrange(bike, cnt),
o.temp = temp * 41),
o.temp > 10),
dteday),
sum.cnt = sum(cnt), mean.cnt = mean(cnt))
# NEAT pipeline # pipe operator: pipe to right
daily_summary <- bike %>%
arrange(cnt) %>%
mutate(o.temp = temp * 41) %>%
filter(o.temp > 10) %>%
group_by(dteday) %>%
summarise(sum.cnt = sum(cnt),
mean.cnt = mean(cnt))

# more about dplyr --------------------------------------------------------
browseVignettes(package = "dplyr") # read Vignettes


Posted by Name_null