Tidy Data
- 분석하기 좋은 형태로 정리된 Data
- Codd’s 3rd normal form
1. Each variable forms a column.
2. Each observation forms a row. (시행 또는 Trial을 확인한 결과, 또는 관찰값)
3. Each type of observational unitforms a table.
The five most common problems with messy datasets
● Column headers are values, not variable names. (숫자, 연도, 구간, 수/우/미/양/가를 Column 이름으로 쓰면 안된다)
● Multiple variables are stored in one column.
● Variables are stored in both rows and columns.
● Multiple types of observational units are stored in the same table.
● A single observational unit is stored in multiple tables.
- Wide form 보다 Long Form이 더 선호됨
- Data 정규화(=중복제거)하는 것이 선호되지만 그렇지 않은 경우도 있음
- Space 절약 및 일관성 측면에서 정규화가 중요하나 추출비용이 증가
dplyr 소개
● d: data.frame, ply: apply, r
● dplyr: data.frame > data.frame
● ‘plyr’ package의 data.frame 특화 버전
● 참고: The SplitApplyCombine Strategy for Data Analysis
dplyr의 기본 구조
[R] 데이터 처리의 새로운 강자, dplyr 패키지
함수명 | 내용 | 유사함수 |
---|---|---|
filter() | 지정한 조건식에 맞는 데이터 추출 | subset() |
select() | 열의 추출 | data[, c(“Year”, “Month”)] |
mutate() | 열 추가 | transform() |
arrange() | 정렬 | order(), sort() |
summarise() | 집계 | aggregate() |
DPLYR: 데이터 핸들링 쉽게 하기
Rmarkdown으로 문서 작성하기
● markdown 문법:
https://github.com/adamp/markdownhere/wiki/MarkdownCheatsheet
● R + markdown = Rmarkdown (.Rmd 파일)
“보고서를 코딩한다”
기존 방식
분석 결과 만들고, 복사해서 편집기(워드프로세서, 혹은 엑셀)에 붙여 넣고 편집. literate programming
분석 과정과 문서 작성을 한번에 하고, 결과물을 출력.
#**********************************
# Lecture 5: handling data set with dplyr
# 2015-03-10
# jangsea.park@gmail.com
#**********************************
# install packages --------------------------------------------------------
# check if ggplot2, dplyr are already installed
installed_packages <- row.names(installed.packages())
if (!('dplyr' %in% installed_packages)) install.packages('dplyr')
# if ram runs short then go to AWS which is cheap!
# load packages -----------------------------------------------------------
library(dplyr)
# read data ---------------------------------------------------------------
# check working directory
getwd()
# check data file
datapath <- '/Users/tommy/Documents/Data Analytics/lecture05/Project 5'
list.files(path=datapath,
pattern='*.csv')
# read data
# Character vectors should NOT be converted to factors.
bike <- read.csv(file='./bike_sharing_hour.csv',
stringsAsFactors=F
)
# check data
head(bike)
View(bike)
str(bike)
class(bike)
# make POSIXlt vector:
ts <- strptime(paste(bike$dteday, bike$hr), '%Y-%m-%d %H')
bike <- cbind(bike, ts)
# Starting dplyr::tbl_df --------------------------------------------------
bike <- tbl_df(bike) # make it friendly with dplyr
bike
class(bike) # Class is changed.
# dplyr::filter -------------- select rows -------------------------------
?dplyr::filter # Return rows with matching conditions
# only spring
filter(bike, season == 1) # if season=1 then..
# temp > 0.5
filter(bike, temp > 0.5)
# spring and fall
filter(bike, season %in% c(1, 3)) # %in% means in between
filter(bike, season == 1 | 3) # | means or
# 1:10 %in% 1:2
# not winter
filter(bike, season != 4) # != means not
# feel hotter
filter(bike, atemp * 50 > temp * 41) # denormalize
# dplyr:select ------------------------------------------------------------
?dplyr::select # select columns vs (filter)
select(bike, c(dteday, holiday, weekday, workingday)) # reduce columns to 4
select(bike, -instant) # - means 'except'
# dplyr::arrange ----------------------------------------------------------
?dplyr::arrange
# sort by cnt : consumes CPU alot
arrange(bike, cnt) # in order sorting
# sort by cnt desc
arrange(bike, desc(cnt))
# dplyr::mutate ------------------- add new cloumn -------------------------
?dplyr::mutate
# make original temp
# make original temp, atemp, hum, windspeed
mutate(bike, o.temp = temp * 41, o.atemp = atemp * 50, o.hum = hum * 100, o.windspeed = windspeed * 67)
# indent!!
mutate(bike,
o.temp = temp * 41,
o.atemp = atemp * 50,
o.hum = hum * 100,
o.windspeed = windspeed * 67)
# dplyr::summarise --------------------------------------------------------
?dplyr::summarise
# sum, mean(average)
summarise(bike, sum(cnt))
summarise(bike, mean(cnt))
# group by season
?dplyr::group_by # Group a tbl by one or more variables.
by_season <- group_by(bike, season)
by_season
summarise(by_season, sum(cnt))
summarise(by_season, mean(cnt))
# pipe line ---------------------------------------------------------------
?dplyr::chain # Chain together multiple operations.
# sort by cnt
# calcuate original temp
# filter original temp > 10
# group by day
# calucate sum, mean of cnt by day
# DIRTY nesting # moethod of ((())) # thinking vs. code
daily_summary <-
summarise(
group_by(
filter(
mutate(
arrange(bike, cnt),
o.temp = temp * 41),
o.temp > 10),
dteday),
sum.cnt = sum(cnt), mean.cnt = mean(cnt))
# NEAT pipeline # pipe operator: pipe to right
daily_summary <- bike %>%
arrange(cnt) %>%
mutate(o.temp = temp * 41) %>%
filter(o.temp > 10) %>%
group_by(dteday) %>%
summarise(sum.cnt = sum(cnt),
mean.cnt = mean(cnt))
# more about dplyr --------------------------------------------------------
browseVignettes(package = "dplyr") # read Vignettes
'Fast Camp_DA' 카테고리의 다른 글
Fast Campus: Data Analytics 15일차 (4/15) (0) | 2015.04.15 |
---|---|
Fast Campus: Data Analytics 14일차 (4/12) (0) | 2015.04.15 |
Fast Campus: Data Analytics 4일차 (3/8) (0) | 2015.03.08 |
Fast Campus: Data Analytics 3일차 (3/4) Part II (0) | 2015.03.08 |
Fast Campus: Data Analytics 3일차 (3/4) Part I (0) | 2015.03.04 |