daTa-dRiveN

2015. 10. 26. 14:21

R

Data Analysis and Statistical Inference: Introduction to data

Welcome!

# Load the cdc data frame into the workspace:

a <-load(url("http://assets.datacamp.com/course/dasi/cdc.Rdata"))

head(a)

dim(a)

str(a)

# The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States.

# As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends.

# For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage.

Which variables are you working with?

# The cdc data frame is already loaded into the workspace

# Print the names of the variables:

names(cdc)

Taking a peek at your data

# The cdc data frame is already loaded into the workspace

# Print the head and tails of the data frame:

head(cdc)

tail(cdc)

# http://www.cdc.gov/brfss/

# This function returns a vector of variable names in which each name corresponds to a question that was asked in the survey.

# For example, for genhlth, respondents were asked to evaluate their general health from excellent down to poor.

# The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0).

# Likewise, hlthplan indicates whether the respondent had some form of health coverage.

# The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in his lifetime.

Let's refresh

# The cdc data frame is already loaded into the workspace.

# View the head or tail of both the height and the genhlth variables:

head(cdc$height)

head(cdc$genhlth)

# Assign your sum here:

sum <- 84941 + 19686

# Assign your multiplication here:

mult <- 73 * 51

Turning info into knowledge - Numerical data

# The cdc data frame is already loaded into the workspace

mean(cdc$weight)

var(cdc$weight)

median(cdc$weight)

summary(cdc$weight)

Turning info into knowledge - Categorical data

# categorical data: look at absolute or relative frequency. (범주형 data는 빈도를 보는 것이 좋다)

# The cdc data frame is already loaded into the workspace.

# Create the frequency table here:

table(cdc$genhlth) # 빈도가 테이블 형태로 나온다

plot(cdc$genhlth) # 빈도가 히스토그램으로 나온다

# Create the relative frequency table here:

table(cdc$genhlth) / dim(cdc)[1] # dim()을 쓰면 row, column 수가 나온다

table(cdc$genhlth) / nrow(cdc) # nrow도 동일한 결과가 나온다

# dim(variable)[1] -> 행수

# dim(variable)[2] -> 열수

Creating your first barplot

# The cdc data frame is already loaded into the workspace.

# Draw the barplot:

table(cdc$smoke100) # 먼저 테이블로 표시

barplot(table(cdc$smoke100)) # x축이 2개의 값을 갖는 히스토그램으로 표시

# 먼저 table을 통해서 (0,1)로 묶어준다.

# 그다음 그래프를 그린다.

barplot(cdc$smoke100)

# 모든 row에 대해서 0,1이 섞여서 출력 되므로 black으로 보인다.

Even prettier: the Mosaic Plot

# The cdc data frame is already loaded into the workspace

gender_smokers <- table(cdc$gender,cdc$smoke100) # x,y의 분류축을 지정

# table의 축을 지정해 줄수 있다.

gender_smokers

# Plot the mosaicplot:

mosaicplot(gender_smokers)

# mosaicplot은 면적을 통해 가늠할수 있게 해 준다

Interlude: How R thinks about data (1)

# The cdc data frame is already loaded into the workspace

head(cdc)

cdc[1337,]

cdc[111,]

# Create the subsets:

height_1337 <- cdc[1337,5]

weight_111 <- cdc[111,6]

# Print the results:

height_1337

weight_111

Interlude (2)

# The cdc data frame is already loaded into the workspace

# Create the subsets:

first8 <- cdc[1:8, 3:5] # data table에서 특정한 영역만 선택

wt_gen_10_20 <- cdc[10:20 , 6:9]

# Print the subsets:

first8

wt_gen_10_20

Interlude (3)

# The cdc data frame is already loaded into the workspace

# Create the subsets:

resp205 <- cdc[205,]

ht_wt <- cdc[,5:6]

# Print the subsets:

resp205

head(ht_wt)

str(ht_wt)

Interlude (4)

# The cdc data frame is already loaded into the workspace

# Create the subsets:

resp1000_smk <- cdc$smoke100[1000]

first30_ht <- cdc$height[1:30]

# Print the subsets:

resp1000_smk

first30_ht

A little more on subsetting

# The cdc data frame is already loaded into the workspace

str(cdc)

# Create the subsets:

very_good <- subset(cdc, genhlth =="very good") # 조건으로 부분집합을 만든다

age_gt50 <- subset(cdc, age > 50) # 조건으로 부분집합을 만든다

# subset(dataframe, column ==,>= "") -> 조건에 맞는 부분집합을 추출

# Print the subsets:

head(very_good)

dim(very_good)[1] # 2만중 6972개

head(age_gt50)

dim(age_gt50)[1] # 2만중 6938개

Subset - one last time

# The cdc data frame is already loaded into the workspace

# Create the subset:

under23_and_smoke <- subset(cdc, age < 23 & smoke100 == 1) # == 조심

# Print the top six rows of the subset:

head(under23_and_smoke)

Visualizing with box plots

# The cdc data frame is already loaded into the workspace.

# Draw the box plot of the respondents heights:

boxplot(cdc$height)

# Print the summary:

summary(cdc$height)

# 최소 - 1/4th - 평균 - 3/4th - 최대

More on box plots

# This notation is new. The ~ operator can be read “versus” or “as a function of”.

# The cdc data frame is already loaded into the workspace.

# Draw the box plot of the weights versus smoking:

boxplot(cdc$weight ~ cdc$smoke100) # 담배를 x축으로 사용한 경우

# boxplot은 numerical data의 분포에 사용

# ~ 를 통해서 타 범주 구분에 따라 나누어 볼수 있다

One last box plot

# The cdc data frame is already loaded into the workspace.

# Calculate the BMI:

bmi <- cdc$weight /(cdc$height^2) * 703.0

# Draw the box plot:

boxplot(bmi ~ cdc$genhlth) # 건강상태에 따른 BMI 분포를 표시

Histograms

# The cdc data frame and bmi object are already loaded into the workspace.

# Draw a histogram of bmi:

hist(bmi)

# And one with breaks set to 50:

hist(bmi, breaks=50)

# And one with breaks set to 100:

hist(bmi, breaks=100)

# breaks는 구간의 갯수, bin의 개수

Weight vs. Desired Weight

# The cdc data frame is already loaded into the workspace.

# Draw your plot here:

plot(cdc$weight ~ cdc$wtdesire) # 관계를 표시할때는 y ~ x

저작자표시 비영리

'R' 카테고리의 다른 글

R을 이용한 Data 분석 실무 (0)	2015.05.03
R과 기초 통계 (펌) (0)	2015.04.11
R 통계 기초 (펌) Yoonwhan Lee (0)	2015.04.09
R intro from 주영 송 (0)	2015.03.29
R 기반의 데이터 시각화 (0)	2015.03.05

Posted by Name_null

2015. 10. 1. 13:26

Data Infra

Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop) from JaeHwa Jung

Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop) from JaeHwa Jung

저작자표시 비영리

'Data Infra' 카테고리의 다른 글

Mahout (펌)_주영송 (0)	2015.04.12
Hive 입문 발표 자료 from beom kyun choi (0)	2015.02.23
HBase란 from 동윤 이 (0)	2015.02.15
Redis, MongoDB 그리고 MySQL 과 함께하는 모바일 애플리케이션 서비스에서의 로그 수집과 분석 (0)	2015.02.15
Hive Beginss (0)	2015.02.15

Posted by Name_null

2015. 9. 28. 19:58

Machine Learning & Data Mining

데이터마이닝 08-가격모델링 from Kwang Woo NAM

데이터마이닝 08-가격모델링 from Kwang Woo NAM

(p6) k-NN

k-NN을 이용한 가격 추정

k-NN : k-nearest neighbors

k-NN 가격 결정

k : 마지막 결과를 얻기 위해 평균을 낼 물품의 개수

k=1, too small

(p8) 유사도 결정하기

거리 측정
Euclidean 거리 사용

(p10) 물품 가중치

거리에 대한 가중치를 주는 방법
역함수(inverse function)  num/(dist+const) Falls off too quickly
물품 가중치

빼기함수 Goes to Zero 데이터마이닝 : Collective Intelligence 11
가우스 함수

Gaussian function

거리가 0일때 1이고, 거리가 멀어지면서 가중치가 줄어듬

(p14) 교차검증

학습 데이터셋과 테스트 데이터셋의 구분 학습 데이터셋 테스트 데이터셋 테스트 데이터셋

(p17) 이질 변수

축적조정 Rescale ml by 0.1 Rescale aisle by 0.0

저작자표시 비영리

'Machine Learning & Data Mining' 카테고리의 다른 글

데이터마이닝 07-고급 분류 기법-커널 기법과 svm-01 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 06-의사결정트리-01 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 05-문서필터링-02 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 04-검색과 랭킹-02 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 03-군집발견-03 from Kwang Woo NAM (0)	2015.09.27

Posted by Name_null

2015. 9. 27. 23:01

Machine Learning & Data Mining

데이터마이닝 07-고급 분류 기법-커널 기법과 svm-01 from Kwang Woo NAM

데이터마이닝 07-고급 분류 기법-커널 기법과 svm-01 from Kwang Woo NAM

(p2) 의사결정트리의 문제점

의사결정트리의 분류 결과의 문제점

분류기준을 수평/수직선으로 강제함 – 혼란스럽고 분류에 적용하기에 복잡

(p4) 기본 선형 분류(Basic Linear Classification)

선형 분류의 개념

의사결정트리가 수평/수직으로 구분하는 단점을 개선
단순 방법

각 범주내의 평균을 찾고, 그 범주의 중앙을 나타내는 중앙점(center point)과 가까운 평균점을 선형 분류
유클리디언 거리를 이용 범주 중앙점 유클리디언 거리

(p8) Linear Classifiers

(p13) Classifier Margin

Define the margin of a linear classifie r as the width that the boundary coul d be increased by before hitting a da tapoint.

(p14) Maximum Margin

1. Maximizing the margin is good according to intuition and PAC theory
2. Implies that only support vectors are important; other training examples are ignorable.
3. Empirically it works very very well.

(p16) 지지벡터머신(Support Vector Machine)

SVM의 개념

두 범주를 갖는 객체들을 분류하는 방법
SVM은 ‘여백을margin 최대화’하여 일반화 능력의 극대화 꾀함

SVM의 역사와 장점

1979년 Vapnik에 의하여 발표된 바 있으나, – 최근에 와서야 그 성능을 인정받게 됨, Vapnik(1995)과 Burges(1998)
주어진 많은 데이터들을 가능한 멀리 두 개의 집단으로 분리시키는 최적의 초평면(hyperplane)을 찾는 것

기존의 통계적 학습 방법들에서 이용되는 경험적 위험도 최소화(empirical risk minimization)가 아닌 구조적 위험도 최소화(structural risk minimization)방법을 이용하여 일반적으로 에러를 줄이는 방법
패턴 인식이나 비선형 운동 분류 등의 다양한 응용분야에 효과적으로 수행

(p18) 지지벡터머신(Support Vector Machine)

기존 선형분류와 SVM의 비교

분류기의 일반화 능력

②보다 ③이 여백이 더 크다. -> 즉 ③이 ②보다 일반화 능력이 뛰어나다.
신경망은 초기값 ①에서 시작하여 ②를 찾았다면 거기서 멈춘다. 왜?
SVM은 ③을 찾는다.

중요한 문제

여백이라는 개념을 어떻게 공식화할 것인가?
여백을 최대로 하는 결정 초평면을 어떻게 찾을 것인가?

(p20) SVM의 개념 : 선형 분리가 가능한 상황

(직선의 방향)가 주어진 상황에서,

‘두 부류에 대해 직선으로부터 가장 가까운 샘플까지의 거리가 같게 되는’ b를 결정 (①과 ②는 그렇게 얻은 직선)
여백은 그런 직선에서 가장 가까운 샘플까지 거리의 두 배로 정의함 – 가장 가까운 샘플을 서포트 벡터라 부름

(p22) SVM의 특징

여백이라는 간단한 아이디어로 breakthrough 이룩함
SVM의 특성

사용자 설정 매개 변수가 적다.

커널 종류와 커널에 따른 매개 변수
(5.15)에서 목적 1 과 목적 2의 가중치 C

최적 커널을 자동 설정하는 방법 없음 - 실험에 의한 휴리스틱한 선택
일반화 능력 뛰어남
구현이 까다로움

OSS 활용

SVMlight
LIBSVM

저작자표시 비영리

'Machine Learning & Data Mining' 카테고리의 다른 글

데이터마이닝 08-가격모델링 from Kwang Woo NAM (0)	2015.09.28
데이터마이닝 06-의사결정트리-01 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 05-문서필터링-02 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 04-검색과 랭킹-02 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 03-군집발견-03 from Kwang Woo NAM (0)	2015.09.27

Posted by Name_null

2015. 9. 27. 22:39

Machine Learning & Data Mining

데이터마이닝 06-의사결정트리-01 from Kwang Woo NAM

&lt;span style="color: rgb(0, 0, 0); font-size: 11pt; background-color: rgb(255, 255, 255);"&gt;&amp;lt;span style="font-size: 11pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);"&amp;gt; &amp;lt;/span&amp;gt;&lt;/span&gt;

데이터마이닝 06-의사결정트리-01 from Kwang Woo NAM

(p4) 의사결정트리(Decision Tree)란

정의 - 의사결정 규칙 (Decision Tree)을 도표화하여 관심대상이 되는 집단을 몇 개의 소집단으로 분류 (Classification)하거나 예측 (Prediction)을 수행하는 계량적 분석 방법
장점 – 분석결과는‘조건 A이고 조건 B이면 결과집단 C’라는 형태의 규칙으로 표현되므로 이해가 쉽고, 분류 또는 예측을 목적으로 하는 다른 계량적분석 방법에 비해 쉽게 이해하고 활용 할 수 있음
그림출처:http://jaek.khu.ac.kr/datamining/684

(p7) 의사결정트리(Decision Tree) : 불순도의 측정

의사결정 트리의 분할 속성 선택

어떤 입력변수를 이용하여 어떻게 분리하는 것이 목표변수의 분포를 가장 잘 구별해 주는지를 파악하여 자식마디가 형성되는데,
목표변수의 분포를 구별하는 정도를 순수도(Purity), 또는 불순도(Impurity)에 의해서 측정

순수도 (Purity) : 특정 범주의 개체들이 포함되어 잇는 정도를 의미한다.
불순도(impurity) : 얼마나 다양한 범주들의 개체들이 포함되어있는 가를 의미

분할속성의 선택

부모마디의 순수도에 비해서 자식마디들의 순수도가 증가하도록 자식마디를 형성

예를 들어 그룹0과 그룹 1의 비율이 45%와 55%인 마디는 각 그룹의 비율이 90%와 10%인 마디에 비하여 순수도가 낮다 (또는 불순도가 높다)라고 이야기 한다.

불순도의 측정

카이제곱 통계량의 P값
지니 지수 (Gini Index)
엔트로피 지수(Entropy Index)

(p8) 의사결정트리(Decision Tree) : 불순도의 측

지니 지수 (Gini Index):

불순도를 측정하는 하나의 지수로서 지니지수를 가장 감소시켜주는 예측변수와 그 때의 최적 분리에 의해서 자식마디를 선택

지니 지수 (Gini Index)의 값 다이어그램

두개의 범주개체가 50대 50으로 구성될때 최대의 불순도값

(p11) 의사결정트리(Decision Tree) : 불순도의 측정

지니지수와 엔트로피 지수를 이용한 불순도 측정

(p12) 의사결정트리(Decision Tree) : 불순도의 측정

불순도에 의한 트리 분할 데이터마이닝

(p14) 의사결정트리(Decision Tree) : 트리 학습 14

CART (Classification and Regression Trees)

Classification And Regression Tree의 준말, 984년 Breiman과 그의 동료들이 발명
기계학습(machine learning) 실험의 산물
가장 널리 사용되는 의사결정나무 알고리즘

1. create a root node
2. choose the best variable to divide up the data

C4.5

호주의 연구원 J. Ross Quinlan에 의하여 개발, 초기버전은 ID 3 (Iterative Dichotomizer 3)로 1986년에 개발
CART와는 다르게 각 마디에서 다지분리 (multiple split)가 가능하다.
범주형 입력변수에 대해서는 범주의 수만큼 분리가 일어난다.
불순도함수로 엔트로피 지수를 사용한다. 가지치기를 사용할 때 학습자료를 사용한다.

(p21) 재귀적으로 트리 만들기

정보이득(Information gain)을 통한 트리노드 선정
정보이득

현재의 entropy와 새로운 두 그룹의 가중 평균 entropy 간의 차
알고리즘은 모든 속성마다 정보이득을 계산하여 가장 높은 정보이득을 가진 것을 선택
현재의 불순도- 두개의 그룹으로 나뉜후의 불순도
재귀적으로 트리를 분할

(p29) 트리 가지치기

과잉접합(overfitted)

데이터 과대반영-미소한 엔트로피 감소로도 가지가 생성한다.
엔트로피가 어떤 최소값만큼 줄지 않을 때 분할을 종료한다.

한번의 분할로 엔트로피가 많이 감소되지 않지만 다음 번 분할로 크게 감소하는 경우가 있다.
완전한 트리 생성 후 불필요한 노드를 제거한다.

(p47) 의사결정트리 활용 시점

장점:

학습된 모델을 이해하기 쉽다.
분류 데이터와 숫자 데이터 모두 사용가능하다.

단점:

많은 가능성을 가진 데이터 세트에 비효율적이다.
숫자 데이터를 다룰 때 이상/이하 결정 포인트만 만들 수 있다.

저작자표시 비영리

'Machine Learning & Data Mining' 카테고리의 다른 글

데이터마이닝 08-가격모델링 from Kwang Woo NAM (0)	2015.09.28
데이터마이닝 07-고급 분류 기법-커널 기법과 svm-01 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 05-문서필터링-02 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 04-검색과 랭킹-02 from Kwang Woo NAM (0)	2015.09.27
데이터마이닝 03-군집발견-03 from Kwang Woo NAM (0)	2015.09.27

Posted by Name_null

daTa-dRiveN

Data Analysis and Statistical Inference: Introduction to data

Welcome!

Which variables are you working with?

Taking a peek at your data

Let's refresh

Turning info into knowledge - Numerical data

Turning info into knowledge - Categorical data

Creating your first barplot

Even prettier: the Mosaic Plot

Interlude: How R thinks about data (1)

Interlude (2)

Interlude (3)

Interlude (4)

A little more on subsetting

Subset - one last time

Visualizing with box plots

More on box plots

One last box plot

Histograms

Weight vs. Desired Weight

'R' 카테고리의 다른 글

Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop) from JaeHwa Jung

'Data Infra' 카테고리의 다른 글

데이터마이닝 08-가격모델링 from Kwang Woo NAM

'Machine Learning & Data Mining' 카테고리의 다른 글

데이터마이닝 07-고급 분류 기법-커널 기법과 svm-01 from Kwang Woo NAM

'Machine Learning & Data Mining' 카테고리의 다른 글

데이터마이닝 06-의사결정트리-01 from Kwang Woo NAM

'Machine Learning & Data Mining' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

공지사항

링크

티스토리툴바

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31