Fast Campus: Data Analytics 21일차 (5/10)

2015. 5. 10. 19:27

Fast Camp_DA

Fast Campus: Data Analytics 21일차 (5/10)

Lecture 21 - Clustering

clustering(군집화)

● 유사한 개체(row, observation)를 군집으로 묶는 작업.

● 어떤 개체가 어떤 군집인지 사전에 전혀 정보가 없다.

● unsupervised model의 하나.

○ unsupervised model: trying to find hidden structure in unlabeled data

거리 구하기 거리(distance):

두 개체(observation, row)가 얼마나 떨어져 있는지 측정한 숫자. 특히, Euclidean distance를 주로 사용

k‐means

1. k개의 군집(cluster)을 만들기로 미리 결정

2. 무작위로 k개의 무게 중심(centroid)을 선정(시작점)

3. centroid로 Voronoi diagram을 그리고 평균을 계산히여 새로운 cetroid 선정

4. the within cluster sum of squares (WCSS)가 최소가 될 때까지 23반복

?dist

method the distance measure to be used.

This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski".

Any unambiguous substring can be given.

euclidean sqrt(sum_i (x_i y_i)^2))

maximum max_i |x_i y_i|

manhattan sum_i |x_i y_i|

canberra sum_i |x_i y_i| / |x_i + y_i| weighted version of manhattan distance

binary Jaccard

index

minkowski (sum_i (x_i y_i)^p)^(1/p)

Hierachical clustering

1. 각 개체(observation)의 거리 행렬을 계산, 거리 행렬을 dissimliarity 척도로 사용

2. 가장 비슷한 개체를 묶는다.

> ?hclust

method

the agglomeration method to be used.

This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

저작자표시 비영리

'Fast Camp_DA' 카테고리의 다른 글

Fast Campus: Data Analytics 20일차 (5/6) (0)	2015.05.10
Fast Campus: Data Analytics 19일차 (4/29) (0)	2015.04.29
Fast Campus: Data Analytics 18일차 (4/26) (0)	2015.04.26
Fast Campus: Data Analytics 17일차 (4/22) Part 1 (0)	2015.04.22
Fast Campus: Data Analytics 16일차 (4/19) Part 2 (0)	2015.04.19

Posted by Name_null

daTa-dRiveN

Fast Campus: Data Analytics 21일차 (5/10)

'Fast Camp_DA' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

공지사항

링크

티스토리툴바

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30