Skip to content

Latest commit

 

History

History
185 lines (118 loc) · 11 KB

implementing-your-own-knn-using-python.md

File metadata and controls

185 lines (118 loc) · 11 KB

螳樒鴫菴�閾ェ蟾ア逧� k-譛�霑鷹そ邂玲ウ穂スソ逕ィ Python

蜴滓枚�喙https://www.kdnuggets.com/2016/01/implementing-your-own-knn-using-python.html](https://www.kdnuggets.com/2016/01/implementing-your-own-knn-using-python.html)

菴懆��シ哢atasha Latysheva縲�

郛冶セ第ウィ�哢atasha 豢サ霍�コ� 蜑第。・郛也ィ句ュヲ髯「�瑚ッ・蟄ヲ髯「蟆�コ� 2016 蟷エ 2 譛� 20-21 譌・荳セ蜉� Python 謨ー謐ョ遘大ュヲ隶ュ扈�是�悟惠霑咎㈹菴�蜿ッ莉・蟄ヲ荵�髓亥ッケ邇ー螳樔ク也阜髣ョ鬚倡噪譛�蜈郁ソ帶惻蝎ィ蟄ヲ荵�謚�譛ッ縲�

蝨ィ譛コ蝎ィ蟄ヲ荵�荳ュ�御ス�蜿ッ閭ス扈丞クク蟶梧悍譫�サコ蛻�アサ蝎ィ�悟ー�コ狗黄蛻�アサ蛻ー譟蝉コ帷アサ蛻ォ荳ュ�瑚ソ吩コ帷アサ蛻ォ譏ッ蝓コ莠惹ク�扈�嶌蜈ウ蛟シ縲ゆセ句ヲゑシ悟庄莉・譬ケ謐ョ譚・閾ェ蜈亥燕謔」閠�噪謨ー謐ョ荳コ謔」閠�署萓幄ッ頑妙縲ょ�邀サ蜿ッ閭ス豸牙所蝨ィ邀サ蛻ォ荵矩龍譫���鬮伜コヲ髱樒コソ諤ァ逧�セケ逡鯉シ悟ヲゆク句崟謇�遉コ逧�コ「濶イ縲∫サソ濶イ蜥瑚統濶イ邀サ蛻ォ 荳矩擇��

![kNN 霎ケ逡珪(../Images/4322b5f9ae66be35375b65bec3b6c05f.png)

隶ク螟夂ョ玲ウ募キイ扈剰「ォ蠑�蜿醍畑莠手�蜉ィ蛻�アサ�悟クク隗∫噪邂玲ウ募桁諡ャ髫乗惻譽ョ譫励�∵髪謖∝髄驥乗惻縲∵惷邏�雍晏掌譁ッ蛻�アサ蝎ィ蜥悟、夂ァ咲アサ蝙狗噪逾樒サ冗ス醍サ懊�ゆクコ莠�コ�ァ」蛻�アサ逧�キ・菴懷次逅�シ梧�莉ャ莉・荳�荳ェ邂�蜊慕噪蛻�アサ邂玲ウ補�披�婆-譛�霑鷹そ��NN�我クコ萓具シ悟ケカ蝨ィ Python 2 荳ュ莉主、エ蠑�蟋区桷蟒コ螳��ょヲよ棡菴�蛻壼�蠑�蟋句ュヲ荵� Python�悟庄莉・菴ソ逕ィ荳サ隕∫噪 蜻ス莉、蠑冗シ也ィ矩」取�シ�瑚�御ク肴弍菴ソ逕ィ lambda 蜃ス謨ー 蜥� [蛻苓。ィ謗ィ蟇シ蠑従(http://www.secnetix.de/olli/Python/list_comprehensions.hawk) 逧�」ー譏主シ�/蜃ス謨ー蠑城」取�シ�御サ・菫晄戟邂�蜊輔�ょ惠霑咎㈹�梧�莉ャ蟆�サ狗サ榊錘荳�遘肴婿豕輔�LNN 騾夊ソ�ー�眠逧�ョ樔セ倶ク取怙逶ク莨シ逧�。井セ句�扈�擂霑幄。悟�邀サ縲ょ惠霑咎㈹�御ス�蟆�スソ逕ィ kNN 螟�炊豬∬。檎噪�亥ース邂。譏ッ逅�Φ蛹也噪�蛾ク「蟆セ闃ア謨ー謐ョ髮�シ瑚ッ・謨ー謐ョ髮�桁諡ャ荳臥ァ埼ク「蟆セ闃ア逧�干蜊画オ矩㍼謨ー謐ョ縲よ�莉ャ逧�ササ蜉。譏ッ譬ケ謐ョ闃ア蜊画オ矩㍼謨ー謐ョ鬚�オ玖干蜊臥噪迚ゥ遘肴��ュセ縲ら罰莠惹ス�蟆�渕莠惹ク�扈�キイ遏・逧�ュ」遑ョ蛻�アサ譚・譫�サコ鬚�オ句勣�悟屏豁、 kNN 譏ッ荳�遘咲尅逹」蠑乗惻蝎ィ蟄ヲ荵��郁區辟カ譛我コ帑サ、莠コ蝗ー諠醍噪譏ッ�悟惠 kNN 荳ュ豐。譛画仞蠑冗噪隶ュ扈�亳谿オ�幄ッキ蜿りァ� 諛呈Σ蟄ヲ荵��峨�LNN 莉サ蜉。蜿ッ莉・蛻�ァ」荳コ郛門� 3 荳ェ荳サ隕∝粥閭ス��

1. 隶。邂嶺ササ菴穂ク、轤ケ荵矩龍逧�キ晉ヲサ

2. 蝓コ莠手ソ吩コ帶�蟇ケ霍晉ヲサ謇セ蛻ー譛�霑鷹そ

3. 蝓コ莠取怙霑鷹そ蛻苓。ィ蟇ケ邀サ蛻ォ譬�ュセ霑幄。悟、壽焚謚慕・ィ

莉・荳句崟荳ュ逧�ュ・鬪、謠蝉セ帑コ�ス�蝨ィ莉」遐∽クュ髴�隕∝ョ梧�莉サ蜉。逧�ォ伜アよャ。讎りソー縲�

[kNN 邂玲ウ評(../Images/39dbb98036b866c0b324bcb95cf9762c.png)

邂玲ウ�

邂�閠瑚ィ�荵具シ御ス�蟆�桷蟒コ荳�荳ェ閼壽悽�悟ッケ莠取ッ丈クェ髴�隕∝�邀サ逧�セ灘��梧頗邏「謨エ荳ェ隶ュ扈�寔荳ュ逧� k 荳ェ譛�逶ク莨シ逧�ョ樔セ九�ら┯蜷趣シ碁�夊ソ�、壽焚謚慕・ィ諤サ扈捺怙逶ク莨シ螳樔セ狗噪邀サ蛻ォ譬�ュセ�悟ケカ蟆��菴應クコ豬玖ッ墓。井セ狗噪鬚�オ狗サ捺棡霑泌屓縲�

螳梧紛逧�サ」遐∝惠譁�ォ�逧�怙蜷弱�ら鴫蝨ィ�瑚ョゥ謌台サャ蛻�悪譟・逵倶ク榊酔驛ィ蛻�ケカ隗」驥雁ョ�サャ逧�粥閭ス縲�

蜉�霓ス謨ー謐ョ蟷カ諡��荳コ隶ュ扈�寔蜥梧オ玖ッ暮寔縲ゆクコ莠�ソォ騾滉ク頑焔�御ス�蟆�スソ逕ィ荳�莠幄セ�勧蜃ス謨ー�夊區辟カ謌台サャ蜿ッ莉・閾ェ蟾ア荳玖スス鮑「蟆セ闃ア謨ー謐ョ蟷カ菴ソ逕ィcsv.reader蜉�霓ス螳�シ御ス�荵溷庄莉・逶エ謗・莉� scikit-learn 蠢ォ騾溯執蜿夜ク「蟆セ闃ア謨ー謐ョ縲よュ、螟厄シ御ス�蜿ッ莉・菴ソ逕ィ train_test_split 蜃ス謨ー霑幄。� 60/40 逧�ョュ扈�/豬玖ッ墓究蛻�シ御ス�ス�荵溷庄莉・閾ェ蟾ア髫乗惻蛻��陦鯉シ郁ッキ蜿りァ∵ュ、邀サ蝙狗噪螳樒鴫�峨�ょ惠譛コ蝎ィ蟄ヲ荵�荳ュ�瑚ョュ扈�/豬玖ッ墓究蛻�畑莠主㍼蟆題ソ�供蜷遺�披�泌惠螳梧紛謨ー謐ョ髮�ク願ョュ扈�ィ。蝙句セ�蠕�莨壼ッシ閾エ讓。蝙玖ソ�供蜷域焚謐ョ逧�飭螢ー蜥檎音諤ァ�瑚�御ク肴弍螳樣刔逧�コ募アりカ句漢縲ゆス�蜿ェ蝨ィ隶ュ扈�寔荳願ソ幄。御ササ菴慕アサ蝙狗噪讓。蝙玖ー�紛�井セ句ヲゑシ碁�画叫驍サ螻�噪謨ー驥� k�俄�披�疲オ玖ッ暮寔菴應クコ荳�荳ェ迢ャ遶狗噪縲∵悴隗ヲ蜿顔噪謨ー謐ョ髮�シ檎畑莠取オ玖ッ墓怙扈域ィ。蝙狗噪諤ァ閭ス縲�

from sklearn.datasets import load_iris
from sklearn import cross_validation
import numpy as np

# load dataset and partition in training and testing sets
iris = load_iris()
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)

# reformat train/test datasets for convenience
train = np.array(zip(X_train,y_train))
test = np.array(zip(X_test, y_test))

霑呎弍鮑「蟆セ闃ア謨ー謐ョ髮��∵焚謐ョ諡��莉・蜿顔エ「蠑慕噪邂�隕∵欠蜊励��

[諡��鮑「蟆セ闃ア謨ー謐ョ髮�(../Images/8099def34ac1f68142c2841020df09ca.png)

譖エ螟夂嶌蜈ウ蜀�ョケ

蟇ケkNN逧�ク�荳ェ蠕亥・ス逧�ヲりソー蜿ッ莉・蝨ィ 霑咎㈹ 髦�ッサ縲ゆク�荳ェ譖エ豺ア蜈・逧�ョ樒鴫�悟桁諡ャ蜉�譚�柱謳懃エ「譬托シ瑚ァ� 霑咎㈹縲�

螳梧紛閼壽悽

螳梧紛逧��譛ャ螯ゆク具シ�

from sklearn.datasets import load_iris
from sklearn import cross_validation
from sklearn.metrics import classification_report, accuracy_score
from operator import itemgetter
import numpy as np
import math
from collections import Counter

# 1) given two data points, calculate the euclidean distance between them
def get_distance(data1, data2):
    points = zip(data1, data2)
    diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]
    return math.sqrt(sum(diffs_squared_distance))

# 2) given a training set and a test instance, use getDistance to calculate all pairwise distances
def get_neighbours(training_set, test_instance, k):
    distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set]
    # index 1 is the calculated distance between training_instance and test_instance
    sorted_distances = sorted(distances, key=itemgetter(1))
    # extract only training instances
    sorted_training_instances = [tuple[0] for tuple in sorted_distances]
    # select first k elements
    return sorted_training_instances[:k]

def _get_tuple_distance(training_instance, test_instance):
    return (training_instance, get_distance(test_instance, training_instance[0]))

# 3) given an array of nearest neighbours for a test case, tally up their classes to vote on test case class
def get_majority_vote(neighbours):
    # index 1 is the class
    classes = [neighbour[1] for neighbour in neighbours]
    count = Counter(classes)
    return count.most_common()[0][0] 

# setting up main executable method
def main():

    # load the data and create the training and test sets
    # random_state = 1 is just a seed to permit reproducibility of the train/test split
    iris = load_iris()
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)

    # reformat train/test datasets for convenience
    train = np.array(zip(X_train,y_train))
    test = np.array(zip(X_test, y_test))

    # generate predictions
    predictions = []

    # let's arbitrarily set k equal to 5, meaning that to predict the class of new instances,
    k = 5

    # for each instance in the test set, get nearest neighbours and majority vote on predicted class
    for x in range(len(X_test)):

            print 'Classifying test instance number ' + str(x) + ":",
            neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5)
            majority_vote = get_majority_vote(neighbours)
            predictions.append(majority_vote)
            print 'Predicted label=' + str(majority_vote) + ', Actual label=' + str(test[x][1])

    # summarize performance of the classification
    print '\nThe overall accuracy of the model is: ' + str(accuracy_score(y_test, predictions)) + "\n"
    report = classification_report(y_test, predictions, target_names = iris.target_names)
    print 'A detailed classification report: \n\n' + report

if __name__ == "__main__":
    main()

諠ウ莠�ァ」譖エ螟夲シ滓衍逵区�莉ャ逧�ク、螟ゥ謨ー謐ョ遘大ュヲ隶ュ扈�是��

https://cambridgecoding.com/datascience-bootcamp

邂�莉具シ喙Natasha Latysheva](http://blog.cambridgecoding.com/author/natlat/) 譏ッMRC蛻�ュ千函迚ゥ蟄ヲ螳樣ェ悟ョ、逧�ョ。邂礼函迚ゥ蟄ヲ蜊壼」ォ逕溘�ょ・ケ逧��皮ゥカ髮�クュ莠守剏逞�渕蝗�扈�ュヲ縲∫サ溯ョ。鄂醍サ懷�譫仙柱陋狗區雍ィ扈捺桷縲よ峩蟷ソ豕帛慍隸エ�悟・ケ逧��皮ゥカ蜈エ雜」蛹�峡謨ー謐ョ蟇�寔蝙句�蟄千函迚ゥ蟄ヲ縲∵惻蝎ィ蟄ヲ荵��育音蛻ォ譏ッ豺ア蠎ヲ蟄ヲ荵��牙柱謨ー謐ョ遘大ュヲ縲�

蜴滓枚縲らサ丞�隶ク霓ャ霓ス縲�

逶ク蜈ウ��


謌台サャ逧�ク牙、ァ隸セ遞区耳闕�

1. Google 鄂醍サ懷ョ牙�隸∽ケヲ - 蠢ォ騾溷�髣ィ鄂醍サ懷ョ牙�閨御ク壹��

2. Google 謨ー謐ョ蛻�梵荳謎ク夊ッ∽ケヲ - 謠仙合菴�逧�焚謐ョ蛻�梵謚�閭ス

3. Google IT 謾ッ謖∽ク謎ク夊ッ∽ケヲ - 謾ッ謖∽ス�謇�蝨ィ扈�サ�噪 IT


譖エ螟夂嶌蜈ウ隸晞「�