検索キーワードを python sklearn RandomForestClassifier でクラス分類してカテゴリ分けする

Google Search Console のキーワードを python でクラス分類してカテゴリ分けしてみた結果を記載します。
先日Google Search Console のキーワードを python sklearn LinearSVC でクラス分類してカテゴリ分けする | Monotalkで、LinearSVCを使ってクラス分類を行いましたが、今回は、RandomForestClassifier を使って分類してみます。

前提

入力データ Search Analytics for Sheets - Google スプレッドシートアドオンで、データをGoogle スプレッドシートに Export。
Export 後のスプレッドシートをCSV化したデータを入力データとしています。
Search Analytics for Sheets - Google スプレッドシートアドオン　の使い方は、
サーチコンソールの詳細データをGoogleスプレッドシートに自動反映させてTableauにインポートする方法 :: 「清水誠」公式サイト
が参考になりました。
学習データ
入力データの一部のキーワードを切り出し、別シートに貼り付け、人力で800キーワードくらいを分類しました。
それを、tsv 出力したものを学習データとしています。
以下、学習データとして作成した tsv を一部抜粋しテーブル形式に変換したものを記載します。

Query	label_name
sonarqube	sonarqube
no module named	python
(1_8.w001) the standalone template_* settings were deprecated in django 1.8	django
//nosonar	sonarqube
404 エラー	404
404エラー	404
404エラーページ	404

先日実施した際は、tsv を直接読み込みましたが、だんだん読み書きのたびに、export、import を繰り返すのが、
面倒になってきたので、[Python] Google SpreadSheetをAPI経由で読み書きする - YoheiM .NET
を参考に、gspread を使って、Google スプレッドシートからAPI経由で、直接読み込み、書き込みするようにしました。

参考

以下、記事を参考にしました。

Grid Search の使い方 scikit-learnで最適なパラメータを決めるためにGrid Searchを使う | tatsushim’s blog
パラメータの判断方法不均衡なデータの分類問題について with Python | かものはしの分析ブログ
gensimsの使い方 k-means法で文書のクラスタリング - Qiita BoW+SVMで文書分類（１） | developer’s blog
ランダムフォレストの使い方 Lec86_決定木とランダムフォレスト
 R vs Python：データ解析を比較 | プログラミング | POSTD
Python と R の違い (ランダムフォレスト法) – Python でデータサイエンス

クラス分類の手順

以下の通り、処理を組みました。
クラスタリングには、sklearn の RandomForestClassifier を使います。

クラスタ数の決定

RandomForestClassifier のパラメータを決定します。
GridSearch というクラス?を使うと、交差検定の結果、精度が高いパラメータを算出してくれるので、
それを使用します。

prameter の指定。
GridSearch 実行。
best_estimator_ の値を出力。

クラス分類

Google スプレッドシートの学習データを TSVにして読み込み、キーワードを抽出。
サーチコンソールからの取得結果のGoogle スプレッドシートを TSV にして読み込み、キーワードを抽出。
学習データと、サーチコンソールからの取得結果のキーワードをマージ、キーワード内の単語の出現頻度を数えて、結果を素性ベクトル化する。
²
[2]gensimを使います。次元数削減ができるのでキーワードが増えても次元数が増えないためです。
RandomForestClassifier で学習
TSV をクラス分類して、スコアを取得、確からしさが低いデータについては、分類結果のラベルは付与せず、unknown ラベルに置換
結果を、Google スプレッドシートに書き出す。

各種ライブラリのインストール

必要なライブラリをインストールします。

pip install sklearn numpy scipy pandas gensim

以下の versionが、インストールされました。

pip list | grep -e sklearn  -e numpy -e scipy -e pandas -e gensim
---------------------------
gensim (2.1.0)
numpy (1.12.1)
pandas (0.20.1)
scipy (0.19.0)
sklearn (0.0)
---------------------------

sudo pip install --upgrade gensim

実装

以下、作った python プログラムになります。

search_console_random_forest_classsifier.py

# -*- coding: utf-8 -
from __future__ import print_function

import numpy as np
import search_console_classifier_utils as utils
from gensim import corpora, models
from gensim import matutils
from sklearn.ensemble import RandomForestClassifier


def get_max_label(array, score):
    index = 0
    max_value = 0
    for i in range(len(array)):
        item = array[i]
        if max_value < item:
            max_value = item
            index = i
    return {score[index]: max_value}


def vec2dense(vec, num_terms):
    return list(matutils.corpus2dense([vec], num_terms=num_terms).T[0])


def create_keywords_data():
    keywords = []
    # 学習データからキーワードを取得
    for line in utils.read_from_learning_tsv(0):
        keywords.append(utils.split_keyword(line))
    # CSVからキーワードを取得
    for line in utils.parse_report_tsv():
        keywords.append(utils.split_keyword(line))
    dictionary = corpora.Dictionary(keywords)
    corpus = []
    for text in keywords:
        corpus.append(dictionary.doc2bow(text))
    # 300次元に圧縮
    lsi_model = models.LsiModel(corpus, id2word=dictionary, num_topics=300)
    lsi_corpus = lsi_model[corpus]
    denses = [vec2dense(lsi_corpus_elem, len(dictionary)) for lsi_corpus_elem in lsi_corpus]

    # 学習データ部のみ抽出
    learning_denses = denses[:len(utils.read_from_learning_tsv(0))]
    # データに対応したラベルを取得
    learning_labels = np.array(utils.read_from_learning_tsv(1))
    return denses, learning_denses, learning_labels

#########################
# クラスタ数の決定  
# GridSearch実行
def execute_grid_search():
    # 学習データ、Label の取得、作成
    denses, learning_denses, learning_labels = create_keywords_data()

    # -------------------------------------
    # RandomForestClassifierに学習データを入力、分類する
    # --------------------------------
    # GridSearch のパラメータを設定
    # 1. prameter の 指定。
    parameters = {
        'n_estimators': [5, 10, 20, 30, 50, 100, 300],
        'max_features': [3, 5, 10, 15, 20],
        'random_state': [0],
        'n_jobs': [1],
        'min_samples_split': [3, 5, 10, 15, 20, 25, 30, 40, 50, 100],
        'max_depth': [3, 5, 10, 15, 20, 25, 30, 40, 50, 100]
    }
    # 2. `GridSearch` 実行。
    from sklearn.model_selection import GridSearchCV
    clf = GridSearchCV(RandomForestClassifier(), parameters)
    clf.fit(learning_denses, learning_labels)
    # 3. `best_estimator_` の値を出力。
    print("---------------------------")
    print(clf.best_estimator_)
    print("---------------------------")


# メインメソッド
def execute():
    # 学習データ、Label の取得、作成
    # ----------------------------------------------------------------------------
    # 3. 学習データと、サーチコンソールからの取得結果のキーワードをマージ、キーワード内の単語の出現頻度を数えて、結果を素性ベクトル化する。  
    # ----------------------------
    denses, learning_denses, learning_labels = create_keywords_data()

    # ----------------------------------------------------------------------------
    # 4. `RandomForestClassifier` で学習
    # ----------------------------
    # 素の呼び出し
    # model = RandomForestClassifier()
    # GridSearch の結果得られた値を設定
    model = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                                   max_depth=20, max_features=20, max_leaf_nodes=None,
                                   min_impurity_split=1e-07, min_samples_leaf=1,
                                   min_samples_split=5, min_weight_fraction_leaf=0.0,
                                   n_estimators=300, n_jobs=1, oob_score=False, random_state=0,
                                   verbose=0, warm_start=False)

    model.fit(learning_denses, learning_labels)
    classes = model.classes_
    # ----------------------------------------------------------------------------
    # 5. `TSV` をクラス分類して、スコアを取得、確からしさが低いデータについては、分類結果のラベルは付与せず、`unknown` ラベルに置換
    # ----------------------------
    scores = model.predict_proba(denses[len(utils.read_from_learning_tsv(0)):])

    labels = []
    for score in scores:
        max_dict = get_max_label(score, classes)
        for k, v in max_dict.items():
            if v > 0.5:
                # 0.50 より大きい場合、ラベルを設定
                labels.append(k)
            else:
                # 上記以外の場合は、"unknown"
                labels.append("unknown")

    # スプレッドシート更新
    # ----------------------------------------------------------------------------
    # 6. 結果を、Google スプレッドシートに書き出す。  
    # ----------------------------
    utils.update_labels(labels)


if __name__ == '__main__':
    # execute_grid_search()
    execute()

search_console_classifier_utils.py

以下、メインpython から使用している utility python ファイルです。
Google Spread Sheet へのアクセスをメソッド化したものを定義しました。

# -*- coding: utf-8 -
import gspread
from __builtin__ import unicode
from oauth2client.service_account import ServiceAccountCredentials
from sets import Set

stop_words = Set(['name', 'not', 'the', 'usr', 'you', 'version', 'this'])

scope = ['https://spreadsheets.google.com/feeds']


# stop word のチェック
# 2文字以下の文字列、クラスタリングした結果、
# ラベルとして、出力されたあまり意味のわからない単語を除外
def __check_stop_word(word):
    if word in stop_words:
        return False
    if len(word) <= 2:
        return False
    return True


# キーワードを区切る
def split_keyword(text):
    keywords = text.split(" ")
    return [keyword for keyword in keywords if __check_stop_word(keyword)]


# ストップワードを除外する
def exclude_stop_words(text):
    return " ".join(split_keyword)


# report csv を parse する
def parse_report_tsv():
    lines = []
    row_count = 1
    credentials = ServiceAccountCredentials.from_json_keyfile_name('your_api_key.json', scope)
    gc = gspread.authorize(credentials)
    # Google Search Consoleのデータは"Merge"シートに入力しています。
    wks = gc.open("Google Search Console Analyze").worksheet("Merge")
    for line in wks.export(format='tsv').split("\n"):
        if row_count != 1:
            arr = line.split("\t")
            # キーワードカラムを取り出す
            lines.append(arr[1])
        row_count += 1
    return lines


# learning_tsv parseする
def read_from_learning_tsv(index):
    lines = []
    row_count = 1
    credentials = ServiceAccountCredentials.from_json_keyfile_name('your_api_key.json', scope)
    gc = gspread.authorize(credentials)
    # 学習データはシート名"LearningData"で入力しています。
    wks = gc.open("Google Search Console Analyze").worksheet("LearningData")
    for line in wks.export(format='tsv').split("\n"):
        if row_count != 1:
            arr = line.split("\t")
            # キーワードカラムを取り出す
            lines.append(arr[index])
        row_count += 1
    return lines


def update_labels(labels):
    # CSVを再度読み込み、ラベル名を追加して、CSV書き出し
    credentials = ServiceAccountCredentials.from_json_keyfile_name('your_api_key.json', scope)
    gc = gspread.authorize(credentials)
    wks = gc.open("Google Search Console Analyze").worksheet("Merge")
    # Select a range
    labels.insert(0, "Label")
    cell_list = wks.range('J1:J' + str(len(labels)))
    for i in range(len(labels)):
        cell_list[i].value = unicode(labels[i], 'utf-8')
    wks.update_cells(cell_list)

説明クラスタ数の決定

execute_grid_searchメソッドで実行しています。
メイン部でコメントアウトしているので、実行する際は、コメントアウトの解除が必要です。
パラメータ指定部は、scikit-learnで最適なパラメータを決めるためにGrid Searchを使う | tatsushim’s blog の記載を拝借しています。これでもそれなりの時間(30分)程度実行に時間がかかりました。
得られた推奨値をexecuteメソッドで使用しています。
clf.best_estimator_ をprintすると、以下のようなcopy&pasteで実行可能な文字列が取得できます。

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=25, max_features=20, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=3, min_weight_fraction_leaf=0.0,
            n_estimators=300, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

クラス分類

以下、説明を記載します。

3. 学習データと、サーチコンソールからの取得結果のキーワードをマージ、キーワード内の単語の出現頻度を数えて、結果を素性ベクトル化する。
CountVectorizer でも同様のことは行えますが、次元圧縮はできないのかと思いまして、
gensim を使うようにしました。次元圧縮は、models.LsiModel を使用して実施しています。

4. RandomForestClassifier で学習 LinearSVC と同様に学習データと、ラベルをINPUTに fit を実行すれば実行可能です。

5. TSV をクラス分類して、スコアを取得、確からしさが低いデータについては、分類結果のラベルは付与せず、unknown ラベルに置換 LinearSVC だと、decision_function で、どのラベルが確からしいのか、決定境界からの距離が取得できましたが、
RandomForestClassifier だと、predict_proba で、所属しているクラスの確率が取得可能なので、
この戻りを使用します。確率が最も高いラベルを取得それが、0.5 よりも大きい場合、そのラベルを割り当て、0.5 以下の場合は、
unknown を割り当てます。

6. 結果を、Google スプレッドシートに書き出す。
gspread を使ってGoogle スプレッドシートに書き出します。
1行ずつ書き出しだと遅かったため、update_cells でまとめて書き出ししています。
スプレッドシート書き出し後、
以下のような、キーワードのクラス分類結果のラベル名と Click 数での円グラフが作成できました。キーワードラベル名とClick数

検索キーワードをLinearSVCでクラス分類してみた時との比較

以下、実施した結果、定性的わかったことを記載します。

1単語のみのキーワードの分類結果について確からしさの単位の違い(decision_function と predict_proba の違い)と、1単語に対するラベルを割り当てた、というところがあるのかもしれませんが、1単語に対する分類は、LinearSVC に比べるとうまく分類できているように見えました。
分類可能なデータについて
LinearSVC を使用していた場合に比べて、unknown の割合は減少しました。
LinearSVC はパラメータのチューニングをしていなかったというのが理由かもしれないので、
パラメータのチューニングをすると、結果が変わるかもしれません。

LinearSVC での分類に続き、RandomForestClassifier での分類を実施してみました。
素人目線で、LinearSVC よりもキーワード分類向きな気はしました。

以上です。

検索キーワードを python sklearn RandomForestClassifier でクラス分類してカテゴリ分けする

前提

参考

クラス分類の手順

クラスタ数の決定

クラス分類

各種ライブラリのインストール

実装

search_console_random_forest_classsifier.py

search_console_classifier_utils.py

説明クラスタ数の決定

クラス分類

以下、説明を記載します。

検索キーワードをLinearSVCでクラス分類してみた時との比較

コメント

カテゴリー

最近の投稿

前提

参考

クラス分類の手順

クラスタ数の決定

クラス分類

各種ライブラリのインストール

実装

search_console_random_forest_classsifier.py

search_console_classifier_utils.py

説明 クラスタ数の決定

クラス分類

以下、説明を記載します。

検索キーワードをLinearSVCでクラス分類してみた時との比較

コメント

カテゴリー

関連投稿

最近の投稿

説明クラスタ数の決定