Google Search Console API と、はてなキーワードAPI を使ってブログ記事を自動分類する

Github gist の投稿のコピーを掲載する Blog を立てたのですが、gist だけ載せていてもあまり面白くないと感じたので、形態素解析をして、似た投稿を関連投稿とするスクリプト、自動でカテゴリを付与するスクリプトを実装しようかと思いました。
形態素解析をして、似た投稿を関連投稿とするスクリプトは過去に、Mezzanine の Blog に関連記事のレコメンド表示を組み込んでみる | Monotalk で、Mezzanine のブログ記事レコメンドのために作成したスクリプトを改造して実装しました。

今回は、自動でカテゴリを付与するスクリプトを実装しましたので、実装した内容について記載します。

前提
参考
何故 Google Search Console API と、はてなキーワード API を併用するのか?
処理概要
まとめ

前提

作成したスクリプトは Django Command で以下の環境で動作しています。

OS

cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)

python の Version
```
python3.6 -V
Python 3.6.4
```

Django の Version

python3.6 -m pip list --format=columns | grep Django
Django                     1.10.8

その他使用しているライブラリ

wagtail                    1.13.1
pandas                     0.22.0  
searchconsole              後述します。
puput                      0.9.2.1

参考

【Python】はてなキーワードAPIを使って特徴語を抽出する - 歩いたら休め
ブログ記事本文をはてなキーワード API の INPUTにして、戻りでキーワードを取得しています。
ほぼこの実装をそのまま使用させて頂きました。
Quickstart: Run a Search Console App in Python | Search Console API | Google Developers
Google Search Console API を Python 経由で実行する方法が記載されています。

何故 Google Search Console API と、はてなキーワード API を併用するのか?

それぞれ、以下のような特徴があり、それぞれの集合の積集合を取得したいというのが併用する理由となります。

Google Search Console API で取得できるデータの特徴
- 人間が Web 検索時に使用したキーワード
  【SEO上級編】検索キーワードの種類とコンテンツマーケティングの関係 | 本気ファクトリー株式会社に記載がありますが、トランザクショナルクエリ、ナビゲーショナルクエリ、インフォメーショナルクエリ に分類可能なキーワードが取得できます。
- 検索キーワードのパフォーマンス
  対象の検索キーワードで、サイトが Google の検索結果の何番目に表示されたのか等の、パフォーマンス情報が入手できます。
はてなキーワード API で取得できるデータの特徴
- 人間が付与、作成したキーワード
  人間が、意図的に作成したキーワードを取得できます。タグのようなイメージを持ちました。
  おそらく、インフォメーショナルクエリに属する言葉が取得できるのかと思います。
２つの API の積集合が持つ特徴
実装して記載していないので、予想になりますが、以下の特徴を持つキーワードが取得できるかと思います。
- 検索キーワードとして、使われる傾向の強い名詞
- 掲載順位、クリック、表示回数を加味することで、記事の特徴を表す名詞
- タグ、カテゴリとして付与しても違和感のない名詞

処理概要

以下の流れで処理は実装します。

記事を INPUT にはてなキーワード API を実行、戻り値を保持する。
Google Search Console API を実行し、戻り値を保持する。
1.、2. のデータを元に、キーワードの積集合を作成、スコア等を条件に積集合を絞り込む。

テーブル定義について
はてなキーワード API を戻りを保持するテーブルと、Google Search Console の戻りを保持するテーブルを作成しました。
記事テーブルとの関係は下図のようになっています。

以下、項番の詳細について説明します。

1. 記事を INPUT にはてなキーワード API を実行、戻り値を保持する。

下記の、Django コマンドを処理を実行するようにしました。
【Python】はてなキーワードAPIを使って特徴語を抽出する - 歩いたら休めの実装をほぼそのまま拝借して作成しました。

collect_hatena_keywords.py

from __future__ import print_function, unicode_literals

import xmlrpc.client

import six

try:
    import HTMLParser
except ImportError:
    from html.parser import HTMLParser
from django.core.management.base import BaseCommand
from puput.models import EntryPage
from markdown import markdown
from bs4 import BeautifulSoup
from logging import getLogger
from home.models import EntryHatenaKeyword

logger = getLogger(__name__)

class Command(BaseCommand):
    def handle(self, **options):

        logger.info(__name__ , " start")
        # ---------------------------
        # データを全件削除する
        # ----------------------
        EntryHatenaKeyword.objects.all().delete()

        for blog_post in EntryPage.objects.all():
            import time
            time.sleep(1)
            # blog contents
            html = markdown(blog_post.body)
            text = ''.join(BeautifulSoup(html, "html5lib").findAll(text=True))
            if six.PY2:
                html_parser = HTMLParser.HTMLParser()
            else:
                if six.PY34:
                    import html
                    html_parser = html
                else:
                    html_parser = HTMLParser()
            unescaped_text = html_parser.unescape(text)
            server = xmlrpc.client.ServerProxy("http://d.hatena.ne.jp/xmlrpc")
            res = server.hatena.setKeywordLink({"body": unescaped_text, 'mode': 'lite'})
            word_list = res.get("wordlist")
            for word in word_list:
                word["word"] = word["word"].strip().capitalize()
                EntryHatenaKeyword.objects.create(entry=blog_post, **word)
        logger.info(__name__, " end")

説明

データはDELETE ALL、INSERT ALL
記事件数が少ないので、全件削除、全件登録しています。多い場合、差分のみ実行するなどの考慮が必要かと思います。
リクエストの連続実行防止のため、1秒のインターバルを設定しています。

登録項目について
項目と値の意味について説明します。

項目名	説明
entry	キーワード抽出対象となった記事ID
word	キーワード
score	スコア
refcount	キーワードの参照回数(はてなブログ等でのリンクの数)
cname	キーワードの分類名

markdown を html 変換して、本文を抽出する
Blog の記事本文は、markdown で記載しています。
markdown から本文を抽出するため、一度 html に変換し、BeautifulSoup で本文抽出を行なっています。

    # blog contents
    html = markdown(blog_post.body)
    text = ''.join(BeautifulSoup(html, "html5lib").findAll(text=True))
    if six.PY2:
        html_parser = HTMLParser.HTMLParser()
    else:
        if six.PY34:
            import html
            html_parser = html
        else:
            html_parser = HTMLParser()
    unescaped_text = html_parser.unescape(text)

2. Google Search Console API を実行し、戻り値を保持する。

以下、Django コマンドを作成しました。

collect_search_console.py

from __future__ import print_function, unicode_literals
from logging import getLogger

import pandas as pd
import searchconsole
from django.core.management.base import BaseCommand
from home.models import EntryGoogleSearchConsole
from puput.models import EntryPage

logger = getLogger(__name__)


def split_data_frame_list(df, target_column, separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row.
    The values in the other columns are duplicated across the newly divided rows.
    '''

    def split_list_to_rows(row, row_accumulator, target_column, separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row["splited_" + target_column] = s
            row_accumulator.append(new_row)

        new_rows = []
        df.apply(split_list_to_rows, axis=1, args=(new_rows, target_column, separator))
        new_df = pd.DataFrame(new_rows)
        return new_df


class Command(BaseCommand):
    def handle(self, **options):
        logger.info(__name__, " start")
        # --------------------------------------
        # データ全件削除
        # -------------------------------
        EntryGoogleSearchConsole.objects.all().delete()

        # --------------------------------------
        # SearchConsole API 実行
        # -------------------------------
        from django.conf import settings
        account = searchconsole.authenticate(service_account=settings.BASE_DIR + '/client_secrets.json')
        web_property = account['https://your.domain.com/']
        report = web_property.query.range('today', days=-90).dimension('query', 'page').limit(50000).get()
        df = report.to_dataframe()
        df["slug"] = df["page"]
        # SearchConsole の URL から、blogのid を抽出。
        # ここは、不要であれば、除去してください。
        df["slug"] = df["slug"].str.replace("https://your.domain.com/posts/", "")
        df["slug"] = df["slug"].str.replace("https://your.domain.com/", "")
        df["slug"] = df["slug"].str.replace("/", "")
        df = df[df["slug"] != ""]
        df = split_data_frame_list(df, 'query', ' ')
        df["splited_query"] = df["splited_query"].str.strip().str.capitalize()
        # Dataframe の行数分繰り返し
        for index, row in df.iterrows():
            entry = EntryPage.objects.filter(gist_id=row["slug"]).first()
            if not entry:
                continue
            dict = row.to_dict()
            del dict["slug"]
            EntryGoogleSearchConsole.objects.create(entry=entry, **dict)

        logger.info(__name__, " end")

説明

searchconsole について
Google Search Console の API を実行するライブラリがないか調べたところ、joshcarty/google-searchconsole: A wrapper for the Google Search Console API. が見つかりました。
to_datafrome で、pandas の dataframe に API の戻り値を変換できます。
本線にマージされていませんが、branch にある version は、サービスアカウントを使った API 接続ができるようになっていますので、それを使用しています。
dataframe の検索クエリをスペースで区切って複数行 dataframe に変換する
検索クエリをスペースで区切って、単語化し、且つ分割した数分行を複製したかったため、split_data_frame_list というメソッドを作成しました。
実装は以下の gist から拝借させて頂きました。
Efficiently split Pandas Dataframe cells containing lists into multiple rows, duplicating the other column’s values.
URL から、blog 記事のキーとなる ID を抽出
df["slug"] = df["slug"].str.replace("https://your.domain.com/posts/", "") あたりの記述で、blog 記事 ID 抽出のため、文字列置換を実施しています。
データの登録方法は、DELETE ALL、INSERT ALL
こちらは、API 一撃で取得できるデータですので、負荷はあまり気にならないかと思います。
取得データが膨大な場合は、少しずつデータを取得して登録したほうがいいかもしれません。

登録項目について
項目と値の意味について説明します。

項目名	説明
entry	キーワード抽出対象となった記事ID
clicks	クリックされた回数
impressions	表示された回数
page	表示対象、クリック対象となったURL
position	掲載順位
query	検索キーワード
splited_query	スペースで分割した検索キーワードの一部

3. `1.`、`2.` のデータを元に、キーワードの積集合を作成、スコア等を条件に積集合を絞り込む。

1.、2. で登録したデータから、SQL でタグにする対象のクエリを絞り込みます。
以下の Django コマンドを作成しました。

classify_entry.py

from __future__ import print_function, unicode_literals

from django.core.management.base import BaseCommand
from puput.models import EntryPage
from home.models import EntryGoogleSearchConsole
from home.models import EntryHatenaKeyword
from django.db import connection
from logging import getLogger

logger = getLogger(__name__)


class Command(BaseCommand):
    def handle(self, **options):
        logger.info(__name__, " start")
        cursor = connection.cursor()
        cursor.execute("""
        select
          p_ep.page_ptr_id,
          h_ehk.word
        from
          puput_entrypage as p_ep
        inner join
          home_entryhatenakeyword as h_ehk on p_ep.page_ptr_id = h_ehk.entry_id
        where
          h_ehk.cname in ('elec','web')
          and
          h_ehk.score >= 25
          and
          h_ehk.word in (select splited_query from home_entrygooglesearchconsole)
        order by p_ep.page_ptr_id
        """)
        rows = cursor.fetchall()
        import collections
        entry_tag_relations = collections.defaultdict(list)
        for k, v in rows:
            entry_tag_relations[k].append(v)

        for k, v in entry_tag_relations.items():
            entry = EntryPage.objects.get(id=k)
            entry.tags.clear()
            entry.tags.add(*v)
            entry.save()

        logger.info(__name__, " end")

説明
- 抽出条件について
  Commad に記載している SQL が抽出条件になります。
  はてなキーワードの cname が、elec 又は、web で、スコア 25 以上のデータ
  はてなキーワードの word が、Google Search Console の検索クエリにも含まれるデータを抽出しています。
  Google Search Console のデータ量が少なく、多くのタグを付与したいためこの条件にしていますが、データ量が多い場合は、表示件数、掲載順位等も条件に含めてたほうがよいかと思います。
- データ取得結果の Taple を辞書に変換する
  SQL の取得結果の Taple は BlogID と word の2要素取得できます。
  この値を key value とする辞書のほうが後続処理が行いやすかったので、以下の処理を実施しています。
```
import collections
entry_tag_relations = collections.defaultdict(list)
for k, v in rows:
    entry_tag_relations[k].append(v)
```

まとめ

Google Search Console API と、はてなキーワードAPI を使って、ブログ記事を分類してみました。
API だよりとなり、あまり Machine Learning 感がありませんが結果を見る限り、それなりの分類ができているように思います。
ただ、文章の伝えたい文脈とは違うタグが付与されること、キーワードが最近のもので、はてなキーワードには登録がされていないため、ついて欲しいタグがつかないケースがあります。
文書自体の形態素解析結果を勘案したり、無条件にタグを付与するキーワードを設ける、逆に、除外するキーワードを設ける等の処理を追加してもいいかと思います。
後は、検索文脈に沿ったキーワードが付与されることがおもしろいのかおもしろくないのかというところがキーポイントかと思いますが、個人的には、SEO 的には良いかと考えます。
以上です。

Google Search Console API と、はてなキーワードAPI を使ってブログ記事を自動分類する

前提

参考

何故 Google Search Console API と、はてなキーワード API を併用するのか?

処理概要

1. 記事を INPUT にはてなキーワード API を実行、戻り値を保持する。

2. Google Search Console API を実行し、戻り値を保持する。

3. `1.`、`2.` のデータを元に、キーワードの積集合を作成、スコア等を条件に積集合を絞り込む。

まとめ

コメント

カテゴリー

最近の投稿

前提

参考

何故 Google Search Console API と、はてなキーワード API を併用するのか?

処理概要

1. 記事を INPUT に はてなキーワード API を実行、戻り値を保持する。

2. Google Search Console API を実行し、戻り値を保持する。

3. 1.、2. のデータを元に、キーワードの積集合を作成、スコア等を条件に積集合を絞り込む。

まとめ

コメント

カテゴリー

関連投稿

最近の投稿

1. 記事を INPUT にはてなキーワード API を実行、戻り値を保持する。

3. `1.`、`2.` のデータを元に、キーワードの積集合を作成、スコア等を条件に積集合を絞り込む。