Python Google Analytics の PageView と Blog の記事数の相関関係を計算する

この Blog はそれなりの更新頻度かと思います。
記事の品質問題は置いておき、だいたい 1 月あたり10 記事前後は作成していて、記事数に比例して PageView も増えていく傾向があります。
どの程度記事数と、PageView に相関関係があるのか気になりましたので、Python で計算してみました。
結果を記載します。

前提

相関関係の計算にあたり、前提となる情報を記載します。

この Blog について

2015 年に始めました。
月あたり 10 記事前後を投稿しており、Web解析のため、Google Analytics、Google Search Console を使用しています。

必要なライブラリのインストール

!pip install --upgrade pip
!pip install pandas
!pip install google2pandas
!pip install oauth2client

Google2Pandas の使い方

Google Analytics のデータ取得には、google2pandas というライブラリを使用しています。
使用する際は、Google Analytics サービスアカウントのキーの発行も必要になります。
以下の記事にまとめていますのでよろしければご確認ください。
Google2Pandas で、Google Analytics のデータを pandas Dataframe に変換する | Monotalk

Mezzanine の月次の記事数取得SQL

PostgresSQL 前提ですが、以下SQLを実行すると日次の記事数と累積記事数を取得できます。

Django から DB接続
```
python3.6 manage.py dbshell     
```

SQL実行

SELECT
    DATE_TRUNC('month', publish_date) as publish_month,
    COUNT(id) AS "Num of entry",
    SUM(COUNT(id)) OVER(ORDER BY DATE_TRUNC('month', publish_date) ASC) AS "Num of entries up to that month"
FROM
    blog_blogpost AS blogpost
GROUP BY publish_month    
ORDER BY publish_month ASC;

出力結果

     publish_month      | Num of entry | Num of entries up to that month 
------------------------+--------------+---------------------------------
 2015-04-01 00:00:00+09 |            1 |                               1
 2015-05-01 00:00:00+09 |            1 |                               2
 2015-06-01 00:00:00+09 |            4 |                               6
 2015-09-01 00:00:00+09 |            1 |                               7
 2015-11-01 00:00:00+09 |            3 |                              10
 2015-12-01 00:00:00+09 |            9 |                              19
 2016-01-01 00:00:00+09 |            9 |                              28
 2016-03-01 00:00:00+09 |            3 |                              31
 2016-04-01 00:00:00+09 |            8 |                              39
 2016-05-01 00:00:00+09 |           19 |                              58
 2016-06-01 00:00:00+09 |           14 |                              72
 2016-07-01 00:00:00+09 |            9 |                              81
 2016-08-01 00:00:00+09 |           16 |                              97
 2016-09-01 00:00:00+09 |            6 |                             103
 2016-10-01 00:00:00+09 |           11 |                             114
 2016-11-01 00:00:00+09 |           13 |                             127
 2016-12-01 00:00:00+09 |            7 |                             134
 2017-01-01 00:00:00+09 |           24 |                             158
 2017-02-01 00:00:00+09 |           12 |                             170
 2017-03-01 00:00:00+09 |           11 |                             181
 2017-04-01 00:00:00+09 |           14 |                             195
 2017-05-01 00:00:00+09 |           13 |                             208
 2017-06-01 00:00:00+09 |           18 |                             226
 2017-07-01 00:00:00+09 |           25 |                             251
 2017-08-01 00:00:00+09 |           27 |                             278
 2017-09-01 00:00:00+09 |           18 |                             296
 2017-10-01 00:00:00+09 |           16 |                             312
 2017-11-01 00:00:00+09 |           19 |                             331
 2017-12-01 00:00:00+09 |           16 |                             347
 2018-01-01 00:00:00+09 |           23 |                             370
 2018-02-01 00:00:00+09 |           13 |                             383
 2018-03-01 00:00:00+09 |           12 |                             395
 2018-04-01 00:00:00+09 |           16 |                             411
 2018-05-01 00:00:00+09 |            8 |                             419
 2018-06-01 00:00:00+09 |           10 |                             429
 2018-07-01 00:00:00+09 |            9 |                             438
 2018-08-01 00:00:00+09 |           22 |                             460
 2018-09-01 00:00:00+09 |           12 |                             472
 2018-10-01 00:00:00+09 |            8 |                             480
 2018-11-01 00:00:00+09 |            5 |                             485
 2018-12-01 00:00:00+09 |            7 |                             492
 2019-01-01 00:00:00+09 |            9 |                             501
 2019-02-01 00:00:00+09 |            8 |                             509
 2019-03-01 00:00:00+09 |            2 |                             511
(44 行)

合計 511 記事結構書いたなと思います。

Google Analytics データの取得

google2pandas を使います。
過去にgoogle2pandas の使い方について、Google2Pandas で、Google Analytics のデータを pandas Dataframe に変換する | Monotalk にまとめました。よろしければこちらもご確認ください。

from google2pandas import *
view_id = '103185238'
query = {
    'reportRequests': [{
        'viewId' : view_id,
        'dateRanges': [{
            'startDate' : '1825daysAgo',
            'endDate'   : 'today'}],
        'dimensions' : [
            {'name' : 'ga:yearMonth'}
        ],
        'metrics'   : [
            {'expression' : 'ga:pageViews'}
        ]
    }]
}
conn = GoogleAnalyticsQueryV4(secrets='./ga_client.json')
df = conn.execute_query(query)

# 出力
df['pageViews'] = df['pageViews'].astype(int)
ga_page_views = df.sort_values('yearMonth', ascending=True)
ga_page_views

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	yearMonth	pageViews
0	201506	646
1	201507	920
2	201508	314
3	201509	163
4	201510	121
5	201511	970
6	201512	504
7	201601	642
8	201602	251
9	201603	385
10	201604	432
11	201605	664
12	201606	913
13	201607	946
14	201608	1145
15	201609	1267
16	201610	1481
17	201611	2847
18	201612	3734
19	201701	5911
20	201702	11972
21	201703	2390
22	201704	2405
23	201705	3114
24	201706	3955
25	201707	4244
26	201708	4448
27	201709	5644
28	201710	6985
29	201711	8005
30	201712	7844
31	201801	8804
32	201802	10040
33	201803	12893
34	201804	13694
35	201805	16909
36	201806	18984
37	201807	18341
38	201808	19236
39	201809	19010
40	201810	20003
41	201811	20071
42	201812	24221
43	201901	24895
44	201902	24356
45	201903	4878

相関関係を求める

月次の記事数の増加量と、Pageview の相関関係を求めます。

import pandas as pd   
# gist に up した Mezzanine の月次の投稿数を取得    
monthly_entries = pd.read_csv('https://gist.githubusercontent.com/kemsakurai/ef86122b072e509e3968d4e5cea3bd5f/raw/35fd7f8c5a16cd2fcf318a811c32a32ef42181b8/Number%2520of%2520blog%2520posts%2520per%2520month.tsv',sep='\t')
# YYYY-MM-DD を yearMonth の形式に変換する    
# - を除去
monthly_entries['yearMonth'] = monthly_entries['YYYY-MM-DD'].str.replace("-","").str[0:6]
# dataframe を inner join で結合    
merge_df = pd.merge(monthly_entries, ga_page_views, how='inner')
# Entry と、PageView の散布図を描画
merge_df.plot(kind='scatter', x='Num of entries up to that month', y='pageViews')

<matplotlib.axes._subplots.AxesSubplot at 0x119ee3c18>

20190309_output_7_1.png - Google ドライブ

合計記事数が増えれば、PageView が増えると言えそうです。
続いて時系列グラフを描きます。

merge_df.plot(x="yearMonth", subplots=True, sharex=True,figsize=(15, 10))

array([<matplotlib.axes._subplots.AxesSubplot object at 0x11aee16d8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x11af0cbe0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x11b125080>],
      dtype=object)

20190309_output_9_1.png - Google ドライブ

2017年にPageView が上がり、その後下がってるのは監視サービスを導入したためです。
監視ボットのアクセスがGoogle Analytics に記録されていたようです。
続いてdf.corr() で相関係数を求めます。

del merge_df['Num of entry']
merge_df.corr()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	Num of entries up to that month	pageViews
Num of entries up to that month	1.000000	0.876561
pageViews	0.876561	1.000000

0.876561 で正の相関があることがわかります。

デバイスごとの PageView との相関を計算する

PageView 全体での相関は計算しました。今度はデバイスごとでどのような相関を示すのか計算してみます。
デバイスごとのデータを取得するには、Google Analytics のデータ取得時にディメンションとしてga:deviceCategory を追加します。

from google2pandas import *
view_id = '103185238'
query = {
    'reportRequests': [{
        'viewId' : view_id,
        'dateRanges': [{
            'startDate' : '1825daysAgo',
            'endDate'   : 'today'}],
        'dimensions' : [
            {'name' : 'ga:yearMonth'},
            {'name' : 'ga:deviceCategory'}
        ],
        'metrics'   : [
            {'expression' : 'ga:pageViews'}
        ]
    }]
}
conn = GoogleAnalyticsQueryV4(secrets='./ga_client.json')
df = conn.execute_query(query)

# 出力
df['pageViews'] = df['pageViews'].astype(int)
ga_page_views = df.sort_values('yearMonth', ascending=True)
ga_page_views

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	yearMonth	deviceCategory	pageViews
0	201506	desktop	624
1	201506	mobile	22
2	201507	desktop	915
3	201507	mobile	5
4	201508	desktop	309
5	201508	mobile	5
6	201509	desktop	160
7	201509	mobile	2
8	201509	tablet	1
9	201510	desktop	117
10	201510	mobile	4
11	201511	desktop	942
12	201511	mobile	28
14	201512	mobile	44
15	201512	tablet	4
13	201512	desktop	456
16	201601	desktop	536
17	201601	mobile	100
18	201601	tablet	6
19	201602	desktop	232
20	201602	mobile	18
21	201602	tablet	1
22	201603	desktop	359
23	201603	mobile	26
24	201604	desktop	389
25	201604	mobile	40
26	201604	tablet	3
29	201605	tablet	6
27	201605	desktop	547
28	201605	mobile	111
...	...	...	...
102	201806	desktop	17387
103	201806	mobile	1359
104	201806	tablet	238
105	201807	desktop	16958
106	201807	mobile	1179
107	201807	tablet	204
108	201808	desktop	17171
109	201808	mobile	1880
110	201808	tablet	185
113	201809	tablet	187
111	201809	desktop	16412
112	201809	mobile	2411
114	201810	desktop	18044
115	201810	mobile	1734
116	201810	tablet	225
117	201811	desktop	18362
118	201811	mobile	1313
119	201811	tablet	396
120	201812	desktop	21959
121	201812	mobile	1776
122	201812	tablet	486
124	201901	mobile	1853
123	201901	desktop	22544
125	201901	tablet	498
127	201902	mobile	1451
126	201902	desktop	22401
128	201902	tablet	504
130	201903	mobile	444
129	201903	desktop	6917
131	201903	tablet	85

132 rows × 3 columns

deviceCategory が列に追加され、desktop、tablet、mobile が設定されています。

import pandas as pd   
# gist に up した Mezzanine の月次の投稿数を取得    
monthly_entries = pd.read_csv('https://gist.githubusercontent.com/kemsakurai/ef86122b072e509e3968d4e5cea3bd5f/raw/35fd7f8c5a16cd2fcf318a811c32a32ef42181b8/Number%2520of%2520blog%2520posts%2520per%2520month.tsv',sep='\t')
# YYYY-MM-DD を yearMonth の形式に変換する    
# - を除去
monthly_entries['yearMonth'] = monthly_entries['YYYY-MM-DD'].str.replace("-","").str[0:6]
# dataframe を inner join で結合    
merge_df = pd.merge(monthly_entries, ga_page_views, how='inner')

# Entry と、PageView の散布図を描画
# Deviceごとに色を変える
colors = {'desktop':'red', 'tablet':'blue', 'mobile':'green'}
merge_df.plot(kind='scatter', x='Num of entries up to that month', y='pageViews',c=merge_df['deviceCategory'].apply(lambda x: colors[x]))

<matplotlib.axes._subplots.AxesSubplot at 0x11b946f28>

20190309_output_16_1.png - Google ドライブ

# deviceCategory で Groupby して必要な項目を取得、結果の相関係数を出力    
merge_df.groupby('deviceCategory')[['YYYY-MM-DD','yearMonth','pageViews','Num of entries up to that month']].corr()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

		pageViews	Num of entries up to that month
deviceCategory
desktop	pageViews	1.000000	0.889307
desktop	Num of entries up to that month	0.889307	1.000000
mobile	pageViews	1.000000	0.718562
mobile	Num of entries up to that month	0.718562	1.000000
tablet	pageViews	1.000000	0.775839
tablet	Num of entries up to that month	0.775839	1.000000

deviceCategory desktopが最も強い、正の相関を示していて、deviceCategory mobile の正の相関が最も弱いです。

Desktop と Mobile で相関が異なる理由

以下が理由かと思います。

そもそも Mobile から閲覧する絶対数が少ない。
このサイトのアクセスの90%がDesktopからのアクセスになります。
そもそも絶対数が少ないかなと思います。
技術ブログを閲覧するのは業務中ということでしょうか。
Mobileから流入のある記事が限られていて、そのページの検索順位の変動に大きく左右される。
一部の記事でやけに Mobile からのアクセス率の高い記事があります。
そのような記事数は少なくてそのページの検索順位の変動があった場合、閲覧数が増減しているように思いました。

参考

以下、参考にしました。

以上です。

Python Google Analytics の PageView と Blog の記事数の相関関係を計算する

前提

この Blog について

必要なライブラリのインストール

Google2Pandas の使い方

Mezzanine の月次の記事数取得SQL

Google Analytics データの取得

相関関係を求める

デバイスごとの PageView との相関を計算する

Desktop と Mobile で相関が異なる理由

参考

コメント

カテゴリー

最近の投稿

前提

この Blog について

必要なライブラリのインストール

Google2Pandas の使い方

Mezzanine の月次の記事数取得SQL

Google Analytics データの取得

相関関係を求める

デバイスごとの PageView との相関を計算する

Desktop と Mobile で 相関が異なる理由

参考

コメント

カテゴリー

関連投稿

最近の投稿

Desktop と Mobile で相関が異なる理由