Google Apps Script で sitemap.xml を parse する


Google Apps Script で、sitemap.xml から url を抽出したくなり、調べた結果について記載します。
実際に、スクリプトを作って公開している方がいましたので、そちらを参考にスクリプトを作成してみました。


参考

Cem Simsek’s Space — Google Apps Script | XML Sitemap Parser


作成したスクリプト

  • getUrlFromSitemaps.gs
    以下、sitemap.xml の url を 配列として取得する function です。
    function getUrlFromSitemaps() {
      var sitemapUrl = "your_sitemap_url";
      var xmlNS_String = "http://www.sitemaps.org/schemas/sitemap/0.9";
      // エラーが発生した場合は、Exception がスローされが処理中断します。
      var xml = UrlFetchApp.fetch(sitemapUrl).getContentText();
      var document = XmlService.parse(xml);
      var xmlProtocol = XmlService.getNamespace(xmlNS_String);
      var urlEntries = document.getRootElement().getChildren('url', xmlProtocol);
      var urlsArray = [];
      for (var urlIndex= 0; urlIndex < urlEntries.length; urlIndex++) {
        urlsArray.push(urlEntries[urlIndex].getChild('loc', xmlProtocol).getText());
      }
      return urlsArray;
    }
    
  • 実行結果
    sitemapUrl に存在する sitemap.xml を指定した際の戻り値です。

    [https://www.monotalk.xyz/, 
    https://www.monotalk.xyz/blog/invisible-markup-on-wicket/, 
    https://www.monotalk.xyz/blog/rest-api-on-wicket7.3.0/, 
    https://www.monotalk.xyz/blog/review-musuinabe/, 
    https://www.monotalk.xyz/blog/form_unable_to_find_component_with_id_on_wicket/, 
    https://www.monotalk.xyz/blog/form_must_be_applied_to_a_tag_with_html_tag_on_wicket/, 
    https://www.monotalk.xyz/blog/execute_redpen_on_wercker/, 
    https://www.monotalk.xyz/blog/404_errorpage_configration_on_wicket_dropwizard/, 
    https://www.monotalk.xyz/blog/Make_sure_you_are_not_calling_WARNING_on_wicket/, 
    https://www.monotalk.xyz/blog/configure-mongodb-and-postgres-on-wercker/, 
    https://www.monotalk.xyz/blog/django-crontab-on-mezzanine/, 
    https://www.monotalk.xyz/blog/using-resource-on-wicket-application/, 
    https://www.monotalk.xyz/blog/Mezzanine-Blog-post-pagination-broken/, 
    https://www.monotalk.xyz/blog/django-request-on-mezzanine/, 
    https://www.monotalk.xyz/blog/jackson-covertvalue-exclude-fields-no-annotation/, 
    https://www.monotalk.xyz/blog/search-resutls-wicket-models/, 
    https://www.monotalk.xyz/blog/send-errormail-on-mezzanine/, 
    https://www.monotalk.xyz/blog/modify-author-on-mezzanine/,
    https://www.monotalk.xyz/about/]
    

  • getJsonFromSitemaps.gs
    以下、sitemap.xml の 要素を、json で返す function です。
    Loop 内で、各要素を設定しています。

    function getJsonFromSitemaps() {
      var sitemapUrl = "your_sitemap_url";
      var xmlNS_String = "http://www.sitemaps.org/schemas/sitemap/0.9";
      // エラーが発生した場合は、Exception がスローされが処理中断します。
      var xml = UrlFetchApp.fetch(sitemapUrl).getContentText();
      var document = XmlService.parse(xml);
      var xmlProtocol = XmlService.getNamespace(xmlNS_String);
      var urlEntries = document.getRootElement().getChildren('url', xmlProtocol);
      var elements = [];
      for (var urlIndex= 0; urlIndex < urlEntries.length; urlIndex++) {
        // 各要素を json に設定する     
        var element = new Object();
        element.loc = urlEntries[urlIndex].getChild('loc', xmlProtocol).getText();
        var lastmod = urlEntries[urlIndex].getChild('lastmod', xmlProtocol);
        // lastmod は null の場合があるので、null の場合は空文字を設定する     
        element.lastmod = lastmod == null ? "" : lastmod.getText();
        elements.push(element);
      }
      return elements;
    }
    

  • 実行結果
    sitemapUrl に存在する sitemap.xml を指定した際の戻り値です。

    [
      {loc=https://www.monotalk.xyz/, lastmod=}, 
      {loc=https://www.monotalk.xyz/blog/django-nose-test-tips/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/pep8wraning-do-not-assign-a-lambda-expression-use-a-def/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/WSGIRequest-object-is-not-subscriptable-on-django/, lastmod=2017-07-17}, 
      {loc=https://www.monotalk.xyz/blog/usage-of-modify-display-choice-name-on-django/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/Default-FindBugs-Rules-In-Github-Repositories/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/usage-of-django-nvd3/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/label-invisible-on-NVD3/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/Findbugs-filterfile-is-delicate/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/understanding-of-aggregate-pipeline/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/Default-PMD-Rules-In-Github-Repositories/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/EmptyCatchBlock-Warning-on-PMD/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/nonencoded-querystring-on-python-requests/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/pymongo-errors-WriteError-The-dotted-field/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/usage-lastfm-java-client/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/Dropwizard-Error-command-has-been-already-used/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/error-handling-on-wicket/, lastmod=2017-08-29}, 
      {loc=https://www.monotalk.xyz/blog/usesDeploymentConfig-in-wicket/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/control-error-message-by-component-on-wicket/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/blog/search-resutls-wicket-forms/, lastmod=2017-05-06}, 
      {loc=https://www.monotalk.xyz/about/, lastmod=}
    ]
    


補足説明

以上です。

コメント