『伝道の書』を自然言語処理する - nltk の install、トークン化、使用頻度の plot まで -

　先日の記事を書き、とにかく始めようということで、ライブラリを Install、あれこれいじっていると…、こんなところにも matplotlib が使われているんですね。

１．まずは Install

２．テキストデータの準備、読み込みあれこれ

　２－１．別にデータは何でもいいと思いますが…

　２－２．さあ、読み込みです

　２－３．単語のトークン化

３．トークンのカウントと視覚化

４．トークンのデータ型

５．英単語はやはり横書きがいい

　５－１．matplotlib の barh を使おう

　５－２．seaborn だって使える

６．まとめと野望

１．まずは Install

　『Python による AI プログラミング入門』の 10 章、自然言語処理を片手に、言われるがままライブラリを Install します。

www.oreilly.co.jp

pip install nltk

　場合によっては、ライブラリの一部はすでに Install されているかもしれません。私は Anaconda 3 / Jupyter notebook を普段使っていますが、一部はすでに Install されているようでした。

　いずれにせよInstall が完了したら、次は nltk のデータセットのダウンロードです。このデータセットは、色々な分析の判断基準となる辞書や物語のようです。Python Shell からでも、Jupyter notebook からでもダウンロード可能です。

import nltk
nltk.download()

　Window が新しく出てくるので "ALL" を選びます。回線の早さにもよると思いますが、わりと時間がかかります（1,2 分ではないです）。

　結構色々なものをダウンロードしていたので公式ページを覗いてみると、リストがありましたが、膨大でした…。必要になったら見返そうということで、今回はスルーしました。

www.nltk.org

　ちなみに、データセットには以下の 9 つのお話（book と言うようです）が入っています。

from nltk.book import *
texts()

とすると、お話の一覧が得られます。

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

ついでに、

sents()

とすると、それぞれの text の書き出しが見られます。

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .

　横道にそれました。あと一つ、gensim というライブラリも Install してひとまず準備完了です。

conda(あるいはpip) install gensim

２．テキストデータの準備、読み込みあれこれ

　２－１．別にデータは何でもいいと思いますが…

　何ができるのか、全く全貌は見えていませんが、まずは、色々操作するためのテキストデータが必要そうです。データセットの text 3 に旧約聖書の『創世記』が入っているのを見て、じゃあ私は、同じく旧約聖書から、『伝道の書』と呼ばれる『コヘレトの言葉（Ecclesiastes）』をデータとしようと決めました。

　興味のあるものをインターネットから探して、text ファイル等にすればよいと思います。

　２－２．さあ読み込みです

　何となく予想はしていたのですが、指定しないと文字コードのエラー等出る可能性が高いです。今回はひとまず先に進みたかったので、以下のようにしました。テキストデータは、"ecclesia" という名前にしました。読み込んだデータの一部を print してあります。

with open("Ecclesiastes.txt", mode="r", encoding="ms932", errors="ignore") as file:
 ecclesia = file.read()
 print(ecclesia)

　当たり前ですが、普通のテキストファイルです。

Ecclesiastes
or
The Preacher
1:1 The words of the Preacher, the son of David, king in Jerusalem.
1:2 Vanity of vanities, saith the Preacher, vanity of vanities; all is vanity.
1:3 What profit hath a man of all his labour which he taketh　…

２－３．単語のトークン化

　文章の分析にあたり、単語やカンマ、ピリオドなどに分割し、分割したものをトークン、作業をトークン化というそうです。ひとまずやってみます。

import nltk
from nltk import word_tokenize,sent_tokenize　　　# import library

token = nltk.word_tokenize(ecclesia)　　　　　　　 # トークン化

stop_words = nltk.corpus.stopwords.words('english')
symbol = ["'", '"', ':', ';', '.', ',', '-', '!', '?', "'s"]
clean_token = [w.lower() for w in token if w.lower() not in stop_words + symbol]

print(clean_token)

　トークン化すると、カンマやピリオドも一つのデータとなるため、そういったものを除外し、また、文頭に来た単語は、最初の文字が大文字になるので、それを小文字にする作業をします。そうすれば、テキストデータは意味のある単語に分割されるはずです。上の結果は、次のようになりました。

['ecclesiastes', 'preacher', '1:1', 'words', 'preacher', 'son', 'david', 'king', 'jerusalem', '1:2', 'vanity', 'vanities', 'saith', 'preacher', 'vanity', 'vanities', 'vanity', '1:3', 'profit', 'hath', 'man', 'labour', 'taketh', 'sun', '1:4', 'one', 'generation', 'passeth', 'away',　 …

　確かになっているようです。

３．トークンのカウントと視覚化

　いろいろ解析はあるのでしょうが、各トークンがどのくらい現れているのか、言い換えれば、ある単語がどの位使われているかを数えるコマンドがあり、しかも図示も出来ます。詳しくは、公式サイトをご覧頂くとして、使用頻度上位 20 個を図示してみました。

clean_frequency = nltk.FreqDist(w.lower() for w in token if w.lower() not in stop_words + symbol)
clean_frequency.plot(20, cumulative=False)

f:id:ohigehige:20201022174012p:plain

　横軸が左から、使用頻度の多いトークン、縦軸がその回数です。便利です。でも、日本語や中国語は、縦書き横書きあっても良いと思うのですが、英語で縦書きはちょっとなあ、というのが否定できない印象だったりもします。

　しかし…、これは何となく（いえ、どう見ても）matplotlib ですよねー。トークンのデータ型を見てみることにしました。

４．トークンのデータ型

　clean_token の type を調べてみると、予想通り、

<class 'list'>

となり、list でした。というか、「２－３．単語のトークン化」で print した clean_token は、よく見ればリストそのものですね。

　リストと分かれば、あとは単なる視覚化の問題です。こうして nltk から Visualization へと脱線して行きました。

５．英単語はやはり横書きがいい

　５－１．matplotlib の barh を使おう

　nltk.FreqDist で試した時同様、使用頻度の高いトークン上位 20 個を横棒グラフにしてみました。やっぱり、英単語は横書きが良いと思うのですが。

f:id:ohigehige:20201023000420p:plain

　matplotlib で棒グラフに出来ると分かると、色もちょっといじりたくなってしまいました。

import collections

c = collections.Counter(clean_token)
c20 = c.most_common(20)
c20r =list(reversed(c20))

import matplotlib.pyplot as plt
import numpy as np

sns.set()
sns.set_style=("darkgrid")
fig=plt.figure(figsize=(7,7))

y_pos = range(0, 20)
clrs = ['darkolivegreen', 'limegreen', 'greenyellow', 'darkkhaki']
plt.yticks(y_pos, word, fontweight='bold', rotation=10, fontsize='14')
plt.barh(y_pos, count, color=clrs)
plt.xlabel('\nWord Count', fontweight='bold', color = 'black', fontsize='14', verticalalignment='center')
plt.ylabel('Word\n', fontweight='bold', color='black', fontsize='14', verticalalignment='center')

plt.show()

５－２．seaborn だって使える

　seaborn なら、countplot があるので、上位 20 語の指定はもっと簡単です。

f:id:ohigehige:20201023001301p:plain

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({'Word':clean_token})

fig=plt.figure(figsize=(7,7))
sns.countplot(y="Word", data=df, palette="magma",
 order=df.Word.value_counts().iloc[:20].index)

　もっとも、FreqDist 一行で済んだ code をここまで増やすことに価値を見出すかどうかは、やっぱりお好みでしょうね…。

6．まとめと野望

　今回は、nltk を Install し、トークン化を試みました。そして、トークンのカウントと視覚化で、大きく脱線して行きました。

　英語の NLP でやることもたくさんあるようですが、まずは日本語が扱える環境を整えたいと思います。

　本日も最後までお付き合いありがとうございました。