Luhnの要約アルゴリズムの本処理

Luhnの要約アルゴリズムの前処理の続き。記事本文を"。"で分割して文のリストにし、さらに英語を小文字化したのがnormalized_sents（以下、sents）だった。ここでは、分割された各文がどのようにスコア付けされるかを見ていく。

文を単語に分割する

今回は、以下の文の処理の流れを追っていく。

googleによるとgnmtは一部のケースでは人間レベルの翻訳が可能なレベルに達しているとのこと

nltkjp.word_tokenize() は分かち書きを行うメソッド。

scores = []
sent_idx = -1
nltkjp = NLTKJP()
for s in [nltkjp.word_tokenize(sent) for sent in sents]:
    sent_idx += 1

文が単語に分割されたので、sの中身は次のようになる。

>>> for d in s:
>>>     print d,
>>> print
google に よる と gnmt は 一部 の ケース で は 人間 レベル の 翻訳 が 可能 な レベル に 達し て いる と の こと

文中の重要単語の位置を把握する

Luhnの要約アルゴリズムの前処理で、頻出名詞Top100をimportant_wordsとして取得した。これらが文中のどこに位置するかを把握する。

    word_idx = []

    # 単語リスト中の個々の単語について
    for w in important_words:
        try:
            # 文中の重要単語が出現した位置のインデックスを計算する
            word_idx.append(s.index(w))
        except ValueError, e:  # この文にはwが含まれていない
            pass

    word_idx.sort()

    # 一部の文は、重要単語を１つも含んでいないことがありえる
    if len(word_idx) == 0:
        continue

位置は次のようになる。

>>> print word_idx
[0, 4, 6, 8, 11, 12, 14, 16]

実際に単語に分割した文と重要単語の位置を比較すると、次のようになる。上記のアルゴリズムでは文中に同じ重要単語が出てきた時にカウントできないことが分かる。次の例では"レベル"が12と18にあるが、12しかカウントされていない。意図的ではなくミスだと思うので、後で改修する。

文中の重要単語の距離からクラスタリングを行う

文中の重要単語間の距離が3以内であれば、それらを１つのクラスタとする。オリジナルの距離は5。

    self.CLUSTER_THRESHOLD = 3  # 考慮する単語の間の距離

    # 単語のインデックスを使って２つの連続する単語に対して
    # 最大距離の閾値を使ってクラスタを計算する
    clusters = []
    cluster = [word_idx[0]]
    i = 1
    while i < len(word_idx):
        if word_idx[i] - word_idx[i-1] < self.CLUSTER_THRESHOLD:
            cluster.append(word_idx[i])
        else:
            clusters.append(cluster[:])
            cluster = [word_idx[i]]
        i += 1
    clusters.append(cluster)

実行すると３つのクラスタに分類された。

>>> print clusters
[[0], [4, 6, 8], [11, 12, 14, 16]]

クラスタのスコアを計算する

クラスタ内の重要単語数の二乗を、クラスタの最初と最後の距離で割った値を、そのクラスタのスコアとする。最終的に各クラスタのスコアの最大値がその文のスコアになる。

    # 各クラスタのスコアを計算。クラスタのスコアの最大値がその文のスコア
    score_list = []
    max_cluster_score = 0
    for c in clusters:
        swc = len(c)  # significant_words_in_cluster
        twc = c[-1] - c[0] + 1  # total_words_in_cluster
        score = 1.0 * swc*swc / twc
        score_list.append(score)

        if score > max_cluster_score:
            max_cluster_score = score

    scores.append((sent_idx, max_cluster_score))

３つのクラスタとスコアとその最大値（文のスコア）は次のようになる。

>>> print score_list
[1.0, 1.8, 2.7]

>>> print max_cluster_score
2.7

まとめ

以上より、文のスコア付けの流れは次のようになる。

1. googleによるとgnmtは一部のケースでは人間レベルの翻訳が可能なレベルに達しているとのこと
2. google に よる と gnmt は 一部 の ケース で は 人間 レベル の 翻訳 が 可能 な レベル に 達し て いる と の こと
3. [0, 4, 6, 8, 11, 12, 14, 16]
4. [[0], [4, 6, 8], [11, 12, 14, 16]]
5. [1.0, 1.8, 2.7]
6. 2.7

スコア計算メソッドは次の通り。

def score_sentences(self, sents, important_words):
    """
    H.P. Luhn, "The Automatic Creation of Literature Abstracts"によるアプローチ
    """
    scores = []
    sent_idx = -1

    nltkjp = NLTKJP()
    for s in [nltkjp.word_tokenize(sent) for sent in sents]:
        sent_idx += 1
        word_idx = []

        # 単語リスト中の個々の単語について
        for w in important_words:
            try:
                # 文中の重要単語が出現した位置のインデックスを計算する
                word_idx.append(s.index(w))
            except ValueError, e:  # この文にはwが含まれていない
                pass
        word_idx.sort()

        # 一部の文は、重要単語を１つも含んでいないことがありえる
        if len(word_idx) == 0:
            continue

        # 単語のインデックスを使って２つの連続する単語に対して
        # 最大距離の閾値を使ってクラスタを計算する
        clusters = []
        cluster = [word_idx[0]]
        i = 1
        while i < len(word_idx):
            if word_idx[i] - word_idx[i-1] < self.CLUSTER_THRESHOLD:
                cluster.append(word_idx[i])
            else:
                clusters.append(cluster[:])
                cluster = [word_idx[i]]
            i += 1
        clusters.append(cluster)

        # 各クラスタのスコアを計算。クラスタのスコアの最大値がその文のスコア
        score_list = []
        max_cluster_score = 0
        for c in clusters:
            swc = len(c)  # significant_words_in_cluster
            twc = c[-1] - c[0] + 1  # total_words_in_cluster
            score = 1.0 * swc*swc / twc
            score_list.append(score)

            if score > max_cluster_score:
                max_cluster_score = score

        scores.append((sent_idx, max_cluster_score))

    return scores