精做 NLP100 题-经典 NLP-TF-IDF

2024-10-28

📒题目来源：https://medium.com/@milana.shxanukova15/100-nlp-questions-answers-attention-part-bf86e07251de

TF-IDF「多文档方法」

$T_f$：术语在文档总术语中出现的比例
$IDF$: 逆文档频率「IDF 的核心思想是：如果一个词在文档集合中出现的频率越低，那么它对区分文档的能力就越强。换句话说，如果一个词在大多数文档中都出现，那么它可能是一个常见的词（如“的”、“是”），对区分文档的贡献不大；而如果一个词只在少数文档中出现，那么它可能是一个具有较高信息量的词，对区分文档的贡献较大。」

$$ TF-IDF(t,d)=\frac{TF(t,d)}{DF(t)} = TF(t,d)·IDF(t) \\ $$$$ t 表示单词，d 表示文档，TF(t,d)表示t在d 中的出现频次 $$$$ DF(t)表示有多少篇文档包含t，DF 的倒数就是IDF 逆文档排序 $$$$ Tf-Idf(t, d, D) = Tf(t, d) * IDF(t, d) $$$$ Tf = \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of terms in document } d} $$$$ \text{IDF} = \log_e \left( \frac{\text{Total number of documents in the corpus}}{\text{Number of documents with term } t \text{ in them}} \right) $$$$ Tf-Idf(t, d) = tf(t, d) * log(\frac{N}{df + 1}) $$$$t 表示某个词$$$$N 表示文档集合中的文档总数$$$$df(t) 表示包含词t的文档数$$$$log 函数用于平滑IDF 的值，避免极端值的影响$$

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119


# legit - https://www.askpython.com/python/examples/tf-idf-model-from-scratch

import numpy as np
from nltk.tokenize import word_tokenize 
from typing import List, Dict


def create_counts(texts: List):
    sentences = []
    word_set = []
    
    for sent in texts:
        # split into words / разбиваем пословно только слова
        x = [i.lower() for  i in word_tokenize(sent) if i.isalpha()]
        
        sentences.append(x)
        for word in x:
            if word not in word_set:
                word_set.append(word)
    
    # set a vocab of unique words / создаем словарь уникальных слов 
    word_set = set(word_set)
    total_documents = len(sentences)
 
    # Creating an index for each word in our vocab / каждому слову уникальный индекс 
    index_dict = {} 
    i = 0
    for word in word_set:
        index_dict[word] = i
        i += 1
        
    return sentences, word_set, total_documents, index_dict


def count_dict(sentences: List, word_set: set) -> Dict:
    """
    Counts of words.
    @sentences: the list of sentences 
    @word_set: all words without their unique ids

    return: frequencies of each word
    """
    word_count = {}
    for word in word_set:
        word_count[word] = 0
        for sent in sentences:
            if word in sent:
                word_count[word] += 1
    return word_count


def termfreq(document: List, word: str) -> float:
    """
    Count the term frequency according to the formula 
    (num of term occurencies in the doc d) / (num of non-unique terms in doc d)

    @document: a list of words in the doc
    @word: a unique word

    return: TF value
    """
    # occurance = len([token for token in document if token == word])
    occurance = document.count(word)
    return occurance / len(document)


def inverse_doc_freq(word: str, total_documents: int, word_count: Dict):
    """
    Count the IDF according to the formula
    log (num of docs / num of docs with term t)

    @word: a unique word
    @total_documents: num of docs in corpus
    @word_count: word frequencies {word: number of docs with word}

    return: IDF value
    """
    word_occurance = word_count.get(word, 0) + 1
    return np.log(total_documents / word_occurance)

def tf_idf(sentence: List[str], vector_shape: int, index_dict: Dict, total_documents: int, word_count: Dict) -> np.array:
    """
    Get the sentence tf-idf vector

    @sentence: list of words in a sentence
    @vector_shape: number of unique words in corpus
    @index_dict: ids of words
    @total_documents: num of docs in corpus
    @word_count: word frequencies {word: number of docs with word}

    return: tf-idf vector as np.array
    """
    tf_idf_vec = np.zeros((vector_shape, ))
    for word in sentence:
        tf = termfreq(sentence, word)
        idf = inverse_doc_freq(word, total_documents, word_count)
         
        tf_idf_vec[index_dict[word]] = tf * idf 
    return tf_idf_vec

def create_vectors(texts: List):
    vectors = []
    sentences, word_set, total_documents, index_dict = create_counts(texts)
    vector_shape = len(word_set)
    word_count = count_dict(sentences, word_set)

    for sent in sentences:
        vec = tf_idf(sent, vector_shape, index_dict, total_documents, word_count)
        vectors.append(vec)
    vectors = np.array(vectors)
    return vectors, index_dict

sentences = ['This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',]

vectors, word2id = create_vectors(sentences)
vectors.shape

什么是TF-IDF 中的标准化？

TF - IDF 标准化可以在函数本身的框架内理解。 TF 代表文档中的术语频率。在较长的文档中，TF 值可能会更高，因为文档中的单词较多。

通过文档长度标准化，我们可以考虑像“cat”这样的单词，它在 1000 字的文本中出现 10 次，与像“cat”这样的单词在 500 字的文本中出现 5 次一样重要。

在sklearn中的TF-IDF向量化器中，我们还可以在计算所有分数后使用归一化。如果我们使用 L2 归一化，我们可以使用两个向量之间的余弦相似度作为点积。

TF-IDF 标准化的目的是消除文档长度或词汇频率对 TF-IDF 值的影响，使得不同文档或不同词汇之间的 TF-IDF 值更具可比性。「显然并不能取代BM25」

📝 NOTE

Sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

为何在当今时代你需要了解TF-IDF，以及如何在复杂模型中应用它？

也逐渐被 BM25 算法所取代，BM25 在关键字查询中效果非常好。

文本数据的普遍性：随着互联网和社交媒体的兴起，文本数据成为了信息传播的重要形式。TF-IDF 是一种经典且高效的工具，用于衡量词语在文档和文档集合中的重要性，是文本挖掘和自然语言处理的基础。
简单且高效：相较于复杂的深度学习模型，TF-IDF 算法简单易懂且计算效率高，特别适合于小型数据集或资源有限的场景。
可解释性：在当今复杂模型常被批评为“黑箱”的背景下，TF-IDF 提供了高度透明的特征提取方式，便于人类理解和分析。
作为基线模型： TF-IDF 常被用作更复杂文本处理任务的基线模型。在构建如情感分析、文档分类或推荐系统时，TF-IDF 提供了良好的起点，帮助评估更复杂模型的提升程度。