tf-idf

\begin{alignat*}{2} \text{tf-idf}({\color {green} w}, d, D) &= \text{tf}({\color {green} w},d) \cdot \text{idf}({\color {green} w},D)\\ &= \text N({\color {green} w},d) \cdot \text{log} \frac {|D|} {|\{ d \in \text D: {\color {green} w} \in \text D \}|} \end{alignat*}

$\color{green} w$ - 单个词
$d$ - 单个文档
$D$ - 所有文档
$\text{tf}(w,d) = \text N(w,d)$ $tf (w, d) = N (w, d)$
- $w$ 在 $d$ 文档中的数量
- 词频 - term frequency
$\text{idf}(w,D)$ $idf (w, D)$
- $w$ 在整个资料库中的数量
- IDF - 逆向文件频率
  - 数量越高，权重越低

Demo