Inverted File Index
:material-circle-edit-outline: 约 140 个字 :material-image-multiple-outline: 3 张图片 :material-clock-time-two-outline: 预计阅读时间 1 分钟
Intro¶
-
Term-Document Incidence Matrix
Too sparse for a matrix
-
Inverted File Index
Index is a mechanism for locating a given term in a text.
Inverted file contains a list of pointers (e.g. the number of a page) to all occurrences of that term in the text.
Note
-
Word Stemming:
Process a word so that only its stem or root form is left.
-
Stop Words:
Some words are so common that almost every document contains them, such as “a” “the” “it”. It is useless to index them. They are called stop words. We can eliminate them from the original documents.
Distributed Indexing¶
- Term-partitioned index:
Seperate by alphabets
- Document-partitioned index:
Seperate by file
Dynamic Indexing¶
Compression¶
Note
Thresholding¶
Sort the query terms by their frequency in ascending order; search according to only some percentage of the original query terms
User Happiness¶
Note