Compression of Indexes with Full Positional Information in Very Large Text Databases

著者: Gordon Linoff, Craig Stanfil
タイトル: Compression of Indexes with Full Positional Information in Very Large Text Databases
書籍: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
ページ: 88-95
日時: June 1993
出版: ACM Press
概要: This paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. These methods are Prefix Omission, Run-Length Encoding, and a novel family of numeric representations caled n-s coding. Using these compression methods on two different text sources (the King James Version of Bible and a sample of Wall Street Journal Stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. The decreased time required for I/O can almost fully compensate for the time needed to uncompress the postings. This research is part of an effort to handle very large text databased on the CM-5, a massively parallel MIMD supercomputer.
コメント: テキストデータベースにあらわれる単語の出現位置を示すインデクスファイル(concordance)の圧縮方法。小さな数は少ないビット数で表現するようにn-s符合化という、エリアス符合もどきを使っている。・Dynamicなテキストデータに使えるような気がしない。・もとのテキストは圧縮していないのが不思議。
カテゴリ: IR, Compress

Organization: ACM Category: IR Compress Comment: テキストデータベースにあらわれる単語の出現位置を示すインデクスファイル(concordance)の圧縮方法。小さな数は少ないビット数で表現するようにn-s符合化という、エリアス符合もどきを使っている。・Dynamicなテキストデータに使えるような気がしない。・もとのテキストは圧縮していないのが不思議。 Abstract: This paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. These methods are Prefix Omission, Run-Length Encoding, and a novel family of numeric representations caled n-s coding. Using these compression methods on two different text sources (the King James Version of Bible and a sample of Wall Street Journal Stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. The decreased time required for I/O can almost fully compensate for the time needed to uncompress the postings. This research is part of an effort to handle very large text databased on the CM-5, a massively parallel MIMD supercomputer. Bibtype: InProceedings Booktitle: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Month: jun Pages: 88-95 Author: Gordon Linoff Craig Stanfil Title: Compression of Indexes with Full Positional Information in Very Large Text Databases Year: 1993 Super: SIGIR93 Publisher: ACM Press