著者
Gordon Linoff, Craig Stanfil
タイトル
Compression of Indexes with Full Positional Information in Very Large Text Databases
書籍
Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
ページ
88-95
日時
June 1993
出版
ACM Press
概要
This paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. These methods are Prefix Omission, Run-Length Encoding, and a novel family of numeric representations caled n-s coding. Using these compression methods on two different text sources (the King James Version of Bible and a sample of Wall Street Journal Stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. The decreased time required for I/O can almost fully compensate for the time needed to uncompress the postings. This research is part of an effort to handle very large text databased on the CM-5, a massively parallel MIMD supercomputer.
コメント
テキストデータベースにあらわれる単語の出現位置を示すインデクス ファイル(concordance)の圧縮方法。小さな数は少ないビット数で 表現するようにn-s符合化という、エリアス符合もどきを使っている。 ・Dynamicなテキストデータに使えるような気がしない。 ・もとのテキストは圧縮していないのが不思議。
カテゴリ
IR, Compress
Organization: ACM
Category: IR Compress
Comment: テキストデータベースにあらわれる単語の出現位置を示すインデクス
        ファイル(concordance)の圧縮方法。小さな数は少ないビット数で
        表現するようにn-s符合化という、エリアス符合もどきを使っている。
        ・Dynamicなテキストデータに使えるような気がしない。
        ・もとのテキストは圧縮していないのが不思議。
Abstract: This paper describes a combination of compression
        methods which may be used to reduce the size of inverted
        indexes for very large text databases. These methods are
        Prefix Omission, Run-Length Encoding, and a novel family
        of numeric representations caled n-s coding.  Using
        these compression methods on two different text sources
        (the King James Version of Bible and a sample of Wall
        Street Journal Stories), the compressed index occupies
        less than 40% of the size of the original text, even
        when both stopwords and numbers are included in the
        index. The decreased time required for I/O can almost
        fully compensate for the time needed to uncompress the
        postings. This research is part of an effort to handle
        very large text databased on the CM-5, a massively
        parallel MIMD supercomputer.
Bibtype: InProceedings
Booktitle: Proceedings of the Sixteenth Annual International
        ACM SIGIR Conference on Research and Development in
        Information Retrieval
Month: jun
Pages: 88-95
Author: Gordon Linoff
        Craig Stanfil
Title: Compression of Indexes with Full Positional Information
        in Very Large Text Databases
Year: 1993
Super: SIGIR93
Publisher: ACM Press