著者
E. Moura, G. Navarro, N. Ziviani, R. Baeza-Yates
タイトル
Fast Searching on Compressed Text Allowing Errors
書籍
Proceedings of the 21th Annual International Conference on Research and Development in Information Retrieval (SIGIR'98)
ページ
298-306
日時
1998
概要
We present a fast compression and decompression scheme for natural language texts that allows efficient and flexible string matching by searching the compressed text directly. The compression scheme uses a word-based Huffman encoding and the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression times are close to the times of Compress and approximately half the times of Gzip, and decompression times are lower than those of Gzip and one third of those of Compress. The searching algorithm allows a large number of variations of the exact and approximate compressed string matching problem, such as phrases, ranges, complements, wild cards and arbitrary regular expressions. Separators and stopwords can be discarded at search time without significantly increasing the cost. The algorithm is based on a word-oriented shift-or algorithm and a fast Boyer-Moore-type filter. It concomitantly uses the vocabulary of the text available as part of the Huffman coding data. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is twice as fast as running Agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than Agrep. We also mention the impact of our technique in inverted files pointing to documents or logical blocks as Glimpse.
コメント
ftp://dcc.ufmg.br/pub/research/~nivio/cgrep Cgrepというプログラムになっているらしい。
カテゴリ
String
Category: String
Comment: ftp://dcc.ufmg.br/pub/research/~nivio/cgrep
        Cgrepというプログラムになっているらしい。
Abstract: We present a fast compression and decompression scheme for
        natural language texts that allows efficient and flexible
        string matching by searching the compressed text directly. 
        The compression scheme uses a word-based Huffman encoding
        and the coding alphabet is byte-oriented rather than
        bit-oriented. We compress typical English texts to about
        30% of their original size, against 40% and 35% for
        Compress and Gzip, respectively. Compression times are
        close to the times of Compress and approximately half the
        times of Gzip, and decompression times are lower than
        those of Gzip and one third of those of Compress. The
        searching algorithm allows a large number of variations of
        the exact and approximate compressed string matching
        problem, such as phrases, ranges, complements, wild cards
        and arbitrary regular expressions. Separators and
        stopwords can be discarded at search time without
        significantly increasing the cost. The algorithm is based
        on a word-oriented shift-or algorithm and a fast
        Boyer-Moore-type filter. It concomitantly uses the
        vocabulary of the text available as part of the Huffman
        coding data. When searching for simple patterns, our
        experiments show that running our algorithm on a
        compressed text is twice as fast as running Agrep on the
        uncompressed version of the same text. When searching
        complex or approximate patterns, our algorithm is up to 8
        times faster than Agrep. We also mention the impact of our
        technique in inverted files pointing to documents or
        logical blocks as Glimpse.
Bibtype: InProceedings
URL: http://www.dcc.uchile.cl/~gnavarro/publ.html
Pages: 298-306
Author: E. Moura
        G. Navarro
        N. Ziviani
        R. Baeza-Yates
Booktitle: Proceedings of the 21th Annual International
        Conference on Research and Development in
        Information Retrieval (SIGIR'98)
Title: Fast Searching on Compressed Text Allowing Errors
Year: 1998
Date: 2003/08/01 04:59:50