Algorithm-Engineered Compressed Indexes
项目来源
项目主持人
项目受资助机构
项目编号
立项年度
立项时间
项目级别
研究期限
受资助金额
学科
学科代码
基金类别
关键词
参与者
参与机构
1.τλ-Index: A framework for locating rare patterns in repetitive corpora
- 关键词:
- Data mining;Economic and social effects;Indexing (of information);Information retrieval;Natural language processing systems;Query processing;Compressed indexing;Fixed integers;Full-text index;Genome sequences;Integer parameters;Pattern search;Pattern-matching;Rare disease;Rare pattern;Sub-strings
- Tsao, Che-Wei;Deng, Jin-Jie;Chen, Long-Qi;Hon, Wing-Kai;Köppl, Dominik;Sadakane, Kunihiko
- 《Information Systems》
- 2026年
- 139卷
- 期
- 期刊
A substring P in a text T is called rare if it occurs at most τ times, for some (relatively small) fixed integer parameter τ. Rare patterns are useful in many applications, such as in the analysis of rare diseases from genome sequences. Although several full-text indexes have already been developed to support time- and space-efficient full-text retrieval in compressed highly repetitive texts, we face two challenges for rare pattern search. First, a large fraction of the input data is likely to be redundant or not relevant for rare pattern search and thus the index size can be drastically reduced by extracting only information necessary for rare pattern search. Second, indexes with no efficient support for count queries, such as some Lempel–Ziv-based or grammar-based indexes, cannot easily determine whether a query pattern P is rare. To address these two challenges, we propose an index framework to locate rare patterns P of length at most λ. We also present a near-linear time method to identify the parts in T necessary for rare pattern search. Experimental results demonstrate that the τλ-index achieves competitive space–time trade-offs on highly repetitive corpora. In particular, for highly repetitive corpora such as real-world human DNA sequences, the τλ-index requires less space than the three self-indexes compared in our experiments, namely LZ77 (Claude et al., 2016), LPG (Díaz-Domínguez et al., 2021), and the r-index (Gagie et al., 2018). Moreover, for indexes that do not support efficient count queries, our method offers significantly faster performance — often by orders of magnitude — with only a modest space overhead observed in certain corpora. © 2026 Elsevier Ltd.
...2.LZ78 Substring Compression in Compressed Space
- 关键词:
- Lossless data compression; LZ78 factorization; Substring compression;CDAWG; C630 Computational Techniques; Simulation Modeling
- Shibata, Hiroki;Koppl, Dominik
- 《THEORY OF COMPUTING SYSTEMS》
- 2025年
- 70卷
- 1期
- 期刊
The Lempel-Ziv 78 (LZ78) factorization is a well-studied technique for data compression. It and its derivates are used in compression formats such as compress or gif. Although most research focuses on the factorization of plain data, not much research has been conducted on indexing the data for fast LZ78 factorization. Here, we study the LZ78 factorization and its derivates in the substring compression model, where we are allowed to index the data and return the factorization of a substring specified at query time. In that model, we propose an algorithm that works in compressed space, computing the factorization with a logarithmic slowdown compared to the optimal time complexity.
...3.Extended parameterized Burrows–Wheeler transform
- 关键词:
- Backward wave tubes;Data structures;Indexing (of information);Pattern matching;Query processing;Burrows-Wheeler Transform;Circular text;Extended burrow–wheeler transform;Full-text index;Matching query;Matching statistics;Natural extension;Parameterized;Parameterized pattern matching;Pattern-matching
- Osterkamp, Eric M.;Köppl, Dominik
- 《Information Systems》
- 2026年
- 136卷
- 期
- 期刊
The Burrows–Wheeler transform (BWT) lies at the heart of succinct and compressed full-text indexes for pattern matching queries. Notable variants are (a) the extended BWT (eBWT) capable to index multiple circular texts for pattern matching, or (b) the parameterized BWT (pBWT) for parameterized pattern matching. A natural extension is the combination of the virtues of both variants into a new data structure, whose name we coin with extended parameterized BWT (epBWT). We show that the epBWT supports pattern matching in context of parameterized pattern matching on multiple circular texts, within the same complexities as known solutions presented for the pBWT [Kim and Cho, IPL’21] for patterns not longer than the shortest indexed text. Additionally, we show how to compute the epBWT within the same complexities as [Iseri et al., ICALP’24], i.e., in compact space and quasilinear time. As an application, we extend the matching statistics problem to the parameterized pattern matching setting on circular texts. © 2025 Elsevier Ltd
...
