Algorithm-Engineered Compressed Indexes

项目来源

日本学术振兴会基金(JSPS)

项目主持人

Koeppl Dominik

项目受资助机构

山梨大学

立项年度

2025

立项时间

未公开

项目编号

25K21150

项目级别

国家级

研究期限

未知 / 未知

受资助金额

4680000.00日元

学科

情報学基礎論関連

学科代码

未公开

基金类别

若手研究

关键词

Matrix Compression ; Pattern Matching ; Algorithm Engineering ; Substring Compression ; Generalized Pattern

参与者

未公开

参与机构

山梨大学,大学院総合研究部

项目标书摘要:Outline of Research at the Start:The primary goal of this research is to develop efficient,practical construction methods for compressed data structures,which are often limited to theoretical advancements without adequate real-world applications.While many data structures exist,their practical utility is hampered by inefficiencies in their construction and capabilities(e.g.,the types of supported queries).This project will focus on improving the speed and efficiency of constructing these structures,particularly for tasks such as graph traversal,generalized pattern matching,and substring compression。

  • 排序方式:
  • 1
  • /
  • 1.τλ-Index: A framework for locating rare patterns in repetitive corpora

    • 关键词:
    • Data mining;Economic and social effects;Indexing (of information);Information retrieval;Natural language processing systems;Query processing;Compressed indexing;Fixed integers;Full-text index;Genome sequences;Integer parameters;Pattern search;Pattern-matching;Rare disease;Rare pattern;Sub-strings
    • Tsao, Che-Wei;Deng, Jin-Jie;Chen, Long-Qi;Hon, Wing-Kai;Köppl, Dominik;Sadakane, Kunihiko
    • 《Information Systems》
    • 2026年
    • 139卷
    • 期刊

    A substring P in a text T is called rare if it occurs at most τ times, for some (relatively small) fixed integer parameter τ. Rare patterns are useful in many applications, such as in the analysis of rare diseases from genome sequences. Although several full-text indexes have already been developed to support time- and space-efficient full-text retrieval in compressed highly repetitive texts, we face two challenges for rare pattern search. First, a large fraction of the input data is likely to be redundant or not relevant for rare pattern search and thus the index size can be drastically reduced by extracting only information necessary for rare pattern search. Second, indexes with no efficient support for count queries, such as some Lempel–Ziv-based or grammar-based indexes, cannot easily determine whether a query pattern P is rare. To address these two challenges, we propose an index framework to locate rare patterns P of length at most λ. We also present a near-linear time method to identify the parts in T necessary for rare pattern search. Experimental results demonstrate that the τλ-index achieves competitive space–time trade-offs on highly repetitive corpora. In particular, for highly repetitive corpora such as real-world human DNA sequences, the τλ-index requires less space than the three self-indexes compared in our experiments, namely LZ77 (Claude et al., 2016), LPG (Díaz-Domínguez et al., 2021), and the r-index (Gagie et al., 2018). Moreover, for indexes that do not support efficient count queries, our method offers significantly faster performance — often by orders of magnitude — with only a modest space overhead observed in certain corpora. © 2026 Elsevier Ltd.

    ...
  • 2.LZ78 Substring Compression in Compressed Space

    • 关键词:
    • Lossless data compression; LZ78 factorization; Substring compression;CDAWG; C630 Computational Techniques; Simulation Modeling
    • Shibata, Hiroki;Koppl, Dominik
    • 《THEORY OF COMPUTING SYSTEMS》
    • 2025年
    • 70卷
    • 1期
    • 期刊

    The Lempel-Ziv 78 (LZ78) factorization is a well-studied technique for data compression. It and its derivates are used in compression formats such as compress or gif. Although most research focuses on the factorization of plain data, not much research has been conducted on indexing the data for fast LZ78 factorization. Here, we study the LZ78 factorization and its derivates in the substring compression model, where we are allowed to index the data and return the factorization of a substring specified at query time. In that model, we propose an algorithm that works in compressed space, computing the factorization with a logarithmic slowdown compared to the optimal time complexity.

    ...
  • 3.Extended parameterized Burrows–Wheeler transform

    • 关键词:
    • Backward wave tubes;Data structures;Indexing (of information);Pattern matching;Query processing;Burrows-Wheeler Transform;Circular text;Extended burrow–wheeler transform;Full-text index;Matching query;Matching statistics;Natural extension;Parameterized;Parameterized pattern matching;Pattern-matching
    • Osterkamp, Eric M.;Köppl, Dominik
    • 《Information Systems》
    • 2026年
    • 136卷
    • 期刊

    The Burrows–Wheeler transform (BWT) lies at the heart of succinct and compressed full-text indexes for pattern matching queries. Notable variants are (a) the extended BWT (eBWT) capable to index multiple circular texts for pattern matching, or (b) the parameterized BWT (pBWT) for parameterized pattern matching. A natural extension is the combination of the virtues of both variants into a new data structure, whose name we coin with extended parameterized BWT (epBWT). We show that the epBWT supports pattern matching in context of parameterized pattern matching on multiple circular texts, within the same complexities as known solutions presented for the pBWT [Kim and Cho, IPL’21] for patterns not longer than the shortest indexed text. Additionally, we show how to compute the epBWT within the same complexities as [Iseri et al., ICALP’24], i.e., in compact space and quasilinear time. As an application, we extend the matching statistics problem to the parameterized pattern matching setting on circular texts. © 2025 Elsevier Ltd

    ...
  • 4.Efficient Solutions toVariants ofInversion Problems ofRange Minimum Queries

    • 关键词:
    • Query processing;Consistent sequence;Constant delays;Enumeration algorithms;Inversion problems;Minimum value;Range minimum queries;Survey analysis;Unified analysis;Value-based
    • Kobayashi, Souta;Köppl, Dominik;Yoshinaka, Ryo;Shinohara, Ayumi
    • 《51st International Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2026》
    • 2026年
    • February 9, 2026 - February 13, 2026
    • Krakow, Poland
    • 会议

    Given a set of results from range minimum queries (RMQs), our task is to construct a sequence that is consistent with the results of the queries. We study two types of RMQs: a value-based RMQ returns the minimum value and an index-based RMQ returns the index of the minimum. While the value-based version has been discussed informally in the context of competitive programming, the index-based version appears to be unexplored. In this paper, we provide a survey and unified analysis of the value-based version, and we propose efficient algorithms for the index-based version. These include algorithms for computing the lexicographically smallest consistent sequence and permutation, as well as an enumeration algorithm that outputs all consistent permutations with constant delay. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

    ...
  • 5.Counting Distinct (Non-)crossing Substrings

    • 关键词:
    • Rhenium compounds;All positions;Constant sizes;Distinct substring;LPF array;Ordered alphabets;Run;String algorithms;Sub-strings
    • Umezaki, Haruki;Shibata, Hiroki;Köppl, Dominik;Nakashima, Yuto;Inenaga, Shunsuke;Bannai, Hideo
    • 《32nd International Symposium on String Processing and Information Retrieval, SPIRE 2025》
    • 2026年
    • September 8, 2025 - September 11, 2025
    • London, United kingdom
    • 会议

    Let w be a string of length n. The problem of counting factors crossing a position - Problem 64 fromthe textbook "125 Problems in Text Algorithms" [Crochemore, Leqroc, and Rytter, 2021], asks to count the number C(w,k) (resp. N(w,k)) of distinct substrings in w that have occurrences containing (resp. not containing) a position k in w. The solutions provided in their textbook compute C(w,k) and N(w,k) in O(n) time for a single positionk in w, and thus a direct application would require O(n2) time for all positionsk=1,…,n in w. Their solution is designed for constant-size alphabets. In this paper, we present new algorithms which compute C(w,k) in O(n) total time for general ordered alphabets, and N(w,k) in O(n) total time for linearly sortable alphabets, for all positions k=1,…,n in w. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.

    ...
  • 6.A Survey of the Bijective Burrows-Wheeler Transform

    • 关键词:
    • Backward wave tubes;Compaction;Indexing (of information);Soft computing;Bijective string transformation;Burrows-Wheeler Transform;Compression;Construction algorithms;Index construction;Index construction algorithm;Lyndon words;Repetitiveness measure;String transformation;Text-indexing
    • Bannai, Hideo;Köppl, Dominik;Lipták, Zsuzsanna
    • 《Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday》
    • 2025年
    • July 25, 2025
    • Venice, Italy
    • 会议

    The Bijective BWT (BBWT), conceived by Scott in 2007, later summarized in a preprint by Gil and Scott in 2009 (arXiv 2012), is a variant of the Burrows-Wheeler Transform which is bijective: every string is the BBWT of some string. Indeed, the BBWT of a string is the extended BWT [Mantaci et al., 2007] of the factors of its Lyndon factorization. The BBWT has been receiving increasing interest in recent years. In this paper, we survey existing research on the BBWT, starting with its history and motivation. We then present algorithmic topics including construction algorithms with various complexities and an index on top of the BBWT for pattern matching. We subsequently address some properties of the BBWT as a compressor, discussing robustness to operations such as reversal, edits, rotation, as well as compression power. We close with listing other bijective variants of the BWT and open problems concerning the BBWT. © Hideo Bannai, Dominik Köppl, and Zsuzsanna Lipták.

    ...
  • 7.Enumeration of Ordered Trees with Leaf Restrictions

    • 关键词:
    • Forestry;Time delay;Timing circuits;Trees (mathematics);Constant time delays;Enumeration algorithms;Nonnegative integers;Ordered tree;Rooted trees
    • Kobayashi, Yasuaki;Köppl, Dominik;Matsui, Yasuko;Ono, Hirotaka;Saitoh, Toshiki;Uno, Yushi
    • 《From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday 2025》
    • 2025年
    • July 25, 2025
    • Venice, Italy
    • 会议

    An α-ary tree for a constant α ≥ 2 is a rooted tree in which each node has at most α children. A node having no children is called a leaf. For a given rooted tree and a node v, the number of edges from the root to v is called the depth of v. We call a vector w = (w1, w2, . . ., wd) of nonnegative integers an (α-ary) distribution if there is an α-ary tree T such that the number of leaves at each depth i ∈ [1..d] in T is wi. Although not every vector of nonnegative integers is a distribution, a distribution can be associated with many α-ary trees. In this paper, we present an algorithm to enumerate all α-ary trees for a given distribution. Our algorithm reports the first tree in O(d + ∑di=1 wi) time, and then each subsequent α-ary tree in O(maxdi=1 wi) time by representing each tree as the difference from the previous one. The algorithm can be restricted to computing all trees that are full, i.e., trees whose nodes have exactly α or no children. © Yasuaki Kobayashi, Dominik Köppl, Yasuko Matsui, Hirotaka Ono, Toshiki Saitoh, and Yushi Uno;

    ...
  • 排序方式:
  • 1
  • /