Subtopic Deep Dive

Suffix Arrays
Research Guide

What is Suffix Arrays?

Suffix arrays are space-efficient integer arrays storing the lexicographically sorted order of all suffixes of a string, enabling efficient full-text indexing and pattern matching as alternatives to suffix trees.

Suffix arrays support O(n log n) construction and O(m + log n) pattern searches, where n is string length and m is pattern length. They underpin Burrows-Wheeler transform (BWT) techniques for genomic data compression and alignment. Over 10 papers from the list apply suffix arrays to sequence analysis, with MUMmer4 (Marçais et al., 2018) garnering 2517 citations.

15
Curated Papers
3
Key Challenges

Why It Matters

Suffix arrays enable indexing of terabyte-scale genomic datasets for read alignment and variant calling, as in MUMmer4 (Marçais et al., 2018) which aligns genomes 16 times faster than prior versions. They power compressed indexes for haplotype matching in PBWT (Durbin, 2014), storing 1000 human genomes efficiently. Applications include error correction in LoRDEC (Salmela and Rivals, 2014) and fast short-read searches (Solomon and Kingsford, 2016), reducing memory needs for bioinformatics pipelines.

Key Research Challenges

Parallel Construction

Linear-time suffix array construction algorithms like DC3 struggle with parallelization on multi-core systems for massive strings. BarraCUDA (Klus et al., 2012) accelerates alignment via GPUs but construction remains bottleneck. Scaling to petabyte genomic repositories demands new approaches (Cox et al., 2012).

Compressed Indexing

Balancing space and query speed in FM-indexes derived from suffix arrays challenges terabyte datasets. PBWT (Durbin, 2014) uses positional BWT for haplotypes but general compression lags. Lam et al. (2008) achieve local alignment but query times degrade with extreme compression.

k-Mismatch Searches

Efficiently handling mismatches in genomic alignments with suffix arrays is compute-intensive for short reads. kmacs (Leimeister and Morgenstern, 2014) uses average common substrings but scales poorly beyond k=2. Adaptive seeds (Kiełbasa et al., 2011) improve sensitivity yet increase preprocessing.

Essential Papers

1.

MUMmer4: A fast and versatile genome alignment system

Guillaume Marçais, Arthur L. Delcher, Adam M. Phillippy et al. · 2018 · PLoS Computational Biology · 2.5K citations

The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, ...

2.

Adaptive seeds tame genomic sequence comparison

Szymon M. Kiełbasa, Raymond Wan, Kengo Sato et al. · 2011 · Genome Research · 1.4K citations

The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is c...

3.

LoRDEC: accurate and efficient long read error correction

Leena Salmela, Éric Rivals · 2014 · Bioinformatics · 930 citations

Abstract Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors i...

4.

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Richard Durbin · 2014 · Bioinformatics · 489 citations

Abstract Motivation: Over the last few years, methods based on suffix arrays using the Burrows–Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide ver...

5.

Fast search of thousands of short-read sequencing experiments

Brad Solomon, Carl Kingsford · 2016 · Nature Biotechnology · 164 citations

6.

RAPSearch: a fast protein similarity search tool for short reads

Yuzhen Ye, Jeong‐Hyeon Choi, Haixu Tang · 2011 · BMC Bioinformatics · 148 citations

7.

BarraCUDA - a fast short read sequence aligner using graphics processing units

Petr Klus, Simon Lam, Dag Lyberg et al. · 2012 · BMC Research Notes · 145 citations

Reading Guide

Foundational Papers

Start with Adaptive seeds (Kiełbasa et al., 2011) for genomic applications; PBWT (Durbin, 2014) for BWT integration; LoRDEC (Salmela and Rivals, 2014) for error correction contexts.

Recent Advances

MUMmer4 (Marçais et al., 2018) for state-of-art alignment; Fast search (Solomon and Kingsford, 2016) for short-read indexing; Large-scale compression (Cox et al., 2012) for BWT databases.

Core Methods

DC3/Skew for O(n) construction; FM-index for compressed queries; LCP arrays for longest common prefixes; BWT permutation for run-length encoding.

How PapersFlow Helps You Research Suffix Arrays

Discover & Search

Research Agent uses searchPapers('suffix array genomic alignment') to retrieve MUMmer4 (Marçais et al., 2018), then citationGraph to map 2517 citing works and findSimilarPapers for parallel variants like BarraCUDA. exaSearch uncovers niche implementations in low-citation papers.

Analyze & Verify

Analysis Agent applies readPaperContent on PBWT (Durbin, 2014) to extract suffix array query complexities, verifies claims with verifyResponse (CoVe) against LoRDEC (Salmela and Rivals, 2014), and runs PythonAnalysis to benchmark construction times using NumPy on sample genomes with GRADE scoring for empirical validation.

Synthesize & Write

Synthesis Agent detects gaps in parallel construction across papers like Klus et al. (2012), flags contradictions in space claims between Cox et al. (2012) and Lam et al. (2008); Writing Agent uses latexEditText for algorithm pseudocode, latexSyncCitations for 10-paper bibliography, latexCompile for PDF, and exportMermaid for BWT construction diagrams.

Use Cases

"Benchmark suffix array construction time vs suffix tree on 1GB genome"

Research Agent → searchPapers → Analysis Agent → runPythonAnalysis (NumPy timer on DC3 vs Ukkonen implementations from paper code) → matplotlib plot of n log n scaling.

"Write LaTeX section on FM-index from suffix arrays with citations"

Synthesis Agent → gap detection → Writing Agent → latexEditText (insert FM-index equations) → latexSyncCitations (add Durbin 2014, Lam 2008) → latexCompile → PDF with diagram via exportMermaid.

"Find GitHub repos implementing GPU suffix arrays from papers"

Research Agent → searchPapers('BarraCUDA') → Code Discovery (paperExtractUrls → paperFindGithubRepo → githubRepoInspect) → verified CUDA kernels for short-read alignment.

Automated Workflows

Deep Research workflow scans 50+ suffix array papers via searchPapers → citationGraph, producing structured report with construction algorithm taxonomy and citation networks. DeepScan applies 7-step analysis: readPaperContent on MUMmer4 → runPythonAnalysis benchmarks → CoVe verification → GRADE evidence table. Theorizer generates hypotheses on quantum-accelerated suffix sorting from BWT papers like Cox et al. (2012).

Frequently Asked Questions

What is a suffix array?

A suffix array is an integer array of suffix starting positions sorted lexicographically, enabling O(m + log n) pattern matching without storing the full suffix tree.

What are main construction methods?

DC3 algorithm achieves O(n) time; skew algorithm sorts suffixes in blocks. MUMmer4 (Marçais et al., 2018) uses optimized variants for genomes.

What are key papers?

MUMmer4 (Marçais et al., 2018, 2517 citations) for alignment; PBWT (Durbin, 2014, 489 citations) for BWT+suffix arrays; Adaptive seeds (Kiełbasa et al., 2011, 1408 citations) for mismatch handling.

What are open problems?

Parallel construction at exabyte scale; quantum-resistant compressed indexes; real-time k-mismatch for streaming metagenomics beyond kmacs (Leimeister and Morgenstern, 2014).

Research Algorithms and Data Compression with AI

PapersFlow provides specialized AI tools for Computer Science researchers. Here are the most relevant for this topic:

See how researchers in Computer Science & AI use PapersFlow

Field-specific workflows, example queries, and use cases.

Computer Science & AI Guide

Start Researching Suffix Arrays with AI

Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.

See how PapersFlow works for Computer Science researchers