1. Introduction
Genome sequencing, a key component of precision health, is necessary for early detection of cancer, autism, infectious diseases (such as COVID-19), and genetic diseases. With the innovation in genome sequencing technologies over the past decade, sequencing data is being generated more rapidly and at a lower cost, increasing at a rate that far outpaces Moore’s Law. The total amount of sequencing data has been doubling approximately every seven months.20 The cost to sequence a human genome has dropped from $100 million at the beginning of this century to less than $1,000 nowadays.3
This large volume of sequencing data poses significant computational challenges and requires novel computing solutions that can keep pace. Recent works explore architecture-aware optimizations on commodity hardware, such as leveraging SIMD hardware on CPUs10,24 and thread-level parallelism on GPUs.1,17 Custom accelerators, however, achieve much better performance and are more area and power efficient than CPUs and GPUs.2,6,23,25 These accelerators gain orders of magnitude speed-ups over general-purpose hardware, but at a high cost of hardware design. Specific genomics “kernels” (algorithms) do not have a market large enough to justify a custom chip. This makes it difficult to design an accelerator for the particular implementation of a single kernel, since the state-of-the-art implementation may change significantly over the next few years. For example, Smith-Waterman (SW), the basic approximate string-matching algorithm, was optimized from the original SW to a banded SW, and further to an adaptive banded SW and a wavefront version. Therefore, the high cost for designing custom accelerators and frequent kernel developments motivate a generic domain-specific accelerator for genome sequencing analysis.
In the commonly used genome sequencing pipelines, dynamic programming (DP) algorithms are widely used, including read alignment and variant calling in reference-guided alignment, layout and polishing in de novo assembly, as well as abundance estimation in metagenomics classification.22 Matrix multiplications are the heart of machine learning (ML) applications, which motivates the design of tensor processing units (TPU). Similarly, DP algorithms are adopted by many genomics kernels and account for large amounts of time in mainstream sequencing pipelines, which provides the opportunity for a dynamic programming accelerator that supports existing and future DP kernels.
Dynamic programming simplifies a complicated problem by breaking it down into sub-problems which can be solved recursively. However, accelerating a general-purpose DP algorithm comes with several challenges. First, DP kernels in common sequencing pipelines have different dependency patterns, including both 1-Dimension and 2-Dimension DP tables. Some kernels have long-range dependencies in the graph structure, where cells in the DP table not only depend on the neighboring cells, but also depend on cells far away. Second, DP kernels have different objective functions which include multiple operators. For instance, approximate string matching, an algorithm applied in DNA, RNA, and protein sequence alignment, has three modes: local (Smith-Waterman), global (Needleman-Wunsch), and semi-global alignment (overlap), as well as three methods for scoring insertions and deletions: linear, affine, and convex. Each mode or method above requires a unique objective function. Third, DP kernels have different precision requirements. It is challenging to support multiple precision arithmetic while neither losing efficiency for low-precision computation nor compromising accuracy for high-precision computation.
To address these challenges, we propose GenDP, a framework of dynamic programming acceleration for genome sequencing analysis, which supports multiple DP kernels. First, we present DPAx, a DP accelerator capable of solving multiple dependency patterns by providing flexible interconnections between processing elements (PEs) in the systolic array. The systolic array helps exploit the wavefront parallelism in the DP table and provides better spatial locality for DP dataflow. DPAx decouples the control and compute instructions in the systolic array. Second, we present DPMap, a graph-partitioning algorithm which maps the data-flow graph of the objective function to compute units in the DPAx accelerator. DPAx supports different objective functions and multiple precision arithmetic by programmable compute units.
We evaluate the GenDP framework on four DP kernels: Banded Smith-Waterman (BSW), Chain, Pairwise Hidden Markov Model (PairHMM), and Partial Order Alignment (POA).22 We also demonstrate generality of the proposed framework by extending to other dynamic programming algorithms, such as Dynamic Time Warping (DTW)
, which is commonly used for speech detection, as well as the Bellman-Ford (BF
) algorithm for shortest-path search in robotic motion-planning tasks. In summary, this paper makes the following contributions:
-
We propose GenDP, a general-purpose acceleration framework for dynamic programming algorithms.
-
We design DPAx, an accelerator with programmable compute units, specialized dataflow, and flexible interconnections, supporting multiple dependency patterns, objective functions, and multi-precision arithmetic.
-
We describe DPMap, a graph-partitioning algorithm, to map the data-flow graph of DP objective functions to the compute units in DPAx.
-
GenDP achieves 157.8 throughput per unit area and 15.1 compared to GPU, and 132.0 throughput per unit area over CPU baselines.
2. Background
Genome sequencing starts with raw data from the sequencer. The raw signals are interpreted to derive reads (short sequences of base pairs). This process is called basecalling. Next-generation sequencing (NGS) technologies produce short reads with base pairs (bp), while third-generation technologies produce much longer reads ( 10,000 bps). After obtaining reads from raw data, there are two important analysis pipelines: reference-guided assembly and ”de novo” assembly (without using a reference genome).
2.1 Common genomics pipelines.
In reference-guided assembly, the sample genome is reconstructed by aligning reads to an existing reference genome. Read alignment can be abstracted to an approximate string-matching problem, where dynamic programming is used to estimate the pairwise similarity between the read and reference sequence.24 After alignments, small variants (mutations) still exist in aligned reads. A Hidden Markov Model (HMM) or ML model is then applied to detect such mutations, in a step known as variant calling.15
If there is no reference genome available for alignment, the genome sequence needs to be constructed with reads from scratch, which is referred to as “de novo” assembly. Reads with overlapping regions can be chained to build an overlap graph and are then further extended into larger contiguous regions. Finally, assembly errors are corrected in a graph-based dynamic programming polishing step.
In addition to the two analysis pipelines above, metagenomics classification is used for real-time pathogen detection and microbial abundance estimation.12 Metagenomics classification aligns input microbial reads to a reference pan-genome (consisting of different species) and then estimates the proportion of different microbes in the sample.
2.2 Dynamic programming.
Dynamic programming simplifies a problem by breaking it down into subproblems. Following the Bellman equation which describes the objective function, the sub-problems can be solved recursively from the initial conditions. Longest common subsequence (LCS) is a classic DP algorithm that involves looking for the LCS of two known sequences and . First, looking for the LCS between and can be simplified by looking for LCSs between and , as well as and , which can be further converted to compute the LCSs between and . If we define to be the length of an LCS between the sequence and , the objective function can be represented by Equation 1:
(1)
Second, a DP table can be constructed based on the sequence and to memorize the sub-problem results, as shown in Figure 2. is calculated based on its upper, left, and diagonal neighbors , , and . The first row and first column of the DP table are filled with 0. Based on this initial condition, the cells in the whole DP table can be filled out. Finally, the largest value in the table is the length of longest common subsequence and the corresponding subsequence can be found by the trace back stage, as shown in orange blocks in Figure 2.


2.3 DP kernels in genomics pipelines.
We introduce four important and time-consuming DP kernels from commonly used genomics pipelines, as shown in Figure 1. BSW is applied in read alignment, and variants of BSW are also used for RNA and protein alignment. PairHMM is used in post-alignment variant calling. POA is applied in the polishing step of assembly. Chain is used in both alignment and assembly of long read sequencing, as well as metagenomics classification. These four kernels spend 31%, 70%, 47%, and 75% of time in corresponding sequencing pipeline stages, respectively.22 The details of these algorithms are explained as follows:
Banded Smith-Waterman is the banded version of the Smith-Waterman19 algorithm, which estimates the pairwise similarity between the query and reference sequences. The similarity score for a given DNA sequence is typically computed with affine-gap penalties, identifying short insertions and deletions in pairwise alignments. The objective function is shown in Figure 1a, which computes three matrices H
, E
, and F
, corresponding to three edit types: match, insertion, and deletion. S
is a similarity score between the base X(i)
and Y(j)
. H(i,j)
refers to the similarity score for the substring X(0,i)
and Y(0,j)
. The banded version of Smith-Waterman is applied with a maximum of w
insertions or deletions, illustrated as the region between black cells in Figure 1a. BSW can be computed using 8-bit or 16-bit integer arithmetic depending on the sequence length.
Pairwise Hidden Markov Model aligns reads to candidate haplotypes identified by the De-Bruijn graph traversal. The most likely haplotype supported by the reads is identified from the pairwise alignment, which is performed by a HMM. The likelihood score is computed by the formula shown in Figure 1b, where ,
, and
represent match, insertion, and deletion probabilities for aligning read substring
X(0,i)
to haplotype substring Y(0,j)
. The weights are different transition and emission parameters of the HMM. is the prior probability of emitting bases X(i)
and Y(j)
. The computation in PairHMM uses floating-point arithmetic.25
Partial Order Alignment: In the assembly polishing step, multiple read sequences are used to construct a partial-order graph and the consensus sequence is then generated from the graph. Each unaligned sequence is aligned to the existing graph, as shown in Figure 1c. The nodes in the partial-order graph represent bases in the read sequence, and the weighted edges denote the times that the edges appear in different reads. Each cell depends not only on the upper and diagonal cells in the previous row but also on earlier rows if there is an edge connecting that row with the current row in the graph. The objective function is similar to that used in BSW.
Chain: Given a set of seed pairs (anchors) shared between a pair of reads, Chain aims to group a set of collinear seed pairs into a single overlap region, as shown in Figure 1d(i). In the 1-Dimension DP table (Figure 1d(ii)), each anchor is compared with N previous anchors (default setting N=25) to determine the best parent. However, the dependency between neighboring anchors poses difficulties for parallelism. The reordered Chain algorithm8 compares each anchor with N subsequent anchors and updates the scores each time for them (Figure 1d(iii)).
Table 1 summarizes the characteristics of four DP kernels above, including the dimension and size of DP tables, dependency patterns and arithmetic precision. BSW and PairHMM are used for short read pipelines, while POA and Chain are used in long read pipelines with larger DP tables. The first three kernels have 2-Dimension DP tables, whereas Chain has a 1-Dimension DP table. These kernels also have different precision requirements.
Kernels | Dimension | Dependency | Precision |
---|---|---|---|
BSW | 2D Table | Last 2 Wavefronts | 8-bit/16-bit Integer |
PairHMM | 2D Table | Last 2 Wavefronts | Floating-point |
POA | 2D Table |
Graph structure Long-range dependency |
32-bit Integer |
Chain | 1D Table | Last Anchors |
32-bit Integer Floating-point |
3. GenDP Framework
Figure 3 demonstrates the structure of the GenDP framework. For a new DP kernel, we analyze its inter-cell dependency pattern and the intra-cell objective function, as shown in the top and bottom blocks, respectively. First, the inter-cell dependency patterns are determined by the recursion rule in the DP kernel. Based on the dependency pattern, we configure the PE interconnection and generate the control instructions. Second, the intra-cell data-flow graph for the objective function is mapped to compute units based on the DPMap algorithm. The compute instructions are generated based on the mapping results.

3.1 Inter-cell dependency pattern supports.
An overview of the DPAx architecture is described in Figure 4, including 16 integer PE arrays and one floating-point PE array. Each PE array contains four PEs connected as a 1-Dimension systolic array, in which each PE can receive data from the previous PE and pass data to the next one. The interconnections of the 16 integer PE arrays are configured based on the dependency pattern of the DP kernel. The last PE in the PE array can either be connected to the first PE in the next PE array or to the data buffer directly. The 16 integer PE arrays can be concatenated and make up a large systolic array consisting of 64 PEs. Figures 5b and 5d show PE arrays of size 4 and 8 respectively.


For kernels with a 2D DP Table, cells within a row are executed in the same PE. For example, rows with different colors in Figure 5 (a) are executed in PEs with corresponding colors in Figure 5 (b). Each element in the target sequence is stored in a PE statically, while elements in the query sequence are streamed through PEs in the same array. Each cell depends on results from three neighbors, the result from the left cell in the same row is stored inside the PE, while results from upper and diagonal cells in the previous row are transferred from the previous PE. First-in-first-out (FIFO) buffers connect the last and the first PEs in the array, from which the first PE (the first row in the next four-row group) acquires the dependent data from the last PE (the last row in the current four-row group). PE array size determines how many rows can be executed in parallel.
For kernels with a 1D DP Table, cells are mapped to a large PE array as shown in Figure 5 (d). When multiple small PE arrays are connected together to form a large PE array, only the FIFO in the first PE array is used. For example, in the first time step, cells #1-8
in Figure 5 (c) are mapped to PEs as shown in Figure 5 (d). Cells #1-8
all depend on cell #0
. Therefore, the value of cell #0
is loaded from the FIFO to each PE sequentially. During the next time step, cells #2-8
move forward to their neighboring PEs. Cell #9
is sent from the input buffer to the first PE, while cell #1
is moved out from the last PE. Meanwhile, cell #1
is loaded from the FIFO to each PE because cells #2-9
all depend on cell #1
.
Long-range dependencies in the graph structure are supported by scratchpad memories (SPM) inside each PE. In a 2D DP table, usually only the result of the left cell is stored in the PE. However, if a cell depends on other cells in the same row but far away, all the candidate dependent cells need to be stored in the PE. In kernels with long-range dependencies, the result for each cell is not only stored in registers for reuse by the next cell but also stored in SPM for potential reuse by later cells.
3.2 Intra-cell objective function mapping.
To support DP objective functions efficiently in the PE, we design a multi-level ALU array in the compute unit. This multi-level ALU array is consistent with common compute patterns in genomics DP kernels and reduces the register access pressure as well. The computations in the objective function are represented by data-flow graphs. DPMap partitions the data-flow graph into multiple subgraphs which can be further mapped to the ALU arrays in compute units. The partition rules will be illustrated in Section 5.
4. DPAx Architecture
An overview of the DPAx architecture is shown in Figure 4 and discussed in the previous section. Figure 6 shows the architecture of the PE array and PE. The systolic array architecture simplifies control by allowing data flow between neighboring PEs. However, the systolic data path alone cannot satisfy the requirements of various DP kernels. For example, POA has long-range dependencies and its dependency pattern is determined by the graph structure. The movement for such dependency requires branch instructions. Therefore, DPAx decouples the computation and control architecture to provide flexible data movements similar to decoupled access execute architectures.18 Meanwhile, the parallelism and the massive reduction tree pattern observed in genomics DP kernels (Section 4.2) motivate the VLIW architecture.

The PE array consists of the input and output data buffers, control instruction buffer, decoder, FIFO buffer, and four PEs. PEs are connected as a systolic array and data can be passed from one PE to the next. The FIFO connects the last and first PEs. The first PE is connected to the input data buffer to receive the input sequences. The last PE stores the results to the output data buffer. The last PE in the PE array is also equipped with a dedicated port connected to the first PE in the next PE array to build a larger PE group. In Figure 6, blue solid and orange dotted lines show the data and control flow respectively.
4.1 Processing element.
Each PE is capable of running a control and a compute thread in parallel. Control and compute instructions are stored in two instruction buffers and decoded separately. Each PE contains a register file (RF) and a SPM that store the short- and long-range dependencies respectively. Load and store ports are connected to the previous and next PEs. The first and last PEs are also connected to the FIFO and input/output data buffers.
Each PE is a two-way VLIW compute unit array that can execute two independent compute instructions in parallel to exploit the instruction-level parallelism (ILP). Every PE contains two 32-bit compute units which execute the VLIW instructions. Each compute unit (CU) can either execute operations on 32-bit or four concurrent 8-bit groups of operands as a SIMD unit to make use of data-level parallelism (DLP). The SIMD unit improves the performance of low-precision kernels—for example, BSW, where four DP tables are mapped to four SIMD lanes. The floating-point PE array and PE architecture is similar to the integer one, but only supports 32-bit FP operands.
4.2 Compute unit design choice.
We observe that genomics DP kernels have a common reduction tree pattern as shown in Figure 7 (a) and (b). Thus, we propose a reduction tree architecture for the CU. The outputs of the first-level ALUs are used as inputs to the next-level ALUs. The CU also contains a multiplication module for the weight calculation in the chain kernel. Since multiplication increases the length of the CU critical path, we design it as a separate unit from the ALU reduction tree.

Figure 7 (c)(d)(e) shows three possible choices for the ALU reduction tree. Compared to a one-level reduction tree, the two-level design requires fewer register file accesses and has better balances of critical path and area. Further, a three-level reduction tree decreases the ALU utilization when there are not enough patterns to map to the CU (Section 5).
4.3 Execution model and GenDP ISA.
We adopt the following execution model for GenDP. Instructions are preloaded to the accelerator before starting a DP kernel. Each PE array runs one thread of execution, controlling the data movement between data buffers and PEs, as well as the start of the execution for each PE. Upon receiving the start flag from the PE array, each PE runs two threads of execution: control and compute. The control thread manages data movement between the SPM, register file, and the systolic data flow between neighboring PEs. This thread also controls the start of the compute thread. The compute thread executes a compute instruction by decoding instructions, loading the data from the register file, executing computations in the CU array, and finally writing results back to the register file.
The control ISA, shown in Table 2, is applied in both the PE array and PEs. Arithmetic instructions manipulate the address registers within the decoders. mv
instructions either load immediate values or move data between memory components. branch
instruction enables the loop iteration. set
instructions control the start of execution for a subsidiary component (e.g. PE arrays control PEs and PEs control CUs). Figure 8 shows an example for data movement, where two control instructions (one in each PE) are executed. PE[i-1] moves data from 0x00ff in the register file to its out port
, while PE[i] moves data from its in port
to 0x00ff in the scratchpad memory. The delay for movements in both PEs are considered into the critical path and the movement can be done in one cycle. The control instructions are generated manually in this work.
Type | Instruction | Assembly |
---|---|---|
Arithmetic | add | add rd rs1 rs2 |
addi | addi rd rs1 #imm | |
Move | li | li [dest addr] #imm |
mv | mv [dest addr] [src addr] | |
Branch | beq/bne/bge/blt | branch rs1 rs2 offset |
Other | set | set PC |
no-op | no operation | |
halt | halt |

The two-way VLIW compute instructions are executed by two compute units, each of them containing three operations (for three ALUs of the two-level reduction tree in Figure 7 (d)) and six operands (four for the top left ALU and two for the right one). Operations in the compute instructions are listed in Table 3. Compute instructions are generated by DPMap algorithm in Section 5.
Operation | Action |
---|---|
Addition | |
Subtraction | |
Multiplication | |
Carry | |
Borrow | ? 1 : 0 |
Maximum | |
Minimum | |
Left-shift 16-bit | |
Right-shift 16-bit | |
Copy | |
Match Score | |
Log2 LUT | |
Log_sum LUT | |
Comparison | ? : |
Comparison | ? : |
No-op | Invalid |
Halt | Stop Computation |
5. DPMap Algorithm
The DP objective function is represented as a data-flow graph (DFG). The DPMap algorithm generates compute instructions by mapping the DFG to compute units in the PE. In the DFG, a node represents an operator, while an edge shows the dependency between operators. The DFG has nodes and edges . In edge , the operator in node takes the result of as an operand. We define node as the parent of node , and as the child of . Figure 9(a) shows the DFG of the BSW kernel.

comp
are removed and a comp
node is replicated to two comp
nodes, because their children both have another parent and contain commutative operations. In (c) seeding, nodes with two parents or multiple children are located and grouped into subgraphs with their parents. In (d) refinement, the remaining two comp
nodes are grouped into two subgraphs in the end.
DPMap breaks the entire graph into subgraphs that contain either one multiplication or three ALU nodes (Figure 7(d)). The edges within the subgraphs represent the data movements within the compute units (CU). The edges between subgraphs represent accesses to the register file and are removed by DPMap in three steps. First, partitioning extracts nodes that will be mapped to four-input ALUs and multipliers, because a CU supports at most one such operation. Second, seeding looks for nodes that could be mapped to the second level of the ALU reduction tree. Nodes with more than one parent or more than one child are selected as seeds. The seed and its parents are mapped to a CU together. After seeding, the remaining nodes have a single parent or a single child. Third, refinement maps every two remaining nodes to the two-level ALU tree in a CU. Figure 9 shows an example of the DPMap algorithm. Four subfigures represent the original graph and the three steps in DPMap separately. Dashed blocks represent final subgraphs.
Partitioning: Algorithm 1 breaks both input and output edges connected to nodes of four-input ALUs and multipliers. All parent and child edges of the multiplication
nodes are removed (lines 2-4). DPMap also removes the parent edges of four-input operations (line 6). For a four-input node that has two children, we replicate it if the operations of its children are commutative (except Subtraction
) in order to decrease register file accesses (lines 8-14). After partitioning, all nodes have at most two parents.
Seeding: In Algorithm 2, we look for nodes that are suitable for the second level of the ALU reduction tree and name them seeds
. Nodes that have two parents are located to fit the structure of the ALU reduction tree (line 2). The output edges of seeds
are removed because the output of this operator will be stored to the register file (line 3). In addition, since input operands of the seed
‘s parents must be fetched from the register file, DPMap also removes the input edges of the seed
‘s parent nodes (lines 4-7). Finally, we remove the output edges of all nodes with more than one child as its outputs have to be stored in the register file (lines 9-11). At this point, all nodes have at most one parent or one child.
Refinement: Algorithm 3 traverses the DFG in reverse order. If the node has a grandparent (line 4), the edge connecting its parent and grandparent is removed to group every two nodes (line 5). In the end, all subgraphs can be mapped to compute units in the PE.
6. Evaluation
We synthesize the DPAx accelerator using the Synopsys Design Compiler in a TSMC 28nm process. DPAx consumes 5.4 in area and 3.6W of power. We use a cycle-accurate simulator to measure the throughput of DPAx accelerator on the four DP kernels introduced in Section 2.3. The BSW, PairHMM and POA simulations show same results as CPU baselines. We use Ramulator11 to generate DRAM configurations and use DRAMPower4 to measure the power by DRAM access traces. We use Intel Xeon Platinum 8380 and NVIDIA A100 to evaluate CPU and GPU baselines, with die areas estimated to be 600 and 826 , respectively. We use the input datasets from GenomicsBench.22
6.1 GenDP performance.
We use throughput per unit area measured in million cell updates per second/ (MCUPS/) as a metric for performance. The area and power of CPU, GenDP, and custom accelerators are scaled to a 7nm process for fair comparison with GPU.21 GenDP is expected to run at 2GHz. Figure 10(a) shows comparisons across four DP benchmarks. Overall, GenDP achieves 132.0 speedup over CPU and 157.8 over GPU. Figure 10(b) shows the comparison between GenDP and GPU. The large speedup can be attributed to the GenDP ISA, the specialized dataflow and on-chip memory hierarchy tailored for dynamic programming.

Both the BWA-MEM2 CPU baselines and GenDP benefit from 8-bit SIMD optimizations for the BSW kernel. With AVX512
, BWA-MEM2 has 64 SIMD lanes, and GenDP has four SIMD lanes. The PairHMM baseline applies floating point, whereas GenDP applies the pruned-based implementation using logarithm and fixed-point numbers to approximate the computation and reduce complexity. The bottleneck of POA performance on GenDP is the memory accesses. First, it has the graph dependency pattern, which is more complex than other kernels. The dependency information needs to be loaded from the input data buffer to each PE. Second, downstream trace-back functions in POA need the move directions on the DP table for each cell, which requires 8-byte outputs to be written to the output data buffer from each cell. Both the input of the dependency information and the output of the move directions consume extra data movement instructions that limit POA performance on GenDP. In the chain kernel, both performances of GPU and GenDP are penalized by 3.72 for the extra cell computation.
6.2 Comparison with accelerators.
The GenDP framework’s goal is to build a versatile dynamic programming accelerator to support a wide range of genomics applications. Thus it sacrifices performance for programmability and supporting a broader set of kernels. A key research question is how much performance is sacrificed for generality. Figure 10(c) shows the performance of GenDP compared to available custom genomics ASIC accelerators, GenAx6 accelerator for BSW, and pruning-based PairHMM accelerator.25
We observe a geomean of 2.8 slowdown. This can be attributed largely to area overheads, custom datapaths for cell score computation, custom data flow, and custom precision. For example, 37.5% of the register and 40% of the SRAM are only used by POA but idle in other kernels, because POA is significantly more complex than the other three. A custom data-flow could specify the data bus width between neighboring PEs and propagate all the data in a single cycle, whereas GenDP needs control instructions to move data between neighboring PEs because of various data movement requirements. An accelerator for one specific kernel can implement one appropriate precision to save area. For instance, the pruning-based PairHMM ASIC utilizes 20-bit fixed-point data which satisfies the compute requirements, but GenDP has no such custom precision choice.
6.3 Generality.
In addition to these four DP kernels within the commonly used sequencing pipelines, the GenDP framework also supports other DP algorithms in either genomics or broader fields. Dynamic time warping (DTW) measures the similarity between two temporal sequences, which could be used for Nanopore signal basecalling and speech detection. DTW has near-range dependency pattern similar to Smith-Waterman. Bellman-Ford (BF), a shortest path search algorithm, is commonly used for robotic motion planning applications. BF has a graph-based dependency pattern where the long-range dependencies within a certain distance could be efficiently supported by GenDP and the ultra-long range dependencies need accesses through DRAM. GenDP supports both objective functions of DTW and BF. Their performance and instruction comparisons with GPUs are shown in Figure 10(d).
7. Related Work
Many custom genomics accelerators have been proposed to boost the performance of DP kernels in genomics pipelines, which significantly improve the performance over commodity hardware. However, these accelerators only support a single genomics kernel and must be customized and combined to support different stages of genomics pipelines.2,5,23 This increases both design cost and complexity.
There has been industry interest in DP acceleration. For example, NVIDIA recently announced the dynamic programming instructions (DPX) in the Hopper architecture.9
There are also several works that have identified common compute and memory access patterns across both regular and irregular workloads and proposed reconfigurable, spacial and dataflow architectures for these patterns.7,13,14,16 But these accelerators are mostly optimized for data-parallel or data-intensive applications and not suitable for dynamic programming kernels.
8. Conclusion
To support general-purpose acceleration for genomics kernels in the commonly used sequencing pipelines, this work presented GenDP, a programmable dynamic programming acceleration framework, including DPAx, a systolic array-based DP accelerator, and DPMap, a graph-partitioning algorithm to map DP kernels to the accelerator. DPAx supports multiple dependency patterns through flexible PE interconnections and different DP objective functions using programmable compute units. GenDP is evaluated on four important genomics kernels, achieving 157.8 and 15.1 compared to GPU, and 132.0 over CPU baselines, and is also extended to DP algorithms in broader fields.
Acknowledgment
We thank the anonymous reviewers for their suggestions which helped improve this paper. This work was supported in part by the NSF under the CAREER-1652294, NSF-1908601 and NSF-2403119, and the Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA.