Академический Документы
Профессиональный Документы
Культура Документы
• The ARMv5TE extensions available in the ARM9E and later cores provide
efficient multiply accumulate operations.
Increase in performance available on different generations of the ARM core.
• The ARM core is not a dedicated DSP. There is no single instruction that
• The key idea is to use block algorithms that calculate several results at
once, and thus require less memory bandwidth, increase performance, and
• The ARM7TDMI has a 32-bit by 8-bit per cycle multiply array with early termination. It
takes four cycles for a 16-bit by 16-bit to 32-bit multiply accumulate. Load instructions
take three cycles and store instructions two cycles for zero-wait-state memory or cache.
• Load instructions are slow, taking three cycles to load a single value. To access memory
efficiently use load and store multiple instructions LDM and STM. Load and store
multiples only require a single cycle for each additional word transferred after the first
word. This often means it is more efficient to store 16-bit data values in 32-bit words.
• The multiply instructions use early termination based on the second
operand in the product Rs. For predictable performance use the second
operand to specify constant coefficients or multiples.
18
FIR filter, cont’.d
ADR r3,c ; load r3 with base of c
ADR r5,x ; load r5 with base of x
; loop body
loop LDR r4,[r3,r8] ; get c[i]
LDR r6,[r5,r8] ; get x[i]
MUL r4,r4,r6 ; compute c[i]*x[i]
ADD r2,r2,r4 ; add into running sum
ADD r8,r8,#4 ; add one word offset to array index
ADD r0,r0,#1 ; add 1 to i
CMP r0,r1 ; exit?
BLT loop ; if i < N, continue
19
• Let’s look at the issue of dynamic range and possible overflow of the
output signal. Suppose that we are using Qn and Qm fixed-point
representations X[t ] and C[i] for xt and ci , respectively. In other words:
• The diagram starts with the oldest sample Xt−M+1 since the filter routine
will load samples in increasing order of memory address.
• If you feed in the impulse signal x = (1, 0, 0, 0, . . .), then yt may oscillate
forever. This is why it has an infinite impulse response. However, for a
stable filter, yt will decay to zero. We will concentrate on efficient
implementation of this filter.
IIR filter example
• You can calculate the output signal yt directly, using general equation. In
this case the code is similar to the FIR. However, this calculation method
may be numerically unstable.
• It is often more accurate, and more efficient, to factorize the filter into a
series of biquads—an IIR filter with M = L = 2:
• So, now we only have to implement biquads efficiently. On the face of it, to
calculate yt for a biquad, we need the current sample xt and four history
elements xt−1, xt−2, yt−1, yt−2.
• However, there is a trick to reduce the number of history or state values
we require from four to two.
We define an intermediate signal st by