Unit-Iii Arm Application Development

UNIT-III
ARM APPLICATION DEVELOPMENT

OVERVIEW:
Introduction to DSP on ARM
FIR filter, IIR filter
Discrete fourier transform
Exception handling
Interrupts, Interrupt handling schemes
Firmware and boot loader
Embedded Operating systems
Integrated Development Environment
STDIO Libraries
Peripheral Interface
Application of ARM Processor
Caches, Memory protection Units, Memory Management units
Future ARM Technologies.
ARM VS DSP
• Both DSP and ARM Processors are types of microprocessors. A microprocessor
is a silicon chip that contains the central processing unit (CPU) of the device.
• The ARM Processors are based on the RISC design of computer processors. The RISC
microprocessors are usually for generic usage.
• The DSP processor is another type of microprocessor. DSP stands for digital
signal processing. It is basically any signal processing that is done on a digital
signal or information signal.
• A DSP processor is a specialized microprocessor that has an architecture
optimized for the operational needs of digital signal processing.
APPLICATIONS:
ARM
ARM processors are well known and desired for light, portable,
battery-powered devices such as smart phones and tablet computers.
DSP
DSP aims to modify or improve the signal. It is characterized by
the representation of discrete units, such as discrete time, discrete
frequency, or discrete domain signals.
The main goal of a DSP processor is to measure, filter and/or compress
digital or analog signals
INTRODUCTION TO DSP ON ARM
• Microprocessors now wield enough computational power to process
real-time digitized signals.
• mp3 audio players, digital cameras, and digital mobile/cellular telephones.
• Processing digitized signals requires high memory bandwidths and fast
multiply accumulate operations.
Traditionally an embedded or portable device would contain two types of

processor:
• A microcontroller would handle the user interface, and a separate DSP
processor would manipulate digitized signals such as audio.
• However, now you can often use a single microprocessor to perform both
tasks because of the higher performance and clock frequencies available on
microprocessors today.
• A single-core design can reduce cost and power consumption over a

two-core solution.
• The ARMv5TE extensions available in the ARM9E and later cores provide
efficient multiply accumulate operations.
Increase in performance available on different generations of the ARM core.
• The ARM core is not a dedicated DSP. There is no single instruction that
issues a multiply accumulate and data fetch in parallel. However, by
reusing loaded data you can achieve a respectable DSP performance.
• The key idea is to use block algorithms that calculate several results at
once, and thus require less memory bandwidth, increase performance, and
decrease power consumption compared with calculating single results.

• The ARM also differs from a standard DSP when it comes to precision and
saturation. In general, ARM does not provide operations that saturate
automatically. Saturating versions of operations usually cost additional cycles.
Saturation clips a result to a fixed range to prevent overflow.
• saturate16(x) = x clipped to the range −0x00008000 to +0x00007fff inclusive
• saturate32(x) = x clipped to the range −0x80000000 to +0x7fffffff inclusive
• On the other hand, ARM supports extended-precision 32-bit multiplied by 32-bit

to 64-bit operations very well.
Guidelines for Writing DSP Code for ARM
• Design the DSP algorithm so that saturation is not required because
saturation will cost extra cycles. Use extended-precision arithmetic or
additional scaling rather than saturation.
• Design the DSP algorithm to minimize loads and stores. Once you load a
data item, then perform as many operations that use the datum as
possible.
• Write ARM assembly to avoid processor interlocks. The results of load and
multiply instructions are often not available to the next instruction without
adding stall cycles. Sometimes the results will not be available for several
cycles.
• There are 14 registers available for general use on the ARM, r0 to r12 and
r14. Design the DSP algorithm so that the inner loop will require 14
registers or fewer.
DSP on the ARM7TDMI
• The ARM7TDMI has a 32-bit by 8-bit per cycle multiply array with early termination. It
takes four cycles for a 16-bit by 16-bit to 32-bit multiply accumulate. Load instructions
take three cycles and store instructions two cycles for zero-wait-state memory or cache.
Guidelines for Writing DSP Code for the ARM7TDMI
• Load instructions are slow, taking three cycles to load a single value. To access memory
efficiently use load and store multiple instructions LDM and STM. Load and store
multiples only require a single cycle for each additional word transferred after the first
word. This often means it is more efficient to store 16-bit data values in 32-bit words.
• The multiply instructions use early termination based on the second
operand in the product Rs. For predictable performance use the second
operand to specify constant coefficients or multiples.
• Multiply is one cycle faster than multiply accumulate. It is sometimes

useful to split an MLA instruction into separate MUL and ADD instructions.
You can then use a barrel shift with the ADD to perform a scaled
accumulate.
FIR filters
• The finite impulse response (FIR) filter is a basic building block of many
DSP applications
• You can use a FIR filter to remove unwanted frequency ranges, boost
certain frequencies, or implement special effects.
• We will concentrate on efficient implementation of the filter on the ARM.
The FIR filter is the simplest type of digital filter.
• The filtered sample yt depends linearly on a fixed, finite number of
unfiltered samples xt . Let M be the length of the filter. Then for some filter
coefficients, ci :
A direct form discrete-time FIR filter of order N.
Example: FIR filter
• C:
for (i=0, f=0; i<N; i++)
f = f + c[i]*x[i];
• Assembler
; loop initiation code
MOV r0,#0 ; use r0 for I
MOV r8,#0 ; use separate index for arrays
ADR r2,N ; get address for N
LDR r1,[r2] ; get value of N
MOV r2,#0 ; use r2 for f
18
FIR filter, cont’.d
ADR r3,c ; load r3 with base of c
ADR r5,x ; load r5 with base of x
; loop body
loop LDR r4,[r3,r8] ; get c[i]
LDR r6,[r5,r8] ; get x[i]
MUL r4,r4,r6 ; compute c[i]*x[i]
ADD r2,r2,r4 ; add into running sum
ADD r8,r8,#4 ; add one word offset to array index
ADD r0,r0,#1 ; add 1 to i
CMP r0,r1 ; exit?
BLT loop ; if i < N, continue
19
• Let’s look at the issue of dynamic range and possible overflow of the
output signal. Suppose that we are using Qn and Qm fixed-point
representations X[t ] and C[i] for xt and ci , respectively. In other words:
• Then A[t ] is a Q(n+m) representation of yt . But, how large is A[t ]? How

many bits of precision do we need to ensure that A[t ] does not overflow
its integer
Block FIR filters
• we can usually implement filters using integer sums of products, without
the need to check for saturation or overflow:
A[t] = C[0]*X[t] + C[1]*X[t-1] + ... + C[M-1]*X[t-M+1];
• Generally X[t ] and C[i] are k-bit integers and A[t ] is a 2k-bit integer, where
k = 8, 16, or 32.
• By a long filter, we mean that M is so large that you can’t hold the filter
coefficients in registers. You should optimize short filters such as previous
example on a case-by-case basis. For these you can hold many coefficients
in registers.
• An R-way block filter implementation calculates the R value
A[t ], A[t + 1], . . . , A[t + R − 1]
using a single pass of the data X[t ] and coefficients C[i]. This reduces
the number of memory accesses by a factor of R over calculating each result
separately. So R should be as large as possible.
• An R × S block filter is an R-way block filter where we read S data and
coefficient values at a time for each iteration of the inner loop. On each
loop we accumulate R × S products onto the R accumulators.
Typical 4 × 3 block filter implementation.
• Each accumulator on the left is the sum of products of the coefficients on
the right multiplied by the signal value heading each column.
• The diagram starts with the oldest sample Xt−M+1 since the filter routine
will load samples in increasing order of memory address.
• Each inner loop of a 4 × 3 filter accumulates the 12 products in a 4 × 3

parallelogram.
Writing FIR Filters on the ARM
• If the number of FIR coefficients is small enough, then hold the
coefficients and history samples in registers. Often coefficients are
repeated. This will save on the number of registers you need.
• If the FIR filter length is long, then use a block filter algorithm of
size R × (R − 1) or R ×R. Choose the largest R possible given the 14
available general purpose registers on the ARM.
• Ensure that the input arrays are aligned to the access size. This will
be 64-bit when using LDRD. Ensure that the array length is a
multiple of the block size.
• Schedule to avoid all load-use and multiply-use interlocks.
IIR Filters
• An infinite impulse response (IIR) filter is a digital filter that depends
linearly on a finite number of input samples and a finite number of
previous filter outputs. In other words, it combines a FIR filter with
feedback from previous filter outputs. Mathematically, for some
coefficients bi and aj :
• If you feed in the impulse signal x = (1, 0, 0, 0, . . .), then yt may oscillate
forever. This is why it has an infinite impulse response. However, for a
stable filter, yt will decay to zero. We will concentrate on efficient
implementation of this filter.
IIR filter example
• You can calculate the output signal yt directly, using general equation. In
this case the code is similar to the FIR. However, this calculation method
may be numerically unstable.
• It is often more accurate, and more efficient, to factorize the filter into a
series of biquads—an IIR filter with M = L = 2:
"Biquad" is an abbreviation of "biquadratic", which refers to the fact that in the Z

domain, its transfer function is the ratio of two quadratic functions
• We can implement any IIR filter by repeatedly filtering the data by a

number of biquads. To see this, we use the z-transform. This transform
associates with each signal xt ,a polynomial x(z) defined as
biquads
• So, now we only have to implement biquads efficiently. On the face of it, to
calculate yt for a biquad, we need the current sample xt and four history
elements xt−1, xt−2, yt−1, yt−2.
• However, there is a trick to reduce the number of history or state values
we require from four to two.
We define an intermediate signal st by
• The coefficient b0 controls the amplitude of the biquad. We can assume

that b0 = 1 when performing a series of biquads, and use a single multiply
or shift at the end to correct the signal amplitude. So, to summarize, we
have reduced an IIR to filtering by a series of biquads of the form
• For a block IIR, we split the input signal xt into large frames of N samples.
We make multiple passes over the signal, filtering by as many biquads as
we can hold in registers on each pass.
• Typically for ARMv4 processors we filter by one biquad on each pass; for
ARMv5TE processors, by two biquads.
Implementing 16-bit IIR Filters
• Factorize the IIR into a series of biquads. Choose the data precision so there
can be no overflow during the IIR calculation.
• Use a block IIR algorithm, dividing the signal to be filtered into large
frames.
• On each pass of the sample frame, filter by M biquads. Choose M to be the
largest number of biquads so that you can hold the state and coefficients
in the 14 available registers on the ARM. Ensure that the total number of
biquads is a multiple of M.
• As always, schedule code to avoid load and multiply use interlocks.
The Discrete Fourier Transform
• The Discrete Fourier Transform (DFT) converts a time domain signal xt to
a frequency domain signal yk . The associated inverse transform (IDFT)
reconstructs the time domain signal from the frequency domain signal.
• This tool is heavily used in signal analysis and compression.
• It is particularly powerful because there is an algorithm, the Fast Fourier
Transform (FFT), that implements the DFT very efficiently.
• we will look at some efficient ARM implementations of the FFT.
• The DFT acts on a frame of N complex time samples, converting them into
N complex frequency coefficients.
The Fast Fourier Transform
• The idea of the FFT is to break down the transform by factorizing
N. Suppose for example that N = R × S. Split the output into S
blocks of size R and the input into R blocks of size S. In other
words:

Unit-Iii Arm Application Development

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Unit-Iii Arm Application Development

Загружено:

Авторское право:

Доступные форматы

UNIT-III

ARM APPLICATION DEVELOPMENT

Traditionally an embedded or portable device would contain two types of

• A single-core design can reduce cost and power consumption over a

issues a multiply accumulate and data fetch in parallel. However, by

reusing loaded data you can achieve a respectable DSP performance.

decrease power consumption compared with calculating single results.

Saturation clips a result to a fixed range to prevent overflow.

• saturate16(x) = x clipped to the range −0x00008000 to +0x00007fff inclusive

• saturate32(x) = x clipped to the range −0x80000000 to +0x7fffffff inclusive

• On the other hand, ARM supports extended-precision 32-bit multiplied by 32-bit

Guidelines for Writing DSP Code for the ARM7TDMI

• Multiply is one cycle faster than multiply accumulate. It is sometimes

• Then A[t ] is a Q(n+m) representation of yt . But, how large is A[t ]? How

• Each inner loop of a 4 × 3 filter accumulates the 12 products in a 4 × 3

"Biquad" is an abbreviation of "biquadratic", which refers to the fact that in the Z

• We can implement any IIR filter by repeatedly filtering the data by a

• The coefficient b0 controls the amplitude of the biquad. We can assume

Вам также может понравиться