You are on page 1of 2

Since ARM multiply instruction (MUL) has to use pipeline 0, statement (1) and (2) cannot make the

pipeline operation. The inputs of statement (3) are the output of statement (1) and (2). So the three statements should execute one by one. Furthermore, each MUL instruction occupies two cycles. One multiply and one add operation need five cycles when running on ARM. In sub-band synthesis filter, multiply-add is the main operation, which consumes many cycles at each operation. NEON can help in the situation. VMUL of NEON instruction finishes vector multiplication in one cycle, which is equivalent to two multiply operations. The multiply-add operation is converted into NEON code:
VMUL D1, D2, D3

D1~D3 are the independent NEON register vectors. D2 contains values of r2 and r5, while D3 contains values of r3 and r6. The operation result is stored in D1. The one NEON instruction finishes 2 multiplications. Moreover, VMLA of NOEON instruction is equal to two multiply-add operations. After NEON optimization, it can reduce multiply-add operation time and the computing time of the module. IMDCT is the second largest computing time consumption module in the MP3 decoder, about 25 percent of the total. IMDCT has 32 frequency sub-band. Each subband contains one long window or three sequential short windows. Long window is consisted of 18 frequency lines, and short window is consisted of six frequency lines. The formula of IMDCT is: After the algorithm level optimization, IMDCT is converted to the algorithm, which includes mainly multiply-add operation. Its similar to optimization method of sub-band synthesis filter that VMUL and VMLA of NEON can replace multiply-add instruction of ARM code efficiently. It reduces the computing time of the IMDCT module by a large margin. The common audio decoders,

such as WMA, AAC and OGG, contain a large number of discrete cosine transform, so the same method of NEON instruction optimization can be used. The above method is also common. Furthermore, for multimedia processing features, NEON instruction set provides a range of optimized media processing instructions, such as the saturated vector operations, vector load/store and so on. If they are used properly, the optimization effect is very significant.


Usually FFT is used to implement the IMDCT/PQMF transforms optimally. It is important to choose an FFT algorithm that is most suited for NEON. In most audio codecs the transform length is usually a power of 2. Hence a radix-2 or a mixed radix algorithm (with radix-8, radix-4 etc. can are used). An algorithm with higher radix has an advantage of reduced number of computational operations. In addition, there is significant reduction in number of data loads/stores as we go for higher radix algorithms. However, higher the radix of the algorithm means more number of data points required for the computation one FFT butterfly. So if sufficient number of registers is not available to hold all data points required for the computation of the butterfly then the data needs to pushed on the stack and popped back at a later point. This will offset the advantages of a higher radix algorithm. Since NEON has 32 registers (64-bit each), a higher radix FFT such as radix-8 or radix-16 becomes highly suitable for this platform. Hence, a mixed radix FFT algorithm with maximum number of radix-16 and radix-8 decimations will be the optimal implementation of FFT using NEON. Merge Consecutive Functional Blocks Merging of two or more consecutive functional blocks can be done in order to save loads/stores of data values. In most cases, the algorithm will allow the merger of two or more consecutive functional blocks, but often not done due to non-availability of registers (for storing data and the pointers). Since Cortex-A8 has enough registers, it is suggested to merge two functional blocks wherever possible and thus reducing the number of loads/stores.