Zynq Documentation

CHAPTER 1
INTRODUCTION TO EMBEDDED SYSTEMS
What is embedded system:
An embedded system is a computer system with a dedicated function within a larger

electrical system. It is embedded as part of a complete device often including hardware
and mechanical parts. Embedded systems control many devices in common use
today. Ninety-eight percent of all microprocessors are manufactured as components of
embedded systems.
Examples of properties of typically embedded computers when compared with general-

purpose counterparts are low power consumption, small size, rugged operating ranges,
and low per-unit cost. This comes at the price of limited processing resources, which
make them significantly more difficult to program and to interact with. However, by
building intelligence mechanisms on top of the hardware, taking advantage of possible
existing sensors and the existence of a network of embedded units, one can both
optimally manage available resources at the unit and network levels as well as provide
augmented functions, well beyond those available. For example, intelligent techniques
can be designed to manage power consumption of embedded systems.
Modern embedded systems are often based on microcontrollers (i.e. CPU's with
integrated memory or peripheral interfaces), but ordinary microprocessors (using external
chips for memory and peripheral interface circuits) are also common, especially in more-
complex systems. In either case, the processor(s) used may be types ranging from general
purpose to those specialised in certain class of computations, or even custom designed for
the application at hand. A common standard class of dedicated processors is the digital
signal processing (DSP).
Since the embedded system is dedicated to specific tasks, design engineers can optimize
it to reduce the size and cost of the product and increase the reliability and performance.
Some embedded systems are mass-produced, benefiting from economices of time.
Page 1
Embedded systems range from portable devices such as digital watches , to large
stationary installations like traffic lights, factory controllers, and largely complex systems
like hybrid vehicles, MRI, and avionics. Complexity varies from low, with a
single microcontrollers chip, to very high with multiple units, peripherals and networks
mounted inside a large chassis or enclosure.
HISTORY:
One of the very first recognizably modern embedded systems was the Apollo guidance
coputer, developed by Charles stark draper at the MIT Instrumentation Laboratory. At the
project's inception, the Apollo guidance computer was considered the riskiest item in the
Apollo project as it employed the then newly developed monolithic integrated circuits to
reduce the size and weight. An early mass-produced embedded system was the autonetic
D-17 guidance computing for the minutemen missile , released in 1961. When the
Minuteman II went into production in 1966, the D-17 was replaced with a new computer
that was the first high-volume use of integrated circuits.
Since these early applications in the 1960s, embedded systems have come down in price
and there has been a dramatic rise in processing power and functionality. An
early microprocessor for example, the Intel 4004, was designed for calculators and other
small systems but still required external memory and support chips. In 1978 National
Engineering Manufacturers Association released a "standard" for programmable
microcontrollers, including almost any computer-based controllers, such as single board
computers, numerical, and event-based controllers.
As the cost of microprocessors and microcontrollers fell it became feasible to replace

expensive knob-based analog components such as potentiometer and variable
capacitor with up/down buttons or knobs read out by a microprocessor even in consumer
products. By the early 1980s, memory, input and output system components had been
integrated into the same chip as the processor forming a microcontroller. Microcontrollers
find applications where a general-purpose computer would be too costly.
Page 2
A comparatively low-cost microcontroller may be programmed to fulfill the same role as
a large number of separate components. Although in this context an embedded system is
usually more complex than a traditional solution, most of the complexity is contained
within the microcontroller itself. Very few additional components may be needed and
most of the design effort is in the software. Software prototype and test can be quicker
compared with the design and construction of a new circuit not using an embedded
processor.
Characteristics:
Embedded systems are designed to do some specific task, rather than be a general-
purpose computer for multiple tasks. Some also have real-time performance constraints
that must be met, for reasons such as safety and usability; others may have low or no
performance requirements, allowing the system hardware to be simplified to reduce costs.
Embedded systems are not always standalone devices. Many embedded systems consist
of small parts within a larger device that serves a more general purpose. For example,
the Gibson robot guitar features an embedded system for tuning the strings, but the overall
purpose of the Robot Guitar is, of course, to play music. Similarly, an embedded system
in an automoblie provides a specific function as a subsystem of the car itself.
e-con Systems eSOM270 & eSOM300 Computer on Modules
The program instructions written for embedded systems are referred to as firmware, and
are stored in read-only memory or flash memory chips. They run with limited computer
hardware resources: little memory, small or non-existent keyboard or screen.
User interface
Embedded system text user interface using Micro VGA
Page 3
Embedded systems range from no user interface at all, in systems dedicated only to one
task, to complex graphical user interface that resemble modern computer desktop operating
systems. Simple embedded devices use buttons, LEDs , graphic or character LCDs
(HD44780LCD for example) with a simple menu system.
More sophisticated devices which use a graphical screen with touch sensing or screen-
edge buttons provide flexibility while minimizing space used: the meaning of the buttons
can change with the screen, and selection involves the natural behavior of pointing at
what is desired. hand held system often have a screen with a "joystick button" for a
pointing device.
Some systems provide user interface remotely with the help of a serial (e.g.RS 232,USB,I2
Cetc.) or network (e.g. Ethernet) connection. This approach gives several advantages:
extends the capabilities of embedded system, avoids the cost of a display,
simplifies BSP and allows one to build a rich user interface on the PC. A good example of
this is the combination of an EMBEDDED WEB SERVER running on an embedded device
(such as an IP camera) or a network router. The user interface is displayed in a web
browser on a PC connected to the device, therefore needing no software to be installed.
Processors in embedded systems

Embedded processors can be broken into two broad categories. Ordinary microprocessors
use separate integrated circuits for memory and peripherals. Microcontrollers have on-
chip peripherals, thus reducing power consumption, size and cost. In contrast to the
personal computer market, many different basic CPU architecture are used, since software
is custom-developed for an application and is not a commodity product installed by the
end user. Both Von newman as well as various degrees of Hardware architecture are
used. RISC as well as non-RISC processors are found. Word lengths vary from 4-bit to
64-bits and beyond, although the most typical remain 8/16-bit. Most architecture come in
a large number of different variants and shapes, many of which are also manufactured by
several different companies.
Page 4
Numerous microcontrollers have been developed for embedded systems use. General-
purpose microprocessors are also used in embedded systems, but generally require more
support circuitry than microcontrollers.
Ready made computer boards
PC/104 and PC/104+ are examples of standards for ready made computer boards intended
for small, low-volume embedded and ruggedized systems, mostly x86-based. These are
often physically small compared to a standard PC, although still quite large compared to
most simple (8/16-bit) embedded systems. They often use DOS, Linux, NetBSD, or an
embedded real time operating system such as MicroC/S-II, QNX or Vxworks. Sometimes
these boards use non-x86 processors.
In certain applications, where small size or power efficiency are not primary concerns,
the components used may be compatible with those used in general purpose x86 personal
computers. Boards such as the VIA EPIA range help to bridge the gap by being PC-
compatible but highly integrated, physically smaller or have other attributes making them
attractive to embedded engineers. The advantage of this approach is that low-cost
commodity components may be used along with the same software development tools
used for general software development. Systems built in this way are still regarded as
embedded since they are integrated into larger devices and fulfill a single role. Examples
of devices that may adopt this approach are ATMs and arcade machines, which contain
code specific to the application.
However, most ready-made embedded systems boards are not PC-centered and do not use
the ISA or PCI buses. When a system-on-the-chip processor is involved, there may be
little benefit to having a standarized bus connecting discrete components, and the
environment for both hardware and software tools may be very different.
One common design style uses a small system module, perhaps the size of a business
card, holding high density BGA chips such as an ARM-based system-on-a-chip processor
and peripherals, external flash memory for storage, and DRAM for runtime memory. The
module vendor will usually provide boot software and make sure there is a selection of
Page 5
operating systems, usually including Linux and some real time choices. These modules
can be manufactured in high volume, by organizations familiar with their specialized
testing issues, and combined with much lower volume custom mainboards with
application-specific external peripherals.
Implementation of embedded systems have advanced, embedded systems can easily be

implemented with already made boards which are based on worldwide accepted platform.
These platforms include, but are not limited to, arudino and RASPHBERRY PI.
ASIC and FPGA solutions
A common array of n configuration for very-high-volume embedded systems is

the system on chip (SoC) which contains a complete system consisting of multiple
processors, multipliers, caches and interfaces on a single chip. SoCs can be implemented
as an application specified integrated circuits (ASIC) or using a field-programmable gate
arrey (FPGA).
Peripherals
A close-up of the SMSC LAN91C110 (SMSC 91x) chip, an embedded Ethernet chip
Embedded systems talk with the outside world via peripherals, such as:
Serial Communication Interfaces (SCI): RS-232, RS422, RS485, etc.
Synchronous Serial Communication Interface: I2C, SPI, SSC and ESSI (Enhanced
Synchronous Serial Interface)
Universal serial bus (USB)
Multi Media Cards (SD cards, Compact Flash, etc.)
Networks: Ethernet, LAN works, etc.
FIELDSbus: CAN-bus, LIN-bus, PROFIBUS, etc.
Page 6
Timers: PLL (s), Capture/Compare and Time Processing Units
Discrete IO: aka general purpose input output (GPIO)
Debugging: JTAG, ISP, ICSP, BDM Port, BITP, and DB9 port
APPLICATIONS:
Embedded systems are commonly found in consumer, cooking, industrial, automotive,

medical, commercial and military applications.
Telecommunications systems employ numerous embedded systems from telephone

switches for the network to cellphones at the end user. Computer networking uses
dedicated routers and network bridges to route data.
Consumer electronics include MP3 players, mobile phones, video game consoles, digital
cameras, GPS receivers, and printers. Household appliances, such as microwave owens
, washing machines and dishwashers, include embedded systems to provide flexibility,
efficiency and features. Home automations uses wired- and wireless-networking that can
be used to control lights, climate, security, audio/visual, surveillance, etc., all of which
use embedded devices for sensing and controlling.
Transportation systems from flight to automobiles increasingly use embedded systems.

New airplanes contain advanced avionics such as internal guidances
system and GPS receivers that also have considerable safety requirements. Various
electric motors brushless DCmotors, induction motors and DC motors use
electric/electronic motor controllers. automobiles, electric vehicles, and hybrid
vehicles increasingly use embedded systems to maximize efficiency and reduce pollution.
Other automotive safety systems include anti-lock breaking system (ABS), Electronic
braking control (ESC/ESP), traction control (TCS) and automatic four wheel drive.
Page 7
Medical equipments uses embedded systems for vital signs monitoring, electronics
stetascope for amplifying sounds, and various medical imaging (PET,SPECT, CT,
and MRI) for non-invasive internal inspections. Embedded systems within medical
equipment are often powered by industrial computers.
Embedded systems are used in transportation, fire safety, safety and security, medical
applications and life critical systems, as these systems can be isolated from hacking and
thus, be more reliable For fire safety, the systems can be designed to have greater ability
to handle higher temperatures and continue to operate. In dealing with security, the
embedded systems can be self-sufficient and be able to deal with cut electrical and
communication systems.
A new class of miniature wireless devices called motes are networked wireless sensors.
Wireless sensor networking, WNS, makes use of miniaturization made possible by
advanced IC design to couple full wireless subsystems to sophisticated sensors, enabling
people and companies to measure a myriad of things in the physical world and act on this
information through IT monitoring and control systems. These motes are completely self-
contained, and will typically run off a battery source for years before the batteries need to
be changed or charged.
Embedded Wi-Fi modules provide a simple means of wirelessly enabling any device
which communicates via a serial port.
Page 8
CHAPTER 2
REALIZATION OF EMBEDDED SYSTEM USING VIVADO DESIGN

SUITE
The Vivado Design Suite offers a new approach for ultra-high productivity with next
generation C/C++ and IP-based design. The new HLx editions include HL System
Edition, HL Design Edition and HL WebPACK Edition. When coupled with the new
UltraFast High-Level Productivity Design Methodology Guide, users can realize a 10-
15X productivity gain over traditional approaches.
Unlike traditional RTL-based design where the majority of the design effort is spent in
the backend of the design process, C and IPbased design allows for reduced development
cycles in verification, implementation and design convergence, so designers can focus on
their differentiated logic. This flow includes
Rapid generation of the platform connectivity design, along with the necessary software
stack
Rapid differentiated logic development using high-level design. This also enables
superior design reuse capabilities.
Page 9
Dramatically shortened verification times from high-level languages, compared to RTL.
Using high levels of abstraction, design teams can quickly get overall better or equal
Quality of Results (performance, power, utilization)
UltraFast High-Level Productivity Design Methodology Guide
Traditional design development starts with experienced system architects estimating how
their design will be implemented in a new technology, capturing both the system
connectivity requirements and the value added differentiated logic in highlevel modeling
format. In turn, RTL designs implement those requirements. RTL design cycles typically
consist verification and design closure iterations for each block, as well as for the entire
design. As a consequence of this methodology, the platform connectivity design never
stabilizes, as any change in the differentiated logic can cause an IO interface (e.g. DDR
memory, Ethernet, PCIe) to fail timing requirements. Also, RTL verification cycles no
longer allow for exhaustive functional tests prior to hardware bring-up.
The High-Level Design Methodology turns the development effort on its head allowing
designers to spend more time designing the value-add logic, and less time trying to make
it work. This flow provides a 15X reduction in design cycle compared to an RTL design
flow. The main attributes of this high-level methodology are:
Separation of platform development and differentiated logic, allowing designers to

focus on the companys high-value functionality.
Rapid configuration, generation and closure of the platform connectivity, using Vivado
IP Integrator with board awareness, as well as the Vivado IP systems.
C-based simulation for the differentiated logic, decreasing simulation times by orders of
magnitude over traditional RTL simulation.
High-Level synthesis with Vivado HLS and C/C++ libraries, and well as IP Integrator
for rapid implementation and system integration from C to silicon.
Page 10
HLx speeds the creation, design modification and reuse and complements Xilinxs SDx
family of software-defined environments by providing a methodology for designing
custom-platforms that are software programmable.
C-based Design and Accelerated Reuse:
A typical system starts with a software model of the system. Whether for entertainment,
gaming, communications, or medicine, most product began as a software model or
prototype. This model is then distributed to the hardware and embedded software teams.
Hardware design teams are tasked to choose an RTL microarchitecture that meets the
system requirement.
The biggest advantage of programmable devices such as FPGAs is the ability to create
custom-hardware which is optimized for any specific application. As a result, the end
Page 11
product has orders of magnitude better performance per watt than a pure software
program running on a distributed processor-based system.
The Vivado High-Level Synthesis (HLS) compiler provides a programming environment

similar to those available for processor compilers. The main difference is that Vivado
HLS compiles the
C code into an optimized RTL microarchitecture, while processorbased compilers

generate assembly code to be executed on a fixed, GHz rate, processor architecture.
System architects, software programmers or hardware engineers can use Vivado HLS to
create custom hardware optimized for throughout, power and latency. This allows for
optimized implementation of high performance, low power, or low cost systems, for any
application including compute, storage, or networking.
Vivado HLS accelerates design implementation and verification by enabling C/C++

specifications to be directly synthesized into VHDL or Verilog RTL, after exploring a
multitude of micro-architectures based on design requirements. Functional simulation can
be performed at that level, providing orders of magnitude acceleration over VHDL or
Verilog simulation. For example, with a video motion estimation algorithm, the C input to
Vivado HLS execute10 frames of video data in 10 seconds, while the corresponding RTL
model takes roughly two days to process the same 10 video frames.
When coupled with Vivado IP Integrator, Vivado HLS provides designers and system
architects with a faster and more robust way of delivering quality designs.
Page 12
Vivado HLS provides a faster path to IP creation by:
Abstraction of algorithmic description, data type specification (integer, fixed-point or

floating-point) and interfaces (FIFO, AXI4, AXI4-Lite, AXI4-Stream)
Directives driven architecture-aware synthesis to quickly deliver a design that can rival
or beat hand-coded RTL implementations in term of performance, power and area
utilization.
Accelerated verification using C/C++ test bench simulation, automatic VHDL or

Verilog simulation and test bench generation
Multi-language support (C, C++, OpenCL, SystemC) and the broadest language
coverage in the industry Automatic use of Xilinx on-chip memory hierarchy, Digital
Signal Processing compute elements and floating-point library
Domain Focused Software Libraries
Supported libraries include Math, DSP, Video, and linear algebra libraries for high
performance low power implementations. In addition for complex cores such as FFTs and
filters HLS integrates the optimized LogiCORE IP FFT and FIR Compiler for the
highest quality of
results. For domain-specific acceleration, Xilinx Alliance members also provide libraries
for OpenCV, BLAS, Machine learning and more
Reuse of Complete IP Sub-systems:
Xilinx and its Alliance Partners have a rich library of Intellectual Property (IP), to help
get products to market faster. The IP goes through a vigorous test and validation effort to
insure success the first time. Beyond a simple library of cores we provide solutions to
increase productivity.
Page 13
Xilinxs new LogiCORE IP sub-systems are highly configurable, market-tailored
building blocks that integrate multiple individual IP cores, including data -movers,
software drivers, examples designs and test benches. Available with the Vivado Design
Suite are new IP subsystems for Ethernet, PCIe, HDMI, video processing, image sensor
processing and OTN development. As an example, the AXI-4 PCIe subsystem leverages
multiple IP cores including PCIe, DMA, AXI-4 interconnect and provides the necessary
software stack to be used in a processor system.
All IP sub-systems are based on industry standards: AMBA AXI4 interconnect

protocol, IEEE P1735 encryption and XDC design constraints, enabling interoperability
with user and Xilinx Alliance member packaged IP, in order to accelerate integration.
Integration Automation:
The Vivado Design Suite shatters the RTL design productivity plateau with Vivado IP
Integrator, the industrys first plug-and-play system integration design environment.
Vivado IP Integrator enables rapid platform creation by generating customized

connectivity to the board interfaces. It also enables system assembly of highly concurrent
C/C++ generated functions, onto a platform.
Vivado IP Integrator provides a graphical and Tcl-based, correct-by-construction design

development flow. It provides a device and platform aware, interactive environment that
supports intelligent auto-connection of key interfaces, one-click subsystem generation,
real-time DRCs, and interface change propagation, combined with a powerful debug
capability.
Designers work at the interface and not signal level of abstraction when making
connections between functions, greatly increasing productivity. Although IP Integrator
leverages the industry standard AXI4, it also supports other interfaces and users can
define their own custom interfaces for greater flexibility.
Page 14
With open, industry IP standards, Vivado Design Suite enables third-party vendors to
deliver their IP portfolios to developers, who can now integrate them with Vivado IPI.
Users can also package their own RTL, or C/C++/SystemC and MATLAB/Simulink
algorithms into the IP catalog using Vivado HLS or System Generator for DSP with the
Vivado IP packager.
The Vivado Design Suite accelerates the implementation process by delivering more
turns per day while helping to reduce the number of design iterations needed. Its shared,
scalable data model delivers unrivaled compiled times and memory footprint, and enables
early analysis of critical design metrics such as power, timing and resource utilization.
These metrics allow design and tool setting modifications to occur earlier in the design
processes, where iterations are faster and impact on system performance higher.
Using the High-Level Design Methodology, iterations are pushed even higher, at the
C/C++ level, for even faster and higher impact iterations, dwarfing the impact and the
need of last minute place-and-route closure iterations.
Platform Creation and Reuse:
The Vivado Design Suite is not only device awareit is target platform aware
supporting Zynq SoCs and MPSoCs, along with ASIC-class FPGAs and 3D ICs boards
and kits. By being target platform aware, Vivado configures and applies board-specific
design rule checks, which ensures rapid bring up of working systems.
For example, by selecting the Zynq-7000 All Programmable SoC ZC702 Evaluation Kit,
and instantiating a Zynq processing system within IPI, Vivado preconfigures the
processing system with the correct peripherals, drivers, and memory map to support the
board. Platform designers can now more rapidly identify, reuse, and integrate both
software and hardware IP, targeting the dual core ARM processing system and high-
performance FPGA logic.The user easily specifies the interface between the processing
system and their logic with a series of dialog boxes. Interfaces are automatically
generated, optimized for performance or area, and then users can add their own
algorithms with Vivado HLS or use the Vivado IP catalog to complete their design.
Page 15
CHAPTER 3
PROCESSOR
ARM:
ARM, originally Acorn RISC Machine, later Advanced RISC Machine, is a family
of reduced instruction set computing (RISC) architecture for computer processor,
configured for various environments. British company ARM holdings develops the
architecture and licenses it to other companies, who design their own products that
implement one of those architecturesincluding system on chip (SoC) and system on
module (SoM) that incorporate memory, interfaces, radios, etc. It also designs cores that
implement this instruction set and licenses these designs to a number of companies that
incorporate those core designs into their own products.
A RISC-based computer design approch means processors require fewer transistors than
typical complex instruction set computing (CISC) x86 processors in most personal
computers. This approach reduces costs, heat and power use. These characteristics are
desirable for light, portable, battery-powered devices
including smartphones, laptops and tabletcomputers, and other embedded
system. For supercomputers, which consume large amounts of electricity, ARM could
also be a power-efficient solution.
ARM Holdings periodically releases updates to architectures and core designs. All of
them support a 32-bit address bus (only pre-ARMv3 chips, made before ARM Holdings
was formed, as in original arcion archimieds, had smaller) and 32-bit arithmetic;
instructions for ARM Holdings' cores have 32-bit fixed-length instructions, but later
versions of the architecture also support a variable-length instruction set that provides
both 32- and 16-bit instructions for improved code density. Some older cores can also
provide hardware execution of java bytecodes. The ARMv8-A architecture, announced in
October 2011, adds support for a 64-bits address space and 64-bit arithmetic with its new
32-bit fixed-length instruction set.
Page 16
HISTORY OF ARM:
The British computer manufacturer Acorn computers first developed the Acorn RISC
Machine architecture (ARM) in the 1980s to use in its personal computers. Its first ARM-
based products were coprocessor modules for the BBC micro series of computers. After
the successful BBC Micro computer, Acorn Computers considered how to move on from
the relatively simple MOS technology 6502 processor to address business markets like
the one that was soon dominated by the IBM PC, launched in 1981. The Acorn business
computers (ABC) plan required that a number of second processor be made to work with
the BBC Micro platform, but processors such as the Motorola 68000 and national
Page 17
semiconductor32016 were considered unsuitable, and the 6502 was not powerful enough
for a graphics based user interface
According to sophie wilson, all the tested processors at that time performed about the
same, with about a 4 Mbit/second bandwidth.
After testing all available processors and finding them lacking, Acorn decided it needed a
new architecture. Inspired by white papers on the Berkeley RISC project, Acorn
considered designing its own processor. A visit to the western design centre in Phoenix,
where the 6502 was being updated by what was effectively a single-person company,
showed Acorn engineers steve furber and Sophie Wilson they did not need massive
resources and state-of-the-art research and development facilities.
Wilson developed the instruction set, writing a simulation of the processor in BBC
basic that ran on a BBC Micro with a 6502 second processor. This convinced Acorn
engineers they were on the right track. Wilson approached Acorn's CEO, Hermann
hauser, and requested more resources. Hauser gave his approval and assembled a small
team to implement Wilson's model in hardware.
Acorn RISC Machine: ARM2
The official Acorn RISC Machine project started in October 1983. They chose VLSI
technology as the silicon partner, as they were a source of ROMs and custom chips for
Acorn. Wilson and Furber led the design. They implemented it with a similar efficiency
ethos as the 6502. A key design goal was achieving low-latency input/output (interrupt)
handling like the 6502. The 6502's memory access architecture had let developers
produce fast machines without costly direct memory access (DMA) hardware.
The first samples of ARM silicon worked properly when first received and tested on 26
April 1985.
The first ARM application was as a second processor for the BBC Micro, where it helped
in developing simulation software to finish development of the support chips (VIDC,
Page 18
IOC, MEMC), and sped up the CAD software used in ARM2 development. Wilson
subsequently rewrote BBC BASIC in ARM assembly language. The in-depth knowledge
gained from designing the instruction set enabled the code to be very dense, making
ARM BBC BASIC an extremely good test for any ARM emulator. The original aim of a
principally ARM-based computer was achieved in 1987 with the release of the acorn
archimeds. In 1992, Acorn once more won the queens award for technology for the ARM.
The ARM2 featured a 32-bit data bus, 26-bit address space and 27 32-bit registers.
Eight bits from the program counter register were available for other purposes; the top
six bits (available because of the 26-bit address space) served as status flags, and the
bottom two bits (available because the program counter was always word-alligned) were
used for setting modes. The address bus was extended to 32 bits in the ARM6, but
program code still had to lie within the first 64 MB of memory in 26-bit compatibility
mode, due to the reserved bits for the status flags. The ARM2 had a transistor count of
just 30,000, compared to Motorola's six-year-older 68000 model with around
40,000. Much of this simplicity came from the lack of microcode (which represents about
one-quarter to one-third of the 68000) and from (like most CPUs of the day) not
including any cache. This simplicity enabled low power consumption, yet better
performance than the intel 80286. A successor, ARM3, was produced with a 4 KB cache,
which further improved performance.
32 BIT ARCHITECTURE OF ARM:
The 32-bit ARM architecture, such as ARMv7-A, is the most widely used architecture in
mobile devices.
Since 1995, the ARM architecture reference manual has been the primary source of
documentation on the ARM processor architecture and instruction set, distinguishing
interfaces that all ARM processors are required to support (such as instruction semantics)
from implementation details that may vary. The architecture has evolved over time, and
version seven of the architecture, ARMv7, defines three architecture "profiles":
Page 19
A-profile, the "Application" profile, implemented by 32-bit cores in the Cortex-A series
and by some non-ARM cores
R-profile, the "Real-time" profile, implemented by cores in the Cortex-R series
M-profile, the "Microcontroller" profile, implemented by most cores in the Cortex-

M series
Although the architecture profiles were first defined for ARMv7, ARM subsequently
defined the ARMv6-M architecture (used by the Cortex M0 /M0+ /M1) as a subset of the
ARMv7-M profile with fewer instructions.
CPU modes
Except in the M-profile, the 32-bit ARM architecture specifies several CPU modes,
depending on the implemented architecture features. At any moment in time, the CPU
can be in only one mode, but it can switch modes due to external events (interrupts) or
programmatically.[57]
User mode: The only non-privileged mode.
FIQ mode: A privileged mode that is entered whenever the processor accepts an FIQ
interrupt.
IRQ mode: A privileged mode that is entered whenever the processor accepts an IRQ
interrupt.
Supervisor (svc) mode: A privileged mode entered whenever the CPU is reset or when an
SVC instruction is executed.
Abort mode: A privileged mode that is entered whenever a prefetch abort or data abort
exception occurs.
Page 20
Undefined mode: A privileged mode that is entered whenever an undefined instruction
exception occurs.
System mode (ARMv4 and above): The only privileged mode that is not entered by an
exception. It can only be entered by executing an instruction that explicitly writes to the
mode bits of the CPSR.
Monitor mode (ARMv6 and ARMv7 Security Extensions, ARMv8 EL3): A monitor
mode is introduced to support TrustZone extension in ARM cores.
Hyp mode (ARMv7 Virtualization Extensions, ARMv8 EL2): A hypervisor mode that
supports popek and Goldberg virtualization for the non-secure operation of the CPU.
Thread mode (ARMv6-M, ARMv7-M, ARMv8-M): A mode which can be specified as

either privileged or unprivileged, while whether Main Stack Pointer (MSP) or Process
Stack Pointer (PSP) is used can also be specified in CONTROL register with privileged
access. This mode is designed for user tasks in RTOS environment but it's typically used
in bare-metal for super-loop.
Handler mode (ARMv6-M, ARMv7-M, ARMv8-M): A mode dedicated for exception

handling (except the RESET which are handled in Thread mode). Handler mode always
uses MSP and works in privileged level.
INSTRUCTION SET
The original (and subsequent) ARM implementation was hardwired without microcode,
like the much simpler 8-bit 6502 processor used in prior Acorn microcomputers.
The 32-bit ARM architecture (and the 64-bit architecture for the most part) includes the
following RISC features:
Load/store architecture
Page 21
No support for unaligned memory accesses in the original version of the architecture.
ARMv6 and later, except some microcontroller versions, support unaligned accesses for
half-word and single-word load/store instructions with some limitations, such as no
guaranteed atomicity.
Uniform 16 32-bit register file (including the program counter, stack pointer and the
link register).
Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of
decreased code density. Later, the thumb instruction set added 16-bit instructions and
increased code density.
Mostly single clock-cycle execution.
To compensate for the simpler design, compared with processors like the Intel 80286
and Motorola 68020, some additional design features were used:
Conditional execution of most instructions reduces branch overhead and compensates for
the lack of a branch predictor.
Arithmetic instructions alter condition codes only when desired.
32-bit barrel shift can be used without performance penalty with most
arithmetic instructions and address calculations.
Has powerful indexed addressing modes.
A link register supports fast leaf function calls.
A simple, but fast, 2-priority-level interrupt subsystem has switched register banks.
Arithmetic instructions
Page 22
ARM includes integer arithmetic operations for add, subtract, and multiply; some
versions of the architecture also support divide operations.
ARM supports 32-bit x 32-bit multiplies with either a 32-bit result or 64-bit result,
though Cortex-M0 / M0+ / M1 cores don't support 64-bit results. Some ARM cores also
support 16-bit x 16-bit and 32-bit x 16-bit multiplies.
The divide instructions are only included in the following ARM architectures:
ARMv7-M and ARMv7E-M architectures always include divide instructions.
ARMv7-R architecture always includes divide instructions in the Thumb instruction set,
but optionally in its 32-bit instruction set.
ARMv7-A architecture optionally includes the divide instructions. The instructions might
not be implemented, or implemented only in the Thumb instruction set, or implemented
in both the Thumb and ARM instruction sets, or implemented if the Virtualization
Extensions are included.
Page 23
Registers
Registers R0 through R7 are the same across all CPU modes; they are never banked.
Registers R8 through R12 are the same across all CPU modes except FIQ mode. FIQ
mode has its own distinct R8 through R12 registers.
R13 and R14 are banked across all privileged CPU modes except system mode. That is,
each mode that can be entered because of an exception has its own R13 and R14. These
Page 24
registers generally contain the stack pointer and the return address from function calls,
respectively.
Aliases:
R13 is also referred to as SP, the Stack Pointer.
R14 is also referred to as LR, the Link Register.
R15 is also referred to as PC, the Program Counter.
The Current Program Status Register (CPSR) has the following 32 bits.
M (bits 04) is the processor mode bits.
T (bit 5) is the Thumb state bit.
F (bit 6) is the FIQ disable bit.
I (bit 7) is the IRQ disable bit.
A (bit 8) is the imprecise data abort disable bit.
E (bit 9) is the data endianness bit.
IT (bits 1015 and 2526) is the if-then state bits.
GE (bits 1619) is the greater-than-or-equal-to bits.
DNM (bits 2023) is the do not modify bits.
J (bit 24) is the Java state bit.
Page 25
Q (bit 27) is the sticky overflow bit.
V (bit 28) is the overflow bit.
C (bit 29) is the carry/borrow/extend bit.
Z (bit 30) is the zero bit.
N (bit 31) is the negative/less than bit.
CHAPTER 4
FEATURES OF ARM
Pipelining:
The ARM7 and earlier implementations have a three-stage pipeline; the stages being
fetch, decode and execute. Higher-performance designs, such as the ARM9, have deeper
pipelines: Cortex-A8 has thirteen stages. Additional implementation changes for higher
performance include a faster adder and more extensive branch prediction logic.
Coprocessors:
Page 26
The ARM architecture (pre-ARMv8) provides a non-intrusive way of extending the
instruction set using "coprocessors" that can be addressed using MCR, MRC, MRRC,
MCRR and similar instructions. The coprocessor space is divided logically into
16 coprocessors with numbers from 0 to 15, coprocessor 15 (cp15) being reserved for
some typical control functions like managing the caches and MMU operation on
processors that have one.
In ARM-based machines, peripheral devices are usually attached to the processor by

mapping their physical registers into ARM memory space, into the coprocessor space, or
by connecting to another device (a bus) that in turn attaches to the processor. Coprocessor
accesses have lower latency, so some peripheralsfor example, an XScale interrupt
controllerare accessible in both ways: through memory and through coprocessors.
Debugging:
All modern ARM processors include hardware debugging facilities, allowing software
debuggers to perform operations such as halting, stepping, and breakpointing of code
starting from reset. These facilities are built using JTAG support, though some newer
cores optionally support ARM's own two-wire "SWD" protocol. In ARM7TDMI cores,
the "D" represented JTAG debug support, and the "I" represented presence of an
"EmbeddedICE" debug module. For ARM7 and ARM9 core generations, EmbeddedICE
over JTAG was a de facto debug standard, though not architecturally guaranteed.
The ARMv7 architecture defines basic debug facilities at an architectural level. These
include breakpoints, watchpoints and instruction execution in a "Debug Mode"; similar
facilities were also available with EmbeddedICE. Both "halt mode" and "monitor" mode
debugging are supported. The actual transport mechanism used to access the debug
facilities is not architecturally specified, but implementations generally include JTAG
support.
There is a separate ARM "CoreSight" debug architecture, which is not architecturally

required by ARMv7 processors.
Page 27
DSP enhancement instructions:
To improve the ARM architecture for digital signal processing and multimedia
applications, DSP instructions were added to the set. These are signified by an "E" in the
name of the ARMv5TE and ARMv5TEJ architectures. E-variants also imply T, D, M and
I.
The new instructions are common in digital signal processor(DSP) architectures. They
include variations on signed multiplyaccumulate, saturated add and subtract, and count
leading zeros.
SIMD extensions for multimedia:
Introduced in the ARMv6 architecture, this was a precursor to Advanced SIMD, also
known as NEON.
Jazelle:
Jazelle DBX (Direct Bytecode eXecution) is a technique that allows Java Bytecode to be
executed directly in the ARM architecture as a third execution state (and instruction set)
alongside the existing ARM and Thumb-mode. Support for this state is signified by the
"J" in the ARMv5TEJ architecture, and in ARM9EJ-S and ARM7EJ-S core names.
Support for this state is required starting in ARMv6 (except for the ARMv7-M profile),
though newer cores only include a trivial implementation that provides no hardware
acceleration.
Thumb:
To improve compiled code-density, processors since the ARM7TDMI (released in 1994)

have featured the Thumb instruction set, which have their own state. (The "T" in "TDMI"
indicates the Thumb feature.) When in this state, the processor executes the Thumb
instruction set, a compact 16-bit encoding for a subset of the ARM instruction set.Most of
the Thumb instructions are directly mapped to normal ARM instructions. The space-
Page 28
saving comes from making some of the instruction operands implicit and limiting the
number of possibilities compared to the ARM instructions executed in the ARM
instruction set state.
In Thumb, the 16-bit opcodes have less functionality. For example, only branches can be
conditional, and many opcodes are restricted to accessing only half of all of the CPU's
general-purpose registers. The shorter opcodes give improved code density overall, even
though some operations require extra instructions. In situations where the memory port or
bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased
performance compared with 32-bit ARM code, as less program code may need to be
loaded into the processor over the constrained memory bandwidth.
Embedded hardware, such as the Game Boy Advance, typically have a small amount of
RAM accessible with a full 32-bit datapath; the majority is accessed via a 16-bit or
narrower secondary datapath. In this situation, it usually makes sense to compile Thumb
code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM
instructions, placing these wider instructions into the 32-bit bus accessible memory.
The first processor with a Thumb instruction decoder was the ARM7TDMI. All ARM9
and later families, including XScale, have included a Thumb instruction decoder. The
Thumb instruction set was originally inspired by SuperH's ISA ARM licensed several
patents from Hitachi.
Thumb-2:
Thumb-2 technology was introduced in the ARM1156 core, announced in 2003. Thumb-
2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to
give the instruction set more breadth, thus producing a variable-length instruction set. A
Page 29
stated aim for Thumb-2 was to achieve code density similar to Thumb with performance
similar to the ARM instruction set on 32-bit memory.
Thumb-2 extends the Thumb instruction set with bit-field manipulation, table branches
and conditional execution. At the same time, the ARM instruction set was extended to
maintain equivalent functionality in both instruction sets. A new "Unified Assembly
Language" (UAL) supports generation of either Thumb or ARM instructions from the
same source code; versions of Thumb seen on ARMv7 processors are essentially as
capable as ARM code (including the ability to write interrupt handlers). This requires a
bit of care, and use of a new "IT" (if-then) instruction, which permits up to four
successive instructions to execute based on a tested condition, or on its inverse. When
compiling into ARM code, this is ignored, but when compiling into Thumb it generates
an actual instruction.
All ARMv7 chips support the Thumb instruction set. All chips in the Cortex-A series,
Cortex-R series, and ARM11 series support both "ARM instruction set state" and "Thumb
instruction set state", while chips in the Cortex-M series support only the Thumb
instruction set.
Thumb Execution Environment (ThumbEE):
ThumbEE (erroneously called Thumb-2EE in some ARM documentation), marketed

as Jazelle RCT(Runtime Compilation Target), was announced in 2005, first appearing in
the Cortex-A8 processor. ThumbEE is a fourth instruction set state, making small
changes to the Thumb-2 extended instruction set. These changes make the instruction set
particularly suited to code generated at runtime (e.g. by JIT compilation) in
managed Execution Environments. ThumbEE is a target for languages such
as Java, C#, Perl, and Python, and allows JIT compilers to output smaller compiled code
without impacting performance.
New features provided by ThumbEE include automatic null pointer checks on every load
and store instruction, an instruction to perform an array bounds check, and special
instructions that call a handler. In addition, because it utilises Thumb-2 technology,
Page 30
ThumbEE provides access to registers r8-r15 (where the Jazelle/DBX Java VM state is
held). Handlers are small sections of frequently called code, commonly used to
implement high level languages, such as allocating memory for a new object. These
changes come from repurposing a handful of opcodes, and knowing the core is in the new
ThumbEE state.
On 23 November 2011, ARM Holdings deprecated any use of the ThumbEE instruction
set, and ARMv8 removes support for ThumbEE.
Floating-point (VFP):
VFP (Vector Floating Point) technology is an FPU (Floating-Point Unit) coprocessor

extension to the ARM architecture (implemented differently in ARMv8 - coprocessors
not defined there). It provides low-cost single-precision and double-precision floating-
point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard for
Binary Floating-Point Arithmetic. VFP provides floating-point computation suitable for a
wide spectrum of applications such as PDAs, smartphones, voice compression and
decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and
automotive applications. The VFP architecture was intended to support execution of short
"vector mode" instructions but these operated on each vector element sequentially and
thus did not offer the performance of true single instruction, multiple data (SIMD) vector
parallelism. This vector mode was therefore removed shortly after its introduction, to be
replaced with the much more powerful NEON Advanced SIMD unit.
Some devices such as the ARM Cortex-A8 have a cut-down VFPLite module instead of a
full VFP module, and require roughly ten times more clock cycles per float operation.Pre-
ARMv8 architecture implemented floating-point/SIMD with the coprocessor interface.
Other floating-point and/or SIMD units found in ARM-based processors using the
coprocessor interface include FPA, FPE, iwMMXt, some of which were implemented in
software by trapping but could have been implemented in hardware. They provide some
of the same functionality as VFP but are not opcode-compatible with it.
Page 31
VFPv1
Obsolete
VFPv2
An optional extension to the ARM instruction set in the ARMv5TE, ARMv5TEJ and
ARMv6 architectures. VFPv2 has 16 64-bit FPU registers.
VFPv3 or VFPv3-D32
Implemented on most Cortex-A8 and A9 ARMv7 processors. It is backwards compatible
with VFPv2, except that it cannot trap floating-point exceptions. VFPv3 has 32 64-bit
FPU registers as standard, adds VCVT instructions to convert between scalar, float and
double, adds immediate mode to VMOV such that constants can be loaded into FPU
registers.
VFPv3-D16
As above, but with only 16 64-bit FPU registers. Implemented on Cortex-R4 and R5
processors and the Tegra 2 (Cortex-A9).
VFPv3-F16
Uncommon; it supports IEEE754-2008 half-precision (16-bit) floating point as a storage
format.
VFPv4 or VFPv4-D32
Implemented on the Cortex-A12 and A15 ARMv7 processors, Cortex-A7 optionally has
VFPv4-D32 in the case of an FPU with NEON. VFPv4 has 32 64-bit FPU registers as
standard, adds both half-precision support as a storage format and fused multiply-
accumulate instructions to the features of VFPv3.
VFPv4-D16
As above, but it has only 16 64-bit FPU registers. Implemented on Cortex-A5 and A7
processors (in case of an FPU without NEON).
VFPv5-D16-M
Page 32
Implemented on Cortex-M7 when single and double-precision floating-point core option
exists.
In Debian Linux, and derivatives such as Ubuntu, armhf (ARM hard float) refers to the
ARMv7 architecture including the additional VFP3-D16 floating-point hardware
extension (and Thumb-2) above. Software packages and cross-compiler tools use the
armhf vs. arm/armel suffixes to differentiate.
Advanced SIMD (NEON):
The Advanced SIMD extension (aka NEON or "MPE" Media Processing Engine) is a
combined 64- and 128-bit SIMD instruction set that provides standardized acceleration
for media and signal processing applications. NEON is included in all Cortex-A8 devices
but is optional in Cortex-A9 devices. NEON can execute MP3 audio decoding on CPUs
running at 10 MHz and can run the GSM adaptive multi-rate (AMR) speech codec at no
more than 13 MHz. It features a comprehensive instruction set, separate register files and
independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and
single-precision (32-bit) floating-point data and SIMD operations for handling audio and
video processing as well as graphics and gaming processing. In NEON, the SIMD
supports up to 16 operations at the same time. The NEON hardware shares the same
floating-point registers as used in VFP. Devices such as the ARM Cortex-A8 and Cortex-
A9 support 128-bit vectors but will execute with 64 bits at a time, whereas newer Cortex-
A15 devices can execute 128 bits at a time.
ProjectNe10 is ARM's first open source project (from its inception). The Ne10 library is a
set of common, useful functions written in both NEON and C (for compatibility). The
library was created to allow developers to use NEON optimisations without learning
NEON but it also serves as a set of highly optimised NEON intrinsic and assembly code
examples for common DSP, arithmetic and image processing routines. The code is
available on GitHub.
Page 33
Mostly execution will be single execution, it wont take more time for execution or
repeating of execution will not be repeated continuosly
Hardware virtualization support
ARM CORE LICENCES:
ARM offers several microprocessor core designs that have been "publicly licensed" 830
times including 249 times for their newer "application processors" (non-microcontroller)
used in such applications as smartphones and tablets. Three of those companies are
known to have a licence for one of ARM's most powerful processor core, the 64-
bitCortex-A72 (some including ARM's other 64-bit core the Cortex-A53) and four have a
licence to their most powerful 32-bit core, the Cortex-A15.
Cores for 32-bit architectures include Cortex-A32, Cortex-A15, Cortex-A12, Cortex-

A17, Cortex-A9, Cortex-A8, Cortex-A7 and Cortex-A5, and older "Classic ARM
Processors", as well as variant architectures for microcontrollers that include these
cores: ARM Cortex-R7, ARM Cortex-R5, ARM Cortex-R4, ARM Cortex-M4, ARM
Cortex-M3, ARM Cortex-M1, ARM Cortex-M0+ and ARM Cortex-M0 for licensing; the
three most popular licensing models are the "Perpetual (Implementation) License", "Term
License" and "Per Use License".
Companies often license these designs from ARM to manufacture and integrate into their
own System on chip (SoC) with other components such as GPUs (sometimes ARM's
Mali) or radio basebands (for mobile phones).
In addition to licenses for their core designs, ARM offers an "architectural license" for
their instruction sets, allowing the licensees to design their own cores that implement one
of those instruction sets. An ARM architectural license is more costly than a regular ARM
core license and also requires the necessary engineering power to design a CPU based on
the instruction set.
Page 34
Processors believed to be designed independently from ARM
include Apples (architecture license from March 2008) A6, A6X, A7 and all subsequent
Apple processors(used in iPhone 5, iPad and iPhone
5S), Qualcomm's Snapdragon series (used in smartphones such as the US version of
the Samsung Galaxy S4) and Samsung's Exynos ("Mongoose" M1 cores). There were
around 15 architectural licensees in 2013 including Marvell, Apple,
Qualcomm, Broadcom and some others.
Companies that are current licensees of the 64-bit ARMv8-A core designs
include AMD, applied micro (X-gene), broadcom, calxeda, hisilicon, rockchip,
samsung, and STmicroelectronics.
Companies that are current or former licensees of 32-bit ARM core designs include
AMD, Broadcom, Freescale (now NXP semiconductors), huawei (HiSilicon
division), IBM, Infineon technologies (Infineon xmc 32-bit MCU families), Intel (older
"ARM11 MPCore"), LG, Microsemi, NXP semiconductors, Renasas,
Rockchip, Samsung, STmicroelectronics and Texas instrumentation.
In 2013, ARM stated that there are around 15 architectural licensees, but the full list is
not yet public knowledge.
Companies with a 64-bit ARMv8-A architectural license include Applied

Micro, Broadcom, Cavium, Huawei, Nvidia, AMD, Samsung, and Apple.
Companies with a 32-bit ARM architectural license include Broadcom

(ARMv7), Faraday Technology (ARMv4, ARMv5), Marvell Technology
Group, Microsoft, Qualcomm, Intel, and Apple.
Page 35
CHAPTER 5
AMBA BUS
The ARM Advanced Microcontroller Bus Architecture (AMBA) is an open-standard, on-

chip interconnect specification for the connection and management of functional blocks
in system on the chip (SoC) designs. It facilitates development of multi-processor designs
with large numbers of controllers and peripherals. Since its inception, the scope of
AMBA has, despite its name, gone far beyond micro controller devices. Today, AMBA is
widely used on a range of ASIC and SoC parts including applications processors used in
modern portable mobile devices like smartphones. AMBA is a registered trademark
of ARM LTD.
AMBA was introduced by ARM in 1996. The first AMBA buses were Advanced System
Bus (ASB) and Advanced Peripheral Bus (APB). In its second version, AMBA 2 in 1999,
ARM added AMBA High-performance Bus (AHB) that is a single clock-edge protocol.
In 2003, ARM introduced the third generation, AMBA 3, including AXI to reach even
higher performance interconnect and the Advanced Trace Bus (ATB) as part of the
CoreSight on-chip debug and trace solution. In 2010 the AMBA 4 specifications were
introduced starting with AMBA 4 AXI4, then in 2011 extending system wide coherency
Page 36
with AMBA 4 ACE. In 2013 the AMBA 5 CHI (Coherent Hub Interface) specification
was introduced, with a re-designed high-speed transport layer and features designed to
reduce congestion.
These protocols are today the de facto standard for 32-bit embedded processors because
they are well documented and can be used without royaltie.
DESIGN PRINCIPLES:
An important aspect of a SoC is not only which components or blocks it houses, but also
how they interconnect. AMBA is a solution for the blocks to interface with each other.
The objective of the AMBA specification is to:
facilitate right-first-time development of embedded microcontroller products with one or

more CPUs, GPUs or signal processors,
be technology independent, to allow reuse of IPcores, peripheral and system macrocells

across diverse IC processes,
encourage modular system design to improve processor independence, and the

development of reusable peripheral and system IP libraries
minimize silicon infrastructure while supporting high performance and low power on-
chip communication.
SPECIFICATION OF AMBA BUS:
The AMBA specification defines an on-chip communications standard for designing

high-performance embedded microcontrollers. It is supported by ARM limited with wide
cross-industry participation.
The AMBA 5 specification defines the following buses/interfaces:
Page 37
Advanced High-performance Bus (AHB5, AHB-Lite)
CHI Coherent Hub Interface (CHI)
The AMBA 4 specification defines following buses/interfaces:
AXI Coherency Extensions (ACE) - widely used on the latest ARM Cortex-A processors
including Cortex-A7 and Cortex-A15
AXI Coherency Extensions Lite (ACE-Lite)
Advanced Extensible Interface 4 (AXI4)
Advanced Extensible Interface 4 Lite (AXI4-Lite)
Advanced Extensible Interface 4 Stream (AXI4-Stream v1.0)
Advanced Trace Bus (ATB v1.1)
Advanced Peripheral Bus (APB4 v2.0)
AMBA 3 specification defines four buses/interfaces:
Advanced Extensible Interface (AXI3 or AXI v1.0) - widely used on ARM Cortex-A
processors including CortexA-9
Advanced High-performance Bus Lite (AHB-Lite v1.0)
Advanced Peripheral Bus (APB3 v1.0)
Advanced Trace Bus (ATB v1.0)
AMBA 2 specification defines three buses/interfaces:
Page 38
Advanced High-performance Bus (AHB) - widely used on ARM7, ARM9 and ARM
Cortex-M based designs
Advanced System Bus (ASB)
Advanced Peripheral Bus (APB2 or APB)
AMBA specification (First version) defines two buses/interfaces:
Advanced System Bus (ASB)
Advanced Peripheral Bus (APB)
The timing aspects and the voltage levels on the bus are not dictated by the specifications.
TYPES OF BUSES:
AXI Coherency Extensions (ACE and ACE-Lite)
ACE, defined as part of the AMBA 4 specification, extends AXI with additional
signalling introducing system wide coherency. This system coherency allows multiple
processors to share memory and enables technology like ARM's processing. The ACE-
Lite protocol enables one-way aka IO coherency, for example a network interface that
can read from the caches of a fully coherent ACE processor.
Advanced extensible Interface (AXI)
AXI, the third generation of AMBA interface defined in the AMBA 3

specification, is targeted at high performance, high clock frequency
system designs and includes features that make it suitable for high
speed sub-micrometer interconnect:
separate address/control and data phases
Page 39
support for unaligned data transfers using byte strobes
burst based transactions with only start address issued
issuing of multiple outstanding addresses with out of order responses
easy addition of register stages to provide timing closure.
Advanced High-performance Bus (AHB)
AHB is a bus protocol introduced in Advanced Microcontroller Bus Architecture version

2 published by ARM LTD company.
In addition to previous release, it has the following features:
large bus-widths (64/128 bit).
A simple transaction on the AHB consists of an address phase and a subsequent data
phase (without wait states: only two bus-cycles). Access to the target device is controlled
through a MUX (non-tristate), thereby admitting bus-access to one bus-master at a time.
AHB-Lite is a subset of AHB formally defined in the AMBA 3 standard. This subset
simplifies the design for a bus with a single master.
Advanced Peripheral Bus (APB)
APB is designed for low bandwidth control accesses, for example register interfaces on
system peripherals. This bus has an address and data phase similar to AHB, but a much
reduced, low complexity signal list (for example no bursts).
Applications Of ARM Processor
Used as processor in mobile phones, Telecommunication:
Page 40
The ARM-powered smartphone is the worlds primary compute device that over 3 Billion
consumers reach for every day. Ecosystem unparalleled innovation, The de-facto standard
architecture for mobile that mobile operating systems and applications are built and
optimized for.
Path to 5G Next steps in connectivity the dawn of 5G will bring a continuation of the
always-on, always-connected world and in bring new ways of interacting with it.
Portal delivering new experiences be it virtual reality, augmented reality, machine

learning or bots, consumers will experience new technology through their smartphone.
Used in automotive industry:
Scalability from sensors to cockpit, ARM processors can power the smallest, lowest-cost
ultrasonic parking sensors through to server-class compute hardware that enables
automated driving.
Low energy, minimum heat, maximum range every watt of electrical power used in a car
contributes to CO2emissions and reduced fuel economy. In a fully-electric car, lower-
power electronics directly increase the driving range.
Trust the base for robust security cyber-attacks on cars are a new but disconcerting trend,
compounded by the fact that there are multiple points of possible attack. Robust
electronics security is vital.
Safety , reliable electronics all of ARM's automotive-applicable IP is available with a

comprehensive functional safety package that accelerates the safety aspects of a complete
chip design.
ADAS(advance driving assistant system):
Page 41
Modern cars contain a lot of sensors, which are located on many positions on the outside
surface of the car. These sensors can be thought of as the eyes of the car to get a
continuous picture of the outer world. The graphic below illustrates where these sensors
are, and which areas they are supervising.
A view how ARM is used in ADAS
ADAS systems which will help drivers on various driving conditions, are currently
assistance and helping functions only - leaving final decision making to the car driver.
The driver is still the topmost ranking decision maker. The driver must have ability to
override the electronic assistance in all conditions to prevent failures. The driver is
legally responsible for his driving (e.g. you still can use the breaks to get to a full stop
even if cruise-control wants to accelerate). You might want to still be able to accelerate if
the on-board radar confuses a traffic sign with a pedestrian.
Used in Servers and Networking:
Page 42
More and more functionality will be integrated onto single SoC devices and these will
typically be processing multiple traffic types including payload or backhaul traffic,
control traffic and in some cases scheduling of users
Used in wearable devices:
Power Efficiency, enabling wearables with more than 95% of wearables running on
ARM, ARM is the architecture of choice for innovators bringing the newest devices to
market.
Innovation, beyond the wrist Embedding ARM connected intelligence in the fabric of our
lives will enable new wearable form factors.
Always Aware continuous contextual awareness, continuously processing multiple sensor

feeds enables wearables contextual awareness.
Trust, building trusted wearables the data that wearables collect is highly personal, and
ARMs TrustZone is the basis for securing your wearable data.
Used in set-top and satellite receiver
Used in home gateway
Used in robotics for developing industrials robots
Page 43
CHAPTER 6
ZYNQ 7000 APSOC
ABOUT ZYNQ-7000:
The Zynq-7000 family offers the flexibility and scalability of an FPGA, while providing
performance, power, and ease of use typically associated with ASIC and ASSPs. The
range of devices in the Zynq-7000 family allows designers to target cost-sensitive as well
as high-performance applications from a single platform using industry-standard tools.
While each device in the Zynq-7000 family contains the same PS, the PL and I/O
resources vary between the devices. As a result, the Zynq-7000 and Zynq-7000S SoCs
are able to serve a wide range of applications including: Automotive driver assistance,
driver information, and infotainment Broadcast camera Industrial motor control,
industrial networking, and machine vision IP and Smart camera LTE radio and
Page 44
baseband Medical diagnostics and imaging Multifunction printers Video and night
vision equipment The Zynq-7000 architecture enables implementation of custom logic in
the PL and custom software in the PS. It allows for the realization of unique and
differentiated system functions. The integration of the PS with the PL allows levels of
performance that two-chip solutions (e.g., an ASSP with an FPGA) cannot match due to
their limited I/O bandwidth, latency, and power budgets. Xilinx offers a large number of
soft IP for the Zynq-7000 family. Stand-alone and Linux device drivers are available for
the peripherals in the PS and the PL. The Vivado Design Suite development
environment enables a rapid product development for software, hardware, and systems
engineers. Adoption of the ARM-based PS also brings a broad range of third-party tools
and IP providers in combination with Xilinxs existing PL ecosystem. The inclusion of an
application processor enables high-level operating system support, e.g., Linux. Other
standard operating systems used with the Cortex-A9 processor are also available for the
Zynq-7000 family. The PS and the PL are on separate power domains, enabling the user
of these devices to power down the PL for power management if required. The processors
in the PS always boot first, allowing a software centric approach for PL configuration. PL
Page 45
configuration is managed by software running on the CPU, so it boots similar to an ASSP.
Page 46
Zynq-7000 AP soc block diagram
PS(PROCESSING SYSTEM)
Application processor unit (APU) Memory interfaces I/O peripherals (IOP)

Interconnect Application Processor Unit (APU) The key features of the APU include:
Dual-core or single-core ARM Cortex-A9 MPCores. Features associated with each core
include: 2.5 DMIPS/MHz Operating frequency range: - Z-7007S/Z-7012S/Z-7014S
(wire bond): Up to 667 MHz (-1); 766 MHz (-2) - Z-7010/Z-7015/Z-7020 (wire bond):
Up to 667 MHz (-1); 766 MHz (-2); 866 MHz (-3) - Z-7030/Z-7035/Z-7045 (flip-chip):
667 MHz (-1); 800 MHz (-2); 1GHz (-3) - Z-7100 (flip-chip): 667 MHz (-1); 800 MHz (-
2) Ability to operate in single processor, symmetric dual processor, and asymmetric dual
processor modes Single and double precision floating point: up to 2.0 MFLOPS/MHz
each NEON media processing engine for SIMD support Thumb-2 support for code
compression Level 1 caches (separate instruction and data, 32 KB each) - 4-way set-
associative - Non-blocking data cache with support for up to four outstanding read and
write misses each Integrated memory management unit (MMU) TrustZone for
secure mode operation Accelerator coherency port (ACP) interface enabling coherent
accesses from PL to CPU memory space Unified Level 2 cache (512 KB) 8-way set-
associative TrustZone enabled for secure operation Dual-ported, on-chip RAM (256
KB) Accessible by CPU and programmable logic (PL) Designed for low latency
access from the CPU 8-channel DMA Supports multiple transfer types: memory-to-
memory, memory-to-peripheral, peripheral-to-memory, and scatter-gather 64-bit AXI
interface, enabling high throughput DMA transfers 4 channels dedicated to PL
TrustZone enabled for secure operation Dual register access interfaces enforce
separation between secure and non-secure accesses
Interrupts and Timers General interrupt controller (GIC) Three watch dog timers
(WDT) (one per CPU and one system WDT) Two triple timers/counters (TTC)
CoreSight debug and trace support for Cortex-A9 Program trace macrocell (PTM) for
instruction and trace Cross trigger interface (CTI) enabling hardware breakpoints and
triggers Memory Interfaces The memory interface unit includes a dynamic memory
Page 47
controller and static memory interface modules. The dynamic memory controller supports
DDR3, DDR3L, DDR2, and LPDDR2 memories. The static memory controllers support
a NAND flash interface, a Quad-SPI flash interface, a parallel data bus, and a parallel
NOR flash interface. Dynamic Memory Interfaces The multi-protocol DDR memory
controller can be configured to provide 16-bit or 32-bit-wide accesses to a 1 GB address
space using a single rank configuration of 8-bit, 16-bit or 32-bit DRAM memories. ECC
is supported in 16-bit bus access mode. The PS incorporates both the DDR controller and
the associated PHY, including its own set of dedicated I/Os. Speed of up to 1333 Mb/s for
DDR3 is supported. The DDR memory controller is multi-ported and enables the
processing system and the programmable logic to have shared access to a common
memory. The DDR controller features four AXI slave ports for this purpose: One 64-bit
port is dedicated for the ARM CPU(s) via the L2 cache controller and can be configured
for low latency. Two 64-bit ports are dedicated for PL access. One 64-bit AXI port is
shared by all other AXI masters via the central interconnect. Static Memory Interfaces
The static memory interfaces support external static memories: 8-bit SRAM data bus
supporting up to 64 MB 8-bit parallel NOR flash supporting up to 64 MB ONFi 1.0
NAND flash support with 1-bit ECC 1-bit SPI, 2-bit SPI, 4-bit SPI (quad-SPI), or two
quad-SPI (8-bit) serial NOR flash I/O Peripherals (IOP) The IOP unit contains the data
communication peripherals. Key features of the IOP include: Two 10/100/1000 tri-mode
Ethernet MAC peripherals with IEEE Std 802.3 and IEEE Std 1588 revision 2.0 support
Scatter-gather DMA capability Recognition of 1588 rev. 2 PTP frames Supports an
external PHY interface Two USB 2.0 OTG peripherals, each supporting up to 12
endpoints Supports high-speed and full-speed modes in Host, device, and On-The-Go
configuration Fully USB 2.0 compliant, Host, and Device IP core Uses 32-bit AHB
DMA master and AHB slave interfaces Provides an 8-bit ULPI external PHY interface
Intel EHCI compliant USB host controller registers and data structures
Two full CAN 2.0B compliant CAN bus interface controllers CAN 2.0-B standard as
defined by BOSCH Gmbh ISO 118981-1 An external PHY interface Two SD/SDIO
2.0 compliant SD/SDIO controllers with built-in DMA Two full-duplex SPI ports with
three peripheral chip selects Two UARTs Two master and slave I2C interfaces Up to
Page 48
118 GPIO bits Using the TrustZone system, the two Ethernet, two SDIO, and two USB
ports (all master devices) can be configured to be secure or non-secure. The IOP
peripherals communicate to external devices through a shared pool of up to 54 dedicated
multiuse I/O (MIO) pins. Each peripheral can be assigned one of several pre-defined
groups of pins, enabling a flexible assignment of multiple devices simultaneously.
Although 54 pins are not enough for simultaneous use of all the I/O peripherals, most
IOP interface signals are available to the PL, allowing use of standard PL I/O pins when
powered up and properly configured. All MIO pins support 1.8V HSTL and LVCMOS
standards as well as 2.5V/3.3V standards. Interconnect The APU, memory interface unit,
and the IOP are all connected to each other and to the PL through a multilayered ARM
AMBA AXI interconnect.The interconnect is non-blocking and supports multiple
simultaneous master-slave transactions. The interconnect is designed with latency
sensitive masters, such as the ARM CPU, having the shortest paths to memory, and
bandwidth critical masters, such as the potential PL masters, having high throughput
connections to the slaves with which they need to communicate. Traffic through the
interconnect can be regulated through the Quality of Service (QoS) block in the
interconnect. The QoS feature is used to regulate traffic generated by the CPU, DMA
controller, and a combined entity representing the masters in the IOP.
PS EXTERNAL INTERFACING:
PS External Interfaces The PS external interfaces use dedicated pins that cannot be
assigned as PL pins. These include: Clock, reset, boot mode, and voltage reference Up
to 54 dedicated multiuse I/O (MIO) pins, software-configurable to connect to any of the
internal I/O peripherals and static memory controllers 32-bit or 16-bit
DDR2/DDR3/DDR3L/LPDDR2 memories MIO Overview The function of the MIO is to
multiplex access from the PS peripheral and static memory interfaces to the PS pins as
defined in the configuration registers. There are up to 54 pins available for use by the IOP
and static memory interfaces in the PS. Table 4 shows where the different peripherals
pins can be mapped. A block diagram of the MIO module is shown in Figure 2. If
additional I/O pins beyond the 54 are required, it is possible to route these through the PL
to the I/O associated with the PL. This feature is referred to as extendable multiplexed
Page 49
I/O (EMIO). Port mappings can appear in multiple locations. For example, there are up to
12 possible port mappings for CAN pins. The PS Configuration Wizard (PCW) tool
should be used for peripheral and static memory pin mapping.
Page 50
Block diagram of MIO
PS-PL INTERFACE
PS-PL Interface The PS-PL interface includes: AMBA AXI interfaces for primary data
communication Two 32-bit AXI master interfaces Two 32-bit AXI slave interfaces
Four 64-bit/32-bit configurable, buffered AXI slave interfaces with direct access to DDR
memory and OCM, referred to as high-performance AXI ports One 64-bit AXI slave
interface (ACP port) for coherent access to CPU memory DMA, interrupts, events
signals Processor event bus for signaling event information to the CPU PL peripheral
Page 51
IP interrupts to the PS GIC Four DMA channel signals for the PL Asynchronous
triggering signals Extendable multiplexed I/O (EMIO) allows unmapped PS peripherals
to access PL I/O Clocks and resets Four PS clock outputs to the PL with start/stop
control Four PS reset outputs to the PL Configuration and miscellaneous Processor
configuration access port (PCAP) to support full and partial PL configuration, and
secured PS boot image decryption and authentication eFUSE and battery-backed RAM
signals from the PL to the PS XADC interface JTAG interface The two highest
performance interfaces between the PS and the PL for data transfer are the high-
performance AXI ports and ACP interfaces. The high performance AXI ports are used for
high throughput data transfer between the PS and the PL. Coherency, if required, is
managed under software control. When hardware coherent access to the CPU memory is
required, the ACP port is to be used. High-Performance AXI Ports The high-performance
AXI ports provide access from the PL to DDR and OCM in the PS. The four dedicated
AXI memory ports from the PL to the PS are configurable as either 32-bit or 64-bit
interfaces. As shown in Figure 3, these interfaces connect the PL to the memory
interconnect via a FIFO controller. Two of the three output ports go to the DDR memory
controller and the third goes to the dual-ported on-chip memory (OCM).
Page 52
BLOCK DIAGRAM OF PS-PL MEMORY SUBSYSTEM
Each high-performance AXI port has these characteristics: Reduced latency between PL
and processing system memory 1 KB deep FIFO Configurable either as 32- or 64-bit
AXI interfaces Supports up to a 32 word buffer for read acceptance Supports data
release control for write accesses to use AXI interconnect bandwidth more efficiently
Supports multiple AXI commands issuing to DDR and OCM Accelerator Coherency Port
(ACP) The accelerator coherency port (ACP) is a 64-bit AXI slave interface that provides
connectivity between the APU and a potential accelerator function in the PL. The ACP
directly connects the PL to the snoop control unit (SCU) of the ARM Cortex-A9
processors, enabling cache-coherent access to CPU data in the L1 and L2 caches. The
ACP provides a low latency path between the PS and a PL-based accelerator when
compared with a legacy cache flushing and loading scheme.
PL(PROGRAMABLE LOGIC)
Page 53
Key PL features include: CLB Eight LUTs per CLB for random logic implementation
or distributed memory Memory LUTs are configurable as 64x1 or 32x2 bit RAM or
shift register (SRL) 16 flip-flops per CLB 2 x 4-bit cascadeable adders for arithmetic
functions 36 Kb block RAM True dual-port Up to 36 bits wide Configurable as
dual 18 Kb block RAMs
DSP slices 18 x 25 signed multiply 48-bit adder/accumulator Programmable I/O

Blocks Support for common I/O standards including LVCMOS, LVDS, and SSTL
1.2V to 3.3V I/O Built-in programmable I/O delay Low-power serial transceivers in
selected devices An integrated Endpoint/Root port (can be Root Complex when
connected to the PS) block for PCI Express in selected devices Two 12-bit analog to
digital converters (XADC) On-chip voltage and temperature Up to 17 external
differential input channels PL configuration module CLBs, Slices, and LUTs Some key
features of the CLB architecture include: True 6-input LUTs Memory capability within
the LUT Register and shift register functionality LUTs can be configured as either one
6-input LUT (64-bit ROMs) with one output, or as two 5-input LUTs (32-bit ROMs) with
separate outputs but common addresses or logic inputs. Each LUT output can optionally
be registered in a flip-flop. Four such LUTs and their eight flip-flops as well as
multiplexers and arithmetic carry logic form a slice, and two slices form a configurable
logic block (CLB). Four of the eight flip-flops per slice (one flip-flop per LUT) can
optionally be configured as latches. Between 2550% of all slices can also use their LUTs
as distributed 64-bit RAM or as 32-bit shift registers (SRL32) or as two SRL16s. Modern
synthesis tools take advantage of these highly efficient logic, arithmetic, and memory
features.
CLOCK MANAGEMENT
Clock Management Some of the key highlights of the clock management architecture
include: High-speed buffers and routing for low-skew clock distribution Frequency
synthesis and phase shifting Low-jitter clock generation and jitter filtering
Page 54
Mixed-Mode Clock Manager and Phase-Locked Loop
The MMCM and PLL share many characteristics. Both can serve as a frequency
synthesizer for a wide range of frequencies and as a jitter filter for incoming clocks. At
the center of both components is a voltage-controlled oscillator (VCO), which speeds up
and slows down depending on the input voltage it receives from the phase frequency
detector (PFD). There are three sets of programmable frequency dividers: D, M, and O.
The pre-divider D (programmable by configuration and afterwards via DRP) reduces the
input frequency and feeds one input of the traditional PLL phase/frequency comparator.
The feedback divider M (programmable by configuration and afterwards via DRP) acts as
a multiplier because it divides the VCO output frequency before feeding the other input
of the phase comparator. D and M must be chosen appropriately to keep the VCO within
its specified frequency range. The VCO has eight equally-spaced output phases (0, 45,
90, 135, 180, 225, 270, and 315). Each can be selected to drive one of the output
dividers (six for the PLL, O0 to O5, and seven for the MMCM, O0 to O6), each
programmable by configuration to divide by any integer from 1 to 128. The MMCM and
PLL have three input-jitter filter options: Low-bandwidth mode, which has the best jitter
Page 55
attenuation; high-bandwidth mode, which has the best phase offset; and optimized mode,
which allows the tools to find the best setting.
MMCM Additional Programmable Features
The MMCM can have a fractional counter in either the feedback path (acting as a
multiplier) or in one output path. Fractional counters allow non-integer increments of 1/8
and can thus increase frequency synthesis capabilities by a factor of 8. The MMCM can
also provide fixed or dynamic phase shift in small increments that depend on the VCO
frequency. At 1,600 MHz, the phase-shift timing increment is 11.2 ps.
Clock Distribution
Each device in the Zynq-7000 family provides six different types of clock lines (BUFG,
BUFR, BUFIO, BUFH, BUFMR, and the high-performance clock) to address the
different clocking requirements of high fanout, short propagation delay, and extremely
low skew.
Global Clock Lines
In each device, 32 global clock lines have the highest fanout and can reach every flip-flop
clock, clock enable, and set/reset as well as many logic inputs. There are 12 global clock
lines within any clock region driven by the horizontal clock buffers (BUFH). Each BUFH
can be independently enabled/disabled, allowing for clocks to be turned off within a
region, thereby offering fine-grain control over which clock regions consume power.
Global clock lines can be driven by global clock buffers, which can also perform
glitchless clock multiplexing and clock enable functions. Global clocks are often driven
from the CMT, which can completely eliminate the basic clock distribution delay.
Regional Clocks
Regional clocks can drive all clock destinations in their region. A region is defined as any
area that is 50 I/O and 50 CLB high and half the device wide. Each device in the Zynq-
7000 family has between four and fourteen regions. There are four regional clock tracks
Page 56
in every region. Each regional clock buffer can be driven from either of four clock-
capable input pins, and its frequency can optionally be divided by any integer from 1 to 8.
I/O Clocks
I/O clocks are especially fast and serve only I/O logic and serializer/deserializer (SerDes)
circuits, as described in the I/O Logic section. The SoCs have a direct connection from
the MMCM to the I/O for low-jitter, high-performance interfaces.
Block RAM
Some of the key features of the block RAM include: Dual-port 36 Kb block RAM with
port widths of up to 72 Programmable FIFO logic Built-in optional error correction
circuitry Each device in the Zynq-7000 family has up to 755 dual-port block RAMs, each
storing 36 Kb. Each block RAM has two completely independent ports that share nothing
but the stored data.
Synchronous Operation
Each memory access, read or write, is controlled by the clock. All inputs, data, address,
clock enables, and write enables are registered. The input address is always clocked,
retaining data until the next operation. An optional output data pipeline register allows
higher clock rates at the cost of an extra cycle of latency. During a write operation, the
data output can reflect either the previously stored data, the newly written data, or can
remain unchanged. Programmable Data Width Each port can be configured as 32K 1,
16K 2, 8K 4, 4K 9 (or 8), 2K 18 (or 16), 1K 36 (or 32), or 512 72 (or 64).
The two ports can have different aspect ratios without any constraints. Each block RAM
can be divided into two completely independent 18 Kb block RAMs that can each be
configured to any aspect ratio from 16K 1 to 512 36. Everything described previously
Page 57
for the full 36 Kb block RAM also applies to each of the smaller 18 Kb block RAMs.
Only in simple dual-port (SDP) mode can data widths of greater than 18 bits (18 Kb
RAM) or 36 bits (36 Kb RAM) be accessed. In this mode, one port is dedicated to read
operation, the other to write operation. In SDP mode, one side (read or write) can be
variable, while the other is fixed to 32/36 or 64/72. Both sides of the dual-port 36 Kb
RAM can be of variable width. Two adjacent 36 Kb block RAMs can be configured as
one cascaded 64K 1 dual-port RAM without any additional logic.
Error Detection and Correction
Each 64-bit-wide block RAM can generate, store, and utilize eight additional Hamming
code bits and perform single-bit error correction and double-bit error detection (ECC)
during the read process. The ECC logic can also be used when writing to or reading from
external 64- to 72-bit-wide memories.
FIFO Controller
The built-in FIFO controller for single-clock (synchronous) or dual-clock (asynchronous

or multirate) operation increments the internal addresses and provides four handshaking
flags: full, empty, almost full, and almost empty. The almost full and almost empty flags
are freely programmable. Similar to the block RAM, the FIFO width and depth are
programmable, but the write and read ports always have identical width. First word fall-
through mode presents the first-written word on the data output even before the first read
operation. After the first word has been read, there is no difference between this mode
and the standard mode.
Digital Signal Processing DSP Slice
Some highlights of the DSP functionality include: 25 18 two's complement

multiplier/accumulator high-resolution (48 bit) signal processor Power saving pre-adder
to optimize symmetrical filter applications Advanced features: optional pipelining,
Page 58
optional ALU, and dedicated buses for cascading DSP applications use many binary
multipliers and accumulators, best implemented in dedicated DSP slices. The devices in
the Zynq-7000 family have many dedicated, full custom, low-power DSP slices,
combining high speed with small size while retaining system design flexibility. Each DSP
slice fundamentally consists of a dedicated 25 18 bit two's complement multiplier and a
48-bit accumulator, both capable of operating up to 741 MHz. The multiplier can be
dynamically bypassed, and two 48-bit inputs can feed a single-instruction-multiple-data
(SIMD) arithmetic unit (dual 24-bit add/subtract/accumulate or quad 12-bit
add/subtract/accumulate), or a logic unit that can generate any one of ten different logic
functions of the two operands. The DSP includes an additional pre-adder, typically used
in symmetrical filters. This pre-adder improves performance in densely packed designs
and reduces the DSP slice count by up to 50%. The DSP also includes a 48-bit-wide
Pattern Detector that can be used for convergent or symmetric rounding. The pattern
detector is also capable of implementing 96-bit-wide logic functions when used in
conjunction with the logic unit. The DSP slice provides extensive pipelining and
extension capabilities that enhance the speed and efficiency of many applications beyond
digital signal processing, such as wide dynamic bus shifters, memory address generators,
wide bus multiplexers, and memory-mapped I/O register files. The accumulator can also
be used as a synchronous up/down counter.
Input/Output
Some highlights of the PL input/output functionality include: High-performance

SelectIO technology with support for 1866 Mb/s DDR3 High-frequency decoupling
capacitors within the package for enhanced signal integrity Digitally Controlled
Impedance that can be 3-stated for lowest power, high-speed I/O operation The number
of I/O pins varies depending on device and package size. Each I/O is configurable and
can comply with a large number of I/O standards. With the exception of the supply pins
and a few dedicated configuration pins, all other PL pins have the same I/O capabilities,
constrained only by certain banking rules. The SelectIO resources in Zynq-7000 and
Zynq-7000S devices are classed as either High Range (HR) or High Performance (HP).
The HR I/Os offer the widest range of voltage support, from 1.2V to 3.3V. The HP I/Os
Page 59
are optimized for highest performance operation, from 1.2V to 1.8V. All I/O pins are
organized in banks, with 50 pins per bank. Each bank has one common VCCO output
supply, which also powers certain input buffers. Some single-ended input buffers require
an internally generated or an externally applied reference voltage (VREF). There are two
VREF pins per bank (except configuration bank 0). A single bank can have only one
VREF voltage value. The Zynq-7000 family uses a variety of package types to suit the
needs of the user, including small form factor wire-bond packages for lowest cost;
conventional, high performance flip-chip packages; and lidless flip-chip packages that
balance smaller form factor with high performance. In the flip-chip packages, the silicon
device is attached to the package substrate using a high-performance flip-chip process.
Controlled ESR discrete decoupling capacitors are mounted on the package substrate to
optimize signal integrity under simultaneous switching of outputs (SSO) conditions.
I/O Electrical Characteristics
Single-ended outputs use a conventional CMOS push/pull output structure driving High
towards VCCO or Low towards ground, and can be put into a high-Z state. The system
designer can specify the slew rate and the output strength. The input is always active but
is usually ignored while the output is active. Each pin can optionally have a weak pull-up
or a weak pull-down resistor. Most signal pin pairs can be configured as differential input
pairs or output pairs. Differential input pin pairs can optionally be terminated with a
100 internal resistor. All devices in the Zynq-7000 family support differential standards
beyond LVDS: HT, RSDS, BLVDS, differential SSTL, and differential HSTL. Each of
the I/Os supports memory I/O standards, such as single-ended and differential HSTL as
well as single-ended SSTL and differential SSTL. The SSTL I/O standard can support
data rates of up to 1866 Mb/s for DDR3 interfacing applications.
3-State Digitally Controlled Impedance and Low-Power I/O Features
The 3-state Digitally Controlled Impedance (T_DCI) can control the output drive
impedance (series termination) or can provide parallel termination of an input signal to
VCCO or split (Thevenin) termination to VCCO/2. This allows users to eliminate off-
Page 60
chip termination for signals using T_DCI. In addition to board space savings, the
termination automatically turns off when in output mode or when 3-stated, saving
considerable power compared to off-chip termination. The I/Os also have low-power
modes for IBUF and IDELAY to provide further power savings, especially when used to
implement memory interfaces.
I/O Logic
Input and Output Delay
All inputs and outputs can be configured as either combinatorial or registered. Double
data rate (DDR) is supported by all inputs and outputs. Any input and some outputs can
be individually delayed by up to 32 increments of 78 ps or 52 ps each. Such delays are
implemented as IDELAY and ODELAY. The number of delay steps can be set by
configuration and can also be incremented or decremented while in use.
ISERDES and OSERDES
Many applications combine high-speed, bit-serial I/O with slower parallel operation
inside the device. This requires a serializer and deserializer (SerDes) inside the I/O
structure. Each I/O pin possesses an 8-bit IOSERDES (ISERDES and OSERDES)
capable of performing serial-to-parallel or parallel-to-serial conversions with
programmable widths of 2, 3, 4, 5, 6, 7, or 8 bits. By cascading two IOSERDES from two
adjacent pins (default from differential I/O), wider width conversions of 10 and 14 bits
can also be supported. The ISERDES has a special oversampling mode capable of
asynchronous data recovery for applications like a 1.25 Gb/s LVDS I/O-based SGMII
interface.
Low-Power Serial Transceivers
Some highlights of the low-power serial transceivers in the Zynq-7000 family include:
High-performance GTX transceivers capable of up to 12.5 Gb/s line rates with flip-chip
packages, up to 6.6 Gb/s with lidless flip-chip packages, and GTP transceivers capable of
Page 61
up to 6.25 Gb/s with wire-bond packages. Low-power mode optimized for chip-to-chip
interfaces. Advanced Transmit pre and post emphasis, and receiver linear (CTLE) and
decision feedback equalization (DFE), including adaptive equalization for additional
margin. Ultra-fast serial data transmission to optical modules, between ICs on the same
PCB, over the backplane, or over longer distances is becoming increasingly popular and
important to enable customer line cards to scale to 200 Gb/s. It requires specialized
dedicated on-chip circuitry and differential I/O capable of coping with the signal integrity
issues at these high data rates. The transceiver counts range from 0 to 16 transceiver
circuits. Each serial transceiver is a combined transmitter and receiver. The various serial
transceivers can use a combination of ring oscillators and LC tank architecture to allow
the ideal blend of flexibility and performance while enabling IP portability across the
family members. Lower data rates can be achieved using logic-based oversampling. The
serial transmitter and receiver are independent circuits that use an advanced PLL
architecture to multiply the reference frequency input by certain programmable numbers
between 4 and 25 to become the bit-serial data clock. Each transceiver has a large number
of user-definable features and parameters. All of these can be defined during device
configuration, and many can also be modified during operation.
Transmitter
The transmitter is fundamentally a parallel-to-serial converter with a conversion ratio of

16, 20, 32, 40, 64, or 80. This allows the designer to trade-off datapath width for timing
margin in high-performance designs. These transmitter outputs drive the PC board with a
single-channel differential output signal. TXOUTCLK is the appropriately divided serial
data clock and can be used directly to register the parallel data coming from the internal
logic. The incoming parallel data is fed through an optional FIFO and has additional
hardware support for the 8B/10B, 64B/66B, or 64B/67B encoding schemes to provide a
sufficient number of transitions. The bit-serial output signal drives two package pins with
differential signals. This output signal pair has programmable signal swing as well as
programmable pre- and post-emphasis to compensate for PC board losses and other
interconnect characteristics. For shorter channels, the swing can be reduced to reduce
power consumption.
Page 62
Receiver
The receiver is fundamentally a serial-to-parallel converter, changing the incoming bit-

serial differential signal into a parallel stream of words, each 16, 20, 32, 40, 64, or 80
bits. This allows the designer to trade-off internal datapath width versus logic timing
margin. The receiver takes the incoming differential data stream, feeds it through
programmable linear and decision feedback equalizers (to compensate for PC board and
other interconnect characteristics), and uses the reference clock input to initiate clock
recognition. There is no need for a separate clock line. The data pattern uses non-return-
to-zero (NRZ) encoding and optionally guarantees sufficient data transitions by using the
selected encoding scheme. Parallel data is then transferred into the PL using the
RXUSRCLK clock. For short channels, the transceivers offers a special low power mode
(LPM) for additional power reduction
Out-of-Band Signaling
The transceivers provide out-of-band (OOB) signaling, often used to send low-speed
signals from the transmitter to the receiver while high-speed serial data transmission is
not active. This is typically done when the link is in a powered-down state or has not yet
been initialized. This benefits PCI Express and SATA/SAS applications.
Integrated Block for PCI Express Designs
Highlights of the integrated block for PCI Express include: Compliant to the PCI
Express Base Specification 2.1 with Endpoint and Root Port capability Supports Gen1
(2.5 Gb/s) and Gen2 (5 Gb/s) Advanced configuration options, Advanced Error
Reporting (AER), and End-to-End CRC (ECRC) Advanced Error Reporting and ECRC
features All devices with transceivers in the Zynq-7000 family include an integrated
block for PCI Express technology that can be configured as an Endpoint or Root Port,
compliant to the PCI Express Base Specification Revision 2.1. The Root Port can be used
Page 63
to build the basis for a compatible Root Complex, to allow custom communication
between the Zynq-7000 AP SoC and other devices via the PCI Express protocol, and to
attach ASSP Endpoint devices, such as Ethernet Controllers or Fibre Channel HBAs, to
the Zynq-7000 All Programmable SoC. This block is highly configurable to system
design requirements and can operate 1, 2, 4, or 8 lanes at the 2.5 Gb/s and 5.0 Gb/s data
rates. For high-performance applications, advanced buffering techniques of the block
offer a flexible maximum payload size of up to 1,024 bytes. The integrated block
interfaces to the integrated high-speed transceivers for serial connectivity and to block
RAMs for data buffering. Combined, these elements implement the Physical Layer, Data
Link Layer, and Transaction Layer of the PCI Express protocol. Xilinx provides a light-
weight, configurable, easy-to-use LogiCORE IP wrapper that ties the various building
blocks (the integrated block for PCI Express, the transceivers, block RAM, and clocking
resources) into an Endpoint or Root Port solution. The system designer has control over
many configurable parameters: lane width, maximum payload size, PL interface speeds,
reference clock frequency, and base address register decoding and filtering. Xilinx offers
a wrapper for the integrated block: AXI4 (memory mapped). AXI4 (memory mapped) is
designed for Xilinx Platform Studio/EDK design flow and MicroBlaze processor based
designs.
XADC (Analog-to-Digital Converter)
Highlights of the XADC architecture include: Dual 12-bit 1 MSPS analog-to-digital

converters (ADCs) Up to 17 flexible and user-configurable analog inputs On-chip or
external reference option On-chip temperature and power supply sensors Continuous
JTAG access to ADC measurements All devices in the Zynq-7000 family integrate a
flexible analog interface called XADC. When combined with the programmable logic
capability, the XADC can address a broad range of data acquisition and monitoring
requirements. This unique combination of analog and programmable logic is called
Analog Mixed Signal. The XADC contains two 12-bit 1 MSPS ADCs with separate track
and hold amplifiers, an on-chip analog multiplexer (up to 17 external analog input
channels supported), and on-chip thermal and supply sensors. The two ADCs can be
configured to simultaneously sample two external-input analog channels. The track and
Page 64
hold amplifiers support a range of analog input signal types, including unipolar, bipolar,
and differential. The analog inputs can support signal bandwidths of at least 500 KHz at
sample rates of 1MSPS. It is possible to support higher analog bandwidths using external
analog multiplexer mode with the dedicated analog input (see UG480, 7 Series FPGAs
XADC Dual 12-Bit 1MSPS Analog-to-Digital Converter User Guide). The XADC
optionally uses an on-chip reference circuit (1%), thereby eliminating the need for any
external active components for basic on-chip monitoring of temperature and power
supply rails. To achieve the full 12-bit performance of the ADCs, an external 1.25V
reference IC is recommended. If the XADC is not instantiated in a design, then by default
it digitizes the output of all on-chip sensors. The most recent measurement results
(together with maximum and minimum readings) are stored in dedicated registers for
access at any time via the JTAG interface. User-defined alarm thresholds can
automatically indicate over-temperature events and unacceptable power supply variation.
A user-specified limit (for example, 100C) can be used to initiate an automatic power-
down.
System-Level Functions
Several functions span both the PS and PL and include:
Reset management
Clock management
Device configuration
Hardware and software debug support
Power management
Reset Management
Page 65
The reset management function provides the ability to reset the entire device or
individual units within it. The PS supports these reset functions and signals:
External and internal power-on reset signal
Warm reset
Watchdog timer reset
User resets to PL
Software, watchdog timer, or JTAG provided resets
Security violation reset (locked down reset)
Clock Management
In the Zynq-7000 family, the PS is equipped with three phase-locked loops (PLLs),
providing flexibility in configuring the clock domains within the PS. There are three
primary clock domains of interest within the PS. These include the APU, the DDR
controller, and the I/O peripherals (IOP). The frequencies of all of these domains can be
configured independently under software control.
PS Boot and Device Configuration
Zynq-7000 and Zynq-7000S devices use a multi-stage boot process that supports both a
non-secure and a secure boot. The PS is the master of the boot and configuration process.
For a secure boot, the PL must be powered on to enable the use of the security block
located within the PL, which provides 256-bit AES and SHA decryption/authentication.
Page 66
Upon reset, the device mode pins are read to determine the primary boot device to be
used: NOR, NAND, Quad-SPI, SD, or JTAG. JTAG can only be used as a non-secure
boot source and is intended for debugging purposes. One of the ARM Cortex-A9 CPUs
executes code out of on-chip ROM and copies the first stage boot loader (FSBL) from the
boot device to the OCM. After copying the FSBL to OCM, the processor executes the
FSBL. Xilinx supplies example FSBLs or users can create their own. The FSBL initiates
the boot of the PS and can load and configure the PL, or configuration of the PL can be
deferred to a later stage. The FSBL typically loads either a user application or an optional
second stage boot loader (SSBL) such as U-Boot. Users obtain the SSBL from Xilinx or a
third party, or they can create their own SSBL. The SSBL continues the boot process by
loading code from any of the primary boot devices or from other sources such as USB,
Ethernet, etc. If the FSBL did not configure the PL, the SSBL can do so, or again, the
configuration can be deferred to a later stage. The static memory interface controller
(NAND, NOR, or Quad-SPI) is configured using default settings. To improve device
configuration speed, these settings can be modified by information provided in the boot
image header. The ROM boot image is not user readable or callable after boot.
Hardware and Software Debug Support
The debug system used in the Zynq-7000 family is based on ARMs CoreSight
architecture. It uses ARM CoreSight components including an embedded trace buffer
(ETB), a program trace macrocell (PTM), and an instrument trace macrocell (ITM). This
enables instruction trace features as well as hardware breakpoints and triggers. The
programmable logic can be debugged with the integrated logic analyzer.
Debug Ports
Page 67
Two JTAG ports are available and can be chained together or used separately. When
chained together, a single port is used for ARM processor code downloads and run-time
control operations, PL configuration, and PL debug with the ChipScope Pro embedded
logic analyzer. This enables tools such as the Xilinx Software Development Kit (SDK)
and ChipScope Pro analyzer to share a single download cable from Xilinx. When the
JTAG chain is split, one port is used for PS support, including direct access to the ARM
DAP interface. This CoreSight interface enables the use of ARM-compliant debug and
software development tools such as Development Studio 5 (DS-5). The other JTAG
port can then be used by the Xilinx FPGA tools for access to the PL, including
configuration bitstream downloads and PL debug with the integrated logic analyzer. In
this mode, users can download to, and debug the PL in the same manner as a stand-alone
FPGA.
Power Management
The PS and PL reside on different power planes. This enables the PS and PL to be
connected to independent power rails, each with its own dedicated power supply pins. If
PL power-off mode is not needed, the user can tie the PS and PL power rails together.
When the PS is in power-off mode, it holds the PL in a permanent reset condition. The
power control for the PL is accomplished through external pins to the PL. External power
management circuitry can be used to control power. The external power management
circuitry could be controlled by software and the PS GPIO.
Power Modes
These are a few of the power savings modes offered by the Zynq-7000 family:
Programmable Logic Power Off (Sleep) The PS and PL reside on different power planes
and the PS can run with the PL powered off. For security reasons, the PL cannot be
powered on before the PS. The PL requires reconfiguration after each power-on. The user
should take PL configuration time into consideration when using this power savings
mode. PS Clock Control The PS can be run at a reduced clock rate down to 30 MHz
using the internal PLLs. The clock rate can be changed dynamically. To change the clock
Page 68
dynamically, the user must unlock the system control register to access the PS clock
control register or the clock generation control register. Single Processor Mode In this
mode, the second Cortex-A9 CPU is switched off using clock gating and the first CPU is
kept fully operational.
Memory Map
Devices in the Zynq-7000 family support a 4 GB address space, organized as described
Page 69
CHAPTER 7
DESIGNING OF HARDWARE USING XILINX VIVADO:
Step1:
Open Vivado by selecting Start.
Click Create New Project to start the wizard. You will see Create A New Vivado Project
dialog box. Click Next.
Select the project location and click next
Page 70
Step2:
Select RTL Project option in the Project Type form, and click Next.
Select VHDL as the Target Language and as the Simulator language in the Add Sources
form
Page 71
Step3:
Page 72
In the Default Part form, using the Parts option and various drop-down fields of the Filter
section.
Select board and click the zynq evaluation and development board.
Click Finish to create the Vivado project
Page 73
Step4:
Click Add Sources under the Project Manager tasks of the Flow Navigator pane.
Select the zynq processing system, gpio, axio, bram controller
Page 74
Step5:
Below peripherals are the selected one.
Now click the RUN BLOCK AUTOMATION.
Page 75
Step6:
After clicking the RUN BLOCK AUTOMATION the connections will be made by itself
Step7:
Page 76
Below the block shows what are the connection made
Page 77
Page 78
Step8:
After design of hardware in VIVADO design export to SDK to validate the design.
The output is show below for the validated design
Page 79
Above output shows how read operation is working
Above output shows how write operation is working
Page 80
CONCLUSION:
Modeling of EMBEDDED system using arm processor for memory transaction has been
implemented using VIVADO design suite and SDK
Page 81
BIBLOGRAPHY:
www.xilinx.com
www.arm.com
www.wikipedia.cm
Page 82

Zynq Documentation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Zynq Documentation

Загружено:

Авторское право:

Доступные форматы

CHAPTER 1

INTRODUCTION TO EMBEDDED SYSTEMS

What is embedded system:

An embedded system is a computer system with a dedicated function within a larger

Examples of properties of typically embedded computers when compared with general-

As the cost of microprocessors and microcontrollers fell it became feasible to replace

e-con Systems eSOM270 & eSOM300 Computer on Modules

Embedded system text user interface using Micro VGA

Processors in embedded systems

Ready made computer boards

Implementation of embedded systems have advanced, embedded systems can easily be

ASIC and FPGA solutions

A common array of n configuration for very-high-volume embedded systems is

Serial Communication Interfaces (SCI): RS-232, RS422, RS485, etc.

Synchronous Serial Interface)

Universal serial bus (USB)

Multi Media Cards (SD cards, Compact Flash, etc.)

Networks: Ethernet, LAN works, etc.

FIELDSbus: CAN-bus, LIN-bus, PROFIBUS, etc.

Discrete IO: aka general purpose input output (GPIO)

Embedded systems are commonly found in consumer, cooking, industrial, automotive,

Telecommunications systems employ numerous embedded systems from telephone

Transportation systems from flight to automobiles increasingly use embedded systems.

REALIZATION OF EMBEDDED SYSTEM USING VIVADO DESIGN

UltraFast High-Level Productivity Design Methodology Guide

Separation of platform development and differentiated logic, allowing designers to

C-based Design and Accelerated Reuse:

The Vivado High-Level Synthesis (HLS) compiler provides a programming environment

C code into an optimized RTL microarchitecture, while processorbased compilers

Vivado HLS accelerates design implementation and verification by enabling C/C++

Abstraction of algorithmic description, data type specification (integer, fixed-point or

Accelerated verification using C/C++ test bench simulation, automatic VHDL or

Domain Focused Software Libraries

Reuse of Complete IP Sub-systems:

All IP sub-systems are based on industry standards: AMBA AXI4 interconnect

Vivado IP Integrator enables rapid platform creation by generating customized

Vivado IP Integrator provides a graphical and Tcl-based, correct-by-construction design

Platform Creation and Reuse:

Acorn RISC Machine: ARM2

32 BIT ARCHITECTURE OF ARM:

R-profile, the "Real-time" profile, implemented by cores in the Cortex-R series

M-profile, the "Microcontroller" profile, implemented by most cores in the Cortex-

User mode: The only non-privileged mode.

Thread mode (ARMv6-M, ARMv7-M, ARMv8-M): A mode which can be specified as

Handler mode (ARMv6-M, ARMv7-M, ARMv8-M): A mode dedicated for exception

Mostly single clock-cycle execution.

Arithmetic instructions alter condition codes only when desired.

Has powerful indexed addressing modes.

A link register supports fast leaf function calls.

ARMv7-M and ARMv7E-M architectures always include divide instructions.

R13 is also referred to as SP, the Stack Pointer.

R14 is also referred to as LR, the Link Register.

R15 is also referred to as PC, the Program Counter.

M (bits 04) is the processor mode bits.

T (bit 5) is the Thumb state bit.

F (bit 6) is the FIQ disable bit.

I (bit 7) is the IRQ disable bit.

A (bit 8) is the imprecise data abort disable bit.

E (bit 9) is the data endianness bit.

IT (bits 1015 and 2526) is the if-then state bits.

GE (bits 1619) is the greater-than-or-equal-to bits.

DNM (bits 2023) is the do not modify bits.

J (bit 24) is the Java state bit.

V (bit 28) is the overflow bit.