Mmwin 7

Windows 7 Memory
Management
Landy Wang
Distinguished Engineer
Microsoft Corporation
Topics
>
Working set management
>
Fine grained page locking
>
Security
>
NUMA
>
Non volatile (flash) memory
>
Handling of contiguous/large page memory requests
>
High end servers
>
Footprint and performance
Working Set Background

>
Optimal usage of system memory - a constant area of

investment !
>
Working set : Comprises all the potentially trimmable virtual

addresses for a given process, session or system resource.
>
Resources like nonpaged pool, kernel stacks, large pages &

AWE regions are excluded (because they are not trimmable).
>
Working sets provide an efficient way for the system to make

memory available under pressure ... but maintaining them is
not free and care must be exercised during trim candidate
selection ... and the subsequent writing of those pages !
>
Trimmed pages go to the standby (clean), modified or zero

page lists. The modified/mapped writer threads write them
in a timely fashion.
Working Set Aging/Trimming

>
Working sets are periodically aged to improve trim decisions
>
Which sets and which virtual addresses to trim ?
>
How much to trim ?
>
Memory events so applications can (optionally) participate ...
Working Set General Policies

>
When memory is low, how are working sets managed equitably and
efficiently so optimal usage is achieved ?
>
Working sets are ordered based on their age distribution.
>
Trim goal is set higher to avoid subsequent additional trimming.
>
After goal is met, other sets continue to be trimmed but just for
their very old pages. This provides fairness so one process doesnt
surrender pages and the others do not.
>
Up to 4 passes may be performed, later passes consider higher

percentages of each working set and lower ages (more recently
accessed) as well.
>
When trimming occurs, all sets are also aged so future trims will
have optimal (and fair !) candidates.
Working Set Improvements

>
>
>
>
>
>
>
>
>
Expansion to 8 aging values (up from 4)

Keep exact age distribution counts instead of estimates
Force self-aging and trimming during rapid expansion
Dont skip processes due to lock contention and ensure fair
aging by removing pass limits
Dont ravage small sets since subsequent hard faults
penalize all sets
Separation of the system cache working set into 3 distinct
working sets (system cache, paged pool and driver images)
to prevent individual expansion from trimming the others
Factor in standby list repurposing when making age/trim
decisions
Improved inpage clustering of system addresses
Result : Doubling of performance in memory constrained
systems !
Task Managers Main Screen
Task Manager Working Set Display
PFN Lock Background

>
The PFN (page frame number) array is a virtually contiguous

(but can be physically sparse) data structure where each PFN
entry describes the state of a physical page of memory.
>
Information includes :
>
>
>
>
>
>
>
- State (zero, free, standby, modified, modifiednowrite, bad, active, etc)

- How many page table entries are mapping it
- How many I/Os are currently in progress
- The containing frame/PTE
- The PTE value to restore when the page leaves its last working set or is
repurposed
- NUMA node
- etc
Size is critical ... And how to best manage the information ?
PFN Lock : The Problem

>
The huge majority of all virtual memory operations were

synchronized via a single system-wide PFN lock. Thus even
seemingly unrelated operations by threads, even those in
different processes, would contend for and serialize at this
lock, potentially causing significant performance
degradation/spikes.
>
Larger numbers of processors and memory sizes intensify the

lock pressure. For example, prior to this change SQL Server
had an 88% PFN lock contention rate on systems with 128
processors.
>
Applications and device drivers seeking higher performance

faced significant complexity at best : AWE, large pages, or
even complete algorithmic redesigns.
PFN Lock : The Scope

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
All page allocation, deallocation, access and state manipulation

All prefetching, prioritization, access logging and page identification
All page list manipulation (zero, free, standby, modified, modnowrite, bad)
All pagefile space allocation/deletion, adding/expansion/contraction
Page fault management, trimming/theft/replacement, mapped/modified
writing, flushing, purging
All control area, segments, subsections and prototype PTE usage
Virtual address space deletion/decommit, protection changing, trimming,
large pages, etc
Process/session creation, duplication, inswap/outswap, deletion
Kernel stack creation/deletion, inswap/outswap, stealing
System cache view mapping/unmapping/readahead, protection
Image validation, ASLR dynamic relocations
MDL probing/unlocking
Driver loading, unloading, paging
User event signaling (low memory, high memory, etc)
Dynamic addition/removal of memory plus mirroring/hibernate/resume
Dynamic kernel virtual address space allocation/deletion/initialization
PFN Lock : The Answer

>
In Windows 7, the systemwide PFN lock was replaced with

fine-grained locking on an individual page basis.
>
This completely eliminated this bottleneck, resulting in much

higher scalability. For example, the Usenix memclone
microbenchmark is now 15x faster than Windows Server
2008 on 32 processor configurations.
>
Fully compatible (on a binary and source level) so all software

benefits without any changes. Developers dont need to
resort to complex workarounds to achieve highest
performance !
PFN Lock Replacement Hierarchy

>
>
>
>
>
>
>
>
>
>
>
>
Pool locks
System VA lock
Working set expansion list lock
Individual per-page locks
Access logging lock
Page list (free per color, zero per color, standby per priority,
modified filesystem/pagefile destined, bad) locks
Per-pagefile space lock
Memory event signaling lock
Per-control area lock
Dynamic relocation VA (ASLR) assignment lock
Segment list lock
Section object pointers lock
Security : ASLR Background

Image header
Executable
Load Address
+/- 16MB
Executable
Randomly Chosen
Executable
Load Address
DLL Loading
DLLs
Up to
16MB
Kernel Mode
Randomly Chosen
Image-Load Bias
Security : ASLR Background

>
Images relocated dynamically when each image section is

created.
>
When combined with NX, makes life difficult for hackers !
>
Compresses VA space to reduce page table page cost as well

as provide a larger contiguous VA range for applications.
>
Introduced in Vista, applications (for compatibility) must opt

in via /DYNAMICBASE.
Security : ASLR Improvements

>
Driver randomization increased to 64 possible load addresses

for 32-bit drivers and 256 for 64-bit drivers, up from 16 for
both.
>
Kernel, HAL and session drivers relocated post-Vista RTM.

Large session drivers (win32k.sys for example) are also now
relocated.
>
Extra effort is also made to relocate user space images even

when system VA space is tight/fragmented by temporarily
using the user address space of the system process.
>
The memory cost of ASLR has also been reduced by adding

2x compression for in-memory image relocation tables, which
saves at least 11MB of pagable memory on every system.
>
Allow execute revocation (for NX-optin on the fly) post-Vista

RTM.
NUMA
>
NUMA is the approach preferred by hardware designers to

achieve optimum performance.
>
Typical far node cost : clients 1.3-1.7x, servers 1.1x-3x+ !
>
Windows 7 adds support for 64 NUMA nodes (up from 16).
>
Node graph construction so optimal allocations can always be

performed automatically without drivers/apps doing heavy
lifting.
>
Apps can specify node preference on allocation/view/control

area/thread/process boundaries.
>
Automatic page migration performed by the system !
Integrated NVRAM Support

>
>
>
>
>
NVRAM :
- Built directly into motherboards
- In solid state drives
- In USB sticks
- As a replacement for main memory
>
Windows 7 delivers tight and efficient integration of NVRAM

support directly into the core memory management system.
Eliminates numerous filter driver drawbacks some
examples :
The same disk page can be in memory, in a ReadyBoost
cache and pinned in a ReadyDrive disk all at the same time
with each component unaware of the others.
Pagefile-backed pages can be consuming space in both
ReadyBoost and ReadyDrive caches even though the
application (and memory management) had deleted them
long ago.
>
>
>
Contiguous/Large Page Memory

>
Significant redesign post Vista RTM to obtain memory

efficiently without trimming, issuing I/O or inserting fault
delays. In memory pages are swapped in place. Efficient
scanning including range skipping during preliminary pass
ensures much higher yield results.
>
>
>
Applications (ie, databases) allocating large page regions.

Hypervisors allocating memory for guest VMs.
Device drivers making contiguous memory or
MmAllocatePagesForMdl* calls.
>
Result : Reductions in allocation times can be several orders

of magnitude ! Callers no longer run the risk of disrupting
the entire system !
High-End System Support

>
Initial 64 bit nonpaged pool maximum bumped from 40% to 75%.
>
Reclaim initial nonpaged pool (up to a 3% RAM boost !).
>
Boot time reductions by not depleting executive worker queues for

page zeroing (at the expense of boot time forward progress).
>
TLB flush reductions (especially valuable for virtualization).
>
Cache management improvements to avoid flushing/overflushing.
>
Enterprise clustered filesystem support APIs added.
>
Software mirroring for major OEMs (also in WS2008).
>
Avoid issuing modified page writes until absolutely necessary post

Vista RTM.
Footprint Analysis
>
Vista SP1 memory management code is ~460KB (25% is

pagable or INIT). Windows 7 total code growth is 8KB !
>
Static data reduced from 41KB to 38KB.
>
Multiplicative data structures even more important !

Significant effort into saving at this level relocation tables
are one example.
>
Locality of reference improvements for speed, false sharing

elimination and footprint purposes.
Focus Areas
>
Footprint / locality of reference
>
Memory and I/O prioritization and efficiency
>
Parallelism
>
Scalability
>
Security
>
Power consumption
>
New technologies - hardware and software
Questions
?
2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S.
and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond
to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after
the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Mmwin 7

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Mmwin 7

Загружено:

Авторское право:

Доступные форматы

Windows 7 Memory

Working set management

Fine grained page locking

Non volatile (flash) memory

Handling of contiguous/large page memory requests

High end servers

Footprint and performance

Working Set Background

Optimal usage of system memory - a constant area of

Working set : Comprises all the potentially trimmable virtual

Resources like nonpaged pool, kernel stacks, large pages &

Working sets provide an efficient way for the system to make

Trimmed pages go to the standby (clean), modified or zero

Working Set Aging/Trimming

Working sets are periodically aged to improve trim decisions

Which sets and which virtual addresses to trim ?

How much to trim ?

Memory events so applications can (optionally) participate ...

Working Set General Policies

Working sets are ordered based on their age distribution.

Trim goal is set higher to avoid subsequent additional trimming.

Up to 4 passes may be performed, later passes consider higher

Working Set Improvements

Expansion to 8 aging values (up from 4)

Task Managers Main Screen

Task Manager Working Set Display

PFN Lock Background

The PFN (page frame number) array is a virtually contiguous

- State (zero, free, standby, modified, modifiednowrite, bad, active, etc)

Size is critical ... And how to best manage the information ?

PFN Lock : The Problem

The huge majority of all virtual memory operations were

Larger numbers of processors and memory sizes intensify the

Applications and device drivers seeking higher performance

PFN Lock : The Scope

All page allocation, deallocation, access and state manipulation

PFN Lock : The Answer

In Windows 7, the systemwide PFN lock was replaced with

This completely eliminated this bottleneck, resulting in much

Fully compatible (on a binary and source level) so all software

PFN Lock Replacement Hierarchy

Security : ASLR Background

Security : ASLR Background

Images relocated dynamically when each image section is

When combined with NX, makes life difficult for hackers !

Compresses VA space to reduce page table page cost as well

Introduced in Vista, applications (for compatibility) must opt

Security : ASLR Improvements

Driver randomization increased to 64 possible load addresses

Kernel, HAL and session drivers relocated post-Vista RTM.

Extra effort is also made to relocate user space images even

The memory cost of ASLR has also been reduced by adding

Allow execute revocation (for NX-optin on the fly) post-Vista

NUMA is the approach preferred by hardware designers to

Typical far node cost : clients 1.3-1.7x, servers 1.1x-3x+ !

Windows 7 adds support for 64 NUMA nodes (up from 16).

Node graph construction so optimal allocations can always be

Apps can specify node preference on allocation/view/control

Automatic page migration performed by the system !

Integrated NVRAM Support

Windows 7 delivers tight and efficient integration of NVRAM

Contiguous/Large Page Memory

Significant redesign post Vista RTM to obtain memory