Вы находитесь на странице: 1из 53

Big Data on Little Linux

hard-won lessons managing dozens of

servers processing petabytes of data
Daniel Sterling
Expression Analysis / Q2 Solutions
We’re hiring! http://www.q2labsolutions.com/careers
Daniel Sterling
• HPC System Administrator at a genomic analysis lab
• unabashedly operations – we run servers
• personalized medicine: custom clinical trials and treatments to help
patients beat cancer and keep living – we can win this fight
• What big data means at a genomic analysis lab

• Hardware war stories

• NFS war stories

• Storage and memory war stories

• The future of Linux Storage

What is Big Data?
• “If it fits in RAM, it’s not Big Data” - not quite true

• You still have to store and process even if you can manipulate in RAM

• What’s the storage backend? Do you use SMB, NFS or iscsi?

• How do you scale processing: Threads or IPC / message passing?

Mapped memory, a database, or shared / distributed memory?
Big Data Processing
• Genomic data usually fits into memory

• Assuming that you buy big servers with lots of memory

• Instruments generate terabytes of lab sample data over a few days

• Each sample we process uses 10s to 100s of GB uncompressed –

that adds up quickly
Big Data Processing
• Per sample, most analysis scripts only use resources on a single node

• Per sample, scale up (bigger server – more CPUs, cores, RAM)

• Intel Xeon “E5v4” server CPUs have up to 22 cores per socket!

AMD has dropped out of the game for now

• For many samples, scale out (more servers)

Big Data Storage
• We use NFS on systems with many big, slow disks
Disk sizes are huge now – 8, 10+ TB disks are generally available

• You can buy expensive “enterprise” systems


• Just run RAID 6 on Linux! XFS filesystem over NFS, how hard can it be?
Everything brok. Halp am not good with comp
I lied to you. This talk is actually
Everything is broken and needs
replaced, upgraded or patched
Seriously, upgrade everything
BIOS, Firmware, Kernel, Libraries
Because everything is broken, always
I found the bugs

So you don’t have to

“Do you not know, my son, with how little wisdom the world is governed?”
- Axel Oxenstierna
Problem: The hardware is broken
• Thank goodness, it’s just the hardware. Fairly easy to isolate and fix

• DIMMs fail, sometimes spectacularly, even expensive ECC RAM.

Burn in servers: run any built-in system diagnostics first, then

• Boot to and run memtest for a few cycles to find bad RAM before
your application does and it causes crashes or corrupts data

• User space stress-testing is also effective – we use stress-ng

Solution: Replace bad hardware
• Easy enough to swap out a fan, CPU or DIMM

• Buy name-brand (HP, Dell) and get vendor support or enjoy getting
batches of bad components (especially DIMMs)

• Use redundancy / clustering:

• SLURM for job dispatching to “compute” nodes

• Testing gluster for storage
Problem: The hardware is overheating
• Server has 20+ cores and 100s of GB of RAM

• If you actually use all the cores, the server gets hot

• CPUs throttle when they get hot,

and slow down significantly at 90+ degrees Celsius (in-die temp)

• Emergency shutdown at 105C.

“Why are all the servers turned off”
Solution: Upgrade the BIOS / firmware
and turn on the fans
• Firmware in latest Dell server generation lets you specify any fan
speed minimum you want

• Latest firmware in older generations will run fans at 100% if you set
BIOS System Profile to “Performance” mode – upgrade your firmware

• Don’t worry about $20 worth of fans in a server with $2,000 CPUs

• Run those fans at 100% 24/7. Maybe get 2U servers instead of 1U

Problem: NFS is broken or slow
• What’s slow or broken: the NFS client or server?
On Linux, sometimes both!
• NFS caching is complicated and confusing and probably broken
• Linux NFS client re-uses one TCP connection for all traffic to a server
• RPC system used for each transaction does not multiplex well with
default options; one app’s heavy reads or writes can drown out all
other filesystem operations, causing delays of 10s of minutes
• Linux NFS server is implemented in the kernel for no good reason.
Ganesha NFS project set to fix this
But Linux NFS being awful is
(mostly) not NFS’s fault or the
Linux NFS maintainer’s fault
Problem: The software is broken
• Which software? All of it!
• All software has bugs. Only tested software has fewer bugs

Solution: Upgrade and test your software

• But how do you test the Linux Kernel, where the NFS code lives?

• Not easily, it turns out.

Only tested code works: Linux kernel edition
• There is no automated testing of most of the features in many Linux
subsystems. The Linux automated tests mostly just ensure it boots

• Kernel testing happens when people use features and report bugs

• The fewer people that use a kernel version or feature, the more bugs
it has

• Some kinda important NFS features are buggy

Run supported software: Linux kernel edition
• Run tested software to experience fewer bugs

• For the Linux kernel, that means the most-used versions

• That means Long Term Support (LTS) kernels:

Red Hat Enterprise Kernels
Kernel.org LTS kernels

• Do not run Ubuntu stock kernels

Memory allocation: Linux kernel edition
• Do not run Linux out of kernel-space memory.

• You sort of can’t: There’s an “unwritten rule that GFP_KERNEL

allocations for low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail”

• The kernel will HANG FOREVER instead of failing to allocate memory.

• That’s the best case; in the worst case, low-memory situations will
trigger bugs and corrupt your data.
Problem: NFS is broken or slow
Because Linux ran out of kernel-space memory
• Symptoms:

• NFS server hangs or inexplicable system slowness,

resolved by a reboot

• “Out of memory” errors, regardless of how much memory is reported

as “free”

• Occasionally, corrupt filesystems

Solution: Upgrade the kernel for driver fixes
And give Linux more memory via tunables
• 10G NIC drivers are notorious for using all the kernel memory in the
NUMA node they’re assigned to, and XFS also loves allocating RAM
• So doing heavy NFS writes to an XFS filesystem on older kernels can
mean slowness, crashes or even corruption
• Upgrade your kernel, and give Linux more memory:
increase the vm.min_free_kbytes tunable, but not too much
we use 135168 (128MB). If set too high Linux will immediately hang
• You can also increase vm.vfs_cache_pressure tunable to e.g. 500
• You can also flush the page cache as a quick and dirty fix:
echo 1 > /proc/sys/vm/drop_caches
Problem: NFS is broken or slow
Because you’re doing heavy reads or writes
• Symptoms:

• NFS client operations run slowly on one server, but not another

• Inexplicably high load on NFS client system, occasional crashes

Solution: Stop using buggy Linux NFS code
• There’s an “optional” tunable you have to set in the Linux NFS client,
because the code that runs when it’s not set doesn’t work.

• tcp_slot_table_entries – sets max number of concurrent RPC


• The kernel is supposed to autoscale this value, but:

Solution: Stop using buggy Linux NFS code
and trick Linux into opening more TCP connections
• Have to set tcp_slot_table_entries before NFS export are mounted:
update /etc/modprobe.d/sunrpc.conf
options sunrpc tcp_slot_table_entries=8192

• Another big NFS trick: Use multiple IPs on the server

• With multiple fstab entries (one per IP / directory pair) on the client.

• Works around Linux’s re-use of one TCP connection per server

Problem: NFS is slow under heavy write loads
• Heavy writes slow down all filesystem operations on the NFS server
(and for all clients)

• You’ve filled up the page cache with dirty data and Linux is trying to
synchronously flush it

• Every write blocks until Linux flushes enough data

• Concurrent reads will slow the write flushing way down

Solution: Lower your dirty caching tunables
consider constantly syncing
• Turn your dirty cache ratio tunables way down:

vm.dirty_ratio = 5
vm.dirty_background_ratio = 2

• If you care more about latency than throughput (and your users often
do), just constantly run “sync”

while [[ 1 ]] ; do date ; time sync ; sleep 1 ; done

Problem: NFS is broken or slow
Because you’re trying to use NFS as a database
• NFS can’t rely on the local system pagecache to always have the latest
accurate file attribute metadata, and NFS uses async, mostly stateless
RPC for every transaction. This means NFS is robust but slow

• NFS caches file attribute metadata to avoid making constant RPC calls

• If you use NFS to store state, now your application is using a cache it
can’t directly query, control, manipulate or invalidate
So now your application has an invisible, distributed cache – every
server has its own cached state of the NFS metadata!
Solution: Upgrade your kernel
And disable some or all NFS metadata caching
• Obviously, use an actual proper database or other IPC / RPC system in
your application for state. But if you don’t control the application:

• Do not disable NFS caching on older kernels! Linux will crash under
heavy NFS load. Do not use the “noac” mount option! That forces all
NFS IO to synchronous IO, which is exceedingly, unnecessarily slow.

• Upgrade your kernel, then set “lookupcache=pos” to disable negative

directory cache lookups. If that’s not enough, set “actimeo=0” to
disable all NFS attribute caching.
Done with NFS
so done

There are other bugs I didn’t cover; upgrade your kernel for NFS fixes
Problem: You used ext4 for big data
• Symptoms:

• You used ext4 for big data

• And now your data is corrupted

• And you can’t run fsck (filesystem check) because you don’t have
enough memory
Solution: Use XFS
• Use XFS (or ZFS on BSD / Solaris, but that’s another talk)

• ext4 was written and is maintained by smart people making good

decisions, but was never meant to be rock-solid or used for big data

• XFS was written from the ground up for huge filesystems, stability,
and rock-solid consistency. XFS will tell you when your storage
subsystem is broken and corrupting data. ext4 will not.

• XFS is still being maintained by the same smart people, 20 years later
Problem: Server is unresponsive
• Symptoms:

• Your application just allocated all available RAM and swap

• If you have any swap space left, the OOM (out of memory) killer may
not kick in unless the app makes one large allocation

• Everything is running exceedingly slowly and Linux is constantly

Swap and wired memory: Linux kernel edition
• “Wired” aka unevictable aka locked memory cannot be swapped out

• Linux does not lock memory for the UI (e.g. ssh + bash) by default

• Other OSes enforce room in the page file for every byte of memory
applications use, and may lock UI pages into memory

• On other OSes, if you have 16GB of memory, you must have at least a 16GB
page file. Memory allocations fail if there is not enough room on disk to
store everything already in memory.
Problem: Server is unresponsive
Because Linux ran out of RAM and swap
• With a proper memory management system, any or all allocated data
can always be immediately stored in the page file on disk if necessary.

• Under Linux, swap is treated exactly as if it were RAM, so you can run
out of both RAM and swap, and destroy UI interactivity due to latency

• The app or UI may never be able to be fully swapped back in, until
something frees memory or finishes or is killed. Linux will constantly
swap pages in and out to very, very slowly continue work
Solution: Upgrade your kernel
And disable swap
• Older kernels did not work well when swap was completely disabled

• To be fair, this has been fixed for many years

• Just run without any swap. swapoff -a

• Why is swap so broken? Nobody uses it (Google runs with no swap)

Problem: Server is unresponsive
• Symptoms: You do want to use swap so it’s not disabled

• Your application isn’t using all available memory – system was not
heavily swapping

• But the system stopped working anyway

• Your swap is on a software linux md (multiple device) RAID 1 device

Solution: Upgrade your kernel
Use swap files instead of swap block devices
• Linux can crash if you’re using md RAID 1 on swap

• This is not an old bug, it affects recent kernels

• Upgrade your kernel, or

• use swap files on local disk filesystems, instead of swap partitions

Problem: Server is unresponsive
• Symptoms:

• Your application is doing lots of work with large in-memory data

• Your application is not a database that allocates its own huge pages

• You have transparent huge pages fully enabled (the default in newer
kernels / distros)
Transparent huge pages: Linux kernel edition
• The logic for transparent huge pages massively complicates the Linux
memory subsystem

• Allocating huge pages can only work if there’s enough contiguous

memory to allocate a huge page

• So Linux has to constantly compact / defragment memory! Usually in

the background, but sometimes this causes allocation pauses!

• And of course it’s buggy and causes crashes / hangs on older kernels
Solution: Upgrade your kernel
Disable transparent huge pages
• Because of the complexity, you’ll want to run a recent kernel if you’re
using transparent huge pages

• Use explicit huge pages in your app if that makes sense, or

• Just disable it, this fully disables all the memory compaction logic

echo never > /sys/kernel/mm/transparent_hugepage/enabled

Problem: Reading huge files thrashes disk
• You’re reading and processing huge files, and have plenty of memory

• But Linux thrashes the disk reading the files anyway

• Writes will push even frequently read data out of the page cache

• Read pages get evicted, even with over 100GB of free RAM, well over
the size of the files you’re reading and expecting to stay in cache
Solution: mmap and lock the pages in RAM
• This is one issue where upgrading the kernel won’t help

• You have to use the same tricks database apps use

• Use shared memory, or mmap the file and lock the pages into
memory in your app or with vmtouch

ulimit -l unlimited && vmtouch -vltm 300G /data/database.*

Problem: You ran out of disk space
and have no idea what to delete
• Running du over filesystems with 10s or 100s of terabytes of data can
take days

Solution: Cron a du report

• Add a daily cron job that grabs a lockfile and runs du
• Be sure to write to a new file, move current to old and new to current
• That way you still have yesterday’s report even if you ran out of space
The Future
• Linux by itself (with no special hardware) will eventually be able to
provide a stable, high-performance storage system – in 5-10 years

• For now, buy the best block storage hardware you can (good RAID
card and drive enclosure) and directly attach it to a Linux server, or

• Buy “enterprise” object-level (provides NFS) storage systems from a

major vendor (Dell, EMC / Isilon, NetApp, etc)
The Future: btrfs, Ceph and ZFS, Oh my!
• Ceph can provide good block-level storage now, but CephFS isn’t
ready yet. You may be able to use Ceph with XFS over NFS, but you
will not get vendor support. Ceph is still evolving rapidly.

• In 5-10 years, brtfs, CephFS or ZFS on Linux may make more sense
than specialized storage hardware

• The in-kernel Linux NFS server needs to be put down.

Ganesha NFS will be ready soon (is ready now for some workloads)
For Now: Use Gluster
• Gluster is conceptually very simple. It makes sense and it’s stable

• Gluster takes simple Linux block-level storage, puts XFS on it, and
duplicates data at the file level across multiple filesystems / servers

• To get good Gluster performance, buy good block storage (good RAID
card, fast disks) and use native gluster client tooling or Ganesha NFS

• Red Hat does more than any company (possibly except IBM) to make
Linux great
• Buy supported hardware and burn it in
• Upgrade your BIOS and firmware, run your system fans at high
• Upgrade your Linux kernel
• Upgrade your Perl
• Use NFS wisely and apply the tricks as necessary
• Don’t use swap
• Do use XFS
• Lower your dirty caching tunables and keep app data in memory
• Wait a few years for Ceph to become more stable
Big Data on Little Linux
or how I learned to stop worrying and love upgrades