Академический Документы
Профессиональный Документы
Культура Документы
Storage SolutionS
For Securing Data
My Journey:
Niyam Bhushan
One Of India’s Most Passionate
OSS Evangelists
COntEnts January 2018 ISSN-2456-4885
30
FOR U & ME
25 Tricks to Try Out on
Thunderbird and SeaMonkey
AdMin
40 DevOps Series Deploying Graylog
Using Ansible
56 Build Your Own Cloud Storage A Hands-on Guide on Virtualisation with VirtualBox
System Using OSS
dE v E l O p E Rs
82 Machines Learn in Many
Different Ways
85 Regular Expressions in
Programming Languages:
Java for You
C O l UM n s
r E G u L a r F E AT U R E S
77 CodeSport
80 Exploring Software: Python is 07 FossBytes 20 New Products 104 Tips & Tricks
Still Special
65
Ph: (011) 26810602, 26810603; Fax: 26817563
E-mail: info@efy.in
95
MISSING ISSUES
E-mail: support@efy.in
BACK ISSUES
Kits ‘n’ Spares
New Delhi 110020
Ph: (011) 26371661, 26371662
E-mail: info@kitsnspares.com
NEWSSTAND DISTRIBUTION
Ph: 011-40596600
E-mail: efycirc@efy.in
ADVERTISEMENTS
MUMBAI
Ph: (022) 24950047, 24928520
E-mail: efymum@efy.in
BENGALURU
Ph: (080) 25260394, 25260023
JAPAN
22 101
Tandem Inc., Ph: 81-3-3541-4166
E-mail: japan@efy.in
SINGAPORE
Publicitas Singapore Pte Ltd
Ph: +65-6836 2272
E-mail: singapore@efy.in
TAIWAN
J.K. Media, Ph: 886-2-87726780 ext. 10
E-mail: taiwan@efy.in
UNITED STATES
E & Tech Media
Ph: +1 860 536 6677
E-mail: usa@efy.in
kickstarted a revolution in UX
us
to
rite Re
that cannot be used are returned to the authors if accompanied by a pr
ope
rly
,w co
mm
en
d
Ubuntu Desktop
(64-bit) e
design in India
k
self-addressed and sufficiently stamped envelope. But no responsibility
dS
r
wo
ys
ot
tem
sn
is taken for any loss or delay in returning the material. Disputes, if any, 17.10
Re
oe
(Live)
Dd
qu
ire
DV
me
this
nts
: P4
In c
, 1G
B RA
Fedora
M, D
MX Linux 17 Workstation 27
VD-RO
tended, and sh
s unin oul
c, i d be
dis
Fedora Workstation is a
M Drive
MX Linux is a cooperative
att
the
rib
on
d to t
atu
of
bj Int
o ern
Any t dat e
Note: a.
Te
106
am
Ubuntu comes with everything you e-m
• Fedora Workstation 27
Kindly add ` 50/- for outside Delhi cheques.
ail:
need to run your organisation, school, cd
tea
home or enterprise m@
Node js
Angular js
Swift
Mongo DB
Red Hat OpenShift Building secure container infrastructure with Kata Containers
Container Platform 3.7 The OpenStack Foundation has announced a new open source project—Kata
released Containers, which aims to unite the security advantages of virtual machines
Red Hat has launched the latest (VMs) with the speed and manageability of container technologies. The
OpenShift Container Platform 3.7, project is designed to be hardware agnostic and compatible with Open
the version of Red Hat’s enterprise- Container Initiative (OCI)
grade Kubernetes container specifications, as well as the
application platform. As application container runtime interface
complexity and cloud incompatibility (CRI) for Kubernetes.
increase, Red Hat OpenShift Intel is contributing
Container Platform 3.7 will help IT its open source Intel Clear
organisations to build and manage Containers project and
applications that use services from the Hyper is contributing
data centre to the public cloud. its runV technology to
The latest version of the industry’s initiate the project. Besides
most comprehensive enterprise Intel and Hyper, 99cloud,
Kubernetes platform includes native AWcloud, Canonical, China Mobile, City Network, CoreOS, Dell/EMC,
integrations with Amazon Web Services EasyStack, Fiberhome, Google, Huawei, JD.com, Mirantis, NetApp, Red
(AWS) service brokers, which enable Hat, SUSE, Tencent, Ucloud, UnitedStack, and ZTE are also supporting the
developers to bind services across AWS project’s launch.
and on-premise resources to create The Kata Containers project will initially comprise six components, which
modern applications while providing include the agent, runtime, proxy, Shim, kernel and packaging of QEMU 2.9.
a consistent, open standards-based It is designed to be architecture agnostic and to run on multiple hypervisors.
foundation to drive business evolution. Kata Containers offers the ability to run container management tools directly
“We are excited about our on bare metal.
collaboration with Red Hat and “The Kata Containers Project is an exciting addition to the OpenStack
the general availability of the first Foundation family of projects. Lighter, faster, more secure VM technology fits
AWS service brokers in Red Hat perfectly into the OpenStack Foundation family and aligns well with Canonical’s
OpenShift. The ability to seamlessly data centre efficiency initiatives. Like Clear Containers and Hyper.sh previously,
configure and deploy a range of AWS Kata Container users will find their hypervisor and guests well supported on
services from within OpenShift will Ubuntu,” said Dustin Kirkland, vice president, product, Canonical.
allow our customers to benefit from
AWS’s rapid pace of innovation, Fedora 27 released
both on-premises and in the cloud,” The Fedora Project, a Red Hat sponsored and community-driven open source
said Matt Yanchyshyn, director, collaboration, has announced the general availability of Fedora 27. All
partner solution architecture, editions of Fedora 27 are
Amazon Web Services, Inc. built from a common set
Red Hat OpenShift Container of base packages and,
Platform 3.7 will ship with the as with all new Fedora
OpenShift template broker, which releases, these packages
turns any OpenShift template into a have seen numerous
discoverable service for application tweaks, incremental
developers using OpenShift. improvements and new
OpenShift templates are lists additions. For Fedora
of OpenShift objects that can 27, this includes the
be implemented within specific GNU C Library 2.26 and RPM 4.14.
parameters, making it easier for IT “Building and supporting the next generation of applications remains a critical
organisations to deploy reusable, focus for the Fedora community, showcased in Fedora 27 by our continued
composite applications comprising support and refinement of system containers and containerised services like
microservices. Kubernetes and Flannel. More traditional developers and end users will be pleased
Had
oop
Open Apache
Source Post id
Is Hot gres Andro
In The MySQ
L
IT World Open
OSS
Stac PEARL
k
la PHP
Joom Drupal
To buy an ezine edition, visit: www.magzter.com choose Open Source For You
FOSSBYTES
Microsoft launches Azure It has become increasingly evident that the future of AI needs more than just
location based services ethical direction and government oversight. It would be comforting to know that
Addressing a gathering at the tech giants are on the same page too. The machines, and the humans who will
Automobility LA 2017 in Los rely on them, need the biggest companies building AI to take on a fair share of
Angeles, California, Sam George, responsibility for the future.
director – Azure IoT, Microsoft
said, “Microsoft is making Four tech giants using Linux change their
an effort to solve mobility open source licensing policies
challenges and bring government The GNU Public License version 2 (GPLv2) is arguably the most important
bodies, private companies and open source licence for one reason—Linux uses it. On November 27, 2017,
automotive OEMs together, three tech power houses that use Linux—Facebook, Google and IBM, as
using Microsoft’s intelligent well as the major Linux distributor Red Hat, announced they would extend
cloud platform.” additional rights to help companies who’ve made GPLv2 open source licence
The new location capabilities compliance errors and mistakes.
will provide cloud developers The GPLv2 and its close relative, GNU Lesser General Public License
critical geographical data to (LGPL), are widely used open source software licences. When the GPL version
power smart cities and Internet 3 (GPLv3) was released, it came with an express termination approach. This
of Things (IoT) solutions termination policy in GPLv3 provided a way for companies to correct licensing
across industries. This includes errors and mistakes. This approach allows licence compliance enforcement that is
manufacturing, automotive, consistent with community norms.
logistics, urban planning
and retail, etc. FreeNAS 11.1 provides greater performance
and cloud integration
FreeNAS 11.1 adds cloud integration and OpenZFS performance improvements,
including the ability to prioritise ‘resilvering’ operations, and preliminary Docker
support to the world’s most popular software-defined storage operating system.
It also adds a cloud sync (data import/
export to the cloud) feature, which lets
you sync (similar to back up), move
(erase from source) or
copy (only changed
data) data to and from
public cloud providers
TomTom Telematics will that include Amazon
be the first official partner for S3 (Simple Storage
the service, supplying critical Services), Backblaze B2
location and real-time traffic data, Cloud, Google Cloud
providing Microsoft customers and Microsoft Azure.
with advanced location and OpenZFS has
mapping capabilities. noticeable performance improvements for handling multiple snapshots and large
Microsoft’s Azure location files. Resilver Priority has been added to the ‘Storage’ screen of the graphical user
based services will offer interface, allowing you to configure ‘resilvering’ at a higher priority at specific
enterprise customers location times. This helps to mitigate the inherited challenges and risks associated with
capabilities integrated in the storage array rebuilds on very large capacity drives.
cloud to help any industry The latest release includes an updated preview of the beta version of the
improve traffic flow. Microsoft new administrator graphical user interface, including the ability to select display
also announced that Azure LBS themes. It can be downloaded from freenas.org/download.
will be launched in 2018, and
will be available globally in
more than 30 languages. For more news, visit www.opensourceforu.com
DRIVING
TECHNOLOGY,
INNOVATION &
INVESTMENTS
KTPO Whitefield Colocated
Bengaluru shows
India’s #1 IoT show. At Electronics Is there a show in India that Our belief is that the LED bulb is the Test & Measurement India (T&M
For You, we strongly believe that showcases the latest in culmination of various advances in India) is Asia’s leading exposition
India has the potential to become a electronics manufacturing such as technology. And such a product for test & measurement products
superpower in the IoT space, in the rapid prototyping, rapid production category and its associated and services. Launched in 2012
upcoming years. All that's needed and table top manufacturing? industry cannot grow without as a co-located show along with
are platforms for different focusing on the latest technologies. Electronics For You Expo, it has
stakeholders of the ecosystem to Yes, there is now - EFY Expo But, while there are some good established itself as the
come together. 2018. With this show’s focus on B2B shows for LED lighting in India, must-attend event for users of
the areas mentioned and it being none has a focus on ‘the T&M equipment, and a
We’ve been building one such co-located at India Electronics technology that powers lights’. must-exhibit event for suppliers of
platform: IoTshow.in--an event for Week, it has emerged as India's Thus, the need for LEDAsia.in. T&M products and services.
the creators, the enablers and leading expo on the latest
customers of IoT. In February 2018, manufacturing technologies and Who should attend? In 2015, T&M India added an
the third edition of IoTshow.in will electronic components. • Tech decision makers: CEOs, important element by launching
bring together a B2B expo, technical CTOs, R&D and design the T&M Showcase-a platform for
and business conferences, the Who should attend? engineers and those developing show-casing latest T&M products
Start-up Zone, demo sessions of • Manufacturers: CEOs, MDs, the latest LED-based products and technologies. Being a
innovative products, and more. and those involved in firms that • Purchase decision makers: first-of-its-kind event in India, the
manufacture electronics and CEOs, purchase managers and T&M Showcase was well
Who should attend? technology products production managers from received by the audience and the
• Creators of IoT solutions: • Purchase decision makers: manufacturing firms that use exhibitors.
OEMs, design houses, CEOs, CEOs, purchase managers, LEDs
CTOs, design engineers, production managers and • Channel partners: Importers, Who should attend?
software developers, IT those involved in electronics distributors, resellers of LEDs • Sr Technical Decision
managers, etc manufacturing and LED lighting products Makers--Manufacturing,
• Enablers of IoT solutions: • Technology decision • Investors: Startups, Design, R&D & Trade
Systems integrators, solutions makers: Design engineers, entrepreneurs, investment Channel organisations
providers, distributors, resellers, R&D heads and those consultants interested in this • Sr Business Decision
etc involved in electronics sector Makers--Manufacturing,
• Business customers: manufacturing • Enablers: System integrators, Design, R&D & Trade
Enterprises, SMEs, the • Channel partners: Importers, lighting consultants and those Channel organisations
government, defence distributors, resellers of interested in smarter lighting • R&D Engineers
establishments, academia, etc electronic components, tools solutions (thanks to the • Design Engineers
and equipment co-located IoTshow.in) • Test & Maintenance
Why you should attend • Investors: Startups, Engineers
• Get updates on the latest entrepreneurs, investment Why you should attend • Production Engineers
technology trends that define the consultants and others • Get updates on the latest • Academicians
IoT landscape interested in electronics technology trends defining the • Defence & Defence
• Get a glimpse of products and manufacturing LED and LED lighting sector Electronics Personnel
solutions that enable the • Get a glimpse of the latest
development of better IoT Why you should attend components, equipment and Why you should attend?
solutions • Get updates on the latest tools that help manufacture • India’s Only Show Focused on
• Connect with leading IoT brands technology trends in rapid better lighting products T&M for electronics
seeking channel partners and prototyping and production, • Get connected with new • Experience Latest T&M
systems integrators and in table top manufacturing suppliers from across India to Solutions first-hand
• Connect with leading • Get connected with new improve your supply chain • Explore Trade Channel
suppliers/service providers in the suppliers from across India to • Connect with OEMs, principals, Opportunities from Indian &
electronics, IT and telecom improve your supply chain lighting brands seeking channel Foreign OEMs
domain who can help you • Connect with OEMs, principals partners and systems • Attend Demo sessions of
develop better IoT solutions, and brands seeking channel integrators latest T&M equipment
faster partners and distributors • Connect with foreign suppliers launched in India
• Network with the who’s who of • Connect with foreign suppliers and principals to represent them • Special Passes for Defence
the IoT world and build and principals to represent in India
connections with industry peers them in India • Explore new business ideas and
• Find out about IoT solutions that • Explore new business ideas investment opportunities in the
can help you reduce costs or and investment opportunities LED and lighting sector
increase revenues in this sector • Get an insider’s view of ‘IoT +
• Get updates on the latest Lighting’ solutions that make
business trends shaping the lighting smarter
demand and supply of IoT
solutions www.IndiaElectronicsWeek.com
KTPO Whitefield Colocated
Bengaluru shows
www.IndiaElectronicsWeek.com
KTPO Whitefield Colocated
Bengaluru shows
The themes
• Profit from IoT • Rapid prototyping and production
• Table top manufacturing • LEDs and LED lighting
To get more details on how exhibiting at IEW 2018 can help you achieve your sales and marketing goals,
A
t the start of a brand new year, 2020. Microsoft recently contributed the IoT challenges such as Arduino,
I looked into the crystal ball to the mix by launching its Virtual Home Assistant, Zetta, Device Hive,
to figure out the areas that no Kubelet connector for Azure, and ThingSpeak. Open source has
technologist can afford to ignore. Here streamlining the whole container already served as the foundation
is my rundown on some of the top management process. for IoT’s growth till now and will
trends that will define 2018. continue to do so.
Blockchain finds its footing
Automation and artificial As Bitcoin is likely to hit the US$ OpenStack to gain more
intelligence 20,000 mark, we’re all in awe of acceptance
Two of the most talked about trends the blockchain technology behind OpenStack has enjoyed tremendous
are increasingly utilising open source. all the crypto currencies. Other success since the beginning, with its
Companies including Google, Amazon industries are expected to follow suit, exciting and creative ways to utilise
and Microsoft have released the code such as supply chain, healthcare, the cloud. But it lags behind when it
for Open Network Automation Platform government services, etc. comes to adoption, partly due to its
(ONAP) software frameworks that The fact that it’s not controlled complex structure and dependence on
are designed to help developers build by any single authority and has no virtualisation, servers and extensive
powerful AI applications. single point of failure makes it a very networking resources.
In fact, Gartner says that artificial robust, transparent, and incorruptible But new fixes are in the works as
intelligence is going to widen its technology. Russia has also become several big software development and
net to include data preparation, one of the first countries to embrace hosting companies work overtime to
integration, algorithm selection, the technology by piloting its resolve the underlying challenges. In
training methodology selection, and banking industry’s first ever payment fact, OpenStack has now expanded
model creation. I can point out so many transaction. Sberbank, Russia’s biggest its scope to include containers
examples right now such as chatbots, bank by assets, has executed a realtime with the recent launch of the Kata
autonomous vehicles and drones, money transfer over an IBM-built containers project.
video games as well as other real-life blockchain based on the Hyperledger Open source is evolving at a great
scenarios such as design, training and open source collaborative project. pace, which presents tremendous
visualisation processes. One more case in point is a opportunities for enterprises to grow
consortium comprising more than a bigger and better. Today, the cloud
Open source containers are no dozen food companies and retailers — also shares a close bond with the
longer orphans now including Walmart, Nestle and Tyson open source apps, with the services of
DevOps ecosystems are now Foods—dedicated to using blockchain various big cloud companies like AWS,
seeing the widespread adoption of technology to gather better information Google Cloud, and Microsoft Azure
containers like Docker for open source on the origin and state of food. being quite open source-friendly.
e-commerce development. I can think of no better way to say
Containers are one of the hottest IoT-related open source this, but it is poised to be the driver
tickets in open source technology. You tools/libraries behind various innovations. I’d love to
can imagine them as a lightweight IoT has already made its presence felt. hear your thoughts on other trends that
packing of application software that Various open source tools are available will dominate 2018. Do drop me a line
has all its dependencies bundled for now that are a perfect match to further (dinesh@railsfactory.com).
easy portability. This removes a lot of
hassles for enterprises as they cut down By: Dinesh Kumar
on costs and time. The author is the CEO of Sedin Technologies and the co-founder of RailsFactory. He is
According to 451 Research, the a passionate proponent of open source and keenly observes the trends in this space. In
market is expected to grow by more this new column, he digs into his experience of servicing over 200 major global clients
than 250 per cent between 2016 and across the USA, UK, Australia, Canada and India.
pocket-friendly payments and has 15 preloaded sports sources claim, and seven days on
apps along with inbuilt GPS functionality. smartwatch mode.
leather headset With the company’s in-house Compatible with all Android and
from Astrum chroma display with LED backlighting, iOS devices, the Garmin Vivoactive
the smartwatch features a 3.04cm (1.2 smartwatch is available in black and
Leading ‘new technology’ brand Astrum inch) screen with a 240 x 240 pixel white colours via selected retail and
has unveiled a travel-friendly headset screen resolution. Its display is protected online stores.
– the HT600. The affordable yet stylish by Corning Gorilla Glass 3 with a
headphones come in a lightweight, stainless steel bezel, and the case is Address: Garmin India, D186, 2nd
compact design with no wires. made of fibre-reinforced polymer. Floor, Yakult Building, Okhla Industrial
The headset’s twist-folding design The device offers 11 hours of Area, Phase 1, New Delhi – 110020;
allows compact storage, making it battery life with GPS mode, company Ph: 09716661666
easily portable. Packed with its own
hard case and pouch, the leather
“MY LOVE
Affair with Freedom”
Wearing geeky eyewear,
this dimple-chinned man
looks content with his
life. When asked about
his sun sign, he mimes
the sun with its rays,
but does not reveal his
zodiac sign. Yes, this is
the creative and very witty
Niyam Bhushan, who has
kickstarted a revolution in
UX design in India through
the workshops conducted
by his venture DesignRev.
in. In a tete-a-tete with
Syeda Beenish of OSFY,
this industry veteran, who
has spent 30 odd years
in understanding and
sharing the value of open
source with the masses,
speaks passionately about
the essence of open
source. Excerpts:
Your definition of open source: Muft and mukt is a state of mind, not software
Favourite book: ‘The Cathedral and the Bazaar’ by Eric S. Raymond
Past-time: Tasting the timeless through meditation
Favourite movie: ‘Snowden’ by Oliver Stone
Dream destination: Bhutan, birthplace of ‘Schumacher Economics’ that
gives a more holistic vision to the open source philosophy
Idol: Osho, a visionary who talked about true freedom and how to exercise
your individual freedom in your society
I
n 2004, Google introduced its Gmail service with a 1GB tell your contacts whether you are online or if your
mailbox and free POP access. This was at a time when most camera is on. Email clients do not do this.
people had email accounts with their ISP or had free Web Modern Web browsers take many liberties without
mail accounts with Hotmail or Yahoo. Mailbox storage was asking. Chrome, by default, listens to your microphone
limited to measly amounts such as 5MB or 10MB. If you did and uploads conversations to Google servers (for your
not regularly purge old messages, then your incoming mail convenience of course). Email clients are not like that.
would bounce with the dreaded ‘Inbox full’ error. Hence, Searching archived messages is extremely powerful on
it was a standard practice to store email ‘offline’ using an desktop mail clients. There is no paging of the results.
email client. Each year now, a new generation of young When popular Web mail providers offer free POP access,
people (mostly students) discover the Internet and they start why suffer the slowness of the Web?
with Web mail straight away. As popular Web mail services
integrate online chatting as well, they prefer to use a Web POP or IMAP access to email
browser rather than a desktop mail client to access email. This Email clients use two protocols, POP and IMAP, to receive
is sad because desktop email clients represent one of those mail. POP is ideal if you want to download and delete mail.
rare Internet technologies that can claim to have achieved IMAP is best if you need access on multiple devices or
perfection. This article will bring readers up to speed on at different locations. POP is more prevalent than IMAP.
Thunderbird, the most popular FOSS email client. For offline storage, POP is the best. Popular Web mail
providers provide both POP and IMAP access. Before you
Why use a desktop email client? can use an email client, you will have to log in to your Web
With an email client, you store emails offline. After mail provider in a browser, check the settings and activate
the email application connects to your mail server and POP/IMAP access for incoming mail. Email clients use
downloads new mail, it instructs the server to delete those the SMTP protocol for outgoing mail. In Thunderbird/
messages from your mailbox (unless configured otherwise). SeaMonkey, you may have to add SMTP server settings
This has several advantages. separately for each email account.
If your account gets hacked, the hacker will not get your If you have lots of email already online, then it may not
archived messages. This also limits the fallout on your be possible to make your email client create an offline copy in
other accounts such as those of online banking. one go. Each time you choose to receive messages, the mail
Web mail providers such as Gmail read your messages to client will download a few hundred of your old messages.
display ‘relevant’ advertisements. This is creepy, even if it After it has downloaded all your old archived messages,
is software-driven. the mail client will then settle down to downloading
Email clients let you read and compose messages offline. only your newest messages.
A working Net connection is not required. Web mail The settings for some popular Web mail services
requires you to log in first. are as follows:
Web mail providers such as Gmail automatically Hotmail/Live/Outlook
• POP: pop-mail.outlook.com
• SMTP: smtp-mail.outlook.com
Gmail
• POP: pop.gmail.com
• SMTP: smtp.gmail.com
Yahoo
• POP: pop.mail.yahoo.com Figure 1: Live off the grid with no mail online. To get this Gmail note, you will
• SMTP: smtp.mail.yahoo.com have to empty the Inbox and Trash, and also delete all archived messages.
The following settings are common for them:
POP Even on a desktop screen, space may be at a premium.
• Connection security/Encryption method: SSL Currently, Thunderbird and SeaMonkey do not provide an
• Port: 995 easy way to customise the date columns. I use this trick in the
SMTP launcher command to fix it.
• Connection security/Encryption method: SSL/TLS/
STARTTLS export LC_TIME=en_DK.UTF-8 && seamonkey -mail
• Port: 465/587
Some ISPs and hosting providers provide unencrypted
mail access. Here, the connection security method will be
‘None’, and the ports are set to 110 for POP and 25 for SMTP.
However, please be aware that most ISPs block Port 25, and
many mail servers block mail originating from that port.
Email backup
When you store email offline, the burden of doing regular
Figure 4: Thunderbird is also an RSS feed reader backups falls on you. You also need to ensure that your
computer is not vulnerable to malware such as email viruses.
Apart from email, Thunderbird can also display content Web mail providers do a good job of eliminating email-borne
from RSS feeds (as shown in Figure 4) and Usenet forums malware, but malware can still arrive from other sources.
(as shown in Figure 5). Windows computers are particularly vulnerable to malware
Usenet newsgroups predate the World Wide Web. They spread by USB drives and browser toolbars and extensions. In
are like an online discussion forum organised into several Windows, simply creating a directory named ‘autorun.inf’ at
hierarchical groups. Forum participants post messages in the root level stops most USB drive infections.
the form of an email addressed to a newsgroup (say comp. SeaMonkey stores all its data (email messages and
lang.javascript), and the NNTP client threads the discussions accounts, RSS feeds, website user names/ passwords/
based on the subject line (Google Groups is a Web based preferences, etc,) in the ~/.mozilla/Seamonkey directory.
interface into the world of Usenet). For backup, just zip this directory regularly. If you move to
a new GNU/Linux system, restore the backed-up directory
SeaMonkey ChatZilla to your new ~/.mozilla directory.
Apart from the Firefox-based browser and the Thunderbird-
based email client, SeaMonkey also bundles an IRC chat
client. IRC is yet another Internet-based communication By: V. Subhash
protocol that does not use the World Wide Web. It is the The author is a writer, illustrator, programmer and FOSS fan.
His website is at www.vsubhash.com. You can contact him at
preferred medium of communication for hackers. Here is a tech.writer@outlook.com.
link for starters: irc://chat.freenode.net/.
stories on
• The late st in to a bright future
turers can look forward
• CCTV Camera manufac a CCT V into a proactive tool
s Turning
• Video Analytics System
cam eras evo lvin g with new technologies
• Security
INDUSTRY IS AT A
Security in
eras
• The latest in dome cam t security cameras
-proof and vandal-resistan
• The latest in weather
www.electronicsb2b.com
Log on to www.electronicsb2b.com and be in touch with the Electronics B2B Fraternity 24x7
Tutorials
Latest News
Feature Stories
RESPONSIVE
MOCKUP SET
Interviews from
the world of
open source
www.OpenSourceForU.com
You can also submit your tips, contribute with your ideas or extend your subscription directly from the website.
Remember to follow us on Twitter (@OpenSourceForU) and like us on Facebook (Facebook.com/OpenSourceForU) to get regular updates on open source developments.
Admin How To
A Hands-on Guide on
Virtualisation with VirtualBox
Virtualisation is the process of creating software based (or virtual) representation
of a resource rather than a physical one. Virtualisation is applicable at the compute,
storage or network levels. In this article we will discuss compute level virtualisation,
which is commonly referred to as server virtualisation.
S
erver virtualisation (henceforth referred to as the physical host on which they are running. VMware
virtualisation) allows us to run multiple instances of Workstation and Oracle VM VirtualBox (hereafter referred
operating systems (OS) simultaneously on a single to as VirtualBox) are examples of hosted hypervisors.
server. These OSs can be of the same or of different types.
For instance, you can run Windows as well as Linux OS on An introduction to VirtualBox
the same server simultaneously. Virtualisation adds a software VirtualBox is cross-platform virtualisation software. It is
layer on top of the hardware, which allows users to share available on a wide range of platforms like Windows, Linux,
physical hardware (memory, CPU, network, storage and so Solaris, and so on. It extends the functionality of the existing
on) with multiple OSs. This virtualisation layer is called the OS and allows us to run multiple guests simultaneously along
virtual machine manager (VMM) or a hypervisor. There are with the host’s other applications.
two types of hypervisors.
Bare metal hypervisors: These are also known as VirtualBox terminology
Type-1 hypervisors and are directly installed on hardware. To get a better understanding of VirtualBox, let’s get familiar
This enables the sharing of hardware resources with a guest with its terminology.
OS (henceforth referred to as ‘guest’) running on top of 1) Host OS: This is a physical or virtual machine on which
them. Each guest runs in an isolated environment without VirtualBox is installed.
interfering with other guests. ESXi, Xen, Hyper-V and KVM 2) Virtual machine: This is the virtual environment created to
are examples of bare metal hypervisors. run the guest OS. All its resources, like the CPU, memory,
Hosted hypervisors: These are also known as Type-2 storage, network devices, etc, are virtual.
hypervisors. They cannot be installed directly on hardware. 3) Guest OS: This is the OS running inside VirtualBox.
They run as applications and hence require an OS to run VirtualBox supports a wide range of guests like Windows,
them. Similar to bare metal hypervisors, they are able Solaris, Linux, Apple, and so on.
to share physical resources among multiple guests and 4) Guest additions: These are additional software bundles
installed inside a guest to improve its performance and To begin installation, execute the command given below in a
extend the functionality. For instance, these allow us to terminal and follow the on-screen instructions:
share folders between the host and guest, and to drag and
drop functionality. $ sudo dpkg -i virtualbox-5.2_5.2.0-118431-Ubuntu-xenial_amd64.
deb
Features of VirtualBox
Let us discuss some important features of VirtualBox. Using VirtualBox
1) Portability: VirtualBox is highly portable. It is available After successfully installing VirtualBox, let us get our hands dirty
on a wide range of platforms and its functionality remains by first starting VirtualBox from the desktop environment. It will
identical on each of those platforms. It uses the same file launch the VirtualBox manager window as shown in Figure 1.
and image format for VMs on all platforms. Because of
this, a VM created on one platform can be easily migrated
to another. In addition, VirtualBox supports the Open
Virtualisation Format (OVF), which enables VM import
and export functionality.
2) Commodity hardware: VirtualBox can be used on a CPU
that doesn’t support hardware virtualisation instructions,
like Intel’s VT-x or AMD-V.
3) Guest additions: As stated earlier, these software bundles are
installed inside a guest, and enable advanced features like Figure 1: VirtualBox manager
shared folders, seamless windows and 3D virtualisation.
4) Snapshot: VirtualBox allows the user to take consistent This is the main window from which you can manage your
snapshots of the guest. It records the current state of the VMs. It allows you to perform various actions on VMs like
guest and stores it on disk. It allows the user to go back in Create, Import, Start, Stop, Reset and so on. At this moment,
time and revert the machine to an older configuration. we haven’t created any VMs; hence, the left pane is empty.
5) VM groups: VirtualBox allows the creation of a group Otherwise, a list of VMs are displayed there.
of VMs and represents them as a single entity. We can
perform various operations on that group like Start, Stop, Creating a new VM
Pause, Reset, and so on. Let us create a new VM from scratch. Follow the instructions
given below to create a virtual environment for OS installation.
Getting started with VirtualBox 1) Click the ‘New’ button on the toolbar.
2) Enter the guest’s name, its type and version and click the
System requirements ‘Next’ button to continue.
VirtualBox runs as an application on the host machine and 3) Select the amount of memory to be allocated to the guest and
for it to work properly, the host must meet the following click the ‘Next’ button.
hardware and software requirements: 4) From this window we can provide storage to the
1) An Intel or AMD CPU VM. It allows us to create a new virtual hard disk or
2) A 64-bit processor with hardware virtualisation is required use the existing one.
to run 64-bit guests 4a) To create a new virtual hard disk, select the ‘Create
3) 1GB of physical memory a virtual hard disk now’ option and click the ‘Create’
4) Windows, OS X, Linux or Solaris host OS button.
4b) Select the VDI disk format and click on ‘Continue’.
Downloading and installation 4c) On this page, we can choose between a storage policy
To download VirtualBox, visit https://www.virtualbox.org/ that is either dynamically allocated or a fixed size:
wiki/Downloads link. It provides software packages for i) As the name suggests, a dynamically allocated disk will
Windows, OS X, Linux and Solaris hosts. In this column I’ll be grow on demand up to the maximum provided size.
demonstrating VirtualBox on Mint Linux. Refer to the official ii) A fixed size allocation will reserve the required storage
documentation if you wish to install it on other platforms. upfront. If you are concerned about performance, then go
For Debian based Linux, it provides the ‘.deb’ package. with a fixed size allocation.
Its format is virtualbox-xx_xx-yy-zz.deb where xx_xx-yy is 4d) Click the ‘Next’ button.
the version and build number respectively and zz is the host 5) Provide the virtual hard disk’s name, location and size
OS’s name and platform. For instance, in case of a Debian before clicking on the ‘Create’ button.
based 64-bit host, the package name is virtualbox-5.2_5.2.0- This will show a newly created VM on the left pane as
118431-Ubuntu-xenial_amd64.deb. seen in Figure 2.
Us:
Installing a guest OS
To begin OS installation, we need to attach an ISO image
to the VM. Follow the steps given below to begin OS
installation:
1) Select the newly created VM.
2) Click the ‘Settings’ button on the toolbar.
3) Select the storage option from the left pane.
4) Select the optical disk drive from the storage devices.
5) Provide the path of the ISO image and click the ‘OK’
button. Figure 3 depicts the first five steps.
6) Select the VM from the left pane. Click the ‘Start’ button
on the toolbar. Follow the on-screen instructions to
complete OS installation.
VM power actions
Let us understand VM power actions in detail. Figure 2: Creating a VM
1) Power On: As the name suggests, this starts the VM at
the state it was powered off or saved in. To start the VM,
right-click on it and select the ‘Start’ option.
2) Pause: In this state, the guest releases the CPU but not
the memory. As a result, the contents of the memory are
preserved when the VM is resumed. To pause the VM,
right-click on it and select the ‘Pause’ option.
3) Save: This action saves the current VM state and releases
the CPU as well as the memory. The saved machine can
be started again in the same state. To save the VM, right-
click on it and select the ‘Close->Save State’ option.
4) Shutdown: This is a graceful turn-off operation. In this
case, the shutdown signal is sent to the guest. To shut
down the VM, right-click on it and select the ‘Close-
>ACPI Shutdown’ option.
5) Poweroff: This is non-graceful turn-off operation. It can Figure 3: Installing the OS
cause data loss. To power off the VM, right-click on it and
select the ‘Close->Poweroff’ option.
6) Reset: The Reset option will turn off and turn on the
VM, respectively. It is different from Restart, which is a
graceful turn-off operation. To reset the VM, right-click
on it and select the ‘Reset’ option.
Removing the VM
Let us explore the steps we need to take to remove a VM. The
remove operation can be broken up into two parts.
1) Unregister VM: This removes the VM from the library,
i.e., it will just unregister the VM from VirtualBox
so that it won’t be visible in VirtualBox Manager. To
unregister a VM, right-click on it, select the ‘Remove’
option and click the ‘Remove Only’ option. You can
re-register this VM by navigating to the ‘Machine->Add’ Figure 4: Starting the VM
option from VirtualBox Manager.
2) Delete VM: This action is used to delete the VM permanently. VirtualBox—beyond the basics
It will delete the VM’s configuration files and virtual hard Beginners will get a fair idea about virtualisation and
disks. Once performed, this action cannot be undone. To VirtualBox by referring to the first few sections of this article.
remove a VM permanently, right-click on it, select the However, VirtualBox is a feature-rich product; this section
‘Remove’ option and click the ‘Delete all files’ option. describes its more advanced features.
Not Attached, NAT, bridged adapters, internal networks and networks from the guest, then this will serve your
host-only adapters. purpose. It is similar to a physical system connected to an
Perform the steps given below to view/manipulate the external network via the router.
current network settings: 3) Bridged adapter: In this mode, VirtualBox connects
1) Select the VM from the VirtualBox Manager. to one of your installed network cards and exchanges
2) Click the ‘Settings’ button on the toolbar. network packets directly, circumventing the host operating
3) Select the ‘Network’ option from the left pane. system’s network stack.
4) Select the adapter. The current networking mode will be 4) Internal: In this mode, communication is allowed between
displayed under the ‘Attached to’ drop-down box. a selected group of VMs only. Communication with the
5) To change the mode, select the required network mode host is not possible.
from the drop-down box and click the ‘OK’ button. 5) Host only: In this mode, communication is allowed
Figure 9 illustrates the above steps. between a selected group of VMs and the host. A
physical Ethernet card is not required; instead, a virtual
VirtualBox network modes network interface (similar to a loopback interface) is
Let us discuss each network mode briefly. created on the host.
1) Not Attached: In this mode, VirtualBox reports to the
guest that the network card is installed but it is not An introduction to VBoxManage
connected. As a result of this, networking is not possible VBoxManage is the command line interface (CLI) of
in this mode. If you want to compare this scenario with VirtualBox. You can manage VirtualBox from your host
a physical machine, then it is similar to the Ethernet card via these commands. It supports all the features that
being present but the cable not being connected to it. are supported by the GUI. It gets installed by default
2) NAT: This stands for Network Address Translation and when the VirtualBox package is installed. Let us look at
it is the default mode. If you want to access external some of its basic commands.
To turn on the VM
VBoxManage provides a simple command to start the VM. It
accepts the VM name as an argument.
DevOps Series
Deploying Graylog Using Ansible
This 11th article in the DevOps series is a tutorial on installing
Graylog software using Ansible.
G
raylog is a free and open source log management plugins/modules’, u’/usr/share/ansible/plugins/modules’]
software that allows you to store and analyse all your ansible python module location = /usr/lib/python2.7/site-
logs from a central location. It requires MongoDB packages/ansible
(a document-oriented, NoSQL database) to store meta executable location = /usr/bin/ansible
information and configuration information. The actual log python version = 2.7.14 (default, Sep 20 2017, 01:25:59)
messages are stored in Elasticsearch. It is written using the [GCC 7.2.0]
Java programming language and released under the GNU
General Public License (GPL) v3.0. Add an entry to the /etc/hosts file for the guest ‘ubuntu’
Access control management is built into the software, VM as indicated below:
and you can create roles and user accounts with different
permissions. If you already have an LDAP server, its user 192.168.122.25 ubuntu
accounts can be used with the Graylog software. It also
provides a REST API, which allows you to fetch data to build On the host system, let’s create a project directory
your own dashboards. You can create alerts to take actions structure to store the Ansible playbooks:
based on the log messages, and also forward the log data
to other output streams. In this article, we will install the ansible/inventory/kvm/
Graylog software and its dependencies using Ansible. /playbooks/configuration/
/playbooks/admin/
GNU/Linux
An Ubuntu 16.04.3 LTS guest virtual machine (VM) instance An ‘inventory’ file is created inside the inventory/kvm
will be used to set up Graylog using KVM/QEMU. The folder that contains the following code:
host system is a Parabola GNU/Linux-libre x86_64 system.
Ansible is installed on the host system using the distribution ubuntu ansible_host=192.168.122.25 ansible_connection=ssh
package manager. The version of Ansible used is: ansible_user=ubuntu ansible_password=password
- { regexp: ‘root_password_sha2 =’, The guest VM is a single node, and hence if you traverse
replace: ‘root_password_sha2 = eabb9bb2efa089223 to System -> Nodes, you will see this node information as
d4f54d55bf2333ebf04a29094bff00753536d7488629399’} illustrated in Figure 3.
- { regexp: ‘#web_enable = false’, replace: ‘web_
enable = true’ }
- { regexp: ‘#web_listen_uri =
http://127.0.0.1:9000/’, replace: “web_listen_uri = http://{{
ansible_default_ipv4.address }}:9000/” }
- { regexp: ‘rest_listen_uri = http://127.0.0.1:9000/
api/’, replace: “rest_listen_uri = http://{{ ansible_default_
ipv4.address }}:9000/api/” }
The above playbook can be run using the following command: You can now test the Graylog installation by adding a
data source as input by traversing System -> Input in the Web
$ ansible-playbook -i inventory/kvm/inventory playbooks/ interface. The ‘random HTTP message generator’ is used as a
configuration/graylog.yml --tags graylog -K local input, as shown in Figure 4.
Web interface
You can now open the URL http://192.168.122.25:9000 in a
browser on the host system to see the default Graylog login
page as shown in Figure 1.
B
ig Data is a term used to refer to a huge collection Apache. There are many organisations using Hadoop —
of data that comprises both structured data found in Amazon Web Services, Intel, Cloudera, Microsoft, MapR
traditional databases and unstructured data like text Technologies, Teradata, etc.
documents, video and audio. Big Data is not merely data but
also a collection of various tools, techniques, frameworks and The history of Hadoop
platforms. Transport data, search data, stock exchange data, Doug Cutting and Mike Cafarella are two important people
social media data, etc, all come under Big Data. in the history of Hadoop. They wanted to invent a way to
Technically, Big Data refers to a large set of data that can return Web search results faster by distributing the data over
be analysed by means of computational techniques to draw several machines and make calculations, so that several
patterns and reveal the common or recurring points that would jobs could be performed at the same time. At that time,
help to predict the next step—especially human behaviour, they were working on an open source search engine project
like future consumer actions based on an analysis of past called Nutch. But, at the same time, the Google search
purchase patterns. engine project also was in progress. So, Nutch was divided
Big Data is not about the volume of the data, but more into two parts—one of the parts dealt with the processing
about what people use it for. Many organisations like business of data, which the duo named Hadoop after the toy elephant
corporations and educational institutions are using this data to that belonged to Cutting’s son. Hadoop was released as an
analyse and predict the consequences of certain actions. After open source project in 2008 by Yahoo. Today, the Apache
collecting the data, it can be used for several functions like: Software Foundation maintains the Hadoop ecosystem.
Cost reduction
The development of new products Prerequisites for using Hadoop
Making faster and smarter decisions Linux based operating systems like Ubuntu or Debian
Detecting faults are preferred for setting up Hadoop. Basic knowledge of
Today, Big Data is used by almost all sectors including the Linux commands is helpful. Besides, Java plays an
banking, government, manufacturing, airlines and hospitality. important role in the use of Hadoop. But people can use
There are many open source software frameworks for their preferred languages like Python or Perl to write the
storing and managing data, and Hadoop is one of them. methods or functions.
It has a huge capacity to store data, has efficient data There are four main libraries in Hadoop.
processing power and the capability to do countless jobs. 1. Hadoop Common: This provides utilities used by
It is a Java based programming framework, developed by all other modules in Hadoop.
2. Hadoop MapReduce: This works as a parallel framework efficiently. MapReduce programming is inefficient
for scheduling and processing the data. for jobs involving highly analytical skills. It is a
3. Hadoop YARN: This is an acronym for Yet Another distributed system with low level APIs. Some APIs are
Resource Navigator. It is an improved version of not useful to developers.
MapReduce and is used for processes running over But there are benefits too. Hadoop has many useful
Hadoop. functions like data warehousing, fraud detection and
4. Hadoop Distributed File System – HDFS: This stores data marketing campaign analysis. These are helpful to get
and maintains records over various machines or clusters. useful information from the collected data. Hadoop has
It also allows the data to be stored in an accessible format. the ability to duplicate data automatically. So multiple
HDFS sends data to the server once and uses it as copies of data are used as a backup to prevent loss of data.
many times as it wants. When a query is raised, NameNode
manages all the DataNode slave nodes that serve the given Frameworks similar to Hadoop
query. Hadoop MapReduce performs all the jobs assigned Any discussion on Big Data is never complete without
sequentially. Instead of MapReduce, Pig Hadoop and Hive a mention of Hadoop. But like with other technologies,
Hadoop are used for better performances. a variety of frameworks that are similar to Hadoop have
Other packages that can support Hadoop are listed below. been developed. Other frameworks used widely are Ceph,
Apache Oozie: A scheduling system that manages Apache Storm, Apache Spark, DataTorrentRTS, Google
processes taking place in Hadoop BiqQuery, Samza, Flink and HydraDataTorrentRTS.
Apache Pig: A platform to run programs made on Hadoop MapReduce requires a lot of time to perform assigned
Cloudera Impala: A processing database for Hadoop. tasks. Spark can fix this issue by doing in-memory
Originally it was created by the software organisation processing of data. Flink is another framework that works
Cloudera, but was later released as open source software faster than Hadoop and Spark. Hadoop is not efficient for
Apache HBase: A non-relational database for Hadoop real-time processing of data. Apache Spark uses stream
Apache Phoenix: A relational database based on processing of data where continuous input and output of
Apache HBase data happens. Apache Flink also provides single runtime
Apache Hive: A data warehouse used for summarisation, for the streaming of data and batch processing.
querying and the analysis of data However, Hadoop is the preferred platform for
Apache Sqoop: Is used to store data between Hadoop and Big Data analytics because of its scalability, low cost
structured data sources and flexibility. It offers an array of tools that data
Apache Flume: A tool used to move data to HDFS scientists need. Apache Hadoop with YARN transforms
Cassandra: A scalable multi-database system a large set of raw data into a feature matrix which is
easily consumed. Hadoop makes machine learning
The importance of Hadoop algorithms easier.
Hadoop is capable of storing and processing large amounts
of data of various kinds. There is no need to preprocess the
data before storing it. Hadoop is highly scalable as it can References
store and distribute large data sets over several machines [1] https://www.sas.com/en_us/insights/big-data/hadoop.html
running in parallel. This framework is free and uses cost- [2] https://www.sas.com/en_us/insights/big-data/what-is-
big-data.html
efficient methods. [3] http://www.mastersindatascience.org/data-scientist-
Hadoop is used for: skills/hadoop/
Machine learning [4] https://data-flair.training/blogs/13-limitations-of-hadoop/
[5] https://www.tutorialspoint.com/hadoop/hadoop_big_data_
Processing of text documents overview.htm
Image processing [6] https://www.knowledgehut.com/blog/bigdata-hadoop/top-
Processing of XML messages pros-and-cons-of-hadoop
[7] http://searchcloudcomputing.techtarget.com/definition/
Web crawling Hadoop
Data analysis [8] https://en.wikipedia.org/wiki/Apache_Hadoop
Analysis in the marketing field [9] http://bigdata-madesimple.com/the-top-12-apache-
hadoop-challenges/
Study of statistical data
O
pen source data backup software has become quite Amanda is a scheduling, automation and tracking
popular in recent times. One of the main reasons program wrapped around native backup tools like tar (for
for this is that users have access to the code, which UNIX/Linux) and zip (for Windows). The database that
allows them to tweak the product. Open source tools are now tracks all backups allows you to restore any file from a
being used in data centre environments because they are low previous version of that file that was backed up by Amanda.
cost and provide flexibility. This reliance on native backup tools comes with advantages
Let’s take a look at three open source backup software and disadvantages. The biggest advantage, of course, is that
packages that I consider the best. All three provide support for you will never have a problem reading an Amanda tape on
UNIX, Linux, Windows and Mac OS. any platform. The formats Amanda uses are easily available
on any open-systems platform. The biggest disadvantage is
Amanda that some of these tools have limitations (e.g., path length)
This is one of the oldest open source backup software and Amanda will inherit those limitations.
packages. It gets its name from the University of Maryland On another level, Amanda is a sophisticated program
where it was originally conceived. Amanda stands for the that has a number of enterprise-level features, like
Advanced Maryland Disk Archive. automatically determining when to run your full backups,
Figure 1: Selecting files and folders for file system backup Figure 3: BackupPC server status
BackupPC
Both Amanda and Bacula feel and behave like conventional
backup products. They have support for both disk and tape,
scheduled full and incremental backups, and they come
in a ‘backup format’. BackupPC, on the other hand, is a
disk-only backup tool that forever performs incremental
backups, and stores those backups in their native format in
a snapshot-like tree structure that is available via a GUI.
Like Bacula, it’s a file-only backup tool and its incremental
nature might be hampered by backing up large database
files. However, it’s a really interesting alternative for file
Figure 2: Bacula admin page data. BackupPC’s single most imposing feature is that it
does file-level de-duplication. If you have a file duplicated
instead of having you schedule them. It’s also the only open anywhere in your environment, it will find that duplicate and
source package to have database agents for SQL Server, replace it with a link to the original file.
Exchange, SharePoint and Oracle, as well as the only backup
package to have an agent for MySQL and Ingress. Which one should you use?
Amanda is now backed by Zmanda, and this company Choosing a data backup tool entirely depends on the purpose.
has put its development into overdrive. Just a few months If you want the least proprietary backup format then go for
after beginning operations, Zmanda has addressed major BackupPC. If database agents are a big driver, you can choose
limitations in the product that had hindered it for years. Amanda. Or if you want a product that’s designed like a typical
Since then, it has been responsible for the addition of a lot of commercial backup application, then opt for Bacula. One more
functionality, including those database agents. important aspect is that both BackupPC and Amanda need the
Linux server to control backup and Bacula has a Windows
Bacula server to do the same.
Bacula was originally written by Kern Sibbald, who chose All three products are very popular. Which one you
a very different path from Amanda by writing a custom choose depends on what you need. The really nice thing
backup format designed to overcome the limitations of the about all three tools is that they can be downloaded free of
native tools. Sibbald’s original goal was to write a tool that cost. So you can decide which one is better for you after
could take the place of the enterprise tools he saw in the trying out all three.
data centre.
Bacula also has scheduling, automation and tracking of
By Neetesh Mehrotra
all backups, allowing you to easily restore any file (or files)
from a previous version. Like Amanda, it also has media The author works at TCS as a systems engineer, and his areas
of interest are Java development and automation testing. For
management features that allow you to use automated tape any queries, do contact him at mehrotra.neetesh@gmail.com.
libraries and perform disk-to-disk backups.
I
magine this scenario: You have 1GB of data that you need in Java, originally developed by Doug Cutting, who named it
to process. The data is stored in a relational database in after his son’s toy elephant!
your desktop computer which has no problem managing Hadoop uses Google’s MapReduce technology as its
the load. Your company soon starts growing very rapidly, and foundation. It is optimised to handle massive quantities
the data generated grows to 10GB, and then 100GB. You start of data which could be structured, unstructured or semi-
to reach the limits of what your current desktop computer structured, using commodity hardware, i.e., relatively
can handle. So what do you do? You scale up by investing inexpensive computers. This massive parallel processing
in a larger computer, and you are then alright for a few more is done with great efficiency. However, handling massive
months. When your data grows from 1TB to 10TB, and amounts of data is a batch operation, so the response time is
then to 100TB, you are again quickly approaching the limits not immediate. Importantly, Hadoop replicates its data across
of that computer. Besides, you are now asked to feed your different computers, so that if one goes down, the data is
application with unstructured data coming from sources like processed on one of the replicated computers.
Facebook, Twitter, RFID readers, sensors, and so on. Your
managers want to derive information from both the relational Big Data
data and the unstructured data, and they want this information Hadoop is used for Big Data. Now what exactly is Big Data?
as soon as possible. What should you do? With all the devices available today to collect data, such as
Hadoop may be the answer. Hadoop is an open source RFID readers, microphones, cameras, sensors, and so on, we
project of the Apache Foundation. It is a framework written are seeing an explosion of data being collected worldwide.
JobTracker
MapReducelayer
HDFS layer
NameNode
Big Data is a term used to describe large collections of data Hadoop architecture
(also known as data sets) that may be unstructured, and grow Before we examine Hadoop’s components and architecture,
so large and so quickly that it is difficult to manage with let’s review some of the terms that are used in this discussion.
regular database or statistical tools. A node is simply a computer. It is typically non-enterprise,
In terms of numbers, what are we looking at? How BIG is commodity hardware that contains data. We can keep adding
Big Data? Well there are more than 3.2 billion Internet users, nodes, such as Node 2, Node 3, and so on. This is called
and active cell phones have crossed the 7.6 billion mark. a rack, which is a collection of 30 or 40 nodes that are
There are now more in-use cell phones than there are people physically stored close together and are all connected to the
on the planet (7.4 billion). Twitter processes 7TB of data same network switch. A Hadoop cluster (or just a ‘cluster’
every day, and 600TB of data is processed by Facebook daily. from now on) is a collection of racks.
Interestingly, about 80 per cent of this data is unstructured. Now, let’s examine Hadoop’s architecture—it has two
With this massive amount of data, businesses need fast, major components.
reliable, deeper data insight. Therefore, Big Data solutions 1. The distributed file system component: The main example
based on Hadoop and other analytic software are becoming of this is the Hadoop distributed file system (HDFS),
more and more relevant. though other file systems like IBM Spectrum Scale, are
also supported.
Open source projects related to Hadoop 2. The MapReduce component: This is a framework
Here is a list of some other open source projects for performing calculations on the data in the
related to Hadoop: distributed file system.
Eclipse is a popular IDE donated by IBM to the open HDFS runs on top of the existing file systems on each
source community. node in a Hadoop cluster. It is designed to tolerate a high
Lucene is a text search engine library written in Java. component failure rate through the replication of the data.
Hbase is a Hadoop database - Hive provides data A file on HDFS is split into multiple blocks, and each is
warehousing tools to extract, transform and load (ETL) replicated within the Hadoop cluster. A block on HDFS is a
data, and query this data stored in Hadoop files. blob of data within the underlying file system (see Figure 1).
Pig is a high-level language that generates MapReduce Hadoop distributed file system (HDFS) stores the
code to analyse large data sets. application data and file system metadata separately on
Spark is a cluster computing framework. dedicated servers. NameNode and DataNode are the two
ZooKeeper is a centralised configuration service and critical components of the HDFS architecture. Application
naming registry for large distributed systems. data is stored on servers referred to as DataNodes, and
Ambari manages and monitors Hadoop clusters through file system metadata is stored on servers referred to as
an intuitive Web UI. NameNodes. HDFS replicates the file’s contents on multiple
Avro is a data serialisation system. DataNodes, based on the replication factor, to ensure
UIMA is the architecture used for the analysis of the reliability of data. The NameNode and DataNode
unstructured data. communicate with each other using TCP based protocols.
Yarn is a large scale operating system for Big Data The heart of the Hadoop distributed computation platform
applications. is the Java-based programming paradigm MapReduce. Map
MapReduce is a software framework for easily writing or Reduce is a special type of directed acyclic graph that can
applications that process vast amounts of data. be applied to a wide range of business use cases. The Map
February 2018 Best in the world of Open Source (Tools and Services)
W
e have all been observing a sudden surge in the e-mails, image graphics, audio files, databases, spreadsheets,
production of data in the recent past and this will etc, which act as the lifeblood for most companies. Besides,
undoubtedly increase in the years ahead. Almost all many organisations also have some confidential information
the applications on our smartphones (like Facebook, Instagram, that must not be leaked or accessed by anyone, in which
WhatsApp, Ola, etc) generate data in different forms like text case, security becomes one of the most important aspects of
and images, or depend on data to work upon. With around 2.32 any data storage solution. In critical healthcare applications,
billion smartphone users across the globe (as per the latest data an organisation cannot afford to run out of memory, so data
from statista.com) having installed multiple applications, it needs to be monitored at each and every second.
certainly adds up to a really huge amount of data, daily. Apart Storing different kinds of data and managing its storage
from this, there are other sources of data as well like different is critical to any company’s behind-the-scenes success.
Web applications, sensors and actuators used in IoT devices, When we look for a solution that covers all our storage
process automation plants, etc. All this creates a really big needs, the possibilities seem quite endless, and many of
challenge to store such massive amounts of data in a manner them are likely to consume our precious IT budgets. This is
that can be used as and when needed. why we cannot afford to overlook open source data storage
We all know that our businesses cannot get by without solutions. Once you dive into the open source world, you
storing our data. Sooner or later, even small businesses will find a huge array of solutions for almost every problem
need space for data storage—for documents, presentations, or purpose, which includes storage as well.
Reasons for the growth in the data storage backing up your most important files to a highly secure
solutions segment remote server, you are actually protecting the data stored at
Let’s check out some of the reasons for this: your place of business. You can also easily share different
1. Various recent government regulations, like Sarbanes- large files with your clients, partners or others by providing
Oxley, ask businesses to maintain and keep a backup them with password-protected access to your online storage
of different types of data which they might have service, hence eliminating the need to send those large files
otherwise deleted. by e-mail. And in most cases, you can log into your account
2. Many of the small businesses have now started archiving from any system using a Web based browser, which is one of
different e-mail messages, even those dating back five or the great ways to retrieve files when you are away from your
more years for various legal reasons. PC. Remote storage can be a bit slow, especially during an
3. Also, the pervasiveness of spyware and viruses requires initial backup session, and only as fast as the speed of your
backups and that again requires more storage capacity. network’s access to that storage. For extremely large files, you
4. There has been a growing need to back up and store may require higher speed network access.
different large media files, such as video, MP3, etc, Network attached storage: Network attached storage
and make the same available to users on a specific (NAS) provides fast, reliable and simple access to data in any
network. This is again generating a demand for large IP networking environment. Such solutions are quite suitable
storage solutions. for small or mid-sized businesses that require large volumes
5. Each newer version of any software application or of economical storage which can be shared by multiple users
operating system demands more space and memory over a network. Given that many of the small businesses lack
than its predecessor, which is another reason driving the IT departments, this storage solution is easy to deploy, can
demand for large storage solutions. be managed and consolidated centrally. This type of storage
solution can be as simple as a single hard drive with an
Different types of storage options Ethernet port or even built-in Wi-Fi connectivity.
There are different types of storage solutions that can be used More sophisticated NAS solutions can also provide
based on individual requirements, as listed below. additional USB as well as FireWire ports, enabling you to
Flash memory thumb drives: These drives are connect external hard drives to scale up the overall storage
particularly useful to mobile professionals since they consume capacity of businesses. A NAS storage solution can also offer
little power, are small enough to even fit on a keychain and print-server capabilities, which let multiple users easily share a
have almost no moving parts. You can connect any Flash single printer. A NAS solution may also include multiple hard
memory thumb drive to your laptop’s Universal Serial Bus drives in a Redundant Array of Independent Disks (RAID)
(USB) port and back up different files on the system. Some of Level 1 array. This storage system contains two or more
the USB thumb drives also provide encryption to protect files equivalent hard drives (similar to two 250GB drives) in a single
in case the drive gets lost or is stolen. Flash memory thumb network-connected device. Files written to the first (main) drive
drives also let us store our Outlook data (like recent e-mails are automatically written to the second drive as well. This kind
or calendar items), different bookmarks on Internet Explorer, of automated redundancy present in NAS solutions implies
and even some of the desktop applications. That way, you can that if the first hard drive dies, we will still have access to all
leave your laptop at home and just plug the USB drive into our applications and files present on the second drive. Such
any borrowed computer to access all your data elsewhere. solutions can also help in offloading files being served by other
External hard drives: An inexpensive and relatively servers on your network, which increases the performance.
simpler way to add more memory storage is to connect A NAS system allows you to consolidate storage, hence
an external hard drive to your computer. External hard increasing the efficiency and reducing costs. It simplifies the
disk drives that are directly connected to PCs have several storage administration, data backup and its recovery, and also
disadvantages. Any file stored only on the drive but not allows for easy scaling to meet the growing storage needs.
elsewhere requires to be backed up. Also, if you travel
somewhere for work and need access to some of the files on Choosing the right storage solution
an external drive, you will have to take the drive with you or There are a number of storage solutions available in the
remember to make a copy of the required files to your laptop’s market, which meet diverse requirements. At times, you could
internal drive, a USB thumb drive, a CD or any other storage get confused while trying to choose the right one. Let’s get
media. Finally, in case of a fire or other catastrophe at your rid of that confusion by considering some of the important
place of business, your data will not be completely protected aspects of a storage solution.
if it’s stored on an external hard drive. Scalability: This is one of the important factors to
Online storage: There are different services which be considered while looking for any storage solution. In
provide remote storage and backup over the Internet. All different distributed storage systems, storage capacity can
such services offer businesses a number of benefits. By be added in two ways. The first way involves adding disks
T
he European Organisation for Nuclear Research Data storage infrastructure is broadly classified as
(CERN), a research collaboration of over 20 object-based, block storage and file systems, each with its
countries, has a unique problem—it has way more own set of features.
data than it is possible to store! We’re talking about
petabytes of data per year, where one petabyte equals a Object-based storage
million gigabytes. There are entire departments of scientists This construct manages data as objects instead of treating
working on a subject termed DAQ (Data Acquisition and it as a hierarchy of files or blocks. Each object is associated
Filtering), simply to filter out 95 per cent of the experiment- with a unique identifier and comprises not only the data but
generated data and store only the useful 5 per cent. In fact, also, in some cases, the metadata. This storage pattern seeks
it has been estimated that data in the digital universe will to enable capabilities such as application programmable
amount to 40 zettabytes by 2020, which is about 5,000 interfaces, data management such as replication at object-
gigabytes of data per person. scale, etc. It is often used to allow for the retention of
With the recent spate of breaches affecting cloud service massive amounts of data. Examples include the storage of
providers, setting up a personal data store or even a private photos, songs and files on a massive scale by Facebook,
cloud becomes an attractive prospect. Spotify and Dropbox, respectively.
Figure 1: Hacked credentials Figure 3: Selecting the boot partition [Source: freenas.org]
Figure 2: Object storage, file systems and block storage [Source: ubuntu.com] Figure 4: FreeNAS GUI [Source: freenas.org]
Block storage example, we will look into the general steps involved in
Data is stored as a sequence of bytes, termed a physical deploying such a system by taking the case of a popular
record. This so called ‘block’ of data comprises a whole representative of the set.
number of records. The process of putting data into blocks is
termed as blocking, while the reverse is called deblocking. FreeNAS
Blocking is widely employed when storing data to certain With enterprise-grade features, richly supported plugins, and
types of magnetic tape, Flash memory and rotating media. an enterprise-ready ZFS file system, it is easy to see why
FreeNAS is one of the most popular operating systems in the
File systems market for data storage.
These data storage structures follow a hierarchy, which Let’s take a deeper look at file systems since they are
controls how data is stored and retrieved. In the absence widely used in setting up storage networks today. Building
of a file system, information would simply be a large your own data storage using FreeNAS involves following a
body of data with no way to isolate individual pieces of few of the following simple steps:
information from the whole. A file system encapsulates the 1. You will need to download the disk image suitable for
complete set of rules and logic used to manage sets of data. your architecture and burn it onto either a USB stick or a
File systems can be used on a variety of storage media, CD-ROM, as per your preference.
most commonly, hard disk drives (HDDs), magnetic tapes 2. Since you will be booting your new disk or machine with
and optical discs. FreeNAS, you will need to open the BIOS settings on
booting it, and set the boot preference to USB so that your
Building open source storage system first tries to boot from the USB and, if not found,
then from other attached media.
Software 3. Once you have created the storage media with the
Network Attached Storage (NAS) provides a stable and required software, you can boot up your system and install
widely employed alternative for data storage and sharing FreeNAS in the designated partition.
across a network. It provides centralised repository of 4. Having set the root password, when you boot into it after
data that can be accessed by different members within installation, you will have the option of using the Web
the organisation. Variations include providing complete GUI to log into the system. For some users, it might be
software and hardware packages serving as out-of-the-box much more intuitive to use this option as compared to the
alternatives. These include software and file systems such console-based login.
as Gluster, Ceph, NAS4Free, FreeNAS, and others. As an 5. Using the GUI or console, you can configure and manage
Username
Password
Data folder
/var/www/onwncloud/data
Figure 5: Configuring storage options [Source: freenas.org]
Configure the database
Database User
Database Password
Database name
localhost
C
loonix is a network simulator based on KVM cloonix
or UML. It is basically a Linux router and host ├── allclean
simulation platform. You can simulate a network ├── build
with multiple reconfigurable VMs in a single PC. The VMs ├── cloonix
may be different Linux distributions. You can also monitor │ ├── client
the network’s activities through Wireshark. Cloonix can be │ ├── cloonix_cli
installed on Arch, CentOS, Debian, Fedora, OpenSUSE and │ ├── cloonix_config
their derivative distros. │ ├── cloonix_gui
The main features of Cloonix are: │ ├── cloonix_net
GUI based NS tool │ ├── cloonix_ocp
KVM based VM │ ├── cloonix_osh
VMs and clients are Linux based │ ├── cloonix_scp
Spice server is front-end for VMs │ ├── cloonix_ssh
Network activity monitoring by Wireshark │ ├── cloonix_zor
The system requirements are: │ ├── common
32/64-bit Linux OS (tested on Ubuntu 16.04 64-bit) │ ├── id_rsa
Wireshark │ ├── id_rsa.pub
Cloonix package: http://cloonix.fr/source_stored/ │ ├── LICENCE
cloonix-37-01.tar.gz │ └── server
VM images: http://cloonix.fr/bulk_stored/ ├── doitall
To set it up, download the Cloonix package and ├── install_cloonix
extract it. I am assuming that Cloonix is extracted in the ├── install_depends
$HOME directory. ├── pack
The directory structure of Cloonix is as follows: └── README
├── batman
├── cisco
Figure 1: Ping simulation demo ├── dns
├── dyn_dns
5 directories, 19 files ├── eap_802_1x
├── ethereum
To install Cloonix, run the following commands, which ├── fwmark2mpls
will install all the packages required, except Wireshark. ├── mpls
├── mplsflow
$cd $HOME/cloonix ├── netem
$sudo ./install_depends build ├── ntp
├── olsr
The following command will install and configure ├── openvswitch
Cloonix in your system: ├── ospf
├── ping
$sudo ./doinstall ├── strongswan
└── unix2inet
The command given below will install Wireshark:
To run any demo for ping, for instance, just go to the ping
$sudo apt-get install wireshark directory and run the following code:
B
efore discussing the need for backup software, are increasingly adopting SSDs (solid-state drives) for main
some knowledge of the brief history of storage storage, but HDDs still remain the champions of low cost and
is recommended. In 1953, IBM recognised the very high capacity data storage.
importance and immediate application of what it called the The cost per GB of data has come down significantly over
‘random access file’. The company then went on to describe the years because of a number of innovations and advanced
this as having high capacity with rapid random access to techniques developed in manufacturing HDDs. The graph in
files. This led to the invention of what subsequently became Figure 1 gives a glimpse of this.
the hard disk drive. IBM’s San Jose, California laboratory The general assumption is that this cost will be reduced
invented the HDD. This disk drive created a new level in the further. Now, since storing data is not at all costly compared
computer data hierarchy, then termed random access storage to what it was in the 1970s and ‘80s, why should one take
but today known as secondary storage. backup of data when it so cheap to buy new storage. What are
The commercial use of hard disk drives began in 1957, the advantages of having backup of data?
with the shipment of an IBM 305 RAMAC system including Today, we are generating a lot of data by using various
IBM Model 350 disk storage, for which a US Patent No. gadgets like mobiles, tablets, laptops, handheld computers,
3,503,060 was issued on March 24, 1970. servers, etc. When we exceed the allowed storage capacity
The year 2016 marks the 60th anniversary of the in these devices, we tend to push this data to the cloud or
venerable hard disk drive (HDD). Nowadays, new computers take a backup to avoid any future disastrous events. Many
$10,000.00 GlusterFS 5% 2% 9%
EMC 3% 5%
$100.00
SolidFire 4% 4%
$10.00
IBM Storwize 3% 3%
$1.00
Dell EqualLogic 3%
$0.10 HP 3PAR 2%
80 85 90 95 00
Dev/QA
05 10
19 19 19 19 20
7%
20 20
Other Black Storage Driver 2% 3% Proof of Concept
Figure 1: Hard drive costs per GB of data (Source: http://www.mkomo. Figure 2: Ceph adoption rate (Source: https://sanenthusiast.com/top-5-storage-
com/cost-per-gigabyte) data-center-tech-predictions-2016/)
corporates and enterprise level customers are generating huge This script helps in backing up Ceph pools. It was developed
volumes of data, and to have backups is critical for them. keeping in mind backing up of specified storage pools and
Backing up data is very important. After taking a not only individual images; it also allows retention of dates
backup, we have to also make sure that this data is secure, is and implements a synthetic full backup schedule if needed.
manageable and that the data’s integrity is not compromised. Many organisations are now moving towards large
Keeping in mind these aspects, many open source backup scale object storage and take backups regularly. Ceph is the
software have been developed over a period of years. ultimate solution, as it provides object storage management
Data backup comes in different flavours like individual along with state-of-art backup. It also provides integration
files and folders, whole drives or partitions, or full system into private cloud solutions like OpenStack, which helps one
backups. Nowadays, we also have the ‘smart’ method, which in managing backups of data in the cloud.
automatically backs up files in commonly used locations The Ceph script can also archive data, remove all the old
(syncing) and we have the option of using cloud storage. files and purge all snapshots. This triggers the creation of a
Backups can be scheduled, running as incremental, new, full and initial snapshot.
differential or full backups, as required. OpenStack has a built-in Ceph backup driver, which
For organisations and large enterprises that are planning is an intelligent solution for VM volume backup and
on selecting backup software tools and technologies, this maintenance. This helps in taking regular and incremental
article reviews the best open source tools. Before choosing backups of volumes to maintain consistency of data. Along
the best software or tool, users should evaluate the features with Ceph backup, one can use a tool called CloudBerry
they provide, with reference to stability and open source for versatile control over Ceph based backup and
community support. recovery mechanisms.
Advanced open source storage software like Ceph, Gluster, Ceph also has good support from the community and
ZFS and Lustre can be integrated with some of the popular from large organisations, many of which have adopted it for
backup tools like Bareos, Bacula, AMANDA and CloneZilla; storage and backup management and inturn contribute back
each of these is described in detail in the following section. to the community.
A lot of developments and enhancements are happening
Ceph on a continuous basis with Ceph. A number of research
Ceph is one of the leading choices in open source software organisations have predicted that Ceph’s adoption rate will
for storage and backup. Ceph provides object storage, block increase in the future. Ceph also has certain cost advantages
storage and file system storage features. It is very popular in comparison with other software products.
because of its CRUSH algorithm, which liberates storage More information about the Ceph RBD script can be
clusters from the scalability and performance limitations found at http://obsidiancreeper.com/2017/04/03/Updated-
imposed by centralised data table mapping. Ceph eliminates Ceph-Backup/.
many tedious tasks for administrators by replicating and
rebalancing data within the cluster, and delivers high Gluster
performance and infinite scalability. Red Hat’s Gluster is another open source software defined
Ceph also has RADOS (reliable autonomic distributed scale out, backup and storage solution. It is also called RGHS.
object store), which provides the earlier described object, It helps in managing unstructured data for physical, virtual
block and file system storage in singly unified storage and cloud environments. The advantages of Gluster software
clusters. The Ceph RBD backup script in the v0.1.1 release are its cost effectiveness and highly available storage that
of ceph_rbd_bck.sh creates the backup solution for Ceph. does not compromise on scale or performance.
Cost difference Between Red Hat Gluster Storage and Competitive NAS Storage
System for 300TB Initial Procurement ($)
140,000
120,000
100,000
80,000
($)
60,000
40,000
20,000
Industry standard open source support and data formats It works with public clouds like Amazon, Google Drive
Low cost of ownership and Rackspace, as well as private clouds and networked file
servers. Operating systems that it is compatible with include
Bareos (Backup Archiving Recovery Open Sourced) Windows, Linux and Mac OS X.
Bareos offers high data security and reliability along with
cross-network open source software for backups. Now FOG
being actively developed, it emerged from the Bacula Like Clonezilla, FOG is a disk imaging and cloning tool that
Project in 2010. can aid with both backup and deployment. It’s easy to use,
Bareos supports Linux/UNIX, Mac and Windows based supports networks of all sizes, and includes other features like
OS platforms, along with both a Web GUI and CLI. virus scanning, memory testing, disk wiping, disk testing and
file recovery. Operating systems compatible with it include
Clonezilla Linux and Windows.
Clonezilla is a partition and disk imaging/cloning program. It
is similar to many variants available in the market like Norton References
Ghost and True Image. It has features like bare metal backup
[1] To know more about the history of HDDs
recovery, and supports massive cloning with high efficiency https://www.pcworld.com/article/127105/article.html
in multi-cluster node environments. [2] http://clonezilla.org/
Clonezilla comes in two variants—Clonezilla Live and [3] https://amanda.zmanda.com/amanda-enterprise-edition.html
[4] http://ceph.com/ceph-storage/
Clonezilla SE (Server Edition). Clonezilla Live is suitable [5] http://www.mkomo.com/cost-per-gigabyte
for single machine backup and restore, and Clonezilla SE [6] https://en.wikipedia.org/wiki/History_of_hard_disk_drives
for massive deployment. The latter can clone many (40 plus)
computers simultaneously.
By: Shashidhar Soppin
The author is a senior architect with 16+ years of experience
Duplicati in the IT industry, and has expertise in virtualisation, cloud,
Designed to be used in a cloud computing environment, Docker, open source, ML, Deep Learning and open stack.
Duplicati is a client application for creating encrypted, He is part of the PES team at Wipro. You can contact him at
shashi.soppin@gmail.com.
incremental, compressed backups to be stored on a server.
N
etwork security implementation mainly depends on security of the target. It is the task of the security personnel to
exploratory data analysis (EDA) and visualisation. identify the pattern of the attack and the mistakes committed
EDA provides a mechanism to examine a data set to differentiate them from innocent errors. Let’s now discuss a
without preconceived assumptions about the data and its few examples to identify a fumbling condition.
behaviour. The behaviour of the Internet and the attackers is In a nutshell, fumbling is a type of Internet attack, which
dynamic and EDA is a continuous process to help identify all is characterised by failing to connect to one location with a
the phenomena that are cause for an alarm, and to help detect systematic attack from one or more locations. After a brief
anomalies in access to resources. discussion of this type of network intrusion, let’s consider a
Fumbling is a general term for repeated systematic problem of network data analysis using R, which is a good
failed attempts by a host to access resources. For example, choice as it provides powerful statistical data analysis tools
legitimate users of a service should have a valid email ID or together with a graphical visualisation opportunity for a better
user identification. So if there are numerous attempts by a understanding of the data.
user from a different location to target the users of this service
with different email identifications, then there is a chance that Fumbling of the network and services
this is an attack from that location. From the data analysis In case of TCP fumbling, a host fails to reach a target port
point of view, we say a fumbling condition has happened. of a host, whereas in the case of HTTP fumbling, hackers
This indicates that the user does not have access to that fail to access a target URL. All fumbling is not a network
system and is exploring different possibilities to break the attack, but most of the suspicious attacks appear as fumbling.
The most common reason for fumbling is lookup failure easier than communication level fumbling, as in most of the
which happens mainly due to misaddressing, the movement cases exhaustive logs record each access and malfunction.
of the host or due to the non-existence of a resource. Other For example, HTTP returns three-digit status codes 4xx for
than this, an automated search of destination targets, and every client-side error. Among the different codes, 404 and
scanning of addresses and their ports are possible causes 401 are the most common for unavailability of resources and
of fumbling. Sometimes, to search a target host, automated unauthorised access, respectively. Most of the 404 errors
measures are taken to check whether the target is up and are innocuous, as they occur due to misconfiguration of the
running. These types of failed attempts are generally mistaken URL or the internal vulnerabilities of different services of
for network attacks, though lookup failure happens either the HTTP server. But if it is a 404 scanning, then it may be
due to misconfiguration of DNA, a faulty redirection on the malicious traffic and there may be a chance that attackers
Web server, or email with a wrong URL. Similarly, SMTP are trying to guess the object in order to reach the vulnerable
communication uses an automated network traffic control target. Web server authentication is really used by modern
scheme for its destination address search. Web servers. In case of discovering any log entry of an 401
The most serious cause of fumbling is repeated error, proper steps should be taken to remove the source from
scanning by attackers. Attackers scan the entire address- the server.
port combination matrix either in vertical or in horizontal Another common service level vulnerability comes from
directions. Generally, attackers explore horizontally, as they the mail service protocol, SMTP. When a host sends a mail
are most interested in exploring potential vulnerabilities. to a non-existent address, the server either rejects the mail or
Vertical search is basically a defensive approach to identify an bounces it back to the source. Sometimes it also directs the
attack on an open port address. As an alternative to scanning, mail to a catch-all account. In all these three cases, the routing
at times attackers use a hit-list to explore a vulnerable system. SMTP server keeps a record of the mail delivery status. But
For example, to identify SSH host, attackers may use a blind the main hurdle of identifying SMTP fumbling comes from
scan and then start a password attack. spam. It’s hard to differentiate SMTP fumbling from spam
as spammers send mail to every conceivable address. SMTP
Identifying fumbling fumblers also send mails to target addresses to verify whether
Identifying malicious fumbling is not a trivial task, as an address exists for possible scouting out of the target.
it requires demarcating innocuous fumbling from the
malevolent kind. Primarily, the task of assessing failed Designing a fumbling identification system
accesses to a resource is to identify whether the failure is From the above discussion, it is apparent that identifying
consistent or transient. To explore TCP fumbling, look into fumbling is more subjective than objective. Designing a
all TCP communication flags, payload size and packet count. fumbling identification and alarm system requires in-depth
In TCP communication, the client sends an ACK flag only knowledge of the network and its traffic pattern. There are
after receiving the SYN+ACK signal from the server. If there several network tools, but here we will cover some basic
is no ACK after a SYN from the server, then that indicates a system utilities so that readers can explore the infinite
fumbling. Another possible way to locate a malicious attack possibilities of designing network intrusion detection and
is to count the number of packets of a flow. A legitimate TCP prevention systems of their own.
flow requires at least three packets of overhead before it In order to separate malicious from innocuous fumbling,
considers transmitting data. Most retries require three to five the analyst should mark the targets to determine whether
packets, and TCP flows having five packets or less are likely the attackers are reaching the goal and exploring the target.
to be fumbles. This step reduces the bulk of data to a manageable state and
Since, during a failed connection, the host sends the same makes the task easier. After fixing the target, it is necessary
SYN packets options repeatedly, a ration of packet size and to examine the traffic to study the failure pattern. If it is TCP
packet number is also a good measure of identifying TCP fumbling, as mentioned earlier, this can be detected by finding
flow fumbling. traffic without the ACK flag. In case of an HTTP scanning,
ICMP informs a user about why a connection failed. It is examination of the HTTP server log table for 404 or 401 is
also possible to look into the ICMP response traffic to identify done to find out the malicious fumbling. Similarly, the SMTP
fumbling. If there is a sudden spike in messages originating server log helps us to find out doubtful emails to identify the
from a router, then there is a good chance that a target is attacking hosts.
probing the router’s network. A proper forensic investigation If a scouting happens to a dark space of a network, then
can identify a possible attacking host attacking host. the chance of malicious attack is high. Similarly, if a scanner
Since UDP does not follow TCP as a strict communication scans more than one port in a given time frame, the chance
protocol, the easiest way to identify UDP fumbling is by of intrusion is high. A malicious attack can be confirmed by
exploring network mapping and ICMP traffic. examining the conversation between the attacker and the
Identifying service level fumbling is comparatively target. Suspicious conversations can be subsequent transfers
Similarly, TCP ACK packets can be captured by issuing …then the first ten rows can be displayed to have a view
the command given below: of the table structure, as shown in Figure 1.
In our case, we can use this to identify ports that are listening.
For example, to know about connections of HTTP and 20
$ netstat -tlp
Frequency
Data analysis 10
shown below:
…where the dimension, columns and object class are: 172.16.11.95 172.16.4.66 172.16.5.132 172.16.5.230 172.16.6.252 172.16.7.155
IP Address
Now the data is ready for analysis. The R summary By: Dipankar Ray
command will show the count of elements of each field, The author is a member of IEEE and IET, and has more than 20
whereas the count command will show the frequency years of experience in open source versions of UNIX operating
distribution of the IP address as shown below: systems and Sun Solaris. He is presently working on data
analysis and machine learning using a neural network as well
> summary(u) as on different statistical tools. He has also jointly authored a
textbook called ‘MATLAB for Engineering and Science’. He can
user ipaddress
be reached at dipankarray@ieee.org.
ccadmin :34 172.16.6.252:21
ent distributors
• India’s leading compon
ry
onics components indust
• Growth of Indian electr INDUSTRY IS AT A
components for LEDs
• The latest launches of
ics
components for electron
• The latest launches of
Log on to www.electronicsb2b.com and be in touch with the Electronics B2B Fraternity 24x7
W
e keep data on portable hard disks, memory cards, many others. Python is a free and open source programming
USB Flash drives or other such similar media. language which is equipped with in-built features of system
Ensuring the long term preservation of this data programming, a high level programming environment and
with timely backup is very important. Many times, these network compatibility. In addition, the interfacing of Python
memory drives get corrupted because of malicious programs can be done with any channel, whether it is live streaming on
or viruses; so they should be protected by using secure backup social media or in real-time via satellite. A number of other
and recovery tools. programming languages have been developed, which have
been influenced by Python. These languages include Boo,
Popular tools for secured backup and recovery Cobra, Go, Goovy, Julia, OCaml, Swift, ECMAScript and
For secured backup and recovery of data, it is always CoffeeScript. There are other programming environments
preferable to use performance-aware software tools and with the base code and programming paradigm of Python
technologies, which can protect the data against any under development.
malicious or unauthenticated access. A few free and open Python is rich in maintaining the repository of
source software tools which can be used for secured backup packages for big applications and domains including
and recovery of data in multiple formats are: AMANDA, image processing, text mining, systems administration,
Bacula, Barcos, CloneZilla, Fog, Rsync, BURP, Duplicata, Web scraping, Big Data analysis, database applications,
BackupPC, Mondo Rescue, GRSync, Areca Backup, etc. automation tools, networking, video processing, satellite
imaging, multimedia and many others.
Python as a high performance programming
environment Python Package Index (PyPi): https://pypi.
Python is a widely used programming environment for almost python.org/pypi
every application domain including Big Data analytics, The Python Package Index (PyPi), which is also known
wireless networks, cloud computing, the Internet of Things as Cheese Shop, is the repository of Python packages
(IoT), security tools, parallel computing, machine learning, for different software modules and plugins developed as
knowledge discovery, deep learning, NoSQL databases and add-ons to Python. Till September 2017, there were more
than 117,000 packages for different functionalities and The rotation approach in Rotate-Backups can be customised
applications in PyPi. This escalated to 123,086 packages by as strict rotation (enforcement of the time window) or relaxed
November 30, 2017. rotation (no enforcement of time windows).
The table in Figure 1 gives the statistics fetched from After installation, there are two files ~/.rotate-backups.
ModuleCounts.com, which maintains data about modules, ini and /etc/rotate-backups.ini which are used by default.
plugins and software tools. This default setting can be changed using the command
line option --config.
Date Nov-24 Nov-25 Nov-26 Nov-27 Nov-28 Nov-29 Nov-30
The timeline and schedules of the backup can be specified
Packages
122,619 122,669 122,723 122,808 122,918 123,008 123,086
in PyPi on the configuration file as follows:
Figure 1: Statistics of modules and packages in PyPi in the last week of November
(Source: http://www.modulecounts.com/) # /etc/rotate-backups.ini:
[/backups/mylaptop]
Python based packages for secured hourly = 24
backup and recovery daily = 7
As Python has assorted tools and packages for diversified weekly = 4
applications, security and backup tools with tremendous monthly = 12
functionalities are also integrated in PyPi. Descriptions of yearly = always
Python based key tools that offer security and integrity during ionice = idle
backup follow. [/backups/myserver]
daily = 7 * 2
Rotate-Backups weekly = 4 * 2
Rotate-Backups is a simplified command line tool that is used monthly = 12 * 4
for backup rotation. It has multiple features including flexible yearly = always
rotations on particular timestamps and schedules. ionice = idle
The installation process is quite simple. Give the [/backups/myregion]
following command: daily = 7
weekly = 4
$ pip install rotate-backups monthly = 2
ionice = idle
The usage is as follows (the table at the bottom of this [/backups/myxbmc]
page lists the options): daily = 7
weekly = 4
$ rotate-backups [Options] monthly = 2
Option Description
-M, --minutely=COUNT Number of backups per minute
-H, --hourly=COUNT Number of hourly backups
-d, --daily=COUNT Number of daily backups
-w, --weekly=COUNT Number of weekly backups
-m, --monthly=COUNT Number of monthly backups
-y, --yearly=COUNT Number of yearly backups
-I, --include=PATTERN Matching the shell patterns
-x, --exclude=PATTERN No process of backups that match the shell pattern
-j, --parallel One backup at a time, no parallel backup
-p, --prefer-recent Ordering or preferences
-r, --relaxed Strict rotation with the time window for each rotation scheme
-i, --ionice=CLASS Input-output scheduling and priorities
-c, --config=PATH Configuration path
-u, --use-sudo Enabling the use of ‘sudo’
-n, --dry-run No changes, display the output
-v, --verbose Increase logging verbosity
-q, --quiet Decrease logging verbosity
-h, --help Messages and documentation
$ bakthat backup /home/mylocation/myfile.txt To create a backup archive, use the command given below:
$ bakthat backup myfile -d glacier For another backup with deduplication, use the
following code:
To disable the password prompt, give the following
command: $ borg create -v --stats /path/to/repo::Saturday2 ~/Documents
---------------------------------------------------------
$ bakthat mybackup mymyfile --prompt no Archive name: MyArchive
Would You
Like More
DIY Circuits?
Encrypting Partitions
Using LUKS
Sensitive data needs total protection. And there’s no better way of protecting
your sensitive data than by encrypting it. This article is a tutorial on how to
encrypt your laptop or server partitions using LUKS.
S
ensitive data on mobile systems such as laptops can the following line will encrypt the /home partition:
get compromised if they get lost, but this risk can
be mitigated if the data is encrypted. Red Hat Linux # part /home --fstype=ext4 --size=10000 --onpart=vda2
supports partition encryption through the Linux Unified --encrypted --passphrase=PASSPHRASE
Key Setup (LUKS) on-disks-format technology. Encrypting
partitions is easiest during installation but LUKS can also be Note that the passphrase, PASSPHRASE is stored in the
configured post installation. Kickstart profile in plain text, so this profile must be secured.
Omitting the –passphrase = option will cause the installer to
Encryption during installation pause and ask for the passphrase during installation.
When carrying out an interactive installation, tick the Encrypt
checkbox while creating the partition to encrypt it. When this Encryption post installation
option is selected, the system will prompt users for a passphrase Listed below are the steps needed to create an encrypted
to be used for decrypting the partition. The passphrase needs to volume:
be manually entered every time the system boots. 1. Create either a physical disk partition or a new logical
When performing automated installations, Kickstart volume.
can create encrypted partitions. Use the --encrypted and 2. Encrypt the block device and designate a passphrase, by
--passphrase option to encrypt each partition. For example, using the following command:
# cryptsetup luksOpen /dev/vdb1 --key-file /root/key.txt name After adding the secondary key, again run the luksDump
command to verify whether the key file has been added to
As shown in the entry of the fstab file, if the device to Slot3 or not. As shown in Figure 7, the key file has been
be mounted is named, then the file system on which the added to Slot3, as Slot2 remains disabled and Slot3 has been
encrypted partition should be permanently mounted is in
the other entries. Also, no passphrase is asked for separately
now, as we have supplied the key file, which has already been
added to the partition. The partition can now be mounted
using the mount -a command, after which the mounted
partition can be verified upon reboot by using the df -h Figure 6: Secondary key file key2.txt has been added at Slot3
Figure 5: Available slots for an encrypted partition are shown Figure 7: Slot3 enabled successfully
enabled with the key file supplied. Now Slot3 can also be
used to decrypt the partition.
Figure 9: Decrypting a partition with the passphrase supplied initially
Restoring LUKS headers
For some commonly encountered LUKS issues, LUKS
header backups can mean the difference between a simple
administrative fix and permanently unrecoverable data.
Therefore, administrators of LUKS encrypted volumes should
engage in the good practice of routinely backing up their headers.
In addition, they should be familiar with the procedures for
restoring the headers from backup, should the need arise.
B
ased on our readers’ requests to take up a Example 1
real life ML/NLP problem with a sufficiently Q1a: What are the ways of investing in the
large data set, we had started on the problem share market?
of detecting duplicate questions in community Q1b: What are the ways of investing in the share
question answering (CQA) forums using the Quora market in India?
Question Pair Dataset. One of the state-of-art tools available online
Let’s first define our task as follows: Given a for detecting semantic text similarity is SEMILAR
pair of questions <Q1, Q2>, the task is to identify (http://www.semanticsimilarity.org/). A freely
whether Q2 is a duplicate of Q1, in the sense that, available state-of-art tool for entailment recognition
will the informational needs expressed in Q1 satisfy is Excitement Open Platform or EOP (http://hlt-
the informational needs of Q2? In simpler terms, services4.fbk.eu/eop/index.php). SEMILAR gave
we can say that Q1 and Q2 are duplicates from a lay a semantic similarity score of 0.95 for the above
person’s perspective if both of them are asking the pair whereas EOP reported it as textual entailment.
same thing in different surface forms. However, these two questions have different
An alternative definition is to consider that information needs and hence they are not duplicates
Q1 and Q2 are duplicates if the answer to Q1 will of each other.
also provide the answer to Q2. However, we will Example 2
not consider the second definition since we are Q2a: In which year did McEnroe beat Becker,
concerned only with analysing the informational who went on to become the youngest winner of the
needs expressed in the questions themselves and Wimbledon finals?
have no access to answer text. Therefore, let’s define Q2b: In which year did Becker beat McEnroe
our task as a binary classification problem, where and go on to become the youngest winner in the
one of the two labels (duplicate or non-duplicate) finals at Wimbledon?
needs to be predicted for each given question pair, SEMILAR reported a similarity score of
with the restriction that only the question text is 0.972 and EOP marked this question pair as
available for the task and not answer text. entailment, indicating that Q2b is entailed from
As I pointed out in last month’s column, a Q2a. Again, these two questions are about entirely
number of NLP problems are closely related to two different events, and hence are not duplicates.
duplicate question detection. The general consensus We hypothesise that humans are quick to see the
is that duplicate question detection can be solved as difference by extracting the relations that are
a by-product by using these techniques themselves. being sought for in the two questions. In Q2a,
Detecting semantic text similarity and recognising the relational event is “<McEnroe (subject), beat
textual entailment are the closest in nature to that of (predicate), Becker (object)> whereas in Q2b,
duplicate question detection. However, given that the relational event is <Becker (subject), beat
the goal of each of these problems is distinct from (predicate), McEnroe (object)> which is a different
that of duplicate question detection, they fail to relation from that in Q2a. By quickly scanning for
solve the latter problem adequately. Let me illustrate a relational match/mismatch at the cross-sentence
this with a few example question pairs. level, humans quickly mark this as non-duplicate,
even though there is considerable textual similarity across the but instead should consider the following:
text pair. It is also possible that the entailment system gets Do relations that exist in the first question hold true for
confused due to sub-classes being entailed across the two the second question?
questions (namely, the clause, “Becker went on to become Are there word level interactions across the two
youngest winner”). This lends weight to our claim that while questions which cause them to have different
semantic similarity matching and textual entailment are informational needs (even if the rest of the question is
closely related problems to the duplicate question detection pretty much identical across the two sentences)?
task, they cannot be used as solutions directly for the duplicate Now that we have a good idea of the requirements for a
detection problem. reasonable duplicate question detection system, let’s look at
There are subtle but important differences in the relations how we can start implementing this solution. For the sake of
of entities—cross-sentence word level interaction between simplicity, let us assume that our data set consists of single
two sentences which mark them as non-duplicates when sentence questions. Our system for duplicate detection first
examined by humans. We can hypothesise that humans needs to create a representation for each input sentence and
use these additional checks on top of the coarse grained then feed the representations for each of the two questions to
similarity comparison they do in their minds when they a classifier, which will decide whether they are duplicates or
look at these questions in isolation, and then arrive at not, by comparing the representations. The high-level block
the decision of whether they are duplicates or not. If we diagram of such a system is shown in Figure 1.
consider the example we discussed in Q2a and Q2b, the First, we need to create an input representation for
fact is that the relation between the entities in Question 2a each question sentence. We have a number of choices
does not hold good in Question 2b and, hence, if this cross- for this module. As is common in most neural network
sentence level semantic relations are checked, it would be based approaches, we use word embeddings to create a
possible to determine that this pair is not a duplicate. It is sentence representation. We can either use pre-trained
also important to note that not all mismatches are equally word embeddings such as Word2Vec embeddings/Glove
important. Let us consider another example. embeddings, or we can train our own word embeddings
Example 3 using the training data as our corpus. For each word in a
Q3a: Do omega-3 fatty acids, normally available as fish sentence, we look up its corresponding word embedding
oil supplements, help prevent cancer? vector and form the sentence matrix. Thus, each question
Q3b: Do omega-3 fatty acids help prevent cancer? (sentence) is represented by its sentence matrix (a matrix
Though Q3b does not mention the fact that omega-3 whose rows represent each word in the sentence and hence
fatty acids are typically available as fish oil supplements, its each row is the word-embedding vector for that word). We
information needs are satisfied by the answer to Q3a, and now need to convert the sentence-embedding matrix into a
hence these two questions are duplicates. From a human fixed length input representation vector.
perspective, we hypothesise that the word fragment “normally One of the popular ways of representing an input
available as fish oil supplements” is not seen as essential to sentence is by creating a sequence-to-sequence
the overall semantic compositional meaning of Q3a; so we can representation using recurrent neural networks. Given a
quickly discard this information when we refine the overall sequence of input words (this constitutes the sentence),
representation of the first question when doing a pass over we now pass this sequence through a recurrent neural
the second question. Also, we can hypothesise that humans network (RNN) and create an output sequence. While RNN
use cross-sentence word level interactions to quickly check generates an output for each input in the sequence, we are
whether similar information needs are being met in the two only interested in the final aggregated representation of
questions. the input sequence. Hence, we take the output of the last
Example 4 unit of the RNN and use it as our sentence representation.
Q4a: How old was Becker when he won the first time at We can use either vanilla RNNs, or gated recurrent units
Wimbledon? (GRU), or long short term memory (LSTM) units for
Q4b: What was Becker’s age when he was crowned as the creating a fixed length representation from a given input
youngest winner at Wimbledon? sequence. Given that LSTMs have been quite successfully
Though the surface forms of the two questions are quite used in many of the NLP tasks, we decided to use LSTMs
dissimilar, humans tend to compare cross-sentence word to create the fixed length representation of the question.
level interactions such as (<old, age>, <won, crowned>) The last stage output from each of the two LSTMs
in the context of the entity in question, namely, Becker to (one LSTM for each of the two questions) represents
conclude that these two questions are duplicates. Hence any the input question representation. We then feed the two
system which attempts to solve the task of duplicate question representations to a multi-layer perceptron (MLP) classifier.
detection should not depend blindly on a single aggregated An MLP classifier is nothing but a fully connected multi-
coarse-grained similarity measure to compare the sentences, layer feed forward neural network. Given that we have
Do-it-yourself
a two-class prediction problem, the last stage of the MLP
classifier is a two-unit softmax, the output of which gives
the probabilities for each of the two output classes. This is
shown in the overall block diagram in Figure 1.
Sentence
representation
(LSTM)
Input
Question 1 MLP Output
Question 2 Classifer
Sentence
representation
(LSTM)
I
t was in the year 2000 that I had first come across … return x*y
Python in the Linux Journal, a magazine that’s no ...
longer published. I read about it in a review titled >>> def op(fn,x,y):
‘Why Python’ by Eric Raymond. I had loved the idea … return fn(x,y)
of a language that enforced indentation for obvious …
reasons. It was a pain to keep requesting colleagues >>> op(add,4,5)
to indent the code. IDEs were primitive then—not 9
even as good as a simple text editor today. >>> op(prod,4,5)
However, one of Raymond’s statements that stayed 20
in my mind was, “I was generating working code >>>
nearly as fast as I could type.”
It is hard to explain but somehow the syntax of All too often, the method required is determined
Python offers minimal resistance! by the data. For example, a form-ID is used to call an
The significance of Python even today is underlined appropriate validation method. This, in turn, results in a
by the fact that Uber has just open sourced its AI tool set of conditional statements which obscure the code.
Pyro, which aims at ‘…deep universal probabilistic Consider the following illustration:
programming with Python and PyTorch (https://eng.
uber.com/pyro/).’ >>> def op2(fname,x,y):
Mozilla’s DeepSpeech open source speech … fn = eval(fname)
recognition model includes pre-built packages for … return fn(x,y)
Python (https://goo.gl/nxXz2Y). ...
>>> op2(‘add’,4,5)
Passing a function as a parameter 9
Years ago, after coding a number of forms, it was >>> op2(‘prod’,4,5)
obvious that handling user interface forms required 20
the same logic, except for validations. You could code >>>
a common validations routine, which used a form
identifier to execute the required code. However, as the The eval function allows you to convert a string
number of forms increased, it was obviously a messy into code. This eliminates the need for the conditional
solution. The ability to pass a function as a parameter in expressions discussed above.
Pascal, simplified the code a lot. Now, consider the following addition:
So, the fact that Python can do it as well is nothing
special. However, examine the simple example that >>> newfn =”””def div(x,y):
follows. There should be no difficulty in reading the … return x/y”””
code and understanding its intent. >>> exec(newfn)
>>> div(6,2)
>>> def add(x,y): 3
… return x+y >>> op(div,6,2)
... 3
>>> def prod(x,y): >>> op2(‘div’,6,2)
Machines
Learn in Many Different Ways
This article gives the reader a bird’s eye view of machine learning models, and solves
a use case through Sframes and Python.
‘D
ata is the new oil’—and this is not an empty single x and a single y), the form of the model is:
expression doing the rounds within the tech
industry. Nowadays, the strength of a company y = B0 + B1*x
is also measured by the amount of data it has. Facebook and
Google offer their services free in lieu of the vast amount Using this model, the price of a house can be predicted
of data they get from their users. These companies analyse based on the data available on nearby homes.
the data to extract useful information. For instance, Amazon Classification model: The classification model helps
keeps on suggesting products based on your buying trends, identify the sentiments of a particular post. For example, a
and Facebook always suggests friends and posts in which user review can be classified as positive or negative based on
you might be interested. Data in the raw form is like crude the words used in the comments. Given one or more inputs,
oil—you need to refine crude oil to make petrol and diesel. a classification model will try to predict the value of one or
Similarly, you need to process data to get useful insights and more outcomes. Outcomes are labels that can be applied to a
this is where machine learning comes handy. data set. Emails can be categorised as spam or not, based on
Machine learning has different models such as regression, these models.
classification, clustering and similarity, matrix factorisation, Clustering and similarity: This model helps when
deep learning, etc. In this article, I will briefly describe these we are trying to find similar objects. For example, if I am
models and also solve a use case using Python. interested in reading articles about football, this model will
Linear regression: Linear regression is studied as a search for documents with certain high-priority words and
model to understand the relationship between input and suggest articles about football. It will also find articles on
output numerical values. The representation is a linear Messi or Ronaldo as they are involved with football. TF-
equation that combines a specific set of input values (x), the IDF (term frequency - inverse term frequency) is used to
solution to which is the predicted output for that set of input evaluate this model.
values. It helps in estimating the values of the coefficients Deep learning: This is also known as deep structured
used in the representation with the data that we have learning or hierarchical learning. It is used for product
available. For example, in a simple regression problem (a recommendations and image comparison based on pixels.
import graphlab
peoples = graphlab.SFrame(‘people_wiki.gl/’) .
peoples.head()
obama = people[people[‘name’] == ‘Barack Obama’] 3. To sort the word counts to show the most common words
at the top, type:
2. Now, sort the word counts for the Obama article. To
turn the dictionary of word counts into a table, give the obama_word_count_table.head()
J
ava is an object-oriented general-purpose about the history of Java by describing the different
programming language. Java applications are initially platforms and versions of Java. But here I am at a loss.
compiled to bytecode, which can then be run on a The availability of a large number of Java platforms and
Java virtual machine (JVM), independent of the underlying the complicated version numbering scheme followed by
computer architecture. According to Wikipedia, “A Java Sun Microsystems makes such a discussion difficult. For
virtual machine is an abstract computing machine that example, in order to explain terms like Java 2, Java SE, Core
enables a computer to run a Java program.” Don’t get Java, JDK, Java EE, etc, in detail, a series of articles might
confused with this complicated definition—just imagine that be required. Such a discussion about the history of Java
JVM acts as software capable of running Java bytecode. might be a worthy pursuit for another time but definitely not
JVM acts as an interpreter for Java bytecode. This is the for this article. So, all I am going to do is explain a few key
reason why Java is often called a compiled and interpreted points regarding various Java implementations.
language. The development of Java—initially called Oak— First of all, Java Card, Java ME (Micro Edition),
began in 1991 by James Gosling, Mike Sheridan and Patrick Java SE (Standard Edition) and Java EE (Enterprise
Naughton. The first public implementation of Java was Edition) are all different Java platforms that target
released as Java 1.0 in 1996 by different classes of devices and application domains. For
Sun Microsystems. Currently, example, Java SE is customised for general-purpose use
Oracle Corporation owns Sun on desktop PCs, servers and similar devices. Another
Microsystems. Unlike many important question that requires an answer is, ‘What is
other programming languages, the difference between Java SE and Java 2?’ Books like
Java has a mascot called Duke ‘Learn Java 2 in 48 Hours’ or ‘Learn Java SE in Two Days’
(shown in Figure 1). can confuse beginners a lot while making a choice. In a
As with previous articles nutshell, there is no difference between the two. All this
in this series I really wanted to confusion arises due to the complicated naming convention
begin with a brief discussion Figure 1: Duke – the mascot of Java followed by Sun Microsystems.
The December 1998 release of Java was called Java 2, and Now a Java class file called HelloWorld.class containing
the version name J2SE 1.2 was given to JDK 1.2 to distinguish the Java bytecode is created in the directory. The JVM can
it from the other platforms of Java. Again, J2SE 1.5 (JDK be invoked to execute this class file containing bytecode
1.5) was renamed J2SE 5.0 and later as Java SE 5, citing the with the command:
maturity of J2SE over the years as the reason for this name
change. The latest version of Java is Java SE 9, which was java HelloWorld.class
released in September 2017. But actually, when you say Java 9,
you mean JDK 1.9. So, keep in mind that Java SE was formerly The message ‘Hello World’ is displayed on the terminal.
known as Java 2 Platform, Standard Edition or J2SE. Figure 2 shows the execution and output of the Java program
The Java Development Kit (JDK) is an implementation HelloWorld.java. The program contains a special method
of one of the Java Platforms, Standard Edition, Enterprise named main( ), the starting point of this program, which
Edition, or Micro Edition in the form of a binary product. will be identified and executed by the JVM. Remember
The JDK includes the JVM and a few other tools like the that a method in an object oriented programming paradigm
compiler (javac), debugger (jdb), applet viewer, etc, which are is nothing but a function in a procedural programming
required for the development of Java applications and applets. paradigm. The main( ) method contains the following line of
The latest version of JDK is JDK 9.0.1 released in October code, which prints the message ‘Hello World’ on the terminal:
2017. OpenJDK is a free and open source implementation of
Java SE. The OpenJDK implementation is licensed under the ‘System.out.println(“Hello World”);’
GNU General Public License (GNU GPL). The Java Class
Library (JCL) is a set of dynamically loadable libraries that The program HelloWorld.java and all the other
Java applications can call at run time. JCL contains a number programs discussed in this article can be downloaded
of packages, and each of them contains a number of classes to from opensourceforu.com/article_source_code/
provide various functionalities. Some of the packages in JCL January18javaforyou.zip.
include java.lang, java.io, java.net, java.util, etc.
pattern matches the whole string. In this case, the string ‘Open Regex3.java will display the message ‘Match from 10 to
Source’ is just a substring of the string ‘Magazine Open Source 20’ on the terminal. This is due to the fact that the substring
For You’ and since there is no match, the method matches( ) ‘Open Source’ appears from the 10th character to the 20th
returns False, and the if statement displays the message ‘No character in the string ‘Magazine Open Source For You’. The
Match Found’ on the terminal. method find( ) also returns True in case of a match and False
If you replace the line of code: in case if there is no match. The method find( ) can be used
repeatedly to find all the matching substrings present in a
‘Pattern pat = Pattern.compile(“Open Source”);’ string. Consider the program Regex4.java shown below.
‘if(mat.matches( ))’
…in Regex1.java with the line of code: Figure 4: Output of Regex3.java and Regex4.java
class Regex5
{
public static void main(String args[])
{
Pattern pat = Pattern.compile(“S.*r”);
String str = “Sachin Tendulkar Hits a Sixer”;
Matcher mat = pat.matcher(str); Figure 5: Output of Regex5.java and Regex6.java
int i=1;
while(mat.find( )) this series, C++ and JavaScript, use a style known as the
{ ECMAScript regular expression style. The articles in this
System.out.println(“Matched String “ + i + series were never intended to describe the complexities of
“ : “ + mat.group( )); intricate regular expressions in detail. Instead, I tried to focus
i++; on the different flavours of regular expressions and how
} they can be used in various programming languages. Any
} decent textbook on regular expressions will give a language-
} agnostic discussion of regular expressions but we were more
worried about the actual execution of regular expressions in
On execution, the program regex5.java displays the programming languages.
message ‘Matched String 1 : Sachin Tendulkar Hits a Sixer’ Before concluding this series, I would like to go over
on the terminal. What is the reason for matching the whole the important takeaways. First, always remember the fact
string? Because the pattern ‘S.*r’ searches for a string that there are many different regular expression flavours.
starting with S, followed by zero or more occurrences of any The differences between many of them are subtle, yet they
character, and finally ending with an r. Since the pattern ‘.*’ can cause havoc if used indiscreetly. Second, the style of
results in a greedy match, the whole string is matched. regular expression used in a programming language depends
Now replace the line of code: on the flavour of the regular expression implemented by the
language’s regular expression engine. Due to this reason, a
‘Pattern pat = Pattern.compile(“S.*r”);’ single programming language may support multiple regular
expression styles with the help of different regular expression
…in Regex5.java with the line: engines and library functions. Third, the way different
languages support regular expressions is different. In some
‘Pattern pat = Pattern.compile(“S.*?r”);’ languages the support for regular expressions is part of the
language core. An example for such a language is Perl. In
…to get Regex6.java. What will be the output of Regex6. some other languages the regular expressions are supported
java? Since this is the last article of this series on regular with the help of library functions. C++ is a programming
expressions, I request you to try your best to find the answer language in which regular expressions are implemented using
before proceeding any further. Figure 5 shows the output of library functions. Due to this, all the versions and standards
Regex5.java and Regex6.java. But what is the reason for the of some programming languages may not support the use
output shown by Regex6.java? Again, I request you to ponder of regular expressions. For example, in C++, the support
over the problem for some time and find out the answer. If for regular expressions starts with the C++11 standard.
you don’t get the answer, download the file Regex6.java For the same reason, the different versions of a particular
from the link shown earlier, and in that file I have given the programming language itself might support different regular
explanation as a comment. expression styles. You must be very careful about these
So, with that example, let us wind up our discussion important points while developing programs using regular
about regular expressions in Java. Java is a very powerful expressions to avoid dangerous pitfalls.
programming language and the effective use of regular So, finally, we are at the end of a long journey of learning
expressions will make it even more powerful. The basic stuff regular expressions. But an even longer and far more exciting
discussed here will definitely kick-start your journey towards journey of practising and developing regular expressions lies
the efficient use of regular expressions in Java. And now it is ahead. Good luck!
time to say farewell.
In this series we have discussed regular expression By: Deepu Benson
processing in six different programming languages. Four of The author is a free software enthusiast whose area of interest
these—Python, Perl, PHP and Java—use a regular expression is theoretical computer science. He maintains a technical
blog at www.computingforbeginners.blogspot.in and can be
style called PCRE (Perl Compatible Regular Expressions).
reached at deepumb@hotmail.com.
The other two programming languages we discussed in
Explore
Data Using R
As of August 2017, Twitter had 328 million active users, with 500 million tweets
being sent every day. Let’s look at how the open source R programming
language can be used to analyse the tremendous amount of data created by
this very popular social media tool.
S
ocial networking websites are ideal sources of Big Start your exploration
Data, which has many applications in the real world. Exploring Twitter data using R requires some preparation.
These sites contain both structured and unstructured First, you need to have a Twitter account. Using that account,
data, and are perfect platforms for data mining and subsequent register an application into your Twitter account from https://
knowledge discovery from the source. Twitter is a popular apps.twitter.com/ site. The registration process requires basic
source of text data for data mining. Huge volumes of Twitter personal information and produces four keys for R application
data contain many varieties of topics, which can be analysed and Twitter application connectivity. For example, an
to study the trends of different current subjects, like market application myapptwitterR1 may be created as shown in
economics or a wide variety of social issues. Accessing Figure 1.
Twitter data is easy as open APIs are available to transfer and In turn, this will create your application settings, as shown
arrange data in JSON and ATOM formats. in Figure 2.
In this article, we will look at an R programming A customer key, a customer secret, an access token
implementation for Twitter data analysis and visualisation. This and the access token secret combination forms the final
will give readers an idea of how to use R to analyse Big Data. authentication using the setup_twitter_oauth() function.
As a micro blogging network for the exchange and sharing
of short public messages, Twitter provides a rich repository >setup_twitter_oauth(consumerKey, consumerSecret,AccessToken,
of different hyperlinks, multimedia and hashtags, depicting AccessTokenSecret)
the contemporary social scenario in a geolocation. From the
originating tweets and the responses to them, as well as the It is also necessary to create an object to save
retweets by other users, it is possible to implement opinion the authentication for future use. This is done by
mining over a subject of interest in a geopolitical location. By OAuthFactory$new() as follows:
analysing the favourite counts and the information about the
popularity of users in their followers’ count, it is also possible credential<- OAuthFactory$new(consumerKey, consumerSecret,
to make a weighted statistical analysis of the data. requestURL, accessURL,authURL)
Application Settings
40
Consumer Key (API Key)HTgXiD3kqncGM93bxlBczTfhR
Consumer Secret (API Secret)
30
djgP2zhAWKbGAgiEd4R6DXujipXRq1aTSdoD9yaHSA8q97G8O
Frequency
Access LevelRead and write (modify app permissions)
Owner dpnkray
20
Owner ID 1371497528
and access tokens:
Access Token
10
1371497582-xD5GxHnkpg8z6k0XqpnJZ3XvIyc1vVJGUsDXNWZ
Access Token Secret
Qm9tV2XvlOcwbrL2z4QktA3azydtgIYPqflZglJ3D4WQ3
0
15:00 17:30 20:00 22:30 25:00 27:30
Here, requestURL, accessURL and authURL are available Figure 2: Histogram of created time tag
from the application setting of https://apps.twitter.com/.
connectivity object:
Connect to Twitter
This exercise requires R to have a few packages for calling >cred<- OAuthFactory$new(consumerKey,consumerSecret,requestUR
all Twitter related functions. Here is an R script to start the L,accessURL,authURL)
Twitter data analysis task. To access the Twitter data through >cred$handshake(cainfo=”cacert.pem”)
the just created application myapptwitterR, one needs to call
twitter, ROAuth and modest packages. Authentication to a Twitter application is done by the
function setup_twitter_oauth() with the stored key values as:
>setwd(‘d:\\r\\twitter’)
>setup_twitter_oauth(consumerKey, consumerSecret,AccessToken,
>install.packages(“twitteR”) AccessTokenSecret)
>install.packages(“ROAuth”)
>install.packages(“modest”) With all this done successfully, we are ready to access
Twitter data. As an example of data analysis, let us consider
>library(“twitteR”) the simple problem of opinion mining.
>library(“ROAuth”)
>library(“httr”) Data analysis
To demonstrate how data analysis is done, let’s get some
To test this on the MS Windows platform, load Curl into data from Twitter. The Twitter package provides the function
the current workspace, as follows: searchTwitter() to retrieve a tweet based on the keywords
searched for. Twitter organises tweets using hashtags. With the
>download.file (url=”http://curl.haxx.se/ca/cacert. help of a hashtag, you can expose your message to an audience
pem”,destfile=”cacert.pem”) interested in only some specific subject. If the hashtag is a
popular keyword related to your business, it can act to increase
Before the final connectivity to the Twitter application, your brand’s awareness levels. The use of popular hashtags
save all the necessary key values to suitable variables: helps one to get noticed. Analysis of hashtag appearances in
tweets or Instagram can reveal different trends of what the
>consumerKey=’HTgXiD3kqncGM93bxlBczTfhR’ people are thinking about the hashtag keyword. So this can be
>consumerSecret=’djgP2zhAWKbGAgiEd4R6DXujipXRq1aTSdoD9yaHSA8 a good starting point to decide your business strategy.
q97G8Oe’ To demonstrate hashtag analysis using R, here, we have
>requestURL=’https://api.twitter.com/oauth/request_token’, picked up the number one hashtag keyword #love for the
>accessURL=’https://api.twitter.com/oauth/access_token’, study. Other than this search keyword, the searchTwitter()
>authURL=’https://api.twitter.com/oauth/authorize’) function also requires the maximum number of tweets that the
function call will return from the tweets. For this discussion,
With these preparations, one can now create the required let us consider the maximum number as 500. Depending upon
the speed of your Internet and the traffic on the Twitter server, .. getLanguage, getLatitude, getLocation, getLongitude,
you will get an R list class object responses within a few getProfileImageURL,
minutes and an R list class object. .. getReplyToSID, getReplyToSN, getReplyToUID,
getRetweetCount,
>tweetList<- searchTwitter(“#love”,n=500) .. getRetweeted, getRetweeters, getRetweets,
>mode(tweetList) getScreenName, getStatusSource,
[1] “list” .. getText, getTruncated, getUrls, initialize, setCreated,
>length(tweetList) setFavoriteCount,
[1] 500 .. setFavorited, setId, setIsRetweet, setLanguage,
setLatitude, setLocation,
In R, an object list is a compound data structure and .. setLongitude, setProfileImageURL, setReplyToSID,
contains all types of R objects, including itself. For further setReplyToSN,
analysis, it is necessary to investigate its structure. Since it .. setReplyToUID, setRetweetCount, setRetweeted,
is an object of 500 list items, the structure of the first item is setScreenName,
sufficient to understand the schema of the set of records. .. setStatusSource, setText, setTruncated, setUrls,
toDataFrame,
>str(head(tweetList,1)) .. toDataFrame#twitterObj
List of 1 >
$ :Reference class ‘status’ [package “twitteR”] with 20 fields
..$ text : chr “https://t.co/L8dGustBQX #SavOne #LLOVE The structure shows that there are 20 fields of each list
#GotItWrong #JCole #Drake #Love #F4F #follow #follow4follow item, and the fields contain information and data related
#Repost #followback” to the tweets.
..$ favorited : logi FALSE Since the data frame is the most efficient structure for
..$ favoriteCount :num 0 processing records, it is now necessary to convert each list
..$ replyToSN : chr(0) item to the data frame and bind these row-by-row into a single
..$ created : POSIXct[1:1], format: “2017-10-04 frame. This can be done in an elegant way using the do.call()
06:11:03” function call, as shown here:
..$ truncated : logi FALSE
..$ replyToSID : chr(0) loveDF<- do.call(“rbind”,lapply(tweetList, as.data.frame))
..$ id : chr “915459228004892672”
..$ replyToUID : chr(0) Function lapply() will first convert each list to a data frame,
..$ statusSource :chr “<a href=\”http://twitter.com\” then do.call() will bind these, one by one. Now we have a set of
rel=\”nofollow\”>Twitter Web Client</a>” records with 19 fields (one less than the list!) in a regular format
..$ screenName : chr “Lezzardman” ready for analysis. Here, we shall mainly consider ‘created’ field
..$ retweetCount : num 0 to study the distribution pattern of arrival of tweets.
..$ isRetweet : logi FALSE
..$ retweeted : logi FALSE >length(head(loveDF,1))
..$ longitude : chr(0) [1] 19
..$ latitude : chr(0) >str(head(lovetDF,1))
..$ location :chr “Bay Area, CA, #CLGWORLDWIDE <ed><U+00A0> ‘data.frame’ : 1 obs. of 19 variables:
<U+00BD><ed><U+00B2><U+00AF>” $ text : chr “https://t.co/L8dGustBQX
..$ language : chr “en” #SavOne #LLOVE #GotItWrong #JCole #Drake #Love #F4F #follow
..$profileImageURL:chrhttp://pbs.twimg.com/profile_ #follow4follow #Repost #followback”
images/444325116407603200/XmZ92DvB_normal.jpeg” $ favorited : logi FALSE
..$ urls :’data.frame’: 1 obs. of 5 $ favoriteCount : num 0
variables: $ replyToSN : chr NA
.. ..$ url : chr “https://t.co/L8dGustBQX” $ created : POSIXct, format: “2017-10-04
.. ..$ expanded_url: chr “http://cdbaby.com/cd/savone” 06:11:03”
.. ..$ display_url :chr “cdbaby.com/cd/savone” $ truncated : logi FALSE
.. ..$ start_index :num 0 $ replyToSID : chr NA
.. ..$ stop_index :num 23 $ id : chr “915459228004892672”
..and 59 methods, of which 45 are possibly relevant: $ replyToUID : chr NA
.. getCreated, getFavoriteCount, getFavorited, getId, $ statusSource : chr “<a href=\”http://twitter.com\”
getIsRetweet, rel=\”nofollow\”>Twitter Web Client</a>”
1.0
300
0.8
Frequency
Frequency
200
0.6
100
0.4
0
0.2
0 5 10 15
2 4 6 8 10 12 14
created-time
Cumulative-time-interval
Figure 3: Histogram of ordered created time tag Figure 4: Cumulative frequency distribution
$ screenName : chr “Lezzardman” If we want to study the pattern of how the word ‘love’
$ retweetCount : num 0 appears in the data set, we can take the differences of
$ isRetweet : logi FALSE consecutive time elements of the vector ‘created’. R function
$ retweeted : logi FALSE diff() can do this. It returns iterative lagged differences of
$ longitude : chr NA the elements of an integer vector. In this case, we need lag
$ latitude : chr NA and iteration variables as one. To have a time series from the
$ location : chr “Bay Area, CA, #CLGWORLDWIDE ‘created’ vector, it first needs to be converted to an integer;
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>” here, we have done it before creating the series, as follows:
$ language : chr “en”
$ profileImageURL: chr “http://pbs.twimg.com/profile_ >detach(loveDF)
images/444325116407603200/XmZ92DvB_normal.jpeg” >sortloveDF<-loveDF[order(as.integer(created)),]
> >attach(sortloveDF)
The fifth column field is ‘created’; we shall try to >hist(as.integer(abs(diff(created)))
explore the different statistical characteristics features of
this field. This distribution shows that the majority of tweets
in this group come within the first few seconds and a
>attach(loveDF) # attach the frame for further much smaller number of tweets arrive in subsequent time
processing. intervals. From the distribution, it’s apparent that the
>head(loveDF[‘created’],2) # first 2 record set items for arrival time distribution follows a Poisson Distribution
demo. pattern, and it is now possible to model the number of
created times an event occurs in a given time interval.
1 2017-10-04 06:11:03 Let’s check the cumulative distribution pattern, and the
2 2017-10-04 06:10:55 number of tweets arriving within a time interval. For this
we have to write a short R function to get the cumulative
Twitter follows the Coordinated Universal Time tag as values within each interval. Here is the demo script and
the time-stamp to record the tweet’s time of creation. This the graph plot:
helps to maintain a normalised time frame for all records,
and it becomes easy to draw a frequency histogram of the countarrival<- function(created)
‘created’ time tag. {
i=1
>hist(created,breaks=15,freq=TRUE,main=”Histogram of s <- seq(1,15,1)
created time tag”) for(t in seq(1,15,1))
C
ontinuous integration is a practice that requires Kohsuke Kawaguchi in 2004, and is an automation server
developers to integrate code into a shared repository that helps to speed up different DevOps implementation
such as GitHub, GitLab, SVN, etc, at regular practices such as continuous integration, continuous
intervals. This concept was meant to avoid the hassle of testing, continuous delivery, continuous deployment,
later finding problems in the build life cycle. Continuous continuous notifications, orchestration using a build
integration requires developers to have frequent builds. The pipeline or Pipeline as a Code.
common practice is that whenever a code commit occurs, Jenkins helps to manage different application lifecycle
a build should be triggered. However, sometimes the build management activities. Users can map continuous integration
process is also scheduled in a way that too many builds with build, unit test execution and static code analysis;
are avoided. Jenkins is one of the most popular continuous continuous testing with functional testing, load testing and
integration tools. security testing; continuous delivery and deployment with
Jenkins was known as a continuous integration server automated deployment into different environments, and so on.
earlier. However, the Jenkins 2.0 announcement made it Jenkins provides easier ways to configure DevOps practices.
clear that, going forward, the focus would not only be on The Jenkins package has two release lines:
continuous integration but on continuous delivery too. LTS (long term support): Releases are selected every 12
Hence, ‘automation server’ is the term used more often weeks from the stream of regular releases, ensuring a
after Jenkins 2.0 was released. It was initially developed by stable release.
Weekly: A new release is available every week to fix bugs configure authentication using Active Directory, Jenkins’
and provide features to the community. own user database, and LDAP. You can also configure
LTS and weekly releases are available in different authorisation using Matrix-based security or the project-
flavours such as .war files (Jenkins is written in Java), based Matrix authorisation strategy.
native packages for the operating systems, installers and To configure environment variables (such as ANDROID_
Docker containers. HOME), tool locations, SonarQube servers, Jenkins
The current LTS version is Jenkins 2.73.3. This version location, Quality Gates - Sonarqube, e-mail notification,
comes with a very useful option, called Deploy to Azure. and so on, go to Jenkins Dashboard > Manage Jenkins >
Yes, we can deploy Jenkins to the Microsoft Azure public Configure System.
cloud within minutes. Of course, you need a Microsoft To configure Git, JDK, Gradle, and so on, go to
Azure subscription to utilise this option. Jenkins can be Jenkins Dashboard > Manage Jenkins > Global Tool
installed and used in Docker, FreeBSD, Gentoo, Mac OS X, Configuration.
OpenBSD, openSUSE, Red Hat/Fedora/CentOS, Ubuntu/
Debian and Windows. Creating a pipeline for Android applications
The features of Jenkins are: We have the following prerequisites:
Support for SCM tools such as Git, Subversion, Star Sample the Android application on GitHub, GitLab, SVN
Team, CVS, AccuRev, etc. or file systems.
Extensible architecture using plugins: The plugins Download the Gradle installation package or configure it
available are for Android development, iOS development, to install automatically from Jenkins Dashboard.
.NET development, Ruby development, library plugins, Download the Android SDK.
source code management, build tools, build triggers, Install plugins in Jenkins such as the Gradle plugin, the
build notifiers, build reports, UI plugins, authentication Android Lint plugin, the Build Pipeline plugin, etc.
and user management, etc. Now, let’s look at how to create a pipeline using the Build
It has the ‘Pipelines as a Code’ feature, which uses a Pipeline plugin so we can achieve the following tasks:
domain-specific language (DSL) to create a pipeline to Perform code analysis for Android application code using
manage the application’s lifecycle. Android Lint.
The master agent architecture supports distributed builds. Create an APK file.
To install Jenkins, the minimum hardware requirements
are 256MB of RAM and 1GB of drive space. The
recommended hardware configuration for a small team is
1GB+ of RAM and 50 GB+ of drive space. You need to
have Java 8 - Java Runtime Environment (JRE) or a Java
Development Kit (JDK).
The easiest way to run Jenkins is to download and run its
latest stable WAR file version. Download the jenkins.war file,
go to that directory and execute the following command:
Configuration
To install plugins, go to Jenkins Dashboard > Manage
Jenkins > Manage Plugins. Verify the updates as well as
the available and the installed tabs. For the HTTP proxy
configuration, go to the Advanced tab.
To manually upload plugins, go to Jenkins Dashboard
> Manage Jenkins > Manage Plugins > Advanced >
Upload Plugin.
To configure security, go to Jenkins Dashboard >
Manage Jenkins > Configure Global Security. You can Figure 2: Gradle installation
Figure 3: ANDROID_HOME environment variable In an Android project, the main part is Gradle, which
is used to build our source code and download all the
necessary dependencies required for the project. In the
name field, users can enter Gradle with their version for
better readability. The next field is Gradle Home, which is
the same as the environment variable in our system. Copy
your path to Gradle and paste it here. There is one more
option, ‘Install automatically’, which installs Gradle’s latest
version if the user does not have it.
Configuring the ANDROID_HOME environment
variable: The next step is to configure the SDK for
the Android project that contains all the platform tools
and other tools also.
Here, the user has to follow a path to the SDK file,
Figure 4: Source code management which is present in the system.
The path in Jenkins is Home> Configuration >SDK.
Creating a Freestyle project to perform Lint analysis
for the Android application: The basic setup is ready; so
let’s start our project. The first step is to enter a proper name
(AndroidApp-CA) to your project. Then select ‘Freestyle
project’ under Category, and click on OK. Your project file
structure is ready to use.
The user can customise all the configuration steps to
show a neat and clean function. As shown in Figure 4, in a
general configuration, the ‘Discard old build’ option discards
all your old builds and keeps the number of the build at
whatever the user wants. The path in Jenkins is Home>
#your_project# > General Setting.
Figure 5: Lint configuration In the last step, we configure Git as version control to
pull the latest code for the Build Pipeline. Select the Git
Create a pipeline so that the first code analysis is option and provide the repository’s URL and its credentials.
performed and on its successful implementation, execute Users can also mention from which branch they want to take
another build job to create an APK file. the code, and as shown in Figure 5, the ‘Master’ branch is
Now, let’s perform each step in sequence. applied to it. Then click the Apply and Save button to save
Configuring Git, Java, and Gradle: To execute the all your configuration steps.
build pipeline, it is necessary to take code from a shared The next step is to add Gradle to the build, as well
repository. As shown below, Git is configured to go further. as add Lint to do static code analysis. Lint is a tool that
The same configuration is applied to all version control that performs code analysis for Android applications, just as
is to be set up. The path in Jenkins is Home > Global tool Sonarqube does in Java applications. To add the Lint task to
configuration > Version control / Git. the configuration, the user has to write Lint options in the
Figure 8: Archive Artifact Archive the build artefacts such as JAR, WAR, APK or IPA
files so that they can be downloaded later. Click on Save.
After executing all the jobs, the pipeline can be
pictorially represented by using the Build Pipeline plugin.
After installing that plugin, users have to give the start,
middle and end points to show the build jobs in that
sequence. They can configure upstream and downstream
jobs to build the pipeline.
To show all the build jobs, click on the ‘+’ sign on the right
hand side top of the screen. Select build pipeline view on the
screen that comes up after clicking on this sign. Configuring
a build pipeline view can be decided on by the user, as per
requirements. Select AndroidApp-CA as the initial job.
There are multiple options like the Trigger Option,
Figure 9: Downstream job Display Option, Pipeline Flow, etc.
As configured earlier, the pipeline starts by clicking
build.gradle file in the Android project. on the Run button and is refreshed periodically. Upstream
The Android Lint plugin offers a feature to examine and downstream, the job execution will take place
the XML output produced by the Android Lint tool and as per the configuration.
provides the results on the build page for analysis. It does After completing all the processes, you can see the
not run Lint, but Lint results in XML format must be visualisation shown in Figure 11. Green colour indicates
generated and available in the workspace. the successful execution of a pipeline whereas red
Creating a Freestyle project to build the APK file for indicates an unsuccessful build.
the Android application: After completing the analysis of
the code, the next step is to build the Android project and
create an APK file to execute further. By: Bhagyashri Jain
Creating a Freestyle project with the name The author is a systems engineer and loves Android
development. She likes to read and share daily news on her
AndroidApp-APK: In the build actions, select Invoke
blog at http://bjlittlethings.wordpress.com.
Gradle script.
Demystifying Blockchains
A blockchain is a continuously growing list of records, called blocks, which are
linked and secured using cryptography to ensure data security.
Demystifying blockchains
A blockchain, in itself, is a distributed ledger and an
interconnected chain of individual blocks of data, where each
block can be a transaction, or a group of transactions.
In order to explain the concepts of the blockchain, let’s
look at a code example in JavaScript. The link to the GitHub
repository can be found at https://github.com/abhiit89/Ang
Coins. So do check the GitHub repo and go through the
‘README’ as it contains the instructions on how to run the
code locally.
D
Block: A block in a blockchain is a combination of the
ata security is of paramount importance to transaction data along with the hash of the previous block.
corporations. Enterprises need to establish high For example:
levels of trust and offer guarantees on the security
of the data being shared with them while interacting with class Block {
other enterprises. The major concern of any enterprise about constructor(blockId, dateTimeStamp, transactionData,
data security is data integrity. What many in the enterprise previousTransactionHash) {
domain worry about is, “Is my data accurate?” this.blockId = blockId;
Data integrity ensures that the data is accurate, untampered this.dateTimeStamp = dateTimeStamp;
with and consistent across the life cycle of any transaction. this.transactionData = transactionData;
Enterprises share data like invoices, orders, etc. The integrity of this.previousTransactionHash =
this data is the pillar on which their businesses are built. previousTransactionHash;
this.currentTransactionHash = this.
Blockchain calculateBlockDigest();
A blockchain is a distributed public ledger of transactions that }
no person or company owns or controls. Instead, every user
can access the entire blockchain, and every transaction from The definition of the block, inside a blockchain, is
any account to any other account, as it is recorded in a secure presented in the above example. It consists of the data
and verifiable form using algorithms of cryptography. In (which includes blockId, dateTimeStamp, transactionData,
short, a blockchain ensures data integrity. previousTransactionHash, nonce), the hash of the data
A blockchain provides data integrity due to its unique and (currentTransactionHash) and the hash of the previous
significant features. Some of these are listed below. transaction data.
Timeless validation for a transaction: Each transaction Genesis block: A genesis block is the first block to be
in a blockchain has a signature digest attached to it which created at the beginning of the blockchain. For example:
depends on all the previous transactions, without the
expiration date. Due to this, each transaction can be validated new Block(0, new Date().getTime().valueOf(), ‘First Block’,
at any point in time by anyone without the risk of the data ‘0’);
being altered or tampered with.
Highly scalable and portable: A blockchain is a Adding a block to the blockchain
decentralised ledger distributed across the globe, and it In order to add blocks or transactions to the blockchain, we
ensures very high availability and resilience against disaster. have to create a new block with a set of transactions, and add
Tamper-proof: A blockchain uses asymmetric or elliptic it to the blockchain as explained in the code example below:
curve cryptography under the hood. Besides, each transaction
gets added to the blockchain only after validation, and addNewTransactionBlockToTransactionChain(currentBlock) {
each transaction also depends on the previous transaction. currentBlock.previousTransactionHash = this.
returnLatestBlock().currentTransactionHash; Another way is to not only change the data but also
currentBlock.currentTransactionHash = currentBlock. update the hash. Even then the current implementation
calculateBlockDigest(); can invalidate it. The code for it is available in the branch
this.transactionChain.push(currentBlock); https://github.com/abhiit89/AngCoins/tree/tampering_data_
} with_updated_hash.
In the above code example, we calculate the hash of the Proof of work
previous transaction and the hash of the current transaction With the current implementation, it is still possible that
before pushing the new block to the blockchain. We also someone can spam the blockchain by changing the data in
validate the new block before adding it to the blockchain one block and updating the hash in all the following blocks
using the method described below. in the blockchain. In order to prevent that, the concept of the
‘proof of work’ suggests a difficulty or condition that each
Validating the blockchain block that is generated has to meet before getting added to the
Each block needs to be validated before it gets added to the blockchain. This difficulty prevents very frequent generation
blockchain. The validation we used in our implementation is of the block, as the hashing algorithm used to generate the
described below: block is not under the control of the person creating the
block. In this way, it becomes a game of hit and miss to try to
isBlockChainValid() { generate the block that meets the required conditions.
for (let blockCount = 1; blockCount < this. For our implementation, we have set the difficult task that
transactionChain.length; blockCount++) { each block generated must have two ‘00’ in the beginning of
const currentBlockInBlockChain = this. the hash, in order to be added to the blockchain. For example,
transactionChain[blockCount]; we can modify the function to add a new block to include the
const previousBlockInBlockChain = this. difficult task, given as below:
transactionChain[blockCount - 1];
if (currentBlockInBlockChain. addNewTransactionBlockToTransactionChain(currentBlock) {
currentTransactionHash !== currentBlockInBlockChain. currentBlock.previousTransactionHash = this.
calculateBlockDigest()) { returnLatestBlock().currentTransactionHash;
return false; currentBlock.mineNewBlock(this.difficulty);
} this.transactionChain.push(currentBlock);
}
if (currentBlockInBlockChain.
previousTransactionHash !== previousBlockInBlockChain. This calls the mining function (which validates the difficult
currentTransactionHash) { conditions):
return false;
} mineNewBlock(difficulty) {
while(this.currentTransactionHash.substring(0,
} difficulty) !== Array(difficulty + 1).join(‘0’)) {
this.nonce++;
return true; this.currentTransactionHash = this.
} calculateBlockDigest();
}
In this implementation, there are a lot of features missing console.log(‘New Block Mined --> ‘ + this.
as of now, like validation of the funds, the rollback feature currentTransactionHash);
in case the newly added block corrupts the blockchain, }
etc. If anyone is interested in tackling fund validation, the
rollback or any other issue they find, please go to my GitHub The complete code for this implementation can be seen
repository, create an issue and the fix for it, and send me a in the branch https://github.com/abhiit89/AngCoins/tree/
pull request or just fork the repository and use it whichever block_chain_mining.
way this code suits your requirements.
A point to be noted here is that in this implementation, Blockchain providers
there are numerous ways to tamper with the blockchain. One Blockchain technology, with its unprecedented way of
way is to tamper with the data alone. The implementation managing trust and data and of executing procedures,
for that is done in the branch https://github.com/abhiit89/ can transform businesses. Here are some open source
AngCoins/tree/tampering_data. blockchain platforms.
Continued on page 103...
100 | JANUARY 2018 | OPEN SOURCE FOR YOU | www.OpenSourceForU.com
Let’s Try Developers
R
is an open source programming viewer and the tab for visualisation.
language and environment for data Clicking on the Packages tab in Section 3
analysis and visualisation, and is will list all the packages available in R Studio,
widely used by statisticians and analysts. as shown in Figure 6.
It is a GNU package written mostly in C, Using R is very straightforward. On the
Fortran and R itself. console area, type ‘2 + 2’ and you will get ‘4’
as the output. Refer to Figure 7.
Installing R The R console supports all the basic math
Installing R is very easy. Navigate the browser to www.r- operations; so one can think of it as a calculator. You can try
project.org and click on CRAN in the Download section to do more calculations on the console.
(Figure 1). Creating a variable is very straightforward too. To assign
This will open the CRAN mirrors. Select the appropriate ‘2’ to variable ‘x’, use the following different ways:
mirror and it will take you to the Download section, as
shown in Figure 2. > x <- 2
Grab the version which is appropriate for your system OR
and install R. After the installation, you can see the R icon > x = 2
on the menu/desktop, as seen in Figure 3. OR
You can start using R by double-clicking on the icon, but > assign(“x”,2)
there is a better way available. You can install the R Studio, OR
which is an IDE (integrated development environment)— > x <- y <- 2
this makes things very easy. It’s a free and open source
integrated environment for R. One can see that there is no concept of data type
Download R Studio from https://www.rstudio.com/ declaration. The data type is assumed according to the value
products/rstudio/. Use the open source edition, which is free assigned to the variable.
to use. Once installed, open R Studio by double-clicking on As we assign the value, we can also see the Environment
its icon, which will look like what’s shown in Figure 4. panel display the variable and value, as shown in Figure 8.
The default screen of R Studio is divided into three A rm command is used to remove the variable.
sections, as shown in Figure 5. The section marked ‘1’ is R supports basic data types to find the type of data in
the main console window where we will execute the R variable use class functions, as shown below:
commands. Section 2 shows the environment and history.
The former will show all the available variables for the > x <- 2
console and their values, while ‘history’ stores all the > class(x)
commands’ history. Section 3 shows the file explorer, help [1] “numeric”
date and logical. The following code shows how to use Figure 7: Using the console in R Studio
various data types:
Apart from basic data types, R supports data structures or
> x<-”data” objects like vectors, lists, arrays, matrices and data frames.
> class(x) These are the key objects or data structures in R.
[1] “character” A vector stores data of the same type. It can be
> nchar(x) thought of as a standard array in most of the programming
[1] 4 languages. A ‘c’ function is used to create a vector (‘c’
> d<-as.Date(“2017-12-01”) stands for ‘combine’).
> d The following code snippet shows the creation of a vector:
[1] “2017-12-01”
> class(d) > v <- c(10,20,30,40)
[1] “Date” > v
> b<-TRUE [1] 10 20 30 40
> class(b)
[1] “logical” The most interesting thing about a vector is that any
operation applied on it will be applied to individual elements
of it. For example, ‘v + 10’ will increase the value of each
element of a vector by 10.
> v + 10
[1] 20 30 40 50
> a<-1:5
> b<-21:25
> a+b
Figure 6: Packages in R Studio [1] 22 24 26 28 30
To convert a JPG image to BMP, you can give the This will generate a public key on /home/a/.ssh/id_rsa.pub.
following command: On host B as User b, create ~/.ssh directory (if not
already present) as follows:
convert image.jpg image.bmp
a@A:~> ssh b@B mkdir -p .ssh
The tool can also be used to resize an image, for which b@B’s password:
the syntax is shown below:
Finally, append User a’s new public key to b@B:.ssh/
convert [nameofimage.jpg] -resize [dimensions] authorized_keys and enter User b’s password for one last time:
[newnameofimage.jpg]
a@A:~> cat .ssh/id_rsa.pub | ssh b@B ‘cat >> .ssh/authorized_
For example, to convert an image to a size of 800 x keys’
600, the command would be as follows: b@B’s password:
$sudo apt-get install linux-tools-common linux-tools-generic Replace all occurrences of a string with a
new line
The above command will install the ‘perf’ tool on Often, we might need to replace all occurrences of a
Ubuntu or a similar operating system. string with a new line in a file. We can use the ‘sed’
command for this:
$perf list
$sed ‘s/\@@/\n/g’ file1.txt > file2.txt
The above command gives the list of all the information
that can be got by running ‘perf’. The above command replaces the string ‘@@’ in ‘file1.
For example, to analyse the performance of a C txt’ with a new line character and writes the modified lines
program and if you want to know the number of cache- to ‘file2.txt’.
misses, the command is as follows: sed is a very powerful tool; you can read its manual for
more details.
$perf stat -e cache-misses ./a.out
—Nagaraju Dhulipalla, nagarajunice@gmail.com
If you want to use more than one command at a time,
give the following command: Git: Know about modified files in changeset
Running the plain old ‘git log’ spews out a whole lot
$perf stat -e cache-misses,cache-references ./a.out of details about each commit. How about extracting just
the name of the files (with their path relative to the root of
—Gunasekar Duraisamy, dg.gunasekar@gmail.com the Git repository)? Here is a handy command for that:
Create a QR code from the command line git log -m -1 --name-only --pretty=”format:” HEAD
QR code (abbreviated from Quick Response Code)
is a type of matrix bar code (or two-dimensional bar code) Changing the HEAD to a different SHA1 commit ID
first designed for the automotive industry. There are many will fetch the names of the files only. This can come in
online websites that help you create a QR code of your handy while tooling the CI environment.
choice. Here is a method that helps generate QR codes for a
string or URL using the Linux command line: Note: This will return empty on merge commits.
—Ramanathan M, rus.cahimb@gmail.com
rly
,w
rite Re
co
mm future of Ubuntu. You can try it live from the bundled DVD.
pe en
ro de
kp
dS
r
wo
Fedora Workstation 27
ys
ot
tem
sn
Re
oe
qu
ire
DV
me
this
: P4
In c
, 1G
for developers and makers of all kinds. It comes with a sleek user
B RA
tended, and sh
M, D
VD-RO interface and the complete open source toolbox. Previous releases
unin
rib
ute
terial, if found
d to t
he complex n
atu
re
tio
ec
of
bj Int
o ern
Any t dat e
Note:
MX Linux 17
MX Linux is a cooperative venture between the antiX and former
CD
MEPIS communities, which uses the best tools and talent from
Te
am
e-m
ail:
AN EVENT FOR
THE CREATORS,
THE INTEGRATORS,
THE ENABLERS, AND
THE CUSTOMERS OF IOT
IOTSHOW.IN Creating
IoT Solutions?
Come and explore
7-9 Feb 2018 latest products
AN
KTPO EVENT
Whitefield FOR
• Bengaluru & technologies
THE CREATORS,
150+ speakers • 200+ exhibitors • 5,000+ delegates
THE ENABLERS AND
CONTACT: 98111 55335 • www.iotshow.in • iew@efy.in
THE CUSTOMERS
December 2017
Loonycorn
is hiring
Interested?
Us: