Академический Документы
Профессиональный Документы
Культура Документы
Counters
0.1.0
Development
Environment
Tutorial
For
CentOS
6.3
x86_64
Workstation
By
Alex
Kozlov
and
Chris
Poulin
Testing
by
Daniel
Rule
and
Ken
Krugler
January
31
2013
Contact:
Chris
Poulin:
Chris@patternsandpredictions.net
P a g e
|
2
Table
of
Contents
1
Introduction ................................................................................................................................................ 4
Provisioning
................................................................................................................................................
4
4.1
4.2
4.3
4.4
4.5
Network .......................................................................................................................................................................................... 6
Conventions ................................................................................................................................................ 6
Log into the Gnome Desktop on h13.demo.dev with the poulin account. ................................................................ 7
6.2
Browse to Cloudera Manager at http://h12.demo.dev:7180/ and log in with the admin account using
Within Cloudera Manager, Hosts -> Add Host This will start the Add Hosts Wizard. Follow the
instructions
of
the
wizard
to
add
h13.demo.dev
to
the
cluster
but
do
not
add
any
rolls
to
h13.demo.dev,
as
we
do
not
want
the
development
host
to
participate
as
part
of
the
cluster.
When
complete,
h13.demo.dev
will
show
up
in
the
Hosts
list.
..................................................................................................................................................................................
8
6.4
Within Cloudera Manager, Services -> HBase (hbase1) and Actions -> Download Client Configuration and
Within Cloudera Manager, Services -> MapReduce (mapreduce1) and Actions -> Download Client
Open a Terminal (All shell commands going forward will be executed in Terminal) ......................................... 9
6.7
6.8
6.9
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
3
6.16
Click
OK
on
workspace
launcher
defaults,
both
now
and
when
seen
through
the
remainder
of
this
tutorial.
.....................................................................................................................................................................................................
12
6.17
Install
M2E
extension
for
Eclipse
Juno
..............................................................................................................................
12
6.18
Close
Eclipse
...............................................................................................................................................................................
13
6.19
Obtain
a
copy
of
bcounts-0.1.0-SNAPSHOT-project.tar.bz2
from
Cloudera
and
save
to
/home/poulin/
..
13
6.20
Prepare
bcounts-0.1.0-SNAPSHOT
for
execution
from
Eclipse
and
CLI
.................................................................
13
6.21
Check
M2_REPO
classpath
variable
set
in
Eclipse
.........................................................................................................
14
6.22
In
Eclipse:
File->Import...->(expand)
General
->
(highlight)
Existing
Projects
into
Workspace
->
(click)
Next
->
and
specify
root
directory
/home/poulin/bcounts-0.1.0-SNAPSHOT
(hit
enter
if
pasted)
..........................
15
6.23
In
Eclipse:
Window->Preferences->Java->Code
Style->Formatter
........................................................................
16
6.24
Click
Import
and
navigate
to
/home/poulin/bcounts-0.1.0-SNAPSHOT/eclipse_formatter_apache.xml
Then
Click
Apply
and
then
Click
OK
and
then
close
Eclipse.
..................................................................................................
16
6.25
Create
schema
for
Bayesian
Counters
examples
in
HBase
.........................................................................................
16
6.26
Load
Iris
data
into
HBase
via
CLI
........................................................................................................................................
17
6.27
Load
Iris
data
into
HBase
via
Eclipse
.................................................................................................................................
17
6.28
View
Iris
data
and
schema
in
HBase
..................................................................................................................................
18
6.29
Perform
NB
inference
on
the
Iris
dataset
Note:
NB
inference
must
be
executed
within
300
seconds
of
loading
iris
data
into
hbase,
or
modify
the
300
in
the
following
steps
to
a
larger
number
of
seconds
while
testing
........................................................................................................................................................................................................
18
6.30
Perform
clique
scoring
with
random
projections
.........................................................................................................
19
6.31
Create
small
delta
of
the
ad.data
file
.................................................................................................................................
20
6.32
Load
Ad
data
into
HBase
via
Eclipse
..................................................................................................................................
21
6.33
Perform
NB
inference
on
the
Ad
dataset
Run->Run
Configurations...
...................................................................
21
6.34
Create
bag
of
words
file
from
configuration
file
............................................................................................................
22
6.35
Edit
/home/poulin/bcounts-0.1.0-SNAPSHOT/bin/sp_schema.py
Change
from:
if
len(sys.argv)<3
or
sys.argv[1]
is
None:
...............................................................................................................................................................................
22
6.36
Create
an
XML
Configuration
file
derived
from
a
bag-of-words
file
.......................................................................
22
6.37
Convert
testing
files
into
header-less
files
for
storing
in
HDFS
................................................................................
23
6.38
Generate
a
'scored_'
file
in
current
directory
.................................................................................................................
23
6.39
Create
small
delta
of
sp-training-file
.................................................................................................................................
23
6.40
Load
small
sample
of
SP
into
HBase
via
Eclipse
.............................................................................................................
23
6.41
Perform
NB
inference
on
the
SP
dataset
Run->Run
Configurations...
....................................................................
24
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
4
1 Introduction
Bayesian
counters
(B-counts)
is
a
framework
for
on-line
near
real
time
model
building
and
prediction.
It
can
be
used
to
identify
correlations
in
the
data,
and
as
a
library
used
to
respond
to
unusual
or
rare
events.
The
underlying
technology
for
B-counts
is
HBase,
a
highly
scalable
and
fault
tolerant
key-value
map
storage
engine.
The
solution
can
scale
to
thousands
of
nodes
and
billions
of
features.
Finally,
the
initial
prediction
algorithm
is
Nave
Bayes
(NB).
The
framework
is
currently
being
extended
to
incorporate
Nearest
Neighbors
(NN)
and
a
general
Bayesian
Network
(BN)
learning
algorithms.
2 The
Audience
The
steps
in
this
tutorial
are
highly
detailed
and
aim
for
optimal
repeatability
at
the
time
of
this
writing,
however
the
audience
must
have
Linux
literacy
either
by
experience,
formal
training
or
education
and
have
a
strong
understanding
of
computer
and
network
security.
Finally,
this
tutorial
does
not
cover
statistical
analysis
aspects
of
the
solution.
3 The
Goal
Preparing
a
development
environment
is
usually
a
complex
task
but
leads
to
powerful
results
and
strong
capabilities.
This
tutorial
will
attempt
to
make
this
task
as
painless
and
repeatable
as
possible.
4 Provisioning
4.1
4.2
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
5
4.3
Development
Workstation
The
Workstation
should
be
provisioned
with
the
following:
1. CentOS
Linux
6.3
x86_64
Workstation
2. jdk-6u31-linux-x64-rpm
or
newer
version
of
jdk-6.
Note:
Do
not
use
jdk-6u18
3. 16GB
RAM
&
2
Cores
minimum
hardware
allocation
4. The
hostname
of
the
workstation
for
this
tutorial
is
expected
to
be
permanently
assigned
as
h13.demo.dev
5. h13.demo.dev
should
be
configured
to
use
a
certified
copy
of
both
the
CentOS
and
EPEL
repos
6. The
workstation
will
need
an
example
account
created
called
poulin
with
permissions:
a. Account
poulin
must
be
able
to
sudo
with
root
level
credentials
h13.demo.dev
b. Account
poulin
must
be
able
to
log
into
the
gnome
desktop
of
h13.demo.dev,
either
directly
in
the
case
of
local
bare
metal
installation,
vmware
installation
or
virtualbox
installation.
Or
via
VNC
SSH
tunnel
client
if
the
workstation
is
hosted
on
AWS
or
trusted
other
remote
cloud/VPS,
dedicated
hosting
service
or
datacenter.
4.4
HBase
Cluster
1. Navigate
to
https://ccp.cloudera.com/display/DOC/Documentation
2. Download
and
archive
a
copy
of
all
documents
under:
a. Cloudera
Manager
4.1
Enterprise
Edition
Documentation
b. Cloudera
Manager
4.1
Free
Edition
Documentation
3. Follow
the
steps
outlined
in
CM-4.1-free-installation-guide.pdf
to
provision
a
pseudo
distributed
cluster
in
the
same
trusted
network
as
the
Development
Workstation
on
h13.demo.dev.
The
cluster
must
consist
of:
a. A
single
host
with
the
assigned
hostname
h12.demo.dev
b. The
single
host
should
be
installed
with
CM4.1
as
well
as
the
hbase
and
all
dependent
rolls
via
the
CM4.1
UI.
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
6
4.5
Network
Both
the
HBase
Cluster
and
the
Workstation
should
be
on
the
same
trusted
network.
With:
-
All
external
in-bound
ports
blocked
for
connections
into
the
trusted
network
except
for
SSH
from
other
trusted
locations
only.
The workstation must be able to connect out to the internet on HTTP, HTTPS , FTP and SFTP
No ports should be blocked between the HBase Cluster and the Workstation within the trusted network.
Both h13.demo.dev and h12.demo.dev to have a permanent static IP address and hostname.
The IP address on both h13.demo.dev and h12.demo.dev must support reverse lookup to hostname.
The date and time on both h13.demo.dev and h12.demo.dev must be permanently in sync.
5 Conventions
Single
line
boxes
delimit
commands
to
execute
in
the
Terminal
# example
Dashed
line
boxes
delimit
some
or
all
of
the
contents
of
a
text
file
# example
Double
line
boxes
delimit
some
or
all
standard
output
# example
Wave
line
boxes
delimit
hbase
shell
# example
Candy
cane
boxes
delimit
overview
of
logic
# example
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
7
Log
into
the
Gnome
Desktop
on
h13.demo.dev
with
the
poulin
account.
Note:
Gnome
Desktop
comes
with
CentOS
Linux
6.3
x86_64
Desktop
install
6.2
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
8
6.3
This
will
start
the
Add
Hosts
Wizard.
Follow
the
instructions
of
the
wizard
to
add
h13.demo.dev
to
the
cluster
but
do
not
add
any
rolls
to
h13.demo.dev,
as
we
do
not
want
the
development
host
to
participate
as
part
of
the
cluster.
When
complete,
h13.demo.dev
will
show
up
in
the
Hosts
list.
6.4
Within Cloudera Manager, Services -> HBase (hbase1) and Actions -> Download Client
6.5
Within Cloudera Manager, Services -> MapReduce (mapreduce1) and Actions -> Download
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
9
6.6
Open a Terminal (All shell commands going forward will be executed in Terminal)
6.7
6.8
=> "abcde"
=> 1633837924
=> 168496141
=> 1633837924
=> 1633837924
=> 1321038671000
2013
Patterns
and
Predictions
(Poulin
Holdings,
LLC)
All
Rights
Reserved.
Confidential.
Reproduction
or
redistribution
P a g e
|
10
6.9
Edit
/usr/bin/mvn2
#!/bin/bash
MAVEN_HOME=/usr/share/apache-maven-2.2.1
M2_HOME=$MAVEN_HOME
2013
Patterns
and
Predictions
(Poulin
Holdings,
LLC)
All
Rights
Reserved.
Confidential.
Reproduction
or
redistribution
P a g e
|
11
PATH=$MAVEN_HOME/bin:$PATH
export MAVEN_HOME
export M2_HOME
export PATH
/usr/share/apache-maven-2.2.1/bin/mvn "$@"
configured correctly.
cd /tmp/
curl -O http://mirrors.kernel.org/gnu/src-highlite/source-highlight-3.1.7.tar.gz.sig
curl -O http://mirrors.kernel.org/gnu/src-highlite/source-highlight-3.1.7.tar.gz
# Reminder: Confirm signature is OK before running or installing anything.
# Reminder: If mirror downloads fail, locate an alternate mirror
#
/usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar /usr/lib/hadoop/hadoop-core.jar
6.13 Install Eclipse Juno ( in future, newer is probably OK, but optimal repeatability with Juno)
Navigate to http://www.eclipse.org/downloads/?osType=linux
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
12
su -l
mv /home/poulin/Desktop/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz /usr/lib/
chown root:root /usr/lib/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz
cd /usr/lib/
tar -xvf ./eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz
ln -s /usr/lib/eclipse/eclipse /usr/bin/eclipse
rm -fr /usr/lib/eclipse-jee-juno-SR1-linux-gtk-x86_64.tar.gz
6.16 Click
OK
on
workspace
launcher
defaults,
both
now
and
when
seen
through
the
remainder
of
this
tutorial.
help -> eclipse marketplace -> Search tab -> find (Maven Integration for Eclipse) (enter key)
Check all boxes under Confirm Selected Features if they are not already checked and Click Next.
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
13
o
If you accept the Eclipse Foundation Software User Agreement, Check the acceptance and Click Finish
cd /home/poulin/
tar jxf bcounts-0.1.0-SNAPSHOT-project.tar.bz2
cd /home/poulin/bcounts-0.1.0-SNAPSHOT/
mkdir /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/mapred-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf.sac.old/
cp /home/poulin/Desktop/*-clientconfig.zip /home/poulin/bcounts-0.1.0-SNAPSHOT/
unzip hbase1-clientconfig.zip
unzip mapreduce1-clientconfig.zip
cp ./hbase-conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/
cp ./hadoop-conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/
cp ./hadoop-conf/mapred-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/
cp ./hadoop-conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/
rm -fr ./mapreduce1-clientconfig.zip
rm -fr ./hbase1-clientconfig.zip
rm -fr ./hbase-conf
rm -fr ./hadoop-conf
cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/mapred-site.xml /home/poulin/bcounts-0.1.0SNAPSHOT/src/main/resources/
cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hbase-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/
cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/core-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/
cp /home/poulin/bcounts-0.1.0-SNAPSHOT/conf/hdfs-site.xml /home/poulin/bcounts-0.1.0-SNAPSHOT/src/main/resources/
mkdir /home/poulin/bcounts-0.1.0-SNAPSHOT/lib.old
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
14
mv /home/poulin/bcounts-0.1.0-SNAPSHOT/lib/* /home/poulin/bcounts-0.1.0-SNAPSHOT/lib.old/
cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.0.1.jar /home/poulin/bcounts-0.1.0-SNAPSHOT/lib/
# First build with Maven2
wget https://builds.apache.org/job/mrunit-trunk/ws/target/mrunit-1.0.0-SNAPSHOT-hadoop1.jar
rm -fr /home/poulin/.m2
# Install mrunit-1.0.0 into ~/.m2
mvn2
# if the following hangs for more than 5 minutes without output, ctrl-c and then re-run it
mvn3 -DskipTests install
# run without -DskipTests switch
mvn3 install
# make maven project loadable into eclipse
mvn3 -Declipse.workspace=/home/poulin/workspace eclipse:configure-workspace
mvn3 -DdownloadSources=true -DdownloadJavadocs=true eclipse:clean eclipse:eclipse
mvn3 dependency:build-classpath
su poulin
eclipse
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
15
o
Once
M2_REPO
is
observed,
Click
Cancel
back
to
the
parent
Eclipse
window
6.22 In
Eclipse:
File->Import...->(expand)
General
->
(highlight)
Existing
Projects
into
Workspace
->
(click)
Next
->
and
specify
root
directory
/home/poulin/bcounts-0.1.0-SNAPSHOT
(hit
enter
if
pasted)
Ensure
that
bcounts
is
checked
in
the
Projects
box
and
Click
Finish
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
16
2013
Patterns
and
Predictions
(Poulin
Holdings,
LLC)
All
Rights
Reserved.
Confidential.
Reproduction
or
redistribution
P a g e
|
17
Load
Iris
data
into
HBase
The iris data loaded into hbase is rectangular and newline delimited in the format:
<count-delta>,<count-delta>,<count-delta>,<classifier><newline>
During the load, the counts in hbase are incremented.
The human-readable meaning and schema of iris.data can be found in the Iris section of
the bayesiancounters-site.xml which is added to a CLASSPATH in prior steps.
For a production pipeline, will repeat this iris.load at a regular interval of deltas
or bind the UI directly to the hbase calls used by the loader code.
The loader logic can be mastered with eclipse by modifying the following section of
this tutorial:
Change from: Run->Run Configurations
Change to: Run->Debug Configurations
Then check mark next to Stop in main and then step through the code.
su Poulin
eclipse
Run->Run Configurations...
Name: IrisLoad
Main->Project: bcounts
Click on Apply, then Click on Run and then view the Console tab of the parent window
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
18
su poulin
eclipse
Run->Run Configurations...
Name: IrisInference
Main->Project: bcounts
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
19
Click on Apply, then Click on Run and then view the Console tab of the parent window
NB
inference
on
the
Iris
dataset
Opens connection to the iris table
Loads iris classifications from bayesiancounters-site.xml into local memory
Moves columns in hbase between tiers, e.g. T5MIN, T30MIN, etc. while computing scores
and tracking parent and child counts
The logic is derived from naive Bayes classifier theory
The resulting scores, counts and probabilities are displayed to standard output
The
probabilities
output
of
scoring
can
be
used
directly
for
mode
complex
decision
making
algorithms
based
on
benefit/loss
analysis.
Run->Run Configurations...
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
20
o
Name: CliqueRandom
Main->Project: bcounts
Click on Apply, then Click on Run and then view the Console tab of the parent window
Clique
scoring
can
be
used
to
perform
variable
importance
analysis
or
for
emerging
trend
identification.
2013
Patterns
and
Predictions
(Poulin
Holdings,
LLC)
All
Rights
Reserved.
Confidential.
Reproduction
or
redistribution
P a g e
|
21
Run->Run Configurations...
Name: AdLoad
Main->Project: bcounts
Click on Apply, then Click on Run and then view the Console tab of the parent window
Name: AdInference
Main->Project: bcounts
Click on Apply, then Click on Run and then view the Console tab of the parent window
Close eclipse
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
22
Note:
These
results
are
from
bcounts
on
2
lines
of
the
input
data
only.
Recommend
using
small
or
medium
sized
cluster
for
processing
the
entire
ad.data
file.
See
Cloudera
Manager
Documentation
for
cluster
size
specifications.
worker
working
wreckage
xvi
yates
young
SP_increase
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
23
<value>SP_increase</value>
</property>
<property>
<name>bayesiancounters.dataset.sp.col.valueset.647</name>
<value>-100, -40, 10, 40, 100</value>
</property>
6.37 Convert
testing
files
into
header-less
files
for
storing
in
HDFS
su poulin
cd /home/poulin/bcounts-0.1.0-SNAPSHOT/
python273 ./bin/sp_training.py /tmp/bag-of-words \
./examples/data/training_19_2004-18_2005.dat > /tmp/sp-training-file
tail -c 32 /tmp/sp-training-file
0,0,0,0,0,0,0,0,0,0,0,0,0,0,7.2
0,1,0,0,0,0,0,0,2,2,2,1,1,3,6,0
su poulin
eclipse
Run->Run Configurations...
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution
P a g e
|
24
o
Name: SpLoad
Main->Project: bcounts
Click on Apply, then Click on Run and then view the Console tab of the parent window
Name: SpInference
Main->Project: bcounts
Click on Apply, then Click on Run and then view the Console tab of the parent window
Close
eclipse
Note:
These
results
are
from
bcounts
on
2
lines
of
the
input
data
only.
Recommend
using
small
or
medium
sized
cluster
for
processing
the
entire
ad.data
file.
See
Cloudera
Manager
Documentation
for
cluster
size
specifications.
2013
Patterns
and
Predictions
(Poulin
Holdings,
LLC)
All
Rights
Reserved.
Confidential.
Reproduction
or
redistribution
P a g e
|
25
In
this
example
B-counts
can
be
used
for
predicting
an
effect
of
the
news
on
stock
market
movements
(US
Patent
No.
7,516,050)
P a g e
|
26
11. GNU
Source-highlight
3.1.7
-
http://www.gnu.org/software/src-highlite/source-highlight.html
12. Python
v2.7.3
documentation
-
http://docs.python.org/release/2.7.3/
13. Ruby
in
Twenty
Minutes
-
http://www.ruby-lang.org/en/documentation/quickstart/
2013 Patterns and Predictions (Poulin Holdings, LLC) All Rights Reserved. Confidential. Reproduction or redistribution