Вы находитесь на странице: 1из 103

Pulse

HP Vertica Analytic Database


Software Version: 7.1.x

Document Release Date: 7/21/2016

Legal Notices
Warranty
The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
The information contained herein is subject to change without notice.

Restricted Rights Legend


Confidential computer software. Valid license from HP required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer
Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial
license.

Copyright Notice
Copyright 2006 - 2015 Hewlett-Packard Development Company, L.P.

Trademark Notices
Adobe is a trademark of Adobe Systems Incorporated.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.

HP Vertica Analytic Database

Page 2 of 103

Contents
Contents

Pulse Virtual Machine Quick Start

About the HP Vertica Pulse Package

Installing or Upgrading HP Vertica Pulse

11

HP Vertica Pulse Package Version Requirements

11

Installation Overview

11

Installing Java on HP Vertica Hosts

12

Setting the JavaBinaryForUDx Configuration Parameter


Installing or Upgrading the HP Vertica Pulse Package on Your Host

12
14

Install or Upgrade the Pulse Package

14

Running the Pulse Install Script

15

Tuning the jvm Resource Pool for HP Vertica Pulse


Configuring the jvm Resource Pool for your System

17
18

Assign Users to the pulse_users Role and Allow Access to Pulse Functions

20

Uninstalling HP Vertica Pulse and PulsePackages

21

Uninstall HP Vertica Pulse on Your Hosts

21

Uninstall Pulse Packages

21

Using Pulse
Dictionaries and Mappings

23
24

Loading Dictionaries and Mappings into Pulse

26

Dictionary and Mapping Labels

28

Normalization Map Effects on Results

28

Creating Tables for Custom Dictionaries and Mappings

29

Determining Sentiment

31

Tuning Pulse

33

HP Vertica Analytic Database

Page 3 of 103

Pulse
Contents

Improving Automatic Attribute Discovery

33

Determining How Pulse Scores Sentiment

33

Improving Sentiment Scores

34

Bulk Loading Word Lists from Text Files

37

Bulk Loading User Dictionary Lists

38

Bulk Loading the Normalization Map

38

Multilingual Pulse

39

Spanish Pulse

40

Multilingual Examples

41

Pulse Cookbook

45

Batch Analyzing Data as It Is Loaded

45

Analyzing Comments for a Company or Product

48

Determining Popular Topics

51

Determining Prolific Authors

55

Analyzing the Sentiment of Specific Authors

56

Finding Associated Attributes

58

Using Pulse as an Aid in Competitive Analysis

59

Pulse Function Reference

63

CommentAttributes

64

ExtractSentence

68

GetAllDictionarySetLabels

70

GetAllDictionaryWords

71

GetAllLoadedDictionaries

72

GetAllMappingWords

73

GetAllSentences

75

GetLoadedDictionary

78

GetLoadedMapping

80

HP Vertica Analytic Database (7.1.x)

Page 4 of 103

Pulse
Contents

GetSentenceCount

82

GetStorage

85

LoadDictionary

87

LoadMapping

89

PartsOfSpeech

91

SentimentAnalysis

94

SetDefaultLanguage

98

UnloadLabeledDictionary

99

UnloadLabeledDictionarySet

100

UnloadLabeledMapping

101

We appreciate your feedback!

HP Vertica Analytic Database (7.1.x)

103

Page 5 of 103

Pulse
Contents

HP Vertica Analytic Database (7.1.x)

Page 6 of 103

Pulse
Pulse Virtual Machine Quick Start

Pulse Virtual Machine Quick Start


These Quick Start instructions detail the minimal steps for installing and using Pulse with the HP
Vertica Virtual Machine Image. Consult the complete documentation for detailed steps on installing
Pulse on your own platform.
Downloading and Installing Pulse
1. Go to http://my.vertica.com/ and sign in. Then, click the Download tab.
2. Scroll down to the section "Download HP Vertica 7.1 Virtual Machines" and click the download
link for your VM environment. These instructions assume you are installing the VMDK version VMWare Server 2.0 and Workstation 7.0.
3. After the download completes, unzip the file.
4. Double-click the .vmx file in vmsrvr_64/Vertica 7.1.x x64 for VMware. The VM starts in
your VMWare application.
5. You are automatically logged in as dbadmin. However, the password for the user (and root) is
'password'.
6. In the VM, select Applications > Accessories > Terminal to open a terminal.
7. In the terminal, type admintools to start the administration tools.
8. You are prompted for a license when admintools starts for the first time. To use the community
edition license, simply click OK. You are then prompted to accept the EULA. Accept the EULA
then exit admintools.
9. As dbadmin, using vsql on any node in the cluster, set the JavaBinaryforUDx Configuration
Parameter (use which java to determine your java location):
vsql -t -c "ALTERDATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';"

10. Copy the HP Vertica Pulse install package to the VM then, as root, install the Pulse Package:
rpm -Uvh /path/to/vertica-pulse.x86_64.xxx.rpm

Note: Only install HP Vertica Pulse on a single node. All Pulse functions are available on

HP Vertica Analytic Database (7.1.x)

Page 7 of 103

Pulse
Pulse Virtual Machine Quick Start

all nodes. However, the installation SQL scripts and user-dictionary loading script are only
available on the node on which you install the Pulse package.

11. As dbadmin, run the Pulse install script on the node on which you installed the Pulse Package:
vsql -f /opt/vertica/packages/pulse/ddl/install.sql

Using Pulse
1. Run a sentiment function:
select sentimentanalysis('Cookies are sweet.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)

Note: By default, HP VerticaPulse analyzes English text, however, you can also specify the
language of the text being analyzed as an attribute of the sentimentanalysis() function. For
example:
select sentimentanalysis('Cookies are sweet.', 'english') OVER(PARTITION BEST);

English and Spanish are the supported languages.

HP Vertica Analytic Database (7.1.x)

Page 8 of 103

Pulse
About the HP Vertica Pulse Package

About the HP Vertica Pulse Package


HP Vertica Pulse provides a suite of functions that allow you to analyze and extract the sentiment
from English and Spanish language text directly from your HP Vertica database.
HP Vertica Pulse features include:
l

Attribute based sentiment scoring - Pulse scores the sentiment of attributes in a sentence.
Attributes are generally nouns and are automatically discovered by Pulse. Pulse typically scores
sentiment from a range of -1 (negative sentiment) to +1 (positive sentiment). A sentiment of 0 is
considered neutral. Scoring individual attributes in a sentence instead of scoring the sentence as
a whole provides a more granular analysis for the text. For example, consider the sentence "The
quick brown fox jumped over the lazy dog." It would be difficult to score the sentiment on the
sentence as a whole, but if you score on the attributes of fox and dog, you could say the
sentiment on the fox was positive (the fox is quick), and the sentiment on the dog is negative
(the dog is lazy).

Tuning to your domain - Pulse provides functionality to recognize attributes that are specific
to your domain. For example, you can add the name of your product or company to a 'white_list'
so that it is discovered by Pulse.

Tuning of how sentiment is scored - Pulse includes user-dictionaries of words that are used
to help score sentiment. You can alter these user-dictionaries to fine tune the way your text is
analyzed.

Filtering of attributes you are not interested in - Pulse supports a special 'stop words' userdictionary to indicate attributes that should not be analyzed. Alternately, you can choose to
score sentiment only on attributes defined in your white_list.

Synonym mappings - Pulse provides customizable mappings so that you can map synonyms
to a base word, and then normalize the analysis for the synonyms to the base word. For
example, you can map Hewlett Packard to HP.

HP Vertica Pulse requires that Java and the HP Vertica Java Support Package are installed on all
nodes in the HP Vertica cluster.
Depending on the version of Pulse, it may support only one language (English or Spanish) or
multiple languages (English and Spanish). For multilingual versions, Pulse can analyze each text
row (for example a tweet) in the language of the text specified as argument, the language specified
by the user as parameter or the default language. See Multilingual Pulse for details.

HP Vertica Analytic Database (7.1.x)

Page 9 of 103

Pulse
About the HP Vertica Pulse Package

HP Vertica Analytic Database (7.1.x)

Page 10 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

Installing or Upgrading HP Vertica Pulse


The HP Vertica Pulse Package requires that Java be installed prior to installing HP Vertica Pulse.
HP Vertica Pulse Package Version Requirements

11

Installation Overview

11

Installing Java on HP Vertica Hosts

12

Installing or Upgrading the HP Vertica Pulse Package on Your Host

14

Tuning the jvm Resource Pool for HP Vertica Pulse

17

Assign Users to the pulse_users Role and Allow Access to Pulse Functions

20

Uninstalling HP Vertica Pulse and PulsePackages

21

HP Vertica Pulse Package Version Requirements


Your server must be running version 7.1.x or later to run Pulse. Pulse must be installed on an HP
Vertica node.
You can download the HP Vertica server package and from the HP Vertica Marketplace.

Installation Overview
1. Verify that your HP Vertica server version matches your HP Vertica Pulse version.
2. Install Java on all Hosts and set the JavaBinaryForUDx Vertica configuration parameter to
your Java binary location. For example, using vsql: ALTERDATABASE mydb SET
JavaBinaryForUDx = '/usr/bin/java'
3. Install the HP Vertica Package on a single node in the cluster. The process is the same for
installation or upgrade. You need only install it on a single node, but note that the SQL scripts
used to install and uninstall the Pulse functions and the SQL script that creates pulse schema
and the user-dictionaries tables are only available from the node on which you installed the
Pulse package. The Pulse functions, once installed, are available on all nodes regardless if the
package is installed on the node to which you are connecting.
4. Modify the jvm resource pool so that Pulse performs optimally on your system hardware.

HP Vertica Analytic Database (7.1.x)

Page 11 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

Installing Java on HP Vertica Hosts


You must install a Java Virtual Machine (JVM) on every host in your HP Vertica cluster in order to
run Pulse. Pulse requires a 64-bit Java Standard Edition 6 or 7 (Java version 1.6 or 1.7) runtime.
Both the Oracle JDK and openjdk are supported. You can choose to install either the Java Runtime
Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE. See the
Java Standard Edition (SE) Download Page to download an Oracle installation package for your
Linux platform, or use your platforms packaging tool (such as yum or apt-get) to get a Java 1.6 or
1.7 compatible version of open-jdk.
Once you have installed a JVM on each host, ensure that the java command is in the search path
and calls the correct JVM by running the command:
java -version

This command should print something similar to:


java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)

Setting the JavaBinaryForUDx Configuration Parameter


The JavaBinaryForUDx configuration parameter tells HP Vertica where to look for the JRE to
execute Java UDFs. After you have installed the JRE on all of the nodes in your cluster, you need
to set this parameter to the absolute path of the Java executable. You can use the symbolic link
that some Java installers create (for example /usr/bin/java). If the Java executable is in your shell
search path, you can get the path of the Java executable by running the following command from
the Linux command line shell:
$which java
/usr/bin/java

If the java command is not in the shell search path, use the path to the Java executable in the
directory where you installed the JRE. For example, if you installed the JRE in /usr/java/default
(which is where the installation package supplied by Oracle installs the Java 1.6 JRE), the Java
executable is /usr/java/default/bin/java.
You set the configuration parameter by executing the following statement as a database superuser:
ALTERDATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';

HP Vertica Analytic Database (7.1.x)

Page 12 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

SeeALTER DATABASE for more information on setting configuration parameters.


To view the current setting of the configuration parameter, query the CONFIGURATION_
PARAMETERS system table:
=> \x
Expanded display is on.
=> SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name =
'JavaBinaryForUDx';
-[RECORD 1 ]-----------------+---------------------------------------------node_name
| ALL
parameter_name
| JavaBinaryForUDx
current_value
| /usr/bin/java
default_value
|
change_under_support_guidance | f
change_requires_restart
| f
description
| Path to the java binary for executing
UDx written in Java

Once you have set the configuration parameter, HP Vertica will be able to find the Java executable
on each node in your cluster in order to execute Java UDFs.
Note: Since the location of the Java executable is set by a single configuration parameter for the
entire cluster, you must ensure that the path to the Java executable is the same across all of the
nodes in the cluster.

HP Vertica Analytic Database (7.1.x)

Page 13 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

Installing or Upgrading the HP Vertica Pulse


Package on Your Host
After you install a JVM on all of the nodes in your cluster, you must install the Pulse Package on a
single node. If upgrading, install the new package on the same host on which you previously
installed the package. Pulse installation or upgrade is a two-step process:
1. Install/Update the RPM or DEB package for Pulse.
2. Run included sql scripts to install or update the Pulse functions and create the user
dictionaries.
The Pulse install process installs the functions and schema required for sentiment analysis. You
need only install it on a single node. However, be aware that the following SQL scripts are only
available from the node on which you installed the Pulse package:
l

SQL scripts used to install and uninstall the Pulse functions

SQL script that populates and loads the dictionaries

You can access Pulse functions on all nodes, regardless if the package is installed on the node to
which you are connecting.

Install or Upgrade the Pulse Package


When you upgrade or reinstall Pulse, it automatically uses port 5433 for vsql. If you are using a
different port, configure it using the command export VSQL_PORT=<port_number>.
1. Copy the RPM or DEB package to the node where you want to install or upgrade Pulse. If you
are upgrading Pulse then copy the new package to the same node where you previously
installed the Pulse package. The version of HP Vertica Pulse must match the version of the
HP Vertica server. For example, if your HP Vertica server is version 7.1.0, then the HP
VerticaPulse version must also be 7.1.0.
If you are upgrading Pulse, you can find the currently-installed version number of Pulse with the
command:
select lib_version, lib_sdk_version from user_libraries where lib_name =
'SentimentLib';

HP Vertica Analytic Database (7.1.x)

Page 14 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

2. Log into the host and install the package.


For Red Hat, use:

sudo rpm -Uvh /path-to-package/vertica-pulse.x86_64.xxx.rpm

For Debian, use:

sudo dpkg -i /path-to-package/vertica-pulse.x86_64.xxx.deb

The Pulse Package is installed to /opt/vertica/packages/pulse.


After you install the package, you must run the appropriate SQL scripts to install or upgrade the
Pulse functions and install the dictionary tables. HP Vertica automatically reloads any labeled userdefined dictionaries.

Running the Pulse Install Script


Run the install script to install or upgrade the Pulse functions and schema for the dictionaries and
mappings required for sentiment analysis. You must run the install script once on the node on which
you installed the package. After you run the install script, then all nodes can use the Pulse
functions.
Important! Before running the install script, you must set the JavaBinaryforUDx configuration
parameter or the install script fails to install the Pulse functions. See Installing Java on HP Vertica
Hosts.
To run the install script:
1. As the dbadmin user, on the node on which you installed the Pulse RPM/DEB, run the
install.sh script:
bash /opt/vertica/packages/pulse/install.sh

Note: You must run the install script for installs or upgrades.

2. The script installs/upgrades the Pulse functions:


CREATE LIBRARY

HP Vertica Analytic Database (7.1.x)

Page 15 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
etc...

FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION

3. If this is a fresh installation, then Modify the jvm Resource Pool to match your system
hardware.

HP Vertica Analytic Database (7.1.x)

Page 16 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

Tuning the jvm Resource Pool for HP Vertica Pulse


Note: You must modify the jvm resource pool to match the capabilities of your
hardware so that HP Vertica Pulse has adequate resources to perform queries. If a cluster
does not have sufficient resources to run an HP Vertica Pulse query, then such a query can fail
with an Out Of Memory (OOM) exception.
HP Vertica Pulse runs as a Java UDx (User Defined eXtension) and uses the jvm resource pool to
define the resources available to run HP Vertica Pulse queries.
HP Vertica starts a Java Virtual Machine (JVM) when you perform an HP Vertica Pulse query. The
session from which you issue the query reserves resources for the JVM (across all nodes in the
cluster) and it releases the resources when the session ends. You can also explicitly close the JVM
attached to the session by using the command SELECT release_jvm_memory();.
The most critical resource pool settings that affect HP Vertica Pulse are MAXMEMORYSIZE and
PLANNEDCONCURRENCY.
l

MAXMEMORYSIZE defines the amount of RAM that a JVM can use. By default
MAXMEMORYSIZE is set to either 10% of system memory or 2GB, whichever is smaller.

PLANNEDCONCURRENCY defines how many JVMs are allowed to run across the cluster
and how many Pulse sessions you are able to run cluster-wide. By Default,
PLANNEDCONCURRENCY is set to AUTO, which is the lower of either the number of cores
on the node, or memory / 2GB, but it is never automatically set to less than "4".

The amount of memory that each JVM is allocated is determined by MAXMEMORYSIZE /


PLANNEDCONCURRENCY. For example, suppose MAXMEMORYSIZE is set to 8G and
PLANNEDCONCURRENCY is set to 2. In this case, only a maximum of 2 sessions can run HP
Vertica Pulse queries and the session JVMs have a maximum memory size of 4GB.
Tip: The basic thing to remember is that PLANNEDCONCURRENCY controls the number of
sessions across the entire cluster that can run the sentimentAnalysis() function. If set to 1,
then only a single session can run Pulse functions. No other sessions are able to run Pulse or
Java UDx functions until the session currently running Pulse functions is closed.
While resource pool settings are based on the resources of a node, they apply across the entire
cluster. A session with an HP Vertica Pulse query reserves the same resources for its JVM on all
nodes in the cluster. Therefore, it doesn't matter if the cluster contains 3 nodes or 30 nodes; each
node reserves, for example, 4GB of the node's memory for the JVM used by the HP Vertica Pulse

HP Vertica Analytic Database (7.1.x)

Page 17 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

session and PLANNEDCONCURRENCY limits the amount of sessions that can run Pulse
cluster-wide. If PLANNEDCONCURRENCYis 1, then only 1 vsql session (or client connection) in
the entire cluster can run Pulse.
You can display the current resource pool settings for the jvm resource pool with the following
command:
select name, MAXMEMORYSIZE, PLANNEDCONCURRENCY from V_CATALOG.RESOURCE_POOLS
where name = 'jvm';

Configuring the jvm Resource Pool for your System


Do not use the default jvm resource pool settings for HP Vertica Pulse. You must configure the jvm
resource pool to match your hardware and workload requirements. Specifically, specify
PLANNEDCONCURRENCY and MAXMEMORYSIZE to match your hardware.
You may need to experiment to find the optimal settings for your hardware and your specific
workloads. As a best practice, allow:
l

At least 2GB of memory per session for HP Vertica Pulse

At least 25% of the memory available for general HP Vertica overhead. Essentially,
MAXMEMORYSIZE must never exceed 75% of total system memory.

Note: If you are running a lot of queries not in the context of HP Vertica Pulse, then you should
allow for more memory to be available outside of the jvm resource pool.
To configure your system for HP Vertica Pulse:
l

Determine the number of cores on a node. Your PLANNEDCONCURRENCY setting cannot


exceed this value. For example, you can run the following from a shell to determine cores:
cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed
s/physical/\\nphysical/g |
grep -v ^$| sort | uniq | wc -l

Determine the amount of memory in GB on a node. Your MAXMEMORYSIZE cannot exceed


75% of the total system memory. For example, you can run the following from a shell to
determine the Total System Memory in GB for any particular node:
awk /MemTotal/'{printf "%f GB\n", $2/1024/1024}' /proc/meminfo

HP Vertica Analytic Database (7.1.x)

Page 18 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

Use the formula MAXMEMORYSIZE / PLANNEDCONCURRENCY to determine how much


memory each HP Vertica Pulse JVM receives. For example, you can use (.75 * Total
System Memory) / PLANNEDCONCURRENCY if you plan to use most of your RAM for HP Vertica
Pulse. The outcome of the formula must be 2 (which denotes GB) or greater. For example, if you
have 8GB of total system memory, and your estimated PLANNEDCONCURRENCY is 3, then
the formula results in "2" and is acceptable. However, if you have the same amount of memory
and PLANNEDCONCURRENCY is set to 4, then the result of the formula is "1.5", which is
below the recommended minimum of 2GB. You can either add more RAM to the system or
reduce PLANNEDCONCURRENCY to get the resulting number up to "2".

Finally, alter the jvm resource pool. For example, for a cluster with nodes each having 16GB of
memory, and you determine to use up to 75% of the total system memory (0.75 * 16GB = 12GB)
for HP Vertica Pulse, then you can set the resource pool as follows:
ALTER RESOURCE POOL jvm MAXMEMORYSIZE '12G' PLANNEDCONCURRENCY 3;

Note: For evaluation purposes on systems with lower memory, set MAXMEMORYSIZE to
75% and PLANNEDCONCURRENCY to 1: ALTER RESOURCE POOL jvm MAXMEMORYSIZE
'75%' PLANNEDCONCURRENCY 1; While these settings are unsupported, they do allow you to
run simple HP Vertica Pulse queries. You may experience Out Of Memory exceptions and
slow performance.
For additional details, see:
l

ALTER RESOURCE POOL

Managing Workloads

Java UDx Resource Management

HP Vertica Analytic Database (7.1.x)

Page 19 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

Assign Users to the pulse_users Role and Allow


Access to Pulse Functions
When you install Pulse, the install script creates a pulse schema, which contains the userdictionary and mapping lists used by Pulse. Initially only administrators can read or edit tables in the
pulse schema. To give non-administrator database users access to the pulse schema, you assign
the user to the 'pulse_users' role, which has all privileges for the pulse schema. The role is created
automatically when you install Pulse.
Note: The default dbadmin user has access to the pulse schema by default. You do not need
to add the pulse_users role to the dbadmin account.

Granting users Access to the Pulse Schema


To grant non administrator users access to the tables in the Pulse schema:
1. As the dbadmin, if the user does not exist, create the user with the command: create user
username identified by 'password';
2. As the dbadmin, if the user does not have access to function in the public schema, then grant
execute privileges with the command: GRANT execute ON ALL FUNCTIONS IN SCHEMA
public TO username;
Note: By default, the Pulse functions are created in the public schema.

3. As the dbadmin, grant the pulse_user role to the new user with the command: grant pulse_
users to username;
4. As the user to which you granted the pulse_user role, set the users role to pulse_users with the
command: set role pulse_users;

Note: The user must run the set role command per session to read or edit tables in the
pulse schema.

HP Vertica Analytic Database (7.1.x)

Page 20 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

Uninstalling HP Vertica Pulse and PulsePackages


Uninstalling HP Vertica Pulse on hosts and uninstalling Pulse packages require different
procedures.

Uninstall HP Vertica Pulse on Your Hosts


As the dbadmin, run the uninstall script from the node on which you installed the Pulse package:
bash /opt/vertica/packages/pulse/uninstall.sh

The uninstall script removes all Pulse functions, but does not remove the pulse schema containing
the user-dictionary and mapping tables.
To remove all Pulse dictionaries and mappings, including custom dictionaries, include the -r
parameter
bash /opt/vertica/packages/pulse/uninstall.sh -r

Uninstall Pulse Packages


To uninstall the Pulse package, on the nodes that have the Pulse package installed, use the
appropriate command for your package.
l

For RPM packages:


# sudo rpm -e vertica-pulse

For DEB packages:


# sudo dpkg --remove vertica-pulse

The Pulse schema and associated user-dictionary and mapping tables remain in the database. To
remove the Pulse schema and its associated tables, run the following command:
DROPSCHEMA pulse CASCADE

HP Vertica Analytic Database (7.1.x)

Page 21 of 103

Pulse
Installing or Upgrading HP Vertica Pulse

HP Vertica Analytic Database (7.1.x)

Page 22 of 103

Pulse
Using Pulse

Using Pulse
Dictionaries and Mappings

24

Determining Sentiment

31

Tuning Pulse

33

Bulk Loading Word Lists from Text Files

37

HP Vertica Analytic Database (7.1.x)

Page 23 of 103

Pulse
Using Pulse

Dictionaries and Mappings


Pulse uses a proprietary system-dictionary to help score sentiment. The system dictionary is not
visible or modifiable. However, you can alter the default way in which sentiment is scored by
modifying user dictionaries. The user dictionaries provide flexibility so that you can tune sentiment
scoring for your specific domain. However, you do not have to modify user dictionaries if Pulse is
scoring your data appropriately.
User dictionaries and a normalization map for each supported language reside in tables inside the
Pulseschema. There is one table per dictionary/map for each language. The table name has the
language abbreviation as a suffix. For example, English tables have the suffix "_en" and Spanish
tables have the suffix "_es". By default, the user dictionaries and normalization map are empty. You
can modify these tables to tune Pulse to your specific needs. After you modify these tables, you
must load the changes into memory.
You can update the user dictionaries and normalization tables at any time. To do so, you must run
load functions (see LoadDictionary()and LoadMapping()) to load the values from the tables into
memory. Your changes affect sentiment scoring only after you load the new values. These
dictionaries and the normalization map exist as tables in the Pulse schema. You can see the
contents of the tables with simple queries such as: select * from pulse.pos_words_en; and
select * from pulse.pos_words_es;
Users can apply dictionaries on a per-user basis. Any number of Pulse users can concurrently
apply different sets of dictionaries without conflicts and without disrupting other users' sessions.
Each user can have one dictionary of each type loaded at any given time. If a user does not specify
a dictionary of a given type, Pulse uses the default dictionary for that type.
Note: Loading a user-dictionary or loading a normalization map overwrites the values in
memory with the values from the specified table.You cannot append user dictionaries or the
normalization map in memory.
The following dictionary table names provide examples of the English user dictionaries with
descriptions:

HP Vertica Analytic Database (7.1.x)

Page 24 of 103

Pulse
Using Pulse

Dictionary Table Name

Description

white_list_en

Words that are always marked as an attribute. This list augments


the built-in Pulse attribute discovery process. Add words that you
always want scored to the white_list user dictionary. For
example, such words can include nouns, phrases or businessdependent attributes that are not auto-discovered by Pulse.
This list is typically modified to increase the accuracy of
sentiment scoring for your domain.
Consider the term "Alice in Wonderland". Pulse automatically
marks "Alice" as an attribute. However, you can add "Alice in
Wonderland" to the white_list and Pulse then uses "Alice in
Wonderland" as the attribute instead of just "Alice".
Note: If your white_list contains phrases that are subsets of
other phrases in the white list, then the shorter phrase is not
matched if the text being analyzed matches the superset
phrase. For example, if both "Honest Al" and "Honest Al Car
Emporium" are in the white_list, then the latter phrase is
identified as an attribute in the text "Honest Al Car Emporium
is not honest.", not the shorter "Honest Al" white_list phrase.

stop_words_en

Words that are never marked as an attribute. Add words that you
do not want scored to the stop_words user dictionary. Use this to
filter out attributes that are not of interest to your analysis. This list
is typically modified to increase the accuracy of sentiment scoring
for your domain.
Note: If a word appears in both stop_words and white_
list, then the white_list word takes precedence. The
word appears in results even though it is in thestop_words
dictionary.

pos_words_en

Positive words that can be any type of word or phrase. Words in


this list are more likely to carry a positive polarity in general.
You can also add exact phrases, such as idioms, to this list.
Examples: adroit, resolve, strong, hit the nail on the head

HP Vertica Analytic Database (7.1.x)

Page 25 of 103

Pulse
Using Pulse

Dictionary Table Name

Description

neg_words_en

Negative words that can be any type of word or phrase that have a
negative connotation. Words in this list are deemed more likely to
carry a negative polarity in general.
You can also add exact phrases, such as idioms, to this list.
Examples: abhorrent, butcher, racist, wrath, flash in the pan.

neutral_words_en

Words that indicate a neutral connotation. Words in this list are


scored with a sentiment of 0, meaning not positive or negative.

The following table describes the tables that describe mapping within Pulse.
Mapping Table Name

Description

Example

normalization_en

A list of word pairs used to map like

base/synonym:

terms (synonyms). You can use this


to correct common misspellings and

"hp"/ "hewlettpackard"

"hp"/ "Hewlett-Packard"

"Obama"/ "President

map them to the correct spelling. This


list is frequently modified and is
empty by default.

Obama"
l

"Obama"/ "Barack Obama"

For Pulse versions that support Spanish, the same set of dictionaries with the suffix "_es" is
present in the Pulse schema.

Loading Dictionaries and Mappings into Pulse


You need to load dictionaries, the normalization map, or both, if you have made changes to the
pulse schema tables. After the changes are loaded, Pulse stores them in memory, across all
sessionsin the cluster.Because Pulse automatically loads the dictionaries and mapping at startup,
you do not need to reload them after a database restart or system reboot.
l

To load an individual user-dictionary into memory, use the LoadDictionary() function with the
appropriate parameter and word list.

To load the normalization mapping into memory, use the LoadMapping() function with the
normalization map.

HP Vertica Analytic Database (7.1.x)

Page 26 of 103

Pulse
Using Pulse

For ease of use, Pulse ships with a script to automatically load into memory all of the required user
dictionaries and the normalization mapping. You can run the script from within vsql with the
following command:
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql

Note: This script only exists on the node on which you installed the Pulse RPM/DEB package.
Manually Loading Dictionaries and the Normalization Map
If you want to manually load certain user dictionaries or mappings from the pulse schema tables,
then run the following command. This example loads the pos_words dictionary. See LoadDictionary
() for valid values for the listName parameter and for multilingual version loading.
Note: The following examples use the English dictionaries. For Spanish, replace "_en" with "_
es".
First, add a word to the pos_words dictionary:
=> insert into pulse.pos_words_en values('SuperDuper');
=> commit;

By default, added words are not case sensitive. "ERROR" produces the same results as "error".
You can, however, specify a case setting for a word using the $Case parameter. For example, to
identify "Apple", rather than "apple", you would add the following:
=> insert into pulse.white_list_en values('$Case(Apple)');
=> commit;

Then, load the updated dictionary into Pulse:


select LoadDictionary(standard USING PARAMETERS
listName='white_list_en') over()
from pulse.white_list_en;

If you change the normalization map, then you can load the new normalization values with the
following command:
select LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization') over()
from pulse.normalization_en;

HP Vertica Analytic Database (7.1.x)

Page 27 of 103

Pulse
Using Pulse

After loading, HP Vertica returns a success message and the number of rows (words or word pairs)
loaded.

Dictionary and Mapping Labels


You can apply a label to any user-defined dictionary or mapping when you load that object. Labels
enable to you perform sentiment analysis against a predetermined set of dictionaries and mappings
without having to specify a list of dictionaries. For example, you might have a set of dictionaries
labeled "music" and a set labeled "movies." The default user dictionaries automatically have a label
of "default."
A single dictionary or mapping can have multiple labels. For example, you might label a white list of
artists as both "painters" and "renaissance." You could load the dictionary by loading either label. A
label can only apply to one dictionary of each type. For example, you cannot have two stop words
dictionaries that share the same label. If you apply a label to multiple dictionaries of the same type,
Pulse uses the most recently applied label.
You can view the labels associated with your current dictionaries using the
GetAllLoadedDictionaries() function. You can also view the label associated with your current
mapping using the GetLoadedMapping() function.

Normalization Map Effects on Results


Before any of the sentiment analysis functions are run on the text, the normalization map is
applied.When a sentiment analysis function is run, Pulse replaces the synonym with the base
word. The result of the sentiment analysis function displays the mapped words and not the original
text. For example, Pulsemaps both "Hewlett Packard" and "Hewlett-Packard" (with a hyphen) to
HP in the results when the normalization map is populated with those terms:
Before Mapping
SELECT SentimentAnalysis('Hewlett-Packard was founded in 1939.
Hewlett Packard was started in a garage in Palo Alto California')
OVER(PARTITION BEST);
sentence |
attribute
| sentiment_score
----------+----------------------+----------------1 | hewlett-packard
|
0
2 | hewlett packard
|
0
2 | garage
|
0
2 | palo alto california |
0
(4 rows)

Insert Normalization Values and Load Map


INSERT INTO pulse.normalization_en VALUES('HP', 'Hewlett-Packard');

HP Vertica Analytic Database (7.1.x)

Page 28 of 103

Pulse
Using Pulse

INSERT INTO pulse.normalization_en VALUES('HP', 'Hewlett Packard');


commit;
SELECT LoadMapping(standard_base, standard_synonym
USING PARAMETERS mapName='normalization') OVER()
FROM pulse.normalization_en;

After Mapping
The mapping operation replaces the attributes with their counterparts from the normalization list and
displays the base terms:
SELECT SentimentAnalysis('Hewlett-Packard was founded in 1939.
Hewlett Packard was started in a garage in Palo Alto California')
OVER(PARTITION BEST);
sentence |
attribute
| sentiment_score
----------+----------------------+----------------1 | hp
|
0
2 | hp
|
0
2 | garage
|
0
2 | palo alto california |
0
(4 rows)

The CommentAttribute() function also uses the normalization map and displays the base terms
instead of the original text:
SELECT CommentAttributes('Hewlett-Packard was founded in 1939.
Hewlett Packard was started in a garage in Palo Alto California')
OVER(PARTITION BEST);
sentence |
attribute
----------+---------------------1 | hp
2 | hp
2 | garage
2 | palo alto california
(4 rows)

Creating Tables for Custom Dictionaries and Mappings


The HP VerticaPulse package includes all the necessary user dictionary and mappings tables, but
you can create your own tables to store additional user dictionaries or mappings. For example:
CREATE TABLE my_positive_words(word VARCHAR(64));

The following example shows how to create a table, add some terms to it, and then load the table as
anormalization map:
CREATE TABLE myNormalization(base VARCHAR(64), synonym VARCHAR(64));

HP Vertica Analytic Database (7.1.x)

Page 29 of 103

Pulse
Using Pulse

INSERT INTO myNormalization VALUES('hp','Hewlett Packard');


INSERT INTO myNormalization VALUES('hp','Hewlett-Packard');
commit;
SELECT LoadMapping(base, synonym USING PARAMETERS
mapName='normalization') OVER() FROM myNormalization;

After loading, HP Vertica returns a success message from each node in the cluster.

HP Vertica Analytic Database (7.1.x)

Page 30 of 103

Pulse
Using Pulse

Determining Sentiment
You determine sentiment by using the SentimentAnalysis() function on text.
The SentimentAnalysis() function first extracts the attributes (typically nouns) from the sentence,
and then applies a sentiment score to each attribute. Scores can be one of the following:
l

1 - Positive Sentiment

0 - Neutral Sentiment

-1 - Negative Sentiment

This provides a more granular analysis than just determining the sentiment for the sentence as a
whole. Consider the following quote from Abraham Lincoln; "Force is all-conquering, but its
victories are short-lived." If you were to score the sentiment of the sentence as a whole by
averaging the sentiment of its parts, then the sentiment is neutral.
=> select avg(t1.sentiment_score) as 'Average Sentiment' from (
select sentimentAnalysis('Force is all-conquering, but its victories are shortlived.')
over (PARTITIONBEST)
) as t1;
Average Sentiment
----0

If you score the individual attributes of the sentence, then you can obtain a much more precise
analysis of the sentiment than if you were trying to assign a single score to the entire sentence. For
example:
=> select sentimentAnalysis('Force is all-conquering, but its victories are shortlived.') over (PARTITIONBEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | force
|
1
1 | victories |
-1

"Force" is scored with positive sentiment because it is "all-conquering". "Victories" is scored with
negative sentiment because it is "short-lived".
Note: HP Vertica Pulse does not recognize personal pronouns (I, you, we, he, she, it, etc.) as
attributes.

HP Vertica Analytic Database (7.1.x)

Page 31 of 103

Pulse
Using Pulse

SentimentAnalysis() also extracts the sentiment from multiple sentences and returns the
sentence in which attributes are found:
=> SELECT SentimentAnalysis('Force is all-conquering, but its victories are short-lived.
Every good boy deserves fudge.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | force
|
1
1 | victories |
-1
2 | boy
|
1
2 | fudge
|
1
(4 rows)

"Boy" is scored with positive sentiment because he is good. Fudge is scored with positive
sentiment because it is something that is deserved.
Note: The sentence detector considers a period to mark the end of a sentence. Some abbreviations
that use a period, such as Dr. or Mr., cause the sentence detector to end the sentence at the
abbreviation.
The SentimentAnalysis function also identifies attributes with neutral sentiment (a sentiment score
of zero). For example:
SELECT SentimentAnalysis('Roses are red. Violets are blue.') OVER(PARTITION BEST);
sentence | attribute | sentiment score
----------+-----------+----------------1 | roses
|
0
2 | violets
|
0
(2 rows)

Both roses and violets receive neutral sentiment because neither being red nor blue is considered
positive or negative in this context.
See the Pulse Cookbook for more examples of determining sentiment.

HP Vertica Analytic Database (7.1.x)

Page 32 of 103

Pulse
Using Pulse

Tuning Pulse
Pulse contains built-in dictionaries that help to determine the sentiment of sentences. These
dictionaries are not directly readable. However, you can modify the Pulse dictionary tables to
improve automatic attribute discovery and provide more accurate results for sentiment scoring
based on your specific data sets. The dictionary tables are available in the Pulse schema. Any
words you add to these dictionaries takes precedence over the built-in dictionaries.

Improving Automatic Attribute Discovery


Pulse identifies nouns in sentences and marks them as attributes. However, there are two
dictionaries and one mapping that you can modify to improve automatic attribute discovery. These
are:
l

white_list - a list of words on which you want to score sentiment, but are not auto-discovered by
Pulse. Typically these are product or company names, or special words in the domain of the
data you are analyzing. You can also add noun phrases to the white_list.

stop_words - a list of words on which you do not want to score sentiment, but may appear
frequently in your data set. stop_words is basically a way to filter out attributes.

normalization - a map of base words and synonyms that allow you to normalize words for easy
comparison. For example, you can normalize "Hewlett Packard" to "HP", then count the number
of times "HP" appears as an attribute in your data. Any text that contains "HP" or "Hewlett
Packard" is counted towards the total.

Determining How Pulse Scores Sentiment


When tuning Pulse it is important to understand why Pulse may not be scoring a particular attribute
the way you want it to be scored. For example, consider the sentence "The quick brown fox jumped
over the lazy dog." By default, Pulse scores the fox as positive and the dog as negative. If you want
to better understand how the words in the sentence affect the attributes, then you can use the
relatedwords parameter to see which words are affecting the score. For example:
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
USING PARAMETERS relatedwords=true) OVER(PARTITION BEST);

sentence | attribute | sentiment_score | related_word_1 | related_word_2 | related_word_


3
----------+-----------+-----------------+----------------+----------------+---------------

HP Vertica Analytic Database (7.1.x)

Page 33 of 103

Pulse
Using Pulse

1 | fox
1 | dog

|
|

1 | quick
-1 | lazy

| lazy
|

|
|

(2 rows)

The output details that "quick" and "lazy" impacted the scoring of the "fox" attribute, and that "lazy"
affected the scoring of the "dog" attribute. "Quick" (positive) is weighted heavier than "lazy"
(negative) when scoring "fox" because the word "quick" is closer to the attribute "fox" in the
sentence, and the result is that "fox" is scored positively. "Lazy" (negative) is the only related word
being used to score the sentiment for "dog". If you don't agree with the scoring, you can change
how these related words affect the score by adding them to the appropriate user-dictionary, as
described in "ImprovingSentiment Scores".

Improving Sentiment Scores


Pulse scores sentiment on attributes (nouns) in sentences using Natural Language Processing
(NLP) algorithms and rules. Pulse attempts to identify the parts of a sentence (for example, verbs,
nouns/attributes, adjectives, etc; the parts of speech), and then scores the attributes based on
which system-dictionaries the parts of speech appear (positive,negative, or neutral) and where
those parts of speech appear in relation to the attributes and other contextual information.. Note that
Pulse does not identify personal pronouns (he, you, we , she, etc.) as attributes.
Pulse provides a PartsOfSpeech function so that you can verify which parts of speech are being
identified in a sentence.
Sentiment Scoring and the Precedence of Pulse User-Dictionaries
The negative, positive, and neutral user-dictionaries adjust the score of an attribute based on which
dictionary the words in the sentence appear. Note that user-dictionaries take precedence over the
internal dictionaries that Pulse uses for analyzing text, so that you can override the default polarity
of an opinion word or phrase by inserting it in the appropriate user-dictionary table.
Pulse also supports using phrases in the pos_words, neg_words and neutral_words dictionaries.
Phrases, such as idioms ("hit the nail on the head."), can be added to the user dictionaries. Phrases
of two or more words support "fuzzy" matching. For example, the phrase "solve problem" also
matches "solves problems".
Pulse uses an order of precedence to determine which user dictionary is used to modify the default
score. The order of precedence of the user dictionary that Pulse uses to score attributes is as
follows:
1. Phrases or strings that occur in the "neutral_words" dictionary
2. Phrases or strings that occur in the "neg_words" dictionary

HP Vertica Analytic Database (7.1.x)

Page 34 of 103

Pulse
Using Pulse

3. Phrases or strings that occur in the "pos_words" dictionary


4. Single words appearing in the "neutral_words" dictionary
5. Single words appearing in the "neg_words" dictionary
6. Single words appearing in the "pos_words" dictionary

Note: If a word is present in both stop_words and white_list, then the white_list word
takes precedence. The word is present in results even though it exists in stop_words.
Consider the sentence "Fudge is good". It contains three parts; a noun (fudge), a verb (is), and an
adjective (good). When you analyze the sentence using Pulse, it identifies "fudge" as an attribute,
because it is a proper noun, and then assigns "fudge" a positive sentiment:
select sentimentAnalysis('Fudge is good') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fudge
|
1

The number of words matched against a dictionary also has an impact on which dictionaries take
precedence. For example, phrases or word combinations in the user-dictionary lists take
precedence over single words. For example, the positive phrase "solve problem" causes a positive
score on the text "Joe solves problems", even though "problem" is a negative word. Since phrases
have precedence over single words, a positive score is applied to Joe.
SELECT SentimentAnalysis('Joe solves problems.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | joe
|
1
(1 row)

SELECT SentimentAnalysis('Joe is a problem.') OVER(PARTITION BEST);


sentence | attribute | sentiment_score
----------+-----------+----------------1 | joe
|
-1
1 | problem
|
0
(2 rows)

Tuning Example
You can modify any of the user-dictionaries to improve the accuracy of sentiment scores. The two
basic dictionaries, "neg_words" and "pos_words", are typically the easiest to modify to get good
results. Words in these two dictionaries can be any part of speech (verb, adjective, etc.). If you find

HP Vertica Analytic Database (7.1.x)

Page 35 of 103

Pulse
Using Pulse

a word that is causing an attribute to be scored positively or negatively, but it should be score as
neutral, then you can add that word to the "neutral_words_en" dictionary to cause it to be scored 0.
Consider the sentence "The product delivers simplicity.":
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
0
1 | simplicity |
0
(2 rows)

If you want "product" to be scored positively in this sentence, then you must add "deliver simplicity"
to the pos_words user-dictionary. "deliver simplicity" will also match "delivers simplicity" due to the
"fuzzy" matching feature of phrases in the dictionaries. If you add "simplicity" by itself to the "pos_
words" dictionary, then simplicity in any context is considered positive, which may not be the result
you want to achieve across your entire domain. The following example adds "deliver simplicity" to
the pos_words user-dictionary for the English language:
insert into pulse.pos_words_en values ('deliver simplicity');
commit;
-- you must reload the dictionaries for the changes to be effective
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
1
(1 row)

Note that "simplicity" is not positive if it is not paired with "deliver":


select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
0
1 | simplicity |
0
(2 rows)

If you want "simplicity" to always be positive, add it to the "pos_words" list. This example replaces
"deliver" with "provides":
insert into pulse.pos_words_en values ('simplicity');
commit;
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+-----------------

HP Vertica Analytic Database (7.1.x)

Page 36 of 103

Pulse
Using Pulse

1 | product
|
1 | simplicity |

1
0

(2 rows)

Note that the sentiment score for the attribute (noun) "simplicity" is not affected by having the word
"simplicity" in a Pulse user-dictionary, since it has been identified as an attribute.
Additional Tuning Examples
The following table provides additional examples for tuning Pulse:.
Text

Attribute Score

TuningSteps

New product

New

Default: -1

"Smash" is scored negatively by default.

smashes

Product

After Tuning: 1

Add "smash target" to "pos_words".

Default: -1

"sneak" is scored negatively by default.

After Tuning: 1

Add "sneak peek" to "pos_words".

Default: -1

"outbreak" is scored negatively by default.

After Tuning: 1

Add "spot trend" to "pos_words".

Default: -1

"knock" is scored negatively by default.

After tuning: 1

Add "knock your socks off" to "pos_

kickstarter
target in a day!
Get a sneak

Movie

peek of the new


movie.
Google was

Google

able to spot
trends in flu
outbreaks in the
United States
using the
collection and
analysis of big
data.
Five health tips

health

that will knock

tips

your socks off!

words".

If you have many words or base/synonyms to add to user-dictionaries, then you can bulk load the
lists from text files. See Bulk Loading Word Lists from Text Files.

Bulk Loading Word Lists from Text Files


If you have many words that you need to add to the user-dictionary or normalization mapping, then it
may be easier to create the word lists in a text file and load the lists using the COPY command.

HP Vertica Analytic Database (7.1.x)

Page 37 of 103

Pulse
Using Pulse

Bulk Loading User Dictionary Lists


To bulk load user-dictionary lists into the pulse schema, first create a text file with the list of words
to add, one word per line, for each of the user-dictionaries. See Dictionaries and Mappings for a list
of the user-dictionaries and normalization map. Optionally name each text file to match the name of
the corresponding user-dictionary. Place these text files in the /home/dbadmin directory.
Then, in vsql, use one or more of the following commands to load the respective text file into the
pulse schema. These commands assume that you are using English version of Pulse, that the builtin user dictionary tables in the pulse schema and that the text files are named the same as the userdictionary.
copy pulse.neg_words_en(standard) from '/home/dbadmin/neg_words.txt';
copy pulse.neutral_words_en(standard) from '/home/dbadmin/neutral_words.txt';
copy pulse.pos_words_en(standard) from '/home/dbadmin/positive_words.txt';
copy pulse.stop_words_en(standard) from '/home/dbadmin/stop_words.txt';
copy pulse.white_list_en(standard) from '/home/dbadmin/white_list.txt';

Bulk Loading the Normalization Map


You can load normalization terms into the pulse schema similarly to loading user-dictionaries.
However, instead of one word per line, use the convention of one pair of words per line, separated
by a comma. For example, to map the different forms of Hewlett-Packard to HP, create a text file in
/home/dbadmin named normalization.txt with the following content:
hp, hewlett packard
hp, hewlett-packard
Then, in vsql, use the following command to load the normalization into the pulse schema.
copy pulse.normalization_en (standard_base, standard_synonym) from
'/home/dbadmin/normalization.txt' delimiter ',';

When you have finished loading the text files, run the loadUserDictionaries.sql script to update
the new terms in memory:
vsql -f /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql

HP Vertica Analytic Database (7.1.x)

Page 38 of 103

Pulse
Multilingual Pulse

Multilingual Pulse
This section describes the multilingual features of Pulse and gives a brief explanation on how to use
the sentimentAnalysis() functions for different supported languages.
Pulse can analyze text in different languages. Currently English and Spanish are supported. You
can specify the language that is analyzed in three ways:
l

Provide the language as argument: if there is a language specified in the document record, then
it can be used for analyzing the text by passing it as argument. This is particularly useful when a
dataset contains texts in different languages. If the language in a record is not a supported one,
then it is ignored.

Provide the language as parameter: if there is no value specified for the language for a document
record, Pulse uses the value specified for the language parameter in the query to get the
language.

Note: If you provide the language parameter more than once, then the last value specified is
used.

Do not provide an argument or parameter and use the default language. If the language is neither
specified in the record nor by the user, then Pulse defaults to English unless you have changed
the default language. To change the default language use the SetDefaultLanguage function.

Note: If you provide both an argument and a parameter, then the argument is used as the
language. If the argument is not valid then the parameter is used. If neither the argument or
parameter are valid then the default language is used.

Note: Accents are removed from characters in attributes. Additionally, a "u" with a dieresis is
converted to a plain "u" and an "n" with a diacritical tilde is replace with a plain "n".
Functions that use language as parameter and/or as argument:
l

CommentAttributes

ExtractSentence

GetAllSentences

HP Vertica Analytic Database (7.1.x)

Page 39 of 103

Pulse
Multilingual Pulse

GetSentenceCount

PartsOfSpeech

SentimentAnalysis

Other functions can use the language only as a parameter (if not provided, the function uses the
default language):
l

GetLoadedDictionary

GetLoadedMapping

LoadDictionary

LoadMapping

GetAllDictionaryWords

GetAllMappingWords

In This Section
Spanish Pulse

40

Multilingual Examples

41

Spanish Pulse
The only visible difference between the English and Spanish versions is in the table names for the
user dictionaries. The modifications for dictionaries/mappings must be done in the following tables:
l

white_list_es

stop_words_es

pos_words_es

neg_words_es

neutral_words_es

normalization_es

HP Vertica Analytic Database (7.1.x)

Page 40 of 103

Pulse
Multilingual Pulse

Consider the text "El producto provee simplicidad" (the product provides simplicity). If the word
'simplicidad' (simplicity) should be positive, it has to be loaded into the pos_words dictionary for
Spanish as follows:
select sentimentanalysis('El producto provee simplicidad') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-------------+----------------1 | producto
|
0
1 | simplicidad |
0
(2 rows)
insert into pulse.pos_words_es values('simplicidad');
OUTPUT
-------1
(1 row)
select LoadDictionary(standard USING PARAMETERS listName='pos_words') over() from
pulse.pos_words_es;
Success
--------t
(1 row)
select sentimentanalysis('El producto provee simplicidad') OVER(PARTITION AUTO);
sentence | attribute | sentiment_score
----------+-------------+----------------1 | producto
|
1
1 | simplicidad |
0
(2 rows)

Multilingual Examples
Language as an Argument
select sentimentanalysis('Cookies are sweet.', 'english') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)

select sentimentanalysis('Las galletas son dulces','spanish') OVER(PARTITION BEST);


sentence | attribute | sentiment_score
----------+-----------+----------------1 | galletas |
1
(1 row)

The following example shows how to analyze tweets from a table where each tweet record contains
the language of the tweet in addition to the text.

HP Vertica Analytic Database (7.1.x)

Page 41 of 103

Pulse
Multilingual Pulse

create table myTweets (text varchar(300), language varchar(15));


insert into myTweets values ('Wired reviews Amazon''s tiny-screen Kindle Fire: Web
browsing sucks, emotionally draining, makes reading a chore', 'english');
insert into myTweets values ('Cookies are sweet', 'english');
insert into myTweets values ('Why does my iPhone have 6 GB of corrupted space I can''t
use? That is obnoxious.', 'english');
insert into myTweets values ('Las galletas son dulces', 'spanish');
insert into myTweets values ('el iPhone es el celular mas popular', 'spanish');

select sentimentanalysis(text,language) OVER(PARTITION BEST) from MyTweets;

sentence |
attribute
| sentiment_score
----------+----------------+----------------1 | reviews amazon |
-1
1 | kindle fire
|
-1
1 | web
|
-1
1 | chore
|
-1
1 | cookies
|
1
1 | iphone
|
-1
1 | gb
|
-1
1 | space
|
-1
1 | galletas
|
1
1 | iphone
|
1
1 | celular
|
1

(11 rows)

Language as a Parameter
select sentimentanalysis('Las galletas son dulces' using PARAMETERS language='spanish')
OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | galletas |
1
(1 row)
select sentimentanalysis('Cookies are sweet' using PARAMETERS language='english') OVER
(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)

Although it is possible to specify the language as parameter for a specific text given in a query,
using the language argument is more appropriate. The use of the language parameter is targeted to
queries that analyze a set of texts (from a table) written in a same language. The language
parameter is used by Pulse to skip texts in other languages because Pulse does not automatically

HP Vertica Analytic Database (7.1.x)

Page 42 of 103

Pulse
Multilingual Pulse

detect the language, Thus, Pulse uses the language specified as parameter to analyze each text
from the table (consequently the sentiment scores for texts in other language may be incorrect).
The following example shows a query that analyzes tweets from a table where the tweets do not
have a language value stored in the table, but are all in the same language.
create table myTweets (text varchar(300));
insert into myTweets values ('Las galletas son dulces');
insert into myTweets values ('el iphone es el celular mas popular');
insert into myTweets values ('el zorro rapido brinco sobre el perro flojo');
select sentimentanalysis(text using PARAMETERS language='spanish') OVER(PARTITION BEST)
from MyTweets;
sentence |
attribute
| sentiment_score
----------+----------------+----------------1 | galletas
|
1
1 | iphone
|
1
1 | celular
|
1
1 | zorro
|
1
1 | perro
|
-1
(5 rows)

The following example shows a query that analyzes tweets from a table with tweets in different
languages. The Spanish tweets do not have the language value. In a single query you can specify
both an argument and parameter. The argument has precedence over the parameter setting. In this
case the parameter is only used when a tweet doesn't provide a language value.
create table myTweets (doc_id int, text varchar(300), language varchar(15));
insert into myTweets values (1, 'Vertica is the best company', 'english');
insert into myTweets values (2, 'Cookies are sweet', 'english');
insert into myTweets values (3, 'The quick brown fox jumped over the lazy dog',
'english');
insert into myTweets values (4, 'Las galletas son dulces');
insert into myTweets values (5, 'el iphone es el celular mas popular');
select doc_id, sentimentanalysis(text,language using PARAMETERS language='spanish') OVER
(PARTITION BY id, text) from MyTweets;
doc_id
| sentence | attribute | sentiment_score
----------+-----------+-----------+----------------1 |
1| vertica
|
1
1 |
1| company
|
1
2 |
1| cookies
|
1
3 |
1| fox
|
1

HP Vertica Analytic Database (7.1.x)

Page 43 of 103

Pulse
Multilingual Pulse

3
4
5
5

|
|
|
|

1|
1|
1|
1|

dog
galletas
iphone
celular

|
|
|
|

-1
1
1
1

(8 rows)

Using the Default Language


select sentimentanalysis('Cookies are sweet') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)

HP Vertica Analytic Database (7.1.x)

Page 44 of 103

Pulse
Pulse Cookbook

Pulse Cookbook
This section contains the following recipes for using Pulse
Batch Analyzing Data as It Is Loaded

45

Analyzing Comments for a Company or Product

48

Determining Popular Topics

51

Determining Prolific Authors

55

Analyzing the Sentiment of Specific Authors

56

Finding Associated Attributes

58

Using Pulse as an Aid in Competitive Analysis

59

Batch Analyzing Data as It Is Loaded


If you are constantly loading data that needs to be analyzed with Pulse, then you should run the
sentimentAnalysis() function in batches on the newly loaded data. You can store the sentiment
scores in a separate table and associate the rows in the scored table with the original table by
joining on IDs between the tables. Running sentimentAnalysis() as the data is loaded and storing
the results is more efficient than running sentimentAnalysis() during interactive sessions because
the sentimentAnalysis() can take a few seconds to return results.
For example, suppose that you are using the Social Media Connector (available in the Data
Ingest section of the HP Vertica Marketplace) to retrieve Twitter tweets and load them into HP
Vertica. In this case, you can create shell scripts and a cron job to automatically run
sentimentAnalysis() on the text of the tweets. Then you can store the resulting scores in a table for
quick retrieval later on.
Complete the following steps as the dbadmin user to run sentimentAnalysis() on your Twitter data.
This task also sets up the system to run sentimentAnalysis() on new Twitter data every 2 minutes.
1. Create a table to hold the tweets (for example, named tweets) with the following structure:
create table tweets(
id int,
created_at timezonetz,
"user.name" varchar(144),

HP Vertica Analytic Database (7.1.x)

Page 45 of 103

Pulse
Pulse Cookbook

"user.screen_name" varchar(144),
text varchar(500),
"retweeted_status.retweet_count" int,
"retweeted_status.id" int,
"retweeted_status.favorite_count" int,
"user.location" varchar(144),
"coordinates.coordinates.0" float,
"coordinates.coordinates.1" float,
lang varchar(5)
);

The columns are based on the data returned by Twitter's streaming API. The fields are defined
in the Twitter Field Guide at https://dev.twitter.com/docs/platform-objects/tweets.
Note that the columns with quoted names; "user.name", "user.screen_name", are sub-fields
within a larger field. For example, the "users" field is described here:
https://dev.twitter.com/docs/platform-objects/users.
You must at least have columns for id, text, and "user.screen_name"
2. Create a table to hold the sentiment scores (for example, named : tweet_sentiment). Then load
it with the scores from your existing tweets. Make sure no new tweets are loaded until this step
completes.
Replace the column names in the following example with the column names from your twitter
table. The example uses the column names used by the Social Media Connector:

create table tweet_sentiment as


(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en'order by attribute );
-- The following table defines which data has been analyzed
create table dt_start as (select max(created_at) dt from tweets);
commit;

Note: If you have a large number of tweets then this command can take a long time to run.
However, it is important to score your existing data, before you start scoring newly loaded
data.

3. Create a SQL script to update the tweet_sentiment table with data from newly loaded tweets.

HP Vertica Analytic Database (7.1.x)

Page 46 of 103

Pulse
Pulse Cookbook

Save it in the home folder of the HP Vertica database admin user. For example, this path could
be /home/dbadmin/tweet_update.sql.
Replace the column names with the column names from your twitter table. The following
example uses the column names used by the HP Vertica Social Media Connector:
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
drop table if exists dt_end;
create table dt_end as (select max(created_at) dt from tweets);
-- run sentiment
insert into tweet_sentiment
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en' and
tweets.created_at > (select dt from dt_start) and
tweets.created_at <= (select dt from dt_end)
order by attribute);
-- copy date end into new start date
drop table if exists dt_start;
create table dt_start as (select dt from dt_end);
-- free up jvm resource pool memory used by this script
select release_jvm_memory();

4. Create a shell script named tweet_update.sh that is run from a cron job. This shell script runs
the tweet_update.sql script and logs the results to the file tweet_update.log. Save the tweet_
update.sh script in the home folder of the HP Vertica database admin user. For example, this
path could be /home/dbadmin/tweet_update.sh.
Replace the dbadmin, password, and databasename values with the values for your system.
/opt/vertica/bin/vsql -U dbadmin -w password -d databasename -f
/home/dbadmin/tweet_update.sql > tweet_update.log

5. Create a cron job to run the script every two minutes. Use the command crontab -e to create
the cron job.
*/2 * * * * /home/dbadmin/tweet_update.sh

The script runs every two minutes. Any new tweets that have been loaded in that two-minute
window are analyzed and the results are added to the tweet_sentiment table. You can join results of
queries by the id's of the tweets and tweet_sentiment tables.

HP Vertica Analytic Database (7.1.x)

Page 47 of 103

Pulse
Pulse Cookbook

Analyzing Comments for a Company or Product


Pulse allows you to analyze comments (such as tweets) for a particular company or product.
For example, imagine that the fictional company Pytell Corp has just released a new product called
Owl-2. You want to analyze the sentiment of both the company and the product.
You've collected several tweets from Twitter about several companies and products into your
database. However, for this analysis you only want to target tweets that have to do with Pytell
Corpand/or Owl-2.
The dataset for this example is below:
create table tweets_sample(id int, author varchar(50), text varchar(400));
insert into tweets_sample values(400900, 'DramaBugs',
'Pytell Corp has horrible customer support. On Hold 2 hours!');
insert into tweets_sample values(401200, 'Gemball',
'Owl-2 doesn''t fly!');
insert into tweets_sample values(403070, 'Postta',
'Pytell finally released Owl-2!');
insert into tweets_sample values(480920, 'Instana',
'Unboxing Owl-2 after work today! Stay Tuned!');
insert into tweets_sample values(434500, 'Dailydant',
'Owl-2 flies great! I like it!');
insert into tweets_sample values(450670, 'HelpfulBen',
'Owl-2 keeps crashing into things!');
insert into tweets_sample values(402092, 'Championtips',
'Owl-2 has solved our rodent infestation!');
insert into tweets_sample values(434950, 'Editone',
'Pytell fail? Reports of Owl-2 crashing through windows.');
insert into tweets_sample values(413956, 'CzarLatest',
'Pytell Corp''s Owl-2 just released!');
insert into tweets_sample values(459988, 'CelticMiss', 'I like Ponies!');
insert into tweets_sample values(403511, 'BuffDrama',
'I am afraid of small spiders.');
commit;

1. Run SentimentAnalysis to get an idea of how Pulse is analyzing the data:


SELECT author, SentimentAnalysis(text) OVER(PARTITION BY author, text) FROM tweets_
sample ORDER BY attribute;
author
| sentence |
attribute
| sentiment_score
--------------+----------+--------------------+----------------DramaBugs
|
1 | customer support
|
-1
Championtips |
1 | owl-2
|
0

HP Vertica Analytic Database (7.1.x)

Page 48 of 103

Pulse
Pulse Cookbook

HelpfulBen
Instana
CzarLatest
Gemball
Postta
Dailydant
Editone
CelticMiss
Editone
Championtips
BuffDrama
Instana
Editone
Postta
CzarLatest
DramaBugs
Editone
Instana
(20 rows)

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

1
1
1
1
1
1
2
1
2
1
1
2
1
1
1
1
2
1

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

owl-2
owl-2
owl-2
owl-2
owl-2
owl-2
owl-2
ponies
reports
rodent infestation
spiders
tuned
Pytell
Pytell
Pytell corp
Pytell corp
windows
work today

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

-1
1
0
0
0
1
-1
1
-1
0
-1
0
-1
0
0
-1
-1
0

2. There are some attributes listed (ponies!) that do not apply to the analysis that you are doing.
You can focus your analysis by adding whitelist entries and filtering on the whitelist. Insert
whitelist entries for the company and product name into the standard whitelist:
INSERT INTO pulse.white_list_en VALUES ('Pytell Corp');
INSERT INTO pulse.white_list_en VALUES ('owl-2');
commit;

Reload the whitelist into Pulse. Loading a user-dictionary or mapping overwrites the existing
user-dictionary or mapping:
SELECT LoadDictionary(standard USING PARAMETERS listName='white_list') OVER() FROM
pulse.white_list_en;

3. Also, note that Pulse is not identifying all variations on the company name. There are also three
obvious attributes for the product name ('Pytell', 'pytell corp). You can normalize these values
by using a normalization mapping. Add the synonyms to the standard normalization mapping:
insert into pulse.normalization_en values('Pytell', 'Pytell Corp');
commit;

4. Reload the normalization mapping to load the new values into Pulse:
SELECT LoadMapping(standard_base, standard_synonym USING PARAMETERS

HP Vertica Analytic Database (7.1.x)

Page 49 of 103

Pulse
Pulse Cookbook

mapName='normalization') OVER() FROM pulse.normalization_en;

5. Run the query again to see how the normalization affects the results.
Note that 'pytell corp' has been normalized to 'pytell' and Pulse is correctly identifying the
synonyms and mapping them to the base term

HP Vertica Analytic Database (7.1.x)

Page 50 of 103

Pulse
Pulse Cookbook

Determining Popular Topics


The next examples in this cookbook use a table with the following structure:
create table tweets(
id int,
created_at timezonetz,
"user.name" varchar(144),
"user.screen_name" varchar(144),
text varchar(500),
"retweeted_status.retweet_count" int,
"retweeted_status.id" int,
"retweeted_status.favorite_count" int,
"user.location" varchar(144),
"coordinates.coordinates.0" float,
"coordinates.coordinates.1" float,
lang varchar(5)
);

The columns are based on the data returned by Twitter's streaming API. The fields are defined in
the Twitter Field Guide at https://dev.twitter.com/docs/platform-objects/tweets.
Note that the columns with quoted names; "user.name", "user.screen_name", are sub-fields within
a larger field. For example, the "users" field is described here:
https://dev.twitter.com/docs/platform-objects/users.
The example queries provided work with any Twitter data that follows the above table structure.

Determining Popular Topics


The Pulse attribute discovery feature allows you to easily find popular topics in a data set. Use the
CommentAttributes() function to extract the attributes from rows of text and count the number of
times the attribute occurs.
For example, using a dataset of 30,000 tweets that matched a keyword of "D11" collected during
the D11 tech conference in 2013, you could get a count of the attributes discovered by Pulse to
determine popular topics:
SELECT t.attribute, count(*) FROM(SELECT CommentAttributes(text)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;
attribute
| count
----------------+------http
| 3631
d11
| 3281
rt
| 2453

HP Vertica Analytic Database (7.1.x)

Page 51 of 103

Pulse
Pulse Cookbook

encryption
usb
aes-256
smartphones
rt @hp
world
ceo
(10 rows)

|
|
|
|
|
|
|

2356
2121
1859
1843
1788
1609
1520

If the dataset contains tweets in English and Spanish languages, then (using the Pulse multilingual
version) each tweet can be analyzed according to its language by specifying the language as
argument in the CommentAttributes() function. If the language of a specific tweet is not supported,
then that tweet is ignored by the function. For example:
SELECT t.attribute, count(*) FROM(SELECT CommentAttributes(text,lang)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;

Notice that the top attribute is "http". This is due to the large number of links in tweets. You can
ignore links by using the filterlinks argument of CommentAtttributes():
SELECT t.attribute, count(*) FROM
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;
attribute
| count
----------------+------d11
| 4757
rt
| 2397
encryption
| 2356
usb
| 2121
aes-256
| 1871
smartphones
| 1829
rt @hp
| 1788
world
| 1611
ceo
| 1542
interview
| 1346
(10 rows)

The attribute "http" is now gone from the list, but we still have "rt" (for retweet) on the list and it is
not helpful in this context. You can omit terms such as "rt" by adding them to the stop_words list
and reloading the stop_words user-dictionary:
INSERT INTO pulse.stop_words_en VALUES('rt');
commit;
SELECT LoadDictionary(standard USING PARAMETERS
listName='stop_words') OVER() FROM pulse.stop_words_en;

When you rerun the query you get more accurate results for the popular topics in the data set:

HP Vertica Analytic Database (7.1.x)

Page 52 of 103

Pulse
Pulse Cookbook

SELECT t.attribute, count(*) FROM


(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute
ORDER BY count(*) DESC LIMIT 10;
attribute
| count
----------------+------d11
| 4757
encryption
| 2356
perfume
| 2121
usb
| 1871
aes-256
| 1829
rt @hp
| 1788
world
| 1611
ceo
| 1542
interview
| 1346
cloud
| 1306
(10 rows)

You can further refine the list to topics that contain specific attributes by adding the attributes in
which you are interested to the white_list, and then filtering with the whitelist parameter:
SELECT t.attribute, count(*) FROM
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true,
whitelistonly=true) OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute
ORDER BY count(*) DESC LIMIT 10;

Determining The Sentiment of Popular Topics


In addition to finding popular, or most discussed, topics in your data set, you can also easily get an
average sentiment for the topics.
The following example uses a dataset of 10,000 tweets containing the hashtag #sports.
SELECT * from
(SELECT attribute, count(attribute) AS
cnt, AVG(sentiment_score) FROM (select
SentimentAnalysis(text USING PARAMETERS
filterlinks=true) OVER(PARTITIONBEST) from tweets)
AS t1 GROUPBY attribute ORDERBY
AVG(sentiment_score) desc) AS t2
WHERE t2.cnt > 500
LIMIT 5;

The result shows the top 5 tweets with the highest average sentiment for attributes that have 500 or
more occurances:
attribute

| cnt

HP Vertica Analytic Database (7.1.x)

avg

Page 53 of 103

Pulse
Pulse Cookbook

-------------------+------+------------------football
| 817 | 0.290085679314565
game
| 638 | 0.134796238244514
baseball
| 1558 | 0.128369704749679
basketball
| 776 | 0.114690721649485
hockey
| 2610 | 0.113409961685824

HP Vertica Analytic Database (7.1.x)

Page 54 of 103

Pulse
Pulse Cookbook

Determining Prolific Authors


You can identify prolific authors of your textual data without using any of the Pulse functions. For
example, using the same dataset as the examples in Determining Popular Topics, you can easily
determine how many tweets were made by authors:
select "user.name", count(*) as post_count from tweets group by
"user.name" order by count(*) DESC limit 10;
user.name
| post_count
----------------------+-----------Nick Cicero
|
182
Networked Society
|
171
AllThingsD
|
137
Stephanie~
|
117
Jennifer Ives
|
105
Claudia-ElasticMinds |
101
Needful Things
|
96
Poptart Tech
|
85
Patrick Bertrand
|
84
Alessandro Piol
|
81
(10 rows)

HP Vertica Analytic Database (7.1.x)

Page 55 of 103

Pulse
Pulse Cookbook

Analyzing the Sentiment of Specific Authors


You can use the white_list feature of SentimentAnalysis() to filter the attributes so only the white_
list terms are returned. You can combine the white_list with a query for a list of specific authors to
narrow down the results to a specific subset of authors.
Using the same tweet_samples table in Analyzing Comments for a Company or Product, add the
following sample tweets:
INSERT INTO tweets_sample VALUES('123', 'bcook',
'The hyperdrive is a great machine.');
INSERT INTO tweets_sample VALUES('124', 'sprock',
'The hyperdrive is a pinnacle of technology.');
INSERT INTO tweets_sample VALUES('125', 'tgates',
'What is a hyperdrive?');
INSERT INTO tweets_sample VALUES('126', 'bcook', 'Roses are red.');
INSERT INTO tweets_sample VALUES('127', 'sprock',
'Energy equals mass times the speed of light squared.');
INSERT INTO tweets_sample VALUES('128', 'tgates', 'Violets are blue.');
commit;

Create an authors table to hold the names of the authors whose sentiment you want to analyze:
CREATE TABLE authors (name VARCHAR, screenname VARCHAR);

Then insert the following authors:


INSERT INTO authors VALUES('Brian Cook','bcook');
INSERT INTO authors VALUES('Tom Gates', 'tgates');
INSERT INTO authors VALUES('Jim Sprock', 'sprock');
commit;

Add the word 'hyperdrive' to your existing white_list and reload the white_list user-dictionary:
INSERT INTO pulse.white_list_en VALUES('hyperdrive');
SELECT LoadDictionary(standard USING PARAMETERS
listName='white_list') OVER() FROM pulse.white_list_en;

Then, you can run a query that filters on authors and the white_list and provides you with a
sentiment score and the content of the analyzed text:
SELECT t1.id, t1.author, t1.attribute, t1.sentiment_score, t2.text from (SELECT id,
author, SentimentAnalysis(text USING PARAMETERS
whitelistonly=true) OVER (PARTITION BY id, author) FROM tweets_sample
WHERE author IN (SELECT screenname FROM authors)) AS t1 JOIN (SELECT id,
text FROM tweets_sample) AS t2 ON t1.id = t2.id ;

HP Vertica Analytic Database (7.1.x)

Page 56 of 103

Pulse
Pulse Cookbook

id | author | attribute | sentiment_score |


text
-----+--------+-----------+-----------------+--------------------------123 | bcook | hyperdrive |
1 | The hyperdrive is a great
124 | sprock | hyperdrive |
1 | The hyperdrive is a pinnacle
125 | tgates | hyperdrive |
0 | What is a hyperdrive?
(3 rows)

HP Vertica Analytic Database (7.1.x)

Page 57 of 103

Pulse
Pulse Cookbook

Finding Associated Attributes


Once you've analyzed your tweets and stored them in a table (see Batch Analyzing Data as It Is
Loaded) you can use the analyzed data to make quick comparisons, such as finding attributes most
associated with another attribute.
For example, if your primary attribute is 'microsoft', you may want to determine which other
attributes are used most often with the word 'microsoft' in the same tweet. This can be
accomplished with the following SQL:
select t1.attribute, count(*), avg(t1.sentiment_score) from tweet_sentiment t1,
tweet_sentiment t2 where t1.id=t2.id and not t1.attribute=t2.attribute and
t2.attribute = 'microsoft' group by t1.attribute order by count desc limit 5;

We get the following results from a data set of 25,000 PC Manufacturer tweets:
attribute
| count |
avg
----------------------------------+-------+-------------------windows phone
|
81 | 0.0238095238095238
power data center
|
77 |
0.58974358974359
wind project
|
77 |
0
investment
|
73 |
0
windows
|
57 | 0.175438596491228

The query allows you to gain additional insight into the scope of an attribute and may aid in
determining the context of why a certain attribute it scored a certain way.

HP Vertica Analytic Database (7.1.x)

Page 58 of 103

Pulse
Pulse Cookbook

Using Pulse as an Aid in Competitive Analysis


This topic details how you can use Pulse to conduct basic competitive analysis for products or
brands. Pulse makes basic competitive analysis simple through use of it's white list feature. By
utilizing the white list feature, you can analyze the tweets that pertain only to the brands or products
that you are evaluating.
For example, say you wanted to analyze the sentiment of major food brands to determine how the
brands compared to each other and what words people associate (positively and negatively) about
the brands. Your work flow to do this analysis with Twitter and HP Vertica Pulse could be as
follows:
1. Start collecting tweets based on the brands or products that you are following. For example,
you can use the Social Media Connector (available on the Pulse marketplace) to collect tweets
matching keywords.
2. First, create a white_list that contains the same keywords as the tweets that you are
collecting. The whitelist allows you to later group and filter tweets collected. For example:
insert into pulse.white_list_en values ('productA');
insert into pulse.white_list_en values ('productB');
insert into pulse.white_list_en values ('productC');
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql

3. Batch Load Tweets, and be sure to specify whitelistonly=true and relatedwords=true in


the sentimentAnalysis() function. This creates a table with the sentiment score for your whitelisted attributes. Note that this should be done in batches for large data sets. For smaller data
sets (depending on your hardware) you can try and analyze all the tweets at once. For
example:
create table tweet_sentiment as
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true, relatedwords=true,
filterretweets=true, whitelistonly=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en' order by attribute );

4. Verify that your tweet_sentiment table contains only your whitelist attributes. The following

HP Vertica Analytic Database (7.1.x)

Page 59 of 103

Pulse
Pulse Cookbook

query should only return the brands/products that you have white listed. For example:
=> select distinct(attribute) from tweet_sentiment;
attribute
------------ProductA
ProductB
ProductC
(3 rows)

5. You can get a basic idea of which product or brand is being talked about the most by seeing
how many instances of each attribute appear in your data set:
=> select attribute, count(*) from tweet_sentiment group by (attribute) order by
count(*) desc;
attribute | count
-------------+------ProductA
|
701
ProductB
|
192
ProductC
|
52
(3 rows)

You can see that ProductA is the most talked about product of three being analyzed over the
time-frame that the tweets were collected.
6. Determine the average sentiment scores of the tweets you have collected:
=> select attribute, avg(sentiment_score) as score from tweet_sentiment group by
(attribute) order by score DESC;
attribute |
score
-------------+--------------------ProductC
|
0.192307692307692
ProductB
| -0.0729166666666667
ProductA
| -0.122681883024251
(3 rows)

From this basic analysis, you can see that ProductC has the most positive sentiment from the
three brands being analyzed over the time period when the tweets were collected, and
ProductA has the lowest sentiment.
7. You can also determine which words or phrases are associated with each attribute in their

HP Vertica Analytic Database (7.1.x)

Page 60 of 103

Pulse
Pulse Cookbook

positive and negative contexts. For example, to see the list of words that are most associated
with positive sentiment for ProductC, you can look at the related words fields and add up the
occurances of words associated with positive sentiment:
=> select count(*), related_word_1 from tweet_sentiment where attribute = 'ProductC'
and sentiment_score > 0 group by related_word_1 order by count DESC;
count | related_word_1
-------+---------------11 | delicious
2 | love
1 | best
1 | bless
1 | good
1 | work
(6 rows)

You can also do the same for negative sentiment:


=> select count(*), related_word_1 from tweet_sentiment where attribute = 'ProductC'
and sentiment_score < 0 group by related_word_1 order by count DESC;
count | related_word_1
-------+---------------1 | working
1 | dragging
1 | bad
1 | doomed
1 | loud
1 | stressful
1 | damn
(7 rows)

8. Finally, Pulse makes it easy to see other attributes associated with your target attributes to
help you better understand the context in which people are discussing the brands or products
that you are analyzing.
a. Create another sentiment table from your data, but this time omit the whitelistonly and
relatedwords parameters:
create table tweet2_sentiment as
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true, filterretweets=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en' order by attribute );

HP Vertica Analytic Database (7.1.x)

Page 61 of 103

Pulse
Pulse Cookbook

b. Next, query the tweets that contain your target attribute and find all the other attributes
associated with those tweets. Display a count of the top 5 attributes (not including the
target attribute):
=> select count(attribute), attribute from tweet2_sentiment where id in (select
id from tweet_sentiment where attribute = 'ProductC') and attribute <> 'ProductC'
group by (attribute) order by count(attribute) DESC limit 5;
count |
attribute
-------+----------------13 | bbq
11 | state
11 | sandwich
11 | steak
3 | ProductB
(5 rows)

As you can see, a few basic queries can tell you the general sentiment differences between
multiple brands or products. You can also determine which words are contributing to the sentiment
of each product/brand that you are analyzing and which other attributes people are talking about
when they mention the brand or product(s) that you are analyzing.
You could further refine these queries by breaking out different geographic locations or time of day
by joining the IDs of the tweet_sentiment table back to the main tweets table and filtering be
location or time.

HP Vertica Analytic Database (7.1.x)

Page 62 of 103

Pulse
Pulse Function Reference

Pulse Function Reference


CommentAttributes

64

ExtractSentence

68

GetAllDictionarySetLabels

70

GetAllDictionaryWords

71

GetAllLoadedDictionaries

72

GetAllMappingWords

73

GetAllSentences

75

GetLoadedDictionary

78

GetLoadedMapping

80

GetSentenceCount

82

GetStorage

85

LoadDictionary

87

LoadMapping

89

PartsOfSpeech

91

SentimentAnalysis

94

SetDefaultLanguage

98

UnloadLabeledDictionary

99

UnloadLabeledDictionarySet

100

UnloadLabeledMapping

101

HP Vertica Analytic Database (7.1.x)

Page 63 of 103

Pulse
Pulse Function Reference

CommentAttributes
Retrieves the attributes (nouns) from a given piece of text.

Syntax
CommentAttributes(text[,language][USING PARAMETERS
[whitelistonly = boolean ]
[, filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, filterpunctuation = boolean
[, filterretweets = boolean ]
[, adjustcasing = boolean ]
[, language = string ]])
])

Parameters
Argument

Description

text

The text from which to extract the attributes.

language

The language:

whitelistonly

'english' or 'en'

'spanish' or 'es'

Optional. Default false. When set to true only attributes


defined in the white_list user-dictionary are returned.

filterlinks

Optional. Default false. When set to true, links are not set
as attributes.

filterusermentions

Optional. Default false. When set to true, Twitter


usernames (@username) are not set as attributes.

HP Vertica Analytic Database (7.1.x)

Page 64 of 103

Pulse
Pulse Function Reference

Argument

Description

filterhashtags

Optional. Default false. When set to true, removes the


following from tweets:
l

hashtag symbols - For example, #pizza becomes pizza.

@mentions - For example, HP Vertica would remove


@NewYorkCity from a tweet.

filterpunctuation

Link URLs

Optional. Default true. Filters any punctuation that occurs at


the beginning of an attribute other than @ and #.

filterretweets

Optional. Defaults to false.Filters out the characters "RT"


from re-tweets in attributes.

adjustcasing

Optional. Defaults to false. When set to true, all letters in


the sentence are converted to upper-case before sentence
detection. After sentence detection all letters are converted
to lower-case. This option is helpful if the original data is all
in lower-case and Pulse is incorrectly identifying parts of
speech in the sentence.

Notes
l

The text argument is limited to 65,000 bytes.

This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.

language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.

Examples
select CommentAttributes('The quick brown fox jumped over the lazy dog. All good boys
deserve fudge.') OVER(PARTITION BEST);
sentence | attribute
----------+----------1 | fox
1 | dog

HP Vertica Analytic Database (7.1.x)

Page 65 of 103

Pulse
Pulse Function Reference

2 | boys
2 | fudge
(4 rows)
select commentattributes('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
,'english') over();
sentence | attribute
----------+----------1 | fox
1 | dog
2 | boys
2 | fudge
(4 rows)
select commentattributes('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
using parameters language='english') over();
sentence | attribute
----------+----------1 | fox
1 | dog
2 | boys
2 | fudge
select commentattributes('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
,'spanish') over();
sentence | attribute
----------+----------1 | zorro
1 | perro
2 | chicos
2 | premio
(4 rows)
select commentattributes('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
using PARAMETERS language='spanish') over();
sentence | attribute
----------+----------1 | zorro
1 | perro
2 | chicos
2 | premio
(4 rows)

Filtering User-mentions
SELECT CommentAttributes('@user is always late. He kept me waiting 20 minutes last
weekend.'
USING PARAMETERS filterusermentions=true) OVER(PARTITION BEST);
sentence | attribute
----------+----------2 | weekend
(1 row)

HP Vertica Analytic Database (7.1.x)

Page 66 of 103

Pulse
Pulse Function Reference

See Also
l

SentimentAnalysis()

HP Vertica Analytic Database (7.1.x)

Page 67 of 103

Pulse
Pulse Function Reference

ExtractSentence
Returns the specified sentence from a body of text.

Syntax
ExtractSentence(text, sentence [, language] [USING PARAMETERS
[filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])

Parameters
Argument
l

text

language

Description
The text containing the sentence to extract.
The language:
l

'english' or 'en'

'spanish' or 'es'

sentence

Integer value. The number of the sentence in the text .

filterlinks

Optional. Default false. When set to true, sentences that are only links
are skipped over and ignored. Any links in a sentence are not included in
the extracted sentence.

filterusermentions

Optional. Default false. When set to true, sentences that are only Twitter
user mentions (@username) are skipped over and ignored. Any usermentions in a sentence are not included in the extracted sentence.

filterhashtags

Optional. Default false. When set to true, sentences that are only Twitter
hashtags (#hashtag) are skipped over and ignored. Any hashtags in a
sentence are not included in the extracted sentence.

HP Vertica Analytic Database (7.1.x)

Page 68 of 103

Pulse
Pulse Function Reference

Argument

Description

adjustcasing

Optional. Defaults to false. When set to true, all letters in the sentence
are converted to upper-case before sentence detection. After sentence
detection all letters are converted to lower-case. This option is helpful if
the original data is all in lower-case and Pulse is incorrectly identifying
parts of speech in the sentence.

Notes
l

The text argument is limited to 65,000 bytes.

This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.

language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.

Examples
select ExtractSentence('The quick brown fox jumped. Every good boy deserves fudge', 2)
OVER(PARTITION BEST);
sentence
-------------------------------Every good boy deserves fudge.
(1 row)
select extractSentence('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
, 2, 'english') over();
sentence
----------------------------All good boys deserve fudge
(1 row)
select extractSentence('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
,2 using parameters language='english') over();
sentence
----------------------------All good boys deserve fudge
(1 row)
select extractSentence('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
, 2, 'spanish') over();
sentence
------------------------------------------Todos los chicos buenos merecen un premio
(1 row)

HP Vertica Analytic Database (7.1.x)

Page 69 of 103

Pulse
Pulse Function Reference

select extractSentence('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
,2 using parameters language='spanish') over();
sentence
------------------------------------------Todos los chicos buenos merecen un premio
(1 row)

Filtering Links
SELECT ExtractSentence('HP - http://hp.com is a useful website. I
like HP.', 1 USING PARAMETERS filterlinks=true) OVER(PARTITION BEST);
sentence
---------------------------hp - is a useful website.
(1 row)

See Also
l

GetSentenceCount()

GetAllSentences()

GetAllDictionarySetLabels
Lists all the dictionary labels that are loaded into the current Pulse session. This function shows
you which labels are currently in use. You can load only one dictionary of each type in a single
session.

Syntax
SELECT GetAllDictionarySetLabels() over();

Examples
SELECT GetAllDictionarySetLables() OVER();
label
--------default
sports_teams
(2 rows)

HP Vertica Analytic Database (7.1.x)

Page 70 of 103

Pulse
Pulse Function Reference

GetAllDictionaryWords
Lists all dictionary words that are currently loaded into Pulse. This function can help you determine
which user-defined words in a sentence might be affecting the sentiment score of an attribute.

Syntax
SELECT GetAllDictionaryWords([using PARAMETERS language='language'[, label='label']) OVER
();

Parameters
Argument

Description

language

The language of the dictionary:

label

'english' or 'en'

'spanish' or 'es'

The label of the dictionaries that you want to list. If you do


not provide a label, Pulse uses the default dictionaries.

Examples
SELECT GetAllDictionaryWords() OVER();
dictionary |
word
------------+------------neg_words | ratchet
neg_words | squirelly
select GetAllDictionaryWords(using parameters language='english') over();
dictionary
|
word
-------------------+-----------pos_words_en
| simplicity
(1 row)
select GetAllDictionaryWords(using parameters label='music') over();
dictionary
|
word
-------------------+------------white_list_en
| classical
white_list_en
| popular
white_list_en
| rock
(3 rows)

HP Vertica Analytic Database (7.1.x)

Page 71 of 103

Pulse
Pulse Function Reference

See Also
l

GetAllMappingWords()

GetAllLoadedDictionaries
Lists all the dictionaries and dictionary labels that are loaded into the current Pulse session. This
function shows you which dictionaries are determining the sentiment score of an attribute. Only one
dictionary of each type can be loaded in a single session.

Syntax
SELECT GetAllLoadedDictionaries() over();

Examples
SELECT GetAllLoadedDictionaries() OVER();
dictionary
| label
------------------+------neg_words_en
| default
stop_words_es
| default
neutral_words_es | default
white_list_en
| default
normalization_en | default
pos_words_es
| default
neg_words_es
| default
pos_words_en
| default
white_list_es
| default
neutral_words_en | default
stop_words_en
| default
normalization_es | default
(12 rows)

HP Vertica Analytic Database (7.1.x)

Page 72 of 103

Pulse
Pulse Function Reference

GetAllMappingWords
Lists all user-defined bases and synonyms that are currently loaded into Pulse. This function helps
you determine which user-defined mappings in a sentence might be affecting the sentiment score of
an attribute.

Syntax
SELECT GetAllMappingWords([using PARAMETERS language='language'][, label='label']) OVER
();

Parameters
Argument

Description

language

The language of the dictionary:

label

'english' or 'en'

'spanish' or 'es'

The label of the mappings that you want to list. If you do not
provide a lable, Pulse uses the default dictionaries.

Examples
SELECT GetAllMappingWords() OVER() limit 10;
mapping
|
key
|
value
---------------+-------------+----------------normalization | hp
| hewlett packard
normalization | hp
| hewlett-packard
normalization | companycorp | company-corp
normalization | companycorp | companycorps
normalization | companycorp | companycorp's
normalization | producthd
| product hd
normalization | producthd
| product-hd
normalization | companycorp | company corp
(8 rows)
select getAllMappingWords(using parameters language='english') over();
mapping
| key |
value
-----------------------+-----+----------------normalization_en
| hp | hewlett-packard
normalization_en
| hp | hewlett Packard

HP Vertica Analytic Database (7.1.x)

Page 73 of 103

Pulse
Pulse Function Reference

(2 rows)
select getAllMappingWords(using parameters language='spanish') over();
mapping
|
key
|
value
-----------------------+---------+---------------normalization_es
| hidalgo | miguel hidalgo
(1 row)

See Also
l

GetAllDictionaryWords()

HP Vertica Analytic Database (7.1.x)

Page 74 of 103

Pulse
Pulse Function Reference

GetAllSentences
Extracts a row for each sentence in a body of text. This ability is useful if you need to
programmatically get each sentence in a piece of text.

Syntax
GetAllSentences(text [, language[USING PARAMETERS
[filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])

Parameters
Argument

Description

text

The text from which to get the sentences.

language

The language:

filterlinks

'english' or 'en'

'spanish' or 'es'

Optional. Default false. When set to true, sentences that are


only links are skipped over and ignored. Any links in a
sentence are not included in the extracted sentence.

filterusermentions

Optional. Default false. When set to true, sentences that are


only Twitter user mentions (@username) are skipped over
and ignored. Any user-mentions in a sentence are not
included in the extracted sentence.

filterhashtags

Optional. Default false. When set to true, sentences that are


only Twitter hashtags (#hashtag) are skipped over and
ignored. Any hashtags in a sentence are not included in the
extracted sentence.

HP Vertica Analytic Database (7.1.x)

Page 75 of 103

Pulse
Pulse Function Reference

Argument

Description

adjustcasing

Optional. Defaults to false. When set to true, all letters in the


sentence are converted to upper-case before sentence
detection. After sentence detection all letters are converted
to lower-case. This option is helpful if the original data is all in
lower-case and Pulse is incorrectly identifying parts of
speech in the sentence.

Notes
l

The text argument is limited to 65,000 bytes.

This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.

language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.

Examples
SELECT GetAllSentences('The quick brown fox jumped over the lazy
dog. Every good boy deserves fudge') OVER(PARTITION BEST);

sentence
----------------------------------------------The quick brown fox jumped over the lazy dog.
Every good boy deserves fudge.
(2 rows)
select getAllSentences('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
,'english') over();
sentence_index |
sentence_text
----------------+----------------------------------------------1 | the quick brown fox jumped over the lazy dog.
2 | All good boys deserve fudge
(2 rows)
select getAllSentences('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
using parameters language='english') over();
sentence_index |
sentence_text
----------------+----------------------------------------------1 | the quick brown fox jumped over the lazy dog.
2 | All good boys deserve fudge

HP Vertica Analytic Database (7.1.x)

Page 76 of 103

Pulse
Pulse Function Reference

(2 rows)
select getAllSentences('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
,'spanish') over();
sentence_index |
sentence_text
----------------+---------------------------------------------1 | el zorro rapido brinco sobre el perro flojo.
2 | Todos los chicos buenos merecen un premio
(2 rows)
select getAllSentences('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
using parameters language='spanish') over();
sentence_index |
sentence_text
----------------+---------------------------------------------1 | el zorro rapido brinco sobre el perro flojo.
2 | Todos los chicos buenos merecen un premio
(2 rows)

Filtering User-mentions
SELECT GetAllSentences('@user is always late. He kept me waiting 20 minutes last time.'
USING PARAMETERS filterusermentions=true)
OVER(PARTITION BEST);
sentence
----------------------------------------is always late.
he kept me waiting 20 minutes last time.
(2 rows)

See Also
l

GetSentenceCount()

ExtractSentence()

HP Vertica Analytic Database (7.1.x)

Page 77 of 103

Pulse
Pulse Function Reference

GetLoadedDictionary
Lists the currently loaded words for the specified user-dictionary.

Syntax
SELECT GetLoadedDictionary(user-dictionary
label='label']) OVER();

[using PARAMETERS language = string][,

Parameters
Argument

Description

user-dictionary

The user-dictionary list to retrieve.


Valid values:
l

pos_words

neg_words

neutral_words

stop_words

white_list

See Dictionaries and Mappings for details on each type.


language

label

The language of the dictionary:


l

'english' or 'en'

'spanish' or 'es'

The label of the dictionaries that you want to list. If you do not provide
a label, Pulse uses the default dictionaries.

Usage Considerations
l

If the user-dictionary is not loaded, then nothing is returned.

You must use the OVER() clause with this function.

HP Vertica Analytic Database (7.1.x)

Page 78 of 103

Pulse
Pulse Function Reference

Examples
Note: This example is from a three node cluster, so three copies of the words are returned.
SELECT GetLoadedDictionary('pos_words') OVER();
word
------------------------:-)
adequate
admire
admiringly
adore
adoringly
adulation
adventuresome
advocated
affable
affably
affordable
affordably
afordable
all-around
alluringly
amazement
ameliorate
ample
amusing
--More--

select getLoadedDictionary('pos_words' using PARAMETERS language='english') over();


word
-----------simplicity
(1 row)
select getLoadedDictionary('pos_words' using PARAMETERS language='spanish') over();
word
------------simplicidad
(1 row)

See Also
l

LoadDictionary()

GetLoadedMapping()

HP Vertica Analytic Database (7.1.x)

Page 79 of 103

Pulse
Pulse Function Reference

GetLoadedMapping
Lists the currently loaded words for the specified user-defined mapping.

Syntax
SELECT GetLoadedMapping('normalization' [using PARAMETERS language = string]) OVER();

Parameters
Argument

Description

mapping

The mapping list to retrieve. Currently the only mapping supported is:
normalization

language

label

The language of the dictionary:


l

'english' or 'en'

'spanish' or 'es'

The label to which you want to load the specified mapping. If you do
not include a label, Pulse loads the default UDDs.

Usage Considerations
l

This function must be used with the OVER() clause.

If the mapping is not loaded with LoadMapping , then nothing is returned.

Examples
SELECT GetLoadedMapping('normalization') OVER();
key |
value
-----+----------------hp | hewlett packard
(1 row)

select getLoadedMapping('normalization' using PARAMETERS language='english') over();


key |
value

HP Vertica Analytic Database (7.1.x)

Page 80 of 103

Pulse
Pulse Function Reference

-----+----------------hp | hewlett-packard
hp | hewlett packard
(2 rows)
select getLoadedMapping('normalization' using PARAMETERS language='spanish') over();
key
|
value
---------+---------------hidalgo | miguel hidalgo
(1 row)

Note: By default, the normalization list is empty.

See Also
l

LoadMapping()

GetLoadedDictionary()

HP Vertica Analytic Database (7.1.x)

Page 81 of 103

Pulse
Pulse Function Reference

GetSentenceCount
Returns the number of sentences in a body of text. You can use this function to count the number of
sentences in a long piece of text. It is also useful if you are programmatically using the
"ExtractSentence" function and need to know the number of sentences in a piece of text.

Syntax
select GetSentenceCount(text [, language] [USING PARAMETERS
[filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])

Parameters
Argument
l

text

Description
The text from which to extract the number of sentences. Currently
English and Spanish language text are supported for analysis.

language

filterlinks

The language:
l

'english' or 'en'

'spanish' or 'es'

Optional. Default false. When set to true, sentences that are only links
are not counted as a sentence.

filterusermentions

Optional. Default false. When set to true, sentences that are only Twitter
user mentions (@username) are not counted as a sentence.

filterhashtags

Optional. Default false. When set to true, sentences that are only Twitter
hashtags (#hashtag) are not counted as a sentence.

adjustcasing

Optional. Defaults to false. When set to true, all letters in the sentence
are converted to upper-case before sentence detection. After sentence
detection all letters are converted to lower-case. This option is helpful if
the original data is all in lower-case and Pulse is incorrectly identifying
parts of speech in the sentence.

HP Vertica Analytic Database (7.1.x)

Page 82 of 103

Pulse
Pulse Function Reference

Notes
l

The text argument is limited to 65,000 bytes.

This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.

language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.

Examples
SELECT GetSentenceCount('The quick brown fox jumped over the lazy dog. Every good boy
deserves fudge') OVER(PARTITION BEST);
sentence_count
---------------2
(1 row)
SELECT getsentencecount('http://hp.com. @hp. http://hp.com is great!') OVER(PARTITION
BEST);
sentence_count
---------------3
(1 row)
select getsentencecount('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
using PARAMETERS language='spanish') over();
sentence_count
---------------2
(1 row)
select getsentencecount('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
,'spanish') over();
sentence_count
---------------2
(1 row)
select getsentencecount('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
using parameters language='english') over();
sentence_count
---------------2
(1 row)
select getsentencecount('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'

HP Vertica Analytic Database (7.1.x)

Page 83 of 103

Pulse
Pulse Function Reference

,'english') over();
sentence_count
---------------2
(1 row)

Filtering Links and User Mentions


SELECT GetSentenceCount('http://hp.com. @hp. http://hp.com is great!' USING PARAMETERS
filterlinks=true, filterusermentions=true) OVER(PARTITION BEST);
sentence_count
---------------1
(1 row)

See Also
l

GetAllSentences()

ExtractSentence()

HP Vertica Analytic Database (7.1.x)

Page 84 of 103

Pulse
Pulse Function Reference

GetStorage
Lists the currently loaded user-dictionaries and user-defined mapping.

Syntax
SELECTGetStorage([using PARAMETERS label='label']) OVER();

Parameters
Argument

Description

label

The label of the dictionaries and mapping names that you want to list.
If you do not provide a label, Pulse uses the default dictionaries.

Usage Considerations
l

This function must be used with the OVER() clause.

Examples
SELECTGetStorage() OVER();
key
-----------------neg_words_en
neutral_words_en
pos_words_en
stop_words_en
white_list_en
normalization_en
neg_words_es
neutral_words_es
pos_words_es
stop_words_es
white_list_es
normalization_es
(12 rows)

See Also
l

LoadDictionary()

LoadMapping()

HP Vertica Analytic Database (7.1.x)

Page 85 of 103

Pulse
Pulse Function Reference

GetLoadedDictionary()

GetLoadedMapping()

HP Vertica Analytic Database (7.1.x)

Page 86 of 103

Pulse
Pulse Function Reference

LoadDictionary
Lists words from a Pulse user defined dictionary into memory for use by sentimentAnalysis() and
other Pulse functions. User defined dictionary lists are lists of words that are assigned to a specific
list.

Syntax
SELECT LoadDictionary(word USING PARAMETERS listName='listname'[, language='lang'] [,
label='label']) OVER() FROM table

Parameters
Argument

Description

word

A column of words to assign to a user-dictionary list. The column


name must match the value of word.

listName

The user-dictionary list from which to load the values from word .
Valid values:
l

pos_words

neg_words

neutral_words

stop_words

white_list

See Dictionaries and Mappings for details on each list type.


language

The language of the dictionary:


l

'english' or 'en'

'spanish' or 'es'

label

The label that you want to assign to the dictionary.

table

Load values from the specified table.

HP Vertica Analytic Database (7.1.x)

Page 87 of 103

Pulse
Pulse Function Reference

Usage Considerations
l

This function must be used with the OVER() clause.

All user-dictionaries and mappings must be loaded (using LoadDictionary() and LoadMapping())
whenever you change any user-dictionary or the normalization map is changed for the changes
to take effect.

Dictionaries and Mappings are loaded on a per-client basis. Loaded dictionaries can vary from
session to session.

If you load a user-dictionary with an incorrect listName, then the result of LoadDictionary() is
false and the user-dictionary is not loaded.

LoadDictionary does not append user-dictionary list. It overwrites them. If you load a userdictionary more than once with the same list name, then only the most recent user-dictionary is
loaded for that list name.

Examples
select LoadDictionary(standard USING PARAMETERS listName=
'neg_words_en') OVER() from pulse.neg_words_en;
select LoadDictionary(standard USING PARAMETERS listName=
'pos_words_en') OVER() from pulse.pos_words_en;
select LoadDictionary(standard USING PARAMETERS listName=
'pos_words_en', language='english') OVER() from pulse.pos_words_en;
select LoadDictionary(standard USING PARAMETERS listName=
'pos_words_es', language='spanish') OVER() from pulse.pos_words_es;
select LoadDictionary(standard USING PARAMETERS listName=
'neg_words',label='custom_negatives') OVER() from pulse.neg_words_en;

See Also
l

LoadMapping()

GetLoadedDictionary()

GetStorage()

HP Vertica Analytic Database (7.1.x)

Page 88 of 103

Pulse
Pulse Function Reference

LoadMapping
Loads a Pulse user-mapping into memory for use by sentimentAnalysis() and other Pulse functions.
Maps are lists of synonyms of one or more words that map to another word. Using maps allows you
to analyze text that pertains to the same subject or concept but may use slightly different
terminology.
For example, you can map both "Hewlett Packard" and "Hewlett-Packard" (with hyphen) to HP.
Pulse substitutes the mapped words to the core word when it runs its analysis.

Syntax
SELECT LoadMapping(base, wordToMap USING PARAMETERS mapName='mapName' [, language='lang']
[, label='label']) OVER()FROM table

Parameters
Argument

Description

base

A column of base words to assign to a mapped word. The


column name must match the value of base.

wordToMap

A column of words to map to the base word in the same row.


The column name must match the value of wordToMap.

mapName

The mapping to load the words into. mapName.


Valid values:
l

irregular_verbs list of conjugations of verbs and their


bases.

language

label

normalization list of synonyms and their base word.

The language of the dictionary:


l

'english' or 'en'

'spanish' or 'es'

The label of the mapping that you want to load. If you do not
provide a label, Pulse uses the default mapping.

table

HP Vertica Analytic Database (7.1.x)

Load values from the specified table.

Page 89 of 103

Pulse
Pulse Function Reference

Usage Considerations
l

This function must be used with the OVER() clause.

All user-dictionaries and mappings must be loaded (using LoadDictionary() and LoadMapping())
whenever you updated any user-dictionary or the normalization map is changed for the changes
to take effect.

After loading, HP Vertica returns a success message from each node in the cluster.

Dictionaries and Mappings are loaded across all client sessions and remain in memory even if
the database is stopped and started.

If you load a mapping with an incorrect mapName, then the result of LoadMapping() is false and
the map is not loaded.

LoadMapping() does not append maps. It overwrites them. If you load a map more than once
with the same mapName, then only the most recent mapping are loaded for that mapName.

Examples
select LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization') over() from pulse.normalization_en;
select LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization', language='english') over() from pulse.normalization_en;
select LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization', language='spanish') over() from pulse.normalization_es;

See Also
l

LoadDictionary()

GetLoadedMapping()

GetStorage()

HP Vertica Analytic Database (7.1.x)

Page 90 of 103

Pulse
Pulse Function Reference

PartsOfSpeech
Tags the words in one or more sentences with their part of speech clasification, using Penn
Treebank parts of speech tags.

Syntax
Select PartsOfSpeech('sentences'[, language='lang'] [using PARAMETERS [language='lang']
[, adjustcasing=boolean)
OVER(PARTITION BEST);

Parameters
Argument

Description

sentences

One or more sentences to be tagged with parts of speech markup.

language

The language:

adjustcasing

'english' or 'en'

'spanish' or 'es'

Optional. Defaults to false. When set to true, all letters in the


sentence are converted to upper-case before sentence detection.
After sentence detection all letters are converted to lower-case.
This option is helpful if the original data is all in lower-case and
Pulse is incorrectly identifying parts of speech in the sentence.

Notes
l

This function returns a part of speech markup for each word. The markup used is the Penn
Treebank Project Parts of Speech Tags while for Spanish the Parole Reduced Tagset is used.

This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.

Examples
select partsOfSpeech('The quick brown fox jumped over the lazy dog.') OVER(PARTITION

HP Vertica Analytic Database (7.1.x)

Page 91 of 103

Pulse
Pulse Function Reference

BEST);
sentence | token | part_of_speech
----------+--------+---------------1 | the
| DT
1 | quick | JJ
1 | brown | JJ
1 | fox
| NN
1 | jumped | VBD
1 | over
| IN
1 | the
| DT
1 | lazy
| JJ
1 | dog
| NN
1 | .
| .
(10 rows)

select partsOfSpeech('Every good boy deserves fudge.') OVER(PARTITION BEST);


sentence | token
| part_of_speech
----------+----------+---------------1 | every
| DT
1 | good
| JJ
1 | boy
| NN
1 | deserves | VBZ
1 | fudge
| NN
1 | .
| .
(6 rows)
select partsOfSpeech('The quick brown fox jumped over the lazy dog.', 'english')
OVER(PARTITION BEST);
sentence | token

| part_of_speech

----------+--------+---------------1
| the
| DT
1
| quick | JJ
1
| brown | JJ
1
| fox
| NN
1
| jumped
| VBD
1
| over | IN
1
| the
| DT
1
| lazy | JJ
1
| dog
| NN
1
| .
| .
(10 rows)

select partsofSpeech('El zorro rapido brinco sobre el perro flojo','spanish')


over();
sentence | token | part_of_speech
----------+--------+---------------1 | El
| DA
1 | zorro | NC
1 | rapido | AQ
1 | brinco | AQ
1 | sobre | SP
1 | el
| DA
1 | perro | NC
1 | flojo | AQ
(8 rows)

HP Vertica Analytic Database (7.1.x)

Page 92 of 103

Pulse
Pulse Function Reference

See Also
l

SentimentAnalysis()

HP Vertica Analytic Database (7.1.x)

Page 93 of 103

Pulse
Pulse Function Reference

SentimentAnalysis
Provides a sentiment score for each attribute (noun) in a given body of text. Positive sentiment
receives a positive integer score and negative sentiment receives a negative integer score. A score
of 0 indicates that the sentiment for the attribute is neutral.

Syntax
SentimentAnalysis(text [,language] [USING PARAMETERS
[whitelistonly = boolean ]
[, filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, filterpunctiation = boolean ]
[, filterretweets = boolean ]
[, relatedwords = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
[, label='label']
])

Parameters
Argument

Description

text

The text to analyze.

whitelistonly

Optional. Default false. When set to true only attributes


defined in the whitelist user-dictionary are scored.

filterlinks

Optional. Default false. When set to true, links are not


included as attributes.

filterusermentions

Optional. Default false. When set to true, Twitter user


mentions (@username) are not included as attributes.

filterhashtags

Optional. Default false. When set to true, Twitter hashtags


(#hashtag) are not included as attributes.

filterpunctuation

Optional. Default true. Filters any punctuation that occurs


at the beginning of an attribute other than @ and #.

filterretweets

Optional. Defaults to false.Filters out the characters "RT"


from re-tweets in attributes.

HP Vertica Analytic Database (7.1.x)

Page 94 of 103

Pulse
Pulse Function Reference

Argument

Description

relatedwords

Optional. Defaults to false. When set to true, provides up to


three words from the sentence used to help determine the
sentiment of the attribute.

adjustcasing

[Optional] Defaults to false. When set to true, all letters in


the text are converted to uppercase before sentence
detection. After performing sentence detection, HP Vertica
converts all letter to lowercase. This option can help you in
cases where the original data is all in lowercase letters and
Pulse is incorrectly identifying sentence boundaries.

language

label

The language:
l

'english' or 'en'

'spanish' or 'es'

The label of the dictionaries that you want to use for


sentiment analysis. If you do not include a label, Pulse
uses the default dictionaries.

Usage Considerations
l

The text argument is limited to 65,000 bytes.

This function must be used with the OVER() clause. Use OVER(PARTITIONBEST) for the best
performance if the query does not require specific columns in the OVER() clause. Any valid
PARTITION BY clause is acceptable. However, only the PARTITION BY clause which matches
the segmentation clause of the table's projection provides optimum performance. You can
improve performance by segmenting on the columns in the PARTITIONBY clause.

language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.

Examples
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.') OVER(PARTITION
BEST);
sentence | attribute | sentiment score
----------+-----------+-----------------

HP Vertica Analytic Database (7.1.x)

Page 95 of 103

Pulse
Pulse Function Reference

1 | fox
1 | dog

|
|

1
-1

(2 rows)
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
USING PARAMETERS relatedwords=true) OVER(PARTITION BEST);
sentence | attribute | sentiment_score | related_word_1 | related_word_2 | related_word_
3
----------+-----------+-----------------+----------------+----------------+--------------1 | fox
|
1 | quick
| lazy
|
1 | dog
|
-1 | lazy
|
|
(2 rows)
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.', 'english')
OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fox
|
1
1 | dog
|
-1
(2 rows)

select SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
using PARAMETERS language='english') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fox
|
1
1 | dog
|
-1
(2 rows)

select SentimentAnalysis('El zorro rapido brinco sobre el perro flojo.', 'spanish')


sentence | attribute | sentiment_score
----------+-----------+----------------1 | zorro
|
1
1 | perro
|
-1
(2 rows)

select SentimentAnalysis('El zorro rapido brinco sobre el perro flojo.'


using PARAMETERS language='spanish') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | zorro
|
1
1 | perro
|
-1
(2 rows)

Getting Twitter User-mention Sentiment


select SentimentAnalysis('@company is great!') OVER(PARTITION BEST);

HP Vertica Analytic Database (7.1.x)

Page 96 of 103

Pulse
Pulse Function Reference

sentence | attribute | sentiment score


----------+-----------+----------------1 | @company |
1
(1 row)

Filtering Twitter user sentiment


select SentimentAnalysis('@company is great!' USING PARAMETERS
filterusermentions=true) OVER(PARTITION BEST);
sentence | attribute | sentiment score
----------+-----------+----------------(0 rows)

See Also
l

LoadDictionary()

LoadMapping()

ExtractSentence()

GetSentenceCount()

GetAllSentences()

CommentAttributes()

HP Vertica Analytic Database (7.1.x)

Page 97 of 103

Pulse
Pulse Function Reference

SetDefaultLanguage
Sets the new default language to use for Pulse functions if no language is specified in a Pulse
function call.

Syntax
SetDefaultLanguage(language )

Parameters
Argument

Description

language

The language:
l

'english' or 'en'

'spanish' or 'es'

Notes
l

This function must be used with the OVER() clause.

The default language immediately after installation is English.

The language that is set when using this function is the default language across all sessions and
is persistent across database restarts.

Examples
=> select setDefaultLanguage('es') over();
Success
--------t
(1 row)

See Also
l

SentimentAnalysis

HP Vertica Analytic Database (7.1.x)

Page 98 of 103

Pulse
Pulse Function Reference

UnloadLabeledDictionary
Unloads a specific dictionary from a Pulse session. The dictionary continues to exist and a user
can later reload the dictionary, if needed.
You cannot unload a default dictionary, but you can replace it by loading a custom user-defined
dictionary.

Syntax
SELECT unloadLabeledDictionary(USING PARAMETERS listname='listname'[, language='lang'] [,
label='label']) over();

Parameters
Argument

Description

listName

The type of the dictionary that you want to unload. listName must be
one of:
l

pos_words

neg_words

neutral_words

stop_words

white_list

See Dictionaries and Mappings for details on each list type.


language

label

The language:
l

'english' or 'en'

'spanish' or 'es'

The label of the dictionary that you want to unload.

HP Vertica Analytic Database (7.1.x)

Page 99 of 103

Pulse
Pulse Function Reference

Examples
select unloadLabeledDictionary(USING PARAMETERS listname='neg_words',
label='custom_negatives') OVER();
success
--------t
(1 row)

See Also
l

UnloadLabeledDictionarySet()

UnloadLabeledDictionarySet
Unloads all user-defined dictionaries with a particular label from a Pulse session. The dictionaries
continue to exist, and a user can later reload the dictionaries, if needed.
You cannot unload a default dictionary, but you can replace it by loading a custom user-defined
dictionary.

Syntax
SELECT unloadLabeledDictionarySet(USING PARAMETERS label='labelName') over();

Parameters
Argument

Description

label

The label of the dictionary set that you want to unload.

Examples
select unloadLabeledDictionarySet(USING PARAMETERS label='custom_negatives') OVER();
success
--------t
(1 row)

HP Vertica Analytic Database (7.1.x)

Page 100 of 103

Pulse
Pulse Function Reference

See Also
l

UnloadLabeledDictionary()

UnloadLabeledMapping
Unloads a specific mapping from a Pulse session. The mapping continues to exist, and a user can
later reload it, if needed.

Syntax
SELECT unloadLabeledMapping(USING PARAMETERS mapName='normalization' [, language='lang']
[, label='label']) over();

Parameters
Argument

Description

mapName

The name of the mapping from which you are unloading the
dictionary.

language

label

The language:
l

'english' or 'en'

'spanish' or 'es'

The label of the mapping that you want to unload.

Examples
select unloadLabeledMapping(standard USING PARAMETERS label='custom_mapping') OVER();
success
--------t
(1 row)

HP Vertica Analytic Database (7.1.x)

Page 101 of 103

Pulse
Pulse Function Reference

HP Vertica Analytic Database (7.1.x)

Page 102 of 103

We appreciate your feedback!


If you have comments about this document, you can contact the documentation team by email. If
an email client is configured on this system, click the link above and an email window opens with
the following information in the subject line:
Feedback on Pulse (Vertica Analytic Database 7.1.x)
Just add your feedback to the email and click send.
If no email client is available, copy the information above to a new message in a web mail client,
and send your feedback to vertica-docfeedback@hp.com.

HP Vertica Analytic Database

Page 103 of 103

Вам также может понравиться