Академический Документы
Профессиональный Документы
Культура Документы
Legal Notices
Warranty
The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be
construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.
The information contained herein is subject to change without notice.
Copyright Notice
Copyright 2006 - 2015 Hewlett-Packard Development Company, L.P.
Trademark Notices
Adobe is a trademark of Adobe Systems Incorporated.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.
Page 2 of 103
Contents
Contents
11
11
Installation Overview
11
12
12
14
14
15
17
18
Assign Users to the pulse_users Role and Allow Access to Pulse Functions
20
21
21
21
Using Pulse
Dictionaries and Mappings
23
24
26
28
28
29
Determining Sentiment
31
Tuning Pulse
33
Page 3 of 103
Pulse
Contents
33
33
34
37
38
38
Multilingual Pulse
39
Spanish Pulse
40
Multilingual Examples
41
Pulse Cookbook
45
45
48
51
55
56
58
59
63
CommentAttributes
64
ExtractSentence
68
GetAllDictionarySetLabels
70
GetAllDictionaryWords
71
GetAllLoadedDictionaries
72
GetAllMappingWords
73
GetAllSentences
75
GetLoadedDictionary
78
GetLoadedMapping
80
Page 4 of 103
Pulse
Contents
GetSentenceCount
82
GetStorage
85
LoadDictionary
87
LoadMapping
89
PartsOfSpeech
91
SentimentAnalysis
94
SetDefaultLanguage
98
UnloadLabeledDictionary
99
UnloadLabeledDictionarySet
100
UnloadLabeledMapping
101
103
Page 5 of 103
Pulse
Contents
Page 6 of 103
Pulse
Pulse Virtual Machine Quick Start
10. Copy the HP Vertica Pulse install package to the VM then, as root, install the Pulse Package:
rpm -Uvh /path/to/vertica-pulse.x86_64.xxx.rpm
Note: Only install HP Vertica Pulse on a single node. All Pulse functions are available on
Page 7 of 103
Pulse
Pulse Virtual Machine Quick Start
all nodes. However, the installation SQL scripts and user-dictionary loading script are only
available on the node on which you install the Pulse package.
11. As dbadmin, run the Pulse install script on the node on which you installed the Pulse Package:
vsql -f /opt/vertica/packages/pulse/ddl/install.sql
Using Pulse
1. Run a sentiment function:
select sentimentanalysis('Cookies are sweet.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)
Note: By default, HP VerticaPulse analyzes English text, however, you can also specify the
language of the text being analyzed as an attribute of the sentimentanalysis() function. For
example:
select sentimentanalysis('Cookies are sweet.', 'english') OVER(PARTITION BEST);
Page 8 of 103
Pulse
About the HP Vertica Pulse Package
Attribute based sentiment scoring - Pulse scores the sentiment of attributes in a sentence.
Attributes are generally nouns and are automatically discovered by Pulse. Pulse typically scores
sentiment from a range of -1 (negative sentiment) to +1 (positive sentiment). A sentiment of 0 is
considered neutral. Scoring individual attributes in a sentence instead of scoring the sentence as
a whole provides a more granular analysis for the text. For example, consider the sentence "The
quick brown fox jumped over the lazy dog." It would be difficult to score the sentiment on the
sentence as a whole, but if you score on the attributes of fox and dog, you could say the
sentiment on the fox was positive (the fox is quick), and the sentiment on the dog is negative
(the dog is lazy).
Tuning to your domain - Pulse provides functionality to recognize attributes that are specific
to your domain. For example, you can add the name of your product or company to a 'white_list'
so that it is discovered by Pulse.
Tuning of how sentiment is scored - Pulse includes user-dictionaries of words that are used
to help score sentiment. You can alter these user-dictionaries to fine tune the way your text is
analyzed.
Filtering of attributes you are not interested in - Pulse supports a special 'stop words' userdictionary to indicate attributes that should not be analyzed. Alternately, you can choose to
score sentiment only on attributes defined in your white_list.
Synonym mappings - Pulse provides customizable mappings so that you can map synonyms
to a base word, and then normalize the analysis for the synonyms to the base word. For
example, you can map Hewlett Packard to HP.
HP Vertica Pulse requires that Java and the HP Vertica Java Support Package are installed on all
nodes in the HP Vertica cluster.
Depending on the version of Pulse, it may support only one language (English or Spanish) or
multiple languages (English and Spanish). For multilingual versions, Pulse can analyze each text
row (for example a tweet) in the language of the text specified as argument, the language specified
by the user as parameter or the default language. See Multilingual Pulse for details.
Page 9 of 103
Pulse
About the HP Vertica Pulse Package
Page 10 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
11
Installation Overview
11
12
14
17
Assign Users to the pulse_users Role and Allow Access to Pulse Functions
20
21
Installation Overview
1. Verify that your HP Vertica server version matches your HP Vertica Pulse version.
2. Install Java on all Hosts and set the JavaBinaryForUDx Vertica configuration parameter to
your Java binary location. For example, using vsql: ALTERDATABASE mydb SET
JavaBinaryForUDx = '/usr/bin/java'
3. Install the HP Vertica Package on a single node in the cluster. The process is the same for
installation or upgrade. You need only install it on a single node, but note that the SQL scripts
used to install and uninstall the Pulse functions and the SQL script that creates pulse schema
and the user-dictionaries tables are only available from the node on which you installed the
Pulse package. The Pulse functions, once installed, are available on all nodes regardless if the
package is installed on the node to which you are connecting.
4. Modify the jvm resource pool so that Pulse performs optimally on your system hardware.
Page 11 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
If the java command is not in the shell search path, use the path to the Java executable in the
directory where you installed the JRE. For example, if you installed the JRE in /usr/java/default
(which is where the installation package supplied by Oracle installs the Java 1.6 JRE), the Java
executable is /usr/java/default/bin/java.
You set the configuration parameter by executing the following statement as a database superuser:
ALTERDATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
Page 12 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
Once you have set the configuration parameter, HP Vertica will be able to find the Java executable
on each node in your cluster in order to execute Java UDFs.
Note: Since the location of the Java executable is set by a single configuration parameter for the
entire cluster, you must ensure that the path to the Java executable is the same across all of the
nodes in the cluster.
Page 13 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
You can access Pulse functions on all nodes, regardless if the package is installed on the node to
which you are connecting.
Page 14 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
Note: You must run the install script for installs or upgrades.
Page 15 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
CREATE TRANSFORM
etc...
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
FUNCTION
3. If this is a fresh installation, then Modify the jvm Resource Pool to match your system
hardware.
Page 16 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
MAXMEMORYSIZE defines the amount of RAM that a JVM can use. By default
MAXMEMORYSIZE is set to either 10% of system memory or 2GB, whichever is smaller.
PLANNEDCONCURRENCY defines how many JVMs are allowed to run across the cluster
and how many Pulse sessions you are able to run cluster-wide. By Default,
PLANNEDCONCURRENCY is set to AUTO, which is the lower of either the number of cores
on the node, or memory / 2GB, but it is never automatically set to less than "4".
Page 17 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
session and PLANNEDCONCURRENCY limits the amount of sessions that can run Pulse
cluster-wide. If PLANNEDCONCURRENCYis 1, then only 1 vsql session (or client connection) in
the entire cluster can run Pulse.
You can display the current resource pool settings for the jvm resource pool with the following
command:
select name, MAXMEMORYSIZE, PLANNEDCONCURRENCY from V_CATALOG.RESOURCE_POOLS
where name = 'jvm';
At least 25% of the memory available for general HP Vertica overhead. Essentially,
MAXMEMORYSIZE must never exceed 75% of total system memory.
Note: If you are running a lot of queries not in the context of HP Vertica Pulse, then you should
allow for more memory to be available outside of the jvm resource pool.
To configure your system for HP Vertica Pulse:
l
Page 18 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
Finally, alter the jvm resource pool. For example, for a cluster with nodes each having 16GB of
memory, and you determine to use up to 75% of the total system memory (0.75 * 16GB = 12GB)
for HP Vertica Pulse, then you can set the resource pool as follows:
ALTER RESOURCE POOL jvm MAXMEMORYSIZE '12G' PLANNEDCONCURRENCY 3;
Note: For evaluation purposes on systems with lower memory, set MAXMEMORYSIZE to
75% and PLANNEDCONCURRENCY to 1: ALTER RESOURCE POOL jvm MAXMEMORYSIZE
'75%' PLANNEDCONCURRENCY 1; While these settings are unsupported, they do allow you to
run simple HP Vertica Pulse queries. You may experience Out Of Memory exceptions and
slow performance.
For additional details, see:
l
Managing Workloads
Page 19 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
3. As the dbadmin, grant the pulse_user role to the new user with the command: grant pulse_
users to username;
4. As the user to which you granted the pulse_user role, set the users role to pulse_users with the
command: set role pulse_users;
Note: The user must run the set role command per session to read or edit tables in the
pulse schema.
Page 20 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
The uninstall script removes all Pulse functions, but does not remove the pulse schema containing
the user-dictionary and mapping tables.
To remove all Pulse dictionaries and mappings, including custom dictionaries, include the -r
parameter
bash /opt/vertica/packages/pulse/uninstall.sh -r
The Pulse schema and associated user-dictionary and mapping tables remain in the database. To
remove the Pulse schema and its associated tables, run the following command:
DROPSCHEMA pulse CASCADE
Page 21 of 103
Pulse
Installing or Upgrading HP Vertica Pulse
Page 22 of 103
Pulse
Using Pulse
Using Pulse
Dictionaries and Mappings
24
Determining Sentiment
31
Tuning Pulse
33
37
Page 23 of 103
Pulse
Using Pulse
Page 24 of 103
Pulse
Using Pulse
Description
white_list_en
stop_words_en
Words that are never marked as an attribute. Add words that you
do not want scored to the stop_words user dictionary. Use this to
filter out attributes that are not of interest to your analysis. This list
is typically modified to increase the accuracy of sentiment scoring
for your domain.
Note: If a word appears in both stop_words and white_
list, then the white_list word takes precedence. The
word appears in results even though it is in thestop_words
dictionary.
pos_words_en
Page 25 of 103
Pulse
Using Pulse
Description
neg_words_en
Negative words that can be any type of word or phrase that have a
negative connotation. Words in this list are deemed more likely to
carry a negative polarity in general.
You can also add exact phrases, such as idioms, to this list.
Examples: abhorrent, butcher, racist, wrath, flash in the pan.
neutral_words_en
The following table describes the tables that describe mapping within Pulse.
Mapping Table Name
Description
Example
normalization_en
base/synonym:
"hp"/ "hewlettpackard"
"hp"/ "Hewlett-Packard"
"Obama"/ "President
Obama"
l
For Pulse versions that support Spanish, the same set of dictionaries with the suffix "_es" is
present in the Pulse schema.
To load an individual user-dictionary into memory, use the LoadDictionary() function with the
appropriate parameter and word list.
To load the normalization mapping into memory, use the LoadMapping() function with the
normalization map.
Page 26 of 103
Pulse
Using Pulse
For ease of use, Pulse ships with a script to automatically load into memory all of the required user
dictionaries and the normalization mapping. You can run the script from within vsql with the
following command:
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
Note: This script only exists on the node on which you installed the Pulse RPM/DEB package.
Manually Loading Dictionaries and the Normalization Map
If you want to manually load certain user dictionaries or mappings from the pulse schema tables,
then run the following command. This example loads the pos_words dictionary. See LoadDictionary
() for valid values for the listName parameter and for multilingual version loading.
Note: The following examples use the English dictionaries. For Spanish, replace "_en" with "_
es".
First, add a word to the pos_words dictionary:
=> insert into pulse.pos_words_en values('SuperDuper');
=> commit;
By default, added words are not case sensitive. "ERROR" produces the same results as "error".
You can, however, specify a case setting for a word using the $Case parameter. For example, to
identify "Apple", rather than "apple", you would add the following:
=> insert into pulse.white_list_en values('$Case(Apple)');
=> commit;
If you change the normalization map, then you can load the new normalization values with the
following command:
select LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization') over()
from pulse.normalization_en;
Page 27 of 103
Pulse
Using Pulse
After loading, HP Vertica returns a success message and the number of rows (words or word pairs)
loaded.
Page 28 of 103
Pulse
Using Pulse
After Mapping
The mapping operation replaces the attributes with their counterparts from the normalization list and
displays the base terms:
SELECT SentimentAnalysis('Hewlett-Packard was founded in 1939.
Hewlett Packard was started in a garage in Palo Alto California')
OVER(PARTITION BEST);
sentence |
attribute
| sentiment_score
----------+----------------------+----------------1 | hp
|
0
2 | hp
|
0
2 | garage
|
0
2 | palo alto california |
0
(4 rows)
The CommentAttribute() function also uses the normalization map and displays the base terms
instead of the original text:
SELECT CommentAttributes('Hewlett-Packard was founded in 1939.
Hewlett Packard was started in a garage in Palo Alto California')
OVER(PARTITION BEST);
sentence |
attribute
----------+---------------------1 | hp
2 | hp
2 | garage
2 | palo alto california
(4 rows)
The following example shows how to create a table, add some terms to it, and then load the table as
anormalization map:
CREATE TABLE myNormalization(base VARCHAR(64), synonym VARCHAR(64));
Page 29 of 103
Pulse
Using Pulse
After loading, HP Vertica returns a success message from each node in the cluster.
Page 30 of 103
Pulse
Using Pulse
Determining Sentiment
You determine sentiment by using the SentimentAnalysis() function on text.
The SentimentAnalysis() function first extracts the attributes (typically nouns) from the sentence,
and then applies a sentiment score to each attribute. Scores can be one of the following:
l
1 - Positive Sentiment
0 - Neutral Sentiment
-1 - Negative Sentiment
This provides a more granular analysis than just determining the sentiment for the sentence as a
whole. Consider the following quote from Abraham Lincoln; "Force is all-conquering, but its
victories are short-lived." If you were to score the sentiment of the sentence as a whole by
averaging the sentiment of its parts, then the sentiment is neutral.
=> select avg(t1.sentiment_score) as 'Average Sentiment' from (
select sentimentAnalysis('Force is all-conquering, but its victories are shortlived.')
over (PARTITIONBEST)
) as t1;
Average Sentiment
----0
If you score the individual attributes of the sentence, then you can obtain a much more precise
analysis of the sentiment than if you were trying to assign a single score to the entire sentence. For
example:
=> select sentimentAnalysis('Force is all-conquering, but its victories are shortlived.') over (PARTITIONBEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | force
|
1
1 | victories |
-1
"Force" is scored with positive sentiment because it is "all-conquering". "Victories" is scored with
negative sentiment because it is "short-lived".
Note: HP Vertica Pulse does not recognize personal pronouns (I, you, we, he, she, it, etc.) as
attributes.
Page 31 of 103
Pulse
Using Pulse
SentimentAnalysis() also extracts the sentiment from multiple sentences and returns the
sentence in which attributes are found:
=> SELECT SentimentAnalysis('Force is all-conquering, but its victories are short-lived.
Every good boy deserves fudge.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | force
|
1
1 | victories |
-1
2 | boy
|
1
2 | fudge
|
1
(4 rows)
"Boy" is scored with positive sentiment because he is good. Fudge is scored with positive
sentiment because it is something that is deserved.
Note: The sentence detector considers a period to mark the end of a sentence. Some abbreviations
that use a period, such as Dr. or Mr., cause the sentence detector to end the sentence at the
abbreviation.
The SentimentAnalysis function also identifies attributes with neutral sentiment (a sentiment score
of zero). For example:
SELECT SentimentAnalysis('Roses are red. Violets are blue.') OVER(PARTITION BEST);
sentence | attribute | sentiment score
----------+-----------+----------------1 | roses
|
0
2 | violets
|
0
(2 rows)
Both roses and violets receive neutral sentiment because neither being red nor blue is considered
positive or negative in this context.
See the Pulse Cookbook for more examples of determining sentiment.
Page 32 of 103
Pulse
Using Pulse
Tuning Pulse
Pulse contains built-in dictionaries that help to determine the sentiment of sentences. These
dictionaries are not directly readable. However, you can modify the Pulse dictionary tables to
improve automatic attribute discovery and provide more accurate results for sentiment scoring
based on your specific data sets. The dictionary tables are available in the Pulse schema. Any
words you add to these dictionaries takes precedence over the built-in dictionaries.
white_list - a list of words on which you want to score sentiment, but are not auto-discovered by
Pulse. Typically these are product or company names, or special words in the domain of the
data you are analyzing. You can also add noun phrases to the white_list.
stop_words - a list of words on which you do not want to score sentiment, but may appear
frequently in your data set. stop_words is basically a way to filter out attributes.
normalization - a map of base words and synonyms that allow you to normalize words for easy
comparison. For example, you can normalize "Hewlett Packard" to "HP", then count the number
of times "HP" appears as an attribute in your data. Any text that contains "HP" or "Hewlett
Packard" is counted towards the total.
Page 33 of 103
Pulse
Using Pulse
1 | fox
1 | dog
|
|
1 | quick
-1 | lazy
| lazy
|
|
|
(2 rows)
The output details that "quick" and "lazy" impacted the scoring of the "fox" attribute, and that "lazy"
affected the scoring of the "dog" attribute. "Quick" (positive) is weighted heavier than "lazy"
(negative) when scoring "fox" because the word "quick" is closer to the attribute "fox" in the
sentence, and the result is that "fox" is scored positively. "Lazy" (negative) is the only related word
being used to score the sentiment for "dog". If you don't agree with the scoring, you can change
how these related words affect the score by adding them to the appropriate user-dictionary, as
described in "ImprovingSentiment Scores".
Page 34 of 103
Pulse
Using Pulse
Note: If a word is present in both stop_words and white_list, then the white_list word
takes precedence. The word is present in results even though it exists in stop_words.
Consider the sentence "Fudge is good". It contains three parts; a noun (fudge), a verb (is), and an
adjective (good). When you analyze the sentence using Pulse, it identifies "fudge" as an attribute,
because it is a proper noun, and then assigns "fudge" a positive sentiment:
select sentimentAnalysis('Fudge is good') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fudge
|
1
The number of words matched against a dictionary also has an impact on which dictionaries take
precedence. For example, phrases or word combinations in the user-dictionary lists take
precedence over single words. For example, the positive phrase "solve problem" causes a positive
score on the text "Joe solves problems", even though "problem" is a negative word. Since phrases
have precedence over single words, a positive score is applied to Joe.
SELECT SentimentAnalysis('Joe solves problems.') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | joe
|
1
(1 row)
Tuning Example
You can modify any of the user-dictionaries to improve the accuracy of sentiment scores. The two
basic dictionaries, "neg_words" and "pos_words", are typically the easiest to modify to get good
results. Words in these two dictionaries can be any part of speech (verb, adjective, etc.). If you find
Page 35 of 103
Pulse
Using Pulse
a word that is causing an attribute to be scored positively or negatively, but it should be score as
neutral, then you can add that word to the "neutral_words_en" dictionary to cause it to be scored 0.
Consider the sentence "The product delivers simplicity.":
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
0
1 | simplicity |
0
(2 rows)
If you want "product" to be scored positively in this sentence, then you must add "deliver simplicity"
to the pos_words user-dictionary. "deliver simplicity" will also match "delivers simplicity" due to the
"fuzzy" matching feature of phrases in the dictionaries. If you add "simplicity" by itself to the "pos_
words" dictionary, then simplicity in any context is considered positive, which may not be the result
you want to achieve across your entire domain. The following example adds "deliver simplicity" to
the pos_words user-dictionary for the English language:
insert into pulse.pos_words_en values ('deliver simplicity');
commit;
-- you must reload the dictionaries for the changes to be effective
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product delivers simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+----------------1 | product
|
1
(1 row)
If you want "simplicity" to always be positive, add it to the "pos_words" list. This example replaces
"deliver" with "provides":
insert into pulse.pos_words_en values ('simplicity');
commit;
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
select sentimentAnalysis('The product provides simplicity.') over(PARTITION BEST);
sentence | attribute | sentiment_score
----------+------------+-----------------
Page 36 of 103
Pulse
Using Pulse
1 | product
|
1 | simplicity |
1
0
(2 rows)
Note that the sentiment score for the attribute (noun) "simplicity" is not affected by having the word
"simplicity" in a Pulse user-dictionary, since it has been identified as an attribute.
Additional Tuning Examples
The following table provides additional examples for tuning Pulse:.
Text
Attribute Score
TuningSteps
New product
New
Default: -1
smashes
Product
After Tuning: 1
Default: -1
After Tuning: 1
Default: -1
After Tuning: 1
Default: -1
After tuning: 1
kickstarter
target in a day!
Get a sneak
Movie
able to spot
trends in flu
outbreaks in the
United States
using the
collection and
analysis of big
data.
Five health tips
health
tips
words".
If you have many words or base/synonyms to add to user-dictionaries, then you can bulk load the
lists from text files. See Bulk Loading Word Lists from Text Files.
Page 37 of 103
Pulse
Using Pulse
When you have finished loading the text files, run the loadUserDictionaries.sql script to update
the new terms in memory:
vsql -f /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
Page 38 of 103
Pulse
Multilingual Pulse
Multilingual Pulse
This section describes the multilingual features of Pulse and gives a brief explanation on how to use
the sentimentAnalysis() functions for different supported languages.
Pulse can analyze text in different languages. Currently English and Spanish are supported. You
can specify the language that is analyzed in three ways:
l
Provide the language as argument: if there is a language specified in the document record, then
it can be used for analyzing the text by passing it as argument. This is particularly useful when a
dataset contains texts in different languages. If the language in a record is not a supported one,
then it is ignored.
Provide the language as parameter: if there is no value specified for the language for a document
record, Pulse uses the value specified for the language parameter in the query to get the
language.
Note: If you provide the language parameter more than once, then the last value specified is
used.
Do not provide an argument or parameter and use the default language. If the language is neither
specified in the record nor by the user, then Pulse defaults to English unless you have changed
the default language. To change the default language use the SetDefaultLanguage function.
Note: If you provide both an argument and a parameter, then the argument is used as the
language. If the argument is not valid then the parameter is used. If neither the argument or
parameter are valid then the default language is used.
Note: Accents are removed from characters in attributes. Additionally, a "u" with a dieresis is
converted to a plain "u" and an "n" with a diacritical tilde is replace with a plain "n".
Functions that use language as parameter and/or as argument:
l
CommentAttributes
ExtractSentence
GetAllSentences
Page 39 of 103
Pulse
Multilingual Pulse
GetSentenceCount
PartsOfSpeech
SentimentAnalysis
Other functions can use the language only as a parameter (if not provided, the function uses the
default language):
l
GetLoadedDictionary
GetLoadedMapping
LoadDictionary
LoadMapping
GetAllDictionaryWords
GetAllMappingWords
In This Section
Spanish Pulse
40
Multilingual Examples
41
Spanish Pulse
The only visible difference between the English and Spanish versions is in the table names for the
user dictionaries. The modifications for dictionaries/mappings must be done in the following tables:
l
white_list_es
stop_words_es
pos_words_es
neg_words_es
neutral_words_es
normalization_es
Page 40 of 103
Pulse
Multilingual Pulse
Consider the text "El producto provee simplicidad" (the product provides simplicity). If the word
'simplicidad' (simplicity) should be positive, it has to be loaded into the pos_words dictionary for
Spanish as follows:
select sentimentanalysis('El producto provee simplicidad') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-------------+----------------1 | producto
|
0
1 | simplicidad |
0
(2 rows)
insert into pulse.pos_words_es values('simplicidad');
OUTPUT
-------1
(1 row)
select LoadDictionary(standard USING PARAMETERS listName='pos_words') over() from
pulse.pos_words_es;
Success
--------t
(1 row)
select sentimentanalysis('El producto provee simplicidad') OVER(PARTITION AUTO);
sentence | attribute | sentiment_score
----------+-------------+----------------1 | producto
|
1
1 | simplicidad |
0
(2 rows)
Multilingual Examples
Language as an Argument
select sentimentanalysis('Cookies are sweet.', 'english') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)
The following example shows how to analyze tweets from a table where each tweet record contains
the language of the tweet in addition to the text.
Page 41 of 103
Pulse
Multilingual Pulse
sentence |
attribute
| sentiment_score
----------+----------------+----------------1 | reviews amazon |
-1
1 | kindle fire
|
-1
1 | web
|
-1
1 | chore
|
-1
1 | cookies
|
1
1 | iphone
|
-1
1 | gb
|
-1
1 | space
|
-1
1 | galletas
|
1
1 | iphone
|
1
1 | celular
|
1
(11 rows)
Language as a Parameter
select sentimentanalysis('Las galletas son dulces' using PARAMETERS language='spanish')
OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | galletas |
1
(1 row)
select sentimentanalysis('Cookies are sweet' using PARAMETERS language='english') OVER
(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | cookies
|
1
(1 row)
Although it is possible to specify the language as parameter for a specific text given in a query,
using the language argument is more appropriate. The use of the language parameter is targeted to
queries that analyze a set of texts (from a table) written in a same language. The language
parameter is used by Pulse to skip texts in other languages because Pulse does not automatically
Page 42 of 103
Pulse
Multilingual Pulse
detect the language, Thus, Pulse uses the language specified as parameter to analyze each text
from the table (consequently the sentiment scores for texts in other language may be incorrect).
The following example shows a query that analyzes tweets from a table where the tweets do not
have a language value stored in the table, but are all in the same language.
create table myTweets (text varchar(300));
insert into myTweets values ('Las galletas son dulces');
insert into myTweets values ('el iphone es el celular mas popular');
insert into myTweets values ('el zorro rapido brinco sobre el perro flojo');
select sentimentanalysis(text using PARAMETERS language='spanish') OVER(PARTITION BEST)
from MyTweets;
sentence |
attribute
| sentiment_score
----------+----------------+----------------1 | galletas
|
1
1 | iphone
|
1
1 | celular
|
1
1 | zorro
|
1
1 | perro
|
-1
(5 rows)
The following example shows a query that analyzes tweets from a table with tweets in different
languages. The Spanish tweets do not have the language value. In a single query you can specify
both an argument and parameter. The argument has precedence over the parameter setting. In this
case the parameter is only used when a tweet doesn't provide a language value.
create table myTweets (doc_id int, text varchar(300), language varchar(15));
insert into myTweets values (1, 'Vertica is the best company', 'english');
insert into myTweets values (2, 'Cookies are sweet', 'english');
insert into myTweets values (3, 'The quick brown fox jumped over the lazy dog',
'english');
insert into myTweets values (4, 'Las galletas son dulces');
insert into myTweets values (5, 'el iphone es el celular mas popular');
select doc_id, sentimentanalysis(text,language using PARAMETERS language='spanish') OVER
(PARTITION BY id, text) from MyTweets;
doc_id
| sentence | attribute | sentiment_score
----------+-----------+-----------+----------------1 |
1| vertica
|
1
1 |
1| company
|
1
2 |
1| cookies
|
1
3 |
1| fox
|
1
Page 43 of 103
Pulse
Multilingual Pulse
3
4
5
5
|
|
|
|
1|
1|
1|
1|
dog
galletas
iphone
celular
|
|
|
|
-1
1
1
1
(8 rows)
Page 44 of 103
Pulse
Pulse Cookbook
Pulse Cookbook
This section contains the following recipes for using Pulse
Batch Analyzing Data as It Is Loaded
45
48
51
55
56
58
59
Page 45 of 103
Pulse
Pulse Cookbook
"user.screen_name" varchar(144),
text varchar(500),
"retweeted_status.retweet_count" int,
"retweeted_status.id" int,
"retweeted_status.favorite_count" int,
"user.location" varchar(144),
"coordinates.coordinates.0" float,
"coordinates.coordinates.1" float,
lang varchar(5)
);
The columns are based on the data returned by Twitter's streaming API. The fields are defined
in the Twitter Field Guide at https://dev.twitter.com/docs/platform-objects/tweets.
Note that the columns with quoted names; "user.name", "user.screen_name", are sub-fields
within a larger field. For example, the "users" field is described here:
https://dev.twitter.com/docs/platform-objects/users.
You must at least have columns for id, text, and "user.screen_name"
2. Create a table to hold the sentiment scores (for example, named : tweet_sentiment). Then load
it with the scores from your existing tweets. Make sure no new tweets are loaded until this step
completes.
Replace the column names in the following example with the column names from your twitter
table. The example uses the column names used by the Social Media Connector:
Note: If you have a large number of tweets then this command can take a long time to run.
However, it is important to score your existing data, before you start scoring newly loaded
data.
3. Create a SQL script to update the tweet_sentiment table with data from newly loaded tweets.
Page 46 of 103
Pulse
Pulse Cookbook
Save it in the home folder of the HP Vertica database admin user. For example, this path could
be /home/dbadmin/tweet_update.sql.
Replace the column names with the column names from your twitter table. The following
example uses the column names used by the HP Vertica Social Media Connector:
\i /opt/vertica/packages/pulse/ddl/loadUserDictionaries.sql
drop table if exists dt_end;
create table dt_end as (select max(created_at) dt from tweets);
-- run sentiment
insert into tweet_sentiment
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en' and
tweets.created_at > (select dt from dt_start) and
tweets.created_at <= (select dt from dt_end)
order by attribute);
-- copy date end into new start date
drop table if exists dt_start;
create table dt_start as (select dt from dt_end);
-- free up jvm resource pool memory used by this script
select release_jvm_memory();
4. Create a shell script named tweet_update.sh that is run from a cron job. This shell script runs
the tweet_update.sql script and logs the results to the file tweet_update.log. Save the tweet_
update.sh script in the home folder of the HP Vertica database admin user. For example, this
path could be /home/dbadmin/tweet_update.sh.
Replace the dbadmin, password, and databasename values with the values for your system.
/opt/vertica/bin/vsql -U dbadmin -w password -d databasename -f
/home/dbadmin/tweet_update.sql > tweet_update.log
5. Create a cron job to run the script every two minutes. Use the command crontab -e to create
the cron job.
*/2 * * * * /home/dbadmin/tweet_update.sh
The script runs every two minutes. Any new tweets that have been loaded in that two-minute
window are analyzed and the results are added to the tweet_sentiment table. You can join results of
queries by the id's of the tweets and tweet_sentiment tables.
Page 47 of 103
Pulse
Pulse Cookbook
Page 48 of 103
Pulse
Pulse Cookbook
HelpfulBen
Instana
CzarLatest
Gemball
Postta
Dailydant
Editone
CelticMiss
Editone
Championtips
BuffDrama
Instana
Editone
Postta
CzarLatest
DramaBugs
Editone
Instana
(20 rows)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1
1
1
1
1
1
2
1
2
1
1
2
1
1
1
1
2
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
owl-2
owl-2
owl-2
owl-2
owl-2
owl-2
owl-2
ponies
reports
rodent infestation
spiders
tuned
Pytell
Pytell
Pytell corp
Pytell corp
windows
work today
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-1
1
0
0
0
1
-1
1
-1
0
-1
0
-1
0
0
-1
-1
0
2. There are some attributes listed (ponies!) that do not apply to the analysis that you are doing.
You can focus your analysis by adding whitelist entries and filtering on the whitelist. Insert
whitelist entries for the company and product name into the standard whitelist:
INSERT INTO pulse.white_list_en VALUES ('Pytell Corp');
INSERT INTO pulse.white_list_en VALUES ('owl-2');
commit;
Reload the whitelist into Pulse. Loading a user-dictionary or mapping overwrites the existing
user-dictionary or mapping:
SELECT LoadDictionary(standard USING PARAMETERS listName='white_list') OVER() FROM
pulse.white_list_en;
3. Also, note that Pulse is not identifying all variations on the company name. There are also three
obvious attributes for the product name ('Pytell', 'pytell corp). You can normalize these values
by using a normalization mapping. Add the synonyms to the standard normalization mapping:
insert into pulse.normalization_en values('Pytell', 'Pytell Corp');
commit;
4. Reload the normalization mapping to load the new values into Pulse:
SELECT LoadMapping(standard_base, standard_synonym USING PARAMETERS
Page 49 of 103
Pulse
Pulse Cookbook
5. Run the query again to see how the normalization affects the results.
Note that 'pytell corp' has been normalized to 'pytell' and Pulse is correctly identifying the
synonyms and mapping them to the base term
Page 50 of 103
Pulse
Pulse Cookbook
The columns are based on the data returned by Twitter's streaming API. The fields are defined in
the Twitter Field Guide at https://dev.twitter.com/docs/platform-objects/tweets.
Note that the columns with quoted names; "user.name", "user.screen_name", are sub-fields within
a larger field. For example, the "users" field is described here:
https://dev.twitter.com/docs/platform-objects/users.
The example queries provided work with any Twitter data that follows the above table structure.
Page 51 of 103
Pulse
Pulse Cookbook
encryption
usb
aes-256
smartphones
rt @hp
world
ceo
(10 rows)
|
|
|
|
|
|
|
2356
2121
1859
1843
1788
1609
1520
If the dataset contains tweets in English and Spanish languages, then (using the Pulse multilingual
version) each tweet can be analyzed according to its language by specifying the language as
argument in the CommentAttributes() function. If the language of a specific tweet is not supported,
then that tweet is ignored by the function. For example:
SELECT t.attribute, count(*) FROM(SELECT CommentAttributes(text,lang)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;
Notice that the top attribute is "http". This is due to the large number of links in tweets. You can
ignore links by using the filterlinks argument of CommentAtttributes():
SELECT t.attribute, count(*) FROM
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true)
OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute ORDER BY count(*) DESC LIMIT 10;
attribute
| count
----------------+------d11
| 4757
rt
| 2397
encryption
| 2356
usb
| 2121
aes-256
| 1871
smartphones
| 1829
rt @hp
| 1788
world
| 1611
ceo
| 1542
interview
| 1346
(10 rows)
The attribute "http" is now gone from the list, but we still have "rt" (for retweet) on the list and it is
not helpful in this context. You can omit terms such as "rt" by adding them to the stop_words list
and reloading the stop_words user-dictionary:
INSERT INTO pulse.stop_words_en VALUES('rt');
commit;
SELECT LoadDictionary(standard USING PARAMETERS
listName='stop_words') OVER() FROM pulse.stop_words_en;
When you rerun the query you get more accurate results for the popular topics in the data set:
Page 52 of 103
Pulse
Pulse Cookbook
You can further refine the list to topics that contain specific attributes by adding the attributes in
which you are interested to the white_list, and then filtering with the whitelist parameter:
SELECT t.attribute, count(*) FROM
(SELECT CommentAttributes(text USING PARAMETERS filterlinks=true,
whitelistonly=true) OVER(PARTITION BEST) FROM tweets) as t
GROUP BY t.attribute
ORDER BY count(*) DESC LIMIT 10;
The result shows the top 5 tweets with the highest average sentiment for attributes that have 500 or
more occurances:
attribute
| cnt
avg
Page 53 of 103
Pulse
Pulse Cookbook
-------------------+------+------------------football
| 817 | 0.290085679314565
game
| 638 | 0.134796238244514
baseball
| 1558 | 0.128369704749679
basketball
| 776 | 0.114690721649485
hockey
| 2610 | 0.113409961685824
Page 54 of 103
Pulse
Pulse Cookbook
Page 55 of 103
Pulse
Pulse Cookbook
Create an authors table to hold the names of the authors whose sentiment you want to analyze:
CREATE TABLE authors (name VARCHAR, screenname VARCHAR);
Add the word 'hyperdrive' to your existing white_list and reload the white_list user-dictionary:
INSERT INTO pulse.white_list_en VALUES('hyperdrive');
SELECT LoadDictionary(standard USING PARAMETERS
listName='white_list') OVER() FROM pulse.white_list_en;
Then, you can run a query that filters on authors and the white_list and provides you with a
sentiment score and the content of the analyzed text:
SELECT t1.id, t1.author, t1.attribute, t1.sentiment_score, t2.text from (SELECT id,
author, SentimentAnalysis(text USING PARAMETERS
whitelistonly=true) OVER (PARTITION BY id, author) FROM tweets_sample
WHERE author IN (SELECT screenname FROM authors)) AS t1 JOIN (SELECT id,
text FROM tweets_sample) AS t2 ON t1.id = t2.id ;
Page 56 of 103
Pulse
Pulse Cookbook
Page 57 of 103
Pulse
Pulse Cookbook
We get the following results from a data set of 25,000 PC Manufacturer tweets:
attribute
| count |
avg
----------------------------------+-------+-------------------windows phone
|
81 | 0.0238095238095238
power data center
|
77 |
0.58974358974359
wind project
|
77 |
0
investment
|
73 |
0
windows
|
57 | 0.175438596491228
The query allows you to gain additional insight into the scope of an attribute and may aid in
determining the context of why a certain attribute it scored a certain way.
Page 58 of 103
Pulse
Pulse Cookbook
4. Verify that your tweet_sentiment table contains only your whitelist attributes. The following
Page 59 of 103
Pulse
Pulse Cookbook
query should only return the brands/products that you have white listed. For example:
=> select distinct(attribute) from tweet_sentiment;
attribute
------------ProductA
ProductB
ProductC
(3 rows)
5. You can get a basic idea of which product or brand is being talked about the most by seeing
how many instances of each attribute appear in your data set:
=> select attribute, count(*) from tweet_sentiment group by (attribute) order by
count(*) desc;
attribute | count
-------------+------ProductA
|
701
ProductB
|
192
ProductC
|
52
(3 rows)
You can see that ProductA is the most talked about product of three being analyzed over the
time-frame that the tweets were collected.
6. Determine the average sentiment scores of the tweets you have collected:
=> select attribute, avg(sentiment_score) as score from tweet_sentiment group by
(attribute) order by score DESC;
attribute |
score
-------------+--------------------ProductC
|
0.192307692307692
ProductB
| -0.0729166666666667
ProductA
| -0.122681883024251
(3 rows)
From this basic analysis, you can see that ProductC has the most positive sentiment from the
three brands being analyzed over the time period when the tweets were collected, and
ProductA has the lowest sentiment.
7. You can also determine which words or phrases are associated with each attribute in their
Page 60 of 103
Pulse
Pulse Cookbook
positive and negative contexts. For example, to see the list of words that are most associated
with positive sentiment for ProductC, you can look at the related words fields and add up the
occurances of words associated with positive sentiment:
=> select count(*), related_word_1 from tweet_sentiment where attribute = 'ProductC'
and sentiment_score > 0 group by related_word_1 order by count DESC;
count | related_word_1
-------+---------------11 | delicious
2 | love
1 | best
1 | bless
1 | good
1 | work
(6 rows)
8. Finally, Pulse makes it easy to see other attributes associated with your target attributes to
help you better understand the context in which people are discussing the brands or products
that you are analyzing.
a. Create another sentiment table from your data, but this time omit the whitelistonly and
relatedwords parameters:
create table tweet2_sentiment as
(select id, "user.screen_name",
SentimentAnalysis(text using parameters filterlinks=true,
filterusermentions=true, filterretweets=true)
over (partition by id, "user.screen_name", text)
from tweets where lang='en' order by attribute );
Page 61 of 103
Pulse
Pulse Cookbook
b. Next, query the tweets that contain your target attribute and find all the other attributes
associated with those tweets. Display a count of the top 5 attributes (not including the
target attribute):
=> select count(attribute), attribute from tweet2_sentiment where id in (select
id from tweet_sentiment where attribute = 'ProductC') and attribute <> 'ProductC'
group by (attribute) order by count(attribute) DESC limit 5;
count |
attribute
-------+----------------13 | bbq
11 | state
11 | sandwich
11 | steak
3 | ProductB
(5 rows)
As you can see, a few basic queries can tell you the general sentiment differences between
multiple brands or products. You can also determine which words are contributing to the sentiment
of each product/brand that you are analyzing and which other attributes people are talking about
when they mention the brand or product(s) that you are analyzing.
You could further refine these queries by breaking out different geographic locations or time of day
by joining the IDs of the tweet_sentiment table back to the main tweets table and filtering be
location or time.
Page 62 of 103
Pulse
Pulse Function Reference
64
ExtractSentence
68
GetAllDictionarySetLabels
70
GetAllDictionaryWords
71
GetAllLoadedDictionaries
72
GetAllMappingWords
73
GetAllSentences
75
GetLoadedDictionary
78
GetLoadedMapping
80
GetSentenceCount
82
GetStorage
85
LoadDictionary
87
LoadMapping
89
PartsOfSpeech
91
SentimentAnalysis
94
SetDefaultLanguage
98
UnloadLabeledDictionary
99
UnloadLabeledDictionarySet
100
UnloadLabeledMapping
101
Page 63 of 103
Pulse
Pulse Function Reference
CommentAttributes
Retrieves the attributes (nouns) from a given piece of text.
Syntax
CommentAttributes(text[,language][USING PARAMETERS
[whitelistonly = boolean ]
[, filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, filterpunctuation = boolean
[, filterretweets = boolean ]
[, adjustcasing = boolean ]
[, language = string ]])
])
Parameters
Argument
Description
text
language
The language:
whitelistonly
'english' or 'en'
'spanish' or 'es'
filterlinks
Optional. Default false. When set to true, links are not set
as attributes.
filterusermentions
Page 64 of 103
Pulse
Pulse Function Reference
Argument
Description
filterhashtags
filterpunctuation
Link URLs
filterretweets
adjustcasing
Notes
l
This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.
language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.
Examples
select CommentAttributes('The quick brown fox jumped over the lazy dog. All good boys
deserve fudge.') OVER(PARTITION BEST);
sentence | attribute
----------+----------1 | fox
1 | dog
Page 65 of 103
Pulse
Pulse Function Reference
2 | boys
2 | fudge
(4 rows)
select commentattributes('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
,'english') over();
sentence | attribute
----------+----------1 | fox
1 | dog
2 | boys
2 | fudge
(4 rows)
select commentattributes('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
using parameters language='english') over();
sentence | attribute
----------+----------1 | fox
1 | dog
2 | boys
2 | fudge
select commentattributes('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
,'spanish') over();
sentence | attribute
----------+----------1 | zorro
1 | perro
2 | chicos
2 | premio
(4 rows)
select commentattributes('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
using PARAMETERS language='spanish') over();
sentence | attribute
----------+----------1 | zorro
1 | perro
2 | chicos
2 | premio
(4 rows)
Filtering User-mentions
SELECT CommentAttributes('@user is always late. He kept me waiting 20 minutes last
weekend.'
USING PARAMETERS filterusermentions=true) OVER(PARTITION BEST);
sentence | attribute
----------+----------2 | weekend
(1 row)
Page 66 of 103
Pulse
Pulse Function Reference
See Also
l
SentimentAnalysis()
Page 67 of 103
Pulse
Pulse Function Reference
ExtractSentence
Returns the specified sentence from a body of text.
Syntax
ExtractSentence(text, sentence [, language] [USING PARAMETERS
[filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])
Parameters
Argument
l
text
language
Description
The text containing the sentence to extract.
The language:
l
'english' or 'en'
'spanish' or 'es'
sentence
filterlinks
Optional. Default false. When set to true, sentences that are only links
are skipped over and ignored. Any links in a sentence are not included in
the extracted sentence.
filterusermentions
Optional. Default false. When set to true, sentences that are only Twitter
user mentions (@username) are skipped over and ignored. Any usermentions in a sentence are not included in the extracted sentence.
filterhashtags
Optional. Default false. When set to true, sentences that are only Twitter
hashtags (#hashtag) are skipped over and ignored. Any hashtags in a
sentence are not included in the extracted sentence.
Page 68 of 103
Pulse
Pulse Function Reference
Argument
Description
adjustcasing
Optional. Defaults to false. When set to true, all letters in the sentence
are converted to upper-case before sentence detection. After sentence
detection all letters are converted to lower-case. This option is helpful if
the original data is all in lower-case and Pulse is incorrectly identifying
parts of speech in the sentence.
Notes
l
This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.
language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.
Examples
select ExtractSentence('The quick brown fox jumped. Every good boy deserves fudge', 2)
OVER(PARTITION BEST);
sentence
-------------------------------Every good boy deserves fudge.
(1 row)
select extractSentence('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
, 2, 'english') over();
sentence
----------------------------All good boys deserve fudge
(1 row)
select extractSentence('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
,2 using parameters language='english') over();
sentence
----------------------------All good boys deserve fudge
(1 row)
select extractSentence('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
, 2, 'spanish') over();
sentence
------------------------------------------Todos los chicos buenos merecen un premio
(1 row)
Page 69 of 103
Pulse
Pulse Function Reference
select extractSentence('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
,2 using parameters language='spanish') over();
sentence
------------------------------------------Todos los chicos buenos merecen un premio
(1 row)
Filtering Links
SELECT ExtractSentence('HP - http://hp.com is a useful website. I
like HP.', 1 USING PARAMETERS filterlinks=true) OVER(PARTITION BEST);
sentence
---------------------------hp - is a useful website.
(1 row)
See Also
l
GetSentenceCount()
GetAllSentences()
GetAllDictionarySetLabels
Lists all the dictionary labels that are loaded into the current Pulse session. This function shows
you which labels are currently in use. You can load only one dictionary of each type in a single
session.
Syntax
SELECT GetAllDictionarySetLabels() over();
Examples
SELECT GetAllDictionarySetLables() OVER();
label
--------default
sports_teams
(2 rows)
Page 70 of 103
Pulse
Pulse Function Reference
GetAllDictionaryWords
Lists all dictionary words that are currently loaded into Pulse. This function can help you determine
which user-defined words in a sentence might be affecting the sentiment score of an attribute.
Syntax
SELECT GetAllDictionaryWords([using PARAMETERS language='language'[, label='label']) OVER
();
Parameters
Argument
Description
language
label
'english' or 'en'
'spanish' or 'es'
Examples
SELECT GetAllDictionaryWords() OVER();
dictionary |
word
------------+------------neg_words | ratchet
neg_words | squirelly
select GetAllDictionaryWords(using parameters language='english') over();
dictionary
|
word
-------------------+-----------pos_words_en
| simplicity
(1 row)
select GetAllDictionaryWords(using parameters label='music') over();
dictionary
|
word
-------------------+------------white_list_en
| classical
white_list_en
| popular
white_list_en
| rock
(3 rows)
Page 71 of 103
Pulse
Pulse Function Reference
See Also
l
GetAllMappingWords()
GetAllLoadedDictionaries
Lists all the dictionaries and dictionary labels that are loaded into the current Pulse session. This
function shows you which dictionaries are determining the sentiment score of an attribute. Only one
dictionary of each type can be loaded in a single session.
Syntax
SELECT GetAllLoadedDictionaries() over();
Examples
SELECT GetAllLoadedDictionaries() OVER();
dictionary
| label
------------------+------neg_words_en
| default
stop_words_es
| default
neutral_words_es | default
white_list_en
| default
normalization_en | default
pos_words_es
| default
neg_words_es
| default
pos_words_en
| default
white_list_es
| default
neutral_words_en | default
stop_words_en
| default
normalization_es | default
(12 rows)
Page 72 of 103
Pulse
Pulse Function Reference
GetAllMappingWords
Lists all user-defined bases and synonyms that are currently loaded into Pulse. This function helps
you determine which user-defined mappings in a sentence might be affecting the sentiment score of
an attribute.
Syntax
SELECT GetAllMappingWords([using PARAMETERS language='language'][, label='label']) OVER
();
Parameters
Argument
Description
language
label
'english' or 'en'
'spanish' or 'es'
The label of the mappings that you want to list. If you do not
provide a lable, Pulse uses the default dictionaries.
Examples
SELECT GetAllMappingWords() OVER() limit 10;
mapping
|
key
|
value
---------------+-------------+----------------normalization | hp
| hewlett packard
normalization | hp
| hewlett-packard
normalization | companycorp | company-corp
normalization | companycorp | companycorps
normalization | companycorp | companycorp's
normalization | producthd
| product hd
normalization | producthd
| product-hd
normalization | companycorp | company corp
(8 rows)
select getAllMappingWords(using parameters language='english') over();
mapping
| key |
value
-----------------------+-----+----------------normalization_en
| hp | hewlett-packard
normalization_en
| hp | hewlett Packard
Page 73 of 103
Pulse
Pulse Function Reference
(2 rows)
select getAllMappingWords(using parameters language='spanish') over();
mapping
|
key
|
value
-----------------------+---------+---------------normalization_es
| hidalgo | miguel hidalgo
(1 row)
See Also
l
GetAllDictionaryWords()
Page 74 of 103
Pulse
Pulse Function Reference
GetAllSentences
Extracts a row for each sentence in a body of text. This ability is useful if you need to
programmatically get each sentence in a piece of text.
Syntax
GetAllSentences(text [, language[USING PARAMETERS
[filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])
Parameters
Argument
Description
text
language
The language:
filterlinks
'english' or 'en'
'spanish' or 'es'
filterusermentions
filterhashtags
Page 75 of 103
Pulse
Pulse Function Reference
Argument
Description
adjustcasing
Notes
l
This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.
language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.
Examples
SELECT GetAllSentences('The quick brown fox jumped over the lazy
dog. Every good boy deserves fudge') OVER(PARTITION BEST);
sentence
----------------------------------------------The quick brown fox jumped over the lazy dog.
Every good boy deserves fudge.
(2 rows)
select getAllSentences('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
,'english') over();
sentence_index |
sentence_text
----------------+----------------------------------------------1 | the quick brown fox jumped over the lazy dog.
2 | All good boys deserve fudge
(2 rows)
select getAllSentences('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
using parameters language='english') over();
sentence_index |
sentence_text
----------------+----------------------------------------------1 | the quick brown fox jumped over the lazy dog.
2 | All good boys deserve fudge
Page 76 of 103
Pulse
Pulse Function Reference
(2 rows)
select getAllSentences('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
,'spanish') over();
sentence_index |
sentence_text
----------------+---------------------------------------------1 | el zorro rapido brinco sobre el perro flojo.
2 | Todos los chicos buenos merecen un premio
(2 rows)
select getAllSentences('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
using parameters language='spanish') over();
sentence_index |
sentence_text
----------------+---------------------------------------------1 | el zorro rapido brinco sobre el perro flojo.
2 | Todos los chicos buenos merecen un premio
(2 rows)
Filtering User-mentions
SELECT GetAllSentences('@user is always late. He kept me waiting 20 minutes last time.'
USING PARAMETERS filterusermentions=true)
OVER(PARTITION BEST);
sentence
----------------------------------------is always late.
he kept me waiting 20 minutes last time.
(2 rows)
See Also
l
GetSentenceCount()
ExtractSentence()
Page 77 of 103
Pulse
Pulse Function Reference
GetLoadedDictionary
Lists the currently loaded words for the specified user-dictionary.
Syntax
SELECT GetLoadedDictionary(user-dictionary
label='label']) OVER();
Parameters
Argument
Description
user-dictionary
pos_words
neg_words
neutral_words
stop_words
white_list
label
'english' or 'en'
'spanish' or 'es'
The label of the dictionaries that you want to list. If you do not provide
a label, Pulse uses the default dictionaries.
Usage Considerations
l
Page 78 of 103
Pulse
Pulse Function Reference
Examples
Note: This example is from a three node cluster, so three copies of the words are returned.
SELECT GetLoadedDictionary('pos_words') OVER();
word
------------------------:-)
adequate
admire
admiringly
adore
adoringly
adulation
adventuresome
advocated
affable
affably
affordable
affordably
afordable
all-around
alluringly
amazement
ameliorate
ample
amusing
--More--
See Also
l
LoadDictionary()
GetLoadedMapping()
Page 79 of 103
Pulse
Pulse Function Reference
GetLoadedMapping
Lists the currently loaded words for the specified user-defined mapping.
Syntax
SELECT GetLoadedMapping('normalization' [using PARAMETERS language = string]) OVER();
Parameters
Argument
Description
mapping
The mapping list to retrieve. Currently the only mapping supported is:
normalization
language
label
'english' or 'en'
'spanish' or 'es'
The label to which you want to load the specified mapping. If you do
not include a label, Pulse loads the default UDDs.
Usage Considerations
l
Examples
SELECT GetLoadedMapping('normalization') OVER();
key |
value
-----+----------------hp | hewlett packard
(1 row)
Page 80 of 103
Pulse
Pulse Function Reference
-----+----------------hp | hewlett-packard
hp | hewlett packard
(2 rows)
select getLoadedMapping('normalization' using PARAMETERS language='spanish') over();
key
|
value
---------+---------------hidalgo | miguel hidalgo
(1 row)
See Also
l
LoadMapping()
GetLoadedDictionary()
Page 81 of 103
Pulse
Pulse Function Reference
GetSentenceCount
Returns the number of sentences in a body of text. You can use this function to count the number of
sentences in a long piece of text. It is also useful if you are programmatically using the
"ExtractSentence" function and need to know the number of sentences in a piece of text.
Syntax
select GetSentenceCount(text [, language] [USING PARAMETERS
[filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
])
Parameters
Argument
l
text
Description
The text from which to extract the number of sentences. Currently
English and Spanish language text are supported for analysis.
language
filterlinks
The language:
l
'english' or 'en'
'spanish' or 'es'
Optional. Default false. When set to true, sentences that are only links
are not counted as a sentence.
filterusermentions
Optional. Default false. When set to true, sentences that are only Twitter
user mentions (@username) are not counted as a sentence.
filterhashtags
Optional. Default false. When set to true, sentences that are only Twitter
hashtags (#hashtag) are not counted as a sentence.
adjustcasing
Optional. Defaults to false. When set to true, all letters in the sentence
are converted to upper-case before sentence detection. After sentence
detection all letters are converted to lower-case. This option is helpful if
the original data is all in lower-case and Pulse is incorrectly identifying
parts of speech in the sentence.
Page 82 of 103
Pulse
Pulse Function Reference
Notes
l
This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.
language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.
Examples
SELECT GetSentenceCount('The quick brown fox jumped over the lazy dog. Every good boy
deserves fudge') OVER(PARTITION BEST);
sentence_count
---------------2
(1 row)
SELECT getsentencecount('http://hp.com. @hp. http://hp.com is great!') OVER(PARTITION
BEST);
sentence_count
---------------3
(1 row)
select getsentencecount('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
using PARAMETERS language='spanish') over();
sentence_count
---------------2
(1 row)
select getsentencecount('el zorro rapido brinco sobre el perro flojo. Todos los chicos
buenos merecen un premio'
,'spanish') over();
sentence_count
---------------2
(1 row)
select getsentencecount('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
using parameters language='english') over();
sentence_count
---------------2
(1 row)
select getsentencecount('the quick brown fox jumped over the lazy dog. All good boys
deserve fudge'
Page 83 of 103
Pulse
Pulse Function Reference
,'english') over();
sentence_count
---------------2
(1 row)
See Also
l
GetAllSentences()
ExtractSentence()
Page 84 of 103
Pulse
Pulse Function Reference
GetStorage
Lists the currently loaded user-dictionaries and user-defined mapping.
Syntax
SELECTGetStorage([using PARAMETERS label='label']) OVER();
Parameters
Argument
Description
label
The label of the dictionaries and mapping names that you want to list.
If you do not provide a label, Pulse uses the default dictionaries.
Usage Considerations
l
Examples
SELECTGetStorage() OVER();
key
-----------------neg_words_en
neutral_words_en
pos_words_en
stop_words_en
white_list_en
normalization_en
neg_words_es
neutral_words_es
pos_words_es
stop_words_es
white_list_es
normalization_es
(12 rows)
See Also
l
LoadDictionary()
LoadMapping()
Page 85 of 103
Pulse
Pulse Function Reference
GetLoadedDictionary()
GetLoadedMapping()
Page 86 of 103
Pulse
Pulse Function Reference
LoadDictionary
Lists words from a Pulse user defined dictionary into memory for use by sentimentAnalysis() and
other Pulse functions. User defined dictionary lists are lists of words that are assigned to a specific
list.
Syntax
SELECT LoadDictionary(word USING PARAMETERS listName='listname'[, language='lang'] [,
label='label']) OVER() FROM table
Parameters
Argument
Description
word
listName
The user-dictionary list from which to load the values from word .
Valid values:
l
pos_words
neg_words
neutral_words
stop_words
white_list
'english' or 'en'
'spanish' or 'es'
label
table
Page 87 of 103
Pulse
Pulse Function Reference
Usage Considerations
l
All user-dictionaries and mappings must be loaded (using LoadDictionary() and LoadMapping())
whenever you change any user-dictionary or the normalization map is changed for the changes
to take effect.
Dictionaries and Mappings are loaded on a per-client basis. Loaded dictionaries can vary from
session to session.
If you load a user-dictionary with an incorrect listName, then the result of LoadDictionary() is
false and the user-dictionary is not loaded.
LoadDictionary does not append user-dictionary list. It overwrites them. If you load a userdictionary more than once with the same list name, then only the most recent user-dictionary is
loaded for that list name.
Examples
select LoadDictionary(standard USING PARAMETERS listName=
'neg_words_en') OVER() from pulse.neg_words_en;
select LoadDictionary(standard USING PARAMETERS listName=
'pos_words_en') OVER() from pulse.pos_words_en;
select LoadDictionary(standard USING PARAMETERS listName=
'pos_words_en', language='english') OVER() from pulse.pos_words_en;
select LoadDictionary(standard USING PARAMETERS listName=
'pos_words_es', language='spanish') OVER() from pulse.pos_words_es;
select LoadDictionary(standard USING PARAMETERS listName=
'neg_words',label='custom_negatives') OVER() from pulse.neg_words_en;
See Also
l
LoadMapping()
GetLoadedDictionary()
GetStorage()
Page 88 of 103
Pulse
Pulse Function Reference
LoadMapping
Loads a Pulse user-mapping into memory for use by sentimentAnalysis() and other Pulse functions.
Maps are lists of synonyms of one or more words that map to another word. Using maps allows you
to analyze text that pertains to the same subject or concept but may use slightly different
terminology.
For example, you can map both "Hewlett Packard" and "Hewlett-Packard" (with hyphen) to HP.
Pulse substitutes the mapped words to the core word when it runs its analysis.
Syntax
SELECT LoadMapping(base, wordToMap USING PARAMETERS mapName='mapName' [, language='lang']
[, label='label']) OVER()FROM table
Parameters
Argument
Description
base
wordToMap
mapName
language
label
'english' or 'en'
'spanish' or 'es'
The label of the mapping that you want to load. If you do not
provide a label, Pulse uses the default mapping.
table
Page 89 of 103
Pulse
Pulse Function Reference
Usage Considerations
l
All user-dictionaries and mappings must be loaded (using LoadDictionary() and LoadMapping())
whenever you updated any user-dictionary or the normalization map is changed for the changes
to take effect.
After loading, HP Vertica returns a success message from each node in the cluster.
Dictionaries and Mappings are loaded across all client sessions and remain in memory even if
the database is stopped and started.
If you load a mapping with an incorrect mapName, then the result of LoadMapping() is false and
the map is not loaded.
LoadMapping() does not append maps. It overwrites them. If you load a map more than once
with the same mapName, then only the most recent mapping are loaded for that mapName.
Examples
select LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization') over() from pulse.normalization_en;
select LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization', language='english') over() from pulse.normalization_en;
select LoadMapping(standard_base, standard_synonym USING PARAMETERS
mapName='normalization', language='spanish') over() from pulse.normalization_es;
See Also
l
LoadDictionary()
GetLoadedMapping()
GetStorage()
Page 90 of 103
Pulse
Pulse Function Reference
PartsOfSpeech
Tags the words in one or more sentences with their part of speech clasification, using Penn
Treebank parts of speech tags.
Syntax
Select PartsOfSpeech('sentences'[, language='lang'] [using PARAMETERS [language='lang']
[, adjustcasing=boolean)
OVER(PARTITION BEST);
Parameters
Argument
Description
sentences
language
The language:
adjustcasing
'english' or 'en'
'spanish' or 'es'
Notes
l
This function returns a part of speech markup for each word. The markup used is the Penn
Treebank Project Parts of Speech Tags while for Spanish the Parole Reduced Tagset is used.
This function must be used with the over() clause. Use with OVER(PARTITIONBEST) for the
best performance if the query does not require specific columns in the over() clause.
Examples
select partsOfSpeech('The quick brown fox jumped over the lazy dog.') OVER(PARTITION
Page 91 of 103
Pulse
Pulse Function Reference
BEST);
sentence | token | part_of_speech
----------+--------+---------------1 | the
| DT
1 | quick | JJ
1 | brown | JJ
1 | fox
| NN
1 | jumped | VBD
1 | over
| IN
1 | the
| DT
1 | lazy
| JJ
1 | dog
| NN
1 | .
| .
(10 rows)
| part_of_speech
----------+--------+---------------1
| the
| DT
1
| quick | JJ
1
| brown | JJ
1
| fox
| NN
1
| jumped
| VBD
1
| over | IN
1
| the
| DT
1
| lazy | JJ
1
| dog
| NN
1
| .
| .
(10 rows)
Page 92 of 103
Pulse
Pulse Function Reference
See Also
l
SentimentAnalysis()
Page 93 of 103
Pulse
Pulse Function Reference
SentimentAnalysis
Provides a sentiment score for each attribute (noun) in a given body of text. Positive sentiment
receives a positive integer score and negative sentiment receives a negative integer score. A score
of 0 indicates that the sentiment for the attribute is neutral.
Syntax
SentimentAnalysis(text [,language] [USING PARAMETERS
[whitelistonly = boolean ]
[, filterlinks = boolean ]
[, filterusermentions = boolean ]
[, filterhashtags = boolean ]
[, filterpunctiation = boolean ]
[, filterretweets = boolean ]
[, relatedwords = boolean ]
[, adjustcasing = boolean ]
[, language = string ]
[, label='label']
])
Parameters
Argument
Description
text
whitelistonly
filterlinks
filterusermentions
filterhashtags
filterpunctuation
filterretweets
Page 94 of 103
Pulse
Pulse Function Reference
Argument
Description
relatedwords
adjustcasing
language
label
The language:
l
'english' or 'en'
'spanish' or 'es'
Usage Considerations
l
This function must be used with the OVER() clause. Use OVER(PARTITIONBEST) for the best
performance if the query does not require specific columns in the OVER() clause. Any valid
PARTITION BY clause is acceptable. However, only the PARTITION BY clause which matches
the segmentation clause of the table's projection provides optimum performance. You can
improve performance by segmenting on the columns in the PARTITIONBY clause.
language can be specified as an argument and/or as a parameter where the argument value
supersedes the parameter value.
Examples
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.') OVER(PARTITION
BEST);
sentence | attribute | sentiment score
----------+-----------+-----------------
Page 95 of 103
Pulse
Pulse Function Reference
1 | fox
1 | dog
|
|
1
-1
(2 rows)
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
USING PARAMETERS relatedwords=true) OVER(PARTITION BEST);
sentence | attribute | sentiment_score | related_word_1 | related_word_2 | related_word_
3
----------+-----------+-----------------+----------------+----------------+--------------1 | fox
|
1 | quick
| lazy
|
1 | dog
|
-1 | lazy
|
|
(2 rows)
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.', 'english')
OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fox
|
1
1 | dog
|
-1
(2 rows)
select SentimentAnalysis('The quick brown fox jumped over the lazy dog.'
using PARAMETERS language='english') OVER(PARTITION BEST);
sentence | attribute | sentiment_score
----------+-----------+----------------1 | fox
|
1
1 | dog
|
-1
(2 rows)
Page 96 of 103
Pulse
Pulse Function Reference
See Also
l
LoadDictionary()
LoadMapping()
ExtractSentence()
GetSentenceCount()
GetAllSentences()
CommentAttributes()
Page 97 of 103
Pulse
Pulse Function Reference
SetDefaultLanguage
Sets the new default language to use for Pulse functions if no language is specified in a Pulse
function call.
Syntax
SetDefaultLanguage(language )
Parameters
Argument
Description
language
The language:
l
'english' or 'en'
'spanish' or 'es'
Notes
l
The language that is set when using this function is the default language across all sessions and
is persistent across database restarts.
Examples
=> select setDefaultLanguage('es') over();
Success
--------t
(1 row)
See Also
l
SentimentAnalysis
Page 98 of 103
Pulse
Pulse Function Reference
UnloadLabeledDictionary
Unloads a specific dictionary from a Pulse session. The dictionary continues to exist and a user
can later reload the dictionary, if needed.
You cannot unload a default dictionary, but you can replace it by loading a custom user-defined
dictionary.
Syntax
SELECT unloadLabeledDictionary(USING PARAMETERS listname='listname'[, language='lang'] [,
label='label']) over();
Parameters
Argument
Description
listName
The type of the dictionary that you want to unload. listName must be
one of:
l
pos_words
neg_words
neutral_words
stop_words
white_list
label
The language:
l
'english' or 'en'
'spanish' or 'es'
Page 99 of 103
Pulse
Pulse Function Reference
Examples
select unloadLabeledDictionary(USING PARAMETERS listname='neg_words',
label='custom_negatives') OVER();
success
--------t
(1 row)
See Also
l
UnloadLabeledDictionarySet()
UnloadLabeledDictionarySet
Unloads all user-defined dictionaries with a particular label from a Pulse session. The dictionaries
continue to exist, and a user can later reload the dictionaries, if needed.
You cannot unload a default dictionary, but you can replace it by loading a custom user-defined
dictionary.
Syntax
SELECT unloadLabeledDictionarySet(USING PARAMETERS label='labelName') over();
Parameters
Argument
Description
label
Examples
select unloadLabeledDictionarySet(USING PARAMETERS label='custom_negatives') OVER();
success
--------t
(1 row)
Pulse
Pulse Function Reference
See Also
l
UnloadLabeledDictionary()
UnloadLabeledMapping
Unloads a specific mapping from a Pulse session. The mapping continues to exist, and a user can
later reload it, if needed.
Syntax
SELECT unloadLabeledMapping(USING PARAMETERS mapName='normalization' [, language='lang']
[, label='label']) over();
Parameters
Argument
Description
mapName
The name of the mapping from which you are unloading the
dictionary.
language
label
The language:
l
'english' or 'en'
'spanish' or 'es'
Examples
select unloadLabeledMapping(standard USING PARAMETERS label='custom_mapping') OVER();
success
--------t
(1 row)
Pulse
Pulse Function Reference