You are on page 1of 25

Cloudera Search Installation Guide

Cloudera, Inc.
220 Portage Avenue
Palo Alto, CA 94306
info@cloudera.com
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Important Notice
2010-2013 Cloudera, Inc. All rights reserved.
Cloudera, the Cloudera logo, Cloudera Impala, Impala, and any other product or service names or
slogans contained in this document, except as otherwise disclaimed, are trademarks of Cloudera and its
suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior
written permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other
trademarks, registered trademarks, product names and company names or logos mentioned in this
document are the property of their respective owners. Reference to any products, services, processes or
other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute
or imply endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights
under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval
system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise), or for any purpose, without the express written permission of Cloudera.
Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Cloudera, the furnishing of this document does not give you any license to these
patents, trademarks copyrights, or other intellectual property.
The information in this document is subject to change without notice. Cloudera shall not be liable for
any damages resulting from technical errors or omissions which may be present in this document, or
from use of this document.
Version: Cloudera Search Beta, version 0.9.0
Date: June 4, 2013
Contents
ABOUT THIS GUIDE ................................................................................................................................................ 1
GUIDELINES FOR DEPLOYING CLOUDERA SEARCH ................................................................................................. 1
THE IMPORTANCE OF USE CASE DEFINITION .................................................................................................................... 1
CLOUDERA SEARCH REQUIREMENTS ..................................................................................................................... 3
CDH REQUIREMENT ................................................................................................................................................ 3
OPERATING SYSTEMS ............................................................................................................................................... 4
JDK..................................................................................................................................................................... 5
PORTS USED BY CLOUDERA SEARCH ............................................................................................................................ 5
Ports Used by Cloudera Search ....................................................................................................................... 5
INSTALLING CLOUDERA SEARCH ............................................................................................................................ 5

CHOOSING WHERE TO DEPLOY THE CLOUDERA SEARCH PROCESSES ..................................................................................... 6


CLOUDERA SEARCH INSTALLATION APPROACHES ............................................................................................................. 6

Installing from Packages ................................................................................................................................ 6


BEFORE YOU BEGIN INSTALLING CLOUDERA SEARCH MANUALLY ........................................................................................ 7

INSTALLING SOLR PACKAGES ...................................................................................................................................... 7


DEPLOYING CLOUDERA SEARCH IN SOLRCLOUD MODE ..................................................................................................... 8

Installing and Starting ZooKeeper Server ........................................................................................................ 8


Initializing Solr for SolrCloud Mode ................................................................................................................. 9

Configuring Solr for use with HDFS ................................................................................................................. 9


Creating the /solr Directory in HDFS ............................................................................................................. 11
Initializing ZooKeeper Namespace ................................................................................................................ 11
Starting Solr in SolrCloud Mode .................................................................................................................... 11
Administering Solr with the solrctl Tool ......................................................................................................... 11
Runtime Solr Configuration .......................................................................................................................... 12
Creating your first Solr Collection.................................................................................................................. 13

Adding Another Collection with Replication .................................................................................................. 14


INSTALLING FLUME SOLR SINK FOR USE WITH CLOUDERA SEARCH ..................................................................................... 14
INSTALLING MAPREDUCE TOOLS FOR USE WITH CLOUDERA SEARCH .................................................................................. 15
UPGRADING CLOUDERA SEARCH ......................................................................................................................... 15
CONTENTS .......................................................................................................................................................... 15
UPGRADING CLOUD SEARCH FROM SOLRCLOUD MODE .................................................................................................. 16
UPGRADING CLOUD SEARCH FROM NON-SOLRCLOUD MODE ........................................................................................... 16
INSTALLING AND USING HUE WITH CLOUDERA SEARCH ...................................................................................... 17

IMPORTING COLLECTIONS ....................................................................................................................................... 17


USER UI ............................................................................................................................................................. 18
Customization UI .......................................................................................................................................... 18
Deploying Hue Search .................................................................................................................................. 19

Updating Hue Search ................................................................................................................................... 20


Hue Search Twitter Demo ............................................................................................................................. 20
About this Guide

About this Guide


This guide explains how to install Cloudera Search powered by Solr. This guide also explains how to
install, start, and use supporting tools and services such as the ZooKeeper Server, MapReduce tools for
use with Cloudera Search, and Flume Solr Sink.
Cloudera Search documentation also includes:
Cloudera Search User Guide

Guidelines for Deploying Cloudera Search


This section outlines some of the items and choices that you should consider when deploying Cloudera
Search. Use the following information as a guide to help you form and implement your solutions for
your particular use cases, rather than a list of firm recommendations. Note that there is a tradeoff
between effort and results. Until you have an example of an application, most of this remains
theoretical since one cant necessarily predict what factors will be most important until use cases and
data are better understood.

The importance of use case definition


It is important to define the use cases as early as possible. The same Solr index can have drastically
different hardware requirements usually memory depending upon the queries that are performed.
For example, the memory requirements for faceting vary depending upon the number of unique terms
in the field being faceted upon. Suppose you want to use faceting on a field that has 10 unique values.
Since only 10 counting buckets are required no matter how many documents are in the index,
memory overhead is almost non-existent in this example. But suppose the very same index has unique
timestamps for every entry and you want to facet on that field with a : -type query. This would
require one counting bucket per document in the index. If there are 500 million documents, then
faceting across 10 such fields would increase the RAM requirements significantly.
For this reason, use cases and some characterizations of the data must be known before you can
estimate the hardware requirements. The important parameters to consider are:
Number of documents. For Cloudera Search, its almost almost always the case that sharding is
required.
Approximate word count for each potential field.
What information is stored in the Solr index (that is, returned with the search results) and what
is only for searching.
Foreign language support.
o How many different languages appear in your data?
o What percentage of documents are in each language?

Cloudera Search Installation Guide | 1


Guidelines for Deploying Cloudera Search

o Is language-specific searching to be supported? The issue is whether accent folding and


storing the text in a single field is sufficient.
o What language families are going to be searched? You can, for instance, combine all
Western European languages into a single field, but combining English and Chinese into
a single field doesnt work well. For instance, sometimes accents alter the meaning of a
word, accent folding will lose that distinction.
Faceting requirements
o Be wary of faceting on fields that have many unique terms (for example, timestamps,
free-text fields). Usually faceting on a field with many (more than 10,000 unique values)
is not a useful thing to do. Make sure that any requirement to facet on such fields is
necessary.
o Types of facets. You can facet on queries as well as field values. Faceting on queries is
often useful for dates (for example, in the last day, in the last week and so on).
Using a bare NOW (Solr Date Math) is almost always inefficient. Facet-by-query is not
expensive memory-wise since the number of counting buckets is limited by the
number of queries specified, no matter how many unique values are in the underlying
field.
Sorting requirements
o Sorting requires one int per document (maxDoc) and takes up significant memory.
Additionally, sorting on strings requires storing each unique string value.

Will there be an advanced search capability? If so, what does that look like? Can the Cloudera
Search count on users be more motivated than e-commerce users? There are significant design
decisions that need to be made depending on how motivated the users are. That is:
o Can users be expected to take some time to learn about the system? Advanced
screens are usually intimidating to e-commerce users but may be the best choice when
users can be expected to take some time to learn them.
o How long can your users be patient? Data mining can mean that users can wait multiple
seconds for search results. Of course, you dont want users to wait any longer than
necessary, but theres another set of design decisions related to reasonable response
times.
o How many simultaneous users?
Update requirements. An update in Solr refers both to adding new documents and changing
existing documents.
o Loading new documents.
Bulk. Are there use-cases where the index has to be rebuilt from scratch? Or will
there be an initial load?

2 | Cloudera Search Installation Guide


Cloudera Search Requirements

Incremental. What is the rate of new documents coming into the system?
o Updating documents. Can you characterize the expected number of modifications to
existing documents?
o How much latency is acceptable in terms of the time when a document is added to Solr
and its available for search?

Security requirements. Solr has no built-in security options. In Solr, document-level security is
usually best accomplished by indexing some kind of authorization token(s) along with the
document. The number of authorization tokens applied to a document is largely irrelevant;
thousands are reasonable although such large numbers usually are a nightmare to administer.
The number of authorization tokens associated with a particular user should be much smaller.
100 or so is a good straw-man upper limit. The reason for this is that security at this level is
usually enforced by appending an fq clause to the query and putting thousands of tokens in an
fq clause is expensive.
o There exists a post filter (aka no-cache) filter that can help with access schemes that
cant use the first option. These are not cached and are applied only after all the less-
expensive filters are applied.
o If grouping, faceting isnt required to accurately reflect the true document counts, some
shortcuts can be taken. For example, ACL filtering is notoriously expensive in some
systems, sometimes requiring database access. If accurate faceting is required, you
cannot stop processing partway through the list and still reflect accurate facets.
Required query rate, usually measured in queries-per-second.
o Note that you must size the machines to give a reasonable response rate for a single
user. Its possible to put so much strain on a machine that the target hardware cannot
satisfy even a few users, in which case re-sharding is necessary.
o Absent needing to re-shard, increasing Solrs QPS rate is usually a matter of adding more
replicas to each shard.
o Zillions of shards can show the laggard issue. As the number of shards increases, the
probability that one of them will be anomalously slow increases. The QPS rate will
generally fall, though very slowly, as the number of shards gets into the hundreds.

Cloudera Search Requirements


CDH Requirement
Cloudera Search requires CDH 4.3 or later. For more information, see CDH4 Documentation.

Cloudera Search Installation Guide | 3


Cloudera Search Requirements

Operating Systems
Cloudera Search provides packages for Red-Hat-compatible, SLES, Ubuntu, and Debian systems as
described below.

Operating System Version Packages

Red Hat compatible

Red Hat Enterprise Linux (RHEL) 5.7 64-bit

6.2 64-bit, 32-bit

CentOS 5.7 64-bit

6.2 64-bit, 32-bit

Oracle Linux with Unbreakable 5.6 64-bit


Enterprise Kernel

SLES

SLES Linux Enterprise Server (SLES) 11 with Service Pack 1 or later 64-bit

Ubuntu/Debian

Ubuntu Lucid (10.04) - Long-Term Support (LTS) 64-bit

Precise (12.04) - Long-Term Support (LTS) 64-bit

Debian Squeeze (6.03) 64-bit

Notes
For production environments, 64-bit packages are recommended. Except as noted above,
Cloudera Search provides only 64-bit packages.
Cloudera has received reports that our RPMs work well on Fedora, but we have not tested
this.
If you are using an operating system that is not supported by Cloudera's packages, you can

4 | Cloudera Search Installation Guide


Installing Cloudera Search

also download source tarballs from Downloads.

JDK
Cloudera Search requires Oracle JDK 1.6. Cloudera recommends version 1.6.0_31. The minimum
supported version is 1.6.0_8. See Java Development Kit Installation for more information.

Ports Used by Cloudera Search


Cloudera Search uses the ports listed in table below. Before you deploy Cloudera Search, make sure
these ports are open on each system. The table reflects the current default settings, which are defined
in the solr defaults file located in /etc/defaults/solr.

Ports Used by Cloudera Search

Component Service Port Protocol Access Requirement Comment

Cloudera Solr 8983 http External All Solr-specific actions,


Search search/update update/query. Defined in
/etc/defaults/solr.

CDH Cloudera CDH 8984 http Internal CDH Administrative use.


admin

Installing Cloudera Search


Review Cloudera Search Requirements before getting started.
Install Cloudera's repository: before using the instructions in this guide to install or upgrade
Cloudera Search from packages, install Cloudera's yum, zypper/YaST or apt repository, and
install or upgrade CDH4 and make sure it is functioning correctly. For instructions, see CDH4
Installation and the instructions for Upgrading from CDH3 to CDH4 or Upgrading from an Earlier
CDH4 Release.

Note
Non-SolrCloud mode has been deprecated and is no longer supported.

Cloudera Search provides the following packages:

Package Name Description

solr Solr/SolrCloud

Cloudera Search Installation Guide | 5


Installing Cloudera Search

Package Name Description

solr-server Platform specific service script for starting, stopping, or restart Solr.

solr-doc Cloudera Search documentation.

solr-mapreduce Tools to index documents using MapReduce.

flume-ng-solr Flume Solr Sink.

search Examples, Contrib, and Utility code and data.

Choosing where to Deploy the Cloudera Search Processes


You can collocate a Cloudera Search server (solr-server package) with a Hadoop TaskTracker (MRv1) and
a DataNode. When co-locating with TaskTrackers, be sure that the resources of the machine are not
oversubscribed. It's safest to start with a small number of MapReduce slots and increase them gradually.
For instructions describing how and where to install solr-mapreduce, see Installing MapReduce Tools for
use with Cloudera Search. For information about flume-ng-solr, see Installing Flume Solr Sink for use
with Cloudera Search. For information about the search package, see the Using Cloudera Search section
in the Cloudera Search Tutorial topic in the Cloudera Search User Guide.

Cloudera Search Installation Approaches


Cloudera Search currently supports installation using either packages or using Cloudera Manager. For
more information, see:
The Cloudera Manager Installation Guide for information on Install Search using Cloudera
Manager.
Ways To Install CDH4 for details on installing using packages. This page also describes how to
install CDH4 using Cloudera Manager.

Installing from Packages


To install and deploy Cloudera Search, follow the directions on the following page and be sure to review
the Guidelines for Deploying Cloudera Search and the Cloudera Search Tutorial in the Cloudera Search
User Guide.

6 | Cloudera Search Installation Guide


Installing Cloudera Search

Before You Begin Installing Cloudera Search Manually


Review the requirements described in Cloudera Search Requirements. The installation instructions
assume that the sudo command is configured on the hosts where you are installing Cloudera Search. If
sudo is not configured, use the root user (superuser) to configure Cloudera Search.

Important
Running services: When starting, stopping, and restarting CDH components, always use the
service (8) command rather than running /etc/init.d scripts directly. This is
important because service sets the current working directory to the root directory (/)
and removes environment variables except LANG and TERM. This creates a predictable
environment in which to administer the service. If you use /etc/init.d scripts directly,
any environment variables continue to be applied, potentially producing unexpected
results. If you install CDH from packages, service is installed as part of the Linux Standard
Base (LSB).

Installing Solr Packages


ForClouderaInternalDevUseOnly
Cloudera Manager can be used to install/manage the CDH cluster (HDFS, MR, flume, etc...) however
be sure to specify CDH 4.2 repo - other repos (4.1/4.3/etc...) will not work properly. During
Cloudera Manager installation, use a custom repo. If you are using CloudCat, be sure the CDH
cluster is using CDH version 4.2.
To get access to the nightly build, run the following (adjust for the version of your RHEL/CentOS):

curl http://repos.jenkins.sf.cloudera.com/solr-beta-
nightly/redhat/5/x86_64/search/cloudera-search.repo | sudo tee
/etc/yum.repos.d/cloudera-search.repo
sudo yum clean all

Before you start


If the Cloudera Search server is already running, stop it before continuing:
sudo service solr-server stop

To install Cloudera Search On Red Hat-compatible systems:

$ sudo yum install solr-server

Cloudera Search Installation Guide | 7


Installing Cloudera Search

To install Cloudera Search on Ubuntu and Debian systems:

$ sudo apt-get install solr-server

To install Cloudera Search on SLES systems:

$ sudo zypper install solr-server

See also Deploying Cloudera Search in SolrCloud Mode.

To list the installed files on Red Hat and SLES systems:

$ rpm -ql solr-server solr

To list the installed files on Ubuntu and Debian systems:

$ dpkg -L solr-server solr

You can see that the Cloudera Search packages has been configured to conform to the Linux Filesystem
Hierarchy Standard. (To learn more, run man hier).
You are now ready to enable the server daemons you want to use with Hadoop. You can also enable
Java-based client access by adding the JAR files in /usr/lib/solr/ and /usr/lib/solr/lib/ to
your Java class path.

Deploying Cloudera Search in SolrCloud Mode


SolrCloud allows you to partition your data set into multiple indexes and processes while simplifying the
management via ZooKeeper. In essence, you run a cluster of coordinating Solr servers rather than a
single Solr server.

Before you start


This section assumes that you have already complete the process of Installing Solr Packages. You
are now about to distribute the processes across multiple hosts; see Choosing where to Deploy the
Cloudera Search Processes.

Installing and Starting ZooKeeper Server


SolrCloud mode uses a ZooKeeper Service as a highly available, central location for cluster management.
For a small cluster, running a ZooKeeper node collocated with the NameNode is recommended. For
larger clusters, contact Cloudera Support for configuration help.

8 | Cloudera Search Installation Guide


Installing Cloudera Search

Install and start the ZooKeeper service by running the commands shown in the "Installing the ZooKeeper
Server Package and Starting ZooKeeper on a Single Server" section of Installing the Zookeeper Packages.

Initializing Solr for SolrCloud Mode


Once your Zookeeper Service is running, you need to configure each Solr node with the ZooKeeper
Quorum address:
Configure the ZooKeeper Quorum address in /etc/default/solr. Edit the following property to
configure the nodes with the address of the ZooKeeper service. Do this on every Solr Server host:

SOLR_ZK_ENSEMBLE=<zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr

Configuring Solr for use with HDFS


To set up Solr for use with your established HDFS service, perform the following configurations:
1. Configure the HDFS URI for Solr to use as a backing store in /etc/default/solr. Edit the
following property to configure the location of Solr index data in HDFS. Do this on every Solr
Server host:

SOLR_HDFS_HOME=hdfs://namenodehost:8020/solr

Be sure to replace namenodehost with the hostname of your HDFS NameNode (as specified by
fs.default.name or fs.defaultFS in your conf/core-site.xml file); you may also need
to change the port number from the default (8020). On an HA-enabled cluster, you will need to
ensure that the HDFS URI you use reflects the designated nameservice utilized by your cluster.
This value should be reflected in fs.default.name; instead of a hostname, you would see
hdfs://nameservice1 or something similar.

2. In some cases, such as for configuring Solr to work with HDFS High Availability (HA), you may
want to configure Solr's HDFS client. You can do this by setting the HDFS configuration directory
in /etc/default/solr. Locate the appropriate HDFS configuration directory on each node,
and edit the following property with the absolute path to this directory. Do this on every Solr
Server host:

SOLR_HDFS_CONFIG=/etc/hadoop/conf

Be sure to replace the path with the correct directory containing the proper HDFS configuration
files, core-site.xml and hdfs-site.xml.

Cloudera Search Installation Guide | 9


Installing Cloudera Search

Configuring Solr use with Secure HDFS


For information on setting up a secure CDH cluster, see the CDH4 Security Guide. In addition to the
above steps for Configuring Solr for use with HDFS, you will need to perform the following additional
steps if security is enabled:
1. Create the Kerberos Principals and Keytab Files
For every node in your cluster:
a. Create the solr principal using either kadmin or kadmin.local (see Create and Deploy the
Kerberos Principals and Keytab Files for information on which to use).

kadmin: addprinc -randkey


solr/fully.qualified.domain.name@YOUR-REALM.COM

b. Create the solr keytab:

kadmin: xst -norandkey -k solr.keytab


solr/fully.qualified.domain.name

2. Deploy the Kerberos Keytab Files


On every node in your cluster:
a. Copy or move the keytab files to a directory that Solr can access, such as /etc/solr/conf.

$ sudo mv solr.keytab /etc/solr/conf/

b. Make sure that the solr.keytab file is only readable by the solruser

$ sudo chown solr:hadoop /etc/solr/conf/solr.keytab


$ sudo chmod 400 /etc/solr/conf/solr.keytab

3. Add Kerberos related settings to /etc/default/solr on every node in your cluster, substituting
appropriate values:

SOLR_KERBEROS_ENABLED=true
SOLR_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab
SOLR_KERBEROS_PRINCIPAL=solr/fully.qualified.domain.name@YOUR-
REALM.COM

10 | Cloudera Search Installation Guide


Installing Cloudera Search

Creating the /solr Directory in HDFS


Before starting the Cloudera Search server, you need to create the /solr directory in HDFS. The
Cloudera Search master runs as solr:solr so it does not have the required permissions to create a
top-level directory.
To create the /solr directory in HDFS:

$ sudo -u hdfs hadoop fs -mkdir /solr


$ sudo -u hdfs hadoop fs -chown solr /solr

Initializing ZooKeeper Namespace


Before starting the Cloudera Search server, you need to create the solr namespace in Zookeeper:

$ solrctl init

WARNING
It must be noted that solrctl init takes a --force option as well. solrctl init --force will
clear the Solr data in ZooKeeper and interfere with any running nodes. If you want to clear Solr data
from ZooKeeper to start over, be sure to stop the cluster first.

Starting Solr in SolrCloud Mode


To start the cluster, start Solr Server on each node:

$ sudo service solr-server restart

After you have started the Cloudera Search Server, the Solr server should be up and running. You can
verify that all daemons are running using the jps tool from the Oracle JDK, which you can obtain from
the Java SE Downloads page. If you are running a pseudo-distributed HDFS installation and a Solr search
installation on one machine, jps will show the following output:

$ sudo jps -lm


31407 sun.tools.jps.Jps -lm
31236 org.apache.catalina.startup.Bootstrap start

Administering Solr with the solrctl Tool


Cloudera Search comes with a command line utility solrctl for performing administrative operations
on configuration bundles and Solr's collection. To display help information about solrctl, run solrctl with
the -help as the only option on the command line.

Cloudera Search Installation Guide | 11


Installing Cloudera Search

$ solrctl --help

usage: /usr/bin/solrctl [options] command [command-arg] [command [command-


arg]] ...

Options:
--solr solr_uri
--zk zk_ensemble
--help
--quiet

Commands:
init [--force]

instancedir [--generate path]


[--create name path]
[--update name path]
[--get name path]
[--delete name]
[--list]

collection [--create name -s <numShards>


[-c <collection.configName>]
[-r <replicationFactor>]
[-m <maxShardsPerNode>]
[-n <createNodeSet>]]
[--delete name]
[--reload name]
[--stat name]
[--deletedocs name]
[--list]

core [--create name [-p name=value]...]


[--reload name]
[--unload name]
[--status name]

Runtime Solr Configuration


In order to start using Solr for indexing the data, you must configure a collection holding the index. At a
minimum, a configuration for a collection requires two files: solrconfig.xml and schema.xml (plus
whatever helper files may be referenced from these two). The solrconfig.xml file contains all of the
Solr settings for a given collection, and the schema.xml file specifies the schema that solr will be using
when indexing documents. For more details on how to configure it for your data set see
http://wiki.apache.org/solr/SchemaXml.

12 | Cloudera Search Installation Guide


Installing Cloudera Search

WARNING
If Cloudera Manager is managing the cluster, the --zk option must be specified appropriately.

solrctl --zk <zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr ...

Configuration files for a collection are managed as part of the instance directory. To generate a skeleton
of the instance directory run:

$ solrctl instancedir --generate $HOME/solr_configs

You can customize it by directly editing the solrconfig.xml and schema.xml files that have been
created in $HOME/solr_configs/conf.
These configuration files are compatible with the standard Solr tutorial example documents.
Once you are satisfied with the configuration, you can make it available for Solr to use via the following
command that will upload the content of the entire instance directory to ZooKeeper:

$ solrctl instancedir --create collection1 $HOME/solr_configs

You may also use the solrctl tool to verify that your instance directory uploaded successfully and is
available via ZooKeeper:

$ solrctl instancedir --list

which should return a list of instance directory names. For example, "collection1" in this case.

Important
If you are familiar with Apache Solr, you may be tempted to configure a collection directly in solr
home: /var/lib/solr. While this is possible, it is discouraged and the use of solrctl is
recommended instead.

Creating your first Solr Collection


By default, the Solr server comes up with no collections. Make sure that you create your first collection
using the instancedir that you provided to Solr in previous steps by using the same collection name.
(numOfShards is the number of SolrCloud shards you want to partition the collection across. The
number of shards cannot exceed the total number of Solr servers in your SolrCloud cluster):

$ solrctl collection --create collection1 -s {{numOfShards}}

Cloudera Search Installation Guide | 13


Installing Cloudera Search

You should be able to navigate to


http://localhost:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true and verify that the
collection is active. You should also be able to observe the topology of your SolrCloud by navigating to:
http://localhost:8983/solr/#/~cloud

Adding Another Collection with Replication


To support scaling for query load, create a second collection with replication. Having a multiple servers
with replicated collections distributes the request load for each shard. Create a one shard cluster with a
replication factor of two. Your cluster must have at least two running servers to support this
configuration, so ensure Cloudera Search is installed on at least two servers before continuing with this
process. A replication factor of two causes two copies of the index files to be stored in two different
locations.
1. Generate the config files for the collection:

$ solrctl instancedir --generate $HOME/solr_configs2

2. Upload the instance directory to ZooKeeper:

$ solrctl instancedir --create collection2 $HOME/solr_configs2

3. Create the second collection:

$ solrctl collection --create collection2 -s 1 -r 2

4. Verify the collection is live and that your one shard is being served by 2 nodes:
http://localhost:8983/solr/#/~cloud

Installing Flume Solr Sink for use with Cloudera Search


The Flume Solr Sink provides a flexible, scalable, fault tolerant, transactional, Near Real Time (NRT)
oriented system for processing a continuous stream of records into live search indexes. Latency from the
time of data arrival to the time of data showing up in search query results is on the order of seconds,
and tunable.
To install the Flume Solr Sink On Red Hat-compatible systems:

$ sudo yum install flume-ng-solr

To install the Flume Solr Sink on Ubuntu and Debian systems:

$ sudo apt-get install flume-ng-solr

14 | Cloudera Search Installation Guide


Upgrading Cloudera Search

To install the Flume Solr Sink on SLES systems:

$ sudo zypper install flume-ng-solr

For information on using the Flume Solr Sink, see the Flume Near Real-Time Indexing Reference in the
Cloudera Search User Guide.

Installing MapReduce Tools for use with Cloudera Search


Cloudera Search provides the ability to batch index documents using MapReduce jobs. Install the solr-
mapreduce package on nodes where you want to submit a batch indexing job.

To install solr-mapreduce On Red Hat-compatible systems:

$ sudo yum install solr-mapreduce

To install solr-mapreduce on Ubuntu and Debian systems:

$ sudo apt-get install solr-mapreduce

To install solr-mapreduce on SLES systems:

$ sudo zypper install solr-mapreduce

For information on using MapReduce to batch index documents see the MapReduce Batch Indexing
Reference in the Cloudera Search User Guide.

Upgrading Cloudera Search

Contents
Upgrading Cloud Search from SolrCloud mode
Upgrading Cloud Search from Non-SolrCloud mode
Upgrading Cloudera Search involves stopping Cloudera Search services, using your operating system's
package management tool to upgrade Cloudera Search to the latest version, and then restarting
Cloudera Search services.

Requirements
Before attempting any upgrades it is extremely important to make backup copies of the following
configuration files:

Cloudera Search Installation Guide | 15


Upgrading Cloudera Search

/etc/default/solr

/var/lib/solr/solr.xml

All collection configurations


Make sure you make a copy on every node that is part of the SolrCloud.

Upgrading Cloud Search from SolrCloud mode


If you already have SolrCloud configuration deployed, do the following:
1. Stop the Solr server:

$ sudo service solr-server stop

2. Upgrade the packages. To upgrade the packages, follow the instructions in the "Installing
Cloudera Search" section of the Installing and Using Cloudera Search guide. Do NOT run yum
update.

3. Start the Solr server:

$ sudo service solr-server start

Upgrading Cloud Search from Non-SolrCloud mode


Non-SolrCloud is now fully deprecated and should be avoided. If you previously configured Cloudera
Search in a Non-SolrCloud mode, you need to move your configuration by using the following steps
before you install any new packages.
While your previous Non-SolrCloud deployment is running, do the following:
1. List all existing instancedirs:

$ solradmin instancedir --list

2. Make a copy of each corecofing:

$ solradmin instancedir --get <NAME> $HOME/<NAME>

3. Stop the Solr server:

$ sudo service solr-server stop

16 | Cloudera Search Installation Guide


Installing and Using Hue with Cloudera Search

4. Upgrade the packages. To upgrade the packages, follow the instructions in the "Installing
Cloudera Search" section of the Installing and Using Cloudera Search guide. Do NOT run yum
update.

5. Enable SolrCloud mode by specifying SOLR_ZK_ENSEMBLE in /etc/default/solr.


6. Initialize the SolrCloud state.

$ solrctl init

7. Upload ALL the existing configuration into the SolrCloud.

Requirements
It is extremely important NOT to start the upgraded SolrServer service before completing
this step.

8. For every configuration that you saved, do the following:

$ solradmin instancedir --create <NAME> $HOME/<NAME>

9. Start the Solr server:

$ sudo service solr-server start

Installing and Using Hue with Cloudera Search


Hue includes a Search application that provides a customizable search UI.

Importing Collections
The following screenshot is an example of the collection import feature within Hue.

Cloudera Search Installation Guide | 17


Installing and Using Hue with Cloudera Search

Generally, only collections should be imported. Importing cores is rarely useful since it enables querying
a shard of the index. See A little about SolrCores and Collections for more information.

User UI
The following screenshot is an example of the appearance of the Search application that is integrated
with the Hue user interface.

Customization UI
The following screenshot is an example of the appearance of the Search application customization
interface provided in Hue.

Currently, only super users can access this view.

18 | Cloudera Search Installation Guide


Installing and Using Hue with Cloudera Search

Deploying Hue Search


You must install and configure Hue before you can use Search with Hue.
1. Follow the instructions for Installing Hue.
2. Use one of the following commands to install Search applications on the Hue machine:

For package installation on RHEL systems:

sudo yum install hue-search

For package installation on SLES systems:

sudo zypper install hue-search

For package installation on Ubuntu or Debian systems:

sudo apt-get install hue-search

For installation using tarballs:

$ cd /usr/share/hue
$ sudo tar -xzvf hue-search-####.tar.gz
$ sudo /usr/share/hue/tools/app_reg/app_reg.py --install
/usr/share/hue/apps/search

3. Update the URL for the Solr Server.


In a Cloudera Manager-managed environment:
a. Connect to Cloudera Manager.
b. Select the Hue service.
c. Click Configuration > View and Edit.
d. Search for the word "safety".
e. Add information about your Solr host to Hue Server (Base) / Advanced. For example, if
your hostname was SOLR_HOST, you might add the following:

[search]
## URL of the Solr Server
solr_url=http://SOLR_HOST:8983/solr

In an environment without Cloudera Manager:

Cloudera Search Installation Guide | 19


Installing and Using Hue with Cloudera Search

Specify the Solr URL in /etc/hue/hue.ini. For example, to use localhost as your Solr host,
you would add the following:

[search]
# URL of the Solr Server, replace 'localhost' if Solr is
running on another host
solr_url=http://localhost:8983/solr/

4. (Optional) To view files on HDFS, ensure the correct webhdfs_url is included in hue.ini and
WebHdfs is properly configured as described in Configuring CDH Components for Hue.
5. Restart Hue:

$ sudo /etc/init.d/hue restart

6. Open http://hue-host.com:8888/search/ in your browser.

Updating Hue Search


The process of updating Hue search involves installing updates and restarting the Hue service.
1. On the Hue machine, update Hue Search:

$ cd /usr/share/hue
$ sudo tar -xzvf hue-search-####.tar.gz
$ sudo /usr/share/hue/tools/app_reg/app_reg.py --install
/usr/share/hue/apps/search

2. Restart Hue:

$ sudo /etc/init.d/hue restart

Hue Search Twitter Demo


The demo uses similar process to those described in the Running Queries section of the Cloudera Search
Tutorial in the Cloudera Search User Guide. The demo illustrates the following features:

Only regular Solr APIs are used.


Show facets such as fields, range, or dates; sort by time in second.

Result snippet editor and preview, function for downloading, extra css/js, labels, and field
picking assist.
Show multi-collections.
Show highlighting of search term.

20 | Cloudera Search Installation Guide


Installing and Using Hue with Cloudera Search

Show facet ordering.

Auto complete handler using /suggest.

Cloudera Search Installation Guide | 21