You are on page 1of 18

BIAM 560 – Predictive Analytics

LAB WEEK 7

NEURAL NETWORKS, NOSQL, AND REPORTING


Dritan Papazisi

Submitted

to Professor: Dr. Michael Mullas

Sunday, June 23, 2019

1|Page
Table of Contents
Scenario.......................................................................................................................................................................................................................... 3
Explanation SQL, NoSQL, and NewSQL .......................................................................................................................................................................... 4
Where RDBMS fall short? .......................................................................................................................................................................................... 4
What NoSQL brings into the table? ........................................................................................................................................................................... 5
Data models in NoSQL. .............................................................................................................................................................................................. 7
1. Key-Value (K-V) Stores ................................................................................................................................................................................... 7
2. Document Stores ........................................................................................................................................................................................... 8
3. Column-Oriented Stores ................................................................................................................................................................................ 8
4. Graph Databases ............................................................................................................................................................................................ 8
Preview of Some NoSQL Solutions................................................................................................................................................................................. 8
Redis ........................................................................................................................................................................................................................... 9
CouchDB™ .................................................................................................................................................................................................................. 9
MongoDB ................................................................................................................................................................................................................... 9
Why MongoDB? ........................................................................................................................................................................................................... 11
Lift Chart: Compare and contrast DT and SVM. ........................................................................................................................................................... 13
ROC charts: Compare and contrast DT and SVM. ........................................................................................................................................................ 15
Literature Sources ........................................................................................................................................................................................................ 18

2|Page
Scenario
Throughout this course we have examined many different ways to manage and analyze data. This week, our challenge is to explore NoSQL. Your
CEO had decided that using big data is the wave of the future and is especially interested in the topic of NoSQL and how it works. Before you can
explain NoSQL and NoSQL data management options to the CEO, it is necessary to do some further research. The CEO has indicated that his
advisors believe that MongoDB is a viable option for managing large amounts of data. Your task is to research NoSQL and discover how it can
benefit the organization. You are also curious about MongoDB; what it is and how it works. Please build a two-page report on your NoSQL
discussion and recommendation of MongoDB. Explain the pros and cons of working with MongoDB, and indicate why it might be the best option
for the CEO. Report should include the following.

Discussion of NoSQL and its benefits

Discussion of MongoDB

Discussion of use of an Internet application like MongoDB.

While doing your research, you encounter information on the following topics that you feel should also be included in your report to the CEO. An
additional two pages are needed to explain the following.

Lift charts: Compare and contrast DT and SVM.

ROC charts: Compare and contrast DT and SVM.

You will need to put your findings into a Word document and submit the file.

3|Page
Explanation SQL, NoSQL, and NewSQL
“SQL” is used both as the name of a language and as a type of database. SQL the language is a structured query language designed for
managing data in relational database management systems (RDBMS). Relational database management systems are often called SQL databases
since they use the SQL language. For over 30 years, relational
database technology based on a model model of organizing
data in the form of tables, columns, and rows. has been the
gold standard. Since the mid-1980s, SQL has been
unquestionably the standard for querying and managing
RDBMS data sets. SQL systems proven themselves as good
option for vertical expansion in databases, independently of
their size. However in in the era of Cloud computing with
unprecedented data volumes, massive workloads for Web
services, and the need to store new types of data brought
and gathering of data for BI, the need for different database
systems raised.

Where RDBMS fall short?


The biggest problem developers faced by using relational databases is the object-relational impedance mismatch. SQL queries are not
well suited for the object-oriented data structures that are used in most applications now. Another closely related issue is storing or retrieving
an object with all relevant data. Some application operations require multiple and/or very complex queries. In that case, data mapping and
query generation complexity raises too much and becomes difficult to maintain on the application side. Some of these problems may be
tempered by various Object-relational mapping (ORM) frameworks, but it still requires a lot of development effort to work around most of
performance and data access complexity issues.

4|Page
Another set of problems that relational databases
struggle with is related to an exponentially increasing
amount of data. The direct consequence is the so-called big
data problem. This problem arises when standard SQL query
operations do not have acceptable performances, especially
when transactions are involved. In this situations developer
founded that the schema of extending more servers raised
more issues in regard to configuration and maintainability,
which ultimately proven to be costly without any improved
performance. (Krisciunas, 2014)

Performance consists of a driving force in selection


of the proper platform. A contemporary business intelligence
infrastructure features capabilities and tools to manage and analyze
large quantities and different types of data from multiple sources.
Easy-to-use query and reporting tools for casual business users and
more sophisticated analytical toolsets for power users are included.
(Laudon & Laudon, 2015)

What NoSQL brings into the table?


NoSQL databases were created in response to the limitations
of traditional relational database technology. When compared against relational databases, NoSQL databases are more scalable and provide
superior performance, and their data model addresses several shortcomings of the relational model. Modern apps enable use cases that

5|Page
demand stream ingestion, real-time processing and data
storage, while running on a distributed cloud native
infrastructure. Differently from traditional messaging
tool providers which are trying hard to “bolt-on” ACID
transactions and move towards adding statefulness to
their streams and their inherent “data pipeline”
architecture limits them from offering these must have
capabilities, the new, using this different data models
don’t force them to do something they aren’t designed
for. It is of the upmost importance to understand and
correctly use the data model when choosing the proper
NoSQL solutions. (Laudon & Laudon, 2015).

Many NoSQL solutions compromise consistency


in favor of availability, scalability and partition tolerance. On the other hand, some NoSQL solutions may allow you to specify what level of
consistency should be applied for particular operation and some even fully support ACID transactions.

As there are indicated through the charts enclosed, the main advantages of NoSQL include:

1. being able to handle large volumes of structured, semi-structured, and unstructured data,
2. agile sprints, quick iteration, and frequent code pushes
3. object-oriented programming that is easy to use and flexible,
4. efficient, scale-out architecture instead of expensive, monolithic architecture,
5. Many NoSQL databases also tend to be open-source which means a relatively low-cost way of developing, implementing and sharing
software.

6|Page
These are some of the main reasons why the fast-growing companies are embracing quickly “NoSQL” non-relational database technologies for
their benefit.

Non-relational database management systems use a more flexible data model and are designed for managing large data sets across
many distributed machines and for easily scaling up or down. They have been very useful for accelerating simple queries against large volumes
of structured and unstructured data, including Web, social media, graphics, and other forms of data that are difficult to analyze with traditional
SQL-based tools.

Data models in NoSQL.


Nowadays there are too many different NoSQL databases, each with its own technical features and behavior. Oracle NoSQL Database is
one example, as is Amazon’s SimpleDB, one of the Amazon Web Services that run in the cloud. SimpleDB provides a simple Web services
interface to create and store multiple data sets, query data easily, and return the results. There is no need to pre-define a formal database
structure or change that definition if new data are added later. This makes it very attractive choice in cases of new ventures where the chances
for success are not highly optimistic. For example, MetLife decided to employ the MongoDB open source NoSQL database to quickly integrate
disparate data and deliver a consolidated view of the customer. MetLife’s database brings together data from more than 70 separate
administrative systems, claims systems and other data sources, including semi-structured and unstructured data, such as images of health
records and death certificates. The NoSQL database is able to ingest structured, semi-structured and unstructured information without requiring
tedious, expensive and time-consuming database-mapping (Henschen, 2013). However, if generalized there can be grouped in the following
categories

1. Key-Value (K-V) Stores


Technically it is just a distributed persistent associative array. The key is a unique identifier for a value, which can be any data application needs
stored. This model is also the fastest way to get data by known key, but without the flexibility of more advanced querying. It may be used for

7|Page
data sharing between application instances like distributed cache or to store user session data. In this case or document store data models,
transaction consistency is rarely needed as most operations are by definition atomic.

2. Document Stores
Document store is a data model for storing semi-structured document object data and metadata. The JSON format is normally used to represent
such objects. Documents can be queried by their properties in a similar manner to relational databases but aren’t required to adhere to the
strict structure of a database table. Additionally, only parts of the object may be requested or updated. Document stores are used for aggregate
objects that have no shared complex data between them and to quickly search or filter by some object properties.

3. Column-Oriented Stores
A more advanced K-V store data model is a column family. These are used for organizing data based on individual columns where actual data is
used as a key to refer to whole data collections. It is similar to a relational database index, however a column family may be an arbitrary
collection of columns. There are more complex aggregation structures like super columns and super column families to allow access to the data
by several keys. This approach is used for very large scalable databases to greatly reduce time for searching data. It is rarely used outside of
enterprise level applications.

4. Graph Databases
As the name implies, this data model allows objects to link and be linked by several other objects thus constructing a graph structure. Links
usually have additional properties to describe the relation between objects. Graph databases map more directly to object-oriented programming
models and are faster for highly associative data sets and graph queries. Furthermore, they typically support ACID transaction properties in the
same way as most RDBMS.

Preview of Some NoSQL Solutions


Some of the more successful NoSQL solutions which offers potential to be integrated with Microsoft Azure Cloud are

8|Page
Redis
Redis, is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings,
hashes, lists, sets and sorted sets. Redis allows a user to set an expiration time for key-value pairs and requires all stored data to fit into a server
RAM. Clearly, it is designed to be used as a distributed caching and session service. Data storage in RAM allows very fast read/write operations.
Furthermore, data is persisted to a disk and in the case of a server restart can be restored back to RAM for quick access.

The approximate memory usage provided by Redis developers are:

• An empty instance uses – 1MB of memory


• 1 Million Keys – String Value pairs use 100MB of memory
• 1 Million Keys – Hash value, representing an object with 5 fields, use 200 MB of memory
Various useful atomic operations are supported like increment and decrement for integer values.

CouchDB™
Apache CouchDB is a document storage mainly targeted for mobile devices with offline mode support. It uses JSON for document storage and
REST for API. Field values are restricted to standard JSON types. CouchDB provides ACID transaction semantics meaning that it can handle a high
volume of concurrent readers and writers without conflict. CouchDB also guarantees eventual consistency to be able to provide both availability
and partition tolerance. Aggregation in CouchDB is done by using a specialized view model similar to a map-reduce system, and is continuously
updated and processed in parallel. CouchDB is a perfect candidate for usage on mobile devices and client side focused web browser applications.

MongoDB
MongoDB is document storage designed for high performance, high availability, and with automatic scaling. Documents are saved in a BSON
format (binary JSON) and field values aside from the usual JSON types can include other documents, arrays and arrays of documents. Every field
can be indexed and queried. MongoDB has a write lock support which blocks all other operations, including reads.

9|Page
Also, MangoDB supports dynamic consistency where each write operation can specify the guaranteed level of success for that operation. When
inserts, updates and deletes have a weak write concern, write operations return quickly. In some failed cases, write operations issued with weak
write concerns may not continue. With stronger write concerns, clients wait after sending a write operation for MongoDB to confirm the write
operations. Additional notable features include:

• Geospatial indexing allowing location-based queries


• GridFS for very large file support
MangoDB is able to be used as primary storage for CMS content. Companies choose MongoDB for developing modern applications as it offers
the advantages of relational databases along with the innovations of NoSQL.

10 | P a g e
Why MongoDB?
MongoDB is designed to meet the demands of modern apps with a technology foundation that enables the developer and the end user through:

Best Way To Work With Data


• Easy: Work with data in a natural, intuitive way, while providing ACID guarantees to ensure data integrity
• Fast: Get great performance without a lot of work
• Flexible: Adapt and make changes quickly
• Versatile: Supports a wide variety of data and queries

Intelligently Put Data Where You Want It


• Availability: Deliver globally resilient platforms through sophisticated replication and failover
• Scalability: Grow horizontally through native sharding
• Workload Isolation: Run operational and analytical workloads in the same cluster
• Locality: Place data on specific devices and in specific geographies for governance, class of service, and low-latency access

Freedom To Run Anywhere


• Portability: Database that runs the same everywhere
• Cloud Agnostic: Leverage benefits of multi-cloud strategy with no lock-in
• Global coverage: 50+ regions across the major providers

The document model approach also simplifies query development and optimization. There’s no need to write complex code to manipulate text
and values into SQL and work with multiple tables. Figure below illustrates the difference between using the MongoDB query language and SQL
to insert a single user record, where users have multiple properties including name, all of their addresses, phone numbers, interests, and more.

11 | P a g e
12 | P a g e
Lift Chart: Compare and contrast DT and SVM.

The lift chart gives us a quick way of identifying the cases in the test (evaluation) sample according to their predicted probabilities of
success. Cases with large probabilities of success are listed first. Next to them we print the actual results; we assume that we know these as we
are evaluating what would have happened if we had used this rule. We see that the cases we predict as successes (cases with probabilities 0.5 or
larger) are in fact actual successes. The lift curve graphs the cumulative number of successes (after having sorted the cases according to their
predicted values in decreasing order) against the number of cases. The reference line expresses the performance of the naïve model.

Decision Trees and Random Forests are actually extremely good classifiers. While SVM's (Support Vector Machines) are seen as more
complex it does not actually mean they will perform better. The paper "An Empirical Comparison of Supervised Learning Algorithms" by Rich
Caruana compared 10 different binary classifiers, SVM, Neural-Networks, KNN, Logistic Regression, Naive Bayes, Random Forests, Decision
Trees, Bagged Decision Trees, Boosted Decision trees and Bootstrapped Decision Trees on eleven different data sets and compared the results
on 8 different performance metrics. They found that Boosted decision trees came in first with Random Forests second and then Bagged
Decision. As Zaniwics puts it, uplift modeling is a branch of Machine Learning which aims to predict the difference between the class variable
behavior in treatment and control. Objects in the treatment group have been subject to some action, while objects in the control group have
not. By including the control group, it is possible to build a model which predicts the causal effect of the action for a given individual. In this
paper we present a variant of Support Vector Machines designed specifically for uplift modeling. The SVM optimization task has been
reformulated to explicitly model the difference in class behavior between two datasets. The model predicts whether a given object will have a
positive, neutral or negative response to a given action, and by tuning a parameter of the model the analyst is able to influence the relative
proportion of neutral predictions and thus the sensitivity of the model.

Traditional classification methods predict the conditional class probability distribution in a given dataset. Based on those predictions an
action is often taken on the classified individuals. This approach is, however, usually incorrect, especially in the case of marketing campaigns or
controlled medical trials. Standard classification methods are only able to model what happens after the action has been taken not what

13 | P a g e
happens because of the action. The reason is that such models do not take into account what would have happened had the action not been
taken.

14 | P a g e
ROC charts: Compare and contrast DT and SVM.
The term ROC stands for Receiver Operating Characteristic. ROC curves were first employed in the study of discriminator systems for the
detection of radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor. The initial research was motivated by the
desire to determine how the US RADAR "receiver operators" had missed the Japanese aircraft.

ROC curves are frequently used to show in a graphical way the connection/trade-off between sensitivity and specificity for a binary logic
situation. In addition, the area under the ROC curve gives an idea about the benefit of using the model in question.

ROC curves are used in to choose the most appropriate model to predict the data or in machine learning. The best cut-off has the highest true
positive rate together with the lowest false positive rate. As the area under an ROC curve is a measure of the usefulness of a model in general,
where a greater area means a more adequately it presents the reality.

Now ROC curves are frequently used to show the connection between clinical sensitivity and specificity for every possible cut-off for a test or a
combination of tests. In addition, the area under the ROC curve gives an idea about the benefit of using the test(s) in question.
Process An ROC curve shows the relationship between model sensitivity and
True Outcome False Outcome specificity for every possible cut-off. The ROC curve is a graph with:
True Prediction True Positive False Positive
for Outcome Prediction Prediction The x-axis showing 1 – specificity (= false positive fraction = FP/(FP+TN))
(TP) (FP) The y-axis showing sensitivity (= true positive fraction = TP/(TP+FN))

False Prediction False Negative True Negative Thus, every point on the ROC curve represents a chosen cut-off even though
Model

for Outcome Prediction Prediction you cannot see this cut-off. What you can see is the true positive fraction
(FN) (TN) and the false positive fraction that you will get when you choose this cut-off.

An example of ROC curves comparing classification performance of five machine learning and R script is presented by Heuristic Andrew, 2009. It
appears that SVM has the lowest false positive cut-offs.

15 | P a g e
16 | P a g e
17 | P a g e
Literature Sources
Caruana, R Niculescu-Mizil, A. (2006): An Empirical Comparison of Supervised Learning Algorithms, Proceedings of the 23rd International
Conference on Machine Learning, Pittsburgh, PA, 2006., file:///C:/Users/tpapazisi/Downloads/caruana.icml06.pdf

Henschen, D., (2013) “ MetLife Uses NoSQL for Customer Service Breakthrough.” Information Week.

heuristicandrew (2009): Compare performance of machine learning classifiers in R, https://heuristically.wordpress.com/2009/12/23/compare-


performance-machine-learning-classifiers-r/

Krisciunas, A., (2014): Benefits of NoSQL, https://www.devbridge.com/articles/benefits-of-nosql/

Lander, J. P. R for Everyone: Advanced Analytics and Graphics. [VitalSource Bookshelf]. Retrieved from
https://online.vitalsource.com/#/books/9781323582657/

Laudon, K. C., Laudon,, J. P. (01/2015). Management Information Systems: Managing the Digital Firm, 15th Edition Retrieved from
vbk://9781323187944

MongoDB (2018): A MongoDB White Paper: Top 5 Considerations When Evaluating NoSQL Databases,
https://webassets.mongodb.com/_com_assets/collateral/10gen_Top_5_NoSQL_Considerations.pdf?_ga=2.45945828.1554317026.1561312189-
1896449855.1561312189

MongoDB (2018): A MongoDB White Paper: MongoDB Architecture Guide MongoDB 4.0,
file:///C:/Users/tpapazisi/Downloads/MongoDB_Architecture_Guide.pdf

Zaniewicz, L., Jaroszewicz, S., (2013): Support Vector Machines for Uplift Modeling, http://www.ipipan.waw.pl/~sj/pdf/upsvm.pdf

18 | P a g e