Distributed Machine Learning With Pyspark Abdelaziz Testas Full Chapter

Distributed Machine Learning with
PySpark Abdelaziz Testas

Visit to download the full and correct content document:
https://ebookmass.com/product/distributed-machine-learning-with-pyspark-abdelaziz-t
estas/
Distributed Machine
Learning with PySpark
Migrating Effortlessly from Pandas
and Scikit-Learn
Abdelaziz Testas
Distributed Machine Learning with PySpark: Migrating

Effortlessly from Pandas
and Scikit-Learn
Abdelaziz Testas
Fremont, CA, USA
ISBN-13 (pbk): 978-1-4842-9750-6
ISBN-13 (electronic): 978-1-4842-9751-3
https://doi.org/10.1007/978-1-4842-9751-3
Copyright © 2023 by Abdelaziz Testas
This work is subject to copyright. All rights are reserved by the

Publisher, whether the whole or part of the material is concerned,
specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other
physical way, and transmission or information storage and retrieval,
electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
Trademarked names, logos, and images may appear in this book.

Rather than use a trademark symbol with every occurrence of a
trademarked name, logo, or image we use the names, logos, and
images only in an editorial fashion and to the benefit of the
trademark owner, with no intention of infringement of the
trademark.
The use in this publication of trade names, trademarks, service

marks, and similar terms, even if they are not identified as such, is
not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.
While the advice and information in this book are believed to be true
and accurate at the date of publication, neither the authors nor the
editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no
warranty, express or implied, with respect to the material contained
herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Celestin Suresh John
Development Editor: Laura Berendson
Coordinating Editor: Gryffin Winkler
Cover designed by eStudioCalamar
Cover image by Aakash Dhage on Unsplash (www.unsplash.com)
Distributed to the book trade worldwide by Apress Media, LLC, 1

New York Plaza, New York, NY 10004, U.S.A. Phone 1-800-
SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-
sbm.com, or visit www.springeronline.com. Apress Media, LLC is a
California LLC and the sole member (owner) is Springer Science +
Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is
a Delaware corporation.
For information on translations, please e-mail
booktranslations@springernature.com; for reprint, paperback, or
audio rights, please e-mail bookpermissions@springernature.com.
Apress titles may be purchased in bulk for academic, corporate, or

promotional use. eBook versions and licenses are also available for
most titles. For more information, reference our Print and eBook
Bulk Sales web page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the

author in this book is available to readers on GitHub
(https://github.com/Apress). For more detailed information, please
visit https://www.apress.com/gp/services/source-code.
Paper in this product is recyclable
I dedicate this book to my family and colleagues
for support and encouragement.
Table of Contents
About the Author

��
��
��
�� xiii About the Technical Reviewer
��
��
�� xv Acknowledgments
��
��
��
�� xvii Introduction
��
��
��
�� xix
Chapter 1: An Easy Transition

��
��
�� 1
PySpark and Pandas Integration

��
��
��
�� 2
Similarity in Syntax
��
��
��
�� 4
Loading Data
��
��
��
��
�� 5
Selecting Columns
��
��
��
�� 6
Aggregating Data
��
��
��
�� 9
Filtering Data
��
��
��
��
�� 11
Joining Data
��
��
��
��
�� 14
Saving Data
��
��
��
��
�� 17
Modeling Steps
��
��
��
��
18
Pipelines
��
��
��
��
�� 20
Summary��
��
��
��
�� 23
Chapter 2: Selecting Algorithms

��
��
�� 25
The Dataset
��
��
��
��
�� 26
Selecting Algorithms with Cross-Validation

��
��
�� 34
Scikit-Learn
��
��
��
��
36
PySpark
��
��
��
��
�� 42
Bringing It All Together

��
��
��
�� 47
Table of ConTenTs
Scikit-Learn
��
��
��
��
�� 47
PySpark
��
��
��
��
�� 49
Summary��
��
��
��
�� 51
Chapter 3: Multiple Linear Regression with Pandas, Scikit-Learn,
and PySpark
��
��
��
�� 53
The Dataset
��
��
��
��
�� 54
Multiple Linear Regression

��
��
��
�� 60
Multiple Linear Regression with Scikit-Learn

��
��
�� 63
Multiple Linear Regression with PySpark

��
��
�� 67
Summary��
��
��
��
�� 73
Chapter 4: Decision Tree Regression with Pandas, Scikit-Learn, and

PySpark �� 75
The Dataset
��
��
��
��
�� 76
Decision Tree Regression

��
��
��
�� 83
Decision Tree Regression with Scikit-Learn

��
��
�� 85
Decision Tree Regression with PySpark

��
��
�� 100

��
��
��
�� 109
Scikit-Learn
��
��
��
��
�� 109
PySpark
��
��
��
��
�� 111
Summary��
��
��
��
�� 113
Chapter 5: Random Forest Regression with Pandas, Scikit-Learn,
and PySpark
��
��
��
�� 115
The Dataset
��
��
��
��
�� 116
Random Forest Regression

��
��
��
�� 120
Random Forest with Scikit-Learn

��
��
�� 122
Random Forest with PySpark

��
��
��
� 129

��
��
��
�� 137
Scikit-Learn
��
��
��
��
�� 137
PySpark
��
��
��
��
�� 139
Summary��
��
��
��
�� 142
vi
Table of ConTenTs
Chapter 6: Gradient-Boosted Tree Regression with Pandas, Scikit-

Learn,
and PySpark
��
��
��
�� 143
The Dataset
��
��
��
��
�� 144
Gradient-Boosted Tree (GBT) Regression

��
��
�� 148
GBT with Scikit-Learn

��
��
��
�� 150
GBT with PySpark

��
��
��
�� 157

��
��
��
�� 166
Scikit-Learn
��
��
��
�� 167
PySpark
��
��
��
��
�� 169
Summary��
��
��
��
�� 171
Chapter 7: Logistic Regression with Pandas, Scikit-Learn, and

PySpark �� 173
The Dataset
��
��
��
��
�� 174
Logistic Regression
��
��
��
�� 180
Logistic Regression with Scikit-Learn

��
��
�� 183
Logistic Regression with PySpark

��
��
��
192
Putting It All Together

��
��
��
�� 204
Scikit-Learn
��
��
��
�� 204
PySpark
��
��
��
��
�� 208
Summary��
��
��
��
�� 212
Chapter 8: Decision Tree Classification with Pandas, Scikit-Learn,
and PySpark
��
��
��
�� 213
The Dataset
��
��
��
��
�� 214
Decision Tree Classification

��
��
��
�� 221
Scikit-Learn and PySpark Similarities

��
��
�� 222
Decision Tree Classification with Scikit-Learn

��
��
�� 224
Decision Tree Classification with PySpark

��
��
�� 230

��
��
��
�� 235
Scikit-Learn
��
��
��
�� 235
PySpark
��
��
��
��
�� 238
Summary��
��
��
��
�� 241
vii
Table of ConTenTs
Chapter 9: Random Forest Classification with Scikit-Learn and

PySpark �� 243
Random Forest Classification

��
��
��
�� 244
Scikit-Learn and PySpark Similarities for Random Forests

��
�� 244
Random Forests with Scikit-Learn

��
��
�� 245
Random Forests with PySpark

��
��
��
248

��
��
��
�� 253
Scikit-Learn
��
��
��
�� 253
PySpark
��
��
��
��
�� 255
Summary��
��
��
��
�� 258
Chapter 10: Support Vector Machine Classification with Pandas,
Scikit-Learn, and PySpark

��
��
�� 259
The Dataset
��
��
��
��
�� 260
Support Vector Machine Classification

��
��
�� 264
Linear SVM with Scikit-Learn

��
��
��
� 264
Linear SVM with PySpark

��
��
��
�� 269
��
��
��
�� 276
Scikit-Learn
��
��
��
�� 277
PySpark
��
��
��
��
�� 278
Summary��
��
��
��
�� 280
Chapter 11: Naive Bayes Classification with Pandas, Scikit-Learn,
and PySpark
��
��
��
�� 281
The Dataset
��
��
��
��
�� 282
Naive Bayes Classification

��
��
��
�� 287
Naive Bayes with Scikit-Learn

��
��
��
�� 287
Naive Bayes with PySpark

��
��
��
�� 290

��
��
��
�� 293
Scikit-Learn
��
��
��
�� 294
PySpark
��
��
��
��
�� 295
Summary��
��
��
��
�� 297
viii
Table of ConTenTs
Chapter 12: Neural Network Classification with Pandas, Scikit-Learn,
and PySpark
��
��
��
�� 299
The Dataset
��
��
��
��
�� 300
MLP Classification
��
��
��
�� 307
MLP Classification with Scikit-Learn

��
��
�� 310
MLP Classification with PySpark

��
��
�� 314

��
��
��
�� 324
Scikit-Learn
��
��
��
�� 324
PySpark
��
��
��
��
�� 325
Summary��
��
��
��
�� 327
Chapter 13: Recommender Systems with Pandas, Surprise, and

PySpark �� 329
The Dataset
��
��
��
��
�� 331
Building a Recommender System

��
��
��
339
Recommender System with Surprise

��
��
�� 339
Recommender System with PySpark

��
��
�� 345

��
��
��
�� 350
Surprise
��
��
��
��
�� 351
PySpark
��
��
��
��
�� 352
Summary��
��
��
��
�� 353
Chapter 14: Natural Language Processing with Pandas, Scikit-Learn,
and PySpark
��
��
��
�� 355
The Dataset
��
��
��
��
�� 356
Cleaning, Tokenization, and Vectorization

��
��
�� 362
Naive Bayes
Classification��
��
��
�� 371
Naive Bayes with Scikit-Learn

��
��
��
373
Naive Bayes with PySpark

��
��
��
�� 378

��
��
��
�� 389
Scikit-Learn
��
��
��
�� 389
PySpark
��
��
��
��
�� 391
Summary��
��
��
��
�� 394
ix
Table of ConTenTs
Chapter 15: k-Means Clustering with Pandas, Scikit-Learn, and

PySpark �� 395
The Dataset
��
��
��
��
�� 396
Machine Learning with k-Means

��
��
��
� 400
k-Means Clustering with Scikit-Learn

��
��
�� 400
k-Means Clustering with PySpark

��
��
�� 408

��
��
��
�� 412
Scikit-Learn
��
��
��
�� 412
PySpark
��
��
��
��
�� 414
Summary��
��
��
��
�� 416
Chapter 16: Hyperparameter Tuning with Scikit-Learn and PySpark

�� 417
Examples of Hyperparameters
��
��
��
�� 417
Tuning the Parameters of a Random Forest

��
��
�� 419
Hyperparameter Tuning in Scikit-Learn
��
��
�� 420
Hyperparameter Tuning in PySpark

��
��
�� 426

��
��
��
�� 436
Scikit-Learn
��
��
��
�� 436
PySpark
��
��
��
��
�� 438
Summary��
��
��
��
�� 440
Chapter 17: Pipelines with Scikit-Learn and PySpark

��
�� 441
The Significance of Pipelines

��
��
��
�� 442
Pipelines with Scikit-

Learn��
��
��
�� 443
Pipelines with PySpark

��
��
��
�� 450

��
��
��
�� 458
Scikit-Learn
��
��
��
�� 458
PySpark
��
��
��
��
�� 459
Summary��
��
��
��
�� 461
Table of ConTenTs
Chapter 18: Deploying Models in Production with Scikit-Learn and

PySpark �� 463
Steps in Model Deployment

��
��
��
�� 465
Deploying a Multilayer Perceptron (MLP)

��
��
�� 466
Deployment with Scikit-Learn

��
��
��
466
PySpark
��
��
��
��
�� 470

��
��
��
�� 478
Scikit-Learn
��
��
��
�� 479
PySpark
��
��
��
��
�� 480
Summary��
��
��
��
�� 482
Index
��
��
��
�� 483
xi
About the Author
Abdelaziz Testas, PhD, is a data scientist with over a decade of

experience in data analysis and machine learning,
specializing in the use of standard Python libraries and Spark
distributed computing. He holds a PhD in Economics from
the University of Leeds and a master’s degree in Finance
from the University of Glasgow. He has completed several
certificates in computer science and data science.
In the last ten years, the author worked for Nielsen
in Fremont, California, as a lead data scientist, focusing
on improving the company’s audience measurement
by planning, initiating, and executing end-to-end data

science projects and methodology work, and drove advanced
solutions into Nielsen’s digital ad and content rating products by
leveraging subject matter expertise in media measurement and data
science. The author is passionate about helping others improve their
machine learning skills and workflows and is excited to share his
knowledge and experience with a wider audience through this book.
xiii
About the Technical Reviewer
Bharath Kumar Bolla has over 12 years of experience and
is a senior data scientist at Salesforce, Hyderabad. Bharath
obtained an MS in Data Science from the University of
Arizona, USA. He also had a master’s in Life Sciences from
Mississippi State University, USA. Bharath worked as a
research scientist for around seven years at the University
of Georgia, Emory University, and Eurofins LLC. At Verizon,
Bharath led a team to build a “Smart Pricing” solution, and

at Happiest Minds, he worked on AI-based digital marketing
products. Along with his day-to-day responsibilities, he is a mentor

and an active researcher with more than 20 publications in
conferences and journals. Bharath received the “40 Under 40 Data
Scientists 2021” award from Analytics India Magazine for his
accomplishments.
xv
Acknowledgments
I would like to express my gratitude to all those who have directly or

indirectly contributed to the creation and publication of this book.
First and foremost, I am deeply thankful to my family for their
unwavering support and encouragement throughout this journey.
I would like to acknowledge the invaluable assistance of Apress in

bringing this book to fruition. Specifically, I wish to extend my
heartfelt thanks to Celestin John, the Acquisitions Editor for AI and
Machine Learning, for his guidance in shaping the main themes of
this work and providing continuous feedback during the initial stages
of the book. I am also grateful to Nirmal Selvaraj for his dedication
and support as my main point of contact throughout the
development cycle of the book. Additionally, I would like to express
my appreciation to Laura Berendson for her advisory role and
invaluable contributions as our Development Editor.
Lastly, I extend my gratitude to the technical reviewer Bharath

Kumar Bolla who diligently reviewed the manuscript and provided
valuable suggestions for improvement.
Your meticulousness and expertise have significantly enhanced the

quality and clarity of the final product.
To all those who have contributed, whether mentioned individually

or not, your contributions have played an integral part in making this
book a reality. Thank you for your support, feedback, and
commitment to this project.
xvii
Introduction
In recent years, the amount of data generated and collected by

companies and organizations has grown exponentially. As a result,
data scientists have been pushed to process and analyze large
amounts of data, and traditional single-node computing tools such
as Pandas and Scikit-Learn have become inadequate. In response,
many data scientists have turned to distributed computing
frameworks such as Apache Spark, with its Python-based interface,
PySpark.
PySpark has several advantages over single-node computing,

including the ability to handle large volumes of data and the
potential for significantly faster data processing times. Furthermore,
because PySpark is built on top of Spark, a widely used distributed
computing framework, it also offers a broader set of tools for data
processing and machine learning.
While transitioning from Pandas and Scikit-Learn to PySpark may

seem daunting, the transition can be relatively straightforward.
Pandas/Scikit-Learn and PySpark offer similar APIs, which means
that many data scientists can easily transition from one to the other.
In this context, this book will explore the benefits of using PySpark
over traditional single-node computing tools and provide guidance
for data scientists who are considering transitioning to PySpark.
In this book, we aim to provide a comprehensive overview of the

main machine learning algorithms with a particular focus on
regression and classification. These are fundamental techniques that
form the backbone of many practical applications of machine
learning. We will cover popular methods such as linear and logistic
regression, decision trees, random forests, gradient-boosted trees,
support vector machines, Naive Bayes, and neural networks. We will
also discuss how these algorithms can be applied to real-world
problems, such as predicting house prices, and the likelihood of
diabetes as well as classifying handwritten digits or the species of an
Iris flower and predicting whether a tumor is benign or malignant.
Whether you are a beginner or an experienced practitioner, this book
is designed to help you understand the core concepts of machine
learning and develop the skills needed to apply these methods in
practice.
xix
InTroduCTIon
This book spans 18 chapters and covers multiple topics. The first
two chapters examine why migration from Pandas and Scikit- Learn
to PySpark can be a seamless process, and address the challenges
of selecting an algorithm. Chapters 3–6 build, train, and evaluate
some popular regression models, namely, multiple linear regression,
decision trees, random forests, and gradient-boosted trees, and use
them to deal with some real-world tasks such as predicting house
prices. Chapters 7–12 deal with classification issues by building,
training, and evaluating widely used algorithms such as logistic
regression, decision trees, random forests, support vector machines,
Naive Bayes, and neural networks. In Chapters 13–15, we examine
three additional types of algorithms, namely, recommender systems,
natural language processing, and clustering with k-means. In the
final three chapters, we deal with hyperparameter tuning, pipelines,
and deploying models into production.
xx
CHAPTER 1
An Easy Transition
One of the key factors in making the transition from Pandas and
Scikit-Learn to PySpark relatively easy is the similarity in
functionality. This similarity will become evident after reading this
chapter and executing the code described herein.
One of the easiest ways to test the code is by signing up for an

online Databricks Community Edition account and creating a
workspace. Databricks provides detailed documentation on how to
create a cluster, upload data, and create a notebook.
Additionally, Spark can also be installed locally through the pip install
pyspark command.
Another option is Google Colab. PySpark is preinstalled on Colab by

default; otherwise, it can be installed using the !pip install pyspark
command in a Colab notebook. This command will install PySpark
and its dependencies in the Colab environment. While both provide
Jupyter-like notebooks, one advantage of Databricks is that Colab is
a single-core instance, whereas Databricks provides multi-node
clusters for parallel processing. This feature makes Databricks better
suited for handling larger datasets and more complex computational
tasks in a collaborative team environment.
Although Pandas and Scikit-Learn are tools primarily designed for

small data processing and analysis, while PySpark is a big data
processing framework, PySpark offers functionality similar to Pandas
and Scikit-Learn. This includes DataFrame operations and machine
learning algorithms. The presence of these familiar functionalities in
PySpark facilitates a smoother transition for data scientists
accustomed to working with Pandas and Scikit-Learn.
In this chapter, we examine in greater depth the factors that

contribute to the ease of transition from these small data tools
(Pandas and Scikit-Learn) to PySpark. More specifically, we focus on
PySpark and Pandas integration and the similarity in syntax between
PySpark, on the one hand, and Pandas and Scikit-Learn, on the
other.
© Abdelaziz Testas 2023
A. Testas, Distributed Machine Learning with PySpark,

https://doi.org/10.1007/978-1-4842-9751-3_1
ChApteR 1 An eASy tRAnSition
PySpark and Pandas Integration
PySpark is well integrated with Pandas. Both the DataFrame and

MLlib (Machine Learning Library) designs in PySpark were inspired
by Pandas (and Scikit-Learn) concepts. The DataFrame concept in
PySpark is similar to that of Pandas in that the data is stored in rows
and columns similar to relational database tables and excel sheets.
Consequently, it is easy to convert a PySpark DataFrame to a Pandas

DataFrame and vice-versa, making the integration between the two
packages straightforward.
A PySpark DataFrame can be converted to a Pandas DataFrame by

using toPandas() method, while a Pandas DataFrame can be
converted to a PySpark DataFrame using the createDataFrame()
method.
Let’s start with an example of how to convert a Pandas DataFrame

to a PySpark DataFrame using the PySpark createDataFrame()
method:
Step 1: Import Pandas
[In]: import pandas as pd
Step 2: Create a Pandas DataFrame

[In]: data = {'Country': ['United States', 'Brazil', 'Russia'], 'River':
['Mississippi', 'Amazon', 'Volga']}
[In]: pandas_df = pd.DataFrame(data)
Step 3: Convert Pandas DataFrame to PySpark DataFrame
[In]: from pyspark.sql import SparkSession
[In]: spark =
SparkSession.builder.appName("BigRivers").getOrCreate()
[In]: pyspark_df = spark.createDataFrame(pandas_df)
Step 4: Display the content of the PySpark DataFrame
[In]: pyspark_df.show()
[Out]:
Country
River
United States
Mississippi
Brazil
Amazon
Russia
Volga
2
Notice that before converting the Pandas DataFrame to the PySpark

DataFrame using the PySpark createDataFrame() method in step 3,
we needed to create a Spark Session named spark. There are two
lines of code to achieve this. In the first line (from pyspark.sql
import SparkSession), we imported the SparkSession class from the
pyspark.sql module. In the second line, we created a new instance
of the Spark Session named spark (spark =
SparkSession.builder.appName(“BigRivers”).getOrCreate()). The
following is what each part of this line of code does:
• SparkSession.builder: This starts the process of configuring and
building a new Spark Session.
• appName(“BigRivers”): The appName method sets a name for the
Spark application.
• getOrCreate(): This method either retrieves an existing Spark

Session if one exists or creates a new one if none is found. It
ensures that only one Spark Session is active at a time.
This Spark Session concept is fundamental in Apache Spark as it is

an entry point to interacting with its features and APIs in a
structured manner. It provides an environment for managing
configurations, creating Spark DataFrame and Dataset objects,
executing Spark operations, and coordinating tasks across a cluster.
This is particularly important if we are running the code in an

environment like Google Colab. When using Databricks notebooks,
however, the Spark Session is already created and initialized, so we
can directly use the spark object to interact with Spark features, run
queries, create DataFrames, and use various Spark libraries. We can
now do the reverse: Create a PySpark DataFrame and then convert
it to a Pandas DataFrame using the toPandas() method:
Step 1: Import Spark Session
[In]: from pyspark.sql import SparkSession
Step 2: Create a Spark Session
[In]: spark =
SparkSession.builder.appName("BigRivers").getOrCreate() Step 3:
Create a PySpark DataFrame
[In]: data = [('United States', 'Mississippi'), ('Brazil', 'Amazon'),
('Russia', 'Volga')]
[In]: spark_df = spark.createDataFrame(data, ["Country", "River"])

3
Step 4: Convert Spark DataFrame to a Pandas DataFrame
[In]: pandas_df = spark_df.toPandas()
Step 5: Display the data
[In]: print(pandas_df)
[Out]:
Country
River
United States
Mississippi
1
Brazil
Amazon
Russia
Volga
Notice that Pandas displays an index column in the output, providing

a unique identifier for each row, whereas PySpark does not explicitly
show an index column in its DataFrame output.
As we have seen in this section, it is easy to toggle between Pandas

and PySpark, thanks to the close integration between the two
libraries.
Similarity in Syntax
Another key factor that contributes to the smooth transition from

small data tools (Pandas and Scikit-Learn) to big data with PySpark
is the familiarity of syntax. PySpark shares a similar syntax with
Pandas and Scikit-Learn in many cases. For example, square
brackets ([]) can be used on Databricks or Google Colab to select
columns directly from a PySpark DataFrame, just like in Pandas.
Similarly, PySpark provides a DataFrame API that resembles Pandas,
and Spark MLlib (Machine Learning Library) includes
implementations of various machine learning algorithms found in
Scikit-Learn.
In this section, we present examples of how PySpark code is similar

to Pandas and Scikit-Learn, facilitating an easy transition for data
scientists between these tools.
Another random document with
no related content on Scribd:
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the

free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.
Section 1. General Terms of Use and

Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree to
abide by all the terms of this agreement, you must cease using
and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only

be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project
Gutenberg™ works in compliance with the terms of this
agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms
of this agreement by keeping this work in the same format with
its attached full Project Gutenberg™ License when you share it
without charge with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E. Unless you have removed all references to Project

Gutenberg:
1.E.1. The following sentence, with active links to, or other

immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United

States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it
away or re-use it under the terms of the Project Gutenberg
License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country where
you are located before using this eBook.
1.E.2. If an individual Project Gutenberg™ electronic work is

derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of the
copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is

posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.
1.E.4. Do not unlink or detach or remove the full Project

Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute
this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
1.E.7. Do not charge a fee for access to, viewing, displaying,

performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.
1.E.8. You may charge a reasonable fee for copies of or

providing access to or distributing Project Gutenberg™
electronic works provided that:
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You provide a full refund of any money paid by a user who

notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.
• You provide, in accordance with paragraph 1.F.3, a full refund of

any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.E.9. If you wish to charge a fee or distribute a Project

Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.
1.F.
1.F.1. Project Gutenberg volunteers and employees expend

considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite
these efforts, Project Gutenberg™ electronic works, and the
medium on which they may be stored, may contain “Defects,”
such as, but not limited to, incomplete, inaccurate or corrupt
data, transcription errors, a copyright or other intellectual
property infringement, a defective or damaged disk or other
medium, a computer virus, or computer codes that damage or
cannot be read by your equipment.
1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES -

Except for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU
AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE,
STRICT LIABILITY, BREACH OF WARRANTY OR BREACH
OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE
TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER
THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR
ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE
OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF
THE POSSIBILITY OF SUCH DAMAGE.
1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If

you discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person or
entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.
1.F.4. Except for the limited right of replacement or refund set

forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR
ANY PURPOSE.
1.F.5. Some states do not allow disclaimers of certain implied

warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the

Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you do
or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.
Section 2. Information about the Mission of

Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the

assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.
Section 3. Information about the Project

Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status by
the Internal Revenue Service. The Foundation’s EIN or federal
tax identification number is 64-6221541. Contributions to the
Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.
The Foundation’s business office is located at 809 North 1500

West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to

the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws

regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or
determine the status of compliance for any particular state visit
www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states

where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot

make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.
Please check the Project Gutenberg web pages for current

donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.
Section 5. General Information About Project

Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.
Project Gutenberg™ eBooks are often created from several

printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.

Distributed Machine Learning With Pyspark Abdelaziz Testas Full Chapter

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Distributed Machine Learning With Pyspark Abdelaziz Testas Full Chapter

Загружено:

Авторское право:

Доступные форматы

Distributed Machine Learning with

PySpark Abdelaziz Testas

Learning with PySpark

Migrating Effortlessly from Pandas

Distributed Machine Learning with PySpark: Migrating

Fremont, CA, USA

ISBN-13 (pbk): 978-1-4842-9750-6

ISBN-13 (electronic): 978-1-4842-9751-3

Copyright © 2023 by Abdelaziz Testas

This work is subject to copyright. All rights are reserved by the

Trademarked names, logos, and images may appear in this book.

The use in this publication of trade names, trademarks, service

Managing Director, Apress Media LLC: Welmoed Spahr

Acquisitions Editor: Celestin Suresh John

Development Editor: Laura Berendson

Coordinating Editor: Gryffin Winkler

Cover designed by eStudioCalamar

Cover image by Aakash Dhage on Unsplash (www.unsplash.com)

Distributed to the book trade worldwide by Apress Media, LLC, 1

Apress titles may be purchased in bulk for academic, corporate, or

Any source code or other supplementary material referenced by the

Paper in this product is recyclable

I dedicate this book to my family and colleagues

for support and encouragement.

About the Author

Chapter 1: An Easy Transition

PySpark and Pandas Integration

Chapter 2: Selecting Algorithms

Selecting Algorithms with Cross-Validation

Bringing It All Together

Chapter 3: Multiple Linear Regression with Pandas, Scikit-Learn,

Multiple Linear Regression

Multiple Linear Regression with Scikit-Learn

Multiple Linear Regression with PySpark

Chapter 4: Decision Tree Regression with Pandas, Scikit-Learn, and

Decision Tree Regression

Decision Tree Regression with Scikit-Learn

Decision Tree Regression with PySpark

Bringing It All Together

Chapter 5: Random Forest Regression with Pandas, Scikit-Learn,

Random Forest Regression

Random Forest with Scikit-Learn

Random Forest with PySpark

Bringing It All Together

Chapter 6: Gradient-Boosted Tree Regression with Pandas, Scikit-

Gradient-Boosted Tree (GBT) Regression

GBT with Scikit-Learn

GBT with PySpark

Bringing It All Together

Chapter 7: Logistic Regression with Pandas, Scikit-Learn, and

Logistic Regression with Scikit-Learn

Logistic Regression with PySpark

Putting It All Together

Chapter 8: Decision Tree Classification with Pandas, Scikit-Learn,

Decision Tree Classification

Scikit-Learn and PySpark Similarities

Decision Tree Classification with Scikit-Learn

Decision Tree Classification with PySpark

Bringing It All Together

Chapter 9: Random Forest Classification with Scikit-Learn and

Random Forest Classification

Scikit-Learn and PySpark Similarities for Random Forests

Random Forests with Scikit-Learn