Are You Still Using Pandas For Big Data - by Roman Orac - Towards Data Science

11/22/2020 Are you still using Pandas for big data?
| by Roman Orac | Towards Data Science
Get started Open in app
498K Followers · About Follow
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
Are you still using Pandas for big data?

Pandas doesn’t have multiprocessing support and it is slow with bigger datasets.
There is a better tool that puts those CPU cores to work!
Roman Orac Apr 27 · 5 min read
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 1/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
Photo by Chris Curry on Unsplash
Pandas is one of the best tools when it comes to Exploratory Data Analysis. But this
doesn't mean that it is the best tool available for every task — like big data processing.
I’ve spent so much time waiting for pandas to read a bunch of files or to aggregate them
and calculate features.
Recently, I took the time and found a better tool, which made me update my data
processing pipeline. I use this tool for heavy data processing — like reading multiple
files with 10 gigs of data, apply filters to them and do aggregations. When I am done
with heavy processing I save the result to a smaller “pandas friendly” CSV file and
continue with Exploratory Data Analysis in pandas.
Download the Jupyter Notebook to follow examples.
I write extensively about Data Analysis with pandas. Take a look at my series:
Pandas Data Analysis Series

A curated list of pandas articles from Tips & Tricks, How NOT to
guides to Tips related to Big Data analysis.
medium.com
Here are a few links that might interest you:
- Your First Machine Learning Model in the Cloud
- Intro to Machine Learning
- Intro to Programming
- Data Science for Business Leaders
- AI for Healthcare
- Autonomous Systems
- Learn SQL
- Free skill tests for Data Scientists & Machine Learning Engineers
Disclosure: Bear in mind that some of the links above are affiliate links and if you go
through them to make a purchase I will earn a commission. Keep in mind that I link Udacity
programs and my tutorials because of their quality and not because of the commission I
receive from your purchases. The decision is yours, and whether or not you decide to buy
something is completely up to you.
Meet Dask
Dask logo from dask.org
Dask provides advanced parallelism for analytics, enabling performance at scale for the
tools you love. This includes numpy, pandas and sklearn. It is open-source and freely
available. It uses existing Python APIs and data structures to make it easy to switch
between Dask-powered equivalents.
Dask makes simple things easy and complex things

possible
Pandas vs Dask
I could go on and on describing Dask, because it has so many features, but instead, let's
look at a practical example. In my work, I usually get a bunch of files that I need to
analyze. Let’s simulate my workday and create 10 files with 100K entries (each file has
196 MB).
from sklearn.datasets import make_classification

import pandas as pd
for i in range(1, 11):

print('Generating trainset %d' % i)
x, y = make_classification(n_samples=100_000, n_features=100)
df = pd.DataFrame(data=x)
df['y'] = y
df.to_csv('trainset_%d.csv' % i, index=False)
Now, let’s read those files with pandas and measure time. Pandas doesn’t have native
glob support so we need to read files in a loop.
%%time
import glob
df_list = []
for filename in glob.glob('trainset_*.csv'):
df_ = pd.read_csv(filename)
df_list.append(df_)
df = pd.concat(df_list)
df.shape
It took pandas 16 seconds to read files.
CPU times: user 14.6 s, sys: 1.29 s, total: 15.9 s

Wall time: 16 s
Now, imagine if those files would be 100 times bigger — you couldn’t even read them
with pandas.
Meme created with imgflip
Dask can process data that doesn’t fit into memory by breaking it into blocks and
specifying task chains. Let’s measure how long Dask needs to load those files.
import dask.dataframe as dd
%%time
df = dd.read_csv('trainset_*.csv')
CPU times: user 154 ms, sys: 58.6 ms, total: 212 ms
Wall time: 212 ms
Dask needed 154 ms! How is that even possible? Well, it is not. Dask has delayed
execution paradigm. It only calculates things when it needs them. We define the
execution graph so Dask can then optimize the execution of the tasks. Let’s repeat the
experiment — also notice that Dask’s read_csv function takes glob natively.
%%time
df = dd.read_csv('trainset_*.csv').compute()
CPU times: user 39.5 s, sys: 5.3 s, total: 44.8 s

Wall time: 8.21 s
The compute function forces Dask to return the result. Dask read files twice as fast than
pandas.
Dask natively scales Python

Pandas vs Dask CPU usage
Does Dask use all of the cores you paid for? Let’s compare CPU usage between pandas
and Dask when reading files — the code is the same as above.
CPU usage with pandas when reading files
CPU usage with Dask when reading files
In the screen recordings above the difference in multiprocessing is obvious with pandas
and Dask when reading files.
What is happening behind the scenes?

Dask’s DataFrame is composed of multiple pandas DataFrames, which are split by index.
When we execute read_csv with Dask, multiple processes read a single file.
We can even visualize the execution graph.
exec_graph = dd.read_csv('trainset_*.csv')
exec_graph.visualize()
Dask execution graph when reading multiple files.
Shortcomings of Dask
You might be thinking if Dask is so great, why not ditch pandas all together. Well, it is
not that simple. Only certain functions from pandas are ported to Dask. Some of them
are hard to parallelize, like sorting values and setting indexes on unsorted columns.
Dask is not a silver bullet — usage of Dask is recommended only for datasets that don’t
fit in the main memory. As Dask is built on top of pandas, operations that were slow in
pandas, stay slow in Dask. Like I mentioned before, Dask is a useful tool in the data
pipeline process, but it doesn’t replace other libraries.
Dask is recommended only for datasets that don’t fit

in the main memory
How to install Dask
To install Dask simply run:
python -m pip install "dask[complete]"
This will install the whole Dask library.
Conclusion
I’ve only touched the surface of Dask library in this blog post. If you would like to dive
deeper check amazing Dask tutorials and Dask’s DataFrame documentation. Interested
in which DataFrame functions are supported in Dask? Check DataFrame API.
How to process a DataFrame with billions of rows in seconds

Yet another Python library for Data Analysis that You Should Know
About — and no, I am not talking about Spark or Dask.
towardsdatascience.com
Before you go
I am building an online business focused on Data Science. I tweet about how I’m doing
it. Follow me there to join me on my journey.
Photo by Courtney Hedger on Unsplash
Sign up for The Daily Pick

By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to
Thursday. Make learning your daily ritual. Take a look
Your email
Get this newsletter
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
Thanks to Ludovic Benistant.
Python Data Science Programming Data Analytics
About Help Legal
Get the Medium app

Are You Still Using Pandas For Big Data - by Roman Orac - Towards Data Science

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Are You Still Using Pandas For Big Data - by Roman Orac - Towards Data Science

Загружено:

Авторское право:

Доступные форматы

11/22/2020 Are you still using Pandas for big data?

| by Roman Orac | Towards Data Science

Get started Open in app

498K Followers · About Follow

Are you still using Pandas for big data?

Roman Orac Apr 27 · 5 min read

Photo by Chris Curry on Unsplash

Download the Jupyter Notebook to follow examples.

Pandas Data Analysis Series

Here are a few links that might interest you:

- Your First Machine Learning Model in the Cloud

- Intro to Machine Learning

- Data Science for Business Leaders

Dask logo from dask.org

Dask makes simple things easy and complex things

from sklearn.datasets import make_classification

for i in range(1, 11):

It took pandas 16 seconds to read files.

CPU times: user 14.6 s, sys: 1.29 s, total: 15.9 s

Meme created with imgflip

CPU times: user 39.5 s, sys: 5.3 s, total: 44.8 s

Dask natively scales Python

CPU usage with pandas when reading files

CPU usage with Dask when reading files

What is happening behind the scenes?

We can even visualize the execution graph.

Dask execution graph when reading multiple files.

Dask is recommended only for datasets that don’t fit

python -m pip install "dask[complete]"

This will install the whole Dask library.

How to process a DataFrame with billions of rows in seconds

Photo by Courtney Hedger on Unsplash

Sign up for The Daily Pick

Get this newsletter

Thanks to Ludovic Benistant.

Python Data Science Programming Data Analytics

About Help Legal

Get the Medium app

Вам также может понравиться