Вы находитесь на странице: 1из 10

11/22/2020 Are you still using Pandas for big data?

| by Roman Orac | Towards Data Science

Get started Open in app

498K Followers · About Follow

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

Are you still using Pandas for big data?


Pandas doesn’t have multiprocessing support and it is slow with bigger datasets.
There is a better tool that puts those CPU cores to work!

Roman Orac Apr 27 · 5 min read

https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 1/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

Photo by Chris Curry on Unsplash

Pandas is one of the best tools when it comes to Exploratory Data Analysis. But this
doesn't mean that it is the best tool available for every task — like big data processing.
I’ve spent so much time waiting for pandas to read a bunch of files or to aggregate them
and calculate features.

Recently, I took the time and found a better tool, which made me update my data
processing pipeline. I use this tool for heavy data processing — like reading multiple
files with 10 gigs of data, apply filters to them and do aggregations. When I am done
with heavy processing I save the result to a smaller “pandas friendly” CSV file and
continue with Exploratory Data Analysis in pandas.

Download the Jupyter Notebook to follow examples.

I write extensively about Data Analysis with pandas. Take a look at my series:

Pandas Data Analysis Series


A curated list of pandas articles from Tips & Tricks, How NOT to
guides to Tips related to Big Data analysis.
medium.com
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 2/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

Here are a few links that might interest you:

- Your First Machine Learning Model in the Cloud

- Intro to Machine Learning

- Intro to Programming

- Data Science for Business Leaders

- AI for Healthcare

- Autonomous Systems

- Learn SQL

- Free skill tests for Data Scientists & Machine Learning Engineers

Disclosure: Bear in mind that some of the links above are affiliate links and if you go
through them to make a purchase I will earn a commission. Keep in mind that I link Udacity
programs and my tutorials because of their quality and not because of the commission I
receive from your purchases. The decision is yours, and whether or not you decide to buy
something is completely up to you.

Meet Dask

https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 3/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

Dask logo from dask.org

Dask provides advanced parallelism for analytics, enabling performance at scale for the
tools you love. This includes numpy, pandas and sklearn. It is open-source and freely
available. It uses existing Python APIs and data structures to make it easy to switch
between Dask-powered equivalents.

Dask makes simple things easy and complex things


possible
Pandas vs Dask
I could go on and on describing Dask, because it has so many features, but instead, let's
look at a practical example. In my work, I usually get a bunch of files that I need to
analyze. Let’s simulate my workday and create 10 files with 100K entries (each file has
196 MB).

from sklearn.datasets import make_classification


import pandas as pd

for i in range(1, 11):


print('Generating trainset %d' % i)
x, y = make_classification(n_samples=100_000, n_features=100)
df = pd.DataFrame(data=x)
df['y'] = y
df.to_csv('trainset_%d.csv' % i, index=False)

Now, let’s read those files with pandas and measure time. Pandas doesn’t have native
glob support so we need to read files in a loop.

%%time

import glob

df_list = []
for filename in glob.glob('trainset_*.csv'):
df_ = pd.read_csv(filename)
df_list.append(df_)

https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 4/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

df = pd.concat(df_list)
df.shape

It took pandas 16 seconds to read files.

CPU times: user 14.6 s, sys: 1.29 s, total: 15.9 s


Wall time: 16 s

Now, imagine if those files would be 100 times bigger — you couldn’t even read them
with pandas.

Meme created with imgflip

Dask can process data that doesn’t fit into memory by breaking it into blocks and
specifying task chains. Let’s measure how long Dask needs to load those files.

import dask.dataframe as dd

%%time
df = dd.read_csv('trainset_*.csv')

https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 5/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

CPU times: user 154 ms, sys: 58.6 ms, total: 212 ms
Wall time: 212 ms

Dask needed 154 ms! How is that even possible? Well, it is not. Dask has delayed
execution paradigm. It only calculates things when it needs them. We define the
execution graph so Dask can then optimize the execution of the tasks. Let’s repeat the
experiment — also notice that Dask’s read_csv function takes glob natively.

%%time

df = dd.read_csv('trainset_*.csv').compute()

CPU times: user 39.5 s, sys: 5.3 s, total: 44.8 s


Wall time: 8.21 s

The compute function forces Dask to return the result. Dask read files twice as fast than
pandas.

Dask natively scales Python


Pandas vs Dask CPU usage
Does Dask use all of the cores you paid for? Let’s compare CPU usage between pandas
and Dask when reading files — the code is the same as above.

https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 6/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

CPU usage with pandas when reading files

CPU usage with Dask when reading files

In the screen recordings above the difference in multiprocessing is obvious with pandas
and Dask when reading files.

What is happening behind the scenes?


Dask’s DataFrame is composed of multiple pandas DataFrames, which are split by index.
When we execute read_csv with Dask, multiple processes read a single file.

We can even visualize the execution graph.

exec_graph = dd.read_csv('trainset_*.csv')
exec_graph.visualize()

https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 7/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

Dask execution graph when reading multiple files.

Shortcomings of Dask
You might be thinking if Dask is so great, why not ditch pandas all together. Well, it is
not that simple. Only certain functions from pandas are ported to Dask. Some of them
are hard to parallelize, like sorting values and setting indexes on unsorted columns.
Dask is not a silver bullet — usage of Dask is recommended only for datasets that don’t
fit in the main memory. As Dask is built on top of pandas, operations that were slow in
pandas, stay slow in Dask. Like I mentioned before, Dask is a useful tool in the data
pipeline process, but it doesn’t replace other libraries.

Dask is recommended only for datasets that don’t fit


in the main memory
How to install Dask
To install Dask simply run:

python -m pip install "dask[complete]"

This will install the whole Dask library.

Conclusion
I’ve only touched the surface of Dask library in this blog post. If you would like to dive
deeper check amazing Dask tutorials and Dask’s DataFrame documentation. Interested
in which DataFrame functions are supported in Dask? Check DataFrame API.
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 8/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

How to process a DataFrame with billions of rows in seconds


Yet another Python library for Data Analysis that You Should Know
About — and no, I am not talking about Spark or Dask.
towardsdatascience.com

Before you go
I am building an online business focused on Data Science. I tweet about how I’m doing
it. Follow me there to join me on my journey.

Photo by Courtney Hedger on Unsplash

Sign up for The Daily Pick


By Towards Data Science

Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to
Thursday. Make learning your daily ritual. Take a look

https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 9/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

Thanks to Ludovic Benistant. 

Python Data Science Programming Data Analytics

About Help Legal

Get the Medium app

https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 10/10

Вам также может понравиться