Академический Документы
Профессиональный Документы
Культура Документы
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 1/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
Pandas is one of the best tools when it comes to Exploratory Data Analysis. But this
doesn't mean that it is the best tool available for every task — like big data processing.
I’ve spent so much time waiting for pandas to read a bunch of files or to aggregate them
and calculate features.
Recently, I took the time and found a better tool, which made me update my data
processing pipeline. I use this tool for heavy data processing — like reading multiple
files with 10 gigs of data, apply filters to them and do aggregations. When I am done
with heavy processing I save the result to a smaller “pandas friendly” CSV file and
continue with Exploratory Data Analysis in pandas.
I write extensively about Data Analysis with pandas. Take a look at my series:
- Intro to Programming
- AI for Healthcare
- Autonomous Systems
- Learn SQL
- Free skill tests for Data Scientists & Machine Learning Engineers
Disclosure: Bear in mind that some of the links above are affiliate links and if you go
through them to make a purchase I will earn a commission. Keep in mind that I link Udacity
programs and my tutorials because of their quality and not because of the commission I
receive from your purchases. The decision is yours, and whether or not you decide to buy
something is completely up to you.
Meet Dask
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 3/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
Dask provides advanced parallelism for analytics, enabling performance at scale for the
tools you love. This includes numpy, pandas and sklearn. It is open-source and freely
available. It uses existing Python APIs and data structures to make it easy to switch
between Dask-powered equivalents.
Now, let’s read those files with pandas and measure time. Pandas doesn’t have native
glob support so we need to read files in a loop.
%%time
import glob
df_list = []
for filename in glob.glob('trainset_*.csv'):
df_ = pd.read_csv(filename)
df_list.append(df_)
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 4/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
df = pd.concat(df_list)
df.shape
Now, imagine if those files would be 100 times bigger — you couldn’t even read them
with pandas.
Dask can process data that doesn’t fit into memory by breaking it into blocks and
specifying task chains. Let’s measure how long Dask needs to load those files.
import dask.dataframe as dd
%%time
df = dd.read_csv('trainset_*.csv')
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 5/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
CPU times: user 154 ms, sys: 58.6 ms, total: 212 ms
Wall time: 212 ms
Dask needed 154 ms! How is that even possible? Well, it is not. Dask has delayed
execution paradigm. It only calculates things when it needs them. We define the
execution graph so Dask can then optimize the execution of the tasks. Let’s repeat the
experiment — also notice that Dask’s read_csv function takes glob natively.
%%time
df = dd.read_csv('trainset_*.csv').compute()
The compute function forces Dask to return the result. Dask read files twice as fast than
pandas.
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 6/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
In the screen recordings above the difference in multiprocessing is obvious with pandas
and Dask when reading files.
exec_graph = dd.read_csv('trainset_*.csv')
exec_graph.visualize()
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 7/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
Shortcomings of Dask
You might be thinking if Dask is so great, why not ditch pandas all together. Well, it is
not that simple. Only certain functions from pandas are ported to Dask. Some of them
are hard to parallelize, like sorting values and setting indexes on unsorted columns.
Dask is not a silver bullet — usage of Dask is recommended only for datasets that don’t
fit in the main memory. As Dask is built on top of pandas, operations that were slow in
pandas, stay slow in Dask. Like I mentioned before, Dask is a useful tool in the data
pipeline process, but it doesn’t replace other libraries.
Conclusion
I’ve only touched the surface of Dask library in this blog post. If you would like to dive
deeper check amazing Dask tutorials and Dask’s DataFrame documentation. Interested
in which DataFrame functions are supported in Dask? Check DataFrame API.
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 8/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
Before you go
I am building an online business focused on Data Science. I tweet about how I’m doing
it. Follow me there to join me on my journey.
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to
Thursday. Make learning your daily ritual. Take a look
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 9/10
11/22/2020 Are you still using Pandas for big data? | by Roman Orac | Towards Data Science
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
https://towardsdatascience.com/are-you-still-using-pandas-for-big-data-12788018ba1a 10/10