Вы находитесь на странице: 1из 1

DATA SCIENCE GENIUS

YOUR ESSENTIAL
DATA SCIENCE TOOLBOX

Here are the top tools I use every day as a professional data
scientist, processing files from tens of thousands of rows to
TB-scale logs. I've done it all on 11-inch Macbook Air with 8GB of
Ram! I'm limiting this list to tools you can get for free and use on a
single machine that you own.

PYTHON 2.7
Python 2.7. I use the Anaconda installation. Its free and simple to install for Mac, Windows
and Linux. Find it here: https://www.continuum.io/downloads. Follow the easy-to-follow
instructions to install it.

Anaconda will install lots of packages for you. The ones I use every day are:

Jupyter - For displaying/running notebooks (99% of my work is done inside Jupyter


notebooks)

Pandas - My #1 tool for data analysis. I use it for cleaning and first visualizations and
summarizations of all small to medium datasets. This is also my go-to for opening and
saving CSV files.

NumPy - Called the fundamental package for scientific computing with Python.
It allows Matlab-like array and matrix computations and manipulation, and tons more.
Its super fast and very well documented (http://www.numpy.org/).

Matplotlib - The defacto plotting library for Python work. (http://matplotlib.org/)

Bokeh - Makes (Jupyter Notebook) plots interactive. I dont use this all the time, but
its invaluable when you need to do visual investigation.

I use these a little less often, but find them essential:

Beautifulsoup4 - Text processing and screen scraping

nltk - Natural Language Toolkit. To learn about this tool, read the free book here:
http://www.nltk.org/book/

This one is not installed with Anaconda, but essential for natural language processing
when things get big (it does out-of-core processing):

Gensim - Once you've got Anaconda installed, install Gensim like this:
> conda install gensim

For straight Python coding:

Sublime Text - I think everyone who does any text processing should use Sublime
Text. You can use it for free (and get nagged sometimes when you save your work) or
buy a license. Get it here: https://www.sublimetext.com/

PyCharm Ce - For bigger projects, I'll sometimes use PyCharm. It's free and does
everything I've ever needed from a full-featured IDE
https://www.jetbrains.com/pycharm/

When stuff gets really big:

mawk - this is a blazingly fast version of awk. Ive used it to process terabytes of log
files. The syntax takes a bit of getting used to, but when you can't get the answer you
need any other way, or you have to have a faster solution for a big dataset, give it a
try. Get it here: http://invisible-island.net/mawk/

Command line:

For Macs, I recommend and use iTerm as a terminal replacement:


https://www.iterm2.com/

Start using Python:

Ready to kick off a Jupyter Notebook and run some Python jobs? To start a Jupyter
Notebook from the command line, type:

> jupyter notebook

when that starts up, in the first cell, type in this list of imports. It's the same one that I
use to start nearly every analysis I do. I copied it directly from my last notebook:

import pandas as pd
import numpy as np
from __future__ import division
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline

pd.set_option('display.max_rows', 200)
plt.style.use('ggplot')
sns.set_style('dark')

from IPython.core.debugger import Tracer

copyright Data Science Genius 2016

Вам также может понравиться