Академический Документы
Профессиональный Документы
Культура Документы
YOUR ESSENTIAL
DATA SCIENCE TOOLBOX
Here are the top tools I use every day as a professional data
scientist, processing files from tens of thousands of rows to
TB-scale logs. I've done it all on 11-inch Macbook Air with 8GB of
Ram! I'm limiting this list to tools you can get for free and use on a
single machine that you own.
PYTHON 2.7
Python 2.7. I use the Anaconda installation. Its free and simple to install for Mac, Windows
and Linux. Find it here: https://www.continuum.io/downloads. Follow the easy-to-follow
instructions to install it.
Anaconda will install lots of packages for you. The ones I use every day are:
Pandas - My #1 tool for data analysis. I use it for cleaning and first visualizations and
summarizations of all small to medium datasets. This is also my go-to for opening and
saving CSV files.
NumPy - Called the fundamental package for scientific computing with Python.
It allows Matlab-like array and matrix computations and manipulation, and tons more.
Its super fast and very well documented (http://www.numpy.org/).
Bokeh - Makes (Jupyter Notebook) plots interactive. I dont use this all the time, but
its invaluable when you need to do visual investigation.
nltk - Natural Language Toolkit. To learn about this tool, read the free book here:
http://www.nltk.org/book/
This one is not installed with Anaconda, but essential for natural language processing
when things get big (it does out-of-core processing):
Gensim - Once you've got Anaconda installed, install Gensim like this:
> conda install gensim
Sublime Text - I think everyone who does any text processing should use Sublime
Text. You can use it for free (and get nagged sometimes when you save your work) or
buy a license. Get it here: https://www.sublimetext.com/
PyCharm Ce - For bigger projects, I'll sometimes use PyCharm. It's free and does
everything I've ever needed from a full-featured IDE
https://www.jetbrains.com/pycharm/
mawk - this is a blazingly fast version of awk. Ive used it to process terabytes of log
files. The syntax takes a bit of getting used to, but when you can't get the answer you
need any other way, or you have to have a faster solution for a big dataset, give it a
try. Get it here: http://invisible-island.net/mawk/
Command line:
Ready to kick off a Jupyter Notebook and run some Python jobs? To start a Jupyter
Notebook from the command line, type:
when that starts up, in the first cell, type in this list of imports. It's the same one that I
use to start nearly every analysis I do. I copied it directly from my last notebook:
import pandas as pd
import numpy as np
from __future__ import division
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline
pd.set_option('display.max_rows', 200)
plt.style.use('ggplot')
sns.set_style('dark')