Вы находитесь на странице: 1из 14

Name: Shahima Khan Batch: A3

Roll No.: I071 Date: 11.07.20

Experiment 1

Aim:

Exploratory Data Analysis (EDA) of the IRIS dataset using Python and Data Preprocessing using
WEKA

Theory:

In this practical you will use pandas in python to explore the IRIS dataset. Python is a simple
high-level and an open-source language used for general-purpose programming. It has many open-
source libraries and Pandas is one of them. Pandas is a powerful, fast, flexible open-source library used
for data analysis and manipulations of data frames/datasets. Pandas can be used to read and write data
in a dataset of different formats like CSV(comma separated values), txt, xls(Microsoft Excel) etc. We
will use it for simple data exploration work in this course.

Data exploration refers to the preliminary investigation of data in order to better understand
its specific characteristics. There are two key motivations for data exploration:

1. To help users select the appropriate preprocessing and data analysis technique used.
2. To make use of humans’ abilities to recognize patterns in the data.

Summary statistics are quantities, such as the mean and standard deviation, that capture
various characteristics of a potentially large set of values with a single number or a small set
of numbers.

Iris dataset

In this practical, we will use the Iris sample data, which contains information on 150 Iris
flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica. Each flower
is characterized by five attributes:
• sepal length in centimeters
• sepal width in centimeters
• petal length in centimeters
• petal width in centimeters
• class (Setosa, Versicolour, Virginica)

In this part, you will learn how to:


• Load a CSV data file into a Pandas DataFrame object.
• Compute various summary statistics from the DataFrame.

To execute the sample program shown here, make sure you have installed the Pandas library.
First, you need to download the Iris dataset from the UCI machine learning repository.

Using Wine dataset from the UCI repository

Getting Started with Pandas:


Displaying up the top rows of the dataset with their columns

Displaying the number of rows randomly.

Displaying the number of columns and names of the columns.


Displaying the shape of the dataset.

Display the whole dataset

Slicing the rows.


Displaying only specific columns.

Filtering:Displaying the specific rows using “iloc” function


For each quantitative attribute, calculate its average, standard deviation,
minimum, and maximum values.
Summary for all the attributes.
Histogram

Sepalengtcm
Sepalwidthcm

Petallengthcm
Petalwidthcm

Boxplot
Sepal length vs Sepal width

Petal length vs petal width


Joint plot
Sepal length and sepal width

Petal length vs petal width

Вам также может понравиться