csv

👈Previous Topic

Hello folks....!!!!

Today let us look at a very important package 'csv'.

What is csv ?

CSV stands for Comma Separated Values (and sometimes Character Separated Values).
The csv files are the important ones for any kind of machine learning/deep learning projects.
These csv files will serve as the datasets for any kind of classification/clustering/prediction problems.
The files just look like excel spreadsheets but are saved with an extension '.csv'.
The values in these files will be separated by any delimiting character, mostly a comma (',') or semicolon (';') is used.
Each value in a row corresponds to a column.

Standard sites

There are some standard sites from which you can download csv files for processing.

UCI repository : https://archive.ics.uci.edu/ml/index.php
Kaggle : https://www.kaggle.com/
Indian government's open datasets : https://data.gov.in/
Quandl : https://www.quandl.com/search
Google's public datasets : https://cloud.google.com/bigquery/public-data/

Some sites may ask you to create an account with them. You can create one with your mail ID in the above websites and they are trust worthy.

How to download a csv file ?

You can download a csv file by following the steps given below and store in any location in your PC. I will show you an example using UCI repository.

1. Open the link https://archive.ics.uci.edu/ml/index.php. You will a window like the one given below.

2. Now click on the data set you want to download. I am downloading iris dataset now.

3. You will get a window like the one given below. If you want the details like number of attributes, name and explanation for the attributes click on Data set description and it will be downloaded. Open it with notepad/wordpad or any other editor to view.

4. To download the dataset, click on Data Folder. A window like the one given below will appear.

The index link contains a downloadable, which has information on creation of the dataset. bezdekIris.data and iris.data are the links to download dataset. The iris.names also contain a downloadable which has information that describes the dataset.

5. Click on iris.data. The file will be downloaded. Open it with a notepad or any editor to view the contents. The downloaded file will look as the one given below.

Explanation :

This dataset has 5 attributes. Now consider the first row and it can be thought as :sepal length in cm

sepal length in cm :5.1
sepal width in cm :3.5
petal length in cm :1.4
petal width in cm :0.2
class : Iris-setosa.

Which means, the flowers with above properties (sepal length and width, petal length and width) come under the class Iris-setosa.

Reading csv files :

The csv files can be read using the reader provided by the csv package. First import the package, then open the file in read mode. Look at code snippet and its output given below. The dataset considered is the iris dataset downloaded before and stored in C:\

The reader( ) returns an reader object which is an iterable. It can be imagined as list of lists. Each row in the file is considered as a separate list. Thus using a for loop each row is printed out here.

Custom delimiters :

Some files do not have comma (',') as the character that separates the values. It may be a tab space even or a '|' or a ';'. Thus the reader() can be customized to read the file based on the delimiter. To specify the delimiter, one can set the value for optional parameter 'delimiter'.

Consider a dataset where the values are separated by a '|' (pipe symbol):

To read this csv, we can use the code snippet given below.

You can notice a space after the delimiter (before each value) in some csv files. To remove that extra space one can pass the optional parameter skipinitialspace. It takes a boolean value .The default value is false. If set to true the leading space wont be considered. For example if the csv is as given below :

You can write skipinitialspace=True in the reader( ).

Learn Python

Search This Blog

csv