Hello folks....!!!!
Today let us look at a very important package 'csv'.
What is csv ?
- CSV stands for Comma Separated Values (and sometimes Character Separated Values).
- The csv files are the important ones for any kind of machine learning/deep learning projects.
- These csv files will serve as the datasets for any kind of classification/clustering/prediction problems.
- The files just look like excel spreadsheets but are saved with an extension '.csv'.
- The values in these files will be separated by any delimiting character, mostly a comma (',') or semicolon (';') is used.
- Each value in a row corresponds to a column.
Standard sites
There are some standard sites from which you can download csv files for
processing.
- UCI repository : https://archive.ics.uci.edu/ml/index.php
- Kaggle : https://www.kaggle.com/
- Indian government's open datasets : https://data.gov.in/
- Quandl : https://www.quandl.com/search
- Google's public datasets : https://cloud.google.com/bigquery/public-data/
Some sites may ask you to create an account with them. You can create one
with your mail ID in the above websites and they are trust worthy.
How to download a csv file ?
You can download a csv file by following the steps given below and store in
any location in your PC. I will show you an example using UCI repository.
1. Open the link https://archive.ics.uci.edu/ml/index.php. You will a window like the one given
below.
2. Now click on the data set you want to download. I am
downloading iris dataset now.
3. You will get a window like the one given below. If
you want the details like number of attributes, name and explanation for the
attributes click on Data set description and it will be
downloaded. Open it with notepad/wordpad or any other editor to view.
4. To download the dataset, click on Data Folder. A
window like the one given below will appear.
The index link contains a downloadable, which has information on
creation of the dataset. bezdekIris.data and iris.data are the links to
download dataset. The iris.names also contain a downloadable which has
information that describes the dataset.
5. Click on iris.data. The file will be downloaded.
Open it with a notepad or any editor to view the
contents. The downloaded file will look as
the one given below.
Explanation :
This dataset has 5 attributes. Now consider the first row and it can
be thought as :sepal length in cm
- sepal length in cm :5.1
- sepal width in cm :3.5
- petal length in cm :1.4
- petal width in cm :0.2
- class : Iris-setosa.
Which means, the flowers with above properties (sepal length and width,
petal length and width) come under the class Iris-setosa.
Reading csv files :
The csv files can be read using the reader provided by the csv package.
First import the package, then open the file in read mode. Look at code
snippet and its output given below. The dataset considered is the iris
dataset downloaded before and stored in C:\
The reader( ) returns an reader object which is an iterable. It can
be imagined as list of lists. Each row in the file is considered as a
separate list. Thus using a for loop each row is printed out here.
Custom delimiters :
Some files do not have comma (',') as the character that separates the
values. It may be a tab space even or a '|' or a ';'. Thus the reader() can
be customized to read the file based on the delimiter. To specify the
delimiter, one can set the value for optional parameter
'delimiter'.
Consider a dataset where the values are separated by a '|' (pipe symbol):
To read this csv, we can use the code snippet given below.
You can notice a space after the delimiter (before each value) in some csv files. To remove that extra space one can pass the optional parameter skipinitialspace. It takes a boolean value .The default value is false. If set to true the leading space wont be considered. For example if the csv is as given below :
You can write skipinitialspace=True in the reader( ).
Comments
Post a Comment