RECOGNIZING HANDWRITTEN DIGITS

RECOGNIZING HANDWRITTEN DIGITS WITH SCIKIT-LEARN

Hypothesis to be tested : The Digits data set of scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times.

Libraries : scikit-learn , matplotlib

INSTALL LIBRARIES

We can analyse image as well as sound using data analysis too. Here in this project we are going to analyse the images. For any data analysis project we are using jupyter notebook as an editor same as here. Now by using pip install sklearn and matplotlib.

from sklearn import datasets,svm
svc = svm.SVC(gamma=0.001, C=100.)
digits = datasets.load_digits()

Now for getting information about the digits dataset use following command and read the description given:

print(digits.DESCR)

After hitting above command you will get the desciption as:

.. _digits_dataset: Optical recognition of handwritten digits dataset -------------------------------------------------- **Data Set Characteristics:** :Number of Instances: 5620 :Number of Attributes: 64 :Attribute Information: 8x8 image of integer pixels in the range 0..16. :Missing Attribute Values: None :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr) :Date: July; 1998 This is a copy of the test set of the UCI ML hand-written digits datasets https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits The data set contains images of hand-written digits: 10 classes where each class refers to a digit. Preprocessing programs made available by NIST were used to extract normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions. For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C. L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469, 1994. .. topic:: References - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition, MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University. - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika. - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin. Linear dimensionalityreduction using relevance weighted LDA. School of Electrical and Electronic Engineering Nanyang Technological University. 2005. - Claudio Gentile. A New Approximate Maximal Margin Classification Algorithm. NIPS. 2000.

2D REPRESENTAION OF IMAGE

In computer everything is stored as a number. Here images are stored as a 2D array. By hitting following command you will get a 2D array representation of one of the image.

digits.images[0]

You will get an array like this:

array([[ 0., 0., 5., 13., 9., 1., 0., 0.], 
    [ 0., 0., 13., 15., 10., 15., 5., 0.], 
    [ 0., 3., 15., 2., 0., 11., 8., 0.], 
    [ 0., 4., 12., 0., 0., 8., 8., 0.], 
    [ 0., 5., 8., 0., 0., 9., 8., 0.], 
    [ 0., 4., 11., 0., 1., 12., 7., 0.], 
    [ 0., 2., 14., 5., 10., 12., 0., 0.], 
    [ 0., 0., 6., 13., 10., 0., 0., 0.]])

PLOTTING 2D ARRAY ON GRAPH

The above image representation can be drawn on the graph to get the image.
import matplotlib.pyplot as plt %matplotlib inline plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')

Try out following code for more image representation on graph:

import matplotlib.pyplot as plt %matplotlib inline plt.subplot(321) plt.imshow(digits.images[1791], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(322) plt.imshow(digits.images[1792], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(323) plt.imshow(digits.images[1793], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(324) plt.imshow(digits.images[1794], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(325) plt.imshow(digits.images[1795], cmap=plt.cm.gray_r, interpolation='nearest') plt.subplot(326) plt.imshow(digits.images[1796], cmap=plt.cm.gray_r, interpolation='nearest')

TRAIN THE MODEL

svc.fit(digits.data[1:1790], digits.target[1:1790])

PREDICTION 

print(svc.predict(digits.data[1791:])) print(digits.target[1791:])

Output:
[4 9 0 8 9 8] 
[4 9 0 8 9 8]
Here by training the model with 1790 records we are getting 100% accuracy.

LETS TRY SOME OTHER RANGES FOR TRANING THE DATA

svc.fit(digits.data[500:1000], digits.target[500:1000]) s1=svc.predict(digits.data[1786:1796]) s2=digits.target[1786:1796] print(s1==s2) 
Output:[False True True False False True True True True True]
Here by training the model with 500 records we are getting 70% accuracy.

svc.fit(digits.data[1:40], digits.target[1:40]) s1=svc.predict(digits.data[1701:1711]) s2=digits.target[1701:1711] print(s1==s2)
Output:[ True True True True True True True True True True]
Here by training the model with only 40 records we are getting 100% accuracy.

svc.fit(digits.data[40:1140], digits.target[40:1140]) s1=svc.predict(digits.data[1600:1610]) s2=digits.target[1600:1610] print(s1==s2)
Output:[ True True False True True False True True True True]
Here by training the model with only 1100 records we are getting 80% accuracy.

CONCLUSION

From the above analysis, we can conclude that given algorithm predicts the digits accuretly 95% of the times. For 100% accuracy we need to train the model with more number of records.

Github repositiory for the same:
Jupyter notebook

I am thankful to mentors at https://lnkd.in/gPMHXgu for providing awesome problem statements and giving many of us a Coding Internship Exprience. Thank you www.suvenconsultants.com

Comments

Popular posts from this blog

STAR-CLICKS REALITY

GLOBAL WARMINGS EFFECT ON TEMPERATURE AND HUMIDITY