Character Recognition using AlexNet

July 4, 2020

When AlexNet (paper) won the ImageNet Large Scale Visual Recognition Challenge in 2012, it sent a shock wave across the computer vision research community. Even though the concept of neural network has been available for many decades, it is AlexNet that made deep convolutional neural network (CNN) a highly recognised solution that solves many computer vision problems.

There are now many other CNN architectures that are more sophisticated and more powerful. However in many cases a “simple” AlexNet can still be very effective. In this post I’m going to use AlexNet architecture for the task of character recognition.

The dataset

The dataset is obtained from the Chars74K dataset, which contains 74K images of 64 classes (0-9, A-Z, a-z). The dataset includes characters obtained from natural images (7,705), hand-drawn characters (3,410) and synthesised characters from computer fonts (62,992). For the sake of this post, I’m going to use synthesised characters from computer fonts with numbers (0-9) and uppercase letters (A-Z) only. The reduced dataset can be downloaded here. It has 36,576 images of 36 classes.

Random samples from the dataset are shown below.

Figure 1. Random samples from the 74K dataset

AlexNet Architecture

AlexNet architecture has eight layers which consists of five convolutional layers and three fully connected layers. The first convolutional layer has 96 kernels of size 11×11 with a stride of 4. The second convolutional layer has 256 kernels of size 5×5. The third and fourth convolutional layers have 384 kernels of size 3×3. And the fifth convolutional layer has 256 kernels of size 3×3. The fully connected layers have 4096 neurons. Each layer is followed by Relu activation function. And max pooling is applied in the first, second and fifth layers with size 3×3 and stride 2×2.

The input image in the original AlexNet paper has width x height of 224×224. However the 74K dataset has image size of 128×128, and I stick to the width and height of the 74K dataset.

The AlexNet-like architecture for the 74K dataset is illustrated in Fig. 2 (click image to view in full screen). Please note the input image size is different from that of the original paper.

AlexNet architecture
Figure 2. AlexNet architecture for character recognition.

Implementation

The model can be implemented in Tensorflow as follows:

from keras.models import Sequential, Model
from keras.layers.normalization import BatchNormalization
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers.core import Dropout
from keras.layers.core import Dense
from keras.layers import Conv2D, MaxPooling2D
from keras import layers

def model(num_classes, input_shape):
    model = Sequential()

    # 1st Convolutional Layer
    model.add(Conv2D(filters=96, input_shape=input_shape, kernel_size=(11,11), strides=(4,4), padding='valid'))
    model.add(Activation('relu'))
    # Max Pooling
    model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid'))

    # 2nd Convolutional Layer
    model.add(Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), padding='same'))
    model.add(Activation('relu'))
    # Max Pooling
    model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid'))

    # 3rd Convolutional Layer
    model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same'))
    model.add(Activation('relu'))

    # 4th Convolutional Layer
    model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same'))
    model.add(Activation('relu'))

    # 5th Convolutional Layer
    model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same'))
    model.add(Activation('relu'))
    # Max Pooling
    model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid'))

    # Passing it to a Fully Connected layer
    model.add(Flatten())
    # 1st Fully Connected Layer
    model.add(Dense(4096))
    model.add(Activation('relu'))
    # Add Dropout to prevent overfitting
    model.add(Dropout(0.5))

    # 2nd Fully Connected Layer
    model.add(Dense(4096))
    model.add(Activation('relu'))
    # Add Dropout to prevent overfitting
    model.add(Dropout(0.5))

    # Output Layer
    model.add(Dense(num_classes))
    model.add(Activation('softmax'))

    return model

Now we have the model ready, we can set up the data and hyperparameters to train the model.

First we define some constants for the program such as DATASET_PATH (where to locate the dataset), MODEL_PATH (where to save the model after training), BATCH_SIZE (batch size), EPOCHS (number of epochs), TARGET_WIDTH, TARGET_HEIGHT, and TARGET_DEPTH (width, height and depth of the input image respectively)

from keras.preprocessing.image import ImageDataGenerator
from keras.optimizers import Adam
from keras.callbacks import ReduceLROnPlateau
import os

# Define constants
DATASET_PATH = './English/Fnt/'
MODEL_PATH = '.'
BATCH_SIZE = 128
EPOCHS = 20
TARGET_WIDTH = 128
TARGET_HEIGHT = 128
TARGET_DEPTH = 3

Next we split the data into 2 parts: 80% for training and 20% for validation. This is done via Keras’ ImageDataGenerator:

# Set up the data generator to flow data from disk
print("[INFO] Setting up Data Generator...")
data_gen = ImageDataGenerator(validation_split=0.2, rescale=1./255)

train_generator = data_gen.flow_from_directory(
    DATASET_PATH, 
    subset='training',
    target_size = (TARGET_WIDTH, TARGET_HEIGHT),
    batch_size = BATCH_SIZE
)

val_generator = data_gen.flow_from_directory(
    DATASET_PATH,
    subset='validation',
    target_size = (TARGET_WIDTH, TARGET_HEIGHT),
    batch_size = BATCH_SIZE
)

We compile the AlexNet model by using the model function defined earlier:

# Build model
print("[INFO] Compiling model...")
alexnet = model(train_generator.num_classes, (TARGET_WIDTH, TARGET_HEIGHT, TARGET_DEPTH))

# Compile the model
alexnet.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Before we train the network, we define a learning rate decay function. This callback monitors the loss value and if no improvement is seen for 2 epochs, the learning rate is reduced by a factor of 0.2:

# Set the learning rate decay
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=2, min_lr=0.001)

Finally we can train the model and save it into disk:

# Train the network
print("[INFO] Training network ...")
H = alexnet.fit_generator(
	train_generator,
	validation_data=val_generator,
	steps_per_epoch=train_generator.samples // BATCH_SIZE,
	validation_steps = val_generator.samples // BATCH_SIZE,
	epochs=EPOCHS, verbose=1, callbacks=[reduce_lr])

# save the model to disk
print("[INFO] Serializing network...")
alexnet.save(MODEL_PATH + os.path.sep + "trained_model")

print("[INFO] Done!")

This model achieved 97.29% accuracy on training set and 95.46% accuracy on validation set. Considering that many characters are difficult even for human to distinguish such as 1 and I, 2 and Z, 0 and O, 5 and S, this is a quite impressive result.

Figure 3. Model training.

After training, we can try using the trained model to predict the hand written characters that the model has not seen. Save the following code in a file called predict.py:

import argparse
import numpy as np
from keras.models import load_model
from keras.preprocessing.image import img_to_array
import cv2

# Construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True, help="path to input image")
args = vars(ap.parse_args())

labels = [
    '0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G',
    'H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'
    ]

# Define constants
TARGET_WIDTH = 128
TARGET_HEIGHT = 128
MODEL_PATH = './trained_model'

# Load the image
original_image = cv2.imread(args["image"])
# Preprocessing the image
image = cv2.resize(original_image, (TARGET_WIDTH, TARGET_HEIGHT))
image = image.astype("float") / 255.0
image = img_to_array(image)
image = np.expand_dims(image, axis=0)

# Load the trained convolutional neural network
print("[INFO] Loading my model...")
model = load_model(MODEL_PATH, compile=False)

# Classify the input image then find the index of the class with the *largest* probability
print("[INFO] Classifying image...")
prob = model.predict(image)[0]
idx = np.argsort(prob)[-1]

# Display original image
cv2.imshow("Original Image", original_image)
cv2.waitKey(0);

# Display the predicted image
cv2.putText(original_image, 'Character is ' + labels[idx], 
    (10, 100), 
    cv2.FONT_HERSHEY_SIMPLEX, 
    2,
    (255,0,255),
    2)
cv2.imshow("Recognised Image", original_image)
cv2.waitKey(0)

To test it on an image, just run:

python predict.py --image test1.png

Bravo, it guessed the image correctly:

Figure 4. Model prediction

Full source code

https://github.com/minhthangdang/CharactersRecognition

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments