Skip to content

Commit b23ef83

Browse files
committed
Added CRNN comparison
1 parent 401fedb commit b23ef83

File tree

1 file changed

+52
-0
lines changed

1 file changed

+52
-0
lines changed

README.md

+52
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,58 @@ model details should be easily identifiable in the code.
4848
The default training mechanism uses the ADAM optimizer with learning
4949
rate decay.
5050

51+
## Differences from CRNN
52+
53+
### Deeper early convolutions
54+
55+
The original CRNN uses a single 3x3 convolution in the first two conv/pool
56+
stages, while this network uses a paired sequence of 3x3 kernels. This change
57+
increases the theoretical receptive field of early stages of the network.
58+
59+
As a tradeoff, we omit the computationally expensive 2x2x512 final
60+
convolutional layer of CRNN. In its place, this network vertically max pools over the
61+
remaining three rows of features to collapse to a single 512-dimensional
62+
feature vector at each horizontal location.
63+
64+
The combination of these changes preserves the theoretical receptive field size
65+
of the final CNN layer, but reduces the number of convolution parameters to be
66+
learned by 15%.
67+
68+
### Padding
69+
70+
Another important difference is the lack of zero-padding in the first
71+
convolutional layer, which can cause spurious strong filter responses around
72+
the border. By trimming the first convolution to valid regions, this model
73+
erodes the outermost pixel of values from the response filter maps (reducing
74+
height from 32 to 30 and reducing the width by two pixels).
75+
76+
This approach seems preferable to requiring the network to learn to ignore
77+
strong Conv1 responses near the image edge (presumably by weakening the power
78+
of filters in subsequent convolutional layers).
79+
80+
### Batch normalization
81+
82+
We include batch normalization after each pair of convolutions (i.e., after
83+
layers 2, 4, 6, and 8 as numbered above). The CRNN does not include batch
84+
normalization after its first two convolutional stages. Our model therefore
85+
requires greater computation with an eye toward decreasing the number of
86+
training iterations required to reach converegence.
87+
88+
### Subsampling/stride
89+
90+
The first two pooling stages of CRNN downsample the feature maps with a stride
91+
of two in both spatial dimensions. This model instead preserves sequence length
92+
by downsampling horizontally only after the first pooling stage.
93+
94+
Because the output feature map must have at least one timeslice per character
95+
predicted, overzealous downsampling can make it impossible to represent/predict
96+
sequences of very compact or narrow characters. Reducing the horizontal
97+
downsampling allows this model to recognize words in narrow fonts.
98+
99+
This increase in horizontal resolution does mean the LSTMs must capture more
100+
information. Hence this model uses 512 hidden units, rather than the 256 used
101+
by the CRNN. We found this larger number to be necessary for good performance.
102+
51103
# Training
52104

53105
To completely train the model, you will need to download the mjsynth

0 commit comments

Comments
 (0)