@@ -48,6 +48,58 @@ model details should be easily identifiable in the code.
48
48
The default training mechanism uses the ADAM optimizer with learning
49
49
rate decay.
50
50
51
+ ## Differences from CRNN
52
+
53
+ ### Deeper early convolutions
54
+
55
+ The original CRNN uses a single 3x3 convolution in the first two conv/pool
56
+ stages, while this network uses a paired sequence of 3x3 kernels. This change
57
+ increases the theoretical receptive field of early stages of the network.
58
+
59
+ As a tradeoff, we omit the computationally expensive 2x2x512 final
60
+ convolutional layer of CRNN. In its place, this network vertically max pools over the
61
+ remaining three rows of features to collapse to a single 512-dimensional
62
+ feature vector at each horizontal location.
63
+
64
+ The combination of these changes preserves the theoretical receptive field size
65
+ of the final CNN layer, but reduces the number of convolution parameters to be
66
+ learned by 15%.
67
+
68
+ ### Padding
69
+
70
+ Another important difference is the lack of zero-padding in the first
71
+ convolutional layer, which can cause spurious strong filter responses around
72
+ the border. By trimming the first convolution to valid regions, this model
73
+ erodes the outermost pixel of values from the response filter maps (reducing
74
+ height from 32 to 30 and reducing the width by two pixels).
75
+
76
+ This approach seems preferable to requiring the network to learn to ignore
77
+ strong Conv1 responses near the image edge (presumably by weakening the power
78
+ of filters in subsequent convolutional layers).
79
+
80
+ ### Batch normalization
81
+
82
+ We include batch normalization after each pair of convolutions (i.e., after
83
+ layers 2, 4, 6, and 8 as numbered above). The CRNN does not include batch
84
+ normalization after its first two convolutional stages. Our model therefore
85
+ requires greater computation with an eye toward decreasing the number of
86
+ training iterations required to reach converegence.
87
+
88
+ ### Subsampling/stride
89
+
90
+ The first two pooling stages of CRNN downsample the feature maps with a stride
91
+ of two in both spatial dimensions. This model instead preserves sequence length
92
+ by downsampling horizontally only after the first pooling stage.
93
+
94
+ Because the output feature map must have at least one timeslice per character
95
+ predicted, overzealous downsampling can make it impossible to represent/predict
96
+ sequences of very compact or narrow characters. Reducing the horizontal
97
+ downsampling allows this model to recognize words in narrow fonts.
98
+
99
+ This increase in horizontal resolution does mean the LSTMs must capture more
100
+ information. Hence this model uses 512 hidden units, rather than the 256 used
101
+ by the CRNN. We found this larger number to be necessary for good performance.
102
+
51
103
# Training
52
104
53
105
To completely train the model, you will need to download the mjsynth
0 commit comments