<table>
<thead>
<tr>
<th>Title</th>
<th>Efficient Mini-batch Training on Memristor Neural Network Integrating Gradient Calculation and Weight Update</th>
</tr>
</thead>
<tbody>
<tr>
<td>Author(s)</td>
<td>YAMAMORI, Satoshi; HIROMOTO, Masayuki; SATO, Takashi</td>
</tr>
<tr>
<td>Citation</td>
<td>IEICE Transactions of Fundamentals on Electronics, Communications and Computer Sciences (2018), E101-A(7): 1092-1100</td>
</tr>
<tr>
<td>Issue Date</td>
<td>2018-07-01</td>
</tr>
<tr>
<td>URL</td>
<td><a href="http://hdl.handle.net/2433/242229">http://hdl.handle.net/2433/242229</a></td>
</tr>
<tr>
<td>Rights</td>
<td>© 2018 The Institute of Electronics, Information and Communication Engineers 許諾条件に基づいて掲載しています。</td>
</tr>
<tr>
<td>Type</td>
<td>Journal Article</td>
</tr>
<tr>
<td>Textversion</td>
<td>publisher</td>
</tr>
</tbody>
</table>

Kyoto University
Efficient Mini-Batch Training on Memristor Neural Network Integrating Gradient Calculation and Weight Update

Satoshi YAMAMORI, Student Member, Masayuki HIROMOTO, and Takashi SATO, Members

SUMMARY We propose an efficient training method for memristor neural networks. The proposed method is suitable for the mini-batch-based training, which is a common technique for various neural networks. By integrating the two processes of gradient calculation in the backpropagation algorithm and weight update in the write operation to the memristors, the proposed method accelerates the training process and also eliminates the external computing resources required in the existing method, such as multipliers and memories. Through numerical experiments, we demonstrated that the proposed method achieves twice faster convergence of the training process than the existing method, while retaining the same level of the accuracy for the classification results.

key words: memristor, neural network, mini-batch training, stochastic gradient descent

1. Introduction

Recent progress in neural networks is remarkable. The performance of the neural networks have become comparable, or have outperformed the human’s ability in various fields such as image recognition, speech recognition, and automatic translation [1]–[4]. This progress owes greatly to the progress of computer technologies along with the Moore’s law. However, recent processors, such as CPUs and GPUs, are facing a serious obstacle — the computation in state-of-the-art neural network architectures demand large amount of memory access for input data and network parameters.

A memristor neural network (MNN) [5], [6] is one of the emerging computing technologies to resolve the above problem. A memristor [7]–[9], which is also called as a ReRAM, is a passive electric element whose resistance is changed by the current passing through the device, and can be used as a low-power non-volatile memory [6], [10]. By employing a crossbar array structure, the memristors are also capable of performing multiply-and-accumulate (MAC) operations [5], [11]–[13]. This enables so-called in-memory calculation, which has a possibility to overcome the Von Neumann bottleneck of the modern processors. There are various kinds of MNNs proposed [13]–[16], most of which are reported to realize much better energy efficiency than the conventional digital processing.

The training of the MNNs is realized by applying high voltages to the memristors to update conductance according to the parameters of the neural networks. There are two approaches to train the MNN: ex-situ and in-situ methods [17]. In the ex-situ method, the parameters, or weights, of the neural networks are calculated separately from the MNN by executing a training algorithm on ordinary digital computers. This method is so simple that it is used in various existing works [11], [17]. However, it is difficult to accurately write the weights to the memristors due to device variation. Since the error in the weights may deteriorate the classification performance of the neural networks, additional adjustment of the memristor resistance has been necessary. On the other hand, the in-situ method performs network training on the MNN itself. Although this method requires additional peripheral circuits dedicated for training, this method has advantages in terms of robustness against the device variation and capability to realize on-line training.

Kataeva et al. [18] proposed an in-situ training method for the MNN, in which the backpropagation algorithm is efficiently combined with the weight update process on the MNN. This method realized a good training result on the MNN having variations of the memristor devices, as reported by Prezioso et al. [13]. However, this method is not suitable for applying to the algorithms that use a mini-batch technique. The mini-batch-based training is known as fast and robust, and it is used in combination with various optimization algorithms such as a stochastic gradient decent (SGD) method [19]. The reason is that the Kataeva’s method requires not only a large number of update iterations in proportion to the size of the MNN, and also additional computation resources, such as multipliers and memories, are required to calculate and store the intermediate data.

In this paper, we propose weight dividing update (WDU) method, an efficient in-situ training method for MNNs with mini-batch technique. The proposed method divides the weight update into discrete processes for each sample in a mini-batch. As a result, we can combine two processes into one: gradient calculation in the backpropagation algorithm, and weight update in the write operation of the memristor. This integration can eliminate the external multipliers and memories that have been indispensable for the existing method, and also can shorten the training process by parallel execution of the weight update for all the memristors on the crossbar array.

The contribution of this paper is summarized as follows.

• A weight dividing update (WDU) method for efficient mini-batch training on memristor neural networks

Copyright © 2018 The Institute of Electronics, Information and Communication Engineers
Algorithm 1 Stochastic gradient decent (SGD)

1: Choose an initial weight matrix \( W \)
2: for \( i = 1, \ldots, N_{\text{epochs}} \) do
3: \hspace{1em} Randomly sample a mini-batch \( \mathcal{B} \) from the training set
4: \hspace{1em} Calculate gradient \( \Delta W \) on the mini batch \( \mathcal{B} \)
5: \hspace{1em} Update weight: \( W \leftarrow W + \Delta W \)
6: end for

(MNNs) is newly proposed.

- WDU eliminates the external multipliers and reduces memory usage compared to the existing method [18].
- WDU can accelerate the training process by parallel weight update of the memristors.
- The training accuracy is evaluated through circuit simulation. The result shows the proposed WDU achieves \( M/K \) times faster convergence of the training compared to the existing method [18], while keeping the same classification accuracy. Here, \( M \) and \( K \) are a network size and a mini-batch size, respectively. The mini-batch is a subset of the training dataset, from which the gradient is calculated for SGD to update weights of the neural network. The mini-batch size \( K \) is the number of the data elements in the mini-batch.

The remainder of the paper is organized as follows. First, the fundamentals of conventional MNNs are briefly reviewed in Sect. 2. Then the proposed WDU method is described in Sect. 3, and the performance evaluation through circuit simulation is conducted Sect. 4. Finally, the paper is concluded in Sect. 5.

2. Memristor Neural Network

2.1 Neural Network

A neural network is one of the brain-inspired mathematical models that have an ability to approximate functions through learning on a large training data [19]. The neural network has a hierarchical network structure that consists of simple units called neurons. Figure 1 shows a neuron model and an example of a multi-layer neural network.

The output of the neuron is given by

\[
y = f(Wz),
\]

where \( z, W, \) and \( f(\cdot) \) are an input vector, a weight matrix, and an activation function, respectively.

In training the neural networks, stochastic gradient descent (SGD) [19] is widely used. Algorithm 1 shows a pseudocode of the SGD, in which gradient calculation (Line 4) and weight update (Line 5) are iteratively executed for each epoch until the iteration reaches a predefined number of epochs, \( N_{\text{epochs}} \). The weight gradient \( \Delta W \) is obtained by the backpropagation method as follows:

\[
e^I(k) = t(k) - y(k),
\]

\[
e^l(k) = \left( (W^{l+1})^T e^{l+1}(k) \right) \odot f'(W^l z^l(k)),
\]

where \( z, \) \( W, \) and \( f(\cdot) \) are an input vector, a weight matrix, and an activation function, respectively.

\[
\Delta^l(k) = e^l(k) \left( z^l \right)^T(k),
\]

\[
\Delta W^l = \eta \frac{1}{K} \sum_{k=1}^{K} \Delta^l(k).
\]

Here we assume an \( L \)-layer neural network, and \( z^l, W^l, \) and \( e^l \) are the input, weights, and errors for the \( l \)-th layer, respectively. \( y \) is the output of the final layer, i.e., the result of the forward propagation of the neural network. \( t \) is target values of the output, and the error at the final layer is calculated by Eq. (2). Note that the augment \( k \) for each variable indicates that those values correspond the \( k \)-th training sample in a mini-batch \( \mathcal{B} \) whose size is \( K \). The errors at the \( l \)-th layer can be obtained from the errors at the \( (l+1) \)-th layer as shown in Eq. (3). The operator symbols \( \odot, T, \) and \( ' \) mean Hadamard product, matrix transpose, and derivative, respectively. By repeating this process, errors for all the layers are obtained sequentially from the last layer to the first layer. This is why this method is called backpropagation. After having all the errors \( e^l \), the gradient of the weights to be updated is calculated by Eqs. (4) and (5), where \( \eta \) is a constant variable called a learning rate.

2.2 Memristor Neural Network

2.2.1 Memristor Device

A memristor is a passive two-terminal component theoretically predicted by Chua in 1970s [7] and found by Strukov et al. in 2008 [9]. Figure 2(a) shows a typical structure of the memristor, which consists of a metal oxide layer in between two metal layers. The conductance of the memristor changes depending on the current passing through the device. The 1-V characteristic of the memristor is shown in Fig. 2(b), in which the hysteresis loop is found. The memristor works in two operation modes: read and write. In the read mode, a
The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.

The physical abstraction model and the I-V characteristic of the memristor device.

![Physical model and I-V characteristic](image)

**Fig. 2** The physical abstraction model and the I-V characteristic of the memristor device.
weight update method on the crossbar array are detailed in the next section.

2.2.3 Variable-Amplitude Scheme

In this section, we describe the "variable-amplitude" scheme proposed by Kataeva et al. [18] for in-situ training of the neural networks on the memristor crossbar array. From Eq. (7), the conductance change $\Delta G$ is determined by the period $\Delta T$ and the amplitude $V_a$ of the voltage application. In the variable-amplitude scheme, the write voltage $V_a$ is changed to control the conductance, while the period $\Delta T$ is kept constant. The advantage of this method is that it can substitute multiplication of two variables with subtraction of two voltages applied to the memristor. For example, assume that we want to update the conductance by $\Delta G \propto XY$, where $X$ and $Y$ are two variables whose product is proportional to $\Delta G$ and satisfying $|X|, |Y| > 1$. Let the two voltages $V_x$ and $V_y$ be

$$V_x = \text{Sign}(XY) \ln |X|,$$
$$V_y = -\text{Sign}(XY) \ln |Y|.\quad (12)$$

When these two voltages are applied to the two terminals of the memristor, its conductance changes by

$$\Delta G \propto \exp(V_x) - \exp(V_y) = \exp(V_x - V_y) = \exp(\text{Sign}(XY) \ln |XY|) = XY.\quad (13)$$

This means that the multiplication of $X$ and $Y$ can be realized just by applying corresponding voltages $V_x$ and $V_y$.

This property of the variable-amplitude scheme enables effective in-situ training on the memristor crossbar array when not using a mini-batch, i.e., $K = 1$. In this case, the conductance of the memristor at the intersection of the $i$-th row and the $j$-th column should be updated by $\Delta G_{ij} = \eta \Delta W_{ij}$ according to Eq. (5). This update is realized in a fully parallel manner on the crossbar array by applying voltages $V_j \propto \ln |z_j|$ to the $j$-th row and $V_i \propto \ln |e_i|$ to the $i$-th column respectively as shown in Fig. 4. Note that $V_i$ and $V_j$ must be chosen to satisfy the V/2 scheme condition in Eq. (11). Since the signs of the two values $z_j$ and $e_i$ have four combinations of $(+, +)$, $(+, -)$, $(-, -)$, and $(-, +)$, the update process must be separately executed in four phases corresponding to such sign combinations, which is detailed in [18]. Thus the number of the voltage application for each weight update becomes $N_{va} = 4$. The advantage of this method is that it does not require external circuits to calculate the multiplication of $z_j \times e_i$, but requires $N + M$ memories to store $z_j$ and $e_i$ during the backpropagation process.

In the case of the mini-batch training with $K > 1$, however, the above method does not work efficiently. From Eq. (5), the conductance update of the memristor should be $\Delta G_{ij} = \Delta W_{ij} = \eta \frac{1}{K} \sum_{k=1}^{K} e_i(k) z_j(k)$, which cannot be decomposed into a simple multiplication of two variables, such as $e_i(k)$ and $z_j(k)$. Therefore, we must calculate $\Delta W_{ij}$ off the crossbar, and apply voltages $V_j \propto \ln |\Delta W_{ij}|/C$ to the $j$-th row and $V_i \propto \ln (C)$ to the $i$-th column, respectively. Note that $C > 0$ is a constant such that $V_i$ and $V_j$ satisfy the V/2 scheme condition in Eq. (11). Because the values of $V_j$ are different for each $i$-th column, the update process requires $M$ iterations as shown in Alg. 2 and Fig. 5. Each iteration is executed in two phases since $C$ is a positive constant and only the sign of $\Delta W_{ij}$ must be considered. Thus, the total number of the voltage application becomes $N_{va} = 2M$. This means that the time required for the weight update proportionally increases as the size of the crossbar becomes larger. In addition, unlike the case of $K = 1$, this method requires $KMN$ multiplications and $MN$ memories for the external circuit to calculate and store $\Delta W_{ij}$, which may deteriorate overall efficiency of the system that is based on the MNN.

3. Weight Dividing Update

In this section, we propose a novel weight update method called "weight dividing update (WDU),” which realizes efficient mini-batch training on the memristor crossbar array. As described in the previous section, the existing method [18] does not work efficiently for the mini-batch training since the multiplication and the summation for the $K$ samples in Eqs. (4) and (5) must be calculated in the external system. In contrast, our proposed method performs this calculation
In iteration, all the memristors are updated in parallel. Note that inside the crossbar array in addition to the weight update, the voltage application of WDU is \( N_{va} = 4K \). Algorithm 3 shows the procedure of the weight update by the proposed method. First, the input data \( z_j(k) \) and the corresponding errors \( e_j(k) \) are calculated through forward and backward propagation for each training sample \( k \) in the mini-batch. Then the weight update is executed for each \( k \), without calculating the summation of \( K \) samples in Eq. (5). In other words, the weight \( \Delta W \) is divided into \( K \) values of \( z_j(k) \cdot e_j(k) \) and they are sequentially updated through \( K \) iterations. This method brings another great advantage that the multiplications of \( z_j(k) \) and \( e_j(k) \) can be realized on the crossbar array in a fully parallel manner, as in the case of the existing method [18] with \( K = 1 \). Figure 6 illustrates the update flow of the proposed method, in which, for each iteration, all the memristors are updated in parallel. Note that each update is identical to that of the existing method [18] with \( K = 1 \), and it requires four phases due to the sign combination of the values. Thus the total number of the voltage application is \( N_{va} = 4K \) for the proposed WDU.

Table 1 summarizes the comparison of the existing methods and the proposed method WDU.

\[
\text{CPE} = C_{twd} \times D + C_{bwd} \times D + C_{upd} \times D/K,
\]

where \( D \) is a total number of the training data, and \( K \) is a mini-batch size. \( C_{twd} \) and \( C_{bwd} \) are the numbers of required clocks to execute the forward and backward propagation for a single training data, respectively. In general, they are constant values. \( C_{upd} \) is a number of required clocks for weight update, i.e., the number of voltage application, \( N_{va} \). From this definition, the required clocks per data (CPE/D) can be calculated as shown in Table 1. The order of the CPE/D for the proposed method is \( O(1) \), which is independent of the size of the crossbar array. Another advantage of WDU is smaller footprint of external memories than the existing method, when the crossbar size \( M \) is sufficiently larger than the batch size \( K \). In addition, WDU requires no multipliers for all \( M \) and \( K \), while existing method requires \( KMN \) multiplies when \( K > 1 \).
method becomes larger than that of the existing method since each voltage source is typically realized by a digital-to-analog converter (DAC). However, as the size of the crossbar \((M \text{ and } N)\) becomes larger, the area of the DACs becomes relatively smaller than that of the crossbar array since the former grows linearly to \(M \text{ and } N\), while the latter grows in the order of \(O(MN)\). Therefore, the increase of the circuit area is remedied when we assume to use a sufficiently large crossbar array as in our research target. In addition, the additional DACs required for the proposed method does not deteriorate the operation frequency, since all the DACs operate in parallel and thus their conversion time is equal to the single DAC.

The training results of the MNNs by WDU and the existing method are basically identical since both methods are based on the same equation to determine the update weight as Eq. (5). However, this claim is true under the assumption that the conductance of the memristor \(G\) in Eq. (7) is constant. In reality, the conductance \(G\) changes gradually during the WDU process, which may cause difference between the final results after training of the two methods. In this work, however, we assume that the conductance change is negligibly small and does not affect the training process. This assumption is validated through the experiments in the next section, which shows that the training results by the proposed method achieved almost equal accuracy to the existing method.

4. Experiments

In this section, through circuit simulations using a memristor device model, we demonstrate that the proposed WDU method can achieve equal accuracy with the conventional method in training MNNs.

4.1 Memristor Model

In the following experiments, we use a memristor model by Chen et al. [21], which is a behavioral model written in Verilog-A language and for analog circuit simulation using SPICE. The model parameters in this model are used, which are based on the measurements on a HfO\(_x\)-based ReRAM. The I-V characteristic of the memristor model follows Eq. (7) in the region where \(\delta_G\) is small. The write voltage to change the conductance of the memristor by \(\delta_G\) can be determined as

\[
V_w = A \ln(\kappa \delta_G) + B, \tag{19}
\]

where \(A = 0.03864\), \(B = 2.030\), and \(\kappa = 0.05\) are the constants calculated from the model parameters for the voltage application time of \(\Delta T = 3.5 \text{ ns}\). Note that the signs of \(V_w\) and \(\delta_G\) do not match if \(\delta_G < \exp(-B/A)\kappa < 2.1 \times 10^{-22}\), but this case rarely occurs in a practical use.

On the other hand, if \(\delta_G\) takes a large value, Eq. (7) does not hold, and the variable-amplitude method is not applicable to train the MNN. Therefore, we must use the memristor device under the write voltage \(V_w\) such that \(\delta_G\) is maintained small.

In order to determine an adequate \(V_w\), we first conduct a preliminary experiment for a single memristor by SPICE simulation. Given a memristor having a certain initial conductance \(G_{\text{ini}}\), a write voltage \(V_w\), which is determined by Eq. (19) to increase the conductance by \(\delta_G\), is applied, and the error \(e = G - G_{\text{ini}}(1 + \delta_G)\) is observed, where \(G\) is a new conductance after the application of the write voltage. Figure 7(a) shows conductance error \(e\) for various combinations of \(G_{\text{ini}}\) (x-axis) and \(\delta_G\) (y-axis). Note that \(\delta_G\) is presented as a relative percentage against \(G_{\text{ini}}\). From the result, the error is suppressed fewer than 10% if the conductance change \(\delta_G\) is less than 10%, whereas the intolerable errors are found in the region with large \(\delta_G\). Thus the training parameters are determined so that \(\delta_G\) becomes less than 10% of the initial conductance.

4.2 Simulation Setups

For a benchmark to evaluate the memristor neural network, a binary classification problem called “circle” [22] is utilized. The classification task is to separate the two-dimension input vectors \((x_1, x_2)\) into two classes. An example input vectors are shown in Fig. 7(b), in which the red and blue present the members of the two classes. All the input vectors are distributed in the range of \([-1.0, 1.0]\). Among 200 samples in the dataset, we utilize 150 samples for training and 50 samples for testing, respectively.

The configuration of the neural network to be implemented on the memristor crossbar array is a two-layer fully connected network with a 2-\(M\) first layer and \(M\)-1 second layer. The numbers of the input and output channels are one and two, respectively. The total number of the memristors to be used is \((2+1) \times M + (M+1) \times 1 \times 2 = 8M + 2\). Note that “+1” is for the bias term, and “\(x2\)” is for the pair of memristors to represent positive and negative weights as described in Sect. 2.2.2. The initial conductances of the memristors \(G_{\text{ini}}\) are determined by changing the model parameter \(g\), a gap, which is a length of high-impedance region in a memristor device. For each memristor, \(g\) is sampled from a normal distribution with the mean of 1.37 nm and the relative variance of 0.01. For the activation functions, a hyperbolic tangent

![Fig. 7](image-url)
the simulation circuit and its behavior. The three modes rate to

parameters to normalize the input, and a sigmoid function

forward propagation, back propagation, and weight update
eled. In a practical use, the peripheral circuits are realized
controlled voltage sources (CCVS), activation functions, and
the crossbar array, and the peripheral circuits such as current
memristor model described in Sect. 4.1 is used for the
network.

The classification errors of the existing method [18] and
the proposed method are compared with different sets of the
parameters $K$ (mini-batch size) and $M$ (crossbar array size).
Table 3 shows the classification errors on the test set by the

existing and the proposed methods. The result shows that the
proposed method achieves comparable or lower error rates
compared to the existing method. The minimum error is
6.0%, which is almost equal to the software classification
results. Thus the proposed method successfully performs
the sufficient level of the neural network training.

Figure 9 shows the learning curves for $(K, M)$ = (8, 16), (8, 32), (16, 16), (16, 32), which represent time change of the classification error for the test set as a function of training epochs. These results indicate that the proposed method achieves the equivalent training of the neural networks to the existing method. Figure 10 shows the same
learning curves in Fig. 9 but the x-axis is changed to the elapsed time. The elapsed time is calculated by using CPE/D in Table 4 and the clock period. From Eq. (18), CPE/D for the existing and the proposed methods are \((2 \times M/K + 2)\) and 6, respectively. Here, we designed \(C_{\text{rmwd}} = C_{\text{rmwd}} = 1\) for both methods and \(C_{\text{rd}} = 2M\) and 4 for the existing and the proposed methods, respectively. In this experiment, the same clock period of 3.5 ns are used for both methods since their peripheral circuits are almost the same and the maximum delay of both methods is equal. The result when \((K, M) = (8, 32)\), which satisfies the condition of \(K < M/2\) described in Sect. 3, shows the twice faster convergence of the proposed method than the existing method. For example, the error rate of the proposed method becomes 0.1 at 50 µs, while that of the existing method reaches the same error rate after spending 100 µs.

5. Conclusion

This paper proposed an efficient mini-batch-based training method for MNNs. In the proposed method, we integrate the two essential processes for the training, i.e., the gradient calculation and the weight update, into one process. As a result, the required number of voltage applications to the memristor crossbar array becomes proportional to the size of the mini-batch, whereas that of the existing method is proportional to the size of the crossbar array. The proposed method is suitable when the large crossbar arrays are used. In addition, the external multipliers and memories required by the existing method are greatly reduced by the proposed method. Through the experiments by circuit simulation, it is shown that the proposed method achieves twice faster training process than the existing method when relatively large networks such as \((K, M) = (8, 32)\) are used, while retaining the same level of the accuracy for the classification results.

For the future work, we will consider the design of the peripheral circuits that are suitable for training of neural networks considering the analog characteristics of memristors. Acceleration of the circuit simulation is another future work so that we can accurately evaluate much larger size of neural networks.

Acknowledgments

This work was partially supported by JSPS KAKENHI Grant No. 26730027 and 17H01713. This work was also supported by VLSI Design and Education Center (VDBC), the University of Tokyo in collaboration with Synopsys, Inc.

References


2006.


Satoshi Yamamori received B.E. degree in Electrical and Electronic Engineering from Kyoto University in 2017. He is a master course student at Department of Systems Science, Kyoto University. He is a student member of the Institute of Electronics, Information and Communication Engineers (IEICE).

Masayuki Hiromoto received B.E. degree in Electrical and Electronic Engineering and M.Sc. and Ph.D. degrees in Communications and Computer Engineering from Kyoto University in 2006, 2007, and 2009 respectively. He was a JSPS research fellow from 2009 to 2010, and with Panasonic Corp. from 2010 to 2013. In 2013, he joined the Graduate School of Informatics, Kyoto University, where he is currently a senior lecturer. His research interests include VLSI design methodology, image processing, and pattern recognition. He is a member of IEEE, IEICE, and IPSJ.

Takashi Sato received B.E. and M.E. degrees from Waseda University, Tokyo, Japan, and a Ph.D. degree from Kyoto University, Kyoto, Japan. He was with Hitachi, Ltd., Tokyo, Japan, from 1991 to 2003, with Renesas Technology Corp., Tokyo, Japan, from 2003 to 2006, and with the Tokyo Institute of Technology, Yokohama, Japan. In 2009, he joined the Graduate School of Informatics, Kyoto University, Kyoto, Japan, where he is currently a professor. He was a visiting industrial fellow at the University of California, Berkeley, from 1998 to 1999. His research interests include CAD for nanometer-scale LSI design, fabrication-aware design methodology, and performance optimization for variation tolerance. Dr. Sato is a member of the IEEE, ACM, and IEICE. He received the Beatrice Winner Award at ISSCC 2000 and the Best Paper Award at ISQED 2003.