Interpretation of a new generation of speech recognition system

Today, we will unveil a new generation of speech recognition systems from Science and Technology News for everyone.

As we all know, since the first time in 2011, Microsoft Research has used Deep Neural Network (DNN) to achieve significant effects on large-scale speech recognition tasks, DNN has received more and more attention in the field of speech recognition. Standard for speech recognition systems. However, more in-depth research results show that although the DNN structure has strong classification ability, its ability to capture context timing information is weak, so it is not suitable for processing timing signals with long-term correlation. Voice is a complex time-varying signal with strong correlation between frames. This correlation is mainly reflected in the phenomenon of coordinated pronunciation when speaking. Usually, several words have influence on the words we are talking about. That is, there is a long-term correlation between frames of speech.

Keda News Feixin Voice System

Figure 1: Schematic diagram of DNN and RNN

Compared with the feedforward neural network DNN, the Recurrent Neural Network (RNN) adds a feedback connection to the hidden layer. That is to say, the input of the current moment of the RNN hidden layer is a hidden layer output of the previous moment. This allows the RNN to see the information at all times before through the loop feedback connection, which gives the RNN memory function, as shown in Figure 1. These features make RNN very suitable for modeling timing signals. In the field of speech recognition, RNN is a new deep learning framework to replace DNN in recent years, and the introduction of Long-Short Term Memory (LSTM). The problem of the disappearance of the traditional simple RNN gradient is solved, and the RNN framework can be put into practical use in the field of speech recognition and obtained the effect of surpassing DNN. It has been used in some advanced voice systems in the industry.

In addition, the researchers have made further improvements based on RNN. Figure 2 is the mainstream RNN acoustic model framework for speech recognition. It mainly consists of two parts: deep bidirectional LSTM RNN and CTC (ConnecTIonist Temporal ClassificaTIon). Output layer. When the bidirectional RNN judges the current speech frame, not only the historical speech information but also the future speech information can be utilized, and more accurate decision can be made; the CTC enables the training process to achieve an effective “end-pair” without frame-level annotation. End training.

Keda News Feixin Voice System

Figure 2: Mainstream acoustic model framework based on LSTM RNN

At present, many academic or industrial institutions in the world have mastered the RNN model and conducted research at one or more of the above technical points. However, each of the above technical points can generally obtain better results when studied separately, but if you want to integrate these technical points together, you will encounter some problems. For example, the combination of multiple technologies will increase by a smaller amount than the sum of the individual technology points. For another example, the traditional two-way RNN scheme theoretically needs to see the end of speech (that is, all future information) in order to successfully apply future information for promotion, so it is only suitable for offline tasks, and for online tasks that require immediate response. (such as voice input method) tends to bring a hard delay of 3-5s, which is unacceptable for online tasks. Furthermore, the RNN fits the context correlation more strongly, and it is more likely to fall into the over-fitting problem than the DNN. It is easy to bring additional anomaly recognition errors due to the local non-robustness of the training data. Finally, because RNN has a more complex structure than DNN, it brings more challenges to the training of RNN models under massive data.

In view of the above problems, HKUST has invented a new framework called Feed-forward Sequen TIal Memory Network (FSMN). In this framework, the above points can be well integrated, and the improvement of the effect of each technology point can be superimposed. It is worth mentioning that the FSMN structure that we creatively proposed in this system uses a non-cyclic feedforward structure, which achieves the same effect as the bidirectional LSTM RNN with only 180ms delay. Let us take a closer look at its composition.

Keda News Feixin Voice System

Figure 3: Schematic diagram of FSMN structure

Keda News Feixin Voice System

Figure 4: Schematic diagram of the timing of the hidden layer memory block in FSMN (one frame left and right)

Figure 3 is a schematic diagram of the structure of the FSMN. Compared to the traditional DNN, we add a module called "memory block" next to the hidden layer for storing historical information and future information useful for judging the current speech frame. Figure 4 shows the timing structure of the memory information in the two-way FSMN in the memory block (in the actual task, the history of the required memory and the length of the future information can be manually adjusted according to the task).

As can be seen from the figure, unlike the traditional loop feedback based RNN, the memory function of the FSMN memory block is implemented using a feedforward structure. This feedforward structure has two major advantages: First, when the bidirectional FSMN remembers future information, there is no limitation that the traditional bidirectional RNN must wait for the end of the speech input to judge the current speech frame. It only needs to wait for a finite length of future speech frame. Yes, as mentioned above, our two-way FSMN can achieve the effect of a two-way RNN with a delay of 180ms. Secondly, as mentioned earlier, the traditional simple RNN is pressed because of the gradient during training. Time passes forward one by one, so there is an exponentially decaying gradient disappearance, which leads to the fact that the theoretically infinitely long memory RNN can actually remember very limited information. However, FSMN is a memory network based on feedforward timing expansion structure. During the training process, the gradient is transmitted back to each moment along the connection weights of the memory block and the hidden layer in FIG. 4. These connection weights determine the influence of the input at different times on judging the current speech frame, and the gradient propagation is The attenuation at any time is constant and trainable, so FSMN solves the problem in RNN in a simpler way. Of disappearing, so that it has long-term memory capacity of similar LSTM.

In addition, in terms of model training efficiency and stability, since FSMN is completely based on feedforward neural network, there is no case of wasted operation in RNN training due to the need to fill zeros in mini-batch. The feedforward structure also makes It has a higher degree of parallelism and maximizes the power of GPU computing. From the distribution of weighted coefficients at each moment in the two-way FSMN model memory block that is finally trained to converge, we observe that the weight value is basically the largest at the current time and gradually decays to the left and right sides, which is also in line with expectations. Further, FSMN can be combined with CTC guidelines to achieve "end-to-end" modeling in speech recognition.

Finally, combined with other technical points, Xunfei's FSMN-based speech recognition framework can achieve 40% performance improvement over the industry's best speech recognition system, combined with our multi-GPU parallel acceleration technology, training efficiency can be achieved Ten thousand hours of training data can be trained to converge one day. Based on the FSMN framework, we will also carry out more related research work, such as: DNN and deeper combination of memory blocks, increase memory block partial complexity and strengthen memory function, FSMN structure and CNN and other structures for deeper integration. . Based on the continuous improvement of these core technologies, the speech recognition system of HKUST will continue to challenge new peaks!

Small Pixel Pitch Led Screen--PV Series

In order to meet customer's request for UHD Led Screen, PRIVA LED developed the complete solution from Module, HUB card and high precise cabinet.

PRIVA LED's PV series of UHD led screen has functions:

1,600mmx 337.5mm cabinet which is easy to get 16:9 Ratio of LED screen panel

2,Front installation& front maintenance. Modules, Hub card, Power supply and Receiving card all could be take off from front side and installed from front.
3, PU series of PRIVA LED has option of Power Redundancy. If one power supply defected, the other one will continue working without problem.
4,Signal dual backup
5, Smart module with calibration data. Calibration data will be saved in the module, so when replace module, the calibration data will be automatically updated.

If more information is needed about our UHD LED display screen, pls feel free to contact with us!

UHD led Screen, High definition LED screen, Small pixel pitch Led Screen

Shenzhen Priva Tech Co., Ltd. , https://www.privaled.com