Automatic lipreading using convolutional neural networks and orthogonal moments

2025;
: pp. 90–100
Received: January 07, 2024
Revised: January 10, 2025
Accepted: January 13, 2025

Ait Khayi Y., El Ogri O., El-Mekkaoui J., Benslimane M., Hjouji A.  Automatic lipreading using convolutional neural networks and orthogonal moments.  Mathematical Modeling and Computing. Vol. 12, No. 1, pp. 90–100 (2025)

1
TI, Laboratory, EST, Sidi Mohamed Ben Abdellah University, Fez, Morocco
2
TI, Laboratory, EST, Sidi Mohamed Ben Abdellah University, Fez, Morocco; CED-ST, STIC, Laboratory of Information, Signals, Automation and Cognitivism LISAC, Dhar El Mahrez Faculty of Science, Sidi Mohamed Ben Abdellah-Fez University, Fez, Morocco
3
TI, Laboratory, EST, Sidi Mohamed Ben Abdellah University, Fez, Morocco
4
TI, Laboratory, EST, Sidi Mohamed Ben Abdellah University, Fez, Morocco
5
Sidi Mohamed Ben Abdellah University, Fez, Morocco

Recently, understanding speech from a speaker's mouth using only visual interpretation of the lips movement has become one of the most complex computer vision tasks.  In the present paper, we suggest a new approach named Optimized Quaternion Meixner Moments Convolutional Neural Networks (OQMMCNN) in order to develop a lipreading system based only on video images.  This approach is based on Quaternion Meixner Moments (QMMs) that we use as a filter in the Convolutional Neural Networks (CNN) architecture.  In addition, we use the Grey Wolf optimization algorithm (GWO) with the aim of ensuring high accuracy of classification through the optimization of the Quaternion Meixner Moments (QMMs) filter local parameters.  We show that this method is an effective solution to decrease the high dimensionality of the video images and the training time.  This approach is tested on a public dataset and compared to different methods that use complex models and deep architecture in the literature.

  1. Fernandez-Lopez A., Sukno F. M.  Survey on automatic lip-reading in the era of deep learning.  Image and Vision Computing.  78, 53–72 (2018).
  2. Hao M., Mamut M., Yadikar N., Aysa A., Ubul K.  A survey of research on lipreading technology.  IEEE Access.  8, 204518–204544 (2020).
  3. Chen X., Jixiang D., Hongbo Z.  Lipreading with DenseNet and resBi-LSTM.  Signal, Image and Video Processing.  14, 981–989 (2020).
  4. Fenghour S., Chen D., Guo K., Xiao P.  Lip reading sentences using deep learning with only visual cues.  IEEE Access.  8, 215516–215530 (2020).
  5. Rashkevych Yu., Peleshko D., Pelekh I., Izonіn I. V.  Speech signal marking on the base of local magnitude and invariant segmentation.  Mathematical Modeling and Computing.  1 (2), 234–244 (2014).
  6. Ma S., Wang S., Lin X.  A transformer-based model for sentence-level Chinese Mandarin lipreading.  2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC). 78–81 (2020).
  7. Fisher C. G.  Confusions among visually perceived consonants.  Journal of Speech and Hearing Research.  11 (4), 796–804 (1968).
  8. Hilder S., Harvey R., Theobald B.-J.  Comparison of human and machine-based lip-reading.  AVSP 2009 – International Conference on Audio-Visual Speech Processing University of East Anglia (2009).
  9. Matthews I., Cootes T. F., Bangham J. A., Cox S., Harvey R.  Extraction of visual features for lipreading.  IEEE Transactions on Pattern Analysis and Machine Intelligence.  24 (2), 198–213 (2002).
  10. Cox S., Harvey R., Lan Y., Newman J., Theobald B.-J.  The challenge of multispeaker lip-reading.  International Conference on Auditory-Visual Speech Processing (AVSP). (2008).
  11. Lee B., Hasegawa-Johnson M., Goudeseune C., Kamdar S., Borys S., Liu M., Huang T.  AVICAR: Audio-visual speech corpus in a car environment.  Eighth International Conference on Spoken Language Processing. 2489–2492 (2004).
  12. Hazen T. J., Saenko K., La C.-H., Glass J. R.  A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments.  Proceedings of the 6th international conference on Multimodal interfaces.  235–242 (2004).
  13. Patterson E. K., Gurbuz S., Tufekci Z., Gowdy J. N.  CUAVE: A new audio-visual database for multimodal human-computer interface research.  2002 IEEE International conference on acoustics, speech, and signal processing.  II, 2017–2020 (2002).
  14. Cooke M., Barker J., Cunningham S., Shao X.  An audio-visual corpus for speech perception and automatic speech recognition.  The Journal of the Acoustical Society of America.  120 (5), 2421–2424 (2006).
  15. Petridis S., Pantic M.  Deep complementary bottleneck features for visual speech recognition.  2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2304–2308 (2016).
  16. Saitoh T., Zhou Z., Zhao G., Pietikäinen M.  Concatenated frame image based CNN for visual speech recognition.  Computer Vision – ACCV 2016 Workshops.  277–289 (2017).
  17. Mesbah A., Berrahou A., Hammouchi H., Berbia H., Qjidaa H., Daoudi M.  Lip reading with Hahn convolutional neural networks.  Image and Vision Computing.  88, 76–83 (2019).
  18. Kim M., Yeo J. H., Ro Y. M.  Distinguishing homophenes using multi-head visual-audio memory for lip reading.  Proceedings of the AAAI Conference on Artificial Intelligence.  36 (1), 1174–1182 (2022).
  19. Mirjalili S., Mirjalili S. M., Lewis A.  Grey Wolf Optimizer.  Advances in Engineering Software.  69, 46–61 (2014).
  20. Lewis A. C.  William Rown Hamilton, Lectures on quaternions (1853).  Landmark Writings in Western Mathematics 1640–1940. 460–469 (2005).
  21. Sayyouri M., Hmimid A., Qjidaa H.  A fast computation of novel set of Meixner invariant moments for image analysis.  Circuits, Systems, and Signal Processing.  34,  875–900 (2015).
  22. Sadeeq H., Abdulazeez A. M.  Hardware implementation of firefly optimization algorithm using FPGAs.  2018 International Conference on Advanced Science and Engineering (ICOASE).  30–35 (2018).
  23. Sadeeq H. T., Abdulazeez A. M.  Giant trevally optimizer (GTO): A novel metaheuristic algorithm for global optimization and challenging engineering problems.  IEEE Access.  10, 121615–121640 (2022).
  24. Naserbegi A., Aghaie M.  Exergy optimization of nuclear-solar dual proposed power plant based on GWO algorithm.  Progress in Nuclear Energy.  140, 103925 (2021).
  25. Naserbegi A., Aghaie M., Zolfaghari A.  Implementation of Grey Wolf Optimization (GWO) algorithm to multi-objective loading pattern optimization of a PWR reactor.  Annals of Nuclear Energy.  148, 107703 (2020).
  26. Gachkevich M., Gachkevich O., Torskyy A., Dmytruk V.  Mathematical models and methods of optimization of technological heating regimes of the piecewise homogeneous glass shell.  State-of-the-art investigations.  Mathematical Modeling and Computing.  2 (2), 140–153 (2015).
  27. Raskin L., Sira O., Sagaydachny D.  Multi-criteria optimization in terms of fuzzy criteria definitions.  Mathematical Modeling and Computing.  5 (2), 207–220 (2018).
  28. Zhao G., Barnard M., Pietikainen M.  Lipreading with local spatiotemporal descriptors.  IEEE Transactions on Multimedia.  11 (7), 1254–1265 (2009).