Abstract
The problem of predicting the position of a person on future frames of a video stream is solved, and in-depth experimental studies on the application of traditional and SOTA blocks for this task are carried out. An original architecture of KeyFNet and its modifications based on transform blocks is presented, which is able to predict coordinates in the video stream for 30, 60, 90, and 120 frames ahead with high accuracy. The novelty lies in the application of a combined algorithm based on multiple FNet blocks with fast Fourier transform as an attention mechanism concatenating the coordinates of key points. Experiments on Human3.6M and on our own real data confirmed the effectiveness of the proposed approach based on FNet blocks, compared to the traditional approach based on LSTM. The proposed algorithm matches the accuracy of advanced models, but outperforms them in terms of speed, uses less computational resources, and thus can be applied in collaborative robotic solutions.
![](https://cdn.statically.io/img/media.springernature.com/m312/springer-static/image/art%3A10.1134%2FS1064562423701624/MediaObjects/11472_2024_9691_Fig1_HTML.png)
![](https://cdn.statically.io/img/media.springernature.com/m312/springer-static/image/art%3A10.1134%2FS1064562423701624/MediaObjects/11472_2024_9691_Fig2_HTML.png)
![](https://cdn.statically.io/img/media.springernature.com/m312/springer-static/image/art%3A10.1134%2FS1064562423701624/MediaObjects/11472_2024_9691_Fig3_HTML.png)
![](https://cdn.statically.io/img/media.springernature.com/m312/springer-static/image/art%3A10.1134%2FS1064562423701624/MediaObjects/11472_2024_9691_Fig4_HTML.png)
![](https://cdn.statically.io/img/media.springernature.com/m312/springer-static/image/art%3A10.1134%2FS1064562423701624/MediaObjects/11472_2024_9691_Fig5_HTML.png)
REFERENCES
S. L. Pintea, J. C. van Gemert, and A. W. M. Smeulders, “Déja Vu: Motion prediction in static images,” Computer Vision–ECCV 2014: Proceedings of the 13th European Conference, Zurich, Switzerland, September 6–12, 2014 (Springer International, 2014), Part III, pp. 172–187.
J. Walker, A. Gupta, and M. Hebert, “Dense optical flow prediction from a static image,” Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2443–2451.
Y. W. Chao et al., “Forecasting human dynamics from static images,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 548–556.
O. Amosov et al., “Human localization in video frames using a growing neural gas algorithm and fuzzy inference,” Comput. Opt. 41 (1), 46–58 (2017). https://doi.org/10.18287/2412-6179-2017-41-1-46-58
O. S. Amosov et al., “Using the deep neural networks for normal and abnormal situation recognition in the automatic access monitoring and control system of vehicles,” Neural Comput. Appl. 33 (8), 3069–3083 (2021). https://doi.org/10.1007/s00521-020-05170-5
N. A. Gerasimenko, A. S. Chernyavsky, and M. A. Nikiforova, “RuSciBERT: A transformer language model for obtaining semantic embeddings of scientific texts in Russian,” Dokl. Math. 106, Suppl. 1, S95–S96 (2022). https://doi.org/10.1134/S1064562422060072
O. S. Amosov et al., “Using the ensemble of deep neural networks for normal and abnormal situations detection and recognition in the continuous video stream of the security system,” Procedia Comput. Sci. 150, 532–539 (2019). https://doi.org/10.1016/j.procs.2019.02.089
X. Gao et al., “Accurate grid keypoint learning for efficient video prediction,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE, 2021), pp. 5908–5915. https://doi.org/10.1109/IROS51168.2021.9636874
Z. Liu et al., “Swin transformer V2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 12009–12019. https://doi.org/10.1109/CVPR52688.2022.01170
C. Ionescu et al., “Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments,” IEEE Trans. Pattern Anal. Mach. Intell. 36 (7), 1325–1339 (2014). https://doi.org/10.1109/TPAMI.2013.248
Y. Ivanov et al., “Using an ensemble of deep neural networks to detect human keypoints in the workspace of a collaborative robotic system,” Eng. Proc. 33 (1), 19 (2023). https://doi.org/10.3390/engproc2023033019
GutHub. https://github.com/IdentySergey/fnet. Accessed August 25, 2023.
J. Lee-Thorp et al., “FNet: Mixing tokens with Fourier transforms” (2021). https://doi.org/10.48550/arXiv.2105.03824
S. Kreiss, L. Bertoni, and A. Alahi, “OpenPifPaf: Composite fields for semantic keypoint detection and spatio-temporal association,” IEEE Trans. Intell. Transport. Syst. 23 (8), 13498–13511 (2021). https://doi.org/10.1109/tits.2021.3124981
Lugaresi et al., “MediaPipe: A framework for building perception pipelines” (2019). https://doi.org/10.48550/arXiv.1906.08172
Funding
This work was supported by the Russian Science Foundation, project no. 22-71-10093, https://rscf.ru/en/project/22-71-10093/.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors of this work declare that they have no conflicts of interest.
Additional information
Publisher���s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhiganov, S.V., Ivanov, Y.S. & Grabar, D.M. Investigation of Neural Network Algorithms for Human Movement Prediction Based on LSTM and Transformers. Dokl. Math. 108 (Suppl 2), S484–S493 (2023). https://doi.org/10.1134/S1064562423701624
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1064562423701624