Proceedings of the International Conference on Advancements in Computing Technologies and Artificial Intelligence (COMPUTATIA-2025)

NeuroVidX: Text-To-Video Diffusion Models with an Expert Transformer

Authors
Shruti Sawant1, *, Sejal Pandit1, Megha Chatur1, Aditya Shinde1, Ganesh Dangat1
1Department of Computer Science and Engineering, Dr. Babasaheb Ambedkar University, Lonere, India
*Corresponding author. Email: shrutisawant162003@gmail.com
Corresponding Author
Shruti Sawant
Available Online 19 April 2025.
DOI
10.2991/978-94-6463-700-7_3How to use a DOI?
Keywords
Diffusion Models; Expert Transformer; 3D Variational Autoencoder (VAE); Video Generation Models; Deep Fusion
Abstract

NeuroVidX, a large-scale text-to-video generation model based on a diffusion transformer, which can generate 10-s continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 1360 pixels, is proposed in this research. Previous video generation models often had limited movement and short durations, and generating videos with coherent narratives based on text is difficult. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along spatial and temporal dimensions to improve compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, NeuroVidX is adept at producing coherent, long-duration, different-shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to generation quality and semantic alignment. Results show that NeuroVidX demonstrates state-of-the-art performance across multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE and video caption model.

Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the International Conference on Advancements in Computing Technologies and Artificial Intelligence (COMPUTATIA-2025)
Series
Advances in Intelligent Systems Research
Publication Date
19 April 2025
ISBN
978-94-6463-700-7
ISSN
1951-6851
DOI
10.2991/978-94-6463-700-7_3How to use a DOI?
Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Shruti Sawant
AU  - Sejal Pandit
AU  - Megha Chatur
AU  - Aditya Shinde
AU  - Ganesh Dangat
PY  - 2025
DA  - 2025/04/19
TI  - NeuroVidX: Text-To-Video Diffusion Models with an Expert Transformer
BT  - Proceedings of the International Conference on Advancements in Computing Technologies and Artificial Intelligence (COMPUTATIA-2025)
PB  - Atlantis Press
SP  - 17
EP  - 28
SN  - 1951-6851
UR  - https://doi.org/10.2991/978-94-6463-700-7_3
DO  - 10.2991/978-94-6463-700-7_3
ID  - Sawant2025
ER  -