Zhixi Cai

Monash University. Melbourne. Australia.


šŸ“˜ I’m currently a Research Fellow (Post-Doctoral) at VL4AI Lab, Faculty of IT, Monash University. Supervised by Dr. Hamid Rezatofighi, my current core research area is video understanding and reasoning using neurosymbolic and large language models.

šŸ’” We believe the next wave of AI will be agentic and verifiable systems that plan, act, and explain their decisions by composing neural perception with symbolic representations and logic. Progress will come from neuro-symbolic combination with built-in verifications. Our aim is AI that is safe-by-default, auditable, and robust.

šŸŽ“ I obtained my PhD in Monash University, supervised by A/Prof. Kalin Stefanov, A/Prof. Abhinav Dhall and Dr. Munawar Hayat in artificial intelligence domain. I completed my thesis Content-Driven Multimodal Deepfake Generation and Temporal Localization, which mainly focuses on deepfakes and video understanding.

šŸ”¬ Now I have published papers in CVPR, ECCV, ICCV, ACM MM, etc, and get two best paper awards in my PhD journey. Please refer to the publication page for more details.

šŸ”Ž I’m the Associate Editor (Area Chair) of IROS. I’m also invited as the reviewer of CVPR, ICCV, ECCV, ACM MM, ICRA, TPAMI, TMM, TAFFC, and more.

šŸ–„ļø I enjoy programming and implementing some cool ideas. I have developped several interesting open source applications and libraries in my spare time. Please refer to the projects page for more details.

šŸ› ļø Also, I love discovering and fine-tuning tools in my hand, including both software tools and physical tools.

news

Jan 27, 2026 A paper are accepted by ICLR 2026.
Nov 08, 2025 A paper are accepted by AAAI 2026.
Oct 13, 2025 Host a tutorial (Multimodal Deepfake Generation and Detection: Challenges, Methods, and Future Directions) at ICMI 2025.
Jul 07, 2025 A paper are accepted by RA-L.
Jun 26, 2025 Two papers are accepted by ICCV 2025.

selected publications

  1. ICLR
    mata.jpg
    MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning
    In International Conference on Learning Representations , 2026
  2. NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
    In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2025
  3. NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception, Reasoning, and Planning in Complex UAV Search Missions
    Zhixi Cai*—,Ā Cristian Rojas Cardenas*,Ā Kevin Leo*,Ā Chenyuan Zhang*,Ā Kal Backman* ,Ā Hanbing Li*,Ā Boying Li,Ā Mahsa Ghorbanali,Ā Stavya Datta,Ā Lizhen Qu , and 7 more authors
    IEEE Robotics and Automation Letters, 2025
  4. AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset
    Zhixi Cai,Ā Shreya Ghosh,Ā Aman Pankaj Adatia,Ā Munawar Hayat,Ā Abhinav Dhall,Ā Tom Gedeon,Ā andĀ Kalin Stefanov
    In Proceedings of the 32nd ACM International Conference on Multimedia , 2024
  5. MARLIN: Masked Autoencoder for facial video Representation LearnINg
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023
  6. Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization
    Zhixi Cai†,Ā Shreya Ghosh,Ā Abhinav Dhall,Ā Tom Gedeon,Ā Kalin Stefanov,Ā andĀ Munawar Hayat
    Computer Vision and Image Understanding, 2023
  7. Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization
    Zhixi Cai,Ā Kalin Stefanov,Ā Abhinav Dhall,Ā andĀ Munawar Hayat
    In 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA) , 2022