Zhixi Cai

📘 I’m currently a Research Fellow (Post-Doctoral) at VL4AI Lab, Faculty of IT, Monash University. Supervised by Dr. Hamid Rezatofighi, my current core research area is video understanding and reasoning using neurosymbolic and large language models.

💡 We believe the next wave of AI will be agentic and verifiable systems that plan, act, and explain their decisions by composing neural perception with symbolic representations and logic. Progress will come from neuro-symbolic combination with built-in verifications. Our aim is AI that is safe-by-default, auditable, and robust.

🎓 I obtained my PhD in Monash University, supervised by A/Prof. Kalin Stefanov, A/Prof. Abhinav Dhall and Dr. Munawar Hayat in artificial intelligence domain. I completed my thesis Content-Driven Multimodal Deepfake Generation and Temporal Localization, which mainly focuses on deepfakes and video understanding.

🔬 Now I have published papers in CVPR, ECCV, ICCV, ACM MM, etc, and get two best paper awards in my PhD journey. Please refer to the publication page for more details.

🔎 I’m the Associate Editor (Area Chair) of IROS. I’m also invited as the reviewer of CVPR, ICCV, ECCV, ACM MM, ICRA, TPAMI, TMM, TAFFC, and more.

🖥️ I enjoy programming and implementing some cool ideas. I have developped several interesting open source applications and libraries in my spare time. Please refer to the projects page for more details.

🛠️ Also, I love discovering and fine-tuning tools in my hand, including both software tools and physical tools.

news

Jan 27, 2026	A paper are accepted by ICLR 2026.
Nov 08, 2025	A paper are accepted by AAAI 2026.
Oct 13, 2025	Host a tutorial (Multimodal Deepfake Generation and Detection: Challenges, Methods, and Future Directions) at ICMI 2025.
Jul 07, 2025	A paper are accepted by RA-L.
Jun 26, 2025	Two papers are accepted by ICCV 2025.

selected publications

ICLR

MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning

Zhixi Cai, Fucai Ke, Kevin Leo, Sukai Huang, Maria Garcia de la Banda, Peter J Stuckey, and Hamid Rezatofighi

In International Conference on Learning Representations , 2026

Abs HTML Code

Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as a hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent’s transition policy, we build transition-trajectory trees and transform to memory-to-next-state pairs, forming the MATA-SFT-90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state-of-the-art results compared with monolithic and compositional baselines. The code and dataset are available at https://github.com/ControlNet/MATA.
ICCV

NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

Zhixi Cai, Fucai Ke, Simindokht Jahangard, Maria Garcia de la Banda, Reza Haffari, Peter J. Stuckey, and Hamid Rezatofighi

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2025

Abs HTML Code

Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines.
RA-L

NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception, Reasoning, and Planning in Complex UAV Search Missions

Zhixi Cai^*‡, Cristian Rojas Cardenas^*, Kevin Leo^*, Chenyuan Zhang^*, Kal Backman^* , Hanbing Li^*, Boying Li, Mahsa Ghorbanali, Stavya Datta, Lizhen Qu , and 7 more authors

IEEE Robotics and Automation Letters, 2025

Abs HTML Code

This paper addresses the problem of autonomous UAV search missions, where a UAV must locate specific Entities of Interest (EOIs) within a time limit, based on brief descriptions in large, hazard-prone environments with keep-out zones. The UAV must perceive, reason, and make decisions with limited and uncertain information. We propose NEUSIS, a compositional neuro-symbolic system designed for interpretable UAV search and navigation in realistic scenarios. NEUSIS integrates neuro-symbolic visual perception, reasoning, and grounding (GRiD) to process raw sensory inputs, maintains a probabilistic world model for environment representation, and uses a hierarchical planning component (SNaC) for efficient path planning. Experimental results from simulated urban search missions using AirSim and Unreal Engine show that NEUSIS outperforms a state-of-the-art (SOTA) vision-language model and a SOTA search planning model in success rate, search efficiency, and 3D localization. These results demonstrate the effectiveness of our compositional neuro-symbolic approach in handling complex, real-world scenarios, making it a promising solution for autonomous UAV systems in search missions.
ACM-MM

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Tom Gedeon, and Kalin Stefanov

In Proceedings of the 32nd ACM International Conference on Multimedia , 2024

Abs HTML Code

The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M.
CVPR

MARLIN: Masked Autoencoder for facial video Representation LearnINg

Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, and Munawar Hayat

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

Abs HTML Code Poster Slides

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN.
CVIU

Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization

Zhixi Cai^†, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat

Computer Vision and Image Understanding, 2023

Abs HTML Code

Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes and are centered around the binary classification task of detecting whether a video is real or fake. This is because available benchmark datasets contain mostly visual-only modifications present in the entirety of the video. However, a sophisticated deepfake may include small segments of audio or audio-visual manipulations that can completely change the meaning of the video content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which effectively captures multimodal manipulations. We further improve (i.e. BA-TFD) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA-TFD on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.
DICTA

Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat

In 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA) , 2022

Abs HTML Code

Due to its high societal impact, deepfake detection is getting active attention in the computer vision community. Most deepfake detection methods rely on identity, facial attributes, and adversarial perturbation-based spatio-temporal modifications at the whole video or random locations while keeping the meaning of the content intact. However, a sophisticated deepfake may contain only a small segment of video/audio manipulation, through which the meaning of the content can be, for example, completely inverted from a sentiment perspective. We introduce a content-driven audio-visual deepfake dataset, termed Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization. Specifically, the content-driven audio-visual manipulations are performed strategically to change the sentiment polarity of the whole video. Our baseline method for benchmarking the proposed dataset is a 3DCNN model, termed as Boundary Aware Temporal Forgery Detection (BA-TFD), which is guided via contrastive, boundary matching, and frame classification loss functions. Our extensive quantitative and qualitative analysis demonstrates the proposed method’s strong performance for temporal forgery localization and deepfake detection tasks.