cv
General Information
Full Name | Zhixi Cai |
Birth Year | 1997 |
Languages | Chinese, English |
Education
-
2021/03 - 2024/07 Doctor of Philosophy
Monash University, Australia - Supervisors are Dr. Kalin Stefanov, A/Prof. Abhinav Dhall, and Dr. Munawar Hayat.
- The Ph.D. program is about the topic, Content-Based Deepfake Generation and Temporal Localization, which belongs to the area of computer vision and deep learning.
-
2019/03 - 2020/12 Master of Data Science
Monash University, Australia - Master coursework about data science, machine learning and IT.
- WAM: 86, GPA: 3.88 / 4
-
2014/09 - 2018/07 Bachelor of Accounting (CPA Canada)
Ningbo University, China - A coursework that all units are tought in English with Canada textbooks.
- WAM: 82.51, GPA: 3.20 / 4
Research & Employment Experience
-
2024/03 - Now DARPA Assured Neuro Symbolic Learning and Reasoning
Postdoc Research Fellow (Level B) - Supervised by Dr. Hamid Rezatofighi and, A/Prof. Yuan-Fang Li
- Research Outcome
- Fucai Ke*, Zhixi Cai*, Simindokht Jahangard*, Teresa Wang, Pari Delir Haghighi, Hamid Rezatofighi. HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning. In ECCV 2024. *Equally contributed. [Paper]
- Zhixi Cai*, Cristian Rojas Cardenas*, Kevin Leo*, Chenyuan Zhang*, Kal Backman*, Hanbing Li*, Boying Li, Mahsa Ghorbanali, Stavya Datta, Lizhen Qu , and Julian Gutierrez Santiago, Alexey Ignatiev, Yuan-Fang Li†, Mor Vered†, Peter J Stuckey†, Maria Banda†, and Hamid Rezatofighi†. NEUSIS: A Compositional Neuro-Symbolic Framework for Autonomous Perception, Reasoning, and Planning in Complex UAV Search Missions. Submitted to ICRA 2025. *Equally contributed. †Equally contributed.[Paper]
- Boying Li, Zhixi Cai, Yuan-Fang Li, Ian Reid, and Hamid Rezatofighi. Hi-SLAM: Scaling-up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting. Submitted to ICRA 2025. [Paper]
-
2021/03 - 2024/07 Content-Driven Multimodal Deepfake Generation and Temporal Localization
PhD Student - This is the PhD thesis project supervised by Dr. Kalin Stefanov, A/Prof. Abhinav Dhall, and Dr. Munawar Hayat.
- Deepfake is a type of forgery that utilizes advancements in deep learning and computer vision. While it has potential, it also poses serious risks. Content-driven deepfakes, a new type of deepfake where only part of the video is fake, has received less attention. The goal of this PhD project is to develop methods for generating high-quality content-driven deepfakes from visual, audio, and linguistic modalities. The project will also focus on developing a high-performance deepfake detection method to counter this threat. The aim is to better prepare the research community and society for this new threat.
- Research Outcome
- Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization. In DICTA 2022. [Best Contribution to Science Award] [Paper] [Code]
- Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization. In CVIU. [Paper] [Code]
- Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, and Kalin Stefanov. AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset. In ACM MM 2024. [Best Student Paper] [Paper] [Code]
- Zhixi Cai, Abhinav Dhall, Shreya Ghosh, Munawar Hayat, Kollias Dimitrios, Kalin Stefanov, and Usman Tariq. 1M-Deepfakes Detection Challenge. In ACM MM 2024. [Paper] [Webpage]
- Shreya Ghosh, Zhixi Cai, Abhinav Dhall, Dimitrios Kollias, Roland Goecke, and Tom Gedeon. MRAC Track 1: 2nd Workshop on Multimodal, Generative and Responsible Affective Computing. In ACM MM Workshop 2024. [Paper] [Webpage]
-
2022/03 - 2023/07 DARPA Computational Cultural Understanding
Research Assistant - Supervised by Dr. Hamid Rezatofighi and Dr. Munawar Hayat
- The CCU program develops language technologies for department of defense operators to improve cross-cultural understanding, situational awareness, and interactional effectiveness. It requires minimal training data and aims to assist negotiators and analysts in the field.
- Subteam2: Culture-Aware Emotion Recognition
- As the team member, I mainly work on the emotion recognition task, which is using multimodal information (visual, audio and linguistic) to infer the emotion, valence and arousal. I mainly focus on two tasks:
- Doing research and developing the better approach for the multimodal emotion valence arousal prediction task.
- Deploy the approach as docker container to the DARPA websocket-based system.
- Research Outcome
- Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, and Munawar Hayat. MARLIN: Masked Autoencoder for facial video Representation LearnINg. In CVPR 2023. [Paper] [Code]
- Shreya Ghosh*, Zhixi Cai*, Parul Gupta, Garima Sharma, Abhinav Dhall, Munawar Hayat, and Tom Gedeon. Emolysis: A Multimodal Open-sourced Group Emotion Analysis and Visualization Toolkit. In ACII Demo 2024. *Equally contributed. [Paper] [Code]
- Tassadaq Hussain, Islam Nassar, Zhixi Cai, Hamid Rezatofighi, Munawar Hayat, and Nicholas Cummins. A Multi-Label Speech Emotion Recognition for Cross Cultural Communication. In UKSPEECH 2023.
- Simindokht Jahangard, Zhixi Cai, Shiki Wen, and Hamid Rezatofighi. JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups. In CVPR 2024. [Paper]
-
2023/06 - 2023/12 EEG-Guided Explainable Deepfake Audio Detection
Research Assistant - Supervised by Dr. Mahsa Salehi, Dr. Kalin Stefanov and Dr. Ehsan Shareghi
- This project is about the correlation analysis between EEG and deepfake audio. The goal is to find the correlation between EEG and deepfake audio, and use the correlation to detect the deepfake audio.
-
2023/08 - 2023/12 Australian Sign Language Recognition and Generation
Research Assistant - Supervised by Dr. Kalin Stefanov
- This project is about the sign language recognition and generation. The goal is to use the deep learning techniques to recognize the sign language from video.
-
2019/12 - 2020/12 Data Generation Techniques for Automatic Facial Expression Recognition
Master Student - This is the master thesis project supervised by A/Prof. Abhinav Dhall and Dr. Kalin Stefanov.
- This project introduces a method for generating new face images with controlled expression using generative neural networks. The method can improve expression recognition by generating data to train classifiers. The project aims to generate facial data for future research and applications, addressing the issue of performance bias towards certain cultures and ethnicities.
Teaching Experience
-
2024/04 - 2024/06 ITO5221 - Intelligent image and video analysis (2024 TP3)
OUC & OLA (Head Teaching Assistant) - Monash University -
2023/09 - 2023/10 ITO5221 - Intelligent image and video analysis (2023 TP5)
OUC & OLA (Head Teaching Assistant) - Monash University -
2022/03 - 2022/07 FIT5221 - Intelligent image and video analysis (2022 Semester 1)
Head Teaching Assistant - Monash University - Student Satisfaction 93% -
2023/02 ITO5221 - Intelligent image and video analysis (2024)
Course Designer - Monash University -
2022/12 ITO5221 - Intelligent image and video analysis (2023)
Course Designer - Monash University -
2022/03 ITO5221 - Intelligent image and video analysis (2022)
Course Designer - Monash University
Awards
-
2024 ACM Multimedia 2024 Best Student Paper Award
Association for Computing Machinery (ACM) -
2022 DICTA 2022 DSTG Best Contribution to Science Award
Australian Defence Science and Technology Group (DSTG)
Skill List
- Programming language: Python, R, JavaScript, Typescript, Java, Kotlin, Scala, Bash, C#, C
- Tool: PyTorch, TensorFlow, Keras, Latex, Slurm, SQL, Spark, sklearn, skimage, pandas, markdown, git, matplotlib, D3.js, Vue.js, React, Svelte, seaborn, ggplot, ...
- Software: Jupyter, JetBrains IDEs, Linux, Visual Studio, Microsoft Office, Docker, VirtualBox, SQLDeveloper, Tableau
Project Experience
-
2022/10 Deepfake Demo
- The deepfake demo used for Monash Open Day and John Monash Science School immersion day. The demo is built based on Vue3 and PyTorch. This web-based demo shows the deepfake generation (face reenactment based on Wav2Lip), voice cloning (based on SV2TTS) and deepfake detection (based on EfficientViT). The demo is wrapped as composed several docker images on DockerHub. [Code]
-
2023/02 - NOW Multi-Tangled (Mita)
- A web application for monitoring runtime status of multiple machines. This application is composed by a server (based on Scala Play framework), a web UI frontend (based on Vue3), and a client (based on Python FastAPI). The server is wrapped as a docker image on DockerHub. Users can use the client library in the Python code to sync the runtime status to the server, and can observe the status from multiple clients on the web UI. [Demo] [Code]
-
2022/06 - 2022/09 Food Recognition System on Android APP
- Research Assistant. Responsible for building a food photo recognition system for an Android app. To overcome the shortage of suitable on-the-shelf datasets, the extra data is scraped online. The model is built with TensorFlow and deployed with the TensorFlow Lite framework.
-
2021/08 - 2021/09 LightIoC
- A light-weight Inversion of Control (IoC) tools by Dependency Injection (DI) for Scala. The project is published in Maven Central. [Code]
-
2021/05 - 2021/07 Tencent Advertisement Algorithm Competition
- A algorithm competition for predicting tags of advertisement videos. From multi-modal data including video, audio, text and image, applying deep learning techniques including NextVlad, ResNet, BERT, Transformer and other types of attentions to merge the multi-modal information together to predict labels. I am the group leader and mainly design model structure, programming and do experiments. The result is ranked 4% of all competitors. [Code]
-
2020/06 Kaggle: Predicting gender by Twitter text
- Applying natural language process techniques to perform preprocessing, modelling and predicting author's gender based on Twitter text dataset. The accuracy of the result is 85.1% and ranked 6%. [Code]
-
2020/05 Forecast the area burned by the Australian forest fires
- Using R for data mining on forest fire dataset, including exploratory data analysis, data preprocessing and machine learning modelling prediction. [Code]
-
2020/04 Kaggle: MNIST hand-written digits prediction
- Using convolutional neural networks to predict the label of hand-written digits from MNIST dataset. The accuracy is 99.6% and ranked 11%.
-
2019/11 - NOW War Thunder Data Project
- A web-based visualization application using Python and Selenium crawlers to collect time-series statistical data for a multiplayer online game War Thunder. And, aggregate and organize the data, and then use d3.js and typescript to build the interactive website. Currently it has 316K visits. [Website] [Code]
-
2019/11 Real estate data integration and data reshaping
- With web crawler and other methods, collecting education, transportation, security and real estate data to integrate them, and then improve the modelling performance by data reshaping. [Code]
-
2019/10 Weather condition prediction with Spark
- Using Apache Spark to do data preprocessing and machine learning on weather dataset. In this project, the SparkSQL is used to perform data wrangling and featurization. The SparkML is used for modelling. [Code]
-
2019/10 Food delivery data wranling and modelling
- Using Python and Pandas to clean the graph data and structured data. This process includes removing outliers, fix data errors and predicting delivery fee with simple modelling. [Code]
Other Experience
-
2021/12 International School in Artificial Intelligence and its Applications in Computer Science (ISAAC)
- Participation in ISAAC as one of the very few students approved from all over the world.
-
2021/11 ACML Asian Machine Learning School (OAMLS)
- Participation in ACML Asian Machine Learning School as one of the very few students approved for the program from top universities in Asia.