cv | Zhixi Cai

General Information

Full Name	Zhixi Cai
Birth Year	1997
Languages	Chinese, English

Education

2021/03 - 2024/02
Doctor of Philosophy

Monash University, Australia
- Supervisors are Dr. Munawar Hayat, A/Prof. Abhinav Dhall, and Dr. Kalin Stefanov.
- The Ph.D. program is about the topic, Content-Based Deepfake Generation and Temporal Localization, which belongs to the area of computer vision and deep learning.
2019/03 - 2020/12
Master of Data Science

Monash University, Australia
- Master coursework about data science, machine learning and IT.
- WAM: 86, GPA: 3.88 / 4
2014/09 - 2018/07
Bachelor of Accounting (CPA Canada)

Ningbo University, China
- A coursework that all units are tought in English with Canada textbooks.
- WAM: 82.51, GPA: 3.20 / 4

Research Experience

2024/01 - Now
DARPA Assured Neuro Symbolic Learning and Reasoning

Postdoc Research Fellow (Level B)
- Supervised by Dr. Hamid Rezatofighi and, A/Prof. Yuan-Fang Li
- Research Outcome
  - Fucai Ke*, Zhixi Cai*, Simindokht Jahangard*, Teresa Wang, Pari Delir Haghighi, Hamid Rezatofighi. HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning. Submitted to ECCV 2024. *Equally contributed. [Paper]
2021/03 - 2024/02
Content-Driven Multimodal Deepfake Generation and Temporal Localization

PhD Student
- This is the PhD thesis project supervised by Dr. Munawar Hayat, A/Prof. Abhinav Dhall, and Dr. Kalin Stefanov.
- Deepfake is a type of forgery that utilizes advancements in deep learning and computer vision. While it has potential, it also poses serious risks. Content-driven deepfakes, a new type of deepfake where only part of the video is fake, has received less attention. The goal of this PhD project is to develop methods for generating high-quality content-driven deepfakes from visual, audio, and linguistic modalities. The project will also focus on developing a high-performance deepfake detection method to counter this threat. The aim is to better prepare the research community and society for this new threat.
- Research Outcome
  - Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization. In DICTA 2022. [Best Contribution to Science Award] [Paper] [Code]
  - Zhixi Cai, Shreya Ghosh, Abhinav Dhall, Tom Gedeon, Kalin Stefanov, and Munawar Hayat. Glitch in the Matrix: A Large Scale Benchmark for Content Driven Audio-Visual Forgery Detection and Localization. In CVIU. [Paper] [Code]
  - Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Kalin Stefanov. AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset. Submitted to MM 2024. [Paper] [Code]
2022/03 - 2023/07
DARPA Computational Cultural Understanding

Research Assistant
- Supervised by Dr. Hamid Rezatofighi and Dr. Munawar Hayat
- The CCU program develops language technologies for department of defense operators to improve cross-cultural understanding, situational awareness, and interactional effectiveness. It requires minimal training data and aims to assist negotiators and analysts in the field.
- Subteam2: Culture-Aware Emotion Recognition
- As the team member, I mainly work on the emotion recognition task, which is using multimodal information (visual, audio and linguistic) to infer the emotion, valence and arousal. I mainly focus on two tasks:
  - Doing research and developing the better approach for the multimodal emotion valence arousal prediction task.
  - Deploy the approach as docker container to the DARPA websocket-based system.
- Research Outcome
  - Zhixi Cai, Shreya Ghosh, Kalin Stefanov, Abhinav Dhall, Jianfei Cai, Hamid Rezatofighi, Reza Haffari, and Munawar Hayat. MARLIN: Masked Autoencoder for facial video Representation LearnINg. In CVPR 2023. [Paper] [Code]
  - Shreya Ghosh*, Zhixi Cai*, Parul Gupta, Garima Sharma, Abhinav Dhall, Munawar Hayat, and Tom Gedeon. Emolysis: A Multimodal Open-sourced Group Emotion Analysis and Visualization Toolkit. Submitted to IUI 2024. *Equally contributed. [Paper] [Code]
  - Tassadaq Hussain, Islam Nassar, Zhixi Cai, Hamid Rezatofighi, Munawar Hayat, and Nicholas Cummins. A Multi-Label Speech Emotion Recognition for Cross Cultural Communication. In UKSPEECH 2023.
  - Simindokht Jahangard, Zhixi Cai, Shiki Wen, and Hamid Rezatofighi. JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups. In CVPR 2024. [Paper]
2023/06 - Now
EEG-Guided Explainable Deepfake Audio Detection

Research Assistant
- Supervised by Dr. Mahsa Salehi, Dr. Kalin Stefanov and Dr. Ehsan Shareghi
- This project is about the correlation analysis between EEG and deepfake audio. The goal is to find the correlation between EEG and deepfake audio, and use the correlation to detect the deepfake audio.
2023/08 - Now
Australian Sign Language Recognition and Generation

Research Assistant
- Supervised by Dr. Kalin Stefanov
- This project is about the sign language recognition and generation. The goal is to use the deep learning techniques to recognize the sign language from video.
2019/12 - 2020/12
Data Generation Techniques for Automatic Facial Expression Recognition

Master Student
- This is the master thesis project supervised by A/Prof. Abhinav Dhall and Dr. Kalin Stefanov.
- This project introduces a method for generating new face images with controlled expression using generative neural networks. The method can improve expression recognition by generating data to train classifiers. The project aims to generate facial data for future research and applications, addressing the issue of performance bias towards certain cultures and ethnicities.

Teaching Experience

2023/09 - 2023/10

ITO5221 - Intelligent image and video analysis (2023 TP5)

OUC & OLA (Head Teaching Assistant) - Monash University
2022/03 - 2022/07

FIT5221 - Intelligent image and video analysis (2022 Semester 1)

Head Teaching Assistant - Monash University - Student Satisfaction 93%
2023/02

ITO5221 - Intelligent image and video analysis (2024)

Course Designer - Monash University
2022/12

ITO5221 - Intelligent image and video analysis (2023)

Course Designer - Monash University
2022/03

ITO5221 - Intelligent image and video analysis (2022)

Course Designer - Monash University

Skill List

Programming language: Python, R, JavaScript, Typescript, Java, Kotlin, Scala, Bash, C#, C
Tool: PyTorch, TensorFlow, Keras, Latex, Slurm, SQL, Spark, sklearn, skimage, pandas, markdown, git, matplotlib, D3.js, Vue.js, React, Svelte, seaborn, ggplot, ...
Software: Jupyter, JetBrains IDEs, Linux, Visual Studio, Microsoft Office, Docker, VirtualBox, SQLDeveloper, Tableau

Project Experience

2023/04
Website for REACT2023 workshop
- A website for the REACT2023 workshop. The website is built based on the Hugo framework and deployed on the GitHub Pages. [Website] [Code]
2022/10
Deepfake Demo
- The deepfake demo used for Monash Open Day and John Monash Science School immersion day. The demo is built based on Vue3 and PyTorch. This web-based demo shows the deepfake generation (face reenactment based on Wav2Lip), voice cloning (based on SV2TTS) and deepfake detection (based on EfficientViT). The demo is wrapped as composed several docker images on DockerHub. [Code]
2023/02 - NOW
Multi-Tangled (Mita)
- A web application for monitoring runtime status of multiple machines. This application is composed by a server (based on Scala Play framework), a web UI frontend (based on Vue3), and a client (based on Python FastAPI). The server is wrapped as a docker image on DockerHub. Users can use the client library in the Python code to sync the runtime status to the server, and can observe the status from multiple clients on the web UI. [Demo] [Code]
2022/06 - 2022/09
Food Recognition System on Android APP
- Research Assistant. Responsible for building a food photo recognition system for an Android app. To overcome the shortage of suitable on-the-shelf datasets, the extra data is scraped online. The model is built with TensorFlow and deployed with the TensorFlow Lite framework.
2021/08 - 2021/09
LightIoC
- A light-weight Inversion of Control (IoC) tools by Dependency Injection (DI) for Scala. The project is published in Maven Central. [Code]
2021/05 - 2021/07
Tencent Advertisement Algorithm Competition
- A algorithm competition for predicting tags of advertisement videos. From multi-modal data including video, audio, text and image, applying deep learning techniques including NextVlad, ResNet, BERT, Transformer and other types of attentions to merge the multi-modal information together to predict labels. I am the group leader and mainly design model structure, programming and do experiments. The result is ranked 4% of all competitors. [Code]
2020/11
COVID-19 Data Visualization
- Using HTML/CSS/JavaScript and D3.js library to visualize COVID-19 data interactively. This narrative visualization combines data science and web development techniques to finalize the achievement. The webpage has been deployed on GitHub. [Demo] [Code]
2021/07 - NOW
Neural Engine Kompanion (tensorneko)
- A PyTorch util library for easier preprocessing, building, training, and evaluating PyTorch model. This library is published in PyPI, and has been downloaded hundreds of times per month.[Code]
2020/06
Kaggle: Predicting gender by Twitter text
- Applying natural language process techniques to perform preprocessing, modelling and predicting author's gender based on Twitter text dataset. The accuracy of the result is 85.1% and ranked 6%. [Code]
2020/05
Forecast the area burned by the Australian forest fires
- Using R for data mining on forest fire dataset, including exploratory data analysis, data preprocessing and machine learning modelling prediction. [Code]
2020/04
Kaggle: MNIST hand-written digits prediction
- Using convolutional neural networks to predict the label of hand-written digits from MNIST dataset. The accuracy is 99.6% and ranked 11%.
2019/11 - NOW
War Thunder Data Project
- A web-based visualization application using Python and Selenium crawlers to collect time-series statistical data for a multiplayer online game War Thunder. And, aggregate and organize the data, and then use d3.js and typescript to build the interactive website. Currently it has 316K visits. [Website] [Code]
2019/11
Real estate data integration and data reshaping
- With web crawler and other methods, collecting education, transportation, security and real estate data to integrate them, and then improve the modelling performance by data reshaping. [Code]
2019/10
Weather condition prediction with Spark
- Using Apache Spark to do data preprocessing and machine learning on weather dataset. In this project, the SparkSQL is used to perform data wrangling and featurization. The SparkML is used for modelling. [Code]
2019/10
Food delivery data wranling and modelling
- Using Python and Pandas to clean the graph data and structured data. This process includes removing outliers, fix data errors and predicting delivery fee with simple modelling. [Code]

Other Experience

2021/12
International School in Artificial Intelligence and its Applications in Computer Science (ISAAC)
- Participation in ISAAC as one of the very few students approved from all over the world.
2021/11
ACML Asian Machine Learning School (OAMLS)
- Participation in ACML Asian Machine Learning School as one of the very few students approved for the program from top universities in Asia.

General Information

Education

Doctor of Philosophy

Monash University, Australia

Master of Data Science

Monash University, Australia

Bachelor of Accounting (CPA Canada)

Ningbo University, China

Research Experience

DARPA Assured Neuro Symbolic Learning and Reasoning

Postdoc Research Fellow (Level B)

Content-Driven Multimodal Deepfake Generation and Temporal Localization

PhD Student

DARPA Computational Cultural Understanding

Research Assistant

EEG-Guided Explainable Deepfake Audio Detection

Research Assistant

Australian Sign Language Recognition and Generation

Research Assistant

Data Generation Techniques for Automatic Facial Expression Recognition

Master Student

Teaching Experience

ITO5221 - Intelligent image and video analysis (2023 TP5)

OUC & OLA (Head Teaching Assistant) - Monash University

FIT5221 - Intelligent image and video analysis (2022 Semester 1)

Head Teaching Assistant - Monash University - Student Satisfaction 93%

ITO5221 - Intelligent image and video analysis (2024)

Course Designer - Monash University

ITO5221 - Intelligent image and video analysis (2023)

Course Designer - Monash University

ITO5221 - Intelligent image and video analysis (2022)

Course Designer - Monash University

Skill List

Project Experience

Website for REACT2023 workshop

Deepfake Demo

Multi-Tangled (Mita)

Food Recognition System on Android APP

LightIoC

Tencent Advertisement Algorithm Competition

COVID-19 Data Visualization

Neural Engine Kompanion (tensorneko)

Kaggle: Predicting gender by Twitter text

Forecast the area burned by the Australian forest fires

Kaggle: MNIST hand-written digits prediction

War Thunder Data Project

Real estate data integration and data reshaping

Weather condition prediction with Spark

Food delivery data wranling and modelling

Other Experience

International School in Artificial Intelligence and its Applications in Computer Science (ISAAC)

ACML Asian Machine Learning School (OAMLS)