The subconscious mind of human always tries to figure out the future. The machines are not able to do it. Even Artificial Intelligence (AI) finds it difficult to predict the future. So the company has come up with a fix to this problem. A team of Google researchers has proposed VideoBERT a self-supervised system that is able to make predictions from unlabeled videos.
VideoBERT’s goal is to determine visual semantic and high-level audio features corresponding to actions and events occurring over time. Google researchers wrote in a blog post, “Speech tends to be temporally aligned with the visual signals, and can be extracted by using off-the-shelf automatic speech recognition (ASR) systems, and thus provides a natural source of self-supervision.”
Google’s Bidirectional Encoder Representations from Transformers (BERT) has been used by VideoBERT to learn details of the video. BERT is a front-line model being used by Google for natural language-based representations.
Image frames joined with speech recognition sentence outputs is used by Google to convert the output into a 1.65 seconds visual token. The visual tokens are then joined with the word tokens. Google VideoBERT model fills the missing tokens.
One million Instructional videos on cooking, vehicle repair and gardening have been used to train VideoBERT by the researchers. The outputs are then evaluated to verify the efficiency of the model.
The outcome showed that VideoBERT was successful in predicting that a cup of flour and cocoa powder will become a brownie or a cupcake after being baked in the oven, claims the blog. The blog post also notes that VideoBERT frequently misses out on fine-grained visual information like smaller objects and hard to notice motions.
“Our results demonstrate the power of the BERT model for learning visual-linguistic and visual representations from unlabeled videos. We find that our models are not only useful for zero-shot action classification and recipe generation, but the learned temporal representations also transfer well to various downstream tasks, such as action anticipation.”, concluded the researchers.