Twelve Labs receives $12 million for AI that can comprehend video context

Posted On: December 5, 2022

As a trained data scientist, Jae Lee could never understand why video, which has become such a massive part of our lives because to the proliferation of sites like TikTok, Vimeo, and YouTube, would be so technically challenging to search across owing to the lack of context knowledge. Video content may be searched by its title, description, and tags using just the most basic of search algorithms. But technology has yet to allow for searching within videos for specific moments and scenes, especially if those moments and scenes were not labelled obviously.

Lee and his techie pals created a cloud service for video searching and comprehension to address this issue. It morphed into Twelve Labs, which has already raised $17 million in venture funding, $12 million of which came in a seed extension round that just ended today. In addition to Index Ventures, WndrCo, Spring Ventures, and the CEO of Weights & Biases, Lukas Biewald, the extension was led by Radical Ventures, as Lee explained.

“The objective of Twelve Labs is to assist developers create programmes that can see, listen, and comprehend the world like humans do by offering them the most sophisticated video understanding infrastructure,” Lee stated.

Twelve Labs, which is presently in closed beta, utilises AI to try to extract “rich information” from videos such as movement and activities, objects and people, sound, text on screen, and voice to determine the links between them. The platform converts these various elements into mathematical representations called “vectors” and forms “temporal connections” between frames, enabling applications like video scene search.

“As a part of realising the company’s objective to assist developers construct intelligent video apps, the Twelve Labs team is creating ‘foundation models’ for multimodal video understanding,” Lee added. In addition to semantic search, “developers will be able to access these models via a set of APIs,” allowing for the completion of tasks like long-form video “chapterization,” summary production, and video question and answer.

Google takes a similar approach to video understanding with its MUM AI system, which the company uses to power video recommendations across Google Search and YouTube by picking out subjects in videos (e.g., “acrylic painting materials”) based on the audio, text and visual content. There is some similarity between the technologies, but Twelve Labs is one of the first to market with it, while Google has chosen to keep MUM internal and has declined to make it accessible via a public-facing API.

Google, Microsoft, and Amazon all provide services (e.g., Google Cloud Video AI, Azure Video Indexer, and AWS Rekognition) that can identify scenes, characters, and activities in videos and then extract detailed information from each every frame. French computer vision startup Reminiz claims to be able to index any video format and tag both archived and real-time transmissions. But Lee contends that Twelve Labs is sufficiently unique — in part because its platform enables users to fine-tune the AI to particular kinds of video material.

What we observed is that “narrow AI technologies created to identify certain issues demonstrate excellent accuracy in their ideal circumstances in a controlled environment,” but “don’t scale so well” to chaotic real-world data, as Lee put it. They function more like a rule-based system and can’t generalise well when unexpected conditions arise. We also regard this as a constraint based in lack of context awareness. Where Twelve Labs excels is in its capacity to understand context, which allows people to generalise across apparently diverse scenarios in the actual world.

Lee claims that Twelve Labs’ technology can be used for more than just searching, citing examples such as ad insertion and content moderation. The technology can use artificial intelligence to determine whether a video depicting a knife is violent or instructive. He also mentions that you may utilise it to have highlight reels of videos made for you automatically and for media analytics and real-time feedback.

A little over a year after its start (March 2021), Twelve Labs has paying clients — Lee wouldn’t specify how many specifically — and a multiyear partnership with Oracle to train AI models utilising Oracle’s cloud infrastructure. Looking forward, the firm aims to spend in building out its tech and increasing its personnel. (Lee did not want to discuss how many people work at Twelve Labs, although according to LinkedIn, there are about eighteen)

The immense value that may be obtained via massive models makes it impractical for most firms to educate, manage, and maintain these models in-house. Lee argued that any business could benefit from a Twelve Labs platform’s advanced video understanding features by making use of a handful of simple API calls. Multimodal video comprehension is where AI is headed, and Twelve Labs is in a prime position to further advance the field in the next year (2023).

Catherine A. Leal

Subtly charming pop culture geek. Amateur analyst. Freelance tv buff. Coffee lover

Recent Posts

Twelve Labs receives $12 million for AI that can comprehend video context