This is a submission for the Cloudflare AI Challenge.
What I Built
For this challenge I built a Video Analysis Tool that uses Cloudflare AI models to analyze frames from a video file. This tool is designed to provide advanced video analysis capabilities, synthesizing information from video frames to offer a deeper understanding of the visual data. It can be used for various purposes, including security surveillance and data analysis.
These frames are first captured and then sent through an API gateway for further analysis by the models.
Analysis involved are as thus:
- Get a description of the frame using Image-To-Text model
- Get the embeddings of all frames using the cloudflare Text Embeddings model
- Also a summary of the description is fetch using the cloudflare Summarizations model
- These data is then stored in the database.
A user is able to chat with an AI in the context of the video analyzed. The frames vector embeddings is then used as context data for the AI Text Generation model .
Features
- Video Upload: Users can upload videos from their local machine for analysis.
- Frame Analysis: The tool analyzes individual frames of the video to extract and synthesize key information.
- Scene Analysis: Analyzes scenes to identify different environments or settings in the video.
- Data Visualization: Provides visualizations of the analysis results for easier interpretation.
Demo
The repositories for this can be found here:
https://github.com/ezecodes/serverless-c3
https://github.com/ezecodes/simple-sockets
My Code
Journey
Initially, I envisioned it as a threat detection tool for CCTV AI surveillance, aiming to enhance security systems. However, as the project evolved, I realized its potential to go beyond security applications and become a versatile video analysis tool.
One of the major challenges I faced was integrating and fine-tuning various ML models to analyze video frames effectively. Understanding and implementing these models required a solid grasp of basic ML concepts, which I had to learn along the way. This learning curve was steep but incredibly rewarding.
As I continue to improve the software, I aim to broaden its scope to encompass various domains. For instance, I envision the tool being used to analyze health video scans, aiding in medical diagnostics and research. This expansion into new domains presents both technical and conceptual challenges, but I am excited about the possibilities it offers.
Multiple Models and/or Triple Task Types
This project utilized multiple models and task types such as
ImageToText @cf/unum/uform-gen2-qwen-500m
VectorEmbedding @cf/baai/bge-base-en-v1.5
Summerisation @cf/facebook/bart-large-cnn
Text Generation
Team member(s) includes - https://dev.to/ezecodes
Top comments (0)