ClipTraining

Semantic video search infrastructure powered by AI

Overview

ClipTraining is an AI-powered search infrastructure that transforms video libraries into a queryable knowledge base.

Users can search using natural language and retrieve exact moments in videos (minute + second) where the answer appears.

The Problem

Video content is inherently difficult to search.

No semantic understanding
Metadata is incomplete or inconsistent
Users must manually scrub through videos
Scaling search across thousands of clips is costly

The Solution

I designed and implemented an end-to-end pipeline that:

Transcribes videos with timestamps
Converts content into vector embeddings
Enables semantic search via natural language
Returns precise video segments

System Overview

Key Contributions

Transcription & Time Alignment

Evaluated multiple ASR providers (Whisper, AssemblyAI, GCP, AWS)
Generated accurate timestamped transcripts
Improved alignment for precise segment retrieval

Metadata Enrichment

Identified gaps in existing metadata
Generated contextual metadata using LLMs
Improved search relevance and discoverability

Chunking Strategy (A/B Testing)

Tested multiple chunk sizes and overlaps
Evaluated impact on retrieval accuracy and latency
Selected optimal configuration: 12s window / 3s overlap

Embeddings & Retrieval

Benchmarked OpenAI vs Gemini embeddings
Optimized for semantic accuracy vs performance tradeoffs
Enabled similarity-based search beyond keywords

Vector Database Architecture

Evaluated Pinecone, FAISS, ChromaDB, and pgvector
Implemented scalable storage and retrieval layer
Designed for fast similarity search across large datasets

Incremental Indexing

Avoided full reprocessing of ~20k video clips
Implemented hashing strategy on VTT files
Re-index only modified content

Search API (Production Deployment)

Built FastAPI service deployed on Azure Functions
Endpoints:
- /search
- /index
- /health
Integrated with PostgreSQL (pgvector) and Azure infrastructure

Cost Optimization

Modeled API costs for transcription, embeddings, and metadata
Identified low per-video cost (~$0.04 at scale)
Balanced accuracy vs compute efficiency

Example Output

Query: “How do I share my screen in Teams?”

Result:

Video: Teams Meetings Tutorial
Timestamp: 02:34
Segment: “Click the share button in the meeting controls…”

Outcome

Production-ready semantic video search system
Scales to ~20k video clips
Enables conversational and API-based search experiences
Ready for frontend integration and enterprise use

Key Learnings

Chunking strategy has a major impact on retrieval quality
Metadata + semantic search (hybrid) significantly improves precision
Cost modeling is critical for scaling AI systems
Incremental pipelines are essential for real-world deployments