Tutorial March 2026 8 min read

Using Learning Videos as a Knowledge Base for AI Tutors

How Video Understanding AI turns any video into a searchable, interactive knowledge source

Video AI
RAG

AI tutor, video AI and RAG for e-learning: Many education providers have invested significant time and money in explainer and instructional videos over recent years. The problem: that knowledge is locked inside the video. Video Understanding AI changes this — turning classic learning videos into a searchable, interactive knowledge base for AI tutors.

Demo: Full workflow from Twelve Labs to the Alphabees AI tutor

The Problem with Classic Learning Videos

Many organizations have built extensive libraries of explainer videos, screencast tutorials, and recorded lectures over the years. This is valuable knowledge — but in a form that modern AI systems can barely use.

Videos are mostly a one-way street. Knowledge is locked inside the video and can't easily be searched or integrated into AI tutors.

The concrete limitations of classic learning videos:

  • Content is rigid and difficult to update
  • Knowledge cannot be searched
  • No modular reuse across different learning modules
  • Practically inaccessible for AI tutors and RAG systems
  • Learners must consume the entire video linearly

Especially when education providers want to deploy AI learning companions, they need knowledge in a searchable, structured form — not locked inside a video file.

What Is Video Understanding AI?

Video Understanding AI is a new class of AI models specifically designed to comprehend the content of videos. Unlike classic transcription (which only converts speech to text), Video Understanding AI analyzes the video as a whole:

  • Spoken text (audio transcription)
  • Visual content: scenes, objects, visuals
  • Relationships between image and spoken content
  • Temporal structure and transitions

The result is a far richer understanding of video content than a pure audio transcription ever provides.

Definition: Video Understanding Foundation Models are pre-trained AI models that semantically understand videos — similar to how language models (LLMs) understand text. Providers like Twelve Labs specialize in this technology.

The Workflow: Step by Step

Here is the complete workflow for converting a learning video into an AI tutor knowledge base — demonstrated using an explainer video about how fossils form.

Step 1: Source a royalty-free video

Use royalty-free video sources or your own productions. Platforms like Pexels, Pixabay or Wikimedia Commons work well for testing. Important: keep license information ready and credit the author where required.

Step 2: Upload the video to Twelve Labs

Create a free account at Twelve Labs and upload the video into a new project. Twelve Labs processes the video and builds an internal semantic index — this takes anywhere from a few seconds to a few minutes depending on the length.

Tip

Twelve Labs offers different models, including Marengo for search and retrieval tasks and Pegasus for generative tasks such as script generation. For this workflow, we use Pegasus.

Step 3: Generate a complete script with timestamps

Once processed, you can give the AI direct instructions. For an AI tutor knowledge base, the following prompt works well:

"Generate a complete script of the video with all relevant information and timestamps."

Twelve Labs then returns a structured, detailed script — not just a spoken-word transcript, but a fully coherent content document with context and timestamps.

Step 4: Export the script as a PDF

Copy the generated script and save it as a PDF file. This document now contains the entire knowledge of the video in a structured, machine-readable form.

Step 5: Import into the Alphabees AI tutor

In the Alphabees AI tutor admin portal:

  1. Create a new knowledge base (e.g. "Fossils — Learning Video March 2026")
  2. Create a new folder
  3. Upload the PDF script

The AI tutor processes the document automatically and builds an internal vector database (RAG) based on the content.

Step 6: Query the AI tutor

The AI tutor can now answer questions about this video. Ask for example: "How do fossils form?" — the tutor responds precisely based on the video content, including relevant timestamps as references.

What Is RAG — and Why Does It Matter?

RAG (Retrieval Augmented Generation) is the technical foundation that enables an AI tutor to respond not from general training knowledge, but from the institution's own course content.

The process, simplified:

  1. The learner's question is converted into a semantic search vector
  2. The system searches the vector database for relevant text passages
  3. The most relevant passages are passed to the language model as context
  4. The language model formulates a precise answer based on those sources

Through the video-to-script workflow, video content becomes accessible to RAG systems for the first time — without needing to analyze the video in real time.

Practical Use Cases

Once a video has been imported as a knowledge base, several scenarios open up for education providers:

  • Moodle integration: The AI tutor is embedded directly into a Moodle course. Learners can ask questions about course content — including video content — without watching the full video
  • Website integration: An AI tutor on a course landing page answers prospective participants' questions based on existing learning materials and videos
  • On-demand knowledge retrieval: Learners search for specific content rather than linearly consuming a 45-minute video
  • Exercise generation: The AI tutor automatically generates practice questions based on the video script
  • Multilingual content: The script can be translated, enabling videos to serve as a knowledge base in multiple languages

Video Understanding AI vs. Classic Transcription

A common objection: "Couldn't I just use a transcription tool like Whisper?"

True — for purely audio-based content. The key difference lies in visual understanding:

  • Classic transcription: Converts spoken audio to text. What is shown on screen is lost
  • Video Understanding AI: Additionally analyzes the visual content — diagrams, animations, demonstrated processes, captions — and incorporates these into the generated knowledge

For explainer videos, screencasts, and presentation recordings, this difference delivers the decisive quality improvement: the script is not just a transcript, but a comprehensive content document.

Outlook: Video RAG Directly in the Alphabees Portal

The workflow shown here currently works as a manual process: analyze video → export script → upload to knowledge base. This takes a few minutes and is manageable for individual videos.

Long-term, a direct integration into the Alphabees Portal is planned: videos would be uploaded directly or linked by URL, the analysis would run automatically in the background, and the knowledge would immediately be available in the knowledge base.

If you'd like to use videos directly in your Alphabees Portal as a knowledge base, get in touch with us. Based on interest, we'll accelerate development of this integration accordingly.

Tools in This Workflow

Twelve Labs

Video Understanding Foundation Models — semantic video analysis and script generation

Alphabees AI Tutor

AI tutors for Moodle and learning platforms — with your own RAG knowledge base

Frequently Asked Questions (FAQ)

What is Video Understanding AI?

Video Understanding AI is a technology that comprehends video content holistically — not just the spoken audio, but also the visual content, scenes, objects, and relationships within the footage. Providers like Twelve Labs develop so-called Video Understanding Foundation Models for this purpose.

How can I use videos as a knowledge base for an AI tutor?

With Video Understanding AI (e.g. Twelve Labs), a video is automatically analyzed and converted into a structured script with timestamps. This script is then imported as a PDF into an AI tutor's knowledge base (e.g. Alphabees). The AI tutor can then answer questions about the video content without learners having to watch the full video.

What is RAG and why is it relevant for AI tutors?

RAG (Retrieval Augmented Generation) is a method in which relevant knowledge content is retrieved from a database and passed to a language model as context before it generates a response. For AI tutors, this means the model responds not from general training knowledge, but precisely based on the institution's own course materials — including imported video scripts.

What is the difference between Video Understanding AI and classic transcription?

Classic transcription converts only spoken audio into text. Video Understanding AI additionally analyzes the visual content itself: scenes, displayed objects, and visual information. For screencasts, presentations, and instructional videos with visual elements, this difference is decisive for the quality of the resulting knowledge base.

Can this workflow be used directly in Moodle?

Yes. The Alphabees AI tutor can be integrated directly into Moodle. Once a video has been imported as a script-based knowledge base, that knowledge is available in the Moodle course via the AI tutor as an interactive, queryable resource. Learners can ask questions about video content without having to consume the video linearly.

Which video formats are supported?

Twelve Labs supports common video formats including MP4, MOV, AVI, and more. Maximum video length and file size depend on the chosen pricing plan. For most explainer videos and recorded lectures, there are no practical limitations.

Want to turn your existing learning videos into an AI tutor knowledge base? Try the Alphabees AI tutor for free and experience how your video library becomes interactive.

For a direct video RAG integration without manual workflow steps, reach out to us — we're building this feature for interested partners.