How to Create a VTT File: Best VTT Creator, Generator, and Editor Tools (2026)

What Is a VTT Creator?

A VTT creator is any tool or method that produces a WebVTT (.vtt) caption file — the timed text format used by HTML5 video, YouTube, Vimeo, and most web-based video players. The right approach depends entirely on what you are starting from:

Audio or video with speech → AI transcription tool (fastest, most accurate)
A plain text script you want to time → Online VTT generator
An existing SRT subtitle file → SRT-to-VTT converter or a two-minute text editor fix
A VTT file that needs timing or text edits → VTT editor software
Automated batch processing → Command-line tools (FFmpeg, subtitle-composer)

This guide covers all five approaches with enough detail to help you pick the right VTT creator for your specific situation.

Quick Comparison: VTT Creation Methods

Method	Best For	Speed	Technical Skill
AI transcription tool	Audio/video with speech	Under 1 minute	None
Online VTT generator	Plain text → timed VTT	5–30 minutes	Low
SRT-to-VTT converter	Existing SRT files	Under 2 minutes	Low
VTT editor software	Fine-tuning timing and text	Slow (manual)	Medium
Command-line (FFmpeg)	Batch/automated pipelines	Fast (per file)	High
Manual text editor	Very short clips or corrections	Very slow	Low

Method 1: AI Transcription Tool — The Fastest VTT Creator for Audio and Video

If your starting point is an audio or video file containing speech, an AI transcription tool is the fastest way to create a VTT file by far. It eliminates every manual step: watching and re-watching the clip, typing out what was said, calculating when each caption should appear, and formatting the timestamps correctly.

Captain Transcribe accepts MP3, AAC, FLAC, WAV, M4A, MP4, MOV, MKV, and all other common audio and video formats. The complete workflow takes under a minute:

Upload your file — Drag your audio or video file onto the upload area. If you have a video, upload it directly without extracting the audio first — the service handles that automatically.
Select the spoken language — Choose from 29+ supported languages. This is the most important setting for accuracy: the wrong language produces garbled output regardless of audio quality.
Choose a caption style — Three options control how the VTT captions are segmented:
- Standard — Full-sentence cues. Ideal for YouTube, Vimeo, e-learning platforms, and traditional web video.
- Short — Two to four words per cue. Designed for TikTok, Instagram Reels, and vertical video formats.
- Karaoke — Word-level timing that highlights each word as it is spoken. Used for music videos or lyric-style content.
Download the VTT file — Click the VTT download button. The file includes the mandatory WEBVTT header, period-separated millisecond timestamps, and UTF-8 encoding — all requirements that browsers enforce before displaying captions.

From the same transcription job you can also download an SRT file or plain text transcript without processing the audio again.

For a detailed walkthrough of this process for each audio format — including tips for FLAC, AAC, MP4, and low-bitrate MP3 — see our guide on converting audio and video to VTT.

Method 2: Online VTT Generator — Create VTT Captions from a Text Script

If you already have a written script or transcript and need to turn it into a timed VTT caption file, an online VTT generator is the right tool. These tools are distinct from transcription services: they do not process audio at all. Instead, they help you assign timestamps to existing text.

How online VTT generators typically work:

You paste or type your caption text into a web interface, usually one caption cue per line or per block.
You either manually type in the timestamps for each cue, or you use a media player integrated into the tool to set timing by pressing a key as you listen along.
The tool formats the entries into a valid VTT file and lets you download it.

This approach is well suited for:

Scripted videos where you wrote the content before recording — the timing work is still manual, but you skip the transcription step
Translating existing VTT captions into another language — paste the translated text, re-time as needed, export new VTT
Simple explainer animations or screen recordings where there are only a handful of caption cues

For content longer than a few minutes, generating VTT from a text script is significantly slower than using an AI transcription tool — even if you already have the text — because manually setting timestamps for every cue is tedious and error-prone. If your video is over two or three minutes long, transcription tools (even on content you scripted) are usually faster and more accurate.

Method 3: Convert SRT to VTT — The Two-Minute Approach

If you already have an SRT subtitle file, converting it to VTT takes about two minutes in any plain text editor. The two formats share the same logical structure — timestamps, cue numbers, and caption text — with only two syntactic differences:

Add the WEBVTT header — Open your SRT file in a text editor (Notepad, VS Code, TextEdit in plain text mode). Insert WEBVTT as the very first line, then add one blank line below it. This header is mandatory: without it, browsers silently ignore the entire file.
Replace commas with periods in timestamps — SRT timestamps use commas as the millisecond separator (00:00:01,500); VTT requires periods (00:00:01.500). Use Find & Replace (Ctrl+H / Cmd+H) to swap every comma in a timestamp to a period. Be precise — scope the replacement to timestamp lines, not caption text that may contain commas.
Save as UTF-8 with a .vtt extension — Choose UTF-8 encoding explicitly when saving. On Windows Notepad, select "All Files" as the file type and type the filename ending in .vtt.

That is the complete conversion. The cue numbers that SRT requires are optional in VTT, so you can leave them in place — they become valid VTT cue identifiers. For a deeper look at why these two formats differ and when to use each one, see our full SRT vs VTT comparison.

If you need to convert many SRT files to VTT in bulk, the command-line approach described below handles this more efficiently than a text editor.

Method 4: VTT Editor Software — For Visual Timing Adjustments

A VTT editor is a desktop application that displays your subtitle file alongside the video on a visual timeline, letting you drag and resize cue boundaries rather than typing timestamp values. These tools are most useful when:

An AI-generated VTT file has timing issues that need visual correction (a caption appears half a second early, or text overlaps a key scene)
You are translating captions and the translated text has different timing needs than the original
You are working on content where precise frame-accurate timing matters — live events, hearing accessibility captions, broadcast media

The three most widely used free VTT editors are:

Subtitle Edit (Windows, free, open-source)

Subtitle Edit is the most feature-complete free subtitle editor available. It opens and exports VTT, SRT, and dozens of other formats. Its waveform display lets you visually align cue boundaries to the audio waveform rather than guessing by eye. Key features: OCR import, spell check, frame rate conversion, and auto-synchronization using speech recognition. Available at subtitleedit.net. If you edit VTT files regularly on Windows, Subtitle Edit is the tool to start with.

Aegisub (Windows, Mac, Linux, free, open-source)

Aegisub was originally built for Advanced SubStation Alpha (ASS) subtitles — a format common in anime fansubs — but it imports and exports VTT. It has a powerful waveform editor, karaoke timing mode, and excellent visual positioning tools. It is heavier than Subtitle Edit for basic VTT tasks, but the karaoke timing feature makes it the preferred tool for word-by-word lyric-style VTT captions. Available at aegisub.org.

Jubler (Windows, Mac, Linux, free, open-source)

Jubler is a cross-platform subtitle editor with a simpler interface than Aegisub. It handles VTT import/export, waveform display, and spell checking. If you need a lightweight VTT editor that runs on Mac or Linux without the complexity of Aegisub, Jubler is a solid choice. Available at jubler.org.

Browser-Based VTT Editors

Several online tools let you paste a VTT file, edit cue text and timestamps in a form interface, and re-download the file — no desktop software required. These are useful for quick text corrections or minor timing tweaks on a file you already have. Search for "online VTT editor" to find current options; the quality and features vary widely, so test a small file before relying on any for production work.

Method 5: Command-Line VTT Creation — For Developers and Batch Workflows

If you are processing many files automatically or need to generate VTT as part of a larger pipeline, command-line tools are more practical than any GUI application.

FFmpeg: Convert Embedded Subtitle Tracks to VTT

FFmpeg is the standard open-source media processing tool. If a video file (MKV, MP4, or similar) contains an embedded subtitle track, FFmpeg can extract it as a VTT file in one command:

ffmpeg -i input.mkv -map 0:s:0 output.vtt

This extracts the first subtitle stream (s:0) from input.mkv and writes it as output.vtt. If the embedded track is SRT, FFmpeg converts it to VTT format automatically. To list all streams in a file before extracting: ffmpeg -i input.mkv.

FFmpeg can also convert any subtitle format to VTT in bulk with a shell loop:

for f in *.srt; do ffmpeg -i "$f" "${f%.srt}.vtt"; done

This converts every .srt file in the current directory to a matching .vtt file. FFmpeg handles the WEBVTT header and timestamp conversion automatically.

OpenAI Whisper: Generate VTT Directly from Audio

If you are self-hosting OpenAI's Whisper model for batch transcription, it natively outputs VTT with the --output_format vtt flag:

whisper audio.mp3 --language en --output_format vtt --output_dir ./output

This produces a .vtt file alongside the .txt and .srt outputs in the specified directory. Whisper is free and unlimited but requires GPU hardware for practical batch processing speeds. For teams with existing GPU infrastructure, this is the zero-cost path to VTT generation at scale.

How to Write a VTT File Manually in a Text Editor

For very short clips — a 30-second intro, a handful of title cards, a brief instructional video — you can write a VTT file from scratch in any plain text editor. The structure is simple:

WEBVTT

1
00:00:00.500 --> 00:00:03.200
Welcome to this tutorial.

2
00:00:03.600 --> 00:00:07.100
In the next five minutes, we will cover
everything you need to get started.

3
00:00:07.500 --> 00:00:11.800
Let us begin with the basics.

Three rules to follow every time:

First line must be exactly WEBVTT — no leading space, no byte-order mark (BOM), no alternative spelling. This header tells the browser it is reading a valid WebVTT file.
Timestamps use periods, not commas — 00:00:01.500 is correct; 00:00:01,500 is SRT syntax and will break a VTT file.
Blank line between every cue — Omitting the blank line causes the parser to merge cues or skip them.

Save the file with UTF-8 encoding and a .vtt extension. On Windows Notepad, select "All Files" in the Save dialog and type the filename ending in .vtt. On Mac TextEdit, switch to plain text mode first (Format → Make Plain Text) before saving.

Manual creation is practical for up to about 10 cues. Beyond that, the time spent watching, pausing, typing, and correcting exceeds the time an AI tool would take by a large margin.

VTT Captions: When to Use VTT Instead of SRT

The choice between VTT captions and SRT captions comes down to where the video will be displayed:

Use VTT for: HTML5 video on web pages (the <track> element requires VTT), Vimeo uploads (VTT is Vimeo's preferred format), web video players like Video.js and Plyr, and e-learning platforms (Moodle, Canvas, Coursera).
Use SRT for: TikTok, CapCut, Premiere Pro, DaVinci Resolve, and any desktop video editor — SRT has wider compatibility with editing software. Also use SRT when you are unsure which format a platform accepts, since SRT support is nearly universal.
Generate both: Tools like Captain Transcribe export SRT and VTT from the same transcription job. Keeping both files from a single processing run costs nothing and gives you the right format for every platform without re-transcribing.

VTT's advantage over SRT in web contexts is not just compatibility — it also supports CSS styling via the ::cue selector, cue positioning settings (line, position, align), inline voice tags (<v Speaker Name>), and file-level metadata. For accessibility-focused captions on a public-facing website, these features matter.

Common VTT Creation Mistakes

Mistake	What Happens	Fix
Missing WEBVTT header	Browser silently ignores the file — no captions appear	Make WEBVTT the very first line; no leading space or BOM
Comma in timestamps (00:01,500)	Invalid timestamps — cues not parsed	Replace with periods (00:01.500); use Find & Replace
Saved with wrong encoding	Accented characters display as garbage	Always save as UTF-8; check encoding in Save As dialog
No blank line between cues	Parser merges adjacent cues or skips them	Add at least one blank line after each cue block
Wrong MIME type on server	Browser refuses to load the track file	Configure server: Content-Type: text/vtt for .vtt files
Saving as .txt instead of .vtt	Video player cannot recognize the file type	Rename with .vtt extension; on Windows, disable "Hide extensions"

Key Takeaways

Match the VTT creation method to your starting material: AI transcription for audio/video, SRT converter for existing subtitles, VTT editor software for timing adjustments, FFmpeg for batch/automated workflows.
AI transcription is almost always faster than manual creation — even for short clips, the time saved on timing and formatting outweighs the upload step.
WEBVTT header on line 1 is non-negotiable — without it, browsers silently reject the file with no error message.
Periods in timestamps, never commas — this single syntax difference from SRT is the most common VTT creation error.
Generate both VTT and SRT from one transcription — Captain Transcribe exports both formats from a single job, eliminating the need to convert between them later.
Use VTT for web video; SRT for editing software — VTT is native to HTML5 and web players; SRT has broader support in desktop video editors.

This article was drafted with AI assistance and reviewed by The Captain before publication.