A Telegram-triggered n8n workflow that turns a reference image plus a short caption into a branded, 9:16 AI-generated video — published straight to X (Twitter) via Buffer. Image generation by Google Gemini 2.5 Flash Image ("NanoBanana"), video generation by Google Veo 3.1, script writing by GPT-4.1-mini. End-to-end run time: under 5 minutes.
Three sequential steps, triggered by a single Telegram message carrying a photo + caption. The entire pipeline runs in one execution — user message in, video published out.
OpenAI Vision analyses the reference photo. A GPT-4.1-mini agent writes a UGC-style image prompt. Google Gemini 2.5 Flash Image ("NanoBanana") generates a fresh, casual-looking image. The result is posted back to you in Telegram so you can approve before the video kicks off.
A second AI agent drafts a structured video script (prompt, caption, title, hashtags). Google Veo 3.1 renders a 9:16, 8-second video using the NanoBanana image as its first frame. A polling loop keeps checking the long-running operation until the video is ready.
The finished video is hosted temporarily on Telegram to obtain a public URL, then published to X (Twitter) through Buffer's GraphQL API with the AI-written caption and hashtags. A final Telegram message confirms success with a link to the post in Buffer.
It's the fastest way for a non-technical marketer to trigger a complex pipeline. No dashboard, no login — just send a picture to the bot with a caption describing the video you want. The bot replies with the image, then a few minutes later with the finished video and a "published" confirmation.
Each run costs approximately $3.30 (mostly Veo 3.1) or $1.70 if you switch to veo-3.1-fast-generate-preview — this workflow already uses the Fast model. Set a daily spend cap in Google AI Studio before giving the bot to students.
This is what the three flows look like after you import the JSON. The orange sticky on the left is the in-canvas README; the three grey bands group the nodes of each step.
Detailed setup for each is in the central Credentials reference. A quick summary below.
Create via @BotFather: /newbot, pick a name, copy the HTTP API token. Send the bot a /start message from your own account so it can chat with you.
At aistudio.google.com/apikey, create an API key with billing enabled. The same key covers Gemini 2.5 Flash Image (NanoBanana) and Veo 3.1. Veo is paid-tier only — confirm billing is active before running.
The same OpenAI key from Workflow 01 works. Used here for GPT-4o Vision (analysing the reference image) and GPT-4.1-mini (both agents).
Reuse the Buffer access token from Workflow 01. You'll also need your X channel ID (copy it from the channel's URL in Buffer Publish).
Must be publicly reachable so Telegram can POST to the trigger's webhook. n8n Cloud works out of the box; self-hosted needs a public URL (ngrok, Cloudflare Tunnel, or a deployed instance).
Unlike Workflow 01, this one does not touch YouTube, Sheets or Blogger. No Google Cloud OAuth client required.
Five steps from zero to a published video.
Open the Set: Bot Token (Placeholder) node. Replace the five REPLACE_WITH_* placeholders: Telegram bot token, Gemini API key, Buffer access token, Buffer channel ID, Buffer organization ID.
Each Telegram node and each OpenAI node wants a credential. Pick your Telegram API and OpenAI API credentials (create them from the panel if you haven't yet).
Click Active (top-right). The Telegram trigger is a webhook — it only fires while the workflow is active.
Open your bot in Telegram, send a reference image with a caption like "a woman running on a beach at sunset". Watch the canvas light up.
Avoid prompts that describe music, dialogue, voiceover, sfx, or sound — Veo's audio RAI filter is aggressive and will block the generation. The Prepare Veo Request Body code node already strips these words defensively, but keep your captions clean too.
Detailed walkthrough of every functional node in the canvas. Four sticky notes are documentation only and not covered here.
The workflow's entry point. Fires whenever your bot receives a message. The trigger exposes message.photo[] (array of progressively larger thumbnails), message.caption (text), and message.chat.id (used later to reply to the same chat).
The single source of truth for every ID and token used downstream: YOUR_BOT_TOKEN, gemini_api_key, buffer_access_token, buffer_channel_id_x, buffer_organization_id, plus CAPTION (auto-populated from the Telegram message). Edit this one node to retarget the workflow; every downstream reference updates automatically.
Telegram's photo array returns file_ids, not URLs. This node calls getFile with the largest thumbnail's file_id to get a file_path we can assemble into a downloadable URL.
Sends the resolved Telegram image URL to GPT-4o Vision with a prompt that asks for YAML output only, describing the subject (product or character), colour scheme, fonts or outfit, and a short visual description. YAML is easier for the next agent to consume than free-form prose.
An agent that combines the user's caption and the YAML image analysis into a single UGC-style image prompt (≤120 words). The system message enforces style rules: casual tone, handheld framing, preserve product text exactly, no copyrighted character names. Wired to two sub-nodes: LLM: OpenAI Chat (gpt-4.1-mini) and LLM: Structured Output Parser (JSON schema with a single image_prompt field).
Calls Google's Gemini 2.5 Flash Image model (informally nicknamed NanoBanana) with the UGC image prompt and responseModalities: ["IMAGE"]. Returns the generated image as base64 in candidates[0].content.parts[*].inlineData.data. Auth is via the x-goog-api-key header, reading the key from Vars.
A short breather so Gemini's response is fully available to the next code node. Not strictly required on stable connections but cheap insurance.
Extracts the base64 image from Gemini's response and converts it into an n8n binary output field. Handles both inlineData and inline_data key names defensively. The binary is reused by the Telegram photo sender (A9) and by Veo as the reference image (B4).
Posts the generated image back to the original chat. Serves two purposes: it gives the user an approval moment, and it uploads the image to Telegram's CDN so we can refer to it by file ID later.
Same trick as A3, but for the just-sent NanoBanana photo. The resulting URL will be fed into Flow B if you need to reference the image by URL rather than base64.
Stores a large JSON schema (json_master) describing the full anatomy of a cinematic video prompt: description, style, camera, lighting, environment, elements, subject, motion, VFX, audio, ending, text, format, keywords. The downstream agent uses this as inspiration for what to write, even though the final output is a simpler 4-field JSON.
The main script-writing agent. Reads the original user caption and the YAML image analysis, and returns a strict JSON with prompt (100-150 word natural-language video description), caption (1-2 sentence social post), title (3-8 words), and hashtags (5-10 tags). The system message explicitly forbids mentioning music, SFX or dialogue to avoid triggering Veo's audio filter. Wired to three sub-nodes: OpenAI Chat Model (gpt-4.1-mini), Think (reasoning tool), and Structured Output Parser.
Normalises the agent's output into a predictable shape regardless of whether LangChain returns the new {output: {…}} format or the older raw OpenAI format. Also normalises hashtags: splits on whitespace/commas, removes duplicates, prepends # if missing, joins into hashtags_string.
Appends Veo-friendly technical cues to the raw prompt: "consistent character throughout, photorealistic quality, professional cinematography, 8 seconds duration, 9:16 aspect ratio, 24fps". Writes the result to veo_prompt while keeping all other fields via includeOtherFields.
Defensive cleanup — strips any remaining audio-related words (music, soundtrack, voiceover, dialogue, speaking, singing, sfx…) that could trigger Veo's RAI filter. Validates the prompt length (≥10 chars) and wraps the clean prompt in the Veo request schema (duration: 8, aspect_ratio: "9:16").
Kicks off Google Veo 3.1 Fast (image-to-video). Sends the clean prompt + the NanoBanana image (as base64) as the first-frame reference. Parameters: aspectRatio: "9:16", durationSeconds: 8, personGeneration: "allow_adult". Returns an operation name — the video is generated asynchronously and must be polled.
Gives Veo room to work between polls. Combined with the If-loop below, this produces a polling interval of ~15-20 seconds until the operation completes.
Polls the operation from B6 by name. Response contains done (boolean) and, when done, a generateVideoResponse block with generatedSamples[0].video.uri — the download URL for the finished clip.
Branches on done. True → Download Video (B10). False → back to Wait (B7). The loop typically runs 8-16 times for an 8-second clip.
Downloads the MP4. The URL is built on the fly by appending &key= + the Gemini API key to the URI from B8. A defensive IIFE also throws a human-readable error if Veo blocked the generation (raiMediaFilteredReasons).
Posts the MP4 back to the same Telegram chat. As with the photo in Flow A, this also gives us a Telegram-hosted URL we'll need for Buffer.
Retrieves the file_path of the just-sent Telegram video so we can build a public URL: https://api.telegram.org/file/bot{TOKEN}/{file_path}. This URL is what Buffer will download the video from.
Calls Buffer's createPost GraphQL mutation with the X channel ID, the AI-written caption + hashtags (concatenated), mode: "shareNow", and the Telegram video URL in assets.videos[]. Returns post.id and status.
Final confirmation back to the originator's chat: title, caption, hashtags, Buffer post status, and a direct link to the post on Buffer's dashboard.
Each is a single-node change.
In the Veo Generation node, change veo-3.1-fast-generate-preview to veo-3.1-generate-preview. ~2x slower, ~2x the cost, noticeably better motion coherence for fast action.
Switch aspectRatio to 16:9 in both the Veo Generation node and the Optimize Prompt for Veo node (the prompt text mentions the aspect ratio explicitly to guide composition).
Duplicate the Buffer: Publish Video node, swap the channelId variable for each platform's channel ID, wire all copies after Get Video File URL.
Insert a Wait for Webhook + Telegram message after Send Video to Telegram asking the user to reply approve or reject. Only fire Buffer on approval.
Replace the NanoBanana HTTP Request with a call to OpenAI's GPT-image-1, Black Forest Labs FLUX, or Stability AI. The rest of the workflow is agnostic — Veo only needs a base64 image to condition on.
Veo 3.1 can generate audio natively. Relax the audio-stripping regex in Prepare Veo Request Body and add generateAudio: true to the request body's parameters. Expect ~1.5x render time.