General-purpose LLMs can't write good storyboard scripts — so one product manager fine‑tuned a model

Summary

It has been reported that a Chinese product manager writing on Huxiu tried GPT‑4, Anthropic’s Claude and Alibaba’s Qwen (通义千问) for months before deciding to fine‑tune a model for storyboard (分镜) scripts. The punchline is simple: large general‑purpose models are excellent generalists, but they struggle with the granular, shot‑by‑shot language required by film and image pipelines. So the author built a small, task‑specific dataset and trained on top of an existing base model to teach it “how to say” storyboard‑level descriptions.

Why general models fall short

A good storyboard is not just story — it’s directing with words: when to use an establishing wide shot, when to cut to a close‑up for emotion, how to describe posture, lighting, costume and framing in a way that an image engine can render. General LLMs tend to split the narrative evenly into mid‑range shots, stay literary, and sometimes scramble required fields. The result? Outputs that read nicely as prose but fail automated downstream tools like Stable Diffusion or Midjourney that need explicit fields (shot number, shot size, scene description, camera angle, visual details, narration). Why not just keep prompting harder? Because prompt engineering can get you most of the way, but reportedly it hits a ceiling — the remaining 20–30% requires model adaptation.

Fine‑tuning: practical trade‑offs and model choice

Fine‑tuning here means continuing training on a pre‑trained model with a few hundred to a few thousand curated examples so the model internalizes the precise format and rhythm of professional storyboards. That differs from pre‑training (costly, requires massive compute) and from prompt engineering (cheap but limited). Open‑source bases give control, lower iterative cost and full visibility into hyperparameters; closed APIs are easier but can be expensive and opaque — it has been reported that per‑token fees and limited transparency make them ill‑suited for repeated experiments. For Chinese‑language creative tasks, Chinese‑native models like Qwen (通义千问) have a head start; LLaMA’s ecosystem is broad, but its native English bias means extra adaptation is needed.

Implications

This is a practical lesson for creatives and product teams in China and beyond: if you want AI to plug into a production visual pipeline, you may have to teach the model your professional grammar. Access to GPUs, open‑source tooling and local model ecosystems matters — and geopolitics does too, since export controls on advanced chips and cloud dependencies influence where and how teams can run fine‑tuning at scale. The author promises deeper dives on model selection, data construction and evaluation next — because in creative AI, the format is the product.