Module 3 - Lesson 2: Media Prompts Across Providers
Comprehensive guide to working with images across OpenAI, Anthropic, and Google Gemini.
Published: 1/20/2026
Media Prompts: Multi-Provider Image Analysis
This lesson demonstrates how to work with images across all three AI providers. You'll see how each provider structures image input differently, while achieving the same goal: analyzing visual content with AI.
Lesson Structure
Each sub-lesson covers a specific provider's approach to image analysis. Follow them in order to understand the differences and similarities in multimodal AI development.
πΈ Media Prompt Overview
Common Elements Across All Providers
Despite different APIs, all three providers share core concepts:
- Image Encoding - Convert images to base64 format
- Multimodal Messages - Combine text and images in prompts
- System Instructions - Guide the AI's analysis approach
- Structured Responses - Extract text from AI responses
- Token Usage - Track API costs and limits
Our Example Task
All three implementations analyze the same image with the same prompt:
System Instruction:
"You are a naturalist. Provide detailed information about the flora and fauna in the image."
User Prompt:
"Analyze the image and describe the species present, their behaviors, and any notable ecological interactions. The photo was taken in the GalΓ‘pagos Islands."
Image: A lizard photo from src/assets/lizard.jpg
This consistency lets you directly compare implementation approaches.
π Provider-Specific Implementations
Lesson 2a: OpenAI Media Prompts
Analyzing images with OpenAI's Vision API.
- Use
openai.responses.create()with multimodal input - Structure
input_textandinput_imagein content array - Handle
output_textresponse format - Understand OpenAI's base64 data URL approach
- Key Feature: Direct
image_urlstring format - Model Used:
gpt-4o-mini - Code:
src/openai/media-prompt.ts
API Structure Highlights:
{ type: "input_image", image_url: `data:image/jpeg;base64,${base64Image}` }
Lesson 2b: Anthropic Media Prompts
Analyzing images with Anthropic's Claude Vision API.
- Use
anthropic.messages.create()with image content blocks - Structure nested
sourceobject for images - Specify explicit
media_typeanddataproperties - Extract text from content array response
- Key Feature: Structured
sourceobject with metadata - Model Used:
claude-haiku-4-5 - Code:
src/anthropic/media-prompt.ts
API Structure Highlights:
{ type: "image", source: { type: "base64", media_type: "image/jpeg", data: base64Image } }
Lesson 2c: Gemini Media Prompts
Analyzing images with Google Gemini's Vision API.
- Use
gemini.models.generateContent()with parts array - Structure
inlineDataobject for images - Use
systemInstructionin config for role guidance - Access simple
.textresponse property - Key Feature:
inlineDatawithmimeTypein parts - Model Used:
gemini-3-flash-preview - Code:
src/gemini/media-prompt.ts
API Structure Highlights:
{ inlineData: { mimeType: "image/jpeg", data: base64Image } }
π Side-by-Side Comparison
Request Structure
| Provider | Method | Image Key | Metadata |
|---|---|---|---|
| OpenAI | responses.create() | image_url (string) | Inline in URL |
| Anthropic | messages.create() | source.data (object) | Separate fields |
| Gemini | generateContent() | inlineData.data (object) | mimeType |
Response Structure
| Provider | Access Pattern | Format |
|---|---|---|
| OpenAI | response.output_text | String |
| Anthropic | response.content[0].text | Array of blocks |
| Gemini | response.text | String |
System Instructions
| Provider | Location | Parameter Name |
|---|---|---|
| OpenAI | In input array | role: "system" |
| Anthropic | Top-level | system |
| Gemini | In config | systemInstruction |
π― When to Use Each Provider
OpenAI Vision (GPT-4o-mini)
Best For:
- Quick image analysis
- Integration with existing OpenAI workflows
- When you need
detailparameter control (low/high/auto)
Strengths:
- Simple, direct data URL format
- Fast response times
- Good balance of quality and cost
Anthropic Vision (Claude Haiku 4.5)
Best For:
- Detailed image analysis
- Safety-critical applications
- Long-form visual descriptions
Strengths:
- Excellent reasoning about images
- Strong safety and content filtering
- Detailed, nuanced responses
Gemini Vision (Gemini 3 Flash)
Best For:
- High-volume image processing
- Cost-conscious applications
- Google ecosystem integration
Strengths:
- Very fast processing
- Cost-effective
- Simple response structure
π‘ Best Practices for Image Prompts
1. Clear Instructions
// β Good - Specific, actionable "Identify the animal species and describe its distinctive features."; // β Bad - Vague "Tell me about this.";
2. Provide Context
// β Good - Context helps "Analyze this medical scan for abnormalities. This is a chest X-ray."; // β Bad - No context "What do you see?";
3. Structure Your Request
// β Good - Organized "Please: 1) Identify all objects, 2) Describe spatial relationships, 3) Note any text"; // β Bad - Unstructured "Tell me everything about this image";
4. Optimize Image Size
// β Good - Reasonable size const image = await sharp(buffer) .resize(1024, 1024, { fit: "inside" }) .jpeg({ quality: 80 }) .toBuffer(); // β Bad - Unnecessarily large // Sending 10MB images when 1MB would work
π§ Common Patterns
Multi-Provider Fallback
async function analyzeImage(imageBase64: string, prompt: string) { try { // Try primary provider return await analyzeWithOpenAI(imageBase64, prompt); } catch (error) { console.log("OpenAI failed, trying Anthropic..."); try { return await analyzeWithAnthropic(imageBase64, prompt); } catch (error) { console.log("Anthropic failed, trying Gemini..."); return await analyzeWithGemini(imageBase64, prompt); } } }
Provider Selection Based on Task
function selectProvider(taskType: string) { switch (taskType) { case "detailed-analysis": return "anthropic"; // Claude's reasoning case "quick-scan": return "openai"; // GPT-4o speed case "high-volume": return "gemini"; // Cost efficiency default: return "openai"; } }
π Code Checklist
Before running the media prompt examples:
- β
API Keys Set: Check
.envforOPENAI_API_KEY,ANTHROPIC_API_KEY,GOOGLE_API_KEY - β
Image Available: Confirm
src/assets/lizard.jpgexists - β
Dependencies Installed: Run
pnpm installin project root - β
TypeScript Compiled: Ensure no type errors with
tsc --noEmit
π Running the Examples
# OpenAI example npx tsx src/openai/media-prompt.ts # Anthropic example npx tsx src/anthropic/media-prompt.ts # Gemini example npx tsx src/gemini/media-prompt.ts
Each script will:
- Load the lizard image
- Convert to base64
- Send multimodal prompt
- Display AI analysis
- Show token usage
π Expected Output
All three providers should return similar analysis:
β
Media Prompt Success!
AI Response: This image shows a marine iguana (Amblyrhynchus cristatus),
an endemic species to the GalΓ‘pagos Islands. The iguana displays the
characteristic dark coloration and robust body adapted for marine life...
Tokens used: { input: 1245, output: 189, total: 1434 }
π Key Takeaways
- Different Structures, Same Goal - Each provider has unique API design
- Base64 Encoding - Universal format for image transmission
- Cost vs Quality - Balance response quality with token costs
- Error Handling - Implement fallbacks for production reliability
- Provider Strengths - Match provider to task requirements
Navigation
- Previous: Lesson 1: Module 3 Overview
- Next: Lesson 2a: OpenAI Media Prompts
- Module Index: AI SDK Essentials
Ready to dive into the code? Start with OpenAI in Lesson 2a!