Module 3 - Lesson 2: Media Prompts Across Providers

Comprehensive guide to working with images across OpenAI, Anthropic, and Google Gemini.

Published: 1/20/2026

Media Prompts: Multi-Provider Image Analysis

This lesson demonstrates how to work with images across all three AI providers. You'll see how each provider structures image input differently, while achieving the same goal: analyzing visual content with AI.

Lesson Structure

Each sub-lesson covers a specific provider's approach to image analysis. Follow them in order to understand the differences and similarities in multimodal AI development.


πŸ“Έ Media Prompt Overview

Common Elements Across All Providers

Despite different APIs, all three providers share core concepts:

  1. Image Encoding - Convert images to base64 format
  2. Multimodal Messages - Combine text and images in prompts
  3. System Instructions - Guide the AI's analysis approach
  4. Structured Responses - Extract text from AI responses
  5. Token Usage - Track API costs and limits

Our Example Task

All three implementations analyze the same image with the same prompt:

System Instruction:

"You are a naturalist. Provide detailed information about the flora and fauna in the image."

User Prompt:

"Analyze the image and describe the species present, their behaviors, and any notable ecological interactions. The photo was taken in the GalΓ‘pagos Islands."

Image: A lizard photo from src/assets/lizard.jpg

This consistency lets you directly compare implementation approaches.


πŸ“š Provider-Specific Implementations

Lesson 2a: OpenAI Media Prompts

Analyzing images with OpenAI's Vision API.

  • Use openai.responses.create() with multimodal input
  • Structure input_text and input_image in content array
  • Handle output_text response format
  • Understand OpenAI's base64 data URL approach
  • Key Feature: Direct image_url string format
  • Model Used: gpt-4o-mini
  • Code: src/openai/media-prompt.ts

API Structure Highlights:

{
  type: "input_image",
  image_url: `data:image/jpeg;base64,${base64Image}`
}

Lesson 2b: Anthropic Media Prompts

Analyzing images with Anthropic's Claude Vision API.

  • Use anthropic.messages.create() with image content blocks
  • Structure nested source object for images
  • Specify explicit media_type and data properties
  • Extract text from content array response
  • Key Feature: Structured source object with metadata
  • Model Used: claude-haiku-4-5
  • Code: src/anthropic/media-prompt.ts

API Structure Highlights:

{
  type: "image",
  source: {
    type: "base64",
    media_type: "image/jpeg",
    data: base64Image
  }
}

Lesson 2c: Gemini Media Prompts

Analyzing images with Google Gemini's Vision API.

  • Use gemini.models.generateContent() with parts array
  • Structure inlineData object for images
  • Use systemInstruction in config for role guidance
  • Access simple .text response property
  • Key Feature: inlineData with mimeType in parts
  • Model Used: gemini-3-flash-preview
  • Code: src/gemini/media-prompt.ts

API Structure Highlights:

{
  inlineData: {
    mimeType: "image/jpeg",
    data: base64Image
  }
}

πŸ”„ Side-by-Side Comparison

Request Structure

ProviderMethodImage KeyMetadata
OpenAIresponses.create()image_url (string)Inline in URL
Anthropicmessages.create()source.data (object)Separate fields
GeminigenerateContent()inlineData.data (object)mimeType

Response Structure

ProviderAccess PatternFormat
OpenAIresponse.output_textString
Anthropicresponse.content[0].textArray of blocks
Geminiresponse.textString

System Instructions

ProviderLocationParameter Name
OpenAIIn input arrayrole: "system"
AnthropicTop-levelsystem
GeminiIn configsystemInstruction

🎯 When to Use Each Provider

OpenAI Vision (GPT-4o-mini)

Best For:

  • Quick image analysis
  • Integration with existing OpenAI workflows
  • When you need detail parameter control (low/high/auto)

Strengths:

  • Simple, direct data URL format
  • Fast response times
  • Good balance of quality and cost

Anthropic Vision (Claude Haiku 4.5)

Best For:

  • Detailed image analysis
  • Safety-critical applications
  • Long-form visual descriptions

Strengths:

  • Excellent reasoning about images
  • Strong safety and content filtering
  • Detailed, nuanced responses

Gemini Vision (Gemini 3 Flash)

Best For:

  • High-volume image processing
  • Cost-conscious applications
  • Google ecosystem integration

Strengths:

  • Very fast processing
  • Cost-effective
  • Simple response structure

πŸ’‘ Best Practices for Image Prompts

1. Clear Instructions

// βœ… Good - Specific, actionable
"Identify the animal species and describe its distinctive features.";

// ❌ Bad - Vague
"Tell me about this.";

2. Provide Context

// βœ… Good - Context helps
"Analyze this medical scan for abnormalities. This is a chest X-ray.";

// ❌ Bad - No context
"What do you see?";

3. Structure Your Request

// βœ… Good - Organized
"Please: 1) Identify all objects, 2) Describe spatial relationships, 3) Note any text";

// ❌ Bad - Unstructured
"Tell me everything about this image";

4. Optimize Image Size

// βœ… Good - Reasonable size
const image = await sharp(buffer)
  .resize(1024, 1024, { fit: "inside" })
  .jpeg({ quality: 80 })
  .toBuffer();

// ❌ Bad - Unnecessarily large
// Sending 10MB images when 1MB would work

πŸ”§ Common Patterns

Multi-Provider Fallback

async function analyzeImage(imageBase64: string, prompt: string) {
  try {
    // Try primary provider
    return await analyzeWithOpenAI(imageBase64, prompt);
  } catch (error) {
    console.log("OpenAI failed, trying Anthropic...");
    try {
      return await analyzeWithAnthropic(imageBase64, prompt);
    } catch (error) {
      console.log("Anthropic failed, trying Gemini...");
      return await analyzeWithGemini(imageBase64, prompt);
    }
  }
}

Provider Selection Based on Task

function selectProvider(taskType: string) {
  switch (taskType) {
    case "detailed-analysis":
      return "anthropic"; // Claude's reasoning
    case "quick-scan":
      return "openai"; // GPT-4o speed
    case "high-volume":
      return "gemini"; // Cost efficiency
    default:
      return "openai";
  }
}

πŸ“‹ Code Checklist

Before running the media prompt examples:

  • βœ… API Keys Set: Check .env for OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY
  • βœ… Image Available: Confirm src/assets/lizard.jpg exists
  • βœ… Dependencies Installed: Run pnpm install in project root
  • βœ… TypeScript Compiled: Ensure no type errors with tsc --noEmit

πŸš€ Running the Examples

# OpenAI example
npx tsx src/openai/media-prompt.ts

# Anthropic example
npx tsx src/anthropic/media-prompt.ts

# Gemini example
npx tsx src/gemini/media-prompt.ts

Each script will:

  1. Load the lizard image
  2. Convert to base64
  3. Send multimodal prompt
  4. Display AI analysis
  5. Show token usage

πŸ“Š Expected Output

All three providers should return similar analysis:

βœ… Media Prompt Success!
AI Response: This image shows a marine iguana (Amblyrhynchus cristatus),
an endemic species to the GalΓ‘pagos Islands. The iguana displays the
characteristic dark coloration and robust body adapted for marine life...

Tokens used: { input: 1245, output: 189, total: 1434 }

πŸŽ“ Key Takeaways

  1. Different Structures, Same Goal - Each provider has unique API design
  2. Base64 Encoding - Universal format for image transmission
  3. Cost vs Quality - Balance response quality with token costs
  4. Error Handling - Implement fallbacks for production reliability
  5. Provider Strengths - Match provider to task requirements

Navigation

Ready to dive into the code? Start with OpenAI in Lesson 2a!