Module 3, Lesson 1: Working with Media Across AI Providers
Introduction to Module 3 - Learning how to work with images across OpenAI, Anthropic, and Google Gemini APIs.
Published: 2/18/2026
Welcome to Module 3!
You've mastered text prompts across OpenAI, Anthropic, and Google Gemini. Now it's time to unlock multimodal AI by working with images.
Why Multimodal AI?
Real-World Applications
-
Visual Analysis
- Medical image interpretation
- Product quality inspection
- Document analysis (screenshots, diagrams, charts)
-
Content Creation
- Image captioning and description
- Accessibility (alt text generation)
- Visual content moderation
-
Enhanced Understanding
- Extract text from images (OCR)
- Understand visual context
- Combine text and visual reasoning
-
User Experience
- Chat with images
- Visual search
- Multimodal assistants
Industry Adoption
Modern AI applications increasingly combine text and images:
// Production multimodal example - Travel app const landmarks = await analyzeImage({ image: userPhoto, prompt: "Identify landmarks and suggest nearby attractions", provider: "openai", // or "anthropic" or "gemini" });
Module 3 Overview
What You'll Learn
In this module, you'll:
- ✅ Understand how each provider handles image input
- ✅ Work with base64-encoded images
- ✅ Compare image processing across OpenAI, Anthropic, and Gemini
- ✅ Learn provider-specific image features
- ✅ Build multimodal applications
Course Structure
Module 3 (Media)
├── Lesson 1 (This lesson) - Module Overview
└── Lesson 2 - Media Prompts Across Providers
├── 2a - OpenAI Image Analysis
├── 2b - Anthropic Image Analysis
└── 2c - Gemini Image Analysis
Working with Images
Common Workflow
All three providers follow a similar pattern:
- Load Image - Read the image file
- Encode - Convert to base64 format
- Structure Request - Format according to provider API
- Send & Receive - Get AI analysis
Code You'll Use
This module's code lives in the module-3-media branch of cwk-ai-playground:
src/
├── openai/
│ └── media-prompt.ts
├── anthropic/
│ └── media-prompt.ts
└── gemini/
└── media-prompt.ts
Each file demonstrates the provider-specific approach to image analysis.
Provider Comparison: Image Handling
OpenAI Vision
// OpenAI approach - Travel landmark identification const response = await openai.responses.create({ model: "gpt-4o-mini", input: [ { role: "user", content: [ { type: "input_text", text: "What landmark is this and what can you tell me about it?", }, { type: "input_image", image_url: `data:image/jpeg;base64,${base64Image}`, }, ], }, ], });
Key Features:
- Uses
input_imagetype in content array - Supports
detailparameter (auto, low, high) - Direct base64 data URL format
Anthropic Vision
// Anthropic approach - Travel destination guide const response = await anthropic.messages.create({ model: "claude-haiku-4-5", max_tokens: 1000, messages: [ { role: "user", content: [ { type: "text", text: "Describe this travel destination and recommend activities", }, { type: "image", source: { type: "base64", media_type: "image/jpeg", data: base64Image, }, }, ], }, ], });
Key Features:
- Separate
sourceobject structure - Explicit
media_typespecification - Response is array of content blocks
Google Gemini Vision
// Gemini approach - Travel photo analysis const response = await gemini.models.generateContent({ model: "gemini-3-flash-preview", contents: { role: "user", parts: [ { text: "Identify this location and provide travel tips" }, { inlineData: { mimeType: "image/jpeg", data: base64Image, }, }, ], }, });
Key Features:
- Uses
inlineDatawithpartsarray mimeTypespecification- Simple response structure with
.text
Supported Image Formats
All three providers support common image formats:
- ✅ JPEG (
.jpg,.jpeg) - ✅ PNG (
.png) - ✅ WebP (
.webp) - ✅ GIF (
.gif, non-animated)
File Size Limits:
- OpenAI: 20MB max
- Anthropic: 5MB max per image
- Gemini: ~20MB max
What You'll Build
By the end of this module, you'll have created image analysis applications across all three providers, understanding the nuances of each approach and when to use which provider based on your needs.
Example Project: Travel Photo Analyzer
The example builds a travel assistant that analyzes landmark photos to:
- Identify famous landmarks and destinations
- Provide historical and cultural context
- Recommend nearby attractions and activities
- Demonstrate provider-specific image handling
- Show proper error handling for media operations
Getting Started
All code for this module is in the module-3-media branch of the cwk-ai-playground repository. Make sure you have:
- ✅ API keys for OpenAI, Anthropic, and Google Gemini
- ✅ The
lizard.jpgimage insrc/assets/(wildlife from the Galápagos Islands) - ✅ TypeScript environment set up
Navigation
- Next: Lesson 2: Media Prompts
- Module Index: AI SDK Essentials
Ready to work with images? Let's dive into the provider-specific implementations in Lesson 2!