Lesson 1 Module 3 Media Overview | Module 3

Introduction to Module 3 - Learning how to work with images across OpenAI, Anthropic, and Google Gemini APIs.

Welcome to Module 3!

You've mastered text prompts across OpenAI, Anthropic, and Google Gemini. Now it's time to unlock multimodal AI by working with images.

Why Multimodal AI?

Real-World Applications

Visual Analysis
- Medical image interpretation
- Product quality inspection
- Document analysis (screenshots, diagrams, charts)
Content Creation
- Image captioning and description
- Accessibility (alt text generation)
- Visual content moderation
Enhanced Understanding
- Extract text from images (OCR)
- Understand visual context
- Combine text and visual reasoning
User Experience
- Chat with images
- Visual search
- Multimodal assistants

Industry Adoption

Modern AI applications increasingly combine text and images:

// Production multimodal example - Travel app
const landmarks = await analyzeImage({
  image: userPhoto,
  prompt: "Identify landmarks and suggest nearby attractions",
  provider: "openai", // or "anthropic" or "gemini"
});

Module 3 Overview

What You'll Learn

In this module, you'll:

✅ Understand how each provider handles image input
✅ Work with base64-encoded images
✅ Compare image processing across OpenAI, Anthropic, and Gemini
✅ Learn provider-specific image features
✅ Build multimodal applications

Course Structure

Module 3 (Media)
├── Lesson 1 (This lesson) - Module Overview
└── Lesson 2 - Media Prompts Across Providers
    ├── 2a - OpenAI Image Analysis
    ├── 2b - Anthropic Image Analysis
    └── 2c - Gemini Image Analysis

Working with Images

Common Workflow

All three providers follow a similar pattern:

Load Image - Read the image file
Encode - Convert to base64 format
Structure Request - Format according to provider API
Send & Receive - Get AI analysis

Code You'll Use

This module's code lives in the module-3-media branch of cwk-ai-playground:

src/
├── openai/
│   └── media-prompt.ts
├── anthropic/
│   └── media-prompt.ts
└── gemini/
    └── media-prompt.ts

Each file demonstrates the provider-specific approach to image analysis.

Provider Comparison: Image Handling

OpenAI Vision

// OpenAI approach - Travel landmark identification
const response = await openai.responses.create({
  model: "gpt-4o-mini",
  input: [
    {
      role: "user",
      content: [
        {
          type: "input_text",
          text: "What landmark is this and what can you tell me about it?",
        },
        {
          type: "input_image",
          image_url: `data:image/jpeg;base64,${base64Image}`,
        },
      ],
    },
  ],
});

Key Features:

Uses input_image type in content array
Supports detail parameter (auto, low, high)
Direct base64 data URL format

Anthropic Vision

// Anthropic approach - Travel destination guide
const response = await anthropic.messages.create({
  model: "claude-haiku-4-5",
  max_tokens: 1000,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Describe this travel destination and recommend activities",
        },
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/jpeg",
            data: base64Image,
          },
        },
      ],
    },
  ],
});

Key Features:

Separate source object structure
Explicit media_type specification
Response is array of content blocks

Google Gemini Vision

// Gemini approach - Travel photo analysis
const response = await gemini.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: {
    role: "user",
    parts: [
      { text: "Identify this location and provide travel tips" },
      {
        inlineData: {
          mimeType: "image/jpeg",
          data: base64Image,
        },
      },
    ],
  },
});

Key Features:

Uses inlineData with parts array
mimeType specification
Simple response structure with .text

Supported Image Formats

All three providers support common image formats:

✅ JPEG (.jpg, .jpeg)
✅ PNG (.png)
✅ WebP (.webp)
✅ GIF (.gif, non-animated)

File Size Limits:

OpenAI: 20MB max
Anthropic: 5MB max per image
Gemini: ~20MB max

What You'll Build

By the end of this module, you'll have created image analysis applications across all three providers, understanding the nuances of each approach and when to use which provider based on your needs.

Example Project: Travel Photo Analyzer

The example builds a travel assistant that analyzes landmark photos to:

Identify famous landmarks and destinations
Provide historical and cultural context
Recommend nearby attractions and activities
Demonstrate provider-specific image handling
Show proper error handling for media operations

Getting Started

All code for this module is in the module-3-media branch of the cwk-ai-playground repository. Make sure you have:

✅ API keys for OpenAI, Anthropic, and Google Gemini
✅ The lizard.jpg image in src/assets/ (wildlife from the Galápagos Islands)
✅ TypeScript environment set up

Navigation

Next: Lesson 2: Media Prompts
Module Index: AI SDK Essentials

Ready to work with images? Let's dive into the provider-specific implementations in Lesson 2!

Module 3, Lesson 1: Working with Media Across AI Providers

Welcome to Module 3!

Why Multimodal AI?

Real-World Applications

Industry Adoption

Module 3 Overview

What You'll Learn

Course Structure

Working with Images

Common Workflow

Code You'll Use

Provider Comparison: Image Handling

OpenAI Vision

Anthropic Vision

Google Gemini Vision

Supported Image Formats

What You'll Build

Example Project: Travel Photo Analyzer

Getting Started

Navigation