Module 3, Lesson 1: Working with Media Across AI Providers

Introduction to Module 3 - Learning how to work with images across OpenAI, Anthropic, and Google Gemini APIs.

Published: 2/18/2026

Welcome to Module 3!

You've mastered text prompts across OpenAI, Anthropic, and Google Gemini. Now it's time to unlock multimodal AI by working with images.

Why Multimodal AI?

Real-World Applications

  1. Visual Analysis

    • Medical image interpretation
    • Product quality inspection
    • Document analysis (screenshots, diagrams, charts)
  2. Content Creation

    • Image captioning and description
    • Accessibility (alt text generation)
    • Visual content moderation
  3. Enhanced Understanding

    • Extract text from images (OCR)
    • Understand visual context
    • Combine text and visual reasoning
  4. User Experience

    • Chat with images
    • Visual search
    • Multimodal assistants

Industry Adoption

Modern AI applications increasingly combine text and images:

// Production multimodal example - Travel app
const landmarks = await analyzeImage({
  image: userPhoto,
  prompt: "Identify landmarks and suggest nearby attractions",
  provider: "openai", // or "anthropic" or "gemini"
});

Module 3 Overview

What You'll Learn

In this module, you'll:

  • ✅ Understand how each provider handles image input
  • ✅ Work with base64-encoded images
  • ✅ Compare image processing across OpenAI, Anthropic, and Gemini
  • ✅ Learn provider-specific image features
  • ✅ Build multimodal applications

Course Structure

Module 3 (Media)
├── Lesson 1 (This lesson) - Module Overview
└── Lesson 2 - Media Prompts Across Providers
    ├── 2a - OpenAI Image Analysis
    ├── 2b - Anthropic Image Analysis
    └── 2c - Gemini Image Analysis

Working with Images

Common Workflow

All three providers follow a similar pattern:

  1. Load Image - Read the image file
  2. Encode - Convert to base64 format
  3. Structure Request - Format according to provider API
  4. Send & Receive - Get AI analysis

Code You'll Use

This module's code lives in the module-3-media branch of cwk-ai-playground:

src/
├── openai/
│   └── media-prompt.ts
├── anthropic/
│   └── media-prompt.ts
└── gemini/
    └── media-prompt.ts

Each file demonstrates the provider-specific approach to image analysis.

Provider Comparison: Image Handling

OpenAI Vision

// OpenAI approach - Travel landmark identification
const response = await openai.responses.create({
  model: "gpt-4o-mini",
  input: [
    {
      role: "user",
      content: [
        {
          type: "input_text",
          text: "What landmark is this and what can you tell me about it?",
        },
        {
          type: "input_image",
          image_url: `data:image/jpeg;base64,${base64Image}`,
        },
      ],
    },
  ],
});

Key Features:

  • Uses input_image type in content array
  • Supports detail parameter (auto, low, high)
  • Direct base64 data URL format

Anthropic Vision

// Anthropic approach - Travel destination guide
const response = await anthropic.messages.create({
  model: "claude-haiku-4-5",
  max_tokens: 1000,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "Describe this travel destination and recommend activities",
        },
        {
          type: "image",
          source: {
            type: "base64",
            media_type: "image/jpeg",
            data: base64Image,
          },
        },
      ],
    },
  ],
});

Key Features:

  • Separate source object structure
  • Explicit media_type specification
  • Response is array of content blocks

Google Gemini Vision

// Gemini approach - Travel photo analysis
const response = await gemini.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: {
    role: "user",
    parts: [
      { text: "Identify this location and provide travel tips" },
      {
        inlineData: {
          mimeType: "image/jpeg",
          data: base64Image,
        },
      },
    ],
  },
});

Key Features:

  • Uses inlineData with parts array
  • mimeType specification
  • Simple response structure with .text

Supported Image Formats

All three providers support common image formats:

  • ✅ JPEG (.jpg, .jpeg)
  • ✅ PNG (.png)
  • ✅ WebP (.webp)
  • ✅ GIF (.gif, non-animated)

File Size Limits:

  • OpenAI: 20MB max
  • Anthropic: 5MB max per image
  • Gemini: ~20MB max

What You'll Build

By the end of this module, you'll have created image analysis applications across all three providers, understanding the nuances of each approach and when to use which provider based on your needs.

Example Project: Travel Photo Analyzer

The example builds a travel assistant that analyzes landmark photos to:

  • Identify famous landmarks and destinations
  • Provide historical and cultural context
  • Recommend nearby attractions and activities
  • Demonstrate provider-specific image handling
  • Show proper error handling for media operations

Getting Started

All code for this module is in the module-3-media branch of the cwk-ai-playground repository. Make sure you have:

  • ✅ API keys for OpenAI, Anthropic, and Google Gemini
  • ✅ The lizard.jpg image in src/assets/ (wildlife from the Galápagos Islands)
  • ✅ TypeScript environment set up

Navigation

Ready to work with images? Let's dive into the provider-specific implementations in Lesson 2!