Lesson 2a Openai Media | Module 3

Lesson 2a: Media Prompts with OpenAI

Learn how to analyze images using OpenAI's Vision API. This lesson demonstrates how to combine text and images in a single prompt to get AI-powered visual analysis.

What It Does

Sends an image along with text instructions to OpenAI's vision model and receives detailed analysis. Perfect for building applications that need to understand visual content.

Key Features

Multimodal Input: Combine text prompts with images
Base64 Encoding: Send images as encoded data URLs
System Instructions: Guide the AI's analysis approach
Simple Response: Direct access via output_text

Code Example

The complete code is in src/openai/media-prompt.ts:

import OpenAI from "openai";
import dotenv from "dotenv";
import fs from "fs";

// Load environment variables
dotenv.config();

// Create OpenAI client with typed configuration
const openai = new OpenAI();

// Async function with proper return type
async function mediaPrompt(): Promise<void> {
  try {
    console.log("Testing OpenAI connection...");
    // Read image file and convert to base64
    const base64Image = fs.readFileSync("./src/assets/lizard.jpg", "base64");

    // Make API call - response is automatically typed!
    // Using a system prompt along with user prompt
    const response = await openai.responses.create({
      model: "gpt-5-nano",
      input: [
        {
          role: "system",
          content:
            "You are a naturalist. Provide detailed information about the flora and fauna in the image.",
        },
        {
          role: "user",
          content: [
            {
              type: "input_text",
              text: "Analyze the image and describe the species present, their behaviors, and any notable ecological interactions. The photo was taken in the Galápagos Islands.",
            },
            {
              type: "input_image",
              image_url: `data:image/jpeg;base64,${base64Image}`,
              detail: "auto",
            },
          ],
        },
      ],
    });

    // TypeScript knows the structure of response
    console.log("✅ Media Prompt Success!");
    console.log("AI Response:", response.output_text);
    console.log("Tokens used:");
    console.dir(response.usage, { depth: null });
  } catch (error) {
    // Proper error handling with type guards
    if (error instanceof OpenAI.APIError) {
      console.log("❌ API Error:", error.status, error.message);
    } else if (error instanceof Error) {
      console.log("❌ Error:", error.message);
    } else {
      console.log("❌ Unknown error occurred");
    }
  }
}

// Run the test
mediaPrompt().catch((error) => {
  console.error("Error:", error);
});

Run It

pnpm tsx src/openai/media-prompt.ts

Expected Output

Testing OpenAI connection...
✅ Media Prompt Success!
AI Response: This image shows a marine iguana (Amblyrhynchus cristatus),
an iconic species endemic to the Galápagos Islands. These remarkable reptiles
are the world's only marine lizards, having evolved unique adaptations for
life in the coastal waters...

Tokens used:
{
  input: 1245,
  output: 189,
  total: 1434
}

Key Concepts

1. Image Encoding with Base64

OpenAI accepts images as base64-encoded data URLs:

// Read and encode the image
const base64Image = fs.readFileSync("./src/assets/lizard.jpg", "base64");

// Format as data URL
const imageUrl = `data:image/jpeg;base64,${base64Image}`;

Why base64?

Works with any image source (file, buffer, URL)
No need for external hosting
Keeps everything in one API call

2. Multimodal Content Array

OpenAI's vision API uses a content array to mix text and images:

content: [
  {
    type: "input_text",
    text: "Your text prompt here",
  },
  {
    type: "input_image",
    image_url: `data:image/jpeg;base64,${base64Image}`,
    detail: "auto", // Optional: "auto", "low", or "high"
  },
];

Content order matters! Text before image provides context.

3. Detail Parameter

Control image processing quality:

{
  type: "input_image",
  image_url: dataUrl,
  detail: "auto"  // Recommended - balances quality and cost
}

Detail Level	Use Case	Token Cost
low	Simple recognition, icons	~85 tokens
auto	Balanced (default)	Varies
high	Fine details, text in images	Up to 765+ tokens

4. System Instructions

Guide the AI's analysis approach:

{
  role: "system",
  content: "You are a naturalist. Provide detailed information about the flora and fauna in the image."
}

Best Practices:

Set expertise level (naturalist, historian, architect)
Define output format (bullet points, paragraphs)
Specify focus areas (identify species, describe composition)

Image Requirements

Supported Formats

✅ JPEG (.jpg, .jpeg)
✅ PNG (.png)
✅ WebP (.webp)
✅ GIF (.gif, non-animated)

Size Limits

Max file size: 20MB
Recommended: Under 5MB for faster processing
Resolution: Automatically resized if too large

Optimization Tips

// Using sharp for optimization
import sharp from "sharp";

const optimizedImage = await sharp("original.jpg")
  .resize(1024, 1024, { fit: "inside" })
  .jpeg({ quality: 80 })
  .toBuffer();

const base64Image = optimizedImage.toString("base64");

Response Structure

Accessing the Response

// Simple direct access
const analysis = response.output_text;

// With token tracking
console.log("Input tokens:", response.usage.input);
console.log("Output tokens:", response.usage.output);
console.log("Total tokens:", response.usage.total);

Token Costs

Vision models consume more tokens due to image processing:

Text-only prompt: ~10 tokens
Image (auto detail): ~1000+ tokens
Total: ~1400+ tokens for this example

Cost breakdown (GPT-5-nano pricing):

Input: 1245 tokens × $0.XX/1M = $0.00XX
Output: 189 tokens × $0.XX/1M = $0.00XX

Error Handling

Common Errors

try {
  const response = await openai.responses.create({...});
} catch (error) {
  if (error instanceof OpenAI.APIError) {
    switch (error.status) {
      case 400:
        console.log("Invalid image format or encoding");
        break;
      case 401:
        console.log("Invalid API key");
        break;
      case 413:
        console.log("Image too large (>20MB)");
        break;
      case 429:
        console.log("Rate limit exceeded");
        break;
      default:
        console.log("API Error:", error.message);
    }
  }
}

Validation Before Sending

// Check file exists
if (!fs.existsSync(imagePath)) {
  throw new Error("Image file not found");
}

// Check file size
const stats = fs.statSync(imagePath);
if (stats.size > 20 * 1024 * 1024) {
  throw new Error("Image too large (max 20MB)");
}

// Read and encode
const base64Image = fs.readFileSync(imagePath, "base64");

Practice Exercises

Try modifying the code:

1. Change the Analysis Focus

{
  role: "system",
  content: "You are a travel photographer. Describe the composition, lighting, and artistic elements of this image."
}

2. Add Multiple Images

content: [
  { type: "input_text", text: "Compare these two images:" },
  { type: "input_image", image_url: `data:image/jpeg;base64,${image1}` },
  { type: "input_image", image_url: `data:image/jpeg;base64,${image2}` },
];

3. Extract Specific Information

{
  type: "input_text",
  text: "List only: 1) Species name, 2) Habitat type, 3) Notable features"
}

4. Request Structured Output

{
  type: "input_text",
  text: "Respond in JSON format with keys: species, habitat, behaviors, conservation_status"
}

Comparison: Text-Only vs Vision

Text-Only Prompt (Module 1)

const response = await openai.responses.create({
  model: "gpt-5-nano",
  input: "Tell me about marine iguanas",
});
// Tokens: ~20 total

Vision Prompt (This Lesson)

const response = await openai.responses.create({
  model: "gpt-5-nano",
  input: [
    { role: "system", content: "You are a naturalist" },
    {
      role: "user",
      content: [
        { type: "input_text", text: "Analyze this image" },
        { type: "input_image", image_url: imageData },
      ],
    },
  ],
});
// Tokens: ~1400+ total

Key Difference: Vision adds ~1000+ tokens for image processing.

Use Cases

1. Travel Photo Analysis

content: [
  {
    type: "input_text",
    text: "Identify this landmark and provide interesting facts for tourists",
  },
  {
    type: "input_image",
    image_url: landmarkPhoto,
  },
];

2. Wildlife Identification

content: [
  {
    type: "input_text",
    text: "Identify the species and provide conservation status",
  },
  {
    type: "input_image",
    image_url: animalPhoto,
  },
];

3. Document Analysis

content: [
  {
    type: "input_text",
    text: "Extract all text and data from this document",
  },
  {
    type: "input_image",
    image_url: documentScan,
    detail: "high", // Use high detail for text extraction
  },
];

Key Takeaways

✅ OpenAI uses input_image type with base64 data URLs
✅ Mix text and images in content arrays
✅ detail parameter controls quality vs cost tradeoff
✅ System prompts guide the analysis approach
✅ Vision tokens are significantly higher than text-only
✅ Response access is simple with output_text

Next Steps

Now you've seen OpenAI's approach to image analysis. Let's compare with Anthropic's vision API!

Next: Lesson 2b - Anthropic Media Prompts →

Quick Reference

Minimal Working Example

import OpenAI from "openai";
import fs from "fs";

const openai = new OpenAI();

const base64Image = fs.readFileSync("image.jpg", "base64");

const response = await openai.responses.create({
  model: "gpt-5-nano",
  input: [
    {
      role: "user",
      content: [
        { type: "input_text", text: "What's in this image?" },
        {
          type: "input_image",
          image_url: `data:image/jpeg;base64,${base64Image}`,
        },
      ],
    },
  ],
});

console.log(response.output_text);

Common Pitfalls

❌ Forgetting data:image/jpeg;base64, prefix
❌ Using raw file path instead of base64
❌ Images over 20MB without compression
❌ Not handling API errors for invalid images

Completed Lesson 2a! You can now analyze images with OpenAI's Vision API. 🎉

Module 3 - Lesson 2a: OpenAI Media Prompts