Lesson 2c Gemini Media | Module 3

Lesson 2c: Media Prompts with Google Gemini

Learn how to analyze images using Google Gemini's Vision API. This lesson demonstrates Gemini's streamlined approach with inlineData and simple response access.

What It Does

Sends an image along with text instructions to Gemini and receives fast, accurate analysis. Gemini excels at speed and efficiency in multimodal tasks.

Key Differences from OpenAI & Anthropic

inlineData Structure: Images use inlineData with mimeType
Parts Array: Content is organized as parts array
Simple Response: Direct access via response.text
System in Config: System instruction is in config object

Code Example

The complete code is in src/gemini/media-prompt.ts:

import { GoogleGenAI, ApiError } from "@google/genai";
import dotenv from "dotenv";
import fs from "fs";

// Load environment variables
dotenv.config();

// Create Gemini client with typed configuration
const gemini = new GoogleGenAI({});

// Async function with proper return type
async function mediaPrompt(): Promise<void> {
  try {
    console.log("Testing Gemini connection...");
    // Read image file and convert to base64
    const base64Image = fs.readFileSync("./src/assets/lizard.jpg", "base64");

    // Make API call - response is automatically typed!
    // Using a system prompt along with user prompt
    // note showing use of role
    const response = await gemini.models.generateContent({
      model: "gemini-3-flash-preview",
      contents: {
        role: "user",
        parts: [
          {
            text: "Analyze the image and describe the species present, their behaviors, and any notable ecological interactions. The photo was taken in the Galápagos Islands.",
          },
          {
            inlineData: {
              mimeType: "image/jpeg",
              data: base64Image,
            },
          },
        ],
      },
      config: {
        systemInstruction:
          "You are a naturalist. Provide detailed information about the flora and fauna in the image.",
      },
    });

    console.log("✅ Media Prompt Success!");
    // show response usage
    console.log("Tokens used:");
    console.dir(response.usageMetadata, { depth: null });

    // Check if we got a response
    if (!response.text || response.text.length === 0) {
      throw new Error("No content in response");
    }

    // TypeScript knows the structure of response
    console.log("AI Response:", response.text);
  } catch (error) {
    // Proper error handling with type guards
    if (error instanceof ApiError) {
      console.log("❌ API Error:", error.status, error.message);
    } else if (error instanceof Error) {
      console.log("❌ Error:", error.message);
    } else {
      console.log("❌ Unknown error occurred");
    }
  }
}

// Run the test
mediaPrompt().catch((error) => {
  console.error("Error:", error);
});

Run It

pnpm tsx src/gemini/media-prompt.ts

Expected Output

Testing Gemini connection...
✅ Media Prompt Success!
Tokens used:
{
  promptTokenCount: 1156,
  candidatesTokenCount: 198,
  totalTokenCount: 1354
}
AI Response: This image shows a marine iguana, scientifically known as
Amblyrhynchus cristatus. These unique reptiles are endemic to the Galápagos
Islands and are the only lizards in the world that have adapted to a marine
lifestyle...

Key Concepts

1. InlineData Structure

Gemini uses inlineData for embedded images:

{
  inlineData: {
    mimeType: "image/jpeg",  // MIME type
    data: base64Image        // Raw base64 (no prefix)
  }
}

Simple and clean - just specify MIME type and data.

2. Parts Array

Content is organized as a parts array:

contents: {
  role: "user",
  parts: [
    { text: "Your prompt here" },
    { inlineData: { mimeType: "image/jpeg", data: base64Image } }
  ]
}

Flexible: Mix text and images in any order.

3. MIME Types

Specify the correct MIME type:

File Format	MIME Type
JPEG	`image/jpeg`
PNG	`image/png`
WebP	`image/webp`
GIF	`image/gif`

// Helper function
const getMimeType = (filepath: string): string => {
  const ext = path.extname(filepath).toLowerCase();
  const mimeTypes: Record<string, string> = {
    ".jpg": "image/jpeg",
    ".jpeg": "image/jpeg",
    ".png": "image/png",
    ".webp": "image/webp",
    ".gif": "image/gif",
  };
  return mimeTypes[ext] || "image/jpeg";
};

4. System Instructions in Config

System instruction is separate from content:

const response = await gemini.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: {
    /* user content */
  },
  config: {
    systemInstruction: "You are a naturalist...",
  },
});

Clean separation between user content and system configuration.

5. Simple Response Access

Direct text access with response.text:

// No filtering needed!
console.log(response.text);

// Token metadata
console.log(response.usageMetadata);

Most straightforward of all three providers.

Three-Way Comparison

OpenAI

{
  type: "input_image",
  image_url: `data:image/jpeg;base64,${base64}`,
  detail: "auto"
}
// Response: response.output_text

Anthropic

{
  type: "image",
  source: {
    type: "base64",
    media_type: "image/jpeg",
    data: base64
  }
}
// Response: response.content.filter(...).map(...).join()

Gemini (Simplest!)

{
  inlineData: {
    mimeType: "image/jpeg",
    data: base64
  }
}
// Response: response.text

Image Requirements

Supported Formats

✅ JPEG (.jpg, .jpeg)
✅ PNG (.png)
✅ WebP (.webp)
✅ GIF (.gif)

Size Limits

Max file size: ~20MB
Recommended: Under 5MB for optimal speed
Resolution: Automatically optimized

Optimization Example

import sharp from "sharp";

// Optimize for Gemini
const optimizedImage = await sharp("large-image.jpg")
  .resize(2048, 2048, { fit: "inside" })
  .jpeg({ quality: 85 })
  .toBuffer();

const base64Image = optimizedImage.toString("base64");

Token Usage and Costs

Understanding Gemini's Token Metadata

console.log("Prompt tokens:", response.usageMetadata.promptTokenCount);
console.log("Response tokens:", response.usageMetadata.candidatesTokenCount);
console.log("Total tokens:", response.usageMetadata.totalTokenCount);

Token Breakdown:

Text prompt: ~40 tokens
Image processing: ~1100+ tokens
System instruction: ~15 tokens
Total: ~1350+ tokens

Cost Calculation (Gemini 3 Flash)

Input: 1156 tokens × $0.XX/1M = $0.00XX
Output: 198 tokens × $0.XX/1M = $0.00XX
Total: Very cost-effective!

Gemini Flash is optimized for speed and cost efficiency.

Configuration Options

Basic Configuration

config: {
  systemInstruction: "You are a helpful assistant",
  temperature: 0.7,
  maxOutputTokens: 1000,
  topP: 0.95,
  topK: 40
}

Temperature Control

// Precise, factual (travel guides, nature docs)
config: {
  temperature: 0.2;
}

// Balanced (general analysis)
config: {
  temperature: 0.7;
}

// Creative (artistic descriptions)
config: {
  temperature: 1.2;
}

Token Limits

// Short responses
config: {
  maxOutputTokens: 500;
}

// Medium responses
config: {
  maxOutputTokens: 1000;
}

// Long analyses
config: {
  maxOutputTokens: 2000;
}

Error Handling

Common Errors

try {
  const response = await gemini.models.generateContent({...});
} catch (error) {
  if (error instanceof ApiError) {
    switch (error.status) {
      case 400:
        console.log("Invalid request - check image format");
        break;
      case 401:
        console.log("Invalid API key");
        break;
      case 413:
        console.log("Image too large");
        break;
      case 429:
        console.log("Rate limit exceeded");
        break;
      case 503:
        console.log("Service temporarily unavailable");
        break;
      default:
        console.log("API Error:", error.message);
    }
  }
}

Validation Best Practices

// Validate before sending
const validateImage = (imagePath: string) => {
  // 1. File exists?
  if (!fs.existsSync(imagePath)) {
    throw new Error("Image not found");
  }

  // 2. Size check
  const stats = fs.statSync(imagePath);
  if (stats.size > 20 * 1024 * 1024) {
    throw new Error("Image too large (max 20MB)");
  }

  // 3. Valid format?
  const ext = path.extname(imagePath).toLowerCase();
  const validExts = [".jpg", ".jpeg", ".png", ".webp", ".gif"];
  if (!validExts.includes(ext)) {
    throw new Error(`Unsupported format: ${ext}`);
  }

  return true;
};

Practice Exercises

1. Multi-Image Comparison

parts: [
  { text: "Compare these travel destinations:" },
  { inlineData: { mimeType: "image/jpeg", data: image1Base64 } },
  { text: "vs" },
  { inlineData: { mimeType: "image/jpeg", data: image2Base64 } },
];

2. Structured Output Request

parts: [
  {
    text: `Analyze this wildlife image and respond in JSON format:
{
  "species": "scientific name",
  "habitat": "description",
  "behaviors": ["list", "of", "behaviors"],
  "threats": ["conservation", "concerns"]
}`,
  },
  { inlineData: { mimeType: "image/jpeg", data: imageData } },
];

3. Detailed Travel Analysis

config: {
  systemInstruction: "You are an enthusiastic travel guide"
},
contents: {
  role: "user",
  parts: [
    {
      text: "Identify this landmark and provide: 1) Historical significance, 2) Best visiting times, 3) Nearby attractions"
    },
    { inlineData: { mimeType: "image/jpeg", data: landmarkImage } }
  ]
}

4. Safety Assessment

config: {
  systemInstruction: "You are a safety inspector. Be thorough but not alarmist."
},
contents: {
  role: "user",
  parts: [
    { text: "Identify any safety concerns in this environment" },
    { inlineData: { mimeType: "image/jpeg", data: sceneImage } }
  ]
}

Gemini's Strengths

1. Speed

Gemini Flash is optimized for fast responses:

// Typical response times
Text-only: ~1-2 seconds
With image: ~2-4 seconds

2. Cost Efficiency

Great balance of quality and cost:

// High-volume applications
for (const image of imagesBatch) {
  const analysis = await analyzeWithGemini(image);
  // Cost per image: ~$0.001-0.002
}

3. Simple API Design

Cleanest response access:

// OpenAI
const text = response.output_text;

// Anthropic
const text = response.content
  .filter((b) => b.type === "text")
  .map((b) => b.text)
  .join("\n");

// Gemini (simplest!)
const text = response.text;

Use Cases

1. Travel App - Quick Landmark ID

const identifyLandmark = async (imageBase64: string) => {
  const response = await gemini.models.generateContent({
    model: "gemini-3-flash-preview",
    contents: {
      role: "user",
      parts: [
        { text: "Identify this landmark in 2-3 sentences" },
        { inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
      ],
    },
    config: { maxOutputTokens: 200 },
  });

  return response.text;
};

2. Wildlife Spotter App

const identifyAnimal = async (photoBase64: string, location: string) => {
  const response = await gemini.models.generateContent({
    model: "gemini-3-flash-preview",
    contents: {
      role: "user",
      parts: [
        {
          text: `Identify the animal species. Location: ${location}. Include conservation status.`,
        },
        { inlineData: { mimeType: "image/jpeg", data: photoBase64 } },
      ],
    },
    config: {
      systemInstruction:
        "You are a wildlife expert. Be concise but informative.",
    },
  });

  return response.text;
};

3. Accessibility Alt Text Generator

const generateAltText = async (imageBase64: string) => {
  const response = await gemini.models.generateContent({
    model: "gemini-3-flash-preview",
    contents: {
      role: "user",
      parts: [
        {
          text: "Create a concise, descriptive alt text for screen readers (max 125 characters)",
        },
        { inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
      ],
    },
    config: { maxOutputTokens: 100 },
  });

  return response.text.slice(0, 125);
};

Provider Selection Guide

Choose Gemini When:

✅ Speed matters - Real-time applications
✅ High volume - Processing many images
✅ Cost-conscious - Budget-friendly option
✅ Simple integration - Easy API design
✅ Google ecosystem - Already using GCP/Firebase

Choose OpenAI When:

✅ Detail control needed (detail parameter)
✅ Existing OpenAI integration
✅ GPT-4 level quality required

Choose Anthropic When:

✅ Detailed reasoning needed
✅ Safety-critical applications
✅ Long, nuanced analysis required

Key Takeaways

✅ Gemini uses inlineData with mimeType
✅ Content organized as parts array
✅ System instruction in separate config object
✅ Simplest response access with .text
✅ Excellent speed/cost balance
✅ Token metadata via usageMetadata
✅ Clean, straightforward API design

Module Complete!

Congratulations! You've now mastered image analysis across all three major AI providers:

OpenAI - Flexible, detail control
Anthropic - Detailed, thoughtful analysis
Gemini - Fast, cost-effective

What's Next?

You can now:

✅ Choose the right provider for each task
✅ Implement multi-provider fallback systems
✅ Build production multimodal applications
✅ Optimize for cost vs quality tradeoffs

Back to Module 3 Overview

Return to AI SDK Essentials Index

Quick Reference

Minimal Working Example

import { GoogleGenAI } from "@google/genai";
import fs from "fs";

const gemini = new GoogleGenAI({});
const base64Image = fs.readFileSync("image.jpg", "base64");

const response = await gemini.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: {
    role: "user",
    parts: [
      { text: "What's in this image?" },
      { inlineData: { mimeType: "image/jpeg", data: base64Image } },
    ],
  },
});

console.log(response.text);

Common Pitfalls

❌ Wrong mimeType for file format
❌ Forgetting raw base64 (no data URL prefix)
❌ Not handling ApiError properly
❌ Accessing response.output instead of response.text

Completed Module 3! You're now a multimodal AI expert! 🎉🖼️

Module 3 - Lesson 2c: Gemini Media Prompts