[GDE] Building a Digital Docent: Master Agentic Vision with Gemini 3

# ai# agenticvision# tutorial# gemini

Connie Leung

The Pain Point: The Invisible World of the "Granular" Masterpiece When a postcard miniaturizes a...

The Pain Point: The Invisible World of the "Granular" Masterpiece
When a postcard miniaturizes a traditional Chinese painting, the reduced scale obscures details. Consider a standard postcard depicting workers and horses pulling wagons in a crowded street during the Qing dynasty. Adults and children walk on a busy bridge while shops on both sides sell goods to pedestrians. For aging viewers, these granular details—a few millimeters wide—become an inaccessible blur.

Chinese art's "density" becomes a barrier, turning detailed scenes into indistinct shapes. When the eyes no longer resolve fine brushstrokes, the artwork loses its narrative.

The Solution: Agentic Vision as a Digital Docent
Gemini 3's Agentic Vision bridges the gap between the postcard's physical limitations and the original art's vast detail. It surpasses basic image understanding using a Think-Act-Observe loop to look where aging eyes struggle.

Given a command—like "Zoom in on the postcard to find the figures on the bridge"—the model becomes an active investigator:

Think: It scans the painting's complex composition, identifying the requested detail's coordinates amidst mountains, trees, buildings, and people.
Act: It autonomously writes and executes Python code to "crop and enhance," isolating the area with high accuracy.
Observe: It reviews the image, answers the prompt, and reasons before presenting the results to the user.

The Result: Visual Accessibility Improvement
Agentic Vision enables users with low vision to resolve fine details. Users can "walk through" a Chinese painting, moving from broad landscape strokes to fine details of a face or bird.

This tool does more than zoom; it makes art accessible by clarifying details, regardless of postcard size.

This post walks through the Colab notebook to inspect artwork postcards, answer visual questions, and identify the manufacturer.

Demo Overview

Let's look at how we can implement this 'Digital Docent' using Python and the Gemini SDK.

Run the Colab notebook to inspect the postcard front for visual answers, and the back for metadata like the title, dimensions, and manufacturer.

The public agentic_visions GitHub folder hosts the raw images used in this demo.

I load these URLs and use the Gemini 3 Flash preview model, the code execution tool, and a text prompt for image cropping and data extraction.

The model receives a prompt to zoom into the postcard and count the horses, mules, and donkeys in three groups. The model uses the code execution tool to find the answer. I parse the JSON response and iterate through the response's parts array to display text and images.

Although the prompt is a single sentence, the model repeats the think, act, and observe steps until it derives an answer.

Prerequisites

To run the Colab notebook, ensure you have:

A valid personal Gmail account to sign in to Google Colab
Vertex AI in Express Mode: I use Gemini via Vertex AI due to regional availability (Hong Kong), but these features function identically in the public Gemini API. To use this for free, create a Google Cloud project, enable the Vertex AI API, and opt into Express Mode using your Gmail account.
Google Cloud API key: Generate an API key in the Google Cloud Console. Configure this key by following the authentication guide.

Setup

Clone the notebook

Save a copy of the Colab notebook to Google Drive to run the test cases.

Client Secrets

Add a GOOGLE_CLOUD_API_KEY secret variable in the Colab notebook by clicking the Secrets (key icon) tab in the left sidebar.

GOOGLE_CLOUD_API_KEY=<GOOGLE CLOUD API KEY>

In the notebook, import userdata to get the Vertex AI Gemini API Key.

from google.colab import userdata

cloud_api_key = userdata.get('GOOGLE_CLOUD_API_KEY')

Code Execution

Install and Import Libraries

The notebook installs Python libraries and imports classes.

%pip install google-genai
%pip install matplotlib
%pip install os
%pip install dotenv
%pip install pydantic

from google import genai
from matplotlib import pyplot as plt
from google.genai import types
from pydantic import BaseModel, Field
from urllib.error import URLError
import requests
from PIL import Image
from io import BytesIO
import os
from typing import Union

Test Images

The front of the postcard: The National Museum of Taiwan exhibits Up the River.

I prompt the model to zoom in on the bottom of the postcard to count horses.

The back of the postcard: It contains metadata like the title and dimensions.

The prompt asks the model to zoom into the label to extract the title, dimensions, manufacturing date, manufacturing location, manufacturer, phone number, address, serial number, and price.

Define the Pydantic Models

I define a Pydantic model to store data.

ArtworkDiscovery stores the code execution's answer and reasoning.

class ArtworkDiscovery(BaseModel):
    answer: str = Field(..., description="The answer to the prompt")
    reasoning: str = Field(..., description="The reasoning behind the answer")

PostcardLabel stores the postcard's back label information.

class PostcardLabel(BaseModel):
    title: str = Field(..., description="The title of the artwork on the postcard")
    dimensions: str = Field(..., description="The dimensions of the postcard")
    manufacturing_date: str = Field(..., description="The manufacturing date of the postcard")
    manufacturing_location: str = Field(..., description="The manufacturing location of the postcard")
    brand: str = Field(..., description="The brand of the postcard")
    manufacturer: str = Field(..., description="The manufacturer of the postcard")
    phone_number: str = Field(..., description="The phone number of the postcard manufacturer")
    address: str = Field(..., description="The address of the postcard manufacturer")
    serial_number: str = Field(..., description="The serial number of the postcard")
    price: float = Field(..., description="The price of the postcard")
    reasoning: str = Field(..., description="The reasoning behind the answer")

The VisionTestCase class defines the image URL, text prompt, and response model type. The response model uses either the ArtworkDiscovery or PostcardLabel class.

class VisionTestCase(BaseModel):
    image_url: str = Field(..., description="The URL of the image to be analyzed")
    prompt: str = Field(..., description="The prompt to be answered based on the image")
    response_model: Union[type[ArtworkDiscovery], type[PostcardLabel]]  = Field(..., description="The Pydantic model that defines the expected response format")

Prompt Engineering

Agentic Vision struggles with counting. When I ask it to count horses or houses on the postcards, the answers deviate by +/- 5 from my count. I add a reasoning field to the Pydantic models to capture the logic. The field forces the model to induce Chain-of-Thought (CoT) processing, which significantly improves arithmetic and logic tasks like counting.

Create a Gemini client

I use the client to call the Gemini 3 Flash Preview model to execute code on the provided images, find the requested object's coordinates, review the cropped regions, and return the text and reasoning.

def create_vertexai_client():
    cloud_api_key = userdata.get('GOOGLE_CLOUD_API_KEY')
    if not cloud_api_key:
        raise ValueError("GOOGLE_CLOUD_API_KEY not found in .env file")

    # Configure the client with your API key
    client = genai.Client(
        vertexai=True,
        api_key=cloud_api_key,
    )

    return client

client = create_vertexai_client()

The Agentic Loop in Action

The curate_artwork_postcard function calls generate_content, prompting the model to crop images within an agentic loop to derive the final answer.

Please be clear that the loop occurs on the server-side within the generated_content calls. The Python client waits for the result while the model iterates.

Zoom into the Postcards

curate_artwork_postcard sends the image URL and prompt to Gemini 3 Flash Preview for a JSON answer. The configuration includes a tools array enabling code execution, along with settings for high thinking and media resolution.

tools=[types.Tool(code_execution=types.ToolCodeExecution())]
config=types.GenerateContentConfig(
    response_mime_type="application/json",
    response_json_schema=test_case.response_model.model_json_schema(),
    thinking_config=types.ThinkingConfig(
        thinking_level=types.ThinkingLevel.HIGH
    ),
    media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH,
    tools=tools, 
)

def curate_artwork_postcard(test_case: VisionTestCase) -> types.GenerateContentResponse:
    response = client.models.generate_content(
        model="gemini-3-flash-preview",
        contents=[
            types.Content(
                role="user",
                parts=[
                    types.Part.from_uri(file_uri=test_case.image_url),
                    types.Part(text=test_case.prompt),
                ]
            )
        ],
        config=config
    )

    return response

Display the Results

The print_parts function receives a response and iterates through the response.candidates[0].content.parts list to print the JSON object as a string, the executable code, the executable code result, and the cropped image.

def print_parts(response: types.GenerateContentResponse):
    for part in response.candidates[0].content.parts:
        if part.text is not None and part.text.strip():
            print("part.text -> ", part.text.strip())
        if part.executable_code is not None:
            print("part.executable_code -> ", part.executable_code)
        if part.code_execution_result is not None:
            print("part.code_execution_result -> ", part.code_execution_result)
        if part.as_image() is not None:
            # display() is a standard function in Jupyter/Colab notebooks
            display(Image.open(BytesIO(part.as_image().image_bytes)))

print_artwork_result calls curate_artwork_postcard. If the model defines response.parsed, the function validates it via model_validate to validate it and obtain the model instance. Otherwise, the function calls the model_validate_json method to validate response.text to obtain the model instance. print_artwork_result then calls print_parts to display the response parts and the result object.

clean_json_string is a helper that is a fallback for raw text responses.

def clean_json_string(raw_string):
    # Remove the markdown code blocks
    clean_str = raw_string.strip()
    if clean_str.startswith("<triple backticks>json"):
        clean_str = clean_str[7:]
    if clean_str.endswith("<triple backticks>"):
        clean_str = clean_str[:-3]
    return clean_str.strip()

def print_artwork_result(test_case: VisionTestCase):
    response = curate_artwork_postcard(test_case=test_case)

    if response.parsed:
        result = test_case.response_model.model_validate(response.parsed) 
    else:
        result = test_case.response_model.model_validate_json(
            clean_json_string(response.text)
        )

    print_parts(response=response)
    if isinstance(result, ArtworkDiscovery):
        print("Final Answer: ", result.answer, "\nReasoning: ", result.reasoning)
    elif isinstance(result, PostcardLabel):
        print("Title: ", result.title,
              "\nDimensions: ", result.dimensions, 
              "\nManufacturing Date: ", result.manufacturing_date, 
              "\nManufacturing Location: ", result.manufacturing_location, 
              "\nBrand: ", result.brand, 
              "\nManufacturer: ", result.manufacturer, 
              "\nPhone Number: ", result.phone_number, 
              "\nAddress: ", result.address, 
              "\nSerial Number: ", result.serial_number, 
              "\nPrice: ", result.price, 
              "\nReasoning: ", result.reasoning
        )

def print_test_cases(heading: str, cases: list[VisionTestCase]):
    print(heading)
    for test_case in cases:
        print_artwork_result(test_case=test_case)

Construct the Test Cases

I use show_the_postcard_front_and_back to construct test image URLs and display them. The load_image_from_url uses the libraries to load the image from an URL and display it.

def load_image_from_url(url: str):
    try:
        response = requests.get(url=url)
        img = Image.open(BytesIO(response.content))
        plt.imshow(img)
        plt.axis('off')
        plt.show()
    except requests.HTTPError as e:
        # e.code contains the status code (e.g., 404)
        if e.code == 404:
            print("Error: URL not found (404).")
        else:
            print(f"HTTP Error: {e.code}")

    except URLError:
        print(f"Error: The file at '{url}' was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

def show_the_postcard_front_and_back(front: str, back: str):
    agent_vision_base_url = "https://raw.githubusercontent.com/railsstudent/colab_images/refs/heads/main/agentic_visions"
    front_url = f"{agent_vision_base_url}/{front}"
    back_url = f"{agent_vision_base_url}/{back}"

    load_image_from_url(url=front_url)
    load_image_from_url(url=back_url)

front_url, back_url = show_the_postcard_front_and_back(front="up-the-river-front.jpg", back="up-the-river-back.jpg")

I ask the model to count animals at the bottom of up-the-river-front.jpg.

up_the_river_test_cases = [
    VisionTestCase(
        image_url=front_url,
        prompt="Zoom to bottom of the postcard to find the number of horses/donkeys/mules near the wagons",
        response_model=ArtworkDiscovery
    ),
]

print_test_cases(heading="Up the River test cases", cases=up_the_river_test_cases)

The following outputs appear:

executable_code generates Python code to load the image, crop parts (for example, bottom_view = get_crop(img, [730, 80, 920, 950])) and save them as PNG files (for example, bottom_view.png). code_execution_result displayed OUTCOME_OK and a null output. The model generates the get_crop function itself, not the developer.

When CodeExecution is enabled, the model automatically downloads the URL and renames it to f_https___raw.githubusercontent.com_railsstudent_colab_images...up_the_river_front.jpg into the local sandbox. The sandbox executes the generated code and the external assets to find the answer.

Note: Libraries like cv2 and PIL are pre-installed in the Gemini model's secure execution sandbox.

part.executable_code ->  code='import PIL.Image\nimport PIL.ImageDraw\n\n# Load the image to get dimensions\nimg = PIL.Image.open(\'f_https___raw.githubusercontent.com_railsstudent_colab_images_refs_heads_main_agentic_visions_up_the_river_front.jpg\')\nwidth, height = img.size\n\n# Identify the bottom area with wagons and animals\n# Wagon 1 area: bottom left/center\n# Wagon 2 area: bottom right/center\n# Far right animals\n# Let\'s crop the bottom third of the image first to see clearly\nbottom_strip = [700, 100, 950, 900] # [ymin, xmin, ymax, xmax] in normalized coordinates\n\n# Better crop: focusing on the animal teams\n# Wagon team 1 (left): approx [750, 100, 900, 400]\n# Wagon team 2 (center): approx [750, 400, 900, 700]\n# Far right: [780, 720, 880, 820]\n\ndef get_crop(img, box_norm):\n    ymin, xmin, ymax, xmax = box_norm\n    left = xmin * width / 1000\n    top = ymin * height / 1000\n    right = xmax * width / 1000\n    bottom = ymax * height / 1000\n    return img.crop((left, top, right, bottom))\n\nbottom_view = get_crop(img, [730, 80, 920, 950])\nbottom_view.save(\'bottom_view.png\')\n\n# Detailed crops for counting\nteam1_crop = get_crop(img, [750, 100, 900, 400])\nteam1_crop.save(\'team1.png\')\n\nteam2_crop = get_crop(img, [720, 400, 900, 720])\nteam2_crop.save(\'team2.png\')\n\nright_animals = get_crop(img, [780, 720, 880, 820])\nright_animals.save(\'right_animals.png\')\n\n# Output objects for verification\n# [{box_2d: [750, 150, 880, 360], label: "animal team 1"},\n#  {box_2d: [750, 400, 885, 545], label: "animal team 2"},\n#  {box_2d: [795, 730, 861, 805], label: "right animals"}]\n' language=<Language.PYTHON: 'PYTHON'>
part.code_execution_result ->  outcome=<Outcome.OUTCOME_OK: 'OUTCOME_OK'> output=None

part.text displays the answer and reasoning. The model discovers 24 animals whereas I found 23. The reasoning field describes the process leading to the answer.

part.text ->  {
 "answer": "24",
 "reasoning": "By zooming into the bottom of the postcard, a large wagon can be seen being pulled by multiple teams of horses, donkeys, or mules. There are two main teams in front of the wagon, each consisting of 8 animals (arranged in pairs). Behind the wagon, there is another group of 6 animals (3 pairs) that appear to be pushing or following closely. Additionally, there is a separate pair of 2 animals to the right of the wagon. Adding these groups together (8 + 8 + 6 + 2) results in a total of 24 animals near the wagon."
}

Final Answer:  24 
Reasoning:  By zooming into the bottom of the postcard, a large wagon can be seen being pulled by multiple teams of horses, donkeys, or mules. There are two main teams in front of the wagon, each consisting of 8 animals (arranged in pairs). Behind the wagon, there is another group of 6 animals (3 pairs) that appear to be pushing or following closely. Additionally, there is a separate pair of 2 animals to the right of the wagon. Adding these groups together (8 + 8 + 6 + 2) results in a total of 24 animals near the wagon.

I then ask the model to extract information from the postcard's back label.

The make_postcard_label_testcase function creates a test case that prompts the model to extract information from back_url.

def make_postcard_label_testcase(back_url: str) -> types.GenerateContentResponse:
  return VisionTestCase(
    image_url=back_url,
    prompt=(
        "Zoom to the label and find:\n"
        "1) the title of the artwork\n"
        "2) the dimensions\n"
        "3) the manufacturing date of the postcard\n"
        "4) the manufacturing location of the postcard\n"
        "5) the brand of the postcard\n"
        "6) the manufacturer of the postcard\n"
        "7) the phone number of the postcard manufacturer\n"
        "8) the address of the postcard manufacturer\n"
        "9) the serial number of the postcard\n"
        "10) the price of the postcard\n"
    ),
    response_model=PostcardLabel
)

print_test_cases(heading="Up the River test cases", 
    cases=[
        make_postcard_label_testcase(back_url=back_url),
    ]
)

The executable_code generates Python code to crop the left label area and save it to price_crop.jpg. The code printed Original image size: 2500x1875. The code_execution_result displayed OUTCOME_OK and Original image size: 2500x1875.

part.executable_code ->  code="import cv2\nimport PIL.Image\n\nimg = cv2.imread('f_https___raw.githubusercontent.com_railsstudent_colab_images_refs_heads_main_agentic_visions_up_the_river_back.jpg')\nheight, width, _ = img.shape\n\n# Crop the left label area\n# Roughly [100, 50, 850, 400] in normalized coordinates\nlabel_crop = img[int(0.1*height):int(0.85*height), int(0.05*width):int(0.4*width)]\ncv2.imwrite('label_crop.jpg', label_crop)\n\n# Crop the price area at the bottom left\nprice_crop = img[int(0.65*height):int(0.85*height), int(0.15*width):int(0.35*width)]\ncv2.imwrite('price_crop.jpg', price_crop)\n\n# Output for visual confirmation\nprint(f'Original image size: {width}x{height}')\n" language=<Language.PYTHON: 'PYTHON'>
part.code_execution_result ->  outcome=<Outcome.OUTCOME_OK: 'OUTCOME_OK'> output='Original image size: 2500x1875\n'

The model stores the JSON object in part.text and displays the title, dimensions, manufacturing date, manufacturing location, brand, manufacturer, phone number, address, serial number, and price. The model renders English and Traditional Chinese characters.

part.text ->  {
  "title": "(Qing Court Version of) Up the River During Qingming",
  "dimensions": "W14.8 x H10 x D0.1 cm",
  "manufacturing_date": "2025.07.01",
  "manufacturing_location": "Taiwan",
  "brand": "臻印藝術",
  "manufacturer": "興台彩色印刷股份有限公司",
  "phone_number": "+886-(4)-2287-1181",
  "address": "台中市南區忠孝路 64 號",
  "serial_number": "2928833300961",
  "price": 47,
  "reasoning": "The information was extracted from the product label on the back of the postcard. The title is derived from the description mentioning the '(Qing Court Version of) Up the River During Qingming'. Dimensions, manufacturing date (2025.07.01), location (Taiwan), brand (臻印藝術), manufacturer (興台彩色印刷股份有限公司), phone number, and address were all clearly printed on the label. The serial number and price ($47) were identified from the barcode sticker at the bottom."
}

Because the result uses the PostcardLabel type, I access its fields and display their values.

Title:  (Qing Court Version of) Up the River During Qingming 
Dimensions:  W14.8 x H10 x D0.1 cm 
Manufacturing Date:  2025.07.01 
Manufacturing Location:  Taiwan 
Brand:  臻印藝術 
Manufacturer:  興台彩色印刷股份有限公司 
Phone Number:  +886-(4)-2287-1181 
Address:  台中市南區忠孝路 64 號 
Serial Number:  2928833300961 
Price:  47.0 
Reasoning:  The information was extracted from the product label on the back of the postcard. The title is derived from the description mentioning the '(Qing Court Version of) Up the River During Qingming'. Dimensions, manufacturing date (2025.07.01), location (Taiwan), brand (臻印藝術), manufacturer (興台彩色印刷股份有限公司), phone number, and address were all clearly printed on the label. The serial number and price ($47) were identified from the barcode sticker at the bottom.

This concludes the Gemini 3 Agentic Vision code walkthrough.

Conclusion

Agentic Vision uses Code Execution to generate Python code for precise, high-resolution postcard cropping. The code_execution_result in the workflow demonstrates how Gemini 3 surpasses standard object detection and uses the model's new vision capability to transform a low-legibility postcard area into a series of cropped images.

Technical Takeaways:

Active Investigation: The agent does not rely on a single look at the image; it identifies areas of interest and interacts with the data to extract hidden details.
Deterministic Accuracy: Code execution extracts details more precisely than standard bounding boxes.
User Experience (UX): This workflow demonstrates how generative AI can act as a bridge between dense physical media and specific end-user resolution requirements.

What will you build when your application can act on what it sees? Start experimenting with the Gemini API and observe actively rather than passively.