Live Testing for Claude Connectors and ChatGPT Apps

# mcp# chatgpt# webdev# react

Abe Wheeler

The sunpeak simulator tests cover a lot. They replicate the ChatGPT and Claude runtimes, run display...

The sunpeak simulator tests cover a lot. They replicate the ChatGPT and Claude runtimes, run display mode transitions, test themes, and validate tool invocations without any paid accounts or AI credits. For most development work, they're enough.

But simulators don't catch everything. Real ChatGPT wraps your app in a nested iframe sandbox. The MCP protocol goes through ChatGPT's actual connection layer. Resource loading happens over a real network with production builds. There's a gap between "works in the simulator" and "works in ChatGPT," and the only way to close it is to test against the real thing.

sunpeak 0.16.23 adds live testing: automated Playwright tests that run against real ChatGPT. You write the same kind of assertions you write for simulator tests, and sunpeak handles authentication, MCP server refresh, host-specific message formatting, and iframe traversal.

TL;DR: Run pnpm test:live with a tunnel active. sunpeak imports your browser session, starts the dev server, refreshes the MCP connection, and runs your tests/live/*.spec.ts files in parallel against real ChatGPT. You write assertions against the app iframe. Everything else is automated.

What Live Tests Actually Do

A live test opens a real ChatGPT session in a browser, types a message that triggers your MCP tool, waits for ChatGPT to call it, and then asserts against the rendered app inside the host's iframe.

Here's a complete live test for an albums resource:

import { test, expect } from 'sunpeak/test';

test('albums tool renders photo grid', async ({ live }) => {
  const app = await live.invoke('show-albums');

  await expect(app.getByText('Summer Slice')).toBeVisible({ timeout: 15_000 });
  await expect(app.locator('img').first()).toBeVisible();

  // Switch to dark mode without re-invoking the tool
  await live.setColorScheme('dark', app);
  await expect(app.getByText('Summer Slice')).toBeVisible();
});

live.invoke('show-albums') starts a new chat, sends /{appName} show-albums to ChatGPT, waits for the LLM response to finish streaming, waits for the app iframe to render, and returns a Playwright FrameLocator pointed at your app's content. From there, it's standard Playwright assertions.

The { timeout: 15_000 } accounts for the LLM response time. ChatGPT needs to process your message, decide to call the tool, receive the result, and render the iframe. In practice this takes 5 to 10 seconds.

Prerequisites

You need three things:

A ChatGPT account with MCP/Apps support (Plus or higher)
A tunnel tool like ngrok or Cloudflare Tunnel
Your MCP server connected in ChatGPT (Settings > Apps > Create, enter your tunnel URL with /mcp path)

You do not need to install anything extra in your sunpeak project. Live test infrastructure ships with sunpeak starting at v0.16.23. New projects scaffolded with sunpeak new include example live test specs and the Playwright config.

Running Live Tests

Open two terminals:

# Terminal 1: Start a tunnel
ngrok http 8000

# Terminal 2: Run live tests
pnpm test:live

On first run, sunpeak imports your ChatGPT session from your browser. It checks Chrome, Arc, Brave, and Edge automatically. If no valid session is found, it opens a browser window and waits for you to log in. The session is saved to tests/live/.auth/chatgpt.json and reused for 24 hours.

After authentication, sunpeak:

Starts sunpeak dev --prod-resources (production resource builds)
Navigates to ChatGPT Settings > Apps, finds your MCP server, and clicks Refresh
Runs all tests/live/*.spec.ts files fully in parallel, each in its own chat window

The MCP refresh happens once in globalSetup, before any test workers start. This means your test workers don't each individually refresh the connection, which would be slow and flaky.

The Fixture API

All live tests import from sunpeak/test:

import { test, expect } from 'sunpeak/test';

The test function provides a live fixture with:

Method	What it does
`invoke(prompt)`	Starts a new chat, sends the prompt with host-specific formatting, waits for the app iframe, returns a `FrameLocator`
`sendMessage(text)`	Sends a message in the current chat with `/{appName}` prefix
`sendRawMessage(text)`	Sends a message without any prefix
`startNewChat()`	Opens a fresh conversation
`waitForAppIframe()`	Waits for the MCP app iframe and returns a `FrameLocator`
`setColorScheme(scheme, appFrame?)`	Switches to `'light'` or `'dark'` via `page.emulateMedia()`
`page`	Raw Playwright `Page` object

Most tests only need invoke and setColorScheme. The invoke method handles the full flow: new chat, message formatting (ChatGPT requires /{appName} before your prompt), waiting for streaming to finish, waiting for the nested iframe to render, and returning a locator into your app's content.

Theme Testing Without Re-Invocation

Sending a second message to trigger a new tool call is slow and burns credits. setColorScheme avoids that by switching the browser's prefers-color-scheme via Playwright's page.emulateMedia(). ChatGPT propagates the change into the iframe, and your app re-renders with the new theme.

test('ticket card text stays readable in dark mode', async ({ live }) => {
  const app = await live.invoke('show-ticket');

  const title = app.getByText('Search results not loading on mobile');
  await expect(title).toBeVisible({ timeout: 15_000 });

  // Verify status badge and assignee are visible in light mode
  await expect(app.getByText('in progress')).toBeVisible();
  await expect(app.getByText('Sarah Chen')).toBeVisible();

  // Switch to dark mode — common bugs: text blends into background,
  // borders disappear, badge colors lose contrast
  await live.setColorScheme('dark', app);

  // Same elements should still be visible with the new theme applied
  await expect(title).toBeVisible();
  await expect(app.getByText('in progress')).toBeVisible();
  await expect(app.getByText('Sarah Chen')).toBeVisible();

  // Badge background should still be distinguishable from the card
  const badge = app.locator('span:has-text("high")');
  const badgeBg = await badge.evaluate(
    (el) => window.getComputedStyle(el).backgroundColor
  );
  expect(badgeBg).not.toBe('rgba(0, 0, 0, 0)');
});

The second argument to setColorScheme tells it to wait for the app's <html data-theme="dark"> attribute to confirm the theme propagated through the iframe boundary before your assertions run.

A Full Example

Here's a live test for a review card resource. It invokes the tool, checks the rendered content, verifies a button interaction triggers a state transition, and confirms the card re-themes correctly in dark mode:

import { test, expect } from 'sunpeak/test';

test('review card renders and handles approval flow', async ({ live }) => {
  const app = await live.invoke('review-diff');

  // Verify the card rendered with the right content
  const title = app.locator('h1').first();
  await expect(title).toBeVisible({ timeout: 15_000 });
  await expect(title).toHaveText('Refactor Authentication Module');

  // Action buttons present
  const applyButton = app.getByRole('button', { name: 'Apply Changes' });
  await expect(applyButton).toBeVisible();

  // Theme switch: card should stay readable in dark mode
  await live.setColorScheme('dark', app);
  await expect(title).toBeVisible();
  await expect(applyButton).toBeVisible();

  // Click Apply Changes — UI transitions to accepted state
  await applyButton.click();
  await expect(applyButton).not.toBeVisible({ timeout: 5_000 });
  await expect(
    app.locator('text=Applying changes...').first()
  ).toBeVisible({ timeout: 5_000 });
});

This catches real issues that simulator tests can miss: the iframe sandbox blocking a script load, a theme change not propagating through the nested iframe boundary, or a button click failing because of host-specific event handling.

The Playwright Config

The live test config is a one-liner:

// tests/live/playwright.config.ts
import { defineLiveConfig } from 'sunpeak/test/config';

export default defineLiveConfig();

This generates a full Playwright config with:

globalSetup pointing to sunpeak's auth and MCP refresh flow
headless: false because chatgpt.com blocks headless browsers
Anti-bot browser arguments and a real Chrome user agent
2-minute timeout per test (LLM responses can be slow)
1 retry per test (LLM responses are non-deterministic)
Fully parallel execution (each test gets its own chat)
Automatic dev server with --prod-resources on a dynamically allocated port

You can pass options to customize the environment:

export default defineLiveConfig({
  colorScheme: 'dark',
  viewport: { width: 1440, height: 900 },
  locale: 'fr-FR',
  timezoneId: 'Europe/Paris',
  geolocation: { latitude: 48.8566, longitude: 2.3522 },
  permissions: ['geolocation'],
});

How It Relates to Simulator Tests

Live tests don't replace simulator tests. They complement them.

	Simulator (`pnpm test:e2e`)	Live (`pnpm test:live`)
Runs against	Local simulator	Real ChatGPT
Speed	Seconds	10-30 seconds per test
Cost	Free	Requires ChatGPT Plus
CI/CD	Yes	Not recommended (needs auth)
Catches	Component logic, display modes, themes, cross-host layout	Real MCP connection, LLM tool invocation, iframe sandbox, production resource loading

Use simulator tests for development and CI/CD. Use live tests before shipping, after major changes, or when debugging issues that only reproduce in the real host.

The Testing Pyramid for Claude Connectors

A Claude Connector built with sunpeak now has three test tiers:

Unit tests (pnpm test): Vitest, jsdom, fast, test component logic in isolation
Simulator e2e tests (pnpm test:e2e): Playwright against the local ChatGPT and Claude simulator, test display modes and themes, runs in CI/CD
Live tests (pnpm test:live): Playwright against real ChatGPT (with Claude coming soon), test real MCP protocol behavior and iframe rendering

Each tier catches different classes of bugs. Unit tests catch logic errors. Simulator tests catch rendering and layout issues across hosts and display modes. Live tests catch protocol and sandbox issues that only show up in the real host environment.

All three are pre-configured when you run sunpeak new. You don't need to set up Vitest, Playwright, or any test infrastructure yourself.

Host-Agnostic Architecture

The live test infrastructure is designed to support multiple hosts. The live fixture resolves the correct host page object based on the Playwright project name. All host-specific DOM interaction (selectors, login flow, settings navigation, iframe nesting) lives in per-host page objects that sunpeak maintains.

Your test code is host-agnostic:

import { test, expect } from 'sunpeak/test';

test('my resource renders', async ({ live }) => {
  const app = await live.invoke('show me something');
  await expect(app.locator('h1')).toBeVisible();
});

This same test will run against any host that sunpeak supports. Today that's ChatGPT. When Claude live testing ships, add it with one line:

// tests/live/playwright.config.ts
export default defineLiveConfig({ hosts: ['chatgpt', 'claude'] });

No changes to your test files.

Getting Started

If you have an existing sunpeak project, update to v0.16.23 or later:

pnpm add sunpeak@latest && sunpeak upgrade

Create tests/live/playwright.config.ts:

import { defineLiveConfig } from 'sunpeak/test/config';
export default defineLiveConfig();

Add the test script to package.json:

{
  "scripts": {
    "test:live": "playwright test --config tests/live/playwright.config.ts"
  }
}

Write your first live test in tests/live/your-resource.spec.ts:

import { test, expect } from 'sunpeak/test';

test('my tool renders correctly in ChatGPT', async ({ live }) => {
  const app = await live.invoke('your prompt here');
  await expect(app.locator('your-selector')).toBeVisible({ timeout: 15_000 });
});

Start a tunnel, run pnpm test:live, and watch Playwright drive a real ChatGPT session.

New projects created with sunpeak new include all of this out of the box, with example live tests for every starter resource.