Cross-Model Testing

opentine's universal model API makes it easy to run the same prompt across different providers and compare quality, cost, and speed. Every model adapter implements the same Model protocol — swap with one line.

What It Does

  • Runs the same prompt and tools across Anthropic, OpenAI, Google, and Ollama
  • Saves each run as a separate .tine file for comparison
  • Reports cost, duration, and step count for each model
  • Uses tine diff to compare how models approached the task differently

Prerequisites

Terminal
# Set API keys for the models you want to test
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AIza..."

# For Ollama, ensure the server is running locally
# ollama serve

# Optional: search provider key
export TAVILY_API_KEY="tvly-..."

You only need API keys for the models you want to test. Skip any provider you don't have access to.

Full Example

cross_model.py
1import asyncio
2from opentine import Agent
3from opentine.models.anthropic import Anthropic
4from opentine.models.openai import OpenAI
5from opentine.models.google import Google
6from opentine.models.ollama import Ollama
7from opentine.tools.web import search, fetch
8
9models = {
10    "claude":  Anthropic("claude-sonnet-4-20250514"),
11    "gpt4o":   OpenAI("gpt-4o"),
12    "gemini":  Google("gemini-2.5-flash"),
13    "llama":   Ollama("llama3.3"),
14}
15
16prompt = "Research the current state of solid-state batteries and summarize the key players."
17
18async def run_model(name, model):
19    agent = Agent(
20        model=model,
21        tools=[search, fetch],
22        system="You are a research assistant. Be thorough but concise.",
23        max_steps=20,
24    )
25    run = await agent.run(prompt)
26    run.save(f"comparison_{name}.tine")
27    return name, run
28
29async def main():
30    tasks = [run_model(name, model) for name, model in models.items()]
31    results = await asyncio.gather(*tasks)
32
33    print(f"{'Model':<12} {'Status':<12} {'Steps':<8} {'Cost':>10} {'Duration':>10}")
34    print("-" * 56)
35    for name, run in results:
36        print(
37            f"{name:<12} {run.status:<12} {len(run.steps()):<8} "
38            f"{run.total_cost:>10.4f} {run.total_duration:>8.1f}s"
39        )
40
41asyncio.run(main())

All four models run concurrently via asyncio.gather. Each produces an independent run tree saved to its own .tine file.

Example Output

Terminal
Model        Status       Steps    Cost      Duration
--------------------------------------------------------
claude       completed    12       $0.0341      18.7s
gpt4o        completed    14       $0.0289      22.1s
gemini       completed    10       $0.0156      15.3s
llama        completed    16       $0.0000      45.8s

Ollama runs are free (local model) but typically slower. Cloud models vary in cost and speed. The step count reveals how efficiently each model approaches the task.

Comparing Runs

Use tine diff to see exactly where two models diverged in their approach:

Terminal
# Compare any two runs side by side
tine diff comparison_claude.tine comparison_gpt4o.tine

# See where the models diverged in their approach

Testing Fewer Models

You don't need all four providers. Test just the models you're evaluating:

two_models.py
1import asyncio
2from opentine import Agent
3from opentine.models.anthropic import Anthropic
4from opentine.models.openai import OpenAI
5from opentine.tools.web import search, fetch
6
7prompt = "Research the current state of solid-state batteries"
8
9async def main():
10    # Test just two models
11    models = [
12        ("claude", Anthropic("claude-sonnet-4-20250514")),
13        ("gpt4o", OpenAI("gpt-4o")),
14    ]
15
16    for name, model in models:
17        agent = Agent(
18            model=model,
19            tools=[search, fetch],
20            max_steps=20,
21        )
22        run = await agent.run(prompt)
23        run.save(f"test_{name}.tine")
24        print(f"{name}: {run.status} in {run.total_duration:.1f}s ({run.total_cost:.4f})")
25
26asyncio.run(main())

Fork-Based Comparison

For a more controlled comparison, run the initial steps with one model, then fork and resume with a different model. Both runs share the same starting context, so you're comparing only the synthesis and decision-making.

fork_compare.py
1from opentine import Run, Agent
2from opentine.models.anthropic import Anthropic
3from opentine.models.openai import OpenAI
4from opentine.tools.web import search, fetch
5
6# Start with a Claude run
7agent = Agent(
8    model=Anthropic("claude-sonnet-4-20250514"),
9    tools=[search, fetch],
10)
11run = agent.run_sync("Research quantum computing advances")
12run.save("base_run.tine")
13
14# Fork from step 5 (after search) and resume with GPT-4o
15forked = run.fork(from_step_id="step_5_id")
16gpt_agent = Agent(
17    model=OpenAI("gpt-4o"),
18    tools=[search, fetch],
19)
20gpt_run = gpt_agent.resume(forked)
21gpt_run.save("gpt_from_step5.tine")
22
23# Both runs share steps 1-5, but diverge in synthesis
24# Compare the different synthesis approaches
25print(f"Claude cost: {run.total_cost:.4f}")
26print(f"GPT-4o cost: {gpt_run.total_cost:.4f}")

This approach isolates the comparison to specific steps. The search results are identical — only the model's interpretation and synthesis differ.

Next Steps