Cross-Model Testing

opentine's universal model API makes it easy to run the same prompt across different providers and compare quality, cost, and speed. Every model adapter implements the same Model protocol — swap with one line.

What It Does

  • Runs the same prompt and tools across Anthropic, OpenAI, Google, and Ollama
  • Saves each run as a separate .tine file for comparison
  • Reports cost, duration, and step count for each model
  • Uses tine diff to compare how models approached the task differently

Prerequisites

Terminal
# Set API keys for the models you want to test
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AIza..."

# For Ollama, ensure the server is running locally
# ollama serve

# Optional: search provider key
export TAVILY_API_KEY="tvly-..."

You only need API keys for the models you want to test. Skip any provider you don't have access to.

Full Example

cross_model.py
1import asyncio
2from opentine import Agent
3from opentine.models.anthropic import Anthropic
4from opentine.models.openai import OpenAI
5from opentine.models.google import Google
6from opentine.models.ollama import Ollama
7from opentine.tools.search import search
8from opentine.tools.web import fetch
9
10models = {
11    "claude":  Anthropic("claude-sonnet-4-20250514"),
12    "gpt4o":   OpenAI("gpt-4o"),
13    "gemini":  Google("gemini-2.5-flash"),
14    "llama":   Ollama("llama3.3"),
15}
16
17prompt = "Research the current state of solid-state batteries and summarize the key players."
18
19async def run_model(name, model):
20    agent = Agent(
21        model=model,
22        tools=[search, fetch],
23        system="You are a research assistant. Be thorough but concise.",
24        max_steps=20,
25    )
26    run = await agent.run(prompt)
27    run.save(f"comparison_{name}.tine")
28    return name, run
29
30async def main():
31    tasks = [run_model(name, model) for name, model in models.items()]
32    results = await asyncio.gather(*tasks)
33
34    print(f"{'Model':<12} {'Status':<12} {'Steps':<8} {'Cost':>10} {'Duration':>10}")
35    print("-" * 56)
36    for name, run in results:
37        print(
38            f"{name:<12} {run.status.value:<12} {len(run.steps):<8} "
39            f"{run.total_cost:>10.4f} {run.total_duration:>8.1f}s"
40        )
41
42asyncio.run(main())

All four models run concurrently via asyncio.gather. Each produces an independent run graph saved to its own .tine file.

Example Output

Terminal
Model        Status       Steps    Cost      Duration
--------------------------------------------------------
claude       completed    12       $0.0341      18.7s
gpt4o        completed    14       $0.0289      22.1s
gemini       completed    10       $0.0156      15.3s
llama        completed    16       $0.0000      45.8s

Ollama runs are free (local model) but typically slower. Cloud models vary in cost and speed. The step count reveals how efficiently each model approaches the task.

Comparing Runs

Use tine diff to see exactly where two models diverged in their approach:

Terminal
# Compare any two runs side by side
tine diff comparison_claude.tine comparison_gpt4o.tine

# See where the models diverged in their approach

Testing Fewer Models

You don't need all four providers. Test just the models you're evaluating:

two_models.py
1import asyncio
2from opentine import Agent
3from opentine.models.anthropic import Anthropic
4from opentine.models.openai import OpenAI
5from opentine.tools.search import search
6from opentine.tools.web import fetch
7
8prompt = "Research the current state of solid-state batteries"
9
10async def main():
11    # Test just two models
12    models = [
13        ("claude", Anthropic("claude-sonnet-4-20250514")),
14        ("gpt4o", OpenAI("gpt-4o")),
15    ]
16
17    for name, model in models:
18        agent = Agent(
19            model=model,
20            tools=[search, fetch],
21            max_steps=20,
22        )
23        run = await agent.run(prompt)
24        run.save(f"test_{name}.tine")
25        print(f"{name}: {run.status.value} in {run.total_duration:.1f}s ({run.total_cost:.4f})")
26
27asyncio.run(main())

Fork-Based Comparison

For a more controlled comparison, run the initial steps with one model, then fork and resume with a different model. Both runs share the same starting context, so you're comparing only the synthesis and decision-making.

fork_compare.py
1from opentine import Run, Agent
2from opentine.models.anthropic import Anthropic
3from opentine.models.openai import OpenAI
4from opentine.tools.search import search
5from opentine.tools.web import fetch
6
7# Start with a Claude run
8agent = Agent(
9    model=Anthropic("claude-sonnet-4-20250514"),
10    tools=[search, fetch],
11)
12run = agent.run_sync("Research quantum computing advances")
13run.save("base_run.tine")
14
15# Fork from the inspected step 5 (after search) and resume with GPT-4o
16fork_point = run.steps[4].id
17forked = run.fork(from_step_id=fork_point)
18gpt_agent = Agent(
19    model=OpenAI("gpt-4o"),
20    tools=[search, fetch],
21)
22gpt_run = gpt_agent.resume_sync(
23    forked,
24    prompt="Continue from the shared search context and synthesize the findings.",
25)
26gpt_run.save("gpt_from_step5.tine")
27
28# Both runs share steps 1-5, but diverge in synthesis
29# Compare the different synthesis approaches
30print(f"Claude cost: {run.total_cost:.4f}")
31print(f"GPT-4o cost: {gpt_run.total_cost:.4f}")

This approach isolates the comparison to specific steps. The search results are identical — only the model's interpretation and synthesis differ.

Next Steps