Cross-Model Testing
opentine's universal model API makes it easy to run the same prompt across different providers and compare quality, cost, and speed. Every model adapter implements the same Model protocol — swap with one line.
What It Does
- Runs the same prompt and tools across Anthropic, OpenAI, Google, and Ollama
- Saves each run as a separate
.tinefile for comparison - Reports cost, duration, and step count for each model
- Uses
tine diffto compare how models approached the task differently
Prerequisites
# Set API keys for the models you want to test
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AIza..."
# For Ollama, ensure the server is running locally
# ollama serve
# Optional: search provider key
export TAVILY_API_KEY="tvly-..."You only need API keys for the models you want to test. Skip any provider you don't have access to.
Full Example
1import asyncio
2from opentine import Agent
3from opentine.models.anthropic import Anthropic
4from opentine.models.openai import OpenAI
5from opentine.models.google import Google
6from opentine.models.ollama import Ollama
7from opentine.tools.web import search, fetch
8
9models = {
10 "claude": Anthropic("claude-sonnet-4-20250514"),
11 "gpt4o": OpenAI("gpt-4o"),
12 "gemini": Google("gemini-2.5-flash"),
13 "llama": Ollama("llama3.3"),
14}
15
16prompt = "Research the current state of solid-state batteries and summarize the key players."
17
18async def run_model(name, model):
19 agent = Agent(
20 model=model,
21 tools=[search, fetch],
22 system="You are a research assistant. Be thorough but concise.",
23 max_steps=20,
24 )
25 run = await agent.run(prompt)
26 run.save(f"comparison_{name}.tine")
27 return name, run
28
29async def main():
30 tasks = [run_model(name, model) for name, model in models.items()]
31 results = await asyncio.gather(*tasks)
32
33 print(f"{'Model':<12} {'Status':<12} {'Steps':<8} {'Cost':>10} {'Duration':>10}")
34 print("-" * 56)
35 for name, run in results:
36 print(
37 f"{name:<12} {run.status:<12} {len(run.steps()):<8} "
38 f"{run.total_cost:>10.4f} {run.total_duration:>8.1f}s"
39 )
40
41asyncio.run(main())
All four models run concurrently via asyncio.gather. Each produces an independent run tree saved to its own .tine file.
Example Output
Model Status Steps Cost Duration
--------------------------------------------------------
claude completed 12 $0.0341 18.7s
gpt4o completed 14 $0.0289 22.1s
gemini completed 10 $0.0156 15.3s
llama completed 16 $0.0000 45.8sOllama runs are free (local model) but typically slower. Cloud models vary in cost and speed. The step count reveals how efficiently each model approaches the task.
Comparing Runs
Use tine diff to see exactly where two models diverged in their approach:
# Compare any two runs side by side
tine diff comparison_claude.tine comparison_gpt4o.tine
# See where the models diverged in their approachTesting Fewer Models
You don't need all four providers. Test just the models you're evaluating:
1import asyncio
2from opentine import Agent
3from opentine.models.anthropic import Anthropic
4from opentine.models.openai import OpenAI
5from opentine.tools.web import search, fetch
6
7prompt = "Research the current state of solid-state batteries"
8
9async def main():
10 # Test just two models
11 models = [
12 ("claude", Anthropic("claude-sonnet-4-20250514")),
13 ("gpt4o", OpenAI("gpt-4o")),
14 ]
15
16 for name, model in models:
17 agent = Agent(
18 model=model,
19 tools=[search, fetch],
20 max_steps=20,
21 )
22 run = await agent.run(prompt)
23 run.save(f"test_{name}.tine")
24 print(f"{name}: {run.status} in {run.total_duration:.1f}s ({run.total_cost:.4f})")
25
26asyncio.run(main())
Fork-Based Comparison
For a more controlled comparison, run the initial steps with one model, then fork and resume with a different model. Both runs share the same starting context, so you're comparing only the synthesis and decision-making.
1from opentine import Run, Agent
2from opentine.models.anthropic import Anthropic
3from opentine.models.openai import OpenAI
4from opentine.tools.web import search, fetch
5
6# Start with a Claude run
7agent = Agent(
8 model=Anthropic("claude-sonnet-4-20250514"),
9 tools=[search, fetch],
10)
11run = agent.run_sync("Research quantum computing advances")
12run.save("base_run.tine")
13
14# Fork from step 5 (after search) and resume with GPT-4o
15forked = run.fork(from_step_id="step_5_id")
16gpt_agent = Agent(
17 model=OpenAI("gpt-4o"),
18 tools=[search, fetch],
19)
20gpt_run = gpt_agent.resume(forked)
21gpt_run.save("gpt_from_step5.tine")
22
23# Both runs share steps 1-5, but diverge in synthesis
24# Compare the different synthesis approaches
25print(f"Claude cost: {run.total_cost:.4f}")
26print(f"GPT-4o cost: {gpt_run.total_cost:.4f}")
This approach isolates the comparison to specific steps. The search results are identical — only the model's interpretation and synthesis differ.
Next Steps
- Model Adapters overview — all supported providers and configuration
- Forked Debug pattern — use forking for iterative debugging
- Replay & Resume — replay saved runs deterministically