A Framework for Running Experiments Using AI

I needed to benchmark my laptop for local AI workloads — how fast can it transcribe audio? What LLMs can it run? Rather than spend a weekend hacking at it manually, I paired with Claude Code and treated it as a lab partner. Out of that came a repeatable framework for running any kind of structured experiment with an AI assistant.

This post walks through that framework, using the AI benchmarks as a concrete example. The benchmarks themselves aren't the point. The process is.

The Problem with Ad-Hoc Experiments

We've all been there. You want to test something — maybe compare three database configurations, or figure out which image processing library is fastest, or evaluate different deployment strategies. You start hacking. You run some commands. You get some numbers. You forget what you changed between runs. Three hours later you have a terminal full of output and no clear answer.

Experiments need structure. But structure takes effort, and when you're exploring, the last thing you want to do is build scaffolding before you know what you're building.

This is where an AI coding assistant changes the game. It can set up structure while you focus on decisions. It can write benchmark scripts, track results, and maintain documentation — all in real time, all while you stay in the driver's seat.

The Framework

1. Start with Goals, Not Code

Before touching any tools or writing any scripts, we created two files:

goals.md — What are we trying to learn? What questions do we want to answer? What metrics matter?

This isn't a formality. Writing down your goals forces you to think about what "done" looks like. In our case, the goals were:

What's the fastest transcription setup for this GPU?
What LLMs can run on 4GB VRAM, and how fast?
What are the optimal settings for each?

guide.md — How will we organize our work? What's the directory structure? What does the workflow look like?

The guide established conventions early: where results go, how experiments are numbered, what tables look like. This is the kind of thing that feels unnecessary at the start but saves you when you're eight experiments deep and can't remember which JSON file goes with which test.

The key insight: I described what I wanted in plain English. Claude Code created the files. I didn't write markdown tables or think about directory structures. I said "I want to test transcription and LLM text generation" and reviewed what it produced. The AI handles the formatting; you handle the thinking.

2. Create an Experiment Tracker

Before running anything, we created a tracker file (transcription_experiments.md) that listed:

What runtimes to test — faster-whisper, whisper.cpp, openai-whisper
What models to test — with expected VRAM requirements
The experiments themselves — E1 through E5, each with a clear scope
Checkboxes — unchecked at first, checked as we completed each one

This is your experiment's table of contents. It tells you what you've done, what's left, and what the results were. When we finished an experiment, Claude Code updated the checkboxes and added the key findings inline:

### E1 — Runtime comparison (small model) ✅
- [x] faster-whisper + small (GPU) — **20.71x RT (int8), 7.78x (fp16)**
- [x] whisper.cpp + small (GPU) — **12.28x RT** (needs FORCE_MMQ build flag)
- [x] openai whisper + small (GPU) — **8.18x RT** (fp32 only, fp16 NaN on GTX 1650)
- **Winner: faster-whisper cuda/int8**

At any point, you can open this one file and know exactly where the project stands. No digging through terminal history or trying to remember what you tested last Tuesday.

3. Structure Your Experiments as a Funnel

We didn't test everything against everything. That's combinatorial explosion. Instead, each experiment narrowed the field for the next one:

E1: Runtime comparison     → Pick the fastest runtime
E2: Model sweep            → Using that runtime, pick the best models
E3: Quantization           → Using that model, pick the best precision
E4: Parameter tuning       → Using that config, optimize the settings
E5: Edge cases / scaling   → Stress-test the winning configuration

Each experiment has a single variable. E1 varies the runtime. E2 varies the model. E3 varies the quantization. This makes results interpretable — you know why something is faster, not just that it's faster.

This funnel structure is the most transferable part of the framework. Whether you're benchmarking databases, comparing ML frameworks, or evaluating cloud providers, the pattern is the same: start broad, narrow down, then optimize.

Don't over-plan the funnel upfront, either. Our E5 (context length testing) didn't exist in the original plan — it emerged from E4's findings about 7B models. The framework supports evolution; it doesn't demand a perfect plan from the start.

4. Write Scripts, Not One-Liners

For each experiment, Claude Code wrote a proper Python or bash script. Not a one-liner in the terminal — a script with:

A consistent test file/prompt across all runs
Timing instrumentation
VRAM measurement
Results saved to JSON for later analysis
A summary printed to stdout

Scripts are reproducible. When I want to run the same test on a different machine, I run the same script. When someone else wants to reproduce my work, they can. At the end of our session, we had Claude Code generate a reproduce.md — a step-by-step guide to run everything from scratch. This is trivial for it to produce (it just did all the steps) but incredibly valuable later.

For example, our faster-whisper benchmark script (bench_faster_whisper.py) tested three configurations (cuda/float16, cuda/int8, cpu/int8), measured load time, transcription time, and VRAM usage for each, and saved structured JSON results. Writing this by hand would have taken 20 minutes. Claude Code wrote it in seconds based on "write a benchmark script for faster-whisper comparing GPU and CPU."

5. Record Results in Markdown + JSON

Every experiment produced two outputs:

e<N>_results.md — Human-readable writeup with tables, findings, and recommendations
e<N>_*.json — Raw data for programmatic analysis

The markdown file is what you read. The JSON is what you'd feed into a comparison tool or plotting script if you wanted to go deeper. Each results file followed a consistent structure:

# E1 — Runtime Comparison Results

**Date:** 2026-02-15
**Model:** Whisper small (244M params)
**Test file:** [consistent across experiments]
**Settings:** beam_size=5, language=en

## Results
[table]

## Key Findings
[numbered list of insights]

## Setup Notes
[gotchas, workarounds, things that broke]

## Recommendation
[what to use going forward]

The "Setup Notes" section turned out to be surprisingly valuable. Every environment has gotchas — silent GPU fallback, NaN errors with certain precision modes, libraries that need manual path configuration. Documenting them inline with results means you (or someone reproducing your work) won't hit the same wall twice.

6. You Decide, the AI Executes

Here's what Claude Code actually did during our session:

Environment setup: Created a Python virtualenv, installed packages, figured out CUDA library paths
Debugging: When GPU offload silently failed, it diagnosed the issue, rebuilt the tools with the right flags, and re-ran the tests
Script writing: Wrote all benchmark scripts — timing loops, VRAM measurement, JSON output
Model downloads: Downloaded GGUF models from HuggingFace, handled 404s by finding alternative repos
Results analysis: Calculated RT factors, wrote comparison tables, identified winners
Documentation: Updated experiment trackers, wrote results files, maintained the CLAUDE.md

Here's what I did:

Made decisions: Which experiments to run, in what order, when to move on
Reviewed results: Looked at the numbers, decided if they made sense
Course-corrected: Caught that we were running without proper GPU support and stopped to fix the environment before continuing
Set priorities: "Skip koboldcpp, the winner is clear"

This division of labor is the key. The AI is fast at execution. You're fast at judgment. Don't let it run on autopilot — stay engaged with the results. When things broke (and they broke a lot), the failures often produced the most valuable findings. A library crashing on your GPU is just as useful to know as one that runs at 20x realtime.

7. Build the CLAUDE.md as You Go

A CLAUDE.md file captures what a future session needs to know. We built ours at the end, but it drew from everything we learned:

How to set up the environment
How to build tools with the right flags
What the best configurations are
Where to find results

This file means the next time I (or anyone) opens this project with Claude Code, it doesn't start from zero. It knows the project structure, the conventions, and the hard-won knowledge from the experiment session.

Adapting the Framework

The pattern generalizes beyond benchmarking. For example, database performance testing:

goals.md         → "Find the fastest DB config for our read-heavy workload"
experiments.md   → E1: Compare engines (Postgres vs MySQL vs SQLite)
                   E2: Index strategies on the winner
                   E3: Connection pooling settings
                   E4: Query optimization
                   E5: Concurrency/load testing

The same funnel applies to API evaluation, infrastructure cost optimization, ML model selection — anything where you're narrowing down options through structured comparison. The general pattern is:

goals.md — What questions are you answering? What does success look like?
guide.md — Directory structure, conventions, workflow
experiments.md — Funnel-shaped experiment list with checkboxes
scripts/ — One script per experiment, saves raw JSON
results/ — Markdown writeup + JSON data per experiment
CLAUDE.md — What future sessions need to know

Start the conversation with: "I want to evaluate [X]. Here are my constraints: [Y]. Create a goals file and experiment tracker, then let's start with E1."

What This Isn't

This isn't about replacing your expertise with AI. You still need to know what questions to ask, whether results make sense, and when to dig deeper. Claude Code can't tell you that 20x realtime transcription is "good enough" for your use case — only you know that.

What this is about is having a capable lab partner who handles the tedious parts — setup, scripting, documentation, debugging — so you can focus on the interesting parts: asking questions, interpreting results, and making decisions.

All code, scripts, and results from this benchmarking session are available in the llmexp repository. The full session benchmarked audio transcription (faster-whisper, whisper.cpp, openai-whisper, moonshine) and LLM text generation (llama.cpp, ollama) on an NVIDIA GTX 1650 with 4GB VRAM.

The Starting Prompt

For the curious, here's the exact message that kicked off the entire session:

ok. this is a folder where i want to run and doctument local llm experiments. the goal is to bench mark the current system on how well it can run llms and ai models. to start out with create a goals.md file outling this. I mainly want to test two things, 1. locall llm text generation (what ever models fit this gpu) and 2. audio transcription. I want to tract the best settings, best way to setup the models, best utils to use etc. then create a guide.md file to keep us on track and organised as we run experiments.

Using Claude Code as Your Lab Partner: A Framework for Running Experiments

The Problem with Ad-Hoc Experiments

The Framework

1. Start with Goals, Not Code

2. Create an Experiment Tracker

3. Structure Your Experiments as a Funnel

4. Write Scripts, Not One-Liners

5. Record Results in Markdown + JSON

6. You Decide, the AI Executes

7. Build the CLAUDE.md as You Go

Adapting the Framework

What This Isn't

The Starting Prompt

Comments

More from this blog

Why Don’t Websites Put All Their Images Into One Giant JPEG? (Nerd-Sniped by My Brain)

Made to Imitate?

How I Almost Got Hacked By A 'Job Interview'

Laughing in the Face of Fear: How I Accidentally Rewired My Brain Through Movies

Command Palette

The Problem with Ad-Hoc Experiments

The Framework

1. Start with Goals, Not Code

2. Create an Experiment Tracker

3. Structure Your Experiments as a Funnel

4. Write Scripts, Not One-Liners

5. Record Results in Markdown + JSON

6. You Decide, the AI Executes

7. Build the CLAUDE.md as You Go

Adapting the Framework

What This Isn't

The Starting Prompt

Comments

More from this blog