Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Goose Benchmark Scripts

This directory contains scripts for running and analyzing Goose benchmarks.

run-benchmarks.sh

This script runs Goose benchmarks across multiple provider:model pairs and analyzes the results.

Prerequisites

  • Goose CLI must be built or installed
  • jq command-line tool for JSON processing (optional, but recommended for result analysis)

Usage

./scripts/run-benchmarks.sh [options]

Options

  • -p, --provider-models: Comma-separated list of provider:model pairs (e.g., 'openai:gpt-4o,anthropic:claude-sonnet-4')
  • -s, --suites: Comma-separated list of benchmark suites to run (e.g., 'core,small_models')
  • -o, --output-dir: Directory to store benchmark results (default: './benchmark-results')
  • -d, --debug: Use debug build instead of release build
  • -h, --help: Show help message

Examples

# Run with release build (default)
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o,anthropic:claude-sonnet-4' --suites 'core,small_models'

# Run with debug build
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o' --suites 'core' --debug

How It Works

The script:

  1. Parses the provider:model pairs and benchmark suites
  2. Determines whether to use the debug or release binary
  3. For each provider:model pair:
    • Sets the GOOSE_PROVIDER and GOOSE_MODEL environment variables
    • Runs the benchmark with the specified suites
    • Analyzes the results for failures
  4. Generates a summary of all benchmark runs

Output

The script creates the following files in the output directory:

  • summary.md: A summary of all benchmark results
  • {provider}-{model}.json: Raw JSON output from each benchmark run
  • {provider}-{model}-analysis.txt: Analysis of each benchmark run

Exit Codes

  • 0: All benchmarks completed successfully
  • 1: One or more benchmarks failed

parse-benchmark-results.sh

This script analyzes a single benchmark JSON result file and identifies any failures.

Usage

./scripts/parse-benchmark-results.sh path/to/benchmark-results.json

Output

The script outputs an analysis of the benchmark results to stdout, including:

  • Basic information about the benchmark run
  • Results for each evaluation in each suite
  • Summary of passed and failed metrics

Exit Codes

  • 0: All metrics passed successfully
  • 1: One or more metrics failed