Name	Name	Last commit message	Last commit date
parent directory ..
bench-postprocess-scripts	bench-postprocess-scripts
provider-error-proxy	provider-error-proxy
test-subrecipes-examples	test-subrecipes-examples
README.md	README.md
check-no-native-tls.sh	check-no-native-tls.sh
check-openapi-schema.sh	check-openapi-schema.sh
clean-gh-pages.sh	clean-gh-pages.sh
diagnostics-viewer.py	diagnostics-viewer.py
goose-db-helper.sh	goose-db-helper.sh
parse-benchmark-results.sh	parse-benchmark-results.sh
pre-release.sh	pre-release.sh
run-benchmarks.sh	run-benchmarks.sh
test_compaction.sh	test_compaction.sh
test_lead_worker.sh	test_lead_worker.sh
test_mcp.sh	test_mcp.sh
test_providers.sh	test_providers.sh
test_providers_code_exec.sh	test_providers_code_exec.sh
test_providers_lib.sh	test_providers_lib.sh
test_subrecipes.sh	test_subrecipes.sh
test_web.sh	test_web.sh

Name

Last commit message

Last commit date

bench-postprocess-scripts

provider-error-proxy

test-subrecipes-examples

README.md

check-no-native-tls.sh

check-openapi-schema.sh

clean-gh-pages.sh

diagnostics-viewer.py

goose-db-helper.sh

parse-benchmark-results.sh

test_providers_code_exec.sh

test_providers_lib.sh

test_subrecipes.sh

test_web.sh

Goose Benchmark Scripts

This directory contains scripts for running and analyzing Goose benchmarks.

run-benchmarks.sh

This script runs Goose benchmarks across multiple provider:model pairs and analyzes the results.

Prerequisites

Goose CLI must be built or installed
jq command-line tool for JSON processing (optional, but recommended for result analysis)

Usage

./scripts/run-benchmarks.sh [options]

Options

-p, --provider-models: Comma-separated list of provider:model pairs (e.g., 'openai:gpt-4o,anthropic:claude-sonnet-4')
-s, --suites: Comma-separated list of benchmark suites to run (e.g., 'core,small_models')
-o, --output-dir: Directory to store benchmark results (default: './benchmark-results')
-d, --debug: Use debug build instead of release build
-h, --help: Show help message

Examples

# Run with release build (default)
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o,anthropic:claude-sonnet-4' --suites 'core,small_models'

# Run with debug build
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o' --suites 'core' --debug

How It Works

The script:

Parses the provider:model pairs and benchmark suites
Determines whether to use the debug or release binary
For each provider:model pair:
- Sets the GOOSE_PROVIDER and GOOSE_MODEL environment variables
- Runs the benchmark with the specified suites
- Analyzes the results for failures
Generates a summary of all benchmark runs

Output

The script creates the following files in the output directory:

summary.md: A summary of all benchmark results
{provider}-{model}.json: Raw JSON output from each benchmark run
{provider}-{model}-analysis.txt: Analysis of each benchmark run

Exit Codes

0: All benchmarks completed successfully
1: One or more benchmarks failed

parse-benchmark-results.sh

This script analyzes a single benchmark JSON result file and identifies any failures.

Usage

./scripts/parse-benchmark-results.sh path/to/benchmark-results.json

Output

The script outputs an analysis of the benchmark results to stdout, including:

Basic information about the benchmark run
Results for each evaluation in each suite
Summary of passed and failed metrics

Exit Codes

0: All metrics passed successfully
1: One or more metrics failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Goose Benchmark Scripts

run-benchmarks.sh

Prerequisites

Usage

Options

Examples

How It Works

Output

Exit Codes

parse-benchmark-results.sh

Usage

Output

Exit Codes

FilesExpand file tree

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

Goose Benchmark Scripts

run-benchmarks.sh

Prerequisites

Usage

Options

Examples

How It Works

Output

Exit Codes

parse-benchmark-results.sh

Usage

Output

Exit Codes