Skip to content

Commit dc385bd

Browse files
committed
chore: update seeder for testing:
* Add a flag for skipping verification for smoke tests (`--no-verify`) * Lazily load `openai` dependency as to not require credentials * Only use browser during verification * Add a flag for skipping report generation (`--no-report`) * Return an error exit code if any verification fails (unless `--no-verify`) * Connect a single charm gen to a parallel integration test to ensure seeder runs properly. No verification is done. * Align shared usage of utils between seeder/jumble-integration
1 parent e2f45e8 commit dc385bd

20 files changed

+545
-480
lines changed

.github/workflows/deno.yml

+48
Original file line numberDiff line numberDiff line change
@@ -367,6 +367,54 @@ jobs:
367367
FRONTEND_URL=http://localhost:8000/ \
368368
deno task integration
369369
370+
seeder-integration-test:
371+
name: "Seeder Integration Tests"
372+
runs-on: ubuntu-latest
373+
needs: ["build-binaries"]
374+
environment: ci
375+
services:
376+
redis:
377+
image: redis
378+
ports:
379+
- 6379:6379
380+
options: >-
381+
--health-cmd "redis-cli ping"
382+
--health-interval 10s
383+
--health-timeout 5s
384+
--health-retries 5
385+
steps:
386+
- name: 📥 Checkout repository
387+
uses: actions/checkout@v4
388+
389+
- name: 🦕 Setup Deno 2.2.2
390+
uses: denoland/setup-deno@v2
391+
with:
392+
deno-version: "2.2.2"
393+
394+
- name: 📦 Cache Deno dependencies
395+
uses: actions/cache@v3
396+
with:
397+
path: |
398+
~/.deno
399+
~/.cache/deno
400+
key: ${{ runner.os }}-deno-${{ hashFiles('**/deno.json') }}
401+
402+
- name: 📥 Download built binaries
403+
uses: actions/download-artifact@v4
404+
405+
- name: 🚀 Start Toolshed server for testing
406+
run: |
407+
chmod +x ./common-binaries/toolshed
408+
CTTS_AI_LLM_ANTHROPIC_API_KEY=fake \
409+
CACHE_DIR=${GITHUB_WORKSPACE}/jumble/integration/cache \
410+
./common-binaries/toolshed &
411+
412+
- name: 🧪 Run seeder integration tests
413+
working-directory: seeder
414+
run: |
415+
TOOLSHED_API_URL=http://localhost:8000/ \
416+
deno task start --tag smol --no-verify --no-report --name ${{ github.run_id }}
417+
370418
# Automatic deployment to staging (toolshed)
371419
deploy-toolshed:
372420
name: "Deploy to Toolshed (Staging)"

.github/workflows/evals.yml

+14-7
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ jobs:
106106
key: llm-cache-${{ github.run_id }}
107107
restore-keys: llm-cache
108108

109-
- name: ✏️ List LLM Cache Load
109+
- name: ✏️ List LLM Cache Load
110110
run: |
111111
mkdir -p ./llm_cache_dir/llm-api-cache
112112
ls -1 ./llm_cache_dir/llm-api-cache > ./cache-log-pre
@@ -138,17 +138,24 @@ jobs:
138138
FLAGS="--tag 10x10"
139139
fi
140140
141-
TOOLSHED_API_URL=http://localhost:8000/ \
142-
OPENAI_API_KEY=${{ secrets.CTTS_AI_LLM_OPENAI_API_KEY }} \
143-
deno task start --name ${{ env.target_sha }} $FLAGS
141+
# Wrap in an if/else, we don't want the job to fail yet
142+
# until we can confidently expect passing results.
143+
if
144+
TOOLSHED_API_URL=http://localhost:8000/ \
145+
OPENAI_API_KEY=${{ secrets.CTTS_AI_LLM_OPENAI_API_KEY }} \
146+
deno task start --name ${{ env.target_sha }} $FLAGS ; then
147+
echo "All charms have been successfully verified!"
148+
else
149+
echo "Some charms have failed verification."
150+
fi
144151
145-
- name: ✏️ List LLM Cache Save
152+
- name: ✏️ List LLM Cache Save
146153
run: |
147154
mkdir -p ./llm_cache_dir/llm-api-cache
148155
ls -1 ./llm_cache_dir/llm-api-cache > ./cache-log-post
149156
echo "$(cat ./cache-log-post | wc -l) items found in cache."
150157
cat ./cache-log-post
151-
158+
152159
- name: 📦 LLM Cache Status
153160
run: |
154161
if diff ./cache-log-pre ./cache-log-post > ./cache-analysis-diff ; then
@@ -158,7 +165,7 @@ jobs:
158165
ADDITIONS=$(cat ./cache-analysis-diff | wc -l)
159166
echo "LLM cache has added $((ADDITIONS / 2)) entries."
160167
fi
161-
168+
162169
- name: 📤 Upload reports
163170
uses: actions/upload-artifact@v4
164171
with:

deno.jsonc

+1-1
Original file line numberDiff line numberDiff line change
@@ -112,4 +112,4 @@
112112
"zod-to-json-schema": "npm:zod-to-json-schema@^3.24.1",
113113
"zod": "npm:zod@^3.24.1"
114114
}
115-
}
115+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"model": "anthropic:claude-3-7-sonnet-20250219",
3+
"system": "\nYou are creating a simple MVP based on the user's goal, using an existing data schema. Focus on the simplest implementation that works with the provided schema.\n\n<task>\nGiven a user's feature request and an existing data schema, you will:\n1. Create a short title (2-5 words) that names the artifact\n2. Create a one-sentence description in the format \"A <artifact> to <goal>\"\n3. Create a concise specification (3-5 sentences max) that works with the existing schema\n4. Generate a brief implementation plan (3 steps max)\n5. Design a minimal JSON schema that represents the core data model\n</task>\n\n<output_structure>\nYour response must be structured as follows:\n\n<thinking>Freeform reasoning for you to consider the solution.</thinking>\n\n<title>\n[Short title for the artifact, 2-5 words]\n</title>\n\n<description>\n[One-sentence description in the format \"A <artifact> to <goal>\"]\n</description>\n\n<spec>\n[Concise specification that captures only the essential requirements]\n</spec>\n\n<plan>\n[Brief 3-step implementation plan using the existing schema]\n</plan>\n\n<result_schema>\n[Minimal JSON Schema in valid JSON format that represents data created by the artifact]\n</result_schema>\n</output_structure>\n\n<schema_guidelines>\n1. Keep it minimal:\n - Include only essential fields (5-7 properties max)\n - Focus on the core functionality\n - If user requested complex features, simplify for this first version\n\n2. Each property should have:\n - A descriptive \"title\" field\n - A brief \"description\" field\n - A sensible default value where appropriate\n\n3. Example of a simple schema:\n```json\n{\n \"type\": \"object\",\n \"title\": \"Note List\",\n \"description\": \"A simple note list for the user\",\n \"properties\": {\n \"notes\": {\n \"type\": \"array\",\n \"title\": \"Notes\",\n \"description\": \"List of user notes\",\n \"default\": [],\n \"items\": {\n \"type\": \"object\",\n \"properties\": {\n \"title\": {\n \"type\": \"string\",\n \"title\": \"Title\",\n \"description\": \"Title of the note\",\n \"default\": \"New Note\"\n },\n \"content\": {\n \"type\": \"string\",\n \"title\": \"Content\",\n \"description\": \"Content of the note\"\n },\n \"created\": {\n \"type\": \"string\",\n \"format\": \"date-time\",\n \"title\": \"Created Date\",\n \"description\": \"When the note was created\"\n }\n },\n \"required\": [\"title\", \"content\"]\n }\n }\n },\n \"required\": [\"title\", \"content\"]\n}\n```\n\nGUIDELINES:\n- Aim for the simplest possible solution that works with the existing schema\n- The specification should take into account the existing schema structure\n- Focus on what can be achieved quickly with the existing data model\n- Avoid suggesting modifications to the schema if possible\n</schema_guidelines>\n\n<high_level>\nIMPORTANT:\n- Focus on the simplest working version\n- Aim for fewer fields rather than more\n- But still capture all the important state the user is creating\n- The user can always iterate and improve the solution later\n</high_level>\n",
4+
"stream": false,
5+
"messages": [
6+
{
7+
"role": "user",
8+
"content": "\n<user_input>\n\n<goal>2048 game</goal>\n\n<description>Create a 🎮 2048 game charm with a grid-based UI, sliding tile mechanics, and score tracking. The charm will need a game board, tile merging logic, and win/lose conditions. Could add features like 🔄 undo moves, 📊 high score tracking, or even 🎨 custom tile themes.</description>\n\n</user_input>\n\n<existing_schema>\n```json\n{\n \"type\": \"object\",\n \"properties\": {}\n}\n```\n</existing_schema>\n\nBased on this goal and the existing schema, please provide a title, description, any additional schema, detailed specification, and implementation plan.\n"
9+
},
10+
{
11+
"role": "assistant",
12+
"content": "<thinking>\nThe user wants to create a 2048 game charm. The 2048 game is a sliding tile puzzle where players combine tiles with the same numbers to create tiles with larger numbers, with the goal of reaching a tile with the value 2048.\n\nFor the core functionality, I need to design a schema that represents:\n1. The game board (a 4x4 grid typically)\n2. The current score\n3. Game state (in progress, won, lost)\n4. Current highest tile value\n\nThe existing schema is empty, so I have freedom to design it appropriately.\n\nFor the MVP, I'll focus on:\n- Game board representation\n- Current score\n- Game state\n- Highest tile\n\nI'll leave out the additional features like undo moves, high score tracking, and custom themes for future iterations.\n</thinking>\n\n<title>\n2048 Game Board\n</title>\n\n<description>\nA game charm to play the classic 2048 sliding tile puzzle.\n</description>\n\n<spec>\nThe 2048 game will feature a 4x4 grid where numbered tiles can be slid in four directions (up, down, left, right). When two tiles with the same number collide, they merge into one tile with the sum of their values. After each move, a new tile with value 2 or 4 appears in a random empty cell. The game ends when the board is full and no more moves are possible, or when a tile with value 2048 is created.\n</spec>\n\n<plan>\n1. Initialize the game board with two random tiles (value 2 or 4) in random positions\n2. Implement the core sliding and merging mechanics for all four directions\n3. Add game state tracking for score, win/lose conditions, and highest tile achieved\n</plan>\n\n<result_schema>\n{\n \"type\": \"object\",\n \"title\": \"2048 Game\",\n \"description\": \"Data model for a 2048 game\",\n \"properties\": {\n \"board\": {\n \"type\": \"array\",\n \"title\": \"Game Board\",\n \"description\": \"4x4 grid representing the game board\",\n \"default\": [\n [0, 0, 0, 0],\n [0, 0, 0, 0],\n [0, 0, 0, 0],\n [0, 0, 0, 0]\n ],\n \"items\": {\n \"type\": \"array\",\n \"items\": {\n \"type\": \"integer\",\n \"minimum\": 0\n }\n }\n },\n \"score\": {\n \"type\": \"integer\",\n \"title\": \"Current Score\",\n \"description\": \"Player's current score in the game\",\n \"default\": 0,\n \"minimum\": 0\n },\n \"highestTile\": {\n \"type\": \"integer\",\n \"title\": \"Highest Tile\",\n \"description\": \"Value of the highest tile on the board\",\n \"default\": 0,\n \"minimum\": 0\n },\n \"gameState\": {\n \"type\": \"string\",\n \"title\": \"Game State\",\n \"description\": \"Current state of the game\",\n \"enum\": [\"in_progress\", \"won\", \"lost\"],\n \"default\": \"in_progress\"\n },\n \"moves\": {\n \"type\": \"integer\",\n \"title\": \"Moves Count\",\n \"description\": \"Number of moves made in the current game\",\n \"default\": 0,\n \"minimum\": 0\n }\n },\n \"required\": [\"board\", \"score\", \"gameState\"]\n}\n</result_schema>"
13+
}
14+
],
15+
"maxTokens": 64000
16+
}

0 commit comments

Comments
 (0)