skill-test

community

>

>_microsoft/skills-for-fabric/.github/skills/skill-test·commit 339c328

name: skill-test description: > Manage the skills-for-fabric evaluation framework: add eval plans for new or existing skills, list available tests and their results, generate eval datasets, review metrics, and check test coverage. Directs test execution to the tests/ folder. Triggers: "add tests", "add evals", "list tests", "show eval results", "run tests", "generate eval data", "eval metrics", "test coverage", "missing tests". "show tests"

Skill Test — skills-for-fabric Evaluation Framework

Manage the end-to-end evaluation framework for skills-for-fabric. This skill routes requests to the correct workflow based on user intent — adding tests, listing tests, running tests, viewing results, generating data, or checking coverage.

When to Use

  • When a contributor wants to add evaluation test cases for a new or existing skill
  • When someone asks to see what tests exist or what results look like
  • When a user wants to run the test suite
  • When reviewing eval metrics or checking which skills lack test coverage

Intent Routing

Parse the user request and route to the appropriate workflow:

User IntentTrigger PhrasesAction
Add evals"add tests", "add evals", "add evals for missing skills", "create eval plan"Workflow: Add Evals
List tests"list tests", "list evals", "show me the list of tests", "what tests exist", "show eval plans"Workflow: List Tests
Run tests"run tests", "run evals", "execute tests", "run the eval suite"Workflow: Run Tests
View results"show eval results", "test results", "eval results", "executive summary"Workflow: View Results
Generate data"generate eval data", "generate test data", "create eval datasets"Workflow: Generate Data
View metrics"eval metrics", "test metrics", "what metrics", "how are tests scored"Workflow: View Metrics
Check coverage"test coverage", "which skills have tests", "missing tests", "skills without evals"Workflow: Check Coverage

Workflow: Add Evals

Follow the instructions in tests/full-eval-tests/README.md § "Adding Evals for New Skills".

Automated Path (Recommended)

Give the agent the prompt:

Add evals for the missing skills

The agent will:

  1. Detect missing skills by comparing installed skills against existing eval plans in tests/full-eval-tests/plan/03-individual-skills/
  2. Generate individual eval plans (plan/03-individual-skills/eval-<skill-name>.md) with 10–12 test cases
  3. Generate combined eval plans (plan/04-combined-skills/eval-<skill>-authoring-plus-consumption.md)
  4. Create golden data in tests/full-eval-tests/evalsets/expected-results/
  5. Update tracking files: plan/00-overview.md, README.md, plan/04-combined-skills/eval-full-pipeline.md

Manual Path

To add evals for a specific skill <new-skill>:

  1. Create tests/full-eval-tests/plan/03-individual-skills/eval-<new-skill>.md using the template in the README
  2. Each test case needs: Case ID (unique prefix), Prompt, Expected result, Pass criteria, at least one negative/ambiguous test
  3. If the skill has an authoring+consumption pair, create tests/full-eval-tests/plan/04-combined-skills/eval-<new-skill>-authoring-plus-consumption.md
  4. Add golden data to tests/full-eval-tests/evalsets/expected-results/
  5. Update plan/00-overview.md, README.md directory tree, and plan/04-combined-skills/eval-full-pipeline.md

Eval Plan Template

Use the template from tests/full-eval-tests/README.md § "Eval Plan Template". Every eval plan must include:

  • Skill overview (name, category, R/W, purpose)
  • Pre-requisites
  • Numbered test cases (XX-01 through XX-10+) with Prompt / Expected / Pass criteria
  • At least one negative/ambiguous test case as the last case
  • Write Operations table (if the skill writes data)
  • Expected Token Range

Workflow: List Tests

Show the user what eval plans and test cases exist.

Individual Skill Evals

List files in tests/full-eval-tests/plan/03-individual-skills/:

ls tests/full-eval-tests/plan/03-individual-skills/

Combined Skill Evals

List files in tests/full-eval-tests/plan/04-combined-skills/:

ls tests/full-eval-tests/plan/04-combined-skills/

Quick Tests (tests.json)

Show the test cases defined in tests/tests.json — these are the prompt-based tests run by the test runner.

Recommended Execution Order

OrderEval PlanReason
1eval-check-updates.mdVerify skills are installed
2eval-spark-authoring.mdCreate lakehouses and load data
3eval-sqldw-authoring.mdCreate warehouse tables and load data
4eval-eventhouse-authoring.mdCreate Eventhouse tables and ingest data
5eval-spark-consumption.mdRead back lakehouse data
6eval-sqldw-consumption.mdRead back warehouse data
7eval-eventhouse-consumption.mdRead back Eventhouse data
8eval-medallion.mdEnd-to-end medallion pipeline

Workflow: Run Tests

⛔ DO NOT execute tests from this skill. The agent must NEVER run copilot, run-full-tests.ps1, or any eval prompt directly. Instead, tell the user the exact commands to run manually.

When the user asks to run tests, respond only with instructions. Do not execute any commands. Tell the user:

  1. Open a terminal and navigate to the tests/ directory at the repository root:

    cd tests
    
  2. Run the full test suite:

    .\run-full-tests.ps1
    
  3. To specify an output directory:

    .\run-full-tests.ps1 -TestFolder C:\temp\eval-run-01
    

Important

  • The agent must NEVER run tests itself — only provide the user with instructions
  • Tests must be run by the user from inside the tests/ folder
  • The script copies the eval framework to a working folder and launches copilot there

Workflow: View Results

Show the user existing evaluation results.

Detailed Results

Read tests/full-eval-tests/eval-results.md — contains per-skill, per-test-case pass/fail with notes, consistency test results, failure analysis, and skip reasons.

Executive Summary

Read tests/full-eval-tests/executive-summary.md — contains the high-level summary: overall pass rate, results by skill, data consistency scores, failure analysis, and recommendations.

Key Metrics from Latest Run

MetricValue
Overall pass rate94.7% (54/57 executed)
Write/Read consistency100% (5/5 exact matches)
Total test cases74
Skipped17

Workflow: Generate Data

Generate synthetic evaluation datasets using the specifications in tests/full-eval-tests/plan/01-data-generation.md.

Using the Generation Script

python tests/full-eval-tests/evalsets/data-generation/generate.py

Datasets

DatasetRowsFormatUsed By
sales_transactions100 / 1K / 10KCSVSQL DW, Spark
customers100CSVJoin testing
products50CSVJoin testing
sensor_readings500JSONSpark semi-structured

Golden Results

Pre-computed expected results are in tests/full-eval-tests/evalsets/expected-results/ and are used to verify consistency.


Workflow: View Metrics

Explain the evaluation metrics defined in tests/full-eval-tests/plan/02-metrics.md.

MetricDefinition
Success Ratepassed / total × 100 — whether the skill executed correctly
Token UsageInput + output tokens consumed per eval prompt
Read/Write ConsistencyData written by authoring skill must be exactly retrievable by consumption skill

Grading

GradeCriteria
PASSSkill invoked correctly, output matches expected
FAIL_INVOCATIONWrong skill invoked or not invoked
FAIL_EXECUTIONSkill invoked but errored
FAIL_RESULTSkill completed but output mismatches

Pass Thresholds

MetricThreshold
Success Rate≥ 90% per skill
Token UsageWithin 2× of baseline
Read/Write Consistency100% exact match

Workflow: Check Coverage

Compare installed skills against existing eval plans to identify gaps.

Steps

  1. List all skills from the marketplace/plugin:

    check-updates, spark-authoring-cli, spark-consumption-cli, sqldw-authoring-cli,
    sqldw-consumption-cli, eventhouse-authoring-cli, eventhouse-consumption-cli, e2e-medallion-architecture
    
  2. List existing individual eval plans:

    ls tests/full-eval-tests/plan/03-individual-skills/
    
  3. Compare and report which skills have eval coverage and which are missing.

  4. For missing skills, suggest running the Add Evals workflow.


Must

  • NEVER execute tests, eval prompts, or the test runner script — only provide instructions for the user to run manually
  • Always route "run tests" to the tests/ folder — tell the user to navigate there and run run-full-tests.ps1
  • Follow the eval plan template when creating new eval plans — every test case needs Case ID, Prompt, Expected, Pass criteria
  • Include at least one negative/ambiguous test in every new eval plan
  • Update tracking files when adding evals — plan/00-overview.md, README.md, eval-full-pipeline.md
  • Reference golden data in evalsets/expected-results/ for consistency tests

Prefer

  • Automated eval generation ("add evals for missing skills") over manual creation
  • Reading existing eval plans as templates before creating new ones
  • Running the full suite rather than individual tests for comprehensive coverage
  • Checking the executive summary before diving into detailed results

Avoid

  • Running eval prompts outside the tests/ directory
  • Executing tests, copilot commands, or the run-full-tests.ps1 script — the agent must only tell the user how to run them
  • Creating eval plans without a negative/ambiguous test case
  • Duplicating content already in the README or plan documents — reference them instead
  • Modifying golden result files unless the underlying data generation rules change
  • Skipping the tracking file updates when adding new evals

Examples

Adding Tests

User: "Add tests for the powerbi-consumption-cli skill"

Agent: Creates tests/full-eval-tests/plan/03-individual-skills/eval-powerbi-consumption.md with 10–12 test cases following the template, adds golden data to evalsets/expected-results/, and updates tracking files.

Listing Tests

User: "What tests exist?"

Agent: Lists all eval plans in plan/03-individual-skills/ and plan/04-combined-skills/, plus the quick tests in tests/tests.json, with the recommended execution order.

Running Tests

User: "Run the tests"

Agent:

Tests must be run from the tests/ folder. Navigate there and execute:

cd tests
.\run-full-tests.ps1

Viewing Results

User: "Show me the eval results"

Agent: Reads tests/full-eval-tests/eval-results.md and presents the summary table, highlighting pass rates and any failures.