Duration: ~60 minutes
Audience: AI Tester Batch 1X — QA professionals with 2–10 years of experience
Pre-requisites: Basic programming awareness (any language)
What You'll Build Towards: MCP servers, LLM evaluations with DeepEval, Pydantic validation
- Why Python for AI Testing?
- Setting Up Your Environment
- Python Basics — The Fast Track
- Data Structures That Matter for AI Testing
- Functions and Type Hints
- Working with Modules and Imports
- Classes and Objects — Just Enough for AI Frameworks
- File I/O and JSON Handling
- Virtual Environments and pip
- Connecting the Dots — Real AI Testing Patterns
- Summary Cheat Sheet
Before we write a single line of code, let's talk about why Python is non-negotiable in the AI testing world.
Every major AI/ML framework is Python-first:
| Tool/Framework | What It Does | Why You'll Use It |
|---|---|---|
| DeepEval | LLM evaluation & testing | Write test cases for LLM outputs |
| Pydantic | Data validation | Validate structured outputs from LLMs |
| FastMCP | Build MCP servers | Expose tools for Claude/LLMs to call |
| LangChain | LLM orchestration | Chain prompts, build RAG pipelines |
| CrewAI | Multi-agent systems | Build AI agent teams for QA workflows |
| pytest | Test framework | Run all your AI test suites |
One-Liner: If you can write Python, you can test AI. If you can't, you're limited to no-code tools forever.
As a QA professional, you already understand test design, edge cases, and validation logic. Python is simply the language that lets you apply those skills to AI systems.
# Check if Python is already installed
python3 --version
# If not, download from https://www.python.org/downloads/
# Recommended: Python 3.10 or higher (required by DeepEval and FastMCP)Create a file called hello_tester.py:
# hello_tester.py
print("Hello, AI Tester! Your Python journey starts now.")Run it:
python3 hello_tester.pyOutput:
Hello, AI Tester! Your Python journey starts now.
Tip: Unlike Java or C#, Python doesn't need a
mainmethod, a class wrapper, or semicolons. You write code, you run it. That's it.
Python is dynamically typed — you don't declare types, Python figures it out.
# Strings — you'll use these A LOT for prompts and LLM outputs
test_prompt = "Check if this API returns valid JSON"
model_name = "gpt-4"
expected_output = 'The response should contain a "status" field'
# Numbers
max_retries = 3 # int
temperature = 0.7 # float (you'll see this in every LLM config)
confidence_threshold = 0.85 # float
# Booleans
is_hallucination = False
test_passed = True
# None — represents "no value" (like null in Java/JavaScript)
api_key = NoneWhen you're testing LLMs, you're constantly working with strings — prompts, responses, extracted text.
# f-strings — the modern way to build strings (Python 3.6+)
model = "claude-sonnet-4-20250514"
task = "summarize bug reports"
prompt = f"You are {model}. Your task is to {task}."
print(prompt)
# Output: You are claude-sonnet-4-20250514. Your task is to summarize bug reports.
# Multi-line strings — perfect for prompt templates
system_prompt = """
You are a Senior QA Engineer AI assistant.
Your job is to:
1. Analyze test results
2. Identify flaky tests
3. Suggest root causes
Always respond in JSON format.
"""
# Common string methods you'll use daily
llm_response = " The test PASSED with 95% confidence. "
llm_response.strip() # Remove whitespace: "The test PASSED with 95% confidence."
llm_response.lower() # Lowercase: " the test passed with 95% confidence. "
llm_response.upper() # Uppercase: " THE TEST PASSED WITH 95% CONFIDENCE. "
"PASSED" in llm_response # Check if substring exists: True
"FAILED" in llm_response # False
llm_response.replace("PASSED", "SUCCEEDED") # Replace text
llm_response.split() # Split into words: ['The', 'test', 'PASSED', ...]# Comparison — used in assertions and test conditions
severity = "P0"
score = 0.92
severity == "P0" # True (equal)
severity != "P1" # True (not equal)
score > 0.8 # True
score >= 0.92 # True
score < 1.0 # True
# Logical operators — combine conditions
is_critical = severity == "P0" and score > 0.9 # True (both must be true)
needs_review = severity == "P0" or score < 0.5 # True (at least one true)
is_stable = not is_critical # False (negation)# Basic if/elif/else
test_score = 0.73
if test_score >= 0.9:
verdict = "PASS — High confidence"
elif test_score >= 0.7:
verdict = "PASS — Marginal, needs review"
elif test_score >= 0.5:
verdict = "WARN — Low confidence"
else:
verdict = "FAIL — Below threshold"
print(verdict) # Output: PASS — Marginal, needs review
# Real-world pattern: Classifying LLM output quality
def classify_response(hallucination_score, relevance_score):
"""Classify an LLM response based on evaluation metrics."""
if hallucination_score > 0.5:
return "REJECTED — Hallucination detected"
elif relevance_score < 0.3:
return "REJECTED — Off-topic response"
elif relevance_score < 0.7:
return "NEEDS_REVIEW — Partially relevant"
else:
return "ACCEPTED — Good quality"Warning: Python uses indentation (4 spaces) instead of curly braces
{}. This is not optional — wrong indentation = broken code.
# for loop — iterate over a collection
test_results = ["PASS", "FAIL", "PASS", "PASS", "FAIL"]
fail_count = 0
for result in test_results:
if result == "FAIL":
fail_count += 1
print(f"Failed: {fail_count} out of {len(test_results)}")
# Output: Failed: 2 out of 5
# for loop with range — when you need the index
for i in range(5):
print(f"Running test case {i + 1}...")
# enumerate — get both index AND value (very Pythonic)
models = ["gpt-4", "claude-sonnet", "gemini-pro"]
for index, model in enumerate(models):
print(f" Model {index + 1}: {model}")
# while loop — repeat until a condition is met
retries = 0
max_retries = 3
success = False
while retries < max_retries and not success:
print(f" Attempt {retries + 1}...")
# Simulate: pretend attempt 3 succeeds
if retries == 2:
success = True
retries += 1
print(f" Success: {success} after {retries} attempts")This is a pattern you'll see everywhere in AI/ML codebases. It's a one-liner way to create lists.
# Traditional loop
scores = [0.9, 0.3, 0.85, 0.45, 0.72]
passing_scores = []
for s in scores:
if s >= 0.7:
passing_scores.append(s)
# Same thing as a list comprehension — one line!
passing_scores = [s for s in scores if s >= 0.7]
print(passing_scores) # [0.9, 0.85, 0.72]
# Transform AND filter
labels = [f"PASS ({s})" if s >= 0.7 else f"FAIL ({s})" for s in scores]
print(labels)
# ['PASS (0.9)', 'FAIL (0.3)', 'PASS (0.85)', 'FAIL (0.45)', 'PASS (0.72)']
# Real use case: Extract test names from a list of test objects
test_cases = [
{"name": "test_login", "status": "passed"},
{"name": "test_checkout", "status": "failed"},
{"name": "test_search", "status": "passed"},
]
failed_tests = [t["name"] for t in test_cases if t["status"] == "failed"]
print(failed_tests) # ['test_checkout']# Lists are your go-to for sequences of things
test_steps = ["login", "navigate", "click_button", "verify"]
# Access by index (0-based)
test_steps[0] # "login"
test_steps[-1] # "verify" (last element)
# Modify
test_steps.append("screenshot") # Add to end
test_steps.insert(2, "wait_for_load") # Insert at position 2
test_steps.remove("click_button") # Remove by value
popped = test_steps.pop() # Remove & return last item
# Slicing — extract sublists
first_two = test_steps[:2] # First 2 elements
last_two = test_steps[-2:] # Last 2 elements
middle = test_steps[1:3] # Index 1 and 2Dictionaries are critical for AI testing because every API response, every LLM config, and every evaluation result is a dictionary (or JSON, which becomes a dictionary in Python).
# A test configuration — looks exactly like JSON!
llm_config = {
"model": "claude-sonnet-4-20250514",
"temperature": 0.7,
"max_tokens": 1024,
"system_prompt": "You are a QA assistant.",
"tags": ["testing", "automation"]
}
# Access values
llm_config["model"] # "claude-sonnet-4-20250514"
llm_config.get("timeout", 30) # 30 (returns default if key missing — safer!)
# Modify
llm_config["temperature"] = 0.5 # Update
llm_config["top_p"] = 0.9 # Add new key
del llm_config["tags"] # Delete key
# Check if key exists
if "model" in llm_config:
print(f"Using model: {llm_config['model']}")
# Loop through key-value pairs
for key, value in llm_config.items():
print(f" {key}: {value}")
# Nested dictionaries — very common in API responses
eval_result = {
"test_name": "hallucination_check",
"metrics": {
"hallucination_score": 0.12,
"relevance_score": 0.89,
"coherence_score": 0.95
},
"verdict": "PASS"
}
# Access nested values
hall_score = eval_result["metrics"]["hallucination_score"]
print(f"Hallucination score: {hall_score}") # 0.12# Tuples are like lists but cannot be changed after creation
# Use for fixed data — coordinates, return values, constant configs
severity_levels = ("P0", "P1", "P2", "P3", "P4")
# You CAN read them
severity_levels[0] # "P0"
# You CANNOT modify them
# severity_levels[0] = "Critical" # ERROR!
# Common pattern: returning multiple values from a function
def evaluate_response(response_text):
score = 0.87
is_valid = True
return score, is_valid # Returns a tuple
score, is_valid = evaluate_response("test response")
print(f"Score: {score}, Valid: {is_valid}")# Sets automatically remove duplicates
failed_modules = {"auth", "payments", "auth", "search", "payments"}
print(failed_modules) # {"auth", "payments", "search"}
# Set operations — great for comparing test results
yesterday_failures = {"auth", "payments", "search"}
today_failures = {"auth", "checkout", "search"}
# What's new today?
new_failures = today_failures - yesterday_failures
print(f"New failures: {new_failures}") # {"checkout"}
# What's consistently failing?
persistent = yesterday_failures & today_failures
print(f"Persistent: {persistent}") # {"auth", "search"}
# All unique failures across both days
all_failures = yesterday_failures | today_failures
print(f"All: {all_failures}") # {"auth", "payments", "search", "checkout"}def run_test(test_name, expected, actual):
"""
Compare expected vs actual result and return a verdict.
This docstring is important — tools like DeepEval use them.
"""
if expected == actual:
print(f" ✅ {test_name}: PASSED")
return True
else:
print(f" ❌ {test_name}: FAILED (expected '{expected}', got '{actual}')")
return False
# Call it
run_test("status_code", 200, 200) # ✅ PASSED
run_test("response_body", "ok", "error") # ❌ FAILEDdef call_llm(prompt, model="claude-sonnet-4-20250514", temperature=0.7, max_tokens=1024):
"""Simulate calling an LLM with configurable parameters."""
print(f" Calling {model} (temp={temperature}, max_tokens={max_tokens})")
print(f" Prompt: {prompt[:50]}...")
return f"Response from {model}"
# Different ways to call this function
call_llm("Summarize this bug report") # Uses all defaults
call_llm("Summarize this", model="gpt-4") # Override just the model
call_llm("Summarize this", temperature=0.0) # Override just temperature
call_llm("Summarize this", max_tokens=2048, temperature=0.3) # Override multipleType hints don't enforce types at runtime, but they're essential for Pydantic, FastMCP, and DeepEval because these frameworks read your type hints to validate data.
# Basic type hints
def calculate_pass_rate(passed: int, total: int) -> float:
"""Calculate the pass rate as a percentage."""
if total == 0:
return 0.0
return (passed / total) * 100
result: float = calculate_pass_rate(8, 10)
print(f"Pass rate: {result}%") # Pass rate: 80.0%
# Type hints with complex types
from typing import Optional, Union
def evaluate_test(
test_name: str,
score: float,
threshold: float = 0.7,
tags: list[str] | None = None # Python 3.10+ syntax
) -> dict[str, any]:
"""Evaluate a single test and return structured result."""
passed = score >= threshold
return {
"test_name": test_name,
"score": score,
"passed": passed,
"tags": tags or []
}
# Usage
result = evaluate_test("hallucination_check", 0.92, tags=["critical", "llm"])
print(result)Why This Matters: When you build MCP tools with FastMCP, the framework reads your function's type hints to auto-generate the tool schema that Claude uses. Bad type hints = Claude can't use your tool properly.
# Lambda = anonymous one-line function
# Useful for sorting and quick transformations
test_results = [
{"name": "test_login", "duration": 2.5},
{"name": "test_search", "duration": 0.8},
{"name": "test_checkout", "duration": 5.1},
]
# Sort by duration (fastest first)
sorted_results = sorted(test_results, key=lambda x: x["duration"])
for r in sorted_results:
print(f" {r['name']}: {r['duration']}s")
# Filter with lambda
long_tests = list(filter(lambda x: x["duration"] > 2.0, test_results))
print(f"Slow tests: {[t['name'] for t in long_tests]}")This is one of the most important sections for AI testing. Every framework, every library, every tool you'll use requires importing modules.
A module is simply a .py file that contains Python code (functions, classes, variables). When you import a module, you're loading that code into your program.
my_project/
├── utils.py ← This is a module
├── test_runner.py ← This is also a module
└── config.py ← And so is this
# ----- Style 1: Import the entire module -----
import json
data = json.loads('{"status": "pass"}') # Must prefix with module name
json.dumps(data, indent=2)
# ----- Style 2: Import specific items from a module -----
from json import loads, dumps
data = loads('{"status": "pass"}') # No prefix needed
dumps(data, indent=2)
# ----- Style 3: Import with an alias -----
import numpy as np # Convention: numpy → np
import pandas as pd # Convention: pandas → pd
# You'll see these in EVERY data science / ML codebase
scores = np.array([0.8, 0.9, 0.75])
mean_score = np.mean(scores)
# ----- Style 4: Import specific items with alias -----
from pydantic import BaseModel as BM
# ----- Style 5: Import everything (AVOID THIS) -----
from json import * # Imports EVERYTHING — pollutes namespace, hard to debugBest Practice: Use Style 1 or Style 2. Style 3 only for well-known conventions (
np,pd). Never use Style 5.
Python ships with a rich standard library. Here are the modules you'll use constantly in AI testing:
# ----- json — Every API response, every LLM output -----
import json
# Parse JSON string into Python dictionary
api_response = '{"model": "claude", "score": 0.95, "passed": true}'
data = json.loads(api_response) # String → Dictionary
print(data["model"]) # "claude"
print(type(data)) # <class 'dict'>
# Convert Python dictionary back to JSON string
result = {"test": "hallucination", "score": 0.12}
json_string = json.dumps(result, indent=2) # Dictionary → Pretty JSON string
print(json_string)
# ----- os — File paths, environment variables -----
import os
# Read API keys from environment (NEVER hardcode secrets!)
api_key = os.environ.get("OPENAI_API_KEY", "not-set")
groq_key = os.getenv("GROQ_API_KEY") # Same thing, shorter syntax
# File path operations
project_root = os.getcwd() # Current directory
config_path = os.path.join(project_root, "config", "settings.json")
file_exists = os.path.exists(config_path)
# List files in a directory
test_files = os.listdir("tests/")
# ----- datetime — Timestamps for test reports -----
from datetime import datetime, timedelta
now = datetime.now()
print(f"Test run started: {now.strftime('%Y-%m-%d %H:%M:%S')}")
one_hour_ago = now - timedelta(hours=1)
print(f"Previous run: {one_hour_ago.strftime('%Y-%m-%d %H:%M:%S')}")
# ----- time — Delays and performance measurement -----
import time
start = time.time()
# ... do something ...
time.sleep(1) # Wait 1 second
elapsed = time.time() - start
print(f"Operation took: {elapsed:.2f} seconds")
# ----- pathlib — Modern file path handling (preferred over os.path) -----
from pathlib import Path
project = Path("my_project")
config_file = project / "config" / "settings.json" # Slash operator builds paths!
print(config_file) # my_project/config/settings.json
print(config_file.suffix) # .json
print(config_file.stem) # settings
# ----- typing — Type hints for complex structures -----
from typing import Optional, Union
def process_result(
data: dict[str, any],
callback: Optional[callable] = None
) -> Union[str, None]:
"""Process and optionally transform a result."""
if callback:
return callback(data)
return None
# ----- dataclasses — Quick data containers -----
from dataclasses import dataclass
@dataclass
class TestResult:
name: str
score: float
passed: bool
duration_seconds: float = 0.0
result = TestResult(name="hallucination_check", score=0.92, passed=True)
print(result) # TestResult(name='hallucination_check', score=0.92, passed=True, duration_seconds=0.0)# Install packages using pip
pip install pydantic
pip install deepeval
pip install fastmcp
pip install requests
pip install python-dotenv# ----- requests — Make HTTP calls to APIs -----
import requests
response = requests.get("https://api.example.com/tests")
print(response.status_code) # 200
data = response.json() # Automatically parses JSON
# POST request (like calling an LLM API)
payload = {
"model": "claude-sonnet-4-20250514",
"messages": [{"role": "user", "content": "Hello!"}]
}
response = requests.post(
"https://api.anthropic.com/v1/messages",
json=payload,
headers={"x-api-key": os.getenv("ANTHROPIC_API_KEY")}
)
# ----- dotenv — Load environment variables from .env file -----
from dotenv import load_dotenv
import os
load_dotenv() # Loads variables from .env file in project root
api_key = os.getenv("ANTHROPIC_API_KEY")
# ----- pydantic — Data validation (PREVIEW: you'll go deep on this!) -----
from pydantic import BaseModel, Field
class BugReport(BaseModel):
title: str
severity: str = Field(..., pattern="^P[0-4]$") # Must be P0-P4
description: str
is_reproducible: bool = True
# This works
bug = BugReport(title="Login fails", severity="P0", description="Login button unresponsive")
print(bug.model_dump()) # Converts to dictionary
# This raises a validation error!
try:
bad_bug = BugReport(title="Test", severity="HIGH", description="Desc")
except Exception as e:
print(f"Validation error: {e}")This is where it gets practical. In a real AI testing project, you'll organize code across multiple files.
Project Structure:
ai_test_project/
├── config.py ← Configuration and constants
├── llm_client.py ← LLM API wrapper
├── evaluators.py ← Evaluation functions
├── test_runner.py ← Main test runner
└── utils/
├── __init__.py ← Makes this folder a package
├── scoring.py ← Scoring utilities
└── formatting.py ← Output formatting
config.py:
# config.py — Project-wide configuration
MODEL_NAME = "claude-sonnet-4-20250514"
TEMPERATURE = 0.7
MAX_TOKENS = 1024
PASS_THRESHOLD = 0.7
SEVERITY_LEVELS = {
"P0": "Blocker",
"P1": "Critical",
"P2": "Major",
"P3": "Minor",
"P4": "Trivial"
}
API_ENDPOINTS = {
"anthropic": "https://api.anthropic.com/v1/messages",
"groq": "https://api.groq.com/openai/v1/chat/completions"
}utils/scoring.py:
# utils/scoring.py — Reusable scoring functions
def calculate_pass_rate(results: list[dict]) -> float:
"""Calculate percentage of passing tests."""
if not results:
return 0.0
passed = sum(1 for r in results if r.get("passed", False))
return (passed / len(results)) * 100
def classify_severity(score: float) -> str:
"""Map a numeric score to severity level."""
if score >= 0.9:
return "P0"
elif score >= 0.7:
return "P1"
elif score >= 0.5:
return "P2"
else:
return "P3"
def normalize_score(raw_score: float, min_val: float = 0, max_val: float = 1) -> float:
"""Normalize a score to 0-1 range."""
if max_val == min_val:
return 0.0
return (raw_score - min_val) / (max_val - min_val)utils/init.py:
# utils/__init__.py — Controls what gets imported with "from utils import ..."
from .scoring import calculate_pass_rate, classify_severity
from .formatting import format_report # if it existstest_runner.py — Bringing It All Together:
# test_runner.py — Main entry point
# Import from our own modules
from config import MODEL_NAME, PASS_THRESHOLD, SEVERITY_LEVELS
from utils.scoring import calculate_pass_rate, classify_severity
# Import third-party packages
import json
from datetime import datetime
def main():
print(f"🚀 Test Runner Started — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f" Model: {MODEL_NAME}")
print(f" Pass Threshold: {PASS_THRESHOLD}")
print()
# Simulated test results
results = [
{"name": "hallucination_check", "score": 0.95, "passed": True},
{"name": "relevance_check", "score": 0.42, "passed": False},
{"name": "coherence_check", "score": 0.88, "passed": True},
{"name": "toxicity_check", "score": 0.97, "passed": True},
{"name": "format_check", "score": 0.31, "passed": False},
]
# Use our imported functions
pass_rate = calculate_pass_rate(results)
print(f"📊 Pass Rate: {pass_rate}%")
print()
# Classify each result
for r in results:
severity = classify_severity(r["score"])
status = "✅" if r["passed"] else "❌"
label = SEVERITY_LEVELS.get(severity, "Unknown")
print(f" {status} {r['name']}: {r['score']} ({severity} - {label})")
# Export results
report = {
"timestamp": datetime.now().isoformat(),
"model": MODEL_NAME,
"pass_rate": pass_rate,
"results": results
}
with open("test_report.json", "w") as f:
json.dump(report, f, indent=2)
print(f"\n📄 Report saved to test_report.json")
# This is the entry point pattern
if __name__ == "__main__":
main()# This block runs ONLY when you execute the file directly
# It does NOT run when the file is imported as a module
# Direct: python3 test_runner.py → __name__ is "__main__" → runs
# Import: from test_runner import main → __name__ is "test_runner" → skips
if __name__ == "__main__":
main()Why This Matters: Almost every Python file in AI frameworks uses this pattern. It lets a file be both a runnable script AND an importable module.
# ----- Absolute imports (recommended) -----
from utils.scoring import calculate_pass_rate
from config import MODEL_NAME
# ----- Relative imports (inside packages only) -----
# From utils/scoring.py, importing from utils/formatting.py:
from .formatting import format_report # . = same package
from ..config import MODEL_NAME # .. = parent package# Common Error 1: ModuleNotFoundError
# import deepeval → ModuleNotFoundError
# Fix: pip install deepeval
# Common Error 2: ImportError — wrong name
# from pydantic import base_model → ImportError
# Fix: from pydantic import BaseModel (case-sensitive!)
# Common Error 3: Circular imports
# file_a.py imports from file_b.py, which imports from file_a.py
# Fix: restructure your code or use lazy imports
# Debugging tip: Check where a module lives
import json
print(json.__file__) # Shows the file path of the module
# Check what's available in a module
import os
print(dir(os)) # Lists all functions and attributesYou don't need to master OOP to start with AI testing, but you need to read and understand classes because every framework uses them.
class TestCase:
"""Represents a single test case for LLM evaluation."""
def __init__(self, name: str, prompt: str, expected: str):
"""Constructor — runs when you create a new TestCase."""
self.name = name # Instance attribute
self.prompt = prompt
self.expected = expected
self.actual = None # Will be set after running
self.passed = False
def run(self, llm_response: str):
"""Run the test case against an LLM response."""
self.actual = llm_response
self.passed = self.expected.lower() in llm_response.lower()
return self.passed
def report(self) -> str:
"""Generate a human-readable report."""
status = "✅ PASS" if self.passed else "❌ FAIL"
return f"{status} | {self.name} | Expected: {self.expected} | Got: {self.actual}"
# Create instances (objects)
tc1 = TestCase("json_format", "Respond in JSON", "json")
tc2 = TestCase("polite_tone", "Be polite", "please")
# Run tests
tc1.run('{"result": "Here is the JSON output"}')
tc2.run("Here is the result you wanted")
# Print reports
print(tc1.report()) # ✅ PASS | json_format ...
print(tc2.report()) # ❌ FAIL | polite_tone ...from pydantic import BaseModel
# Your class INHERITS from BaseModel
# This gives it automatic validation, serialization, etc.
class LLMTestConfig(BaseModel):
model: str
temperature: float = 0.7
max_tokens: int = 1024
system_prompt: str = "You are a helpful assistant."
# Pydantic auto-validates when you create an instance
config = LLMTestConfig(model="claude-sonnet-4-20250514")
print(config.model_dump()) # {'model': 'claude-sonnet-4-20250514', 'temperature': 0.7, ...}
# Validation error if you pass wrong type
try:
bad_config = LLMTestConfig(model="gpt-4", temperature="hot") # "hot" is not a float!
except Exception as e:
print(f"Validation failed: {e}")You'll see decorators everywhere in FastMCP, pytest, and other frameworks. Think of them as "add-ons" that wrap a function with extra behavior.
# ----- Built-in decorators -----
class TestSuite:
@staticmethod
def version():
return "1.0.0"
@classmethod
def create_default(cls):
return cls()
# ----- The decorator pattern you'll use with FastMCP -----
# (Conceptual preview — you'll implement this in MCP module)
from fastmcp import FastMCP
mcp = FastMCP("QA Tools")
@mcp.tool() # ← This decorator registers the function as an MCP tool
def analyze_bug(title: str, description: str) -> str:
"""Analyze a bug report and suggest severity."""
# Your logic here
return f"Bug '{title}' analyzed: Severity P1"
# ----- pytest decorators -----
import pytest
@pytest.mark.parametrize("input,expected", [
("PASS", True),
("FAIL", False),
("ERROR", False),
])
def test_status_check(input, expected):
assert (input == "PASS") == expected# ----- Writing a file -----
with open("test_output.txt", "w") as f:
f.write("Test Results\n")
f.write("============\n")
f.write("Test 1: PASSED\n")
f.write("Test 2: FAILED\n")
# ----- Reading a file -----
with open("test_output.txt", "r") as f:
content = f.read() # Read entire file as string
print(content)
# Read line by line (memory-efficient for large files)
with open("test_output.txt", "r") as f:
for line in f:
print(line.strip())
# ----- The "with" statement -----
# Automatically closes the file when the block ends
# ALWAYS use "with" — never manually f.open() / f.close()import json
# ----- Write JSON -----
test_config = {
"model": "claude-sonnet-4-20250514",
"tests": [
{"name": "hallucination", "threshold": 0.8},
{"name": "relevance", "threshold": 0.7},
],
"metadata": {
"author": "QA Team",
"version": "1.0"
}
}
with open("config.json", "w") as f:
json.dump(test_config, f, indent=2)
# ----- Read JSON -----
with open("config.json", "r") as f:
loaded_config = json.load(f)
print(loaded_config["model"]) # claude-sonnet-4-20250514
print(loaded_config["tests"][0]) # {'name': 'hallucination', 'threshold': 0.8}
# ----- String ↔ JSON -----
json_string = json.dumps(test_config, indent=2) # Dict → String
parsed_dict = json.loads(json_string) # String → DictDifferent projects need different package versions. Virtual environments keep them isolated.
# Create a virtual environment
python3 -m venv ai_testing_env
# Activate it
# macOS/Linux:
source ai_testing_env/bin/activate
# Windows:
ai_testing_env\Scripts\activate
# Your terminal prompt changes to show the active env:
# (ai_testing_env) $
# Install packages inside the virtual environment
pip install pydantic deepeval fastmcp requests python-dotenv
# Save your dependencies
pip freeze > requirements.txt
# Later, recreate the environment from requirements
pip install -r requirements.txt
# Deactivate when done
deactivate# requirements.txt
pydantic>=2.0
deepeval>=1.0
fastmcp>=0.1
requests>=2.28
python-dotenv>=1.0
pytest>=7.0
Tip: Always create a
requirements.txtfor your AI testing projects. When you share code with your team or deploy to CI/CD, this file ensures everyone has the same packages.
Now let's see how everything fits together with patterns you'll actually use.
# This is a preview of what you'll build in the DeepEval module
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric
def test_no_hallucination():
"""Test that the LLM doesn't hallucinate facts."""
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris, founded in 250 BC.",
context=["Paris is the capital and most populous city of France."]
)
metric = HallucinationMetric(threshold=0.5)
assert_test(test_case, [metric])# Validate that an LLM returns properly structured data
from pydantic import BaseModel, Field
from typing import Optional
import json
class TestPlanOutput(BaseModel):
"""Expected structure of an LLM-generated test plan."""
feature_name: str
test_cases: list[str] = Field(min_length=1)
priority: str = Field(pattern="^(P0|P1|P2|P3|P4)$")
estimated_hours: float = Field(gt=0, le=100)
automation_possible: bool
notes: Optional[str] = None
# Simulate LLM response
llm_response = '''
{
"feature_name": "User Login",
"test_cases": [
"Verify login with valid credentials",
"Verify login with invalid password",
"Verify account lockout after 5 attempts"
],
"priority": "P0",
"estimated_hours": 4.5,
"automation_possible": true,
"notes": "Requires test accounts in staging"
}
'''
# Parse and validate
try:
parsed = json.loads(llm_response)
plan = TestPlanOutput(**parsed)
print(f"✅ Valid test plan for: {plan.feature_name}")
print(f" Test cases: {len(plan.test_cases)}")
print(f" Priority: {plan.priority}")
except json.JSONDecodeError as e:
print(f"❌ LLM returned invalid JSON: {e}")
except Exception as e:
print(f"❌ Validation failed: {e}")# This is what an MCP server looks like — you'll build these!
from fastmcp import FastMCP
mcp = FastMCP("QA Assistant Tools")
@mcp.tool()
def get_test_status(test_suite: str, environment: str = "staging") -> dict:
"""
Get the current status of a test suite.
Args:
test_suite: Name of the test suite (e.g., "regression", "smoke")
environment: Target environment (default: staging)
Returns:
Dictionary with test status details
"""
# In reality, this would query your test management system
return {
"suite": test_suite,
"environment": environment,
"total": 150,
"passed": 142,
"failed": 5,
"skipped": 3,
"pass_rate": "94.7%",
"last_run": "2025-01-15T10:30:00Z"
}
@mcp.tool()
def analyze_flaky_tests(days: int = 7) -> list[dict]:
"""
Identify flaky tests from the past N days.
Args:
days: Number of days to analyze (default: 7)
Returns:
List of flaky tests with their flip rates
"""
return [
{"test": "test_payment_flow", "flip_rate": 0.34, "last_flake": "2025-01-14"},
{"test": "test_search_results", "flip_rate": 0.21, "last_flake": "2025-01-13"},
]
# Run the MCP server
if __name__ == "__main__":
mcp.run()Many AI libraries use async Python. Here's a quick primer:
import asyncio
async def call_llm(prompt: str) -> str:
"""Simulate an async LLM call."""
print(f" Sending: {prompt[:40]}...")
await asyncio.sleep(1) # Simulates network delay
return f"Response to: {prompt[:20]}"
async def run_parallel_tests():
"""Run multiple LLM calls in parallel — much faster!"""
prompts = [
"Test case 1: Check login flow",
"Test case 2: Verify search results",
"Test case 3: Validate checkout process",
]
# Run all calls in parallel (not one-by-one!)
results = await asyncio.gather(*[call_llm(p) for p in prompts])
for r in results:
print(f" Got: {r}")
# Run it
asyncio.run(run_parallel_tests())| Concept | Syntax | Example |
|---|---|---|
| Variable | name = value |
model = "claude" |
| f-string | f"text {var}" |
f"Score: {0.95}" |
| List | [a, b, c] |
["P0", "P1", "P2"] |
| Dictionary | {"key": value} |
{"model": "gpt-4"} |
| Function | def name(params): |
def test(x): return x > 0.7 |
| Type hint | param: type |
score: float = 0.7 |
| List comp | [x for x in list if cond] |
[s for s in scores if s > 0.7] |
| Import | from mod import func |
from json import loads |
| Class | class Name: |
class TestCase: |
| Decorator | @decorator |
@mcp.tool() |
| With | with open(f) as x: |
with open("data.json") as f: |
| if/elif/else | if cond: / elif: / else: |
if score > 0.9: "PASS" |
# Standard Library — always available
import json # Parse/create JSON
import os # Environment vars, file paths
import time # Delays, timestamps
from pathlib import Path # Modern file paths
from datetime import datetime # Dates and times
from typing import Optional # Type hints
from dataclasses import dataclass # Quick data classes
# Third-Party — install with pip
from pydantic import BaseModel # Data validation
from deepeval import assert_test # LLM testing
from fastmcp import FastMCP # MCP servers
import requests # HTTP calls
from dotenv import load_dotenv # Load .env files
import pytest # Testing framework| Module | What You'll Learn | Python Concepts Used |
|---|---|---|
| Pydantic Deep Dive | Validate LLM outputs, build schemas | Classes, type hints, decorators |
| DeepEval | Test LLMs for hallucination, relevance, toxicity | Functions, imports, pytest, async |
| MCP Servers | Build tools that Claude can call | Decorators, type hints, modules, async |
| LangChain / CrewAI | Build AI agent teams | Classes, inheritance, config files |
| RAG Pipelines | Retrieval-augmented generation | File I/O, dictionaries, imports |
Final Thought: You don't need to memorize everything in this tutorial. Bookmark it, refer back to it, and most importantly — write code. Every pattern here will become second nature once you start building your first MCP server, your first DeepEval test suite, or your first Pydantic schema.
Welcome to AI Testing. Let's build something.
Python Foundations for AI Testers — The Testing Academy, AI Tester Batch 1X