PramodDutta/Python_AI.md

Python Foundations for AI Testers

A Practical Tutorial for QA Engineers & SDETs Entering the AI World

Duration: ~60 minutes
Audience: AI Tester Batch 1X — QA professionals with 2–10 years of experience
Pre-requisites: Basic programming awareness (any language)
What You'll Build Towards: MCP servers, LLM evaluations with DeepEval, Pydantic validation

Why Python for AI Testing?
Setting Up Your Environment
Python Basics — The Fast Track
Data Structures That Matter for AI Testing
Functions and Type Hints
Working with Modules and Imports
Classes and Objects — Just Enough for AI Frameworks
File I/O and JSON Handling
Virtual Environments and pip
Connecting the Dots — Real AI Testing Patterns
Summary Cheat Sheet

1. Why Python for AI Testing?

Before we write a single line of code, let's talk about why Python is non-negotiable in the AI testing world.

Every major AI/ML framework is Python-first:

Tool/Framework	What It Does	Why You'll Use It
DeepEval	LLM evaluation & testing	Write test cases for LLM outputs
Pydantic	Data validation	Validate structured outputs from LLMs
FastMCP	Build MCP servers	Expose tools for Claude/LLMs to call
LangChain	LLM orchestration	Chain prompts, build RAG pipelines
CrewAI	Multi-agent systems	Build AI agent teams for QA workflows
pytest	Test framework	Run all your AI test suites

One-Liner: If you can write Python, you can test AI. If you can't, you're limited to no-code tools forever.

As a QA professional, you already understand test design, edge cases, and validation logic. Python is simply the language that lets you apply those skills to AI systems.

2. Setting Up Your Environment

Installing Python

# Check if Python is already installed
python3 --version

# If not, download from https://www.python.org/downloads/
# Recommended: Python 3.10 or higher (required by DeepEval and FastMCP)

Your First Python File

Create a file called hello_tester.py:

# hello_tester.py
print("Hello, AI Tester! Your Python journey starts now.")

Run it:

python3 hello_tester.py

Output:

Hello, AI Tester! Your Python journey starts now.

Tip: Unlike Java or C#, Python doesn't need a main method, a class wrapper, or semicolons. You write code, you run it. That's it.

3. Python Basics — The Fast Track

3.1 Variables and Data Types

Python is dynamically typed — you don't declare types, Python figures it out.

# Strings — you'll use these A LOT for prompts and LLM outputs
test_prompt = "Check if this API returns valid JSON"
model_name = "gpt-4"
expected_output = 'The response should contain a "status" field'

# Numbers
max_retries = 3              # int
temperature = 0.7            # float (you'll see this in every LLM config)
confidence_threshold = 0.85  # float

# Booleans
is_hallucination = False
test_passed = True

# None — represents "no value" (like null in Java/JavaScript)
api_key = None

3.2 String Operations — The Bread and Butter of LLM Testing

When you're testing LLMs, you're constantly working with strings — prompts, responses, extracted text.

# f-strings — the modern way to build strings (Python 3.6+)
model = "claude-sonnet-4-20250514"
task = "summarize bug reports"
prompt = f"You are {model}. Your task is to {task}."
print(prompt)
# Output: You are claude-sonnet-4-20250514. Your task is to summarize bug reports.

# Multi-line strings — perfect for prompt templates
system_prompt = """
You are a Senior QA Engineer AI assistant.
Your job is to:
1. Analyze test results
2. Identify flaky tests
3. Suggest root causes

Always respond in JSON format.
"""

# Common string methods you'll use daily
llm_response = "  The test PASSED with 95% confidence.  "

llm_response.strip()        # Remove whitespace: "The test PASSED with 95% confidence."
llm_response.lower()        # Lowercase: "  the test passed with 95% confidence.  "
llm_response.upper()        # Uppercase: "  THE TEST PASSED WITH 95% CONFIDENCE.  "

"PASSED" in llm_response     # Check if substring exists: True
"FAILED" in llm_response     # False

llm_response.replace("PASSED", "SUCCEEDED")  # Replace text
llm_response.split()         # Split into words: ['The', 'test', 'PASSED', ...]

3.3 Comparison and Logic Operators

# Comparison — used in assertions and test conditions
severity = "P0"
score = 0.92

severity == "P0"       # True (equal)
severity != "P1"       # True (not equal)
score > 0.8            # True
score >= 0.92          # True
score < 1.0            # True

# Logical operators — combine conditions
is_critical = severity == "P0" and score > 0.9    # True (both must be true)
needs_review = severity == "P0" or score < 0.5    # True (at least one true)
is_stable = not is_critical                        # False (negation)

3.4 Conditionals — Making Decisions

# Basic if/elif/else
test_score = 0.73

if test_score >= 0.9:
    verdict = "PASS — High confidence"
elif test_score >= 0.7:
    verdict = "PASS — Marginal, needs review"
elif test_score >= 0.5:
    verdict = "WARN — Low confidence"
else:
    verdict = "FAIL — Below threshold"

print(verdict)  # Output: PASS — Marginal, needs review

# Real-world pattern: Classifying LLM output quality
def classify_response(hallucination_score, relevance_score):
    """Classify an LLM response based on evaluation metrics."""
    if hallucination_score > 0.5:
        return "REJECTED — Hallucination detected"
    elif relevance_score < 0.3:
        return "REJECTED — Off-topic response"
    elif relevance_score < 0.7:
        return "NEEDS_REVIEW — Partially relevant"
    else:
        return "ACCEPTED — Good quality"

Warning: Python uses indentation (4 spaces) instead of curly braces {}. This is not optional — wrong indentation = broken code.

3.5 Loops — Processing Test Results

# for loop — iterate over a collection
test_results = ["PASS", "FAIL", "PASS", "PASS", "FAIL"]

fail_count = 0
for result in test_results:
    if result == "FAIL":
        fail_count += 1

print(f"Failed: {fail_count} out of {len(test_results)}")
# Output: Failed: 2 out of 5

# for loop with range — when you need the index
for i in range(5):
    print(f"Running test case {i + 1}...")

# enumerate — get both index AND value (very Pythonic)
models = ["gpt-4", "claude-sonnet", "gemini-pro"]
for index, model in enumerate(models):
    print(f"  Model {index + 1}: {model}")

# while loop — repeat until a condition is met
retries = 0
max_retries = 3
success = False

while retries < max_retries and not success:
    print(f"  Attempt {retries + 1}...")
    # Simulate: pretend attempt 3 succeeds
    if retries == 2:
        success = True
    retries += 1

print(f"  Success: {success} after {retries} attempts")

3.6 List Comprehensions — The Pythonic Shortcut

This is a pattern you'll see everywhere in AI/ML codebases. It's a one-liner way to create lists.

# Traditional loop
scores = [0.9, 0.3, 0.85, 0.45, 0.72]
passing_scores = []
for s in scores:
    if s >= 0.7:
        passing_scores.append(s)

# Same thing as a list comprehension — one line!
passing_scores = [s for s in scores if s >= 0.7]
print(passing_scores)  # [0.9, 0.85, 0.72]

# Transform AND filter
labels = [f"PASS ({s})" if s >= 0.7 else f"FAIL ({s})" for s in scores]
print(labels)
# ['PASS (0.9)', 'FAIL (0.3)', 'PASS (0.85)', 'FAIL (0.45)', 'PASS (0.72)']

# Real use case: Extract test names from a list of test objects
test_cases = [
    {"name": "test_login", "status": "passed"},
    {"name": "test_checkout", "status": "failed"},
    {"name": "test_search", "status": "passed"},
]
failed_tests = [t["name"] for t in test_cases if t["status"] == "failed"]
print(failed_tests)  # ['test_checkout']

4. Data Structures That Matter for AI Testing

4.1 Lists — Ordered, Mutable Collections

# Lists are your go-to for sequences of things
test_steps = ["login", "navigate", "click_button", "verify"]

# Access by index (0-based)
test_steps[0]     # "login"
test_steps[-1]    # "verify" (last element)

# Modify
test_steps.append("screenshot")       # Add to end
test_steps.insert(2, "wait_for_load") # Insert at position 2
test_steps.remove("click_button")     # Remove by value
popped = test_steps.pop()             # Remove & return last item

# Slicing — extract sublists
first_two = test_steps[:2]   # First 2 elements
last_two = test_steps[-2:]   # Last 2 elements
middle = test_steps[1:3]     # Index 1 and 2

4.2 Dictionaries — Key-Value Pairs (JSON-like)

Dictionaries are critical for AI testing because every API response, every LLM config, and every evaluation result is a dictionary (or JSON, which becomes a dictionary in Python).

# A test configuration — looks exactly like JSON!
llm_config = {
    "model": "claude-sonnet-4-20250514",
    "temperature": 0.7,
    "max_tokens": 1024,
    "system_prompt": "You are a QA assistant.",
    "tags": ["testing", "automation"]
}

# Access values
llm_config["model"]          # "claude-sonnet-4-20250514"
llm_config.get("timeout", 30)  # 30 (returns default if key missing — safer!)

# Modify
llm_config["temperature"] = 0.5       # Update
llm_config["top_p"] = 0.9             # Add new key
del llm_config["tags"]                # Delete key

# Check if key exists
if "model" in llm_config:
    print(f"Using model: {llm_config['model']}")

# Loop through key-value pairs
for key, value in llm_config.items():
    print(f"  {key}: {value}")

# Nested dictionaries — very common in API responses
eval_result = {
    "test_name": "hallucination_check",
    "metrics": {
        "hallucination_score": 0.12,
        "relevance_score": 0.89,
        "coherence_score": 0.95
    },
    "verdict": "PASS"
}

# Access nested values
hall_score = eval_result["metrics"]["hallucination_score"]
print(f"Hallucination score: {hall_score}")  # 0.12

4.3 Tuples — Immutable Sequences

# Tuples are like lists but cannot be changed after creation
# Use for fixed data — coordinates, return values, constant configs
severity_levels = ("P0", "P1", "P2", "P3", "P4")

# You CAN read them
severity_levels[0]  # "P0"

# You CANNOT modify them
# severity_levels[0] = "Critical"  # ERROR!

# Common pattern: returning multiple values from a function
def evaluate_response(response_text):
    score = 0.87
    is_valid = True
    return score, is_valid  # Returns a tuple

score, is_valid = evaluate_response("test response")
print(f"Score: {score}, Valid: {is_valid}")

4.4 Sets — Unique Collections

# Sets automatically remove duplicates
failed_modules = {"auth", "payments", "auth", "search", "payments"}
print(failed_modules)  # {"auth", "payments", "search"}

# Set operations — great for comparing test results
yesterday_failures = {"auth", "payments", "search"}
today_failures = {"auth", "checkout", "search"}

# What's new today?
new_failures = today_failures - yesterday_failures
print(f"New failures: {new_failures}")  # {"checkout"}

# What's consistently failing?
persistent = yesterday_failures & today_failures
print(f"Persistent: {persistent}")  # {"auth", "search"}

# All unique failures across both days
all_failures = yesterday_failures | today_failures
print(f"All: {all_failures}")  # {"auth", "payments", "search", "checkout"}

5. Functions and Type Hints

5.1 Basic Functions

def run_test(test_name, expected, actual):
    """
    Compare expected vs actual result and return a verdict.
    
    This docstring is important — tools like DeepEval use them.
    """
    if expected == actual:
        print(f"  ✅ {test_name}: PASSED")
        return True
    else:
        print(f"  ❌ {test_name}: FAILED (expected '{expected}', got '{actual}')")
        return False

# Call it
run_test("status_code", 200, 200)     # ✅ PASSED
run_test("response_body", "ok", "error")  # ❌ FAILED

5.2 Default Parameters and Keyword Arguments

def call_llm(prompt, model="claude-sonnet-4-20250514", temperature=0.7, max_tokens=1024):
    """Simulate calling an LLM with configurable parameters."""
    print(f"  Calling {model} (temp={temperature}, max_tokens={max_tokens})")
    print(f"  Prompt: {prompt[:50]}...")
    return f"Response from {model}"

# Different ways to call this function
call_llm("Summarize this bug report")              # Uses all defaults
call_llm("Summarize this", model="gpt-4")          # Override just the model
call_llm("Summarize this", temperature=0.0)         # Override just temperature
call_llm("Summarize this", max_tokens=2048, temperature=0.3)  # Override multiple

5.3 Type Hints — Why They Matter for AI Testing

Type hints don't enforce types at runtime, but they're essential for Pydantic, FastMCP, and DeepEval because these frameworks read your type hints to validate data.

# Basic type hints
def calculate_pass_rate(passed: int, total: int) -> float:
    """Calculate the pass rate as a percentage."""
    if total == 0:
        return 0.0
    return (passed / total) * 100

result: float = calculate_pass_rate(8, 10)
print(f"Pass rate: {result}%")  # Pass rate: 80.0%

# Type hints with complex types
from typing import Optional, Union

def evaluate_test(
    test_name: str,
    score: float,
    threshold: float = 0.7,
    tags: list[str] | None = None   # Python 3.10+ syntax
) -> dict[str, any]:
    """Evaluate a single test and return structured result."""
    passed = score >= threshold
    return {
        "test_name": test_name,
        "score": score,
        "passed": passed,
        "tags": tags or []
    }

# Usage
result = evaluate_test("hallucination_check", 0.92, tags=["critical", "llm"])
print(result)

Why This Matters: When you build MCP tools with FastMCP, the framework reads your function's type hints to auto-generate the tool schema that Claude uses. Bad type hints = Claude can't use your tool properly.

5.4 Lambda Functions — Quick One-Liners

# Lambda = anonymous one-line function
# Useful for sorting and quick transformations

test_results = [
    {"name": "test_login", "duration": 2.5},
    {"name": "test_search", "duration": 0.8},
    {"name": "test_checkout", "duration": 5.1},
]

# Sort by duration (fastest first)
sorted_results = sorted(test_results, key=lambda x: x["duration"])
for r in sorted_results:
    print(f"  {r['name']}: {r['duration']}s")

# Filter with lambda
long_tests = list(filter(lambda x: x["duration"] > 2.0, test_results))
print(f"Slow tests: {[t['name'] for t in long_tests]}")

6. Working with Modules and Imports

This is one of the most important sections for AI testing. Every framework, every library, every tool you'll use requires importing modules.

6.1 What is a Module?

A module is simply a .py file that contains Python code (functions, classes, variables). When you import a module, you're loading that code into your program.

my_project/
├── utils.py          ← This is a module
├── test_runner.py    ← This is also a module
└── config.py         ← And so is this

6.2 Import Styles — The Five Ways

# ----- Style 1: Import the entire module -----
import json

data = json.loads('{"status": "pass"}')  # Must prefix with module name
json.dumps(data, indent=2)

# ----- Style 2: Import specific items from a module -----
from json import loads, dumps

data = loads('{"status": "pass"}')  # No prefix needed
dumps(data, indent=2)

# ----- Style 3: Import with an alias -----
import numpy as np           # Convention: numpy → np
import pandas as pd          # Convention: pandas → pd

# You'll see these in EVERY data science / ML codebase
scores = np.array([0.8, 0.9, 0.75])
mean_score = np.mean(scores)

# ----- Style 4: Import specific items with alias -----
from pydantic import BaseModel as BM

# ----- Style 5: Import everything (AVOID THIS) -----
from json import *  # Imports EVERYTHING — pollutes namespace, hard to debug

Best Practice: Use Style 1 or Style 2. Style 3 only for well-known conventions (np, pd). Never use Style 5.

6.3 Python's Built-in Modules — The Standard Library

Python ships with a rich standard library. Here are the modules you'll use constantly in AI testing:

# ----- json — Every API response, every LLM output -----
import json

# Parse JSON string into Python dictionary
api_response = '{"model": "claude", "score": 0.95, "passed": true}'
data = json.loads(api_response)    # String → Dictionary
print(data["model"])                # "claude"
print(type(data))                   # <class 'dict'>

# Convert Python dictionary back to JSON string
result = {"test": "hallucination", "score": 0.12}
json_string = json.dumps(result, indent=2)  # Dictionary → Pretty JSON string
print(json_string)

# ----- os — File paths, environment variables -----
import os

# Read API keys from environment (NEVER hardcode secrets!)
api_key = os.environ.get("OPENAI_API_KEY", "not-set")
groq_key = os.getenv("GROQ_API_KEY")  # Same thing, shorter syntax

# File path operations
project_root = os.getcwd()                     # Current directory
config_path = os.path.join(project_root, "config", "settings.json")
file_exists = os.path.exists(config_path)

# List files in a directory
test_files = os.listdir("tests/")

# ----- datetime — Timestamps for test reports -----
from datetime import datetime, timedelta

now = datetime.now()
print(f"Test run started: {now.strftime('%Y-%m-%d %H:%M:%S')}")

one_hour_ago = now - timedelta(hours=1)
print(f"Previous run: {one_hour_ago.strftime('%Y-%m-%d %H:%M:%S')}")

# ----- time — Delays and performance measurement -----
import time

start = time.time()
# ... do something ...
time.sleep(1)  # Wait 1 second
elapsed = time.time() - start
print(f"Operation took: {elapsed:.2f} seconds")

# ----- pathlib — Modern file path handling (preferred over os.path) -----
from pathlib import Path

project = Path("my_project")
config_file = project / "config" / "settings.json"   # Slash operator builds paths!
print(config_file)          # my_project/config/settings.json
print(config_file.suffix)   # .json
print(config_file.stem)     # settings

# ----- typing — Type hints for complex structures -----
from typing import Optional, Union

def process_result(
    data: dict[str, any],
    callback: Optional[callable] = None
) -> Union[str, None]:
    """Process and optionally transform a result."""
    if callback:
        return callback(data)
    return None

# ----- dataclasses — Quick data containers -----
from dataclasses import dataclass

@dataclass
class TestResult:
    name: str
    score: float
    passed: bool
    duration_seconds: float = 0.0

result = TestResult(name="hallucination_check", score=0.92, passed=True)
print(result)  # TestResult(name='hallucination_check', score=0.92, passed=True, duration_seconds=0.0)

6.4 Installing and Importing Third-Party Packages

# Install packages using pip
pip install pydantic
pip install deepeval
pip install fastmcp
pip install requests
pip install python-dotenv

# ----- requests — Make HTTP calls to APIs -----
import requests

response = requests.get("https://api.example.com/tests")
print(response.status_code)  # 200
data = response.json()       # Automatically parses JSON

# POST request (like calling an LLM API)
payload = {
    "model": "claude-sonnet-4-20250514",
    "messages": [{"role": "user", "content": "Hello!"}]
}
response = requests.post(
    "https://api.anthropic.com/v1/messages",
    json=payload,
    headers={"x-api-key": os.getenv("ANTHROPIC_API_KEY")}
)

# ----- dotenv — Load environment variables from .env file -----
from dotenv import load_dotenv
import os

load_dotenv()  # Loads variables from .env file in project root
api_key = os.getenv("ANTHROPIC_API_KEY")

# ----- pydantic — Data validation (PREVIEW: you'll go deep on this!) -----
from pydantic import BaseModel, Field

class BugReport(BaseModel):
    title: str
    severity: str = Field(..., pattern="^P[0-4]$")  # Must be P0-P4
    description: str
    is_reproducible: bool = True

# This works
bug = BugReport(title="Login fails", severity="P0", description="Login button unresponsive")
print(bug.model_dump())  # Converts to dictionary

# This raises a validation error!
try:
    bad_bug = BugReport(title="Test", severity="HIGH", description="Desc")
except Exception as e:
    print(f"Validation error: {e}")

6.5 Creating Your Own Modules

This is where it gets practical. In a real AI testing project, you'll organize code across multiple files.

Project Structure:

ai_test_project/
├── config.py           ← Configuration and constants
├── llm_client.py       ← LLM API wrapper
├── evaluators.py       ← Evaluation functions
├── test_runner.py      ← Main test runner
└── utils/
    ├── __init__.py     ← Makes this folder a package
    ├── scoring.py      ← Scoring utilities
    └── formatting.py   ← Output formatting

config.py:

# config.py — Project-wide configuration

MODEL_NAME = "claude-sonnet-4-20250514"
TEMPERATURE = 0.7
MAX_TOKENS = 1024
PASS_THRESHOLD = 0.7

SEVERITY_LEVELS = {
    "P0": "Blocker",
    "P1": "Critical",
    "P2": "Major",
    "P3": "Minor",
    "P4": "Trivial"
}

API_ENDPOINTS = {
    "anthropic": "https://api.anthropic.com/v1/messages",
    "groq": "https://api.groq.com/openai/v1/chat/completions"
}

utils/scoring.py:

# utils/scoring.py — Reusable scoring functions

def calculate_pass_rate(results: list[dict]) -> float:
    """Calculate percentage of passing tests."""
    if not results:
        return 0.0
    passed = sum(1 for r in results if r.get("passed", False))
    return (passed / len(results)) * 100

def classify_severity(score: float) -> str:
    """Map a numeric score to severity level."""
    if score >= 0.9:
        return "P0"
    elif score >= 0.7:
        return "P1"
    elif score >= 0.5:
        return "P2"
    else:
        return "P3"

def normalize_score(raw_score: float, min_val: float = 0, max_val: float = 1) -> float:
    """Normalize a score to 0-1 range."""
    if max_val == min_val:
        return 0.0
    return (raw_score - min_val) / (max_val - min_val)

utils/init.py:

# utils/__init__.py — Controls what gets imported with "from utils import ..."

from .scoring import calculate_pass_rate, classify_severity
from .formatting import format_report  # if it exists

test_runner.py — Bringing It All Together:

# test_runner.py — Main entry point

# Import from our own modules
from config import MODEL_NAME, PASS_THRESHOLD, SEVERITY_LEVELS
from utils.scoring import calculate_pass_rate, classify_severity

# Import third-party packages
import json
from datetime import datetime

def main():
    print(f"🚀 Test Runner Started — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
    print(f"   Model: {MODEL_NAME}")
    print(f"   Pass Threshold: {PASS_THRESHOLD}")
    print()

    # Simulated test results
    results = [
        {"name": "hallucination_check", "score": 0.95, "passed": True},
        {"name": "relevance_check", "score": 0.42, "passed": False},
        {"name": "coherence_check", "score": 0.88, "passed": True},
        {"name": "toxicity_check", "score": 0.97, "passed": True},
        {"name": "format_check", "score": 0.31, "passed": False},
    ]

    # Use our imported functions
    pass_rate = calculate_pass_rate(results)
    print(f"📊 Pass Rate: {pass_rate}%")
    print()

    # Classify each result
    for r in results:
        severity = classify_severity(r["score"])
        status = "✅" if r["passed"] else "❌"
        label = SEVERITY_LEVELS.get(severity, "Unknown")
        print(f"   {status} {r['name']}: {r['score']} ({severity} - {label})")

    # Export results
    report = {
        "timestamp": datetime.now().isoformat(),
        "model": MODEL_NAME,
        "pass_rate": pass_rate,
        "results": results
    }

    with open("test_report.json", "w") as f:
        json.dump(report, f, indent=2)

    print(f"\n📄 Report saved to test_report.json")

# This is the entry point pattern
if __name__ == "__main__":
    main()

6.6 The `name == "main"` Pattern

# This block runs ONLY when you execute the file directly
# It does NOT run when the file is imported as a module

# Direct: python3 test_runner.py → __name__ is "__main__" → runs
# Import: from test_runner import main → __name__ is "test_runner" → skips

if __name__ == "__main__":
    main()

Why This Matters: Almost every Python file in AI frameworks uses this pattern. It lets a file be both a runnable script AND an importable module.

6.7 Relative vs Absolute Imports

# ----- Absolute imports (recommended) -----
from utils.scoring import calculate_pass_rate
from config import MODEL_NAME

# ----- Relative imports (inside packages only) -----
# From utils/scoring.py, importing from utils/formatting.py:
from .formatting import format_report      # . = same package
from ..config import MODEL_NAME            # .. = parent package

6.8 Import Pitfalls & Debugging

# Common Error 1: ModuleNotFoundError
# import deepeval  → ModuleNotFoundError
# Fix: pip install deepeval

# Common Error 2: ImportError — wrong name
# from pydantic import base_model  → ImportError
# Fix: from pydantic import BaseModel (case-sensitive!)

# Common Error 3: Circular imports
# file_a.py imports from file_b.py, which imports from file_a.py
# Fix: restructure your code or use lazy imports

# Debugging tip: Check where a module lives
import json
print(json.__file__)  # Shows the file path of the module

# Check what's available in a module
import os
print(dir(os))  # Lists all functions and attributes

7. Classes and Objects — Just Enough for AI Frameworks

You don't need to master OOP to start with AI testing, but you need to read and understand classes because every framework uses them.

7.1 Basic Class Structure

class TestCase:
    """Represents a single test case for LLM evaluation."""

    def __init__(self, name: str, prompt: str, expected: str):
        """Constructor — runs when you create a new TestCase."""
        self.name = name            # Instance attribute
        self.prompt = prompt
        self.expected = expected
        self.actual = None          # Will be set after running
        self.passed = False

    def run(self, llm_response: str):
        """Run the test case against an LLM response."""
        self.actual = llm_response
        self.passed = self.expected.lower() in llm_response.lower()
        return self.passed

    def report(self) -> str:
        """Generate a human-readable report."""
        status = "✅ PASS" if self.passed else "❌ FAIL"
        return f"{status} | {self.name} | Expected: {self.expected} | Got: {self.actual}"

# Create instances (objects)
tc1 = TestCase("json_format", "Respond in JSON", "json")
tc2 = TestCase("polite_tone", "Be polite", "please")

# Run tests
tc1.run('{"result": "Here is the JSON output"}')
tc2.run("Here is the result you wanted")

# Print reports
print(tc1.report())  # ✅ PASS | json_format ...
print(tc2.report())  # ❌ FAIL | polite_tone ...

7.2 Inheritance — Why Pydantic and DeepEval Use It

from pydantic import BaseModel

# Your class INHERITS from BaseModel
# This gives it automatic validation, serialization, etc.
class LLMTestConfig(BaseModel):
    model: str
    temperature: float = 0.7
    max_tokens: int = 1024
    system_prompt: str = "You are a helpful assistant."

# Pydantic auto-validates when you create an instance
config = LLMTestConfig(model="claude-sonnet-4-20250514")
print(config.model_dump())  # {'model': 'claude-sonnet-4-20250514', 'temperature': 0.7, ...}

# Validation error if you pass wrong type
try:
    bad_config = LLMTestConfig(model="gpt-4", temperature="hot")  # "hot" is not a float!
except Exception as e:
    print(f"Validation failed: {e}")

7.3 Decorators — The @ Pattern

You'll see decorators everywhere in FastMCP, pytest, and other frameworks. Think of them as "add-ons" that wrap a function with extra behavior.

# ----- Built-in decorators -----
class TestSuite:
    @staticmethod
    def version():
        return "1.0.0"

    @classmethod
    def create_default(cls):
        return cls()

# ----- The decorator pattern you'll use with FastMCP -----
# (Conceptual preview — you'll implement this in MCP module)

from fastmcp import FastMCP

mcp = FastMCP("QA Tools")

@mcp.tool()                         # ← This decorator registers the function as an MCP tool
def analyze_bug(title: str, description: str) -> str:
    """Analyze a bug report and suggest severity."""
    # Your logic here
    return f"Bug '{title}' analyzed: Severity P1"

# ----- pytest decorators -----
import pytest

@pytest.mark.parametrize("input,expected", [
    ("PASS", True),
    ("FAIL", False),
    ("ERROR", False),
])
def test_status_check(input, expected):
    assert (input == "PASS") == expected

8. File I/O and JSON Handling

8.1 Reading and Writing Files

# ----- Writing a file -----
with open("test_output.txt", "w") as f:
    f.write("Test Results\n")
    f.write("============\n")
    f.write("Test 1: PASSED\n")
    f.write("Test 2: FAILED\n")

# ----- Reading a file -----
with open("test_output.txt", "r") as f:
    content = f.read()        # Read entire file as string
    print(content)

# Read line by line (memory-efficient for large files)
with open("test_output.txt", "r") as f:
    for line in f:
        print(line.strip())

# ----- The "with" statement -----
# Automatically closes the file when the block ends
# ALWAYS use "with" — never manually f.open() / f.close()

8.2 JSON Files — Your Daily Companion

import json

# ----- Write JSON -----
test_config = {
    "model": "claude-sonnet-4-20250514",
    "tests": [
        {"name": "hallucination", "threshold": 0.8},
        {"name": "relevance", "threshold": 0.7},
    ],
    "metadata": {
        "author": "QA Team",
        "version": "1.0"
    }
}

with open("config.json", "w") as f:
    json.dump(test_config, f, indent=2)

# ----- Read JSON -----
with open("config.json", "r") as f:
    loaded_config = json.load(f)

print(loaded_config["model"])         # claude-sonnet-4-20250514
print(loaded_config["tests"][0])      # {'name': 'hallucination', 'threshold': 0.8}

# ----- String ↔ JSON -----
json_string = json.dumps(test_config, indent=2)  # Dict → String
parsed_dict = json.loads(json_string)              # String → Dict

9. Virtual Environments and pip

9.1 Why Virtual Environments?

Different projects need different package versions. Virtual environments keep them isolated.

# Create a virtual environment
python3 -m venv ai_testing_env

# Activate it
# macOS/Linux:
source ai_testing_env/bin/activate
# Windows:
ai_testing_env\Scripts\activate

# Your terminal prompt changes to show the active env:
# (ai_testing_env) $

# Install packages inside the virtual environment
pip install pydantic deepeval fastmcp requests python-dotenv

# Save your dependencies
pip freeze > requirements.txt

# Later, recreate the environment from requirements
pip install -r requirements.txt

# Deactivate when done
deactivate

9.2 requirements.txt — Your Project's Recipe

# requirements.txt
pydantic>=2.0
deepeval>=1.0
fastmcp>=0.1
requests>=2.28
python-dotenv>=1.0
pytest>=7.0

Tip: Always create a requirements.txt for your AI testing projects. When you share code with your team or deploy to CI/CD, this file ensures everyone has the same packages.

10. Connecting the Dots — Real AI Testing Patterns

Now let's see how everything fits together with patterns you'll actually use.

10.1 Pattern: LLM Evaluation with DeepEval (Preview)

# This is a preview of what you'll build in the DeepEval module
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric

def test_no_hallucination():
    """Test that the LLM doesn't hallucinate facts."""
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="The capital of France is Paris, founded in 250 BC.",
        context=["Paris is the capital and most populous city of France."]
    )

    metric = HallucinationMetric(threshold=0.5)
    assert_test(test_case, [metric])

10.2 Pattern: Pydantic Validation for LLM Outputs

# Validate that an LLM returns properly structured data
from pydantic import BaseModel, Field
from typing import Optional
import json

class TestPlanOutput(BaseModel):
    """Expected structure of an LLM-generated test plan."""
    feature_name: str
    test_cases: list[str] = Field(min_length=1)
    priority: str = Field(pattern="^(P0|P1|P2|P3|P4)$")
    estimated_hours: float = Field(gt=0, le=100)
    automation_possible: bool
    notes: Optional[str] = None

# Simulate LLM response
llm_response = '''
{
    "feature_name": "User Login",
    "test_cases": [
        "Verify login with valid credentials",
        "Verify login with invalid password",
        "Verify account lockout after 5 attempts"
    ],
    "priority": "P0",
    "estimated_hours": 4.5,
    "automation_possible": true,
    "notes": "Requires test accounts in staging"
}
'''

# Parse and validate
try:
    parsed = json.loads(llm_response)
    plan = TestPlanOutput(**parsed)
    print(f"✅ Valid test plan for: {plan.feature_name}")
    print(f"   Test cases: {len(plan.test_cases)}")
    print(f"   Priority: {plan.priority}")
except json.JSONDecodeError as e:
    print(f"❌ LLM returned invalid JSON: {e}")
except Exception as e:
    print(f"❌ Validation failed: {e}")

10.3 Pattern: MCP Tool Definition (Preview)

# This is what an MCP server looks like — you'll build these!
from fastmcp import FastMCP

mcp = FastMCP("QA Assistant Tools")

@mcp.tool()
def get_test_status(test_suite: str, environment: str = "staging") -> dict:
    """
    Get the current status of a test suite.
    
    Args:
        test_suite: Name of the test suite (e.g., "regression", "smoke")
        environment: Target environment (default: staging)
    
    Returns:
        Dictionary with test status details
    """
    # In reality, this would query your test management system
    return {
        "suite": test_suite,
        "environment": environment,
        "total": 150,
        "passed": 142,
        "failed": 5,
        "skipped": 3,
        "pass_rate": "94.7%",
        "last_run": "2025-01-15T10:30:00Z"
    }

@mcp.tool()
def analyze_flaky_tests(days: int = 7) -> list[dict]:
    """
    Identify flaky tests from the past N days.
    
    Args:
        days: Number of days to analyze (default: 7)
    
    Returns:
        List of flaky tests with their flip rates
    """
    return [
        {"test": "test_payment_flow", "flip_rate": 0.34, "last_flake": "2025-01-14"},
        {"test": "test_search_results", "flip_rate": 0.21, "last_flake": "2025-01-13"},
    ]

# Run the MCP server
if __name__ == "__main__":
    mcp.run()

10.4 Pattern: Using async/await (Preview)

Many AI libraries use async Python. Here's a quick primer:

import asyncio

async def call_llm(prompt: str) -> str:
    """Simulate an async LLM call."""
    print(f"  Sending: {prompt[:40]}...")
    await asyncio.sleep(1)  # Simulates network delay
    return f"Response to: {prompt[:20]}"

async def run_parallel_tests():
    """Run multiple LLM calls in parallel — much faster!"""
    prompts = [
        "Test case 1: Check login flow",
        "Test case 2: Verify search results",
        "Test case 3: Validate checkout process",
    ]

    # Run all calls in parallel (not one-by-one!)
    results = await asyncio.gather(*[call_llm(p) for p in prompts])

    for r in results:
        print(f"  Got: {r}")

# Run it
asyncio.run(run_parallel_tests())

11. Summary Cheat Sheet

Python Basics Quick Reference

Concept	Syntax	Example
Variable	`name = value`	`model = "claude"`
f-string	`f"text {var}"`	`f"Score: {0.95}"`
List	`[a, b, c]`	`["P0", "P1", "P2"]`
Dictionary	`{"key": value}`	`{"model": "gpt-4"}`
Function	`def name(params):`	`def test(x): return x > 0.7`
Type hint	`param: type`	`score: float = 0.7`
List comp	`[x for x in list if cond]`	`[s for s in scores if s > 0.7]`
Import	`from mod import func`	`from json import loads`
Class	`class Name:`	`class TestCase:`
Decorator	`@decorator`	`@mcp.tool()`
With	`with open(f) as x:`	`with open("data.json") as f:`
if/elif/else	`if cond: / elif: / else:`	`if score > 0.9: "PASS"`

Import Cheat Sheet

# Standard Library — always available
import json               # Parse/create JSON
import os                  # Environment vars, file paths
import time                # Delays, timestamps
from pathlib import Path   # Modern file paths
from datetime import datetime  # Dates and times
from typing import Optional    # Type hints
from dataclasses import dataclass  # Quick data classes

# Third-Party — install with pip
from pydantic import BaseModel    # Data validation
from deepeval import assert_test  # LLM testing
from fastmcp import FastMCP       # MCP servers
import requests                    # HTTP calls
from dotenv import load_dotenv    # Load .env files
import pytest                      # Testing framework

What's Next in Your AI Tester Journey

Module	What You'll Learn	Python Concepts Used
Pydantic Deep Dive	Validate LLM outputs, build schemas	Classes, type hints, decorators
DeepEval	Test LLMs for hallucination, relevance, toxicity	Functions, imports, pytest, async
MCP Servers	Build tools that Claude can call	Decorators, type hints, modules, async
LangChain / CrewAI	Build AI agent teams	Classes, inheritance, config files
RAG Pipelines	Retrieval-augmented generation	File I/O, dictionaries, imports

Final Thought: You don't need to memorize everything in this tutorial. Bookmark it, refer back to it, and most importantly — write code. Every pattern here will become second nature once you start building your first MCP server, your first DeepEval test suite, or your first Pydantic schema.

Welcome to AI Testing. Let's build something.

Python Foundations for AI Testers — The Testing Academy, AI Tester Batch 1X