BrianZbr/SPEC.md

FlatMonitor - Implementation-Ready MVP Spec

1. Project Overview

A lightweight, high-performance synthetic monitoring tool that performs HTTP checks and generates a static HTML dashboard. The "Static" approach ensures the dashboard is fast and can be hosted via any simple web server (Nginx, S3, etc.) without a live database or API.

2. Project Structure

app/
  main.py          # Orchestrator: manages queues and the main loop
  config/          # YAML loader and validation logic
  scheduler.py     # Job producer: tracks timing and pushes to job_queue
  runner.py        # Worker logic: performs HTTP checks and classifies results
  storage.py       # Single-writer: pulls from results_queue to CSV
  aggregator.py    # Logic: forward-fills buckets and determines UP/DOWN states
  renderer.py      # Output: Jinja2 templates to static HTML
  models.py        # Shared Pydantic/Dataclass schemas
templates/
  base.html        # Shared layout
  index.html       # Global dashboard
  site.html        # Detailed site view
data/              # CSV storage (/live and /archive)
public/            # Final generated HTML output

3. Core Data Models

DomainConfig

id: Unique identifier.
role: core (essential) or supplementary (optional).
url: The endpoint to check.
interval_seconds: Frequency of checks (default: 300).
expect:
- http_status: Expected code (default: 200).
- body_contains: Optional string to validate content.
bot_protection_string: String to identify bot-block pages (e.g., "Cloudflare").

Result

timestamp: ISO UTC string.
site_id: Group identifier.
domain_id: Individual monitor ID.
domain_status: UP, DOWN, BOT_DETECTED, or TIMEOUT.
http_status: Integer (or null on timeout).
latency_ms: Integer (or null on failure).
failure_type: Detailed error message if applicable.

4. Concurrency & Flow (The Queue Pattern)

To ensure thread safety and prevent CSV file corruption:

Job Queue: The Scheduler pushes DomainConfig objects here when they are due.
Worker Pool: 5–10 concurrent threads pull from the Job Queue, execute runner.py, and push a completed Result to the Results Queue.
Single Writer: The Main Loop (Main Thread) is the only component that pulls from the Results Queue and passes data to storage.py.

5. Runner Logic (runner.py)

The runner must classify results using the following priority:

Timeout: If request duration exceeds timeout (default 20s), status is TIMEOUT.
Bot Check: If bot_protection_string is found in the response body, status is BOT_DETECTED.
Status Check: If http_status != expect.http_status, status is DOWN.
Content Check: If expect.body_contains is not in response, status is DOWN.
Success: Otherwise, status is UP.

6. Storage & Rotation (storage.py)

Path: /data/live/{site}/{domain}.log.
Format: Simple CSV (append-only).
Atomic Writes: Because only the Main Thread handles the Results Queue, standard file appends are safe.
Rotation: Every hour, move files from /live to /archive/{YYYY-MM-DD}/.
Retention: Delete archives older than N days (configurable).

7. Aggregator & State (aggregator.py)

The Forward-Fill Rule

To prevent "unknown" gaps in the UI for long-interval monitors (e.g., 5-minute intervals in 1-minute buckets):

For any 1-minute bucket, use the state of the most recent check.
Constraint: Only use the most recent check if it occurred within $2 \times interval_seconds$.
If no check exists within that window, mark the bucket as UNKNOWN.

Site Health Logic

DOWN: Any core domain is DOWN or TIMEOUT.
DEGRADED: All core are UP, but any supplementary domain is DOWN or BOT_DETECTED.
UP: All domains (core and supplementary) are UP.

8. Renderer (renderer.py)

Throttle: The renderer should only trigger if new_data is present AND at least 30 seconds have passed since the last build.
Timeline Rendering: Render 240 spans (4 hours) of 1-minute buckets as small color-coded blocks:
-  (Green)
-  (Red)
-  (Orange/Yellow)
-  (Gray)

9. Main Loop (main.py)

while True:
    # 1. Schedule checks
    scheduler.tick(job_queue)
    
    # 2. Process results (Single-threaded writing)
    new_data = False
    while not results_queue.empty():
        res = results_queue.get()
        storage.append_csv(res)
        new_data = True
    
    # 3. Aggregation & Rendering (Throttled)
    if new_data and time_to_rebuild():
        aggregator.process_recent_data()
        renderer.build_static_site()
        new_data = False
    
    time.sleep(1)

10. Default Settings

Check Interval: 300s.
HTTP Timeout: 20s.
History Window: 4 hours.
Bucket Size: 1 minute.
Worker Pool Size: 10 threads.