A lightweight, high-performance synthetic monitoring tool that performs HTTP checks and generates a static HTML dashboard. The "Static" approach ensures the dashboard is fast and can be hosted via any simple web server (Nginx, S3, etc.) without a live database or API.
app/
main.py # Orchestrator: manages queues and the main loop
config/ # YAML loader and validation logic
scheduler.py # Job producer: tracks timing and pushes to job_queue
runner.py # Worker logic: performs HTTP checks and classifies results
storage.py # Single-writer: pulls from results_queue to CSV
aggregator.py # Logic: forward-fills buckets and determines UP/DOWN states
renderer.py # Output: Jinja2 templates to static HTML
models.py # Shared Pydantic/Dataclass schemas
templates/
base.html # Shared layout
index.html # Global dashboard
site.html # Detailed site view
data/ # CSV storage (/live and /archive)
public/ # Final generated HTML output
- id: Unique identifier.
- role:
core(essential) orsupplementary(optional). - url: The endpoint to check.
- interval_seconds: Frequency of checks (default: 300).
- expect:
- http_status: Expected code (default: 200).
- body_contains: Optional string to validate content.
- bot_protection_string: String to identify bot-block pages (e.g., "Cloudflare").
- timestamp: ISO UTC string.
- site_id: Group identifier.
- domain_id: Individual monitor ID.
- domain_status:
UP,DOWN,BOT_DETECTED, orTIMEOUT. - http_status: Integer (or null on timeout).
- latency_ms: Integer (or null on failure).
- failure_type: Detailed error message if applicable.
To ensure thread safety and prevent CSV file corruption:
- Job Queue: The Scheduler pushes
DomainConfigobjects here when they are due. - Worker Pool: 5–10 concurrent threads pull from the Job Queue, execute
runner.py, and push a completedResultto the Results Queue. - Single Writer: The Main Loop (Main Thread) is the only component that pulls from the Results Queue and passes data to
storage.py.
The runner must classify results using the following priority:
- Timeout: If request duration exceeds
timeout(default 20s), status isTIMEOUT. - Bot Check: If
bot_protection_stringis found in the response body, status isBOT_DETECTED. - Status Check: If
http_status!=expect.http_status, status isDOWN. - Content Check: If
expect.body_containsis not in response, status isDOWN. - Success: Otherwise, status is
UP.
- Path:
/data/live/{site}/{domain}.log. - Format: Simple CSV (append-only).
- Atomic Writes: Because only the Main Thread handles the
Results Queue, standard file appends are safe. - Rotation: Every hour, move files from
/liveto/archive/{YYYY-MM-DD}/. - Retention: Delete archives older than N days (configurable).
To prevent "unknown" gaps in the UI for long-interval monitors (e.g., 5-minute intervals in 1-minute buckets):
- For any 1-minute bucket, use the state of the most recent check.
-
Constraint: Only use the most recent check if it occurred within
$2 \times interval_seconds$ . - If no check exists within that window, mark the bucket as
UNKNOWN.
- DOWN: Any
coredomain isDOWNorTIMEOUT. - DEGRADED: All
coreareUP, but anysupplementarydomain isDOWNorBOT_DETECTED. - UP: All domains (core and supplementary) are
UP.
- Throttle: The renderer should only trigger if
new_datais present AND at least 30 seconds have passed since the last build. - Timeline Rendering: Render 240 spans (4 hours) of 1-minute buckets as small color-coded blocks:
<span class="up"></span>(Green)<span class="down"></span>(Red)<span class="bot"></span>(Orange/Yellow)<span class="unknown"></span>(Gray)
while True:
# 1. Schedule checks
scheduler.tick(job_queue)
# 2. Process results (Single-threaded writing)
new_data = False
while not results_queue.empty():
res = results_queue.get()
storage.append_csv(res)
new_data = True
# 3. Aggregation & Rendering (Throttled)
if new_data and time_to_rebuild():
aggregator.process_recent_data()
renderer.build_static_site()
new_data = False
time.sleep(1)- Check Interval: 300s.
- HTTP Timeout: 20s.
- History Window: 4 hours.
- Bucket Size: 1 minute.
- Worker Pool Size: 10 threads.