Magellan simulates a team of experienced testers working autonomously on your WordPress plugin or theme. It was designed by seasoned QA engineers to translate real exploratory testing techniques — SBTM, PQIP, heuristics like SFDPOT and FEW HICCUPPS — into a team of AI agents that run without human supervision. You point it at a plugin, trigger the run, and wait for the report.
Unlike test automation, Magellan is non-deterministic: agents put themselves in the shoes of real users and explore the system the way a human tester would — following their curiosity, chasing anomalies, adapting to what they find. No test scripts to write, no selectors to maintain. It's not another automation project. It's like having a team of experienced testers working for you.
| Plugin / Theme | Mode | Problems | Report |
|---|---|---|---|
| WooCommerce (products) | Black box | 2 major · 1 minor | report |
| Desktop Mode v0.7.1 | Black box | 6 major · 6 minor | report |
| Desktop Mode v0.7.1 | Gray box (source inspected) | 8 major · 4 minor | report |
| MailPoet v5.24.0 | Gray box | 1 critical · 10 major · 9 minor | report |
| Jetpack v15.7.1 | Gray box | 7 major · 14 minor | report |
| Heim theme v1.3.1 | Gray box | 2 critical · 7 major · 9 minor | report |
Black box = agents explore the running site with no prior source code inspection, the same way an external tester would.
Gray box = agents inspect the source code first, then test — higher recall on architectural bugs like missing capability checks, lifecycle cleanup gaps, and scale-sensitive patterns.
These are unverified findings that look promising. They need someone with proper context on these plugins to validate them — especially the Desktop Mode ones.
A downloadable product can be published with no download files. A merchant creates a "Downloadable" product, leaves the file upload section empty, and clicks Publish. WooCommerce accepts it silently — no warning, no validation error. Customers who buy the product get a purchase confirmation but nothing to download. Reproduced independently by the Verifier on a fresh site.
An External/Affiliate product can be published with no product URL. Same pattern: merchant selects "External/Affiliate" type, leaves the URL field blank, publishes. No error shown. On the frontend, the "Buy" button simply doesn't appear — the product is live but completely non-functional for customers.
Both of these are merchant-facing UX gaps where the cost of the mistake is a real customer complaint. Whether they're intentional design decisions or gaps is worth a second look by someone on the WooCommerce team.
No way to exit Desktop Mode from within the shell. Once a user enables Desktop Mode and enters the shell, there is no visible "Exit", "Disable", or "Switch to standard admin" option — not in the dock, not in OS Settings, not in the user menu. The only escape is knowing to visit /desktop-mode/ again or to look for an admin bar toggle that isn't prominently labeled. Confirmed by both the black box run and the gray box run, and independently reproduced by the Verifier.
No on-screen affordance to activate Desktop Mode. After installing and activating the plugin, a fresh admin sees nothing — no welcome notice, no dashboard widget, no admin bar button. The feature is entirely hidden unless the admin already knows the /desktop-mode/ URL or reads the plugin documentation. First-time users have no discovery path.
All JS-rendered UI strings are untranslated on non-English sites. The dock buttons, window labels, OS Settings strings, and toast notifications all stay in English regardless of the site language. The translation file exists on disk but WordPress can never find it because the filename uses the wrong suffix (wp-desktop instead of desktop-mode). This affects every locale globally and was confirmed in both runs independently.
The session REST endpoint returns 401 for authenticated admins. GET /wp-json/desktop-mode/v1/session — the endpoint that restores a user's window layout — returns "Sorry, you are not allowed to do that" even when called by a logged-in admin. If this reproduces in production, every page reload would fail to restore the user's window positions and active desktop. Found in the black box run.
The Recycle Bin silently stops emptying after 200 items. Clicking "Empty bin" on a large trash triggers one query for the first 200 posts, deletes them, and reports success — leaving everything beyond 200 untouched. The user sees a success message and a non-empty bin. Confirmed by the gray box run and independently reproduced by the Verifier with 250 trashed items.
These findings are promising but need validation from someone who knows these codebases. Some may be environment-specific (the Studio + SQLite setup has produced false positives before — that's exactly why the Verification phase exists). The Desktop Mode findings in particular would benefit from a review by someone on that team.
Verification (Phase 5.5) — Until now, every bug the agents filed went straight into the final report. The problem: headless browsers running on Studio's SQLite-backed sites can produce false positives that look identical to real bugs. Now, after the main wave finishes, a dedicated Verifier agent spins up a clean MySQL-backed site, walks through each Critical/Major bug step by step, and stamps it reproduced ✓ or refuted ✗. The final report only surfaces confirmed findings. First real-world run against Desktop Mode v0.7.1 caught 2 false positives out of 10 — both would have reached a developer's inbox without this.
Three new lenses — Lenses are cross-cutting testing personas that agents can wear on top of their normal exploration. A security-minded tester looks for different things than a usability-minded one, even on the same plugin. Three new lenses shipped: Security (CSRF, XSS, broken access control, SQL injection, capability check gaps), Usability (feedback gaps, error recovery, discoverability, first-use confusion), and Compatibility (responsiveness, theme conflicts, classic vs block editor, RTL layout). Each one auto-activates when static analysis detects signals that make it relevant — no manual configuration needed. Six accessibility checks were also woven into the baseline skill that every tester runs.
Actionbook is now the default driver — The browser automation layer that agents use to control Chrome. The previous default (playwright-cli-headless) spun up a fresh browser process per charter, which caused session collisions when multiple charters ran concurrently. Actionbook uses a single shared daemon with named, isolated sessions — one per charter — so concurrent waves no longer step on each other, and sessions start faster since the daemon is already warm.
OTel observability — Every run now emits structured OpenTelemetry spans as it executes: one per phase (recon, charter generation, delegation, aggregation, etc.) and one per charter. These are written to otel-spans.jsonl locally by default. If you wire up an OTLP endpoint in config/otel.json, they flow to whatever backend you have (Grafana, Jaeger, Honeycomb). Useful for tracking run durations, spotting slow phases, and correlating cost against coverage over time.
~2,200 lines removed from agent context — Every line an agent loads costs tokens, and tokens cost money and add latency. A full audit stripped out operator-facing explanations, rationale prose, and worked examples that were written for humans reading the repo — not for agents running a charter. The tester-mindset skill (the core QA knowledge base every tester loads) was also split into a small core file plus 6 focused sub-skills that only load when the charter calls for them. A charter testing a settings form loads the forms sub-skill; one probing a bulk-delete operation loads the destructive-op sub-skill. Focused charters load roughly 40% less context per session.
Planner & product improvement signals — A few fixes based on patterns observed across pilots. Breadth-coverage charters (the ones that make sure no feature goes completely untested) were sometimes being skipped when the planner rated them below the auto-dispatch threshold — that's now fixed, they always make it into the wave. The PQIP classification rules were tightened so that ambiguous observations land in Questions (the explicit "needs human judgment" bucket) rather than being filed as low-confidence Problems that inflate bug counts. The charter-sizing rule that prevents agents from biting off more than they can finish was also generalized so it applies to all charter types, not just scale-sensitive ones.