Date: 2026-04-03
Prepared for: Customer Success
Collection: org_42caca5c-da0e-4c91-9bf3-d546266fd2e6_discovery-v1
Collection ID: c85d4618-b581-4413-a855-a4739125e705
Total chunks in Qdrant: 131,896
| # | Source | URL | Type | Status | Chunks in Qdrant | Notes |
|---|---|---|---|---|---|---|
| 1 | Discovery.org (ID section) | https://discovery.org/id | WordPress (subdirectory install) | Not ingested | 11 | Separate WP install with only 2 posts + 44 pages. Minimal unique content — most redirects to main site. Low priority. |
| 2 | Discovery.org (Culture) | https://www.discovery.org/c/intelligent-design/ | WordPress (category on main site) | Ingested | 33,077 (full site) | This is category #71 on the main discovery.org WP site. All discovery.org content is ingested nightly via automated cron, including articles in this category. |
| 3 | ScienceAndCulture.com | https://scienceandculture.com/ | WordPress | Ingested | 80,769 | Largest site by volume (15,070 posts). Automated nightly ingestion active. |
| 4 | Bio-Complexity.org | https://bio-complexity.org/ | Open Journal Systems (OJS) | Ingested (via PDF upload) | ~12,574 (est.) | Not a WordPress site — runs Open Journal Systems. All 54 journal articles + affiliate books ingested as PDFs via Reducto parsing. No additional web content to ingest. |
| 5 | IntelligentDesign.org | https://intelligentdesign.org/ | WordPress | Ingested | 248 | Small site (77 posts, 26 pages). ~90% content overlap with discovery.org. Automated nightly ingestion active. |
| 6 | IDTheFuture.com | https://idthefuture.com/ | WordPress | Ingested | 4,303 | Podcast site with 2,710 episodes stored as posts. Automated nightly ingestion active. |
| 7 | 60+ Discovery Institute Affiliate Books | PDF files provided | PDF → Reducto | Ingested | 13,505 (non-WP chunks) | 78 unique PDFs ingested via S3 upload + Reducto.ai parsing. Includes Bio-Complexity journal articles and affiliate books. |
- Status: Not ingested (not configured as a WordPress site in the collection)
- Investigation: This is a separate WordPress installation at the
/idsubdirectory with its own REST API at/id/wp-json/wp/v2/ - Content: Only 2 posts, 44 pages, 120
gsm_blockentries (UI fragments, not content), 0 video-series - Recommendation: Low value — most content is navigational pages. The 2 actual posts could be ingested if needed, but the 11 chunks already present (from URL overlap) likely cover it. No action required unless customer specifically requests it.
- Status: Fully ingested as part of the main discovery.org WordPress site
- How it works: This URL is a category filter (category #71) on the main discovery.org site. All discovery.org content — including articles, posts, pages, videos, and books — is ingested nightly via the WordPress cron scheduler. Content in the "Intelligent Design" category is included automatically.
- Chunks: 33,077 total for all discovery.org content
- Post types ingested:
a(articles),posts,pages,v(video),b(books)
- Status: Fully ingested with automated nightly cron
- Chunks: 80,769
- Post types:
posts,pages - Notes: Largest contributor to the collection by far (15,070 posts). Incremental ingestion runs nightly using
modified_afterto pick up new/updated content.
- Status: Fully ingested via PDF upload
- Platform: Open Journal Systems (OJS), not WordPress — cannot use WP API ingestion
- Content: 54 peer-reviewed journal articles covering intelligent design biology research
- Ingestion method: PDFs uploaded to S3 → parsed by Reducto.ai → chunked and embedded
- Notes: No additional web-only content beyond what's in the PDFs. The site is a thin JS-rendered wrapper around the journal articles.
- Status: Fully ingested with automated nightly cron
- Chunks: 248
- Post types:
posts,pages - Notes: Small site (77 posts, 26 pages). ~90% content overlap with discovery.org due to WordPress Distributor syndication plugin. Both copies are ingested.
- Status: Fully ingested with automated nightly cron
- Chunks: 4,303
- Post types:
posts,pages - Notes: Podcast/media site. Episodes are stored as WordPress
posts(2,710 items), not thepodcastcustom post type. ~80% overlap with discovery.org'spodcasttype via syndication.
- Status: Fully ingested
- Chunks: 13,505 (all non-WordPress content in the collection)
- Unique PDFs: 78 files
- Ingestion method: S3 upload → Reducto.ai PDF parsing → chunking → Qdrant
- Includes: Bio-Complexity journal articles + standalone books/publications
All 4 WordPress sites run on automated nightly cron schedules:
- discovery.org — nightly incremental ingestion
- scienceandculture.com — nightly incremental ingestion
- intelligentdesign.org — nightly incremental ingestion
- idthefuture.com — nightly incremental ingestion
New/modified posts are automatically detected and ingested each night. A collection-level observability system tracks expected vs actual post counts per site and emits Sentry alerts for any discrepancies (shipped in PR #80, pending merge).
| Item | Priority | Status |
|---|---|---|
| Discovery.org /id section ingestion | Low | Not needed — minimal unique content (2 posts) |
| Content deduplication across syndicated sites | Medium | Known ~80-90% overlap between discovery.org and satellite sites. Currently both copies are ingested. |
| Nightly run observability | Done | PR #80 adds collection-level expected/actual tracking with Sentry alerts |
Generated by engineering team, 2026-04-03