Skip to content

Instantly share code, notes, and snippets.

@jordotech
Created April 3, 2026 15:11
Show Gist options
  • Select an option

  • Save jordotech/984745b0caeb400cc85993fe41819ff0 to your computer and use it in GitHub Desktop.

Select an option

Save jordotech/984745b0caeb400cc85993fe41819ff0 to your computer and use it in GitHub Desktop.

Discovery Institute — Data Scope Status Report

Date: 2026-04-03 Prepared for: Customer Success Collection: org_42caca5c-da0e-4c91-9bf3-d546266fd2e6_discovery-v1 Collection ID: c85d4618-b581-4413-a855-a4739125e705 Total chunks in Qdrant: 131,896


Status Summary

# Source URL Type Status Chunks in Qdrant Notes
1 Discovery.org (ID section) https://discovery.org/id WordPress (subdirectory install) Not ingested 11 Separate WP install with only 2 posts + 44 pages. Minimal unique content — most redirects to main site. Low priority.
2 Discovery.org (Culture) https://www.discovery.org/c/intelligent-design/ WordPress (category on main site) Ingested 33,077 (full site) This is category #71 on the main discovery.org WP site. All discovery.org content is ingested nightly via automated cron, including articles in this category.
3 ScienceAndCulture.com https://scienceandculture.com/ WordPress Ingested 80,769 Largest site by volume (15,070 posts). Automated nightly ingestion active.
4 Bio-Complexity.org https://bio-complexity.org/ Open Journal Systems (OJS) Ingested (via PDF upload) ~12,574 (est.) Not a WordPress site — runs Open Journal Systems. All 54 journal articles + affiliate books ingested as PDFs via Reducto parsing. No additional web content to ingest.
5 IntelligentDesign.org https://intelligentdesign.org/ WordPress Ingested 248 Small site (77 posts, 26 pages). ~90% content overlap with discovery.org. Automated nightly ingestion active.
6 IDTheFuture.com https://idthefuture.com/ WordPress Ingested 4,303 Podcast site with 2,710 episodes stored as posts. Automated nightly ingestion active.
7 60+ Discovery Institute Affiliate Books PDF files provided PDF → Reducto Ingested 13,505 (non-WP chunks) 78 unique PDFs ingested via S3 upload + Reducto.ai parsing. Includes Bio-Complexity journal articles and affiliate books.

Detailed Notes

1. Discovery.org (ID section) — https://discovery.org/id

  • Status: Not ingested (not configured as a WordPress site in the collection)
  • Investigation: This is a separate WordPress installation at the /id subdirectory with its own REST API at /id/wp-json/wp/v2/
  • Content: Only 2 posts, 44 pages, 120 gsm_block entries (UI fragments, not content), 0 video-series
  • Recommendation: Low value — most content is navigational pages. The 2 actual posts could be ingested if needed, but the 11 chunks already present (from URL overlap) likely cover it. No action required unless customer specifically requests it.

2. Discovery.org (Culture / Intelligent Design) — https://www.discovery.org/c/intelligent-design/

  • Status: Fully ingested as part of the main discovery.org WordPress site
  • How it works: This URL is a category filter (category #71) on the main discovery.org site. All discovery.org content — including articles, posts, pages, videos, and books — is ingested nightly via the WordPress cron scheduler. Content in the "Intelligent Design" category is included automatically.
  • Chunks: 33,077 total for all discovery.org content
  • Post types ingested: a (articles), posts, pages, v (video), b (books)

3. ScienceAndCulture.com

  • Status: Fully ingested with automated nightly cron
  • Chunks: 80,769
  • Post types: posts, pages
  • Notes: Largest contributor to the collection by far (15,070 posts). Incremental ingestion runs nightly using modified_after to pick up new/updated content.

4. Bio-Complexity.org

  • Status: Fully ingested via PDF upload
  • Platform: Open Journal Systems (OJS), not WordPress — cannot use WP API ingestion
  • Content: 54 peer-reviewed journal articles covering intelligent design biology research
  • Ingestion method: PDFs uploaded to S3 → parsed by Reducto.ai → chunked and embedded
  • Notes: No additional web-only content beyond what's in the PDFs. The site is a thin JS-rendered wrapper around the journal articles.

5. IntelligentDesign.org

  • Status: Fully ingested with automated nightly cron
  • Chunks: 248
  • Post types: posts, pages
  • Notes: Small site (77 posts, 26 pages). ~90% content overlap with discovery.org due to WordPress Distributor syndication plugin. Both copies are ingested.

6. IDTheFuture.com

  • Status: Fully ingested with automated nightly cron
  • Chunks: 4,303
  • Post types: posts, pages
  • Notes: Podcast/media site. Episodes are stored as WordPress posts (2,710 items), not the podcast custom post type. ~80% overlap with discovery.org's podcast type via syndication.

7. 60+ Discovery Institute Affiliate Books (PDFs)

  • Status: Fully ingested
  • Chunks: 13,505 (all non-WordPress content in the collection)
  • Unique PDFs: 78 files
  • Ingestion method: S3 upload → Reducto.ai PDF parsing → chunking → Qdrant
  • Includes: Bio-Complexity journal articles + standalone books/publications

Automated Ingestion Schedule

All 4 WordPress sites run on automated nightly cron schedules:

  • discovery.org — nightly incremental ingestion
  • scienceandculture.com — nightly incremental ingestion
  • intelligentdesign.org — nightly incremental ingestion
  • idthefuture.com — nightly incremental ingestion

New/modified posts are automatically detected and ingested each night. A collection-level observability system tracks expected vs actual post counts per site and emits Sentry alerts for any discrepancies (shipped in PR #80, pending merge).


Open Items

Item Priority Status
Discovery.org /id section ingestion Low Not needed — minimal unique content (2 posts)
Content deduplication across syndicated sites Medium Known ~80-90% overlap between discovery.org and satellite sites. Currently both copies are ingested.
Nightly run observability Done PR #80 adds collection-level expected/actual tracking with Sentry alerts

Generated by engineering team, 2026-04-03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment