jordotech · April 3, 2026 15:11
diff --git a/discovery-data-scope-status.md b/discovery-data-scope-status.md
#	Source	URL	Type	Status	Chunks in Qdrant	Notes
1	Discovery.org (ID section)	https://discovery.org/id	WordPress (subdirectory install)	Not ingested	11	Separate WP install with only 2 posts + 44 pages. Minimal unique content — most redirects to main site. Low priority.
2	Discovery.org (Culture)	https://www.discovery.org/c/intelligent-design/	WordPress (category on main site)	Ingested	33,077 (full site)	This is category #71 on the main discovery.org WP site. All discovery.org content is ingested nightly via automated cron, including articles in this category.
3	ScienceAndCulture.com	https://scienceandculture.com/	WordPress	Ingested	80,769	Largest site by volume (15,070 posts). Automated nightly ingestion active.
4	Bio-Complexity.org	https://bio-complexity.org/	Open Journal Systems (OJS)	Ingested (via PDF upload)	~12,574 (est.)	Not a WordPress site — runs Open Journal Systems. All 54 journal articles + affiliate books ingested as PDFs via Reducto parsing. No additional web content to ingest.
5	IntelligentDesign.org	https://intelligentdesign.org/	WordPress	Ingested	248	Small site (77 posts, 26 pages). ~90% content overlap with discovery.org. Automated nightly ingestion active.
6	IDTheFuture.com	https://idthefuture.com/	WordPress	Ingested	4,303	Podcast site with 2,710 episodes stored as posts. Automated nightly ingestion active.
7	60+ Discovery Institute Affiliate Books	PDF files provided	PDF → Reducto	Ingested	13,505 (non-WP chunks)	78 unique PDFs ingested via S3 upload + Reducto.ai parsing. Includes Bio-Complexity journal articles and affiliate books.
Item	Priority	Status
Discovery.org /id section ingestion	Low	Not needed — minimal unique content (2 posts)
Content deduplication across syndicated sites	Medium	Known ~80-90% overlap between discovery.org and satellite sites. Currently both copies are ingested.
Nightly run observability	Done	PR #80 adds collection-level expected/actual tracking with Sentry alerts