Skip to content

Instantly share code, notes, and snippets.

@FreedomBen
Created May 7, 2026 22:28
Show Gist options
  • Select an option

  • Save FreedomBen/74032693cc6510c407e708312b58a9de to your computer and use it in GitHub Desktop.

Select an option

Save FreedomBen/74032693cc6510c407e708312b58a9de to your computer and use it in GitHub Desktop.

Low Memory Handling in Canvas LMS

This document describes how Canvas behaves when the host or container is under memory pressure, what mechanisms exist to bound memory usage, and the configuration recommended for a production deployment.

TL;DR

Canvas does not gracefully recover from out-of-memory conditions. It does not rescue NoMemoryError anywhere in the Ruby code. Instead, the application takes a defensive containment approach:

  • Risky operations are wrapped in a per-process rlimit so a runaway block fails fast instead of taking the host down.
  • Background-job workers can be configured to self-recycle on RSS growth or job count, returning memory to the OS between jobs.
  • The web tier relies on the container orchestrator (Kubernetes / Docker / systemd) to OOM-kill and restart processes as needed.

If memory is exhausted, the in-flight request or job fails. The process or worker is then either recycled by Canvas itself (jobs) or by the orchestrator (web).

Mechanisms

1. MemoryLimit.apply — preemptive process-level cap

File: lib/memory_limit.rb

MemoryLimit.apply(2.gigabytes) do
  # work that should not be allowed unbounded memory
end

Implementation: calls Process.setrlimit(:DATA, allowed, max) around the block and restores the prior limit on exit. The cap applies to the entire process, not just the block or thread, for the duration of the block. If the OS rejects the limit (Errno::EINVAL), the block runs without the cap and a warning is logged.

When the cap is hit, Ruby raises NoMemoryError. Canvas does not rescue it, so the request or job fails. The point of the guard is to prevent a single runaway operation from consuming all process memory, not to recover.

Current callers:

File Why it is guarded
app/services/file_text_extraction_service.rb:38 PDF / Office document text extraction
app/services/rubric_llm_service.rb LLM rubric generation
app/controllers/services_api_controller.rb Services API endpoints
app/controllers/application_controller.rb Selected controller actions
lib/cuty_capt.rb Headless screenshot / capture

2. Background-job worker recycling

Files: config/delayed_jobs.yml.example, config/initializers/delayed_job.rb

Canvas uses inst-jobs (Delayed Job fork). Two settings control recycling:

Setting Purpose
worker_max_memory_usage Worker exits cleanly between jobs if RSS exceeds the byte threshold. Pool respawns a new worker.
worker_max_job_count Worker exits after processing N jobs, regardless of memory.

Both are commented out in the example file. Workers are never killed mid-job by these mechanisms — they only check at job boundaries.

The :perform lifecycle callback samples memory before and after each job and emits a [STAT] log line:

[STAT] <start_kb> <end_kb> <delta_kb> <user_cpu> <system_cpu>

This gives you per-job memory deltas in the logs but does not, by itself, trigger any action.

Delayed::Settings.max_attempts = 1, so a job that crashes its worker mid-execution (for example, OS OOM-kill) is effectively dropped rather than retried — keep this in mind when sizing memory.

3. Canvas.sample_memory — RSS measurement

File: lib/base/canvas.rb:200

On Linux, reads /proc/<pid>/statm and converts pages to KB. Falls back to ps -o rss= on other Unixes. Used by the job lifecycle logger and by tests. Does not include swapped-out memory.

4. Web tier (Puma)

File: config/puma.rb

threads 0, 1
if ENV["RAILS_ENV"] == "production"
  preload_app! false
  worker_boot_timeout 240
end

There is no puma_worker_killer or unicorn-worker-killer gem in the Gemfile, and no application-level memory monitor. Memory pressure on the web tier is delegated entirely to the container orchestrator's OOM handling.

5. Health checks

File: lib/health_checks.rb

Component readiness and liveness checks are timeout-based, not memory-based. There is no liveness probe that fails when RSS or available memory crosses a threshold. If you want Kubernetes / your orchestrator to recycle a hot pod before it OOMs, you must configure that externally.

6. Frontend

No application-level handling for QuotaExceededError, navigator.deviceMemory, or performance.memory was found in ui/ or packages/. Browser behavior under memory pressure (tab discard, page crash) is whatever the browser does by default.

What happens under low memory

Tier Behavior
Web request inside MemoryLimit.apply NoMemoryError raised; not rescued; request returns 500.
Web request outside the guard Whole process may be OOM-killed by the OS / orchestrator; pod restarts. In-flight requests on that worker fail.
Job worker with worker_max_memory_usage set Worker exits cleanly between jobs; pool respawns. No job loss.
Job worker without recycling configured OS OOM-kills mid-job; the job is not retried (max_attempts = 1).
Browser tab No app-level handling; tab may be discarded or crash.

Recommended configuration

Background jobs (config/delayed_jobs.yml)

Enable both recycling controls. Without them, a single leaky job can drag a worker into mid-job OOM, which silently drops work.

production:
  workers:
    - queue: canvas_queue
      workers: 2
      max_priority: 10
    - queue: canvas_queue
      workers: 4

  # Recycle a worker after processing this many jobs. Returns
  # fragmented heap memory to the OS. Tune to your job mix; 20-100
  # is a reasonable starting range.
  worker_max_job_count: 50

  # Recycle a worker if its RSS exceeds this many bytes between
  # jobs. 1 GiB shown; size to (pod_memory_limit / workers_per_pod)
  # with headroom for the parent pool process.
  worker_max_memory_usage: 1073741824

Sizing guideline: set worker_max_memory_usage to about 70–80% of the per-worker share of your container's memory limit. The threshold needs enough headroom that a worker can still allocate to finish its current job after crossing the line — recycling is between-jobs, not immediate.

Web tier

Canvas does not ship a per-worker memory ceiling for Puma. Options, in order of preference:

  1. Set a Kubernetes / container memory limit sized to your peak working set with headroom. Combine with livenessProbe and readinessProbe pointing at the existing health endpoints. The orchestrator will OOM-kill and restart, and load balancing will route around the restart.
  2. Add puma_worker_killer if you want application-level recycling on memory growth without relying on the orchestrator. This is not currently in the Gemfile and would be a new dependency — evaluate carefully.
  3. Wrap additional risky controller paths in MemoryLimit.apply. This is the cheapest mitigation for a known offender (large file processing, big exports, expensive report generation) and matches the pattern Canvas already uses.

Operational

  • Alert on the [STAT] log lines or scrape RSS from /proc to spot leaky jobs early.
  • Treat any NoMemoryError in error reporting (Sentry / equivalent) as a capacity signal, not a code bug — it usually means a guard fired.
  • If you raise MemoryLimit.apply caps to make a feature work, prefer fixing the underlying allocation pattern. The cap is a containment device, not a budget.

Gaps worth knowing about

  • No rescue NoMemoryError anywhere — there is no graceful degradation path, only fail-fast and recycle.
  • No memory-aware health check. A pod can be near OOM and still report healthy.
  • The example delayed_jobs.yml ships with both recycling controls commented out, so a deployment that copies the example without editing it has no protection on the job tier.
  • max_attempts = 1 plus mid-job OOM means lost work. Recycling is the primary defense.
  • No frontend handling for browser memory pressure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment