Skip to content

Instantly share code, notes, and snippets.

@jmchilton
Created March 5, 2026 16:29
Show Gist options
  • Select an option

  • Save jmchilton/18f0df38d3b66bf2342f739c09b16fb4 to your computer and use it in GitHub Desktop.

Select an option

Save jmchilton/18f0df38d3b66bf2342f739c09b16fb4 to your computer and use it in GitHub Desktop.
Subcollection Mapping & DCE Modeling: Problem, Goal & Implementation Plan

Implementation Plan: Subcollection Mapping & DCE Modeling

Phase 1: Schema Modeling (parameters.py)

1a. Add map_over_type to BatchDataInstance and BatchDataInstanceInternal

Currently BatchDataInstance (line 534) and BatchDataInstanceInternal (line 883) are simple {src, id} models. Add map_over_type: Optional[str] = None to both. Use Optional[str] — consistent with how collection_type is modeled elsewhere.

This is the core request-layer gap — map_over_type is how clients express subcollection mapping intent in batch values, but the schema doesn't model it.

Files: lib/galaxy/tool_util_models/parameters.py

1b. Add DCE to Internal Representations Only

DCE is backend-produced during batch expansion — it does NOT belong in the external request layer.

Add DataRequestInternalDce with src: Literal["dce"], id: StrictInt (if not already present).

Add "dce" to internal-only types:

  • DataRequestInternalDereferencedT union — add DatasetCollectionElementReference (already exists at parameters.py:1067) to cover job_internal DCE refs produced by subcollection mapping expansion
  • Verify MultiDataInstanceInternal and MultiDataInstanceInternalDereferenced unions include DataRequestInternalDce
  • Do NOT add DataRequestDce to the external _DataRequest union or BatchDataInstance.src
  • Do NOT add "dce" to BatchDataInstanceInternal.src — batch expansion happens after request_internal, so DCE never appears in Batch values at that layer

Files: lib/galaxy/tool_util_models/parameters.py

1c. Verify Conversion Functions Handle DCE

The encode() and decode() functions in convert.py work with generic src_dict format. Verify they handle src: "dce" in internal representations without special-casing. The dereference() function may need DCE handling if a dereference step encounters stored DCE refs.

Fix runtimeify in convert.py (line 548) — currently hardcodes DataRequestInternalHda(**value), which breaks on DCE src dicts. Needs to dispatch on src and handle DCE → dataset resolution.

Files: lib/galaxy/tool_util/parameters/convert.py

1d. Run Existing Unit Tests (Sanity Check)

PYTHONPATH=lib python -m pytest test/unit/tool_util/test_parameter_specification.py -x --timeout=60

Existing tests should still pass — we're only adding new fields/types, not changing existing validation.


Phase 2: Parameter Specification Tests (parameter_specification.yml)

2a. Add map_over_type Batch Specs to gx_data (Request Layer)

Add test cases to gx_data entry. These validate the client-facing schema:

# request_valid additions — map_over_type on batch values:
- parameter: {__class__: "Batch", values: [{src: hdca, id: abcdabcd, map_over_type: paired}]}
- parameter: {__class__: "Batch", values: [{src: hdca, id: abcdabcd, map_over_type: "list:paired"}]}
# map_over_type: null should also be valid (no subcollection mapping)
- parameter: {__class__: "Batch", values: [{src: hdca, id: abcdabcd, map_over_type: null}]}

# landing_request_valid additions — landing pages can pre-fill batch params with map_over_type:
- parameter: {__class__: "Batch", values: [{src: hdca, id: abcdabcd, map_over_type: paired}]}

# request_invalid additions — dce should NOT be valid in external request:
- parameter: {__class__: "Batch", values: [{src: dce, id: abcdabcd}]}
- parameter: {src: dce, id: abcdabcd}

2b. Add Internal Batch Specs to gx_data

These validate post-decode representations where map_over_type carries through:

# request_internal_valid additions:
- parameter: {__class__: "Batch", values: [{src: hdca, id: 5, map_over_type: paired}]}

# request_internal_dereferenced_valid additions:
- parameter: {__class__: "Batch", values: [{src: hdca, id: 5, map_over_type: paired}]}

DCE does NOT belong in Batch values at request_internal — batch expansion hasn't happened yet, and reruns reconstruct HDCA refs via build_for_rerun.

2c. Add DCE to job_internal Specs for gx_data

After expansion, individual job params contain DCE refs (not wrapped in Batch — Batch is expanded away by this layer). Subcollection mapping over gx_data produces {"src": "dce", "id": <int>} via to_decoded_json — each expanded job gets a DCE representing one subcollection element whose child_collection contains the datasets the tool will process.

# job_internal_valid additions — subcollection mapping produces DCE refs:
- parameter: {src: dce, id: 5}

# job_internal_invalid — DCE with encoded ID should fail:
- parameter: {src: dce, id: abcdabcd}

The current job_internal schema for gx_data only allows src: "hda" or src: "ldda" (DataRequestInternalDereferencedT). Must add DatasetCollectionElementReference to the union.

2d. Run Specification Tests (Red→Green)

PYTHONPATH=lib python -m pytest test/unit/tool_util/test_parameter_specification.py -x --timeout=60

Write specs first (red), then fix any model issues (green).

Files: test/unit/tool_util/parameter_specification.yml


Phase 3: Async Expansion Fix (meta.py)

3a. Add DCE Support to __expand_collection_parameter_async

Currently (meta.py:472) the async path rejects src != "hdca". Change to accept "dce" and resolve DatasetCollectionElement → child collection, matching the sync path.

This matters for job reruns where stored job state contains DCE refs from a previous expansion.

if src not in ("hdca", "dce"):
    raise exceptions.ToolMetaParameterException(...)
if src == "dce":
    item = app.model.context.get(DatasetCollectionElement, item_id)
    collection = item.child_collection
else:
    item = app.model.context.get(HistoryDatasetCollectionAssociation, item_id)
    collection = item.collection

Files: lib/galaxy/tools/parameters/meta.py


Phase 4: API Execution Tests (test_tool_execute.py)

4a. Refactor test_map_over_with_nested_paired_output_format_actions to Fluent API

The existing test_map_over_with_nested_paired_output_format_actions uses a manual dict. Refactor it to use the tool_input_format fixture (runs 3x: flat, nested, request) so it gains request-format coverage with map_over_type.

The request-format callback produces {__class__: "Batch", values: [{src: "hdca", id: ..., map_over_type: "paired"}]}. Need to check if DescribeToolInputs supports this or if we need to extend the fluent API.

4b. Add Simple Subcollection Mapping Test (cat1 over list:paired)

Migrate test_simple_subcollection_mapping from test_tools.py to test_tool_execute.py with request format coverage:

@requires_tool_id("cat1")
def test_simple_subcollection_mapping(
    target_history: TargetHistory,
    required_tool: RequiredTool,
    tool_input_format: DescribeToolInputs,
):
    hdca = target_history.with_example_list_of_pairs()
    # legacy/nested: {"f1": {"batch": True, "values": [{"src": "hdca", "map_over_type": "paired", "id": hdca_id}]}}
    # request: {"f1": {"__class__": "Batch", "values": [{"src": "hdca", "id": hdca_id, "map_over_type": "paired"}]}}
    ...

4c. Add paired_or_unpaired Subcollection Mapping with Request Format

Refactor existing test_map_over_paired_or_unpaired_with_list_paired to use tool_input_format fixture so it covers all 3 input formats including request.

4d. Check Fluent API Support

Review DescribeToolInputs in populators.py to see if .when.request() callbacks can produce batch inputs with map_over_type. If not, extend the fluent API. May need a helper like:

def batch_with_map_over(hdca, map_over_type):
    return {"__class__": "Batch", "values": [{**hdca.src_dict, "map_over_type": map_over_type}]}

Phase 5: Run Full Test Suite

5a. Unit Tests

PYTHONPATH=lib python -m pytest test/unit/tool_util/test_parameter_specification.py -x

5b. API Tests (new tests only, quick check)

./run_tests.sh -api lib/galaxy_test/api/test_tool_execute.py -k "subcollection or dce or map_over"

5c. API Tests (full tool execute suite, regression)

./run_tests.sh -api lib/galaxy_test/api/test_tool_execute.py

Implementation Order

Step Phase Description Test First?
1 2a-2b Write parameter specification tests for map_over_type (expect failures) Yes (red)
2 1a Add map_over_type to BatchDataInstance/BatchDataInstanceInternal Green
3 2d Verify spec tests pass Green check
4 1b-1c Add DCE to internal representations, fix runtimeify in convert.py Implementation
5 2c Write job_internal spec tests for DCE (red→green) Red→Green
6 4a-4d Write API execution tests (expect failures for request format) Yes (red)
7 3a Fix async expansion for DCE Green
8 4d Extend fluent API if needed Green
9 5a-5c Full test runs Regression

Subcollection Mapping & DCE Modeling: Problem & Goal

Context

PR #21842 (guerler's /api/jobs modernization) exposed gaps in the structured tool state modeling around subcollection mapping (map_over_type). These features work in the legacy tool execution path but were not fully modeled or tested in the new request schema system.

Key Concept: Where DCE Lives in the Pipeline

Understanding the execution pipeline is critical to scoping this work correctly:

Client Request (request layer)
  → {"input": {"__class__": "Batch", "values": [{src: "hdca", id: "abc", map_over_type: "paired"}]}}

ID Decode (request → request_internal)
  → {"input": {"__class__": "Batch", "values": [{src: "hdca", id: 5, map_over_type: "paired"}]}}

Batch Expansion (meta.py — __expand_collection_parameter)
  → Splits list:paired HDCA into paired DatasetCollectionElement objects
  → to_decoded_json() serializes these as {"src": "dce", "id": <int>}

Job Internal (job_internal layer)
  → Parameters stored with src:"dce" refs pointing to specific subcollection elements

DCE (src: "dce") is backend-produced, not client-sent. The client sends src: "hdca" with map_over_type to express subcollection mapping intent. The backend's batch expansion in meta.py produces DCE references as an internal artifact of splitting collections into subcollection elements. These DCE refs are then stored in job parameters for tracking and reruns.

guerler confirmed this: "The client always resolves dce to hda before submission. The only time dce appears is for sub collection elements in batch or map over scenarios, which are now handled by collection expansion in meta.py."

Problem Statement

1. map_over_type Not Modeled in Request Schema

When a user maps a list:paired collection over a tool expecting a single dataset input, the API receives:

{"input": {"__class__": "Batch", "values": [{"src": "hdca", "id": "abc123", "map_over_type": "paired"}]}}

Currently, map_over_type is only present as a legacy attribute on LegacyRequestModelAttributes (parameters.py:389) with exclude=True and SkipJsonSchema. This means:

  • It is silently stripped during validation — it works by accident, not by design
  • The BatchRequest.values list uses BatchDataInstance which has no map_over_type field on dev (guerler added it in the PR)
  • There are zero parameter specification tests for map_over_type in parameter_specification.yml
  • The schema doesn't communicate to clients how subcollection mapping should be requested

2. DCE Not Modeled in Post-Expansion Representations

After batch expansion produces {"src": "dce", "id": <int>} references, these end up stored in job parameters. But the job_internal schema layer has no src: "dce" option — it only knows hda, ldda, hdca. This means:

  • Stored job state containing DCE refs can't be validated against the job_internal model
  • The request_internal and request_internal_dereferenced layers also lack DCE for the same reason — re-expansion of stored job state passes through these layers

Important: DCE does not belong in the external request layer. Clients never send it. guerler's PR added DataRequestDce to the external _DataRequest union, but that's unnecessary for the request model — DCE only needs to exist in internal/post-expansion representations.

3. No API-Level Test Coverage for Subcollection Mapping via Request Format

Existing subcollection mapping tests in test_tool_execute.py use only legacy/nested input formats, not the "request" format (__class__: "Batch"). Tests that exist:

Test File Input Format
test_map_over_with_nested_paired_output_format_actions test_tool_execute.py:182 legacy only (manual dict)
test_map_over_paired_or_unpaired_with_list_paired test_tool_execute.py:505 legacy only
test_map_over_paired_or_unpaired_with_list test_tool_execute.py:517 legacy only
test_paired_input_map_over_nested_collections test_tools.py:2479 legacy only
test_simple_subcollection_mapping test_tools.py:3023 legacy only
test_can_map_over_dce_on_non_multiple_data_param test_tools.py:2637 legacy only

None test the request format with __class__: "Batch" and map_over_type, because the schema doesn't model it yet.

4. Sync/Async Expansion Mismatch for DCE

The synchronous __expand_collection_parameter (meta.py:419) handles both src: "hdca" and src: "dce", but the async __expand_collection_parameter_async (meta.py:469) only handles src: "hdca" on dev. This matters for job reruns — when stored job state with DCE refs gets re-expanded through the async path, it fails. guerler fixed this in PR but there are no tests ensuring parity.

Goal

Model subcollection mapping correctly across schema layers, with full test coverage, independent of guerler's PR:

  1. Model map_over_type properly on BatchDataInstance / BatchDataInstanceInternal — the request-layer gap
  2. Add DCE to internal representations (BatchDataInstanceInternal, job_internal layer) — where expansion output lives
  3. Add parameter specification tests for batch values with map_over_type (request layer) and DCE src (internal layers)
  4. Add/migrate API execution tests in test_tool_execute.py covering subcollection mapping with request-format inputs
  5. Fix async expansion to handle DCE references (for job rerun scenarios)

Scope

In Scope

  • map_over_type on BatchDataInstance (request layer) and BatchDataInstanceInternal (internal layer)
  • DCE support in BatchDataInstanceInternal and post-expansion representations (request_internal, request_internal_dereferenced, job_internal)
  • Parameter specification tests in parameter_specification.yml
  • API execution tests in test_tool_execute.py using the request input format
  • Async expansion fix in meta.py for DCE
  • Conversion handling in convert.py if needed for DCE encode/decode through internal layers

Out of Scope

  • Adding DCE to the external request layer (_DataRequest union) — clients don't send it
  • Client-side form changes (guerler's PR territory)
  • Legacy format deprecation
  • Full migration of all subcollection tests from test_tools.py to test_tool_execute.py

Success Criteria

  • map_over_type is a first-class field on batch value models, validated by schema
  • DCE is accepted in internal/post-expansion representations where it naturally appears
  • DCE is explicitly not part of the external request model (clients never send it)
  • Parameter specification YAML covers batch+map_over_type (request) and batch+dce (internal) scenarios
  • At least 3 new API tests in test_tool_execute.py exercise subcollection mapping with request-format inputs
  • Async expansion handles DCE without error
  • All existing tests continue to pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment