princeppy/oracle-schema-diagram.md

name

oracle-schema-diagram

description

Use this skill whenever the user provides Oracle SQL (SELECT, stored procedures, views, or anything with FROM/JOIN/WHERE) and wants to understand the database structure, generate a schema-extraction query, identify table relationships, or produce a DBML database diagram for dbdiagram.io. Triggers include: "generate schema SQL", "extract table columns", "database diagram", "ER diagram", "DBML", "dbdiagram.io", or any time Oracle SQL is pasted alongside a request to see columns, relationships, or structure. Also triggers when the user uploads a CSV from the extraction query and wants a diagram built from it. Also triggers when the user has already received a DBML diagram earlier in the conversation and returns with a revised CSV, clarifications, or corrections — treat this as an iteration on the existing analysis, not a fresh start. Use this skill even when the user does not say "DBML" explicitly — if they want a database diagram from Oracle SQL, default to this skill.

Oracle Schema Diagram Skill (DBML output)

This skill takes one or more Oracle SQL statements from the user and walks through a multi-phase workflow that ends with a downloadable DBML file (for https://dbdiagram.io/) plus a findings report.

The workflow is intentionally phased because the middle step requires the user to run a query against their own Oracle database — the skill cannot do that itself.

The full workflow at a glance

Phase 1 — Analyze SQL: Parse the user's SQL statements, extract every referenced table, and emit a ready-to-run Oracle SQL query that pulls columns + FK metadata for those tables.
Phase 2 — Wait for CSV: The user runs that query in their Oracle client and uploads the resulting CSV back here.
Phase 3 — Merge relationships: Combine FK relations from the CSV with JOIN / WHERE relations inferred from the original SQL. Ask the user for clarification on anything ambiguous.
Phase 4 — Prune and emit DBML: Keep only columns actually referenced in the original SQL (plus PKs / FKs needed for relationships), and write a DBML file.
Phase 5 — Findings report: Summarize noteworthy observations about the schema and relationships both in chat and as a downloadable .md file.
Phase 6 — Iterate (only if the user comes back with corrections): The findings often prompt the user to refine things — a fresh CSV covering additional tables, clarifications about specific relationships, instructions to drop or add something. Re-run just the parts of the pipeline that are affected and deliver updated outputs.

Move through phases 1–5 in order. Do not try to skip ahead to the DBML before the CSV is uploaded — the CSV is the only reliable source of column names and data types. Phase 6 only applies when the user returns after seeing the outputs.

Phase 1 — Analyze the SQL and generate the extraction query

When the user pastes one or more Oracle SQL statements, do the following.

1a. Extract every table reference

Walk through each statement and collect tables from:

FROM <table> [alias]
JOIN <table> [alias] (inner, left, right, full, cross — all of them)
UPDATE <table>
INSERT INTO <table>
MERGE INTO <table>
DELETE FROM <table>
Subqueries and CTEs (WITH cte AS (SELECT ... FROM <table>))
Tables referenced inside stored procedure / function bodies

Normalize everything to uppercase, deduplicate, and ignore Oracle built-ins like DUAL, ALL_*, DBA_*, USER_*, V$*, GV$*. Also skip CTE names — they are not physical tables, even though they appear in FROM clauses.

If a table is schema-qualified (HR.EMPLOYEES), keep track of the schema separately and use just the table name in the query's IN (...) list (since ALL_TAB_COLUMNS filters on table name and optionally owner).

1b. Generate the extraction SQL

Use this template, replacing <TABLES_LIST> with the comma-separated, quoted, uppercased table names you extracted:

SELECT 
    c.table_name,
    c.column_name,
    c.data_type,
    c.data_length,
    c.nullable,
    c.column_id,
    cc.constraint_type,
    cc.r_table_name AS referenced_table,
    cc.r_column_name AS referenced_column
FROM all_tab_columns c
LEFT JOIN (
    SELECT 
        ac.table_name,
        acc.column_name,
        ac.constraint_type,
        ac.r_owner,
        acr.table_name AS r_table_name,
        accr.column_name AS r_column_name
    FROM all_constraints ac
    JOIN all_cons_columns acc ON ac.constraint_name = acc.constraint_name
    LEFT JOIN all_constraints acr ON ac.r_constraint_name = acr.constraint_name
    LEFT JOIN all_cons_columns accr ON acr.constraint_name = accr.constraint_name
) cc ON c.table_name = cc.table_name AND c.column_name = cc.column_name
WHERE c.table_name IN (<TABLES_LIST>)
ORDER BY c.table_name, c.column_id;

Example substitution for ORDERS, ORDER_ITEMS, CUSTOMERS:

WHERE c.table_name IN ('ORDERS', 'ORDER_ITEMS', 'CUSTOMERS')

1c. Present the SQL and pause for the CSV

Show the generated SQL in a single clear code block. Tell the user something like:

"I found N tables in your SQL: TABLE_A, TABLE_B, TABLE_C, …

Run the query below in Oracle SQL Developer (or any Oracle client) and export the results as CSV, then upload the CSV here. I'll take it from there."

Also briefly list the tables found so the user can sanity-check before running anything. Do not proceed further until the CSV arrives.

Phase 2 — Waiting for the CSV

There is nothing active to do during Phase 2 — the user runs the query themselves. When they return with a CSV, proceed to Phase 3.

If what they upload is obviously not the extraction-query output (wrong columns, zero rows, text file with error messages), point out what's wrong and ask them to re-export. The expected columns are: TABLE_NAME, COLUMN_NAME, DATA_TYPE, DATA_LENGTH, NULLABLE, COLUMN_ID, CONSTRAINT_TYPE, REFERENCED_TABLE, REFERENCED_COLUMN.

Phase 3 — Parse CSV, merge relationships, ask for clarifications

3a. Parse the CSV

Read the CSV (the file is under /mnt/user-data/uploads/). Use Python with csv or pandas — whichever is easier. Build a structure like:

{
    "ORDERS": {
        "columns": [
            {"name": "ORDER_ID", "data_type": "NUMBER", "length": 22, "nullable": "N",
             "pk": True, "fk_to": None},
            {"name": "CUSTOMER_ID", "data_type": "NUMBER", "length": 22, "nullable": "Y",
             "pk": False, "fk_to": ("CUSTOMERS", "CUSTOMER_ID")},
            ...
        ]
    },
    ...
}

A column can appear on multiple CSV rows (once per constraint it participates in). Merge them: pk=True if any row shows CONSTRAINT_TYPE='P', fk_to=(ref_table, ref_col) if any row shows CONSTRAINT_TYPE='R'.

3b. Extract columns actually used in the original SQL

For each table, figure out which columns are explicitly referenced. Look at:

SELECT clause — alias.column, table.column, bare column names
JOIN ON — both sides of the equality
WHERE, HAVING — any column reference
GROUP BY, ORDER BY
INSERT / UPDATE column lists

Resolve aliases (e.g. o.order_id → ORDERS.ORDER_ID) using the alias map from FROM/JOIN clauses. If the SQL uses SELECT * or SELECT t.*, treat all columns of that table as used.

Build { TABLE_NAME: set_of_used_column_names }.

3c. Collect relationships from three sources

Source A — Foreign keys from CSV (CONSTRAINT_TYPE = 'R'):

{from_table, from_column, to_table, to_column, source: "FK"}

Source B — JOIN ON conditions from the original SQL:

-- JOIN order_items oi ON o.order_id = oi.order_id
--   → {ORDERS.ORDER_ID → ORDER_ITEMS.ORDER_ID, source: "JOIN"}

Source C — WHERE equality between two table columns:

-- WHERE e.dept_id = d.id
--   → {EMPLOYEES.DEPT_ID → DEPARTMENTS.ID, source: "WHERE"}

Sources B and C catch implicit relationships that aren't declared as FKs in the database — common in older Oracle systems where FKs were never formalized. The whole point of pulling both the FK metadata and parsing the SQL is to catch these.

Deduplicate: if the same column pair appears in multiple sources, keep one entry but record all sources (e.g. "FK+JOIN").

3d. Ask the user for clarification when needed

Genuinely ambiguous situations deserve a question, not a guess. Ask when:

Direction is unclear. For FK relations the direction is obvious (child → parent). For JOIN/WHERE-derived relations you may not know which table "owns" the key. Ask if it matters for the diagram.
Non-obvious join columns. e.g. ON a.code = b.type_code where the columns have different names and no FK backs them up. Ask whether this is really a relationship or just a filter coincidence.
Tables appearing in SQL but missing from CSV. Ask whether the table was renamed, in a different schema, or intentionally omitted.
Composite keys. If a JOIN uses multiple column pairs, ask whether to represent it as a composite FK or just the primary column pair.

Keep clarification questions to a short, numbered list — no more than 3–4 at a time. Use concise wording. If there's nothing ambiguous, don't invent questions — just proceed.

Phase 4 — Prune columns and emit DBML

4a. Prune columns

For each table in the final diagram, keep only columns that are:

Referenced explicitly in the original SQL (from Phase 3b), OR
Part of a relationship being drawn (the FK column or the referenced PK column), OR
The table's primary key (even if not used in SQL — this makes the diagram readable)

Drop every other column. This is the key pruning step the user asked for: the diagram should show what the SQL actually touches, not the full 80-column kitchen-sink table.

If after pruning a table has zero columns, still emit it with just its PK — never emit an empty table block.

4b. Generate the DBML

Format the output as valid DBML (the language used by dbdiagram.io). Key rules:

Table block:

Table ORDERS {
  order_id number [pk]
  customer_id number [ref: > CUSTOMERS.customer_id]
  status varchar2
  order_date date
}

Syntax specifics:

Lowercase column names are conventional in DBML; keep table names uppercase to match Oracle. Either is fine — be consistent.
Map Oracle types to DBML-friendly types. DBML accepts any type name, so NUMBER, VARCHAR2, DATE, TIMESTAMP, CLOB are all fine as-is. For VARCHAR2(200) you can use varchar2(200) — include length if it's meaningful.
Use [pk] for primary keys, [not null] when NULLABLE='N', and inline [ref: > OTHER_TABLE.col] for single-column FK relationships.
For multi-column references, or for JOIN/WHERE-derived relationships where you want to annotate the source, use standalone Ref: lines at the bottom of the file. That keeps the table bodies clean.

Relationship syntax:

> = many-to-one (child.col → parent.col)
< = one-to-many
- = one-to-one

Default to > for FK-style relations pointing from the table holding the FK column to the referenced table.

Annotating relationship sources: Put the source (FK, JOIN, WHERE, or combinations) in a comment on the Ref line:

Ref: ORDER_ITEMS.order_id > ORDERS.order_id // source: FK+JOIN
Ref: EMPLOYEES.dept_id > DEPARTMENTS.id // source: WHERE (inferred from SQL, no FK defined)

File header: Start the DBML file with a comment block summarizing what it is, the source SQL, and a legend:

// Database Diagram generated from Oracle SQL analysis
// Tables: N | Relationships: M (FK: x, JOIN-derived: y, WHERE-derived: z)
//
// Legend:
//   // source: FK             — declared foreign key in database
//   // source: JOIN           — inferred from JOIN clause in SQL (no FK defined)
//   // source: WHERE          — inferred from WHERE equality in SQL
//   // source: FK+JOIN        — declared FK, also used in SQL JOIN
//
// Paste this file into https://dbdiagram.io/ to render the diagram.

4c. Save and present the file

Save the DBML to /mnt/user-data/outputs/schema_diagram.dbml (or a descriptive name like orders_schema.dbml if you can tell from the SQL's domain).

Use present_files to give the user a download link. In the chat message, briefly say what they'll see when they paste it into dbdiagram.io (number of tables, whether inferred relationships dominate, etc.) — but keep it short.

Phase 5 — Findings report

After the DBML is ready, generate a findings report. The goal is to surface things the user might not notice by eyeballing the diagram — the kind of observations a human DBA would flag after looking at the schema for a minute.

What to look for

Not every finding applies to every schema — only include findings that are actually interesting for this particular set of SQL + CSV. Good categories to scan for:

Undeclared relationships. Tables linked in SQL via JOIN/WHERE but with no FK in the database. These are fragile — they aren't enforced by the DB and can break silently.
Central / hub tables. Tables participating in many relationships; these are likely the entities the queries revolve around.
Orphan tables. Tables in the SQL with no relationships to anything else in the diagram. Sometimes intentional (lookup tables), sometimes suspicious.
Nullable foreign keys. NULLABLE='Y' on an FK column usually means the relationship is optional — worth calling out.
Composite keys used in joins. Multi-column joins hint at composite primary keys or lack thereof.
Self-references. A table joining to itself (hierarchy, tree structure) is always worth noting.
Columns heavily filtered in WHERE. A column used in many WHERE clauses but not indexed (can't tell from extraction query, but flag as a question) may be a performance concern.
Wide tables vs. narrow use. A table with 50 columns where the SQL only touches 3 — worth noting that the domain is narrow.
Naming inconsistencies. e.g. ORDERS.CUSTOMER_ID joins to CUSTOMERS.ID — inconsistent naming makes joins harder to spot.

Output format

Show the findings in chat, grouped under clear headings. Then save the same content to /mnt/user-data/outputs/schema_findings.md.

Suggested markdown structure:

# Schema Findings

## Summary
- **Tables analyzed:** N
- **Relationships found:** M (K declared as FK, L inferred from SQL)
- **Undeclared relationships:** L — see below

## Key observations
1. **<Short title>.** <One or two sentences explaining the observation and why it matters.>
2. …

## Undeclared relationships (no FK in database)
| From | To | Source | Note |
|------|------|--------|------|
| EMPLOYEES.DEPT_ID | DEPARTMENTS.ID | WHERE clause | Not enforced at DB level |

## Hub tables
…

## Potential data-integrity concerns
…

Keep findings concrete and short. Bullet points over paragraphs. Skip any section that has nothing interesting to report.

Present both files together at the end using present_files (DBML file first, then the findings markdown).

Phase 6 — Iterate on the diagram

Often the most useful output of Phase 5 is that it shows the user what's missing or wrong. They look at the diagram, read the findings, and come back with one or more of:

A new CSV — usually a re-run of the extraction query with additional tables included (tables the user realized were relevant after seeing the findings), or with a different schema owner filter.
Clarifications in chat — answers to the questions you asked in Phase 3d, or spontaneous corrections like "the relationship between X and Y is wrong, it should be via column Z" or "drop TABLE_FOO, it's deprecated" or "treat the MANAGER_ID link as a formal FK even though the DB doesn't declare it."
Both at once.

Recognizing this moment matters. Signals that a message is a Phase-6 iteration rather than a fresh request:

The conversation already contains a DBML file and findings report from an earlier turn.
The user references things by name that only appear in the prior outputs ("the findings said...", "in the diagram...", "those three undeclared relationships").
They upload a CSV without providing new SQL.
They provide corrections or directives without any fresh SQL.

When any of these apply, treat it as iteration — do not ask for the original SQL again, and do not regenerate the extraction query from Phase 1 (unless the user explicitly added new tables that aren't in the old SQL).

6a. Decide what changed

Before running anything, identify what's actually different from the previous iteration:

New CSV only → column/FK data changed, but the SQL-derived relationships (JOIN/WHERE) are unchanged. Re-parse the CSV, re-merge relationships, re-prune, regenerate outputs.
Clarifications only → no new data, just directives. Apply them to the existing merged state and regenerate outputs.
Both → apply the CSV update first, then apply clarifications on top.

State briefly what you understood changed before producing new outputs. One or two sentences. This gives the user a chance to correct you if you misread the update.

6b. Apply clarifications as explicit directives

User clarifications come in a few common shapes. Apply them precisely — clarifications are not suggestions, they are corrections:

Clarification shape	What to do
"The X→Y relationship is wrong, it should be X.a → Y.b"	Remove the old relationship, add the corrected one. Note the source as `USER` in the comment.
"Treat X→Y as an FK even though it's not declared"	Keep the relationship but upgrade the source annotation to `FK (per user)` or similar — the schema hasn't changed, but the user's intent is that this should be considered a real relationship.
"Drop table X from the diagram"	Remove that table and any relationships it participates in.
"Add a relationship X.a → Y.b"	Add it with source `USER`.
"Include column X from table Y even though the SQL doesn't use it"	Add the column to the `keep` set for that table.
"This column is actually the PK" / "this isn't really a PK"	Override the PK flag from the CSV for that column.
Answer to a Phase 3d question you asked earlier	Resolve the ambiguity using the answer and continue.

Introduce a new source annotation USER (or FK+USER, JOIN+USER etc.) for anything the user added or promoted. This keeps the provenance trail honest — you can always tell later which relationships came from the DB, which came from SQL inference, and which came from user judgment.

New tables added in iteration. When the CSV introduces a table that wasn't there before, there's no original SQL that references it — so Phase 3b's "used columns" logic produces an empty set. For these tables, default to including: the PK, any columns referenced in user-added relationships, and any columns the user explicitly mentions. If the table ends up with only a PK and one FK column, that's fine — the diagram stays lean. If the user seems to expect more, ask which columns from the new table they want shown.

Extend the legend in the DBML header accordingly:

//   // source: USER          — added or corrected by user
//   // source: FK+USER       — declared FK, confirmed/adjusted by user

6c. Regenerate both outputs

Rerun Phase 4 (DBML) and Phase 5 (findings) with the updated state. Write to the same filenames as before (schema_diagram.dbml and schema_findings.md, or whatever descriptive name you chose originally) — overwriting the previous files so the user isn't left with stale copies sitting next to fresh ones.

In the findings report, add a short "Changes from previous iteration" section at the top — what was added, what was removed, what was corrected. Keep it to a handful of bullets. This makes it easy for the user to diff their understanding:

## Changes from previous iteration
- Added table `AUDIT_LOG` (4 columns) with relationship to `EMPLOYEES.EMP_ID`.
- Corrected `DEPARTMENTS.LOCATION_CODE → LOCATIONS.LOC_CODE` — per user, this is the canonical relationship (was previously flagged as WHERE-inferred only).
- Dropped deprecated table `LEGACY_EMP_BACKUP`.

Then present the updated files. Be explicit in chat that these supersede the earlier versions.

6d. When to loop again

Multiple iterations are fine. The user may refine several times before the diagram matches their mental model. Each iteration is cheap — just re-run 6a through 6c. The only thing that would send you back to Phase 1 is the user pasting genuinely new SQL that references tables not previously analyzed; in that case, generate a fresh extraction query for the new tables only (not the ones already in the CSV) and continue from there.

Situation	Handling
Only one SQL statement, only one table	Still works — just produce a single-table DBML with no relationships.
CSV has tables not in the SQL	They were pulled in for context but aren't needed — omit them from the DBML.
Table in SQL missing from CSV	Warn the user. Ask whether to skip the table or wait for a re-extraction.
`SELECT *` on a table	Include all columns in the "used" set.
Self-join (`e.manager_id = e.employee_id`)	Emit a self-referential Ref in DBML.
CTE / subquery aliases	Treat as virtual — don't put them in the DBML as tables.
Column used with different case in SQL vs CSV	Normalize to uppercase for matching. Output DBML in whatever casing you chose.
No relationships at all	Still emit the DBML. Note in findings that the tables appear to be independent.
User returns with just "the diagram is wrong" with no specifics	Ask what specifically is wrong — don't guess. Point them at what you'd need (a corrected relationship, a table to drop, a new CSV).
User returns with a CSV that has fewer tables than before	They probably intentionally narrowed the scope. Drop the missing tables from the diagram. Mention this in the "Changes from previous iteration" section.
User contradicts something the DB clearly says (e.g. says "X is not a PK" when CSV has `CONSTRAINT_TYPE='P'`)	Apply the user's override but mention the discrepancy in findings — they may want to know the DB and their mental model disagree.

Output checklist

Before handing back the final DBML + findings, verify:

Every table referenced in the original SQL appears in the DBML (or is flagged in findings as missing from CSV).
Every included table has only the columns used in SQL + its PK + any relationship columns.
All FK relationships from the CSV are represented as Ref: lines or inline refs.
All JOIN-derived relationships are represented, annotated with // source: JOIN (or FK+JOIN).
All WHERE-derived relationships are represented, annotated with // source: WHERE.
The DBML validates when pasted into https://dbdiagram.io/ (no syntax errors in types, refs, or braces).
The findings file exists, is in /mnt/user-data/outputs/, and is also summarized in the chat.
Both files are presented via present_files.

Additional checks for Phase 6 iterations:

You overwrote the previous DBML and findings files rather than creating parallel versions with new names.
Every user clarification was applied — none silently ignored.
User-added or user-corrected relationships carry a USER annotation in the DBML.
The findings report has a "Changes from previous iteration" section at the top listing what changed.
You told the user in chat that the new files supersede the earlier ones.