== Property Lookup Performance, and what we can do about it We often get questions like, "My query is too slow, what can I do." If the query looks like this: ---- load csv from "salaries.csv" as row match (p:Person) where p.id = row.id set p.salary = toFloat(row.salary) ---- the best guess is: ____ You forgot to create an constraint or index on that label and property combination! ---- create constraint on (p:Person) assert p.id is unique; // or create index on :Person(id); ---- ____ Currently, Neo4j, when asked to do a property lookup on a non-indexed property, has to do a full scan over all nodes with that label and compare the property with the provided value in a filter operation. And it does that for every check, so if you have a CSV with 1M lines, then that's 1M x full scan + filter. Let's look at some numbers and create an artificial dataset for that, in our case we're only doing a read (i.e. the lookup) to not have the write + tx operation skew the times. We create 1M People with an `id` property. ---- UNWIND range(1,1000000) AS id CREATE (:Person{id:id, age: id % 100}); +-------------------+ | No data returned. | +-------------------+ Nodes created: 1000000 Properties set: 1000000 Labels added: 1000000 10723 ms ---- === Then we try to look up 500k of those id's ... unwind range(1,1000000,2) as id return count(*); +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 68 ms ... from the existing people: ---- UNWIND range(1,1000000,2) AS id MATCH (:Person{id:id}) RETURN count(*); ... didn't finish after several minutes ... ---- But we have no luck, re-running this with a smaller number (100), shows us the query plan and the associated costs. ---- UNWIND range(1,1000000,10000) AS id MATCH (:Person{id:id}) RETURN count(*); +----------+ | count(*) | +----------+ | 100 | +----------+ 1 row 72957 ms Compiler CYPHER 3.0 Planner COST Runtime INTERPRETED +--------------------+----------------+-----------+-----------+----------------+-------------------+ | Operator | Estimated Rows | Rows | DB Hits | Variables | Other | +--------------------+----------------+-----------+-----------+----------------+-------------------+ | +ProduceResults | 1000 | 1 | 0 | count(*) | count(*) | | | +----------------+-----------+-----------+----------------+-------------------+ | +EagerAggregation | 1000 | 1 | 0 | count(*) | | | | +----------------+-----------+-----------+----------------+-------------------+ | +Filter | 1000000 | 100 | 100000000 | anon[44], id | anon[44].id == id | | | +----------------+-----------+-----------+----------------+-------------------+ | +Apply | 1000000 | 100000000 | 0 | id -- anon[44] | | | |\ +----------------+-----------+-----------+----------------+-------------------+ | | +NodeByLabelScan | 10000000 | 100000000 | 100000100 | anon[44] | :Person | | | +----------------+-----------+-----------+----------------+-------------------+ | +Unwind | 10 | 100 | 0 | id | | | | +----------------+-----------+-----------+----------------+-------------------+ | +EmptyRow | 1 | 1 | 0 | | | +--------------------+----------------+-----------+-----------+----------------+-------------------+ Total database accesses: 200000100 ---- create constraint on (p:Person) assert p.id is unique; // Unique constraints added: 1 -> 8617 ms schema await schema sample -a unwind range(1,1000000,2) as id match (:Person{id:id}) return count(*); unwind range(1,1000000,2) as id match (:Person{id:id}) return count(*); /* +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 7450 ms */ drop constraint on (p:Person) assert p.id is unique; create index on :Person(id); schema await schema sample -a unwind range(1,1000000,2) as id match (:Person{id:id}) return count(*); unwind range(1,1000000,2) as id match (:Person{id:id}) return count(*); /* +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 7132 ms */ drop index on :Person(id); match (p:Person) with collect([toString(p.id),p]) as pairs call apoc.map.fromPairs(pairs) yield value as index unwind range(1,1000000,2) as id with index[toString(id)] as n return count(*); +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 4687 ms match (p:Person) with collect(p) as people call apoc.map.groupBy(people,'id') yield value as index unwind range(1,1000000,2) as id with index[toString(id)] as n return count(*); +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 3115 ms with range(1,1000000,2) as ids match (p:Person) where p.id IN ids with collect(p) as people call apoc.map.groupBy(people,'id') yield value as index unwind range(1,1000000,2) as id with index[toString(id)] as n return count(*); +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 2344 ms with [id IN range(1,1000000,2) | {id:id}] as rows call apoc.map.groupBy(rows,'id') yield value as rowById with rowById,[id IN keys(rowById) | toInt(id)] as ids match (p:Person) where p.id IN ids with rowById[toString(p.id)] as row // do something with p and row return count(*); with [id IN range(1,1000000,2) | {id:toString(id)}] as rows call apoc.map.groupBy(rows,'id') yield value as rowById with rowById,keys(rowById) as ids match (p:Person) where p.id IN ids with rowById[p.id] as row // do something with p and row return count(*); +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 4746 ms with [id IN range(1,1000000,2) | {id:toString(id)}] as rows call apoc.map.groupBy(rows,'id') yield value as rowById match (p:Person) where p.id IN keys(rowById) with rowById[p.id] as row // do something with p and row return count(*); profile with range(1,1000) as ids match (p:Person) where p.id2 IN ids return count(*); profile match (p:Person) where p.id2 IN range(1,1000) return count(*); with [id IN range(1,1000000,2) | {id:toString(id)}] as rows call apoc.map.groupBy(rows,'id') yield value as rowById return count(*); // 1300ms unwind range(1,1000000,2) as id with collect(id) as ids match (p:Person) where p.id IN ids return count(*); with range(1,1000000,2) as ids match (p:Person) where p.id IN ids return count(*); +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 1631 ms with range(1,1000000,2) as ids match (p:Person) where p.id IN ids with collect([toString(p.id),p]) as pairs call apoc.map.fromPairs(pairs) yield value as index unwind range(1,1000000,2) as id with index[toString(id)] as n return count(*); +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 3563 ms with collect([toString(p.id),p]) as pairs call apoc.map.fromPairs(pairs) yield value as index unwind range(1,1000000,2) as id with index[toString(id)] as n return count(*); unwind range(1,1000000,2) as id with collect(id) as ids match (p:Person) where p.id IN ids return count(*); +----------+ | count(*) | +----------+ | 500000 | +----------+ 1 row 1660 ms load csv from "salaries.csv" as row match (p:Person) where p.id = row.id set p.salary = toFloat(row.salary) rewrite to load csv from "salaries.csv" as row with collect(distinct row.id) as ids, collect(row) as rows match (p:Person) where p.id IN ids WITH collect(p) as people, rows // this aggreation is probably the only issue UNWIND rows as row WITH head([p in people where p.id = row.id]) as p // and perhaps this "lookup" SET p.salary = row.salary; set p.salary = toFloat(row.salary) load csv from "salaries.csv" as row with collect(row) as rows call apoc.map.groupBy(rows,'id') yield value as rowById match (p:Person) where p.id IN keys(rowById) set p.salary = rowById[toString(p.id)].salary