Hugo Palmer hugopalmer

Load a Large CSV into Vertica

Here's an efficient way to load a dataset into Vertica by splitting it up into multiple pieces and then parallelizing the load process.

Note that this only makes sense if your Vertica cluster is a single node. If it's running more nodes, there are definitely more efficient ways of doing this.

For this example, the large CSV file will be called large_file.csv. If your file is under 1GB, it probably makes sense to load it using a single COPY command.

	CREATE TEMPORARY FUNCTION

	-- In this function, we're going to be working on arrays of values.
	-- we're also going to define a set of functions 'inside' the kMeans.

	-- heavily borrowing from https://github.com/NathanEpstein/clusters --

	kMeans(x ARRAY<FLOAT64>, -- ESR1 gene expression
	y ARRAY<FLOAT64>, -- EGFR gene expression
	iterations FLOAT64, -- the number of iterations