How To Load JSON, Avro, Parquet, Or Anything Else in PostgreSQL Easily and Reliably

March 7, 2024 By Nicolas Even

PostgreSQL is a great database, but the options for loading data files are so few and poor that finding a good way to do that is often a headache. It is all the more frustrating that other databases support dozens of input formats natively. I’ve come up with a solution that’s easy, reliable, and requires no extra tools.

PostgreSQL supports three input formats: CSV, TEXT (a tsv-like format), and BINARY. The first two formats aren’t standardized, making it hard to convert data to the right format, and even to remember what the right format is1. The BINARY format is great, but there are no tools that I know of that take advantage of it.

The Solution: NDJSON as a 1-column csv file

JSON, or more precisely newline-delimited JSON (NDJSON) is a clearly defined format, has useful, albeit few, datatypes, and, most importantly, most data formats can be converted to it with existing tools.

To import an NDJSON file we’ll proceed in two steps:

  1. Load the NDJSON file in a temporary table as a JSONB column
  2. Convert the data to the format we want

For the rest of the article, the commands will be based on this file:

{ "num": 1, "arr": [1, 2], "txt": "abcd", "dt": 1709754503162 }
{ "num": 2, "arr": [3,4], "txt": "abcd", "dt": 1709754507812 }

Step 1: Load The File

First, create the staging table:

CREATE table staging(data jsonb);

Then, and here’s where the magic happens, load the NDJSON file:

cat data.ndjson | psql -c "copy temp from stdin (format csv, delimiter E'\1', quote E'\2')"

We use the CSV import mode and set the escape and quote characters to control characters (which are disallowed in JSON) so the file is imported as a single-column CSV file.

Note: You may be tempted to use the TEXT format, but this won’t work. This format uses backslash sequences that are incompatible with the JSON backslash sequences: PostgreSQL will replace \u with u, therefore corrupting JSON escape sequences such as \uEDED.

Step 2: Convert The Data

PostgreSQL has many JSON functions, so how you do it is a matter of taste.

The easiest way may be to use array-style brackets:

create table target as
to_timestamp(data['dt'] / 1000.0) dt
from staging;

What about other file formats?

Just convert them to JSON!

For example with Apache AVRO:

avro-tools tojson file.avro | psql -c "copy temp from stdin (format csv, delimiter E'\1', quote E'\2')"

  1. How do you quote an array? How do you format a date? ↩︎