Detect
In an ideal world, all data is strongly typed, and there’s no ambiguity. Take a DB source:
$ sq add postgres://sakila@localhost/sakila -p <<< $PASSWD
The driver type is known (postgres
),
each of the tables in the DB has
a distinct name (e.g. actor
), and each column in the table has a distinct
name (actor_id
) and type (INTEGER
), as well as other attributes such
as nullability.
However, this doesn’t hold for all sources. In particular, it’s not
the case for some file-based sources such as CSV
. Let’s say we add a CSV file:
$ sq add ./actor.csv
That CSV file may contain the same values as the Postgres actor
table, but
being an untyped text-based format, the CSV data doesn’t have all the goodness
of the Postgres table. sq
uses various “detectors” to bridge that gap.
Driver type
When adding a file-based source, you can use
the --driver
flag (e.g. --driver=csv
)
to explicitly specify the file’s driver type. However, you can omit
that flag, and sq
will generally figure out the appropriate driver. This
mechanism works fairly reliably, but you can always fall back to explicitly
supplying the driver type.
# Explicit
$ sq add --driver=sqlite3 ./sakila.db
@sakila sqlite3 sakila.db
# Auto-detected
$ sq add ./sakila.db
@sakila sqlite3 sakila.db
sq
has driver-type detection for SQLite, Excel,
the three JSON variants
(JSON, JSONA, JSONL),
and CSV/TSV.
Header row
When adding a CSV or Excel source,
the source datafile doesn’t explicitly state whether
the first row of data is a header row. This is important to determine, so that
the header row isn’t treated as a data row. Take two distinct CSV files, actor_header.csv
:
actor_id,first_name,last_name,last_update
1,PENELOPE,GUINESS,2020-02-15T06:59:28Z
2,NICK,WAHLBERG,2020-02-15T06:59:28Z
and actor_no_header.csv
:
1,PENELOPE,GUINESS,2020-02-15T06:59:28Z
2,NICK,WAHLBERG,2020-02-15T06:59:28Z
For the latter case, sq
automatically assigns generated
column names A
, B
, C
. (Note that the
generated column names can be configured.)
For the first case, the column names are present in the first row of the file,
and sq
is generally able to figure this out.
It does this by running an algorithm on a sample of the data, and if
the first row of data appears to be different from the other rows, sq
marks
that first row as a header row, and thus can determine the correct column names.
The header-detection algorithm is relatively naive, and can be defeated if the number
of sample rows is small, or if the header row and data rows are all the same data kind
(e.g. everything is a string). If that happens, you can explicitly tell
sq
that a header row is present (or not):
# Explicitly specify that a header row exists
$ sq add --ingest.header ./actor_header.csv
# Explicitly specify no header row
$ sq add --ingest.header=false ./actor_no_header.csv
sq
fails to detect the header row correctly in your data, but it seems
like it should be able to, please open an issue, and attach a copy of your data (sanitized
if necessary).Column kind
When ingesting untyped data, sq
determines a lowest-common-denominator “kind”
for each column, e.g. int
, float
, datetime
, etc. If a kind cannot be
determined, the column typically is treated as text
. You can think of sq
’s
“kind” mechanism as an internal, generalized, intermediate column type.
Let’s go back to the CSV example:
actor_id,first_name,last_name,last_update
1,PENELOPE,GUINESS,2020-02-15T06:59:28Z
2,NICK,WAHLBERG,2020-02-15T06:59:28Z
And inspect that source:
$ sq inspect -v @actor.data
NAME TYPE ROWS COLS NAME TYPE PK
data table 200 4 actor_id INTEGER
first_name TEXT
last_name TEXT
last_update DATETIME
Note the TYPE
value for each column. This is the type of the column in the
ingest DB table that sq
ingests the CSV data into.
In this case, the ingest DB is actually a SQLite DB, and thus the TYPE
is
SQLite data type INTEGER
.
For the equivalent Postgres source, note the different TYPE
value:
$ sq inspect @sakila/pg12.actor -v
NAME TYPE ROWS COLS NAME TYPE PK
actor table 200 4 actor_id int4 pk
first_name varchar
last_name varchar
last_update timestamp
In the CSV case, the actor_id
column ultimately maps to SQLite INTEGER
,
while in Postgres, the type is int4
. Because sq
supports multiple backend
DB implementations, there is an internal sq
representation of data types: kind.
If we run inspect
again on the sources, but this time output in YAML, we can see
the kind for each column.
# sq inspect --yaml @actor.data
name: data
table_type: table
table_type_db: table
row_count: 200
columns:
- name: actor_id
position: 0
primary_key: false
base_type: INTEGER
column_type: INTEGER
kind: int
nullable: true
- name: first_name
position: 1
primary_key: false
base_type: TEXT
column_type: TEXT
kind: text
nullable: true
# [Truncated for brevity]
# sq inspect --yaml @sakila/pg12.actor
name: actor
name_fq: sakila.public.actor
table_type: table
table_type_db: BASE TABLE
row_count: 200
size: 73728
columns:
- name: actor_id
position: 1
primary_key: true
base_type: int4
column_type: integer
kind: int
nullable: false
default_value: nextval('actor_actor_id_seq'::regclass)
- name: first_name
position: 2
primary_key: false
base_type: varchar
column_type: character varying
kind: text
nullable: false
# [Truncated for brevity]
Note that although the column_type
values differ, both sources have the same kind
for each column.
Kinds
sq
defines a kind for the most common data types.
Kind | Example |
---|---|
text | seven |
int | 7 |
float | 7.12344556 |
decimal | 7.12 |
bool | true |
datetime | 1977-11-07T17:59:28Z |
date | 1977-11-07 |
time | 17:59:28 |
bytes | aHV6emFoCg== |
null
and unknown
, but the end-user
is typically not exposed to them.Kind detection
When sq
ingests untyped data, it samples multiple values from each column.
As sq
iterates over those values, it’s able to eliminate various kinds. For
example, if the sampled value is 1977-11-07
, the kind cannot be int
, or float
,
or bool
, etc. By a process of elimination, sq
ends up with a column kind.
If there’s ambiguity in the sample data, the determined kind will be text
.
In practice, sq
runs the sample data through a series of parsing functions. If
a parsing function fails, that kind is eliminated. For some kinds there are
multiple parsing functions to try. In particular, there are many date and time formats
that sq
can parse.
Date/time formats
Listed below are the various datetime
, date
, and time
formats
that sq
can detect. Note that some formats have unlisted variants
to accept timezones, or various degrees
of precision: e.g. RFC3339 can accept second precision (e.g. 2006-01-02T15:04:05
),
all the way down to nanosecond precision (2006-01-02T15:04:05.999999999Z
).
Kind | Example | Note |
---|---|---|
datetime | 2006-01-02T15:04:05Z | RFC3339 |
datetime | Mon Jan 2 15:04:05 2006 | ANSI C |
datetime | Mon Jan 2 15:04:05 MST 2006 | Unix date |
datetime | Mon Jan 02 15:04:05 -0700 2006 | Ruby date |
datetime | 02 Jan 06 15:04 MST | RFC822 |
datetime | Monday, 02-Jan-06 15:04:05 MST | RFC850 |
datetime | Mon, 02 Jan 2006 15:04:05 MST | RFC1123 |
datetime | Monday, January 2, 2006 | Excel long date |
datetime | 2006-01-02 15:04:05 | |
datetime | 01/2/06 15:04:05 | |
date | 2006-01-02 | |
date | 02 Jan 2006 | |
date | 2006/01/02 | |
date | 01-02-06 | This is month-day-year. Try to avoid this format due to month-first vs day-first confusion. |
date | 01-02-2006 | Also month-day-year: try to avoid. |
date | 02-Jan-2006 | |
date | 2-Jan-2006 | |
date | 2-Jan-06 | |
date | Jan 2, 2006 | |
date | Monday, January 2, 2006 | |
date | Mon, January 2, 2006 | |
date | Mon, Jan 2, 2006 | |
date | January 2, 2006 | |
date | 2/Jan/06 | |
time | 15:04:05 | |
time | 15:04 | |
time | 3:04PM | |
time | 3:04 PM | |
time | 3:04pm | |
time | 3:04 pm |
datetime
format does not specify a timezone/offset, the value is
ingested as UTC.