Detect

In an ideal world, all data is strongly typed, and there’s no ambiguity. Take a DB source:

$ sq add postgres://sakila@localhost/sakila -p <<< $PASSWD

The driver type is known (postgres), each of the tables in the DB has a distinct name (e.g. actor), and each column in the table has a distinct name (actor_id) and type (INTEGER), as well as other attributes such as nullability.

However, this doesn’t hold for all sources. In particular, it’s not the case for some file-based sources such as CSV. Let’s say we add a CSV file:

$ sq add ./actor.csv

That CSV file may contain the same values as the Postgres actor table, but being an untyped text-based format, the CSV data doesn’t have all the goodness of the Postgres table. sq uses various “detectors” to bridge that gap.

Driver type

When adding a file-based source, you can use the --driver flag (e.g. --driver=csv) to explicitly specify the file’s driver type. However, you can omit that flag, and sq will generally figure out the appropriate driver. This mechanism works fairly reliably, but you can always fall back to explicitly supplying the driver type.

# Explicit
$ sq add --driver=sqlite3 ./sakila.db
@sakila  sqlite3  sakila.db

# Auto-detected
$ sq add ./sakila.db
@sakila  sqlite3  sakila.db

sq has driver-type detection for SQLite, Excel, the three JSON variants (JSON, JSONA, JSONL), and CSV/TSV.

Header row

When adding a CSV or Excel source, the source datafile doesn’t explicitly state whether the first row of data is a header row. This is important to determine, so that the header row isn’t treated as a data row. Take two distinct CSV files, actor_header.csv:

actor_id,first_name,last_name,last_update
1,PENELOPE,GUINESS,2020-02-15T06:59:28Z
2,NICK,WAHLBERG,2020-02-15T06:59:28Z

and actor_no_header.csv:

1,PENELOPE,GUINESS,2020-02-15T06:59:28Z
2,NICK,WAHLBERG,2020-02-15T06:59:28Z

For the latter case, sq automatically assigns generated column names A, B, C. (Note that the generated column names can be configured.)

For the first case, the column names are present in the first row of the file, and sq is generally able to figure this out. It does this by running an algorithm on a sample of the data, and if the first row of data appears to be different from the other rows, sq marks that first row as a header row, and thus can determine the correct column names.

The header-detection algorithm is relatively naive, and can be defeated if the number of sample rows is small, or if the header row and data rows are all the same data kind (e.g. everything is a string). If that happens, you can explicitly tell sq that a header row is present (or not):

# Explicitly specify that a header row exists
$ sq add --ingest.header ./actor_header.csv

# Explicitly specify no header row
$ sq add --ingest.header=false ./actor_no_header.csv

Column kind

When ingesting untyped data, sq determines a lowest-common-denominator “kind” for each column, e.g. int, float, datetime, etc. If a kind cannot be determined, the column typically is treated as text. You can think of sq’s “kind” mechanism as an internal, generalized, intermediate column type.

Let’s go back to the CSV example:

actor_id,first_name,last_name,last_update
1,PENELOPE,GUINESS,2020-02-15T06:59:28Z
2,NICK,WAHLBERG,2020-02-15T06:59:28Z

And inspect that source:

$ sq inspect -v  @actor.data
NAME  TYPE   ROWS  COLS  NAME         TYPE      PK
data  table  200   4     actor_id     INTEGER
                         first_name   TEXT
                         last_name    TEXT
                         last_update  DATETIME

Note the TYPE value for each column. This is the type of the column in the ingest DB table that sq ingests the CSV data into. In this case, the ingest DB is actually a SQLite DB, and thus the TYPE is SQLite data type INTEGER.

For the equivalent Postgres source, note the different TYPE value:

$ sq inspect @sakila/pg12.actor -v
NAME   TYPE   ROWS  COLS  NAME         TYPE       PK
actor  table  200   4     actor_id     int4       pk
                          first_name   varchar
                          last_name    varchar
                          last_update  timestamp

In the CSV case, the actor_id column ultimately maps to SQLite INTEGER, while in Postgres, the type is int4. Because sq supports multiple backend DB implementations, there is an internal sq representation of data types: kind.

If we run inspect again on the sources, but this time output in YAML, we can see the kind for each column.

# sq inspect --yaml @actor.data
name: data
table_type: table
table_type_db: table
row_count: 200
columns:
  - name: actor_id
    position: 0
    primary_key: false
    base_type: INTEGER
    column_type: INTEGER
    kind: int
    nullable: true
  - name: first_name
    position: 1
    primary_key: false
    base_type: TEXT
    column_type: TEXT
    kind: text
    nullable: true
# [Truncated for brevity]
# sq inspect --yaml @sakila/pg12.actor
name: actor
name_fq: sakila.public.actor
table_type: table
table_type_db: BASE TABLE
row_count: 200
size: 73728
columns:
- name: actor_id
  position: 1
  primary_key: true
  base_type: int4
  column_type: integer
  kind: int
  nullable: false
  default_value: nextval('actor_actor_id_seq'::regclass)
- name: first_name
  position: 2
  primary_key: false
  base_type: varchar
  column_type: character varying
  kind: text
  nullable: false
# [Truncated for brevity]

Note that although the column_type values differ, both sources have the same kind for each column.

Kinds

sq defines a kind for the most common data types.

KindExample
textseven
int7
float7.12344556
decimal7.12
booltrue
datetime1977-11-07T17:59:28Z
date1977-11-07
time17:59:28
bytesaHV6emFoCg==

Kind detection

When sq ingests untyped data, it samples multiple values from each column. As sq iterates over those values, it’s able to eliminate various kinds. For example, if the sampled value is 1977-11-07, the kind cannot be int, or float, or bool, etc. By a process of elimination, sq ends up with a column kind. If there’s ambiguity in the sample data, the determined kind will be text.

In practice, sq runs the sample data through a series of parsing functions. If a parsing function fails, that kind is eliminated. For some kinds there are multiple parsing functions to try. In particular, there are many date and time formats that sq can parse.

Date/time formats

Listed below are the various datetime, date, and time formats that sq can detect. Note that some formats have unlisted variants to accept timezones, or various degrees of precision: e.g. RFC3339 can accept second precision (e.g. 2006-01-02T15:04:05), all the way down to nanosecond precision (2006-01-02T15:04:05.999999999Z).

KindExampleNote
datetime2006-01-02T15:04:05ZRFC3339
datetimeMon Jan 2 15:04:05 2006ANSI C
datetimeMon Jan  2 15:04:05 MST 2006Unix date
datetimeMon Jan 02 15:04:05 -0700 2006Ruby date
datetime02 Jan 06 15:04 MSTRFC822
datetimeMonday, 02-Jan-06 15:04:05 MSTRFC850
datetimeMon, 02 Jan 2006 15:04:05 MSTRFC1123
datetimeMonday, January 2, 2006Excel long date
datetime2006-01-02 15:04:05
datetime01/2/06 15:04:05
date2006-01-02
date02 Jan 2006
date2006/01/02
date01-02-06This is month-day-year. Try to avoid this format due to month-first vs day-first confusion.
date01-02-2006Also month-day-year: try to avoid.
date02-Jan-2006
date2-Jan-2006
date2-Jan-06
dateJan 2, 2006
dateMonday, January 2, 2006
dateMon, January 2, 2006
dateMon, Jan 2, 2006
dateJanuary 2, 2006
date2/Jan/06
time15:04:05
time15:04 
time3:04PM
time3:04 PM
time3:04pm
time3:04 pm