@@ -8,28 +8,44 @@ the query, the engine uses the most efficient store.
88This is one of the many features that makes CrateDB very fast when reading
99and aggregating data, but it has an impact on storage size.
1010
11- We are going to
12- use [ Yellow taxi trip - January 2024] ( https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page )
13- which has 2_964_624 rows.
11+ We are going to use [ Yellow taxi trip - January 2024] ( https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page ) which has 2_964_624 rows.
1412
15- | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | Airport_fee |
16- | ----------| ----------------------| -----------------------| -----------------| ---------------| ------------| --------------------| --------------| --------------| --------------| -------------| -------| ---------| ------------| --------------| -----------------------| --------------| ----------------------| -------------|
17- | 2 | 1704073016000 | 1704074392000 | 4 | 6.88 | 1 | "N" | 170 | 231 | 1 | 32.4 | 1 | 0.5 | 7.48 | 0 | 1 | 44.88 | 2.5 | 0 |
18- | 1 | 1704071008000 | 1704072649000 | 0 | 4.1 | 1 | "N" | 148 | 233 | 2 | 22.6 | 3.5 | 0.5 | 0 | 0 | 1 | 27.6 | 2.5 | 0 |
19- | 1 | 1704071126000 | 1704071510000 | 2 | 1 | 1 | "N" | 140 | 141 | 1 | 7.9 | 3.5 | 0.5 | 2.55 | 0 | 1 | 15.45 | 2.5 | 0 |
20- | 2 | 1704072696000 | 1704073070000 | 1 | 1.03 | 1 | "N" | 262 | 75 | 1 | 8.6 | 1 | 0.5 | 2.72 | 0 | 1 | 16.32 | 2.5 | 0 |
21- | 2 | 1704074134000 | 1704074399000 | 1 | 1.08 | 1 | "N" | 249 | 68 | 1 | 7.2 | 1 | 0.5 | 2.44 | 0 | 1 | 14.64 | 2.5 | 0 |
13+ This is the schema:
2214
23- The taxi dataset takes:
15+ ``` sql
16+ CREATE TABLE IF NOT EXISTS " doc" ." taxi" (
17+ " VendorID" BIGINT ,
18+ " tpep_pickup_datetime" TIMESTAMP WITHOUT TIME ZONE ,
19+ " tpep_dropoff_datetime" TIMESTAMP WITHOUT TIME ZONE ,
20+ " passenger_count" BIGINT ,
21+ " trip_distance" REAL ,
22+ " RatecodeID" BIGINT ,
23+ " store_and_fwd_flag" TEXT ,
24+ " PULocationID" BIGINT ,
25+ " DOLocationID" BIGINT ,
26+ " payment_type" BIGINT ,
27+ " fare_amount" REAL ,
28+ " extra" REAL ,
29+ " mta_tax" REAL ,
30+ " tip_amount" REAL ,
31+ " tolls_amount" REAL ,
32+ " improvement_surcharge" REAL ,
33+ " total_amount" REAL ,
34+ " congestion_surcharge" REAL ,
35+ " Airport_fee" REAL
36+ )
37+ ```
38+
39+ It takes:
2440
25- - ~ 48MiB in Parquet (very optimized for storage)
26- - ~ 342MiB in CSV
27- - ~ 1.2GiB in JSON
28- - ~ 510MiB in PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)
29- - ~ 775MiB in CrateDB 5.9.3 (3 nodes, default settings)
30- - ~ 431MiB in CrateDB 5.10.9 (3 nodes, default settings)
41+ - ~ ` 48MiB ` in Parquet
42+ - ~ ` 342MiB ` in CSV
43+ - ~ ` 1.2GiB ` in JSON
44+ - ~ ` 510MiB ` in PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)
45+ - ~ ` 775MiB ` in CrateDB 5.9.3 (3 nodes, default settings)
3146
32- We will dive deeper to really understand what is going on.
47+ At first sight, it might look that in CrateDB data takes more space than in PostgreSQL,
48+ but we need to dive deeper to really understand what is going on, the reality is the opposite.
3349
3450:::{note}
3551In version 5.10 storage usage was improved, some users report up to 70% of storage reduction,
@@ -156,10 +172,10 @@ CREATE TABLE taxi_noindex
156172)
157173```
158174
159- The index can only be disabled when the table is created, if the table already exists,
160- it will have to be re-created.
175+ The index can only be disabled when the table is created, if the table already exists and it cannot
176+ be deleted, it will have to be re-created.
161177
162- One of the ways of re-creating a table is by ` renaming ` , for example:
178+ One of the ways of re-creating a table is by renaming it , for example:
163179
1641801 . Rename table ` taxi ` (with INDEX) to ` taxi_deleteme ` with:
165181
0 commit comments