指南:你好NYC
必要条件: 安装TimescaleDB
By now you should already have: installed TimescaleDB, experienced creating hypertables, and tried inserting and querying data (i.e., it's just SQL).
现在你已经安装了TimescaleDB,创建过hypertables,并且尝试插入并查询数据(只是SQL)。
But a database is only as interesting as the insights it allows you to derive from your data.
但是一个数据库的有趣之处仅仅来源于数据。
For this tutorial, we've put together a sample data set from real-life New York City taxicab data (courtesy of the NYC Taxi and Limousine Commission).
在本教程中,我们收集了NYC的出租车数据(出租车豪华轿车委员会提供)作为样本数据。
TIP:For simplicity we'll assume that TimescaleDB is installed on a PostgreSQL server at localhost on the default port, and that a user postgres exists with full superuser access. If your setup is different, please modify the examples accordingly.
提示:为简单起见,我们假定timescaledb安装在本地PostgreSQL服务器默认端口,我们的用户postgres
拥有绝对的超级用户访问权限。如果不一致,请据此修改您的设置。
1. 下载并加载数据
Let's start by downloading the dataset. In the interest of (downloading) time and space (on your machine), we'll only grab data for the month of January 2016.
先下载数据集。我们只采集2016年1月份的时间(下载)和空间(在你的机器上)数据。
This dataset contains two files:
1. nyc_data.sql
- A SQL file that will set up the necessary tables
nyc_data_rides.csv
- A CSV file with the ride data
该数据集包含两个文件:
nyc_data.sql
- 创建所需表的SQL文件nyc_data_rides.csv
- 包含ride
数据的CSV文件
First, create a database, e.g.,
nyc_data
with the extension:
首先创建一个数据库,例如nyc_data
CREATE DATABASE nyc_data;
\c nyc_data
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
Now, download the file
nyc_data.tar.gz
.
下载nyc_data.tar.gz
文件.
Then, follow these steps:
操作以下步骤:
# (1) unzip the archive
tar -xvzf nyc_data.tar.gz
# (2) import the table schemas
psql -U postgres -d nyc_data -h localhost < nyc_data.sql
# (3) import data
psql -U postgres -d nyc_data -h localhost -c "\COPY rides FROM nyc_data_rides.csv CSV"
The data is now ready for you to use.
数据已经可以使用了。
# To access your database
psql -U postgres -h localhost -d nyc_data
2. 运行查询语句
Let's see what tables we have:
看看我们都有哪些表:
\dt
List of relations
Schema | Name | Type | Owner
--------+---------------+-------+----------
public | payment_types | table | postgres
public | rates | table | postgres
public | rides | table | postgres
(3 rows)
Most of our data is in the "rides" table. Let's take a closer look:
大多数数据都在“ride”表中。我们来看看“ride”表
\d rides
Table "public.rides"
Column | Type | Modifiers
-----------------------+-----------------------------+-----------
vendor_id | text |
pickup_datetime | timestamp without time zone | not null
dropoff_datetime | timestamp without time zone | not null
passenger_count | numeric |
trip_distance | numeric |
pickup_longitude | numeric |
pickup_latitude | numeric |
rate_code | integer |
dropoff_longitude | numeric |
dropoff_latitude | numeric |
payment_type | integer |
fare_amount | numeric |
extra | numeric |
mta_tax | numeric |
tip_amount | numeric |
tolls_amount | numeric |
improvement_surcharge | numeric |
total_amount | numeric |
Indexes:
"rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)
"rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)
"rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)
"rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)
Let's run a query that TimescaleDB handles better than vanilla PostgreSQL:
-- Average fare amount of rides with 2+ passengers by day
SELECT date_trunc('day', pickup_datetime) as day, avg(fare_amount)
FROM rides
WHERE passenger_count > 1 AND pickup_datetime < '2016-01-08'
GROUP BY day ORDER BY day;
day | avg
--------------------+---------------------
2016-01-01 00:00:00 | 13.3990821679715529
2016-01-02 00:00:00 | 13.0224687415181399
2016-01-03 00:00:00 | 13.5382068607068607
2016-01-04 00:00:00 | 12.9618895561740149
2016-01-05 00:00:00 | 12.6614611935518309
2016-01-06 00:00:00 | 12.5775245695086098
2016-01-07 00:00:00 | 12.5868802584437019
(7 rows)
Some queries will execute over 20x faster on TimescaleDB than on vanilla PostgreSQL. Here's one example:
-- Total number of rides by day for first 5 days
SELECT date_trunc('day', pickup_datetime) as day, COUNT(*) FROM rides
GROUP BY day ORDER BY day
LIMIT 5;
day | count
--------------------+--------
2016-01-01 00:00:00 | 345037
2016-01-02 00:00:00 | 312831
2016-01-03 00:00:00 | 302878
2016-01-04 00:00:00 | 316171
2016-01-05 00:00:00 | 343251
(5 rows)
Now let's run a query beyond what vanilla PostgreSQL supports:
我们现在来运行一个vanilla PostgreSQL不支持的查询语句:
-- Number of rides by 5 minute intervals
-- (using the TimescaleDB "time_bucket" function)
SELECT time_bucket('5 minute', pickup_datetime) as five_min, count(*)
FROM rides
WHERE pickup_datetime < '2016-01-01 02:00'
GROUP BY five_min ORDER BY five_min;
five_min | count
---------------------+-------
2016-01-01 00:00:00 | 703
2016-01-01 00:05:00 | 1482
2016-01-01 00:10:00 | 1959
2016-01-01 00:15:00 | 2200
2016-01-01 00:20:00 | 2285
2016-01-01 00:25:00 | 2291
2016-01-01 00:30:00 | 2349
2016-01-01 00:35:00 | 2328
2016-01-01 00:40:00 | 2440
2016-01-01 00:45:00 | 2372
2016-01-01 00:50:00 | 2388
2016-01-01 00:55:00 | 2473
2016-01-01 01:00:00 | 2395
2016-01-01 01:05:00 | 2510
2016-01-01 01:10:00 | 2412
2016-01-01 01:15:00 | 2482
2016-01-01 01:20:00 | 2428
2016-01-01 01:25:00 | 2433
2016-01-01 01:30:00 | 2337
2016-01-01 01:35:00 | 2366
2016-01-01 01:40:00 | 2325
2016-01-01 01:45:00 | 2257
2016-01-01 01:50:00 | 2316
2016-01-01 01:55:00 | 2250
(24 rows)
3. Run Some Fancier Queries
Let's see what else is going on in the dataset.
我们来看看数据库中还运行了其他什么操作。
-- Analyze rides by rate type
SELECT rate_code, COUNT(vendor_id) as num_trips FROM rides
WHERE pickup_datetime < '2016-01-08'
GROUP BY rate_code ORDER BY rate_code;
rate_code | num_trips
-----------+-----------
1 | 2266401
2 | 54832
3 | 4126
4 | 967
5 | 7193
6 | 17
99 | 42
(7 rows)
Unfortunately rate_code doesn't really tell us what these groups represent, and it doesn't look like there is any other info on rates in the rides table.
But it turns out that there is a separate rates table, and fortunately for us, TimescaleDB supports JOINs between tables: (In other words: with TimescaleDB you won't need to denormalize your data.)
-- Mapping of rate_code to text description
SELECT * FROM rates;
rate_code | description
-----------+-----------------------
1 | standard rate
2 | JFK
3 | Newark
4 | Nassau or Westchester
5 | negotiated fare
6 | group ride
(6 rows)
-- Join rides with rates to get more information on rate_code
SELECT rates.description, COUNT(vendor_id) as num_trips FROM rides
JOIN rates on rides.rate_code = rates.rate_code
WHERE pickup_datetime < '2016-01-08'
GROUP BY rates.description ORDER BY rates.description;
description | num_trips
-----------------------+-----------
JFK | 54832
Nassau or Westchester | 967
Newark | 4126
group ride | 17
negotiated fare | 7193
standard rate | 2266401
(6 rows)
Now we have something that is human readable. In particular, two of these rate types correspond to local airports (JFK, Newark). Let's take a closer look at those two:
-- Analysis of all JFK and EWR rides in Jan 2016
SELECT rates.description, COUNT(vendor_id) as num_trips,
AVG(dropoff_datetime - pickup_datetime) as avg_trip_duration, AVG(total_amount) as avg_total,
AVG(tip_amount) as avg_tip, MIN(trip_distance) as min_distance, AVG(trip_distance) as avg_distance, MAX(trip_distance) as max_distance,
AVG(passenger_count) as avg_passengers
FROM rides
JOIN rates on rides.rate_code = rates.rate_code
WHERE rides.rate_code in (2,3) AND pickup_datetime < '2016-02-01'
GROUP BY rates.description ORDER BY rates.description;
description | num_trips | avg_trip_duration | avg_total | avg_tip | min_distance | avg_distance | max_distance | avg_passengers
-------------+-----------+-------------------+---------------------+--------------------+--------------+---------------------+--------------+--------------------
JFK | 225019 | 00:45:46.822517 | 64.3278115181384683 | 7.3334228220728027 | 0.00 | 17.2602816651038357 | 221.00 | 1.7333869584346211
Newark | 16822 | 00:35:16.157472 | 86.4633688027582927 | 9.5461657353465700 | 0.00 | 16.2706122934252764 | 177.23 | 1.7435501129473309
(2 rows)
Now this is interesting:
- JFK is more than 10x more popular than Newark (assuming both airports have a similar number of flights per day).
- JFK is about 25% cheaper (most likely because of NJ tunnel and highway tolls).
- Newark trips however are 22% (10 min) shorter.
- Each airport is about 17 miles on average from the trip origination point.
- Each have on average ~1.74 passengers/trip, indicating some homogeneity between the two rate types.
Here's an interesting insight:
If you need to book a flight out of NYC, think you will be in a rush, and don't mind paying a little extra (e.g., if you are splitting the fare), then you may want to consider flying out of Newark over JFK.
4. What's Happening Behind the Scenes
What's nice about TimescaleDB is that it lets you run these types of queries on large datasets without having to worry about partitioning / chunking, etc.
When you write data to the database, it gets automatically partitioned across space and time. Then, when you query your data, the database lets you run those queries against a hypertable: the abstraction of a single continuous table across all space and time intervals.
In TimescaleDB, hypertables can live alongside normal PostgreSQL tables.
In our example above, rides is a hypertable, while rates is a normal PostgreSQL table.
Let's peek behind the curtain using the native PostgreSQL EXPLAIN command, and see TimescaleDB at work:
-- Peek behind the scenes
EXPLAIN SELECT * FROM rides;
QUERY PLAN
-----------------------------------------------------------------------------------
Append (cost=0.00..258876.51 rows=3591152 width=472)
-> Seq Scan on rides (cost=0.00..0.00 rows=1 width=472)
-> Seq Scan on _hyper_1_1_chunk (cost=0.00..235846.56 rows=3253056 width=472)
-> Seq Scan on _hyper_1_2_chunk (cost=0.00..1043.31 rows=34831 width=113)
-> Seq Scan on _hyper_1_3_chunk (cost=0.00..21905.44 rows=302144 width=472)
-> Seq Scan on _hyper_1_4_chunk (cost=0.00..81.20 rows=1120 width=472)
(6 rows)
This shows that the hypertable rides is split across four chunks (_hyper_1_1_chunk, _hyper_1_2_chunk, _hyper_1_3_chunk, _hyper_1_4_chunk).
TIP:This is the schema as of TimescaleDB 0.1.0. Older versions of the database had a slightly different internal schema.
We can even query one of these chunks directly, accessing them via the private schema
_timescaledb_internal
:
SELECT COUNT(*) FROM _timescaledb_internal._hyper_1_2_chunk;
count
-------
34831
(1 row)
Feel free to play with the internal TimescaleDB tables. One of the advantages of TimescaleDB is that it is fully transparent: you can always peek behind the curtain and see what's going on backstage.
(To see all internal schemas, use command \dn.)
\dn
List of schemas
Name | Owner
-----------------------+----------
_timescaledb_cache | postgres
_timescaledb_catalog | postgres
_timescaledb_internal | postgres
public | postgres
(4 rows)