指南:你好NYC

必要条件: ​安装TimescaleDB​

By now you should already have: installed TimescaleDB, experienced creating hypertables, and tried inserting and querying data (i.e., it's just SQL).

现在你已经安装了TimescaleDB,创建过hypertables,并且尝试插入并查询数据(只是SQL)。

But a database is only as interesting as the insights it allows you to derive from your data.

但是一个数据库的有趣之处仅仅来源于数据。

For this tutorial, we've put together a sample data set from real-life New York City taxicab data (​courtesy of the NYC Taxi and Limousine Commission​).

在本教程中,我们收集了NYC的出租车数据(出租车豪华轿车委员会提供)作为样本数据。

TIP:For simplicity we'll assume that TimescaleDB is installed on a PostgreSQL server at ​localhost​ on the default port, and that a user ​postgres​ exists with full superuser access. If your setup is different, please modify the examples accordingly.

提示:为简单起见,我们假定timescaledb安装在​本地PostgreSQL服务器​默认端口,我们的用户postgres拥有绝对的超级用户访问权限。如果不一致,请据此修改您的设置。

1. 下载并加载数据

Let's start by downloading the dataset. In the interest of (downloading) time and space (on your machine), we'll only grab data for the month of January 2016.

先下载数据集。我们只采集2016年1月份的时间(下载)和空间(在你的机器上)数据。

This dataset contains two files:

​1. nyc_data.sql​ - A SQL file that will set up the necessary tables

  1. ​nyc_data_rides.csv​ - A CSV file with the ride data

该数据集包含两个文件:

  1. nyc_data.sql​ - 创建所需表的SQL文件
  2. ​nyc_data_rides.csv​ - 包含ride数据的CSV文件

First, create a database, e.g., ​nyc_data​ with the extension:

首先创建一个数据库,例如nyc_data​

CREATE DATABASE nyc_data;
\c nyc_data
CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;

Now, download the file ​​nyc_data.tar.gz​​.

下载nyc_data.tar.gz文件​​.

Then, follow these steps:

操作以下步骤:

# (1) unzip the archive
tar -xvzf nyc_data.tar.gz

# (2) import the table schemas
psql -U postgres -d nyc_data -h localhost < nyc_data.sql

# (3) import data
psql -U postgres -d nyc_data -h localhost -c "\COPY rides FROM nyc_data_rides.csv CSV"

The data is now ready for you to use.

数据已经可以使用了。

# To access your database
psql -U postgres -h localhost -d nyc_data

2. 运行查询语句

Let's see what tables we have:

看看我们都有哪些表:


\dt

           List of relations
 Schema |     Name      | Type  |  Owner
--------+---------------+-------+----------
 public | payment_types | table | postgres
 public | rates         | table | postgres
 public | rides         | table | postgres
(3 rows)

Most of our data is in the "rides" table. Let's take a closer look:

大多数数据都在“ride”表中。我们来看看“ride”表

\d rides

                      Table "public.rides"
        Column         |            Type             | Modifiers
-----------------------+-----------------------------+-----------
 vendor_id             | text                        |
 pickup_datetime       | timestamp without time zone | not null
 dropoff_datetime      | timestamp without time zone | not null
 passenger_count       | numeric                     |
 trip_distance         | numeric                     |
 pickup_longitude      | numeric                     |
 pickup_latitude       | numeric                     |
 rate_code             | integer                     |
 dropoff_longitude     | numeric                     |
 dropoff_latitude      | numeric                     |
 payment_type          | integer                     |
 fare_amount           | numeric                     |
 extra                 | numeric                     |
 mta_tax               | numeric                     |
 tip_amount            | numeric                     |
 tolls_amount          | numeric                     |
 improvement_surcharge | numeric                     |
 total_amount          | numeric                     |
Indexes:
    "rides_passenger_count_pickup_datetime_idx" btree (passenger_count, pickup_datetime DESC)
    "rides_pickup_datetime_vendor_id_idx" btree (pickup_datetime DESC, vendor_id)
    "rides_rate_code_pickup_datetime_idx" btree (rate_code, pickup_datetime DESC)
    "rides_vendor_id_pickup_datetime_idx" btree (vendor_id, pickup_datetime DESC)

Let's run a query that TimescaleDB handles better than vanilla PostgreSQL:


-- Average fare amount of rides with 2+ passengers by day
SELECT date_trunc('day', pickup_datetime) as day, avg(fare_amount)
  FROM rides
  WHERE passenger_count > 1 AND pickup_datetime < '2016-01-08'
  GROUP BY day ORDER BY day;

        day         |         avg
--------------------+---------------------
2016-01-01 00:00:00 | 13.3990821679715529
2016-01-02 00:00:00 | 13.0224687415181399
2016-01-03 00:00:00 | 13.5382068607068607
2016-01-04 00:00:00 | 12.9618895561740149
2016-01-05 00:00:00 | 12.6614611935518309
2016-01-06 00:00:00 | 12.5775245695086098
2016-01-07 00:00:00 | 12.5868802584437019
(7 rows)

Some queries will execute ​​over 20x​​ faster on TimescaleDB than on vanilla PostgreSQL. Here's one example:

-- Total number of rides by day for first 5 days
SELECT date_trunc('day', pickup_datetime) as day, COUNT(*) FROM rides
  GROUP BY day ORDER BY day
  LIMIT 5;

        day         | count
--------------------+--------
2016-01-01 00:00:00 | 345037
2016-01-02 00:00:00 | 312831
2016-01-03 00:00:00 | 302878
2016-01-04 00:00:00 | 316171
2016-01-05 00:00:00 | 343251
(5 rows)

Now let's run a query ​beyond​ what vanilla PostgreSQL supports:

我们现在来运行一个vanilla PostgreSQL不支持的查询语句:

-- Number of rides by 5 minute intervals
--   (using the TimescaleDB "time_bucket" function)
SELECT time_bucket('5 minute', pickup_datetime) as five_min, count(*)
  FROM rides
  WHERE pickup_datetime < '2016-01-01 02:00'
  GROUP BY five_min ORDER BY five_min;

      five_min       | count
---------------------+-------
 2016-01-01 00:00:00 |   703
 2016-01-01 00:05:00 |  1482
 2016-01-01 00:10:00 |  1959
 2016-01-01 00:15:00 |  2200
 2016-01-01 00:20:00 |  2285
 2016-01-01 00:25:00 |  2291
 2016-01-01 00:30:00 |  2349
 2016-01-01 00:35:00 |  2328
 2016-01-01 00:40:00 |  2440
 2016-01-01 00:45:00 |  2372
 2016-01-01 00:50:00 |  2388
 2016-01-01 00:55:00 |  2473
 2016-01-01 01:00:00 |  2395
 2016-01-01 01:05:00 |  2510
 2016-01-01 01:10:00 |  2412
 2016-01-01 01:15:00 |  2482
 2016-01-01 01:20:00 |  2428
 2016-01-01 01:25:00 |  2433
 2016-01-01 01:30:00 |  2337
 2016-01-01 01:35:00 |  2366
 2016-01-01 01:40:00 |  2325
 2016-01-01 01:45:00 |  2257
 2016-01-01 01:50:00 |  2316
 2016-01-01 01:55:00 |  2250
(24 rows)

3. Run Some Fancier Queries

Let's see what else is going on in the dataset.

我们来看看数据库中还运行了其他什么操作。

-- Analyze rides by rate type
SELECT rate_code, COUNT(vendor_id) as num_trips FROM rides
  WHERE pickup_datetime < '2016-01-08'
  GROUP BY rate_code ORDER BY rate_code;

 rate_code | num_trips
-----------+-----------
         1 |   2266401
         2 |     54832
         3 |      4126
         4 |       967
         5 |      7193
         6 |        17
        99 |        42
(7 rows)

Unfortunately ​rate_code​ doesn't really tell us what these groups represent, and it doesn't look like there is any other info on rates in the ​rides​ table.

But it turns out that there is a separate ​rates​ table, and fortunately for us, TimescaleDB supports JOINs between tables: ​(In other words: with TimescaleDB you won't need to denormalize your data.)​

-- Mapping of rate_code to text description
SELECT * FROM rates;

 rate_code |      description
-----------+-----------------------
         1 | standard rate
         2 | JFK
         3 | Newark
         4 | Nassau or Westchester
         5 | negotiated fare
         6 | group ride
(6 rows)

-- Join rides with rates to get more information on rate_code
SELECT rates.description, COUNT(vendor_id) as num_trips FROM rides
  JOIN rates on rides.rate_code = rates.rate_code
  WHERE pickup_datetime < '2016-01-08'
  GROUP BY rates.description ORDER BY rates.description;

      description      | num_trips
-----------------------+-----------
 JFK                   |     54832
 Nassau or Westchester |       967
 Newark                |      4126
 group ride            |        17
 negotiated fare       |      7193
 standard rate         |   2266401
(6 rows)

Now we have something that is human readable. In particular, two of these rate types correspond to local airports (JFK, Newark). Let's take a closer look at those two:

-- Analysis of all JFK and EWR rides in Jan 2016
SELECT rates.description, COUNT(vendor_id) as num_trips,
    AVG(dropoff_datetime - pickup_datetime) as avg_trip_duration, AVG(total_amount) as avg_total,
    AVG(tip_amount) as avg_tip, MIN(trip_distance) as min_distance, AVG(trip_distance) as avg_distance, MAX(trip_distance) as max_distance,
    AVG(passenger_count) as avg_passengers
  FROM rides
  JOIN rates on rides.rate_code = rates.rate_code
  WHERE rides.rate_code in (2,3) AND pickup_datetime < '2016-02-01'
  GROUP BY rates.description ORDER BY rates.description;

 description | num_trips | avg_trip_duration |      avg_total      |      avg_tip       | min_distance |    avg_distance     | max_distance |   avg_passengers
-------------+-----------+-------------------+---------------------+--------------------+--------------+---------------------+--------------+--------------------
 JFK         |    225019 | 00:45:46.822517   | 64.3278115181384683 | 7.3334228220728027 |         0.00 | 17.2602816651038357 |       221.00 | 1.7333869584346211
 Newark      |     16822 | 00:35:16.157472   | 86.4633688027582927 | 9.5461657353465700 |         0.00 | 16.2706122934252764 |       177.23 | 1.7435501129473309
(2 rows)

Now this is interesting:

  • JFK is more than 10x more popular than Newark (assuming both airports have a similar number of flights per day).
  • JFK is about 25% cheaper (most likely because of NJ tunnel and highway tolls).
  • Newark trips however are 22% (10 min) shorter.
  • Each airport is about 17 miles on average from the trip origination point.
  • Each have on average ~1.74 passengers/trip, indicating some homogeneity between the two rate types.

Here's an interesting insight:

If you need to book a flight out of NYC, think you will be in a rush, and don't mind paying a little extra (e.g., if you are splitting the fare), then you may want to consider flying out of Newark over JFK.

4. What's Happening Behind the Scenes

What's nice about TimescaleDB is that it lets you run these types of queries on large datasets without having to worry about partitioning / chunking, etc.

When you write data to the database, it gets automatically partitioned across space and time. Then, when you query your data, the database lets you run those queries against a ​hypertable​: the abstraction of a single continuous table across all space and time intervals.

In TimescaleDB, hypertables can live alongside normal PostgreSQL tables.

In our example above, ​rides​ is a hypertable, while ​rates​ is a normal PostgreSQL table.

Let's peek behind the curtain using the native PostgreSQL EXPLAIN​ command, and see TimescaleDB at work:

-- Peek behind the scenes
EXPLAIN SELECT * FROM rides;

                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Append  (cost=0.00..258876.51 rows=3591152 width=472)
   ->  Seq Scan on rides  (cost=0.00..0.00 rows=1 width=472)
   ->  Seq Scan on _hyper_1_1_chunk  (cost=0.00..235846.56 rows=3253056 width=472)
   ->  Seq Scan on _hyper_1_2_chunk  (cost=0.00..1043.31 rows=34831 width=113)
   ->  Seq Scan on _hyper_1_3_chunk  (cost=0.00..21905.44 rows=302144 width=472)
   ->  Seq Scan on _hyper_1_4_chunk  (cost=0.00..81.20 rows=1120 width=472)
(6 rows)

This shows that the hypertable ​rides​ is split across four chunks (​_hyper_1_1_chunk​, ​_hyper_1_2_chunk​, ​_hyper_1_3_chunk​, ​_hyper_1_4_chunk​).

TIP:This is the schema as of TimescaleDB 0.1.0. Older versions of the database had a slightly different internal schema.

We can even query one of these chunks directly, accessing them via the private schema ​_timescaledb_internal​:

SELECT COUNT(*) FROM _timescaledb_internal._hyper_1_2_chunk;

 count
-------
 34831
(1 row)

Feel free to play with the internal TimescaleDB tables. One of the advantages of TimescaleDB is that it is fully transparent: you can always peek behind the curtain and see what's going on backstage.

(To see all internal schemas, use command ​\dn​.)

\dn

         List of schemas
         Name          |  Owner
-----------------------+----------
 _timescaledb_cache    | postgres
 _timescaledb_catalog  | postgres
 _timescaledb_internal | postgres
 public                | postgres
(4 rows)

results matching ""

    No results matching ""