数据模型

TimescaleDB utilizes a "wide-table" data model, which is quite common in the world of relational databases. This makes Timescale somewhat different than most other time-series databases, which typically use a "narrow-table" model.

大多数时序数据库使用“窄表”,而TimescaleDB使用的是关系型数据库中常见的“宽表”数据模型。

Here we discuss why we chose the wide-table model, and how we recommend using it for time-series data, using an Internet of Things (IoT) example.

现在我们来用一个物联网(IoT)的例子来讨论一下为什么我们选用宽表,并且推荐处理时序数据时使用宽表。

Imagine a distributed group of 1,000 IoT devices designed to collect environmental data at various intervals. This data could include:

试想一下1000台分布式IoT设备用来在不同时间段收集环境数据。这些数据包括:

  • ​Identifiers:​ ​device_id​, ​timestamp​
  • ​Metadata:​ ​location_id​, ​dev_type​, ​firmware_version​, ​customer_id​
  • ​Device metrics:​ ​cpu_1m_avg​, ​free_mem​, ​used_mem​, ​net_rssi​, ​net_loss​, ​battery​
  • Sensor metrics:​ ​temperature​, ​humidity​, ​pressure​, ​CO​, ​NO2​, ​PM10​
  • 标识符:device_id​, timestamp(时间戳)
  • 元数据:location_id​, ​dev_type​, ​firmware_version​, ​customer_id​
  • 设备度量:温度,湿度,压力, ​CO​, ​NO2​, ​PM10​

For example, your incoming data may look like this:

输入的数据如下所示:

timestamp device_id cpu_1m_avg free_mem temperature location_id dev_type
2017-01-01 01:02:00 abc123 80 500MB 72 335 field
2017-01-01 01:02:23 def456 90 400MB 64 335 roof
2017-01-01 01:02:30 ghi789 120 0MB 56 77 roof
2017-01-01 01:03:12 abc123 80 500MB 72 335 field
2017-01-01 01:03:35 def456 95 350MB 64 335 roof
2017-01-01 01:03:42 ghi789 100 100MB 56 77 roof

Now, let's look at various ways to model this data.

现在,我们来看看不同的数据建模方式。

窄表模型

Most time-series databases would represent this data in the following way:

  • Represent each metric as a separate entity (e.g., represent ​cpu_1m_avg​ and ​free_mem​ as two different things)
  • Store a sequence of "time", "value" pairs for that metric
  • Represent the metadata values as a "tag-set" associated with that metric/tag-set combination

大多数时序数据库使用以下方式展示数据:

  • 将每个度量当做单独的实体分开展示。
  • 使用“时间”序列和“值”对的方式储存度量值。
  • 将元数据值表示为与度量/标注集组合相关联的“标注集”。

In this model, each metric/tag-set combination is considered an individual "time series" containing a sequence of time/value pairs.

这种数据模型中,每一个度量/标注集组合都可以被看作是单独的包含时间/值对序列。

Using our example above, this approach would result in 9 different "time series", each of which is defined by a unique set of tags.

以上面的例子为例,这种方法将产生9种不同的“时间序列”,每个时间序列都由一组唯一的标记集定义。

1. {name:  cpu_1m_avg,  device_id: abc123,  location_id: 335,  dev_type: field}
2. {name:  cpu_1m_avg,  device_id: def456,  location_id: 335,  dev_type: roof}
3. {name:  cpu_1m_avg,  device_id: ghi789,  location_id:  77,  dev_type: roof}
4. {name:    free_mem,  device_id: abc123,  location_id: 335,  dev_type: field}
5. {name:    free_mem,  device_id: def456,  location_id: 335,  dev_type: roof}
6. {name:    free_mem,  device_id: ghi789,  location_id:  77,  dev_type: roof}
7. {name: temperature,  device_id: abc123,  location_id: 335,  dev_type: field}
8. {name: temperature,  device_id: def456,  location_id: 335,  dev_type: roof}
9. {name: temperature,  device_id: ghi789,  location_id:  77,  dev_type: roof}

The number of such time series scales with the cross-product of the cardinality of each tag, i.e., (# names) x (# device ids) x (# location ids) x (device types).

And each of these "time series" then has its own set of time/value sequences.

每一个“时间序列”都有各自的时间/值序列集。

Now, this approach may make sense if you collect each of your metrics independently, with little to no metadata.

现在,如果你使用这种方式单独收集几乎没有元数据的度量会比较有作用。

But in general, we believe that this approach is limiting. It loses the inherent structure in the data, making it harder to ask a variety of useful questions. For example:

  • What was the state of the system when ​free_mem​ went to 0?
  • How does ​cpu_1m_avg​ correlate with ​free_mem​?
  • What is the average ​temperature​ by ​location_id​?

但是总的来说,这种方式是有所限制的。使用这种方式收集的数据失去了固有的数据结构,这样处理一些有用的问题就变的比较困难了。比如:

  • free_men变为0时,系统状态是什么?
  • 怎样收集与free_men相关的​cpu_1m_avg
  • 每个location_id的平均温度是多少?

We also find this approach cognitively confusing. Are we really collecting 9 different time-series, or just one collection of data with a variety of metadata and metrics readings?

我们还发现这种方法会造成认知混乱。因为我们不能确认我们是真的收集了9个不同的时间序列还是收集了不同元数据和度量值的数据集合。

宽表模型

In contrast, TimescaleDB uses a wide-table model, which reflects the inherent structure in the data.

相反的是,TimescaleDB使用的是能够反映数据固有结构d额宽表模型。

Our wide-table model actually looks exactly the same as the initial data stream:

我们的宽表模型实际上看起来与原始数据流是一样的。

timestamp device_id cpu_1m_avg free_mem temperature location_id dev_type
2017-01-01 01:02:00 abc123 80 500MB 72 42 field
2017-01-01 01:02:23 def456 90 400MB 64 42 roof
2017-01-01 01:02:30 ghi789 120 0MB 56 77 roof
2017-01-01 01:03:12 abc123 80 500MB 72 42 field
2017-01-01 01:03:35 def456 95 350MB 64 42 roof
2017-01-01 01:03:42 ghi789 100 100MB 56 77 roof

Here, each row is a new reading, with a set of measurements and metadata at a given time. This allows us to preserve relationships within the data, and ask more interesting or exploratory questions than before.

如上所示,每一行是特定时间内新的元数据和测量值。这使我们能够保存数据中的关系,并提出比以前更有趣或更具探索性的问题。

Of course, this is not a new format: it's what one would commonly find within a relational database. Which is also why we find this format more intuitive.

当时,这并不是一种新的数据格式,这是关系型数据库中常见的数据模式。所以这种数据格式看起来更直观。

关系型数据的联接

TimescaleDB's data model also has another similarity with relational databases: it supports JOINs. Specifically, one can store additional metadata in a secondary table, and then utilize that data at query time.

TimescalDB数据模型与关系型数据库的另一个相似之处是:也支持联接。具体来说,可以在辅助表中存储额外的元数据,然后在查询时利用这些数据。

In our example, one could have a separate locations table, mapping ​location_id​ to additional metadata for that location. For example:

在我们的例子中,可以创建一个单独的位置表,将location_id​映射到该位置对应的额外的元数据。例如:

location_id name latitude longitude zip_code region
42 Grand Central Terminal 40.7527° N 73.9772° W 10017 NYC
77 Lobby 7 42.3593° N 71.0935° W 02139 Massachusetts

Then at query time, by joining our two tables, one could ask questions like: what is the average ​free_mem​ of our devices in ​zip_code​ 10017?

Without joins, one would need to denormalize their data and store all metadata with each measurement row. This creates data bloat, and makes data management more difficult.

不使用联接的话,数据必反饭规范化,并且需要在每一个测量行储存所有的元数据。这样会导致数据膨胀,增加管理数据的难度。

With joins, one can store metadata independently, and update mappings more easily.

使用数据联接可以独立储存元数据,并能轻松更新映射。

For example, if we wanted to update our "region" for ​location_id​ 77 (e.g., from "Massachusetts" to "Boston"), we can make this change without having to go back and overwrite historical data.

例如,如果我们想更新location_id为77的区域地址(比如:从“马塞诸塞州”更改为“波士顿”),我们不需要返回去重写历史数据。

results matching ""

    No results matching ""