Striding Data in the Database
Jan 7, 2018 · 697 words · 4 minutes read
Sequential data such as text from a book, recorded speech, or financial time series is one of the most common types of structured data. Fitting models on sequence data typically involves constructing small, overlapping sequences by moving a window across the entire dataset.
Each sequence is a small example of the larger picture. However because the data is continuous, it is often stored as a long (or wide) array.
For example the series from 1 to 12 would be stored like this:
1
2
3
4
5
6
7
8
9
10
11
12
After preparing the data for a model, the data might look like this:
1, 2, 3, 4, 5
2, 3, 4, 5, 6
3, 4, 5, 6, 7
4, 5, 6, 7, 8
5, 6, 7, 8, 9
6, 7, 8, 9, 10
7, 8, 9, 10, 11
8, 9, 10, 11, 12
Therefore one of the first steps in fitting such models is to transform the data from a single continuous series to many shorter sequences. There many tools for striding the data, including NumPy stride tricks. In this example each sequence has a length of 5 and moves forward 1 step.
It turns out you can stride your data in the database. Postgres will let you do this using array_agg
and a window function with a frame (i.e. ROWS
) clause.
Example
Suppose you’re fitting a RNN or LSTM on stock market data. Each record contains information such as the date and closing price. The data is stored in a table, stock_prices
:
SELECT date, price
FROM stock_prices
WHERE ticker='aapl'
ORDER BY date
LIMIT 15;
Date | Price |
---|---|
1998-01-02 | 3.95098 |
1998-01-05 | 3.8902 |
1998-01-06 | 4.60502 |
1998-01-07 | 4.24032 |
1998-01-08 | 4.39107 |
1998-01-09 | 4.43726 |
1998-01-12 | 4.46886 |
1998-01-13 | 4.76306 |
1998-01-14 | 4.78738 |
1998-01-15 | 4.66581 |
1998-01-16 | 4.55883 |
1998-01-20 | 4.61962 |
1998-01-21 | 4.59044 |
1998-01-22 | 4.69499 |
1998-01-23 | 4.74118 |
This data is free historical stock data from QuantQuote containing daily resolution data for all S&P 500 ticker symbols between 1998 and July 31, 2013. Scripts to ingest this data into a Postgres database is available on GitHub.
Because the data is a time series we can use array_agg
with a window statement to construct subsequences.
SELECT
date,
array_to_json(array_agg(price))
OVER(
ORDER BY date
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
) AS sequences
FROM stock_prices
WHERE ticker='aapl'
ORDER BY date
LIMIT 15;
Date | Sequences |
---|---|
1998-01-02 | [3.95098] |
1998-01-05 | [3.95098,3.8902] |
1998-01-06 | [3.95098,3.8902,4.60502] |
1998-01-07 | [3.95098,3.8902,4.60502,4.24032] |
1998-01-08 | [3.95098,3.8902,4.60502,4.24032,4.39107] |
1998-01-09 | [3.95098,3.8902,4.60502,4.24032,4.39107,4.43726] |
1998-01-12 | [3.8902,4.60502,4.24032,4.39107,4.43726,4.46886] |
1998-01-13 | [4.60502,4.24032,4.39107,4.43726,4.46886,4.76306] |
1998-01-14 | [4.24032,4.39107,4.43726,4.46886,4.76306,4.78738] |
1998-01-15 | [4.39107,4.43726,4.46886,4.76306,4.78738,4.66581] |
1998-01-16 | [4.43726,4.46886,4.76306,4.78738,4.66581,4.55883] |
1998-01-20 | [4.46886,4.76306,4.78738,4.66581,4.55883,4.61962] |
1998-01-21 | [4.76306,4.78738,4.66581,4.55883,4.61962,4.59044] |
1998-01-22 | [4.78738,4.66581,4.55883,4.61962,4.59044,4.69499] |
1998-01-23 | [4.66581,4.55883,4.61962,4.59044,4.69499,4.74118] |
Note that the frame clause, ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
, returns up to 6 observations per sequence because BETWEEN
is inclusive. For the first 5 rows there are fewer than 6 records because there aren’t preceeding rows.
If you want each sequence to contain the same number of elements you can pad them by inserting extra rows at the beginning.
CREATE TEMP TABLE temp_tbl AS (
SELECT
date :: TIMESTAMP,
0.0 AS price
FROM generate_series('1997-12-26' :: TIMESTAMP,
'1998-01-01' :: TIMESTAMP,
'1 day' :: INTERVAL) AS date
);
SELECT *
FROM (
SELECT
date :: DATE,
array_to_json(
array_agg(price)
OVER (
ORDER BY date
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
)
) AS sequences
FROM (
SELECT
date,
price
FROM sp500.daily
WHERE ticker = 'aapl'
) AS tbl
FULL OUTER JOIN temp_tbl USING (date, close)
) AS tbl
WHERE date >= '1998-01-02'
ORDER BY date
LIMIT 10;
Date | Sequences |
---|---|
1998-01-02 | [0.0,0.0,0.0,0.0,0.0,3.95098] |
1998-01-05 | [0.0,0.0,0.0,0.0,3.95098,3.8902] |
1998-01-06 | [0.0,0.0,0.0,3.95098,3.8902,4.60502] |
1998-01-07 | [0.0,0.0,3.95098,3.8902,4.60502,4.24032] |
1998-01-08 | [0.0,3.95098,3.8902,4.60502,4.24032,4.39107] |
1998-01-09 | [3.95098,3.8902,4.60502,4.24032,4.39107,4.43726] |
1998-01-12 | [3.8902,4.60502,4.24032,4.39107,4.43726,4.46886] |
1998-01-13 | [4.60502,4.24032,4.39107,4.43726,4.46886,4.76306] |
1998-01-14 | [4.24032,4.39107,4.43726,4.46886,4.76306,4.78738] |
1998-01-15 | [4.39107,4.43726,4.46886,4.76306,4.78738,4.66581] |
Now the first sequences are left-padded with 0 so that every sequence has the same length.
Final thoughts
So there you have it: striding data in the database is as simple as array_agg
with a frame clause. If you like, you can insert a few rows at the beginning (or end) of the table to pad the data so that each sequence has the same length.
Doing this in the database means you can run the same query from any language and get the same result. So if you prototype in R but use Python for production you can reuse the same SQL code — no need to translate dplyr to Pandas.