Striding Data in the Database

Jan 7, 2018 · 697 words · 4 minutes read data processing • databases • finance • postgres

Sequential data such as text from a book, recorded speech, or financial time series is one of the most common types of structured data. Fitting models on sequence data typically involves constructing small, overlapping sequences by moving a window across the entire dataset.

Each sequence is a small example of the larger picture. However because the data is continuous, it is often stored as a long (or wide) array.

For example the series from 1 to 12 would be stored like this:

After preparing the data for a model, the data might look like this:

1, 2, 3, 4, 5
2, 3, 4, 5, 6
3, 4, 5, 6, 7
4, 5, 6, 7, 8
5, 6, 7, 8, 9
6, 7, 8, 9, 10
7, 8, 9, 10, 11
8, 9, 10, 11, 12

Therefore one of the first steps in fitting such models is to transform the data from a single continuous series to many shorter sequences. There many tools for striding the data, including NumPy stride tricks. In this example each sequence has a length of 5 and moves forward 1 step.

It turns out you can stride your data in the database. Postgres will let you do this using array_agg and a window function with a frame (i.e. ROWS) clause.

Example

Suppose you’re fitting a RNN or LSTM on stock market data. Each record contains information such as the date and closing price. The data is stored in a table, stock_prices:

SELECT date, price
FROM stock_prices
WHERE ticker='aapl'
ORDER BY date
LIMIT 15;

Date	Price
1998-01-02	3.95098
1998-01-05	3.8902
1998-01-06	4.60502
1998-01-07	4.24032
1998-01-08	4.39107
1998-01-09	4.43726
1998-01-12	4.46886
1998-01-13	4.76306
1998-01-14	4.78738
1998-01-15	4.66581
1998-01-16	4.55883
1998-01-20	4.61962
1998-01-21	4.59044
1998-01-22	4.69499
1998-01-23	4.74118

This data is free historical stock data from QuantQuote containing daily resolution data for all S&P 500 ticker symbols between 1998 and July 31, 2013. Scripts to ingest this data into a Postgres database is available on GitHub.

Because the data is a time series we can use array_agg with a window statement to construct subsequences.

SELECT
  date,
  array_to_json(array_agg(price))
    OVER(
      ORDER BY date
      ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
    ) AS sequences
FROM stock_prices
WHERE ticker='aapl'
ORDER BY date
LIMIT 15;

Date	Sequences
1998-01-02	[3.95098]
1998-01-05	[3.95098,3.8902]
1998-01-06	[3.95098,3.8902,4.60502]
1998-01-07	[3.95098,3.8902,4.60502,4.24032]
1998-01-08	[3.95098,3.8902,4.60502,4.24032,4.39107]
1998-01-09	[3.95098,3.8902,4.60502,4.24032,4.39107,4.43726]
1998-01-12	[3.8902,4.60502,4.24032,4.39107,4.43726,4.46886]
1998-01-13	[4.60502,4.24032,4.39107,4.43726,4.46886,4.76306]
1998-01-14	[4.24032,4.39107,4.43726,4.46886,4.76306,4.78738]
1998-01-15	[4.39107,4.43726,4.46886,4.76306,4.78738,4.66581]
1998-01-16	[4.43726,4.46886,4.76306,4.78738,4.66581,4.55883]
1998-01-20	[4.46886,4.76306,4.78738,4.66581,4.55883,4.61962]
1998-01-21	[4.76306,4.78738,4.66581,4.55883,4.61962,4.59044]
1998-01-22	[4.78738,4.66581,4.55883,4.61962,4.59044,4.69499]
1998-01-23	[4.66581,4.55883,4.61962,4.59044,4.69499,4.74118]

Note that the frame clause, ROWS BETWEEN 5 PRECEDING AND CURRENT ROW, returns up to 6 observations per sequence because BETWEEN is inclusive. For the first 5 rows there are fewer than 6 records because there aren’t preceeding rows.

If you want each sequence to contain the same number of elements you can pad them by inserting extra rows at the beginning.

CREATE TEMP TABLE temp_tbl AS (
  SELECT
    date :: TIMESTAMP,
    0.0 AS price
  FROM generate_series('1997-12-26' :: TIMESTAMP,
                       '1998-01-01' :: TIMESTAMP,
                       '1 day' :: INTERVAL) AS date
);
SELECT *
FROM (
  SELECT
    date :: DATE,
    array_to_json(
      array_agg(price)
        OVER (
          ORDER BY date
          ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
        )
    ) AS sequences
  FROM (
    SELECT
      date,
      price
    FROM sp500.daily
    WHERE ticker = 'aapl'
    ) AS tbl
  FULL OUTER JOIN temp_tbl USING (date, close)
) AS tbl
WHERE date >= '1998-01-02'
ORDER BY date
LIMIT 10;

Date	Sequences
1998-01-02	[0.0,0.0,0.0,0.0,0.0,3.95098]
1998-01-05	[0.0,0.0,0.0,0.0,3.95098,3.8902]
1998-01-06	[0.0,0.0,0.0,3.95098,3.8902,4.60502]
1998-01-07	[0.0,0.0,3.95098,3.8902,4.60502,4.24032]
1998-01-08	[0.0,3.95098,3.8902,4.60502,4.24032,4.39107]
1998-01-09	[3.95098,3.8902,4.60502,4.24032,4.39107,4.43726]
1998-01-12	[3.8902,4.60502,4.24032,4.39107,4.43726,4.46886]
1998-01-13	[4.60502,4.24032,4.39107,4.43726,4.46886,4.76306]
1998-01-14	[4.24032,4.39107,4.43726,4.46886,4.76306,4.78738]
1998-01-15	[4.39107,4.43726,4.46886,4.76306,4.78738,4.66581]

Now the first sequences are left-padded with 0 so that every sequence has the same length.

Final thoughts

So there you have it: striding data in the database is as simple as array_agg with a frame clause. If you like, you can insert a few rows at the beginning (or end) of the table to pad the data so that each sequence has the same length.

Doing this in the database means you can run the same query from any language and get the same result. So if you prototype in R but use Python for production you can reuse the same SQL code — no need to translate dplyr to Pandas.