Tableau - Incremental Extracts


Introduction

Instead of Extracting all the data every time you can just add new rows of data. Doing this will make your extracts run faster and sooner. 

 

Requirements

In order to do an incremental extract you need to

  1. Not aggregate the extract (This adds a group and if you want to extract your can still do it in customSQL)
  2. Have a date, datetime, or number to use as the identifier of new rows.

Instructions

 

What happens when Tableau refreshes an incremental extract?

  1. It looks for the max identifier in the extract
  2. adds a line to the where clause so only records greater the identifier come back

NOTE: Tableau will round seconds to three decimal points. If you have a max datetime of 1-10-28 12:05:2.2345 and then a new row with 1-10-28 12:05:2.2348 is added to the database it wil not be included in your incremental extract

NOTE: Making changes to your datasource (including sql) will trigger Tableau to re-generate the extract.

How does that change when I am using CustomSQL?

This works great if you drag and drop your tables and join them in Tableau. Sometimes you have to use CustomSQL so how can we build the SQL to accomidate incremental extracts.

How CustomSQL works in Tableau

The short answer is it wraps the SQL you paste in with what it needs to create the extract or render the viz.

SELECT
    the fields we need --for live queries
    fields in the sql that are not hidden in Tableau --for extracts
FROM
    --start your sql
    SELECT 
        stuff
    FROM
        THINGS
    WHERE
        1=1
    --end your sql
WHERE
    stuff in the filter pane --live
    datasource filters --live
    extract filters --extract
    incremental_column > MAX(incremental_column_in_extract) --incremental extracts
GROUP BY 
    every dimesion --when extracts are aggregated or if it is a live query

How to optimize your SQL for incremental extracts

This is fine for the first run, but your sql might return 3 years of data ever time. Every subsquent time it will return it all just to have Tableau say "Thanks, but I only need these new rows."

WHERE 
    (trunc(sysdate) = TO_DATE('January 11, 2019, 12:00 A.M.','Month dd, YYYY, HH:MI A.M.') --Set to today when I create the extract today it will pull everything
    OR 
    date_col = sysdate-1) --rows from yesterday

 

Common Issues

Understand how your database works. If a row is changed it might not get loaded if the date is the same.

In this case new transactions adjust the prior one. We could just sum the dollars and get 10 and incrementally extract based on the transaction date.

account transaction-date dollars type
tom 1/1/00 5 charge
tom 1/2/00 -5 adjustment
tom 1/2/00 10 charge

 

In this case transactions are updated

Using the transaction date does not work because the row changes but not the date one day the row might look like:

account transaction-date dollars
tom 1/1/00 5

then the next it might be changed and look like

account transaction-date dollars
tom 1/1/00 10

An incremental based on the transaction-date wouldn't pick up the change so. Two ways around this:

  1. there is an update date field. This will end up with duplicate rows you will need to use a LOD in Tableau to get the row with the Max updated record
account transaction-date dollars updated
tom 1/1/00 5 1/1/00
tom 1/1/00 10 1/2/00
  1. do a incremental and a full extract. The server will let you schedule both and I might do incrementals each day and then a full refresh on the weekend In this case some drift can happen and tom shows as $15 but the full refresh will clean things back up.

 

Cool Hacks: Snapshots!

  1. Use CustomSQL
  2. include the sysdate as a columm
  3. increment the extract on the sysdate

WARNING this will exponentially expand your extract but in a transactional system with no history you can answer some cool questions like how many tickets were open on a day? Also useful for snapshoting metrics for a period. Then you can compare This month to the last easily in Tableau.

In this simple example the sql returns the total amount of charges, payments, and adjustments for yesterday. Running it daily will overtime create a nice reporting table. 3 rows per day (charge/payment/adjustment).

select 
    t.POST_DATE
    ,tt.name
    ,sysdate --sysdate is here so we can use it for our incremental counter
    ,SUM(t.amount)
    ,count(distinct t.account_ID)
from TRANSACTIONS t
    left outer join TRANSACTION_TYPE tt on t.tran_type=tt.tran_type
WHERE t.Post_Date = trunc(sysdate)-1
group by 
    t.post_date
    ,tt.name