How to… Shape Data
As discussed in the previous post 'How to...Plan your Prep', the first step of Data Preparation is understanding how the original dataset is structured, and quickly followed by, understanding what structure the dataset needs to be for analysis. This post looks at exactly these factors so when you look at future datasets, you can quickly determine what steps you want to take to shape the data for analysis.
For this post, we will look at a typical input dataset that has been formed within Excel by building a Pivot Table. This example uses Ice Cream sales.
What to look at for incoming datasets?
When assessing an incoming dataset, it's important to identify both Dimensions of the data as well as Measures. If the Dimensions are all in individual columns then you can move on to Measures without having to think about any structure changes. The same assessment needs to be made for Measures.
Let's use the Ice Cream sales example to assess the Input shape of the data. The dataset has been coloured to highlight the structure found:
Dark Blue - Header for Dimension column
Light Blue - Dimension value
Dark Green - Header for Measure column
Light Green - Measure value
By drawing out the structure of the input dataset, it becomes clear what alterations need to be made once you have an understanding on how data should be structured for analysis.
What shape is best for analysis in Tableau?
When loading data in to Tableau Desktop, the software takes the first row of data as the headers for the columns and all subsequent rows as the data points for those headers. Here are the key aspects to consider when structuring data for Tableau:
A single column for data field - these will form the data fields that are then dragged & dropped in Tableau
Is it a Dimension or Measure? Tableau will divide all data fields in to Dimensions (aspect to split the data up by) and Measures (the data fields to analyse).
A single data type for each data field - a data field in Tableau (and most other tools) require a single data type. For measures, if they are not numeric, they they will not be present in the list of measures.
There are a few aspects that do not matter as well:
Order of columns - Tableau Desktop and Server will absorb a dataset and order the fields shown in the Data Pane in Alphabetical Order. Therefore, there is no need to order the columns
Geographic Roles - currently a geographic role can not be set and then carried over in to Tableau Desktop. The data field can be set as a String but the Geographic Role will need to be allocated in Desktop.
In this example, we can see Category is in the correct state. The Header for the Dimension is at the top of the column that contains all the relevant values.
The Measure Header is in the correct location, but is it necessary? The Measures are listed under each individual month. To analyse data over time in Tableau, one column containing all of the different dates would be more preferable and easier to use. Therefore, in this case, the Dates listed as headers at the moment will need to be pivoted.
The Measures that are named in the Measure Column would be much easier to analyse if they were individual columns. Forming one column for Sales and one for Profit would enable these two columns to be Measures when analysing the data in Tableau.
Changing Dataset structures in Prep
Of the different types of steps available within Prep, there are three that are key to changing the structure of the input datasets:
Pivot
The Pivot step is the most important when changing the data structure. There are two types of Pivot:
Columns to Rows - taking multiple columns and converting them in to additional rows of data. The column header is made into a new dimensional column that will contain all other column headers that are involved in the Pivot. The icon for the 'Columns to Rows' Pivot is the dark purple icon shown above.
Rows to Columns - this is the reverse of the Columns to Rows pivot. In this instance, rows of data are converted in additional columns within the dataset. The set-up of the 'Rows to Columns' pivot requires a selection of the column that will become the Headers of the new data fields. The set-up also requires a selection of the data field that will act as the values for the new data fields. In case there are multiple values forced into the same cell, the form of aggregation has to be made.
Aggregate
Whilst aggregations change the number of rows, the Aggregation step within Prep also can change the structure of the data. The only data fields that continue on in the data flow from an Aggregate step are any fields included as a 'Group By' field or an 'Aggregation'.
Join
Joins are designed to add additional columns to the original dataset from an extra source. Depending on the Join Type and Join Conditions set, the resulting data set can be very different in terms of the number of data fields, as well as number of rows.
Union
This step can create a different data structure as Unioning mismatched column headers will create a wider dataset. Without merging the mismatched fields, then a large number of nulls will be present in any fields that are not contained in both datasets.
Applying the techniques to the Example
Using the techniques available in Prep to restructure data, let's apply these techniques to the Ice Cream example.
Step A: Pivot - Columns to Rows
Creating a single column to contain dates are a key step to take. By taking this step, 'Pivot1 Names' is created as a column to hold the former column headers that are selected within the 'Columns to Rows' step.
Step B: Pivot - Rows to Columns
To create a column for each measure, the 'Measure' column needs to be converted into Headers for the new column with the relevant value added underneath for each combination of 'Category' and Date.
Restructuring data is key skill to master and practice as the dataset above is much easier to analyse. Longer and thinner is not always the intention. A few key elements to aim for:
A single data field for each dimension, ideally containing a single data type (like a string or date)
A single column for a type of date rather than each week / month / year being a separate column
A column per measure
The closest this data structure is matched, the easier it will be to analyse the data set in question flexibly.