exploring and remodeling table fields

Sometimes tables are messy not just in their data items, but also in the fields that define the table columns.[1] Various techniques help to deal with such “second order” messiness. Sorting table fields alphabetically or evaluating them with more powerful text similarity measures help to identify inadvertently duplicated fields. Sorting table fields by the number of non-null items in the fields would tell you the relative data size of different fields. Of course, you would also like to be able to merge and split fields just as easily as you select fields to include in a sub-table of interest.

Transforming a table into row number/field/item triples can be helpful for exploring and remodeling table fields. With such a transformation, fields become new data items. Excluding triples with null items and faceting the fields shows the count of items in each field. Merging and splitting fields become operations of selecting relevant triples and renaming the field/data item. When finished such meta-cleaning, you might want to transform the triples back into a remodeled table. Since field merges can create a field containing more than one item, the remodeled table ideally would support multiple items per cell. [2]

The triple transform is an instance of more general table reshaping operations. A long transformation replaces a set of fields with a key field containing those field names and a new field containing the keyed values. A wide transformation reverses the long transformation. It expands a key field into separate key-instance fields populated with the corresponding keyed values. Non-varying fields are expanded or contracted as necessary. Table reshaping operations change both the column and row dimensions of a table. The reshape command in Stata implements this class of table operations. The Stata reshape documentation shows examples of these transformations. SDDL / STT provides table reshaping using spreadsheet-based markup.

The awesome new data tool Google Refine can do table reshaping. Here’s the Google Refine procedure for transforming a table into row/field/item triples:

Create a row-number data column called “row” at the beginning of your table, if you don’t already have such a column. That column will be identical to the (non-editable) row number column on the Google Refine table presentation frame.
From the column drop-down menu in the first column after the row number column, select “Transpose” / “Cells across columns into rows…” from the column drop down menu. Select the field below the row field in the “from column” and the last field in the “to column”. Call the column name “field”, check “prepend the column name”, and separate the column name and cell value with “:” (or some other value not included in item text). Then click “Transpose”. The table is then elongated, but it needs some further modifications.
From the “row” column drop-down menu, select “Edit cells” / “Fill down”. That fills in the row numbers for all the records.
In the field column, select “Edit column” / “Split into several columns…” , and enter the separator (“:”) used in step 2. Clicking ok completes transforming the table into triples.

To transform the triples back into the wide table:

In the drop-down menu for the rightmost column (field2), select “Transpose” / “Cells across columns into rows”. For the number of rows to be transpose, enter the number n of fields in the original table (not including the row number field).
Rename the new “field 2 n” fields based on the field names in the previous column (field 1).
In any of the former “field 2 n” fields, select from the drop-down menu “Facet” / “Text facet”. In the left panel select “(blank)”. You should now see all the blank “field 2 n” rows.
In the drop-down menu for “all”, select “Edit rows” / “Remove all matching rows”.
Click on “Reset All” and the table is back to its original form.[3]

This triple transform unfortunately isn’t as helpful as it could be. With this triple transform, you can then facet the (former) table fields and understand what fields exist in the table. But any changes that you make just in the field column won’t have any effect in the reverse transform. A function that would implement the reverse transform with changes in the field column (and automatically create the corresponding field names) seems like it would be straightforward to implement.[4] A robust reverse triple transform would make a great data tool even more useful.[5]

* * * * *

Notes:

[1] Throughout this post, table means a data collection with one dimension of categorization (a record list), e.g. a typical relational database table that organizes fields into records.

[2] If it doesn’t, the table cleaner has to ensure before a field merge that no multi-item cells will be created.

[3] Similar steps can implement table reshaping on just a subset of columns.

[4] The possibility of having multiple items in one row-field cell would have to be addressed. One solution would be to ask for a delimiter before the transform, and use that delimiter to form multi-item cells.

[5] Collecting human-generated unstructured and structured data using SDDL / STT tends to result in messy, wide, sparse tables.

Leave a Reply Cancel reply