Even in this age of big data, most persons collect data in spreadsheets. Two challenges are common with spreadsheet data, particularly spreadsheet data collected from a variety of sources. First, you need to understand what numbers you have. That means both the definition of a specific number and the presence or absence of particular numbers. Second, you need to combine unstructured data and data tables of various forms into an encompassing data structure that you can flexibly query and re-organize. Here are some ideas and data tools to address these challenges.
Suppose you are collecting unstructured data from a variety of sources. For example, a Wall Street Journal news article states:
Nielsen, at the request of The Wall Street Journal, analyzed cellphone bills of 60,000 mobile subscribers and found adults made and received an average of 188 mobile phone calls a month in the 2010 period, down 25% from the same period three years earlier.
To collect that data, you might paste the string “adults made and received an average of 188 mobile phone calls a month in the 2010 period” in a spreadsheet cell. But that’s data you can’t use in spreadsheet calculations. Ok, stick the number 188 in the cell to the right of that data string so you can do calculations on the relevant number. Even better, stick the number in the first cell and the description string in the second cell, so that the long, left-justified description string lines up close to the corresponding number. If you’re really serious, stick the source url in another column to the right so that you can check if you’ve missed something. This sort of procedure gives you a spreadsheet that looks something like this. This data format may be good enough for some private-sector work, or even some government work. If so, fine.
But collecting numbers and ad hoc data description strings has some weaknesses. In the example above, you haven’t fully captured the data description. That data concerns “mobile subscribers.” Mobile subscribers may or may not be different from mobile prepaid service users. Moreover, suppose you had a hundred numbers like that above. To find a particular statistic of interest, you would have to start reading through the description strings until you found one similar to the statistic that you sought. That’s tedious and time-consuming.
If you describe spreadsheet data in a more structured way, you can understand it better and process effectively larger amounts of it. A common approach would be to set up a spreadsheet table and stick numbers into it. For example, suppose you want to collect wireless service data by company. You place in a row your data categories (fields): company, customer type, number of customers 3Q 2010, etc. Then you start putting data into the relevant columns below. If you have a lot of categories and the data you find isn’t generally organized in the order of your categories, populating the table will be a tedious and time-consuming task. Moreover, suppose that you are collecting the data by company. You have to repeatedly enter the company name in the company column. That sort of annoyance can easily be multiplied if you are collecting data with additional categorical organizations such as date and business segments. If these slowly varying categories are spread across a large table, the pain is even more intense.
Here’s some good news: that supercomputer on your desk can work as a tabulating machine. You just need to describe your data with the Spreadsheet Data Description Language (SDDL) and then tabulate it with STT (Spreadsheet Tabulating Tool, or Simple Tabulating Tool, or SDDL Tabulating Tool). SDDL and STT provide fine new acronyms for you to use without requiring you to learn much or do much different with the beloved spreadsheet programs that you’ve been using since Visicalc came out. To use SDDL, you put a category in one cell, and in the next cell to the right, an item for that category. You can add additional categories and items aligned below, in any order most convenient. When you stick an item into a category that already contains an item, you’ve implicitly tabulated a record and started a new record with empty categories. STT is a spreadsheet macro that adds to a table on a separate sheet the categories (if necessary) and items contained in a block of SDDL. SDDL is so simple that you’ve probably been setting up a lot of spreadsheet data in SDDL, without even knowing that you were using SDDL.
Here are some SDDL / STT examples using Google Docs spreadsheets. So that you can see the SDDL and the tabulation side-by-side, I’ve copied the tabulations from the tabulated sheet to the SDDL examples sheet. Examples 1.1-1.2 show placing items into categories, and how sticking an item into an occupied category generates a tabulated record. This technique is more efficient than hand-tabulating data for a wide, sparse table.
You can easily combine hand tabulating and SDDL / STT. If you start with a tabulation, running STT on a block of SDDL adds any needed categories and adds all items to their categories. See Example 2 in the SDDL example spreadsheet.
SDDL uses prefixes for pinning and unpinning items in categories. If you precede a category name with the character “*”, the associated item is pinned in that category across records. So if you are working on tabulating a group of records for a company, you can pin the company name in the company category for that record group. That company name will then populate each record in that record group. When you insert a new name in the company category, you decide whether to pin that new name. See Example 3.
You can create SDDL using all the editing, linking, and calculating capabilities of the spreadsheet. So, for example, if you have a figure somewhere in a spreadsheet, you can reference that figure and do a calculation using it to produce a figure in an SDDL cell. Similarly you can copy and paste categories to lessen the work of entering SDDL.
SDDL stack and twoway formats provide an easy way to create groups of similar records. A stack is a table of items with one dimension of categories (a record list). A twoway format describes a table with two dimensions of categories. Both stacks and twoways are commonly called tables, but technically these forms differ significantly.
In SDDL, a stack is one or more rows in a block where more than one item follows the category that starts each row. The row-starting categories, called 1cats, must not be duplicated within one stack. STT tabulates one record for each stack column. Each tabulated record contains all the items sitting in the current record before the stack occurred. To visually distinguish stacks, and for some minor cases of missing values, the stack category can be prefixed with a 1. See Examples 5.1 and 5.2.
While a stack implies a record for each data column, the twoway format typically creates a record for each item placed in a rectangular grid. The twoway format begins with the twoway directive “%twoway”, with “precat” categories following in that directive’s row. Below the twoway directive row are a stack of “1cat” rows and a set of “2cat” rows. The record for grid item at position (r,c) is created from items in the “precat” section of row r, the items in the stack above the grid in column c, and the grid item. The grid item goes into the “2cat” category that begins the grid row. See example 6. Once you understand how to read the relevant categories (precats, 1cats, and 2cats), a twoway SDDL form is just a well-described, two-way table.
A key use for the twoway is to re-organize and aggregate various data tables. Financial publications typically include pivot tables (two-way tables) constructed from a master data table (a one-way table / record list). The SDDL twoway format allows you to reverse pivot the table to the extent possible (unfortunately, you can’t un-sum a cell total). For an example of an SDDL table-to-list conversion using real data, see the first data table here (from an AT&T 3Q2010 financial spreadsheet) and the first SDDL twoway here. A Google search shows that many others struggle to convert (two-way) tables to (one-way) record lists. With SDDL / STT, you can do that job easily and flexibly. Moreover, you can automatically aggregate the tables that you convert into one, big master tabulation.
Go ahead and experiment with SDDL / STT and consider whether it might be useful to you. Here’s an example of using SDDL / STT to aggregate both unstructured and structured data from AT&T’s and Verizon’s third-quarter, 2010, financial reports (wireless segment). Here’s a full definition of SDDL and the source code for the STT script. Want to customize the script to your liking? Got an idea for improving it? Go for it! If you make an improvement to the script, please make it available to everyone. That’s not a legal requirement; it’s just a polite request.
Only a few persons have big data. Ultimately, it’s not the size of your data, but what you do with it that really matters.
* * * * *
 Tricks exist to fill down blank cells. But you still have to locate the right columns for the first-changed items and filling down.
 STT is currently freely available as a Google Apps Script. I hope that public-spirited programmers will port it to Microsoft Excel, OpenOffice, Zoho Spreadsheets, and any other applications where it would be useful, and that they will make the resulting ports freely available.
 The tabulations appear in a separate sheet from the SDDL. That sheet by default is named “tabulated”.
 My design philosophy for STT’s processing of SDDL is “STT will do the best it can with what it’s given”. If you duplicate 1cats in a stack, STT will break up the stack into sub-stacks to overcome the duplication. The STT log file records such situations.
 An SDDL stack is the same as the transpose of a record list, where the first row contains field names. The SDDL directive “gettab” adds a record list to an STT tabulation (which itself is a record list).
 The twoway can handle different categories placed as 2cats. See the SDDL / STT documentation.