Which is a better, one big table, or two or more smaller tables? The organization of the data sources, the number of smaller tables, the extent of the relationships between the smaller tables, and economies in table processing all affect the balance of advantage. But cheaper storage, cheaper computing power, and fancier data tools probably favor the unified table. At the limit of costless storage, costless processing, and tools that make huge masses of data transparent, you can handle a component of the data as easily as you can handle all the data. Hence in those circumstances, using one big table is the dominant strategy.[*]
Unified tables are likely to be badly structured from a traditional data modeling perspective. With n disjoint components, the unified table has the form of a diagonal matrix of tables, where the diagonal elements are the disjoint components and the off-diagonal elements are empty matrices. It’s a huge waste of space. But for the magnitudes of data that humans generate and curate by hand, storage costs are so small as to be irrelevant. Organization, in contrast, is always a burden to action. The simpler the organization, the greater the possibilities for decentralized, easily initiated action.
Consider collecting data from company reports to investors. Such data appear within text of reports, in tables embedded within text, and (sometimes) in spreadsheet files posted with presentations. Here are some textual data from AT&T’s 3Q 2010 report:
More than 8 million postpaid integrated devices were activated in the third quarter, the most quarterly activations ever. More than 80 percent of postpaid sales were integrated devices.
These data don’t have a nice, regular, tabular form. If you combine that data with data from the accompanying spreadsheets, the resulting table isn’t pretty. It gets even more badly structured when you add human-generated data from additional companies.
Humans typically generate idiosyncratic data presentations. More powerful data tools allow persons to create a greater number and variety of idiosyncratic data presentations from well-structured, well-defined datasets. One might hope that norms of credibility evolve to encourage data presenters to release the underlying, machine-queryable dataset along with the idiosyncratic human-generated presentation. But you can think of many reasons why that often won’t happen.
Broadly collecting and organizing human-generated data tends to produce badly structured tables. No two persons generate exactly the same categories and items of data. Data persons present change over time. The result is a wide variety of small data items and tables. Combining that data into one badly structured table makes for more efficient querying and analysis. As painful as this situation might be for thoughtful data modelers, badly structured tables have a bright future.
* * * * *
[*] Of course the real world is finite. A method with marginal cost that increases linearly with job size pushes against a finite world much sooner than a method with constant marginal cost. The above thought experiment is meant to offer insight, not a proof of a real-world universal law.