December 2010 – Page 2

gift accounting tools

My friend Bill is feeling stressed out with holiday gift-giving. Bill works long hours as an accountant for a major performance rights organization (a copyright collective), so he hasn’t been able to devote the time he should to choosing gifts. He’s also got a rather complicated family situation. His first marriage to Priscilla ended in divorce after three children, Elizabeth (now married with children), John, and Mary. His subsequent marriage with Cathy also ended in divorce. With Cathy, he had two children, Rachel and Seth. Bill presently is in a long-term relationship with Meena, and they have one daughter, Tara. Bill remains close with his only sibling Sophia, and tries not to forget about his parents on their birthdays, anniversary, and Christmas (Bill’s parents are Catholic, as was Priscillia, but Cathy is Jewish and Meena is Hindu, so Bill also has to watch for Hanakhah and Bali Padyami.) As you might imagine, gift-giving for Bill is complicated. Especially since he’s quite concerned about possible gift imbalances.

Bill has resolved to be a better gift-giver in the coming new year. Bill’s a guy whose been pushing numbers around spreadsheets for 50-60 hours a week for more than two decades. So of course he started with a spreadsheet. He set up a list of dates, persons, gifts, and gift costs for the gifts he gave in 2010. He created another list of persons, occasions, gifts, and estimated costs for the gifts he received in 2010. But what about the occasions on which he forgot to give gifts in 2010? He needs to compile those dates, too. To make appropriate compensations in 2011, he needs to calculate his 2010 gift cost balance for each person. But family life is much more complicated than that. He’s got to make sure that he’s not unduly favoring relationships from his second failed marriage relative to the first, even though he and Priscilla remain quite bitter (but he has a close, warm relationship with his daughter Elizabeth, whose mother is Priscilla). Similarly, if he gave an ex-in-law a nice gift in 2010, he wants to make sure he gave a nicer (more expensive) gift to Meena’s parents.

After spending the last two Saturday nights (annoying Meena) working on a gift data spreadsheet, Bill sought my advice. His fundamental gift problem, I told him, is that he hasn’t kept up with the development of new data tools. “I’ve already got a lot of gift data in a spreadsheet,” he complained. “I always use spreadsheets.” “I don’t need anything else.” “I don’t have time to learn anything new.” “I just want to figure out the right gifts, and be done with this business.” “Will you help me with this spreadsheet?”

I’ve known Bill for a long time, and he’s always been nice to me. I still remember when, two decades ago, his daughter Mary came over to our house for Susie’s big birthday party. We and all the kids sang Happy Birthday and a lot of other happy songs. We did that without purchasing any rights. Bill never reported us. Bill’s a swell guy at heart.

Here’s how I helped Bill. First, I used SDDL / STT to combine two tables he had set up and some other data into a big data list. Then I pulled that data list into Needle. Using Needle’s Thread query language, I weaved across the disparate data records to create some relevant, compact gift data tables. By sorting rows and creating, filtering and hiding columns, Bill had made in his spreadsheet a gift planning list similar to this. But he couldn’t create in a spreadsheet a list with persons sorted by first gift dates in the coming year, with person gift-dates grouped together and sorted for shrewd gift planning. To achieve his resolution of doing better in 2011, Bill needs to study his gift exchanges for 2010. In his spreadsheet he had set up a person gift-balance table like this. But what about family, what about relationships? They can’t endure without good gift-data calculations and analysis. So I swiftly created this gift exchange review by family group, and a similar review by relationship type (Needle really is sharp). Using SDDL / STT and his beloved spreadsheet program, Bill can easily record gifts as they occur in the coming year, without having to scroll around a big master table. He can then easily bring the additional data into Needle and do real-time gift exchange tracking and analysis.

Everyone wants to be a cheerful giver. Fine new data tools serve that purpose and ensure that you’ll warm hearts on all the gift-demanding days in your full calendar.

challenges to mass media generate interest in FCC

Consider by publication year the normalized frequency of “FCC” (U.S. Federal Communications Commission) in the text of a large corpus of books published in the U.S. from 1934 to 2008.[1] While you might think that the communications industry keeps getting more complicated and more contentious, the normalized frequency of FCC references is lower in the 2000s than it was in the late 1970s and early 1980s. While you might think that the breakup of AT&T in 1984 was a huge policy issues, references to the FCC were falling from the beginning of the 1980s through to the end. References to the FCC started to increase rapidly about 1970 and peaked about 1978. What made the 1970s such interesting years in communications policy?

While more book computing is needed for a definitive answer, I conjecture that cable television’s challenge to broadcast television drove increasing book attention to the FCC in the 1970s. By the mid-1960s, cable television was threatening to bring video competition to local television broadcasters. In orders in 1965 and 1966, the FCC declared its authority over cable television and established “must carry” regulations requiring cable systems to carry local broadcast stations.[2] In 1972, the FCC issued a major Cable Television Report and Order to “open up cable’s potential to serve the public without at the same time undermining the foundation of the existing over-the-air broadcast structure.”[3] Since existing over-the-air broadcasters were central media in the mass public sphere, the FCC’s goal was both politically understandable and quite difficult. The FCC issued extensive, detailed regulations to support that goal. These regulations provided rich fodder for authorial discussion of free speech, media concentration, and political battles both small and large.

When communications regulation has greater effects on a common mass media (television), it will be discussed more in elite mass media (books).

Other posts on corpus word statistics and corpus stylistics:

* * * * *

[1] The U.S. Congress established the FCC in the Communications Act of 1934. The last year of data in the Google Ngram Viewer is 2008. In the Google Ngram Viewer, the American English corpus refers to a large set of books published in the U.S. I use a minimal amount of smoothing (1 — meaning a three-year moving average) to filter some high-frequency noise while retaining reasonable year resolution.

[2] FCC, First Report and Order, Dockets 14895, 15233, 38 FCC 683 (1965); FCC, Second Report and Order, Dockets 14895, 15233, 15971, 2 FCC 2d. 729 (1966).

[3] FCC, 36th Annual Report, Fiscal Year Ending June 30, 1970 (1971) p. 33, cited in William Emmons, “Public Policy and the Evolution of Cable Television: 1950-1990,” Business and Economic History, 2nd Series, v. 22 (1992) pp. 182-191.

badly structured tables have a bright future

Which is a better, one big table, or two or more smaller tables? The organization of the data sources, the number of smaller tables, the extent of the relationships between the smaller tables, and economies in table processing all affect the balance of advantage. But cheaper storage, cheaper computing power, and fancier data tools probably favor the unified table. At the limit of costless storage, costless processing, and tools that make huge masses of data transparent, you can handle a component of the data as easily as you can handle all the data. Hence in those circumstances, using one big table is the dominant strategy.[*]

Unified tables are likely to be badly structured from a traditional data modeling perspective. With n disjoint components, the unified table has the form of a diagonal matrix of tables, where the diagonal elements are the disjoint components and the off-diagonal elements are empty matrices. It’s a huge waste of space. But for the magnitudes of data that humans generate and curate by hand, storage costs are so small as to be irrelevant. Organization, in contrast, is always a burden to action. The simpler the organization, the greater the possibilities for decentralized, easily initiated action.

Consider collecting data from company reports to investors. Such data appear within text of reports, in tables embedded within text, and (sometimes) in spreadsheet files posted with presentations. Here are some textual data from AT&T’s 3Q 2010 report:

More than 8 million postpaid integrated devices were activated in the third quarter, the most quarterly activations ever. More than 80 percent of postpaid sales were integrated devices.

These data don’t have a nice, regular, tabular form. If you combine that data with data from the accompanying spreadsheets, the resulting table isn’t pretty. It gets even more badly structured when you add human-generated data from additional companies.

Humans typically generate idiosyncratic data presentations. More powerful data tools allow persons to create a greater number and variety of idiosyncratic data presentations from well-structured, well-defined datasets. One might hope that norms of credibility evolve to encourage data presenters to release the underlying, machine-queryable dataset along with the idiosyncratic human-generated presentation. But you can think of many reasons why that often won’t happen.

Broadly collecting and organizing human-generated data tends to produce badly structured tables. No two persons generate exactly the same categories and items of data. Data persons present change over time. The result is a wide variety of small data items and tables. Combining that data into one badly structured table makes for more efficient querying and analysis. As painful as this situation might be for thoughtful data modelers, badly structured tables have a bright future.

* * * * *

[*] Of course the real world is finite. A method with marginal cost that increases linearly with job size pushes against a finite world much sooner than a method with constant marginal cost. The above thought experiment is meant to offer insight, not a proof of a real-world universal law.

Wednesday's flowers

purple flowers as far as you can see

exploring and remodeling table fields

Sometimes tables are messy not just in their data items, but also in the fields that define the table columns.[1] Various techniques help to deal with such “second order” messiness. Sorting table fields alphabetically or evaluating them with more powerful text similarity measures help to identify inadvertently duplicated fields. Sorting table fields by the number of non-null items in the fields would tell you the relative data size of different fields. Of course, you would also like to be able to merge and split fields just as easily as you select fields to include in a sub-table of interest.

Transforming a table into row number/field/item triples can be helpful for exploring and remodeling table fields. With such a transformation, fields become new data items. Excluding triples with null items and faceting the fields shows the count of items in each field. Merging and splitting fields become operations of selecting relevant triples and renaming the field/data item. When finished such meta-cleaning, you might want to transform the triples back into a remodeled table. Since field merges can create a field containing more than one item, the remodeled table ideally would support multiple items per cell. [2]

The triple transform is an instance of more general table reshaping operations. A long transformation replaces a set of fields with a key field containing those field names and a new field containing the keyed values. A wide transformation reverses the long transformation. It expands a key field into separate key-instance fields populated with the corresponding keyed values. Non-varying fields are expanded or contracted as necessary. Table reshaping operations change both the column and row dimensions of a table. The reshape command in Stata implements this class of table operations. The Stata reshape documentation shows examples of these transformations. SDDL / STT provides table reshaping using spreadsheet-based markup.

The awesome new data tool Google Refine can do table reshaping. Here’s the Google Refine procedure for transforming a table into row/field/item triples:

Create a row-number data column called “row” at the beginning of your table, if you don’t already have such a column. That column will be identical to the (non-editable) row number column on the Google Refine table presentation frame.
From the column drop-down menu in the first column after the row number column, select “Transpose” / “Cells across columns into rows…” from the column drop down menu. Select the field below the row field in the “from column” and the last field in the “to column”. Call the column name “field”, check “prepend the column name”, and separate the column name and cell value with “:” (or some other value not included in item text). Then click “Transpose”. The table is then elongated, but it needs some further modifications.
From the “row” column drop-down menu, select “Edit cells” / “Fill down”. That fills in the row numbers for all the records.
In the field column, select “Edit column” / “Split into several columns…” , and enter the separator (“:”) used in step 2. Clicking ok completes transforming the table into triples.

To transform the triples back into the wide table:

In the drop-down menu for the rightmost column (field2), select “Transpose” / “Cells across columns into rows”. For the number of rows to be transpose, enter the number n of fields in the original table (not including the row number field).
Rename the new “field 2 n” fields based on the field names in the previous column (field 1).
In any of the former “field 2 n” fields, select from the drop-down menu “Facet” / “Text facet”. In the left panel select “(blank)”. You should now see all the blank “field 2 n” rows.
In the drop-down menu for “all”, select “Edit rows” / “Remove all matching rows”.
Click on “Reset All” and the table is back to its original form.[3]

This triple transform unfortunately isn’t as helpful as it could be. With this triple transform, you can then facet the (former) table fields and understand what fields exist in the table. But any changes that you make just in the field column won’t have any effect in the reverse transform. A function that would implement the reverse transform with changes in the field column (and automatically create the corresponding field names) seems like it would be straightforward to implement.[4] A robust reverse triple transform would make a great data tool even more useful.[5]

* * * * *

Notes:

[1] Throughout this post, table means a data collection with one dimension of categorization (a record list), e.g. a typical relational database table that organizes fields into records.

[2] If it doesn’t, the table cleaner has to ensure before a field merge that no multi-item cells will be created.

[3] Similar steps can implement table reshaping on just a subset of columns.

[4] The possibility of having multiple items in one row-field cell would have to be addressed. One solution would be to ask for a delimiter before the transform, and use that delimiter to form multi-item cells.

[5] Collecting human-generated unstructured and structured data using SDDL / STT tends to result in messy, wide, sparse tables.