Tuesday, May 13, 2008

A Pragmatic Take On Our Use Of Data

In the course of our careers most of us have heard someone utter the misbegotten statement "The data speaks for itself." This is doubly irritating. First, we collect data because we want to find answers to particular questions, and use it as evidence to support particular claims about those questions. It is the claims that matter. On its own data could be used to support a variety of claims. Drawing in part from the following survey data: 6 American moms prefer Jif peanut butter, 3 prefer Skippy, and 2 prefer Nutella hazelnut spread, we could claim that American moms prefer Jif, or that Nutella is making inroads into the American market, or that our survey does not reflect what is known about the preferences of the population as a whole. Data is information. It is authors that speak. Also irritating is that the statement is grammatically incorrect. Data is the plural form of datum. Data wouldn't speaks, they would speak. A well-formed (but still misbegotten) statement, then, would say "The data speak for themselves."

I am more irritated by the former. In fact, I am not bothered by the grammar of "The data speaks for itself" at all. It is true that one meaning of data is that it is the plural of datum, but I don't believe that this is the way that most academics, or most people in general, use the term. When we talk about data, we are usually interested in its value for making statistical inferences. Making statistical inferences requires us to draw from a sample population large enough that we can presume that, within a finite degree of uncertainty, what we found in the sample is likely to be representative of some larger population. The key to such a "large enough" sample is that one additional or one less datum is unlikely to significantly change the inferences we make. We're not interested in data here so much as we are a dataset. The difference is a subtle one, and I acknowledge that we don't always use data in this context, and that this is a very simplistic discussion of data and statistical inference. And that I could be wrong. But I think the discussion is good enough for what I am concerned with here. A "large enough" dataset speaks as one voice, and hence using singular grammar is most appropriate.

Here are the goods: language is both full of grammatical rules and wonderfully inventive. In the case of data, we have two rules: data is the plural form of datum, and data is the shortened form of dataset. Pragmatically speaking, each rule is appropriate to a given community of speakers in a given context. Natural language is pretty good at recognizing what rules are appropriate to what context, and at inventing new rules when the context demands it. This comes, however, with a tolerance for things like ambiguity, error, and evolutionary change that technical discourses might find unacceptable.

I should add that part of the reason I wrote this post is because I think that, at least for the areas I am currently interested in (urban design, professional practice), we planning academics are too often the dowdy grammarian correcting unruly students. Research often demands that we define our terms, so that we can measure, question, and test with precision, but we shouldn't expect professional and lay communities to define the same terms in the same way. Further, I would argue that once in awhile the colloquial meaning of a term generated within a professional or lay community has a greater logic than the formal/technical meaning generated within our own academic world, regardless of whether such meanings are explicit or overt.

No comments: