
Writing a data management plan: data transformations
In survey data collected from questionnaires, multiple choice and other kinds of responses are usually coded as numbers instead of character strings. This simple type of transformation has the advantages of easing data entry if typing in paper responses and avoiding inconsistencies (such as typos) in data values.
Other types of qualitative data (interview transcripts for example) can be transformed into quantitative data by applying textual coding and categorisation techniques. Such variables (created by a human thought process rather than computed) are understandably described as categorical or nomimal, and can have a certain range of statistical techniques applied to them, albeit not as many as 'real' numbers. ('Levels of measurement' are nominal, ordinal, interval, ratio.)
Another reason for data transformation may be to visualise the data effectively. A simple example is converting data where there is a numerator and a denominator, from ratios to percentages in order to display on a bar chart or pie graph.
When preparing data for display (visualisation), questions of scale and granularity arise. For example, should a line chart have daily occurences along the Y axis, or be smoothed over (averaged) to show points by week or month? The answer depends on what is worth showing in the data. Anomalies may be smoothed over at a higher level; is this desirable, to eliminate noise, or deceptive (where anomalies may be revealing)?
A number of techniques may be used to transform confidential or sensitive data so that they may be shared with other researchers. These include:
- aggregation: "the combination of related categories, usually within a common branch of a hierarchy, to provide information at a broader level to that at which detailed observations are taken." (OECD definition) Geographical data are also often aggregated to a higher unit, where information is deemed sensitive or revealing (like postcode unit to postcode sector).
- anonymisation: cases are stripped of revealing identifiers such as name and address. Pseudonomisation is a common technique for protecting identities in qualitative data.
- perturbation: a deliberate distortion is introduced at the level of tabular data cells. Population Census data are sometimes released with perturbations as a trade-off for geographical detail.
You have now been introduced to file formats and transformation.