The Data Corner Data Standardization— A Tedious Necessity By Marlene Hanken, Data Analyst, Science Welcome to The Data Corner, a place where this resident data nerd addresses some of the most common data issues and data-related topics facing the agriculture industry today!
photo—obscured findings at best and completely incomprehensible analytics at worst. A common example of this would be the naming of commodities—each operation can potentially record the same commodity similarly, but if not identical—character for character—a computer is unable to recognize that a commodity with even a leading or trailing space is the same as one without (e.g. “romaine lettuce” vs “ romaine lettuce” vs “romaine lettuce ”.) However, without a process to reconcile those three values to be identified as the same, the resulting analytics (for example a count of how many commodities are being reported) would return a count of three rather than one. Sound tedious? It is! That’s what makes data standardization and the attention of good data managers so important! The previous example seems simple and straight forward, but let’s take another common example—one company reports a commodity as “romaine lettuce” and another “lettuce, romaine.” Both commodity names are valid—identifiable by other industry participants and even consumers, However, just like the first example, our count of commodities would currently indicate two when, in fact, they are the same and should only be represented as one commodity. So, which one does a data manager use? Herein lies the first obstacle in data standardization—the
This issue’s topic is Data Standardization… and what is data standardization anyway? Simply put, data standardization is a set of processes and standards that allow data sets from various sources and formats to be conformed into a common format and source for analysis and interpretation—the foundation for collaboration. For those familiar with spreadsheets, the concept requires taking two spreadsheets, composed of differing column headings, and unifying the data into a single spreadsheet while not having two or more fields containing redundant information. Let’s use the example of two different growers sharing commodity yields for the year. Perhaps both have a field containing the commodities and yield volumes, but one has planting information in the other columns and the other grower has harvest information in their columns. Does the resulting spreadsheet contain all columns from both (both planting and harvest info)? Or does the resulting spreadsheet only contain a subset of the two—namely the fields they have in common? This is a common decision made by data managers (the person or entity combining the spreadsheets) and is an important part of data standardization. But why is data standardization so important? When data isn’t standardized, the resulting analytics are similar to the results of a blurry
Romaine Lettuce
Lettuce, Romaine
Romaine
26
SEPTEMBER | OCTOBER 2022
Western Grower & Shipper | www.wga.com
Made with FlippingBook - Online Brochure Maker