FEATURE STORY SUSTAINABLE DATA MANAGEMENT: WHAT IS IT AND WHY DOES IT MATTER? By Gustavo Reyes, Food Safety Manager
In today’s world, data is the driver for progress and innovation. From tracking soil health to assessing water quality, to guiding regulatory decisions and tracking microbiological trends, data helps improve efficiency and decision making. I have recently been polishing my modeling skills by reading a book in modeling techniques by Max Kuhn and Julia Silge. What I was reminded of is that the quality of a model is directly dependent on the quality of the data used to train it. This realization made me think about the concepts of data management with a little twist: How can we make data management sustainable to ensure longevity of our data, and ultimately, our insights? In exploring this question, I found myself delving into what would be the principles of sustainable data management. Much like environmental organizations’ sustainability, the idea is to approach data management with practices that ensure its longevity, utility and stewardship. So, what would be the pillars for sustainable data management? 1. Data Quality and Integrity : Sustainable data management begins with ensuring data quality at every stage—from collection to analysis. This means focusing on several key aspects: a. Accuracy: Data should represent what we are trying to capture. In food safety testing data, this means that sampling and testing is performed as intended. In monitoring programs such as capturing weather data, this means that equipment is correctly calibrated. Poor accuracy could introduce bias that would be unaccounted for when analyzing and using the data in the future. b. Completeness: Missing data can be a challenge, and sustainable practices involve designing collection programs to minimize missing values and frequently verifying data for completeness. Incomplete data could render observations of little value when performing analysis in the future. c. Consistency: Uniformity among datasets (within your company or with outside companies) allows for the data to be cohesive. For example, ensuring that the same variable names, formats and units are used across data sources reduces friction when integrating datasets. 2. Effective Storage and Access : As datasets grow, the way data is stored can affect the longevity and utility of the data. Many organizations still rely on adhoc file systems, where data is stored in random folders on individual computers or external drives. While this might seem convenient in the short term, it creates numerous problems in the long run: files can be lost because of a hardware failure, redundancy and duplication
increase, and collaboration and tracking changes are nearly impossible. To address these challenges, you can take the following steps: a. Centralized data repository: allows for automatic organization and retrieval. You can think of this as a file that exists in one shared location. Multiple people can access it, and the system tracks any changes that can be recovered. b. Cloud-based storage: offers scalability to handle growing datasets but also helps reduce the reliance on physical infrastructure, which can be expensive and difficult to maintain. These support seamless integration with tools for data analysis, visualization and modeling. c. Consistency in organization: this includes having established nomenclature that includes dates, folder hierarchy for easy access and potential guides or metadata that go with the data. 3. Future proofing through documentation : Documentation serves as the “instruction manual” for your datasets, providing essential information about what the data represents, how it was collected and how it can be used. Without thorough documentation, users (including your future self) may struggle to understand key aspects of the data. Here are a few considerations. a. Data collection: documentation related to source, methodology and timing b. Data cleanup: documentation on correction of errors, transformation and additional changes to achieve the final format c. Use and purpose of the data: Will this data be used for trending analyses, creation of plots, aggregation into other data? d. A version history ensures that changes to the dataset over time are well-documented. This prevents confusion about which version to use and helps identify how the data evolved. Sustainable data management is about prioritizing quality and adopting sustainable practices so that we not only enhance the value of our insights but also contribute to a culture of respect for data-driven systems that shape our world. Sustainable data management is a journey we all must embark on and accept to maximize our learnings.
23 Western Grower & Shipper | www.wga.com January | February 2025
Made with FlippingBook flipbook maker