The Latest Western Digital News
Product and Solution Information, Press Releases, Announcements
3 Things We Learned (the Hard Way) About Data Quality
October 04, 2019 By BlueAlly
The definition of data quality is not straightforward. It is rather a qualitative assessment of how well data can serve its purpose in a specific context. There are many definitions of what data quality is, or can be, and it is usually based on a multitude of factors critical for the contextual purpose.
Over the last six years we’ve built a top-tier big data platform for our manufacturing business. Here are the three key lessons we learned about data quality for analytics and its impact on business:
Video Transcript
Hi, my name is JuneAn Lanigan and I am the global leader of enterprise data management at Western Digital. Over the last six years, we have been building out a scalable big data platform and ecosystem, and I’m really excited to share with you some of the things that we have done to build a best-in-class platform, but also some of the learnings that we’ve had.
Impact of Data Quality on Business Needs
In thinking about building out a platform that truly scales, you have to think about what are the needs of the business today, what are the current technologies, but you have to start looking at what is the future – what is the future of the business need as they evolve in their analytics capabilities? And [you need to] really place some bets on technology. But, one of the fundamental things that we’ve learned is around the importance of data quality.
I run enterprise data management. I certainly understand quality, but “quality” has many, many dimensions to it, and so we really have to look at how are we performing on multiple dimensions – the completeness of the data, the accuracy of the data, the latency of the data, and certainly the cost of the data. So, I’d like to share some of the things that we’ve learned [about data quality].
Data Completeness
One of the critical factors to consider is data completeness, and I think it makes sense that as you move data, in our case from different factories across the network to a cloud environment, that you have to make sure that whatever the data is you picked up is the data you actually inserted into the tables in the cloud. That’s an expectation and it’s a reasonable expectation because we’re supporting analytics in this big data environment and those analytics have far-reaching value to the business, to the bottom line.
So, if we have the opportunity to take out millions of dollars of costs based on our analysis, we have to make sure that the data is accurate from which they’re making those decisions. What we found though is that we weren’t meeting the quality standards that we needed to, and we benchmarked our own data and tried to set a threshold of 85 percent, but even that isn’t good enough for the business when they’re banking on that data to make their decisions.
And, so what we did was we implemented our Western Digital ActiveScale™ in our Las Vegas data centers so that we could actually bring our data directly from the factory to our on-premises location before we brought that data to the cloud. What we were able to accomplish is a consistent five 9’s data quality across all of our various pipelines. It used to be that I’d have to report to the business daily as to how we were doing against our benchmarks. I still look at those metrics, but they don’t anymore. They can trust the data.
Data Accuracy
The next point to consider is data accuracy, and this is where I like to say beauty is in the eye of the beholder because everybody uses the data differently. So, if you are a product program manager, you’ll define the data based on your perspective and what you’re trying to accomplish with those parameters, but a test engineer or manufacturing operations may be using that data differently. And, therefore, we found the key is data governance – to be able to identify who owns the data, really understand who uses that data and for what purposes, and come together as teams to make sure that we’re aligned on those definitions and how it’s going to be used. The key here is ownership. To really provide, not only the definition of “you’re the owner,” but to provide the capabilities and the transparency into the data through dashboards, so that they can monitor if they’re meeting the expectations of that data, now.
The data governance team on my team is a small team and they’re really teaching people how to be a data owner, how to be a data steward. We used to have a backlog of issues around data accuracy, where today what we’re finding is because the teams are knowledgeable about how to address issues, as an issue comes, they form as a team [and] they are able to resolve the issue and the ticket is closed. So, I really think that one of the things to focus on in data accuracy is just around governance and around who owns that data and who’s accountable for it.
Data Latency
Then, the last point I think we should consider is data latency. What that means is how old does can the data be before somebody uses it? In the beginning of our journey, six years ago, people were really using their current data sources – those silos of data within their own manufacturing location. So, the requirements that we were getting were more like 24-48 hours because they were, in parallel, using our platform to do their analysis, but they’re making their business decisions based on the environments that they had at that time. As they became to understand the capabilities of the big data platform and the value it could bring in their analytics, the requirement for latency got smaller and smaller and smaller for each of the pipelines. So, we’ve had to evolve our environment to go from 24-48 hours to more like 15 minutes-four hours. And, in order to do that we had to really think about how we were processing the data.
So, again we implemented ActiveScale in our Las Vegas data center so that we could quickly move the data from the manufacturing site onto our on-premises environment in Las Vegas, and, then with the S3 object store to send that data to Amazon. In AWS™ we fan out the data and we learned a lot about our pipelines. And, we had to re-engineer several of our pipelines, but that was what was necessary in order for us to meet that service level agreement of latency with our consumers.
Now, as we’re evolving more and more in our capabilities and maturity around analytics, the demands are getting even more intense and now they want real-time and they want near real-time. So how do you accomplish that in the cloud environment? That’s very difficult, so what we’ve now decided to do, as we have evolved our platform, is that we’re now implementing ActiveScale directly in each of our manufacturing locations so that we can actually accommodate those requirements of streaming data and be able to insert analytics into the stream of data directly on the manufacturing shop floor.
Futureproofing
So, those (data completeness, accuracy and latency) are really the key things to consider. Going back to what I’ve talked about [previously] – futureproofing our environment to really anticipate [the future business] needs and figure out how your architecture, how your tools, technologies, how your processes can really support that scalable evolution.