Chalmers St – Consulting

The Hard Work of Data Collection

I wish I could find the slides from my project I completed to become a certified Green Belt. I remember I had such a hard time pulling together data to support my work. It actually took over a year because I stubbornly searched for a ready made data set that I could analyze. This was in 2001, so I really don’t remember exactly what the project was about. I do remember an important lesson from the experience; it is not often that we will find the exact data that we need. It is more often the case that we will have to go capture and create the data set for our analysis.


A lot has changed in the operations world since 2001. Today most companies run on an enterprise resource system or systems that are collecting huge amounts of data. Some of it is stored and easy to retrieve, some of it not so much. These days when we kick off an improvement project the first task is to investigate the information that is captured in these systems to understand how the information is organized, the quality of the data, and the different ways that we can retrieve it to build a picture of the operations or the specific function we desire to improve. Large systems like this can provide years of data on order types, fulfillment cycle times, and scrap amounts. It is an endless array of information. Large data sources allow us to see all of the variation in output. We can use reporting tools to create real time dashboards. This is really great, but just like I learned from my first Green Belt project this path does come with its own drawbacks. 


Large data sources are often a collection of data tables that do not clearly or easily connect. A lot of time can be wasted trying to understand the input source and figuring out how to correctly tie tables together to create a clean picture. This takes some skill and experience and can be done incorrectly which leads to misinterpretations. Worse, if the source of the data is not clean we could just end up tabulating and analyzing incorrect information. Data tables without the practical context of how the data is gathered is not enough. There is always a need to get into the operations environment to understand the input process of the data.


The other problem with large data sources is that while we can see all of the variation, we do not really know much about the causes of variation. For example, we do not know that demand dropped for a week due to a large snow storm, we do not know that we missed a bunch of shipments one day because of a system outage, we do not know that scrap spiked one month because of a supplier issue. This is a big problem because understanding the source of variation is the whole purpose of an improvement project. 


I achieved my Green Belt Certification in 2002 because I buckled down and did the hard work of data collection. We have to accept that often the data collected automatically through large systems is not necessarily the view that we need to understand and solve the problem. When this happens we have to develop a method for collecting this data manually. I like to think of this as two different approaches, controlled collection and uncontrolled collection.


When I say “uncontrolled collection” I do not actually mean that the collection is uncontrolled. That would be bad. We want to observe the process in its natural and untampered state. We set up a tracking sheet or we observe a day or two of regular operations. We collect specific data points and make notes about what we saw. These days I do not actually stand and watch processes for too long. If I want an hour or more of observation we typically record the work with a camera. This is powerful because we are collecting exactly what we want and we can note things that we observe that cause variation in real time. This approach has some drawbacks. The data sets are much smaller so we will assuredly miss some types of variation. It is harder to collect enough data to create multiple month trends. This is time consuming and can be a bit intrusive. Recently, we had a client employee ask if he could stop a data collection effort because it was consuming too much of his time. We had successfully validated an improvement so we stopped the collection. The downside is that we lose the full picture of the savings from our work. 


Anyone that has ever performed a designed experiment (DOE) or a gauge R&R has experience in controlled data collection. In this case, we are controlling the process of interest. Here we can study and understand the impact of specific variations because we will create a controlled environment where we purposefully vary the factors of interest. These can be machine settings, supplier materials, years of experience of the operator, etc. We change each of these in a controlled environment and measure the impact on the output. This type of collection is very powerful because it provides much stronger evidence of causation than any other data collection method. The drawbacks are much smaller data sets. Also, these are usually fairly costly to run. Not only is there more thought and planning required to create the controlled environment, but you also have to figure out how to control the variations that will be inserted into the process. 


To conclude, it is more important to know exactly what you are collecting and how it was collected than to have large data sources. This is not to say that large data sources are not valuable. They are highly valuable and modern applications are only going to increase the value of data. But, I think it will continue to be true that small samples and experimental methods are important tools in our data collection and analysis tool box.