Quantifying the Discovery

We can agree that in the process of discovery observation is not enough. To make characterizations about a process we need hard data. Observation can be a form of data, but it is qualitative and open to subjective bias. Quantified data creates a much more powerful understanding of process performance.

When I was first working on my Green Belt, I remember feeling frustrated that there wasn’t a database to collect the data I needed for my project. Reflecting back, it seems silly now that I was so perplexed by this problem. I am reminded of it every time I find myself coaching a junior engineer. I will ask a question like, “How often does the machine stop due to a jam in the feeder?” They respond, “We don’t collect that information.” In my mind I think, “Yes, I know. If a report on this existed, then I would not need to ask the question!” Of course, I recognize that it is my job to coach, so I explain to the junior engineer that they need to create a log sheet and start collecting data.

This comes up pretty often, so it is worth focusing an article on the steps for gathering data. Here we go:

Identify the event of interest.
Determine how the data will be analyzed (Pareto, Histogram, Box Plot, Trend Chart)
Consider the comparisons you wish to make.
Define the measurement process.
Identify who is responsible for data collection.
Determine how much data is needed.
Define the time period of data collection.

I realize that each of these seven steps could be its own article. Let me know if that would be valuable and I’ll make it happen this year. Collecting data is no easy task. It is daunting and that is why I think people tend to try to find existing data sets. Using existing data sets is a missed opportunity. Manually collected data is valuable. The knowledge you gain through manual collection gives you context to the data in action. There is less mystery behind each data point because you know what was going on when it was collected. Here are some additional details on each of the seven steps:

Identify the event of interest –

It is important to recognize that there are endless pieces of information that can be collected in a process. Focus is important for an improvement project. You need to focus on vital data and define the collection clearly. If you want to understand machine downtime, then you have to define the downtime event. Is it that the machine has slowed or is it that it has come to a complete stop? Is it down if we have no work for the machine or are we only interested in mechanical failures? What if it is down for preventive maintenance? Does this count? What if you have work for the machine, but there is no operator available? This needs to be defined or you will not get the information that you desire. It needs to be defined from the perspective of the problem you wish to solve. These definitions matter. Data gathering takes time; you need to be focused or you will waste time and money.

Determine how the data will be analyzed –

I always struggled with this because I would think that I don’t know exactly what I want to do with the data until I have it in hand. The problem with this way of thinking is that if you don’t consider, for example, that you want to show a trend chart then you may fail in collecting the data in a way that you can enter it on a graph. At least a rough idea of how the data will be used can help you avoid missing something crucial when you analyze the information.

Consider the comparisons you wish to make –

Like the previous step, if you fail to think about how you might use the data then you won’t collect important attribute meta data that will allow you to make comparisons. This may be between different operators, shifts, times of the day, etc. All of these may be important factors in revealing performance differences needed to explore cause and effect. Failing to capture this information will prevent you from making the comparison and allow the cause to be hidden in a mess of information.

Define the measurement process –

This is especially important if someone else will perform the data collection. If you are capturing the setup time of a process, you need to know what is the catalyst event that starts the timer running? What is the point at which the process is complete? Is it when the last part comes off of the line or is it when the pallet of goods is moved over to the inspection area? What device are you using to measure? For example, a tape measure or a pair of calipers? What is the resolution required for a good measurement? If you or the person helping you collect the data fails to capture this consistently you will mistake measurement variation as process variation and likely start chasing phantom root causes. I bet you have all seen these mistakes in practice. They are the result of poor documentation and training on the measurement process.

Identify who is responsible for data collection –

Once you have the process you need to determine who will be trained. This is both important for knowing who needs training and for creating ownership in data collection. Believe it or not, I have seen instances where an engineer put a log sheet on the floor, explained what was needed and how to collect the information only to find a day later that no one was entering data. Or that the first shift entered their data, but no one on the second shift entered data. How was this overlooked? “Well, I thought the supervisor would have assigned someone.” They didn’t define who was responsible! It was just assumed that everyone knew. Assumptions are another good topic for a future article. For now, just know that assumptions always get us into trouble.

Determine how much data is needed –

This is an easy answer… 30 data points. Okay, well there is more to it than that. This was the answer in all of my statistics books. In reality, it is a difficult question that is often asked of Black Belts and Statisticians. There are formulas you can use to answer the question, but I think it is better to understand the concepts. First, any data is better than no data. So, do not waste time overthinking and just start collecting. Second, the inherent variability in the process will drive more or less data collection. More variation means you need more samples to characterize the variation. Then, it follows that less variation requires less samples for characterization. The criticality of the event of interest will also drive sample size consideration. If you are looking for a needle in the haystack, you need lots of hay to sift through. If not, a small number of points will do the trick. Finally, the impact of time on the data will drive both sampling methods and sample size. The more important it is to see change over time, the more data will be needed.

Define the time period of data collection –

As I just mentioned, this is also related to the amount of data. If you have to collect a lot of data on a process that does not repeat very often then you will be collecting for a long period of time. Another consideration is that if you are looking for an infrequent event or trying to characterize some variability, then you will need to collect over a long duration. Like in the previous step it really just depends on the event that you are searching for. If you do not need to understand the impact of time on a process and the process is highly repeated, you may be able to collect a few hours or a day of data.

Discovery takes many forms. The most powerful discovery happens when you can translate our observations into quantified data that can then be summarized and visually displayed. You have many powerful tools for data tabulation, organization, and visualization these days. They all still require good collection and the fundamental activities for data collection are the same as they have always been. Getting the tools right is crucial to the process of discovery.

Quantifying the Discovery

Contact Info

Quick Links

Chalmers St. Consulting