On Probability…

In today’s post, I am exploring the nature of probability. Is probability an intrinsic feature of events that evolves over time, or is it something else entirely? My view is that probability is best understood as a measure of an observer’s uncertainty that can change as new information becomes available, rather than as a property that events themselves possess.

Probability is not an intrinsic property of events that evolves over time. It is a measure of an observer’s uncertainty that changes as the observer gains new information.

This insight becomes clear when we consider what happens before and after an event of interest occurs. You might assign a 35% probability that your favorite team will win their championship match in 2025 based on their team, coaching staff, recent performance, and other factors. When your team does indeed win the championship in 2025, you no longer speak of a 35% chance afterward. You know they won, so your uncertainty about whether your team would capture the 2025 title is gone. The event itself has not changed. What has changed is simply your information about it.

This example reveals something fascinating. The event does not have a probability that flows through time. Your favorite team winning the 2025 championship does not possess an inherent “35% chance property” that somehow transforms into a “100% chance property” when they claim victory. Rather, probability expresses your epistemic state. It expresses what you know and do not know about the event. As your knowledge updates, so does the probability you assign.

Before the season, the probability of 35% captured your uncertainty given incomplete information about how this specific championship race would unfold. After they win, your uncertainty about whether your team won the 2025 championship disappears because you have complete information about this particular outcome. The players were competing and making decisions throughout the season, but your knowledge of the final result was incomplete and then became complete. Probability tracks this change in knowledge, not a change in the event itself.

Your favorite team winning the 2025 championship is a singular, unrepeatable event. This singularity principle applies to every event, whether it is the outcome of a coin toss or whether you miss a train. Even when we consider the 2026 championship, that represents a completely separate event requiring its own probability assessment. You might again assign some probability to your team winning in 2026, but this concerns a different season with different players, different opponents, and different circumstances. The fact that your team won in 2025 provides information that might influence your assessment of their 2026 chances, but each championship stands as a distinct event with its own associated uncertainty.

Different philosophical schools interpret probability in various ways. Frequentists focus on long-run patterns, while others emphasize physical propensities in systems. I adopt the Bayesian perspective here, which treats probability as quantifying an observer’s degree of belief about uncertain outcomes. This framework excels at handling partial information and belief updating as new evidence arrives.

The Bayesian approach formalizes how rational observers should revise their beliefs. You start with a prior probability based on available information. When new evidence arrives, Bayes’ theorem shows how to calculate an updated posterior probability, which then serves as the prior for the next update. Certainty represents probability at its extremes (belief of 1 or 0), but most real-world knowledge involves intermediate probabilities reflecting justified but incomplete information.

Let us return to the championship example with this framework in mind. Your initial 35% probability assignment reflects partial knowledge about the 2025 season that remains open to revision. When your favorite team wins the championship, your belief updates to certainty: probability 1. This transition represents a shift in your epistemic state, not a change in some objective property of the championship outcome. The probability assigned to the event changes only because your information changes.

Your team winning the 2025 championship might influence how you assess their chances for future seasons, but each championship represents a separate event. The 2026 championship is not the same event as the 2025 championship because it involves different circumstances, different player development, different opponents, and different strategic decisions that create their own uncertainty. Your experience from the 2025 season provides information for assessing future championship races, but the probability you assign to the 2026 contest addresses a distinct event with its own epistemic challenges.

Once an event’s outcome becomes known, assigning forward-looking probabilities to that specific completed event loses predictive meaning. However, probabilities retain important roles in other contexts. We use explanatory probabilities to reason about hidden causes of observed effects, and counterfactual probabilities to explore alternative scenarios for learning and decision-making. These applications all involve managing uncertainty about things we do not fully know.

Some philosophers argue for objective chances embedded in physical reality, claiming that the world itself has genuine probabilistic features. Even these can be understood through a Bayesian lens as rational betting odds conditioned on our best current knowledge about physical laws and initial conditions. From this epistemic perspective, probability fundamentally reflects our relationship to knowledge and uncertainty, not immutable features of external events.

Understanding probability as observer-dependent rather than event-dependent has practical implications. It explains why different people can reasonably assign different probabilities to the same event because they possess different information. It clarifies why probabilities can seem to “change” as we learn more: our knowledge evolves while events themselves follow deterministic or genuinely random processes. Most importantly, it positions probability as a dynamic tool for rational reasoning under uncertainty rather than a mysterious property that events carry through time.

Finally, it is important to recognize that while our beliefs may remain probabilistic, our decisions in the real world must ultimately resolve into binary choices. We decide to carry an umbrella or not, to take the highway or not, to treat a patient or not. Practical action demands that we collapse our probabilistic beliefs into definitive commitments. This reinforces that probability serves as a bridge between uncertainty and action, not as a property that events carry through time.

Final Words:

This epistemic view of probability transforms how we think about uncertainty and prediction. Rather than searching for probabilities “out there” in the world, we recognize them as tools for managing our own knowledge and ignorance.

As Pierre Simon Laplace eloquently put it: “Probability theory is nothing but common sense reduced to calculation.”

Once we embrace probability as a measure of what we know rather than what events are, we can use it more effectively as the rational tool it was always meant to be.

Always keep learning…

Relationship Between Process Capability Index and Sigma:

Recently, I wrote about the process capability index and tolerance interval. In today’s post, I am writing about the relationship between process capability index and sigma. The sigma number here relates to how many standard deviations the process window can hold.

A +/- 3 sigma contains 99.73% of the normal probability density curve. This is also traditionally notated as the “process window”. The number of sigma’s is also the z-score. When the process window is compared against the specification window, we can assess the process capability. When the process window is much narrower than the specification window and is fully contained within the specification window, we say that the process is highly capable. When the process window is larger than the specification window, we say that the process is not capable. How much the process window is enclosed within the process specification window is explained by the process capability index. The most common process capability index is Cpk or Ppk. Here, we will consider Ppk.

Ppk is the minimum of two values:

Here µ is the mean, σ is the standard deviation, LSL is the Lower Specification Limit, and USL is the Upper Specification Limit. We are splitting the process window into two here, and accounting for how centered the process is. If the process window is not centered compared to the process specification window, we penalize it by choosing the minimum of the two.

For convenience, let’s assume the equation below:

If we multiply both sides by 3, the equation becomes:

The value on the right side can be expressed as – how many standard deviations are contained in the split process window? This is also the Sigma value or the z-score.

For example, if the Ppk is 1.00, then the z-score is 3.00. This means that the process window and the specification window overlap exactly. This corresponds to 99.73% of the curve. Please note that, I am assuming that the process is perfectly centered here. Refer to this post for additional details on calculations for unilateral and bilateral capabilities.

In other words,

This relationship allows us to estimate the %-conforming (% under the curve) by just knowing the process capability index value. A keen reader may also notice the similarity to tolerance interval calculations. If we go back to the idea that sigma is the number of standard deviations that the split process window can accommodate, then we can replace Sigma with k1 and k2 factors used for the tolerance interval calculations for unilateral and bilateral interval calculations.

A word of caution here is about the switcheroo that happened. The calculations we are doing are based on the normal probability distribution curve, and not the actual process probability distribution curve. The accuracy of our inferences will depend on how close the actual process probability distribution curve matches the beautiful symmetric normal curve.

Always keep on learning…

Ppk, Capability Index and Tolerance Interval Relation:

In today’s post, I am looking at the relationship between capability index (Cpk or Ppk) and Tolerance Intervals. The capability index is tied to the specification limits, and tying this to the tolerance interval allows us to utilize the confidence/reliability statement allowed by the tolerance interval calculation.

Consider the scenario below:

A quality engineer is tasked with assessing the capability of a sealing process. The requirement the engineer is used to is that the process capability index, Ppk, must be greater than or equal to 1.33. The engineer is used to using 30 as the sample size.

But what does this really tell us about the process? Is 1.33 expected to be the population parameter? If so, does testing 30 samples provide us with this information? The capability index calculated from 30 samples is only the statistic and not the parameter.

We can utilize the tolerance interval calculation approach here and calculate the one-sided k-factor for a sample size of 30. Let us assume that we want to find the tolerance interval that will cover 99.9% of the population with 95% confidence. NIST provides us a handy reference to calculate this and we can utilize an Excel spreadsheet to do this for us. We see that the one-sided k-factor calculated is 4.006.

The relationship between the required Ppk and the one-sided k-factor is as follows:

Ppkrequired = k1/3

Similarly for a bilateral specification, the relationship between the required Ppk and the two-sided k-factor is:

Ppkrequired = k2/3

In our example, the required Ppk is 1.34. In other words, if we utilize a sample size of 30 and show that the calculated Ppk is 1.34 or above, we can make the following statement:

With 95% confidence, at least 99.9% of the population is conforming to the specifications. In other words, with 95% confidence, we can claim at least 99.9% reliability.

This approach is also utilized for variable sampling plans. However, please do note that the bilateral specification also requires an additional condition to be met for variable sample plans.

I have attached a spreadsheet that allows the reader to perform these calculations easily. I welcome your thoughts. Please note that the spreadsheet is provided as-is with no guarantees.

Final words:

I will finish with the history of the process capability indices from a great article by Roope M. Turunen and Gregory H. Watson. [1]

The concept of process capability originated in the same Bell Labs group where Walter A. Shewhart developed SPC. Bonnie B. Small led the editing team for the Western Electric Statistical Quality Control Handbook, but the contributor of the process capability concept is not identified. The handbook proposes two methods by which to calculate process capability: first, “as a distribution having a certain center, shape and spread,” and second, “as a percentage outside some specified limit.” These methods were combined to create a ratio of observed variation relative to standard deviation, which is expressed as a percentage. The handbook does not call the ratio an index; this terminology was introduced by two Japanese quality specialists in their 1956 conference paper delivered to the Japanese Society for Quality Control (JSQC). M. Kato and T. Otsu modified Bell Labs’ use of percentage and converted it to an index, and proposed using that as a Cp index to measure machine process capability. Subsequently, in a 1967 JSQC conference paper, T. Ishiyama proposed Cpb as a measurement index of bias in nonsymmetric distributions. This later was changed to Cpk, where “k” refers to the Japanese term katayori, which means “offset” or “bias.”

Always keep on learning…

My last post was All Communication is Miscommunication:

[1] Analyzing the capability of lean processes by Roope M. Turunen and Gregory H. Watson (Quality Progress Feb 2021)

Utilizing Stress/Strength Analysis to Reduce Sample Size:

Art by NightCafe

In today’s post, I am looking at some practical suggestions for reducing sample sizes for attribute testing. A sample is chosen to represent a population. The sample size should be sufficient enough to represent the population parameters such as mean, standard deviation etc. Here, we are looking at attribute testing, where a test results in either a pass or a fail. The common way to select an appropriate sample size using reliability and confidence level is based on success run theorem. The often-used sample sizes are shown below. The assumptions for using binomial distribution holds true here.

The formula for the Success Run Theorem is given as:

n = ln(1 – C)/ ln(R), where n is the sample size, ln is the natural logarithm, C is the confidence level and R is reliability.

Selecting a sample size must be based on risk involved. The specific combinations of reliability and confidence level should be tied to the risk involved. Testing for higher risk profile attributes require higher sample sizes. For example, for a high-risk attribute, one can test 299 samples and if there were no rejects found, then claim that at 95% confidence, the product lot is at least 99% conforming or that the process that produced the product is at least 99% reliable.

Often time, due to several constraints such as material availability, resource constraints, unforeseen circumstances etc., one may not be able to utilize required sample sizes needed. I am proposing here that we can utilize the stress/strength relationship to appropriately justify the use of a smaller sample size while at the same time not compromise on the desired reliability/confidence level combination.

A common depiction of a stress/strength relationship is shown below for a product. We can see that as long as the stress distribution does not overlap with the strength distribution, the product should function with no issues. The space between the two distributions is referred to as the margin of safety. Often, the product manufacturer defines the normal operating parameters based on this. The specifications for the product are also based on this and some value of margin of safety is incorporated in the specifications.

For example, let’s say that the maximum force that the glue joint of a medical device would see during normal use is 0.50 pound-force, and the specification is set as 1.5 pound-force to account for a margin of safety. It is estimated that a maximum of 1% can likely fail at 1.5 pound-force. This refers to 99% reliability. As part of design verification, we could test 299 samples at 1.5 pound-force and if we do not have any failures, claim that the process is at least 99% reliable at 95% confidence level. If the glue joint is tested at 0.50 pound-force, we should be expecting no product to fail. This is after all, the reason to include the margin of safety.

Following this logic, if we increase the testing stress, we will also increase the likelihood for failures. For example, by increasing the stress five-fold (7.5 pound-force), we are also increasing the likelihood of failure by five-fold (5%) or more. Therefore, if we test 60 parts (one-fifth of 299 from the original study) at 7.5 pound-force and see no failures, this would equate to 99% reliability at 95% confidence at 1.5 pound-force. We can claim at least 99% reliability of performance at 95% confidence level during normal use of product. We were able to reduce the sample size needed to demonstrate the required 99% reliability at 95% confidence level by increasing the stress test condition.

Similarly, if we are to test the glue joint at 3 pound-force (two-fold), we will need 150 samples (half of 299 from the original study) with no failures to claim the same 99% reliability at 95% confidence level during the normal use of product. The rule of thumb is that when aiming for a testing margin of safety of ‘x,’ we can reduce the sample size by a factor of ‘1/x’ while maintaining the same level of reliability and confidence. The exact number can be found by using the success run theorem. In our example, we estimate at least 95% reliability based on the 5% failures while using 5X stress test conditions, when compared to the original 1% failures. Using the equation ln(1-C)/ln(R), where C = 0.95 and R = 0.95, this equates to 59 samples. Similarly for 2X stress conditions, we estimate 2% failures, and here R = 0.98. Using C = 0.95 in the equation, we get the sample size required as 149.

If we had started with a 95% reliability (5% failures utmost) and 95% confidence at the 1X stress conditions, and we go to 2X stress conditions, then we need to calculate the reduced sample size based on 10% failures (2 x 5%). This means that the reliability is estimated to be 90% at 2X stress conditions. Using 0.95 for confidence and 0.90 reliability, this equates to a reduced sample size of 29.

A good resource to follow up on this is Dr. Wayne Taylor’s book, “Statistical Procedures for the Medical Device Industry”. Dr. Taylor notes that:

An attribute stress test results in a pass/fail result. However, the unit is exposed to higher stresses than are typical under normal conditions. As a result, the stress test is expected to produce more failures than will occur under normal conditions. This allows the number of units tested to be reduced. Stress testing requires identifying the appropriate stressor, including time, temperature, force, humidity and voltage. Examples of stress tests include dropping a product from a higher height, exposing a product to more cycles and exposing a product to a wider range of operating conditions.

Many test methods contained in standards are in fact stress tests designed to provide a safety margin. For example, the ASTM packaging standards provide for conditioning units by repeated temperature/humidity cycles and dropping of units from heights that are more extreme and at intervals that are more frequent than most products would typically see during shipping. As a result, it is common practice to test smaller sample sizes. The ASTM packaging conditioning tests are shown… to be five-times stress tests.

It should be apparent that if the product is failing at the elevated stress level, we cannot claim the margin of safety, we were going for. We need to clearly understand how the product will be used in the field and what the normal performance conditions are. We need a good understanding of the safety margins involved. With this approach, if we are able to improve the product design to maximize the safety margins for the specific attributes, we can then utilize a smaller sample size than what is noted in the table above.

Always keep on learning. In case you are interested, my last post was Deriving the Success Run Theorem:

Note:

1) It’s commonly used to depict a distribution using +/-3 standard deviations (σ). This is a practical way to visualize a distribution.

2) The most prevalent representation of a distribution often resembles a symmetrical bell curve. However, this is a simplified sketch and not intended to accurately represent the true data distribution, which may exhibit various distribution shapes with varying degrees of fit.

Deriving the Success Run Theorem:

Art by NightCafe

In today’s post, I am explaining how to derive the Success Run Theorem using some basic assumptions. Success Run theorem is one of the most common statistical rational for sample sizes used for attribute data. It goes in the form of:

Having zero failures out of 22 samples, we can be 90% confident that the process is at least 90% reliable (or at least 90% of the population is conforming).

Or

Having zero failures out of 59 samples, we can be 95% confident that the process is at least 95% reliable (or at least of 95% of the population is conforming).

The formula for the Success Run Theorem is given as:

n = ln(1 – C)/ ln(R), where n is the sample size, nl is the natural logarithm, C is the confidence level and R is reliability.

The derivation is fairly straightforward and we can use the multiplication rule of probability to do so. Let’s assume that we have a lot of infinite size and we are testing random samples out of the lot. The infinite size of the lot ensures independence of the samples. If the lot was finite and small then the probability of finding good (conforming) or bad (nonconforming) parts will change from sample to sample, if we are not replacing the tested sample back into the lot.

Let’s assume that q is the conforming rate (probability of finding a good part).

Let us calculate the probability of finding 22 conforming products in a row. In other words, we are testing 22 random samples and we want to find out the probability of finding 22 good parts. This is also the probability of NOT finding any bad product in the 22 random samples. For ease of explanation, let us assume that q = 0.9 or 90%. This rate of conforming product can also be notated as the reliability, R.

Using the multiplication rule of probability:

p(22 conforming products in a row) = 0.9 x 0.9 x 0.9 …… x 0.9 = 0.9 ^22

            = 0.10

            = 10%

If we find zero rejects in the 22 samples, we are also going to accept the lot. Therefore, this is also the probability of accepting the lot.

The complement of this is the probability of NOT finding 22 conforming products in a row, or the probability of finding at least one nonconforming product in the 22 samples. This is also the probability of rejecting the lot.

p(rejecting the lot) = 1 – p(22 conforming products in a row)

            = 1 – 0.10 = 0.90

            = 90%

This can be also stated as the CONFIDENCE that if the lot is passing our inspection (if we found zero rejects), then the lot is at least 90% conforming.

In other words, C = 1 – R^n

Or R^n = 1 – C

Taking logarithms of both sides,

n * ln(R) = ln(1 – C)

Or n = ln(1 – C)/ln(R)

Using the example, if we tested 22 samples from a lot, and there were zero rejects then we can with 90% confidence say that the lot is at least 90% conforming. This is also a form of LTPD sampling in Acceptance Sampling. We can get the same results using an OC Curve.

Using a similar approach, we can derive a one-sided nonparametric tolerance interval. If we test 22 samples, then we can say with 90% confidence level that at least 90% of the population is above the smallest value of the samples tested.

Any statistic we calculate should reflect our lack of knowledge of the parameter of the population. The use of confidence/reliability statement is one such way of doing it. I am calling this the epistemic humility dictum:

Any statistical statement we make should reflect our lack of knowledge of the “true” value/nature of the parameter we are interested in.

Always keep on learning. In case you missed it, my last post was An Existentialist’s View of Complexity:

OC Curve and Reliability/Confidence Sample Sizes:

“Reliability” as dreamt by Dream by WOMBO

In today’s post, I am looking at a topic in Statistics. I have had a lot of feedback on one of my earlier posts on OC curves and how one can use it to generate a reliability/confidence statement based on sample size, n and rejects, c. I provided an Excel spreadsheet that calculates the reliability/confidence based on sample size and rejects. I have been asked how we can utilize Minitab to generate the same results. So, this post is mostly geared towards giving an overview of using OC curves to generate reliability/confidence values and using Minitab to do the same.

The basic premise is that a Type B OC curve can be drawn for samples tested, n and rejects found, c. On the OC curve, the line represents various combinations of reliability and confidence. The OC curve is a plot between percent nonconforming, and probability of acceptance. The lower the percent nonconforming, the higher the probability of acceptance. The probability can be calculated using binomial, hypergeometric or Poisson distributions. The binomial OC curves are called as “Type B” OC curve and do not utilize lot sizes, generally represented as N. The hypergeometric OC curves utilizes lot sizes and are called as “Type A” OC curve. When the ratio n/N is small and n >= 15, the binomial distribution closely matches the hypergeometric distribution. Therefore, the Type B OC curve is used quite often.

The most commonly used standard for attribute sample plans is MIL 105E. The sample plans in MIL 105E are identical to the Z1.4 standard plans. The sampling plans provided as part of the tables do utilize lot sizes. These sampling plans were “tweaked” to include lot sizes because there was a push for including economic considerations of accepting a large lot that may contain rejects. The sample sizes for larger lots were made larger due to this. The OC curves shown in the standards however are Type B OC curves that do not use lot sizes. Hypergeometric distribution considers the fact that there is no replacement for the samples tested. Each test sample removed will impact the subsequent testing since the number of samples is now less. However, as noted above, when the ratio n/N is small, the issue of not replacing samples is not a concern. For the binomial distribution, lot size is not considered since the samples are assumed to be taken from lots of infinite lot size.

With this background, let’s look at a Type B OC curve. The OC Curve is a plot between % Nonconforming, and Probability of Acceptance. Lower the % Nonconforming, the higher the Probability of Acceptance. The OC Curve shown is for n = 59 with 0 rejects calculated using Binomial Distribution.

The producer’s risk is the risk of good product getting rejected. The acceptance quality limit (AQL) is generally defined as the percent of defectives that the plan will accept 95 percent of the time (i.e., in the long run). Lots that are at or better than the AQL will be accepted 95 percent of the time (in the long run). If the lot fails, we can say with 95-percent confidence that the lot quality level is worse than the AQL. Likewise, we can say that a lot at the AQL that is acceptable has a 5-percent chance of being rejected. In the example, the AQL is 0.09 percent.

The consumer’s risk, on the other hand, is the risk of accepting bad product. The lot tolerance percent defective (LTPD) is generally defined as percent of defective product that the plan will reject 90 percent of the time (in the long run). We can say that a lot at or worse than the LTPD will be rejected 90 percent of the time (in the long run). If the lot passes, we can say with 90-percent confidence that the lot quality is better than the LTPD (i.e., the percent nonconforming is less than the LTPD value). We could also say that a lot at the LTPD that is defective has a 10-percent chance of being accepted.

The vertical axis (y axis) of the OC curve goes from 0 percent to 100 percent probability of acceptance. Alternatively, we can say that the y axis corresponds to 100 percent to 0 percent probability of rejection. Let’s call this confidence. This is also the probability of rejecting the lot. The horizontal axis (x axis) of the OC curve goes from 0 percent to 100 percent for percent nonconforming. Alternatively, we can say that the x axis corresponds to 100 percent to 0 percent for percent conforming. Let’s call this reliability.

We can easily invert the y axis so that it aligns with a 0 to 100-percent confidence level. In addition, we can also invert the x axis so that it aligns with a 0 to 100-percent reliability level. This is shown below.

The OC Curve line is a combination of reliability and confidence values. Therefore, for any sample size and rejects combination, we can find the required combination of reliability and confidence values. If we know the sample size and rejects, then we can find the confidence value for any reliability value or vice-versa. Let us look at a problem to detail this further:

In the wonderful book Acceptance Sampling in Quality Control by Edward Schilling and Dean Neubauer, the authors discuss a problem that would be of interest here. They posed:

consider an example given by Mann et al. rephrased as follows: Suppose that n = 20 and the observed number of failures is x = 1. What is the reliability π of the units sampled with 90% confidence? Here π is unknown and γ is to be .90. 

One of the solutions given was to find the reliability or the confidence desired directly from the OC curve.

They gave the following relation:

π = 1 – p, where π is the reliability and p is the nonconforming rate.

γ = 1 – Pa, where γ is the confidence and Pa is the probability of acceptance.

This is the same relation that was explained above.

In my spreadsheet, when we enter the values as shown below, we see that the reliability value is 81.91% based on LTPD value of 18.10%. This is the same result documented in the book.

We can use Minitab to get the same result. However, it will be slightly backwards. As I noted above, drawing the OC curve requires only two inputs – the sample size and the number of rejects allowed or acceptance number. Once the OC curve is drawn, we can then look at the different reliability and confidence combinations. We can also calculate the confidence, if we provide the reliability. The reliability is also 1 – p. In Minitab, we can input the sample size, number of rejects and p, and the software will provide us the Pa. For the purpose of reliability and confidence, the p value will be the LTPD value and the confidence value will be 1 – Pa.

I am using Minitab 18 here. Go to Acceptance Sampling by Attributes as shown below:

Choose “Compare User Defined Sampling Plans” from the dropdown and enter the different values as shown. Please note that the acceptance number is the maximum number of rejects allowed. Here we are entering the LTPD value because we know the value to be 18.10. In the spreadsheet, we have to enter the confidence level we want to calculate the reliability, while in Minitab we have to enter the LTPD value (1 – reliability) to calculate the confidence. In the example below, we are going to show that entering the LTPD as 18.10 will yield the Pa as 0.10 and thus the confidence as 0.90 or 90%.

Minitab yields the following result:

One can use the combination of sample size, acceptance number and required LTPD value to calculate the confidence value. The spreadsheet is available here. I will finish with one of the oldest statistical quotes attributed to the famous sixteenth century Spanish writer, Miguel de Cervantes Saavedra that is apt here:

“The proof of the pudding is in the eating. By a small sample we may judge of the whole piece.”

Stay safe and always keep on learning…

In case you missed it, my last post was Second Order Variety:

AQL/RQL/LTPD/OC Curve/Reliability and Confidence:

Binomial2

It has been a while since I have posted about statistics. In today’s post, I am sharing a spreadsheet that generates an OC Curve based on your sample size and the number of rejects. I get asked a lot about a way to calculate sample sizes based on reliability and confidence levels. I have written several posts before. Check this post and this post for additional details.

The spreadsheet is hopefully straightforward to use. The user has to enter data in the required yellow cells.

Binomial1

A good rule of thumb is to use 95% confidence level, which also corresponds to 0.05 alpha. The spreadsheet will plot two curves. One is the standard OC curve, and the other is an inverse OC curve. The inverse OC curve has the probability of rejection on the Y-axis and % Conforming on the X-axis. These corresponds to Confidence level and Reliability respectively.

Binomial2

I will discuss the OC curve and how we can get a statement that corresponds to a Reliability/Confidence level from the OC curve.

The OC Curve is a plot between % Nonconforming, and Probability of Acceptance. Lower the % Nonconforming, the higher the Probability of Acceptance. The probability can be calculated using Binomial, Hypergeometric or Poisson distributions. The OC Curve shown is for n = 59 with 0 rejects calculated using Binomial Distribution.

Binomial3

The Producer’s risk is the risk of good product getting rejected. The Acceptance Quality Limit (AQL) is generally defined as the percent defectives that the plan will accept 95% of the time (in the long run). Lots that are at or better than the AQL will be accepted 95% of the time (in the long run). If the lot fails, we can say with 95% confidence that the lot quality level is worse than the AQL. Likewise, we can say that a lot at the AQL that is acceptable has a 5% chance of being rejected. In the example, the AQL is 0.09%.

Binomial4

The Consumer’s risk, on the other hand, is the risk of accepting bad product. The Lot Tolerance Percent Defective (LTPD) is generally defined as percent defective that the plan will reject 90% of the time (in the long run). We can say that a lot at or worse than the LTPD will be rejected 90% of the time (in the long run). If the lot passes, we can say with 90% confidence that the lot quality is better than the LTPD (% nonconforming is less than the LTPD value). We could also say that a lot at the LTPD that is defective has a 10% chance of being accepted.

The vertical axis (Y-axis) of the OC Curve goes from 0% to 100% Probability of Acceptance. Alternatively, we can say that the Y-axis corresponds to 100% to 0% Probability of Rejection. Let’s call this Confidence.

The horizontal axis (X-axis) of the OC Curve goes from 0% to 100% for % Nonconforming. Alternatively, we can say that the X-axis corresponds to 100% to 0% for % Conforming. Let’s call this Reliability.

Binomial5

We can easily invert the Y-axis so that it aligns with a 0 to 100% confidence level. In addition, we can also invert the X-axis so that it aligns with a 0 to 100% reliability level. This is shown below.

Binomial6

What we can see is that, for a given sample size and defects, the more reliability we try to claim, the less confidence we can assume. For example, in the extreme case, 100% reliability lines up with 0% confidence.

I welcome the reader to play around with the spreadsheet. I am very much interested in your feedback and questions. The spreadsheet is available here.

In case you missed it, my last post was Nature of Order for Conceptual Models:

MTTF Reliability, Cricket and Baseball:

bradman last

I originally hail from India, which means that I was eating, drinking and sleeping Cricket at least for a good part of my childhood. Growing up, I used to “get sick” and stay home when the one TV channel that we had broadcasted Cricket matches. One thing I never truly understood then was how the batting average was calculated in Cricket. The formula is straightforward:

Batting average = Total Number of Runs Scored/ Total Number of Outs

Here “out” indicates that the batsman had to stop his play because he was unable to keep his wicket. In Baseball terms, this will be similar to a strike out or a catch where the player has to leave the field. The part that I could not understand was when the Cricket batsman did not get out. The runs he scored was added to the numerator but there was no changes made to the denominator. I could not see this as a true indicator of the player’s batting average.

When I started learning about Reliability Engineering, I finally understood why the batting average calculation was bothering me. The way the batting average in Cricket is calculated is very similar to the MTTF (Mean Time To Failure) calculation. MTTF is calculated as follows:

MTTF = Total time on testing/Number of failures

For a simple example, if we were testing 10 motors for 100 hours and three of them failed at 50, 60 and 70 hours respectively, we can calculate MTTF as 293.33 hours. The problem with this is that the data is a right-censored data. This means that we still have samples where the failure has not occurred and we stopped the testing. This is similar to the case where we do not include the number of innings where the batsman did not get out. A key concept to grasp here is that the MTTF or the MTBF (Mean Time Between Failure) metric is not for a single unit. There is more to this than just saying that on average a motor is going to last 293.33 hours.

When we do reliability calculations, we should be aware whether censored data is being used and use appropriate survival analysis to make a “reliability specific statement” – we can expect that 95% of the motor population will survive x hours. Another good approach is to calculate the lower bound confidence intervals based on the MTBF. A good resource is https://www.itl.nist.gov/div898/handbook/apr/section4/apr451.htm.

Ty Cobb. Don Bradman and Sachin Tendulkar:

We can compare the batting averages in Cricket to Baseball. My understanding is that the batting average in Baseball is calculated as follows:

Batting Average = Number of Hits/Number of Bats

Here the hit can be in the form of singles, home runs etc. Apparently, this statistic was initially brought up by an English statistician Henry Chadwick. Chadwick was a keen Cricket fan.

I want to now look at the greats of Baseball and Cricket, and look at a different approach to their batting capabilities. I have chosen Ty Cobb, Don Bradman and Sachin Tendulkar for my analyses. Ty Cobb has the largest Baseball batting average in American Baseball. Don Bradman, an Australian Cricketer often called the best Cricket player ever, has the largest batting average in Test Cricket. Sachin Tendulkar, an Indian Cricketer and one of the best Cricket players of recent times, has the largest number of runs scored in Test Cricket. The batting averages of the three players are shown below:

averages

As we discussed in the last post regarding calculating reliability with Bayesian approach, we can make reliability statements in place of batting averages. Based on 4191 hits in 11420 bats, we could make a statement that – with 95% confidence Ty Cobb is 36% likely to make a hit in the next bat. We can utilize the batting average concept in Baseball to Cricket. In Cricket, hitting fifty runs is a sign of a good batsman. Bradman has hit fifty or more runs on 56 occasions in 80 innings (70%). Similarly Tendulkar has hit fifty or more runs on 125 occasions in 329 innings (38%).

We could state that with 95% confidence, Bradman was 61% likely to score fifty or more runs in the next inning. Similarly, Sachin was 34% likely to score fifty runs or more in the next inning at 95% confidence level.

Final Words:

As we discussed earlier, similar to MTTF, batting average is not a good estimation for a single inning. It is an attempt for a point estimate for reliability but we need additional information regarding this. This should not be looked at it as a single metric in isolation. We cannot expect that Don Bradman would score 99.94 runs per innings. In fact, in the last very match that Bradman played, all he had to do was score 4 single runs to achieve the immaculate batting average of 100. He had been out only 69 times and he just needed four measly runs to complete 7000 runs and even if he got out on that inning, he would have achieved the spectacular batting average of 100. He was one of the best players ever. His highest score was 334. This is called “triple century” in Cricket, and this is a rare achievement. As indicated earlier, he was 61% likely to have scored fifty runs or more in the next inning. In fact, Bradman had scored more than four runs 69 times in 79 innings.

bradman last

Everyone expected Bradman to cross the 100 mark easily. As fate would have it, Bradman scored zero runs as he was bowled out (the batsman misses and the ball hits the wicket) by the English bowler Eric Hollies, in the second ball he faced. He had hit 635 fours in his career. A four is where the batsman scores four runs by hitting the ball so that it rolls over the boundary of the field. All Bradman needed was one four to achieve the “100”. Bradman proved that to be human is to be fallible. He still remains the best that ever was and his record is far from broken. At this time, the batsman with the second best batting average is 61.87.

Always keep on learning…

In case you missed it, my last post was Reliability/Sample Size Calculation Based on Bayesian Inference:

Reliability/Sample Size Calculation Based on Bayesian Inference:

Bayesian

I have written about sample size calculations many times before. One of the most common questions a statistician is asked is “how many samples do I need – is a sample size of 30 appropriate?” The appropriate answer to such a question is always – “it depends!”

In today’s post, I have attached a spreadsheet that calculates the reliability based on Bayesian Inference. Ideally, one would want to have some confidence that the widgets being produced is x% reliable, or in other words, it is x% probable that the widget would function as intended. There is the ubiquitous 90/90 or 95/95 confidence/reliability sample size table that is used for this purpose.

90-95

In Bayesian Inference, we do not assume that the parameter (the value that we are calculating like Reliability) is fixed. In the non-Bayesian (Frequentist) world, the parameter is assumed to be fixed, and we need to take many samples of data to make an inference regarding the parameter. For example, we may flip a coin 100 times and calculate the number of heads to determine the probability of heads with the coin (if we believe it is a loaded coin). In the non-Bayesian world, we may calculate confidence intervals. The confidence interval does not provide a lot of practical value. My favorite explanation for confidence interval is with the analogy of an archer. Let’s say that the archer shot an arrow and it hit the bulls-eye. We can draw a 3” circle around this and call that as our confidence interval based on the first shot. Now let’s assume that the archer shot 99 more arrows and they all missed the bull-eye. For each shot, we drew a 3” circle around the hit resulting in 100 circles. A 95% confidence interval simply means that 95 of the circles drawn contain the first bulls-eye that we drew. In other words, if we repeated the study a lot of times, 95% of the confidence intervals calculated will contain the true parameter that we are after. This would indicate that the one study we did may or may not contain the true parameter. Compared to this, in the Bayesian world, we calculate the credible interval. This practically means that we can be 95% confident that the parameter is inside the 95% credible interval we calculated.

In the Bayesian world, we can have a prior belief and make an inference based on our prior belief. However, if your prior belief is very conservative, the Bayesian inference might make a slightly liberal inference. Similarly, if your prior belief is very liberal, the inference made will be slightly conservative. As the sample size goes up, impact of this prior belief is minimized. A common method in Bayesian inference is to use the uninformed prior. This means that we are assuming equal likelihood for all the events. For a binomial distribution we can use beta distribution to model our prior belief. We will use (1, 1) to assume the uninformed prior. This is shown below:

uniform prior

For example, if we use 59 widgets as our samples and all of them met the inspection criteria, then we can calculate the 95% lower bound credible interval as 95.13%. This is assuming the (1, 1) beta values. Now let’s say that we are very confident of the process because we have historical data. Now we can assume a stronger prior belief with the beta values as (22,1). The new prior plot is shown below:

22-1 prior

Based on this, if we had 0 rejects for the 59 samples, then the 95% lower bound credible interval is 96.37%. A slightly higher reliability is estimated based on the strong prior.

We can also calculate a very conservative case of (1, 22) where we assume very low reliability to begin with. This is shown below:

1-22 Prior

Now when we have 0 rejects with 59 samples, we are pleasantly surprised because we were expecting our reliability to be around 8-10%. The newly calculated 95% lower bound credible interval is 64.9%.

I have created a spreadsheet that you can play around with. Enter the data in the yellow cells. For a stronger prior (liberal), enter a higher a_prior value. Similarly, for a conservative prior, enter a higher b_prior value. If you are unsure, retain the (1, 1) value to have a uniform prior. The spreadsheet also calculates the maximum expected rejects per million value as well.

You can download the spreadsheet here.

I will finish with my favorite confidence interval joke.

“Excuse me, professor. Why do we always calculate 95% confidence interval and not a 94% or 96% interval?”, asked the student.

“Shut up,” explained the professor.

Always keep on learning…

In case you missed it, my last post was Mismatched Complexity and KISS:

Rules of 3 and 5:

rules of thumb

It has been a while since I have blogged about statistics. So in today’s post, I will be looking at rules of 3 and 5. These are heuristics or rules of thumb that can help us out. They are associated with sample sizes.

Rule of 3:

Let’s assume that you are looking at a binomial event (pass or fail). You took 30 samples and tested them to see how many passes or failures you get. The results yielded no failures. Then, based on the rule of 3, you can state that at 95% confidence level, the upper bound for a failure is 3/30 = 10% or the reliability is at least 90%. The rule is written as;

p = 3/n

Where p is the upper bound of failure, and n is the sample size.

Thus, if you used 300 samples, then you could state with 95% confidence that the process is at least 99% reliable based on p = 3/300 = 1%. Another way to express this is to say that with 95% confidence fewer than 1 in 100 units will fail under the same conditions.

This rule can be derived from using binomial distribution. The 95% confidence comes from the alpha value of 0.05. The calculated value from the rule of three formula gets more accurate with a sample size of 20 or more.

Rule of 5:

I came across the rule of 5 from Douglas Hubbard’s informative book “How to Measure Anything” [1]. Hubbard states the Rule of 5 as;

There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.

This is a really neat heuristic because you can actually tell a lot from a sample size of 5! The median is the 50th percentile value of a population, the point where half of the population is above it and half of the population is below it. Hubbard points out the probability of picking a value above or below the median is 50% – the same as a coin toss. Thus, we can calculate that the probability of getting 5 heads in a row is 0.5^5 or 3.125%. This would be the same for getting 5 tails in a row. Then the probability of not getting all heads or all tails is (100 – (3.125+3.125)) or 93.75%. Thus, we can state that the chance of one value out of five being above the median and at least one value below the median is 93.75%.

Final words:

The reader has to keep in mind that both of the rules require the use of randomly selected samples. The Rule of 3 is a version of Bayes’ Success Run Theorem and Wilk’s One-sided Tolerance calculation. I invite the reader to check out my posts that sheds more light on this 1) Relationship between AQL/RQL and Reliability/Confidence , 2) Reliability/Confidence Level Calculator (with c = 0, 1….., n) and 3) Wilk’s One-sided Tolerance Spreadsheet.

When we are utilizing random samples to represent a population, we are calculating a statistic – a representation value of the parameter value. A statistic is an estimate of the parameter, the true value from a population. The higher the sample size used, the better the statistic can represent the parameter and better your estimation.

I will finish with a story based on chance and probability;

It was the finals and an undergraduate psychology major was totally hung over from the previous night. He was somewhat relieved to find that the exam was a true/false test. He had taken a basic stat course and did remember his professor once performing a coin flipping experiment. On a moment of clarity, he decided to flip a coin he had in his pocket to get the answers for each questions. The psychology professor watched the student the entire two hours as he was flipping the coin…writing the answer…flipping the coin….writing the answer, on and on. At the end of the two hours, everyone else had left the room except for this one student. The professor walks up to his desk and angrily interrupts the student, saying: “Listen, it is obvious that you did not study for this exam since you didn’t even open the question booklet. If you are just flipping a coin for your answer, why is it taking you so long?”

The stunned student looks up at the professor and replies bitterly (as he is still flipping the coin): “Shhh! I am checking my answers!”

Always keep on learning…

In case you missed it, my last post was Kenjutsu, Ohno and Polanyi:

[1] How to Measure Anything.