Statistically Squashing Bugs
Ah, the dreaded intermittent bug. For no apparent reason something is misbehaving. The logs offer nothing but dead ends. "There's no such thing as an intermittent bug, only unknown boundary conditions" doesn't seem to apply to the situation. The only fix seems to be changing pieces one-by-one until something new happens. But then something even worse happens - changing something causes the bug to "disappear" but there seems to be no link between the change and any mechanism that could impact the bug. If the 'fix' doesn't seem connected to any root cause for the bug, how do we know the bug is really gone?
I found myself in this situation as part of an ongoing project to write custom firmware for the Elegoo Tumbller self-balancing robot.
The Tumbller uses an Arduino Nano clone with a MPU6050 IMU. The MPU6050 features a neat on-board Digital Motion Processor (DMP) which takes care of fusing the gyroscope and accelerometer readings into a usable pose. However, the Elegoo developers opted to hand-tune a Kalman filter with raw gyroscope and accelerometer readings; I wanted to use the DMP.
Fortunately, there exists an easy-to-use open-source driver to read from the DMP.
Unfortunately, there were periodic spikes in the DMP output which threw a wrench in tuning a controller. The following figure shows pitch (derived from the DMP quaternion output) for a stationary robot:
Every so often the DMP output would erroneously jump before settling back to steady-state within a few hundred milliseconds:
So what to do? Swapping for a new MPU6050 (twice) and a new Arduino did nothing for eliminating the spikes, hinting that there was a bug lurking in the custom firmware.
To get a handle on the issue, I collected 500 samples of the DMP spikes and measured the time between consecutive occurrences. The hypothesis was that the time deltas between events would be a rough Exponential distribution, which models the time between events in a Poisson process. A histogram of the time delta measurements is below:
The histogram looks remarkably similar to an Exponential distribution! Fitting a curve was straightforward with scipy.stats.expon.fit
.
With an Exponential distribution model in hand, it is possible to determine how long we need to sample without observing the DMP output drift to say with confidence that the bug is fixed. The Exponential distribution models how much time elapses in between events but we need to know how long to wait for an event to occur. The Poisson distribution models how many events occur in a given time window. Fortunately, there is a simple relationship between the two distributions that allows calculating the probability that an event will occur within a certain time period.
$$P(T > t) = \exp^{-\lambda t}$$ gives the probability for no occurences to happen in time \(t\) milliseconds. In this case, the \(\lambda\) parameter for the Exponential distribution is \(4.65\text{e-}5\) (1/scale from scipy.stats.expon.fit
). For example, after starting the system the probability of observing a DMP spike within 15 seconds (15000 ms) is ~50%. Running for 64 seconds gives a probability of ~95%, and coincidentially running for 99 seconds gives a probability of ~99%.
In other words, if a fix is made to the firmware and the system runs for ~99 seconds without observing the DMP spike there is a 99% chance that it is fixed.
Armed with the knowledge of how long to test a change, it is possible to begin shotgun debugging. Fortunately, 50% of the time a change only needs to run for ~15 seconds to determine that it doesn't fix the issue so the debug-iteration loop is at an okay rate.
In this case, the issue was caused by resetting the FIFO buffer, waiting for a packet, then reading from the reset FIFO buffer.
...
mpu.resetFIFO();
if(!mpu.dmpPacketAvailable()){
return -1;
}
mpu.getFIFOBytes(fifoBuffer, 42);
mpu.dmpGetQuaternion(&orientation, fifoBuffer);
...
Invensense's documentation is very sparse, so it is difficult to piece together why clearing the DMP's FIFO buffer causes intermittent output drift, but removing the FIFO reset eliminated the spikes:
...
//mpu.resetFIFO();
if(!mpu.dmpPacketAvailable()){
return -1;
}
mpu.getFIFOBytes(fifoBuffer, 42);
mpu.dmpGetQuaternion(&orientation, fifoBuffer);
...
With no apparent link between the fix and the bug symptoms, the "proof" of the bug resolving is in the statistics.
Was this an interesting read? Say thanks and help keep this site ad & analytic free by using my Amazon Affilliate URL. I'll receive a small portion of any purchases made within 24 hours.