Clocks, resets and wild goose chases

There’s a problem that I believe costs design companies billions of dollars a year whether they’re in hardware, software or FPGA design. The problem is hard to control, difficult to monitor and impossible to predict. The problem is bad design practice. Some people call everything “bugs”, but I prefer to call this problem poor design. It is what happens when you do those late-nighters and you’re too tired to see all the loose-ends you’re leaving in your code, or when you don’t have time to test every use case because the product release is yesterday.

Logic is the art of going wrong with confidence. -Joseph Wood Krutch

Here is the typical design and debugging sequence:

  1. Write code
  2. Test code: WORKS!
  3. Write more code
  4. Test code: WORKS!
  5. Write more code
  6. Test code: DOESN’T WORK!
  7. Check what was changed in step 5: FOUND BUG!
  8. Fix the stupid mistake
  9. Test code: WORKS!

The sequence above illustrates a point I want to make: When something stops working, we as designers typically blame the most recent changes we made to a design. But sometimes the real bug is not in the most recent changes, sometimes it was made earlier. Now you might be saying, that doesn’t make sense, the code “worked” before, it’s logical that the bug must be in your most recent changes. Wrong. These are the kind of bugs I’m talking about, the nasty bugs that lead us on wild goose chases. The bugs that slip through the cracks because they give us the impression that they don’t exist. The bugs that can allow a design to work by coincidence. Think about that, did you ever find yourself saying “it can’t be the ..bleep.. because it always worked fine”, the stupid assumption that cost you 1 week. Or what about “I’m sure the ..bleep.. is not working”, the assumption that just cost you 2 days testing something that works perfectly fine.

This week I was given the task of debugging a problem with an FPGA design that didn’t make sense to anyone, including myself. In hindsight the problem was so weird that it had to be one of these nasty wild-goose-chase problems, and the worst thing is I think I sensed that but I still went in with the “most recent changes” bias.

Think about this, you have a fairly straight forward design that works, you never had a problem with it. You tested the design thoroughly and you fixed all the bugs. You are ready to come out of development phase and go to release phase, so you remove your debugging tools from the design (ie. Chipscope): DOESN’T WORK!

What the?? You only removed the debug tools. You didn’t change the design. Like I said, I was biased from the start, so I went into this task with my sights set on what exactly removing Chipscope would do to the design. Well that’s easy, it changes the placement of your logic and primitives, surely this is a timing problem caused by poor placement. Bad assumption. Have you got an idea? It’s probably also a bad assumption. So I spent a good amount of time comparing the placement of the working design with the non-working design. Not surprisingly, and to my great satisfaction, I found that a lot of things were displaced! Doesn’t it feel nice when we find evidence to support our beliefs, even when those beliefs are false? Admittedly the placement didn’t appear to be that different as to cause the problem, but I was stuck on my theory and I had to have it disproved. After a few rounds of changing the placement and testing, and a good many hours, I had to give up on my theory. Beliefs shattered, ego dented, I was forced to look at the big picture and I could now actually do something useful.

Now I don’t want to go into details because that’s not the point of my post, but after looking deeper, I noticed a flaw in the design: we were not managing our clocks and resets correctly. This is a topic that I could write several posts about and I can’t stress enough how important it is. We had a configurable external clock which, after reset, defaulted to a certain frequency. We then configured the clock to another frequency, but we were forgetting to reset the core. When we were asking ourselves “why doesn’t it work anymore?” it just so happened that the right question we should have been asking was “why DID it work before?”.

The point I want to make here is not so much about the clocks and resets but the fact that very often we have to look at more than just the most recent changes that we make to a design when we are faced with strange behavior. As a logical human being we have a tendency to focus on what we just changed, but sometimes we have to open our minds to the possibility that our design was lucky to get as far as it did. Sometimes we don’t realize that our design is hanging on by a thread and that it shouldn’t actually be working at all. If that’s hard to understand, this great quote should make it clearer:

Logic: The art of thinking and reasoning in strict accordance with the limitations and incapacities of the human misunderstanding. -Ambrose Bierce