The answer always seems obvious once you find it. One of our clients had one of their server start going down nearly every day. It quickly escalated to happening about twice a day.
We looked at recent commits for anything that could have caused a problem and found nothing. We verified that the code on the server that went down matched the code on the other servers in the cluster and that all of the server settings were the same. When we took down the offending server, another server started going down instead. At that point, we knew we were likely facing a code problem.
The site in question is one we inherited from the previous vendor, so we had no instinctive sense about where the problem might be. Fortunately, we have a good process for tracking down these sorts of problems.