Thursday, April 26, 2007

Server stability problems again yesterday

We had an incident with a higher-than-normal error rate on the site again yesterday. Apparently some bug in .Net triggered the problem and one front-end server was at times doing odd and unpredictable things. This caused e.g. some links to be written in incorrect format, producing errors when users clicked on them. Once discovered the problem was quickly fixed by clearing the .Net cache and restarting the IIS server on the affected front-end.

We are soon introducing better monitoring for the websites. Until now we have used the standard Group IS monitoring tool but this fucuses only on slow server response. The page view error percentage seems to be a much better indicator of the general health status of the site so we plan to constantly track that error rate to generate alerts.

