The load average: A (very) short explanation

Every server-admin is familiar with the situation: Your servers load is flaky but you don’t see any processes eating your CPU capacity, at least not constantly. What is going on? Probably the first thing you do is Google. And yes, you will find lot’s of articles explaining you that what the number means. They will all tell you that it represents the average number of processes in the run-queue over a certain period of time.

You will learn that the health of the number is determined by the number of cores in your system. For every core the load should be 1 or lower in order to considered to be healthy. So when you have a quad-core processor with a load of 3 that should not necessarily pose to be a problem. I am explicitly putting in the word necessarily in the previous sentence because this is where many articles make a mistake. Many resources on the web tell you there is nothing to worry about as long as your load average is lower then the number of cores in your server. I oppose to this statement.

As you might know processes have a number of different states which they can be in. One of those states is IO/WAIT. This means that the proces is waiting for access to an IO-device (for example the harddrive) . During this state the proces is considered to be in-proces thus being in the run-queue and therefore representing 1 in the load average of your server. As you can see, in this situation the number of cores in the server is completely irrelevant because even a load of 2 can mean your quad-core server is slowing down because of the lack of speed in it’s IO-device(s).

The previous example can be recognized when the output of the command top is being watched. When you see a high load but low CPU usage you can almost certainly say that IO is causing the number of processes in the run-queue to be high.