From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S265487AbUHMOCf (ORCPT ); Fri, 13 Aug 2004 10:02:35 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S265499AbUHMOCf (ORCPT ); Fri, 13 Aug 2004 10:02:35 -0400 Received: from gw-oleane.hubxpress.net ([81.80.52.129]:26326 "EHLO yoda.hubxpress.net") by vger.kernel.org with ESMTP id S265487AbUHMOCa (ORCPT ); Fri, 13 Aug 2004 10:02:30 -0400 From: "Sylvain COUTANT" To: Subject: High CPU usage (up to server hang) under heavy I/O load Date: Fri, 13 Aug 2004 16:01:35 +0200 Organization: ILLICOM MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook, Build 11.0.5510 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441 Thread-Index: AcSBPdykdNbyY+XFR4iSOX7E4jeWtgAABWeA Message-Id: <20040813140229.4F48B2FC2C@illicom.com> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Gurus, I have a problem with one server (DELL, 1 TB RAID5 + RAID0, Bi-Xeon, 8 GB RAM) which, sometimes, goes mad when the I/O pressure gets too high. We use this server as a VMWare server and as a backup server (200 GB are dedicated to the backup part). We have run full hardware diags and checked every software that runs on the system. We have been able to reproduce the problem once without having launched the VMWare server (so I believe this software is not responsible for the problem). We have tested kernels 2.4.22 and 2.4.26. The server is running under Debian Woody. The exact problem is the unix load sometimes goes _very high_ under I/O activity (mainly disk writes seem to cause this). Even having just two or three tar+gz processes running can cause the 15 minutes load average to get as high as 20 or 30 (where we would expect it being between 2 and 4). The load can go to so high value that we can't do anything anymore on the server during some delay (between a few seconds and a few days). Our Friday night backups often hang the server until we unplug the power to reboot on monday morning. From what we have seen, it can happen that kswapd and kupdated eats up to 70% of one CPU each. Otherwise, it's very hard to tell what happens exactly because when the server is slowing down we have no possible monitoring, no log, no alerts, no automatic reboot (must power off/on to reboot), no console alert, ... Just, that it's still pingable !! We didn't found a way to reproduce this behaviour with a specific test case. Sometimes the server will be fine during a few days then slow down and hang. Sometimes it will hang just a few hours after having been booted. I have spent hours searching for information and found our problem was very common in early 2.4.x kernels (virtual memory management) between 2000 and 2002 on servers with large RAM. Only recent information I found was some patches related to kernel hanging or something like, but symptoms described was never exactly the same as ours. I tried to play a little with "/proc/sys/vm/*" settings (mainly bdflush) but I didn't found any major improvement (perhaps just because I didn't put in the good values). What I was trying to do was reducing the amount of memory the kernel could use for dirty buffers, thus having more regular flush to disks. Hopefully, some people here could be of help ;-)) Regards, Sylvain.