From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262956AbVD2UUG (ORCPT ); Fri, 29 Apr 2005 16:20:06 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262950AbVD2URu (ORCPT ); Fri, 29 Apr 2005 16:17:50 -0400 Received: from hg-msq-srv.levigo.de ([62.206.214.5]:31239 "EHLO mail.levigo.de") by vger.kernel.org with ESMTP id S262949AbVD2UQy (ORCPT ); Fri, 29 Apr 2005 16:16:54 -0400 Message-ID: <4272962B.9000502@levigo.de> Date: Fri, 29 Apr 2005 22:16:43 +0200 From: =?ISO-8859-1?Q?J=F6rg_Henne?= Organization: levigo software gmbh User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: de, de-at, en, en-us MIME-Version: 1.0 To: linux-kernel@vger.kernel.org Subject: Spurious disk change detections interferes with disk access Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hi all, using a CompactFlash based system, I am experiencing a very weird behaviour I just don't understand. The system is based on - 2.4.28 - devfs - a CompactFlash IDE drive attached to a real IDE controller (in contrast to a PC-Card adapter) - the root filesystem resides on the CF While booting, there are irregular but frequent errors which manifest in messages like hda1: bad access: block= count= which are followed by error messages from the filesystem and finally errors in the boot process itself. I added some more debug output to drivers/ide/ide-io.c and found that while the block offset was well within the nominal bounds of the device, the message was actually caused by the fact that while the error occurred, the partition table was all zeroed-out, i.e. drive->part[minor&PARTN_MASK].nr_sects == 0. The reason why the drive's partition table is zeroed-out temporarily seems to be that there is an ide_revalidate_disk() going on at that time, which in turn is explained by messages like VFS: Disk change detected on device 03:00 /dev/ide/host0/bus0/target0/lun0: p1 p2 which precede the "bad access" message. As I see it, the core of the problem is: - something accesses the devfs directory associated with the drive (ls /dev/ide/host0/bus0/target0/lun0 will do) - defvs runs a check_media_change which ends up at ide-disk.c:idedisk_media_change() which is implemented like this: /* if removable, always assume it was changed */ return drive->removable; - devfs decides that a revalidate is appropriate since the media hash changed, which trashes the partition table for a brief moment - processes accessing the drive at that time see lots of errors Is it really true, that for removable devices, a simple directory access on the devfs can interfere with accesses to the drive? Sounds like a great start for DOSing the system. I see two possible solutions for the problem: - A lock is put in place which guards access to the drive while the partition table is re-built. Seems fairly hacky to me. - The media change detection mechanism is improved, so that it can detect when the media REALLY changed. Thanks in advance for any input on this. Joerg Henne