Possible data corruption sata

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Possible data corruption sata_sil24?
@ 2007-07-06  1:24 David Shaw
  2007-07-10 15:28 ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: David Shaw @ 2007-07-06  1:24 UTC (permalink / raw)
  To: linux-ide

Hi everyone,

I'm having a problem with data corruption using devmapper on a SATA
disk using sata_sil24.  I've done some work tracking it down, and
hopefully you folks can point me further in the right direction.

The kernel I'm using is 2.6.21-1.3228.fc7 (i.e. Fedora 7).  LVM2 is
lvm2-2.02.24-1.fc7.  The dmsetup and libdevmapper is
device-mapper-1.02.17-7.fc7.

The original setup that showed the problem is this:

Starting with two 500GB SATA drives (interface card uses a Silicon
3124 chipset), /dev/sdd and /dev/sde.  I partitioned each into two
250GB chunks (250*1000*1000*1000, not 250*1024*1024*1024), and set up
two RAID 1 sets such that /dev/md0 is /dev/sdd1+/dev/sde1 and /dev/md2
is /dev/sdd2+/dev/sde2.  I then created a volume group ("storage") on
top of /dev/md0 and /dev/md1.  Finally, I allocated two logical
volumes on top of that: "one" is -L300GB and "two" is -L100GB.

mke2fs -j -m0 on /dev/storage/one and /dev/storage/two, and it would
seem everything was fine, but after copying data to the two volumes,
they would fail a fsck in pretty dramatic fashion (dozens of errors
indicating pretty severe filesystem corruption).

I'll skip all the steps I tried when reducing this down to a simple
reproducible test case, but the end result is this:

1) Take a 500GB disk (as before, it's SATA plugged into a card using
   the sata_sil24 driver)

2) echo "0 482344960 linear 8:32 0" | dmsetup create one
   echo "0 209715200 linear 8:32 482345000" | dmsetup create two

3) mke2fs -j -m0 /dev/mapper/one
   mke2fs -j -m0 /dev/mapper/two
   mount /dev/mapper/one /one
   mount /dev/mapper/two /two

4) cd /one ; \
   for i in `seq 0 3`; do dd if=/dev/zero bs=4K count=1M of=$i; done ; \
   cd ; \
   umount /one

   cd /two ; \
   for i in `seq 0 3`; do dd if=/dev/zero bs=4K count=1M of=$i; done ; \
   cd ; \
   umount /two

5) fsck -f /dev/mapper/one
   fsck -f /dev/mapper/two

Both filesystems return many errors on fsck.  This is very repeatable.

Note that this simplified reproduction case uses only the device
mapper: RAID is not involved, nor is LVM.  "dmsetup table" says:

two: 0 209715200 linear 8:32 482345000
one: 0 482344960 linear 8:32 0

Just to be sure, I have run memtest86+ on the machine and badblocks on
the disk.  Both came up clean.  Partitioning the disk and mke2fs-ing
the partitions directly (i.e. no device-mapper), works fine.  It's
only when using the device-mapper does the corruption happen.  There
is nothing of interest logged in /var/log/messages or dmesg (I see the
usual messages around 'mount', but that's it).

Any suggestions?  Many thanks,

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption sata_sil24?
  2007-07-06  1:24 Possible data corruption sata_sil24? David Shaw
@ 2007-07-10 15:28 ` Tejun Heo
  2007-07-13  1:42   ` David Shaw
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2007-07-10 15:28 UTC (permalink / raw)
  To: David Shaw; +Cc: linux-ide

Hello,

David Shaw wrote:
> I'm having a problem with data corruption using devmapper on a SATA
> disk using sata_sil24.  I've done some work tracking it down, and
> hopefully you folks can point me further in the right direction.
> 
> The kernel I'm using is 2.6.21-1.3228.fc7 (i.e. Fedora 7).  LVM2 is
> lvm2-2.02.24-1.fc7.  The dmsetup and libdevmapper is
> device-mapper-1.02.17-7.fc7.
> 
> The original setup that showed the problem is this:
> 
> Starting with two 500GB SATA drives (interface card uses a Silicon
> 3124 chipset), /dev/sdd and /dev/sde.  I partitioned each into two
> 250GB chunks (250*1000*1000*1000, not 250*1024*1024*1024), and set up
> two RAID 1 sets such that /dev/md0 is /dev/sdd1+/dev/sde1 and /dev/md2
> is /dev/sdd2+/dev/sde2.  I then created a volume group ("storage") on
> top of /dev/md0 and /dev/md1.  Finally, I allocated two logical
> volumes on top of that: "one" is -L300GB and "two" is -L100GB.

-ETOOMANYCOMPNONETS.  If it's data corruption with sata_sil24, it's
highly likely that you're gonna be able to regenerate the problem
without using raw /dev/sdd and /dev/sde devices.  Please try that.

> Note that this simplified reproduction case uses only the device
> mapper: RAID is not involved, nor is LVM.  "dmsetup table" says:
> 
> two: 0 209715200 linear 8:32 482345000
> one: 0 482344960 linear 8:32 0

Can you try this on another controller so that we can tell whether the
problem lies with dm or ata?

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption sata_sil24?
  2007-07-10 15:28 ` Tejun Heo
@ 2007-07-13  1:42   ` David Shaw
  2007-07-13  7:34     ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: David Shaw @ 2007-07-13  1:42 UTC (permalink / raw)
  To: linux-ide; +Cc: Tejun Heo

On Wed, Jul 11, 2007 at 12:28:32AM +0900, Tejun Heo wrote:
> Hello,
> 
> David Shaw wrote:
> > I'm having a problem with data corruption using devmapper on a SATA
> > disk using sata_sil24.  I've done some work tracking it down, and
> > hopefully you folks can point me further in the right direction.
> > 
> > The kernel I'm using is 2.6.21-1.3228.fc7 (i.e. Fedora 7).  LVM2 is
> > lvm2-2.02.24-1.fc7.  The dmsetup and libdevmapper is
> > device-mapper-1.02.17-7.fc7.
> > 
> > The original setup that showed the problem is this:
> > 
> > Starting with two 500GB SATA drives (interface card uses a Silicon
> > 3124 chipset), /dev/sdd and /dev/sde.  I partitioned each into two
> > 250GB chunks (250*1000*1000*1000, not 250*1024*1024*1024), and set up
> > two RAID 1 sets such that /dev/md0 is /dev/sdd1+/dev/sde1 and /dev/md2
> > is /dev/sdd2+/dev/sde2.  I then created a volume group ("storage") on
> > top of /dev/md0 and /dev/md1.  Finally, I allocated two logical
> > volumes on top of that: "one" is -L300GB and "two" is -L100GB.
> 
> -ETOOMANYCOMPNONETS.  If it's data corruption with sata_sil24, it's
> highly likely that you're gonna be able to regenerate the problem
> without using raw /dev/sdd and /dev/sde devices.  Please try that.

It fails whether I use a raw /dev/sdd or partition it into one large
/dev/sdd1, or partition into multiple partitions.  sata_sil24 seems to
work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get
corruption.

> > Note that this simplified reproduction case uses only the device
> > mapper: RAID is not involved, nor is LVM.  "dmsetup table" says:
> > 
> > two: 0 209715200 linear 8:32 482345000
> > one: 0 482344960 linear 8:32 0
> 
> Can you try this on another controller so that we can tell whether the
> problem lies with dm or ata?

I have tried this on a different machine (also Fedora 7) that has an
Intel ICH5 controller on the motherboard (ata_piix driver).  This
works correctly, and there is no corruption either with or without dm.

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption sata_sil24?
  2007-07-13  1:42   ` David Shaw
@ 2007-07-13  7:34     ` Tejun Heo
  2007-07-13 11:59       ` David Shaw
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2007-07-13  7:34 UTC (permalink / raw)
  To: linux-ide, Tejun Heo

Hello,

David Shaw wrote:
>>> Starting with two 500GB SATA drives (interface card uses a Silicon
>>> 3124 chipset), /dev/sdd and /dev/sde.  I partitioned each into two
>>> 250GB chunks (250*1000*1000*1000, not 250*1024*1024*1024), and set up
>>> two RAID 1 sets such that /dev/md0 is /dev/sdd1+/dev/sde1 and /dev/md2
>>> is /dev/sdd2+/dev/sde2.  I then created a volume group ("storage") on
>>> top of /dev/md0 and /dev/md1.  Finally, I allocated two logical
>>> volumes on top of that: "one" is -L300GB and "two" is -L100GB.
>> -ETOOMANYCOMPNONETS.  If it's data corruption with sata_sil24, it's
>> highly likely that you're gonna be able to regenerate the problem
>> without using raw /dev/sdd and /dev/sde devices.  Please try that.

Oops, sorry, s/without/with/

> It fails whether I use a raw /dev/sdd or partition it into one large
> /dev/sdd1, or partition into multiple partitions.  sata_sil24 seems to
> work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get
> corruption.

Hmmmm.... Can you reproduce the corruption by accessing both devices
simultaneously without using dm?  Considering ich5 does fine, it looks
like hardware and/or driver problem and I really wanna rule out dm.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption sata_sil24?
  2007-07-13  7:34     ` Tejun Heo
@ 2007-07-13 11:59       ` David Shaw
  2007-07-18  8:53         ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: David Shaw @ 2007-07-13 11:59 UTC (permalink / raw)
  To: linux-ide

On Fri, Jul 13, 2007 at 04:34:06PM +0900, Tejun Heo wrote:

> > It fails whether I use a raw /dev/sdd or partition it into one large
> > /dev/sdd1, or partition into multiple partitions.  sata_sil24 seems to
> > work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get
> > corruption.
> 
> Hmmmm.... Can you reproduce the corruption by accessing both devices
> simultaneously without using dm?  Considering ich5 does fine, it looks
> like hardware and/or driver problem and I really wanna rule out dm.

I think I wasn't clear enough before.  The corruption happens when I
use dm to create two dm mappings that both reside on the same real
device.  Using two different devices, or two different partitions on
the same physical device works properly.  ich5 does fine with these 3
tests, but sata_sil24 fails:

 * /dev/sdd, create 2 dm linear mappings on it, mke2fs and use those
   dm "devices" == corruption

 * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, mke2fs and use
   those partitions == no corruption

 * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, create 2 dm linear
   mappings on /dev/sdd1, mke2fs and use those dm "devices" ==
   corruption

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption sata_sil24?
  2007-07-13 11:59       ` David Shaw
@ 2007-07-18  8:53         ` Tejun Heo
  2007-07-18 12:31           ` David Shaw
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2007-07-18  8:53 UTC (permalink / raw)
  To: dshaw; +Cc: linux-ide, linux-raid

David Shaw wrote:
>>> It fails whether I use a raw /dev/sdd or partition it into one large
>>> /dev/sdd1, or partition into multiple partitions.  sata_sil24 seems to
>>> work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get
>>> corruption.
>> Hmmmm.... Can you reproduce the corruption by accessing both devices
>> simultaneously without using dm?  Considering ich5 does fine, it looks
>> like hardware and/or driver problem and I really wanna rule out dm.
> 
> I think I wasn't clear enough before.  The corruption happens when I
> use dm to create two dm mappings that both reside on the same real
> device.  Using two different devices, or two different partitions on
> the same physical device works properly.  ich5 does fine with these 3
> tests, but sata_sil24 fails:
> 
>  * /dev/sdd, create 2 dm linear mappings on it, mke2fs and use those
>    dm "devices" == corruption
> 
>  * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, mke2fs and use
>    those partitions == no corruption
> 
>  * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, create 2 dm linear
>    mappings on /dev/sdd1, mke2fs and use those dm "devices" ==
>    corruption

I'm not sure whether this is problem of sata_sil24 or dm layer.  Cc'ing
linux-raid for help.  How much memory do you have?  One big difference
between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA.
Maybe dma mapping or something interacts weirdly with dm there?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption sata_sil24?
  2007-07-18  8:53         ` Tejun Heo
@ 2007-07-18 12:31           ` David Shaw
  2007-07-19  8:03             ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: David Shaw @ 2007-07-18 12:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide, linux-raid

On Wed, Jul 18, 2007 at 05:53:39PM +0900, Tejun Heo wrote:
> David Shaw wrote:
> >>> It fails whether I use a raw /dev/sdd or partition it into one large
> >>> /dev/sdd1, or partition into multiple partitions.  sata_sil24 seems to
> >>> work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get
> >>> corruption.
> >> Hmmmm.... Can you reproduce the corruption by accessing both devices
> >> simultaneously without using dm?  Considering ich5 does fine, it looks
> >> like hardware and/or driver problem and I really wanna rule out dm.
> > 
> > I think I wasn't clear enough before.  The corruption happens when I
> > use dm to create two dm mappings that both reside on the same real
> > device.  Using two different devices, or two different partitions on
> > the same physical device works properly.  ich5 does fine with these 3
> > tests, but sata_sil24 fails:
> > 
> >  * /dev/sdd, create 2 dm linear mappings on it, mke2fs and use those
> >    dm "devices" == corruption
> > 
> >  * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, mke2fs and use
> >    those partitions == no corruption
> > 
> >  * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, create 2 dm linear
> >    mappings on /dev/sdd1, mke2fs and use those dm "devices" ==
> >    corruption
> 
> I'm not sure whether this is problem of sata_sil24 or dm layer.  Cc'ing
> linux-raid for help.  How much memory do you have?  One big difference
> between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA.
> Maybe dma mapping or something interacts weirdly with dm there?

The machine has 640 megs of RAM.  FWIW, I tried this with 512 megs of
RAM with the same results.  Running Memtest86+ shows the memory is
good.

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Possible data corruption sata_sil24?
  2007-07-18 12:31           ` David Shaw
@ 2007-07-19  8:03             ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2007-07-19  8:03 UTC (permalink / raw)
  To: Tejun Heo, linux-ide, linux-raid

David Shaw wrote:
>> I'm not sure whether this is problem of sata_sil24 or dm layer.  Cc'ing
>> linux-raid for help.  How much memory do you have?  One big difference
>> between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA.
>> Maybe dma mapping or something interacts weirdly with dm there?
> 
> The machine has 640 megs of RAM.  FWIW, I tried this with 512 megs of
> RAM with the same results.  Running Memtest86+ shows the memory is
> good.

Hmmm... I see, so no DMA to the wrong address problem then.  Let's see
whether dm people can help us out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-07-19  8:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-06  1:24 Possible data corruption sata_sil24? David Shaw
2007-07-10 15:28 ` Tejun Heo
2007-07-13  1:42   ` David Shaw
2007-07-13  7:34     ` Tejun Heo
2007-07-13 11:59       ` David Shaw
2007-07-18  8:53         ` Tejun Heo
2007-07-18 12:31           ` David Shaw
2007-07-19  8:03             ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).