* Megaraid corruption?
@ 2004-12-09 2:17 Troy Klein
2004-12-09 14:14 ` Tom Coughlan
0 siblings, 1 reply; 8+ messages in thread
From: Troy Klein @ 2004-12-09 2:17 UTC (permalink / raw)
To: linux-scsi
I am attempting to find out why I am having data corruption problems on
my megaraid2 driver and not on megaraid. The system is a 32-bit Xeon
with a LSI 150-6 SATA controller running Redhat EL 3.0AS. When I am
running megaraid I can write data, read data, and modify data and get NO
corruption. With megaraid2 I get data corruption with tar, rsync, and
scp.
Can anyone help?
Troy Perplexed
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Megaraid corruption?
2004-12-09 2:17 Troy Klein
@ 2004-12-09 14:14 ` Tom Coughlan
2004-12-09 15:03 ` Troy Klein
0 siblings, 1 reply; 8+ messages in thread
From: Tom Coughlan @ 2004-12-09 14:14 UTC (permalink / raw)
To: res1uo0c; +Cc: linux-scsi
On Wed, 2004-12-08 at 21:17, Troy Klein wrote:
> I am attempting to find out why I am having data corruption problems on
> my megaraid2 driver and not on megaraid. The system is a 32-bit Xeon
> with a LSI 150-6 SATA controller running Redhat EL 3.0AS. When I am
> running megaraid I can write data, read data, and modify data and get NO
> corruption. With megaraid2 I get data corruption with tar, rsync, and
> scp.
Are there messages in /var/log/messages that correspond to the timeframe
of the corruption?
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Megaraid corruption?
@ 2004-12-09 14:18 Mukker, Atul
2004-12-09 14:54 ` Troy Klein
0 siblings, 1 reply; 8+ messages in thread
From: Mukker, Atul @ 2004-12-09 14:18 UTC (permalink / raw)
To: 'res1uo0c@verizon.net',
'linux-scsi@vger.kernel.org'
We would need more details before speculating for the cause. What kind of
corruption is happening, mismatch data? How many bytes? Kernel messages when
the corruption is seen?
Thanks
-Atul Mukker
> -----Original Message-----
> From: Troy Klein [mailto:res1uo0c@verizon.net]
> Sent: Wednesday, December 08, 2004 9:18 PM
> To: linux-scsi@vger.kernel.org
> Subject: Megaraid corruption?
>
> I am attempting to find out why I am having data corruption
> problems on my megaraid2 driver and not on megaraid. The
> system is a 32-bit Xeon with a LSI 150-6 SATA controller
> running Redhat EL 3.0AS. When I am running megaraid I can
> write data, read data, and modify data and get NO corruption.
> With megaraid2 I get data corruption with tar, rsync, and scp.
>
> Can anyone help?
>
> Troy Perplexed
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-scsi" in the body of a message to
> majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Megaraid corruption?
2004-12-09 14:18 Megaraid corruption? Mukker, Atul
@ 2004-12-09 14:54 ` Troy Klein
0 siblings, 0 replies; 8+ messages in thread
From: Troy Klein @ 2004-12-09 14:54 UTC (permalink / raw)
To: Mukker, Atul; +Cc: 'linux-scsi@vger.kernel.org'
description:
Recently during the migration of users to our two new servers, people
started noticing that files were being corrupted. This happened on both
servers (one of which also experienced a major filesystem corruption).
I removed the users from one server and have been using this as a
testbed to try and pinpoint the problem.
I was using "rsync" to transfer files, and thought this might be the
problem. I upgraded the software, but the problem still occurred. I
then used "tar" to try and copy a user's home dir to another dri on the
server.
This had problems also, so that eliminated rsync as the cause. I then
used rsync to copy files locally on the same machine. The corruption
still occurred, so this eliminates the network itself as a source of
errors.
Some Background:
My "test case" has been a user home dir with about 12GB of data. I use
"rsync" to copy the files, and then use "diff" to compare the two dirs
for differences. The server has two Xeon 2.4 GHz processors, 4GB RAM,
and 3 Maxtor 250GB SATA disks in a RAID 5 configuration behind a 6-port
LSI Logic MegaRAID card. I have RedHat Enterprise Advanced Server 3
installed as the OS, with ext3 filesystems on all partitions.
First I upgraded the kernel to version 2.4.21-20.ELsmp which contained
the recent megaraid2 driver (version 2.10.6). This did not help. I
suspected possible bad RAM. I ran memtest86 for over 100 hours without
a single error being reported.
During the deletion of the some files, I ran across some errors like
this: critical hardware error aborting-43441 cmd=2a <c=0 t=0 l=0>
waiting for 24 commands to flush: iter: 30000
The RAID utility (megamgr) did not report any errors. I upgraded the
RAID card's firmware to the latest version (713G). This did not have
any affect. The corruption takes place whether I copy files between
dirs on the same partition or across different partitions. The odd
thing about the corrupted files was that the copy had the exact same
size as the original, but a different md5 checksum. Upon closer
inspection, I found that chunks of the copied file contained null bytes.
This allowed the size to remain the same, but caused the different
checksum. Some of the files were definitely corrupted on the disk (ie -
these null bytes were written to the disk media) but others were not. I
used the "debugfs" utility to walk through the filesystem directly.
When I would use this to dump the "corrupted" file, I often found that
this dumped copy matched the original file exactly. This means that the
bits were written correctly to disk, but that normal access to the file
via tools like "cat" was returning the wrong info Below are my original
notes on the file corruption problem. I have done a few other tests
since then and wanted to pass on the results to you:
1) Problem still occurs even when copying data between different
partitions in the RAID array.
2) I tried remounting the ext3 filesystem as ext2. Problem still
remained.
3) Tried Disabling hyperthreading and nptl. No effect.
4) Disabled as much caching on the RAID card as I could. No effect.
5) Reformatted one of the partitions as jfs instead of ext3, and then
tried copying data to this newly formatted partition. The problem still
occurred.
However, I found that if I unmounted the filesystem and then remounted
it, some of these files would now appear to be OK (they matched the
originals). But then others that had matched before would now appear to
be corrupt.
The megaserv.log file for the RAID card showed media errors on port #2.
I replaced that disk and then rebuilt the RAID array. This stopped the
I/O errors, but another test showed that the file corruption problem
remained.
On Thu, 2004-12-09 at 09:18 -0500, Mukker, Atul wrote:
> We would need more details before speculating for the cause. What kind of
> corruption is happening, mismatch data? How many bytes? Kernel messages when
> the corruption is seen?
>
> Thanks
> -Atul Mukker
>
> > -----Original Message-----
> > From: Troy Klein [mailto:res1uo0c@verizon.net]
> > Sent: Wednesday, December 08, 2004 9:18 PM
> > To: linux-scsi@vger.kernel.org
> > Subject: Megaraid corruption?
> >
> > I am attempting to find out why I am having data corruption
> > problems on my megaraid2 driver and not on megaraid. The
> > system is a 32-bit Xeon with a LSI 150-6 SATA controller
> > running Redhat EL 3.0AS. When I am running megaraid I can
> > write data, read data, and modify data and get NO corruption.
> > With megaraid2 I get data corruption with tar, rsync, and scp.
> >
> > Can anyone help?
> >
> > Troy Perplexed
> >
> >
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-scsi" in the body of a message to
> > majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Megaraid corruption?
2004-12-09 14:14 ` Tom Coughlan
@ 2004-12-09 15:03 ` Troy Klein
2004-12-09 17:26 ` Russell Johnson
0 siblings, 1 reply; 8+ messages in thread
From: Troy Klein @ 2004-12-09 15:03 UTC (permalink / raw)
To: Tom Coughlan; +Cc: linux-scsi
No that is the thing I am getting no entry in logs.
I do have some questions as to the differences between megaraid version
1.18f capabilities and megaraid2 2.00.9 capabilities. Is the megaraid2
capable of addressing faster drive buses? What are the major
differences and major features added to version 2?
The following is a more detailed descrtiption and some background:
My "test case" has been a user home dir with about 12GB of data. I use
"rsync" to copy the files, and then use "diff" to compare the two dirs
for differences. The server has two Xeon 2.4 GHz processors, 4GB RAM,
and 3 Maxtor 250GB SATA disks in a RAID 5 configuration behind a 6-port
LSI Logic MegaRAID card. I have RedHat Enterprise Advanced Server 3
installed as the OS, with ext3 filesystems on all partitions.
First I upgraded the kernel to version 2.4.21-20.ELsmp which contained
the recent megaraid2 driver (version 2.10.6). This did not help. I
suspected possible bad RAM. I ran memtest86 for over 100 hours without
a single error being reported.
During the deletion of the some files, I ran across some errors like
this: critical hardware error aborting-43441 cmd=2a <c=0 t=0 l=0>
waiting for 24 commands to flush: iter: 30000
The RAID utility (megamgr) did not report any errors. I upgraded the
RAID card's firmware to the latest version (713G). This did not have
any affect. The corruption takes place whether I copy files between
dirs on the same partition or across different partitions. The odd
thing about the corrupted files was that the copy had the exact same
size as the original, but a different md5 checksum. Upon closer
inspection, I found that chunks of the copied file contained null bytes.
This allowed the size to remain the same, but caused the different
checksum. Some of the files were definitely corrupted on the disk (ie -
these null bytes were written to the disk media) but others were not. I
used the "debugfs" utility to walk through the filesystem directly.
When I would use this to dump the "corrupted" file, I often found that
this dumped copy matched the original file exactly. This means that the
bits were written correctly to disk, but that normal access to the file
via tools like "cat" was returning the wrong info Below are my original
notes on the file corruption problem. I have done a few other tests
since then and wanted to pass on the results to you:
1) Problem still occurs even when copying data between different
partitions in the RAID array.
2) I tried remounting the ext3 filesystem as ext2. Problem still
remained.
3) Tried Disabling hyperthreading and nptl. No effect.
4) Disabled as much caching on the RAID card as I could. No effect.
5) Reformatted one of the partitions as jfs instead of ext3, and then
tried copying data to this newly formatted partition. The problem still
occurred.
However, I found that if I unmounted the filesystem and then remounted
it, some of these files would now appear to be OK (they matched the
originals). But then others that had matched before would now appear to
be corrupt.
The megaserv.log file for the RAID card showed media errors on port #2.
I replaced that disk and then rebuilt the RAID array. This stopped the
I/O errors, but another test showed that the file corruption problem
remained.
Troy
On Thu, 2004-12-09 at 09:14 -0500, Tom Coughlan wrote:
> On Wed, 2004-12-08 at 21:17, Troy Klein wrote:
> > I am attempting to find out why I am having data corruption problems on
> > my megaraid2 driver and not on megaraid. The system is a 32-bit Xeon
> > with a LSI 150-6 SATA controller running Redhat EL 3.0AS. When I am
> > running megaraid I can write data, read data, and modify data and get NO
> > corruption. With megaraid2 I get data corruption with tar, rsync, and
> > scp.
>
> Are there messages in /var/log/messages that correspond to the timeframe
> of the corruption?
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Megaraid corruption?
2004-12-09 15:03 ` Troy Klein
@ 2004-12-09 17:26 ` Russell Johnson
2004-12-09 17:34 ` Troy Klein
0 siblings, 1 reply; 8+ messages in thread
From: Russell Johnson @ 2004-12-09 17:26 UTC (permalink / raw)
To: res1uo0c, 'Tom Coughlan'; +Cc: linux-scsi
There was a recent patch to the megaraid2 driver that fixed corruption while
some ioctl's were executing during file operations. You mentioned using
megamgr. If that is running in the background while you are performing the
file copies, it could be causing the bug that was fixed in the most recent
megaraid2 driver.
Try disabling megamgr and run your test...
Or try updating megaraid2 to the latest version. I'm using
megaraid cmm: 2.20.2.0 (Release Date: Thu Aug 19 09:58:33 EDT 2004)
megaraid: 2.20.4.0 (Release Date: Mon Sep 27 22:15:07 EDT)
> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org
> [mailto:linux-scsi-owner@vger.kernel.org] On Behalf Of Troy Klein
> Sent: Thursday, December 09, 2004 8:04 AM
> To: Tom Coughlan
> Cc: linux-scsi@vger.kernel.org
> Subject: Re: Megaraid corruption?
>
>
> No that is the thing I am getting no entry in logs.
>
> I do have some questions as to the differences between
> megaraid version 1.18f capabilities and megaraid2 2.00.9
> capabilities. Is the megaraid2 capable of addressing faster
> drive buses? What are the major differences and major
> features added to version 2?
>
> The following is a more detailed descrtiption and some background:
>
> My "test case" has been a user home dir with about 12GB of
> data. I use
> "rsync" to copy the files, and then use "diff" to compare the
> two dirs for differences. The server has two Xeon 2.4 GHz
> processors, 4GB RAM, and 3 Maxtor 250GB SATA disks in a RAID
> 5 configuration behind a 6-port LSI Logic MegaRAID card. I
> have RedHat Enterprise Advanced Server 3 installed as the OS,
> with ext3 filesystems on all partitions.
>
> First I upgraded the kernel to version 2.4.21-20.ELsmp which
> contained the recent megaraid2 driver (version 2.10.6). This
> did not help. I suspected possible bad RAM. I ran memtest86
> for over 100 hours without a single error being reported.
>
> During the deletion of the some files, I ran across some errors like
> this: critical hardware error aborting-43441 cmd=2a <c=0 t=0
> l=0> waiting for 24 commands to flush: iter: 30000
>
> The RAID utility (megamgr) did not report any errors. I
> upgraded the RAID card's firmware to the latest version
> (713G). This did not have any affect. The corruption takes
> place whether I copy files between dirs on the same partition
> or across different partitions. The odd thing about the
> corrupted files was that the copy had the exact same size as
> the original, but a different md5 checksum. Upon closer
> inspection, I found that chunks of the copied file contained
> null bytes.
>
> This allowed the size to remain the same, but caused the
> different checksum. Some of the files were definitely
> corrupted on the disk (ie - these null bytes were written to
> the disk media) but others were not. I used the "debugfs"
> utility to walk through the filesystem directly. When I would
> use this to dump the "corrupted" file, I often found that
> this dumped copy matched the original file exactly. This
> means that the bits were written correctly to disk, but that
> normal access to the file via tools like "cat" was returning
> the wrong info Below are my original notes on the file
> corruption problem. I have done a few other tests since then
> and wanted to pass on the results to you:
>
> 1) Problem still occurs even when copying data between different
> partitions in the RAID array.
>
> 2) I tried remounting the ext3 filesystem as ext2. Problem still
> remained.
>
> 3) Tried Disabling hyperthreading and nptl. No effect.
>
> 4) Disabled as much caching on the RAID card as I could. No effect.
>
> 5) Reformatted one of the partitions as jfs instead of ext3, and then
> tried copying data to this newly formatted partition. The
> problem still occurred.
>
> However, I found that if I unmounted the filesystem and then
> remounted it, some of these files would now appear to be OK
> (they matched the
> originals). But then others that had matched before would
> now appear to be corrupt.
>
> The megaserv.log file for the RAID card showed media errors
> on port #2. I replaced that disk and then rebuilt the RAID
> array. This stopped the I/O errors, but another test showed
> that the file corruption problem remained.
>
> Troy
>
> On Thu, 2004-12-09 at 09:14 -0500, Tom Coughlan wrote:
> > On Wed, 2004-12-08 at 21:17, Troy Klein wrote:
> > > I am attempting to find out why I am having data
> corruption problems
> > > on my megaraid2 driver and not on megaraid. The system
> is a 32-bit
> > > Xeon with a LSI 150-6 SATA controller running Redhat EL
> 3.0AS. When
> > > I am running megaraid I can write data, read data, and
> modify data
> > > and get NO corruption. With megaraid2 I get data corruption with
> > > tar, rsync, and scp.
> >
> > Are there messages in /var/log/messages that correspond to the
> > timeframe of the corruption?
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-scsi" in the body of a message to
> majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Megaraid corruption?
2004-12-09 17:26 ` Russell Johnson
@ 2004-12-09 17:34 ` Troy Klein
2004-12-09 18:26 ` Russell Johnson
0 siblings, 1 reply; 8+ messages in thread
From: Troy Klein @ 2004-12-09 17:34 UTC (permalink / raw)
To: Russell Johnson; +Cc: 'Tom Coughlan', linux-scsi
What kernel version are you running? I am running 2.4.21-20.ELsmp
(RedHat EL).
Does the megaraid compile with 2.4 kernels and if so how is that done
the source appears to be for 2.6?
Troy
On Thu, 2004-12-09 at 10:26 -0700, Russell Johnson wrote:
> There was a recent patch to the megaraid2 driver that fixed corruption
> while
> some ioctl's were executing during file operations. You mentioned
> using
> megamgr. If that is running in the background while you are
> performing the
> file copies, it could be causing the bug that was fixed in the most
> recent
> megaraid2 driver.
>
> Try disabling megamgr and run your test...
>
> Or try updating megaraid2 to the latest version. I'm using
> megaraid cmm: 2.20.2.0 (Release Date: Thu Aug 19 09:58:33 EDT 2004)
> megaraid: 2.20.4.0 (Release Date: Mon Sep 27 22:15:07 EDT)
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: Megaraid corruption?
2004-12-09 17:34 ` Troy Klein
@ 2004-12-09 18:26 ` Russell Johnson
0 siblings, 0 replies; 8+ messages in thread
From: Russell Johnson @ 2004-12-09 18:26 UTC (permalink / raw)
To: res1uo0c; +Cc: 'Tom Coughlan', linux-scsi
I'm running 2.6.7.
Did you try running a test without the megamgr running? The bug that I had
didn't exhibit itself if nothing was running ioctls to the driver. So by
disabling megamgr you would be possibly getting rid of the source of the
data corruption if your bug is the same that mine was.
Once you figure that out you can determine if you want to update to the 2.6
series or not.
Russ
> -----Original Message-----
> From: Troy Klein [mailto:res1uo0c@verizon.net]
> Sent: Thursday, December 09, 2004 10:35 AM
> To: Russell Johnson
> Cc: 'Tom Coughlan'; linux-scsi@vger.kernel.org
> Subject: RE: Megaraid corruption?
>
>
> What kernel version are you running? I am running
> 2.4.21-20.ELsmp (RedHat EL).
>
> Does the megaraid compile with 2.4 kernels and if so how is
> that done the source appears to be for 2.6?
>
> Troy
>
> On Thu, 2004-12-09 at 10:26 -0700, Russell Johnson wrote:
> > There was a recent patch to the megaraid2 driver that fixed
> corruption
> > while some ioctl's were executing during file operations. You
> > mentioned using
> > megamgr. If that is running in the background while you are
> > performing the
> > file copies, it could be causing the bug that was fixed in the most
> > recent
> > megaraid2 driver.
> >
> > Try disabling megamgr and run your test...
> >
> > Or try updating megaraid2 to the latest version. I'm using
> > megaraid cmm: 2.20.2.0 (Release Date: Thu Aug 19 09:58:33 EDT 2004)
> > megaraid: 2.20.4.0 (Release Date: Mon Sep 27 22:15:07 EDT)
> >
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2004-12-09 18:26 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-09 14:18 Megaraid corruption? Mukker, Atul
2004-12-09 14:54 ` Troy Klein
-- strict thread matches above, loose matches on Subject: below --
2004-12-09 2:17 Troy Klein
2004-12-09 14:14 ` Tom Coughlan
2004-12-09 15:03 ` Troy Klein
2004-12-09 17:26 ` Russell Johnson
2004-12-09 17:34 ` Troy Klein
2004-12-09 18:26 ` Russell Johnson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).