* Interesting possible XFS crash condition
@ 2010-10-20 6:13 Shawn Usry
2010-10-20 8:12 ` Michael Monnerie
2010-10-20 9:54 ` Emmanuel Florac
0 siblings, 2 replies; 7+ messages in thread
From: Shawn Usry @ 2010-10-20 6:13 UTC (permalink / raw)
To: xfs
Hi List -
First off, thanks for the great filesystem. Thus far it's been an
excellent performer for my needs both professionally and personally.
I have a situation/environment that is producing a kernel crash, that
may be XFS related. A colleague suggested I post to this list as there
may be some interest in reproducing it.
Environment (current):
Fedora core 13 (kernel 2.6.34.7-56.fc13.i686)
xfsprogs-3.1.1-7.fc13.i686
RAID5 Controller: 3ware 9550-SXU-8LP 8-port sata controller, 64-bit PCI-X.
XFS filesystem in question is on a RAID 5 array on this controller, made
of up 4 identical disks, 1.5TB each, 64k stripe (block device = /dev/sdb)
The setup:
ORIGINALLY, this was a 3-disk RAID5. Created the XFS filesystem with:
--> mkfs -t xfs /dev/sdb
All was well in use to this point.
Next, I ADDED a 4th disk to the array, and expanded the array in place;
and operation supported by this RAID device.
New usable size = 4.5T
Once completed, I grew the XFS filesystem with xfs_growxfs to expand
into the full size of the new array.
Again, all was well, for about a week of normal use - fairly heavy
copy/read/write operations on a regular basis.
Then, without any changes or warning (that I was aware of at least), the
machine started crashing/kernel panic anytime I accessed (read/write)
MOST of the files in the filesystem. Some files could be accessed
without a problem. In general though any kind of high I/O (copying a
file (not moving) to the same device, copying to another block
device/disk, reading it across the network, etc) now causes the
condition, observed by access occurring normally for the first few MB
(this seems to vary in value) and then the system locking up completely.
Most of the time, the system becomes unresponsive and must be rebooted
to gain access again. In some cases though, system access will return,
on a limited / choppy basis and messages like "card reset" will appear
in the message log.
The latter statement and observations lead me to believe that perhaps
this was simply a yucky controller that was failing under heavy I/O.
However, several other tests/observations leave me wondering if it may
be a corrupt filesystem in some way, that is not being detected by
xfs_repair.
Tests / Observations:
1. Mounted, or Unmounted, I can "dd" the block device array (/dev/sdb)
all day long without a problem:
--> dd if=/dev/sdb of=/dev/null bs=(varied tests) result: end to end
no problem
--> dd if=/dev/sdb of=/tmp/test.file bs=(varies) result: no problem
(as long as test.file space permits..)
2. I can CREATE arbitrary NEW files onto the filesystem, and copy them
/read them OFF the device, such as a disk-to-otherdisk,
disk-to-samedisk, copy across the network, etc, read them, delete, them
- NO CRASH.
--> dd if=/dev/zero of=/myblkdevice/test.file bs=1M count=1024 (create
an arbitrary 1GB file). All normal.
3. Copying / Reading existing files (at least, that existed at the time
I grew the array) seems to trigger the system crash. Copying/reading
said NEW files (i.e., #2 above) does NOT trigger the crash.
4. Copying EXISTING files from other servers / locations on the
network, or other disks, to the device triggers the crash (i.e., would
be a NEW file being copied to the array, but not created ON the array).
5. Unmounted, xfs_repair -n /dev/sdb ---> finds no issues
6. Unmounted, xfs_repair /dev/sdb ---> finds no issues, performs no
changes.
Other Notes:
1. I did recently learn of the create-time and mount-time options
sunit/swidth for optimizing performance. Setting these had no effect
on this issue.
2. SOME files behave perfectly normal. I can copy them, read them, etc
without a problem. But for the MOST part, MOST files, and MOST all file
operations seem to trigger the crash, though
3. Limited information shown in what I've been able to capture in the
kernel crash. Nothing really specific or repeatable (different message
each time) - some instances to the term "atomic" and "xfs" - other times
"irq" related.
4. In general the crash seems to happen when I either:
a. Attempt to do any reads of files larger than 100 MB or so
(small, single operations don't seem to have an effect, but strings of
small operations (unzipping a dir of files, for example) does).
b. Attempt to move or copy any data to the filesystem that didn't
ORIGINATE on the filesystem.
Questions:
1. Is is possible that my raid-expansion on the 3ware board brought on
some kind of corruption? Might not xfs_repair detect this if so?
2. Are there any thoughts / patches / commands / debug options I might
try to resolve this?
3. Is this more likely a problem with the 3ware controller + XFS
combination?
The only recourse I've thought of is to completely wipe the array and
start from scratch with a fresh 4-disk array, and XFS filesystem
creation, then copy data back to it.
I can't leave this device in place in an unusable state very long - I
just thought this list might be interested in the conditions. Any
suggestions or thoughts would be greatly appreciated. Resolving this
would save me a good deal of time.
Shawn
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Interesting possible XFS crash condition
2010-10-20 6:13 Interesting possible XFS crash condition Shawn Usry
@ 2010-10-20 8:12 ` Michael Monnerie
2010-10-20 19:01 ` Shawn Usry
2010-10-20 9:54 ` Emmanuel Florac
1 sibling, 1 reply; 7+ messages in thread
From: Michael Monnerie @ 2010-10-20 8:12 UTC (permalink / raw)
To: xfs; +Cc: Shawn Usry
[-- Attachment #1.1: Type: Text/Plain, Size: 1145 bytes --]
On Mittwoch, 20. Oktober 2010 Shawn Usry wrote:
> Limited information shown in what I've been able to capture in the
> kernel crash. Nothing really specific or repeatable (different
> message each time) - some instances to the term "atomic" and "xfs"
> - other times "irq" related
I'm not a dev, but I'd say a kernel crash dump would be very helpful.
Can't you at least take pictures of the messages?
I've never read about such XFS errors, maybe you should
1) xfs_metadump
2) xfs_mdrestore (into a file)
3) mount that file
and try to access files there. If this also crashes, it will really be
XFS related.
Also, can you try putting the hard disks onto another system, possibly
with changing the RAID controller? It might be a hardware error.
--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc
it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31
****** Radiointerview zum Thema Spam ******
http://www.it-podcast.at/archiv.html#podcast-100716
// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Interesting possible XFS crash condition
2010-10-20 8:12 ` Michael Monnerie
@ 2010-10-20 19:01 ` Shawn Usry
2010-10-20 21:27 ` Emmanuel Florac
0 siblings, 1 reply; 7+ messages in thread
From: Shawn Usry @ 2010-10-20 19:01 UTC (permalink / raw)
Cc: xfs
On 10/20/2010 3:12 AM, Michael Monnerie wrote:
> On Mittwoch, 20. Oktober 2010 Shawn Usry wrote:
>> Limited information shown in what I've been able to capture in the
>> kernel crash. Nothing really specific or repeatable (different
>> message each time) - some instances to the term "atomic" and "xfs"
>> - other times "irq" related
> I'm not a dev, but I'd say a kernel crash dump would be very helpful.
> Can't you at least take pictures of the messages?
>
> I've never read about such XFS errors, maybe you should
> 1) xfs_metadump
> 2) xfs_mdrestore (into a file)
> 3) mount that file
> and try to access files there. If this also crashes, it will really be
> XFS related.
>
> Also, can you try putting the hard disks onto another system, possibly
> with changing the RAID controller? It might be a hardware error.
>
Thanks for the suggestions / comments guys.
@Emmanuel - I've run a verify on the unit several times, some purposely,
some that start automatically after the system reboots after a crash.
All have completed without a problem. I even forcibly removed a disk,
and re-added it to the array, to force a rebuild. This completed
without error,
or any messages other than start/completed in dmesg.
@Michael - I can try to capture some of the kernel dump - but getting
this info is often sketchy - most often, no dump is ever produced to
even the console
screen. Even using netconsole to redirect console output and kernel
debugging set, there is often little if any information. What data is
sometimes
produced is rarely the same (seemingly) information - but I'll try to
capture what I can on several repeat offenses.
I gave the xfs_metadata/xfs_mdrestore procedure a run and this produced
no problems. I could access the filesystem and files just fine - of
course they are
all basically empty files so I couldn't really do any real work with
them, but I could traverse the filesystem copy/move files just fine. If
there are any other
detailed tests I could try there please let me know.
On hardware swapping - I'll have to find an MB with a 64-bit pci slot in
it. Otherwise, I sadly don't have a second controller to work with.
A couple of other notes:
1. I thought this might be driver-related (3w-xxxx) but I've tried
several versions of the driver, by using different distributions (Centos
5, Fedora 13)
with the same results. To note, the array was originally created, and
expanded, under Centos 5.5. I reinstalled the OS to Fedora 13, hoping
that newer
code might resolve the issue. Same results.
2. I did upgrade the firmware on the controller to a newer version
AFTER the issue appeared, hoping this would resolve it. Same results.
At this point I'm leaning toward faulty hardware somewhere.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Interesting possible XFS crash condition
2010-10-20 19:01 ` Shawn Usry
@ 2010-10-20 21:27 ` Emmanuel Florac
2010-10-21 4:45 ` Shawn Usry
0 siblings, 1 reply; 7+ messages in thread
From: Emmanuel Florac @ 2010-10-20 21:27 UTC (permalink / raw)
To: Shawn Usry; +Cc: xfs
Le Wed, 20 Oct 2010 14:01:00 -0500 vous écriviez:
> 2. I did upgrade the firmware on the controller to a newer version
> AFTER the issue appeared, hoping this would resolve it. Same results.
>
> At this point I'm leaning toward faulty hardware somewhere.
Another possibility is a memory problem, possibly making the machine
crash under heavy load; if most RAM is used as filesystem caching, this
maybe may lead to apparently xfs related errors. You could try running
memtest86+ on the system for a couple of hours.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Interesting possible XFS crash condition
2010-10-20 21:27 ` Emmanuel Florac
@ 2010-10-21 4:45 ` Shawn Usry
2010-10-21 6:06 ` Dave Chinner
0 siblings, 1 reply; 7+ messages in thread
From: Shawn Usry @ 2010-10-21 4:45 UTC (permalink / raw)
To: xfs
On 10/20/2010 4:27 PM, Emmanuel Florac wrote:
> Le Wed, 20 Oct 2010 14:01:00 -0500 vous écriviez:
>
>> 2. I did upgrade the firmware on the controller to a newer version
>> AFTER the issue appeared, hoping this would resolve it. Same results.
>>
>> At this point I'm leaning toward faulty hardware somewhere.
> Another possibility is a memory problem, possibly making the machine
> crash under heavy load; if most RAM is used as filesystem caching, this
> maybe may lead to apparently xfs related errors. You could try running
> memtest86+ on the system for a couple of hours.
>
Great minds think alike :) Actually I have indeed already run a
memtest86+ on the system which completed without a problem. Loading up
read/writes on other single disks on the system also doesn't produce a
problem. Just this one filesystem ;) It's a stumper!
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Interesting possible XFS crash condition
2010-10-21 4:45 ` Shawn Usry
@ 2010-10-21 6:06 ` Dave Chinner
0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2010-10-21 6:06 UTC (permalink / raw)
To: Shawn Usry; +Cc: xfs
On Wed, Oct 20, 2010 at 11:45:03PM -0500, Shawn Usry wrote:
>
> On 10/20/2010 4:27 PM, Emmanuel Florac wrote:
> >Le Wed, 20 Oct 2010 14:01:00 -0500 vous écriviez:
> >
> >>2. I did upgrade the firmware on the controller to a newer version
> >>AFTER the issue appeared, hoping this would resolve it. Same results.
> >>
> >>At this point I'm leaning toward faulty hardware somewhere.
> >Another possibility is a memory problem, possibly making the machine
> >crash under heavy load; if most RAM is used as filesystem caching, this
> >maybe may lead to apparently xfs related errors. You could try running
> >memtest86+ on the system for a couple of hours.
> >
> Great minds think alike :) Actually I have indeed already run a
> memtest86+ on the system which completed without a problem.
> Loading up read/writes on other single disks on the system also
> doesn't produce a problem. Just this one filesystem ;) It's a
> stumper!
Having some idea of the log messages generated by the crash - even
if it is screen shorts via hand-held camera - might help us narrow
down the cause.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Interesting possible XFS crash condition
2010-10-20 6:13 Interesting possible XFS crash condition Shawn Usry
2010-10-20 8:12 ` Michael Monnerie
@ 2010-10-20 9:54 ` Emmanuel Florac
1 sibling, 0 replies; 7+ messages in thread
From: Emmanuel Florac @ 2010-10-20 9:54 UTC (permalink / raw)
To: Shawn Usry; +Cc: xfs
Le Wed, 20 Oct 2010 01:13:19 -0500
Shawn Usry <shawn@dolphinlogic.com> écrivait:
> The latter statement and observations lead me to believe that perhaps
> this was simply a yucky controller that was failing under heavy
> I/O.
I've set up quite a lot of those RAID cards (about 100) and there is a
significant failure rate on these (much higher than the newer 9650). I
had both cases of bad controller RAM and CPU overheating several times.
Try unmounting the filesystem and start a RAID verify:
tw_cli /cX/uY start verify
This will generate high IO. Check dmesg for controller errors. Try
remounting after a couple of hours of verification. If the controller
is fried, it most probably fail, but shouldn't crash the system.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-10-21 6:04 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-20 6:13 Interesting possible XFS crash condition Shawn Usry
2010-10-20 8:12 ` Michael Monnerie
2010-10-20 19:01 ` Shawn Usry
2010-10-20 21:27 ` Emmanuel Florac
2010-10-21 4:45 ` Shawn Usry
2010-10-21 6:06 ` Dave Chinner
2010-10-20 9:54 ` Emmanuel Florac
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox