public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* Interesting possible XFS crash condition
@ 2010-10-20  6:13 Shawn Usry
  2010-10-20  8:12 ` Michael Monnerie
  2010-10-20  9:54 ` Emmanuel Florac
  0 siblings, 2 replies; 7+ messages in thread
From: Shawn Usry @ 2010-10-20  6:13 UTC (permalink / raw)
  To: xfs

  Hi List -

First off, thanks for the great filesystem.  Thus far it's been an 
excellent performer for my needs both professionally and personally.

I have a situation/environment that is producing a kernel crash, that 
may be XFS related.   A colleague suggested I post to this list as there 
may be some interest in reproducing it.


Environment (current):

Fedora core 13 (kernel 2.6.34.7-56.fc13.i686)
xfsprogs-3.1.1-7.fc13.i686

RAID5 Controller:  3ware 9550-SXU-8LP 8-port sata controller, 64-bit PCI-X.
XFS filesystem in question is on a RAID 5 array on this controller, made 
of up 4 identical disks, 1.5TB each, 64k stripe (block device = /dev/sdb)

The setup:
ORIGINALLY, this was a 3-disk RAID5.  Created the XFS filesystem with:
   --> mkfs -t xfs /dev/sdb

All was well in use to this point.

Next, I ADDED a 4th disk to the array, and expanded the array in place; 
and operation supported by this RAID device.
New usable size = 4.5T

Once completed, I grew the XFS filesystem with xfs_growxfs to expand 
into the full size of the new array.

Again, all was well, for about a week of normal use - fairly heavy 
copy/read/write operations on a regular basis.

Then, without any changes or warning (that I was aware of at least), the 
machine started crashing/kernel panic anytime I accessed (read/write) 
MOST of the files in the filesystem.   Some files could be accessed 
without a problem.  In general though any kind of high I/O (copying a 
file (not moving) to the same device, copying to another block 
device/disk, reading it across the network, etc) now causes the 
condition, observed by access occurring normally for the first few MB 
(this seems to vary in value) and then the system locking up completely.

Most of the time, the system becomes unresponsive and must be rebooted 
to gain access again.   In some cases though, system access will return, 
on a limited / choppy basis and messages like "card reset" will appear 
in the message log.

The latter statement and observations lead me to believe that perhaps 
this was simply a yucky controller that was failing under heavy I/O.   
However, several other tests/observations leave me wondering if it may 
be a corrupt filesystem in some way, that is not being detected by 
xfs_repair.

Tests / Observations:

1.  Mounted, or Unmounted, I can "dd" the block device array (/dev/sdb)  
all day long without a problem:
--> dd if=/dev/sdb of=/dev/null bs=(varied tests)    result:  end to end 
no problem
--> dd if=/dev/sdb of=/tmp/test.file bs=(varies)  result:   no problem 
(as long as test.file space permits..)

2.  I can CREATE arbitrary NEW files onto the filesystem, and copy them 
/read them OFF the device, such as a disk-to-otherdisk, 
disk-to-samedisk, copy across the network, etc, read them, delete, them 
- NO CRASH.
--> dd if=/dev/zero of=/myblkdevice/test.file bs=1M count=1024 (create 
an arbitrary 1GB file).  All normal.

3.  Copying / Reading existing files (at least, that existed at the time 
I grew the array) seems to trigger the system crash.  Copying/reading 
said NEW files (i.e., #2 above) does NOT trigger the crash.

4.  Copying EXISTING files from other servers / locations on the 
network, or other disks,  to the device triggers the crash (i.e., would 
be a NEW file being copied to the array, but not created ON the array).

5.  Unmounted, xfs_repair -n /dev/sdb ---> finds no issues

6.  Unmounted, xfs_repair /dev/sdb ---> finds no issues, performs no 
changes.

Other Notes:
1.  I did recently learn of the create-time and mount-time options 
sunit/swidth for optimizing performance.   Setting these had no effect 
on this issue.

2.  SOME files behave perfectly normal. I can copy them, read them, etc 
without a problem.  But for the MOST part, MOST files, and MOST all file 
operations seem to trigger the crash, though

3.  Limited information shown in what I've been able to capture in the 
kernel crash.  Nothing really specific or repeatable (different message 
each time) - some instances to the term "atomic" and "xfs" - other times 
"irq" related.

4.  In general the crash seems to happen when I either:
    a.  Attempt to do any reads of files larger than 100 MB or so 
(small, single operations don't seem to have an effect, but strings of 
small operations (unzipping a dir of files, for example) does).
    b.  Attempt to move or copy any data to the filesystem that didn't 
ORIGINATE on the filesystem.


Questions:
1.  Is is possible that my raid-expansion on the 3ware board brought on 
some kind of corruption?   Might not xfs_repair detect this if so?

2.  Are there any thoughts / patches / commands / debug options I might 
try to resolve this?

3.  Is this more likely a problem with the 3ware controller + XFS 
combination?

The only recourse I've thought of is to completely wipe the array and 
start from scratch with a fresh 4-disk array, and XFS filesystem 
creation, then copy data back to it.

I can't leave this device in place in an unusable state very long - I 
just thought this list might be interested in the conditions.   Any 
suggestions or thoughts would be greatly appreciated.  Resolving this 
would save me a good deal of time.

Shawn

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interesting possible XFS crash condition
  2010-10-20  6:13 Interesting possible XFS crash condition Shawn Usry
@ 2010-10-20  8:12 ` Michael Monnerie
  2010-10-20 19:01   ` Shawn Usry
  2010-10-20  9:54 ` Emmanuel Florac
  1 sibling, 1 reply; 7+ messages in thread
From: Michael Monnerie @ 2010-10-20  8:12 UTC (permalink / raw)
  To: xfs; +Cc: Shawn Usry


[-- Attachment #1.1: Type: Text/Plain, Size: 1145 bytes --]

On Mittwoch, 20. Oktober 2010 Shawn Usry wrote:
> Limited information shown in what I've been able to capture in the 
> kernel crash.  Nothing really specific or repeatable (different
> message  each time) - some instances to the term "atomic" and "xfs"
> - other times "irq" related

I'm not a dev, but I'd say a kernel crash dump would be very helpful. 
Can't you at least take pictures of the messages?

I've never read about such XFS errors, maybe you should 
1) xfs_metadump
2) xfs_mdrestore (into a file)
3) mount that file
and try to access files there. If this also crashes, it will really be 
XFS related.

Also, can you try putting the hard disks onto another system, possibly 
with changing the RAID controller? It might be a hardware error.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

****** Radiointerview zum Thema Spam ******
http://www.it-podcast.at/archiv.html#podcast-100716

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interesting possible XFS crash condition
  2010-10-20  6:13 Interesting possible XFS crash condition Shawn Usry
  2010-10-20  8:12 ` Michael Monnerie
@ 2010-10-20  9:54 ` Emmanuel Florac
  1 sibling, 0 replies; 7+ messages in thread
From: Emmanuel Florac @ 2010-10-20  9:54 UTC (permalink / raw)
  To: Shawn Usry; +Cc: xfs

Le Wed, 20 Oct 2010 01:13:19 -0500
Shawn Usry <shawn@dolphinlogic.com> écrivait:

> The latter statement and observations lead me to believe that perhaps 
> this was simply a yucky controller that was failing under heavy
> I/O.   

I've set up quite a lot of those RAID cards (about 100) and there is a
significant failure rate on these (much higher than the newer 9650). I
had both cases of bad controller RAM and CPU overheating several times.

Try unmounting the filesystem and start a RAID verify:

tw_cli /cX/uY start verify

This will generate high IO. Check dmesg for controller errors. Try
remounting after a couple of hours of verification. If the controller
is fried, it most probably fail, but shouldn't crash the system.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interesting possible XFS crash condition
  2010-10-20  8:12 ` Michael Monnerie
@ 2010-10-20 19:01   ` Shawn Usry
  2010-10-20 21:27     ` Emmanuel Florac
  0 siblings, 1 reply; 7+ messages in thread
From: Shawn Usry @ 2010-10-20 19:01 UTC (permalink / raw)
  Cc: xfs



On 10/20/2010 3:12 AM, Michael Monnerie wrote:
> On Mittwoch, 20. Oktober 2010 Shawn Usry wrote:
>> Limited information shown in what I've been able to capture in the
>> kernel crash.  Nothing really specific or repeatable (different
>> message  each time) - some instances to the term "atomic" and "xfs"
>> - other times "irq" related
> I'm not a dev, but I'd say a kernel crash dump would be very helpful.
> Can't you at least take pictures of the messages?
>
> I've never read about such XFS errors, maybe you should
> 1) xfs_metadump
> 2) xfs_mdrestore (into a file)
> 3) mount that file
> and try to access files there. If this also crashes, it will really be
> XFS related.
>
> Also, can you try putting the hard disks onto another system, possibly
> with changing the RAID controller? It might be a hardware error.
>
Thanks for the suggestions / comments guys.

@Emmanuel - I've run a verify on the unit several times, some purposely, 
some that start automatically after the system reboots after a crash.
All have completed without a problem.   I even forcibly removed a disk, 
and re-added it to the array, to force a rebuild.  This completed 
without error,
or any messages other than start/completed in dmesg.

@Michael -  I can try to capture some of the kernel dump - but getting 
this info is often sketchy - most often, no dump is ever produced to 
even the console
screen.  Even using netconsole to redirect console output and kernel 
debugging set, there is often little if any information.   What data is 
sometimes
produced is rarely the same (seemingly) information - but I'll try to 
capture what I can on several repeat offenses.

I gave the xfs_metadata/xfs_mdrestore procedure a run and this produced 
no problems.  I could access the filesystem and files just fine - of 
course they are
all basically empty files so I couldn't really do any real work with 
them, but I could traverse the filesystem copy/move files just fine.  If 
there are any other
detailed tests I could try there please let me know.

On hardware swapping - I'll have to find an MB with a 64-bit pci slot in 
it.  Otherwise, I sadly don't have a second controller to work with.

A couple of other notes:

1.  I thought this might be driver-related (3w-xxxx) but I've tried 
several versions of the driver, by using different distributions (Centos 
5, Fedora 13)
with the same results.  To note, the array was originally created, and 
expanded, under Centos 5.5.   I reinstalled the OS to Fedora 13, hoping 
that newer
code might resolve the issue.  Same results.

2.  I did upgrade the firmware on the controller to a newer version 
AFTER the issue appeared, hoping this would resolve it.  Same results.

At this point I'm leaning toward faulty hardware somewhere.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interesting possible XFS crash condition
  2010-10-20 19:01   ` Shawn Usry
@ 2010-10-20 21:27     ` Emmanuel Florac
  2010-10-21  4:45       ` Shawn Usry
  0 siblings, 1 reply; 7+ messages in thread
From: Emmanuel Florac @ 2010-10-20 21:27 UTC (permalink / raw)
  To: Shawn Usry; +Cc: xfs

Le Wed, 20 Oct 2010 14:01:00 -0500 vous écriviez:

> 2.  I did upgrade the firmware on the controller to a newer version 
> AFTER the issue appeared, hoping this would resolve it.  Same results.
> 
> At this point I'm leaning toward faulty hardware somewhere.

Another possibility is a memory problem, possibly making the machine
crash under heavy load; if most RAM is used as filesystem caching, this
maybe may lead to apparently xfs related errors. You could try running
memtest86+ on the system for a couple of hours.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interesting possible XFS crash condition
  2010-10-20 21:27     ` Emmanuel Florac
@ 2010-10-21  4:45       ` Shawn Usry
  2010-10-21  6:06         ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Shawn Usry @ 2010-10-21  4:45 UTC (permalink / raw)
  To: xfs


On 10/20/2010 4:27 PM, Emmanuel Florac wrote:
> Le Wed, 20 Oct 2010 14:01:00 -0500 vous écriviez:
>
>> 2.  I did upgrade the firmware on the controller to a newer version
>> AFTER the issue appeared, hoping this would resolve it.  Same results.
>>
>> At this point I'm leaning toward faulty hardware somewhere.
> Another possibility is a memory problem, possibly making the machine
> crash under heavy load; if most RAM is used as filesystem caching, this
> maybe may lead to apparently xfs related errors. You could try running
> memtest86+ on the system for a couple of hours.
>
Great minds think alike :)  Actually I have indeed already run a 
memtest86+ on the system which completed without a problem.   Loading up 
read/writes on other single disks on the system also doesn't produce a 
problem.  Just this one filesystem ;)     It's a stumper!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Interesting possible XFS crash condition
  2010-10-21  4:45       ` Shawn Usry
@ 2010-10-21  6:06         ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2010-10-21  6:06 UTC (permalink / raw)
  To: Shawn Usry; +Cc: xfs

On Wed, Oct 20, 2010 at 11:45:03PM -0500, Shawn Usry wrote:
> 
> On 10/20/2010 4:27 PM, Emmanuel Florac wrote:
> >Le Wed, 20 Oct 2010 14:01:00 -0500 vous écriviez:
> >
> >>2.  I did upgrade the firmware on the controller to a newer version
> >>AFTER the issue appeared, hoping this would resolve it.  Same results.
> >>
> >>At this point I'm leaning toward faulty hardware somewhere.
> >Another possibility is a memory problem, possibly making the machine
> >crash under heavy load; if most RAM is used as filesystem caching, this
> >maybe may lead to apparently xfs related errors. You could try running
> >memtest86+ on the system for a couple of hours.
> >
> Great minds think alike :)  Actually I have indeed already run a
> memtest86+ on the system which completed without a problem.
> Loading up read/writes on other single disks on the system also
> doesn't produce a problem.  Just this one filesystem ;)     It's a
> stumper!

Having some idea of the log messages generated by the crash - even
if it is screen shorts via hand-held camera - might help us narrow
down the cause.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-10-21  6:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-20  6:13 Interesting possible XFS crash condition Shawn Usry
2010-10-20  8:12 ` Michael Monnerie
2010-10-20 19:01   ` Shawn Usry
2010-10-20 21:27     ` Emmanuel Florac
2010-10-21  4:45       ` Shawn Usry
2010-10-21  6:06         ` Dave Chinner
2010-10-20  9:54 ` Emmanuel Florac

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox