* XFS on CoRAID errors with SMB
@ 2011-11-28 13:55 Jon Marshall
2011-11-28 14:46 ` Joe Landman
0 siblings, 1 reply; 4+ messages in thread
From: Jon Marshall @ 2011-11-28 13:55 UTC (permalink / raw)
To: xfs; +Cc: Rory Campbell-Lange, support
Hi,
We have recently experienced what appear to be XFS filesystem errors on
a samba share. The actual filesystem resides on a network attached
storage device, a Coraid. The attached server locked up totally, and we
forced to hard reset it.
I have the following trace from the kernel logs:
[6128798.051868] smbd: page allocation failure. order:4, mode:0xc0d0
[6128798.051872] Pid: 16908, comm: smbd Not tainted 2.6.32-5-amd64 #1
[6128798.051874] Call Trace:
[6128798.051882] [<ffffffff810ba5d6>] ? __alloc_pages_nodemask+0x592/0x5f4
[6128798.051885] [<ffffffff810b959c>] ? __get_free_pages+0x9/0x46
[6128798.051889] [<ffffffff810e7ea1>] ? __kmalloc+0x3f/0x141
[6128798.051893] [<ffffffff8110672c>] ? getxattr+0x89/0x117
[6128798.051896] [<ffffffff810e5b65>] ? virt_to_head_page+0x9/0x2a
[6128798.051899] [<ffffffff810f9bc4>] ? user_path_at+0x52/0x79
[6128798.051919] [<ffffffffa0297b17>] ? xfs_xattr_put_listent+0x0/0xe5 [xfs]
[6128798.051922] [<ffffffff810e5b65>] ? virt_to_head_page+0x9/0x2a
[6128798.051925] [<ffffffff8118ddcb>] ? _atomic_dec_and_lock+0x33/0x50
[6128798.051928] [<ffffffff811068b4>] ? sys_getxattr+0x45/0x60
[6128798.051931] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
smbd seems to throw these errors for about 15 minutes, then sshd starts
throwing errors and shortly after the system became unresponsive.
Just wondering if anyone had any experience of similar results, with XFS
on a CoRAID device or XFS SMB shares?
Thanks
Jon
--
Jon Marshall
Technical Officer
jon@campbell-lange.net
.
Campbell-Lange Workshop
www.campbell-lange.net
0207 6311 555
3 Tottenham Street London W1T 2AF
Registered in England No. 04551928
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: XFS on CoRAID errors with SMB
2011-11-28 13:55 XFS on CoRAID errors with SMB Jon Marshall
@ 2011-11-28 14:46 ` Joe Landman
2011-11-28 15:26 ` Jon Marshall
0 siblings, 1 reply; 4+ messages in thread
From: Joe Landman @ 2011-11-28 14:46 UTC (permalink / raw)
To: xfs
On 11/28/2011 08:55 AM, Jon Marshall wrote:
> Hi,
>
> We have recently experienced what appear to be XFS filesystem errors on
> a samba share. The actual filesystem resides on a network attached
> storage device, a Coraid. The attached server locked up totally, and we
> forced to hard reset it.
This is (from our past experience working with these units and the AoE
system), more likely the AoE driver crashing (or something on the
underlying network failing). From there, the file system eventually dies.
This isn't an xfs problem per se, xfs is sort of an uwilling participant
in a slow motion crash.
> I have the following trace from the kernel logs:
>
> [6128798.051868] smbd: page allocation failure. order:4, mode:0xc0d0
> [6128798.051872] Pid: 16908, comm: smbd Not tainted 2.6.32-5-amd64 #1
> [6128798.051874] Call Trace:
> [6128798.051882] [<ffffffff810ba5d6>] ? __alloc_pages_nodemask+0x592/0x5f4
> [6128798.051885] [<ffffffff810b959c>] ? __get_free_pages+0x9/0x46
> [6128798.051889] [<ffffffff810e7ea1>] ? __kmalloc+0x3f/0x141
If you note the failed kmalloc, something ran you out of memory. What
we've run into in the past with this has been a driver memory leak
(usually older model e1000 or similar drivers)
[...]
> smbd seems to throw these errors for about 15 minutes, then sshd starts
> throwing errors and shortly after the system became unresponsive.
>
> Just wondering if anyone had any experience of similar results, with XFS
> on a CoRAID device or XFS SMB shares?
This is what you see when the AoE stack collapses due to a crash of one
of the lower block rungs. XFS can't run if it can't allocate memory for
itself. smbd dies when the underlying filesystem goes away. sshd
probably gets unresponsive in part, due to all the IOs queuing up that
the scheduler can't do anything with. Before sshd stops working, user
load winds up past 5x number of CPUs, then past 10x, then ...
Once you see this happening, its time to kill the upper level stacks if
possible, and unmount the file system as rapidly as possible. If you
can't kill the stuff above it, a 'umount -l ' is your friend. You *may*
be able to regain enough control for a non-crash based reboot. Even
with this, I'd recommend changing / to sync before either forcing a reboot
mount -o remount,sync /
to preserve the integrity of the OS drive.
Then reboot (or if the user load is too high, and a reboot command will
just hang ... hopefully you have IPMI on you unit so you can do an
'ipmitool -I open chassis power cycle' hard bounce)
>
> Thanks
> Jon
>
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: XFS on CoRAID errors with SMB
2011-11-28 14:46 ` Joe Landman
@ 2011-11-28 15:26 ` Jon Marshall
2011-11-28 15:36 ` Joe Landman
0 siblings, 1 reply; 4+ messages in thread
From: Jon Marshall @ 2011-11-28 15:26 UTC (permalink / raw)
To: Joe Landman; +Cc: xfs
Hi Joe,
Thanks for the rapid response.
Is this something that has been reported often in relation to AoE? Is
there any chance you could point us in the direction of some more
background on the issue? I am checking the AoE mailing list, but if you know
of something specific that would be very helpful.
I am also looking into the ethernet drivers we have in place on the
system in question.
Again, thanks for the quick and informative response.
Jon
On Mon, Nov 28, 2011 at 09:46:22AM -0500, Joe Landman wrote:
> On 11/28/2011 08:55 AM, Jon Marshall wrote:
> >Hi,
> >
> >We have recently experienced what appear to be XFS filesystem errors on
> >a samba share. The actual filesystem resides on a network attached
> >storage device, a Coraid. The attached server locked up totally, and we
> >forced to hard reset it.
>
> This is (from our past experience working with these units and the
> AoE system), more likely the AoE driver crashing (or something on
> the underlying network failing). From there, the file system
> eventually dies.
>
> This isn't an xfs problem per se, xfs is sort of an uwilling
> participant in a slow motion crash.
>
> >I have the following trace from the kernel logs:
> >
> >[6128798.051868] smbd: page allocation failure. order:4, mode:0xc0d0
> >[6128798.051872] Pid: 16908, comm: smbd Not tainted 2.6.32-5-amd64 #1
> >[6128798.051874] Call Trace:
> >[6128798.051882] [<ffffffff810ba5d6>] ? __alloc_pages_nodemask+0x592/0x5f4
> >[6128798.051885] [<ffffffff810b959c>] ? __get_free_pages+0x9/0x46
> >[6128798.051889] [<ffffffff810e7ea1>] ? __kmalloc+0x3f/0x141
>
> If you note the failed kmalloc, something ran you out of memory.
> What we've run into in the past with this has been a driver memory
> leak (usually older model e1000 or similar drivers)
>
> [...]
>
> >smbd seems to throw these errors for about 15 minutes, then sshd starts
> >throwing errors and shortly after the system became unresponsive.
> >
> >Just wondering if anyone had any experience of similar results, with XFS
> >on a CoRAID device or XFS SMB shares?
>
> This is what you see when the AoE stack collapses due to a crash of
> one of the lower block rungs. XFS can't run if it can't allocate
> memory for itself. smbd dies when the underlying filesystem goes
> away. sshd probably gets unresponsive in part, due to all the IOs
> queuing up that the scheduler can't do anything with. Before sshd
> stops working, user load winds up past 5x number of CPUs, then past
> 10x, then ...
>
> Once you see this happening, its time to kill the upper level stacks
> if possible, and unmount the file system as rapidly as possible. If
> you can't kill the stuff above it, a 'umount -l ' is your friend.
> You *may* be able to regain enough control for a non-crash based
> reboot. Even with this, I'd recommend changing / to sync before
> either forcing a reboot
>
> mount -o remount,sync /
>
> to preserve the integrity of the OS drive.
>
> Then reboot (or if the user load is too high, and a reboot command
> will just hang ... hopefully you have IPMI on you unit so you can do
> an 'ipmitool -I open chassis power cycle' hard bounce)
>
>
>
>
>
> >
> >Thanks
> >Jon
> >
>
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics Inc.
> email: landman@scalableinformatics.com
> web : http://scalableinformatics.com
> http://scalableinformatics.com/sicluster
> phone: +1 734 786 8423 x121
> fax : +1 866 888 3112
> cell : +1 734 612 4615
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
Jon Marshall
Technical Officer
jon@campbell-lange.net
.
Campbell-Lange Workshop
www.campbell-lange.net
0207 6311 555
3 Tottenham Street London W1T 2AF
Registered in England No. 04551928
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: XFS on CoRAID errors with SMB
2011-11-28 15:26 ` Jon Marshall
@ 2011-11-28 15:36 ` Joe Landman
0 siblings, 0 replies; 4+ messages in thread
From: Joe Landman @ 2011-11-28 15:36 UTC (permalink / raw)
To: Jon Marshall; +Cc: xfs
On 11/28/2011 10:26 AM, Jon Marshall wrote:
> Hi Joe,
>
> Thanks for the rapid response.
>
> Is this something that has been reported often in relation to AoE? Is
We've experienced it in the past when we supported our customers with
Coraid gear. Most of that is gone now, so we haven't seen much AoE
stuff as of late (last 2 years or so).
This said, the AoE stack depends critically upon the network stack, and
between AoE and the network stack (or possibly something else), you ran
out of memory for use in the kernel. Our experience with this is
usually a leaky network driver. e1000 and similar Intel drivers shipped
with default RHEL5/Centos5 are highly problematic. AoE could be leaking
itself (early versions were pretty bad in this regard, though I haven't
looked at the driver in the last few years, they hopefully have improved
it).
The xfs connection to this (to stay relevant to this group) is that xfs
is ok atop this, as long as the other layers don't go away. If you can
detect problems like this in advance, you might be able to issue an
xfs_freeze, and preserve the integrity of the underlying filesystem
(obviating the need for an xfs_repair). The hard part would be an
accurate prediction, but if your drivers are grabbing memory and not
releasing it back, or you have a run-away memory consuming process,
yeah, you could potentially predict this onset.
> there any chance you could point us in the direction of some more
> background on the issue? I am checking the AoE mailing list, but if you know
> of something specific that would be very helpful.
Not really, we aren't doing much with AoE anymore. This may or may not
be an AoE issue per se. Likely AoE crashed, and the reason for the
crash is very probably the same reason that xfs crashed, it ran out of
memory. If AoE is the culprit, you might find some sort of imprint of
this in the logs, though our experience has been usually a run-away
network driver. Since AoE does its block devices over raw ethernet
packets, it doesn't take very long for a leaky driver to crash such a
system under load.
>
> I am also looking into the ethernet drivers we have in place on the
> system in question.
>
> Again, thanks for the quick and informative response.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-11-28 15:36 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-28 13:55 XFS on CoRAID errors with SMB Jon Marshall
2011-11-28 14:46 ` Joe Landman
2011-11-28 15:26 ` Jon Marshall
2011-11-28 15:36 ` Joe Landman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox