XFS hangs with XFS: possible memory allocation deadlock in kmem

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* XFS hangs with XFS: possible memory allocation deadlock in kmem_alloc
@ 2015-03-07  7:51 Michael Meier
  2015-03-07 14:07 ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Meier @ 2015-03-07  7:51 UTC (permalink / raw)
  To: xfs

We've recently upgraded the OS on one of our servers, and since then
have been experiencing frequent stalls of the XFS filesystem on it.
Other filesystems on the machine seem to still respond fine while XFS
hangs. The stalls sometimes last for around 30 minutes, during which all
attempts to access that filesystem hang completely - after that, the
filesystem suddenly responds instantly again, as if there had never been
any problem. The dmesg is full of these messages while it stalls:
 XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
These also occour from time to time without the filesystem stalling (or
at least it's not noticeable) - the messages appear about once in two
hours, the stalls about once a day.

Google did point me to some reports of these messages occouring at the
end of 2013, but the kernels in question should all have had the fixes
proposed back then - although one message back then suggested there were
more places where this problem could occour that were not fixed yet.

Kernels used were:
- Ubuntu 3.13.0-44  - shows stalls, according to
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1382333 has the fix
- Ubuntu 3.16.0-31  - shows stalls
- Ubuntu 3.2.0-various - no stalls in more than 1 year
We can actually still boot the machine with the 3.2.0 kernel, and it
will run absolutely fine, but as that kernel will not be supported
forever, I do not consider that a permanent solution.

The machine should not be low on memory, the disk array far from its
limits, and the I/O-load is mostly reads with very little writes, as
this is a public FTP server.

I have tried to collect some information, available at
https://grid.rrze.uni-erlangen.de/~unrz191/syslog-with-xfs-hangs.log

Regards,
-- 
Michael Meier, Zentrale Systeme
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28973, Fax: +49 9131 302941
michael.meier@fau.de
www.rrze.fau.de

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XFS hangs with XFS: possible memory allocation deadlock in kmem_alloc
  2015-03-07  7:51 XFS hangs with XFS: possible memory allocation deadlock in kmem_alloc Michael Meier
@ 2015-03-07 14:07 ` Brian Foster
  2015-03-07 19:14   ` Michael Meier
  2015-03-09 11:52   ` Dave Chinner
  0 siblings, 2 replies; 4+ messages in thread
From: Brian Foster @ 2015-03-07 14:07 UTC (permalink / raw)
  To: Michael Meier; +Cc: xfs

On Sat, Mar 07, 2015 at 08:51:50AM +0100, Michael Meier wrote:
> We've recently upgraded the OS on one of our servers, and since then
> have been experiencing frequent stalls of the XFS filesystem on it.
> Other filesystems on the machine seem to still respond fine while XFS
> hangs. The stalls sometimes last for around 30 minutes, during which all
> attempts to access that filesystem hang completely - after that, the
> filesystem suddenly responds instantly again, as if there had never been
> any problem. The dmesg is full of these messages while it stalls:
>  XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
> These also occour from time to time without the filesystem stalling (or
> at least it's not noticeable) - the messages appear about once in two
> hours, the stalls about once a day.
> 
> Google did point me to some reports of these messages occouring at the
> end of 2013, but the kernels in question should all have had the fixes
> proposed back then - although one message back then suggested there were
> more places where this problem could occour that were not fixed yet.
> 
> Kernels used were:
> - Ubuntu 3.13.0-44  - shows stalls, according to
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1382333 has the fix
> - Ubuntu 3.16.0-31  - shows stalls
> - Ubuntu 3.2.0-various - no stalls in more than 1 year
> We can actually still boot the machine with the 3.2.0 kernel, and it
> will run absolutely fine, but as that kernel will not be supported
> forever, I do not consider that a permanent solution.
> 
> The machine should not be low on memory, the disk array far from its
> limits, and the I/O-load is mostly reads with very little writes, as
> this is a public FTP server.
> 
> I have tried to collect some information, available at
> https://grid.rrze.uni-erlangen.de/~unrz191/syslog-with-xfs-hangs.log
> 

Thanks for the data. Some notes from the backtraces in the first
instance:

- xfsaild is down in xlog_cil_force_lsn()->flush_work(). So it's trying
  to push the log, but the workqueue worker is already running.
- The workqueue worker is here:

	[298163.482697] Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]

... and it appears to be blocked on the ctx lock. This means either a
transaction is completing or somebody else is pushing the cil.
- Writeback and one or two other transactions are backed up waiting on
  the ctx lock.
- rsync is running a transaction completion (e.g., holding ctx lock) and
  blocked on memory allocation:

rsync           D ffff88103f893440     0 44446  43197 0x00000000
 ffff8809e4f7b9f0 0000000000000086 ffff880801a15bb0 ffff8809e4f7bfd8
 0000000000013440 0000000000013440 ffff8810146428c0 ffff881013dd8000
 ffff8809e4f7ba20 00000001046f17ea ffff881013dd8000 000000000000d158
Call Trace:
 [<ffffffff817675c9>] schedule+0x29/0x70
 [<ffffffff817668e5>] schedule_timeout+0x165/0x2a0
 [<ffffffff8107a420>] ? ftrace_raw_event_tick_stop+0xc0/0xc0
 [<ffffffff81767c9b>] io_schedule_timeout+0x9b/0xf0
 [<ffffffff81180403>] congestion_wait+0x73/0x100
 [<ffffffff810b4d10>] ? prepare_to_wait_event+0x100/0x100
 [<ffffffffc01deaac>] kmem_alloc+0x6c/0xf0 [xfs]
 [<ffffffffc022399f>] xfs_log_commit_cil+0x34f/0x470 [xfs]
 [<ffffffffc01de37c>] xfs_trans_commit+0x11c/0x230 [xfs]
 [<ffffffffc0212c81>] xfs_rename+0x601/0x670 [xfs]
 [<ffffffffc01d41c2>] xfs_vn_rename+0x82/0x90 [xfs]
 [<ffffffff811e34de>] vfs_rename+0x56e/0x740
 [<ffffffff811e4383>] SYSC_renameat2+0x483/0x530
 [<ffffffff811d6451>] ? __sb_end_write+0x31/0x60
 [<ffffffff811f208f>] ? mnt_drop_write+0x1f/0x30
 [<ffffffff811f34b4>] ? mntput+0x24/0x40
 [<ffffffff811ea6ac>] ? dput+0x4c/0x180
 [<ffffffff811f34b4>] ? mntput+0x24/0x40
 [<ffffffff811dd39e>] ? path_put+0x1e/0x30
 [<ffffffff811e561e>] SyS_rename+0x1e/0x20
 [<ffffffff8176b66d>] system_call_fastpath+0x1a/0x1f

... so that appears to hold everything else up. 

This looks potentially related to the ongoing transaction context memory
allocation discussion, as this code implements a tight retry loop with
time-based task waits and "no fail" allocations. This is also the source
of the "possible memory allocation deadlock" warning.

Dave might be able to comment a bit further on that. I'm not totally
clear on the mm interaction here and if/what a workaround might be. It
might be a good idea to grab the meminfo data when the stall is actually
in effect.

Considering this is a large memory box (64g), I wonder if some vm tuning
might help mitigate this behavior..? For example, increase
/proc/sys/vm/min_free_kbytes in hopes of allowing more memory for these
allocations when under pressure, or tune down the
dirty_ratio/dirty_background_ratio thresholds to more aggressively get
data onto disk..?

Brian

> Regards,
> -- 
> Michael Meier, Zentrale Systeme
> Friedrich-Alexander-Universitaet Erlangen-Nuernberg
> Regionales Rechenzentrum Erlangen
> Martensstrasse 1, 91058 Erlangen, Germany
> Tel.: +49 9131 85-28973, Fax: +49 9131 302941
> michael.meier@fau.de
> www.rrze.fau.de
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XFS hangs with XFS: possible memory allocation deadlock in kmem_alloc
  2015-03-07 14:07 ` Brian Foster
@ 2015-03-07 19:14   ` Michael Meier
  2015-03-09 11:52   ` Dave Chinner
  1 sibling, 0 replies; 4+ messages in thread
From: Michael Meier @ 2015-03-07 19:14 UTC (permalink / raw)
  To: xfs

On 03/07/2015 03:07 PM, Brian Foster wrote:
> Thanks for the data. Some notes from the backtraces in the first
> instance:

Thank you for the quick reply.
I'm not sure if the first instance is the most representative: It was
very short - only one message was logged and then everything was fine
again. The later one starting at 00:48 in the logs however was long
enough to make our nagios complain.

> Considering this is a large memory box (64g), I wonder if some vm tuning
> might help mitigate this behavior..? For example, increase
> /proc/sys/vm/min_free_kbytes in hopes of allowing more memory for these
> allocations when under pressure, or tune down the
> dirty_ratio/dirty_background_ratio thresholds to more aggressively get
> data onto disk..?

That idea had occoured to me too, but at least
vm.min_free_kbytes=4000000
vm.vfs_cache_pressure=200
did not prevent the problem from occouring.

Regards,
-- 
Michael Meier, Zentrale Systeme
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28973, Fax: +49 9131 302941
michael.meier@fau.de
www.rrze.fau.de

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: XFS hangs with XFS: possible memory allocation deadlock in kmem_alloc
  2015-03-07 14:07 ` Brian Foster
  2015-03-07 19:14   ` Michael Meier
@ 2015-03-09 11:52   ` Dave Chinner
  1 sibling, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2015-03-09 11:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: Michael Meier, xfs

On Sat, Mar 07, 2015 at 09:07:21AM -0500, Brian Foster wrote:
> On Sat, Mar 07, 2015 at 08:51:50AM +0100, Michael Meier wrote:
> > We've recently upgraded the OS on one of our servers, and since then
> > have been experiencing frequent stalls of the XFS filesystem on it.
> > Other filesystems on the machine seem to still respond fine while XFS
> > hangs. The stalls sometimes last for around 30 minutes, during which all
> > attempts to access that filesystem hang completely - after that, the
> > filesystem suddenly responds instantly again, as if there had never been
> > any problem. The dmesg is full of these messages while it stalls:
> >  XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
> > These also occour from time to time without the filesystem stalling (or
> > at least it's not noticeable) - the messages appear about once in two
> > hours, the stalls about once a day.
> > 
> > Google did point me to some reports of these messages occouring at the
> > end of 2013, but the kernels in question should all have had the fixes
> > proposed back then - although one message back then suggested there were
> > more places where this problem could occour that were not fixed yet.
> > 
> > Kernels used were:
> > - Ubuntu 3.13.0-44  - shows stalls, according to
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1382333 has the fix
> > - Ubuntu 3.16.0-31  - shows stalls
> > - Ubuntu 3.2.0-various - no stalls in more than 1 year
> > We can actually still boot the machine with the 3.2.0 kernel, and it
> > will run absolutely fine, but as that kernel will not be supported
> > forever, I do not consider that a permanent solution.
> > 
> > The machine should not be low on memory, the disk array far from its
> > limits, and the I/O-load is mostly reads with very little writes, as
> > this is a public FTP server.
> > 
> > I have tried to collect some information, available at
> > https://grid.rrze.uni-erlangen.de/~unrz191/syslog-with-xfs-hangs.log
> > 
> 
> Thanks for the data. Some notes from the backtraces in the first
> instance:
> 
> - xfsaild is down in xlog_cil_force_lsn()->flush_work(). So it's trying
>   to push the log, but the workqueue worker is already running.
> - The workqueue worker is here:
> 
> 	[298163.482697] Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
> 
> ... and it appears to be blocked on the ctx lock. This means either a
> transaction is completing or somebody else is pushing the cil.
> - Writeback and one or two other transactions are backed up waiting on
>   the ctx lock.
> - rsync is running a transaction completion (e.g., holding ctx lock) and
>   blocked on memory allocation:

Yup, that's prety much it. I suspect that we can do better here; I
think we might be ale to hoist the item formatting and memory
allocation outside the ctx lock - I'll need to do a little more than
have a quick browse of the code to determine if it's safe as we are
replacing log vectors in the when we are doing the allocation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-03-09 11:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-07  7:51 XFS hangs with XFS: possible memory allocation deadlock in kmem_alloc Michael Meier
2015-03-07 14:07 ` Brian Foster
2015-03-07 19:14   ` Michael Meier
2015-03-09 11:52   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox