linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* XFS status update for May 2012
@ 2012-06-18 12:08 Christoph Hellwig
  2012-06-18 18:25 ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2012-06-18 12:08 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-kernel

May saw the release of Linux 3.4, including a decent sized XFS update.
Remarkable XFS features in Linux 3.4 include moving over all metadata
updates to use transactions, the addition of a work queue for the
low-level allocator code to avoid stack overflows due to extreme stack
use in the Linux VM/VFS call chain, better xattr operation tracing,
fixes for a long-standing but hard to hit deadlock when using the XFS
real time subvolume, and big improvements in disk quota scalability.

The final diffstat for XFS in Linux 3.4 is:

 61 files changed, 1692 insertions(+), 2356 deletions(-)

In the meantime the merge window for Linux 3.5 opened, and another large
update has been merged into Linus' tree.  Interesting changes in Linux
3.5-rc1 include improved error handling on buffer write failures,
a drastic reduction of locking overhead when doing high-IOPS direct I/O,
removal of the old xfsbufd daemon in favor of writing most run-time
metadata from the xfsaild daemon, deferral of CIL pushes to decouple
user space metadata I/O from log writeback, and last but not least the
addition of the SEEK_DATA/SEEK_HOLE lseek arguments that allow user space
programs to deal with sparse files efficiently. Traffic on the mailing list
has been a bit quiet in May, mostly focusing on a wide range of bug fixes,
but little new features.

On the user space side xfs_repair saw a few bug fixes posted to the list
that didn't make it to the repository yet, while xfstests saw it's usual
amount of minor bug fixes.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS status update for May 2012
  2012-06-18 12:08 XFS status update for May 2012 Christoph Hellwig
@ 2012-06-18 18:25 ` Andreas Dilger
  2012-06-18 18:43   ` Ben Myers
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Andreas Dilger @ 2012-06-18 18:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, linux-fsdevel@vger.kernel.org Devel

On 2012-06-18, at 6:08 AM, Christoph Hellwig wrote:
> May saw the release of Linux 3.4, including a decent sized XFS update.
> Remarkable XFS features in Linux 3.4 include moving over all metadata
> updates to use transactions, the addition of a work queue for the
> low-level allocator code to avoid stack overflows due to extreme stack
> use in the Linux VM/VFS call chain,

This is essentially a workaround for too-small stacks in the kernel,
which we've had to do at times as well, by doing work in a separate
thread (with a new stack) and waiting for the results?  This is a
generic problem that any reasonably-complex filesystem will have when
running under memory pressure on a complex storage stack (e.g. LVM +
iSCSI), but causes unnecessary context switching.

Any thoughts on a better way to handle this, or will there continue
to be a 4kB stack limit and hack around this with repeated kmalloc
on callpaths for any struct over a few tens of bytes, implementing
memory pools all over the place, and "forking" over to other threads
to continue the stack consumption for another 4kB to work around
the small stack limit?

Cheers, Andreas






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS status update for May 2012
  2012-06-18 18:25 ` Andreas Dilger
@ 2012-06-18 18:43   ` Ben Myers
  2012-06-18 20:36     ` Andreas Dilger
  2012-06-18 21:11   ` Eric Sandeen
  2012-06-19  1:11   ` Dave Chinner
  2 siblings, 1 reply; 9+ messages in thread
From: Ben Myers @ 2012-06-18 18:43 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org Devel, xfs

Hey Andreas,

On Mon, Jun 18, 2012 at 12:25:37PM -0600, Andreas Dilger wrote:
> On 2012-06-18, at 6:08 AM, Christoph Hellwig wrote:
> > May saw the release of Linux 3.4, including a decent sized XFS update.
> > Remarkable XFS features in Linux 3.4 include moving over all metadata
> > updates to use transactions, the addition of a work queue for the
> > low-level allocator code to avoid stack overflows due to extreme stack
> > use in the Linux VM/VFS call chain,
> 
> This is essentially a workaround for too-small stacks in the kernel,
> which we've had to do at times as well, by doing work in a separate
> thread (with a new stack) and waiting for the results?  This is a
> generic problem that any reasonably-complex filesystem will have when
> running under memory pressure on a complex storage stack (e.g. LVM +
> iSCSI), but causes unnecessary context switching.
>
> Any thoughts on a better way to handle this, or will there continue
> to be a 4kB stack limit and hack around this with repeated kmalloc
> on callpaths for any struct over a few tens of bytes, implementing
> memory pools all over the place, and "forking" over to other threads
> to continue the stack consumption for another 4kB to work around
> the small stack limit?

FWIW, I think your characterization of the problem as a 'workaround for
too-small stacks in the kernel' is about right.  I don't think any of the XFS
folk were very happy about having to do this, but in the near term it doesn't
seem that we have a good alternative.  I'm glad to see that there are others
with the same pain, so maybe we can build some support for upping the stack
limit.

Regards,
	Ben


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS status update for May 2012
  2012-06-18 18:43   ` Ben Myers
@ 2012-06-18 20:36     ` Andreas Dilger
  2012-06-19  1:20       ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2012-06-18 20:36 UTC (permalink / raw)
  To: Ben Myers
  Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org Devel, xfs,
	LKML List

On 2012-06-18, at 12:43 PM, Ben Myers wrote:
> On Mon, Jun 18, 2012 at 12:25:37PM -0600, Andreas Dilger wrote:
>> On 2012-06-18, at 6:08 AM, Christoph Hellwig wrote:
>>> May saw the release of Linux 3.4, including a decent sized XFS update.
>>> Remarkable XFS features in Linux 3.4 include moving over all metadata
>>> updates to use transactions, the addition of a work queue for the
>>> low-level allocator code to avoid stack overflows due to extreme stack
>>> use in the Linux VM/VFS call chain,
>> 
>> This is essentially a workaround for too-small stacks in the kernel,
>> which we've had to do at times as well, by doing work in a separate
>> thread (with a new stack) and waiting for the results?  This is a
>> generic problem that any reasonably-complex filesystem will have when
>> running under memory pressure on a complex storage stack (e.g. LVM +
>> iSCSI), but causes unnecessary context switching.
>> 
>> Any thoughts on a better way to handle this, or will there continue
>> to be a 4kB stack limit and hack around this with repeated kmalloc
>> on callpaths for any struct over a few tens of bytes, implementing
>> memory pools all over the place, and "forking" over to other threads
>> to continue the stack consumption for another 4kB to work around
>> the small stack limit?
> 
> FWIW, I think your characterization of the problem as a 'workaround for
> too-small stacks in the kernel' is about right.  I don't think any of
> the XFS folk were very happy about having to do this, but in the near
> term it doesn't seem that we have a good alternative.  I'm glad to see
> that there are others with the same pain, so maybe we can build some
> support for upping the stack limit.

Is this problem mostly hit in XFS with dedicated service threads like
kNFSd and similar, or is it a problem with any user thread perhaps
entering the filesystem for memory reclaim inside an already-deep
stack?

For dedicated service threads I was wondering about allocating larger
stacks for just those processes (16kB would be safe), and then doing
something special at thread startup to use this larger stack.  If
the problem is for any potential thread, then the solution would be
much more complex in all likelihood.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS status update for May 2012
  2012-06-18 18:25 ` Andreas Dilger
  2012-06-18 18:43   ` Ben Myers
@ 2012-06-18 21:11   ` Eric Sandeen
  2012-06-18 21:16     ` Eric Sandeen
  2012-06-19  1:27     ` Dave Chinner
  2012-06-19  1:11   ` Dave Chinner
  2 siblings, 2 replies; 9+ messages in thread
From: Eric Sandeen @ 2012-06-18 21:11 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org Devel, xfs

On 6/18/12 1:25 PM, Andreas Dilger wrote:
> On 2012-06-18, at 6:08 AM, Christoph Hellwig wrote:
>> May saw the release of Linux 3.4, including a decent sized XFS update.
>> Remarkable XFS features in Linux 3.4 include moving over all metadata
>> updates to use transactions, the addition of a work queue for the
>> low-level allocator code to avoid stack overflows due to extreme stack
>> use in the Linux VM/VFS call chain,
> 
> This is essentially a workaround for too-small stacks in the kernel,
> which we've had to do at times as well, by doing work in a separate
> thread (with a new stack) and waiting for the results?  This is a
> generic problem that any reasonably-complex filesystem will have when
> running under memory pressure on a complex storage stack (e.g. LVM +
> iSCSI), but causes unnecessary context switching.
> 
> Any thoughts on a better way to handle this, or will there continue
> to be a 4kB stack limit and hack around this with repeated kmalloc

well, 8k on x86_64 (not 4k) right?   But still...

Maybe it's still a partial hack but it's more generic - should we have
IRQ stacks like x86 has?  (I think I'm right that that only exists
on x86 / 32-bit) - is there any downside to that?

We could still get into trouble I'm sure but usually we seem to see
these stack overflows when we take an IRQ while already deep-ish
in the stack.

-Eric

> on callpaths for any struct over a few tens of bytes, implementing
> memory pools all over the place, and "forking" over to other threads
> to continue the stack consumption for another 4kB to work around
> the small stack limit?
> 
> Cheers, Andreas
> 
> 
> 
> 
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS status update for May 2012
  2012-06-18 21:11   ` Eric Sandeen
@ 2012-06-18 21:16     ` Eric Sandeen
  2012-06-19  1:27     ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2012-06-18 21:16 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org Devel, xfs

On 6/18/12 4:11 PM, Eric Sandeen wrote:
> On 6/18/12 1:25 PM, Andreas Dilger wrote:
>> On 2012-06-18, at 6:08 AM, Christoph Hellwig wrote:
>>> May saw the release of Linux 3.4, including a decent sized XFS update.
>>> Remarkable XFS features in Linux 3.4 include moving over all metadata
>>> updates to use transactions, the addition of a work queue for the
>>> low-level allocator code to avoid stack overflows due to extreme stack
>>> use in the Linux VM/VFS call chain,
>>
>> This is essentially a workaround for too-small stacks in the kernel,
>> which we've had to do at times as well, by doing work in a separate
>> thread (with a new stack) and waiting for the results?  This is a
>> generic problem that any reasonably-complex filesystem will have when
>> running under memory pressure on a complex storage stack (e.g. LVM +
>> iSCSI), but causes unnecessary context switching.
>>
>> Any thoughts on a better way to handle this, or will there continue
>> to be a 4kB stack limit and hack around this with repeated kmalloc
> 
> well, 8k on x86_64 (not 4k) right?   But still...
> 
> Maybe it's still a partial hack but it's more generic - should we have
> IRQ stacks like x86 has?  (I think I'm right that that only exists
> on x86 / 32-bit) - is there any downside to that?

Maybe I'm wrong about that, and we already have IRQ stacks on x86_64 -
at least based on the kernel documentation?

-Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS status update for May 2012
  2012-06-18 18:25 ` Andreas Dilger
  2012-06-18 18:43   ` Ben Myers
  2012-06-18 21:11   ` Eric Sandeen
@ 2012-06-19  1:11   ` Dave Chinner
  2 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2012-06-19  1:11 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Christoph Hellwig, xfs, linux-fsdevel@vger.kernel.org Devel

On Mon, Jun 18, 2012 at 12:25:37PM -0600, Andreas Dilger wrote:
> On 2012-06-18, at 6:08 AM, Christoph Hellwig wrote:
> > May saw the release of Linux 3.4, including a decent sized XFS update.
> > Remarkable XFS features in Linux 3.4 include moving over all metadata
> > updates to use transactions, the addition of a work queue for the
> > low-level allocator code to avoid stack overflows due to extreme stack
> > use in the Linux VM/VFS call chain,
> 
> This is essentially a workaround for too-small stacks in the kernel,
> which we've had to do at times as well, by doing work in a separate
> thread (with a new stack) and waiting for the results?  This is a
> generic problem that any reasonably-complex filesystem will have when
> running under memory pressure on a complex storage stack (e.g. LVM +
> iSCSI), but causes unnecessary context switching.

I've seen no performance issues from the context switching.  The
overhead of them is so small to be unmeasurable most cases, because
a typical allocation already requires context switches for contended
locks and metadata IO....

> Any thoughts on a better way to handle this, or will there continue
> to be a 4kB stack limit

We were blowing 8k stacks on x86-64 with alarming ease. Even the
flusher threads were overflowing.

> and hack around this with repeated kmalloc
> on callpaths for any struct over a few tens of bytes, implementing
> memory pools all over the place, and "forking" over to other threads
> to continue the stack consumption for another 4kB to work around
> the small stack limit?

I mentioned that we needed to consider 16k stacks at last years
Kernel Summit and the response was along the lines of "you've got to
be kidding - fix your broken filesystem". That's the perception you
have to change, and i don't feel like having a 4k stacks battle
again...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS status update for May 2012
  2012-06-18 20:36     ` Andreas Dilger
@ 2012-06-19  1:20       ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2012-06-19  1:20 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ben Myers, Christoph Hellwig, linux-fsdevel@vger.kernel.org Devel,
	xfs, LKML List

On Mon, Jun 18, 2012 at 02:36:27PM -0600, Andreas Dilger wrote:
> On 2012-06-18, at 12:43 PM, Ben Myers wrote:
> > On Mon, Jun 18, 2012 at 12:25:37PM -0600, Andreas Dilger wrote:
> >> On 2012-06-18, at 6:08 AM, Christoph Hellwig wrote:
> >>> May saw the release of Linux 3.4, including a decent sized XFS update.
> >>> Remarkable XFS features in Linux 3.4 include moving over all metadata
> >>> updates to use transactions, the addition of a work queue for the
> >>> low-level allocator code to avoid stack overflows due to extreme stack
> >>> use in the Linux VM/VFS call chain,
> >> 
> >> This is essentially a workaround for too-small stacks in the kernel,
> >> which we've had to do at times as well, by doing work in a separate
> >> thread (with a new stack) and waiting for the results?  This is a
> >> generic problem that any reasonably-complex filesystem will have when
> >> running under memory pressure on a complex storage stack (e.g. LVM +
> >> iSCSI), but causes unnecessary context switching.
> >> 
> >> Any thoughts on a better way to handle this, or will there continue
> >> to be a 4kB stack limit and hack around this with repeated kmalloc
> >> on callpaths for any struct over a few tens of bytes, implementing
> >> memory pools all over the place, and "forking" over to other threads
> >> to continue the stack consumption for another 4kB to work around
> >> the small stack limit?
> > 
> > FWIW, I think your characterization of the problem as a 'workaround for
> > too-small stacks in the kernel' is about right.  I don't think any of
> > the XFS folk were very happy about having to do this, but in the near
> > term it doesn't seem that we have a good alternative.  I'm glad to see
> > that there are others with the same pain, so maybe we can build some
> > support for upping the stack limit.
> 
> Is this problem mostly hit in XFS with dedicated service threads like
> kNFSd and similar, or is it a problem with any user thread perhaps
> entering the filesystem for memory reclaim inside an already-deep
> stack?

When you have the flusher thread using 2-2.5k of stack before it
enters the filesystem, DM and MD below the filesystem using 1-1.5k
of stack, and the scsi driver doing a mempool allocation taking 3k
of stack, there's basically nothing left for the filesystem.

We took this action because the flusher thread (i.e. the thread with
the lowest top level stack usage) was blowing the stack during
delayed allocation.

> For dedicated service threads I was wondering about allocating larger
> stacks for just those processes (16kB would be safe), and then doing
> something special at thread startup to use this larger stack.  If
> the problem is for any potential thread, then the solution would be
> much more complex in all likelihood.

Anything that does a filemap_fdatawrite() call is susceptible to a
stack overrun. i having seen a O_SYNC write(2) call overrun a stack
yet, but it was only a matter of time. I certainly have seen the
same write call from an NFSD overrun the stack. It's lucky we have
te IO-less throttling now, otherwise any thread that enters
balance_dirty_pages() was a candidate for a stack overrun....

IOWs,the only solution that would fix the problem was to split
allocations into a different stack so that we have the approximately
4k of stack space needed for the worst case XFS stack usage (double
btree split requiring metadata IO) and still have enough space left
for the DM/MD/SCSI stack underneath it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: XFS status update for May 2012
  2012-06-18 21:11   ` Eric Sandeen
  2012-06-18 21:16     ` Eric Sandeen
@ 2012-06-19  1:27     ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2012-06-19  1:27 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Andreas Dilger, Christoph Hellwig,
	linux-fsdevel@vger.kernel.org Devel, xfs

On Mon, Jun 18, 2012 at 04:11:52PM -0500, Eric Sandeen wrote:
> On 6/18/12 1:25 PM, Andreas Dilger wrote:
> > On 2012-06-18, at 6:08 AM, Christoph Hellwig wrote:
> >> May saw the release of Linux 3.4, including a decent sized XFS update.
> >> Remarkable XFS features in Linux 3.4 include moving over all metadata
> >> updates to use transactions, the addition of a work queue for the
> >> low-level allocator code to avoid stack overflows due to extreme stack
> >> use in the Linux VM/VFS call chain,
> > 
> > This is essentially a workaround for too-small stacks in the kernel,
> > which we've had to do at times as well, by doing work in a separate
> > thread (with a new stack) and waiting for the results?  This is a
> > generic problem that any reasonably-complex filesystem will have when
> > running under memory pressure on a complex storage stack (e.g. LVM +
> > iSCSI), but causes unnecessary context switching.
> > 
> > Any thoughts on a better way to handle this, or will there continue
> > to be a 4kB stack limit and hack around this with repeated kmalloc
> 
> well, 8k on x86_64 (not 4k) right?   But still...
> 
> Maybe it's still a partial hack but it's more generic - should we have
> IRQ stacks like x86 has?  (I think I'm right that that only exists
> on x86 / 32-bit) - is there any downside to that?

We already have irq stacks for x86-64 - the stackunwinder knows
about them so when you get a stack trace from the interrupt stack is
walks back across to the thread stack at the appropriate point...

See dump_trace() in arch/x86/kernel/dumpstack_64.c

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-06-19  1:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-18 12:08 XFS status update for May 2012 Christoph Hellwig
2012-06-18 18:25 ` Andreas Dilger
2012-06-18 18:43   ` Ben Myers
2012-06-18 20:36     ` Andreas Dilger
2012-06-19  1:20       ` Dave Chinner
2012-06-18 21:11   ` Eric Sandeen
2012-06-18 21:16     ` Eric Sandeen
2012-06-19  1:27     ` Dave Chinner
2012-06-19  1:11   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).