Stalls during writeback for mmaped I/O on XFS in 3.0

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Stalls during writeback for mmaped I/O on XFS in 3.0
@ 2011-09-15 14:47 Shawn Bohrer
  2011-09-15 14:55 ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Shawn Bohrer @ 2011-09-15 14:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs, linux-fsdevel, linux-kernel

I've got a workload that is latency sensitive that writes data to a
memory mapped file on XFS.  With the 3.0 kernel I'm seeing stalls of
up to 100ms that occur during writeback that we did not see with older
kernels.  I've traced the stalls and it looks like they are blocking
on wait_on_page_writeback() introduced in
d76ee18a8551e33ad7dbd55cac38bc7b094f3abb "fs: block_page_mkwrite
should wait for writeback to finish"

Reading the commit description doesn't really explain to me why this
change was needed.  Can someone explain what "This is needed to
stabilize pages during writeback for those two filesystems." means in
the context of that commit?  Is this a problem for older kernels as
well?  Should this have been backported to the stable kernels? What
are the downsides of reverting this commit?

Assuming this change is required are there any alternatives solutions
to avoid these stalls with mmaped I/O on XFS?

Thanks,
Shawn

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Stalls during writeback for mmaped I/O on XFS in 3.0
  2011-09-15 14:47 Stalls during writeback for mmaped I/O on XFS in 3.0 Shawn Bohrer
@ 2011-09-15 14:55 ` Christoph Hellwig
  2011-09-15 15:47   ` Shawn Bohrer
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2011-09-15 14:55 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: Darrick J. Wong, linux-fsdevel, linux-kernel, xfs

On Thu, Sep 15, 2011 at 09:47:55AM -0500, Shawn Bohrer wrote:
> I've got a workload that is latency sensitive that writes data to a
> memory mapped file on XFS.  With the 3.0 kernel I'm seeing stalls of
> up to 100ms that occur during writeback that we did not see with older
> kernels.  I've traced the stalls and it looks like they are blocking
> on wait_on_page_writeback() introduced in
> d76ee18a8551e33ad7dbd55cac38bc7b094f3abb "fs: block_page_mkwrite
> should wait for writeback to finish"
> 
> Reading the commit description doesn't really explain to me why this
> change was needed.

It it there to avoid pages beeing modified while they are under
writeback, which defeats various checksumming like DIF/DIX, the iscsi
CRCs, or even just the RAID parity calculations.  All of these either
failed before, or had to work around it by copying all data was
written.

If you don't use any of these you can remove the call and things
will work like they did before.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Stalls during writeback for mmaped I/O on XFS in 3.0
  2011-09-15 14:55 ` Christoph Hellwig
@ 2011-09-15 15:47   ` Shawn Bohrer
  2011-09-16  0:25     ` Darrick J. Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Shawn Bohrer @ 2011-09-15 15:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-fsdevel, linux-kernel, xfs

Thanks Christoph,

On Thu, Sep 15, 2011 at 10:55:57AM -0400, Christoph Hellwig wrote:
> On Thu, Sep 15, 2011 at 09:47:55AM -0500, Shawn Bohrer wrote:
> > I've got a workload that is latency sensitive that writes data to a
> > memory mapped file on XFS.  With the 3.0 kernel I'm seeing stalls of
> > up to 100ms that occur during writeback that we did not see with older
> > kernels.  I've traced the stalls and it looks like they are blocking
> > on wait_on_page_writeback() introduced in
> > d76ee18a8551e33ad7dbd55cac38bc7b094f3abb "fs: block_page_mkwrite
> > should wait for writeback to finish"
> > 
> > Reading the commit description doesn't really explain to me why this
> > change was needed.
> 
> It it there to avoid pages beeing modified while they are under
> writeback, which defeats various checksumming like DIF/DIX, the iscsi
> CRCs, or even just the RAID parity calculations.  All of these either
> failed before, or had to work around it by copying all data was
> written.

I'm assuming you mean software RAID here?  We do have a hardware RAID
controller.  Also for anything that was working around this issue
before by copying the data, are those workarounds still in place?

> If you don't use any of these you can remove the call and things
> will work like they did before.

I may do this for now.

In the longer term is there any chance this could be made better?  I'm
not an expert here so my suggestions may be naive.  Could a mechanism
be made to check if the page needs to be checksummed and only block in
that case?  Or perhaps some mount option, madvise() flag or other hint
from user-mode to disable this, or hint that I'm going to be touching
that page again soon?

Thanks,
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Stalls during writeback for mmaped I/O on XFS in 3.0
  2011-09-15 15:47   ` Shawn Bohrer
@ 2011-09-16  0:25     ` Darrick J. Wong
  2011-09-16 16:32       ` Shawn Bohrer
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2011-09-16  0:25 UTC (permalink / raw)
  To: Shawn Bohrer; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, xfs

On Thu, Sep 15, 2011 at 10:47:48AM -0500, Shawn Bohrer wrote:
> Thanks Christoph,
> 
> On Thu, Sep 15, 2011 at 10:55:57AM -0400, Christoph Hellwig wrote:
> > On Thu, Sep 15, 2011 at 09:47:55AM -0500, Shawn Bohrer wrote:
> > > I've got a workload that is latency sensitive that writes data to a
> > > memory mapped file on XFS.  With the 3.0 kernel I'm seeing stalls of
> > > up to 100ms that occur during writeback that we did not see with older
> > > kernels.  I've traced the stalls and it looks like they are blocking
> > > on wait_on_page_writeback() introduced in
> > > d76ee18a8551e33ad7dbd55cac38bc7b094f3abb "fs: block_page_mkwrite
> > > should wait for writeback to finish"
> > > 
> > > Reading the commit description doesn't really explain to me why this
> > > change was needed.
> > 
> > It it there to avoid pages beeing modified while they are under
> > writeback, which defeats various checksumming like DIF/DIX, the iscsi
> > CRCs, or even just the RAID parity calculations.  All of these either
> > failed before, or had to work around it by copying all data was
> > written.
> 
> I'm assuming you mean software RAID here?  We do have a hardware RAID

Yes.

> controller.  Also for anything that was working around this issue
> before by copying the data, are those workarounds still in place?

I suspect iscsi and md-raid5 are still making shadow copies of data blocks
before writing them out.  However, there was no previous workaround for DIF/DIX
errors -- this ("*_page_mkwrite should wait...") patch series _is_ the fix for
DIF/DIX.  I recall that we rejected the shadow buffer approach for DIF/DIX
because allocating new pages is expensive if we do it for each disk write in
anticipation of future page writes...

> > If you don't use any of these you can remove the call and things
> > will work like they did before.
> 
> I may do this for now.
> 
> In the longer term is there any chance this could be made better?  I'm
> not an expert here so my suggestions may be naive.  Could a mechanism
> be made to check if the page needs to be checksummed and only block in

...however, one could replace that wait_on_page_writeback with some sort of
call that would duplicate the page, update each program's page table to point
to the new page, and then somehow reap the page that's under IO when the IO
completes.  That might also be complicated to implement, I don't know.  If
there aren't any free pages, then this scheme (and the one I mentioned in the
previous paragraph) will block a thread while the system tries to reclaim some
pages.

I think we also talked about a block device flag to signal that the device
requires stable page writes, which would let us turn off the waits on devices
that don't care.  That at least could defer this discussion until you encounter
one of these devices that wants stable page writes.

I'm curious, is this program writing to the mmap region while another program
is trying to fsync/fdatasync/sync dirty pages to disk?  Is that how you noticed
the jittery latency?  We'd figured that not many programs would notice the
latency unless there was something that was causing a lot of dirty page writes
concurrent to something else dirtying a lot of pages.  Clearly we failed in
your case.  Sorry. :/

That said, imagine if we revert to the pre-3.0 mechanism (or add that flag): if
we start transferring page A to the disk for writing and your program comes in
and changes A to A' before that transfer completes, then the disk will see a
data blob that is partly A and partly A', and the proportions of A/A' are
ill-defined.  I agree that ~100ms latency is not good, however. :(

What are your program's mmap write latency requirements?

> that case?  Or perhaps some mount option, madvise() flag or other hint
> from user-mode to disable this, or hint that I'm going to be touching
> that page again soon?

--D

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Stalls during writeback for mmaped I/O on XFS in 3.0
  2011-09-16  0:25     ` Darrick J. Wong
@ 2011-09-16 16:32       ` Shawn Bohrer
  2011-09-20 16:30         ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Shawn Bohrer @ 2011-09-16 16:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, xfs

On Thu, Sep 15, 2011 at 05:25:57PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 15, 2011 at 10:47:48AM -0500, Shawn Bohrer wrote:
> > Thanks Christoph,
> > 
> > On Thu, Sep 15, 2011 at 10:55:57AM -0400, Christoph Hellwig wrote:
> > > On Thu, Sep 15, 2011 at 09:47:55AM -0500, Shawn Bohrer wrote:
> > > > I've got a workload that is latency sensitive that writes data to a
> > > > memory mapped file on XFS.  With the 3.0 kernel I'm seeing stalls of
> > > > up to 100ms that occur during writeback that we did not see with older
> > > > kernels.  I've traced the stalls and it looks like they are blocking
> > > > on wait_on_page_writeback() introduced in
> > > > d76ee18a8551e33ad7dbd55cac38bc7b094f3abb "fs: block_page_mkwrite
> > > > should wait for writeback to finish"
> > > > 
> > > > Reading the commit description doesn't really explain to me why this
> > > > change was needed.
> > > 
> > > It it there to avoid pages beeing modified while they are under
> > > writeback, which defeats various checksumming like DIF/DIX, the iscsi
> > > CRCs, or even just the RAID parity calculations.  All of these either
> > > failed before, or had to work around it by copying all data was
> > > written.
> > 
> > I'm assuming you mean software RAID here?  We do have a hardware RAID
> 
> Yes.
> 
> > controller.  Also for anything that was working around this issue
> > before by copying the data, are those workarounds still in place?
> 
> I suspect iscsi and md-raid5 are still making shadow copies of data blocks
> before writing them out.  However, there was no previous workaround for DIF/DIX
> errors -- this ("*_page_mkwrite should wait...") patch series _is_ the fix for
> DIF/DIX.  I recall that we rejected the shadow buffer approach for DIF/DIX
> because allocating new pages is expensive if we do it for each disk write in
> anticipation of future page writes...

So for the most part it sounds like this change is needed for DIF/DIX.
Could we only enable the wait_on_page_writeback() if
CONFIG_BLK_DEV_INTEGRITY is set?  Does it make sense to tie these
together?

> > > If you don't use any of these you can remove the call and things
> > > will work like they did before.
> > 
> > I may do this for now.
> > 
> > In the longer term is there any chance this could be made better?  I'm
> > not an expert here so my suggestions may be naive.  Could a mechanism
> > be made to check if the page needs to be checksummed and only block in
> 
> ...however, one could replace that wait_on_page_writeback with some sort of
> call that would duplicate the page, update each program's page table to point
> to the new page, and then somehow reap the page that's under IO when the IO
> completes.  That might also be complicated to implement, I don't know.  If
> there aren't any free pages, then this scheme (and the one I mentioned in the
> previous paragraph) will block a thread while the system tries to reclaim some
> pages.
> 
> I think we also talked about a block device flag to signal that the device
> requires stable page writes, which would let us turn off the waits on devices
> that don't care.  That at least could defer this discussion until you encounter
> one of these devices that wants stable page writes.

I would be in favor of something like this.

> I'm curious, is this program writing to the mmap region while another program
> is trying to fsync/fdatasync/sync dirty pages to disk?  Is that how you noticed
> the jittery latency?  We'd figured that not many programs would notice the
> latency unless there was something that was causing a lot of dirty page writes
> concurrent to something else dirtying a lot of pages.  Clearly we failed in
> your case.  Sorry. :/

The other thread in this case is the [flush-8:0] daemon writing back
the pages.  So in our case you could see the spikes every time it wakes
up to write back dirty pages.  While we can control this to some
extent with vm.dirty_writeback_centisecs and vm.dirty_expire_centisecs
it essentially impossible to ensure the writeback doesn't coincide
with us writing to the page again.

> That said, imagine if we revert to the pre-3.0 mechanism (or add that flag): if
> we start transferring page A to the disk for writing and your program comes in
> and changes A to A' before that transfer completes, then the disk will see a
> data blob that is partly A and partly A', and the proportions of A/A' are
> ill-defined.  I agree that ~100ms latency is not good, however. :(

In our use case I don't _think_ we care too much about the part A part
A' problem.  For the most part if we cared about not getting a mix we
would fsync/msync the changes.

> What are your program's mmap write latency requirements?

In these use cases I'm in the "as fast as possible" business.  We
don't have hard latency requirements, but we generally don't want to
see things get worse.

Thanks,
Shawn


---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Stalls during writeback for mmaped I/O on XFS in 3.0
  2011-09-16 16:32       ` Shawn Bohrer
@ 2011-09-20 16:30         ` Christoph Hellwig
  2011-09-20 18:42           ` Shawn Bohrer
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2011-09-20 16:30 UTC (permalink / raw)
  To: Shawn Bohrer
  Cc: Darrick J. Wong, Christoph Hellwig, linux-fsdevel, linux-kernel,
	xfs

On Fri, Sep 16, 2011 at 11:32:32AM -0500, Shawn Bohrer wrote:
> So for the most part it sounds like this change is needed for DIF/DIX.
> Could we only enable the wait_on_page_writeback() if
> CONFIG_BLK_DEV_INTEGRITY is set?  Does it make sense to tie these
> together?

It will also allow for huge efficiency gains on software raid.  There
have been some Lustre patches for that.

> The other thread in this case is the [flush-8:0] daemon writing back
> the pages.  So in our case you could see the spikes every time it wakes
> up to write back dirty pages.  While we can control this to some
> extent with vm.dirty_writeback_centisecs and vm.dirty_expire_centisecs
> it essentially impossible to ensure the writeback doesn't coincide
> with us writing to the page again.

Can you explain how your use case looks in more details?  Right now
for example a mlock removes the page from the lru list and thus stops
VM writeback.  If such an interface would be useful for you we could
offer an fadvice call that stops writeback entirely, and requires you
to force it when you want it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Stalls during writeback for mmaped I/O on XFS in 3.0
  2011-09-20 16:30         ` Christoph Hellwig
@ 2011-09-20 18:42           ` Shawn Bohrer
  0 siblings, 0 replies; 7+ messages in thread
From: Shawn Bohrer @ 2011-09-20 18:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-fsdevel, linux-kernel, xfs

On Tue, Sep 20, 2011 at 12:30:34PM -0400, Christoph Hellwig wrote:
> On Fri, Sep 16, 2011 at 11:32:32AM -0500, Shawn Bohrer wrote:
> > So for the most part it sounds like this change is needed for DIF/DIX.
> > Could we only enable the wait_on_page_writeback() if
> > CONFIG_BLK_DEV_INTEGRITY is set?  Does it make sense to tie these
> > together?
> 
> It will also allow for huge efficiency gains on software raid.  There
> have been some Lustre patches for that.
> 
> > The other thread in this case is the [flush-8:0] daemon writing back
> > the pages.  So in our case you could see the spikes every time it wakes
> > up to write back dirty pages.  While we can control this to some
> > extent with vm.dirty_writeback_centisecs and vm.dirty_expire_centisecs
> > it essentially impossible to ensure the writeback doesn't coincide
> > with us writing to the page again.
> 
> Can you explain how your use case looks in more details?  Right now

In one case we have an app that receives multicast data from a socket
and appends it to one of many memory mapped files.  Once it writes the
data to the end of the file it updates the header to record the new
size.  Since we update the header page frequently we are very likely
to encounter a stall here as the header page gets flushed in the
background.  We also have reader processes that check the file header
to find the new data since the last time they checked.  A stall in
updating the header means the readers do not get the latest data.

It is also possible that as we append the data to the file a partially
filled page at the end of the file could get flushed in the background
causing a stall since we append in chunks smaller than 4K.  This is
less likely though because we have tuned our
vm.dirty_writeback_centisecs and vm.dirty_expire_centisecs so that we
normally completely fill a page before the OS flushes it.

> for example a mlock removes the page from the lru list and thus stops
> VM writeback.  If such an interface would be useful for you we could
> offer an fadvice call that stops writeback entirely, and requires you
> to force it when you want it.

For the case I described above I'm not sure this would help because
we don't know the incoming rate of data so even if we force the sync
it could still cause a stall.

I do have a second application that is also suffering from these
stalls and I believe we could avoid the stalls by using fadvise to
disable writeback for a portion of the file and manually sync it
ourselves.  So this could potentially solve one of my problems.

Thanks,
Shawn

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-09-20 18:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-15 14:47 Stalls during writeback for mmaped I/O on XFS in 3.0 Shawn Bohrer
2011-09-15 14:55 ` Christoph Hellwig
2011-09-15 15:47   ` Shawn Bohrer
2011-09-16  0:25     ` Darrick J. Wong
2011-09-16 16:32       ` Shawn Bohrer
2011-09-20 16:30         ` Christoph Hellwig
2011-09-20 18:42           ` Shawn Bohrer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).