linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NFS page states & writeback
@ 2011-03-25  1:28 Jan Kara
  2011-03-25  4:47 ` Dave Chinner
  2011-03-25  7:00 ` Wu Fengguang
  0 siblings, 2 replies; 15+ messages in thread
From: Jan Kara @ 2011-03-25  1:28 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-nfs, Andrew Morton, Christoph Hellwig, Wu Fengguang,
	Dave Chinner

  Hi,

  while working on changes to balance_dirty_pages() I was investigating why
NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
balance_dirty_pages(). Take a single dd writing to NFS. What I can
see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
machine. So flusher thread starts working and in an instant all these ~700
MB transition from Dirty state to Writeback state. Then, as server acks
writes, Writeback pages slowly change to Unstable pages (at 100 MB/s rate
let's say) and then at one moment (commit to server happens) all pages
transition from Unstable to Clean state - the cycle begins from the start.

The reason for this behavior seems to be a flaw in the logic in
over_bground_thresh() which checks:
global_page_state(NR_FILE_DIRTY) +
      global_page_state(NR_UNSTABLE_NFS) > background_thresh
So at the moment all pages are turned Writeback, flusher thread goes to
sleep and doesn't do any background writeback, until we have accumulated
enough Stable pages to get over background_thresh. But NFS needs to have
->write_inode() called so that it can sent commit requests to the server.
So effectively we end up sending commit only when background_thresh Unstable
pages have accumulated which creates the bumpyness. Previously this wasn't
a problem because balance_dirty_pages() ended up calling ->write_inode()
often enough for NFS to send commit requests reasonably often.

Now I wouldn't write so long email about this if I knew how to cleanly fix
the check ;-). One way to "fix" the check would be to add there Writeback
pages:
NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS > background_thresh

This would work in the sense that it would keep flusher thread working but
a) for normal filesystems it would be working even if there's potentially
nothing to do (or it is not necessary to do anything)
b) NFS is picky when it sends commit requests (inode has to have more
Stable pages than Writeback pages if I'm reading the code in
nfs_commit_unstable_pages() right) so flusher thread may be working but
nothing really happens until enough stable pages accumulate.

A check which kind of works but looks a bit hacky and is not perfect when
there are multiple files is:
NR_FILE_DIRTY + NR_UNSTABLE_NFS > background_thresh ||
NR_UNSTABLE_NFS > NR_WRITEBACK (to match what NFS does)

Any better idea for a fix?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25  1:28 NFS page states & writeback Jan Kara
@ 2011-03-25  4:47 ` Dave Chinner
  2011-03-25  7:11   ` Wu Fengguang
  2011-03-25 22:24   ` Jan Kara
  2011-03-25  7:00 ` Wu Fengguang
  1 sibling, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2011-03-25  4:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-nfs, Andrew Morton, Christoph Hellwig,
	Wu Fengguang

On Fri, Mar 25, 2011 at 02:28:03AM +0100, Jan Kara wrote:
>   Hi,
> 
>   while working on changes to balance_dirty_pages() I was investigating why
> NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
> balance_dirty_pages(). Take a single dd writing to NFS. What I can
> see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
> machine. So flusher thread starts working and in an instant all these ~700
> MB transition from Dirty state to Writeback state. Then, as server acks
> writes, Writeback pages slowly change to Unstable pages (at 100 MB/s rate
> let's say) and then at one moment (commit to server happens) all pages
> transition from Unstable to Clean state - the cycle begins from the start.
> 
> The reason for this behavior seems to be a flaw in the logic in
> over_bground_thresh() which checks:
> global_page_state(NR_FILE_DIRTY) +
>       global_page_state(NR_UNSTABLE_NFS) > background_thresh
> So at the moment all pages are turned Writeback, flusher thread goes to
> sleep and doesn't do any background writeback, until we have accumulated
> enough Stable pages to get over background_thresh. But NFS needs to have
> ->write_inode() called so that it can sent commit requests to the server.
> So effectively we end up sending commit only when background_thresh Unstable
> pages have accumulated which creates the bumpyness. Previously this wasn't
> a problem because balance_dirty_pages() ended up calling ->write_inode()
> often enough for NFS to send commit requests reasonably often.
> 
> Now I wouldn't write so long email about this if I knew how to cleanly fix
> the check ;-). One way to "fix" the check would be to add there Writeback
> pages:
> NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS > background_thresh
> 
> This would work in the sense that it would keep flusher thread working but
> a) for normal filesystems it would be working even if there's potentially
> nothing to do (or it is not necessary to do anything)
> b) NFS is picky when it sends commit requests (inode has to have more
> Stable pages than Writeback pages if I'm reading the code in
> nfs_commit_unstable_pages() right) so flusher thread may be working but
> nothing really happens until enough stable pages accumulate.
> 
> A check which kind of works but looks a bit hacky and is not perfect when
> there are multiple files is:
> NR_FILE_DIRTY + NR_UNSTABLE_NFS > background_thresh ||
> NR_UNSTABLE_NFS > NR_WRITEBACK (to match what NFS does)
> 
> Any better idea for a fix?

Have NFS account for it's writeback pages to also be accounted as
NR_UNSTABLE_NFS pages? i.e. rather than incrementing NR_UNSTABLE_NFS
at the writeback->unstable transition, account it at the
dirty->writeback transition....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25  1:28 NFS page states & writeback Jan Kara
  2011-03-25  4:47 ` Dave Chinner
@ 2011-03-25  7:00 ` Wu Fengguang
  2011-03-25  9:39   ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Wu Fengguang @ 2011-03-25  7:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	Andrew Morton, Christoph Hellwig, Dave Chinner

Hi Jan,

On Fri, Mar 25, 2011 at 09:28:03AM +0800, Jan Kara wrote:
>   Hi,
> 
>   while working on changes to balance_dirty_pages() I was investigating why
> NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
> balance_dirty_pages(). Take a single dd writing to NFS. What I can
> see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
> machine. So flusher thread starts working and in an instant all these ~700
> MB transition from Dirty state to Writeback state. Then, as server acks

That can be fixed by the following patch:

        [PATCH 09/27] nfs: writeback pages wait queue
        https://lkml.org/lkml/2011/3/3/79

> writes, Writeback pages slowly change to Unstable pages (at 100 MB/s rate
> let's say) and then at one moment (commit to server happens) all pages
> transition from Unstable to Clean state - the cycle begins from the start.
> 
> The reason for this behavior seems to be a flaw in the logic in
> over_bground_thresh() which checks:
> global_page_state(NR_FILE_DIRTY) +
>       global_page_state(NR_UNSTABLE_NFS) > background_thresh
> So at the moment all pages are turned Writeback, flusher thread goes to
> sleep and doesn't do any background writeback, until we have accumulated
> enough Stable pages to get over background_thresh. But NFS needs to have
> ->write_inode() called so that it can sent commit requests to the server.
> So effectively we end up sending commit only when background_thresh Unstable
> pages have accumulated which creates the bumpyness. Previously this wasn't
> a problem because balance_dirty_pages() ended up calling ->write_inode()
> often enough for NFS to send commit requests reasonably often.
> 
> Now I wouldn't write so long email about this if I knew how to cleanly fix
> the check ;-). One way to "fix" the check would be to add there Writeback
> pages:
> NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS > background_thresh
> 
> This would work in the sense that it would keep flusher thread working but
> a) for normal filesystems it would be working even if there's potentially
> nothing to do (or it is not necessary to do anything)
> b) NFS is picky when it sends commit requests (inode has to have more
> Stable pages than Writeback pages if I'm reading the code in
> nfs_commit_unstable_pages() right) so flusher thread may be working but
> nothing really happens until enough stable pages accumulate.
> 
> A check which kind of works but looks a bit hacky and is not perfect when
> there are multiple files is:
> NR_FILE_DIRTY + NR_UNSTABLE_NFS > background_thresh ||
> NR_UNSTABLE_NFS > NR_WRITEBACK (to match what NFS does)

There is another patch in the series "[PATCH 12/27] nfs: lower
writeback threshold proportionally to dirty threshold" that tries to
limit the NFS write queue size. For the system default 10%/20%
background/dirty ratios, it has the nice effect of keeping

        nr_writeback < 5%

So when the system is dirty exceeded, the background flusher won't
quit because

        nr_dirty + nr_unstable > 10%

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25  4:47 ` Dave Chinner
@ 2011-03-25  7:11   ` Wu Fengguang
  2011-03-25 22:24   ` Jan Kara
  1 sibling, 0 replies; 15+ messages in thread
From: Wu Fengguang @ 2011-03-25  7:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Andrew Morton,
	Christoph Hellwig

On Fri, Mar 25, 2011 at 12:47:54PM +0800, Dave Chinner wrote:
> On Fri, Mar 25, 2011 at 02:28:03AM +0100, Jan Kara wrote:
> >   Hi,
> > 
> >   while working on changes to balance_dirty_pages() I was investigating why
> > NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
> > balance_dirty_pages(). Take a single dd writing to NFS. What I can
> > see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
> > machine. So flusher thread starts working and in an instant all these ~700
> > MB transition from Dirty state to Writeback state. Then, as server acks
> > writes, Writeback pages slowly change to Unstable pages (at 100 MB/s rate
> > let's say) and then at one moment (commit to server happens) all pages
> > transition from Unstable to Clean state - the cycle begins from the start.
> > 
> > The reason for this behavior seems to be a flaw in the logic in
> > over_bground_thresh() which checks:
> > global_page_state(NR_FILE_DIRTY) +
> >       global_page_state(NR_UNSTABLE_NFS) > background_thresh
> > So at the moment all pages are turned Writeback, flusher thread goes to
> > sleep and doesn't do any background writeback, until we have accumulated
> > enough Stable pages to get over background_thresh. But NFS needs to have
> > ->write_inode() called so that it can sent commit requests to the server.
> > So effectively we end up sending commit only when background_thresh Unstable
> > pages have accumulated which creates the bumpyness. Previously this wasn't
> > a problem because balance_dirty_pages() ended up calling ->write_inode()
> > often enough for NFS to send commit requests reasonably often.
> > 
> > Now I wouldn't write so long email about this if I knew how to cleanly fix
> > the check ;-). One way to "fix" the check would be to add there Writeback
> > pages:
> > NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS > background_thresh
> > 
> > This would work in the sense that it would keep flusher thread working but
> > a) for normal filesystems it would be working even if there's potentially
> > nothing to do (or it is not necessary to do anything)
> > b) NFS is picky when it sends commit requests (inode has to have more
> > Stable pages than Writeback pages if I'm reading the code in
> > nfs_commit_unstable_pages() right) so flusher thread may be working but
> > nothing really happens until enough stable pages accumulate.
> > 
> > A check which kind of works but looks a bit hacky and is not perfect when
> > there are multiple files is:
> > NR_FILE_DIRTY + NR_UNSTABLE_NFS > background_thresh ||
> > NR_UNSTABLE_NFS > NR_WRITEBACK (to match what NFS does)
> > 
> > Any better idea for a fix?
> 
> Have NFS account for it's writeback pages to also be accounted as
> NR_UNSTABLE_NFS pages? i.e. rather than incrementing NR_UNSTABLE_NFS
> at the writeback->unstable transition, account it at the
> dirty->writeback transition....

This increases the opportunity for the NFS flusher to busy loop. Maybe
not a big problem as long as we add some sleep in the loop.

writeback: sleep for 10ms when nothing is written
http://linux.derkeiler.com/Mailing-Lists/Kernel/2010-12/msg06391.html

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25  7:00 ` Wu Fengguang
@ 2011-03-25  9:39   ` Dave Chinner
  2011-03-25 14:22     ` Wu Fengguang
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2011-03-25  9:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Andrew Morton,
	Christoph Hellwig

On Fri, Mar 25, 2011 at 03:00:54PM +0800, Wu Fengguang wrote:
> Hi Jan,
> 
> On Fri, Mar 25, 2011 at 09:28:03AM +0800, Jan Kara wrote:
> >   Hi,
> > 
> >   while working on changes to balance_dirty_pages() I was investigating why
> > NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
> > balance_dirty_pages(). Take a single dd writing to NFS. What I can
> > see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
> > machine. So flusher thread starts working and in an instant all these ~700
> > MB transition from Dirty state to Writeback state. Then, as server acks
> 
> That can be fixed by the following patch:
> 
>         [PATCH 09/27] nfs: writeback pages wait queue
>         https://lkml.org/lkml/2011/3/3/79

I don't think this is a good definition of write congestion for a
NFS (or any other network fs) client. Firstly, writeback congestion
is really dependent on the size of the network send window
remaining. That is, if you've filled the socket buffer with writes
and would block trying to queue more pages on the socket, then you
are congested. i.e. the measure of congestion is the rate at which
write request can be sent to the server and processed by the server.

Secondly, the big problem that causes the lumpiness is that we only
send commits when we reach at large threshold of unstable pages.
Because most servers tend to cache large writes in RAM,
the server might have a long commit latency because it has to write
hundred of MB of data to disk to complete the commit.

IOWs, the client sends the commit only when it really needs the
pages the be cleaned, and then we have the latency of the server
write before it responds that they are clean. Hence commits can take
a long time to complete and mark pages clean on the client side.

A solution that IRIX used for this problem was the concept of a
background commit. While doing writeback on an inode, if it sent
more than than a certain threshold of data (typically in the range
of 0.5-2s worth of data) to the server without a commit being
issued, it would send an _asynchronous_ commit with the current dirty
range to the server. That way the server starts writing the data
before it hits dirty thresholds (i.e. prevents GBs of dirty data
being cached on the server so commit lantecy is kept low).

When the background commit completes the NFS client can then convert
pages in the commit range to clean. Hence we keep the number of
unstable pages under control without needing to wait for a certain
number of unstable pages to build up before commits are triggered.
This allows the process of writing dirty pages to clean
unstable pages at roughly the same rate as the write rate without
needing any magic thresholds to be configured....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25  9:39   ` Dave Chinner
@ 2011-03-25 14:22     ` Wu Fengguang
  2011-03-25 14:32       ` Wu Fengguang
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Wu Fengguang @ 2011-03-25 14:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Andrew Morton,
	Christoph Hellwig

On Fri, Mar 25, 2011 at 05:39:57PM +0800, Dave Chinner wrote:
> On Fri, Mar 25, 2011 at 03:00:54PM +0800, Wu Fengguang wrote:
> > Hi Jan,
> > 
> > On Fri, Mar 25, 2011 at 09:28:03AM +0800, Jan Kara wrote:
> > >   Hi,
> > > 
> > >   while working on changes to balance_dirty_pages() I was investigating why
> > > NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
> > > balance_dirty_pages(). Take a single dd writing to NFS. What I can
> > > see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
> > > machine. So flusher thread starts working and in an instant all these ~700
> > > MB transition from Dirty state to Writeback state. Then, as server acks
> > 
> > That can be fixed by the following patch:
> > 
> >         [PATCH 09/27] nfs: writeback pages wait queue
> >         https://lkml.org/lkml/2011/3/3/79
> 
> I don't think this is a good definition of write congestion for a
> NFS (or any other network fs) client. Firstly, writeback congestion
> is really dependent on the size of the network send window
> remaining. That is, if you've filled the socket buffer with writes
> and would block trying to queue more pages on the socket, then you
> are congested. i.e. the measure of congestion is the rate at which
> write request can be sent to the server and processed by the server.

You are right. The wait queue fullness does reflect the congestion in
typical setup because the queue size is typically much larger than the
network pipeline. If happens to not be the case, I don't bother much
because the patch's main goal is to avoid

- NFS client side nr_dirty being constantly exhausted

- very bursty network IO (I literally see it), such as 1Gbps for 1
  second followed by completely idle for 10 seconds. Ideally if the
  server disk can only do 10MB/s then there should be a fluent 10MB/s
  network stream.

It just happens to inherit the old *congestion* names, and the upper
layer now actually hardly care about the congestion state.

> Secondly, the big problem that causes the lumpiness is that we only
> send commits when we reach at large threshold of unstable pages.
> Because most servers tend to cache large writes in RAM,
> the server might have a long commit latency because it has to write
> hundred of MB of data to disk to complete the commit.
> 
> IOWs, the client sends the commit only when it really needs the
> pages the be cleaned, and then we have the latency of the server
> write before it responds that they are clean. Hence commits can take
> a long time to complete and mark pages clean on the client side.
 
That's the point. That's why I add the following patches to limit the
NFS commit size:

        [PATCH 10/27] nfs: limit the commit size to reduce fluctuations
        [PATCH 11/27] nfs: limit the commit range

> A solution that IRIX used for this problem was the concept of a
> background commit. While doing writeback on an inode, if it sent
> more than than a certain threshold of data (typically in the range
> of 0.5-2s worth of data) to the server without a commit being
> issued, it would send an _asynchronous_ commit with the current dirty
> range to the server. That way the server starts writing the data
> before it hits dirty thresholds (i.e. prevents GBs of dirty data
> being cached on the server so commit lantecy is kept low).
> 
> When the background commit completes the NFS client can then convert
> pages in the commit range to clean. Hence we keep the number of
> unstable pages under control without needing to wait for a certain
> number of unstable pages to build up before commits are triggered.
> This allows the process of writing dirty pages to clean
> unstable pages at roughly the same rate as the write rate without
> needing any magic thresholds to be configured....

That's a good approach. In linux, by limiting the commit size, the NFS
flusher should roughly achieve the same effect.

However there is another problem. Look at the below graph. Even though
the commits are sent to NFS server in relatively small size and evenly
distributed in time (the green points), the commit COMPLETION events
from the server are observed to be pretty bumpy over time (the blue
points sitting on the red lines). This may not be easily fixable.. So
we still have to live with bumpy NFS commit completions...

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/nfs-commit.png

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25 14:22     ` Wu Fengguang
@ 2011-03-25 14:32       ` Wu Fengguang
  2011-03-25 18:26       ` Jan Kara
  2011-03-25 22:55       ` Dave Chinner
  2 siblings, 0 replies; 15+ messages in thread
From: Wu Fengguang @ 2011-03-25 14:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org,
	linux-nfs@vger.kernel.org, Andrew Morton, Christoph Hellwig

> However there is another problem. Look at the below graph. Even though
> the commits are sent to NFS server in relatively small size and evenly
> distributed in time (the green points), the commit COMPLETION events
> from the server are observed to be pretty bumpy over time (the blue
> points sitting on the red lines). This may not be easily fixable.. So
> we still have to live with bumpy NFS commit completions...
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/nfs-commit.png

The bumpy commit completions result in large fluctuations in estimated
write bandwidth (the red line):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-bandwidth.png

and fluctuated number of dirty pages (the red line):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-pages.png

And you can see the v6 patches still manages to keep the pause times
(the red points) under 100ms (hands down!):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-pause.png

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25 14:22     ` Wu Fengguang
  2011-03-25 14:32       ` Wu Fengguang
@ 2011-03-25 18:26       ` Jan Kara
  2011-03-25 22:55       ` Dave Chinner
  2 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2011-03-25 18:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Jan Kara,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Andrew Morton,
	Christoph Hellwig

On Fri 25-03-11 22:22:53, Wu Fengguang wrote:
> On Fri, Mar 25, 2011 at 05:39:57PM +0800, Dave Chinner wrote:
> > On Fri, Mar 25, 2011 at 03:00:54PM +0800, Wu Fengguang wrote:
> > > Hi Jan,
> > > 
> > > On Fri, Mar 25, 2011 at 09:28:03AM +0800, Jan Kara wrote:
> > > >   Hi,
> > > > 
> > > >   while working on changes to balance_dirty_pages() I was investigating why
> > > > NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
> > > > balance_dirty_pages(). Take a single dd writing to NFS. What I can
> > > > see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
> > > > machine. So flusher thread starts working and in an instant all these ~700
> > > > MB transition from Dirty state to Writeback state. Then, as server acks
> > > 
> > > That can be fixed by the following patch:
> > > 
> > >         [PATCH 09/27] nfs: writeback pages wait queue
> > >         https://lkml.org/lkml/2011/3/3/79
> > 
> > I don't think this is a good definition of write congestion for a
> > NFS (or any other network fs) client. Firstly, writeback congestion
> > is really dependent on the size of the network send window
> > remaining. That is, if you've filled the socket buffer with writes
> > and would block trying to queue more pages on the socket, then you
> > are congested. i.e. the measure of congestion is the rate at which
> > write request can be sent to the server and processed by the server.
> 
> You are right. The wait queue fullness does reflect the congestion in
> typical setup because the queue size is typically much larger than the
> network pipeline. If happens to not be the case, I don't bother much
> because the patch's main goal is to avoid
> 
> - NFS client side nr_dirty being constantly exhausted
> 
> - very bursty network IO (I literally see it), such as 1Gbps for 1
>   second followed by completely idle for 10 seconds. Ideally if the
>   server disk can only do 10MB/s then there should be a fluent 10MB/s
>   network stream.
> 
> It just happens to inherit the old *congestion* names, and the upper
> layer now actually hardly care about the congestion state.
  Fengguang, I believe that limitting amount of NFS writeback pages is the
wrong direction. I don't think that the rapid transition of Dirty pages to
Writeback pages is a problem. Limiting Writeback pages only papers-over
the problem in writeback code. So let NFS client write what we tell it to
write as fast as the network and the server can take it. I agree that
bursty sending of pages might cause problems with network congestion etc.
but that's not the problem we are trying to address now.

> > Secondly, the big problem that causes the lumpiness is that we only
> > send commits when we reach at large threshold of unstable pages.
> > Because most servers tend to cache large writes in RAM,
> > the server might have a long commit latency because it has to write
> > hundred of MB of data to disk to complete the commit.
> > 
> > IOWs, the client sends the commit only when it really needs the
> > pages the be cleaned, and then we have the latency of the server
> > write before it responds that they are clean. Hence commits can take
> > a long time to complete and mark pages clean on the client side.
>  
> That's the point. That's why I add the following patches to limit the
> NFS commit size:
> 
>         [PATCH 10/27] nfs: limit the commit size to reduce fluctuations
>         [PATCH 11/27] nfs: limit the commit range
  I agree we need something along the lines of your first patch to push
pages from Unstable state more aggressively. I was testing a patch
yesterday which was more conservative and just changed the check to:

               if (nfsi->ncommit <= (nfsi->npages >> 4) ||
                   (!exceeded && nfsi->ncommit <= (nfsi->npages >> 1)))
                        goto out_mark_dirty;
 
Where 'exceeded' is bdi->dirty_exceeded. So we are more aggresive only if
we really need the pages back (since each commit forces a server to do
fsync, it is good to avoid it if we don't really need it). And it worked
nicely in reducing the bumps as well.

I'm not entirely conviced the nfsi->ncommit <= (nfsi->npages >> 4) part of
the check is the best thing to do, maybe we could have a fixed limit (like
64 MB) there as well. So the condition could look like:

if ((exceeded && nfsi->ncommit <= (nfsi->npages >> 4) && nfsi->ncommit < 16384)
    || (!exceeded && nfsi->ncommit <= (nfsi->npages >> 1)))
	goto out_mark_dirty;

What do you think?

I'm not sure about the second patch - I didn't test a load where it would
help. It would possibly help only if several clients shared a file? How
likely is that? Or do I misread your patch?

> > A solution that IRIX used for this problem was the concept of a
> > background commit. While doing writeback on an inode, if it sent
> > more than than a certain threshold of data (typically in the range
> > of 0.5-2s worth of data) to the server without a commit being
> > issued, it would send an _asynchronous_ commit with the current dirty
> > range to the server. That way the server starts writing the data
> > before it hits dirty thresholds (i.e. prevents GBs of dirty data
> > being cached on the server so commit lantecy is kept low).
> > 
> > When the background commit completes the NFS client can then convert
> > pages in the commit range to clean. Hence we keep the number of
> > unstable pages under control without needing to wait for a certain
> > number of unstable pages to build up before commits are triggered.
> > This allows the process of writing dirty pages to clean
> > unstable pages at roughly the same rate as the write rate without
> > needing any magic thresholds to be configured....
> 
> That's a good approach. In linux, by limiting the commit size, the NFS
> flusher should roughly achieve the same effect.
> 
> However there is another problem. Look at the below graph. Even though
> the commits are sent to NFS server in relatively small size and evenly
> distributed in time (the green points), the commit COMPLETION events
> from the server are observed to be pretty bumpy over time (the blue
> points sitting on the red lines). This may not be easily fixable.. So
> we still have to live with bumpy NFS commit completions...
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/nfs-commit.png
  Well, looking at your graph, and it also pretty well matches my tests, the
completions come in ~50 MB batches. It's not perfect but far better than
those +400 MB batches I was seeing previously - compare e.g.
http://beta.suse.com/private/jack/balance_dirty_pages-v2/graphs/dd-test-nfs_8_1024-2011-02-28-21:52:05-mm.png
and
http://beta.suse.com/private/jack/balance_dirty_pages-v2/graphs/dd-test-nfs_8_1024-2011-03-25-01:02:43-mm.png

With these smaller bumps, pauses in balance_dirty_pages() stay below 500 ms
even with 8 dds banging the NFS share with my patchset (most of them is
~30 ms in fact):
http://beta.suse.com/private/jack/balance_dirty_pages-v2/graphs/dd-test-nfs_8_1024-2011-03-25-01:02:43-mm-procs.png

So I'd say that fixing the background limit checking logic and the small
tweak to commit check should be enough to make NFS reasonably well
behaved...

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25  4:47 ` Dave Chinner
  2011-03-25  7:11   ` Wu Fengguang
@ 2011-03-25 22:24   ` Jan Kara
       [not found]     ` <20110325222458.GB26932-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Kara @ 2011-03-25 22:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Christoph Hellwig, Wu Fengguang

On Fri 25-03-11 15:47:54, Dave Chinner wrote:
> On Fri, Mar 25, 2011 at 02:28:03AM +0100, Jan Kara wrote:
> >   while working on changes to balance_dirty_pages() I was investigating why
> > NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
> > balance_dirty_pages(). Take a single dd writing to NFS. What I can
> > see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
> > machine. So flusher thread starts working and in an instant all these ~700
> > MB transition from Dirty state to Writeback state. Then, as server acks
> > writes, Writeback pages slowly change to Unstable pages (at 100 MB/s rate
> > let's say) and then at one moment (commit to server happens) all pages
> > transition from Unstable to Clean state - the cycle begins from the start.
> > 
> > The reason for this behavior seems to be a flaw in the logic in
> > over_bground_thresh() which checks:
> > global_page_state(NR_FILE_DIRTY) +
> >       global_page_state(NR_UNSTABLE_NFS) > background_thresh
> > So at the moment all pages are turned Writeback, flusher thread goes to
> > sleep and doesn't do any background writeback, until we have accumulated
> > enough Stable pages to get over background_thresh. But NFS needs to have
> > ->write_inode() called so that it can sent commit requests to the server.
> > So effectively we end up sending commit only when background_thresh Unstable
> > pages have accumulated which creates the bumpyness. Previously this wasn't
> > a problem because balance_dirty_pages() ended up calling ->write_inode()
> > often enough for NFS to send commit requests reasonably often.
> > 
> > Now I wouldn't write so long email about this if I knew how to cleanly fix
> > the check ;-). One way to "fix" the check would be to add there Writeback
> > pages:
> > NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS > background_thresh
> > 
> > This would work in the sense that it would keep flusher thread working but
> > a) for normal filesystems it would be working even if there's potentially
> > nothing to do (or it is not necessary to do anything)
> > b) NFS is picky when it sends commit requests (inode has to have more
> > Stable pages than Writeback pages if I'm reading the code in
> > nfs_commit_unstable_pages() right) so flusher thread may be working but
> > nothing really happens until enough stable pages accumulate.
> > 
> > A check which kind of works but looks a bit hacky and is not perfect when
> > there are multiple files is:
> > NR_FILE_DIRTY + NR_UNSTABLE_NFS > background_thresh ||
> > NR_UNSTABLE_NFS > NR_WRITEBACK (to match what NFS does)
> > 
> > Any better idea for a fix?
> 
> Have NFS account for it's writeback pages to also be accounted as
> NR_UNSTABLE_NFS pages? i.e. rather than incrementing NR_UNSTABLE_NFS
> at the writeback->unstable transition, account it at the
> dirty->writeback transition....
Thanks for idea. I was thinking about this but I'm not sure accounting
writeback pages in unstable as you propose is what we want to do. It would
make flusher thread realize that there is more writeback needed but in fact
we would have to wait until pages transition from writeback state to be
able to do any progress. It could be mitigated by inserting a delay in
writeback loop as Fengguang proposes but still it seems a bit hacky.

So I think we could stop doing writeback when we have done the transition
of pages from Dirty to Writeback state. We should only make sure someone
kicks the background writeback again when there are enough unstable pages
to be worth a commit. This tends to happen from balance_dirty_pages() or
in the worst case when flusher thread awakes to check for old inodes to
flush. But we can also kick the flusher thread from NFS when we transition
enough pages which would seem rather robust to me. I'll try to write a
patch in this direction.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25 14:22     ` Wu Fengguang
  2011-03-25 14:32       ` Wu Fengguang
  2011-03-25 18:26       ` Jan Kara
@ 2011-03-25 22:55       ` Dave Chinner
  2011-03-25 23:24         ` Jan Kara
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2011-03-25 22:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org,
	linux-nfs@vger.kernel.org, Andrew Morton, Christoph Hellwig

On Fri, Mar 25, 2011 at 10:22:53PM +0800, Wu Fengguang wrote:
> On Fri, Mar 25, 2011 at 05:39:57PM +0800, Dave Chinner wrote:
> > On Fri, Mar 25, 2011 at 03:00:54PM +0800, Wu Fengguang wrote:
> > > Hi Jan,
> > > 
> > > On Fri, Mar 25, 2011 at 09:28:03AM +0800, Jan Kara wrote:
> > > >   Hi,
> > > > 
> > > >   while working on changes to balance_dirty_pages() I was investigating why
> > > > NFS writeback is *so* bumpy when I do not call writeback_inodes_wb() from
> > > > balance_dirty_pages(). Take a single dd writing to NFS. What I can
> > > > see is that we quickly accumulate dirty pages upto limit - ~700 MB on that
> > > > machine. So flusher thread starts working and in an instant all these ~700
> > > > MB transition from Dirty state to Writeback state. Then, as server acks
> > > 
> > > That can be fixed by the following patch:
> > > 
> > >         [PATCH 09/27] nfs: writeback pages wait queue
> > >         https://lkml.org/lkml/2011/3/3/79
> > 
> > I don't think this is a good definition of write congestion for a
> > NFS (or any other network fs) client. Firstly, writeback congestion
> > is really dependent on the size of the network send window
> > remaining. That is, if you've filled the socket buffer with writes
> > and would block trying to queue more pages on the socket, then you
> > are congested. i.e. the measure of congestion is the rate at which
> > write request can be sent to the server and processed by the server.
> 
> You are right. The wait queue fullness does reflect the congestion in
> typical setup because the queue size is typically much larger than the
> network pipeline. If happens to not be the case, I don't bother much
> because the patch's main goal is to avoid
> 
> - NFS client side nr_dirty being constantly exhausted
> 
> - very bursty network IO (I literally see it), such as 1Gbps for 1
>   second followed by completely idle for 10 seconds. Ideally if the
>   server disk can only do 10MB/s then there should be a fluent 10MB/s
>   network stream.

Sure - it uses the network while doing writes, then bloks waiting fo
the commit to be processed by the server. That's a very common
"foreground commit" NFS client write pattern. i.e. the NFS client
can either write data across the wire, or send a commit across the
wire, not do both at once.

> It just happens to inherit the old *congestion* names, and the upper
> layer now actually hardly care about the congestion state.
> 
> > Secondly, the big problem that causes the lumpiness is that we only
> > send commits when we reach at large threshold of unstable pages.
> > Because most servers tend to cache large writes in RAM,
> > the server might have a long commit latency because it has to write
> > hundred of MB of data to disk to complete the commit.
> > 
> > IOWs, the client sends the commit only when it really needs the
> > pages the be cleaned, and then we have the latency of the server
> > write before it responds that they are clean. Hence commits can take
> > a long time to complete and mark pages clean on the client side.
>  
> That's the point. That's why I add the following patches to limit the
> NFS commit size:
> 
>         [PATCH 10/27] nfs: limit the commit size to reduce fluctuations
>         [PATCH 11/27] nfs: limit the commit range

They don't solve the exclusion problem that is the root cause of the
burstiness. They do reduce the impact of it, but only in cases where
the server isn't that busy...

> > A solution that IRIX used for this problem was the concept of a
> > background commit. While doing writeback on an inode, if it sent
> > more than than a certain threshold of data (typically in the range
> > of 0.5-2s worth of data) to the server without a commit being
> > issued, it would send an _asynchronous_ commit with the current dirty
> > range to the server. That way the server starts writing the data
> > before it hits dirty thresholds (i.e. prevents GBs of dirty data
> > being cached on the server so commit lantecy is kept low).
> > 
> > When the background commit completes the NFS client can then convert
> > pages in the commit range to clean. Hence we keep the number of
> > unstable pages under control without needing to wait for a certain
> > number of unstable pages to build up before commits are triggered.
> > This allows the process of writing dirty pages to clean
> > unstable pages at roughly the same rate as the write rate without
> > needing any magic thresholds to be configured....
> 
> That's a good approach. In linux, by limiting the commit size, the NFS
> flusher should roughly achieve the same effect.

Not really. It's still threshold triggered,  it's still synchronous
and hence will still have problems with commit latency on slow or
very busy servers. That is, it may work ok when you are the only
client writing to the server, but when 1500 other clients are also
writing to the server it won't have the desired effect.

> However there is another problem. Look at the below graph. Even though
> the commits are sent to NFS server in relatively small size and evenly
> distributed in time (the green points), the commit COMPLETION events
> from the server are observed to be pretty bumpy over time (the blue
> points sitting on the red lines). This may not be easily fixable.. So
> we still have to live with bumpy NFS commit completions...

Right. The load on the server will ultimately determine the commit
latency, and that can _never_ be controlled by the client. We just
have to live with it and design the writeback path to prevent
commits from blocking writes in as many situations as possible.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
       [not found]     ` <20110325222458.GB26932-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
@ 2011-03-25 23:04       ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2011-03-25 23:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Christoph Hellwig, Wu Fengguang

On Fri, Mar 25, 2011 at 11:24:58PM +0100, Jan Kara wrote:
> So I think we could stop doing writeback when we have done the transition
> of pages from Dirty to Writeback state. We should only make sure someone
> kicks the background writeback again when there are enough unstable pages
> to be worth a commit. This tends to happen from balance_dirty_pages() or
> in the worst case when flusher thread awakes to check for old inodes to
> flush. But we can also kick the flusher thread from NFS when we transition
> enough pages which would seem rather robust to me. I'll try to write a
> patch in this direction.

I think making the background writeback commits triggered by NFS
write completion is a good way to proceed. A write request
completion gives us a guaranteed context to determine if we need to
issue a commit or not and should mostly avoid the need for polling
or some external event to trigger a threshold check.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25 22:55       ` Dave Chinner
@ 2011-03-25 23:24         ` Jan Kara
  2011-03-26  1:18           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2011-03-25 23:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Jan Kara, linux-fsdevel@vger.kernel.org,
	linux-nfs@vger.kernel.org, Andrew Morton, Christoph Hellwig

On Sat 26-03-11 09:55:58, Dave Chinner wrote:
> On Fri, Mar 25, 2011 at 10:22:53PM +0800, Wu Fengguang wrote:
> > It just happens to inherit the old *congestion* names, and the upper
> > layer now actually hardly care about the congestion state.
> > 
> > > Secondly, the big problem that causes the lumpiness is that we only
> > > send commits when we reach at large threshold of unstable pages.
> > > Because most servers tend to cache large writes in RAM,
> > > the server might have a long commit latency because it has to write
> > > hundred of MB of data to disk to complete the commit.
> > > 
> > > IOWs, the client sends the commit only when it really needs the
> > > pages the be cleaned, and then we have the latency of the server
> > > write before it responds that they are clean. Hence commits can take
> > > a long time to complete and mark pages clean on the client side.
> >  
> > That's the point. That's why I add the following patches to limit the
> > NFS commit size:
> > 
> >         [PATCH 10/27] nfs: limit the commit size to reduce fluctuations
> >         [PATCH 11/27] nfs: limit the commit range
> 
> They don't solve the exclusion problem that is the root cause of the
> burstiness. They do reduce the impact of it, but only in cases where
> the server isn't that busy...
  Well, at least the first patch results in sending commits earlier for
smaller amounts of data so that is principially what we want, isn't it?

Maybe we could make NFS client trigger the commit on it's own when enough
stable pages accumulate (and not depend on flusher thread to call
->write_inode) to make things more fluent. But that's about it and Irix
did something like that if I understood your explanation correctly.

> > > A solution that IRIX used for this problem was the concept of a
> > > background commit. While doing writeback on an inode, if it sent
> > > more than than a certain threshold of data (typically in the range
> > > of 0.5-2s worth of data) to the server without a commit being
> > > issued, it would send an _asynchronous_ commit with the current dirty
> > > range to the server. That way the server starts writing the data
> > > before it hits dirty thresholds (i.e. prevents GBs of dirty data
> > > being cached on the server so commit lantecy is kept low).
> > > 
> > > When the background commit completes the NFS client can then convert
> > > pages in the commit range to clean. Hence we keep the number of
> > > unstable pages under control without needing to wait for a certain
> > > number of unstable pages to build up before commits are triggered.
> > > This allows the process of writing dirty pages to clean
> > > unstable pages at roughly the same rate as the write rate without
> > > needing any magic thresholds to be configured....
> > 
> > That's a good approach. In linux, by limiting the commit size, the NFS
> > flusher should roughly achieve the same effect.
> 
> Not really. It's still threshold triggered,  it's still synchronous
> and hence will still have problems with commit latency on slow or
> very busy servers. That is, it may work ok when you are the only
> client writing to the server, but when 1500 other clients are also
> writing to the server it won't have the desired effect.
  It isn't synchronous. We don't wait for commit in WB_SYNC_NONE mode if
I'm reading the code right.  It's only synchronous in the sense that pages
are really clean only after the commit finishes but that's not the problem
you are pointing to I believe.

> > However there is another problem. Look at the below graph. Even though
> > the commits are sent to NFS server in relatively small size and evenly
> > distributed in time (the green points), the commit COMPLETION events
> > from the server are observed to be pretty bumpy over time (the blue
> > points sitting on the red lines). This may not be easily fixable.. So
> > we still have to live with bumpy NFS commit completions...
> 
> Right. The load on the server will ultimately determine the commit
> latency, and that can _never_ be controlled by the client. We just
> have to live with it and design the writeback path to prevent
> commits from blocking writes in as many situations as possible.
  The question is how hard should we try. Here I believe Fengguang's
patches can offer more then my approach because he throttles processes
based on estimated bandwidth so occasional hiccups of the server are more
"smoothed out". If we send commits early enough, hiccups matter less but
still it's just a matter of how big they are...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-25 23:24         ` Jan Kara
@ 2011-03-26  1:18           ` Dave Chinner
  2011-03-27 15:26             ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2011-03-26  1:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, linux-fsdevel@vger.kernel.org,
	linux-nfs@vger.kernel.org, Andrew Morton, Christoph Hellwig

On Sat, Mar 26, 2011 at 12:24:40AM +0100, Jan Kara wrote:
> On Sat 26-03-11 09:55:58, Dave Chinner wrote:
> > On Fri, Mar 25, 2011 at 10:22:53PM +0800, Wu Fengguang wrote:
> > > It just happens to inherit the old *congestion* names, and the upper
> > > layer now actually hardly care about the congestion state.
> > > 
> > > > Secondly, the big problem that causes the lumpiness is that we only
> > > > send commits when we reach at large threshold of unstable pages.
> > > > Because most servers tend to cache large writes in RAM,
> > > > the server might have a long commit latency because it has to write
> > > > hundred of MB of data to disk to complete the commit.
> > > > 
> > > > IOWs, the client sends the commit only when it really needs the
> > > > pages the be cleaned, and then we have the latency of the server
> > > > write before it responds that they are clean. Hence commits can take
> > > > a long time to complete and mark pages clean on the client side.
> > >  
> > > That's the point. That's why I add the following patches to limit the
> > > NFS commit size:
> > > 
> > >         [PATCH 10/27] nfs: limit the commit size to reduce fluctuations
> > >         [PATCH 11/27] nfs: limit the commit range
> > 
> > They don't solve the exclusion problem that is the root cause of the
> > burstiness. They do reduce the impact of it, but only in cases where
> > the server isn't that busy...
>   Well, at least the first patch results in sending commits earlier for
> smaller amounts of data so that is principially what we want, isn't it?
> 
> Maybe we could make NFS client trigger the commit on it's own when enough
> stable pages accumulate (and not depend on flusher thread to call
> ->write_inode) to make things more fluent. But that's about it and Irix
> did something like that if I understood your explanation correctly.

Effectively - waiting til a threshold is reached is too late to
prevent stalls.

> 
> > > > A solution that IRIX used for this problem was the concept of a
> > > > background commit. While doing writeback on an inode, if it sent
> > > > more than than a certain threshold of data (typically in the range
> > > > of 0.5-2s worth of data) to the server without a commit being
> > > > issued, it would send an _asynchronous_ commit with the current dirty
> > > > range to the server. That way the server starts writing the data
> > > > before it hits dirty thresholds (i.e. prevents GBs of dirty data
> > > > being cached on the server so commit lantecy is kept low).
> > > > 
> > > > When the background commit completes the NFS client can then convert
> > > > pages in the commit range to clean. Hence we keep the number of
> > > > unstable pages under control without needing to wait for a certain
> > > > number of unstable pages to build up before commits are triggered.
> > > > This allows the process of writing dirty pages to clean
> > > > unstable pages at roughly the same rate as the write rate without
> > > > needing any magic thresholds to be configured....
> > > 
> > > That's a good approach. In linux, by limiting the commit size, the NFS
> > > flusher should roughly achieve the same effect.
> > 
> > Not really. It's still threshold triggered,  it's still synchronous
> > and hence will still have problems with commit latency on slow or
> > very busy servers. That is, it may work ok when you are the only
> > client writing to the server, but when 1500 other clients are also
> > writing to the server it won't have the desired effect.
>   It isn't synchronous. We don't wait for commit in WB_SYNC_NONE mode if
> I'm reading the code right.  It's only synchronous in the sense that pages
> are really clean only after the commit finishes but that's not the problem
> you are pointing to I believe.

Yeah, it's changed since last time I looked closely at the NFS
writeback path. So that's not so much the problem anymore.

> 
> > > However there is another problem. Look at the below graph. Even though
> > > the commits are sent to NFS server in relatively small size and evenly
> > > distributed in time (the green points), the commit COMPLETION events
> > > from the server are observed to be pretty bumpy over time (the blue
> > > points sitting on the red lines). This may not be easily fixable.. So
> > > we still have to live with bumpy NFS commit completions...
> > 
> > Right. The load on the server will ultimately determine the commit
> > latency, and that can _never_ be controlled by the client. We just
> > have to live with it and design the writeback path to prevent
> > commits from blocking writes in as many situations as possible.
>   The question is how hard should we try. Here I believe Fengguang's
> patches can offer more then my approach because he throttles processes
> based on estimated bandwidth so occasional hiccups of the server are more
> "smoothed out".

Right, but the problem Fengguang mentioned was that the estimated
bandwidth was badly affected by uneven commit latency. My point is
that it is something we have no control over.

> If we send commits early enough, hiccups matter less but
> still it's just a matter of how big they are...

Yes - though this only reduces the variance the client sees in
steady state operation.  Realistically, we don't care if one commit
takes 2s for 100MB and the next takes 0.2s for the next 100MB as
long as we've been able to send 50MB/s of writes over the wire
consistently. IOWs, what we need to care about is getting the data
to the server as quickly as possible and decoupling that from the
commit operation.  i.e. we need to maximise and smooth the rate at
which we send dirty pages to the server, not the rate at which we
convert unstable pages to stable. If the server can't handle the
write rate we send it, if will slow downteh rate at which it
processes writes and we get congestion feedback that way (i.e. via
the network channel).

Essentially what I'm trying to say is that I don't think
unstable->clean operations (i.e. the commit) should affect or
control  the estimated bandwidth of the channel. A commit is an
operation that can be tuned to optimise throughput, but because of
it's variance it's not really an operation that can be used to
directly measure and control that throughput.

It is also worth remembering that some NFS servers return STABLE as
the state of the data in their write response. This transitions the
pages directly from writeback to clean, so there is no unstable
state or need for a commit operation. Hence the bandwidth estimation
in these cases is directly related to the network/protocol
throughput. If we can run background commit operations triggered by
write responses, then we have the same bandwidth estimation
behaviour for writes regardless of whether they return as STABLE or
UNSTABLE on the server...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
  2011-03-26  1:18           ` Dave Chinner
@ 2011-03-27 15:26             ` Trond Myklebust
       [not found]               ` <1301239601.22136.23.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2011-03-27 15:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Wu Fengguang, linux-fsdevel@vger.kernel.org,
	linux-nfs@vger.kernel.org, Andrew Morton, Christoph Hellwig

On Sat, 2011-03-26 at 12:18 +1100, Dave Chinner wrote:
> Yes - though this only reduces the variance the client sees in
> steady state operation.  Realistically, we don't care if one commit
> takes 2s for 100MB and the next takes 0.2s for the next 100MB as
> long as we've been able to send 50MB/s of writes over the wire
> consistently. IOWs, what we need to care about is getting the data
> to the server as quickly as possible and decoupling that from the
> commit operation.  i.e. we need to maximise and smooth the rate at
> which we send dirty pages to the server, not the rate at which we
> convert unstable pages to stable. If the server can't handle the
> write rate we send it, if will slow downteh rate at which it
> processes writes and we get congestion feedback that way (i.e. via
> the network channel).
> 
> Essentially what I'm trying to say is that I don't think
> unstable->clean operations (i.e. the commit) should affect or
> control  the estimated bandwidth of the channel. A commit is an
> operation that can be tuned to optimise throughput, but because of
> it's variance it's not really an operation that can be used to
> directly measure and control that throughput.

Agreed. However as I have said before, most of the problem here is that
the Linux server is assuming that it should cache the data maximally as
if this were a local process.

Once the NFS client starts flushing data to the server, it is because
the client no longer wants to cache, but rather wants to see the data
put onto stable storage as quickly as possible.
At that point, the server should be focussing doing the same. It should
not be setting the low water mark at 20% of total memory before starting
writeback, because that means that the COMMIT may have to wait for
several GB of data of data to hit the platter.
If the water mark was set at say 100MB or so, then writeback would be
much smoother...

> It is also worth remembering that some NFS servers return STABLE as
> the state of the data in their write response. This transitions the
> pages directly from writeback to clean, so there is no unstable
> state or need for a commit operation. Hence the bandwidth estimation
> in these cases is directly related to the network/protocol
> throughput. If we can run background commit operations triggered by
> write responses, then we have the same bandwidth estimation
> behaviour for writes regardless of whether they return as STABLE or
> UNSTABLE on the server...

If the server were doing its job of acting as a glorified disk instead
of trying to act as a caching device, then most of that data should
already be on disk before the client sends the COMMIT.

Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: NFS page states & writeback
       [not found]               ` <1301239601.22136.23.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
@ 2011-03-28  0:23                 ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2011-03-28  0:23 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jan Kara, Wu Fengguang,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Andrew Morton,
	Christoph Hellwig

On Sun, Mar 27, 2011 at 05:26:41PM +0200, Trond Myklebust wrote:
> On Sat, 2011-03-26 at 12:18 +1100, Dave Chinner wrote:
> > Yes - though this only reduces the variance the client sees in
> > steady state operation.  Realistically, we don't care if one commit
> > takes 2s for 100MB and the next takes 0.2s for the next 100MB as
> > long as we've been able to send 50MB/s of writes over the wire
> > consistently. IOWs, what we need to care about is getting the data
> > to the server as quickly as possible and decoupling that from the
> > commit operation.  i.e. we need to maximise and smooth the rate at
> > which we send dirty pages to the server, not the rate at which we
> > convert unstable pages to stable. If the server can't handle the
> > write rate we send it, if will slow downteh rate at which it
> > processes writes and we get congestion feedback that way (i.e. via
> > the network channel).
> > 
> > Essentially what I'm trying to say is that I don't think
> > unstable->clean operations (i.e. the commit) should affect or
> > control  the estimated bandwidth of the channel. A commit is an
> > operation that can be tuned to optimise throughput, but because of
> > it's variance it's not really an operation that can be used to
> > directly measure and control that throughput.
> 
> Agreed. However as I have said before, most of the problem here is that
> the Linux server is assuming that it should cache the data maximally as
> if this were a local process.

Yes, that's part of the problem, but it is not limited to Linux NFS
servers. Pretty much any general purpose machine that acts as a NFS
server has this problem to some extent.

> Once the NFS client starts flushing data to the server, it is because
> the client no longer wants to cache, but rather wants to see the data
> put onto stable storage as quickly as possible.

*nod*

> At that point, the server should be focussing doing the same. It should
> not be setting the low water mark at 20% of total memory before starting
> writeback, because that means that the COMMIT may have to wait for
> several GB of data of data to hit the platter.
> If the water mark was set at say 100MB or so, then writeback would be
> much smoother...

Right, but that's not as easy to do at the NFS server as it ѕounds.
Besides:

> If the server were doing its job of acting as a glorified disk instead
> of trying to act as a caching device, then most of that data should
> already be on disk before the client sends the COMMIT.

We can't make this assumption about the NFS server's behaviour.
Yes, if the server is optimal, the COMMIT will either be a no-op or
not necessary in the first place. However, there are many servers
out there that are not optimal (and never will be) and so we are
left with attempting to optimise flushing to stable storage from the
client side...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-03-28  0:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-25  1:28 NFS page states & writeback Jan Kara
2011-03-25  4:47 ` Dave Chinner
2011-03-25  7:11   ` Wu Fengguang
2011-03-25 22:24   ` Jan Kara
     [not found]     ` <20110325222458.GB26932-+0h/O2h83AeN3ZZ/Hiejyg@public.gmane.org>
2011-03-25 23:04       ` Dave Chinner
2011-03-25  7:00 ` Wu Fengguang
2011-03-25  9:39   ` Dave Chinner
2011-03-25 14:22     ` Wu Fengguang
2011-03-25 14:32       ` Wu Fengguang
2011-03-25 18:26       ` Jan Kara
2011-03-25 22:55       ` Dave Chinner
2011-03-25 23:24         ` Jan Kara
2011-03-26  1:18           ` Dave Chinner
2011-03-27 15:26             ` Trond Myklebust
     [not found]               ` <1301239601.22136.23.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>
2011-03-28  0:23                 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).