stable page writes: wait_on_page_writeback and packet signing

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* stable page writes: wait_on_page_writeback and packet signing
@ 2011-03-09 19:44 Steve French
       [not found] ` <AANLkTinFx9KGKDWSdUvFSvT4S6f9QjBzX=6Uo17oO89+-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Steve French @ 2011-03-09 19:44 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel
  Cc: Mingming Cao, Jeff Layton

Following up on the discussion about how to avoid the copy into a
temporary buffer for the case when a file system has to sign a page
(or list of pages) that is going to be passed in an iovec to be
written to the network or disk, I noticed that a few file systems do
issue wait_on_page_writeback (nfs in nfs_writepages for example).
Apparently some areas are being investigated to add something similar
for ext4 for disk adapters that do crc checks on data being sent down
to the disk.   In the cifs case it looks like cifs_writepages already
does:

if (wbc->sync_mode != WB_SYNC_NONE)
                                wait_on_page_writeback(page);

(before sending the list of pages to CIFSSMBWrite2 in fs/cifs/file.c)
and does the end_page_writeback if the write to the server succeeds.
The problem is that when packet signing is enabled we default to
issuing the older CIFSSMBWrite (which will allocate a temporary
buffer, and copy the pages being written into it to make sure the data
being written is stable while calculating the CRC on the packet, which
hurts performance).  It seems like we can simply move the equivalent
of the following check:

if (pTcon->ses->server->sec_mode &
    (SECMODE_SIGN_REQUIRED | SECMODE_SIGN_ENABLED))

to add to the existing check for WB_SYNC
if (wbc->sync_mode != WB_SYNC_NONE)
                                wait_on_page_writeback(page);

We would have to add an end_page_writeback earlier though (after the
network i/o is successfully submitted to the network and sent, not
when the server response is received) to avoid holding up page writes
overly long - but only for the case where WB_SYNC_NONE and signing
enabled/required.

Have alternative approaches, other than using wait_on_page_writeback,
been considered for solving the stable page write problem in similar
cases (since only about 1 out of 5 linux file systems uses this call
today).

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found] ` <AANLkTinFx9KGKDWSdUvFSvT4S6f9QjBzX=6Uo17oO89+-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-09 21:51   ` Dave Chinner
  2011-03-09 21:58     ` Chris Mason
                       ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Dave Chinner @ 2011-03-09 21:51 UTC (permalink / raw)
  To: Steve French
  Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel, Mingming Cao,
	Jeff Layton

On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
> Following up on the discussion about how to avoid the copy into a
> temporary buffer for the case when a file system has to sign a page
> (or list of pages) that is going to be passed in an iovec to be
> written to the network or disk, I noticed that a few file systems do
> issue wait_on_page_writeback (nfs in nfs_writepages for example).
> Apparently some areas are being investigated to add something similar
> for ext4 for disk adapters that do crc checks on data being sent down
> to the disk.   In the cifs case it looks like cifs_writepages already
> does:
> 
> if (wbc->sync_mode != WB_SYNC_NONE)
>                                 wait_on_page_writeback(page);
> 

cifs_writepages() has a roll-your-own write_cache_pages()
implementation in it, which is why it needs this.

> (before sending the list of pages to CIFSSMBWrite2 in fs/cifs/file.c)
> and does the end_page_writeback if the write to the server succeeds.
> The problem is that when packet signing is enabled we default to
> issuing the older CIFSSMBWrite (which will allocate a temporary
> buffer, and copy the pages being written into it to make sure the data
> being written is stable while calculating the CRC on the packet, which
> hurts performance).  It seems like we can simply move the equivalent
> of the following check:
> 
> if (pTcon->ses->server->sec_mode &
>     (SECMODE_SIGN_REQUIRED | SECMODE_SIGN_ENABLED))
> 
> to add to the existing check for WB_SYNC
> if (wbc->sync_mode != WB_SYNC_NONE)
>                                 wait_on_page_writeback(page);
> 
> We would have to add an end_page_writeback earlier though (after the
> network i/o is successfully submitted to the network and sent, not
> when the server response is received) to avoid holding up page writes
> overly long - but only for the case where WB_SYNC_NONE and signing
> enabled/required.

Sounds like a case for the same dirty page lifecycle as NFS: clean
-> dirty -> writeback -> unstable -> clean. i.e. the page is
unstable after the issuing of the IO until the response from the
server so the page can't be reclaimed while the IO is still in
progress at the server...

> Have alternative approaches, other than using wait_on_page_writeback,
> been considered for solving the stable page write problem in similar
> cases (since only about 1 out of 5 linux file systems uses this call
> today).

I think that is incorrect. write_cache_pages() does:

 929                         lock_page(page);
.....
 950                         if (PageWriteback(page)) {
 951                                 if (wbc->sync_mode != WB_SYNC_NONE)
 952                                         wait_on_page_writeback(page);
 953                                 else
 954                                         goto continue_unlock;
 955                         }
 956
 957                         BUG_ON(PageWriteback(page));
 958                         if (!clear_page_dirty_for_io(page))
 959                                 goto continue_unlock;
 960
 961                         trace_wbc_writepage(wbc, mapping->backing_dev_info);
 962                         ret = (*writepage)(page, wbc, data);

so every filesystem using the generic_writepages code already does
this check and wait before .writepage is called. Hence only the
filesystems that do not use generic_writepages() or
mpage_writepages() need a specific check, and that means most
filesystems are actually waiting on writeback pages correctly.


Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-09 21:51   ` Dave Chinner
@ 2011-03-09 21:58     ` Chris Mason
  2011-03-09 22:13       ` Steve French
  2011-03-09 23:46       ` Dave Chinner
  2011-03-09 22:01     ` Steve French
  2011-03-09 23:45     ` Jeff Layton
  2 siblings, 2 replies; 25+ messages in thread
From: Chris Mason @ 2011-03-09 21:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Steve French, linux-cifs, linux-fsdevel, Mingming Cao,
	Jeff Layton

Excerpts from Dave Chinner's message of 2011-03-09 16:51:48 -0500:
> On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
> > Have alternative approaches, other than using wait_on_page_writeback,
> > been considered for solving the stable page write problem in similar
> > cases (since only about 1 out of 5 linux file systems uses this call
> > today).
> 
> I think that is incorrect. write_cache_pages() does:
> 
>  929                         lock_page(page);
> .....
>  950                         if (PageWriteback(page)) {
>  951                                 if (wbc->sync_mode != WB_SYNC_NONE)
>  952                                         wait_on_page_writeback(page);
>  953                                 else
>  954                                         goto continue_unlock;
>  955                         }
>  956
>  957                         BUG_ON(PageWriteback(page));
>  958                         if (!clear_page_dirty_for_io(page))
>  959                                 goto continue_unlock;
>  960
>  961                         trace_wbc_writepage(wbc, mapping->backing_dev_info);
>  962                         ret = (*writepage)(page, wbc, data);
> 
> so every filesystem using the generic_writepages code already does
> this check and wait before .writepage is called. Hence only the
> filesystems that do not use generic_writepages() or
> mpage_writepages() need a specific check, and that means most
> filesystems are actually waiting on writeback pages correctly.

But checking here just means we don't start writeback on a page that is
writeback, which is a good idea but not really related to stable pages?

stable pages means we don't let mmap'd pages or file_write muck around
with the pages while they are in writeback, so we need to wait in
file_write and page_mkwrite.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-09 21:51   ` Dave Chinner
  2011-03-09 21:58     ` Chris Mason
@ 2011-03-09 22:01     ` Steve French
  2011-03-09 23:54       ` Jeff Layton
       [not found]       ` <AANLkTinDmqah6pQnHugoVxh-gDq+6+MDMuh-TyVAQ7LP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-03-09 23:45     ` Jeff Layton
  2 siblings, 2 replies; 25+ messages in thread
From: Steve French @ 2011-03-09 22:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-cifs, linux-fsdevel, Mingming Cao, Jeff Layton

On Wed, Mar 9, 2011 at 3:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
>> Following up on the discussion about how to avoid the copy into a
>> temporary buffer for the case when a file system has to sign a page
>> (or list of pages) that is going to be passed in an iovec to be
>> written to the network or disk, I noticed that a few file systems do
>> issue wait_on_page_writeback (nfs in nfs_writepages for example).
>> Apparently some areas are being investigated to add something similar
>> for ext4 for disk adapters that do crc checks on data being sent down
>> to the disk.   In the cifs case it looks like cifs_writepages already
>> does:
>>
>> if (wbc->sync_mode != WB_SYNC_NONE)
>>                                 wait_on_page_writeback(page);

<snip>

> Sounds like a case for the same dirty page lifecycle as NFS: clean
> -> dirty -> writeback -> unstable -> clean. i.e. the page is
> unstable after the issuing of the IO until the response from the
> server so the page can't be reclaimed while the IO is still in
> progress at the server...

Except we don't need to wait that long with the page locked
ie for a response from the cifs server (such as Samba or Windows
or NetApp), just need to wait for it to get on the wire.
Waiting for us to get the server response would
take 10 or 100 times longer.   In any case we can't resend
the same request to the server (the signature changes on the
resend since the sequence number is incremented on every
request/response so we have to recalc the checksum anyway) and
cifs requests can't get lost (as with nfs over udp).  Keeping
a page locked for 10milliseconds seems like a bad idea - but
it is a little more complicated to implement (for the cifs case)
so that we end page writeback (for the non-WB_SYNC)
as quickly as reasonably possible so we don't kill perf.


>> Have alternative approaches, other than using wait_on_page_writeback,
>> been considered for solving the stable page write problem in similar
>> cases (since only about 1 out of 5 linux file systems uses this call
>> today).
>
> I think that is incorrect. write_cache_pages() does:
>
>  929                         lock_page(page);
> .....
>  950                         if (PageWriteback(page)) {
>  951                                 if (wbc->sync_mode != WB_SYNC_NONE)
>  952                                         wait_on_page_writeback(page);

aaah - right. that makes sense.



-- 
Thanks,

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-09 21:58     ` Chris Mason
@ 2011-03-09 22:13       ` Steve French
       [not found]         ` <AANLkTikK8MOm-m9XsOA4YGRe=E9bJTDh4iEYXtZumNmv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2011-03-09 23:46       ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Steve French @ 2011-03-09 22:13 UTC (permalink / raw)
  To: Chris Mason
  Cc: Dave Chinner, linux-cifs, linux-fsdevel, Mingming Cao,
	Jeff Layton

On Wed, Mar 9, 2011 at 3:58 PM, Chris Mason <chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> Excerpts from Dave Chinner's message of 2011-03-09 16:51:48 -0500:
>> On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
>> > Have alternative approaches, other than using wait_on_page_writeback,
>> > been considered for solving the stable page write problem in similar
>> > cases (since only about 1 out of 5 linux file systems uses this call
>> > today).
>>
>> I think that is incorrect. write_cache_pages() does:
>>
>>  929                         lock_page(page);
>> .....
>>  950                         if (PageWriteback(page)) {
>>  951                                 if (wbc->sync_mode != WB_SYNC_NONE)
>>  952                                         wait_on_page_writeback(page);
>>  953                                 else
>>  954                                         goto continue_unlock;
>>  955                         }
>>  956
>>  957                         BUG_ON(PageWriteback(page));
>>  958                         if (!clear_page_dirty_for_io(page))
>>  959                                 goto continue_unlock;
>>  960
>>  961                         trace_wbc_writepage(wbc, mapping->backing_dev_info);
>>  962                         ret = (*writepage)(page, wbc, data);
>>
>> so every filesystem using the generic_writepages code already does
>> this check and wait before .writepage is called. Hence only the
>> filesystems that do not use generic_writepages() or
>> mpage_writepages() need a specific check, and that means most
>> filesystems are actually waiting on writeback pages correctly.
>
> But checking here just means we don't start writeback on a page that is
> writeback, which is a good idea but not really related to stable pages?
>
> stable pages means we don't let mmap'd pages or file_write muck around
> with the pages while they are in writeback, so we need to wait in
> file_write and page_mkwrite.

Isn't the file_write case covered by the i_mutex as
Documentation/filesystems/Locking implies (for write_begin/write_end).


-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-09 21:51   ` Dave Chinner
  2011-03-09 21:58     ` Chris Mason
  2011-03-09 22:01     ` Steve French
@ 2011-03-09 23:45     ` Jeff Layton
  2011-03-10  2:12       ` Jeff Layton
  2 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2011-03-09 23:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Steve French, linux-cifs, linux-fsdevel, Mingming Cao

On Thu, 10 Mar 2011 08:51:48 +1100
Dave Chinner <david@fromorbit.com> wrote:
> 
> Sounds like a case for the same dirty page lifecycle as NFS: clean
> -> dirty -> writeback -> unstable -> clean. i.e. the page is
> unstable after the issuing of the IO until the response from the
> server so the page can't be reclaimed while the IO is still in
> progress at the server...
> 

It's a little more complicated than that for NFS. Unstable pages are
ones that have had successful writes but that have not been committed
yet. Once a NFS COMMIT call completes, the page is marked clean and can
be freed by the VM.

Actual writeback in NFS is pretty similar to other filesystems -- the
page is only under writeback until the WRITE response is received. It
just doesn't clear the dirty bit until a COMMIT response is received.

That said, an unstable write model for CIFS is not a bad idea. Just
substitute a SMB_COM_FLUSH for a NFS COMMIT call...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-09 21:58     ` Chris Mason
  2011-03-09 22:13       ` Steve French
@ 2011-03-09 23:46       ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2011-03-09 23:46 UTC (permalink / raw)
  To: Chris Mason
  Cc: Steve French, linux-cifs, linux-fsdevel, Mingming Cao,
	Jeff Layton

On Wed, Mar 09, 2011 at 04:58:19PM -0500, Chris Mason wrote:
> Excerpts from Dave Chinner's message of 2011-03-09 16:51:48 -0500:
> > On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
> > > Have alternative approaches, other than using wait_on_page_writeback,
> > > been considered for solving the stable page write problem in similar
> > > cases (since only about 1 out of 5 linux file systems uses this call
> > > today).
> > 
> > I think that is incorrect. write_cache_pages() does:
> > 
> >  929                         lock_page(page);
> > .....
> >  950                         if (PageWriteback(page)) {
> >  951                                 if (wbc->sync_mode != WB_SYNC_NONE)
> >  952                                         wait_on_page_writeback(page);
> >  953                                 else
> >  954                                         goto continue_unlock;
> >  955                         }
> >  956
> >  957                         BUG_ON(PageWriteback(page));
> >  958                         if (!clear_page_dirty_for_io(page))
> >  959                                 goto continue_unlock;
> >  960
> >  961                         trace_wbc_writepage(wbc, mapping->backing_dev_info);
> >  962                         ret = (*writepage)(page, wbc, data);
> > 
> > so every filesystem using the generic_writepages code already does
> > this check and wait before .writepage is called. Hence only the
> > filesystems that do not use generic_writepages() or
> > mpage_writepages() need a specific check, and that means most
> > filesystems are actually waiting on writeback pages correctly.
> 
> But checking here just means we don't start writeback on a page that is
> writeback, which is a good idea but not really related to stable pages?

True - but the context of the original question was w.r.t.  use of
wait_on_page_writeback in .writepage[s], which was what I assumed
(based on a quick cscope lookup) that the "1 out of 5" was then
referring to....

> stable pages means we don't let mmap'd pages or file_write muck around
> with the pages while they are in writeback, so we need to wait in
> file_write and page_mkwrite.

.... as I think it's much fewer than "1 in 5 linux filesystems" that
actually implement these waits to ensure pages stay stable once
under writeback. i.e. only BTRFS does them, IIRC.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-09 22:01     ` Steve French
@ 2011-03-09 23:54       ` Jeff Layton
       [not found]         ` <20110309185427.7858c29b-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
       [not found]       ` <AANLkTinDmqah6pQnHugoVxh-gDq+6+MDMuh-TyVAQ7LP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2011-03-09 23:54 UTC (permalink / raw)
  To: Steve French; +Cc: Dave Chinner, linux-cifs, linux-fsdevel, Mingming Cao

On Wed, 9 Mar 2011 16:01:30 -0600
Steve French <smfrench@gmail.com> wrote:

> 
> Except we don't need to wait that long with the page locked
> ie for a response from the cifs server (such as Samba or Windows
> or NetApp), just need to wait for it to get on the wire.
> Waiting for us to get the server response would
> take 10 or 100 times longer.   In any case we can't resend
> the same request to the server (the signature changes on the
> resend since the sequence number is incremented on every
> request/response so we have to recalc the checksum anyway) and
> cifs requests can't get lost (as with nfs over udp).  Keeping
> a page locked for 10milliseconds seems like a bad idea - but
> it is a little more complicated to implement (for the cifs case)
> so that we end page writeback (for the non-WB_SYNC)
> as quickly as reasonably possible so we don't kill perf.
> 

The problem here is that the socket layer doesn't have a mechanism
to notify us of a TCP ACK. So, we have to wait for the next-best thing
-- a response from the server.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]         ` <20110309185427.7858c29b-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
@ 2011-03-10  0:33           ` Steve French
       [not found]             ` <AANLkTi=pXHjE6tNMm0_nO=Cn3nGH8oZ6Xhm1STh8x1Xe-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Steve French @ 2011-03-10  0:33 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel,
	Mingming Cao

On Wed, Mar 9, 2011 at 5:54 PM, Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, 9 Mar 2011 16:01:30 -0600
> Steve French <smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>>
>> Except we don't need to wait that long with the page locked
>> ie for a response from the cifs server (such as Samba or Windows
>> or NetApp), just need to wait for it to get on the wire.
>> Waiting for us to get the server response would
>> take 10 or 100 times longer.   In any case we can't resend
>> the same request to the server (the signature changes on the
>> resend since the sequence number is incremented on every
>> request/response so we have to recalc the checksum anyway) and
>> cifs requests can't get lost (as with nfs over udp).  Keeping
>> a page locked for 10milliseconds seems like a bad idea - but
>> it is a little more complicated to implement (for the cifs case)
>> so that we end page writeback (for the non-WB_SYNC)
>> as quickly as reasonably possible so we don't kill perf.
>>
>
> The problem here is that the socket layer doesn't have a mechanism
> to notify us of a TCP ACK. So, we have to wait for the next-best thing
> -- a response from the server.

But ... we can stop writeback as soon as kernel_sendmsg returns - once
we return from kernel_sendmsg the buffers can (and often will) be
freed so we know those pages could not still be used by tcp  (below
cifs) once kernel_sendmsg returns.  We can minimize the delay further
by making sure we set TCP_NODELAY on the socket (we probably ought to
make that the default instead of an option).





-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]             ` <AANLkTi=pXHjE6tNMm0_nO=Cn3nGH8oZ6Xhm1STh8x1Xe-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-10  1:30               ` Jeff Layton
  2011-03-10 13:53                 ` Steve French
       [not found]                 ` <20110309203044.4fd0498e-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
  0 siblings, 2 replies; 25+ messages in thread
From: Jeff Layton @ 2011-03-10  1:30 UTC (permalink / raw)
  To: Steve French
  Cc: Dave Chinner, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel,
	Mingming Cao

On Wed, 9 Mar 2011 18:33:20 -0600
Steve French <smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Wed, Mar 9, 2011 at 5:54 PM, Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Wed, 9 Mar 2011 16:01:30 -0600
> > Steve French <smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >
> >>
> >> Except we don't need to wait that long with the page locked
> >> ie for a response from the cifs server (such as Samba or Windows
> >> or NetApp), just need to wait for it to get on the wire.
> >> Waiting for us to get the server response would
> >> take 10 or 100 times longer.   In any case we can't resend
> >> the same request to the server (the signature changes on the
> >> resend since the sequence number is incremented on every
> >> request/response so we have to recalc the checksum anyway) and
> >> cifs requests can't get lost (as with nfs over udp).  Keeping
> >> a page locked for 10milliseconds seems like a bad idea - but
> >> it is a little more complicated to implement (for the cifs case)
> >> so that we end page writeback (for the non-WB_SYNC)
> >> as quickly as reasonably possible so we don't kill perf.
> >>
> >
> > The problem here is that the socket layer doesn't have a mechanism
> > to notify us of a TCP ACK. So, we have to wait for the next-best thing
> > -- a response from the server.
> 
> But ... we can stop writeback as soon as kernel_sendmsg returns - once
> we return from kernel_sendmsg the buffers can (and often will) be
> freed so we know those pages could not still be used by tcp  (below
> cifs) once kernel_sendmsg returns.  We can minimize the delay further
> by making sure we set TCP_NODELAY on the socket (we probably ought to
> make that the default instead of an option).
> 

That's not correct. A return from kernel_sendmsg just means that the
data has been buffered up, not that it has been sent and acked. We
shouldn't use that as an indicator to mean that the pages no longer
need to be stable.

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]       ` <AANLkTinDmqah6pQnHugoVxh-gDq+6+MDMuh-TyVAQ7LP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-10  1:41         ` Trond Myklebust
       [not found]           ` <1299721264.2976.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Trond Myklebust @ 2011-03-10  1:41 UTC (permalink / raw)
  To: Steve French
  Cc: Dave Chinner, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel,
	Mingming Cao, Jeff Layton

On Wed, 2011-03-09 at 16:01 -0600, Steve French wrote: 
> On Wed, Mar 9, 2011 at 3:51 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> > On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
> >> Following up on the discussion about how to avoid the copy into a
> >> temporary buffer for the case when a file system has to sign a page
> >> (or list of pages) that is going to be passed in an iovec to be
> >> written to the network or disk, I noticed that a few file systems do
> >> issue wait_on_page_writeback (nfs in nfs_writepages for example).
> >> Apparently some areas are being investigated to add something similar
> >> for ext4 for disk adapters that do crc checks on data being sent down
> >> to the disk.   In the cifs case it looks like cifs_writepages already
> >> does:
> >>
> >> if (wbc->sync_mode != WB_SYNC_NONE)
> >>                                 wait_on_page_writeback(page);
> 
> <snip>
> 
> > Sounds like a case for the same dirty page lifecycle as NFS: clean
> > -> dirty -> writeback -> unstable -> clean. i.e. the page is
> > unstable after the issuing of the IO until the response from the
> > server so the page can't be reclaimed while the IO is still in
> > progress at the server...
> 
> Except we don't need to wait that long with the page locked
> ie for a response from the cifs server (such as Samba or Windows
> or NetApp), just need to wait for it to get on the wire.
> Waiting for us to get the server response would
> take 10 or 100 times longer.   In any case we can't resend
> the same request to the server (the signature changes on the
> resend since the sequence number is incremented on every
> request/response so we have to recalc the checksum anyway) and
> cifs requests can't get lost (as with nfs over udp).  Keeping
> a page locked for 10milliseconds seems like a bad idea - but
> it is a little more complicated to implement (for the cifs case)
> so that we end page writeback (for the non-WB_SYNC)
> as quickly as reasonably possible so we don't kill perf.

So what if the server crashes, or you get some other transient error?

The NFS unstable write mechanism is there in order to deal with
imperfect servers that occasionally crash and lose cached data. If all
we had to deal with was perfect situations where all WRITE requests
succeed, then life would be much simpler...

Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org
www.netapp.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-09 23:45     ` Jeff Layton
@ 2011-03-10  2:12       ` Jeff Layton
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2011-03-10  2:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Steve French, linux-cifs, linux-fsdevel, Mingming Cao

On Wed, 9 Mar 2011 18:45:42 -0500
Jeff Layton <jlayton@redhat.com> wrote:

> On Thu, 10 Mar 2011 08:51:48 +1100
> Dave Chinner <david@fromorbit.com> wrote:
> > 
> > Sounds like a case for the same dirty page lifecycle as NFS: clean
> > -> dirty -> writeback -> unstable -> clean. i.e. the page is
> > unstable after the issuing of the IO until the response from the
> > server so the page can't be reclaimed while the IO is still in
> > progress at the server...
> > 
> 
> It's a little more complicated than that for NFS. Unstable pages are
> ones that have had successful writes but that have not been committed
> yet. Once a NFS COMMIT call completes, the page is marked clean and can
> be freed by the VM.
> 
> Actual writeback in NFS is pretty similar to other filesystems -- the
> page is only under writeback until the WRITE response is received. It
> just doesn't clear the dirty bit until a COMMIT response is received.
> 

Sorry, that's incorrect...NFS does clear the dirty bit after writeback,
but doesn't mark the inode clean until all pages have been committed.

Either way though, that really has little to do with keeping the pages
stable while sending them, but more to do with the fact that the server
can buffer up writes and then later crash and lose them. A successful
COMMIT means that the writes got committed to stable storage.

FWIW, CIFS is currently vulnerable to that problem too, so an unstable
write model isn't a bad idea, IMO.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]           ` <1299721264.2976.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2011-03-10  7:34             ` Christoph Hellwig
  2011-03-10 13:44             ` Steve French
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2011-03-10  7:34 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Steve French, Dave Chinner, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel, Mingming Cao, Jeff Layton

On Wed, Mar 09, 2011 at 08:41:03PM -0500, Trond Myklebust wrote:
> So what if the server crashes, or you get some other transient error?

I don't think the CIFS world cares.  E.g. a Samba in default
configuration never even bothers to do a fsync on the underlying server,
which to me implies data integrity is not of concern for usual CIFS
setups.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]         ` <AANLkTikK8MOm-m9XsOA4YGRe=E9bJTDh4iEYXtZumNmv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-03-10 12:26           ` Chris Mason
  2011-03-10 13:16             ` Jeff Layton
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2011-03-10 12:26 UTC (permalink / raw)
  To: Steve French
  Cc: Dave Chinner, linux-cifs, linux-fsdevel, Mingming Cao,
	Jeff Layton

Excerpts from Steve French's message of 2011-03-09 17:13:06 -0500:
> On Wed, Mar 9, 2011 at 3:58 PM, Chris Mason <chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > Excerpts from Dave Chinner's message of 2011-03-09 16:51:48 -0500:
> >> On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
> >> > Have alternative approaches, other than using wait_on_page_writeback,
> >> > been considered for solving the stable page write problem in similar
> >> > cases (since only about 1 out of 5 linux file systems uses this call
> >> > today).
> >>
> >> I think that is incorrect. write_cache_pages() does:
> >>
> >>  929                         lock_page(page);
> >> .....
> >>  950                         if (PageWriteback(page)) {
> >>  951                                 if (wbc->sync_mode != WB_SYNC_NONE)
> >>  952                                         wait_on_page_writeback(page);
> >>  953                                 else
> >>  954                                         goto continue_unlock;
> >>  955                         }
> >>  956
> >>  957                         BUG_ON(PageWriteback(page));
> >>  958                         if (!clear_page_dirty_for_io(page))
> >>  959                                 goto continue_unlock;
> >>  960
> >>  961                         trace_wbc_writepage(wbc, mapping->backing_dev_info);
> >>  962                         ret = (*writepage)(page, wbc, data);
> >>
> >> so every filesystem using the generic_writepages code already does
> >> this check and wait before .writepage is called. Hence only the
> >> filesystems that do not use generic_writepages() or
> >> mpage_writepages() need a specific check, and that means most
> >> filesystems are actually waiting on writeback pages correctly.
> >
> > But checking here just means we don't start writeback on a page that is
> > writeback, which is a good idea but not really related to stable pages?
> >
> > stable pages means we don't let mmap'd pages or file_write muck around
> > with the pages while they are in writeback, so we need to wait in
> > file_write and page_mkwrite.
> 
> Isn't the file_write case covered by the i_mutex as
> Documentation/filesystems/Locking implies (for write_begin/write_end).
> 

Does cifs take i_mutex before writepage?  The disk based filesystems
don't.  So, i_mutex protects file_write from other procs jumping into
file_write, but it doesn't protect writeback from file_write jumping in
and changing the pages while they are being sent to storage (or over the
wire).

Basically the model needs to be:

file_write:
	lock the page
	wait on page writeback

	< new writeback cannot start because of the page lock >
	copy_from_user
	unlock the page

We also use page_mkwrite to get notified when userland wants to change
some page it has given to mmap.  That needs to wait on page writeback as
well.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-10 12:26           ` Chris Mason
@ 2011-03-10 13:16             ` Jeff Layton
       [not found]               ` <20110310081638.0f8275d4-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2011-03-10 13:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: Steve French, Dave Chinner, linux-cifs, linux-fsdevel,
	Mingming Cao

On Thu, 10 Mar 2011 04:26:31 -0800 (PST)
Chris Mason <chris.mason@oracle.com> wrote:

> Excerpts from Steve French's message of 2011-03-09 17:13:06 -0500:
> > On Wed, Mar 9, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> wrote:
> > > Excerpts from Dave Chinner's message of 2011-03-09 16:51:48 -0500:
> > >> On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
> > >> > Have alternative approaches, other than using wait_on_page_writeback,
> > >> > been considered for solving the stable page write problem in similar
> > >> > cases (since only about 1 out of 5 linux file systems uses this call
> > >> > today).
> > >>
> > >> I think that is incorrect. write_cache_pages() does:
> > >>
> > >>  929                         lock_page(page);
> > >> .....
> > >>  950                         if (PageWriteback(page)) {
> > >>  951                                 if (wbc->sync_mode != WB_SYNC_NONE)
> > >>  952                                         wait_on_page_writeback(page);
> > >>  953                                 else
> > >>  954                                         goto continue_unlock;
> > >>  955                         }
> > >>  956
> > >>  957                         BUG_ON(PageWriteback(page));
> > >>  958                         if (!clear_page_dirty_for_io(page))
> > >>  959                                 goto continue_unlock;
> > >>  960
> > >>  961                         trace_wbc_writepage(wbc, mapping->backing_dev_info);
> > >>  962                         ret = (*writepage)(page, wbc, data);
> > >>
> > >> so every filesystem using the generic_writepages code already does
> > >> this check and wait before .writepage is called. Hence only the
> > >> filesystems that do not use generic_writepages() or
> > >> mpage_writepages() need a specific check, and that means most
> > >> filesystems are actually waiting on writeback pages correctly.
> > >
> > > But checking here just means we don't start writeback on a page that is
> > > writeback, which is a good idea but not really related to stable pages?
> > >
> > > stable pages means we don't let mmap'd pages or file_write muck around
> > > with the pages while they are in writeback, so we need to wait in
> > > file_write and page_mkwrite.
> > 
> > Isn't the file_write case covered by the i_mutex as
> > Documentation/filesystems/Locking implies (for write_begin/write_end).
> > 
> 
> Does cifs take i_mutex before writepage? The disk based filesystems
> don't.  So, i_mutex protects file_write from other procs jumping into
> file_write, but it doesn't protect writeback from file_write jumping in
> and changing the pages while they are being sent to storage (or over the
> wire).
> 
> Basically the model needs to be:
> 
> file_write:
> 	lock the page
> 	wait on page writeback
> 
> 	< new writeback cannot start because of the page lock >
> 	copy_from_user
> 	unlock the page
> 
> We also use page_mkwrite to get notified when userland wants to change
> some page it has given to mmap.  That needs to wait on page writeback as
> well.
> 

No, cifs doesn't take the i_mutex in writepage, but the page is locked.
cifs_write_begin calls grab_cache_page_write_begin, which returns a
locked page and it's not unlocked until cifs_write_end.

So I'm not sure I understand the potential race here. A normal
write_begin/end file write will block on the page lock, and the page is
locked during any writeback (either via writepage or writepages).

The only real "danger" is from processes that have the page mmapped as
they don't care about the page lock at all. A page_mkwrite routine that
does a wait_on_page_writeback should prevent that however.

-- 
Jeff Layton <jlayton@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]               ` <20110310081638.0f8275d4-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org>
@ 2011-03-10 13:32                 ` Chris Mason
  2011-03-10 13:47                   ` Jeff Layton
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2011-03-10 13:32 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Steve French, Dave Chinner, linux-cifs, linux-fsdevel,
	Mingming Cao

Excerpts from Jeff Layton's message of 2011-03-10 08:16:38 -0500:
> On Thu, 10 Mar 2011 04:26:31 -0800 (PST)
> Chris Mason <chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > Excerpts from Steve French's message of 2011-03-09 17:13:06 -0500:
> > > On Wed, Mar 9, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> wrote:
> > > > Excerpts from Dave Chinner's message of 2011-03-09 16:51:48 -0500:
> > > >> On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
> > > >> > Have alternative approaches, other than using wait_on_page_writeback,
> > > >> > been considered for solving the stable page write problem in similar
> > > >> > cases (since only about 1 out of 5 linux file systems uses this call
> > > >> > today).
> > > >>
> > > >> I think that is incorrect. write_cache_pages() does:
> > > >>
> > > >> Â 929 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  lock_page(page);
> > > >> .....
> > > >> Â 950 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  if (PageWriteback(page)) {
> > > >> Â 951 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  if (wbc->sync_mode != WB_SYNC_NONE)
> > > >> Â 952 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  wait_on_page_writeback(page);
> > > >> Â 953 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  else
> > > >> Â 954 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  goto continue_unlock;
> > > >> Â 955 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  }
> > > >> Â 956
> > > >> Â 957 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  BUG_ON(PageWriteback(page));
> > > >> Â 958 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  if (!clear_page_dirty_for_io(page))
> > > >> Â 959 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  goto continue_unlock;
> > > >> Â 960
> > > >> Â 961 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  trace_wbc_writepage(wbc, mapping->backing_dev_info);
> > > >> Â 962 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  ret = (*writepage)(page, wbc, data);
> > > >>
> > > >> so every filesystem using the generic_writepages code already does
> > > >> this check and wait before .writepage is called. Hence only the
> > > >> filesystems that do not use generic_writepages() or
> > > >> mpage_writepages() need a specific check, and that means most
> > > >> filesystems are actually waiting on writeback pages correctly.
> > > >
> > > > But checking here just means we don't start writeback on a page that is
> > > > writeback, which is a good idea but not really related to stable pages?
> > > >
> > > > stable pages means we don't let mmap'd pages or file_write muck around
> > > > with the pages while they are in writeback, so we need to wait in
> > > > file_write and page_mkwrite.
> > > 
> > > Isn't the file_write case covered by the i_mutex as
> > > Documentation/filesystems/Locking implies (for write_begin/write_end).
> > > 
> > 
> > Does cifs take i_mutex before writepage? The disk based filesystems
> > don't.  So, i_mutex protects file_write from other procs jumping into
> > file_write, but it doesn't protect writeback from file_write jumping in
> > and changing the pages while they are being sent to storage (or over the
> > wire).
> > 
> > Basically the model needs to be:
> > 
> > file_write:
> >     lock the page
> >     wait on page writeback
> > 
> >     < new writeback cannot start because of the page lock >
> >     copy_from_user
> >     unlock the page
> > 
> > We also use page_mkwrite to get notified when userland wants to change
> > some page it has given to mmap.  That needs to wait on page writeback as
> > well.
> > 
> 
> No, cifs doesn't take the i_mutex in writepage, but the page is locked.
> cifs_write_begin calls grab_cache_page_write_begin, which returns a
> locked page and it's not unlocked until cifs_write_end.

Ah ok, so you've got the page locked the whole time it is being sent
over the wire?  The disk based filesystems split it and drop the page
lock once the page is set writeback, which is why we need the extra
waits.

So in your case you should just need a page_mkwrite that locks the page.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]           ` <1299721264.2976.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2011-03-10  7:34             ` Christoph Hellwig
@ 2011-03-10 13:44             ` Steve French
  1 sibling, 0 replies; 25+ messages in thread
From: Steve French @ 2011-03-10 13:44 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Dave Chinner, linux-cifs-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel,
	Mingming Cao, Jeff Layton

On Wed, Mar 9, 2011 at 7:41 PM, Trond Myklebust
<Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, 2011-03-09 at 16:01 -0600, Steve French wrote:
>> On Wed, Mar 9, 2011 at 3:51 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
>> > On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
>> >> Following up on the discussion about how to avoid the copy into a
>> >> temporary buffer for the case when a file system has to sign a page
>> >> (or list of pages) that is going to be passed in an iovec to be
>> >> written to the network or disk, I noticed that a few file systems do
>> >> issue wait_on_page_writeback (nfs in nfs_writepages for example).
>> >> Apparently some areas are being investigated to add something similar
>> >> for ext4 for disk adapters that do crc checks on data being sent down
>> >> to the disk.   In the cifs case it looks like cifs_writepages already
>> >> does:
>> >>
>> >> if (wbc->sync_mode != WB_SYNC_NONE)
>> >>                                 wait_on_page_writeback(page);
>>
>> <snip>
>>
>> > Sounds like a case for the same dirty page lifecycle as NFS: clean
>> > -> dirty -> writeback -> unstable -> clean. i.e. the page is
>> > unstable after the issuing of the IO until the response from the
>> > server so the page can't be reclaimed while the IO is still in
>> > progress at the server...
>>
>> Except we don't need to wait that long with the page locked
>> ie for a response from the cifs server (such as Samba or Windows
>> or NetApp), just need to wait for it to get on the wire.
>> Waiting for us to get the server response would
>> take 10 or 100 times longer.   In any case we can't resend
>> the same request to the server (the signature changes on the
>> resend since the sequence number is incremented on every
>> request/response so we have to recalc the checksum anyway) and
>> cifs requests can't get lost (as with nfs over udp).  Keeping
>> a page locked for 10milliseconds seems like a bad idea - but
>> it is a little more complicated to implement (for the cifs case)
>> so that we end page writeback (for the non-WB_SYNC)
>> as quickly as reasonably possible so we don't kill perf.
>
> So what if the server crashes, or you get some other transient error?
>
> The NFS unstable write mechanism is there in order to deal with
> imperfect servers that occasionally crash and lose cached data. If all
> we had to deal with was perfect situations where all WRITE requests
> succeed, then life would be much simpler...

If the server crashes we have to resend a new request with a different
sequence number so the signature changes - so it doesn't
matter if the page was mmapped and modified we will have to recalculate
the crc anyway.



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-10 13:32                 ` Chris Mason
@ 2011-03-10 13:47                   ` Jeff Layton
       [not found]                     ` <20110310084724.658fe5d7-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2011-03-10 13:47 UTC (permalink / raw)
  To: Chris Mason
  Cc: Steve French, Dave Chinner, linux-cifs, linux-fsdevel,
	Mingming Cao

On Thu, 10 Mar 2011 08:32:09 -0500
Chris Mason <chris.mason@oracle.com> wrote:

> Excerpts from Jeff Layton's message of 2011-03-10 08:16:38 -0500:
> > On Thu, 10 Mar 2011 04:26:31 -0800 (PST)
> > Chris Mason <chris.mason@oracle.com> wrote:
> > 
> > > Excerpts from Steve French's message of 2011-03-09 17:13:06 -0500:
> > > > On Wed, Mar 9, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> wrote:
> > > > > Excerpts from Dave Chinner's message of 2011-03-09 16:51:48 -0500:
> > > > >> On Wed, Mar 09, 2011 at 01:44:24PM -0600, Steve French wrote:
> > > > >> > Have alternative approaches, other than using wait_on_page_writeback,
> > > > >> > been considered for solving the stable page write problem in similar
> > > > >> > cases (since only about 1 out of 5 linux file systems uses this call
> > > > >> > today).
> > > > >>
> > > > >> I think that is incorrect. write_cache_pages() does:
> > > > >>
> > > > >>  929                         lock_page(page);
> > > > >> .....
> > > > >>  950                         if (PageWriteback(page)) {
> > > > >>  951                                 if (wbc->sync_mode != WB_SYNC_NONE)
> > > > >>  952                                         wait_on_page_writeback(page);
> > > > >>  953                                 else
> > > > >>  954                                         goto continue_unlock;
> > > > >>  955                         }
> > > > >>  956
> > > > >>  957                         BUG_ON(PageWriteback(page));
> > > > >>  958                         if (!clear_page_dirty_for_io(page))
> > > > >>  959                                 goto continue_unlock;
> > > > >>  960
> > > > >>  961                         trace_wbc_writepage(wbc, mapping->backing_dev_info);
> > > > >>  962                         ret = (*writepage)(page, wbc, data);
> > > > >>
> > > > >> so every filesystem using the generic_writepages code already does
> > > > >> this check and wait before .writepage is called. Hence only the
> > > > >> filesystems that do not use generic_writepages() or
> > > > >> mpage_writepages() need a specific check, and that means most
> > > > >> filesystems are actually waiting on writeback pages correctly.
> > > > >
> > > > > But checking here just means we don't start writeback on a page that is
> > > > > writeback, which is a good idea but not really related to stable pages?
> > > > >
> > > > > stable pages means we don't let mmap'd pages or file_write muck around
> > > > > with the pages while they are in writeback, so we need to wait in
> > > > > file_write and page_mkwrite.
> > > > 
> > > > Isn't the file_write case covered by the i_mutex as
> > > > Documentation/filesystems/Locking implies (for write_begin/write_end).
> > > > 
> > > 
> > > Does cifs take i_mutex before writepage? The disk based filesystems
> > > don't.  So, i_mutex protects file_write from other procs jumping into
> > > file_write, but it doesn't protect writeback from file_write jumping in
> > > and changing the pages while they are being sent to storage (or over the
> > > wire).
> > > 
> > > Basically the model needs to be:
> > > 
> > > file_write:
> > >     lock the page
> > >     wait on page writeback
> > > 
> > >     < new writeback cannot start because of the page lock >
> > >     copy_from_user
> > >     unlock the page
> > > 
> > > We also use page_mkwrite to get notified when userland wants to change
> > > some page it has given to mmap.  That needs to wait on page writeback as
> > > well.
> > > 
> > 
> > No, cifs doesn't take the i_mutex in writepage, but the page is locked.
> > cifs_write_begin calls grab_cache_page_write_begin, which returns a
> > locked page and it's not unlocked until cifs_write_end.
> 
> Ah ok, so you've got the page locked the whole time it is being sent
> over the wire?  The disk based filesystems split it and drop the page
> lock once the page is set writeback, which is why we need the extra
> waits.
> 
> So in your case you should just need a page_mkwrite that locks the page.
> 

Right, or do a wait_on_page_writeback. I think that may have a little
less overhead since we won't need to unlock it and it may mean less
serialization if there are other contenders for the page lock.

-- 
Jeff Layton <jlayton@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-10  1:30               ` Jeff Layton
@ 2011-03-10 13:53                 ` Steve French
       [not found]                 ` <20110309203044.4fd0498e-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
  1 sibling, 0 replies; 25+ messages in thread
From: Steve French @ 2011-03-10 13:53 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Dave Chinner, linux-cifs, linux-fsdevel, Mingming Cao

On Wed, Mar 9, 2011 at 7:30 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Wed, 9 Mar 2011 18:33:20 -0600
> Steve French <smfrench@gmail.com> wrote:
>
>> On Wed, Mar 9, 2011 at 5:54 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> > On Wed, 9 Mar 2011 16:01:30 -0600
>> > Steve French <smfrench@gmail.com> wrote:
>> >
>> >>
>> >> Except we don't need to wait that long with the page locked
>> >> ie for a response from the cifs server (such as Samba or Windows
>> >> or NetApp), just need to wait for it to get on the wire.
>> >> Waiting for us to get the server response would
>> >> take 10 or 100 times longer.   In any case we can't resend
>> >> the same request to the server (the signature changes on the
>> >> resend since the sequence number is incremented on every
>> >> request/response so we have to recalc the checksum anyway) and
>> >> cifs requests can't get lost (as with nfs over udp).  Keeping
>> >> a page locked for 10milliseconds seems like a bad idea - but
>> >> it is a little more complicated to implement (for the cifs case)
>> >> so that we end page writeback (for the non-WB_SYNC)
>> >> as quickly as reasonably possible so we don't kill perf.
>> >>
>> >
>> > The problem here is that the socket layer doesn't have a mechanism
>> > to notify us of a TCP ACK. So, we have to wait for the next-best thing
>> > -- a response from the server.
>>
>> But ... we can stop writeback as soon as kernel_sendmsg returns - once
>> we return from kernel_sendmsg the buffers can (and often will) be
>> freed so we know those pages could not still be used by tcp  (below
>> cifs) once kernel_sendmsg returns.  We can minimize the delay further
>> by making sure we set TCP_NODELAY on the socket (we probably ought to
>> make that the default instead of an option).
>>
>
> That's not correct. A return from kernel_sendmsg just means that the
> data has been buffered up, not that it has been sent and acked. We
> shouldn't use that as an indicator to mean that the pages no longer
> need to be stable.

We can't free the page from the cache until the server responds that
it has written the data (otherwise if the server crashes we have no
way to resend the dirty page), but I don't see any reason to block redirtying
a page as long as we don't break the signing mechanism.  We can allow writes to
a page once the page has been buffered (kernel_sendmsg is complete).
Although cifs has stricter (stricter than open to close) guarantees than nfs
in some case, we could hang on to pages longer as nfs does until the server
returns the equivalent of fsync.



-- 
Thanks,

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]                     ` <20110310084724.658fe5d7-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org>
@ 2011-03-10 13:58                       ` Chris Mason
  2011-03-11 12:11                         ` Jeff Layton
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2011-03-10 13:58 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Steve French, Dave Chinner, linux-cifs, linux-fsdevel,
	Mingming Cao

Excerpts from Jeff Layton's message of 2011-03-10 08:47:24 -0500:
> > > > Does cifs take i_mutex before writepage? The disk based filesystems
> > > > don't.  So, i_mutex protects file_write from other procs jumping into
> > > > file_write, but it doesn't protect writeback from file_write jumping in
> > > > and changing the pages while they are being sent to storage (or over the
> > > > wire).
> > > > 
> > > > Basically the model needs to be:
> > > > 
> > > > file_write:
> > > >     lock the page
> > > >     wait on page writeback
> > > > 
> > > >     < new writeback cannot start because of the page lock >
> > > >     copy_from_user
> > > >     unlock the page
> > > > 
> > > > We also use page_mkwrite to get notified when userland wants to change
> > > > some page it has given to mmap.  That needs to wait on page writeback as
> > > > well.
> > > > 
> > > 
> > > No, cifs doesn't take the i_mutex in writepage, but the page is locked.
> > > cifs_write_begin calls grab_cache_page_write_begin, which returns a
> > > locked page and it's not unlocked until cifs_write_end.
> > 
> > Ah ok, so you've got the page locked the whole time it is being sent
> > over the wire?  The disk based filesystems split it and drop the page
> > lock once the page is set writeback, which is why we need the extra
> > waits.
> > 
> > So in your case you should just need a page_mkwrite that locks the page.
> > 
> 
> Right, or do a wait_on_page_writeback. I think that may have a little
> less overhead since we won't need to unlock it and it may mean less
> serialization if there are other contenders for the page lock.
> 

I think you'll need the page lock too, otherwise you aren't protected
against new IO starting.  page_mkwrite really works together with 
clear_page_dirty_for_io(), and I don't think you get proper
synchronization without the page lock.

You also need the page lock to make sure the page really is still in
your mapping and that truncate won't race in and take the page away.

Basically, if you add a wait_on_page_writeback() to block_page_mkwrite,
you should have what you need, +/- the prepare_write/commit_write calls,
which we use to fill any holes for the pages that are about to change.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]                 ` <20110309203044.4fd0498e-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
@ 2011-03-11 11:53                   ` Jeff Layton
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2011-03-11 11:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Steve French, Dave Chinner, linux-cifs-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel, Mingming Cao

On Wed, 9 Mar 2011 20:30:44 -0500
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> > 
> > But ... we can stop writeback as soon as kernel_sendmsg returns - once
> > we return from kernel_sendmsg the buffers can (and often will) be
> > freed so we know those pages could not still be used by tcp  (below
> > cifs) once kernel_sendmsg returns.  We can minimize the delay further
> > by making sure we set TCP_NODELAY on the socket (we probably ought to
> > make that the default instead of an option).
> > 
> 
> That's not correct. A return from kernel_sendmsg just means that the
> data has been buffered up, not that it has been sent and acked. We
> shouldn't use that as an indicator to mean that the pages no longer
> need to be stable.
> 

Correction...

Steve is correct that we have no need to keep the page stable after
kernel_sendmsg returns. That codepath copies the data to from the iovec
to the send buffer. I got kernel_sendmsg mixed up with kernel_sendpages.
NFS uses both, depending on how the xdr buffer is layed out.

Eventually we should probably consider moving cifs to use ->sendpage
for writes instead, which would allow us to do sends directly out of
the pagecache. To do that though, we would need to keep the pages
stable until the write is complete, or at least until it has been
ACK'ed (if we can work out a way to do that).

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-10 13:58                       ` Chris Mason
@ 2011-03-11 12:11                         ` Jeff Layton
       [not found]                           ` <20110311071143.01b407b6-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2011-03-11 12:11 UTC (permalink / raw)
  To: Chris Mason
  Cc: Steve French, Dave Chinner, linux-cifs, linux-fsdevel,
	Mingming Cao

On Thu, 10 Mar 2011 08:58:04 -0500
Chris Mason <chris.mason@oracle.com> wrote:

> 
> I think you'll need the page lock too, otherwise you aren't protected
> against new IO starting.  page_mkwrite really works together with 
> clear_page_dirty_for_io(), and I don't think you get proper
> synchronization without the page lock.
> 

I'm trying to work this out in my head and I'm having a hard time...

If we fix cifs_writepages to set_page_writeback before calling
clear_page_dirty_for_io, then do we really need the page lock here?

> You also need the page lock to make sure the page really is still in
> your mapping and that truncate won't race in and take the page away.
> 

This I'm a little less clear on. Why is this a concern only for
read-only pages and not for writable ones which won't pass through
page_mkwrite?

The reason I'm reluctant to take the page lock here is that I've been
toying with the idea of having page_mkwrite copy the page to a new one
when it's under writeback. Basically, have page_mkwrite:

1) allocate a new page (if that fails, just wait_on_page_writeback)
2) copy the old page data to the new one
3) replace the old page in the pagecache with the new one
4) shoot down any PTE's that point to the old page (via unmap_mapping_range)
5) return an error from page_mkwrite that tells the caller that the page
   needs to be refaulted in

I think that would allow us to have stable pages for the actual write,
but without blocking processes that have the pages mmapped for an
arbitrary period. If I have to take the page lock however, then that
sort of blows that whole idea out of the water.

I haven't worked through all of the details for this (and I'm sure
handling the locking for this will be tricky). Maybe it's a dumb
idea, but I think it's worth investigating.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]                           ` <20110311071143.01b407b6-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-03-11 12:56                             ` Chris Mason
  2011-03-11 13:42                               ` Jeff Layton
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Mason @ 2011-03-11 12:56 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Steve French, Dave Chinner, linux-cifs, linux-fsdevel,
	Mingming Cao

Excerpts from Jeff Layton's message of 2011-03-11 07:11:43 -0500:
> On Thu, 10 Mar 2011 08:58:04 -0500
> Chris Mason <chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > I think you'll need the page lock too, otherwise you aren't protected
> > against new IO starting.  page_mkwrite really works together with 
> > clear_page_dirty_for_io(), and I don't think you get proper
> > synchronization without the page lock.
> > 
> 
> I'm trying to work this out in my head and I'm having a hard time...
> 
> If we fix cifs_writepages to set_page_writeback before calling
> clear_page_dirty_for_io, then do we really need the page lock here?

clear_page_dirty_for_io is called by write_cache_pages before setting
the page writeback.  This way we avoid transient setting of page
writeback when it wasn't really dirty.  It doesn't mean it won't work
the other way around, but PageWriteback usually means 'I'm being
written' not 'Maybe I'm being written'.

> 
> > You also need the page lock to make sure the page really is still in
> > your mapping and that truncate won't race in and take the page away.
> > 
> 
> This I'm a little less clear on. Why is this a concern only for
> read-only pages and not for writable ones which won't pass through
> page_mkwrite?

We want to make sure that we're not racing with truncate.   For us that
means we don't want to insert blocks to fill a hole in the middle of
truncate doing away with that range in the file.

This may or may not be a concern for cifs, but truncate is going to lock
every page, so we need the page lock to really synchronize with it.

> 
> The reason I'm reluctant to take the page lock here is that I've been
> toying with the idea of having page_mkwrite copy the page to a new one
> when it's under writeback. Basically, have page_mkwrite:
> 
> 1) allocate a new page (if that fails, just wait_on_page_writeback)
> 2) copy the old page data to the new one
> 3) replace the old page in the pagecache with the new one
> 4) shoot down any PTE's that point to the old page (via unmap_mapping_range)
> 5) return an error from page_mkwrite that tells the caller that the page
>    needs to be refaulted in
> 
> I think that would allow us to have stable pages for the actual write,
> but without blocking processes that have the pages mmapped for an
> arbitrary period. If I have to take the page lock however, then that
> sort of blows that whole idea out of the water.
> 
> I haven't worked through all of the details for this (and I'm sure
> handling the locking for this will be tricky). Maybe it's a dumb
> idea, but I think it's worth investigating.
> 

Would it be easier to send a bounce buffer over the wire instead of
the page cache page?

In general we haven't seen a big performance problem from waiting on
writeback and locking the page in page_mkwrite().  Writable mmaps and
high performance expectations don't often go together.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
  2011-03-11 12:56                             ` Chris Mason
@ 2011-03-11 13:42                               ` Jeff Layton
       [not found]                                 ` <20110311084221.4ac6bd11-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2011-03-11 13:42 UTC (permalink / raw)
  To: Chris Mason
  Cc: Steve French, Dave Chinner, linux-cifs, linux-fsdevel,
	Mingming Cao

On Fri, 11 Mar 2011 07:56:14 -0500
Chris Mason <chris.mason@oracle.com> wrote:

> Excerpts from Jeff Layton's message of 2011-03-11 07:11:43 -0500:
> > On Thu, 10 Mar 2011 08:58:04 -0500
> > Chris Mason <chris.mason@oracle.com> wrote:
> > 
> > > 
> > > I think you'll need the page lock too, otherwise you aren't protected
> > > against new IO starting.  page_mkwrite really works together with 
> > > clear_page_dirty_for_io(), and I don't think you get proper
> > > synchronization without the page lock.
> > > 
> > 
> > I'm trying to work this out in my head and I'm having a hard time...
> > 
> > If we fix cifs_writepages to set_page_writeback before calling
> > clear_page_dirty_for_io, then do we really need the page lock here?
> 
> clear_page_dirty_for_io is called by write_cache_pages before setting
> the page writeback.  This way we avoid transient setting of page
> writeback when it wasn't really dirty.  It doesn't mean it won't work
> the other way around, but PageWriteback usually means 'I'm being
> written' not 'Maybe I'm being written'.
> 

Ok that makes sense. cifs does it the same way currently.

> > 
> > > You also need the page lock to make sure the page really is still in
> > > your mapping and that truncate won't race in and take the page away.
> > > 
> > 
> > This I'm a little less clear on. Why is this a concern only for
> > read-only pages and not for writable ones which won't pass through
> > page_mkwrite?
> 
> We want to make sure that we're not racing with truncate.   For us that
> means we don't want to insert blocks to fill a hole in the middle of
> truncate doing away with that range in the file.
> 
> This may or may not be a concern for cifs, but truncate is going to lock
> every page, so we need the page lock to really synchronize with it.
> 

Hmm...ok. I'll need to ponder this a bit more, but the comment above
btrfs_page_mkwrite makes this a bit more clear. Still, it seems like
page_mkwrite ought to not have to worry about this. IOW...

Why doesn't the page fault handler fix that up? And why is this not an
issue for filesystems that don't even implement page_mkwrite?

> > 
> > The reason I'm reluctant to take the page lock here is that I've been
> > toying with the idea of having page_mkwrite copy the page to a new one
> > when it's under writeback. Basically, have page_mkwrite:
> > 
> > 1) allocate a new page (if that fails, just wait_on_page_writeback)
> > 2) copy the old page data to the new one
> > 3) replace the old page in the pagecache with the new one
> > 4) shoot down any PTE's that point to the old page (via unmap_mapping_range)
> > 5) return an error from page_mkwrite that tells the caller that the page
> >    needs to be refaulted in
> > 
> > I think that would allow us to have stable pages for the actual write,
> > but without blocking processes that have the pages mmapped for an
> > arbitrary period. If I have to take the page lock however, then that
> > sort of blows that whole idea out of the water.
> > 
> > I haven't worked through all of the details for this (and I'm sure
> > handling the locking for this will be tricky). Maybe it's a dumb
> > idea, but I think it's worth investigating.
> > 
> 
> Would it be easier to send a bounce buffer over the wire instead of
> the page cache page?
> 
> In general we haven't seen a big performance problem from waiting on
> writeback and locking the page in page_mkwrite().  Writable mmaps and
> high performance expectations don't often go together.
> 

That would definitely be easier. I'm just always a bit leery of doing
any sort of memory allocation while in writeback. The scheme I
described would only do them when someone writes to a mmapped page, and
it can just fall back to blocking if that fails.

My main worry is not so much with performance, but rather with making
sure that we're not blocking processes that are trying to write to
mmaps indefinitely if the server goes away. Blocking them only until
kernel_sendmsg returns makes this a bit less of an issue, but it can
still happen if the socket buffer is full.

In principle, the above scheme would mostly avoid unnecessary memory
allocations, and should mostly prevent mmapped processes from
blocking indefinitely. It is rather complicated though, which might
make it too "fiddly" to really be workable.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: stable page writes: wait_on_page_writeback and packet signing
       [not found]                                 ` <20110311084221.4ac6bd11-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2011-03-11 16:00                                   ` Chris Mason
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Mason @ 2011-03-11 16:00 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Steve French, Dave Chinner, linux-cifs, linux-fsdevel,
	Mingming Cao

Excerpts from Jeff Layton's message of 2011-03-11 08:42:21 -0500:
> On Fri, 11 Mar 2011 07:56:14 -0500
> > > This I'm a little less clear on. Why is this a concern only for
> > > read-only pages and not for writable ones which won't pass through
> > > page_mkwrite?
> > 
> > We want to make sure that we're not racing with truncate.   For us that
> > means we don't want to insert blocks to fill a hole in the middle of
> > truncate doing away with that range in the file.
> > 
> > This may or may not be a concern for cifs, but truncate is going to lock
> > every page, so we need the page lock to really synchronize with it.
> > 
> 
> Hmm...ok. I'll need to ponder this a bit more, but the comment above
> btrfs_page_mkwrite makes this a bit more clear. Still, it seems like
> page_mkwrite ought to not have to worry about this. IOW...
> 
> Why doesn't the page fault handler fix that up? And why is this not an
> issue for filesystems that don't even implement page_mkwrite?

The page fault handler can deal with pages being dirty or clean or
inside or outside of i_size.  But the filesystem is using page_mkwrite
to allocate blocks in the file if the write is filling a hole, or in the
btrfs case to satisfy COW.

So the allocation code needs to synchronize with truncate while the page
handling code itself is already safe.

> 
> > > 
> > > The reason I'm reluctant to take the page lock here is that I've been
> > > toying with the idea of having page_mkwrite copy the page to a new one
> > > when it's under writeback. Basically, have page_mkwrite:
> > > 
> > > 1) allocate a new page (if that fails, just wait_on_page_writeback)
> > > 2) copy the old page data to the new one
> > > 3) replace the old page in the pagecache with the new one
> > > 4) shoot down any PTE's that point to the old page (via unmap_mapping_range)
> > > 5) return an error from page_mkwrite that tells the caller that the page
> > >    needs to be refaulted in
> > > 
> > > I think that would allow us to have stable pages for the actual write,
> > > but without blocking processes that have the pages mmapped for an
> > > arbitrary period. If I have to take the page lock however, then that
> > > sort of blows that whole idea out of the water.
> > > 
> > > I haven't worked through all of the details for this (and I'm sure
> > > handling the locking for this will be tricky). Maybe it's a dumb
> > > idea, but I think it's worth investigating.
> > > 
> > 
> > Would it be easier to send a bounce buffer over the wire instead of
> > the page cache page?
> > 
> > In general we haven't seen a big performance problem from waiting on
> > writeback and locking the page in page_mkwrite().  Writable mmaps and
> > high performance expectations don't often go together.
> > 
> 
> That would definitely be easier. I'm just always a bit leery of doing
> any sort of memory allocation while in writeback. The scheme I
> described would only do them when someone writes to a mmapped page, and
> it can just fall back to blocking if that fails.
> 
> My main worry is not so much with performance, but rather with making
> sure that we're not blocking processes that are trying to write to
> mmaps indefinitely if the server goes away. Blocking them only until
> kernel_sendmsg returns makes this a bit less of an issue, but it can
> still happen if the socket buffer is full.

Don't we already have the risk of blocking in them in
balance_dirty_pages for file_write?  I'm not sure how mmap is different.
At any rate, an interruptible wait_on_page_writeback(), or
wait_on_page_writeback_timeout() might be easier?

> 
> In principle, the above scheme would mostly avoid unnecessary memory
> allocations, and should mostly prevent mmapped processes from
> blocking indefinitely. It is rather complicated though, which might
> make it too "fiddly" to really be workable.
> 

In general I'd try to avoid being special in mmap writes.  It's one of
those things that doesn't get exercised that much in the common test
suites, and so you'll have mysterious bug reports that very few people
see.

(Don't let my grumpiness with mmap stop you from doing cool things
though ;)

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-03-11 16:00 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-09 19:44 stable page writes: wait_on_page_writeback and packet signing Steve French
     [not found] ` <AANLkTinFx9KGKDWSdUvFSvT4S6f9QjBzX=6Uo17oO89+-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-09 21:51   ` Dave Chinner
2011-03-09 21:58     ` Chris Mason
2011-03-09 22:13       ` Steve French
     [not found]         ` <AANLkTikK8MOm-m9XsOA4YGRe=E9bJTDh4iEYXtZumNmv-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-10 12:26           ` Chris Mason
2011-03-10 13:16             ` Jeff Layton
     [not found]               ` <20110310081638.0f8275d4-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org>
2011-03-10 13:32                 ` Chris Mason
2011-03-10 13:47                   ` Jeff Layton
     [not found]                     ` <20110310084724.658fe5d7-xSBYVWDuneFaJnirhKH9O4GKTjYczspe@public.gmane.org>
2011-03-10 13:58                       ` Chris Mason
2011-03-11 12:11                         ` Jeff Layton
     [not found]                           ` <20110311071143.01b407b6-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-03-11 12:56                             ` Chris Mason
2011-03-11 13:42                               ` Jeff Layton
     [not found]                                 ` <20110311084221.4ac6bd11-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2011-03-11 16:00                                   ` Chris Mason
2011-03-09 23:46       ` Dave Chinner
2011-03-09 22:01     ` Steve French
2011-03-09 23:54       ` Jeff Layton
     [not found]         ` <20110309185427.7858c29b-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2011-03-10  0:33           ` Steve French
     [not found]             ` <AANLkTi=pXHjE6tNMm0_nO=Cn3nGH8oZ6Xhm1STh8x1Xe-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-10  1:30               ` Jeff Layton
2011-03-10 13:53                 ` Steve French
     [not found]                 ` <20110309203044.4fd0498e-4QP7MXygkU+dMjc06nkz3ljfA9RmPOcC@public.gmane.org>
2011-03-11 11:53                   ` Jeff Layton
     [not found]       ` <AANLkTinDmqah6pQnHugoVxh-gDq+6+MDMuh-TyVAQ7LP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-03-10  1:41         ` Trond Myklebust
     [not found]           ` <1299721264.2976.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2011-03-10  7:34             ` Christoph Hellwig
2011-03-10 13:44             ` Steve French
2011-03-09 23:45     ` Jeff Layton
2011-03-10  2:12       ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).