flush and EIO errors when writepages fails

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* flush and EIO errors when writepages fails
       [not found]   ` <20080620091542.09edb43f@tupile.poochiereds.net>
@ 2008-06-20 16:19     ` Steve French (smfltc)
  2008-06-20 16:34       ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Steve French (smfltc) @ 2008-06-20 16:19 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Shirish S Pargaonkar, shaggy, linux-fsdevel

If flush fails to write all dirty pages (due to an I/O error on the 
server, server disk or networking stack) today the error (EIO) is marked 
in the inode, and returned on close.   I think cifs_flush (which is 
called before close by the vfs) should also (perhaps after sleep a 
second or so then) retry at least once on the filemap_fdatawrite before 
giving up.  (perhaps retry more if mounted hard) Thoughts?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-20 16:19     ` flush and EIO errors when writepages fails Steve French (smfltc)
@ 2008-06-20 16:34       ` Jeff Layton
  2008-06-20 16:41         ` Steve French (smfltc)
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2008-06-20 16:34 UTC (permalink / raw)
  To: Steve French (smfltc); +Cc: Shirish S Pargaonkar, shaggy, linux-fsdevel

On Fri, 20 Jun 2008 11:19:19 -0500
"Steve French (smfltc)" <smfltc@us.ibm.com> wrote:

> 
> If flush fails to write all dirty pages (due to an I/O error on the 
> server, server disk or networking stack) today the error (EIO) is marked 
> in the inode, and returned on close.   I think cifs_flush (which is 
> called before close by the vfs) should also (perhaps after sleep a 
> second or so then) retry at least once on the filemap_fdatawrite before 
> giving up.  (perhaps retry more if mounted hard) Thoughts?
> 

A couple of thoughts...

Retrying is only likely to be helpful if the server isn't responding. We
could consider doing a better job there somehow.

...and...

Suppose we have a bunch of dirty pages for an inode. We call
cifs_flush, which calls filemap_fdatawrite (which eventually calls
cifs_writepages) and attempt to write all of the pages out. They all
fail and then get discarded. Then we call filemap_fdatawrite again. Now
there are no more dirty pages and this returns success. But, the data
was tossed out on the first filemap_fdatawrite call, so the success
here really isn't a success...

If you want to be more aggressive about handling errors when writing
out pages, then most of the changes will need to be made at the
cifs_writepages level, not so much with cifs_flush.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-20 16:34       ` Jeff Layton
@ 2008-06-20 16:41         ` Steve French (smfltc)
  2008-06-20 17:12           ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Steve French (smfltc) @ 2008-06-20 16:41 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Shirish S Pargaonkar, shaggy, linux-fsdevel

Jeff Layton wrote:
> On Fri, 20 Jun 2008 11:19:19 -0500
> "Steve French (smfltc)" <smfltc@us.ibm.com> wrote:
>
>   
>> If flush fails to write all dirty pages (due to an I/O error on the 
>> server, server disk or networking stack) today the error (EIO) is marked 
>> in the inode, and returned on close.   I think cifs_flush (which is 
>> called before close by the vfs) should also (perhaps after sleep a 
>> second or so then) retry at least once on the filemap_fdatawrite before 
>> giving up.  (perhaps retry more if mounted hard) Thoughts?
>>
>>     
>
> A couple of thoughts...
>
> Retrying is only likely to be helpful if the server isn't responding. We
> could consider doing a better job there somehow.
>
>   
The particular problem case that I am thinking of at the moment, and 
wish is helped by retry, is
the case in which memory pressure prevents the TCP/IP stack or 
underlying (perhaps badly
written) network adapter driver from allowing the SMB write packet from 
even getting to
the wire.
> If you want to be more aggressive about handling errors when writing
> out pages, then most of the changes will need to be made at the
> cifs_writepages level, not so much with cifs_flush.
>   
flush is our "last chance" effort to write the file data - once flush 
and close are called the
file handle is gone so we can no longer write the file data after that 
point.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-20 16:41         ` Steve French (smfltc)
@ 2008-06-20 17:12           ` Jeff Layton
  0 siblings, 0 replies; 17+ messages in thread
From: Jeff Layton @ 2008-06-20 17:12 UTC (permalink / raw)
  To: Steve French (smfltc); +Cc: Shirish S Pargaonkar, shaggy, linux-fsdevel

On Fri, 20 Jun 2008 11:41:26 -0500
"Steve French (smfltc)" <smfltc@us.ibm.com> wrote:

> Jeff Layton wrote:
> > On Fri, 20 Jun 2008 11:19:19 -0500
> > "Steve French (smfltc)" <smfltc@us.ibm.com> wrote:
> >
> >   
> >> If flush fails to write all dirty pages (due to an I/O error on the 
> >> server, server disk or networking stack) today the error (EIO) is marked 
> >> in the inode, and returned on close.   I think cifs_flush (which is 
> >> called before close by the vfs) should also (perhaps after sleep a 
> >> second or so then) retry at least once on the filemap_fdatawrite before 
> >> giving up.  (perhaps retry more if mounted hard) Thoughts?
> >>
> >>     
> >
> > A couple of thoughts...
> >
> > Retrying is only likely to be helpful if the server isn't responding. We
> > could consider doing a better job there somehow.
> >
> >   
> The particular problem case that I am thinking of at the moment, and 
> wish is helped by retry, is
> the case in which memory pressure prevents the TCP/IP stack or 
> underlying (perhaps badly
> written) network adapter driver from allowing the SMB write packet from 
> even getting to
> the wire.

I'll buy that...though I'm not sure how often this is really an issue.
I don't think I've seen any cases of memory allocations failing and
preventing CIFS from writing out pages. I'm also not convinced that
we'd see a lot of success from retrying in these situations.

I have seen problems with NFS/RPC hitting deadlocks when trying to do
sleeping allocations in a write. rpciod tries to do a __GFP_WAIT
allocation, and due to memory pressure ends up trying to write out NFS
pages -- deadlock. A lot of these are now fixed in mainline, but we
still have some less traveled codepaths in NFS/RPC that could deadlock
this way.

We're probably being a bit too optimistic with how we do memory
allocations in CIFS in places, so we may be subject to similar
deadlocks. These types of problems concern me more than retrying failed
page writeouts...

> > If you want to be more aggressive about handling errors when writing
> > out pages, then most of the changes will need to be made at the
> > cifs_writepages level, not so much with cifs_flush.
> >   
> flush is our "last chance" effort to write the file data - once flush 
> and close are called the
> file handle is gone so we can no longer write the file data after that 
> point.

Right, but with the current implementation, once filemap_fdatawrite
returns, any pages that that run touched are either written out or
discarded. Retrying the filemap_fdatawrite from cifs_flush won't help.

This could be redesigned, but we're probably better off fleshing out
cifs_writepages to handle this based on some sort of condition (maybe
the WB_SYNC_* flags or something).

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
@ 2008-06-20 22:34 Steve French
  2008-06-21  7:05 ` Evgeniy Polyakov
  0 siblings, 1 reply; 17+ messages in thread
From: Steve French @ 2008-06-20 22:34 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Shirish Pargaonkar, Dave Kleikamp, Jody French

> Right, but with the current implementation, once filemap_fdatawrite
> returns, any pages that that run touched are either written out or
> discarded.

That could explain some problems if true.  When writepages fails, we
make the pages as in error (PG_error flag?) and presumably they are
still dirty.   Why in the world would anyone free the pages just
because we failed the first time and need to write them again later?
Do you where (presumably in /mm) pages could be freed that are still
dirty (it is hard to find where the PG_error flag is checked etc)?

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-20 22:34 Steve French
@ 2008-06-21  7:05 ` Evgeniy Polyakov
  2008-06-21 12:27   ` Jeff Layton
  0 siblings, 1 reply; 17+ messages in thread
From: Evgeniy Polyakov @ 2008-06-21  7:05 UTC (permalink / raw)
  To: Steve French
  Cc: linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp, Jody French

Hi.

On Fri, Jun 20, 2008 at 05:34:21PM -0500, Steve French (smfrench@gmail.com) wrote:
> > Right, but with the current implementation, once filemap_fdatawrite
> > returns, any pages that that run touched are either written out or
> > discarded.

Depending on writepages() implementation, it is not always the case.

> That could explain some problems if true.  When writepages fails, we
> make the pages as in error (PG_error flag?) and presumably they are
> still dirty.   Why in the world would anyone free the pages just
> because we failed the first time and need to write them again later?
> Do you where (presumably in /mm) pages could be freed that are still
> dirty (it is hard to find where the PG_error flag is checked etc)?

You can clear writeback bit but leave/set dirty bit in completion
callback for given request. You can manually insert page into radix tree
with dirty tag. You can also lock page and do not allow to unlock it
until you resent your data.

So, there is plenty of possibilities to break page accounting in own
writepages() method :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21  7:05 ` Evgeniy Polyakov
@ 2008-06-21 12:27   ` Jeff Layton
  2008-06-21 13:19     ` Evgeniy Polyakov
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2008-06-21 12:27 UTC (permalink / raw)
  To: Steve French
  Cc: Evgeniy Polyakov, linux-fsdevel, Shirish Pargaonkar,
	Dave Kleikamp, Jody French

On Sat, 21 Jun 2008 11:05:56 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Hi.
> 
> On Fri, Jun 20, 2008 at 05:34:21PM -0500, Steve French (smfrench@gmail.com) wrote:
> > > Right, but with the current implementation, once filemap_fdatawrite
> > > returns, any pages that that run touched are either written out or
> > > discarded.
> 
> Depending on writepages() implementation, it is not always the case.
> 

Right. I meant with the current cifs_writepages() implementation. It
looks like when the write fails we walk the pagevec and do:

                                if (rc)
                                        SetPageError(page);
                                kunmap(page);
                                unlock_page(page);
                                end_page_writeback(page);
                                page_cache_release(page);

So I'm not certain that the data is actually discarded. I'm no expert
in page accounting, but I'm pretty sure that we won't attempt to write
it out again from CIFS.

> > That could explain some problems if true.  When writepages fails, we
> > make the pages as in error (PG_error flag?) and presumably they are
> > still dirty.   Why in the world would anyone free the pages just
> > because we failed the first time and need to write them again later?
> > Do you where (presumably in /mm) pages could be freed that are still
> > dirty (it is hard to find where the PG_error flag is checked etc)?
> 
> You can clear writeback bit but leave/set dirty bit in completion
> callback for given request. You can manually insert page into radix tree
> with dirty tag. You can also lock page and do not allow to unlock it
> until you resent your data.
> 
> So, there is plenty of possibilities to break page accounting in own
> writepages() method :)
> 

I still think that if you get a hard error back from the server then
you're not likely to have much success on a second attempt to write out
the pages. Returning an error to userspace as soon as you realize that
it didn't work seems reasonable to me. A non-responding server, on the
other hand, may be a place that's recoverable.

Either way, if we really want to do a second attempt to write out the
pagevec, then adding some code to cifs_writepages that sleeps for a bit
and implements this seems like the thing to do. I'm not convinced that
it will actually make much difference, but it seems unlikely to hurt
anything.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 12:27   ` Jeff Layton
@ 2008-06-21 13:19     ` Evgeniy Polyakov
  2008-06-21 14:21       ` Jody French
  0 siblings, 1 reply; 17+ messages in thread
From: Evgeniy Polyakov @ 2008-06-21 13:19 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Steve French, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp,
	Jody French

On Sat, Jun 21, 2008 at 08:27:42AM -0400, Jeff Layton (jlayton@redhat.com) wrote:
> Right. I meant with the current cifs_writepages() implementation. It
> looks like when the write fails we walk the pagevec and do:
> 
>                                 if (rc)
>                                         SetPageError(page);
>                                 kunmap(page);
>                                 unlock_page(page);
>                                 end_page_writeback(page);
>                                 page_cache_release(page);
> 
> So I'm not certain that the data is actually discarded. I'm no expert
> in page accounting, but I'm pretty sure that we won't attempt to write
> it out again from CIFS.

Yes, current CIFS implementation does not allow to do much of failover
recovery. Data is still in the cache, only page is marked as error, so
nothing prevents from redirtifying it in subsequent write.

> > > That could explain some problems if true.  When writepages fails, we
> > > make the pages as in error (PG_error flag?) and presumably they are
> > > still dirty.   Why in the world would anyone free the pages just
> > > because we failed the first time and need to write them again later?
> > > Do you where (presumably in /mm) pages could be freed that are still
> > > dirty (it is hard to find where the PG_error flag is checked etc)?
> > 
> > You can clear writeback bit but leave/set dirty bit in completion
> > callback for given request. You can manually insert page into radix tree
> > with dirty tag. You can also lock page and do not allow to unlock it
> > until you resent your data.
> > 
> > So, there is plenty of possibilities to break page accounting in own
> > writepages() method :)
> > 
> 
> I still think that if you get a hard error back from the server then
> you're not likely to have much success on a second attempt to write out
> the pages. Returning an error to userspace as soon as you realize that
> it didn't work seems reasonable to me. A non-responding server, on the
> other hand, may be a place that's recoverable.
> 
> Either way, if we really want to do a second attempt to write out the
> pagevec, then adding some code to cifs_writepages that sleeps for a bit
> and implements this seems like the thing to do. I'm not convinced that
> it will actually make much difference, but it seems unlikely to hurt
> anything.

If server returns serious error then there is no other way except to
discard data with error, but if server does not respond or respond EBUSY
or that kind of error, then subsequent write can succeed and at least
should not harm. As a simple case, it is possible to sleep a bit and
resend in writepages(), but it is also possible just to return from the
callback and allow VFS to call it again (frequently it will happen very
soon though) with the same pages (or even increased chunk).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 13:19     ` Evgeniy Polyakov
@ 2008-06-21 14:21       ` Jody French
  2008-06-21 14:42         ` Evgeniy Polyakov
  2008-06-23 15:39         ` Dave Kleikamp
  0 siblings, 2 replies; 17+ messages in thread
From: Jody French @ 2008-06-21 14:21 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jeff Layton, Steve French, linux-fsdevel, Shirish Pargaonkar,
	Dave Kleikamp

Evgeniy Polyakov wrote:
>> Either way, if we really want to do a second attempt to write out the
>> pagevec, then adding some code to cifs_writepages that sleeps for a bit
>> and implements this seems like the thing to do. I'm not convinced that
>> it will actually make much difference, but it seems unlikely to hurt
>> anything.
>>     
>
> If server returns serious error then there is no other way except to
> discard data with error, but if server does not respond or respond EBUSY
> or that kind of error, then subsequent write can succeed and at least
> should not harm. As a simple case, it is possible to sleep a bit and
> resend in writepages(), but it is also possible just to return from the
> callback and allow VFS to call it again (frequently it will happen very
> soon though) with the same pages (or even increased chunk).
>   
In the particular case we are looking at, the network stack (TCP perhaps 
due a temporary glitch in
the network adapter or routing infrastructure or temporary memory 
pressure) is returning EAGAIN
for more than 15 seconds (on the tcp send of the Write request) but the 
server itself has not crashed,
(subsequent parts of the file written via later writepages requests are 
eventually written out),  eventually
we give up in writepages and return EIO on the next fsync or flush/close 
- but if we could
make one more attempt to go through in flush, and write all dirty pages 
including the ones that we timed
out on that would help.  In addition if readpage is about to do a 
partial page read into a dirty page that
we were unable to write out we would like to try once more before 
corrupting the data.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 14:21       ` Jody French
@ 2008-06-21 14:42         ` Evgeniy Polyakov
  2008-06-21 16:15           ` Steve French
  2008-06-23 15:39         ` Dave Kleikamp
  1 sibling, 1 reply; 17+ messages in thread
From: Evgeniy Polyakov @ 2008-06-21 14:42 UTC (permalink / raw)
  To: Jody French
  Cc: Jeff Layton, Steve French, linux-fsdevel, Shirish Pargaonkar,
	Dave Kleikamp

On Sat, Jun 21, 2008 at 09:21:42AM -0500, Jody French (jfrench@austin.rr.com) wrote:
> In the particular case we are looking at, the network stack (TCP perhaps 
> due a temporary glitch in
> the network adapter or routing infrastructure or temporary memory 
> pressure) is returning EAGAIN
> for more than 15 seconds (on the tcp send of the Write request) but the 
> server itself has not crashed,
> (subsequent parts of the file written via later writepages requests are 
> eventually written out),  eventually
> we give up in writepages and return EIO on the next fsync or flush/close 
> - but if we could
> make one more attempt to go through in flush, and write all dirty pages 
> including the ones that we timed
> out on that would help.  In addition if readpage is about to do a 
> partial page read into a dirty page that
> we were unable to write out we would like to try once more before 
> corrupting the data.

If you do not unlock and release the page, nothing can currupt it, but
I'm not sure that if you had 15 seconds timeout writepage+flush will
have enough time interval to exceed it. In the flush you can switch to
nonblocking mode and set socket timeout to 30 seconds for example and if
even that failed, then discard data. EGAIN likely means problem on
server, which I referred as non serious, and likely it will resume in a
few moments, so your trick with write+flush can work, but only with long
enouf timeout.

Actually you can always perform similar trick with socket timeout, but
bevare of problems with umount or sync, when they can take toooo long to
complete, so there should be some flag to show when you want and do not
want it. Similar scheme was implemented in POHMELFS.

Returning error I think is the last thing to do and whatever retry
mechanism you will decide to implement it worth the efforts.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 14:42         ` Evgeniy Polyakov
@ 2008-06-21 16:15           ` Steve French
  2008-06-21 16:28             ` Evgeniy Polyakov
  0 siblings, 1 reply; 17+ messages in thread
From: Steve French @ 2008-06-21 16:15 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp

On Sat, Jun 21, 2008 at 9:42 AM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> If you do not unlock and release the page, nothing can currupt it, but
> I'm not sure that if you had 15 seconds timeout writepage+flush will
> have enough time interval to exceed it. In the flush you can switch to
> nonblocking mode and set socket timeout to 30 seconds for example and if
> even that failed, then discard data. EGAIN likely means problem on
> server, which I referred as non serious, and likely it will resume in a
> few moments, so your trick with write+flush can work, but only with long
> enouf timeout.
>
> Actually you can always perform similar trick with socket timeout, but
> bevare of problems with umount or sync, when they can take toooo long to
> complete, so there should be some flag to show when you want and do not
> want it. Similar scheme was implemented in POHMELFS.
>
> Returning error I think is the last thing to do and whatever retry
> mechanism you will decide to implement it worth the efforts.

Any thoughts on how to simulate this (ie repeated EAGAINs on the socket)
so we can reproduce this kind of scenario without lots of stress (including
a large local file copy to increase memory pressure, while the
remote file is being written out)

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 16:15           ` Steve French
@ 2008-06-21 16:28             ` Evgeniy Polyakov
  2008-06-21 17:02               ` Steve French
  0 siblings, 1 reply; 17+ messages in thread
From: Evgeniy Polyakov @ 2008-06-21 16:28 UTC (permalink / raw)
  To: Steve French
  Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp

On Sat, Jun 21, 2008 at 11:15:00AM -0500, Steve French (smfrench@gmail.com) wrote:
> On Sat, Jun 21, 2008 at 9:42 AM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > If you do not unlock and release the page, nothing can currupt it, but
> > I'm not sure that if you had 15 seconds timeout writepage+flush will
> > have enough time interval to exceed it. In the flush you can switch to
> > nonblocking mode and set socket timeout to 30 seconds for example and if

switch to blocking with 30 seconds timeout of course.

> Any thoughts on how to simulate this (ie repeated EAGAINs on the socket)
> so we can reproduce this kind of scenario without lots of stress (including
> a large local file copy to increase memory pressure, while the
> remote file is being written out)

Depending on how deep you want to dig into socket code :)
The simplest way is just to provide MSG_DONTWAIT flag and set socket
buffer small enough (either via sysctl or socket option). Trick with
small socket buffer will help to get egain (note that it is only
returned for nonblocking socket, otherwise it will sleep upto
sock->sk_sndtimeo/sk_rcvtimeo jiffies and then return number of bytes
transferred) quite soon (well, you can set it to really miserable value,
but that rule will not be followed strickly :) under ay kind of load.

You can also hack into socket code like tcp_sendmsg()/tcp_sendpage()
(the latter is not used in CIFS though) and unconditionally return
errors like ENOMEM and EAGAIN there (or bind it to some socket option,
time of the day or random value).

Those were the simplest imho, but you can also setup network scheduler
(netem is the best for this kind of tests) to drop/reorder/corrupt
dataflow, which will also result in protocol troubles, which will lead
to delays in sending and possibility to delay writing.

In POHMELFS I emulated similar behaviour simply by injecting error into
sending socket codepath and setting non-blocking socket and killed
server.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 16:28             ` Evgeniy Polyakov
@ 2008-06-21 17:02               ` Steve French
  2008-06-21 17:26                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 17+ messages in thread
From: Steve French @ 2008-06-21 17:02 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp

On Sat, Jun 21, 2008 at 11:28 AM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> You can also hack into socket code like tcp_sendmsg()/tcp_sendpage()
> (the latter is not used in CIFS though)

Your point about tcp_sendpage reminds me of something I have been wondering
about for a while.  Since SunRPC switched to kernel_sendpage I have
been wondering
whether that is better or worse than kernel_sendmsg.   It looks like
in some cases
sendpage simply falls back to calling sendmsg with 1 iovec (instead of calling
tcp_sendpage which calls do_tcp_sendpages) which would end up being
slower than calling
sendmsg with a larger iovec as we do in smb_send2 in the write path.

For the write case in which we are writing pages (that are aligned)
out of the page cache to
the socket would sendpage be any faster than sendmsg?  (I wish there
was a sendmultiple
pages call where I could pass the whole list of pages).  How should a
piece of kernel
code check to see if sendpage is supported/faster and when to use
kernel_sendpage and
when to use sendmsg with the pages in the iovec

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 17:02               ` Steve French
@ 2008-06-21 17:26                 ` Evgeniy Polyakov
  2008-06-21 17:37                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 17+ messages in thread
From: Evgeniy Polyakov @ 2008-06-21 17:26 UTC (permalink / raw)
  To: Steve French
  Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp

On Sat, Jun 21, 2008 at 12:02:31PM -0500, Steve French (smfrench@gmail.com) wrote:
> Your point about tcp_sendpage reminds me of something I have been wondering
> about for a while.  Since SunRPC switched to kernel_sendpage I have
> been wondering
> whether that is better or worse than kernel_sendmsg.   It looks like
> in some cases
> sendpage simply falls back to calling sendmsg with 1 iovec (instead of calling
> tcp_sendpage which calls do_tcp_sendpages) which would end up being
> slower than calling
> sendmsg with a larger iovec as we do in smb_send2 in the write path.

This happens for hardware, which does not support hardware checksumming
and scater-gather. sendpage() fundamentally requires it, since sending
is lockless and checksumming happens on the very end of the transmit
path: around the time when hardware does dma into the wire. Software
checksumming may end up with broken checksum if data was changed
in-flight. Also note, that kernel_sendpage() return does not mean that
data was really sent, so its modification can lead to corrupted
protocol. It is also forbidden to sendpage slab pages.

> For the write case in which we are writing pages (that are aligned)
> out of the page cache to
> the socket would sendpage be any faster than sendmsg?  (I wish there
> was a sendmultiple
> pages call where I could pass the whole list of pages).  How should a
> piece of kernel
> code check to see if sendpage is supported/faster and when to use
> kernel_sendpage and
> when to use sendmsg with the pages in the iovec

You can simply check sk->sk_route_caps, it has to have NETIF_F_SG and
NETIF_F_ALL_CSUM bits set to support sendpage().
sendpage() is generally faster, since it does not perform data copy
(and checksumming, but depending on how it is called, like from
userspace, it may not be the main factor). With jumbo frames it provides
more noticeble win, but for smaller mtu it is frequently not that
big (well, in POHMELFS I did not get any better numbers from switching
to sendpage() instead of sendmsg() neither in cpu utilization nor in
performance, but I ran 1500 mtu tests on either quite fast machines and
gige link or over very slow (3 mbyte/s) link).

sendpage() also perfroms less allocations and they are smaller than that
for sendmsg().

I'm actually surprised that in bulk transfer per-page sending is slower
than lots of pages in one go. Of course it should be faster, but
difference should be very small, since it is only a matter of socket lock
grab, which for bulk data sending should not be an issue at all.
At least in my tests I never had a difference and easily achieved wire
limit even for per-page sending with data copy (i.e. sendmsg()).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 17:26                 ` Evgeniy Polyakov
@ 2008-06-21 17:37                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 17+ messages in thread
From: Evgeniy Polyakov @ 2008-06-21 17:37 UTC (permalink / raw)
  To: Steve French
  Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp

On Sat, Jun 21, 2008 at 09:26:39PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> I'm actually surprised that in bulk transfer per-page sending is slower
> than lots of pages in one go. Of course it should be faster, but

....                                                   slower
:)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-21 14:21       ` Jody French
  2008-06-21 14:42         ` Evgeniy Polyakov
@ 2008-06-23 15:39         ` Dave Kleikamp
  2008-06-23 18:05           ` Dave Kleikamp
  1 sibling, 1 reply; 17+ messages in thread
From: Dave Kleikamp @ 2008-06-23 15:39 UTC (permalink / raw)
  To: Steve French
  Cc: Evgeniy Polyakov, Jeff Layton, Steve French, linux-fsdevel,
	Shirish Pargaonkar

On Sat, 2008-06-21 at 09:21 -0500, Jody French wrote:
> Evgeniy Polyakov wrote:
> >> Either way, if we really want to do a second attempt to write out
> the
> >> pagevec, then adding some code to cifs_writepages that sleeps for a
> bit
> >> and implements this seems like the thing to do. I'm not convinced
> that
> >> it will actually make much difference, but it seems unlikely to
> hurt
> >> anything.
> >>     
> >
> > If server returns serious error then there is no other way except to
> > discard data with error, but if server does not respond or respond EBUSY
> > or that kind of error, then subsequent write can succeed and at least
> > should not harm. As a simple case, it is possible to sleep a bit and
> > resend in writepages(), but it is also possible just to return from the
> > callback and allow VFS to call it again (frequently it will happen very
> > soon though) with the same pages (or even increased chunk).
> >   
> In the particular case we are looking at, the network stack (TCP perhaps 
> due a temporary glitch in
> the network adapter or routing infrastructure or temporary memory 
> pressure) is returning EAGAIN
> for more than 15 seconds (on the tcp send of the Write request) but the 
> server itself has not crashed,
> (subsequent parts of the file written via later writepages requests are 
> eventually written out),  eventually
> we give up in writepages and return EIO on the next fsync or flush/close 

If you are getting EAGAIN, you shouldn't give up with an error.  It
would be better to call redirty_page_for_writeback() and let the page
stay dirty a bit longer.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: flush and EIO errors when writepages fails
  2008-06-23 15:39         ` Dave Kleikamp
@ 2008-06-23 18:05           ` Dave Kleikamp
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Kleikamp @ 2008-06-23 18:05 UTC (permalink / raw)
  To: Steve French
  Cc: Evgeniy Polyakov, Jeff Layton, Steve French, linux-fsdevel,
	Shirish Pargaonkar

On Mon, 2008-06-23 at 15:39 +0000, Dave Kleikamp wrote:

> If you are getting EAGAIN, you shouldn't give up with an error.  It
> would be better to call redirty_page_for_writeback() and let the page
> stay dirty a bit longer.

er, redirty_page_for_writepage(), not ...writeback()

> Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-06-23 18:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20080620073150.2bc9988e@tupile.poochiereds.net>
     [not found] ` <OFE8C66E61.981E25D1-ON8725746E.0045A92A-8625746E.004718C0@us.ibm.com>
     [not found]   ` <20080620091542.09edb43f@tupile.poochiereds.net>
2008-06-20 16:19     ` flush and EIO errors when writepages fails Steve French (smfltc)
2008-06-20 16:34       ` Jeff Layton
2008-06-20 16:41         ` Steve French (smfltc)
2008-06-20 17:12           ` Jeff Layton
2008-06-20 22:34 Steve French
2008-06-21  7:05 ` Evgeniy Polyakov
2008-06-21 12:27   ` Jeff Layton
2008-06-21 13:19     ` Evgeniy Polyakov
2008-06-21 14:21       ` Jody French
2008-06-21 14:42         ` Evgeniy Polyakov
2008-06-21 16:15           ` Steve French
2008-06-21 16:28             ` Evgeniy Polyakov
2008-06-21 17:02               ` Steve French
2008-06-21 17:26                 ` Evgeniy Polyakov
2008-06-21 17:37                   ` Evgeniy Polyakov
2008-06-23 15:39         ` Dave Kleikamp
2008-06-23 18:05           ` Dave Kleikamp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).