* flush and EIO errors when writepages fails [not found] ` <20080620091542.09edb43f@tupile.poochiereds.net> @ 2008-06-20 16:19 ` Steve French (smfltc) 2008-06-20 16:34 ` Jeff Layton 0 siblings, 1 reply; 17+ messages in thread From: Steve French (smfltc) @ 2008-06-20 16:19 UTC (permalink / raw) To: Jeff Layton; +Cc: Shirish S Pargaonkar, shaggy, linux-fsdevel If flush fails to write all dirty pages (due to an I/O error on the server, server disk or networking stack) today the error (EIO) is marked in the inode, and returned on close. I think cifs_flush (which is called before close by the vfs) should also (perhaps after sleep a second or so then) retry at least once on the filemap_fdatawrite before giving up. (perhaps retry more if mounted hard) Thoughts? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-20 16:19 ` flush and EIO errors when writepages fails Steve French (smfltc) @ 2008-06-20 16:34 ` Jeff Layton 2008-06-20 16:41 ` Steve French (smfltc) 0 siblings, 1 reply; 17+ messages in thread From: Jeff Layton @ 2008-06-20 16:34 UTC (permalink / raw) To: Steve French (smfltc); +Cc: Shirish S Pargaonkar, shaggy, linux-fsdevel On Fri, 20 Jun 2008 11:19:19 -0500 "Steve French (smfltc)" <smfltc@us.ibm.com> wrote: > > If flush fails to write all dirty pages (due to an I/O error on the > server, server disk or networking stack) today the error (EIO) is marked > in the inode, and returned on close. I think cifs_flush (which is > called before close by the vfs) should also (perhaps after sleep a > second or so then) retry at least once on the filemap_fdatawrite before > giving up. (perhaps retry more if mounted hard) Thoughts? > A couple of thoughts... Retrying is only likely to be helpful if the server isn't responding. We could consider doing a better job there somehow. ...and... Suppose we have a bunch of dirty pages for an inode. We call cifs_flush, which calls filemap_fdatawrite (which eventually calls cifs_writepages) and attempt to write all of the pages out. They all fail and then get discarded. Then we call filemap_fdatawrite again. Now there are no more dirty pages and this returns success. But, the data was tossed out on the first filemap_fdatawrite call, so the success here really isn't a success... If you want to be more aggressive about handling errors when writing out pages, then most of the changes will need to be made at the cifs_writepages level, not so much with cifs_flush. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-20 16:34 ` Jeff Layton @ 2008-06-20 16:41 ` Steve French (smfltc) 2008-06-20 17:12 ` Jeff Layton 0 siblings, 1 reply; 17+ messages in thread From: Steve French (smfltc) @ 2008-06-20 16:41 UTC (permalink / raw) To: Jeff Layton; +Cc: Shirish S Pargaonkar, shaggy, linux-fsdevel Jeff Layton wrote: > On Fri, 20 Jun 2008 11:19:19 -0500 > "Steve French (smfltc)" <smfltc@us.ibm.com> wrote: > > >> If flush fails to write all dirty pages (due to an I/O error on the >> server, server disk or networking stack) today the error (EIO) is marked >> in the inode, and returned on close. I think cifs_flush (which is >> called before close by the vfs) should also (perhaps after sleep a >> second or so then) retry at least once on the filemap_fdatawrite before >> giving up. (perhaps retry more if mounted hard) Thoughts? >> >> > > A couple of thoughts... > > Retrying is only likely to be helpful if the server isn't responding. We > could consider doing a better job there somehow. > > The particular problem case that I am thinking of at the moment, and wish is helped by retry, is the case in which memory pressure prevents the TCP/IP stack or underlying (perhaps badly written) network adapter driver from allowing the SMB write packet from even getting to the wire. > If you want to be more aggressive about handling errors when writing > out pages, then most of the changes will need to be made at the > cifs_writepages level, not so much with cifs_flush. > flush is our "last chance" effort to write the file data - once flush and close are called the file handle is gone so we can no longer write the file data after that point. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-20 16:41 ` Steve French (smfltc) @ 2008-06-20 17:12 ` Jeff Layton 0 siblings, 0 replies; 17+ messages in thread From: Jeff Layton @ 2008-06-20 17:12 UTC (permalink / raw) To: Steve French (smfltc); +Cc: Shirish S Pargaonkar, shaggy, linux-fsdevel On Fri, 20 Jun 2008 11:41:26 -0500 "Steve French (smfltc)" <smfltc@us.ibm.com> wrote: > Jeff Layton wrote: > > On Fri, 20 Jun 2008 11:19:19 -0500 > > "Steve French (smfltc)" <smfltc@us.ibm.com> wrote: > > > > > >> If flush fails to write all dirty pages (due to an I/O error on the > >> server, server disk or networking stack) today the error (EIO) is marked > >> in the inode, and returned on close. I think cifs_flush (which is > >> called before close by the vfs) should also (perhaps after sleep a > >> second or so then) retry at least once on the filemap_fdatawrite before > >> giving up. (perhaps retry more if mounted hard) Thoughts? > >> > >> > > > > A couple of thoughts... > > > > Retrying is only likely to be helpful if the server isn't responding. We > > could consider doing a better job there somehow. > > > > > The particular problem case that I am thinking of at the moment, and > wish is helped by retry, is > the case in which memory pressure prevents the TCP/IP stack or > underlying (perhaps badly > written) network adapter driver from allowing the SMB write packet from > even getting to > the wire. I'll buy that...though I'm not sure how often this is really an issue. I don't think I've seen any cases of memory allocations failing and preventing CIFS from writing out pages. I'm also not convinced that we'd see a lot of success from retrying in these situations. I have seen problems with NFS/RPC hitting deadlocks when trying to do sleeping allocations in a write. rpciod tries to do a __GFP_WAIT allocation, and due to memory pressure ends up trying to write out NFS pages -- deadlock. A lot of these are now fixed in mainline, but we still have some less traveled codepaths in NFS/RPC that could deadlock this way. We're probably being a bit too optimistic with how we do memory allocations in CIFS in places, so we may be subject to similar deadlocks. These types of problems concern me more than retrying failed page writeouts... > > If you want to be more aggressive about handling errors when writing > > out pages, then most of the changes will need to be made at the > > cifs_writepages level, not so much with cifs_flush. > > > flush is our "last chance" effort to write the file data - once flush > and close are called the > file handle is gone so we can no longer write the file data after that > point. Right, but with the current implementation, once filemap_fdatawrite returns, any pages that that run touched are either written out or discarded. Retrying the filemap_fdatawrite from cifs_flush won't help. This could be redesigned, but we're probably better off fleshing out cifs_writepages to handle this based on some sort of condition (maybe the WB_SYNC_* flags or something). -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails
@ 2008-06-20 22:34 Steve French
2008-06-21 7:05 ` Evgeniy Polyakov
0 siblings, 1 reply; 17+ messages in thread
From: Steve French @ 2008-06-20 22:34 UTC (permalink / raw)
To: linux-fsdevel; +Cc: Shirish Pargaonkar, Dave Kleikamp, Jody French
> Right, but with the current implementation, once filemap_fdatawrite
> returns, any pages that that run touched are either written out or
> discarded.
That could explain some problems if true. When writepages fails, we
make the pages as in error (PG_error flag?) and presumably they are
still dirty. Why in the world would anyone free the pages just
because we failed the first time and need to write them again later?
Do you where (presumably in /mm) pages could be freed that are still
dirty (it is hard to find where the PG_error flag is checked etc)?
--
Thanks,
Steve
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: flush and EIO errors when writepages fails 2008-06-20 22:34 Steve French @ 2008-06-21 7:05 ` Evgeniy Polyakov 2008-06-21 12:27 ` Jeff Layton 0 siblings, 1 reply; 17+ messages in thread From: Evgeniy Polyakov @ 2008-06-21 7:05 UTC (permalink / raw) To: Steve French Cc: linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp, Jody French Hi. On Fri, Jun 20, 2008 at 05:34:21PM -0500, Steve French (smfrench@gmail.com) wrote: > > Right, but with the current implementation, once filemap_fdatawrite > > returns, any pages that that run touched are either written out or > > discarded. Depending on writepages() implementation, it is not always the case. > That could explain some problems if true. When writepages fails, we > make the pages as in error (PG_error flag?) and presumably they are > still dirty. Why in the world would anyone free the pages just > because we failed the first time and need to write them again later? > Do you where (presumably in /mm) pages could be freed that are still > dirty (it is hard to find where the PG_error flag is checked etc)? You can clear writeback bit but leave/set dirty bit in completion callback for given request. You can manually insert page into radix tree with dirty tag. You can also lock page and do not allow to unlock it until you resent your data. So, there is plenty of possibilities to break page accounting in own writepages() method :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 7:05 ` Evgeniy Polyakov @ 2008-06-21 12:27 ` Jeff Layton 2008-06-21 13:19 ` Evgeniy Polyakov 0 siblings, 1 reply; 17+ messages in thread From: Jeff Layton @ 2008-06-21 12:27 UTC (permalink / raw) To: Steve French Cc: Evgeniy Polyakov, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp, Jody French On Sat, 21 Jun 2008 11:05:56 +0400 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > Hi. > > On Fri, Jun 20, 2008 at 05:34:21PM -0500, Steve French (smfrench@gmail.com) wrote: > > > Right, but with the current implementation, once filemap_fdatawrite > > > returns, any pages that that run touched are either written out or > > > discarded. > > Depending on writepages() implementation, it is not always the case. > Right. I meant with the current cifs_writepages() implementation. It looks like when the write fails we walk the pagevec and do: if (rc) SetPageError(page); kunmap(page); unlock_page(page); end_page_writeback(page); page_cache_release(page); So I'm not certain that the data is actually discarded. I'm no expert in page accounting, but I'm pretty sure that we won't attempt to write it out again from CIFS. > > That could explain some problems if true. When writepages fails, we > > make the pages as in error (PG_error flag?) and presumably they are > > still dirty. Why in the world would anyone free the pages just > > because we failed the first time and need to write them again later? > > Do you where (presumably in /mm) pages could be freed that are still > > dirty (it is hard to find where the PG_error flag is checked etc)? > > You can clear writeback bit but leave/set dirty bit in completion > callback for given request. You can manually insert page into radix tree > with dirty tag. You can also lock page and do not allow to unlock it > until you resent your data. > > So, there is plenty of possibilities to break page accounting in own > writepages() method :) > I still think that if you get a hard error back from the server then you're not likely to have much success on a second attempt to write out the pages. Returning an error to userspace as soon as you realize that it didn't work seems reasonable to me. A non-responding server, on the other hand, may be a place that's recoverable. Either way, if we really want to do a second attempt to write out the pagevec, then adding some code to cifs_writepages that sleeps for a bit and implements this seems like the thing to do. I'm not convinced that it will actually make much difference, but it seems unlikely to hurt anything. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 12:27 ` Jeff Layton @ 2008-06-21 13:19 ` Evgeniy Polyakov 2008-06-21 14:21 ` Jody French 0 siblings, 1 reply; 17+ messages in thread From: Evgeniy Polyakov @ 2008-06-21 13:19 UTC (permalink / raw) To: Jeff Layton Cc: Steve French, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp, Jody French On Sat, Jun 21, 2008 at 08:27:42AM -0400, Jeff Layton (jlayton@redhat.com) wrote: > Right. I meant with the current cifs_writepages() implementation. It > looks like when the write fails we walk the pagevec and do: > > if (rc) > SetPageError(page); > kunmap(page); > unlock_page(page); > end_page_writeback(page); > page_cache_release(page); > > So I'm not certain that the data is actually discarded. I'm no expert > in page accounting, but I'm pretty sure that we won't attempt to write > it out again from CIFS. Yes, current CIFS implementation does not allow to do much of failover recovery. Data is still in the cache, only page is marked as error, so nothing prevents from redirtifying it in subsequent write. > > > That could explain some problems if true. When writepages fails, we > > > make the pages as in error (PG_error flag?) and presumably they are > > > still dirty. Why in the world would anyone free the pages just > > > because we failed the first time and need to write them again later? > > > Do you where (presumably in /mm) pages could be freed that are still > > > dirty (it is hard to find where the PG_error flag is checked etc)? > > > > You can clear writeback bit but leave/set dirty bit in completion > > callback for given request. You can manually insert page into radix tree > > with dirty tag. You can also lock page and do not allow to unlock it > > until you resent your data. > > > > So, there is plenty of possibilities to break page accounting in own > > writepages() method :) > > > > I still think that if you get a hard error back from the server then > you're not likely to have much success on a second attempt to write out > the pages. Returning an error to userspace as soon as you realize that > it didn't work seems reasonable to me. A non-responding server, on the > other hand, may be a place that's recoverable. > > Either way, if we really want to do a second attempt to write out the > pagevec, then adding some code to cifs_writepages that sleeps for a bit > and implements this seems like the thing to do. I'm not convinced that > it will actually make much difference, but it seems unlikely to hurt > anything. If server returns serious error then there is no other way except to discard data with error, but if server does not respond or respond EBUSY or that kind of error, then subsequent write can succeed and at least should not harm. As a simple case, it is possible to sleep a bit and resend in writepages(), but it is also possible just to return from the callback and allow VFS to call it again (frequently it will happen very soon though) with the same pages (or even increased chunk). -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 13:19 ` Evgeniy Polyakov @ 2008-06-21 14:21 ` Jody French 2008-06-21 14:42 ` Evgeniy Polyakov 2008-06-23 15:39 ` Dave Kleikamp 0 siblings, 2 replies; 17+ messages in thread From: Jody French @ 2008-06-21 14:21 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Jeff Layton, Steve French, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp Evgeniy Polyakov wrote: >> Either way, if we really want to do a second attempt to write out the >> pagevec, then adding some code to cifs_writepages that sleeps for a bit >> and implements this seems like the thing to do. I'm not convinced that >> it will actually make much difference, but it seems unlikely to hurt >> anything. >> > > If server returns serious error then there is no other way except to > discard data with error, but if server does not respond or respond EBUSY > or that kind of error, then subsequent write can succeed and at least > should not harm. As a simple case, it is possible to sleep a bit and > resend in writepages(), but it is also possible just to return from the > callback and allow VFS to call it again (frequently it will happen very > soon though) with the same pages (or even increased chunk). > In the particular case we are looking at, the network stack (TCP perhaps due a temporary glitch in the network adapter or routing infrastructure or temporary memory pressure) is returning EAGAIN for more than 15 seconds (on the tcp send of the Write request) but the server itself has not crashed, (subsequent parts of the file written via later writepages requests are eventually written out), eventually we give up in writepages and return EIO on the next fsync or flush/close - but if we could make one more attempt to go through in flush, and write all dirty pages including the ones that we timed out on that would help. In addition if readpage is about to do a partial page read into a dirty page that we were unable to write out we would like to try once more before corrupting the data. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 14:21 ` Jody French @ 2008-06-21 14:42 ` Evgeniy Polyakov 2008-06-21 16:15 ` Steve French 2008-06-23 15:39 ` Dave Kleikamp 1 sibling, 1 reply; 17+ messages in thread From: Evgeniy Polyakov @ 2008-06-21 14:42 UTC (permalink / raw) To: Jody French Cc: Jeff Layton, Steve French, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp On Sat, Jun 21, 2008 at 09:21:42AM -0500, Jody French (jfrench@austin.rr.com) wrote: > In the particular case we are looking at, the network stack (TCP perhaps > due a temporary glitch in > the network adapter or routing infrastructure or temporary memory > pressure) is returning EAGAIN > for more than 15 seconds (on the tcp send of the Write request) but the > server itself has not crashed, > (subsequent parts of the file written via later writepages requests are > eventually written out), eventually > we give up in writepages and return EIO on the next fsync or flush/close > - but if we could > make one more attempt to go through in flush, and write all dirty pages > including the ones that we timed > out on that would help. In addition if readpage is about to do a > partial page read into a dirty page that > we were unable to write out we would like to try once more before > corrupting the data. If you do not unlock and release the page, nothing can currupt it, but I'm not sure that if you had 15 seconds timeout writepage+flush will have enough time interval to exceed it. In the flush you can switch to nonblocking mode and set socket timeout to 30 seconds for example and if even that failed, then discard data. EGAIN likely means problem on server, which I referred as non serious, and likely it will resume in a few moments, so your trick with write+flush can work, but only with long enouf timeout. Actually you can always perform similar trick with socket timeout, but bevare of problems with umount or sync, when they can take toooo long to complete, so there should be some flag to show when you want and do not want it. Similar scheme was implemented in POHMELFS. Returning error I think is the last thing to do and whatever retry mechanism you will decide to implement it worth the efforts. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 14:42 ` Evgeniy Polyakov @ 2008-06-21 16:15 ` Steve French 2008-06-21 16:28 ` Evgeniy Polyakov 0 siblings, 1 reply; 17+ messages in thread From: Steve French @ 2008-06-21 16:15 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp On Sat, Jun 21, 2008 at 9:42 AM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > If you do not unlock and release the page, nothing can currupt it, but > I'm not sure that if you had 15 seconds timeout writepage+flush will > have enough time interval to exceed it. In the flush you can switch to > nonblocking mode and set socket timeout to 30 seconds for example and if > even that failed, then discard data. EGAIN likely means problem on > server, which I referred as non serious, and likely it will resume in a > few moments, so your trick with write+flush can work, but only with long > enouf timeout. > > Actually you can always perform similar trick with socket timeout, but > bevare of problems with umount or sync, when they can take toooo long to > complete, so there should be some flag to show when you want and do not > want it. Similar scheme was implemented in POHMELFS. > > Returning error I think is the last thing to do and whatever retry > mechanism you will decide to implement it worth the efforts. Any thoughts on how to simulate this (ie repeated EAGAINs on the socket) so we can reproduce this kind of scenario without lots of stress (including a large local file copy to increase memory pressure, while the remote file is being written out) -- Thanks, Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 16:15 ` Steve French @ 2008-06-21 16:28 ` Evgeniy Polyakov 2008-06-21 17:02 ` Steve French 0 siblings, 1 reply; 17+ messages in thread From: Evgeniy Polyakov @ 2008-06-21 16:28 UTC (permalink / raw) To: Steve French Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp On Sat, Jun 21, 2008 at 11:15:00AM -0500, Steve French (smfrench@gmail.com) wrote: > On Sat, Jun 21, 2008 at 9:42 AM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > > If you do not unlock and release the page, nothing can currupt it, but > > I'm not sure that if you had 15 seconds timeout writepage+flush will > > have enough time interval to exceed it. In the flush you can switch to > > nonblocking mode and set socket timeout to 30 seconds for example and if switch to blocking with 30 seconds timeout of course. > Any thoughts on how to simulate this (ie repeated EAGAINs on the socket) > so we can reproduce this kind of scenario without lots of stress (including > a large local file copy to increase memory pressure, while the > remote file is being written out) Depending on how deep you want to dig into socket code :) The simplest way is just to provide MSG_DONTWAIT flag and set socket buffer small enough (either via sysctl or socket option). Trick with small socket buffer will help to get egain (note that it is only returned for nonblocking socket, otherwise it will sleep upto sock->sk_sndtimeo/sk_rcvtimeo jiffies and then return number of bytes transferred) quite soon (well, you can set it to really miserable value, but that rule will not be followed strickly :) under ay kind of load. You can also hack into socket code like tcp_sendmsg()/tcp_sendpage() (the latter is not used in CIFS though) and unconditionally return errors like ENOMEM and EAGAIN there (or bind it to some socket option, time of the day or random value). Those were the simplest imho, but you can also setup network scheduler (netem is the best for this kind of tests) to drop/reorder/corrupt dataflow, which will also result in protocol troubles, which will lead to delays in sending and possibility to delay writing. In POHMELFS I emulated similar behaviour simply by injecting error into sending socket codepath and setting non-blocking socket and killed server. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 16:28 ` Evgeniy Polyakov @ 2008-06-21 17:02 ` Steve French 2008-06-21 17:26 ` Evgeniy Polyakov 0 siblings, 1 reply; 17+ messages in thread From: Steve French @ 2008-06-21 17:02 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp On Sat, Jun 21, 2008 at 11:28 AM, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > You can also hack into socket code like tcp_sendmsg()/tcp_sendpage() > (the latter is not used in CIFS though) Your point about tcp_sendpage reminds me of something I have been wondering about for a while. Since SunRPC switched to kernel_sendpage I have been wondering whether that is better or worse than kernel_sendmsg. It looks like in some cases sendpage simply falls back to calling sendmsg with 1 iovec (instead of calling tcp_sendpage which calls do_tcp_sendpages) which would end up being slower than calling sendmsg with a larger iovec as we do in smb_send2 in the write path. For the write case in which we are writing pages (that are aligned) out of the page cache to the socket would sendpage be any faster than sendmsg? (I wish there was a sendmultiple pages call where I could pass the whole list of pages). How should a piece of kernel code check to see if sendpage is supported/faster and when to use kernel_sendpage and when to use sendmsg with the pages in the iovec -- Thanks, Steve ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 17:02 ` Steve French @ 2008-06-21 17:26 ` Evgeniy Polyakov 2008-06-21 17:37 ` Evgeniy Polyakov 0 siblings, 1 reply; 17+ messages in thread From: Evgeniy Polyakov @ 2008-06-21 17:26 UTC (permalink / raw) To: Steve French Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp On Sat, Jun 21, 2008 at 12:02:31PM -0500, Steve French (smfrench@gmail.com) wrote: > Your point about tcp_sendpage reminds me of something I have been wondering > about for a while. Since SunRPC switched to kernel_sendpage I have > been wondering > whether that is better or worse than kernel_sendmsg. It looks like > in some cases > sendpage simply falls back to calling sendmsg with 1 iovec (instead of calling > tcp_sendpage which calls do_tcp_sendpages) which would end up being > slower than calling > sendmsg with a larger iovec as we do in smb_send2 in the write path. This happens for hardware, which does not support hardware checksumming and scater-gather. sendpage() fundamentally requires it, since sending is lockless and checksumming happens on the very end of the transmit path: around the time when hardware does dma into the wire. Software checksumming may end up with broken checksum if data was changed in-flight. Also note, that kernel_sendpage() return does not mean that data was really sent, so its modification can lead to corrupted protocol. It is also forbidden to sendpage slab pages. > For the write case in which we are writing pages (that are aligned) > out of the page cache to > the socket would sendpage be any faster than sendmsg? (I wish there > was a sendmultiple > pages call where I could pass the whole list of pages). How should a > piece of kernel > code check to see if sendpage is supported/faster and when to use > kernel_sendpage and > when to use sendmsg with the pages in the iovec You can simply check sk->sk_route_caps, it has to have NETIF_F_SG and NETIF_F_ALL_CSUM bits set to support sendpage(). sendpage() is generally faster, since it does not perform data copy (and checksumming, but depending on how it is called, like from userspace, it may not be the main factor). With jumbo frames it provides more noticeble win, but for smaller mtu it is frequently not that big (well, in POHMELFS I did not get any better numbers from switching to sendpage() instead of sendmsg() neither in cpu utilization nor in performance, but I ran 1500 mtu tests on either quite fast machines and gige link or over very slow (3 mbyte/s) link). sendpage() also perfroms less allocations and they are smaller than that for sendmsg(). I'm actually surprised that in bulk transfer per-page sending is slower than lots of pages in one go. Of course it should be faster, but difference should be very small, since it is only a matter of socket lock grab, which for bulk data sending should not be an issue at all. At least in my tests I never had a difference and easily achieved wire limit even for per-page sending with data copy (i.e. sendmsg()). -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 17:26 ` Evgeniy Polyakov @ 2008-06-21 17:37 ` Evgeniy Polyakov 0 siblings, 0 replies; 17+ messages in thread From: Evgeniy Polyakov @ 2008-06-21 17:37 UTC (permalink / raw) To: Steve French Cc: Jeff Layton, linux-fsdevel, Shirish Pargaonkar, Dave Kleikamp On Sat, Jun 21, 2008 at 09:26:39PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > I'm actually surprised that in bulk transfer per-page sending is slower > than lots of pages in one go. Of course it should be faster, but .... slower :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-21 14:21 ` Jody French 2008-06-21 14:42 ` Evgeniy Polyakov @ 2008-06-23 15:39 ` Dave Kleikamp 2008-06-23 18:05 ` Dave Kleikamp 1 sibling, 1 reply; 17+ messages in thread From: Dave Kleikamp @ 2008-06-23 15:39 UTC (permalink / raw) To: Steve French Cc: Evgeniy Polyakov, Jeff Layton, Steve French, linux-fsdevel, Shirish Pargaonkar On Sat, 2008-06-21 at 09:21 -0500, Jody French wrote: > Evgeniy Polyakov wrote: > >> Either way, if we really want to do a second attempt to write out > the > >> pagevec, then adding some code to cifs_writepages that sleeps for a > bit > >> and implements this seems like the thing to do. I'm not convinced > that > >> it will actually make much difference, but it seems unlikely to > hurt > >> anything. > >> > > > > If server returns serious error then there is no other way except to > > discard data with error, but if server does not respond or respond EBUSY > > or that kind of error, then subsequent write can succeed and at least > > should not harm. As a simple case, it is possible to sleep a bit and > > resend in writepages(), but it is also possible just to return from the > > callback and allow VFS to call it again (frequently it will happen very > > soon though) with the same pages (or even increased chunk). > > > In the particular case we are looking at, the network stack (TCP perhaps > due a temporary glitch in > the network adapter or routing infrastructure or temporary memory > pressure) is returning EAGAIN > for more than 15 seconds (on the tcp send of the Write request) but the > server itself has not crashed, > (subsequent parts of the file written via later writepages requests are > eventually written out), eventually > we give up in writepages and return EIO on the next fsync or flush/close If you are getting EAGAIN, you shouldn't give up with an error. It would be better to call redirty_page_for_writeback() and let the page stay dirty a bit longer. Shaggy -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: flush and EIO errors when writepages fails 2008-06-23 15:39 ` Dave Kleikamp @ 2008-06-23 18:05 ` Dave Kleikamp 0 siblings, 0 replies; 17+ messages in thread From: Dave Kleikamp @ 2008-06-23 18:05 UTC (permalink / raw) To: Steve French Cc: Evgeniy Polyakov, Jeff Layton, Steve French, linux-fsdevel, Shirish Pargaonkar On Mon, 2008-06-23 at 15:39 +0000, Dave Kleikamp wrote: > If you are getting EAGAIN, you shouldn't give up with an error. It > would be better to call redirty_page_for_writeback() and let the page > stay dirty a bit longer. er, redirty_page_for_writepage(), not ...writeback() > Shaggy -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-06-23 18:05 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20080620073150.2bc9988e@tupile.poochiereds.net>
[not found] ` <OFE8C66E61.981E25D1-ON8725746E.0045A92A-8625746E.004718C0@us.ibm.com>
[not found] ` <20080620091542.09edb43f@tupile.poochiereds.net>
2008-06-20 16:19 ` flush and EIO errors when writepages fails Steve French (smfltc)
2008-06-20 16:34 ` Jeff Layton
2008-06-20 16:41 ` Steve French (smfltc)
2008-06-20 17:12 ` Jeff Layton
2008-06-20 22:34 Steve French
2008-06-21 7:05 ` Evgeniy Polyakov
2008-06-21 12:27 ` Jeff Layton
2008-06-21 13:19 ` Evgeniy Polyakov
2008-06-21 14:21 ` Jody French
2008-06-21 14:42 ` Evgeniy Polyakov
2008-06-21 16:15 ` Steve French
2008-06-21 16:28 ` Evgeniy Polyakov
2008-06-21 17:02 ` Steve French
2008-06-21 17:26 ` Evgeniy Polyakov
2008-06-21 17:37 ` Evgeniy Polyakov
2008-06-23 15:39 ` Dave Kleikamp
2008-06-23 18:05 ` Dave Kleikamp
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).