* NFSERR_NOSPC nfs-client bug
@ 2008-03-18 1:21 Ray Ferguson
[not found] ` <200803172021.08327.nfs-Uh4cUGhLB8SgSpxsJD1C4w@public.gmane.org>
0 siblings, 1 reply; 7+ messages in thread
From: Ray Ferguson @ 2008-03-18 1:21 UTC (permalink / raw)
To: linux-nfs
I've discovered a bug in the linux nfs client. Specifically it ignores
NFSERR_NOSPC messages (code 28) from an NFS server and happily continues
pounding it with data.
This causes some rather unfortunate consequences on linux nfs servers by
exhausting resources. In 2.4, all cpus peg at 100% usage under the system
catagory. In 2.6, at least one core gets pegged at 100% iowait, but this
still triggers cascading load issues.
So far I've tested:
Opensuse-10.3 = Linux 2.6.22 (client bug confirmed)
RHAS4 = 2.6.9 (client bug confirmed)
RHAS3 = 2.4.21(No Bug: Pre-nfs4)
Solaris 9 = (No Bug)
This can be reproduced by creating a small filesystem and exporting it via
nfs. Then mount it with a buggy client and "cat /dev/zero > /nfs-share/foo"
The expected behavior is for the client to error out the write with a message
informing you that the filesystem is out of space. Instead, the client keeps
sending data and the servers kernel take a beating.
I've checked the wire and confirmed that the server is sending the NOSPC
message back to the client. Most of my testing has been nfs3 though I did
some brief testing w/ nfs2 (bug still present). I have kernel sysrq debug
data and packet captures if anyone is interested.
If this is not the correct place to report this, I would be grateful if anyone
could redirect me.
Thank you for your help.
-
Ray Ferguson
^ permalink raw reply [flat|nested] 7+ messages in thread[parent not found: <200803172021.08327.nfs-Uh4cUGhLB8SgSpxsJD1C4w@public.gmane.org>]
* Re: NFSERR_NOSPC nfs-client bug [not found] ` <200803172021.08327.nfs-Uh4cUGhLB8SgSpxsJD1C4w@public.gmane.org> @ 2008-03-18 1:43 ` Greg Banks [not found] ` <47DF1E5C.9090607-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Greg Banks @ 2008-03-18 1:43 UTC (permalink / raw) To: nfs-Uh4cUGhLB8SgSpxsJD1C4w; +Cc: linux-nfs Ray Ferguson wrote: > I've discovered a bug in the linux nfs client. Specifically it ignores > NFSERR_NOSPC messages (code 28) from an NFS server and happily continues > pounding it with data. > It doesn't ignore ENOSPC, it reports it on close(). Of course this is often several gigabytes of lost data too late. Sensible clients (e.g. Irix) store that error on the inode and report it on the next call to write(). -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. The cake is *not* a lie. I don't speak for SGI. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <47DF1E5C.9090607-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>]
* Re: NFSERR_NOSPC nfs-client bug [not found] ` <47DF1E5C.9090607-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> @ 2008-03-18 1:50 ` Ray Ferguson [not found] ` <200803172050.22223.nfs-Uh4cUGhLB8SgSpxsJD1C4w@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Ray Ferguson @ 2008-03-18 1:50 UTC (permalink / raw) To: linux-nfs On Monday 17 March 2008 20:43, Greg Banks wrote: > It doesn't ignore ENOSPC, it reports it on close(). Of course this is > often several gigabytes of lost data too late. In our case, the thrashing it gave our Linux NFS cluster was severe enough to take it out of commission. The lost data from the file transfer that triggered the event was the least of our worries. - Ray Ferguson ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <200803172050.22223.nfs-Uh4cUGhLB8SgSpxsJD1C4w@public.gmane.org>]
* Re: NFSERR_NOSPC nfs-client bug [not found] ` <200803172050.22223.nfs-Uh4cUGhLB8SgSpxsJD1C4w@public.gmane.org> @ 2008-03-18 2:08 ` Greg Banks [not found] ` <47DF2415.6070802-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Greg Banks @ 2008-03-18 2:08 UTC (permalink / raw) To: nfs-Uh4cUGhLB8SgSpxsJD1C4w; +Cc: linux-nfs Ray Ferguson wrote: > On Monday 17 March 2008 20:43, Greg Banks wrote: > >> It doesn't ignore ENOSPC, it reports it on close(). Of course this is >> often several gigabytes of lost data too late. >> > > In our case, the thrashing it gave our Linux NFS cluster was severe enough to > take it out of commission. The lost data from the file transfer that > triggered the event was the least of our worries. > > Agreed, it sucks. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. The cake is *not* a lie. I don't speak for SGI. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <47DF2415.6070802-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>]
* Re: NFSERR_NOSPC nfs-client bug [not found] ` <47DF2415.6070802-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> @ 2008-03-18 3:09 ` Trond Myklebust [not found] ` <1205809775.22258.9.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Trond Myklebust @ 2008-03-18 3:09 UTC (permalink / raw) To: Greg Banks; +Cc: nfs-Uh4cUGhLB8SgSpxsJD1C4w, linux-nfs On Tue, 2008-03-18 at 13:08 +1100, Greg Banks wrote: > Ray Ferguson wrote: > > On Monday 17 March 2008 20:43, Greg Banks wrote: > > > >> It doesn't ignore ENOSPC, it reports it on close(). Of course this is > >> often several gigabytes of lost data too late. > >> > > > > In our case, the thrashing it gave our Linux NFS cluster was severe enough to > > take it out of commission. The lost data from the file transfer that > > triggered the event was the least of our worries. > > > > > Agreed, it sucks. Try a more recent kernel: 2.6.24 and more recent will report these errors more promptly. Trond ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <1205809775.22258.9.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: NFSERR_NOSPC nfs-client bug [not found] ` <1205809775.22258.9.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2008-03-18 4:03 ` Greg Banks [not found] ` <47DF3F01.9050504-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 7+ messages in thread From: Greg Banks @ 2008-03-18 4:03 UTC (permalink / raw) To: Trond Myklebust; +Cc: nfs-Uh4cUGhLB8SgSpxsJD1C4w, linux-nfs Trond Myklebust wrote: > On Tue, 2008-03-18 at 13:08 +1100, Greg Banks wrote: > >> Ray Ferguson wrote: >> >>> On Monday 17 March 2008 20:43, Greg Banks wrote: >>> >>> >>>> It doesn't ignore ENOSPC, it reports it on close(). Of course this is >>>> often several gigabytes of lost data too late. >>>> >>>> >>> In our case, the thrashing it gave our Linux NFS cluster was severe enough to >>> take it out of commission. The lost data from the file transfer that >>> triggered the event was the least of our worries. >>> >>> >>> >> Agreed, it sucks. >> > > Try a more recent kernel: 2.6.24 and more recent will report these > errors more promptly. > So that would be http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7b159fc18d417980f57aef64cab3417ee6af70f8 ? Is my reading right, that a solitary transient ENOSPC is still reported at close(), but a sequence of two or more ENOSPC is reported in the next write() call? -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. The cake is *not* a lie. I don't speak for SGI. ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <47DF3F01.9050504-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>]
* Re: NFSERR_NOSPC nfs-client bug [not found] ` <47DF3F01.9050504-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> @ 2008-03-18 6:08 ` NeilBrown 0 siblings, 0 replies; 7+ messages in thread From: NeilBrown @ 2008-03-18 6:08 UTC (permalink / raw) To: Greg Banks; +Cc: Trond Myklebust, nfs-Uh4cUGhLB8SgSpxsJD1C4w, linux-nfs On Tue, March 18, 2008 3:03 pm, Greg Banks wrote: > So that would be > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7b159fc18d417980f57aef64cab3417ee6af70f8 > > ? > Is my reading right, that a solitary transient ENOSPC is still reported > at close(), but a sequence of two or more ENOSPC is reported in the next > write() call? I don't think so. After an error report is received by the client, the next write/fsync/close will report the error (even if that write didn't actually have an error). Also, further writes will be attempted synchronously, so the correct error is reported, until a write succeeds. At this point we go back to async writes. So a solitary transient ENOSPC will be reported against a subsequent write. When a write fails, the page remains DIRTY, so a flush will be attempted on every subsequent 'sync', including those triggered by a write while the error flag is set. So if you keep writing, you will keep getting an error until all dirty pages have been safely written to the server. If you give up and close the file, you won't be able to tell just by looking at the error codes which pages were successfully written and which aren't. But it would seem unwise to expect to be able to do that in any case. NeilBrown ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-03-18 6:30 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-18 1:21 NFSERR_NOSPC nfs-client bug Ray Ferguson
[not found] ` <200803172021.08327.nfs-Uh4cUGhLB8SgSpxsJD1C4w@public.gmane.org>
2008-03-18 1:43 ` Greg Banks
[not found] ` <47DF1E5C.9090607-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2008-03-18 1:50 ` Ray Ferguson
[not found] ` <200803172050.22223.nfs-Uh4cUGhLB8SgSpxsJD1C4w@public.gmane.org>
2008-03-18 2:08 ` Greg Banks
[not found] ` <47DF2415.6070802-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2008-03-18 3:09 ` Trond Myklebust
[not found] ` <1205809775.22258.9.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2008-03-18 4:03 ` Greg Banks
[not found] ` <47DF3F01.9050504-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2008-03-18 6:08 ` NeilBrown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.