* Long sleep with i_mutex in xfs_flush_device(), affects NFS service
@ 2006-09-26 18:51 Stephane Doyon
2006-09-26 19:06 ` [NFS] " Trond Myklebust
2006-09-27 11:33 ` Shailendra Tripathi
0 siblings, 2 replies; 17+ messages in thread
From: Stephane Doyon @ 2006-09-26 18:51 UTC (permalink / raw)
To: xfs, nfs
Hi,
I'm seeing an unpleasant behavior when an XFS file system becomes full,
particularly when accessed over NFS. Both XFS and the linux NFS client
appear to be contributing to the problem.
When the file system becomes nearly full, we eventually call down to
xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to
do some work.
xfs_flush_space()does
xfs_iunlock(ip, XFS_ILOCK_EXCL);
before calling xfs_flush_device(), but i_mutex is still held, at least
when we're being called from under xfs_write(). It seems like a fairly
long time to hold a mutex. And I wonder whether it's really necessary to
keep going through that again and again for every new request after we've
hit NOSPC.
In particular this can cause a pileup when several threads are writing
concurrently to the same file. Some specialized apps might do that, and
nfsd threads do it all the time.
To reproduce locally, on a full file system:
#!/bin/sh
for i in `seq 30`; do
dd if=/dev/zero of=f bs=1 count=1 &
done
wait
time that, it takes nearly exactly 15s.
The linux NFS client typically sends bunches of 16 requests, and so if the
client is writing a single file, some NFS requests are therefore delayed
by up to 8seconds, which is kind of long for NFS.
What's worse, when my linux NFS client writes out a file's pages, it does
not react immediately on receiving a NOSPC error. It will remember and
report the error later on close(), but it still tries and issues write
requests for each page of the file. So even if there isn't a pileup on the
i_mutex on the server, the NFS client still waits 0.5s for each 32K
(typically) request. So on an NFS client on a gigabit network, on an
already full filesystem, if I open and write a 10M file and close() it, it
takes 2m40.083s for it to issue all the requests, get an NOSPC for each,
and finally have my close() call return ENOSPC. That can stretch to
several hours for gigabyte-sized files, which is how I noticed the
problem.
I'm not too familiar with the NFS client code, but would it not be
possible for it to give up when it encounters NOSPC? Or is there some
reason why this wouldn't be desirable?
The rough workaround I have come up with for the problem is to have
xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs
of having returned ENOSPC. I have verified that this workaround is
effective, but I imagine there might be a cleaner solution.
Thanks
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [NFS] Long sleep with i_mutex in xfs_flush_device(), affects NFS service 2006-09-26 18:51 Long sleep with i_mutex in xfs_flush_device(), affects NFS service Stephane Doyon @ 2006-09-26 19:06 ` Trond Myklebust 2006-09-26 20:05 ` Stephane Doyon 2006-09-27 11:33 ` Shailendra Tripathi 1 sibling, 1 reply; 17+ messages in thread From: Trond Myklebust @ 2006-09-26 19:06 UTC (permalink / raw) To: Stephane Doyon; +Cc: xfs, nfs On Tue, 2006-09-26 at 14:51 -0400, Stephane Doyon wrote: > Hi, > > I'm seeing an unpleasant behavior when an XFS file system becomes full, > particularly when accessed over NFS. Both XFS and the linux NFS client > appear to be contributing to the problem. > > When the file system becomes nearly full, we eventually call down to > xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to > do some work. > > xfs_flush_space()does > xfs_iunlock(ip, XFS_ILOCK_EXCL); > before calling xfs_flush_device(), but i_mutex is still held, at least > when we're being called from under xfs_write(). It seems like a fairly > long time to hold a mutex. And I wonder whether it's really necessary to > keep going through that again and again for every new request after we've > hit NOSPC. > > In particular this can cause a pileup when several threads are writing > concurrently to the same file. Some specialized apps might do that, and > nfsd threads do it all the time. > > To reproduce locally, on a full file system: > #!/bin/sh > for i in `seq 30`; do > dd if=/dev/zero of=f bs=1 count=1 & > done > wait > time that, it takes nearly exactly 15s. > > The linux NFS client typically sends bunches of 16 requests, and so if the > client is writing a single file, some NFS requests are therefore delayed > by up to 8seconds, which is kind of long for NFS. Why? The file is still open, and so the standard close-to-open rules state that you are not guaranteed that the cache will be flushed unless the VM happens to want to reclaim memory. > What's worse, when my linux NFS client writes out a file's pages, it does > not react immediately on receiving a NOSPC error. It will remember and > report the error later on close(), but it still tries and issues write > requests for each page of the file. So even if there isn't a pileup on the > i_mutex on the server, the NFS client still waits 0.5s for each 32K > (typically) request. So on an NFS client on a gigabit network, on an > already full filesystem, if I open and write a 10M file and close() it, it > takes 2m40.083s for it to issue all the requests, get an NOSPC for each, > and finally have my close() call return ENOSPC. That can stretch to > several hours for gigabyte-sized files, which is how I noticed the > problem. > > I'm not too familiar with the NFS client code, but would it not be > possible for it to give up when it encounters NOSPC? Or is there some > reason why this wouldn't be desirable? How would it then detect that you have fixed the problem on the server? Cheers, Trond ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [NFS] Long sleep with i_mutex in xfs_flush_device(), affects NFS service 2006-09-26 19:06 ` [NFS] " Trond Myklebust @ 2006-09-26 20:05 ` Stephane Doyon 2006-09-26 20:29 ` Trond Myklebust 0 siblings, 1 reply; 17+ messages in thread From: Stephane Doyon @ 2006-09-26 20:05 UTC (permalink / raw) To: Trond Myklebust; +Cc: xfs, nfs On Tue, 26 Sep 2006, Trond Myklebust wrote: [...] >> When the file system becomes nearly full, we eventually call down to >> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to >> do some work. >> >> xfs_flush_space()does >> xfs_iunlock(ip, XFS_ILOCK_EXCL); >> before calling xfs_flush_device(), but i_mutex is still held, at least >> when we're being called from under xfs_write(). It seems like a fairly >> long time to hold a mutex. And I wonder whether it's really necessary to >> keep going through that again and again for every new request after we've >> hit NOSPC. >> >> In particular this can cause a pileup when several threads are writing >> concurrently to the same file. Some specialized apps might do that, and >> nfsd threads do it all the time. [...] >> The linux NFS client typically sends bunches of 16 requests, and so if the >> client is writing a single file, some NFS requests are therefore delayed >> by up to 8seconds, which is kind of long for NFS. > > Why? The file is still open, and so the standard close-to-open rules > state that you are not guaranteed that the cache will be flushed unless > the VM happens to want to reclaim memory. I mean there will be a delay on the server, in responding to the requests. Sorry for the confusion. When the NFS client does flush its cache, each request will take an extra 0.5s to execute on the server, and the i_mutex will prevent their parallel execution on the server. >> What's worse, when my linux NFS client writes out a file's pages, it does >> not react immediately on receiving a NOSPC error. It will remember and >> report the error later on close(), but it still tries and issues write >> requests for each page of the file. So even if there isn't a pileup on the >> i_mutex on the server, the NFS client still waits 0.5s for each 32K >> (typically) request. So on an NFS client on a gigabit network, on an >> already full filesystem, if I open and write a 10M file and close() it, it >> takes 2m40.083s for it to issue all the requests, get an NOSPC for each, >> and finally have my close() call return ENOSPC. That can stretch to >> several hours for gigabyte-sized files, which is how I noticed the >> problem. >> >> I'm not too familiar with the NFS client code, but would it not be >> possible for it to give up when it encounters NOSPC? Or is there some >> reason why this wouldn't be desirable? > > How would it then detect that you have fixed the problem on the server? I suppose it has to try again at some point. Yet when flushing a file, if even one write requests gets an error response like ENOSPC, we know some part of the data has not been written on the server, and close() will return the appropriate error to the program on the client. If a single write error is enough to cause close() to return an error, why bother sending all the other write requests for that file? If we get an error while flushing, couldn't that one flushing operation bail out early? As I said I'm not too familiar with the code, but AFAICT nfs_wb_all() will keep flushing everything, and afterwards nfs_file_flush() wil check ctx->error. Perhaps ctx->error could be checked at some lower level, maybe in nfs_sync_inode_wait... I suppose it's not technically wrong to try to flush all the pages of the file, but if the server file system is full then it will be at its worse. Also if you happened to be on a slower link and have a big cache to flush, you're waiting around for very little gain. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [NFS] Long sleep with i_mutex in xfs_flush_device(), affects NFS service 2006-09-26 20:05 ` Stephane Doyon @ 2006-09-26 20:29 ` Trond Myklebust 0 siblings, 0 replies; 17+ messages in thread From: Trond Myklebust @ 2006-09-26 20:29 UTC (permalink / raw) To: Stephane Doyon; +Cc: xfs, nfs On Tue, 2006-09-26 at 16:05 -0400, Stephane Doyon wrote: > I suppose it's not technically wrong to try to flush all the pages of the > file, but if the server file system is full then it will be at its worse. > Also if you happened to be on a slower link and have a big cache to flush, > you're waiting around for very little gain. That all assumes that nobody fixes the problem on the server. If somebody notices, and actually removes an unused file, then you may be happy that the kernel preserved the last 80% of the apache log file that was being written out. ENOSPC is a transient error: that is why the current behaviour exists. Cheers, Trond ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service 2006-09-26 18:51 Long sleep with i_mutex in xfs_flush_device(), affects NFS service Stephane Doyon 2006-09-26 19:06 ` [NFS] " Trond Myklebust @ 2006-09-27 11:33 ` Shailendra Tripathi 2006-10-02 14:45 ` Stephane Doyon 1 sibling, 1 reply; 17+ messages in thread From: Shailendra Tripathi @ 2006-09-27 11:33 UTC (permalink / raw) To: Stephane Doyon; +Cc: xfs, nfs Hi Stephane, > When the file system becomes nearly full, we eventually call down to > xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to > do some work. > xfs_flush_space()does > xfs_iunlock(ip, XFS_ILOCK_EXCL); > before calling xfs_flush_device(), but i_mutex is still held, at least > when we're being called from under xfs_write(). 1. I agree that the delay of 500 ms is not a deterministic wait. 2. xfs_flush_device is a big operation. It has to flush all the dirty pages possibly in the cache on the device. Depending upon the device, it might take significant amount of time. Keeping view of it, 500 ms in that unreasonable. Also, perhaps you would never want more than one request to be queued for device flush. 3. The hope is that after one big flush operation, it would be able to free up resources which are in transient state (over-reservation of blocks, delalloc, pending removes, ...). The whole operation is intended to make sure that ENOSPC is not returned unless really required. 4. This wait could be made deterministic by waiting for the syncer thread to complete when device flush is triggered. > It seems like a fairly long time to hold a mutex. And I wonder whether it's really It might not be that good even if it doesn't. This can return pre-mature ENOSPC or it can queue many xfs_flush_device requests (which can make your system dead(-slow) anyway) > necessary to keep going through that again and again for every new request after > we've hit NOSPC. > > In particular this can cause a pileup when several threads are writing > concurrently to the same file. Some specialized apps might do that, and > nfsd threads do it all the time. > > To reproduce locally, on a full file system: > #!/bin/sh > for i in `seq 30`; do > dd if=/dev/zero of=f bs=1 count=1 & > done > wait > time that, it takes nearly exactly 15s. > > The linux NFS client typically sends bunches of 16 requests, and so if > the client is writing a single file, some NFS requests are therefore > delayed by up to 8seconds, which is kind of long for NFS. > > What's worse, when my linux NFS client writes out a file's pages, it > does not react immediately on receiving a NOSPC error. It will remember > and report the error later on close(), but it still tries and issues > write requests for each page of the file. So even if there isn't a > pileup on the i_mutex on the server, the NFS client still waits 0.5s for > each 32K (typically) request. So on an NFS client on a gigabit network, > on an already full filesystem, if I open and write a 10M file and > close() it, it takes 2m40.083s for it to issue all the requests, get an > NOSPC for each, and finally have my close() call return ENOSPC. That can > stretch to several hours for gigabyte-sized files, which is how I > noticed the problem. > > I'm not too familiar with the NFS client code, but would it not be > possible for it to give up when it encounters NOSPC? Or is there some > reason why this wouldn't be desirable? > > The rough workaround I have come up with for the problem is to have > xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs > of having returned ENOSPC. I have verified that this workaround is > effective, but I imagine there might be a cleaner solution. The fix would not be a good idea for standalone use of XFS. if (nimaps == 0) { if (xfs_flush_space(ip, &fsynced, &ioflag)) return XFS_ERROR(ENOSPC); error = 0; goto retry; } xfs_flush_space: case 2: xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_flush_device(ip); xfs_ilock(ip, XFS_ILOCK_EXCL); *fsynced = 3; return 0; } return 1; lets say that you don't enqueue it for another 2 secs. Then, in next retry it would return 1 and, hence, outer if condition would return ENOSPC. Please note that for standalone XFS, the application or client mostly don't retry and, hence, it might return premature ENOSPC. You didn't notice this because, as you said, nfs client will retry in case of ENOSPC. Assuming that you don't return *fsynced = 3 (instead *fsynced = 2), the code path will loop (because of retry) and CPU itself would become busy for no good job. You might experiment by adding deterministic wait. When you enqueue, set some flag. All others who come in between just get enqueued. Once, device flush is over wake up all. If flush could free enough resources, threads will proceed ahead and return. Otherwise, another flush would be enqueued to flush what might have come since last flush. > Thanks > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service 2006-09-27 11:33 ` Shailendra Tripathi @ 2006-10-02 14:45 ` Stephane Doyon 2006-10-02 22:30 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Stephane Doyon @ 2006-10-02 14:45 UTC (permalink / raw) To: Shailendra Tripathi; +Cc: xfs On Wed, 27 Sep 2006, Shailendra Tripathi wrote: > Hi Stephane, >> When the file system becomes nearly full, we eventually call down to >> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to >> do some work. xfs_flush_space()does >> xfs_iunlock(ip, XFS_ILOCK_EXCL); >> before calling xfs_flush_device(), but i_mutex is still held, at least >> when we're being called from under xfs_write(). > > 1. I agree that the delay of 500 ms is not a deterministic wait. > > 2. xfs_flush_device is a big operation. It has to flush all the dirty pages > possibly in the cache on the device. Depending upon the device, it might take > significant amount of time. Keeping view of it, 500 ms in that unreasonable. > Also, perhaps you would never want more than one request to be queued for > device flush. > 3. The hope is that after one big flush operation, it would be able to free > up resources which are in transient state (over-reservation of blocks, > delalloc, pending removes, ...). The whole operation is intended to make sure > that ENOSPC is not returned unless really required. Yes I had surmised as much. That last part is still a little vague to me... But my two points were: -It's a long time to hold a mutex. The code bothers to drop the xfs_ilock, so I'm wondering whether the i_mutex had been forgotten? -Once we've actually hit ENOSPC, do we need to try again? Isn't it possible to tell when resources have actually been freed? > 4. This wait could be made deterministic by waiting for the syncer thread to > complete when device flush is triggered. I remember that some time ago, there wasn't any xfs_syncd, and the flushing operation was performed by the task wanting the free space. And it would cause deadlocks. So I presume we would have to be careful if we wanted to wait on sync. >> The rough workaround I have come up with for the problem is to have >> xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs >> of having returned ENOSPC. I have verified that this workaround is >> effective, but I imagine there might be a cleaner solution. > > The fix would not be a good idea for standalone use of XFS. > > if (nimaps == 0) { > if (xfs_flush_space(ip, &fsynced, &ioflag)) > return XFS_ERROR(ENOSPC); > > error = 0; > goto retry; > } > > xfs_flush_space: > case 2: > xfs_iunlock(ip, XFS_ILOCK_EXCL); > xfs_flush_device(ip); > xfs_ilock(ip, XFS_ILOCK_EXCL); > *fsynced = 3; > return 0; > } > return 1; > > lets say that you don't enqueue it for another 2 secs. Then, in next retry it > would return 1 and, hence, outer if condition would return ENOSPC. Please > note that for standalone XFS, the application or client mostly don't retry > and, hence, it might return premature ENOSPC. > > You didn't notice this because, as you said, nfs client will retry in case of > ENOSPC. I'm not entirely sure I follow your explanation. The *fsynced variable is local to the xfs_iomap_write_delay() caller, so each call will go through the three steps in xfs_flush_space(). What my workaround does is, if we've done the xfs_flush_device() thing and still hit ENOSPC within the last two seconds, and we've just tried again the first two xfs_flush_space() steps, then we skip the third step and return ENOSPC. So yes the file system might not be exactly entirely full anymore, which is why I say it's a rough workaround, but it seems to me the discrepancy shouldn't be very big either. Whatever free space might have been missed would have had to be freed after the last ENOSPC return, and must be such that only another xfs_flush_device() call will make it available. It seems to me ENOSPC has never been something very exact anyway: df (statfs) often still shows a few remaining free blocks even on a full file system. Apps can't really calculate how many blocks will be needed for inodes, btrees and directories, so the number of remaining data blocks is an approximation. I am not entirely sure that what xfs_flush_device_work() does is quite deterministic, and as you said the wait period is arbitrary. And I don't particularly care to get every single last byte out of my file system, as long as there are no flagrant inconsistencies such as rm -fr not freeing up some space. > Assuming that you don't return *fsynced = 3 (instead *fsynced = 2), the code > path will loop (because of retry) and CPU itself would become busy for no > good job. Indeed. > You might experiment by adding deterministic wait. When you enqueue, set > some flag. All others who come in between just get enqueued. Once, device > flush is over wake up all. If flush could free enough resources, threads will > proceed ahead and return. Otherwise, another flush would be enqueued to flush > what might have come since last flush. But how do you know whether you need to flush again, or whether your file system is really full this time? And there's still the issue with the i_mutex. Perhaps there's a way to evaluate how much resources are "in transient state" as you put it. Otherwise, we could set a flag when ENOSPC is returned, and have that flag cleared at appropriate places in the code where blocks are actually freed. I keep running into various deadlocks related to full file systems, so I'm wary of clever solutions :-). [Dropped nfs@lists.sourceforge.net from Cc, as this discussion is quite specific to xfs.] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service 2006-10-02 14:45 ` Stephane Doyon @ 2006-10-02 22:30 ` David Chinner 2006-10-03 13:39 ` several messages Stephane Doyon 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2006-10-02 22:30 UTC (permalink / raw) To: Stephane Doyon; +Cc: Shailendra Tripathi, xfs On Mon, Oct 02, 2006 at 10:45:12AM -0400, Stephane Doyon wrote: > On Wed, 27 Sep 2006, Shailendra Tripathi wrote: > > >Hi Stephane, > >> When the file system becomes nearly full, we eventually call down to > >> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to > >> do some work. xfs_flush_space()does > >> xfs_iunlock(ip, XFS_ILOCK_EXCL); > >> before calling xfs_flush_device(), but i_mutex is still held, at least > >> when we're being called from under xfs_write(). > > > >1. I agree that the delay of 500 ms is not a deterministic wait. AFAICT, it was never intended to be. It's not deterministic, and the wait is really only there to ensure that the synchronous log force catches all the operations that may have recently occurred so they can be unpinned and flushed. For example, an extent that has been truncated and freed cannot be reused until the transaction that it was freed in has actually been commited to disk..... > >2. xfs_flush_device is a big operation. It has to flush all the dirty > >pages possibly in the cache on the device. Depending upon the device, it > >might take significant amount of time. Keeping view of it, 500 ms in that > >unreasonable. Also, perhaps you would never want more than one request to > >be queued for device flush. > >3. The hope is that after one big flush operation, it would be able to > >free up resources which are in transient state (over-reservation of > >blocks, delalloc, pending removes, ...). The whole operation is intended > >to make sure that ENOSPC is not returned unless really required. > > Yes I had surmised as much. That last part is still a little vague to > me... But my two points were: > > -It's a long time to hold a mutex. The code bothers to drop the > xfs_ilock, so I'm wondering whether the i_mutex had been forgotten? This deep in the XFS allocation functions, we cannot tell if we hold the i_mutex or not, and it plays no part in determining if we have space or not. Hence we don't touch it here. > -Once we've actually hit ENOSPC, do we need to try again? Isn't it > possible to tell when resources have actually been freed? Given that the only way to determine if space was made available is to query every AG in the exact same way an allocation does, it makes sense to try the allocation again to determine if space was made available.... > >4. This wait could be made deterministic by waiting for the syncer thread > >to complete when device flush is triggered. > > I remember that some time ago, there wasn't any xfs_syncd, and the > flushing operation was performed by the task wanting the free space. And > it would cause deadlocks. So I presume we would have to be careful if we > wanted to wait on sync. *nod* Last thing we want is more deadlocks. This code is already convoluted enough without added yet more special cases to it.... > >> The rough workaround I have come up with for the problem is to have > >> xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs > >> of having returned ENOSPC. I have verified that this workaround is > >> effective, but I imagine there might be a cleaner solution. > > > >The fix would not be a good idea for standalone use of XFS. I doubt it's a good idea for an NFS server, either. Remember that XFS, like most filesystems, trades off speed for correctness as we approach ENOSPC. Many parts of XFS slow down as we approach ENOSPC, and this is just one example of where we need to be correct, not fast. > It seems to me ENOSPC has never been something very exact anyway: > df (statfs) often still shows a few remaining free blocks even on > a full file system. Apps can't really calculate how many blocks > will be needed for inodes, btrees and directories, so the number > of remaining data blocks is an approximation. It's not supposed to be an approximation - the number reported by df should be taking all this into account because it's coming directly from how much space XFS thinks it has available. > >You might experiment by adding deterministic wait. When you enqueue, set > >some flag. All others who come in between just get enqueued. Once, device > >flush is over wake up all. If flush could free enough resources, threads > >will proceed ahead and return. Otherwise, another flush would be enqueued > >to flush what might have come since last flush. > > But how do you know whether you need to flush again, or whether your file > system is really full this time? And there's still the issue with the > i_mutex. > > Perhaps there's a way to evaluate how much resources are "in transient > state" as you put it. I doubt there's any way of doing this without introducing non-enospc performance regressions and extra memory usage. > Otherwise, we could set a flag when ENOSPC is > returned, and have that flag cleared at appropriate places in the code > where blocks are actually freed. I keep running into various deadlocks > related to full file systems, so I'm wary of clever solutions :-). IMO, this is a non-problem. You're talking about optimising a relatively rare corner case where correctness is more important than speed and your test case is highly artificial. AFAIC, if you are running at ENOSPC then you get what performance is appropriate for correctness and if you are continually runing at ENOSPC, then buy some more disks..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-02 22:30 ` David Chinner @ 2006-10-03 13:39 ` Stephane Doyon 2006-10-03 16:40 ` Trond Myklebust 2006-10-05 8:30 ` David Chinner 0 siblings, 2 replies; 17+ messages in thread From: Stephane Doyon @ 2006-10-03 13:39 UTC (permalink / raw) To: Trond Myklebust, David Chinner; +Cc: xfs, nfs, Shailendra Tripathi Sorry for insisting, but it seems to me there's still a problem in need of fixing: when writing a 5GB file over NFS to an XFS file system and hitting ENOSPC, it takes on the order of 22hours before my application gets an error, whereas it would normally take about 2minutes if the file system did not become full. Perhaps I was being a bit too "constructive" and drowned my point in explanations and proposed workarounds... You are telling me that neither NFS nor XFS is doing anything wrong, and I can understand your points of view, but surely that behavior isn't considered acceptable? On Tue, 26 Sep 2006, Trond Myklebust wrote: > On Tue, 2006-09-26 at 16:05 -0400, Stephane Doyon wrote: >> I suppose it's not technically wrong to try to flush all the pages of the >> file, but if the server file system is full then it will be at its worse. >> Also if you happened to be on a slower link and have a big cache to flush, >> you're waiting around for very little gain. > > That all assumes that nobody fixes the problem on the server. If > somebody notices, and actually removes an unused file, then you may be > happy that the kernel preserved the last 80% of the apache log file that > was being written out. > > ENOSPC is a transient error: that is why the current behaviour exists. On Tue, 3 Oct 2006, David Chinner wrote: > This deep in the XFS allocation functions, we cannot tell if we hold > the i_mutex or not, and it plays no part in determining if we have > space or not. Hence we don't touch it here. > I doubt it's a good idea for an NFS server, either. [...] > Remember that XFS, like most filesystems, trades off speed for > correctness as we approach ENOSPC. Many parts of XFS slow down as we > approach ENOSPC, and this is just one example of where we need to be > correct, not fast. [...] > IMO, this is a non-problem. You're talking about optimising a > relatively rare corner case where correctness is more important than > speed and your test case is highly artificial. AFAIC, if you are > running at ENOSPC then you get what performance is appropriate for > correctness and if you are continually runing at ENOSPC, then buy > some more disks..... My recipe to reproduce the problem locally is admittedly somewhat artificial, but the problematic usage definitely isn't: simply an app on an NFS client that happens to fill up a file system. There must be some way to handle this better. Thanks ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-03 13:39 ` several messages Stephane Doyon @ 2006-10-03 16:40 ` Trond Myklebust 2006-10-05 15:39 ` Stephane Doyon 2006-10-05 8:30 ` David Chinner 1 sibling, 1 reply; 17+ messages in thread From: Trond Myklebust @ 2006-10-03 16:40 UTC (permalink / raw) To: Stephane Doyon; +Cc: David Chinner, xfs, nfs, Shailendra Tripathi On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote: > Sorry for insisting, but it seems to me there's still a problem in need of > fixing: when writing a 5GB file over NFS to an XFS file system and hitting > ENOSPC, it takes on the order of 22hours before my application gets an > error, whereas it would normally take about 2minutes if the file system > did not become full. > > Perhaps I was being a bit too "constructive" and drowned my point in > explanations and proposed workarounds... You are telling me that neither > NFS nor XFS is doing anything wrong, and I can understand your points of > view, but surely that behavior isn't considered acceptable? Sure it is. You are allowing the kernel to cache 5GB, and that means you only get the error message when close() completes. If you want faster error reporting, there are modes like O_SYNC, O_DIRECT, that will attempt to flush the data more quickly. In addition, you can force flushing using fsync(). Finally, you can tweak the VM into flushing more often using /proc/sys/vm. Cheers, Trond ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-03 16:40 ` Trond Myklebust @ 2006-10-05 15:39 ` Stephane Doyon 2006-10-06 0:33 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Stephane Doyon @ 2006-10-05 15:39 UTC (permalink / raw) To: Trond Myklebust; +Cc: David Chinner, xfs, nfs, Shailendra Tripathi On Tue, 3 Oct 2006, Trond Myklebust wrote: > On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote: >> Sorry for insisting, but it seems to me there's still a problem in need of >> fixing: when writing a 5GB file over NFS to an XFS file system and hitting >> ENOSPC, it takes on the order of 22hours before my application gets an >> error, whereas it would normally take about 2minutes if the file system >> did not become full. >> >> Perhaps I was being a bit too "constructive" and drowned my point in >> explanations and proposed workarounds... You are telling me that neither >> NFS nor XFS is doing anything wrong, and I can understand your points of >> view, but surely that behavior isn't considered acceptable? > > Sure it is. If you say so :-). > You are allowing the kernel to cache 5GB, and that means you > only get the error message when close() completes. But it's not actually caching the entire 5GB at once... I guess you're saying that doesn't matter...? > If you want faster error reporting, there are modes like O_SYNC, > O_DIRECT, that will attempt to flush the data more quickly. In addition, > you can force flushing using fsync(). What if the program is a standard utility like cp? > Finally, you can tweak the VM into > flushing more often using /proc/sys/vm. It doesn't look to me like a question of degrees about how early to flush. Actually my client can't possibly be caching all of 5GB, it doesn't have the RAM or swap for that. Tracing it more carefully, it appears dirty data starts being flushed after a few hundred MBs. No error is returned on the subsequent writes, only on the final close(). I see some of the write() calls are delayed, presumably when the machine reaches the dirty threshold. So I don't see how the vm settings can help in this case. I hadn't realized that the issue isn't just with the final flush on close(). It's actually been flushing all along, delaying some of the subsequent write()s, getting NOSPC errors but not reporting them until the end. I understand that since my application did not request any syncing, the system cannot guarantee to report errors until cached data has been flushed. But some data has indeed been flushed with an error; can't this be reported earlier than on close? Would it be incorrect for a subsequent write to return the error that occurred while flushing data from previous writes? Then the app could decide whether to continue and retry or not. But I guess I can see how that might get convoluted. Thanks for your patience, ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-05 15:39 ` Stephane Doyon @ 2006-10-06 0:33 ` David Chinner 2006-10-06 13:25 ` Stephane Doyon 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2006-10-06 0:33 UTC (permalink / raw) To: Stephane Doyon Cc: Trond Myklebust, David Chinner, xfs, nfs, Shailendra Tripathi On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote: > > I hadn't realized that the issue isn't just with the final flush on > close(). It's actually been flushing all along, delaying some of the > subsequent write()s, getting NOSPC errors but not reporting them until the > end. Other NFS clients will report an ENOSPC on the next write() or close() if the error is reported during async writeback. The clients that typically do this throw away any unwritten data as well on the basis that the application was returned an error ASAP and it is now Somebody Else's Problem (i.e. the application needs to handle it from there). > I understand that since my application did not request any syncing, the > system cannot guarantee to report errors until cached data has been > flushed. But some data has indeed been flushed with an error; can't this > be reported earlier than on close? It could, but... > Would it be incorrect for a subsequent write to return the error that > occurred while flushing data from previous writes? Then the app could > decide whether to continue and retry or not. But I guess I can see how > that might get convoluted. .... there's many entertaining hoops to jump through to do this reliably. FWIW, these are simply two different approaches to handling ENOSPC (and other server) errors. Mostly it comes down to how the ppl who implemented the NFS client think it's best to handle the errors in the scenarios that they most care about. For example: when you have large amounts of cached data, expedient error reporting and tossing unwritten data leads to much faster error recovery than trying to write every piece of data (hence the Irix use of this method). OTOH, when you really want as much of the data to get to the server, regardless of whether you lose some (e.g. log files) before reporting an error then you try to write every bit of data before telling the application. There's no clear right or wrong approach here - both have their advantages and disadvantages for different workloads. If it weren't for the sub-optimal behaviour of XFS in this case, you probably wouldn't have even cared about this.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-06 0:33 ` David Chinner @ 2006-10-06 13:25 ` Stephane Doyon 0 siblings, 0 replies; 17+ messages in thread From: Stephane Doyon @ 2006-10-06 13:25 UTC (permalink / raw) To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi On Fri, 6 Oct 2006, David Chinner wrote: > On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote: >> >> I hadn't realized that the issue isn't just with the final flush on >> close(). It's actually been flushing all along, delaying some of the >> subsequent write()s, getting NOSPC errors but not reporting them until the >> end. > > Other NFS clients will report an ENOSPC on the next write() or close() > if the error is reported during async writeback. The clients that typically > do this throw away any unwritten data as well on the basis that the > application was returned an error ASAP and it is now Somebody Else's > Problem (i.e. the application needs to handle it from there). Well the client wouldn't necessarily have to throw away cached data. It could conceivably be made to return ENOSPC on some subsequent write. It would need to throw away the data for that write, but not necessarily destroy its cache. It could then clear the error condition and allow the application to keep trying if it wants to... >> Would it be incorrect for a subsequent write to return the error that >> occurred while flushing data from previous writes? Then the app could >> decide whether to continue and retry or not. But I guess I can see how >> that might get convoluted. > > .... there's many entertaining hoops to jump through to do this > reliably. I imagine there would be... > For example: when you have large amounts of cached data, expedient > error reporting and tossing unwritten data leads to much faster > error recovery than trying to write every piece of data (hence the > Irix use of this method). In my case, I didn't think I was caching that much data though, only a few hundred MBs, and I wouldn't have minded so much if an error had been returned after that much. The way it's implemented though, I can write an unbounded amount of data through that cache and not be told of the problem until I close or fsync. It may not be technically wrong, but given the outrageous delay I saw in my particular situation, it felt pretty suboptimal. > There's no clear right or wrong approach here - both have their > advantages and disadvantages for different workloads. If it > weren't for the sub-optimal behaviour of XFS in this case, you > probably wouldn't have even cared about this.... Indeed not! In fact, changing the client is not practical for me, what I need is a fix for the XFS behavior. I just thought it was also worth reporting what I perceived to be an issue with the NFS client. Thanks ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-03 13:39 ` several messages Stephane Doyon 2006-10-03 16:40 ` Trond Myklebust @ 2006-10-05 8:30 ` David Chinner 2006-10-05 16:33 ` Stephane Doyon 1 sibling, 1 reply; 17+ messages in thread From: David Chinner @ 2006-10-05 8:30 UTC (permalink / raw) To: Stephane Doyon Cc: Trond Myklebust, David Chinner, xfs, nfs, Shailendra Tripathi On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote: > Sorry for insisting, but it seems to me there's still a problem in need of > fixing: when writing a 5GB file over NFS to an XFS file system and hitting > ENOSPC, it takes on the order of 22hours before my application gets an > error, whereas it would normally take about 2minutes if the file system > did not become full. > > Perhaps I was being a bit too "constructive" and drowned my point in > explanations and proposed workarounds... You are telling me that neither > NFS nor XFS is doing anything wrong, and I can understand your points of > view, but surely that behavior isn't considered acceptable? I agree that this a little extreme and I can't recall of seeing anything like this before, but I can see how that may happen if the NFS client continues to try to write every dirty page after getting an ENOSPC and each one of those writes has to wait for 500ms. However, you did not mention what kernel version you are running. One recent bug (introduced by a fix for deadlocks at ENOSPC) could allow oversubscription of free space to occur in XFS, resulting in the write being allowed to proceed (i.e. sufficient space for the data blocks) but then failing the allocation because there weren't enough blocks put aside for potential btree splits that occur during allocation. If the linux client is using sync writes on retry, then this would trigger a 500ms sleep on every write. That's the right sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s = ~22 hours.... This got fixed in 2.6.18-rc6 - can you retry with a 2.6.18 server and see if your problem goes away? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-05 8:30 ` David Chinner @ 2006-10-05 16:33 ` Stephane Doyon 2006-10-05 23:29 ` David Chinner 0 siblings, 1 reply; 17+ messages in thread From: Stephane Doyon @ 2006-10-05 16:33 UTC (permalink / raw) To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi On Thu, 5 Oct 2006, David Chinner wrote: > On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote: >> Sorry for insisting, but it seems to me there's still a problem in need of >> fixing: when writing a 5GB file over NFS to an XFS file system and hitting >> ENOSPC, it takes on the order of 22hours before my application gets an >> error, whereas it would normally take about 2minutes if the file system >> did not become full. >> >> Perhaps I was being a bit too "constructive" and drowned my point in >> explanations and proposed workarounds... You are telling me that neither >> NFS nor XFS is doing anything wrong, and I can understand your points of >> view, but surely that behavior isn't considered acceptable? > > I agree that this a little extreme and I can't recall of seeing > anything like this before, but I can see how that may happen if the > NFS client continues to try to write every dirty page after getting > an ENOSPC and each one of those writes has to wait for 500ms. > > However, you did not mention what kernel version you are running. > One recent bug (introduced by a fix for deadlocks at ENOSPC) could > allow oversubscription of free space to occur in XFS, resulting in I do have that fix in my kernel. (I'm the one who pointed you to the patch that introduced that particular problem.) > the write being allowed to proceed (i.e. sufficient space for the > data blocks) but then failing the allocation because there weren't > enough blocks put aside for potential btree splits that occur during > allocation. If the linux client is using sync writes on retry, then The writes from nfsd shouldn't be sync. Technically it's not even retrying, just plowing on... > this would trigger a 500ms sleep on every write. That's the right > sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s > = ~22 hours.... > > This got fixed in 2.6.18-rc6 - You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes. > can you retry with a 2.6.18 server > and see if your problem goes away? Unfortunately it will be several days before I have a chance to do that. The backtrace looked like this: ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device schedule_timeout_uninterruptible. with a 500ms sleep in xfs_flush_device(). Thanks ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-05 16:33 ` Stephane Doyon @ 2006-10-05 23:29 ` David Chinner 2006-10-06 13:03 ` Stephane Doyon 0 siblings, 1 reply; 17+ messages in thread From: David Chinner @ 2006-10-05 23:29 UTC (permalink / raw) To: Stephane Doyon Cc: David Chinner, Trond Myklebust, xfs, nfs, Shailendra Tripathi On Thu, Oct 05, 2006 at 12:33:05PM -0400, Stephane Doyon wrote: > retrying, just plowing on... > > >this would trigger a 500ms sleep on every write. That's the right > >sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s > >= ~22 hours.... > > > >This got fixed in 2.6.18-rc6 - > > You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do > have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes. > > >can you retry with a 2.6.18 server > >and see if your problem goes away? > > Unfortunately it will be several days before I have a chance to do that. > > The backtrace looked like this: > > ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev > xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks > xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device > schedule_timeout_uninterruptible. Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()), not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself (and probably everyone else) confused there which why I suspected sync writes - they trigger the allocate path in the write call. I don't think 2.6.18 will change anything. FWIW, I don't think we can avoid this sleep when we first hit ENOSPC conditions, but perhaps once we are certain of the ENOSPC status we can tag the filesystem with this state (say an xfs_mount flag) and only clear that tag when something is freed. We could then use the tag to avoid continually trying extremely hard to allocate space when we know there is none available.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: several messages 2006-10-05 23:29 ` David Chinner @ 2006-10-06 13:03 ` Stephane Doyon 0 siblings, 0 replies; 17+ messages in thread From: Stephane Doyon @ 2006-10-06 13:03 UTC (permalink / raw) To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi On Fri, 6 Oct 2006, David Chinner wrote: >> The backtrace looked like this: >> >> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev >> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks >> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device >> schedule_timeout_uninterruptible. > > Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()), Yes. > not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself > (and probably everyone else) confused there which why I suspected sync > writes - they trigger the allocate path in the write call. I don't think > 2.6.18 will change anything. > > FWIW, I don't think we can avoid this sleep when we first hit ENOSPC > conditions, but perhaps once we are certain of the ENOSPC status > we can tag the filesystem with this state (say an xfs_mount flag) > and only clear that tag when something is freed. We could then > use the tag to avoid continually trying extremely hard to allocate > space when we know there is none available.... Yes! That's what I was trying to suggest :-). Thank you. Is that hard to do? ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <9E397A467F4DB34884A1FD0D5D27CF43018903F96E@msxaoa4.twosigma.com>]
* Re: several messages [not found] <9E397A467F4DB34884A1FD0D5D27CF43018903F96E@msxaoa4.twosigma.com> @ 2008-06-12 16:54 ` Benjamin L. Shi 0 siblings, 0 replies; 17+ messages in thread From: Benjamin L. Shi @ 2008-06-12 16:54 UTC (permalink / raw) To: xfs Index: fs/xfs/xfs_iomap.c =================================================================== RCS file: /src/linux/2.6.18/fs/xfs/xfs_iomap.c,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -u -r1.1.1.1 -r1.2 --- fs/xfs/xfs_iomap.c 29 Sep 2006 13:45:19 -0000 1.1.1.1 +++ fs/xfs/xfs_iomap.c 12 Jun 2008 15:59:10 -0000 1.2 @@ -706,11 +706,24 @@ * then we must have run out of space - flush delalloc, and retry.. */ if (nimaps == 0) { + if ((mp->m_flags & XFS_MOUNT_FULL) != 0) { + if (mp->m_sb.sb_fdblocks < 500) { + // printk("full again %llu\n", + // mp->m_sb.sb_fdblocks); + return XFS_ERROR(ENOSPC); + } else { + // printk("not full again %llu\n", + // mp->m_sb.sb_fdblocks); + mp->m_flags &= ~XFS_MOUNT_FULL; + } + } xfs_iomap_enter_trace(XFS_IOMAP_WRITE_NOSPACE, io, offset, count); - if (xfs_flush_space(ip, &fsynced, &ioflag)) + if (xfs_flush_space(ip, &fsynced, &ioflag)) { + mp->m_flags |= XFS_MOUNT_FULL; + //printk("set full %llu\n", mp->m_sb.sb_fdblocks); return XFS_ERROR(ENOSPC); - + } error = 0; goto retry; } Index: fs/xfs/xfs_mount.h =================================================================== RCS file: /src/linux/2.6.18/fs/xfs/xfs_mount.h,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -u -r1.1.1.1 -r1.2 --- fs/xfs/xfs_mount.h 29 Sep 2006 13:45:19 -0000 1.1.1.1 +++ fs/xfs/xfs_mount.h 12 Jun 2008 15:59:10 -0000 1.2 @@ -459,6 +459,7 @@ * I/O size in stat() */ #define XFS_MOUNT_NO_PERCPU_SB (1ULL << 23) /* don't use per-cpu superblock counters */ +#define XFS_MOUNT_FULL (1ULL << 24) /* > > On Fri, 6 Oct 2006, David Chinner wrote: > >>> The backtrace looked like this: >>> >>> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev >>> xfs_file_writev >>> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks >>> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space >>> xfs_flush_device >>> schedule_timeout_uninterruptible. >> >> Ahhh, this gets hit on the ->prepare_write path >> (xfs_iomap_write_delay()), > > Yes. > >> not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself >> (and probably everyone else) confused there which why I suspected sync >> writes - they trigger the allocate path in the write call. I don't think >> 2.6.18 will change anything. >> >> FWIW, I don't think we can avoid this sleep when we first hit ENOSPC >> conditions, but perhaps once we are certain of the ENOSPC status >> we can tag the filesystem with this state (say an xfs_mount flag) >> and only clear that tag when something is freed. We could then >> use the tag to avoid continually trying extremely hard to allocate >> space when we know there is none available.... > > Yes! That's what I was trying to suggest << OLE Object: Picture (Device > Independent Bitmap) >> . Thank you. > > Is that hard to do? > ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-06-12 16:53 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-26 18:51 Long sleep with i_mutex in xfs_flush_device(), affects NFS service Stephane Doyon
2006-09-26 19:06 ` [NFS] " Trond Myklebust
2006-09-26 20:05 ` Stephane Doyon
2006-09-26 20:29 ` Trond Myklebust
2006-09-27 11:33 ` Shailendra Tripathi
2006-10-02 14:45 ` Stephane Doyon
2006-10-02 22:30 ` David Chinner
2006-10-03 13:39 ` several messages Stephane Doyon
2006-10-03 16:40 ` Trond Myklebust
2006-10-05 15:39 ` Stephane Doyon
2006-10-06 0:33 ` David Chinner
2006-10-06 13:25 ` Stephane Doyon
2006-10-05 8:30 ` David Chinner
2006-10-05 16:33 ` Stephane Doyon
2006-10-05 23:29 ` David Chinner
2006-10-06 13:03 ` Stephane Doyon
[not found] <9E397A467F4DB34884A1FD0D5D27CF43018903F96E@msxaoa4.twosigma.com>
2008-06-12 16:54 ` Benjamin L. Shi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox