Long sleep with i_mutex in xfs_flush_device(), affects NFS service

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Long sleep with i_mutex in xfs_flush_device(), affects NFS service
@ 2006-09-26 18:51 Stephane Doyon
  2006-09-26 19:06 ` [NFS] " Trond Myklebust
  2006-09-27 11:33 ` Shailendra Tripathi
  0 siblings, 2 replies; 17+ messages in thread
From: Stephane Doyon @ 2006-09-26 18:51 UTC (permalink / raw)
  To: xfs, nfs

Hi,

I'm seeing an unpleasant behavior when an XFS file system becomes full, 
particularly when accessed over NFS. Both XFS and the linux NFS client 
appear to be contributing to the problem.

When the file system becomes nearly full, we eventually call down to 
xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to 
do some work.

xfs_flush_space()does
         xfs_iunlock(ip, XFS_ILOCK_EXCL);
before calling xfs_flush_device(), but i_mutex is still held, at least 
when we're being called from under xfs_write(). It seems like a fairly 
long time to hold a mutex. And I wonder whether it's really necessary to 
keep going through that again and again for every new request after we've 
hit NOSPC.

In particular this can cause a pileup when several threads are writing 
concurrently to the same file. Some specialized apps might do that, and 
nfsd threads do it all the time.

To reproduce locally, on a full file system:
#!/bin/sh
for i in `seq 30`; do
   dd if=/dev/zero of=f bs=1 count=1 &
done
wait
time that, it takes nearly exactly 15s.

The linux NFS client typically sends bunches of 16 requests, and so if the 
client is writing a single file, some NFS requests are therefore delayed 
by up to 8seconds, which is kind of long for NFS.

What's worse, when my linux NFS client writes out a file's pages, it does 
not react immediately on receiving a NOSPC error. It will remember and 
report the error later on close(), but it still tries and issues write 
requests for each page of the file. So even if there isn't a pileup on the 
i_mutex on the server, the NFS client still waits 0.5s for each 32K 
(typically) request. So on an NFS client on a gigabit network, on an 
already full filesystem, if I open and write a 10M file and close() it, it 
takes 2m40.083s for it to issue all the requests, get an NOSPC for each, 
and finally have my close() call return ENOSPC. That can stretch to 
several hours for gigabyte-sized files, which is how I noticed the 
problem.

I'm not too familiar with the NFS client code, but would it not be 
possible for it to give up when it encounters NOSPC? Or is there some 
reason why this wouldn't be desirable?

The rough workaround I have come up with for the problem is to have 
xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs 
of having returned ENOSPC. I have verified that this workaround is 
effective, but I imagine there might be a cleaner solution.

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [NFS] Long sleep with i_mutex in xfs_flush_device(), affects NFS service
  2006-09-26 18:51 Long sleep with i_mutex in xfs_flush_device(), affects NFS service Stephane Doyon
@ 2006-09-26 19:06 ` Trond Myklebust
  2006-09-26 20:05   ` Stephane Doyon
  2006-09-27 11:33 ` Shailendra Tripathi
  1 sibling, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2006-09-26 19:06 UTC (permalink / raw)
  To: Stephane Doyon; +Cc: xfs, nfs

On Tue, 2006-09-26 at 14:51 -0400, Stephane Doyon wrote:
> Hi,
> 
> I'm seeing an unpleasant behavior when an XFS file system becomes full, 
> particularly when accessed over NFS. Both XFS and the linux NFS client 
> appear to be contributing to the problem.
> 
> When the file system becomes nearly full, we eventually call down to 
> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to 
> do some work.
> 
> xfs_flush_space()does
>          xfs_iunlock(ip, XFS_ILOCK_EXCL);
> before calling xfs_flush_device(), but i_mutex is still held, at least 
> when we're being called from under xfs_write(). It seems like a fairly 
> long time to hold a mutex. And I wonder whether it's really necessary to 
> keep going through that again and again for every new request after we've 
> hit NOSPC.
> 
> In particular this can cause a pileup when several threads are writing 
> concurrently to the same file. Some specialized apps might do that, and 
> nfsd threads do it all the time.
> 
> To reproduce locally, on a full file system:
> #!/bin/sh
> for i in `seq 30`; do
>    dd if=/dev/zero of=f bs=1 count=1 &
> done
> wait
> time that, it takes nearly exactly 15s.
> 
> The linux NFS client typically sends bunches of 16 requests, and so if the 
> client is writing a single file, some NFS requests are therefore delayed 
> by up to 8seconds, which is kind of long for NFS.

Why? The file is still open, and so the standard close-to-open rules
state that you are not guaranteed that the cache will be flushed unless
the VM happens to want to reclaim memory.

> What's worse, when my linux NFS client writes out a file's pages, it does 
> not react immediately on receiving a NOSPC error. It will remember and 
> report the error later on close(), but it still tries and issues write 
> requests for each page of the file. So even if there isn't a pileup on the 
> i_mutex on the server, the NFS client still waits 0.5s for each 32K 
> (typically) request. So on an NFS client on a gigabit network, on an 
> already full filesystem, if I open and write a 10M file and close() it, it 
> takes 2m40.083s for it to issue all the requests, get an NOSPC for each, 
> and finally have my close() call return ENOSPC. That can stretch to 
> several hours for gigabyte-sized files, which is how I noticed the 
> problem.
> 
> I'm not too familiar with the NFS client code, but would it not be 
> possible for it to give up when it encounters NOSPC? Or is there some 
> reason why this wouldn't be desirable?

How would it then detect that you have fixed the problem on the server?

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [NFS] Long sleep with i_mutex in xfs_flush_device(), affects NFS service
  2006-09-26 19:06 ` [NFS] " Trond Myklebust
@ 2006-09-26 20:05   ` Stephane Doyon
  2006-09-26 20:29     ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: Stephane Doyon @ 2006-09-26 20:05 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: xfs, nfs

On Tue, 26 Sep 2006, Trond Myklebust wrote:

[...]
>> When the file system becomes nearly full, we eventually call down to
>> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to
>> do some work.
>>
>> xfs_flush_space()does
>>          xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> before calling xfs_flush_device(), but i_mutex is still held, at least
>> when we're being called from under xfs_write(). It seems like a fairly
>> long time to hold a mutex. And I wonder whether it's really necessary to
>> keep going through that again and again for every new request after we've
>> hit NOSPC.
>>
>> In particular this can cause a pileup when several threads are writing
>> concurrently to the same file. Some specialized apps might do that, and
>> nfsd threads do it all the time.
[...]
>> The linux NFS client typically sends bunches of 16 requests, and so if the
>> client is writing a single file, some NFS requests are therefore delayed
>> by up to 8seconds, which is kind of long for NFS.
>
> Why? The file is still open, and so the standard close-to-open rules
> state that you are not guaranteed that the cache will be flushed unless
> the VM happens to want to reclaim memory.

I mean there will be a delay on the server, in responding to the requests. 
Sorry for the confusion.

When the NFS client does flush its cache, each request will take an extra 
0.5s to execute on the server, and the i_mutex will prevent their parallel 
execution on the server.

>> What's worse, when my linux NFS client writes out a file's pages, it does
>> not react immediately on receiving a NOSPC error. It will remember and
>> report the error later on close(), but it still tries and issues write
>> requests for each page of the file. So even if there isn't a pileup on the
>> i_mutex on the server, the NFS client still waits 0.5s for each 32K
>> (typically) request. So on an NFS client on a gigabit network, on an
>> already full filesystem, if I open and write a 10M file and close() it, it
>> takes 2m40.083s for it to issue all the requests, get an NOSPC for each,
>> and finally have my close() call return ENOSPC. That can stretch to
>> several hours for gigabyte-sized files, which is how I noticed the
>> problem.
>>
>> I'm not too familiar with the NFS client code, but would it not be
>> possible for it to give up when it encounters NOSPC? Or is there some
>> reason why this wouldn't be desirable?
>
> How would it then detect that you have fixed the problem on the server?

I suppose it has to try again at some point. Yet when flushing a file, if 
even one write requests gets an error response like ENOSPC, we know some 
part of the data has not been written on the server, and close() will 
return the appropriate error to the program on the client. If a single 
write error is enough to cause close() to return an error, why bother 
sending all the other write requests for that file? If we get an error 
while flushing, couldn't that one flushing operation bail out early? As I 
said I'm not too familiar with the code, but AFAICT nfs_wb_all() will keep 
flushing everything, and afterwards nfs_file_flush() wil check ctx->error. 
Perhaps ctx->error could be checked at some lower level, maybe in 
nfs_sync_inode_wait...

I suppose it's not technically wrong to try to flush all the pages of the 
file, but if the server file system is full then it will be at its worse. 
Also if you happened to be on a slower link and have a big cache to flush, 
you're waiting around for very little gain.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [NFS] Long sleep with i_mutex in xfs_flush_device(), affects NFS service
  2006-09-26 20:05   ` Stephane Doyon
@ 2006-09-26 20:29     ` Trond Myklebust
  0 siblings, 0 replies; 17+ messages in thread
From: Trond Myklebust @ 2006-09-26 20:29 UTC (permalink / raw)
  To: Stephane Doyon; +Cc: xfs, nfs

On Tue, 2006-09-26 at 16:05 -0400, Stephane Doyon wrote:
> I suppose it's not technically wrong to try to flush all the pages of the 
> file, but if the server file system is full then it will be at its worse. 
> Also if you happened to be on a slower link and have a big cache to flush, 
> you're waiting around for very little gain.

That all assumes that nobody fixes the problem on the server. If
somebody notices, and actually removes an unused file, then you may be
happy that the kernel preserved the last 80% of the apache log file that
was being written out.

ENOSPC is a transient error: that is why the current behaviour exists.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service
  2006-09-26 18:51 Long sleep with i_mutex in xfs_flush_device(), affects NFS service Stephane Doyon
  2006-09-26 19:06 ` [NFS] " Trond Myklebust
@ 2006-09-27 11:33 ` Shailendra Tripathi
  2006-10-02 14:45   ` Stephane Doyon
  1 sibling, 1 reply; 17+ messages in thread
From: Shailendra Tripathi @ 2006-09-27 11:33 UTC (permalink / raw)
  To: Stephane Doyon; +Cc: xfs, nfs

Hi Stephane,
> When the file system becomes nearly full, we eventually call down to 
> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to 
> do some work. 
> xfs_flush_space()does
>         xfs_iunlock(ip, XFS_ILOCK_EXCL);
> before calling xfs_flush_device(), but i_mutex is still held, at least 
> when we're being called from under xfs_write(). 

1. I agree that the delay of 500 ms is not a deterministic wait.

2. xfs_flush_device is a big operation. It has to flush all the dirty 
pages possibly in the cache on the device. Depending upon the device, it 
might take significant amount of time. Keeping view of it, 500 ms in 
that unreasonable. Also, perhaps you would never want more than one 
request to be queued for device flush.
3. The hope is that after one big flush operation, it would be able to 
free up resources which are in transient state (over-reservation of 
blocks, delalloc, pending removes, ...). The whole operation is intended 
to make sure that ENOSPC is not returned unless really required.

4. This wait could be made deterministic by waiting for the syncer 
thread to complete when device flush is triggered.

> It seems like a fairly long time to hold a mutex. And I wonder whether it's really 

It might not be that good even if it doesn't. This can return pre-mature 
ENOSPC or it can queue many xfs_flush_device requests (which can make 
your system dead(-slow) anyway)

> necessary to keep going through that again and again for every new request after 
> we've hit NOSPC.
> 
> In particular this can cause a pileup when several threads are writing 
> concurrently to the same file. Some specialized apps might do that, and 
> nfsd threads do it all the time.
> 
> To reproduce locally, on a full file system:
> #!/bin/sh
> for i in `seq 30`; do
>   dd if=/dev/zero of=f bs=1 count=1 &
> done
> wait
> time that, it takes nearly exactly 15s.
> 
> The linux NFS client typically sends bunches of 16 requests, and so if 
> the client is writing a single file, some NFS requests are therefore 
> delayed by up to 8seconds, which is kind of long for NFS.
> 
> What's worse, when my linux NFS client writes out a file's pages, it 
> does not react immediately on receiving a NOSPC error. It will remember 
> and report the error later on close(), but it still tries and issues 
> write requests for each page of the file. So even if there isn't a 
> pileup on the i_mutex on the server, the NFS client still waits 0.5s for 
> each 32K (typically) request. So on an NFS client on a gigabit network, 
> on an already full filesystem, if I open and write a 10M file and 
> close() it, it takes 2m40.083s for it to issue all the requests, get an 
> NOSPC for each, and finally have my close() call return ENOSPC. That can 
> stretch to several hours for gigabyte-sized files, which is how I 
> noticed the problem.
> 
> I'm not too familiar with the NFS client code, but would it not be 
> possible for it to give up when it encounters NOSPC? Or is there some 
> reason why this wouldn't be desirable?
> 
> The rough workaround I have come up with for the problem is to have 
> xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs 
> of having returned ENOSPC. I have verified that this workaround is 
> effective, but I imagine there might be a cleaner solution.

The fix would not be a good idea for standalone use of XFS.

         if (nimaps == 0) {
                 if (xfs_flush_space(ip, &fsynced, &ioflag))
                         return XFS_ERROR(ENOSPC);

                 error = 0;
                 goto retry;
         }

xfs_flush_space:
         case 2:
                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
                 xfs_flush_device(ip);
                 xfs_ilock(ip, XFS_ILOCK_EXCL);
                 *fsynced = 3;
                 return 0;
         }
         return 1;

lets say that you don't enqueue it for another 2 secs. Then, in next 
retry it would return 1 and, hence, outer if condition would return 
ENOSPC. Please note that for standalone XFS, the application or client 
mostly don't retry and, hence, it might return premature ENOSPC.
You didn't notice this because, as you said, nfs client will retry in 
case of ENOSPC.

Assuming that you don't return *fsynced = 3 (instead *fsynced = 2), the 
code path will loop (because of retry) and CPU itself would become busy 
for no good job.

You might experiment by adding deterministic wait. When you enqueue, set 
  some flag. All others who come in between just get enqueued. Once, 
device flush is over wake up all. If flush could free enough resources, 
threads will proceed ahead and return. Otherwise, another flush would be 
enqueued to flush what might have come since last flush.

> Thanks
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service
  2006-09-27 11:33 ` Shailendra Tripathi
@ 2006-10-02 14:45   ` Stephane Doyon
  2006-10-02 22:30     ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stephane Doyon @ 2006-10-02 14:45 UTC (permalink / raw)
  To: Shailendra Tripathi; +Cc: xfs

On Wed, 27 Sep 2006, Shailendra Tripathi wrote:

> Hi Stephane,
>>  When the file system becomes nearly full, we eventually call down to
>>  xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to
>>  do some work. xfs_flush_space()does
>>          xfs_iunlock(ip, XFS_ILOCK_EXCL);
>>  before calling xfs_flush_device(), but i_mutex is still held, at least
>>  when we're being called from under xfs_write(). 
>
> 1. I agree that the delay of 500 ms is not a deterministic wait.
>
> 2. xfs_flush_device is a big operation. It has to flush all the dirty pages 
> possibly in the cache on the device. Depending upon the device, it might take 
> significant amount of time. Keeping view of it, 500 ms in that unreasonable. 
> Also, perhaps you would never want more than one request to be queued for 
> device flush.
> 3. The hope is that after one big flush operation, it would be able to free 
> up resources which are in transient state (over-reservation of blocks, 
> delalloc, pending removes, ...). The whole operation is intended to make sure 
> that ENOSPC is not returned unless really required.

Yes I had surmised as much. That last part is still a little vague to 
me... But my two points were:

-It's a long time to hold a mutex. The code bothers to drop the 
xfs_ilock, so I'm wondering whether the i_mutex had been forgotten?

-Once we've actually hit ENOSPC, do we need to try again? Isn't it 
possible to tell when resources have actually been freed?

> 4. This wait could be made deterministic by waiting for the syncer thread to 
> complete when device flush is triggered.

I remember that some time ago, there wasn't any xfs_syncd, and the 
flushing operation was performed by the task wanting the free space. And 
it would cause deadlocks. So I presume we would have to be careful if we 
wanted to wait on sync.

>>  The rough workaround I have come up with for the problem is to have
>>  xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs
>>  of having returned ENOSPC. I have verified that this workaround is
>>  effective, but I imagine there might be a cleaner solution.
>
> The fix would not be a good idea for standalone use of XFS.
>
>         if (nimaps == 0) {
>                 if (xfs_flush_space(ip, &fsynced, &ioflag))
>                         return XFS_ERROR(ENOSPC);
>
>                 error = 0;
>                 goto retry;
>         }
>
> xfs_flush_space:
>         case 2:
>                 xfs_iunlock(ip, XFS_ILOCK_EXCL);
>                 xfs_flush_device(ip);
>                 xfs_ilock(ip, XFS_ILOCK_EXCL);
>                 *fsynced = 3;
>                 return 0;
>         }
>         return 1;
>
> lets say that you don't enqueue it for another 2 secs. Then, in next retry it 
> would return 1 and, hence, outer if condition would return ENOSPC. Please 
> note that for standalone XFS, the application or client mostly don't retry 
> and, hence, it might return premature ENOSPC.
>
> You didn't notice this because, as you said, nfs client will retry in case of 
> ENOSPC.

I'm not entirely sure I follow your explanation. The *fsynced variable is 
local to the xfs_iomap_write_delay() caller, so each call will go through 
the three steps in xfs_flush_space(). What my workaround does is, if we've 
done the xfs_flush_device() thing and still hit ENOSPC within the last two 
seconds, and we've just tried again the first two xfs_flush_space() steps, 
then we skip the third step and return ENOSPC. So yes the file system 
might not be exactly entirely full anymore, which is why I say it's a 
rough workaround, but it seems to me the discrepancy shouldn't be very big 
either. Whatever free space might have been missed would have had to be 
freed after the last ENOSPC return, and must be such that only another 
xfs_flush_device() call will make it available.

It seems to me ENOSPC has never been something very exact anyway: df 
(statfs) often still shows a few remaining free blocks even on a full file 
system. Apps can't really calculate how many blocks will be needed for 
inodes, btrees and directories, so the number of remaining data blocks is 
an approximation. I am not entirely sure that what xfs_flush_device_work() 
does is quite deterministic, and as you said the wait period is arbitrary. 
And I don't particularly care to get every single last byte out of my file 
system, as long as there are no flagrant inconsistencies such as rm -fr 
not freeing up some space.

> Assuming that you don't return *fsynced = 3 (instead *fsynced = 2), the code 
> path will loop (because of retry) and CPU itself would become busy for no 
> good job.

Indeed.

> You might experiment by adding deterministic wait. When you enqueue, set 
> some flag. All others who come in between just get enqueued. Once, device 
> flush is over wake up all. If flush could free enough resources, threads will 
> proceed ahead and return. Otherwise, another flush would be enqueued to flush 
> what might have come since last flush.

But how do you know whether you need to flush again, or whether your file 
system is really full this time? And there's still the issue with the 
i_mutex.

Perhaps there's a way to evaluate how much resources are "in transient 
state" as you put it. Otherwise, we could set a flag when ENOSPC is 
returned, and have that flag cleared at appropriate places in the code 
where blocks are actually freed. I keep running into various deadlocks 
related to full file systems, so I'm wary of clever solutions :-).

[Dropped nfs@lists.sourceforge.net from Cc, as this discussion is quite 
specific to xfs.]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Long sleep with i_mutex in xfs_flush_device(), affects NFS service
  2006-10-02 14:45   ` Stephane Doyon
@ 2006-10-02 22:30     ` David Chinner
  2006-10-03 13:39       ` several messages Stephane Doyon
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-10-02 22:30 UTC (permalink / raw)
  To: Stephane Doyon; +Cc: Shailendra Tripathi, xfs

On Mon, Oct 02, 2006 at 10:45:12AM -0400, Stephane Doyon wrote:
> On Wed, 27 Sep 2006, Shailendra Tripathi wrote:
> 
> >Hi Stephane,
> >> When the file system becomes nearly full, we eventually call down to
> >> xfs_flush_device(), which sleeps for 0.5seconds, waiting for xfssyncd to
> >> do some work. xfs_flush_space()does
> >>         xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >> before calling xfs_flush_device(), but i_mutex is still held, at least
> >> when we're being called from under xfs_write(). 
> >
> >1. I agree that the delay of 500 ms is not a deterministic wait.

AFAICT, it was never intended to be.

It's not deterministic, and the wait is really only there to ensure
that the synchronous log force catches all the operations that may
have recently occurred so they can be unpinned and flushed.

For example, an extent that has been truncated and freed cannot be
reused until the transaction that it was freed in has actually been
commited to disk.....

> >2. xfs_flush_device is a big operation. It has to flush all the dirty 
> >pages possibly in the cache on the device. Depending upon the device, it 
> >might take significant amount of time. Keeping view of it, 500 ms in that 
> >unreasonable. Also, perhaps you would never want more than one request to 
> >be queued for device flush.
> >3. The hope is that after one big flush operation, it would be able to 
> >free up resources which are in transient state (over-reservation of 
> >blocks, delalloc, pending removes, ...). The whole operation is intended 
> >to make sure that ENOSPC is not returned unless really required.
> 
> Yes I had surmised as much. That last part is still a little vague to 
> me... But my two points were:
> 
> -It's a long time to hold a mutex. The code bothers to drop the 
> xfs_ilock, so I'm wondering whether the i_mutex had been forgotten?

This deep in the XFS allocation functions, we cannot tell if we hold
the i_mutex or not, and it plays no part in determining if we have
space or not. Hence we don't touch it here. 

> -Once we've actually hit ENOSPC, do we need to try again? Isn't it 
> possible to tell when resources have actually been freed?

Given that the only way to determine if space was made available is
to query every AG in the exact same way an allocation does, it makes
sense to try the allocation again to determine if space was made
available....

> >4. This wait could be made deterministic by waiting for the syncer thread 
> >to complete when device flush is triggered.
> 
> I remember that some time ago, there wasn't any xfs_syncd, and the 
> flushing operation was performed by the task wanting the free space. And 
> it would cause deadlocks. So I presume we would have to be careful if we 
> wanted to wait on sync.

*nod*

Last thing we want is more deadlocks. This code is already
convoluted enough without added yet more special cases to it....

> >> The rough workaround I have come up with for the problem is to have
> >> xfs_flush_space() skip calling xfs_flush_device() if we are within 2secs
> >> of having returned ENOSPC. I have verified that this workaround is
> >> effective, but I imagine there might be a cleaner solution.
> >
> >The fix would not be a good idea for standalone use of XFS.

I doubt it's a good idea for an NFS server, either.

Remember that XFS, like most filesystems, trades off speed for
correctness as we approach ENOSPC. Many parts of XFS slow down as we
approach ENOSPC, and this is just one example of where we need to be
correct, not fast.

> It seems to me ENOSPC has never been something very exact anyway:
> df (statfs) often still shows a few remaining free blocks even on
> a full file system.  Apps can't really calculate how many blocks
> will be needed for inodes, btrees and directories, so the number
> of remaining data blocks is an approximation.

It's not supposed to be an approximation - the number reported by df
should be taking all this into account because it's coming directly
from how much space XFS thinks it has available.

> >You might experiment by adding deterministic wait. When you enqueue, set 
> >some flag. All others who come in between just get enqueued. Once, device 
> >flush is over wake up all. If flush could free enough resources, threads 
> >will proceed ahead and return. Otherwise, another flush would be enqueued 
> >to flush what might have come since last flush.
> 
> But how do you know whether you need to flush again, or whether your file 
> system is really full this time? And there's still the issue with the 
> i_mutex.
> 
> Perhaps there's a way to evaluate how much resources are "in transient 
> state" as you put it.

I doubt there's any way of doing this without introducing non-enospc
performance regressions and extra memory usage.

> Otherwise, we could set a flag when ENOSPC is 
> returned, and have that flag cleared at appropriate places in the code 
> where blocks are actually freed. I keep running into various deadlocks 
> related to full file systems, so I'm wary of clever solutions :-).

IMO, this is a non-problem.  You're talking about optimising a
relatively rare corner case where correctness is more important than
speed and your test case is highly artificial. AFAIC, if you are
running at ENOSPC then you get what performance is appropriate for
correctness and if you are continually runing at ENOSPC, then buy
some more disks.....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-02 22:30     ` David Chinner
@ 2006-10-03 13:39       ` Stephane Doyon
  2006-10-03 16:40         ` Trond Myklebust
  2006-10-05  8:30         ` David Chinner
  0 siblings, 2 replies; 17+ messages in thread
From: Stephane Doyon @ 2006-10-03 13:39 UTC (permalink / raw)
  To: Trond Myklebust, David Chinner; +Cc: xfs, nfs, Shailendra Tripathi

Sorry for insisting, but it seems to me there's still a problem in need of 
fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
ENOSPC, it takes on the order of 22hours before my application gets an 
error, whereas it would normally take about 2minutes if the file system 
did not become full.

Perhaps I was being a bit too "constructive" and drowned my point in 
explanations and proposed workarounds... You are telling me that neither 
NFS nor XFS is doing anything wrong, and I can understand your points of 
view, but surely that behavior isn't considered acceptable?

On Tue, 26 Sep 2006, Trond Myklebust wrote:

> On Tue, 2006-09-26 at 16:05 -0400, Stephane Doyon wrote:
>> I suppose it's not technically wrong to try to flush all the pages of the
>> file, but if the server file system is full then it will be at its worse.
>> Also if you happened to be on a slower link and have a big cache to flush,
>> you're waiting around for very little gain.
>
> That all assumes that nobody fixes the problem on the server. If
> somebody notices, and actually removes an unused file, then you may be
> happy that the kernel preserved the last 80% of the apache log file that
> was being written out.
>
> ENOSPC is a transient error: that is why the current behaviour exists.

On Tue, 3 Oct 2006, David Chinner wrote:

> This deep in the XFS allocation functions, we cannot tell if we hold
> the i_mutex or not, and it plays no part in determining if we have
> space or not. Hence we don't touch it here.

> I doubt it's a good idea for an NFS server, either.
[...]
> Remember that XFS, like most filesystems, trades off speed for
> correctness as we approach ENOSPC. Many parts of XFS slow down as we
> approach ENOSPC, and this is just one example of where we need to be
> correct, not fast.
[...]
> IMO, this is a non-problem.  You're talking about optimising a
> relatively rare corner case where correctness is more important than
> speed and your test case is highly artificial. AFAIC, if you are
> running at ENOSPC then you get what performance is appropriate for
> correctness and if you are continually runing at ENOSPC, then buy
> some more disks.....

My recipe to reproduce the problem locally is admittedly somewhat 
artificial, but the problematic usage definitely isn't: simply an app on 
an NFS client that happens to fill up a file system. There must be some 
way to handle this better.

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-03 13:39       ` several messages Stephane Doyon
@ 2006-10-03 16:40         ` Trond Myklebust
  2006-10-05 15:39           ` Stephane Doyon
  2006-10-05  8:30         ` David Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2006-10-03 16:40 UTC (permalink / raw)
  To: Stephane Doyon; +Cc: David Chinner, xfs, nfs, Shailendra Tripathi

On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote:
> Sorry for insisting, but it seems to me there's still a problem in need of 
> fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
> ENOSPC, it takes on the order of 22hours before my application gets an 
> error, whereas it would normally take about 2minutes if the file system 
> did not become full.
> 
> Perhaps I was being a bit too "constructive" and drowned my point in 
> explanations and proposed workarounds... You are telling me that neither 
> NFS nor XFS is doing anything wrong, and I can understand your points of 
> view, but surely that behavior isn't considered acceptable?

Sure it is. You are allowing the kernel to cache 5GB, and that means you
only get the error message when close() completes.

If you want faster error reporting, there are modes like O_SYNC,
O_DIRECT, that will attempt to flush the data more quickly. In addition,
you can force flushing using fsync(). Finally, you can tweak the VM into
flushing more often using /proc/sys/vm.

Cheers,
 Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-03 16:40         ` Trond Myklebust
@ 2006-10-05 15:39           ` Stephane Doyon
  2006-10-06  0:33             ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stephane Doyon @ 2006-10-05 15:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: David Chinner, xfs, nfs, Shailendra Tripathi

On Tue, 3 Oct 2006, Trond Myklebust wrote:

> On Tue, 2006-10-03 at 09:39 -0400, Stephane Doyon wrote:
>> Sorry for insisting, but it seems to me there's still a problem in need of
>> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
>> ENOSPC, it takes on the order of 22hours before my application gets an
>> error, whereas it would normally take about 2minutes if the file system
>> did not become full.
>>
>> Perhaps I was being a bit too "constructive" and drowned my point in
>> explanations and proposed workarounds... You are telling me that neither
>> NFS nor XFS is doing anything wrong, and I can understand your points of
>> view, but surely that behavior isn't considered acceptable?
>
> Sure it is.

If you say so :-).

> You are allowing the kernel to cache 5GB, and that means you
> only get the error message when close() completes.

But it's not actually caching the entire 5GB at once... I guess you're 
saying that doesn't matter...?

> If you want faster error reporting, there are modes like O_SYNC,
> O_DIRECT, that will attempt to flush the data more quickly. In addition,
> you can force flushing using fsync().

What if the program is a standard utility like cp?

> Finally, you can tweak the VM into
> flushing more often using /proc/sys/vm.

It doesn't look to me like a question of degrees about how early to flush. 
Actually my client can't possibly be caching all of 5GB, it doesn't have 
the RAM or swap for that. Tracing it more carefully, it appears dirty data 
starts being flushed after a few hundred MBs. No error is returned on the 
subsequent writes, only on the final close(). I see some of the write() 
calls are delayed, presumably when the machine reaches the dirty 
threshold. So I don't see how the vm settings can help in this case.

I hadn't realized that the issue isn't just with the final flush on 
close(). It's actually been flushing all along, delaying some of the 
subsequent write()s, getting NOSPC errors but not reporting them until the 
end.

I understand that since my application did not request any syncing, the 
system cannot guarantee to report errors until cached data has been 
flushed. But some data has indeed been flushed with an error; can't this 
be reported earlier than on close?

Would it be incorrect for a subsequent write to return the error that 
occurred while flushing data from previous writes? Then the app could 
decide whether to continue and retry or not. But I guess I can see how 
that might get convoluted.

Thanks for your patience,

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-05 15:39           ` Stephane Doyon
@ 2006-10-06  0:33             ` David Chinner
  2006-10-06 13:25               ` Stephane Doyon
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-10-06  0:33 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: Trond Myklebust, David Chinner, xfs, nfs, Shailendra Tripathi

On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote:
> 
> I hadn't realized that the issue isn't just with the final flush on 
> close(). It's actually been flushing all along, delaying some of the 
> subsequent write()s, getting NOSPC errors but not reporting them until the 
> end.

Other NFS clients will report an ENOSPC on the next write() or close()
if the error is reported during async writeback. The clients that typically
do this throw away any unwritten data as well on the basis that the
application was returned an error ASAP and it is now Somebody Else's
Problem (i.e. the application needs to handle it from there).

> I understand that since my application did not request any syncing, the 
> system cannot guarantee to report errors until cached data has been 
> flushed. But some data has indeed been flushed with an error; can't this 
> be reported earlier than on close?

It could, but...

> Would it be incorrect for a subsequent write to return the error that 
> occurred while flushing data from previous writes? Then the app could 
> decide whether to continue and retry or not. But I guess I can see how 
> that might get convoluted.

.... there's many entertaining hoops to jump through to do this
reliably.

FWIW, these are simply two different approaches to handling ENOSPC
(and other server) errors.  Mostly it comes down to how the ppl who
implemented the NFS client think it's best to handle the errors in
the scenarios that they most care about.

For example: when you have large amounts of cached data, expedient
error reporting and tossing unwritten data leads to much faster
error recovery than trying to write every piece of data (hence the
Irix use of this method).

OTOH, when you really want as much of the data to get to the server,
regardless of whether you lose some (e.g.  log files) before
reporting an error then you try to write every bit of data before
telling the application.

There's no clear right or wrong approach here - both have their
advantages and disadvantages for different workloads. If it
weren't for the sub-optimal behaviour of XFS in this case, you
probably wouldn't have even cared about this....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-06  0:33             ` David Chinner
@ 2006-10-06 13:25               ` Stephane Doyon
  0 siblings, 0 replies; 17+ messages in thread
From: Stephane Doyon @ 2006-10-06 13:25 UTC (permalink / raw)
  To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi

On Fri, 6 Oct 2006, David Chinner wrote:

> On Thu, Oct 05, 2006 at 11:39:45AM -0400, Stephane Doyon wrote:
>>
>> I hadn't realized that the issue isn't just with the final flush on
>> close(). It's actually been flushing all along, delaying some of the
>> subsequent write()s, getting NOSPC errors but not reporting them until the
>> end.
>
> Other NFS clients will report an ENOSPC on the next write() or close()
> if the error is reported during async writeback. The clients that typically
> do this throw away any unwritten data as well on the basis that the
> application was returned an error ASAP and it is now Somebody Else's
> Problem (i.e. the application needs to handle it from there).

Well the client wouldn't necessarily have to throw away cached data. It 
could conceivably be made to return ENOSPC on some subsequent write. It 
would need to throw away the data for that write, but not necessarily 
destroy its cache. It could then clear the error condition and allow the 
application to keep trying if it wants to...

>> Would it be incorrect for a subsequent write to return the error that
>> occurred while flushing data from previous writes? Then the app could
>> decide whether to continue and retry or not. But I guess I can see how
>> that might get convoluted.
>
> .... there's many entertaining hoops to jump through to do this
> reliably.

I imagine there would be...

> For example: when you have large amounts of cached data, expedient
> error reporting and tossing unwritten data leads to much faster
> error recovery than trying to write every piece of data (hence the
> Irix use of this method).

In my case, I didn't think I was caching that much data though, only a few 
hundred MBs, and I wouldn't have minded so much if an error had been 
returned after that much. The way it's implemented though, I can write an 
unbounded amount of data through that cache and not be told of the problem 
until I close or fsync. It may not be technically wrong, but given the 
outrageous delay I saw in my particular situation, it felt pretty 
suboptimal.

> There's no clear right or wrong approach here - both have their
> advantages and disadvantages for different workloads. If it
> weren't for the sub-optimal behaviour of XFS in this case, you
> probably wouldn't have even cared about this....

Indeed not! In fact, changing the client is not practical for me, what I 
need is a fix for the XFS behavior. I just thought it was also worth 
reporting what I perceived to be an issue with the NFS client.

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-03 13:39       ` several messages Stephane Doyon
  2006-10-03 16:40         ` Trond Myklebust
@ 2006-10-05  8:30         ` David Chinner
  2006-10-05 16:33           ` Stephane Doyon
  1 sibling, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-10-05  8:30 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: Trond Myklebust, David Chinner, xfs, nfs, Shailendra Tripathi

On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote:
> Sorry for insisting, but it seems to me there's still a problem in need of 
> fixing: when writing a 5GB file over NFS to an XFS file system and hitting 
> ENOSPC, it takes on the order of 22hours before my application gets an 
> error, whereas it would normally take about 2minutes if the file system 
> did not become full.
>
> Perhaps I was being a bit too "constructive" and drowned my point in 
> explanations and proposed workarounds... You are telling me that neither 
> NFS nor XFS is doing anything wrong, and I can understand your points of 
> view, but surely that behavior isn't considered acceptable?

I agree that this a little extreme and I can't recall of seeing
anything like this before, but I can see how that may happen if the
NFS client continues to try to write every dirty page after getting
an ENOSPC and each one of those writes has to wait for 500ms.

However, you did not mention what kernel version you are running.
One recent bug (introduced by a fix for deadlocks at ENOSPC) could
allow oversubscription of free space to occur in XFS, resulting in
the write being allowed to proceed (i.e. sufficient space for the
data blocks) but then failing the allocation because there weren't
enough blocks put aside for potential btree splits that occur during
allocation. If the linux client is using sync writes on retry, then
this would trigger a 500ms sleep on every write.  That's the right
sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
= ~22 hours....

This got fixed in 2.6.18-rc6 - can you retry with a 2.6.18 server
and see if your problem goes away?

Cheers,

Dave.

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-05  8:30         ` David Chinner
@ 2006-10-05 16:33           ` Stephane Doyon
  2006-10-05 23:29             ` David Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Stephane Doyon @ 2006-10-05 16:33 UTC (permalink / raw)
  To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi

On Thu, 5 Oct 2006, David Chinner wrote:

> On Tue, Oct 03, 2006 at 09:39:55AM -0400, Stephane Doyon wrote:
>> Sorry for insisting, but it seems to me there's still a problem in need of
>> fixing: when writing a 5GB file over NFS to an XFS file system and hitting
>> ENOSPC, it takes on the order of 22hours before my application gets an
>> error, whereas it would normally take about 2minutes if the file system
>> did not become full.
>>
>> Perhaps I was being a bit too "constructive" and drowned my point in
>> explanations and proposed workarounds... You are telling me that neither
>> NFS nor XFS is doing anything wrong, and I can understand your points of
>> view, but surely that behavior isn't considered acceptable?
>
> I agree that this a little extreme and I can't recall of seeing
> anything like this before, but I can see how that may happen if the
> NFS client continues to try to write every dirty page after getting
> an ENOSPC and each one of those writes has to wait for 500ms.
>
> However, you did not mention what kernel version you are running.
> One recent bug (introduced by a fix for deadlocks at ENOSPC) could
> allow oversubscription of free space to occur in XFS, resulting in

I do have that fix in my kernel. (I'm the one who pointed you to the patch 
that introduced that particular problem.)

> the write being allowed to proceed (i.e. sufficient space for the
> data blocks) but then failing the allocation because there weren't
> enough blocks put aside for potential btree splits that occur during
> allocation. If the linux client is using sync writes on retry, then

The writes from nfsd shouldn't be sync. Technically it's not even 
retrying, just plowing on...

> this would trigger a 500ms sleep on every write.  That's the right
> sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
> = ~22 hours....
>
> This got fixed in 2.6.18-rc6 -

You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do 
have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes.

> can you retry with a 2.6.18 server
> and see if your problem goes away?

Unfortunately it will be several days before I have a chance to do that.

The backtrace looked like this:

... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev 
xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks 
xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device 
schedule_timeout_uninterruptible.

with a 500ms sleep in xfs_flush_device().

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-05 16:33           ` Stephane Doyon
@ 2006-10-05 23:29             ` David Chinner
  2006-10-06 13:03               ` Stephane Doyon
  0 siblings, 1 reply; 17+ messages in thread
From: David Chinner @ 2006-10-05 23:29 UTC (permalink / raw)
  To: Stephane Doyon
  Cc: David Chinner, Trond Myklebust, xfs, nfs, Shailendra Tripathi

On Thu, Oct 05, 2006 at 12:33:05PM -0400, Stephane Doyon wrote:
> retrying, just plowing on...
> 
> >this would trigger a 500ms sleep on every write.  That's the right
> >sort of ballpark for the slowness you were seeing - 5GB / 32k * 0.5s
> >= ~22 hours....
> >
> >This got fixed in 2.6.18-rc6 -
> 
> You mean commit 4be536debe3f7b0c right? (Actually -rc7 I believe...) I do 
> have that one in my kernel. My kernel is 2.6.17 plus assorted XFS fixes.
> 
> >can you retry with a 2.6.18 server
> >and see if your problem goes away?
> 
> Unfortunately it will be several days before I have a chance to do that.
> 
> The backtrace looked like this:
> 
> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev 
> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks 
> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device 
> schedule_timeout_uninterruptible.

Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()),
not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
(and probably everyone else) confused there which why I suspected sync
writes - they trigger the allocate path in the write call. I don't think
2.6.18 will change anything.

FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
conditions, but perhaps once we are certain of the ENOSPC status
we can tag the filesystem with this state (say an xfs_mount flag)
and only clear that tag when something is freed. We could then
use the tag to avoid continually trying extremely hard to allocate
space when we know there is none available....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: several messages
  2006-10-05 23:29             ` David Chinner
@ 2006-10-06 13:03               ` Stephane Doyon
  0 siblings, 0 replies; 17+ messages in thread
From: Stephane Doyon @ 2006-10-06 13:03 UTC (permalink / raw)
  To: David Chinner; +Cc: Trond Myklebust, xfs, nfs, Shailendra Tripathi

On Fri, 6 Oct 2006, David Chinner wrote:

>> The backtrace looked like this:
>>
>> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev xfs_file_writev
>> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks
>> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space xfs_flush_device
>> schedule_timeout_uninterruptible.
>
> Ahhh, this gets hit on the ->prepare_write path (xfs_iomap_write_delay()),

Yes.

> not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
> (and probably everyone else) confused there which why I suspected sync
> writes - they trigger the allocate path in the write call. I don't think
> 2.6.18 will change anything.
>
> FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
> conditions, but perhaps once we are certain of the ENOSPC status
> we can tag the filesystem with this state (say an xfs_mount flag)
> and only clear that tag when something is freed. We could then
> use the tag to avoid continually trying extremely hard to allocate
> space when we know there is none available....

Yes! That's what I was trying to suggest :-). Thank you.

Is that hard to do?

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <9E397A467F4DB34884A1FD0D5D27CF43018903F96E@msxaoa4.twosigma.com>]

* Re: several messages
       [not found] <9E397A467F4DB34884A1FD0D5D27CF43018903F96E@msxaoa4.twosigma.com>
@ 2008-06-12 16:54 ` Benjamin L. Shi
  0 siblings, 0 replies; 17+ messages in thread
From: Benjamin L. Shi @ 2008-06-12 16:54 UTC (permalink / raw)
  To: xfs




Index: fs/xfs/xfs_iomap.c
===================================================================
RCS file: /src/linux/2.6.18/fs/xfs/xfs_iomap.c,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -u -r1.1.1.1 -r1.2
--- fs/xfs/xfs_iomap.c  29 Sep 2006 13:45:19 -0000      1.1.1.1
+++ fs/xfs/xfs_iomap.c  12 Jun 2008 15:59:10 -0000      1.2
@@ -706,11 +706,24 @@
         * then we must have run out of space - flush delalloc, and retry..
         */
        if (nimaps == 0) {
+               if ((mp->m_flags & XFS_MOUNT_FULL) != 0) {
+                       if (mp->m_sb.sb_fdblocks < 500) {
+                               //      printk("full again %llu\n",
+                               //              mp->m_sb.sb_fdblocks);
+                                       return XFS_ERROR(ENOSPC);
+                       } else {
+                               //      printk("not full again %llu\n",
+                               //              mp->m_sb.sb_fdblocks);
+                                       mp->m_flags &= ~XFS_MOUNT_FULL;
+                       }
+               }
                xfs_iomap_enter_trace(XFS_IOMAP_WRITE_NOSPACE,
                                        io, offset, count);
-               if (xfs_flush_space(ip, &fsynced, &ioflag))
+               if (xfs_flush_space(ip, &fsynced, &ioflag)) {
+                       mp->m_flags |= XFS_MOUNT_FULL;
+                       //printk("set full %llu\n", mp->m_sb.sb_fdblocks);
                        return XFS_ERROR(ENOSPC);
-
+               }
                error = 0;
                goto retry;
        }
Index: fs/xfs/xfs_mount.h
===================================================================
RCS file: /src/linux/2.6.18/fs/xfs/xfs_mount.h,v
retrieving revision 1.1.1.1
retrieving revision 1.2
diff -u -r1.1.1.1 -r1.2
--- fs/xfs/xfs_mount.h  29 Sep 2006 13:45:19 -0000      1.1.1.1
+++ fs/xfs/xfs_mount.h  12 Jun 2008 15:59:10 -0000      1.2
@@ -459,6 +459,7 @@
                                                 * I/O size in stat() */
 #define XFS_MOUNT_NO_PERCPU_SB (1ULL << 23)    /* don't use per-cpu
superblock
                                                   counters */
+#define XFS_MOUNT_FULL         (1ULL << 24)


 /*



>
> On Fri, 6 Oct 2006, David Chinner wrote:
>
>>> The backtrace looked like this:
>>>
>>> ... nfsd_write nfsd_vfs_write vfs_writev do_readv_writev
>>> xfs_file_writev
>>> xfs_write generic_file_buffered_write xfs_get_blocks __xfs_get_blocks
>>> xfs_bmap xfs_iomap xfs_iomap_write_delay xfs_flush_space
>>> xfs_flush_device
>>> schedule_timeout_uninterruptible.
>>
>> Ahhh, this gets hit on the ->prepare_write path
>> (xfs_iomap_write_delay()),
>
> Yes.
>
>> not the allocate path (xfs_iomap_write_allocate()). Sorry - I got myself
>> (and probably everyone else) confused there which why I suspected sync
>> writes - they trigger the allocate path in the write call. I don't think
>> 2.6.18 will change anything.
>>
>> FWIW, I don't think we can avoid this sleep when we first hit ENOSPC
>> conditions, but perhaps once we are certain of the ENOSPC status
>> we can tag the filesystem with this state (say an xfs_mount flag)
>> and only clear that tag when something is freed. We could then
>> use the tag to avoid continually trying extremely hard to allocate
>> space when we know there is none available....
>
> Yes! That's what I was trying to suggest  << OLE Object: Picture (Device
> Independent Bitmap) >> . Thank you.
>
> Is that hard to do?
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-06-12 16:53 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-26 18:51 Long sleep with i_mutex in xfs_flush_device(), affects NFS service Stephane Doyon
2006-09-26 19:06 ` [NFS] " Trond Myklebust
2006-09-26 20:05   ` Stephane Doyon
2006-09-26 20:29     ` Trond Myklebust
2006-09-27 11:33 ` Shailendra Tripathi
2006-10-02 14:45   ` Stephane Doyon
2006-10-02 22:30     ` David Chinner
2006-10-03 13:39       ` several messages Stephane Doyon
2006-10-03 16:40         ` Trond Myklebust
2006-10-05 15:39           ` Stephane Doyon
2006-10-06  0:33             ` David Chinner
2006-10-06 13:25               ` Stephane Doyon
2006-10-05  8:30         ` David Chinner
2006-10-05 16:33           ` Stephane Doyon
2006-10-05 23:29             ` David Chinner
2006-10-06 13:03               ` Stephane Doyon
     [not found] <9E397A467F4DB34884A1FD0D5D27CF43018903F96E@msxaoa4.twosigma.com>
2008-06-12 16:54 ` Benjamin L. Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox