Running out of reserved data blocks

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Running out of reserved data blocks
@ 2008-09-26  5:31 Lachlan McIlroy
  2008-09-26  7:08 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Lachlan McIlroy @ 2008-09-26  5:31 UTC (permalink / raw)
  To: xfs-dev, xfs-oss

A while back I posted a patch to re-dirty pages on I/O error to handle errors from
xfs_trans_reserve() that was failing with ENOSPC when trying to convert delayed
allocations.  I'm now seeing xfs_trans_reserve() fail when converting unwritten
extents and in that case we silently ignore the error and leave the extent as
unwritten which effectively causes data corruption.  I can also get failures when
trying to unreserve disk space.

I've tried increasing the size of the reserved data blocks pool but that only
delays the inevitable.  Increasing the size to 65536 blocks seems to avoid failures
but that's getting to be a lot of disk space.

All of these ENOSPC errors should be transient and if we retried the operation - or
waited for the reserved pool to refill - we could proceed with the transaction.  I
was thinking about adding a retry loop in xfs_trans_reserve() so if XFS_TRANS_RESERVE
is set and we fail to get space we just keep trying.  It's not very elegant but saves
having to address the ENOSPC failure in many code paths.

Does anyone have any other suggestions?

Lachlan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Running out of reserved data blocks
  2008-09-26  5:31 Running out of reserved data blocks Lachlan McIlroy
@ 2008-09-26  7:08 ` Dave Chinner
  2008-09-26  7:45   ` Lachlan McIlroy
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2008-09-26  7:08 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Fri, Sep 26, 2008 at 03:31:23PM +1000, Lachlan McIlroy wrote:
> A while back I posted a patch to re-dirty pages on I/O error to handle errors from
> xfs_trans_reserve() that was failing with ENOSPC when trying to convert delayed
> allocations.  I'm now seeing xfs_trans_reserve() fail when converting unwritten
> extents and in that case we silently ignore the error and leave the extent as
> unwritten which effectively causes data corruption.  I can also get failures when
> trying to unreserve disk space.

Is this problem being seen in the real world, or just in artificial
test workloads?

What the reserve pool is supposed to do is provide sufficient blocks
to allow dirty data to be flushed, xattrs to be added, etc in the
immediately period after the ENOSPC occurs so that none of the
existing operations we've committed to fail.  The reserve pool is
not meant to be an endless source of space that allows the system to
continue operating permanently at ENOSPC.

If you start new operations like writing into unwritten extents once
you are already at ENOSPC, then you can consume the entire of the
reserve pool. There is nothing we can do to prevent that from
occurring, except by doing something like partially freezing the
filesystem (i.e. just the data write() level, not the transaction
level) until the ENOSPC condition goes away....

> I've tried increasing the size of the reserved data blocks pool
> but that only delays the inevitable.  Increasing the size to 65536
> blocks seems to avoid failures but that's getting to be a lot of
> disk space.

You're worried about reserving 20c worth of disk space and 10s of
time to change the config vs hours of enginnering and test time
to come up with a different solution that may or may not be
as effective?

Reserving a bit of extra space is a cheap, cost effective solution
to the problem.

> All of these ENOSPC errors should be transient and if we retried
> the operation - or waited for the reserved pool to refill - we
> could proceed with the transaction.  I was thinking about adding a
> retry loop in xfs_trans_reserve() so if XFS_TRANS_RESERVE is set
> and we fail to get space we just keep trying.

ENOSPC is not a transient condition unless you do something to
free up space. If the system is unattended, then ENOSPC can
persist for a long time. This is effectively silently livelocking
the system until the ENOSPC is cleared. That will have effect on
operations on other filesystems, too. e.g. pdflush gets stuck
in one of these loops...

Either increase the size of the reserved pool so your reserve pool
doesn't empty, or identify and prevent what-ever I/O is depleting
the reserve pool at ENOSPC....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Running out of reserved data blocks
  2008-09-26  7:08 ` Dave Chinner
@ 2008-09-26  7:45   ` Lachlan McIlroy
  2008-09-26  8:48     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Lachlan McIlroy @ 2008-09-26  7:45 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs-dev, xfs-oss

Dave Chinner wrote:
> On Fri, Sep 26, 2008 at 03:31:23PM +1000, Lachlan McIlroy wrote:
>> A while back I posted a patch to re-dirty pages on I/O error to handle errors from
>> xfs_trans_reserve() that was failing with ENOSPC when trying to convert delayed
>> allocations.  I'm now seeing xfs_trans_reserve() fail when converting unwritten
>> extents and in that case we silently ignore the error and leave the extent as
>> unwritten which effectively causes data corruption.  I can also get failures when
>> trying to unreserve disk space.
> 
> Is this problem being seen in the real world, or just in artificial
> test workloads?
Customer escalations.

> 
> What the reserve pool is supposed to do is provide sufficient blocks
> to allow dirty data to be flushed, xattrs to be added, etc in the
> immediately period after the ENOSPC occurs so that none of the
> existing operations we've committed to fail.  The reserve pool is
> not meant to be an endless source of space that allows the system to
> continue operating permanently at ENOSPC.
I'm well aware of what the reserve pool is there for.

> 
> If you start new operations like writing into unwritten extents once
> you are already at ENOSPC, then you can consume the entire of the
> reserve pool. There is nothing we can do to prevent that from
> occurring, except by doing something like partially freezing the
> filesystem (i.e. just the data write() level, not the transaction
> level) until the ENOSPC condition goes away....
Yes we could eat into the reserve pool with btree split/newroot
allocations.  Same with delayed allocations.  That's yet another
problem where we need to account for potential btree space before
creating delayed allocations or unwritten extents.

> 
>> I've tried increasing the size of the reserved data blocks pool
>> but that only delays the inevitable.  Increasing the size to 65536
>> blocks seems to avoid failures but that's getting to be a lot of
>> disk space.
> 
> You're worried about reserving 20c worth of disk space and 10s of
> time to change the config vs hours of enginnering and test time
> to come up with a different solution that may or may not be
> as effective?
65536 x 16KB blocks is 1GB of disk space - people will notice that go
missing.

> 
> Reserving a bit of extra space is a cheap, cost effective solution
> to the problem.
We already plan to bump the pool size to help reduce the liklihood of
this problem occuring but it's not a solution - it's a work around.
It's only a matter of time before we hit this issue again.

> 
>> All of these ENOSPC errors should be transient and if we retried
>> the operation - or waited for the reserved pool to refill - we
>> could proceed with the transaction.  I was thinking about adding a
>> retry loop in xfs_trans_reserve() so if XFS_TRANS_RESERVE is set
>> and we fail to get space we just keep trying.
> 
> ENOSPC is not a transient condition unless you do something to
> free up space. If the system is unattended, then ENOSPC can
> persist for a long time. This is effectively silently livelocking
> the system until the ENOSPC is cleared. That will have effect on
> operations on other filesystems, too. e.g. pdflush gets stuck
> in one of these loops...
The loop would be just like the code that waits for logspace to
become available (which we do right after allocating this space).
But you're right that we may have permanently consumed all of the
reserved space and then we cannot proceed - although I haven't
actually seen that case in testing.

> 
> Either increase the size of the reserved pool so your reserve pool
> doesn't empty, or identify and prevent what-ever I/O is depleting
> the reserve pool at ENOSPC....
The reserve pool is not depleting permanently - there are many threads
each in a transaction and each has 4 blocks from the reserved pool.
After the transaction completes the 4 blocks are returned to the pool.
If another thread starts a transaction while the entire pool is tied
up then that transaction will fail - if it comes back later it could
work.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Running out of reserved data blocks
  2008-09-26  7:45   ` Lachlan McIlroy
@ 2008-09-26  8:48     ` Dave Chinner
  2008-09-29  2:44       ` Lachlan McIlroy
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2008-09-26  8:48 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Fri, Sep 26, 2008 at 05:45:10PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Fri, Sep 26, 2008 at 03:31:23PM +1000, Lachlan McIlroy wrote:
>>> A while back I posted a patch to re-dirty pages on I/O error to handle errors from
>>> xfs_trans_reserve() that was failing with ENOSPC when trying to convert delayed
>>> allocations.  I'm now seeing xfs_trans_reserve() fail when converting unwritten
>>> extents and in that case we silently ignore the error and leave the extent as
>>> unwritten which effectively causes data corruption.  I can also get failures when
>>> trying to unreserve disk space.
>>
>> Is this problem being seen in the real world, or just in artificial
>> test workloads?
> Customer escalations.

And the main cause is what? Direct I/O into unwritten extents?

>> If you start new operations like writing into unwritten extents once
>> you are already at ENOSPC, then you can consume the entire of the
>> reserve pool. There is nothing we can do to prevent that from
>> occurring, except by doing something like partially freezing the
>> filesystem (i.e. just the data write() level, not the transaction
>> level) until the ENOSPC condition goes away....
> Yes we could eat into the reserve pool with btree split/newroot
> allocations.  Same with delayed allocations. That's yet another
> problem where we need to account for potential btree space before
> creating delayed allocations or unwritten extents.

It's the same problem - allocation can cause consumption of
blocks in the BMBT tree. At ENOSPC, it's not the allocbt that
is being split or consuming blocks...

Metadata block allocation due to delayed data allocation is bound by
memory size and dirty page limits - once we get to ENOSPC, there
will be no more pages accepted for delayed allocation - the app will
get an ENOSPC up front. The reserved pool needs to be larger enough
to handle all the allocations that this dirty data can trigger.
Easily solved by bumping the tunable for large mmory systems.

FWIW, determining the number of blocks to reserve for delayed
allocation during delayed allocation is not worth the complexity.
You don't know how many extents the data will end up in, you don't
know what order the pages might get written in so you could have
worst case sparse page writeout before the holes are filled (i.e.
have tree growth and then have it shrink), etc. Even reserving
enough blocks for a full btree split per dirtied inode is not
sufficient, as allocation may trigger multiple full tree splits.
Basically the reservations will get so large that they will cause
applications to get premature ENOSPC errors when the writes could
have succeeded without problems.

That's why this problem has not been solved in the past - it's too
damn complex to enumerate correctly, and in almost all cases the
default sized reserved pool is sufficient to prevent data loss
problems.

For the unwritten extent conversion case, though, we need to
prevent new writes (after ENOSPC occurs) from draining the
reserved pool. That means we either have to return an ENOSPC
to the application, or we freeze the writes into preallocated
space when we are at ENOSPC and the reserve pool is getting
depleted. This needs to be done up-front, not in the I/O completion
where it is too late to handle the fact that the reserve pool
too depleted to do the conversion.....

That seems simple enough to do without having to add any code
to the back end I/O path or the transaction subsystem....

>>> I've tried increasing the size of the reserved data blocks pool
>>> but that only delays the inevitable.  Increasing the size to 65536
>>> blocks seems to avoid failures but that's getting to be a lot of
>>> disk space.
>>
>> You're worried about reserving 20c worth of disk space and 10s of
>> time to change the config vs hours of enginnering and test time
>> to come up with a different solution that may or may not be
>> as effective?
> 65536 x 16KB blocks is 1GB of disk space - people will notice that go
> missing.

Still small change for typical SGI customers.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Running out of reserved data blocks
  2008-09-26  8:48     ` Dave Chinner
@ 2008-09-29  2:44       ` Lachlan McIlroy
  2008-09-29  6:51         ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Lachlan McIlroy @ 2008-09-29  2:44 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs-dev, xfs-oss

Dave Chinner wrote:
> On Fri, Sep 26, 2008 at 05:45:10PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Fri, Sep 26, 2008 at 03:31:23PM +1000, Lachlan McIlroy wrote:
>>>> A while back I posted a patch to re-dirty pages on I/O error to handle errors from
>>>> xfs_trans_reserve() that was failing with ENOSPC when trying to convert delayed
>>>> allocations.  I'm now seeing xfs_trans_reserve() fail when converting unwritten
>>>> extents and in that case we silently ignore the error and leave the extent as
>>>> unwritten which effectively causes data corruption.  I can also get failures when
>>>> trying to unreserve disk space.
>>> Is this problem being seen in the real world, or just in artificial
>>> test workloads?
>> Customer escalations.
> 
> And the main cause is what? Direct I/O into unwritten extents?
The system is so busy that it's overwhelming the bandwidth of the log and
many threads have taken a slice of the reserved pool and are waiting for
log space.  Extent conversions start failing and we invalidate the page
but fail to remove the delayed allocation.  Then a direct I/O read runs
into a delayed allocation where it does not expect one and hits a BUG_ON().

> 
>>> If you start new operations like writing into unwritten extents once
>>> you are already at ENOSPC, then you can consume the entire of the
>>> reserve pool. There is nothing we can do to prevent that from
>>> occurring, except by doing something like partially freezing the
>>> filesystem (i.e. just the data write() level, not the transaction
>>> level) until the ENOSPC condition goes away....
>> Yes we could eat into the reserve pool with btree split/newroot
>> allocations.  Same with delayed allocations. That's yet another
>> problem where we need to account for potential btree space before
>> creating delayed allocations or unwritten extents.
> 
> It's the same problem - allocation can cause consumption of
> blocks in the BMBT tree. At ENOSPC, it's not the allocbt that
> is being split or consuming blocks...
> 
> Metadata block allocation due to delayed data allocation is bound by
> memory size and dirty page limits - once we get to ENOSPC, there
> will be no more pages accepted for delayed allocation - the app will
> get an ENOSPC up front. The reserved pool needs to be larger enough
> to handle all the allocations that this dirty data can trigger.
> Easily solved by bumping the tunable for large mmory systems.
Yep, but how high do we bump it?  I've run some pathological workloads
that can deplete a reserved pool of 16384 blocks.  While our customers
may never run this workload their systems are much bigger than what I
have at my disposal.  When the reserved pool depletes I would rather
have the system degrade performance than cause data corruption or panic.
In other words if we can find a solution to safely handle a depleted
reserved pool (even if it means taking a performance hit) rather than
hope it never happens I would like to hear about it.

> 
> FWIW, determining the number of blocks to reserve for delayed
> allocation during delayed allocation is not worth the complexity.
> You don't know how many extents the data will end up in, you don't
> know what order the pages might get written in so you could have
> worst case sparse page writeout before the holes are filled (i.e.
> have tree growth and then have it shrink), etc. Even reserving
> enough blocks for a full btree split per dirtied inode is not
> sufficient, as allocation may trigger multiple full tree splits.
> Basically the reservations will get so large that they will cause
> applications to get premature ENOSPC errors when the writes could
> have succeeded without problems.
I totally agree.  If only there was some way to know what was going
to happen to the btree during writeout we could account for the space.
We could disable delayed allocations when the filesystem is near ENOSPC
and do immediate allocations like in direct I/O.

> 
> That's why this problem has not been solved in the past - it's too
> damn complex to enumerate correctly, and in almost all cases the
> default sized reserved pool is sufficient to prevent data loss
> problems.
Yeah - in almost all cases...

> 
> For the unwritten extent conversion case, though, we need to
> prevent new writes (after ENOSPC occurs) from draining the
> reserved pool. That means we either have to return an ENOSPC
> to the application, or we freeze the writes into preallocated
> space when we are at ENOSPC and the reserve pool is getting
> depleted. This needs to be done up-front, not in the I/O completion
> where it is too late to handle the fact that the reserve pool
> too depleted to do the conversion.....
> 
> That seems simple enough to do without having to add any code
> to the back end I/O path or the transaction subsystem....
Sounds reasonable.  We could report ENOSPC for an extent conversion that
could work - ie if a split is not needed and users might get a little
confused with ENOSPC if they know they preallocated the space.  But it's
better than data corruption.

What if we were to do the unwritten extent conversion up front?  Could
we delay the transaction commit until after the I/O or will that mean
holding an ilock across the I/O?

> 
>>>> I've tried increasing the size of the reserved data blocks pool
>>>> but that only delays the inevitable.  Increasing the size to 65536
>>>> blocks seems to avoid failures but that's getting to be a lot of
>>>> disk space.
>>> You're worried about reserving 20c worth of disk space and 10s of
>>> time to change the config vs hours of enginnering and test time
>>> to come up with a different solution that may or may not be
>>> as effective?
>> 65536 x 16KB blocks is 1GB of disk space - people will notice that go
>> missing.
> 
> Still small change for typical SGI customers.
> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Running out of reserved data blocks
  2008-09-29  2:44       ` Lachlan McIlroy
@ 2008-09-29  6:51         ` Dave Chinner
  2008-09-29  8:40           ` Lachlan McIlroy
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2008-09-29  6:51 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs-dev, xfs-oss

On Mon, Sep 29, 2008 at 12:44:10PM +1000, Lachlan McIlroy wrote:
> Dave Chinner wrote:
>> On Fri, Sep 26, 2008 at 05:45:10PM +1000, Lachlan McIlroy wrote:
>>> Dave Chinner wrote:
>>>> On Fri, Sep 26, 2008 at 03:31:23PM +1000, Lachlan McIlroy
>>>> wrote:
>>>>> A while back I posted a patch to re-dirty pages on I/O error
>>>>> to handle errors from xfs_trans_reserve() that was failing
>>>>> with ENOSPC when trying to convert delayed allocations.  I'm
>>>>> now seeing xfs_trans_reserve() fail when converting unwritten
>>>>> extents and in that case we silently ignore the error and
>>>>> leave the extent as unwritten which effectively causes data
>>>>> corruption.  I can also get failures when trying to unreserve
>>>>> disk space.
>>>> Is this problem being seen in the real world, or just in
>>>> artificial test workloads?
>>> Customer escalations.
>>
>> And the main cause is what? Direct I/O into unwritten extents?
> The system is so busy that it's overwhelming the bandwidth of the
> log and many threads have taken a slice of the reserved pool and
> are waiting for log space. 

By that I assume you mean that there are lots of threads waiting
in xlog_state_get_iclog_space()?

> Extent conversions start failing and we invalidate the page but
> fail to remove the delayed allocation.

IIUC, you are trying to say that delayed allocation is failing with
ENOSPC in xfs_iomap_write_allocate(), and everything goes downhill
from there?

Perhaps we shuld propagate the "BMAPI_TRYLOCK" flag into
xfs_iomap_write_allocate() and convert ENOSPC errors from
xfs_trans_reserve() into EAGAIN for non-blocking writeback. That
means any sort of synchronous write will propagate an error, but
async writeback (like pdflush) will simply treat the condition the
same as inode lock contention.

Hence issuing something like a fsync() or sync(1) will cause ENOSPC 
errors to be triggered on delalloc in this situation, but async
writeback won't. In the case of a direct I/O read, it should get an
ENOSPC error reported back instead of.....

> Then a direct I/O read
> runs into a delayed allocation where it does not expect one and
> hits a BUG_ON().

.... doing that.

>>>> If you start new operations like writing into unwritten extents once
>>>> you are already at ENOSPC, then you can consume the entire of the
>>>> reserve pool. There is nothing we can do to prevent that from
>>>> occurring, except by doing something like partially freezing the
>>>> filesystem (i.e. just the data write() level, not the transaction
>>>> level) until the ENOSPC condition goes away....
>>> Yes we could eat into the reserve pool with btree split/newroot
>>> allocations.  Same with delayed allocations. That's yet another
>>> problem where we need to account for potential btree space before
>>> creating delayed allocations or unwritten extents.
>>
>> It's the same problem - allocation can cause consumption of
>> blocks in the BMBT tree. At ENOSPC, it's not the allocbt that
>> is being split or consuming blocks...
>>
>> Metadata block allocation due to delayed data allocation is bound by
>> memory size and dirty page limits - once we get to ENOSPC, there
>> will be no more pages accepted for delayed allocation - the app will
>> get an ENOSPC up front. The reserved pool needs to be larger enough
>> to handle all the allocations that this dirty data can trigger.
>> Easily solved by bumping the tunable for large mmory systems.
> Yep, but how high do we bump it?

Not sure. It sounds like the problem is the number of transactions
that can be in flight at once, each taking their 4-8 blocks of
reservation out of the pool, and then blocking for some period of
time waiting for iclog space to be able to commit the transaction.

Given that the most I've previously seen is ~1500 transactions
blocked waiting for iclog space, I'd say that gives a rough
indication of how deep the reservation pool could be bumped to....

> I've run some pathological workloads
> that can deplete a reserved pool of 16384 blocks.  While our customers
> may never run this workload their systems are much bigger than what I
> have at my disposal.  When the reserved pool depletes I would rather
> have the system degrade performance than cause data corruption or panic.

Right, so would I. The problem is how to do it without introducing
excessive complexity.

> In other words if we can find a solution to safely handle a depleted
> reserved pool (even if it means taking a performance hit) rather than
> hope it never happens I would like to hear about it.

I think what I mentioned above might prevent the common case of the
problem you are seeing. It doesn't fix the "depletion by a thousand
unwritten extent conversion" problem, but it should prevent the
silent trashing of delalloc data due to temporary reserve pool
depletion....

>> FWIW, determining the number of blocks to reserve for delayed
>> allocation during delayed allocation is not worth the complexity.
>> You don't know how many extents the data will end up in, you don't
>> know what order the pages might get written in so you could have
>> worst case sparse page writeout before the holes are filled (i.e.
>> have tree growth and then have it shrink), etc. Even reserving
>> enough blocks for a full btree split per dirtied inode is not
>> sufficient, as allocation may trigger multiple full tree splits.
>> Basically the reservations will get so large that they will cause
>> applications to get premature ENOSPC errors when the writes could
>> have succeeded without problems.
> I totally agree.  If only there was some way to know what was going
> to happen to the btree during writeout we could account for the space.
> We could disable delayed allocations when the filesystem is near ENOSPC
> and do immediate allocations like in direct I/O.

Define "near ENOSPC" ;)

[ Well, I already have once. ;) The incore per-cpu superblock
counters fall back to updating the global superblock when the
filesystem approaches ENOSPC (per-cpu threshold, so scales with the
size of machine), but you'd effectively need to flush all the
delalloc data at this point as well if you were to switch of
delalloc.... ]

I guess just switching to sync writes when we near ENOSPC would
do what you are suggesting...

>> For the unwritten extent conversion case, though, we need to
>> prevent new writes (after ENOSPC occurs) from draining the
>> reserved pool. That means we either have to return an ENOSPC
>> to the application, or we freeze the writes into preallocated
>> space when we are at ENOSPC and the reserve pool is getting
>> depleted. This needs to be done up-front, not in the I/O completion
>> where it is too late to handle the fact that the reserve pool
>> too depleted to do the conversion.....
>>
>> That seems simple enough to do without having to add any code
>> to the back end I/O path or the transaction subsystem....
> Sounds reasonable.  We could report ENOSPC for an extent conversion that
> could work - ie if a split is not needed and users might get a little
> confused with ENOSPC if they know they preallocated the space.  But it's
> better than data corruption.

Certainly is ;)

> What if we were to do the unwritten extent conversion up front?

Crash after conversion but before the data I/O is issued then
results in stale data exposure. Not a good failure mode.

> Could
> we delay the transaction commit until after the I/O or will that mean
> holding an ilock across the I/O?

Right - that's not allowed. To do something like this, it would need
to be a two-phase transaction commit. That is, we do all the work
up front before the I/O, then commit that transaction as "pending".
Then on I/O completion we commit a subsequent "I/O done" transaction
that is paired with the coversion/allocation. Then in recovery, we
only do the conversion if we see the I/O done transaction as well.

Realistically, we should do this for all allocations to close the
allocate-crash-expose-stale-data hole that exists. The model for
this is the Extent Free Intent/Extent Free Done (EFI/EFD)
transaction pair and their linked log items used when freeing
extents.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Running out of reserved data blocks
  2008-09-29  6:51         ` Dave Chinner
@ 2008-09-29  8:40           ` Lachlan McIlroy
  0 siblings, 0 replies; 7+ messages in thread
From: Lachlan McIlroy @ 2008-09-29  8:40 UTC (permalink / raw)
  To: Lachlan McIlroy, xfs-dev, xfs-oss

Dave Chinner wrote:
> On Mon, Sep 29, 2008 at 12:44:10PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Fri, Sep 26, 2008 at 05:45:10PM +1000, Lachlan McIlroy wrote:
>>>> Dave Chinner wrote:
>>>>> On Fri, Sep 26, 2008 at 03:31:23PM +1000, Lachlan McIlroy
>>>>> wrote:
>>>>>> A while back I posted a patch to re-dirty pages on I/O error
>>>>>> to handle errors from xfs_trans_reserve() that was failing
>>>>>> with ENOSPC when trying to convert delayed allocations.  I'm
>>>>>> now seeing xfs_trans_reserve() fail when converting unwritten
>>>>>> extents and in that case we silently ignore the error and
>>>>>> leave the extent as unwritten which effectively causes data
>>>>>> corruption.  I can also get failures when trying to unreserve
>>>>>> disk space.
>>>>> Is this problem being seen in the real world, or just in
>>>>> artificial test workloads?
>>>> Customer escalations.
>>> And the main cause is what? Direct I/O into unwritten extents?
>> The system is so busy that it's overwhelming the bandwidth of the
>> log and many threads have taken a slice of the reserved pool and
>> are waiting for log space. 
> 
> By that I assume you mean that there are lots of threads waiting
> in xlog_state_get_iclog_space()?
No, I think they were in xlog_grant_log_space().

> 
>> Extent conversions start failing and we invalidate the page but
>> fail to remove the delayed allocation.
> 
> IIUC, you are trying to say that delayed allocation is failing with
> ENOSPC in xfs_iomap_write_allocate(), and everything goes downhill
> from there?
Yes.  Exactly.

> 
> Perhaps we shuld propagate the "BMAPI_TRYLOCK" flag into
> xfs_iomap_write_allocate() and convert ENOSPC errors from
> xfs_trans_reserve() into EAGAIN for non-blocking writeback. That
> means any sort of synchronous write will propagate an error, but
> async writeback (like pdflush) will simply treat the condition the
> same as inode lock contention.
That sounds like a small change on top of the patch I sent out
earlier.  I'll add it in and re-post the patch.

> 
> Hence issuing something like a fsync() or sync(1) will cause ENOSPC 
> errors to be triggered on delalloc in this situation, but async
> writeback won't. In the case of a direct I/O read, it should get an
> ENOSPC error reported back instead of.....
> 
>> Then a direct I/O read
>> runs into a delayed allocation where it does not expect one and
>> hits a BUG_ON().
> 
> .... doing that.
> 
>>>>> If you start new operations like writing into unwritten extents once
>>>>> you are already at ENOSPC, then you can consume the entire of the
>>>>> reserve pool. There is nothing we can do to prevent that from
>>>>> occurring, except by doing something like partially freezing the
>>>>> filesystem (i.e. just the data write() level, not the transaction
>>>>> level) until the ENOSPC condition goes away....
>>>> Yes we could eat into the reserve pool with btree split/newroot
>>>> allocations.  Same with delayed allocations. That's yet another
>>>> problem where we need to account for potential btree space before
>>>> creating delayed allocations or unwritten extents.
>>> It's the same problem - allocation can cause consumption of
>>> blocks in the BMBT tree. At ENOSPC, it's not the allocbt that
>>> is being split or consuming blocks...
>>>
>>> Metadata block allocation due to delayed data allocation is bound by
>>> memory size and dirty page limits - once we get to ENOSPC, there
>>> will be no more pages accepted for delayed allocation - the app will
>>> get an ENOSPC up front. The reserved pool needs to be larger enough
>>> to handle all the allocations that this dirty data can trigger.
>>> Easily solved by bumping the tunable for large mmory systems.
>> Yep, but how high do we bump it?
> 
> Not sure. It sounds like the problem is the number of transactions
> that can be in flight at once, each taking their 4-8 blocks of
> reservation out of the pool, and then blocking for some period of
> time waiting for iclog space to be able to commit the transaction.
> 
> Given that the most I've previously seen is ~1500 transactions
> blocked waiting for iclog space, I'd say that gives a rough
> indication of how deep the reservation pool could be bumped to....
Hmmm, okay.  1500 * 8 = 12,000 blocks.  Might as well round that up
to 16K.  I'll post a patch to increase the default pool size.

> 
>> I've run some pathological workloads
>> that can deplete a reserved pool of 16384 blocks.  While our customers
>> may never run this workload their systems are much bigger than what I
>> have at my disposal.  When the reserved pool depletes I would rather
>> have the system degrade performance than cause data corruption or panic.
> 
> Right, so would I. The problem is how to do it without introducing
> excessive complexity.
> 
>> In other words if we can find a solution to safely handle a depleted
>> reserved pool (even if it means taking a performance hit) rather than
>> hope it never happens I would like to hear about it.
> 
> I think what I mentioned above might prevent the common case of the
> problem you are seeing. It doesn't fix the "depletion by a thousand
> unwritten extent conversion" problem, but it should prevent the
> silent trashing of delalloc data due to temporary reserve pool
> depletion....
> 
>>> FWIW, determining the number of blocks to reserve for delayed
>>> allocation during delayed allocation is not worth the complexity.
>>> You don't know how many extents the data will end up in, you don't
>>> know what order the pages might get written in so you could have
>>> worst case sparse page writeout before the holes are filled (i.e.
>>> have tree growth and then have it shrink), etc. Even reserving
>>> enough blocks for a full btree split per dirtied inode is not
>>> sufficient, as allocation may trigger multiple full tree splits.
>>> Basically the reservations will get so large that they will cause
>>> applications to get premature ENOSPC errors when the writes could
>>> have succeeded without problems.
>> I totally agree.  If only there was some way to know what was going
>> to happen to the btree during writeout we could account for the space.
>> We could disable delayed allocations when the filesystem is near ENOSPC
>> and do immediate allocations like in direct I/O.
> 
> Define "near ENOSPC" ;)
I thought you might ask that.  I don't have an answer.

> 
> [ Well, I already have once. ;) The incore per-cpu superblock
> counters fall back to updating the global superblock when the
> filesystem approaches ENOSPC (per-cpu threshold, so scales with the
> size of machine), but you'd effectively need to flush all the
> delalloc data at this point as well if you were to switch of
> delalloc.... ]
> 
> I guess just switching to sync writes when we near ENOSPC would
> do what you are suggesting...
Yeah that might work too since if the extent conversions fail we
can return an error from the write().

> 
>>> For the unwritten extent conversion case, though, we need to
>>> prevent new writes (after ENOSPC occurs) from draining the
>>> reserved pool. That means we either have to return an ENOSPC
>>> to the application, or we freeze the writes into preallocated
>>> space when we are at ENOSPC and the reserve pool is getting
>>> depleted. This needs to be done up-front, not in the I/O completion
>>> where it is too late to handle the fact that the reserve pool
>>> too depleted to do the conversion.....
>>>
>>> That seems simple enough to do without having to add any code
>>> to the back end I/O path or the transaction subsystem....
>> Sounds reasonable.  We could report ENOSPC for an extent conversion that
>> could work - ie if a split is not needed and users might get a little
>> confused with ENOSPC if they know they preallocated the space.  But it's
>> better than data corruption.
> 
> Certainly is ;)
> 
>> What if we were to do the unwritten extent conversion up front?
> 
> Crash after conversion but before the data I/O is issued then
> results in stale data exposure. Not a good failure mode.
Didn't we have a plan to fix this for the delalloc extent conversion
case?

> 
>> Could
>> we delay the transaction commit until after the I/O or will that mean
>> holding an ilock across the I/O?
> 
> Right - that's not allowed. To do something like this, it would need
> to be a two-phase transaction commit. That is, we do all the work
> up front before the I/O, then commit that transaction as "pending".
> Then on I/O completion we commit a subsequent "I/O done" transaction
> that is paired with the coversion/allocation. Then in recovery, we
> only do the conversion if we see the I/O done transaction as well.
> 
> Realistically, we should do this for all allocations to close the
> allocate-crash-expose-stale-data hole that exists. The model for
> this is the Extent Free Intent/Extent Free Done (EFI/EFD)
> transaction pair and their linked log items used when freeing
> extents.
Sounds like a plan.  Would this be an easy change to make?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-09-29  8:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-26  5:31 Running out of reserved data blocks Lachlan McIlroy
2008-09-26  7:08 ` Dave Chinner
2008-09-26  7:45   ` Lachlan McIlroy
2008-09-26  8:48     ` Dave Chinner
2008-09-29  2:44       ` Lachlan McIlroy
2008-09-29  6:51         ` Dave Chinner
2008-09-29  8:40           ` Lachlan McIlroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox