Re: Running out of reserved data blocks

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Lachlan McIlroy <lachlan@sgi.com>
To: Lachlan McIlroy <lachlan@sgi.com>, xfs-dev <xfs-dev@sgi.com>,
	xfs-oss <xfs@oss.sgi.com>
Subject: Re: Running out of reserved data blocks
Date: Mon, 29 Sep 2008 18:40:52 +1000	[thread overview]
Message-ID: <48E09494.6070709@sgi.com> (raw)
In-Reply-To: <20080929065152.GB16064@disturbed>

Dave Chinner wrote:
> On Mon, Sep 29, 2008 at 12:44:10PM +1000, Lachlan McIlroy wrote:
>> Dave Chinner wrote:
>>> On Fri, Sep 26, 2008 at 05:45:10PM +1000, Lachlan McIlroy wrote:
>>>> Dave Chinner wrote:
>>>>> On Fri, Sep 26, 2008 at 03:31:23PM +1000, Lachlan McIlroy
>>>>> wrote:
>>>>>> A while back I posted a patch to re-dirty pages on I/O error
>>>>>> to handle errors from xfs_trans_reserve() that was failing
>>>>>> with ENOSPC when trying to convert delayed allocations.  I'm
>>>>>> now seeing xfs_trans_reserve() fail when converting unwritten
>>>>>> extents and in that case we silently ignore the error and
>>>>>> leave the extent as unwritten which effectively causes data
>>>>>> corruption.  I can also get failures when trying to unreserve
>>>>>> disk space.
>>>>> Is this problem being seen in the real world, or just in
>>>>> artificial test workloads?
>>>> Customer escalations.
>>> And the main cause is what? Direct I/O into unwritten extents?
>> The system is so busy that it's overwhelming the bandwidth of the
>> log and many threads have taken a slice of the reserved pool and
>> are waiting for log space. 
> 
> By that I assume you mean that there are lots of threads waiting
> in xlog_state_get_iclog_space()?
No, I think they were in xlog_grant_log_space().

> 
>> Extent conversions start failing and we invalidate the page but
>> fail to remove the delayed allocation.
> 
> IIUC, you are trying to say that delayed allocation is failing with
> ENOSPC in xfs_iomap_write_allocate(), and everything goes downhill
> from there?
Yes.  Exactly.

> 
> Perhaps we shuld propagate the "BMAPI_TRYLOCK" flag into
> xfs_iomap_write_allocate() and convert ENOSPC errors from
> xfs_trans_reserve() into EAGAIN for non-blocking writeback. That
> means any sort of synchronous write will propagate an error, but
> async writeback (like pdflush) will simply treat the condition the
> same as inode lock contention.
That sounds like a small change on top of the patch I sent out
earlier.  I'll add it in and re-post the patch.

> 
> Hence issuing something like a fsync() or sync(1) will cause ENOSPC 
> errors to be triggered on delalloc in this situation, but async
> writeback won't. In the case of a direct I/O read, it should get an
> ENOSPC error reported back instead of.....
> 
>> Then a direct I/O read
>> runs into a delayed allocation where it does not expect one and
>> hits a BUG_ON().
> 
> .... doing that.
> 
>>>>> If you start new operations like writing into unwritten extents once
>>>>> you are already at ENOSPC, then you can consume the entire of the
>>>>> reserve pool. There is nothing we can do to prevent that from
>>>>> occurring, except by doing something like partially freezing the
>>>>> filesystem (i.e. just the data write() level, not the transaction
>>>>> level) until the ENOSPC condition goes away....
>>>> Yes we could eat into the reserve pool with btree split/newroot
>>>> allocations.  Same with delayed allocations. That's yet another
>>>> problem where we need to account for potential btree space before
>>>> creating delayed allocations or unwritten extents.
>>> It's the same problem - allocation can cause consumption of
>>> blocks in the BMBT tree. At ENOSPC, it's not the allocbt that
>>> is being split or consuming blocks...
>>>
>>> Metadata block allocation due to delayed data allocation is bound by
>>> memory size and dirty page limits - once we get to ENOSPC, there
>>> will be no more pages accepted for delayed allocation - the app will
>>> get an ENOSPC up front. The reserved pool needs to be larger enough
>>> to handle all the allocations that this dirty data can trigger.
>>> Easily solved by bumping the tunable for large mmory systems.
>> Yep, but how high do we bump it?
> 
> Not sure. It sounds like the problem is the number of transactions
> that can be in flight at once, each taking their 4-8 blocks of
> reservation out of the pool, and then blocking for some period of
> time waiting for iclog space to be able to commit the transaction.
> 
> Given that the most I've previously seen is ~1500 transactions
> blocked waiting for iclog space, I'd say that gives a rough
> indication of how deep the reservation pool could be bumped to....
Hmmm, okay.  1500 * 8 = 12,000 blocks.  Might as well round that up
to 16K.  I'll post a patch to increase the default pool size.

> 
>> I've run some pathological workloads
>> that can deplete a reserved pool of 16384 blocks.  While our customers
>> may never run this workload their systems are much bigger than what I
>> have at my disposal.  When the reserved pool depletes I would rather
>> have the system degrade performance than cause data corruption or panic.
> 
> Right, so would I. The problem is how to do it without introducing
> excessive complexity.
> 
>> In other words if we can find a solution to safely handle a depleted
>> reserved pool (even if it means taking a performance hit) rather than
>> hope it never happens I would like to hear about it.
> 
> I think what I mentioned above might prevent the common case of the
> problem you are seeing. It doesn't fix the "depletion by a thousand
> unwritten extent conversion" problem, but it should prevent the
> silent trashing of delalloc data due to temporary reserve pool
> depletion....
> 
>>> FWIW, determining the number of blocks to reserve for delayed
>>> allocation during delayed allocation is not worth the complexity.
>>> You don't know how many extents the data will end up in, you don't
>>> know what order the pages might get written in so you could have
>>> worst case sparse page writeout before the holes are filled (i.e.
>>> have tree growth and then have it shrink), etc. Even reserving
>>> enough blocks for a full btree split per dirtied inode is not
>>> sufficient, as allocation may trigger multiple full tree splits.
>>> Basically the reservations will get so large that they will cause
>>> applications to get premature ENOSPC errors when the writes could
>>> have succeeded without problems.
>> I totally agree.  If only there was some way to know what was going
>> to happen to the btree during writeout we could account for the space.
>> We could disable delayed allocations when the filesystem is near ENOSPC
>> and do immediate allocations like in direct I/O.
> 
> Define "near ENOSPC" ;)
I thought you might ask that.  I don't have an answer.

> 
> [ Well, I already have once. ;) The incore per-cpu superblock
> counters fall back to updating the global superblock when the
> filesystem approaches ENOSPC (per-cpu threshold, so scales with the
> size of machine), but you'd effectively need to flush all the
> delalloc data at this point as well if you were to switch of
> delalloc.... ]
> 
> I guess just switching to sync writes when we near ENOSPC would
> do what you are suggesting...
Yeah that might work too since if the extent conversions fail we
can return an error from the write().

> 
>>> For the unwritten extent conversion case, though, we need to
>>> prevent new writes (after ENOSPC occurs) from draining the
>>> reserved pool. That means we either have to return an ENOSPC
>>> to the application, or we freeze the writes into preallocated
>>> space when we are at ENOSPC and the reserve pool is getting
>>> depleted. This needs to be done up-front, not in the I/O completion
>>> where it is too late to handle the fact that the reserve pool
>>> too depleted to do the conversion.....
>>>
>>> That seems simple enough to do without having to add any code
>>> to the back end I/O path or the transaction subsystem....
>> Sounds reasonable.  We could report ENOSPC for an extent conversion that
>> could work - ie if a split is not needed and users might get a little
>> confused with ENOSPC if they know they preallocated the space.  But it's
>> better than data corruption.
> 
> Certainly is ;)
> 
>> What if we were to do the unwritten extent conversion up front?
> 
> Crash after conversion but before the data I/O is issued then
> results in stale data exposure. Not a good failure mode.
Didn't we have a plan to fix this for the delalloc extent conversion
case?

> 
>> Could
>> we delay the transaction commit until after the I/O or will that mean
>> holding an ilock across the I/O?
> 
> Right - that's not allowed. To do something like this, it would need
> to be a two-phase transaction commit. That is, we do all the work
> up front before the I/O, then commit that transaction as "pending".
> Then on I/O completion we commit a subsequent "I/O done" transaction
> that is paired with the coversion/allocation. Then in recovery, we
> only do the conversion if we see the I/O done transaction as well.
> 
> Realistically, we should do this for all allocations to close the
> allocate-crash-expose-stale-data hole that exists. The model for
> this is the Extent Free Intent/Extent Free Done (EFI/EFD)
> transaction pair and their linked log items used when freeing
> extents.
Sounds like a plan.  Would this be an easy change to make?

     prev parent reply	other threads:[~2008-09-29  8:30 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-26  5:31 Running out of reserved data blocks Lachlan McIlroy
2008-09-26  7:08 ` Dave Chinner
2008-09-26  7:45   ` Lachlan McIlroy
2008-09-26  8:48     ` Dave Chinner
2008-09-29  2:44       ` Lachlan McIlroy
2008-09-29  6:51         ` Dave Chinner
2008-09-29  8:40           ` Lachlan McIlroy [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48E09494.6070709@sgi.com \
    --to=lachlan@sgi.com \
    --cc=xfs-dev@sgi.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox