From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: with ECARTIS (v1.0.0; list xfs); Mon, 29 Sep 2008 01:30:15 -0700 (PDT) Received: from relay.sgi.com (relay2.corp.sgi.com [192.26.58.22]) by oss.sgi.com (8.12.11.20060308/8.12.11/SuSE Linux 0.7) with ESMTP id m8T8UAAd022649 for ; Mon, 29 Sep 2008 01:30:11 -0700 Message-ID: <48E09494.6070709@sgi.com> Date: Mon, 29 Sep 2008 18:40:52 +1000 From: Lachlan McIlroy Reply-To: lachlan@sgi.com MIME-Version: 1.0 Subject: Re: Running out of reserved data blocks References: <48DC73AB.4050309@sgi.com> <20080926070831.GM27997@disturbed> <48DC9306.104@sgi.com> <20080926084814.GD13705@disturbed> <48E040FA.9090709@sgi.com> <20080929065152.GB16064@disturbed> In-Reply-To: <20080929065152.GB16064@disturbed> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: xfs-bounce@oss.sgi.com Errors-to: xfs-bounce@oss.sgi.com List-Id: xfs To: Lachlan McIlroy , xfs-dev , xfs-oss Dave Chinner wrote: > On Mon, Sep 29, 2008 at 12:44:10PM +1000, Lachlan McIlroy wrote: >> Dave Chinner wrote: >>> On Fri, Sep 26, 2008 at 05:45:10PM +1000, Lachlan McIlroy wrote: >>>> Dave Chinner wrote: >>>>> On Fri, Sep 26, 2008 at 03:31:23PM +1000, Lachlan McIlroy >>>>> wrote: >>>>>> A while back I posted a patch to re-dirty pages on I/O error >>>>>> to handle errors from xfs_trans_reserve() that was failing >>>>>> with ENOSPC when trying to convert delayed allocations. I'm >>>>>> now seeing xfs_trans_reserve() fail when converting unwritten >>>>>> extents and in that case we silently ignore the error and >>>>>> leave the extent as unwritten which effectively causes data >>>>>> corruption. I can also get failures when trying to unreserve >>>>>> disk space. >>>>> Is this problem being seen in the real world, or just in >>>>> artificial test workloads? >>>> Customer escalations. >>> And the main cause is what? Direct I/O into unwritten extents? >> The system is so busy that it's overwhelming the bandwidth of the >> log and many threads have taken a slice of the reserved pool and >> are waiting for log space. > > By that I assume you mean that there are lots of threads waiting > in xlog_state_get_iclog_space()? No, I think they were in xlog_grant_log_space(). > >> Extent conversions start failing and we invalidate the page but >> fail to remove the delayed allocation. > > IIUC, you are trying to say that delayed allocation is failing with > ENOSPC in xfs_iomap_write_allocate(), and everything goes downhill > from there? Yes. Exactly. > > Perhaps we shuld propagate the "BMAPI_TRYLOCK" flag into > xfs_iomap_write_allocate() and convert ENOSPC errors from > xfs_trans_reserve() into EAGAIN for non-blocking writeback. That > means any sort of synchronous write will propagate an error, but > async writeback (like pdflush) will simply treat the condition the > same as inode lock contention. That sounds like a small change on top of the patch I sent out earlier. I'll add it in and re-post the patch. > > Hence issuing something like a fsync() or sync(1) will cause ENOSPC > errors to be triggered on delalloc in this situation, but async > writeback won't. In the case of a direct I/O read, it should get an > ENOSPC error reported back instead of..... > >> Then a direct I/O read >> runs into a delayed allocation where it does not expect one and >> hits a BUG_ON(). > > .... doing that. > >>>>> If you start new operations like writing into unwritten extents once >>>>> you are already at ENOSPC, then you can consume the entire of the >>>>> reserve pool. There is nothing we can do to prevent that from >>>>> occurring, except by doing something like partially freezing the >>>>> filesystem (i.e. just the data write() level, not the transaction >>>>> level) until the ENOSPC condition goes away.... >>>> Yes we could eat into the reserve pool with btree split/newroot >>>> allocations. Same with delayed allocations. That's yet another >>>> problem where we need to account for potential btree space before >>>> creating delayed allocations or unwritten extents. >>> It's the same problem - allocation can cause consumption of >>> blocks in the BMBT tree. At ENOSPC, it's not the allocbt that >>> is being split or consuming blocks... >>> >>> Metadata block allocation due to delayed data allocation is bound by >>> memory size and dirty page limits - once we get to ENOSPC, there >>> will be no more pages accepted for delayed allocation - the app will >>> get an ENOSPC up front. The reserved pool needs to be larger enough >>> to handle all the allocations that this dirty data can trigger. >>> Easily solved by bumping the tunable for large mmory systems. >> Yep, but how high do we bump it? > > Not sure. It sounds like the problem is the number of transactions > that can be in flight at once, each taking their 4-8 blocks of > reservation out of the pool, and then blocking for some period of > time waiting for iclog space to be able to commit the transaction. > > Given that the most I've previously seen is ~1500 transactions > blocked waiting for iclog space, I'd say that gives a rough > indication of how deep the reservation pool could be bumped to.... Hmmm, okay. 1500 * 8 = 12,000 blocks. Might as well round that up to 16K. I'll post a patch to increase the default pool size. > >> I've run some pathological workloads >> that can deplete a reserved pool of 16384 blocks. While our customers >> may never run this workload their systems are much bigger than what I >> have at my disposal. When the reserved pool depletes I would rather >> have the system degrade performance than cause data corruption or panic. > > Right, so would I. The problem is how to do it without introducing > excessive complexity. > >> In other words if we can find a solution to safely handle a depleted >> reserved pool (even if it means taking a performance hit) rather than >> hope it never happens I would like to hear about it. > > I think what I mentioned above might prevent the common case of the > problem you are seeing. It doesn't fix the "depletion by a thousand > unwritten extent conversion" problem, but it should prevent the > silent trashing of delalloc data due to temporary reserve pool > depletion.... > >>> FWIW, determining the number of blocks to reserve for delayed >>> allocation during delayed allocation is not worth the complexity. >>> You don't know how many extents the data will end up in, you don't >>> know what order the pages might get written in so you could have >>> worst case sparse page writeout before the holes are filled (i.e. >>> have tree growth and then have it shrink), etc. Even reserving >>> enough blocks for a full btree split per dirtied inode is not >>> sufficient, as allocation may trigger multiple full tree splits. >>> Basically the reservations will get so large that they will cause >>> applications to get premature ENOSPC errors when the writes could >>> have succeeded without problems. >> I totally agree. If only there was some way to know what was going >> to happen to the btree during writeout we could account for the space. >> We could disable delayed allocations when the filesystem is near ENOSPC >> and do immediate allocations like in direct I/O. > > Define "near ENOSPC" ;) I thought you might ask that. I don't have an answer. > > [ Well, I already have once. ;) The incore per-cpu superblock > counters fall back to updating the global superblock when the > filesystem approaches ENOSPC (per-cpu threshold, so scales with the > size of machine), but you'd effectively need to flush all the > delalloc data at this point as well if you were to switch of > delalloc.... ] > > I guess just switching to sync writes when we near ENOSPC would > do what you are suggesting... Yeah that might work too since if the extent conversions fail we can return an error from the write(). > >>> For the unwritten extent conversion case, though, we need to >>> prevent new writes (after ENOSPC occurs) from draining the >>> reserved pool. That means we either have to return an ENOSPC >>> to the application, or we freeze the writes into preallocated >>> space when we are at ENOSPC and the reserve pool is getting >>> depleted. This needs to be done up-front, not in the I/O completion >>> where it is too late to handle the fact that the reserve pool >>> too depleted to do the conversion..... >>> >>> That seems simple enough to do without having to add any code >>> to the back end I/O path or the transaction subsystem.... >> Sounds reasonable. We could report ENOSPC for an extent conversion that >> could work - ie if a split is not needed and users might get a little >> confused with ENOSPC if they know they preallocated the space. But it's >> better than data corruption. > > Certainly is ;) > >> What if we were to do the unwritten extent conversion up front? > > Crash after conversion but before the data I/O is issued then > results in stale data exposure. Not a good failure mode. Didn't we have a plan to fix this for the delalloc extent conversion case? > >> Could >> we delay the transaction commit until after the I/O or will that mean >> holding an ilock across the I/O? > > Right - that's not allowed. To do something like this, it would need > to be a two-phase transaction commit. That is, we do all the work > up front before the I/O, then commit that transaction as "pending". > Then on I/O completion we commit a subsequent "I/O done" transaction > that is paired with the coversion/allocation. Then in recovery, we > only do the conversion if we see the I/O done transaction as well. > > Realistically, we should do this for all allocations to close the > allocate-crash-expose-stale-data hole that exists. The model for > this is the Extent Free Intent/Extent Free Done (EFI/EFD) > transaction pair and their linked log items used when freeing > extents. Sounds like a plan. Would this be an easy change to make?