From: Dave Chinner <david@fromorbit.com>
To: Christoph Hellwig <hch@lst.de>
Cc: rpeterso@redhat.com, linux-fsdevel@vger.kernel.org, xfs@oss.sgi.com
Subject: Re: iomap infrastructure and multipage writes V5
Date: Mon, 1 Aug 2016 10:16:08 +1000 [thread overview]
Message-ID: <20160801001608.GH16044@dastard> (raw)
In-Reply-To: <20160731191900.GA29301@lst.de>
On Sun, Jul 31, 2016 at 09:19:00PM +0200, Christoph Hellwig wrote:
> Another quiet weekend trying to debug this, and only minor progress.
>
> The biggest different in traces of the old vs new code is that we manage
> to allocate much bigger delalloc reservations at a time in xfs_bmapi_delay
> -> xfs_bmapi_reserve_delalloc. The old code always went for a single FSB,
> which also meant allocating an indlen of 7 FSBs. With the iomap code
> we always allocate at least 4FSB (aka a page), and sometimes 8 or 12.
> All of these still need 7 FSBs for the worst case indirect blocks. So
> what happens here is that in an ENOSPC case we manage to allocate more
> actual delalloc blocks before hitting ENOSPC - notwithstanding that the
> old case would immediately release them a little later in
> xfs_bmap_add_extent_hole_delay after merging the delalloc extents.
>
> On the writeback side I don't see to many changes either. We'll
> eventually run out of blocks when allocating the transaction in
> xfs_iomap_write_allocate because the reserved pool is too small.
Yup, that's exactly what generic/224 is testing - it sets the
reserve pool to 4 blocks so it does get exhausted very quickly
and then it exposes the underlying ENOSPC issue. Most users won't
ever see reserve pool exhaustion, which is why I didn't worry too
much about solving this before merging.
> The
> only real difference to before is that under the ENOSPC / out of memory
> case we have allocated between 4 to 12 times more blocks, so we have
> to clean up 4 to 12 times as much while write_cache_pages continues
> iterating over these dirty delalloc blocks. For me this happens
> ~6 times as much as before, but I still don't manage to hit an
> endless loop.
Ok, I'd kind of got that far myself, but then never really got much
further than that - I suspected some kind of "split a delalloc
extent too many times, run out of reservation" type of issue, but
couldn't isolate such a problem in any of the traces.
> Now after spending this much time I've started wondering why we even
> reserve blocks in xfs_iomap_write_allocate - after all we've reserved
> space for the actual data blocks and the indlen worst case in
> xfs_bmapi_reserve_delalloc. And in fact a little hack to drop that
> reservation seems to solve both the root cause (depleted reserved pool)
> and the cleanup mess. I just haven't spend enought time to convince
> myself that it's actually safe, and in fact looking at the allocator
> makes me thing it only works by accident currently despite generally
> postive test results.
Hmmm, interesting. I didn't think about that. I have been looking at
this exact code as a result of rmap ENOSPC problems, and now that
you mention this, I can't see why we'd need a block reservation here
for delalloc conversion, either. Time for more <reverb on>
Adventures in Code Archeology! <reverb off>
/me digs
First stop - just after we removed the behaviour layer. Only caller
of xfs_iomap_write_allocate was:
writepage
xfs_map_block(BMAPI_ALLOCATE) // only for delalloc
xfs_iomap
xfs_iomap_write_allocate
Which is essentially the same single caller we have now, just with
much less indirection.
Looking at the code before the behaviour layer removal, there was
also an "xfs_iocore" abstraction, which abstracted inode locking,
block mapping and allocation and a few other miscellaneous IO
functions. This was so CXFS server could plug into the XFS IO path
and intercept allocation requests on the CXFS client side. This
leads me to think that the CXFS server could call
xfs_iomap_write_allocate() directly. Whether or not the server kept
the delalloc reservation or not, I'm not sure.
So, let's go back to before this abstraction was in place. Takes us
back to before the linux port was started, back to pure Irix
code....
.... and there's no block reservation done for delalloc conversion.
Ok, so here's the commit that introduced the block reservation for
delalloc conversion:
commit 6706e96e324a2fa9636e93facfd4b7fbbf5b85f8
Author: Glen Overby <overby@sgi.com>
Date: Tue Mar 4 20:15:43 2003 +0000
Add error reporting calls in error paths that return EFSCORRUPTED
Yup, hidden deep inside the commit that added the
XFS_CORRUPTION_ERROR and XFS_ERROR_REPORT macros for better error
reporting was this unexplained, uncommented hunk:
@@ -562,9 +563,19 @@ xfs_iomap_write_allocate(
nimaps = 0;
while (nimaps == 0) {
tp = xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE);
- error = xfs_trans_reserve(tp, 0, XFS_WRITE_LOG_RES(mp),
+ nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+ error = xfs_trans_reserve(tp, nres,
+ XFS_WRITE_LOG_RES(mp),
0, XFS_TRANS_PERM_LOG_RES,
XFS_WRITE_LOG_COUNT);
+
+ if (error == ENOSPC) {
+ error = xfs_trans_reserve(tp, 0,
+ XFS_WRITE_LOG_RES(mp),
+ 0,
+ XFS_TRANS_PERM_LOG_RES,
+ XFS_WRITE_LOG_COUNT);
+ }
if (error) {
xfs_trans_cancel(tp, 0);
return XFS_ERROR(error);
It's completely out of place compared to the rest of the patch which
didn't change any code logic or algorithms - it only added error
reporting macros. Hence THIS looks like it may have been an
accidental/unintended change in the commit.
The ENOSPC check here went away in 2007 when I expanded the reserve
block pool and added XFS_TRANS_RESERVE to this function to allow it
dip into the reserve pool (commit bdebc6a4 "Prevent ENOSPC from
aborting transactions that need to succeed"). I didn't pick up on
the history back then, I bet I was just focussed on fixing the
ENOSPC issue....
So, essentially, by emptying the reserve block pool, we've opened
this code back up to whatever underlying ENOSPC issue it had prior
to bdebc6a4. And looking back at 6706e96e, I can only guess that the
block reservation was added for a CXFS use case, because XFS still
only called this from a single place - delalloc conversion.
Christoph - it does look like you've found the problem - I agree
with your analysis that the delalloc already reserves space for the
bmbt blocks in the indlen reservation, and that adding another
reservation for bmbt blocks at transaction allocation makes no
obvious sense. The code history doesn't explain it - it only raises
more questions as to why this was done - it may even have been an
accidental change in the first place...
> Here is the quick patch if anyone wants to chime in:
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 620fc91..67c317f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -717,7 +717,7 @@ xfs_iomap_write_allocate(
>
> nimaps = 0;
> while (nimaps == 0) {
> - nres = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> + nres = 0; // XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
>
> error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, nres,
> 0, XFS_TRANS_RESERVE, &tp);
Let me go test it, see what comes up.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2016-08-01 0:16 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-01 14:44 iomap infrastructure and multipage writes V5 Christoph Hellwig
2016-06-01 14:44 ` [PATCH 01/14] fs: move struct iomap from exportfs.h to a separate header Christoph Hellwig
2016-06-01 14:44 ` [PATCH 02/14] fs: introduce iomap infrastructure Christoph Hellwig
2016-06-16 16:12 ` Jan Kara
2016-06-17 12:01 ` Christoph Hellwig
2016-06-20 2:29 ` Dave Chinner
2016-06-20 12:22 ` Christoph Hellwig
2016-06-01 14:44 ` [PATCH 03/14] fs: support DAX based iomap zeroing Christoph Hellwig
2016-06-01 14:44 ` [PATCH 04/14] xfs: make xfs_bmbt_to_iomap available outside of xfs_pnfs.c Christoph Hellwig
2016-06-01 14:44 ` [PATCH 05/14] xfs: reorder zeroing and flushing sequence in truncate Christoph Hellwig
2016-06-01 14:44 ` [PATCH 06/14] xfs: implement iomap based buffered write path Christoph Hellwig
2016-06-01 14:44 ` [PATCH 07/14] xfs: remove buffered write support from __xfs_get_blocks Christoph Hellwig
2016-06-01 14:44 ` [PATCH 08/14] fs: iomap based fiemap implementation Christoph Hellwig
2016-08-18 13:18 ` Andreas Grünbacher
2016-06-01 14:44 ` [PATCH 09/14] xfs: use iomap " Christoph Hellwig
2016-06-01 14:44 ` [PATCH 10/14] xfs: use iomap infrastructure for DAX zeroing Christoph Hellwig
2016-06-01 14:44 ` [PATCH 11/14] xfs: handle 64-bit length in xfs_iozero Christoph Hellwig
2016-06-01 14:44 ` [PATCH 12/14] xfs: use xfs_zero_range in xfs_zero_eof Christoph Hellwig
2016-06-01 14:44 ` [PATCH 13/14] xfs: split xfs_free_file_space in manageable pieces Christoph Hellwig
2016-06-01 14:44 ` [PATCH 14/14] xfs: kill xfs_zero_remaining_bytes Christoph Hellwig
2016-06-01 14:46 ` iomap infrastructure and multipage writes V5 Christoph Hellwig
2016-06-28 0:26 ` Dave Chinner
2016-06-28 13:28 ` Christoph Hellwig
2016-06-28 13:38 ` Christoph Hellwig
2016-06-30 17:22 ` Christoph Hellwig
2016-06-30 23:16 ` Dave Chinner
2016-07-18 11:14 ` Dave Chinner
2016-07-18 11:18 ` Dave Chinner
2016-07-31 19:19 ` Christoph Hellwig
2016-08-01 0:16 ` Dave Chinner [this message]
2016-08-02 23:42 ` Dave Chinner
2017-02-13 22:31 ` Eric Sandeen
2017-02-14 6:10 ` Christoph Hellwig
2016-07-19 3:50 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160801001608.GH16044@dastard \
--to=david@fromorbit.com \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=rpeterso@redhat.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).