From: Lachlan McIlroy <lachlan@sgi.com>
To: David Chinner <dgc@sgi.com>
Cc: xfs-dev <xfs-dev@sgi.com>, xfs-oss <xfs@oss.sgi.com>
Subject: Re: [PATCH] Revalidate the btree cursor after an insert
Date: Thu, 20 Mar 2008 18:13:18 +1100 [thread overview]
Message-ID: <47E20E8E.6070209@sgi.com> (raw)
In-Reply-To: <20080320044813.GY95344431@sgi.com>
Looks good Dave - and thanks for explaining the finer details.
David Chinner wrote:
> Ensure a btree insert returns a valid cursor.
>
> When writing into preallocated regions there is a case
> where XFS can oops or hang doing the unwritten extent conversion
> on I/O completion. It turns out thathte problem is related to
> the btree cursor being invalid.
>
> When we do an insert into the tree, we may need to split blocks
> in the tree. When we only split at the leaf level (i.e. level 0),
> everything works just fine. However, if we have a multi-level
> split in the btreee, the cursor passed to the insert function is
> no longer valid once the insert is complete.
>
> The leaf level split is handled correctly because all the operations
> at level 0 are done using the original cursor, hence it is updated
> correctly. However, when we need to update the next level up the
> tree, we don't use that cursor - we use a cloned cursor that points
> to the index in the next level up where we need to do the insert.
>
> Hence if we need to split a second level, the changes to the tree
> are reflected in the cloned cursor and not the original cursor.
> This clone-and-move-up-a-level-on-split behaviour recurses all
> the way to the top of the tree.
>
> The complexity here is that these cloned cursors do not point to
> the original index that was inserted - they point to the newly
> allocated block (the right block) and the original cursor pointer
> to that level may still point to the left block. Hence, without deep
> examination of the cloned cursor and buffers, we cannot update the
> original cursor with the new path from the cloned cursor.
>
> In these cases the original cursor could be pointing to the wrong
> block(s) and hence a subsequent modification to the tree using that
> cursor will lead to corruption of the tree.
>
> The crash case occurs whenteh tree changes height - we insert a new
> level in the tree, and the cursor does not have a buffer in it's path
> for that level. Hence any attempt to walk back up the cursor to the
> root block will result in a null pointer dereference.
>
> To make matters even more complex, the BMAP BT is rooted in an inode,
> so we can have a change of height in the btree *without a root split*.
> That is, if the root block in the inode is full when we split a
> leaf node, we cannot fit the pointer to the new block in the root,
> so we allocate a new block, migrate all the ptrs out of the inode
> into the new block and point the inode root block at the newly
> allocated block. This changes the height of the tree without a
> root split having occurred and hence invalidates the path in the
> original cursor.
>
> The patch below prevents xfs_bmbt_insert() from returning with an
> invalid cursor by detecting the cases that invalidate the original
> cursor and refresh it by do a lookup into the btree for the original
> index we were inserting at.
>
> Note that the INOBT, AGFBNO and AGFCNT btree implementations also have
> this bug, but the cursor is currently always destroyed or revalidated
> after an insert for those trees. Hence this patch only address the
> problem in the BMBT code.
>
> Signed-off-by: Dave Chinner <dgc@sgi.com>
> ---
> fs/xfs/xfs_bmap_btree.c | 38 ++++++++++++++++++++++++++++++++++++--
> 1 file changed, 36 insertions(+), 2 deletions(-)
>
> Index: 2.6.x-xfs-new/fs/xfs/xfs_bmap_btree.c
> ===================================================================
> --- 2.6.x-xfs-new.orig/fs/xfs/xfs_bmap_btree.c 2008-03-14 11:33:48.000000000 +1100
> +++ 2.6.x-xfs-new/fs/xfs/xfs_bmap_btree.c 2008-03-20 15:45:19.351601515 +1100
> @@ -2027,6 +2027,24 @@ xfs_bmbt_increment(
>
> /*
> * Insert the current record at the point referenced by cur.
> + *
> + * A multi-level split of the tree on insert will invalidate the original
> + * cursor. It appears, however, that some callers assume that the cursor is
> + * always valid. Hence if we do a multi-level split we need to revalidate the
> + * cursor.
> + *
> + * When a split occurs, we will see a new cursor returned. Use that as a
> + * trigger to determine if we need to revalidate the original cursor. If we get
> + * a split, then use the original irec to lookup up the path of the record we
> + * just inserted.
> + *
> + * Note that the fact that the btree root is in the inode means that we can
> + * have the level of the tree change without a "split" occurring at the root
> + * level. What happens is that the root is migrated to an allocated block and
> + * the inode root is pointed to it. This means a single split can change the
> + * level of the tree (level 2 -> level 3) and invalidate the old cursor. Hence
> + * the level change should be accounted as a split so as to correctly trigger a
> + * revalidation of the old cursor.
> */
> int /* error */
> xfs_bmbt_insert(
> @@ -2039,11 +2057,14 @@ xfs_bmbt_insert(
> xfs_fsblock_t nbno;
> xfs_btree_cur_t *ncur;
> xfs_bmbt_rec_t nrec;
> + xfs_bmbt_irec_t oirec; /* original irec */
> xfs_btree_cur_t *pcur;
> + int splits = 0;
>
> XFS_BMBT_TRACE_CURSOR(cur, ENTRY);
> level = 0;
> nbno = NULLFSBLOCK;
> + oirec = cur->bc_rec.b;
> xfs_bmbt_disk_set_all(&nrec, &cur->bc_rec.b);
> ncur = NULL;
> pcur = cur;
> @@ -2052,11 +2073,13 @@ xfs_bmbt_insert(
> &i))) {
> if (pcur != cur)
> xfs_btree_del_cursor(pcur, XFS_BTREE_ERROR);
> - XFS_BMBT_TRACE_CURSOR(cur, ERROR);
> - return error;
> + goto error0;
> }
> XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
> if (pcur != cur && (ncur || nbno == NULLFSBLOCK)) {
> + /* allocating a new root is effectively a split */
> + if (cur->bc_nlevels != pcur->bc_nlevels)
> + splits++;
> cur->bc_nlevels = pcur->bc_nlevels;
> cur->bc_private.b.allocated +=
> pcur->bc_private.b.allocated;
> @@ -2070,10 +2093,21 @@ xfs_bmbt_insert(
> xfs_btree_del_cursor(pcur, XFS_BTREE_NOERROR);
> }
> if (ncur) {
> + splits++;
> pcur = ncur;
> ncur = NULL;
> }
> } while (nbno != NULLFSBLOCK);
> +
> + if (splits > 1) {
> + /* revalidate the old cursor as we had a multi-level split */
> + error = xfs_bmbt_lookup_eq(cur, oirec.br_startoff,
> + oirec.br_startblock, oirec.br_blockcount, &i);
> + if (error)
> + goto error0;
> + ASSERT(i == 1);
> + }
> +
> XFS_BMBT_TRACE_CURSOR(cur, EXIT);
> *stat = i;
> return 0;
>
prev parent reply other threads:[~2008-03-20 7:06 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-20 4:48 [PATCH] Revalidate the btree cursor after an insert David Chinner
2008-03-20 7:13 ` Lachlan McIlroy [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47E20E8E.6070209@sgi.com \
--to=lachlan@sgi.com \
--cc=dgc@sgi.com \
--cc=xfs-dev@sgi.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox