From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 3/5] xfs: track cow/shared record domains explicitly in xfs_refcount_irec
Date: Wed, 26 Oct 2022 11:40:29 +1100 [thread overview]
Message-ID: <20221026004029.GK3600936@dread.disaster.area> (raw)
In-Reply-To: <166664720593.2690245.18182994137797499438.stgit@magnolia>
On Mon, Oct 24, 2022 at 02:33:25PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Just prior to committing the reflink code into upstream, the xfs
> maintainer at the time requested that I find a way to shard the refcount
> records into two domains -- one for records tracking shared extents, and
> a second for tracking CoW staging extents. The idea here was to
> minimize mount time CoW reclamation by pushing all the CoW records to
> the right edge of the keyspace, and it was accomplished by setting the
> upper bit in rc_startblock. We don't allow AGs to have more than 2^31
> blocks, so the bit was free.
>
> Unfortunately, this was a very late addition to the codebase, so most of
> the refcount record processing code still treats rc_startblock as a u32
> and pays no attention to whether or not the upper bit (the cow flag) is
> set. This is a weakness is theoretically exploitable, since we're not
> fully validating the incoming metadata records.
>
> Fuzzing demonstrates practical exploits of this weakness. If the cow
> flag of a node block key record is corrupted, a lookup operation can go
> to the wrong record block and start returning records from the wrong
> cow/shared domain. This causes the math to go all wrong (since cow
> domain is still implicit in the upper bit of rc_startblock) and we can
> crash the kernel by tricking xfs into jumping into a nonexistent AG and
> tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.
>
> To fix this, start tracking the domain as an explicit part of struct
> xfs_refcount_irec, adjust all refcount functions to check the domain
> of a returned record, and alter the function definitions to accept them
> where necessary.
>
> Found by fuzzing keys[2].cowflag = add in xfs/464.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/xfs/libxfs/xfs_refcount.c | 221 ++++++++++++++++++++++++------------
> fs/xfs/libxfs/xfs_refcount.h | 9 +
> fs/xfs/libxfs/xfs_refcount_btree.c | 26 ++++
> fs/xfs/libxfs/xfs_types.h | 10 ++
> fs/xfs/scrub/refcount.c | 22 ++--
> fs/xfs/xfs_trace.h | 48 ++++++--
> 6 files changed, 235 insertions(+), 101 deletions(-)
My first thought was that this is complex enough that it needs to be
split up. Reading the very first hunk:
>
>
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 64b910caafaa..fa75e785652b 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -46,13 +46,16 @@ STATIC int __xfs_refcount_cow_free(struct xfs_btree_cur *rcur,
> int
> xfs_refcount_lookup_le(
> struct xfs_btree_cur *cur,
> + enum xfs_rcext_domain domain,
> xfs_agblock_t bno,
> int *stat)
> {
> - trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
> + trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
> + bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0),
> XFS_LOOKUP_LE);
> cur->bc_rec.rc.rc_startblock = bno;
> cur->bc_rec.rc.rc_blockcount = 0;
> + cur->bc_rec.rc.rc_domain = domain;
> return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
I found that I had to go looking through the patch to find what the
definitions of xfs_rcext_domain, XFS_RCDOM_COW and rc_domain
actually were, because without knowing what the actual new
structures being introduced actually were this didn't make a whole
lot of sense.
Hence I think I'd like this to be broken up into a patch that
introduces the rc_domain and the helpers that build/split the
domain/startblock information in the irec, and a followup patch that
modifies the implementation to use the new domain interfaces to make
it a bit easier to see where the significant changes in the code
actually are.
I also suspect that trace_xfs_refcount_lookup() should be passed the
domain directly rather than encoding it at every call site of the
tracepoint, or it code encoded in a helper such as
xfs_refcount_domain_to_block()...
For consistency, calling the enums XFS_REFC_DOMAIN_{COW/SHARED}
would be more consistent with the naming used in other refcount
btree constants.
> }
>
> @@ -63,13 +66,16 @@ xfs_refcount_lookup_le(
> int
> xfs_refcount_lookup_ge(
> struct xfs_btree_cur *cur,
> + enum xfs_rcext_domain domain,
> xfs_agblock_t bno,
> int *stat)
> {
> - trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
> + trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
> + bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0),
> XFS_LOOKUP_GE);
> cur->bc_rec.rc.rc_startblock = bno;
> cur->bc_rec.rc.rc_blockcount = 0;
> + cur->bc_rec.rc.rc_domain = domain;
> return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
> }
>
> @@ -80,13 +86,16 @@ xfs_refcount_lookup_ge(
> int
> xfs_refcount_lookup_eq(
> struct xfs_btree_cur *cur,
> + enum xfs_rcext_domain domain,
> xfs_agblock_t bno,
> int *stat)
> {
> - trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
> + trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
> + bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0),
> XFS_LOOKUP_LE);
> cur->bc_rec.rc.rc_startblock = bno;
> cur->bc_rec.rc.rc_blockcount = 0;
> + cur->bc_rec.rc.rc_domain = domain;
> return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
> }
>
> @@ -96,7 +105,17 @@ xfs_refcount_btrec_to_irec(
> const union xfs_btree_rec *rec,
> struct xfs_refcount_irec *irec)
> {
> - irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
> + __u32 start;
> +
> + start = be32_to_cpu(rec->refc.rc_startblock);
> + if (start & XFS_REFC_COW_START) {
> + start &= ~XFS_REFC_COW_START;
> + irec->rc_domain = XFS_RCDOM_COW;
> + } else {
> + irec->rc_domain = XFS_RCDOM_SHARED;
> + }
xfs_refcount_block_to_domain()?
> +
> + irec->rc_startblock = start;
> irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
> irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
> }
> @@ -114,7 +133,6 @@ xfs_refcount_get_rec(
> struct xfs_perag *pag = cur->bc_ag.pag;
> union xfs_btree_rec *rec;
> int error;
> - xfs_agblock_t realstart;
>
> error = xfs_btree_get_rec(cur, &rec, stat);
> if (error || !*stat)
> @@ -124,22 +142,18 @@ xfs_refcount_get_rec(
> if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
> goto out_bad_rec;
>
> - /* handle special COW-staging state */
> - realstart = irec->rc_startblock;
> - if (realstart & XFS_REFC_COW_START) {
> - if (irec->rc_refcount != 1)
> - goto out_bad_rec;
> - realstart &= ~XFS_REFC_COW_START;
> - } else if (irec->rc_refcount < 2) {
> + /* handle special COW-staging domain */
> + if (irec->rc_domain == XFS_RCDOM_COW && irec->rc_refcount != 1)
> + goto out_bad_rec;
> + if (irec->rc_domain == XFS_RCDOM_SHARED && irec->rc_refcount < 2)
> goto out_bad_rec;
> - }
>
> /* check for valid extent range, including overflow */
> - if (!xfs_verify_agbno(pag, realstart))
> + if (!xfs_verify_agbno(pag, irec->rc_startblock))
> goto out_bad_rec;
> - if (realstart > realstart + irec->rc_blockcount)
> + if (irec->rc_startblock > irec->rc_startblock + irec->rc_blockcount)
> goto out_bad_rec;
> - if (!xfs_verify_agbno(pag, realstart + irec->rc_blockcount - 1))
> + if (!xfs_verify_agbno(pag, irec->rc_startblock + irec->rc_blockcount - 1))
> goto out_bad_rec;
>
> if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT)
> @@ -169,12 +183,19 @@ xfs_refcount_update(
> struct xfs_refcount_irec *irec)
> {
> union xfs_btree_rec rec;
> + __u32 start;
> int error;
>
> trace_xfs_refcount_update(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec);
> - rec.refc.rc_startblock = cpu_to_be32(irec->rc_startblock);
> +
> + start = irec->rc_startblock & ~XFS_REFC_COW_START;
> + if (irec->rc_domain == XFS_RCDOM_COW)
> + start |= XFS_REFC_COW_START;
Yeah, this definitely looks like we want a
xfs_refcount_domain_to_block() helper...
> +
> + rec.refc.rc_startblock = cpu_to_be32(start);
> rec.refc.rc_blockcount = cpu_to_be32(irec->rc_blockcount);
> rec.refc.rc_refcount = cpu_to_be32(irec->rc_refcount);
> +
> error = xfs_btree_update(cur, &rec);
> if (error)
> trace_xfs_refcount_update_error(cur->bc_mp,
> @@ -196,9 +217,12 @@ xfs_refcount_insert(
> int error;
>
> trace_xfs_refcount_insert(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec);
> +
> cur->bc_rec.rc.rc_startblock = irec->rc_startblock;
> cur->bc_rec.rc.rc_blockcount = irec->rc_blockcount;
> cur->bc_rec.rc.rc_refcount = irec->rc_refcount;
> + cur->bc_rec.rc.rc_domain = irec->rc_domain;
Does a struct copy make more sense here now?
[.....]
> @@ -600,8 +636,6 @@ xfs_refcount_merge_right_extent(
> return error;
> }
>
> -#define XFS_FIND_RCEXT_SHARED 1
> -#define XFS_FIND_RCEXT_COW 2
> /*
> * Find the left extent and the one after it (cleft). This function assumes
> * that we've already split any extent crossing agbno.
> @@ -611,16 +645,16 @@ xfs_refcount_find_left_extents(
> struct xfs_btree_cur *cur,
> struct xfs_refcount_irec *left,
> struct xfs_refcount_irec *cleft,
> + enum xfs_rcext_domain domain,
> xfs_agblock_t agbno,
> - xfs_extlen_t aglen,
> - int flags)
> + xfs_extlen_t aglen)
> {
> struct xfs_refcount_irec tmp;
> int error;
> int found_rec;
>
> left->rc_startblock = cleft->rc_startblock = NULLAGBLOCK;
> - error = xfs_refcount_lookup_le(cur, agbno - 1, &found_rec);
> + error = xfs_refcount_lookup_le(cur, domain, agbno - 1, &found_rec);
> if (error)
> goto out_error;
> if (!found_rec)
> @@ -634,11 +668,13 @@ xfs_refcount_find_left_extents(
> goto out_error;
> }
>
> + if (tmp.rc_domain != domain)
> + return 0;
> if (xfs_refc_next(&tmp) != agbno)
> return 0;
> - if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
> + if (domain == XFS_RCDOM_SHARED && tmp.rc_refcount < 2)
> return 0;
> - if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
> + if (domain == XFS_RCDOM_COW && tmp.rc_refcount > 1)
> return 0;
Hmmm - this pattern is repeated in a couple of places. Perhaps a
helper is in order?
[....]
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 316c1ec0c3c2..b0818063aa20 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -155,12 +155,31 @@ xfs_refcountbt_init_high_key_from_rec(
> key->refc.rc_startblock = cpu_to_be32(x);
> }
>
> +static inline __u32
> +xfs_refcountbt_encode_startblock(
> + struct xfs_btree_cur *cur)
> +{
> + __u32 start;
> +
> + /*
> + * low level btree operations need to handle the generic btree range
> + * query functions (which set rc_domain == -1U), so we check that the
> + * domain is /not/ shared.
> + */
> + start = cur->bc_rec.rc.rc_startblock & ~XFS_REFC_COW_START;
> + if (cur->bc_rec.rc.rc_domain != XFS_RCDOM_SHARED)
> + start |= XFS_REFC_COW_START;
> + return start;
> +}
Oh, there is a xfs_refcount_domain_to_block() helper already - it
just needs to be passed the agbno + domain, not a btree cursor :)
[....]
> diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
> index 8ab55e791ea8..9cf4be9cbb89 100644
> --- a/fs/xfs/scrub/refcount.c
> +++ b/fs/xfs/scrub/refcount.c
> @@ -334,20 +334,19 @@ xchk_refcountbt_rec(
> struct xfs_refcount_irec irec;
> xfs_agblock_t *cow_blocks = bs->private;
> struct xfs_perag *pag = bs->cur->bc_ag.pag;
> - bool has_cowflag;
>
> xfs_refcount_btrec_to_irec(rec, &irec);
>
> /* Only CoW records can have refcount == 1. */
> - has_cowflag = (irec.rc_startblock & XFS_REFC_COW_START);
> - if ((irec.rc_refcount == 1 && !has_cowflag) ||
> - (irec.rc_refcount != 1 && has_cowflag))
> + if (irec.rc_domain == XFS_RCDOM_SHARED && irec.rc_refcount == 1)
> xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
> - if (has_cowflag)
> + if (irec.rc_domain == XFS_RCDOM_COW) {
> + if (irec.rc_refcount != 1)
> + xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
> (*cow_blocks) += irec.rc_blockcount;
> + }
Hmmm, that looks like another of those shared/cow domain + refcount
checks like I pointed out above...
>
> /* Check the extent. */
> - irec.rc_startblock &= ~XFS_REFC_COW_START;
> if (irec.rc_startblock + irec.rc_blockcount <= irec.rc_startblock ||
> !xfs_verify_agbno(pag, irec.rc_startblock) ||
> !xfs_verify_agbno(pag, irec.rc_startblock + irec.rc_blockcount - 1))
> @@ -420,7 +419,6 @@ xchk_xref_is_cow_staging(
> xfs_extlen_t len)
> {
> struct xfs_refcount_irec rc;
> - bool has_cowflag;
> int has_refcount;
> int error;
>
> @@ -428,8 +426,8 @@ xchk_xref_is_cow_staging(
> return;
>
> /* Find the CoW staging extent. */
> - error = xfs_refcount_lookup_le(sc->sa.refc_cur,
> - agbno + XFS_REFC_COW_START, &has_refcount);
> + error = xfs_refcount_lookup_le(sc->sa.refc_cur, XFS_RCDOM_COW, agbno,
> + &has_refcount);
> if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur))
> return;
> if (!has_refcount) {
> @@ -446,8 +444,7 @@ xchk_xref_is_cow_staging(
> }
>
> /* CoW flag must be set, refcount must be 1. */
> - has_cowflag = (rc.rc_startblock & XFS_REFC_COW_START);
> - if (!has_cowflag || rc.rc_refcount != 1)
> + if (rc.rc_domain != XFS_RCDOM_COW || rc.rc_refcount != 1)
> xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);
That looks like the same validation logic check as the above hunk
in xchk_refcountbt_rec()...
That's just a first pass - I haven't really concentrated on
correctness that much yet, the patch is really doing too much for me
to get a good picture of everything...
-Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2022-10-26 0:40 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-24 21:33 [PATCHSET 0/5] xfs: improve runtime refcountbt corruption detection Darrick J. Wong
2022-10-24 21:33 ` [PATCH 1/5] xfs: move _irec structs to xfs_types.h Darrick J. Wong
2022-10-25 23:56 ` Dave Chinner
2022-10-24 21:33 ` [PATCH 2/5] xfs: refactor refcount record usage in xchk_refcountbt_rec Darrick J. Wong
2022-10-25 23:59 ` Dave Chinner
2022-10-24 21:33 ` [PATCH 3/5] xfs: track cow/shared record domains explicitly in xfs_refcount_irec Darrick J. Wong
2022-10-26 0:40 ` Dave Chinner [this message]
2022-10-26 22:06 ` Darrick J. Wong
2022-10-24 21:33 ` [PATCH 4/5] xfs: rename XFS_REFC_COW_START to _COWFLAG Darrick J. Wong
2022-10-26 0:41 ` Dave Chinner
2022-10-24 21:33 ` [PATCH 5/5] xfs: check deferred refcount op continuation parameters Darrick J. Wong
2022-10-26 0:46 ` Dave Chinner
2022-10-26 1:28 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221026004029.GK3600936@dread.disaster.area \
--to=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox