* [PATCH 0/3] xfs: fix inode use-after-free during log recovery
@ 2020-09-17 3:29 Darrick J. Wong
2020-09-17 3:29 ` [PATCH 1/3] xfs: clean up bmap intent item recovery checking Darrick J. Wong
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Darrick J. Wong @ 2020-09-17 3:29 UTC (permalink / raw)
To: darrick.wong; +Cc: linux-xfs, david
Hi all,
In this series, I try to fix a use-after-free that I discovered during
development of the dfops freezer, where BUI recovery releases the inode
even if it requeues itself. If the inode gets reclaimed, the fs
corrupts memory and explodes. The fix is to make the dfops freezer take
over ownership of the inodes if there's any more work to be done.
If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.
This is an extraordinary way to destroy everything. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-bmap-intent-recovery
---
fs/xfs/libxfs/xfs_defer.c | 57 ++++++++++++++++++++++-
fs/xfs/libxfs/xfs_defer.h | 21 ++++++++-
fs/xfs/libxfs/xfs_log_recover.h | 14 +++++-
fs/xfs/xfs_bmap_item.c | 95 +++++++++++++++------------------------
fs/xfs/xfs_icache.c | 41 +++++++++++++++++
fs/xfs/xfs_log_recover.c | 32 +++++++++++--
fs/xfs/xfs_trans.h | 6 --
7 files changed, 191 insertions(+), 75 deletions(-)
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH 1/3] xfs: clean up bmap intent item recovery checking 2020-09-17 3:29 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong @ 2020-09-17 3:29 ` Darrick J. Wong 2020-09-17 5:09 ` Dave Chinner 2020-09-23 7:16 ` Christoph Hellwig 2020-09-17 3:29 ` [PATCH 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering Darrick J. Wong 2020-09-17 3:29 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 2 siblings, 2 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-17 3:29 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs, david From: Darrick J. Wong <darrick.wong@oracle.com> The bmap intent item checking code in xfs_bui_item_recover is spread all over the function. We should check the recovered log item at the top before we allocate any resources or do anything else, so do that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/xfs_bmap_item.c | 57 ++++++++++++++++-------------------------------- 1 file changed, 19 insertions(+), 38 deletions(-) diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 0a0904a7650a..877afe76d76a 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -437,17 +437,13 @@ xfs_bui_item_recover( xfs_fsblock_t inode_fsb; xfs_filblks_t count; xfs_exntst_t state; - enum xfs_bmap_intent_type type; - bool op_ok; unsigned int bui_type; int whichfork; int error = 0; /* Only one mapping operation per BUI... */ - if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) { - xfs_bui_release(buip); - return -EFSCORRUPTED; - } + if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) + goto garbage; /* * First check the validity of the extent described by the @@ -458,29 +454,26 @@ xfs_bui_item_recover( XFS_FSB_TO_DADDR(mp, bmap->me_startblock)); inode_fsb = XFS_BB_TO_FSB(mp, XFS_FSB_TO_DADDR(mp, XFS_INO_TO_FSB(mp, bmap->me_owner))); - switch (bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK) { + state = (bmap->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ? + XFS_EXT_UNWRITTEN : XFS_EXT_NORM; + whichfork = (bmap->me_flags & XFS_BMAP_EXTENT_ATTR_FORK) ? + XFS_ATTR_FORK : XFS_DATA_FORK; + bui_type = bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK; + switch (bui_type) { case XFS_BMAP_MAP: case XFS_BMAP_UNMAP: - op_ok = true; break; default: - op_ok = false; - break; + goto garbage; } - if (!op_ok || startblock_fsb == 0 || + if (startblock_fsb == 0 || bmap->me_len == 0 || inode_fsb == 0 || startblock_fsb >= mp->m_sb.sb_dblocks || bmap->me_len >= mp->m_sb.sb_agblocks || inode_fsb >= mp->m_sb.sb_dblocks || - (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) { - /* - * This will pull the BUI from the AIL and - * free the memory associated with it. - */ - xfs_bui_release(buip); - return -EFSCORRUPTED; - } + (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) + goto garbage; error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); @@ -501,32 +494,17 @@ xfs_bui_item_recover( if (VFS_I(ip)->i_nlink == 0) xfs_iflags_set(ip, XFS_IRECOVERY); - /* Process deferred bmap item. */ - state = (bmap->me_flags & XFS_BMAP_EXTENT_UNWRITTEN) ? - XFS_EXT_UNWRITTEN : XFS_EXT_NORM; - whichfork = (bmap->me_flags & XFS_BMAP_EXTENT_ATTR_FORK) ? - XFS_ATTR_FORK : XFS_DATA_FORK; - bui_type = bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK; - switch (bui_type) { - case XFS_BMAP_MAP: - case XFS_BMAP_UNMAP: - type = bui_type; - break; - default: - XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, mp); - error = -EFSCORRUPTED; - goto err_inode; - } xfs_trans_ijoin(tp, ip, 0); count = bmap->me_len; - error = xfs_trans_log_finish_bmap_update(tp, budp, type, ip, whichfork, - bmap->me_startoff, bmap->me_startblock, &count, state); + error = xfs_trans_log_finish_bmap_update(tp, budp, bui_type, ip, + whichfork, bmap->me_startoff, bmap->me_startblock, + &count, state); if (error) goto err_inode; if (count > 0) { - ASSERT(type == XFS_BMAP_UNMAP); + ASSERT(bui_type == XFS_BMAP_UNMAP); irec.br_startblock = bmap->me_startblock; irec.br_blockcount = count; irec.br_startoff = bmap->me_startoff; @@ -546,6 +524,9 @@ xfs_bui_item_recover( xfs_irele(ip); } return error; +garbage: + xfs_bui_release(buip); + return -EFSCORRUPTED; } STATIC bool ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 1/3] xfs: clean up bmap intent item recovery checking 2020-09-17 3:29 ` [PATCH 1/3] xfs: clean up bmap intent item recovery checking Darrick J. Wong @ 2020-09-17 5:09 ` Dave Chinner 2020-09-23 7:16 ` Christoph Hellwig 1 sibling, 0 replies; 29+ messages in thread From: Dave Chinner @ 2020-09-17 5:09 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Wed, Sep 16, 2020 at 08:29:30PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > The bmap intent item checking code in xfs_bui_item_recover is spread all > over the function. We should check the recovered log item at the top > before we allocate any resources or do anything else, so do that. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > fs/xfs/xfs_bmap_item.c | 57 ++++++++++++++++-------------------------------- > 1 file changed, 19 insertions(+), 38 deletions(-) Looks good - I have a very similar patch here.... Reviewed-by: Dave Chinner <dchinner@redhat.com> -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/3] xfs: clean up bmap intent item recovery checking 2020-09-17 3:29 ` [PATCH 1/3] xfs: clean up bmap intent item recovery checking Darrick J. Wong 2020-09-17 5:09 ` Dave Chinner @ 2020-09-23 7:16 ` Christoph Hellwig 2020-09-23 15:22 ` Darrick J. Wong 1 sibling, 1 reply; 29+ messages in thread From: Christoph Hellwig @ 2020-09-23 7:16 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, david On Wed, Sep 16, 2020 at 08:29:30PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > The bmap intent item checking code in xfs_bui_item_recover is spread all > over the function. We should check the recovered log item at the top > before we allocate any resources or do anything else, so do that. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > fs/xfs/xfs_bmap_item.c | 57 ++++++++++++++++-------------------------------- > 1 file changed, 19 insertions(+), 38 deletions(-) > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > index 0a0904a7650a..877afe76d76a 100644 > --- a/fs/xfs/xfs_bmap_item.c > +++ b/fs/xfs/xfs_bmap_item.c > @@ -437,17 +437,13 @@ xfs_bui_item_recover( > xfs_fsblock_t inode_fsb; > xfs_filblks_t count; > xfs_exntst_t state; > - enum xfs_bmap_intent_type type; > - bool op_ok; > unsigned int bui_type; > int whichfork; > int error = 0; > > /* Only one mapping operation per BUI... */ > - if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) { > - xfs_bui_release(buip); > - return -EFSCORRUPTED; > - } > + if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) > + goto garbage; We don't really need the xfs_bui_release any more, and can stick to plain "return -EFSCORRUPTED" instead of the goto, but I suspect the previous patch has taken care of that and you've rebased already? Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/3] xfs: clean up bmap intent item recovery checking 2020-09-23 7:16 ` Christoph Hellwig @ 2020-09-23 15:22 ` Darrick J. Wong 0 siblings, 0 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-23 15:22 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs, david On Wed, Sep 23, 2020 at 08:16:14AM +0100, Christoph Hellwig wrote: > On Wed, Sep 16, 2020 at 08:29:30PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > The bmap intent item checking code in xfs_bui_item_recover is spread all > > over the function. We should check the recovered log item at the top > > before we allocate any resources or do anything else, so do that. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > > fs/xfs/xfs_bmap_item.c | 57 ++++++++++++++++-------------------------------- > > 1 file changed, 19 insertions(+), 38 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > index 0a0904a7650a..877afe76d76a 100644 > > --- a/fs/xfs/xfs_bmap_item.c > > +++ b/fs/xfs/xfs_bmap_item.c > > @@ -437,17 +437,13 @@ xfs_bui_item_recover( > > xfs_fsblock_t inode_fsb; > > xfs_filblks_t count; > > xfs_exntst_t state; > > - enum xfs_bmap_intent_type type; > > - bool op_ok; > > unsigned int bui_type; > > int whichfork; > > int error = 0; > > > > /* Only one mapping operation per BUI... */ > > - if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) { > > - xfs_bui_release(buip); > > - return -EFSCORRUPTED; > > - } > > + if (buip->bui_format.bui_nextents != XFS_BUI_MAX_FAST_EXTENTS) > > + goto garbage; > > We don't really need the xfs_bui_release any more, and can stick to > plain "return -EFSCORRUPTED" instead of the goto, but I suspect the > previous patch has taken care of that and you've rebased already? Yep. --D > Otherwise looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering 2020-09-17 3:29 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong 2020-09-17 3:29 ` [PATCH 1/3] xfs: clean up bmap intent item recovery checking Darrick J. Wong @ 2020-09-17 3:29 ` Darrick J. Wong 2020-09-17 5:13 ` Dave Chinner 2020-09-17 7:08 ` [PATCH v2 " Darrick J. Wong 2020-09-17 3:29 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 2 siblings, 2 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-17 3:29 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs, david From: Darrick J. Wong <darrick.wong@oracle.com> In most places in XFS, we have a specific order in which we gather resources: grab the inode, allocate a transaction, then lock the inode. xfs_bui_item_recover doesn't do it in that order, so fix it to be more consistent. This also makes the error bailout code a bit less weird. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/xfs_bmap_item.c | 40 +++++++++++++++++++++------------------- 1 file changed, 21 insertions(+), 19 deletions(-) diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 877afe76d76a..6f589f04f358 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -475,25 +475,26 @@ xfs_bui_item_recover( (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) goto garbage; - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, - XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); - if (error) - return error; - - budp = xfs_trans_get_bud(tp, buip); - /* Grab the inode. */ - error = xfs_iget(mp, tp, bmap->me_owner, 0, XFS_ILOCK_EXCL, &ip); + error = xfs_iget(mp, NULL, bmap->me_owner, 0, 0, &ip); if (error) - goto err_inode; + return error; error = xfs_qm_dqattach(ip); if (error) - goto err_inode; + goto err_rele; if (VFS_I(ip)->i_nlink == 0) xfs_iflags_set(ip, XFS_IRECOVERY); + /* Allocate transaction and do the work. */ + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); + if (error) + goto err_rele; + + budp = xfs_trans_get_bud(tp, buip); + xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, 0); count = bmap->me_len; @@ -501,7 +502,7 @@ xfs_bui_item_recover( whichfork, bmap->me_startoff, bmap->me_startblock, &count, state); if (error) - goto err_inode; + goto err_cancel; if (count > 0) { ASSERT(bui_type == XFS_BMAP_UNMAP); @@ -512,18 +513,19 @@ xfs_bui_item_recover( xfs_bmap_unmap_extent(tp, ip, &irec); } + /* Commit transaction, which frees tp. */ error = xlog_recover_trans_commit(tp, dfcp); + if (error) + goto err_unlock; + return 0; + +err_cancel: + xfs_trans_cancel(tp); +err_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); +err_rele: xfs_irele(ip); return error; - -err_inode: - xfs_trans_cancel(tp); - if (ip) { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - xfs_irele(ip); - } - return error; garbage: xfs_bui_release(buip); return -EFSCORRUPTED; ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering 2020-09-17 3:29 ` [PATCH 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering Darrick J. Wong @ 2020-09-17 5:13 ` Dave Chinner 2020-09-17 6:47 ` Darrick J. Wong 2020-09-17 7:08 ` [PATCH v2 " Darrick J. Wong 1 sibling, 1 reply; 29+ messages in thread From: Dave Chinner @ 2020-09-17 5:13 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Wed, Sep 16, 2020 at 08:29:36PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In most places in XFS, we have a specific order in which we gather > resources: grab the inode, allocate a transaction, then lock the inode. > xfs_bui_item_recover doesn't do it in that order, so fix it to be more > consistent. This also makes the error bailout code a bit less weird. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > fs/xfs/xfs_bmap_item.c | 40 +++++++++++++++++++++------------------- > 1 file changed, 21 insertions(+), 19 deletions(-) This probably needs to go before the xfs_qm_dqattach() fix, or the dqattach fix need to come after this.... > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > index 877afe76d76a..6f589f04f358 100644 > --- a/fs/xfs/xfs_bmap_item.c > +++ b/fs/xfs/xfs_bmap_item.c > @@ -475,25 +475,26 @@ xfs_bui_item_recover( > (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) > goto garbage; > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, > - XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); > - if (error) > - return error; > - > - budp = xfs_trans_get_bud(tp, buip); > - > /* Grab the inode. */ > - error = xfs_iget(mp, tp, bmap->me_owner, 0, XFS_ILOCK_EXCL, &ip); > + error = xfs_iget(mp, NULL, bmap->me_owner, 0, 0, &ip); > if (error) > - goto err_inode; > + return error; > > error = xfs_qm_dqattach(ip); > if (error) > - goto err_inode; > + goto err_rele; > > if (VFS_I(ip)->i_nlink == 0) > xfs_iflags_set(ip, XFS_IRECOVERY); > > + /* Allocate transaction and do the work. */ > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, > + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); > + if (error) > + goto err_rele; Hmmmm - don't all the error cased before we call xfs_trans_get_bud() need to release the bui? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering 2020-09-17 5:13 ` Dave Chinner @ 2020-09-17 6:47 ` Darrick J. Wong 0 siblings, 0 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-17 6:47 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Sep 17, 2020 at 03:13:33PM +1000, Dave Chinner wrote: > On Wed, Sep 16, 2020 at 08:29:36PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > In most places in XFS, we have a specific order in which we gather > > resources: grab the inode, allocate a transaction, then lock the inode. > > xfs_bui_item_recover doesn't do it in that order, so fix it to be more > > consistent. This also makes the error bailout code a bit less weird. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > > fs/xfs/xfs_bmap_item.c | 40 +++++++++++++++++++++------------------- > > 1 file changed, 21 insertions(+), 19 deletions(-) > > This probably needs to go before the xfs_qm_dqattach() fix, or > the dqattach fix need to come after this.... <nod> I'll fix the previous patch. > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > index 877afe76d76a..6f589f04f358 100644 > > --- a/fs/xfs/xfs_bmap_item.c > > +++ b/fs/xfs/xfs_bmap_item.c > > @@ -475,25 +475,26 @@ xfs_bui_item_recover( > > (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) > > goto garbage; > > > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, > > - XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); > > - if (error) > > - return error; > > - > > - budp = xfs_trans_get_bud(tp, buip); > > - > > /* Grab the inode. */ > > - error = xfs_iget(mp, tp, bmap->me_owner, 0, XFS_ILOCK_EXCL, &ip); > > + error = xfs_iget(mp, NULL, bmap->me_owner, 0, 0, &ip); > > if (error) > > - goto err_inode; > > + return error; > > > > error = xfs_qm_dqattach(ip); > > if (error) > > - goto err_inode; > > + goto err_rele; > > > > if (VFS_I(ip)->i_nlink == 0) > > xfs_iflags_set(ip, XFS_IRECOVERY); > > > > + /* Allocate transaction and do the work. */ > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, > > + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); > > + if (error) > > + goto err_rele; > > Hmmmm - don't all the error cased before we call xfs_trans_get_bud() > need to release the bui? Yes, I think so. Come to think of it, the other intent items seem like they have the same bug. --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering 2020-09-17 3:29 ` [PATCH 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering Darrick J. Wong 2020-09-17 5:13 ` Dave Chinner @ 2020-09-17 7:08 ` Darrick J. Wong 2020-09-17 8:19 ` Dave Chinner 2020-09-23 7:17 ` Christoph Hellwig 1 sibling, 2 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-17 7:08 UTC (permalink / raw) To: linux-xfs, david From: Darrick J. Wong <darrick.wong@oracle.com> In most places in XFS, we have a specific order in which we gather resources: grab the inode, allocate a transaction, then lock the inode. xfs_bui_item_recover doesn't do it in that order, so fix it to be more consistent. This also makes the error bailout code a bit less weird. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- v2: fix the error-out paths to free the BUI if the BUD hasn't been created --- fs/xfs/xfs_bmap_item.c | 49 +++++++++++++++++++++++++----------------------- 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 4e5aa29f75b7..f088cfd495bd 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -432,7 +432,7 @@ xfs_bui_item_recover( struct xfs_inode *ip = NULL; struct xfs_mount *mp = lip->li_mountp; struct xfs_map_extent *bmap; - struct xfs_bud_log_item *budp; + struct xfs_bud_log_item *budp = NULL; xfs_fsblock_t startblock_fsb; xfs_fsblock_t inode_fsb; xfs_filblks_t count; @@ -475,27 +475,26 @@ xfs_bui_item_recover( (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) goto garbage; - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, - XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); - if (error) { - xfs_bui_release(buip); - return error; - } - - budp = xfs_trans_get_bud(tp, buip); - /* Grab the inode. */ - error = xfs_iget(mp, tp, bmap->me_owner, 0, XFS_ILOCK_EXCL, &ip); + error = xfs_iget(mp, NULL, bmap->me_owner, 0, 0, &ip); if (error) - goto err_inode; + goto err_bui; - error = xfs_qm_dqattach_locked(ip, false); + error = xfs_qm_dqattach(ip); if (error) - goto err_inode; + goto err_rele; if (VFS_I(ip)->i_nlink == 0) xfs_iflags_set(ip, XFS_IRECOVERY); + /* Allocate transaction and do the work. */ + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); + if (error) + goto err_rele; + + budp = xfs_trans_get_bud(tp, buip); + xfs_ilock(ip, XFS_ILOCK_EXCL); xfs_trans_ijoin(tp, ip, 0); count = bmap->me_len; @@ -503,7 +502,7 @@ xfs_bui_item_recover( whichfork, bmap->me_startoff, bmap->me_startblock, &count, state); if (error) - goto err_inode; + goto err_cancel; if (count > 0) { ASSERT(bui_type == XFS_BMAP_UNMAP); @@ -514,17 +513,21 @@ xfs_bui_item_recover( xfs_bmap_unmap_extent(tp, ip, &irec); } + /* Commit transaction, which frees tp. */ error = xlog_recover_trans_commit(tp, dfcp); + if (error) + goto err_unlock; + return 0; + +err_cancel: + xfs_trans_cancel(tp); +err_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); +err_rele: xfs_irele(ip); - return error; - -err_inode: - xfs_trans_cancel(tp); - if (ip) { - xfs_iunlock(ip, XFS_ILOCK_EXCL); - xfs_irele(ip); - } +err_bui: + if (!budp) + xfs_bui_release(buip); return error; garbage: xfs_bui_release(buip); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v2 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering 2020-09-17 7:08 ` [PATCH v2 " Darrick J. Wong @ 2020-09-17 8:19 ` Dave Chinner 2020-09-23 7:17 ` Christoph Hellwig 1 sibling, 0 replies; 29+ messages in thread From: Dave Chinner @ 2020-09-17 8:19 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Thu, Sep 17, 2020 at 12:08:02AM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In most places in XFS, we have a specific order in which we gather > resources: grab the inode, allocate a transaction, then lock the inode. > xfs_bui_item_recover doesn't do it in that order, so fix it to be more > consistent. This also makes the error bailout code a bit less weird. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- > v2: fix the error-out paths to free the BUI if the BUD hasn't been > created > --- > fs/xfs/xfs_bmap_item.c | 49 +++++++++++++++++++++++++----------------------- > 1 file changed, 26 insertions(+), 23 deletions(-) > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > index 4e5aa29f75b7..f088cfd495bd 100644 > --- a/fs/xfs/xfs_bmap_item.c > +++ b/fs/xfs/xfs_bmap_item.c > @@ -432,7 +432,7 @@ xfs_bui_item_recover( > struct xfs_inode *ip = NULL; > struct xfs_mount *mp = lip->li_mountp; > struct xfs_map_extent *bmap; > - struct xfs_bud_log_item *budp; > + struct xfs_bud_log_item *budp = NULL; > xfs_fsblock_t startblock_fsb; > xfs_fsblock_t inode_fsb; > xfs_filblks_t count; > @@ -475,27 +475,26 @@ xfs_bui_item_recover( > (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) > goto garbage; > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, > - XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); > - if (error) { > - xfs_bui_release(buip); > - return error; > - } > - > - budp = xfs_trans_get_bud(tp, buip); > - > /* Grab the inode. */ > - error = xfs_iget(mp, tp, bmap->me_owner, 0, XFS_ILOCK_EXCL, &ip); > + error = xfs_iget(mp, NULL, bmap->me_owner, 0, 0, &ip); > if (error) > - goto err_inode; > + goto err_bui; err_bui could be combined with the garbage error return if error is initialised to -EFSCORRUPTED. Otherwise it looks OK. Reviewed-by: Dave Chinner <dchinner@redhat.com> Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering 2020-09-17 7:08 ` [PATCH v2 " Darrick J. Wong 2020-09-17 8:19 ` Dave Chinner @ 2020-09-23 7:17 ` Christoph Hellwig 1 sibling, 0 replies; 29+ messages in thread From: Christoph Hellwig @ 2020-09-23 7:17 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, david On Thu, Sep 17, 2020 at 12:08:02AM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In most places in XFS, we have a specific order in which we gather > resources: grab the inode, allocate a transaction, then lock the inode. > xfs_bui_item_recover doesn't do it in that order, so fix it to be more > consistent. This also makes the error bailout code a bit less weird. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-09-17 3:29 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong 2020-09-17 3:29 ` [PATCH 1/3] xfs: clean up bmap intent item recovery checking Darrick J. Wong 2020-09-17 3:29 ` [PATCH 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering Darrick J. Wong @ 2020-09-17 3:29 ` Darrick J. Wong 2020-09-23 7:20 ` Christoph Hellwig 2020-09-24 6:06 ` [PATCH v2 " Darrick J. Wong 2 siblings, 2 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-17 3:29 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs, david From: Darrick J. Wong <darrick.wong@oracle.com> In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/libxfs/xfs_defer.c | 57 +++++++++++++++++++++++++++++++++++++-- fs/xfs/libxfs/xfs_defer.h | 21 ++++++++++++++ fs/xfs/libxfs/xfs_log_recover.h | 14 ++++++++-- fs/xfs/xfs_bmap_item.c | 8 +---- fs/xfs/xfs_icache.c | 41 ++++++++++++++++++++++++++++ fs/xfs/xfs_log_recover.c | 32 ++++++++++++++++++---- fs/xfs/xfs_trans.h | 6 ---- 7 files changed, 156 insertions(+), 23 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index edfb3b9856c8..97523b394932 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -16,6 +16,7 @@ #include "xfs_inode.h" #include "xfs_inode_item.h" #include "xfs_trace.h" +#include "xfs_icache.h" /* * Deferred Operations in XFS @@ -555,6 +556,18 @@ xfs_defer_move( xfs_defer_reset(stp); } +/* Unlock the inodes attached to this dfops capture device. */ +static void +xfs_defer_capture_iunlock( + struct xfs_defer_capture *dfc) +{ + unsigned int i; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) + xfs_iunlock(dfc->dfc_inodes[i], XFS_ILOCK_EXCL); + dfc->dfc_ilocked = false; +} + /* * Prepare a chain of fresh deferred ops work items to be completed later. Log * recovery requires the ability to put off until later the actual finishing @@ -568,14 +581,21 @@ xfs_defer_move( * transaction is committed. * * Note that the capture state is passed up to the caller and must be freed - * even if the transaction commit returns error. + * even if the transaction commit returns error. If inodes were passed in and + * a state capture structure was returned, the inodes are now owned by the + * state capture structure and the caller must not touch the inodes. + * + * If no structure is returned, the caller still owns the inodes. */ int xfs_defer_capture( struct xfs_trans *tp, - struct xfs_defer_capture **dfcp) + struct xfs_defer_capture **dfcp, + struct xfs_inode *ip1, + struct xfs_inode *ip2) { struct xfs_defer_capture *dfc = NULL; + int error; if (!list_empty(&tp->t_dfops)) { dfc = kmem_zalloc(sizeof(*dfc), KM_NOFS); @@ -594,10 +614,31 @@ xfs_defer_capture( dfc->dfc_tres.tr_logcount = tp->t_log_count; dfc->dfc_tres.tr_logflags = XFS_TRANS_PERM_LOG_RES; xfs_defer_reset(tp); + + /* + * Transfer responsibility for unlocking and releasing the + * inodes to the capture structure. + */ + if (!ip1) + ip1 = ip2; + dfc->dfc_ilocked = true; + dfc->dfc_inodes[0] = ip1; + if (ip2 != ip1) + dfc->dfc_inodes[1] = ip2; } *dfcp = dfc; - return xfs_trans_commit(tp); + error = xfs_trans_commit(tp); + if (error) + return error; + + /* + * Now that we've committed the transaction, it's safe to unlock the + * inodes that were passed in if we've taken over their ownership. + */ + if (dfc) + xfs_defer_capture_iunlock(dfc); + return 0; } /* Attach a chain of captured deferred ops to a new transaction. */ @@ -609,6 +650,13 @@ xfs_defer_continue( ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); + /* Lock and join the inodes to the new transaction. */ + xfs_defer_continue_inodes(dfc, tp); + if (dfc->dfc_inodes[0]) + xfs_trans_ijoin(tp, dfc->dfc_inodes[0], 0); + if (dfc->dfc_inodes[1]) + xfs_trans_ijoin(tp, dfc->dfc_inodes[1], 0); + /* Move captured dfops chain and state to the transaction. */ list_splice_init(&dfc->dfc_dfops, &tp->t_dfops); tp->t_flags |= dfc->dfc_tpflags; @@ -624,5 +672,8 @@ xfs_defer_capture_free( struct xfs_defer_capture *dfc) { xfs_defer_cancel_list(mp, &dfc->dfc_dfops); + if (dfc->dfc_ilocked) + xfs_defer_capture_iunlock(dfc); + xfs_defer_capture_irele(dfc); kmem_free(dfc); } diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index a788855aef69..4663a9026545 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -10,6 +10,12 @@ struct xfs_btree_cur; struct xfs_defer_op_type; struct xfs_defer_capture; +/* + * Deferred operation item relogging limits. + */ +#define XFS_DEFER_OPS_NR_INODES 2 /* join up to two inodes */ +#define XFS_DEFER_OPS_NR_BUFS 2 /* join up to two buffers */ + /* * Header for deferred operation list. */ @@ -78,15 +84,28 @@ struct xfs_defer_capture { unsigned int dfc_tpflags; unsigned int dfc_blkres; struct xfs_trans_res dfc_tres; + + /* + * Inodes to hold when we want to finish the deferred work items. + * Always set the first element before setting the second. + */ + bool dfc_ilocked; + struct xfs_inode *dfc_inodes[XFS_DEFER_OPS_NR_INODES]; }; /* * Functions to capture a chain of deferred operations and continue them later. * This doesn't normally happen except log recovery. */ -int xfs_defer_capture(struct xfs_trans *tp, struct xfs_defer_capture **dfcp); +int xfs_defer_capture(struct xfs_trans *tp, struct xfs_defer_capture **dfcp, + struct xfs_inode *ip1, struct xfs_inode *ip2); void xfs_defer_continue(struct xfs_defer_capture *dfc, struct xfs_trans *tp); void xfs_defer_capture_free(struct xfs_mount *mp, struct xfs_defer_capture *dfc); +/* These functions must be provided by the xfs implementation. */ +void xfs_defer_continue_inodes(struct xfs_defer_capture *dfc, + struct xfs_trans *tp); +void xfs_defer_capture_irele(struct xfs_defer_capture *dfc); + #endif /* __XFS_DEFER_H__ */ diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h index c3563c5c033c..e8aba7c6e851 100644 --- a/fs/xfs/libxfs/xfs_log_recover.h +++ b/fs/xfs/libxfs/xfs_log_recover.h @@ -127,7 +127,17 @@ void xlog_recover_iodone(struct xfs_buf *bp); void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type, uint64_t intent_id); -int xlog_recover_trans_commit(struct xfs_trans *tp, - struct xfs_defer_capture **dfcp); +int xlog_recover_trans_commit_inodes(struct xfs_trans *tp, + struct xfs_defer_capture **dfcp, struct xfs_inode *ip1, + struct xfs_inode *ip2); + +static inline int +xlog_recover_trans_commit( + struct xfs_trans *tp, + struct xfs_defer_capture **dfcp) +{ + return xlog_recover_trans_commit_inodes(tp, dfcp, NULL, NULL); +} + #endif /* __XFS_LOG_RECOVER_H__ */ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 6f589f04f358..8461285a9dd9 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -513,15 +513,11 @@ xfs_bui_item_recover( xfs_bmap_unmap_extent(tp, ip, &irec); } - /* Commit transaction, which frees tp. */ - error = xlog_recover_trans_commit(tp, dfcp); - if (error) - goto err_unlock; - return 0; + /* Commit transaction, which frees the transaction and the inode. */ + return xlog_recover_trans_commit_inodes(tp, dfcp, ip, NULL); err_cancel: xfs_trans_cancel(tp); -err_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); err_rele: xfs_irele(ip); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 101028ebb571..5b1202cac4ea 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -12,6 +12,7 @@ #include "xfs_sb.h" #include "xfs_mount.h" #include "xfs_inode.h" +#include "xfs_defer.h" #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_inode_item.h" @@ -1692,3 +1693,43 @@ xfs_start_block_reaping( xfs_queue_eofblocks(mp); xfs_queue_cowblocks(mp); } + +/* + * Prepare the inodes to participate in further log intent item recovery. + * For now, that means attaching dquots and locking them, since libxfs doesn't + * know how to do that. + */ +void +xfs_defer_continue_inodes( + struct xfs_defer_capture *dfc, + struct xfs_trans *tp) +{ + int i; + int error; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { + error = xfs_qm_dqattach(dfc->dfc_inodes[i]); + if (error) + tp->t_mountp->m_qflags &= ~XFS_ALL_QUOTA_CHKD; + } + + if (dfc->dfc_inodes[1]) + xfs_lock_two_inodes(dfc->dfc_inodes[0], XFS_ILOCK_EXCL, + dfc->dfc_inodes[1], XFS_ILOCK_EXCL); + else if (dfc->dfc_inodes[0]) + xfs_ilock(dfc->dfc_inodes[0], XFS_ILOCK_EXCL); + dfc->dfc_ilocked = true; +} + +/* Release all the inodes attached to this dfops capture device. */ +void +xfs_defer_capture_irele( + struct xfs_defer_capture *dfc) +{ + unsigned int i; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { + xfs_irele(dfc->dfc_inodes[i]); + dfc->dfc_inodes[i] = NULL; + } +} diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index f5748c8f5157..9ac2726d42b4 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1791,16 +1791,38 @@ xlog_recover_release_intent( spin_unlock(&ailp->ail_lock); } +static inline void +xlog_recover_irele( + struct xfs_inode *ip) +{ + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_irele(ip); +} + /* - * Freeze any deferred ops and commit the transaction. This is the last step - * needed to finish a log intent item that we recovered from the log. + * Freeze any deferred ops, commit the transaction, and deal with the inodes. + * This is the last step needed to finish a log intent item that we recovered + * from the log. If we captured deferred ops, the inodes are attached to it + * and must not be touched. If not, we have to unlock and free them ourselves. */ int -xlog_recover_trans_commit( +xlog_recover_trans_commit_inodes( struct xfs_trans *tp, - struct xfs_defer_capture **dfcp) + struct xfs_defer_capture **dfcp, + struct xfs_inode *ip1, + struct xfs_inode *ip2) { - return xfs_defer_capture(tp, dfcp); + int error; + + error = xfs_defer_capture(tp, dfcp, ip1, ip2); + if (*dfcp) + return error; + + if (ip2 && ip2 != ip1) + xlog_recover_irele(ip2); + if (ip1) + xlog_recover_irele(ip1); + return error; } /****************************************************************************** diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index c88240961aa1..995c1513693c 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -97,12 +97,6 @@ void xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item, #define XFS_ITEM_LOCKED 2 #define XFS_ITEM_FLUSHING 3 -/* - * Deferred operation item relogging limits. - */ -#define XFS_DEFER_OPS_NR_INODES 2 /* join up to two inodes */ -#define XFS_DEFER_OPS_NR_BUFS 2 /* join up to two buffers */ - /* * This is the structure maintained for every active transaction. */ ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-09-17 3:29 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong @ 2020-09-23 7:20 ` Christoph Hellwig 2020-09-23 15:55 ` Darrick J. Wong 2020-09-24 6:06 ` [PATCH v2 " Darrick J. Wong 1 sibling, 1 reply; 29+ messages in thread From: Christoph Hellwig @ 2020-09-23 7:20 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, david On Wed, Sep 16, 2020 at 08:29:42PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > to the inode that is involved in the bmap replay operation. If the > mapping operation does not complete, we call xfs_bmap_unmap_extent to > create a deferred op to finish the unmapping work, and we retain a > pointer to the incore inode. > > Unfortunately, the very next thing we do is commit the transaction and > drop the inode. If reclaim tears down the inode before we try to finish > the defer ops, we dereference garbage and blow up. Therefore, create a > way to join inodes to the defer ops freezer so that we can maintain the > xfs_inode reference until we're done with the inode. > > Note: This imposes the requirement that there be enough memory to keep > every incore inode in memory throughout recovery. As in every inode that gets recovered, not every inode in the system. I think the commit log could use a very slight tweak here. Didn't we think of just storing the inode number for recovery, or did this turn out too complicated? (I'm pretty sure we dicussed this in detail before, but my memory gets foggy). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-09-23 7:20 ` Christoph Hellwig @ 2020-09-23 15:55 ` Darrick J. Wong 0 siblings, 0 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-23 15:55 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs, david On Wed, Sep 23, 2020 at 08:20:15AM +0100, Christoph Hellwig wrote: > On Wed, Sep 16, 2020 at 08:29:42PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > to the inode that is involved in the bmap replay operation. If the > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > create a deferred op to finish the unmapping work, and we retain a > > pointer to the incore inode. > > > > Unfortunately, the very next thing we do is commit the transaction and > > drop the inode. If reclaim tears down the inode before we try to finish > > the defer ops, we dereference garbage and blow up. Therefore, create a > > way to join inodes to the defer ops freezer so that we can maintain the > > xfs_inode reference until we're done with the inode. > > > > Note: This imposes the requirement that there be enough memory to keep > > every incore inode in memory throughout recovery. > > As in every inode that gets recovered, not every inode in the system. > I think the commit log could use a very slight tweak here. > > Didn't we think of just storing the inode number for recovery, or > did this turn out too complicated? (I'm pretty sure we dicussed this > in detail before, but my memory gets foggy). Initially I did just store the inode numbers, but that made the code more clunky due to needing more _iget and _irele calls, and a bunch of error handling for that. Dave suggested on irc that I should retain the reference to the incore inode to simplify the initial patch, and if we run into ENOMEM then we can fix it later. I wasn't 100% convinced of that, but Dave or Brian or someone (memory foggy, don't remember who) countered that the system recovering the fs is usually the system that crashed in the first place, and it certainly had enough RAM to hold the inodes. I think the link you're looking for is[1]. --D [1] https://lore.kernel.org/linux-xfs/158864123329.184729.14504239314355330619.stgit@magnolia/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-09-17 3:29 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 2020-09-23 7:20 ` Christoph Hellwig @ 2020-09-24 6:06 ` Darrick J. Wong 1 sibling, 0 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-24 6:06 UTC (permalink / raw) To: linux-xfs, david, Christoph Hellwig From: Darrick J. Wong <darrick.wong@oracle.com> In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- v2: rebase on v2 of the previous patchset, and make it more clear that xfs_defer_capture takes over ownership of the inodes --- fs/xfs/libxfs/xfs_defer.c | 57 ++++++++++++++++++++++++++++++++++++++- fs/xfs/libxfs/xfs_defer.h | 21 ++++++++++++++ fs/xfs/libxfs/xfs_log_recover.h | 11 ++++++-- fs/xfs/xfs_bmap_item.c | 8 +---- fs/xfs/xfs_icache.c | 41 ++++++++++++++++++++++++++++ fs/xfs/xfs_log_recover.c | 33 ++++++++++++++++++++--- fs/xfs/xfs_trans.h | 6 ---- 7 files changed, 156 insertions(+), 21 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index b693d2c42c27..a99252271b24 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -16,6 +16,7 @@ #include "xfs_inode.h" #include "xfs_inode_item.h" #include "xfs_trace.h" +#include "xfs_icache.h" /* * Deferred Operations in XFS @@ -555,6 +556,18 @@ xfs_defer_move( xfs_defer_reset(stp); } +/* Unlock the inodes attached to this dfops capture device. */ +static void +xfs_defer_capture_iunlock( + struct xfs_defer_capture *dfc) +{ + unsigned int i; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) + xfs_iunlock(dfc->dfc_inodes[i], XFS_ILOCK_EXCL); + dfc->dfc_ilocked = false; +} + /* * Prepare a chain of fresh deferred ops work items to be completed later. Log * recovery requires the ability to put off until later the actual finishing @@ -568,17 +581,26 @@ xfs_defer_move( * transaction is committed. * * Note that the capture state is passed up to the caller and must be freed - * even if the transaction commit returns error. + * even if the transaction commit returns error. If inodes were passed in and + * a state capture structure was returned, the inodes are now owned by the + * state capture structure and the caller must not touch the inodes. + * + * If *ipp1 or *ipp2 remain set, the caller still owns the inodes. */ int xfs_defer_capture( struct xfs_trans *tp, - struct list_head *capture_list) + struct list_head *capture_list, + struct xfs_inode **ipp1, + struct xfs_inode **ipp2) { struct xfs_defer_capture *dfc; struct xfs_mount *mp = tp->t_mountp; int error; + /* Do not pass in an ipp2 without an ipp1. */ + ASSERT(ipp1 || !ipp2); + if (list_empty(&tp->t_dfops)) return xfs_trans_commit(tp); @@ -600,6 +622,21 @@ xfs_defer_capture( dfc->dfc_tres.tr_logflags = XFS_TRANS_PERM_LOG_RES; xfs_defer_reset(tp); + /* + * Transfer responsibility for unlocking and releasing the inodes to + * the capture structure. + */ + dfc->dfc_ilocked = true; + if (ipp1) { + dfc->dfc_inodes[0] = *ipp1; + *ipp1 = NULL; + } + if (ipp2) { + if (*ipp2 != dfc->dfc_inodes[0]) + dfc->dfc_inodes[1] = *ipp2; + *ipp2 = NULL; + } + /* * Commit the transaction. If that fails, clean up the defer ops and * the dfc that we just created. Otherwise, add the dfc to the list. @@ -610,6 +647,12 @@ xfs_defer_capture( return error; } + /* + * Now that we've committed the transaction, it's safe to unlock the + * inodes that were passed in if we've taken over their ownership. + */ + xfs_defer_capture_iunlock(dfc); + list_add_tail(&dfc->dfc_list, capture_list); return 0; } @@ -623,6 +666,13 @@ xfs_defer_continue( ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); + /* Lock and join the inodes to the new transaction. */ + xfs_defer_continue_inodes(dfc, tp); + if (dfc->dfc_inodes[0]) + xfs_trans_ijoin(tp, dfc->dfc_inodes[0], 0); + if (dfc->dfc_inodes[1]) + xfs_trans_ijoin(tp, dfc->dfc_inodes[1], 0); + /* Move captured dfops chain and state to the transaction. */ list_splice_init(&dfc->dfc_dfops, &tp->t_dfops); tp->t_flags |= dfc->dfc_tpflags; @@ -638,5 +688,8 @@ xfs_defer_capture_free( struct xfs_defer_capture *dfc) { xfs_defer_cancel_list(mp, &dfc->dfc_dfops); + if (dfc->dfc_ilocked) + xfs_defer_capture_iunlock(dfc); + xfs_defer_capture_irele(dfc); kmem_free(dfc); } diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index 254d48e6e4dc..c52c57d35af6 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -10,6 +10,12 @@ struct xfs_btree_cur; struct xfs_defer_op_type; struct xfs_defer_capture; +/* + * Deferred operation item relogging limits. + */ +#define XFS_DEFER_OPS_NR_INODES 2 /* join up to two inodes */ +#define XFS_DEFER_OPS_NR_BUFS 2 /* join up to two buffers */ + /* * Header for deferred operation list. */ @@ -78,15 +84,28 @@ struct xfs_defer_capture { unsigned int dfc_tpflags; unsigned int dfc_blkres; struct xfs_trans_res dfc_tres; + + /* + * Inodes to hold when we want to finish the deferred work items. + * Always set the first element before setting the second. + */ + bool dfc_ilocked; + struct xfs_inode *dfc_inodes[XFS_DEFER_OPS_NR_INODES]; }; /* * Functions to capture a chain of deferred operations and continue them later. * This doesn't normally happen except log recovery. */ -int xfs_defer_capture(struct xfs_trans *tp, struct list_head *capture_list); +int xfs_defer_capture(struct xfs_trans *tp, struct list_head *capture_list, + struct xfs_inode **ipp1, struct xfs_inode **ipp2); void xfs_defer_continue(struct xfs_defer_capture *dfc, struct xfs_trans *tp); void xfs_defer_capture_free(struct xfs_mount *mp, struct xfs_defer_capture *dfc); +/* These functions must be provided by the xfs implementation. */ +void xfs_defer_continue_inodes(struct xfs_defer_capture *dfc, + struct xfs_trans *tp); +void xfs_defer_capture_irele(struct xfs_defer_capture *dfc); + #endif /* __XFS_DEFER_H__ */ diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h index 4f752096f7c7..8cf5e23c0aef 100644 --- a/fs/xfs/libxfs/xfs_log_recover.h +++ b/fs/xfs/libxfs/xfs_log_recover.h @@ -125,7 +125,14 @@ void xlog_recover_iodone(struct xfs_buf *bp); void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type, uint64_t intent_id); -int xlog_recover_trans_commit(struct xfs_trans *tp, - struct list_head *capture_list); +int xlog_recover_trans_commit_inodes(struct xfs_trans *tp, + struct list_head *capture_list, struct xfs_inode *ip1, + struct xfs_inode *ip2); + +static inline int +xlog_recover_trans_commit(struct xfs_trans *tp, struct list_head *capture_list) +{ + return xlog_recover_trans_commit_inodes(tp, capture_list, NULL, NULL); +} #endif /* __XFS_LOG_RECOVER_H__ */ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index b6d3a5766148..fa52bfd66ce1 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -513,15 +513,11 @@ xfs_bui_item_recover( xfs_bmap_unmap_extent(tp, ip, &irec); } - /* Commit transaction, which frees tp. */ - error = xlog_recover_trans_commit(tp, capture_list); - if (error) - goto err_unlock; - return 0; + /* Commit transaction, which frees the transaction and the inode. */ + return xlog_recover_trans_commit_inodes(tp, capture_list, ip, NULL); err_cancel: xfs_trans_cancel(tp); -err_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); err_rele: xfs_irele(ip); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 101028ebb571..5b1202cac4ea 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -12,6 +12,7 @@ #include "xfs_sb.h" #include "xfs_mount.h" #include "xfs_inode.h" +#include "xfs_defer.h" #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_inode_item.h" @@ -1692,3 +1693,43 @@ xfs_start_block_reaping( xfs_queue_eofblocks(mp); xfs_queue_cowblocks(mp); } + +/* + * Prepare the inodes to participate in further log intent item recovery. + * For now, that means attaching dquots and locking them, since libxfs doesn't + * know how to do that. + */ +void +xfs_defer_continue_inodes( + struct xfs_defer_capture *dfc, + struct xfs_trans *tp) +{ + int i; + int error; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { + error = xfs_qm_dqattach(dfc->dfc_inodes[i]); + if (error) + tp->t_mountp->m_qflags &= ~XFS_ALL_QUOTA_CHKD; + } + + if (dfc->dfc_inodes[1]) + xfs_lock_two_inodes(dfc->dfc_inodes[0], XFS_ILOCK_EXCL, + dfc->dfc_inodes[1], XFS_ILOCK_EXCL); + else if (dfc->dfc_inodes[0]) + xfs_ilock(dfc->dfc_inodes[0], XFS_ILOCK_EXCL); + dfc->dfc_ilocked = true; +} + +/* Release all the inodes attached to this dfops capture device. */ +void +xfs_defer_capture_irele( + struct xfs_defer_capture *dfc) +{ + unsigned int i; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { + xfs_irele(dfc->dfc_inodes[i]); + dfc->dfc_inodes[i] = NULL; + } +} diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index ab9825ab14d5..37ba7a105b59 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1791,17 +1791,42 @@ xlog_recover_release_intent( spin_unlock(&ailp->ail_lock); } +static inline void +xlog_recover_irele( + struct xfs_inode *ip) +{ + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_irele(ip); +} + /* * Freeze any deferred ops and commit the transaction. This is the last step * needed to finish a log intent item that we recovered from the log, and will - * take care of releasing all the relevant resources. + * take care of releasing all the relevant resources. If we captured deferred + * ops, the inodes are attached to it and must not be touched. If not, we have + * to unlock and free them ourselves. */ int -xlog_recover_trans_commit( +xlog_recover_trans_commit_inodes( struct xfs_trans *tp, - struct list_head *capture_list) + struct list_head *capture_list, + struct xfs_inode *ip1, + struct xfs_inode *ip2) { - return xfs_defer_capture(tp, capture_list); + int error; + + error = xfs_defer_capture(tp, capture_list, &ip1, &ip2); + + /* + * If we still have references to the inodes, either the capture + * process did nothing or it failed; either way, we still own the + * inodes, so unlock and release them. + */ + if (ip2 && ip2 != ip1) + xlog_recover_irele(ip2); + if (ip1) + xlog_recover_irele(ip1); + return error; } /****************************************************************************** diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index b68272666ce1..e27df685d3cd 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -96,12 +96,6 @@ void xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item, #define XFS_ITEM_LOCKED 2 #define XFS_ITEM_FLUSHING 3 -/* - * Deferred operation item relogging limits. - */ -#define XFS_DEFER_OPS_NR_INODES 2 /* join up to two inodes */ -#define XFS_DEFER_OPS_NR_BUFS 2 /* join up to two buffers */ - /* * This is the structure maintained for every active transaction. */ ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v4 0/3] xfs: fix inode use-after-free during log recovery @ 2020-10-05 18:20 Darrick J. Wong 2020-10-05 18:20 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-10-05 18:20 UTC (permalink / raw) To: darrick.wong Cc: Dave Chinner, Brian Foster, Christoph Hellwig, linux-xfs, david, hch Hi all, In this second series, I try to fix a use-after-free that I discovered during development of the dfops freezer, where BUI recovery releases the inode even if it requeues itself. If the inode gets reclaimed, the fs corrupts memory and explodes. The fix is to make the dfops capture struct take over ownership of the inodes if there's any more work to be done. This is a bit clunky, but it's a simpler mechanism than saving inode pointers and inode numbers and introducing tagged structures so that we can distinguish one from the other. v2: rebase atop the new defer capture code v3: only capture one inode, move as much of the defer capture code to xfs_defer.c as we can v4: make defer capture ihold the inode, and the caller still gets to iunlock and irele it If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-bmap-intent-recovery-5.10 --- fs/xfs/libxfs/xfs_defer.c | 43 ++++++++++++++++++++--- fs/xfs/libxfs/xfs_defer.h | 11 +++++- fs/xfs/xfs_bmap_item.c | 84 ++++++++++++++++++++------------------------ fs/xfs/xfs_extfree_item.c | 2 + fs/xfs/xfs_log_recover.c | 7 +++- fs/xfs/xfs_refcount_item.c | 2 + fs/xfs/xfs_rmap_item.c | 2 + 7 files changed, 95 insertions(+), 56 deletions(-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-10-05 18:20 [PATCH v4 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong @ 2020-10-05 18:20 ` Darrick J. Wong 2020-10-06 6:24 ` Christoph Hellwig 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-10-05 18:20 UTC (permalink / raw) To: darrick.wong; +Cc: Brian Foster, linux-xfs, david, hch From: Darrick J. Wong <darrick.wong@oracle.com> In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/libxfs/xfs_defer.c | 43 ++++++++++++++++++++++++++++++++++++++----- fs/xfs/libxfs/xfs_defer.h | 11 +++++++++-- fs/xfs/xfs_bmap_item.c | 7 +++++-- fs/xfs/xfs_extfree_item.c | 2 +- fs/xfs/xfs_log_recover.c | 7 ++++++- fs/xfs/xfs_refcount_item.c | 2 +- fs/xfs/xfs_rmap_item.c | 2 +- 7 files changed, 61 insertions(+), 13 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index e19dc1ced7e6..4d2583758f3d 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -16,6 +16,7 @@ #include "xfs_inode.h" #include "xfs_inode_item.h" #include "xfs_trace.h" +#include "xfs_icache.h" /* * Deferred Operations in XFS @@ -553,10 +554,14 @@ xfs_defer_move( * deferred ops state is transferred to the capture structure and the * transaction is then ready for the caller to commit it. If there are no * intent items to capture, this function returns NULL. + * + * If capture_ip is not NULL, the capture structure will obtain an extra + * reference to the inode. */ static struct xfs_defer_capture * xfs_defer_ops_capture( - struct xfs_trans *tp) + struct xfs_trans *tp, + struct xfs_inode *capture_ip) { struct xfs_defer_capture *dfc; @@ -582,6 +587,15 @@ xfs_defer_ops_capture( /* Preserve the log reservation size. */ dfc->dfc_logres = tp->t_log_res; + /* + * Grab an extra reference to this inode and attach it to the capture + * structure. + */ + if (capture_ip) { + ihold(VFS_I(capture_ip)); + dfc->dfc_capture_ip = capture_ip; + } + return dfc; } @@ -592,24 +606,33 @@ xfs_defer_ops_release( struct xfs_defer_capture *dfc) { xfs_defer_cancel_list(mp, &dfc->dfc_dfops); + if (dfc->dfc_capture_ip) + xfs_irele(dfc->dfc_capture_ip); kmem_free(dfc); } /* * Capture any deferred ops and commit the transaction. This is the last step - * needed to finish a log intent item that we recovered from the log. + * needed to finish a log intent item that we recovered from the log. If any + * of the deferred ops operate on an inode, the caller must pass in that inode + * so that the reference can be transferred to the capture structure. The + * caller must hold ILOCK_EXCL on the inode, and must unlock it before calling + * xfs_defer_ops_continue. */ int xfs_defer_ops_capture_and_commit( struct xfs_trans *tp, + struct xfs_inode *capture_ip, struct list_head *capture_list) { struct xfs_mount *mp = tp->t_mountp; struct xfs_defer_capture *dfc; int error; + ASSERT(!capture_ip || xfs_isilocked(capture_ip, XFS_ILOCK_EXCL)); + /* If we don't capture anything, commit transaction and exit. */ - dfc = xfs_defer_ops_capture(tp); + dfc = xfs_defer_ops_capture(tp, capture_ip); if (!dfc) return xfs_trans_commit(tp); @@ -626,16 +649,26 @@ xfs_defer_ops_capture_and_commit( /* * Attach a chain of captured deferred ops to a new transaction and free the - * capture structure. + * capture structure. If an inode was captured, it will be passed back to the + * caller with ILOCK_EXCL held and joined to the transaction with lockflags==0. + * The caller now owns the inode reference. */ void xfs_defer_ops_continue( struct xfs_defer_capture *dfc, - struct xfs_trans *tp) + struct xfs_trans *tp, + struct xfs_inode **captured_ipp) { ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); + /* Lock and join the captured inode to the new transaction. */ + if (dfc->dfc_capture_ip) { + xfs_ilock(dfc->dfc_capture_ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, dfc->dfc_capture_ip, 0); + } + *captured_ipp = dfc->dfc_capture_ip; + /* Move captured dfops chain and state to the transaction. */ list_splice_init(&dfc->dfc_dfops, &tp->t_dfops); tp->t_flags |= dfc->dfc_tpflags; diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index 6cde6f0713f7..05472f71fffe 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -82,6 +82,12 @@ struct xfs_defer_capture { /* Log reservation saved from the transaction. */ unsigned int dfc_logres; + + /* + * An inode reference that must be maintained to complete the deferred + * work. + */ + struct xfs_inode *dfc_capture_ip; }; /* @@ -89,8 +95,9 @@ struct xfs_defer_capture { * This doesn't normally happen except log recovery. */ int xfs_defer_ops_capture_and_commit(struct xfs_trans *tp, - struct list_head *capture_list); -void xfs_defer_ops_continue(struct xfs_defer_capture *d, struct xfs_trans *tp); + struct xfs_inode *capture_ip, struct list_head *capture_list); +void xfs_defer_ops_continue(struct xfs_defer_capture *d, struct xfs_trans *tp, + struct xfs_inode **captured_ipp); void xfs_defer_ops_release(struct xfs_mount *mp, struct xfs_defer_capture *d); #endif /* __XFS_DEFER_H__ */ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 852411568d14..4570da07eb06 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -513,8 +513,11 @@ xfs_bui_item_recover( xfs_bmap_unmap_extent(tp, ip, &irec); } - /* Commit transaction, which frees the transaction. */ - error = xfs_defer_ops_capture_and_commit(tp, capture_list); + /* + * Commit transaction, which frees the transaction and saves the inode + * for later replay activities. + */ + error = xfs_defer_ops_capture_and_commit(tp, ip, capture_list); if (error) goto err_unlock; diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index 17d36fe5cfd0..3920542f5736 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -627,7 +627,7 @@ xfs_efi_item_recover( } - return xfs_defer_ops_capture_and_commit(tp, capture_list); + return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list); abort_error: xfs_trans_cancel(tp); diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 001e1585ddc6..a8289adc1b29 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2439,6 +2439,7 @@ xlog_finish_defer_ops( { struct xfs_defer_capture *dfc, *next; struct xfs_trans *tp; + struct xfs_inode *ip; int error = 0; list_for_each_entry_safe(dfc, next, capture_list, dfc_list) { @@ -2464,9 +2465,13 @@ xlog_finish_defer_ops( * from recovering a single intent item. */ list_del_init(&dfc->dfc_list); - xfs_defer_ops_continue(dfc, tp); + xfs_defer_ops_continue(dfc, tp, &ip); error = xfs_trans_commit(tp); + if (ip) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_irele(ip); + } if (error) return error; } diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c index 0478374add64..ad895b48f365 100644 --- a/fs/xfs/xfs_refcount_item.c +++ b/fs/xfs/xfs_refcount_item.c @@ -544,7 +544,7 @@ xfs_cui_item_recover( } xfs_refcount_finish_one_cleanup(tp, rcur, error); - return xfs_defer_ops_capture_and_commit(tp, capture_list); + return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list); abort_error: xfs_refcount_finish_one_cleanup(tp, rcur, error); diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c index 0d8fa707f079..1163f32c3e62 100644 --- a/fs/xfs/xfs_rmap_item.c +++ b/fs/xfs/xfs_rmap_item.c @@ -567,7 +567,7 @@ xfs_rui_item_recover( } xfs_rmap_finish_one_cleanup(tp, rcur, error); - return xfs_defer_ops_capture_and_commit(tp, capture_list); + return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list); abort_error: xfs_rmap_finish_one_cleanup(tp, rcur, error); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-10-05 18:20 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong @ 2020-10-06 6:24 ` Christoph Hellwig 0 siblings, 0 replies; 29+ messages in thread From: Christoph Hellwig @ 2020-10-06 6:24 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, david, hch On Mon, Oct 05, 2020 at 11:20:50AM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > to the inode that is involved in the bmap replay operation. If the > mapping operation does not complete, we call xfs_bmap_unmap_extent to > create a deferred op to finish the unmapping work, and we retain a > pointer to the incore inode. > > Unfortunately, the very next thing we do is commit the transaction and > drop the inode. If reclaim tears down the inode before we try to finish > the defer ops, we dereference garbage and blow up. Therefore, create a > way to join inodes to the defer ops freezer so that we can maintain the > xfs_inode reference until we're done with the inode. > > Note: This imposes the requirement that there be enough memory to keep > every incore inode in memory throughout recovery. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > Reviewed-by: Brian Foster <bfoster@redhat.com> Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v3 0/3] xfs: fix inode use-after-free during log recovery @ 2020-09-29 17:43 Darrick J. Wong 2020-09-29 17:44 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-09-29 17:43 UTC (permalink / raw) To: darrick.wong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, david, hch Hi all, In this second series, I try to fix a use-after-free that I discovered during development of the dfops freezer, where BUI recovery releases the inode even if it requeues itself. If the inode gets reclaimed, the fs corrupts memory and explodes. The fix is to make the dfops capture struct take over ownership of the inodes if there's any more work to be done. This is a bit clunky, but it's a simpler mechanism than saving inode pointers and inode numbers and introducing tagged structures so that we can distinguish one from the other. v2: rebase atop the new defer capture code v3: only capture one inode, move as much of the defer capture code to xfs_defer.c as we can If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-bmap-intent-recovery-5.10 --- fs/xfs/libxfs/xfs_defer.c | 55 +++++++++++++++++++++++++++---- fs/xfs/libxfs/xfs_defer.h | 11 +++++- fs/xfs/xfs_bmap_item.c | 78 +++++++++++++++++--------------------------- fs/xfs/xfs_extfree_item.c | 2 + fs/xfs/xfs_log_recover.c | 7 +++- fs/xfs/xfs_refcount_item.c | 2 + fs/xfs/xfs_rmap_item.c | 2 + 7 files changed, 96 insertions(+), 61 deletions(-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-09-29 17:43 [PATCH v3 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong @ 2020-09-29 17:44 ` Darrick J. Wong 0 siblings, 0 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-29 17:44 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs, david, hch From: Darrick J. Wong <darrick.wong@oracle.com> In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/libxfs/xfs_defer.c | 55 ++++++++++++++++++++++++++++++++++++++------ fs/xfs/libxfs/xfs_defer.h | 11 +++++++-- fs/xfs/xfs_bmap_item.c | 8 ++---- fs/xfs/xfs_extfree_item.c | 2 +- fs/xfs/xfs_log_recover.c | 7 +++++- fs/xfs/xfs_refcount_item.c | 2 +- fs/xfs/xfs_rmap_item.c | 2 +- 7 files changed, 67 insertions(+), 20 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index 4caaf5527403..c466a3177acc 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -16,6 +16,7 @@ #include "xfs_inode.h" #include "xfs_inode_item.h" #include "xfs_trace.h" +#include "xfs_icache.h" /* * Deferred Operations in XFS @@ -553,10 +554,14 @@ xfs_defer_move( * deferred ops state is transferred to the capture structure and the * transaction is then ready for the caller to commit it. If there are no * intent items to capture, this function returns NULL. + * + * If inodes are passed in and this function returns a capture structure, the + * inodes are now owned by the capture structure. */ static struct xfs_defer_capture * xfs_defer_ops_capture( - struct xfs_trans *tp) + struct xfs_trans *tp, + struct xfs_inode *ip) { struct xfs_defer_capture *dfc; @@ -588,6 +593,12 @@ xfs_defer_ops_capture( dfc->dfc_tres.tr_logcount = 1; dfc->dfc_tres.tr_logflags = XFS_TRANS_PERM_LOG_RES; + /* + * Transfer responsibility for unlocking and releasing the inodes to + * the capture structure. + */ + dfc->dfc_ip = ip; + return dfc; } @@ -598,29 +609,49 @@ xfs_defer_ops_release( struct xfs_defer_capture *dfc) { xfs_defer_cancel_list(mp, &dfc->dfc_dfops); + if (dfc->dfc_ip) + xfs_irele(dfc->dfc_ip); kmem_free(dfc); } /* * Capture any deferred ops and commit the transaction. This is the last step - * needed to finish a log intent item that we recovered from the log. + * needed to finish a log intent item that we recovered from the log. If any + * of the deferred ops operate on an inode, the caller must pass in that inode + * so that the reference can be transferred to the capture structure. The + * caller must hold ILOCK_EXCL on the inode, and must not touch the inode after + * this call returns. */ int xfs_defer_ops_capture_and_commit( struct xfs_trans *tp, + struct xfs_inode *ip, struct list_head *capture_list) { struct xfs_mount *mp = tp->t_mountp; struct xfs_defer_capture *dfc; int error; + ASSERT(ip == NULL || xfs_isilocked(ip, XFS_ILOCK_EXCL)); + /* If we don't capture anything, commit transaction and exit. */ - dfc = xfs_defer_ops_capture(tp); - if (!dfc) - return xfs_trans_commit(tp); + dfc = xfs_defer_ops_capture(tp, ip); + if (!dfc) { + error = xfs_trans_commit(tp); + if (ip) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_irele(ip); + } + return error; + } - /* Commit the transaction and add the capture structure to the list. */ + /* + * Commit the transaction and add the capture structure to the list. + * Once that's done, we can unlock the inode. + */ error = xfs_trans_commit(tp); + if (ip) + xfs_iunlock(ip, XFS_ILOCK_EXCL); if (error) { xfs_defer_ops_release(mp, dfc); return error; @@ -632,16 +663,24 @@ xfs_defer_ops_capture_and_commit( /* * Attach a chain of captured deferred ops to a new transaction and free the - * capture structure. + * capture structure. A captured inode will be passed back to the caller. */ void xfs_defer_ops_continue( struct xfs_defer_capture *dfc, - struct xfs_trans *tp) + struct xfs_trans *tp, + struct xfs_inode **ipp) { ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); + /* Lock and join the captured inode to the new transaction. */ + if (dfc->dfc_ip) { + xfs_ilock(dfc->dfc_ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, dfc->dfc_ip, 0); + } + *ipp = dfc->dfc_ip; + /* Move captured dfops chain and state to the transaction. */ list_splice_init(&dfc->dfc_dfops, &tp->t_dfops); tp->t_flags |= dfc->dfc_tpflags; diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index c447c79bbe74..3aaf702d4445 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -77,15 +77,22 @@ struct xfs_defer_capture { unsigned int dfc_tpflags; unsigned int dfc_blkres; struct xfs_trans_res dfc_tres; + + /* + * An inode reference that must be maintained to complete the deferred + * work. + */ + struct xfs_inode *dfc_ip; }; /* * Functions to capture a chain of deferred operations and continue them later. * This doesn't normally happen except log recovery. */ -int xfs_defer_ops_capture_and_commit(struct xfs_trans *tp, +int xfs_defer_ops_capture_and_commit(struct xfs_trans *tp, struct xfs_inode *ip, struct list_head *capture_list); -void xfs_defer_ops_continue(struct xfs_defer_capture *d, struct xfs_trans *tp); +void xfs_defer_ops_continue(struct xfs_defer_capture *d, struct xfs_trans *tp, + struct xfs_inode **ipp); void xfs_defer_ops_release(struct xfs_mount *mp, struct xfs_defer_capture *d); #endif /* __XFS_DEFER_H__ */ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 1c9cb5a04bb5..0ffbc75bafe1 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -513,15 +513,11 @@ xfs_bui_item_recover( xfs_bmap_unmap_extent(tp, ip, &irec); } - /* Commit transaction, which frees tp. */ - error = xfs_defer_ops_capture_and_commit(tp, capture_list); - if (error) - goto err_unlock; - return 0; + /* Commit transaction, which frees the transaction and the inode. */ + return xfs_defer_ops_capture_and_commit(tp, ip, capture_list); err_cancel: xfs_trans_cancel(tp); -err_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); err_rele: xfs_irele(ip); diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index 17d36fe5cfd0..3920542f5736 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -627,7 +627,7 @@ xfs_efi_item_recover( } - return xfs_defer_ops_capture_and_commit(tp, capture_list); + return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list); abort_error: xfs_trans_cancel(tp); diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 46e750279634..90ad2bfdfa48 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2439,6 +2439,7 @@ xlog_finish_defer_ops( { struct xfs_defer_capture *dfc, *next; struct xfs_trans *tp; + struct xfs_inode *ip; int error = 0; list_for_each_entry_safe(dfc, next, capture_list, dfc_list) { @@ -2449,9 +2450,13 @@ xlog_finish_defer_ops( /* Transfer all collected dfops to this transaction. */ list_del_init(&dfc->dfc_list); - xfs_defer_ops_continue(dfc, tp); + xfs_defer_ops_continue(dfc, tp, &ip); error = xfs_trans_commit(tp); + if (ip) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_irele(ip); + } if (error) return error; } diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c index 0478374add64..ad895b48f365 100644 --- a/fs/xfs/xfs_refcount_item.c +++ b/fs/xfs/xfs_refcount_item.c @@ -544,7 +544,7 @@ xfs_cui_item_recover( } xfs_refcount_finish_one_cleanup(tp, rcur, error); - return xfs_defer_ops_capture_and_commit(tp, capture_list); + return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list); abort_error: xfs_refcount_finish_one_cleanup(tp, rcur, error); diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c index 0d8fa707f079..1163f32c3e62 100644 --- a/fs/xfs/xfs_rmap_item.c +++ b/fs/xfs/xfs_rmap_item.c @@ -567,7 +567,7 @@ xfs_rui_item_recover( } xfs_rmap_finish_one_cleanup(tp, rcur, error); - return xfs_defer_ops_capture_and_commit(tp, capture_list); + return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list); abort_error: xfs_rmap_finish_one_cleanup(tp, rcur, error); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery @ 2020-09-27 23:41 Darrick J. Wong 2020-09-27 23:41 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-09-27 23:41 UTC (permalink / raw) To: darrick.wong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs, david, hch Hi all, In this second series, I try to fix a use-after-free that I discovered during development of the dfops freezer, where BUI recovery releases the inode even if it requeues itself. If the inode gets reclaimed, the fs corrupts memory and explodes. The fix is to make the dfops capture struct take over ownership of the inodes if there's any more work to be done. This is a bit clunky, but it's a simpler mechanism than saving inode pointers and inode numbers and introducing tagged structures so that we can distinguish one from the other. v2: rebase atop the new defer capture code If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-bmap-intent-recovery-5.10 --- fs/xfs/libxfs/xfs_defer.c | 45 ++++++++++++++++++++++- fs/xfs/libxfs/xfs_defer.h | 22 +++++++++++ fs/xfs/libxfs/xfs_log_recover.h | 11 +++++- fs/xfs/xfs_bmap_item.c | 78 ++++++++++++++++----------------------- fs/xfs/xfs_icache.c | 41 +++++++++++++++++++++ fs/xfs/xfs_log_recover.c | 35 +++++++++++++++--- fs/xfs/xfs_trans.h | 6 --- 7 files changed, 175 insertions(+), 63 deletions(-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-09-27 23:41 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong @ 2020-09-27 23:41 ` Darrick J. Wong 2020-09-28 6:10 ` Dave Chinner 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-09-27 23:41 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs, david, hch From: Darrick J. Wong <darrick.wong@oracle.com> In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/libxfs/xfs_defer.c | 45 ++++++++++++++++++++++++++++++++++++++- fs/xfs/libxfs/xfs_defer.h | 22 ++++++++++++++++++- fs/xfs/libxfs/xfs_log_recover.h | 11 ++++++++-- fs/xfs/xfs_bmap_item.c | 8 ++----- fs/xfs/xfs_icache.c | 41 ++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_log_recover.c | 35 +++++++++++++++++++++++++----- fs/xfs/xfs_trans.h | 6 ----- 7 files changed, 146 insertions(+), 22 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index c53443252389..5a9436d83833 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -16,6 +16,7 @@ #include "xfs_inode.h" #include "xfs_inode_item.h" #include "xfs_trace.h" +#include "xfs_icache.h" /* * Deferred Operations in XFS @@ -541,6 +542,18 @@ xfs_defer_move( stp->t_flags &= ~XFS_TRANS_LOWMODE; } +/* Unlock the inodes attached to this dfops capture device. */ +void +xfs_defer_capture_iunlock( + struct xfs_defer_capture *dfc) +{ + unsigned int i; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) + xfs_iunlock(dfc->dfc_inodes[i], XFS_ILOCK_EXCL); + dfc->dfc_ilocked = false; +} + /* * Prepare a chain of fresh deferred ops work items to be completed later. Log * recovery requires the ability to put off until later the actual finishing @@ -553,13 +566,23 @@ xfs_defer_move( * deferred ops state is transferred to the capture structure and the * transaction is then ready for the caller to commit it. If there are no * intent items to capture, this function returns NULL. + * + * If inodes are passed in and this function returns a capture structure, the + * inodes are now owned by the capture structure. If the transaction commit + * succeeds, the caller must call xfs_defer_capture_iunlock to unlock the + * inodes before moving on to more recovery work. */ struct xfs_defer_capture * xfs_defer_capture( - struct xfs_trans *tp) + struct xfs_trans *tp, + struct xfs_inode *ip1, + struct xfs_inode *ip2) { struct xfs_defer_capture *dfc; + /* Do not pass in an ipp2 without an ipp1. */ + ASSERT(ip1 || !ip2); + if (list_empty(&tp->t_dfops)) return NULL; @@ -582,6 +605,16 @@ xfs_defer_capture( dfc->dfc_tres.tr_logcount = tp->t_log_count; dfc->dfc_tres.tr_logflags = XFS_TRANS_PERM_LOG_RES; + /* + * Transfer responsibility for unlocking and releasing the inodes to + * the capture structure. + */ + dfc->dfc_ilocked = true; + if (ip1) + dfc->dfc_inodes[0] = ip1; + if (ip2 && ip2 != ip1) + dfc->dfc_inodes[1] = ip2; + return dfc; } @@ -594,6 +627,13 @@ xfs_defer_continue( ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); + /* Lock and join the inodes to the new transaction. */ + xfs_defer_continue_inodes(dfc, tp); + if (dfc->dfc_inodes[0]) + xfs_trans_ijoin(tp, dfc->dfc_inodes[0], 0); + if (dfc->dfc_inodes[1]) + xfs_trans_ijoin(tp, dfc->dfc_inodes[1], 0); + /* Move captured dfops chain and state to the transaction. */ list_splice_init(&dfc->dfc_dfops, &tp->t_dfops); tp->t_flags |= dfc->dfc_tpflags; @@ -609,5 +649,8 @@ xfs_defer_capture_free( struct xfs_defer_capture *dfc) { xfs_defer_cancel_list(mp, &dfc->dfc_dfops); + if (dfc->dfc_ilocked) + xfs_defer_capture_iunlock(dfc); + xfs_defer_capture_irele(dfc); kmem_free(dfc); } diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index 063276c063b6..2a4c63573e3d 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -10,6 +10,12 @@ struct xfs_btree_cur; struct xfs_defer_op_type; struct xfs_defer_capture; +/* + * Deferred operation item relogging limits. + */ +#define XFS_DEFER_OPS_NR_INODES 2 /* join up to two inodes */ +#define XFS_DEFER_OPS_NR_BUFS 2 /* join up to two buffers */ + /* * Header for deferred operation list. */ @@ -78,15 +84,29 @@ struct xfs_defer_capture { unsigned int dfc_tpflags; unsigned int dfc_blkres; struct xfs_trans_res dfc_tres; + + /* + * Inodes to hold when we want to finish the deferred work items. + * Always set the first element before setting the second. + */ + bool dfc_ilocked; + struct xfs_inode *dfc_inodes[XFS_DEFER_OPS_NR_INODES]; }; /* * Functions to capture a chain of deferred operations and continue them later. * This doesn't normally happen except log recovery. */ -struct xfs_defer_capture *xfs_defer_capture(struct xfs_trans *tp); +struct xfs_defer_capture *xfs_defer_capture(struct xfs_trans *tp, + struct xfs_inode *ip1, struct xfs_inode *ip2); void xfs_defer_continue(struct xfs_defer_capture *dfc, struct xfs_trans *tp); +void xfs_defer_capture_iunlock(struct xfs_defer_capture *dfc); void xfs_defer_capture_free(struct xfs_mount *mp, struct xfs_defer_capture *dfc); +/* These functions must be provided outside of libxfs. */ +void xfs_defer_continue_inodes(struct xfs_defer_capture *dfc, + struct xfs_trans *tp); +void xfs_defer_capture_irele(struct xfs_defer_capture *dfc); + #endif /* __XFS_DEFER_H__ */ diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h index 8ad44b4195e8..1c87106a3a25 100644 --- a/fs/xfs/libxfs/xfs_log_recover.h +++ b/fs/xfs/libxfs/xfs_log_recover.h @@ -124,7 +124,14 @@ bool xlog_is_buffer_cancelled(struct xlog *log, xfs_daddr_t blkno, uint len); void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type, uint64_t intent_id); -int xlog_recover_trans_commit(struct xfs_trans *tp, - struct list_head *capture_list); +int xlog_recover_trans_commit_inodes(struct xfs_trans *tp, + struct list_head *capture_list, struct xfs_inode *ip1, + struct xfs_inode *ip2); + +static inline int +xlog_recover_trans_commit(struct xfs_trans *tp, struct list_head *capture_list) +{ + return xlog_recover_trans_commit_inodes(tp, capture_list, NULL, NULL); +} #endif /* __XFS_LOG_RECOVER_H__ */ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index b6d3a5766148..fa52bfd66ce1 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -513,15 +513,11 @@ xfs_bui_item_recover( xfs_bmap_unmap_extent(tp, ip, &irec); } - /* Commit transaction, which frees tp. */ - error = xlog_recover_trans_commit(tp, capture_list); - if (error) - goto err_unlock; - return 0; + /* Commit transaction, which frees the transaction and the inode. */ + return xlog_recover_trans_commit_inodes(tp, capture_list, ip, NULL); err_cancel: xfs_trans_cancel(tp); -err_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL); err_rele: xfs_irele(ip); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index deb99300d171..c7f65e16534f 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -12,6 +12,7 @@ #include "xfs_sb.h" #include "xfs_mount.h" #include "xfs_inode.h" +#include "xfs_defer.h" #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_inode_item.h" @@ -1689,3 +1690,43 @@ xfs_start_block_reaping( xfs_queue_eofblocks(mp); xfs_queue_cowblocks(mp); } + +/* + * Prepare the inodes to participate in further log intent item recovery. + * For now, that means attaching dquots and locking them, since libxfs doesn't + * know how to do that. + */ +void +xfs_defer_continue_inodes( + struct xfs_defer_capture *dfc, + struct xfs_trans *tp) +{ + int i; + int error; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { + error = xfs_qm_dqattach(dfc->dfc_inodes[i]); + if (error) + tp->t_mountp->m_qflags &= ~XFS_ALL_QUOTA_CHKD; + } + + if (dfc->dfc_inodes[1]) + xfs_lock_two_inodes(dfc->dfc_inodes[0], XFS_ILOCK_EXCL, + dfc->dfc_inodes[1], XFS_ILOCK_EXCL); + else if (dfc->dfc_inodes[0]) + xfs_ilock(dfc->dfc_inodes[0], XFS_ILOCK_EXCL); + dfc->dfc_ilocked = true; +} + +/* Release all the inodes attached to this dfops capture device. */ +void +xfs_defer_capture_irele( + struct xfs_defer_capture *dfc) +{ + unsigned int i; + + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { + xfs_irele(dfc->dfc_inodes[i]); + dfc->dfc_inodes[i] = NULL; + } +} diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 0d899ab7df2e..1463c3097240 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1755,23 +1755,43 @@ xlog_recover_release_intent( spin_unlock(&ailp->ail_lock); } +static inline void +xlog_recover_irele( + struct xfs_inode *ip) +{ + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_irele(ip); +} + /* * Capture any deferred ops and commit the transaction. This is the last step * needed to finish a log intent item that we recovered from the log, and will - * take care of releasing all the relevant resources. + * take care of releasing all the relevant resources. If we captured deferred + * ops, the inodes are attached to it and must not be touched. If not, we have + * to unlock and free them ourselves. */ int -xlog_recover_trans_commit( +xlog_recover_trans_commit_inodes( struct xfs_trans *tp, - struct list_head *capture_list) + struct list_head *capture_list, + struct xfs_inode *ip1, + struct xfs_inode *ip2) { struct xfs_mount *mp = tp->t_mountp; - struct xfs_defer_capture *dfc = xfs_defer_capture(tp); + struct xfs_defer_capture *dfc = xfs_defer_capture(tp, ip1, ip2); int error; /* If we don't capture anything, commit tp and exit. */ - if (!dfc) - return xfs_trans_commit(tp); + if (!dfc) { + error = xfs_trans_commit(tp); + + /* We still own the inodes, so unlock and release them. */ + if (ip2 && ip2 != ip1) + xlog_recover_irele(ip2); + if (ip1) + xlog_recover_irele(ip1); + return error; + } /* * Commit the transaction. If that fails, clean up the defer ops and @@ -1783,6 +1803,9 @@ xlog_recover_trans_commit( return error; } + /* Unlock the captured inodes so that we can move on with recovery. */ + xfs_defer_capture_iunlock(dfc); + list_add_tail(&dfc->dfc_list, capture_list); return 0; } diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index e3875a92a541..2e76e8c16e91 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -112,12 +112,6 @@ void xfs_log_item_init(struct xfs_mount *mp, struct xfs_log_item *item, #define XFS_ITEM_LOCKED 2 #define XFS_ITEM_FLUSHING 3 -/* - * Deferred operation item relogging limits. - */ -#define XFS_DEFER_OPS_NR_INODES 2 /* join up to two inodes */ -#define XFS_DEFER_OPS_NR_BUFS 2 /* join up to two buffers */ - /* * This is the structure maintained for every active transaction. */ ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-09-27 23:41 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong @ 2020-09-28 6:10 ` Dave Chinner 2020-09-28 17:02 ` Darrick J. Wong 0 siblings, 1 reply; 29+ messages in thread From: Dave Chinner @ 2020-09-28 6:10 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, hch On Sun, Sep 27, 2020 at 04:41:56PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > to the inode that is involved in the bmap replay operation. If the > mapping operation does not complete, we call xfs_bmap_unmap_extent to > create a deferred op to finish the unmapping work, and we retain a > pointer to the incore inode. > > Unfortunately, the very next thing we do is commit the transaction and > drop the inode. If reclaim tears down the inode before we try to finish > the defer ops, we dereference garbage and blow up. Therefore, create a > way to join inodes to the defer ops freezer so that we can maintain the > xfs_inode reference until we're done with the inode. Honest first reaction now I understand what the capture stuff is doing: Ewww! Gross! We only need to store a single inode, so the whole "2 inodes for symmetry with defer_ops" greatly overcomplicates the code. This could be *much* simpler. > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index deb99300d171..c7f65e16534f 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -12,6 +12,7 @@ > #include "xfs_sb.h" > #include "xfs_mount.h" > #include "xfs_inode.h" > +#include "xfs_defer.h" > #include "xfs_trans.h" > #include "xfs_trans_priv.h" > #include "xfs_inode_item.h" > @@ -1689,3 +1690,43 @@ xfs_start_block_reaping( > xfs_queue_eofblocks(mp); > xfs_queue_cowblocks(mp); > } > + > +/* > + * Prepare the inodes to participate in further log intent item recovery. > + * For now, that means attaching dquots and locking them, since libxfs doesn't > + * know how to do that. > + */ > +void > +xfs_defer_continue_inodes( > + struct xfs_defer_capture *dfc, > + struct xfs_trans *tp) > +{ > + int i; > + int error; > + > + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { > + error = xfs_qm_dqattach(dfc->dfc_inodes[i]); > + if (error) > + tp->t_mountp->m_qflags &= ~XFS_ALL_QUOTA_CHKD; > + } > + > + if (dfc->dfc_inodes[1]) > + xfs_lock_two_inodes(dfc->dfc_inodes[0], XFS_ILOCK_EXCL, > + dfc->dfc_inodes[1], XFS_ILOCK_EXCL); > + else if (dfc->dfc_inodes[0]) > + xfs_ilock(dfc->dfc_inodes[0], XFS_ILOCK_EXCL); > + dfc->dfc_ilocked = true; > +} > + > +/* Release all the inodes attached to this dfops capture device. */ > +void > +xfs_defer_capture_irele( > + struct xfs_defer_capture *dfc) > +{ > + unsigned int i; > + > + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { > + xfs_irele(dfc->dfc_inodes[i]); > + dfc->dfc_inodes[i] = NULL; > + } > +} None of this belongs in xfs_icache.c. The function namespace tells me where it should be... > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 0d899ab7df2e..1463c3097240 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -1755,23 +1755,43 @@ xlog_recover_release_intent( > spin_unlock(&ailp->ail_lock); > } > > +static inline void > +xlog_recover_irele( > + struct xfs_inode *ip) > +{ > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > + xfs_irele(ip); > +} Just open code it, please. > int > -xlog_recover_trans_commit( > +xlog_recover_trans_commit_inodes( > struct xfs_trans *tp, > - struct list_head *capture_list) > + struct list_head *capture_list, > + struct xfs_inode *ip1, > + struct xfs_inode *ip2) So are these inodes supposed to be locked, referenced and/or ??? > { > struct xfs_mount *mp = tp->t_mountp; > - struct xfs_defer_capture *dfc = xfs_defer_capture(tp); > + struct xfs_defer_capture *dfc = xfs_defer_capture(tp, ip1, ip2); > int error; That's the second time putting this logic up in the declaration list has made me wonder where something in this function is initilaised. Please move it into the code so that it is obvious. > > /* If we don't capture anything, commit tp and exit. */ > - if (!dfc) > - return xfs_trans_commit(tp); > + if (!dfc) { i.e. before this line. dfc = xfs_defer_capture(tp, ip1, ip2); if (!dfc) { > + error = xfs_trans_commit(tp); > + > + /* We still own the inodes, so unlock and release them. */ > + if (ip2 && ip2 != ip1) > + xlog_recover_irele(ip2); > + if (ip1) > + xlog_recover_irele(ip1); > + return error; > + } Not a fan of the unnecessary complexity of this. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-09-28 6:10 ` Dave Chinner @ 2020-09-28 17:02 ` Darrick J. Wong 0 siblings, 0 replies; 29+ messages in thread From: Darrick J. Wong @ 2020-09-28 17:02 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs, hch On Mon, Sep 28, 2020 at 04:10:46PM +1000, Dave Chinner wrote: > On Sun, Sep 27, 2020 at 04:41:56PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > to the inode that is involved in the bmap replay operation. If the > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > create a deferred op to finish the unmapping work, and we retain a > > pointer to the incore inode. > > > > Unfortunately, the very next thing we do is commit the transaction and > > drop the inode. If reclaim tears down the inode before we try to finish > > the defer ops, we dereference garbage and blow up. Therefore, create a > > way to join inodes to the defer ops freezer so that we can maintain the > > xfs_inode reference until we're done with the inode. > > Honest first reaction now I understand what the capture stuff is > doing: Ewww! Gross! Yes, the whole thing is gross. Honestly, I wish I could go back in time to 2016 to warn myself that we would need a way to reassemble entire runtime transactions + dfops chains so that we could avoid all this. > We only need to store a single inode, so the whole "2 inodes for > symmetry with defer_ops" greatly overcomplicates the code. This > could be *much* simpler. Indeed, see my comment at the very end. > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index deb99300d171..c7f65e16534f 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -12,6 +12,7 @@ > > #include "xfs_sb.h" > > #include "xfs_mount.h" > > #include "xfs_inode.h" > > +#include "xfs_defer.h" > > #include "xfs_trans.h" > > #include "xfs_trans_priv.h" > > #include "xfs_inode_item.h" > > @@ -1689,3 +1690,43 @@ xfs_start_block_reaping( > > xfs_queue_eofblocks(mp); > > xfs_queue_cowblocks(mp); > > } > > + > > +/* > > + * Prepare the inodes to participate in further log intent item recovery. > > + * For now, that means attaching dquots and locking them, since libxfs doesn't > > + * know how to do that. > > + */ > > +void > > +xfs_defer_continue_inodes( > > + struct xfs_defer_capture *dfc, > > + struct xfs_trans *tp) > > +{ > > + int i; > > + int error; > > + > > + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { > > + error = xfs_qm_dqattach(dfc->dfc_inodes[i]); > > + if (error) > > + tp->t_mountp->m_qflags &= ~XFS_ALL_QUOTA_CHKD; > > + } > > + > > + if (dfc->dfc_inodes[1]) > > + xfs_lock_two_inodes(dfc->dfc_inodes[0], XFS_ILOCK_EXCL, > > + dfc->dfc_inodes[1], XFS_ILOCK_EXCL); > > + else if (dfc->dfc_inodes[0]) > > + xfs_ilock(dfc->dfc_inodes[0], XFS_ILOCK_EXCL); > > + dfc->dfc_ilocked = true; > > +} > > + > > +/* Release all the inodes attached to this dfops capture device. */ > > +void > > +xfs_defer_capture_irele( > > + struct xfs_defer_capture *dfc) > > +{ > > + unsigned int i; > > + > > + for (i = 0; i < XFS_DEFER_OPS_NR_INODES && dfc->dfc_inodes[i]; i++) { > > + xfs_irele(dfc->dfc_inodes[i]); > > + dfc->dfc_inodes[i] = NULL; > > + } > > +} > > None of this belongs in xfs_icache.c. The function namespace tells > me where it should be... Agreed. Originally this couldn't really be in libxfs because xfs_iget has a different method signature in userspace, but now that we're just storing the inode pointers directly, there's no need to split this anymore. > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > > index 0d899ab7df2e..1463c3097240 100644 > > --- a/fs/xfs/xfs_log_recover.c > > +++ b/fs/xfs/xfs_log_recover.c > > @@ -1755,23 +1755,43 @@ xlog_recover_release_intent( > > spin_unlock(&ailp->ail_lock); > > } > > > > +static inline void > > +xlog_recover_irele( > > + struct xfs_inode *ip) > > +{ > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > + xfs_irele(ip); > > +} > > Just open code it, please. > > > int > > -xlog_recover_trans_commit( > > +xlog_recover_trans_commit_inodes( > > struct xfs_trans *tp, > > - struct list_head *capture_list) > > + struct list_head *capture_list, > > + struct xfs_inode *ip1, > > + struct xfs_inode *ip2) > > So are these inodes supposed to be locked, referenced and/or ??? ILOCK'd and referenced. > > { > > struct xfs_mount *mp = tp->t_mountp; > > - struct xfs_defer_capture *dfc = xfs_defer_capture(tp); > > + struct xfs_defer_capture *dfc = xfs_defer_capture(tp, ip1, ip2); > > int error; > > That's the second time putting this logic up in the declaration list > has made me wonder where something in this function is initilaised. > Please move it into the code so that it is obvious. > > > > > /* If we don't capture anything, commit tp and exit. */ > > - if (!dfc) > > - return xfs_trans_commit(tp); > > + if (!dfc) { > > i.e. before this line. > > dfc = xfs_defer_capture(tp, ip1, ip2); > if (!dfc) { Ok. > > > + error = xfs_trans_commit(tp); > > + > > + /* We still own the inodes, so unlock and release them. */ > > + if (ip2 && ip2 != ip1) > > + xlog_recover_irele(ip2); > > + if (ip1) > > + xlog_recover_irele(ip1); > > + return error; > > + } > > Not a fan of the unnecessary complexity of this. Yeah, I got ahead of myself -- for atomic extent swapping we'll need to be able to capture two inodes, so I went straight for the end goal. I'll rip it out to simplify things for now, but this all will come back in some form... --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery @ 2020-05-05 1:13 Darrick J. Wong 2020-05-05 1:13 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-05-05 1:13 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs Hi all, Fix a use-after-free during log recovery of deferred operations by creating explicit freeze and thaw mechanisms for deferred ops that were created while processing intent items that were recovered from the log. While we're at it, fix all the bogosity around how we gather up log intents during recovery and actually commit them to the filesystem. If you're going to start using this mess, you probably ought to just pull from my git trees, which are linked below. This is an extraordinary way to destroy everything. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fix-log-recovery xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=fix-log-recovery ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-05-05 1:13 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong @ 2020-05-05 1:13 ` Darrick J. Wong 2020-05-05 14:11 ` Brian Foster 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-05-05 1:13 UTC (permalink / raw) To: darrick.wong; +Cc: linux-xfs From: Darrick J. Wong <darrick.wong@oracle.com> In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ fs/xfs/xfs_bmap_item.c | 7 ++++-- fs/xfs/xfs_icache.c | 19 +++++++++++++++++ 4 files changed, 83 insertions(+), 3 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index ea4d28851bbd..72933fdafcb2 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -16,6 +16,7 @@ #include "xfs_inode.h" #include "xfs_inode_item.h" #include "xfs_trace.h" +#include "xfs_icache.h" /* * Deferred Operations in XFS @@ -583,8 +584,19 @@ xfs_defer_thaw( struct xfs_defer_freezer *dff, struct xfs_trans *tp) { + int i; + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); + /* Re-acquire the inode locks. */ + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { + if (!dff->dff_inodes[i]) + break; + + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); + } + /* Add the dfops items to the transaction. */ list_splice_init(&dff->dff_dfops, &tp->t_dfops); tp->t_flags |= dff->dff_tpflags; @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( struct xfs_defer_freezer *dff) { xfs_defer_cancel_list(mp, &dff->dff_dfops); + xfs_defer_freezer_irele(dff); kmem_free(dff); } + +/* + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, + * which will be dropped and reacquired when we're ready to thaw the frozen + * deferred ops. + */ +int +xfs_defer_freezer_ijoin( + struct xfs_defer_freezer *dff, + struct xfs_inode *ip) +{ + unsigned int i; + + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { + if (dff->dff_inodes[i] == ip) + goto out; + if (dff->dff_inodes[i] == NULL) + break; + } + + if (i == XFS_DEFER_FREEZER_INODES) { + ASSERT(0); + return -EFSCORRUPTED; + } + + /* + * Attach this inode to the freezer and drop its ILOCK because we + * assume the caller will need to allocate a transaction. + */ + dff->dff_inodes[i] = ip; + dff->dff_ilocks[i] = 0; +out: + xfs_iunlock(ip, XFS_ILOCK_EXCL); + return 0; +} diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index 7ae05e10d750..0052a0313283 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -76,6 +76,11 @@ struct xfs_defer_freezer { /* Deferred ops state saved from the transaction. */ struct list_head dff_dfops; unsigned int dff_tpflags; + + /* Inodes to hold when we want to finish the deferred work items. */ +#define XFS_DEFER_FREEZER_INODES 2 + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; }; /* Functions to freeze a chain of deferred operations for later. */ @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); void xfs_defer_freeezer_finish(struct xfs_mount *mp, struct xfs_defer_freezer *dff); +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, + struct xfs_inode *ip); + +/* These functions must be provided by the xfs implementation. */ +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); #endif /* __XFS_DEFER_H__ */ diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index c733bdeeeb9b..bbce191d8fcd 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -530,12 +530,13 @@ xfs_bui_item_recover( } error = xlog_recover_trans_commit(tp, dffp); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - xfs_irele(ip); - return error; + if (error) + goto err_rele; + return xfs_defer_freezer_ijoin(*dffp, ip); err_inode: xfs_trans_cancel(tp); +err_rele: if (ip) { xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_irele(ip); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 17a0b86fe701..b96ddf5ff334 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -12,6 +12,7 @@ #include "xfs_sb.h" #include "xfs_mount.h" #include "xfs_inode.h" +#include "xfs_defer.h" #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_inode_item.h" @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( xfs_queue_eofblocks(mp); xfs_queue_cowblocks(mp); } + +/* Release all the inode resources attached to this freezer. */ +void +xfs_defer_freezer_irele( + struct xfs_defer_freezer *dff) +{ + unsigned int i; + + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { + if (!dff->dff_inodes[i]) + break; + + if (dff->dff_ilocks[i]) + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); + xfs_irele(dff->dff_inodes[i]); + dff->dff_inodes[i] = NULL; + } +} ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-05-05 1:13 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong @ 2020-05-05 14:11 ` Brian Foster 2020-05-06 0:34 ` Darrick J. Wong 0 siblings, 1 reply; 29+ messages in thread From: Brian Foster @ 2020-05-05 14:11 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@oracle.com> > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > to the inode that is involved in the bmap replay operation. If the > mapping operation does not complete, we call xfs_bmap_unmap_extent to > create a deferred op to finish the unmapping work, and we retain a > pointer to the incore inode. > > Unfortunately, the very next thing we do is commit the transaction and > drop the inode. If reclaim tears down the inode before we try to finish > the defer ops, we dereference garbage and blow up. Therefore, create a > way to join inodes to the defer ops freezer so that we can maintain the > xfs_inode reference until we're done with the inode. > > Note: This imposes the requirement that there be enough memory to keep > every incore inode in memory throughout recovery. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > --- Maybe I'm missing something, but I thought the discussion on the previous version[1] landed on an approach where the intent would hold a reference to the inode. Wouldn't that break the dependency on the dfops freeze/thaw mechanism? Brian [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/ > fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ > fs/xfs/xfs_bmap_item.c | 7 ++++-- > fs/xfs/xfs_icache.c | 19 +++++++++++++++++ > 4 files changed, 83 insertions(+), 3 deletions(-) > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > index ea4d28851bbd..72933fdafcb2 100644 > --- a/fs/xfs/libxfs/xfs_defer.c > +++ b/fs/xfs/libxfs/xfs_defer.c > @@ -16,6 +16,7 @@ > #include "xfs_inode.h" > #include "xfs_inode_item.h" > #include "xfs_trace.h" > +#include "xfs_icache.h" > > /* > * Deferred Operations in XFS > @@ -583,8 +584,19 @@ xfs_defer_thaw( > struct xfs_defer_freezer *dff, > struct xfs_trans *tp) > { > + int i; > + > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > + /* Re-acquire the inode locks. */ > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > + if (!dff->dff_inodes[i]) > + break; > + > + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; > + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); > + } > + > /* Add the dfops items to the transaction. */ > list_splice_init(&dff->dff_dfops, &tp->t_dfops); > tp->t_flags |= dff->dff_tpflags; > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( > struct xfs_defer_freezer *dff) > { > xfs_defer_cancel_list(mp, &dff->dff_dfops); > + xfs_defer_freezer_irele(dff); > kmem_free(dff); > } > + > +/* > + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, > + * which will be dropped and reacquired when we're ready to thaw the frozen > + * deferred ops. > + */ > +int > +xfs_defer_freezer_ijoin( > + struct xfs_defer_freezer *dff, > + struct xfs_inode *ip) > +{ > + unsigned int i; > + > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > + > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > + if (dff->dff_inodes[i] == ip) > + goto out; > + if (dff->dff_inodes[i] == NULL) > + break; > + } > + > + if (i == XFS_DEFER_FREEZER_INODES) { > + ASSERT(0); > + return -EFSCORRUPTED; > + } > + > + /* > + * Attach this inode to the freezer and drop its ILOCK because we > + * assume the caller will need to allocate a transaction. > + */ > + dff->dff_inodes[i] = ip; > + dff->dff_ilocks[i] = 0; > +out: > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > + return 0; > +} > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h > index 7ae05e10d750..0052a0313283 100644 > --- a/fs/xfs/libxfs/xfs_defer.h > +++ b/fs/xfs/libxfs/xfs_defer.h > @@ -76,6 +76,11 @@ struct xfs_defer_freezer { > /* Deferred ops state saved from the transaction. */ > struct list_head dff_dfops; > unsigned int dff_tpflags; > + > + /* Inodes to hold when we want to finish the deferred work items. */ > +#define XFS_DEFER_FREEZER_INODES 2 > + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; > + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; > }; > > /* Functions to freeze a chain of deferred operations for later. */ > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); > void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); > void xfs_defer_freeezer_finish(struct xfs_mount *mp, > struct xfs_defer_freezer *dff); > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, > + struct xfs_inode *ip); > + > +/* These functions must be provided by the xfs implementation. */ > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); > > #endif /* __XFS_DEFER_H__ */ > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > index c733bdeeeb9b..bbce191d8fcd 100644 > --- a/fs/xfs/xfs_bmap_item.c > +++ b/fs/xfs/xfs_bmap_item.c > @@ -530,12 +530,13 @@ xfs_bui_item_recover( > } > > error = xlog_recover_trans_commit(tp, dffp); > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > - xfs_irele(ip); > - return error; > + if (error) > + goto err_rele; > + return xfs_defer_freezer_ijoin(*dffp, ip); > > err_inode: > xfs_trans_cancel(tp); > +err_rele: > if (ip) { > xfs_iunlock(ip, XFS_ILOCK_EXCL); > xfs_irele(ip); > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 17a0b86fe701..b96ddf5ff334 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -12,6 +12,7 @@ > #include "xfs_sb.h" > #include "xfs_mount.h" > #include "xfs_inode.h" > +#include "xfs_defer.h" > #include "xfs_trans.h" > #include "xfs_trans_priv.h" > #include "xfs_inode_item.h" > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( > xfs_queue_eofblocks(mp); > xfs_queue_cowblocks(mp); > } > + > +/* Release all the inode resources attached to this freezer. */ > +void > +xfs_defer_freezer_irele( > + struct xfs_defer_freezer *dff) > +{ > + unsigned int i; > + > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > + if (!dff->dff_inodes[i]) > + break; > + > + if (dff->dff_ilocks[i]) > + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); > + xfs_irele(dff->dff_inodes[i]); > + dff->dff_inodes[i] = NULL; > + } > +} > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-05-05 14:11 ` Brian Foster @ 2020-05-06 0:34 ` Darrick J. Wong 2020-05-06 13:56 ` Brian Foster 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-05-06 0:34 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote: > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > to the inode that is involved in the bmap replay operation. If the > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > create a deferred op to finish the unmapping work, and we retain a > > pointer to the incore inode. > > > > Unfortunately, the very next thing we do is commit the transaction and > > drop the inode. If reclaim tears down the inode before we try to finish > > the defer ops, we dereference garbage and blow up. Therefore, create a > > way to join inodes to the defer ops freezer so that we can maintain the > > xfs_inode reference until we're done with the inode. > > > > Note: This imposes the requirement that there be enough memory to keep > > every incore inode in memory throughout recovery. > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > --- > > Maybe I'm missing something, but I thought the discussion on the > previous version[1] landed on an approach where the intent would hold a > reference to the inode. Wouldn't that break the dependency on the dfops > freeze/thaw mechanism? We did, but as I started pondering how to make this work in practice, I soon realized that it's not as simple as "don't irele the inode": (Part 1) It would be easy enough to add a flag to struct xfs_bmap_intent that means "when you're done, unlock and irele the inode". You'd have to add this capability to any log item that deals with inodes, but only deferred bmap items need this. Log recovery looks like this: ---------------- Transaction 0 BUI1 unmap file1, offset1, startblock1, length == 200000 ---------------- <end> What happens if the first bunmapi call doesn't actually unmap everything? We'll queue yet another bmap intent item (call it BUI2) to finish the work. This means that we can't just drop the inode when we're done processing BUI1; we have to propagate the "unlock and irele" flag to BUI2. Ok, so now there's this chaining behavior that wasn't there before. Then I thought about it some more, and wondered what would happen if the log contained two unfinished bmap items for the same inode? If this happened, you'd deadlock because you've not released the ILOCK for the first unfinished bmap item when you want to acquire the ILOCK for the second unfinished bmap item. You'd have to drop the ILOCK (but retain the inode reference) between the two bmap items. I think this is currently impossible because there aren't any transactions that queue multiple bmap intents per transaction, nobody else holds the ilock, and the existing rmap implementation of the swapext ioctl doesn't allow swapping a file with itself. However... (Part 2) The second thing I thought about is, what if there's a lot of work after transaction 0? Say the recovery stream looks like: ---------------- Transaction 0 BUI1 unmap file1, offset1, startblock1, length == 200000 ---------------- <10000 more intents applied to other things> ---------------- <end> This is a problem, because we must drop the ILOCK after completing the BUI1 work so that we can process the 10000 more intents before we can get to BUI2. We cannot let the tail of the log get pinned on the locked inode, so we must release the ILOCK after we're done with BUI1 and re-acquire it when we're ready to deal with BUI2. This means I need /two/ flags in struct xfs_bmap_intent -- one to say that we need to irele the inode at the end of processing, and another to say that the bmap item needs to grab the ILOCK. If there's going to be a BUI2 as a result of recovering BUI1, the first flag must be propagated to BUI2, but the second flag must /not/ be propagated. A potential counterargument is that we crammed all 10000 intents into the log without pinning the log, so maybe we can hold the ILOCK. However.... (Part 3) Next, I considered how the swapext log item in the atomic file update patchset would interact with log recovery. A couple of notes about the swapext log items: 1. They are careful not to create the kinds of bunmap operations that would cause the creation of /more/ BUIs. 2. Every time we log one extent's worth of unmap/remap operations (using deferred bmap log items, naturally) we log an done item for the original swapext intent item and immediately log another swapext intent for the work remaining. I came up with the following sequence of transactions to recover: ---------------- Transaction 0 SXI0 file1, offset1, file2, offset2, length == 10 ---------------- Transaction 1 BUI1 unmap file1, offset1, startblock1, length == 2 BUI2 unmap file2, offset2, startblock2, length == 2 BUI3 map file1, offset1, startblock2, length == 2 BUI4 map file2, offset2, startblock1, length == 2 SXD done (SXI0) SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8 --------------- <end of log> Recovery in this case will end up with BUI[1-4] and SXI5 that still need processing. Each of the 5 intent items gets its own recovery transaction and dfops freezer chain. BUI1, BUI3, and SXI5 will each require ilock'd access to file1, which means that each of them will have to have the ability to drop the ILOCK but retain the reference. The same argument applies to file2 and BUI2, BUI4, and SXI5. Note also that in our brave new world, file1 and file2 can be the same. For the swapext log item type this inode management is particularly acute because we're certain to have new SXIs created as a result of recovering an SXI. (End) In conclusion, we need a mechanism to drop both the inode lock and the inode reference. Each log item type that works with inodes will have to set aside 2 flags somewhere in the incore extent structure, and provide more or less the same code to manage that state. Right now that's just the bmap items, but then the swapext items and the deferred xattr items will have that same requirement when they get added. If we do that, the inode reference and ilock management diverges even further from regular operations. Regular callers are expected to iget and ilock the inode and maintain that all the way to the end of xfs_trans_commit. Log recovery, on the other hand, will add a bunch of state handling to some of the log items so that we can drop the ILOCK after the first item, and then drop the inode reference after finishing the last possible user of that inode in the revived dfops chain. On the other hand, keeping the inode management in the defer freezer code means that I keep all that complexity and state management out of the ->finish_item code. When we go to finish the new incore intents that get created during recovery, the inode reference and ILOCK rules are the same as anywhere else -- the caller is required to grab the inode, the transaction, and the ilock (in that order). To the extent that recovering log intent items is /not/ the same as running a normal transaction, all that weirdness is concentrated in xfs_log_recover.c and a single function in xfs_defer.c. The desired behavior is the same across all the log intent item types, so I decided in favor of keeping the centralised behavior and (trying to) contain the wacky. We don't need all that right away, but we will. Note that I did make it so that we retain the reference always, so there's no longer a need for irele/igrab cycles, which makes capturing the dfops chain less error prone. --D > Brian > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/ > > > fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ > > fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ > > fs/xfs/xfs_bmap_item.c | 7 ++++-- > > fs/xfs/xfs_icache.c | 19 +++++++++++++++++ > > 4 files changed, 83 insertions(+), 3 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > index ea4d28851bbd..72933fdafcb2 100644 > > --- a/fs/xfs/libxfs/xfs_defer.c > > +++ b/fs/xfs/libxfs/xfs_defer.c > > @@ -16,6 +16,7 @@ > > #include "xfs_inode.h" > > #include "xfs_inode_item.h" > > #include "xfs_trace.h" > > +#include "xfs_icache.h" > > > > /* > > * Deferred Operations in XFS > > @@ -583,8 +584,19 @@ xfs_defer_thaw( > > struct xfs_defer_freezer *dff, > > struct xfs_trans *tp) > > { > > + int i; > > + > > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > > > + /* Re-acquire the inode locks. */ > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > + if (!dff->dff_inodes[i]) > > + break; > > + > > + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; > > + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > + } > > + > > /* Add the dfops items to the transaction. */ > > list_splice_init(&dff->dff_dfops, &tp->t_dfops); > > tp->t_flags |= dff->dff_tpflags; > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( > > struct xfs_defer_freezer *dff) > > { > > xfs_defer_cancel_list(mp, &dff->dff_dfops); > > + xfs_defer_freezer_irele(dff); > > kmem_free(dff); > > } > > + > > +/* > > + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, > > + * which will be dropped and reacquired when we're ready to thaw the frozen > > + * deferred ops. > > + */ > > +int > > +xfs_defer_freezer_ijoin( > > + struct xfs_defer_freezer *dff, > > + struct xfs_inode *ip) > > +{ > > + unsigned int i; > > + > > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > + > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > + if (dff->dff_inodes[i] == ip) > > + goto out; > > + if (dff->dff_inodes[i] == NULL) > > + break; > > + } > > + > > + if (i == XFS_DEFER_FREEZER_INODES) { > > + ASSERT(0); > > + return -EFSCORRUPTED; > > + } > > + > > + /* > > + * Attach this inode to the freezer and drop its ILOCK because we > > + * assume the caller will need to allocate a transaction. > > + */ > > + dff->dff_inodes[i] = ip; > > + dff->dff_ilocks[i] = 0; > > +out: > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > + return 0; > > +} > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h > > index 7ae05e10d750..0052a0313283 100644 > > --- a/fs/xfs/libxfs/xfs_defer.h > > +++ b/fs/xfs/libxfs/xfs_defer.h > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer { > > /* Deferred ops state saved from the transaction. */ > > struct list_head dff_dfops; > > unsigned int dff_tpflags; > > + > > + /* Inodes to hold when we want to finish the deferred work items. */ > > +#define XFS_DEFER_FREEZER_INODES 2 > > + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; > > + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; > > }; > > > > /* Functions to freeze a chain of deferred operations for later. */ > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); > > void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); > > void xfs_defer_freeezer_finish(struct xfs_mount *mp, > > struct xfs_defer_freezer *dff); > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, > > + struct xfs_inode *ip); > > + > > +/* These functions must be provided by the xfs implementation. */ > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); > > > > #endif /* __XFS_DEFER_H__ */ > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > index c733bdeeeb9b..bbce191d8fcd 100644 > > --- a/fs/xfs/xfs_bmap_item.c > > +++ b/fs/xfs/xfs_bmap_item.c > > @@ -530,12 +530,13 @@ xfs_bui_item_recover( > > } > > > > error = xlog_recover_trans_commit(tp, dffp); > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > - xfs_irele(ip); > > - return error; > > + if (error) > > + goto err_rele; > > + return xfs_defer_freezer_ijoin(*dffp, ip); > > > > err_inode: > > xfs_trans_cancel(tp); > > +err_rele: > > if (ip) { > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > xfs_irele(ip); > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 17a0b86fe701..b96ddf5ff334 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -12,6 +12,7 @@ > > #include "xfs_sb.h" > > #include "xfs_mount.h" > > #include "xfs_inode.h" > > +#include "xfs_defer.h" > > #include "xfs_trans.h" > > #include "xfs_trans_priv.h" > > #include "xfs_inode_item.h" > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( > > xfs_queue_eofblocks(mp); > > xfs_queue_cowblocks(mp); > > } > > + > > +/* Release all the inode resources attached to this freezer. */ > > +void > > +xfs_defer_freezer_irele( > > + struct xfs_defer_freezer *dff) > > +{ > > + unsigned int i; > > + > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > + if (!dff->dff_inodes[i]) > > + break; > > + > > + if (dff->dff_ilocks[i]) > > + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > + xfs_irele(dff->dff_inodes[i]); > > + dff->dff_inodes[i] = NULL; > > + } > > +} > > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-05-06 0:34 ` Darrick J. Wong @ 2020-05-06 13:56 ` Brian Foster 2020-05-06 17:01 ` Darrick J. Wong 0 siblings, 1 reply; 29+ messages in thread From: Brian Foster @ 2020-05-06 13:56 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote: > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote: > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > > to the inode that is involved in the bmap replay operation. If the > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > > create a deferred op to finish the unmapping work, and we retain a > > > pointer to the incore inode. > > > > > > Unfortunately, the very next thing we do is commit the transaction and > > > drop the inode. If reclaim tears down the inode before we try to finish > > > the defer ops, we dereference garbage and blow up. Therefore, create a > > > way to join inodes to the defer ops freezer so that we can maintain the > > > xfs_inode reference until we're done with the inode. > > > > > > Note: This imposes the requirement that there be enough memory to keep > > > every incore inode in memory throughout recovery. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > --- > > > > Maybe I'm missing something, but I thought the discussion on the > > previous version[1] landed on an approach where the intent would hold a > > reference to the inode. Wouldn't that break the dependency on the dfops > > freeze/thaw mechanism? > > We did, but as I started pondering how to make this work in practice, > I soon realized that it's not as simple as "don't irele the inode": > > (Part 1) > > It would be easy enough to add a flag to struct xfs_bmap_intent that > means "when you're done, unlock and irele the inode". You'd have to add > this capability to any log item that deals with inodes, but only > deferred bmap items need this. > > Log recovery looks like this: > > ---------------- > Transaction 0 > BUI1 unmap file1, offset1, startblock1, length == 200000 > ---------------- > <end> > > What happens if the first bunmapi call doesn't actually unmap > everything? We'll queue yet another bmap intent item (call it BUI2) to > finish the work. This means that we can't just drop the inode when > we're done processing BUI1; we have to propagate the "unlock and irele" > flag to BUI2. Ok, so now there's this chaining behavior that wasn't > there before. > > Then I thought about it some more, and wondered what would happen if the > log contained two unfinished bmap items for the same inode? If this > happened, you'd deadlock because you've not released the ILOCK for the > first unfinished bmap item when you want to acquire the ILOCK for the > second unfinished bmap item. > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap dfops at runtime, but that's not necessarily the case during recovery due to the dfops chaining behavior. To me, this raises the question of why we need to reorder/split these particular intents during recovery if the original operation would have completed under a single inode lock cycle. The comments in xfs_log_recover.c don't really explain it. From going back to commit 509955823cc9c ("xfs: log recovery should replay deferred ops in order"), it looks like the issue is that ordering gets screwed up for operations that commit multiple intents per-transaction and completion of those intents creates new child intents. At runtime, we'd process the intents in the current transaction first and then get to the child intents in the order they were created. If we didn't have the recovery time dfops shuffling implemented in this patch, recovery would process the first intent in the log, then all of the child intents of that one before getting to the next intent in the log that presumably would have been second at runtime. Am I following that correctly? I think the rest makes sense from the perspective of not wanting to plumb the conditional locking complexity through the dfops/intent processing for every intent type that eventually needs it, as opposed to containing in the log recovery code (via dfops magic). I need to think about all that some more, but I'd like to make sure I understand why we have this recovery requirement in the first place. Brian > You'd have to drop the ILOCK (but retain the inode reference) between > the two bmap items. I think this is currently impossible because there > aren't any transactions that queue multiple bmap intents per > transaction, nobody else holds the ilock, and the existing rmap > implementation of the swapext ioctl doesn't allow swapping a file with > itself. However... > > (Part 2) > > The second thing I thought about is, what if there's a lot of work after > transaction 0? Say the recovery stream looks like: > > ---------------- > Transaction 0 > BUI1 unmap file1, offset1, startblock1, length == 200000 > ---------------- > <10000 more intents applied to other things> > ---------------- > <end> > > This is a problem, because we must drop the ILOCK after completing the > BUI1 work so that we can process the 10000 more intents before we can > get to BUI2. We cannot let the tail of the log get pinned on the locked > inode, so we must release the ILOCK after we're done with BUI1 and > re-acquire it when we're ready to deal with BUI2. > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say > that we need to irele the inode at the end of processing, and another to > say that the bmap item needs to grab the ILOCK. If there's going to be > a BUI2 as a result of recovering BUI1, the first flag must be propagated > to BUI2, but the second flag must /not/ be propagated. > > A potential counterargument is that we crammed all 10000 intents into > the log without pinning the log, so maybe we can hold the ILOCK. > However.... > > (Part 3) > > Next, I considered how the swapext log item in the atomic file update > patchset would interact with log recovery. A couple of notes about the > swapext log items: > > 1. They are careful not to create the kinds of bunmap operations that > would cause the creation of /more/ BUIs. > > 2. Every time we log one extent's worth of unmap/remap operations (using > deferred bmap log items, naturally) we log an done item for the original > swapext intent item and immediately log another swapext intent for the > work remaining. > > I came up with the following sequence of transactions to recover: > > ---------------- > Transaction 0 > SXI0 file1, offset1, file2, offset2, length == 10 > ---------------- > Transaction 1 > BUI1 unmap file1, offset1, startblock1, length == 2 > BUI2 unmap file2, offset2, startblock2, length == 2 > BUI3 map file1, offset1, startblock2, length == 2 > BUI4 map file2, offset2, startblock1, length == 2 > SXD done (SXI0) > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8 > --------------- > <end of log> > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need > processing. Each of the 5 intent items gets its own recovery > transaction and dfops freezer chain. BUI1, BUI3, and SXI5 will each > require ilock'd access to file1, which means that each of them will have > to have the ability to drop the ILOCK but retain the reference. The > same argument applies to file2 and BUI2, BUI4, and SXI5. Note also that > in our brave new world, file1 and file2 can be the same. > > For the swapext log item type this inode management is particularly > acute because we're certain to have new SXIs created as a result of > recovering an SXI. > > (End) > > In conclusion, we need a mechanism to drop both the inode lock and the > inode reference. Each log item type that works with inodes will have to > set aside 2 flags somewhere in the incore extent structure, and provide > more or less the same code to manage that state. Right now that's just > the bmap items, but then the swapext items and the deferred xattr items > will have that same requirement when they get added. > > If we do that, the inode reference and ilock management diverges even > further from regular operations. Regular callers are expected to iget > and ilock the inode and maintain that all the way to the end of > xfs_trans_commit. Log recovery, on the other hand, will add a bunch of > state handling to some of the log items so that we can drop the ILOCK > after the first item, and then drop the inode reference after finishing > the last possible user of that inode in the revived dfops chain. > > On the other hand, keeping the inode management in the defer freezer > code means that I keep all that complexity and state management out of > the ->finish_item code. When we go to finish the new incore intents > that get created during recovery, the inode reference and ILOCK rules > are the same as anywhere else -- the caller is required to grab the > inode, the transaction, and the ilock (in that order). > > To the extent that recovering log intent items is /not/ the same as > running a normal transaction, all that weirdness is concentrated in > xfs_log_recover.c and a single function in xfs_defer.c. The desired > behavior is the same across all the log intent item types, so I > decided in favor of keeping the centralised behavior and (trying to) > contain the wacky. We don't need all that right away, but we will. > > Note that I did make it so that we retain the reference always, so > there's no longer a need for irele/igrab cycles, which makes capturing > the dfops chain less error prone. > > --D > > > Brian > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/ > > > > > fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ > > > fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ > > > fs/xfs/xfs_bmap_item.c | 7 ++++-- > > > fs/xfs/xfs_icache.c | 19 +++++++++++++++++ > > > 4 files changed, 83 insertions(+), 3 deletions(-) > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > > index ea4d28851bbd..72933fdafcb2 100644 > > > --- a/fs/xfs/libxfs/xfs_defer.c > > > +++ b/fs/xfs/libxfs/xfs_defer.c > > > @@ -16,6 +16,7 @@ > > > #include "xfs_inode.h" > > > #include "xfs_inode_item.h" > > > #include "xfs_trace.h" > > > +#include "xfs_icache.h" > > > > > > /* > > > * Deferred Operations in XFS > > > @@ -583,8 +584,19 @@ xfs_defer_thaw( > > > struct xfs_defer_freezer *dff, > > > struct xfs_trans *tp) > > > { > > > + int i; > > > + > > > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > > > > > + /* Re-acquire the inode locks. */ > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > + if (!dff->dff_inodes[i]) > > > + break; > > > + > > > + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; > > > + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > + } > > > + > > > /* Add the dfops items to the transaction. */ > > > list_splice_init(&dff->dff_dfops, &tp->t_dfops); > > > tp->t_flags |= dff->dff_tpflags; > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( > > > struct xfs_defer_freezer *dff) > > > { > > > xfs_defer_cancel_list(mp, &dff->dff_dfops); > > > + xfs_defer_freezer_irele(dff); > > > kmem_free(dff); > > > } > > > + > > > +/* > > > + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, > > > + * which will be dropped and reacquired when we're ready to thaw the frozen > > > + * deferred ops. > > > + */ > > > +int > > > +xfs_defer_freezer_ijoin( > > > + struct xfs_defer_freezer *dff, > > > + struct xfs_inode *ip) > > > +{ > > > + unsigned int i; > > > + > > > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > > + > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > + if (dff->dff_inodes[i] == ip) > > > + goto out; > > > + if (dff->dff_inodes[i] == NULL) > > > + break; > > > + } > > > + > > > + if (i == XFS_DEFER_FREEZER_INODES) { > > > + ASSERT(0); > > > + return -EFSCORRUPTED; > > > + } > > > + > > > + /* > > > + * Attach this inode to the freezer and drop its ILOCK because we > > > + * assume the caller will need to allocate a transaction. > > > + */ > > > + dff->dff_inodes[i] = ip; > > > + dff->dff_ilocks[i] = 0; > > > +out: > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > + return 0; > > > +} > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h > > > index 7ae05e10d750..0052a0313283 100644 > > > --- a/fs/xfs/libxfs/xfs_defer.h > > > +++ b/fs/xfs/libxfs/xfs_defer.h > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer { > > > /* Deferred ops state saved from the transaction. */ > > > struct list_head dff_dfops; > > > unsigned int dff_tpflags; > > > + > > > + /* Inodes to hold when we want to finish the deferred work items. */ > > > +#define XFS_DEFER_FREEZER_INODES 2 > > > + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; > > > + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; > > > }; > > > > > > /* Functions to freeze a chain of deferred operations for later. */ > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); > > > void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); > > > void xfs_defer_freeezer_finish(struct xfs_mount *mp, > > > struct xfs_defer_freezer *dff); > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, > > > + struct xfs_inode *ip); > > > + > > > +/* These functions must be provided by the xfs implementation. */ > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); > > > > > > #endif /* __XFS_DEFER_H__ */ > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > > index c733bdeeeb9b..bbce191d8fcd 100644 > > > --- a/fs/xfs/xfs_bmap_item.c > > > +++ b/fs/xfs/xfs_bmap_item.c > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover( > > > } > > > > > > error = xlog_recover_trans_commit(tp, dffp); > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > - xfs_irele(ip); > > > - return error; > > > + if (error) > > > + goto err_rele; > > > + return xfs_defer_freezer_ijoin(*dffp, ip); > > > > > > err_inode: > > > xfs_trans_cancel(tp); > > > +err_rele: > > > if (ip) { > > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > xfs_irele(ip); > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index 17a0b86fe701..b96ddf5ff334 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -12,6 +12,7 @@ > > > #include "xfs_sb.h" > > > #include "xfs_mount.h" > > > #include "xfs_inode.h" > > > +#include "xfs_defer.h" > > > #include "xfs_trans.h" > > > #include "xfs_trans_priv.h" > > > #include "xfs_inode_item.h" > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( > > > xfs_queue_eofblocks(mp); > > > xfs_queue_cowblocks(mp); > > > } > > > + > > > +/* Release all the inode resources attached to this freezer. */ > > > +void > > > +xfs_defer_freezer_irele( > > > + struct xfs_defer_freezer *dff) > > > +{ > > > + unsigned int i; > > > + > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > + if (!dff->dff_inodes[i]) > > > + break; > > > + > > > + if (dff->dff_ilocks[i]) > > > + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > + xfs_irele(dff->dff_inodes[i]); > > > + dff->dff_inodes[i] = NULL; > > > + } > > > +} > > > > > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-05-06 13:56 ` Brian Foster @ 2020-05-06 17:01 ` Darrick J. Wong 2020-05-07 9:53 ` Brian Foster 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-05-06 17:01 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote: > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote: > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote: > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > > > to the inode that is involved in the bmap replay operation. If the > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > > > create a deferred op to finish the unmapping work, and we retain a > > > > pointer to the incore inode. > > > > > > > > Unfortunately, the very next thing we do is commit the transaction and > > > > drop the inode. If reclaim tears down the inode before we try to finish > > > > the defer ops, we dereference garbage and blow up. Therefore, create a > > > > way to join inodes to the defer ops freezer so that we can maintain the > > > > xfs_inode reference until we're done with the inode. > > > > > > > > Note: This imposes the requirement that there be enough memory to keep > > > > every incore inode in memory throughout recovery. > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > --- > > > > > > Maybe I'm missing something, but I thought the discussion on the > > > previous version[1] landed on an approach where the intent would hold a > > > reference to the inode. Wouldn't that break the dependency on the dfops > > > freeze/thaw mechanism? > > > > We did, but as I started pondering how to make this work in practice, > > I soon realized that it's not as simple as "don't irele the inode": > > > > (Part 1) > > > > It would be easy enough to add a flag to struct xfs_bmap_intent that > > means "when you're done, unlock and irele the inode". You'd have to add > > this capability to any log item that deals with inodes, but only > > deferred bmap items need this. > > > > Log recovery looks like this: > > > > ---------------- > > Transaction 0 > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > ---------------- > > <end> > > > > What happens if the first bunmapi call doesn't actually unmap > > everything? We'll queue yet another bmap intent item (call it BUI2) to > > finish the work. This means that we can't just drop the inode when > > we're done processing BUI1; we have to propagate the "unlock and irele" > > flag to BUI2. Ok, so now there's this chaining behavior that wasn't > > there before. > > > > Then I thought about it some more, and wondered what would happen if the > > log contained two unfinished bmap items for the same inode? If this > > happened, you'd deadlock because you've not released the ILOCK for the > > first unfinished bmap item when you want to acquire the ILOCK for the > > second unfinished bmap item. > > > > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap > dfops at runtime, but that's not necessarily the case during recovery > due to the dfops chaining behavior. To me, this raises the question of > why we need to reorder/split these particular intents during recovery if > the original operation would have completed under a single inode lock > cycle. Log recovery /did/ function that way (no reordering) until 509955823cc9c, when it was discovered that we don't have a good way to reconstruct the dfops chain from the list of recovered log intent items. > The comments in xfs_log_recover.c don't really explain it. From going > back to commit 509955823cc9c ("xfs: log recovery should replay deferred > ops in order"), it looks like the issue is that ordering gets screwed up > for operations that commit multiple intents per-transaction and > completion of those intents creates new child intents. At runtime, we'd > process the intents in the current transaction first and then get to the > child intents in the order they were created. If we didn't have the > recovery time dfops shuffling implemented in this patch, recovery would > process the first intent in the log, then all of the child intents of > that one before getting to the next intent in the log that presumably > would have been second at runtime. Am I following that correctly? Correct. If we could figure out how to rebuild the dfops chain (and thus the ordering requirements), we would be able to eliminate this freezer stuff entirely. > I think the rest makes sense from the perspective of not wanting to > plumb the conditional locking complexity through the dfops/intent > processing for every intent type that eventually needs it, as opposed to > containing in the log recovery code (via dfops magic). I need to think > about all that some more, but I'd like to make sure I understand why we > have this recovery requirement in the first place. <nod> It took me a few hours to figure out why that patch was correct the first time I saw it. It took me a few hours again(!) when I was trying to write this series. :/ --D > Brian > > > You'd have to drop the ILOCK (but retain the inode reference) between > > the two bmap items. I think this is currently impossible because there > > aren't any transactions that queue multiple bmap intents per > > transaction, nobody else holds the ilock, and the existing rmap > > implementation of the swapext ioctl doesn't allow swapping a file with > > itself. However... > > > > (Part 2) > > > > The second thing I thought about is, what if there's a lot of work after > > transaction 0? Say the recovery stream looks like: > > > > ---------------- > > Transaction 0 > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > ---------------- > > <10000 more intents applied to other things> > > ---------------- > > <end> > > > > This is a problem, because we must drop the ILOCK after completing the > > BUI1 work so that we can process the 10000 more intents before we can > > get to BUI2. We cannot let the tail of the log get pinned on the locked > > inode, so we must release the ILOCK after we're done with BUI1 and > > re-acquire it when we're ready to deal with BUI2. > > > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say > > that we need to irele the inode at the end of processing, and another to > > say that the bmap item needs to grab the ILOCK. If there's going to be > > a BUI2 as a result of recovering BUI1, the first flag must be propagated > > to BUI2, but the second flag must /not/ be propagated. > > > > A potential counterargument is that we crammed all 10000 intents into > > the log without pinning the log, so maybe we can hold the ILOCK. > > However.... > > > > (Part 3) > > > > Next, I considered how the swapext log item in the atomic file update > > patchset would interact with log recovery. A couple of notes about the > > swapext log items: > > > > 1. They are careful not to create the kinds of bunmap operations that > > would cause the creation of /more/ BUIs. > > > > 2. Every time we log one extent's worth of unmap/remap operations (using > > deferred bmap log items, naturally) we log an done item for the original > > swapext intent item and immediately log another swapext intent for the > > work remaining. > > > > I came up with the following sequence of transactions to recover: > > > > ---------------- > > Transaction 0 > > SXI0 file1, offset1, file2, offset2, length == 10 > > ---------------- > > Transaction 1 > > BUI1 unmap file1, offset1, startblock1, length == 2 > > BUI2 unmap file2, offset2, startblock2, length == 2 > > BUI3 map file1, offset1, startblock2, length == 2 > > BUI4 map file2, offset2, startblock1, length == 2 > > SXD done (SXI0) > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8 > > --------------- > > <end of log> > > > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need > > processing. Each of the 5 intent items gets its own recovery > > transaction and dfops freezer chain. BUI1, BUI3, and SXI5 will each > > require ilock'd access to file1, which means that each of them will have > > to have the ability to drop the ILOCK but retain the reference. The > > same argument applies to file2 and BUI2, BUI4, and SXI5. Note also that > > in our brave new world, file1 and file2 can be the same. > > > > For the swapext log item type this inode management is particularly > > acute because we're certain to have new SXIs created as a result of > > recovering an SXI. > > > > (End) > > > > In conclusion, we need a mechanism to drop both the inode lock and the > > inode reference. Each log item type that works with inodes will have to > > set aside 2 flags somewhere in the incore extent structure, and provide > > more or less the same code to manage that state. Right now that's just > > the bmap items, but then the swapext items and the deferred xattr items > > will have that same requirement when they get added. > > > > If we do that, the inode reference and ilock management diverges even > > further from regular operations. Regular callers are expected to iget > > and ilock the inode and maintain that all the way to the end of > > xfs_trans_commit. Log recovery, on the other hand, will add a bunch of > > state handling to some of the log items so that we can drop the ILOCK > > after the first item, and then drop the inode reference after finishing > > the last possible user of that inode in the revived dfops chain. > > > > On the other hand, keeping the inode management in the defer freezer > > code means that I keep all that complexity and state management out of > > the ->finish_item code. When we go to finish the new incore intents > > that get created during recovery, the inode reference and ILOCK rules > > are the same as anywhere else -- the caller is required to grab the > > inode, the transaction, and the ilock (in that order). > > > > To the extent that recovering log intent items is /not/ the same as > > running a normal transaction, all that weirdness is concentrated in > > xfs_log_recover.c and a single function in xfs_defer.c. The desired > > behavior is the same across all the log intent item types, so I > > decided in favor of keeping the centralised behavior and (trying to) > > contain the wacky. We don't need all that right away, but we will. > > > > Note that I did make it so that we retain the reference always, so > > there's no longer a need for irele/igrab cycles, which makes capturing > > the dfops chain less error prone. > > > > --D > > > > > Brian > > > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/ > > > > > > > fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ > > > > fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ > > > > fs/xfs/xfs_bmap_item.c | 7 ++++-- > > > > fs/xfs/xfs_icache.c | 19 +++++++++++++++++ > > > > 4 files changed, 83 insertions(+), 3 deletions(-) > > > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > > > index ea4d28851bbd..72933fdafcb2 100644 > > > > --- a/fs/xfs/libxfs/xfs_defer.c > > > > +++ b/fs/xfs/libxfs/xfs_defer.c > > > > @@ -16,6 +16,7 @@ > > > > #include "xfs_inode.h" > > > > #include "xfs_inode_item.h" > > > > #include "xfs_trace.h" > > > > +#include "xfs_icache.h" > > > > > > > > /* > > > > * Deferred Operations in XFS > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw( > > > > struct xfs_defer_freezer *dff, > > > > struct xfs_trans *tp) > > > > { > > > > + int i; > > > > + > > > > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > > > > > > > + /* Re-acquire the inode locks. */ > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > + if (!dff->dff_inodes[i]) > > > > + break; > > > > + > > > > + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; > > > > + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > + } > > > > + > > > > /* Add the dfops items to the transaction. */ > > > > list_splice_init(&dff->dff_dfops, &tp->t_dfops); > > > > tp->t_flags |= dff->dff_tpflags; > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( > > > > struct xfs_defer_freezer *dff) > > > > { > > > > xfs_defer_cancel_list(mp, &dff->dff_dfops); > > > > + xfs_defer_freezer_irele(dff); > > > > kmem_free(dff); > > > > } > > > > + > > > > +/* > > > > + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen > > > > + * deferred ops. > > > > + */ > > > > +int > > > > +xfs_defer_freezer_ijoin( > > > > + struct xfs_defer_freezer *dff, > > > > + struct xfs_inode *ip) > > > > +{ > > > > + unsigned int i; > > > > + > > > > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > > > + > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > + if (dff->dff_inodes[i] == ip) > > > > + goto out; > > > > + if (dff->dff_inodes[i] == NULL) > > > > + break; > > > > + } > > > > + > > > > + if (i == XFS_DEFER_FREEZER_INODES) { > > > > + ASSERT(0); > > > > + return -EFSCORRUPTED; > > > > + } > > > > + > > > > + /* > > > > + * Attach this inode to the freezer and drop its ILOCK because we > > > > + * assume the caller will need to allocate a transaction. > > > > + */ > > > > + dff->dff_inodes[i] = ip; > > > > + dff->dff_ilocks[i] = 0; > > > > +out: > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > + return 0; > > > > +} > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h > > > > index 7ae05e10d750..0052a0313283 100644 > > > > --- a/fs/xfs/libxfs/xfs_defer.h > > > > +++ b/fs/xfs/libxfs/xfs_defer.h > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer { > > > > /* Deferred ops state saved from the transaction. */ > > > > struct list_head dff_dfops; > > > > unsigned int dff_tpflags; > > > > + > > > > + /* Inodes to hold when we want to finish the deferred work items. */ > > > > +#define XFS_DEFER_FREEZER_INODES 2 > > > > + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; > > > > + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; > > > > }; > > > > > > > > /* Functions to freeze a chain of deferred operations for later. */ > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); > > > > void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); > > > > void xfs_defer_freeezer_finish(struct xfs_mount *mp, > > > > struct xfs_defer_freezer *dff); > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, > > > > + struct xfs_inode *ip); > > > > + > > > > +/* These functions must be provided by the xfs implementation. */ > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); > > > > > > > > #endif /* __XFS_DEFER_H__ */ > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > > > index c733bdeeeb9b..bbce191d8fcd 100644 > > > > --- a/fs/xfs/xfs_bmap_item.c > > > > +++ b/fs/xfs/xfs_bmap_item.c > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover( > > > > } > > > > > > > > error = xlog_recover_trans_commit(tp, dffp); > > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > - xfs_irele(ip); > > > > - return error; > > > > + if (error) > > > > + goto err_rele; > > > > + return xfs_defer_freezer_ijoin(*dffp, ip); > > > > > > > > err_inode: > > > > xfs_trans_cancel(tp); > > > > +err_rele: > > > > if (ip) { > > > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > xfs_irele(ip); > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > index 17a0b86fe701..b96ddf5ff334 100644 > > > > --- a/fs/xfs/xfs_icache.c > > > > +++ b/fs/xfs/xfs_icache.c > > > > @@ -12,6 +12,7 @@ > > > > #include "xfs_sb.h" > > > > #include "xfs_mount.h" > > > > #include "xfs_inode.h" > > > > +#include "xfs_defer.h" > > > > #include "xfs_trans.h" > > > > #include "xfs_trans_priv.h" > > > > #include "xfs_inode_item.h" > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( > > > > xfs_queue_eofblocks(mp); > > > > xfs_queue_cowblocks(mp); > > > > } > > > > + > > > > +/* Release all the inode resources attached to this freezer. */ > > > > +void > > > > +xfs_defer_freezer_irele( > > > > + struct xfs_defer_freezer *dff) > > > > +{ > > > > + unsigned int i; > > > > + > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > + if (!dff->dff_inodes[i]) > > > > + break; > > > > + > > > > + if (dff->dff_ilocks[i]) > > > > + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > + xfs_irele(dff->dff_inodes[i]); > > > > + dff->dff_inodes[i] = NULL; > > > > + } > > > > +} > > > > > > > > > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-05-06 17:01 ` Darrick J. Wong @ 2020-05-07 9:53 ` Brian Foster 2020-05-07 15:09 ` Darrick J. Wong 0 siblings, 1 reply; 29+ messages in thread From: Brian Foster @ 2020-05-07 9:53 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Wed, May 06, 2020 at 10:01:50AM -0700, Darrick J. Wong wrote: > On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote: > > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote: > > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote: > > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote: > > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > > > > to the inode that is involved in the bmap replay operation. If the > > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > > > > create a deferred op to finish the unmapping work, and we retain a > > > > > pointer to the incore inode. > > > > > > > > > > Unfortunately, the very next thing we do is commit the transaction and > > > > > drop the inode. If reclaim tears down the inode before we try to finish > > > > > the defer ops, we dereference garbage and blow up. Therefore, create a > > > > > way to join inodes to the defer ops freezer so that we can maintain the > > > > > xfs_inode reference until we're done with the inode. > > > > > > > > > > Note: This imposes the requirement that there be enough memory to keep > > > > > every incore inode in memory throughout recovery. > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > --- > > > > > > > > Maybe I'm missing something, but I thought the discussion on the > > > > previous version[1] landed on an approach where the intent would hold a > > > > reference to the inode. Wouldn't that break the dependency on the dfops > > > > freeze/thaw mechanism? > > > > > > We did, but as I started pondering how to make this work in practice, > > > I soon realized that it's not as simple as "don't irele the inode": > > > > > > (Part 1) > > > > > > It would be easy enough to add a flag to struct xfs_bmap_intent that > > > means "when you're done, unlock and irele the inode". You'd have to add > > > this capability to any log item that deals with inodes, but only > > > deferred bmap items need this. > > > > > > Log recovery looks like this: > > > > > > ---------------- > > > Transaction 0 > > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > > ---------------- > > > <end> > > > > > > What happens if the first bunmapi call doesn't actually unmap > > > everything? We'll queue yet another bmap intent item (call it BUI2) to > > > finish the work. This means that we can't just drop the inode when > > > we're done processing BUI1; we have to propagate the "unlock and irele" > > > flag to BUI2. Ok, so now there's this chaining behavior that wasn't > > > there before. > > > > > > Then I thought about it some more, and wondered what would happen if the > > > log contained two unfinished bmap items for the same inode? If this > > > happened, you'd deadlock because you've not released the ILOCK for the > > > first unfinished bmap item when you want to acquire the ILOCK for the > > > second unfinished bmap item. > > > > > > > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap > > dfops at runtime, but that's not necessarily the case during recovery > > due to the dfops chaining behavior. To me, this raises the question of > > why we need to reorder/split these particular intents during recovery if > > the original operation would have completed under a single inode lock > > cycle. > > Log recovery /did/ function that way (no reordering) until > 509955823cc9c, when it was discovered that we don't have a good way to > reconstruct the dfops chain from the list of recovered log intent items. > > > The comments in xfs_log_recover.c don't really explain it. From going > > back to commit 509955823cc9c ("xfs: log recovery should replay deferred > > ops in order"), it looks like the issue is that ordering gets screwed up > > for operations that commit multiple intents per-transaction and > > completion of those intents creates new child intents. At runtime, we'd > > process the intents in the current transaction first and then get to the > > child intents in the order they were created. If we didn't have the > > recovery time dfops shuffling implemented in this patch, recovery would > > process the first intent in the log, then all of the child intents of > > that one before getting to the next intent in the log that presumably > > would have been second at runtime. Am I following that correctly? > > Correct. If we could figure out how to rebuild the dfops chain (and > thus the ordering requirements), we would be able to eliminate this > freezer stuff entirely. > Ok, thanks. I was wondering the same, but it's not clear how to do that with just the intent information in the log and big checkpoint transactions. Perhaps we could if we had a transactional id that was common across intents in a single transaction, or otherwise unique intent ids that would allow an intent to refer to the next in a chain at the time the intent was logged (such that recovery could learn to construct and process chains instead of just individual intents). I'm curious if we're going to want/need to solve that issue somehow or another more directly eventually as we grow more intents and perhaps come up with new and more complex intent-based operations (with more complex recovery scenarios). > > I think the rest makes sense from the perspective of not wanting to > > plumb the conditional locking complexity through the dfops/intent > > processing for every intent type that eventually needs it, as opposed to > > containing in the log recovery code (via dfops magic). I need to think > > about all that some more, but I'd like to make sure I understand why we > > have this recovery requirement in the first place. > > <nod> It took me a few hours to figure out why that patch was correct > the first time I saw it. It took me a few hours again(!) when I was > trying to write this series. :/ > If I follow correctly, the rmap swapext sequence makes for a good example because the bmap completions generate the rmap intents. A simplified (i.e. single extent remap) runtime transactional flow for that looks something like the following: t1: unmap intent, map intent t2: unmap done, map intent, runmap intent t3: map done, runmap intent, rmap intent t4: runmap done, rmap intent ... So at runtime these operations complete in the obvious linear order. Intents drop off the front as done items are logged and new intents are tacked on the end. If we crash between t2 and t3, however, we see t2 in the log and would end up completing the map intent and the subsequently generated rmap intent (that pops up in t3 at runtime) before we process the runmap intent that's still in the log (and was originally in the dfops queue for t2). I _think_ that's essentially the scenario described in 509955823cc9c, but it's not clear to me if it's the same runtime example.. Brian > --D > > > Brian > > > > > You'd have to drop the ILOCK (but retain the inode reference) between > > > the two bmap items. I think this is currently impossible because there > > > aren't any transactions that queue multiple bmap intents per > > > transaction, nobody else holds the ilock, and the existing rmap > > > implementation of the swapext ioctl doesn't allow swapping a file with > > > itself. However... > > > > > > (Part 2) > > > > > > The second thing I thought about is, what if there's a lot of work after > > > transaction 0? Say the recovery stream looks like: > > > > > > ---------------- > > > Transaction 0 > > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > > ---------------- > > > <10000 more intents applied to other things> > > > ---------------- > > > <end> > > > > > > This is a problem, because we must drop the ILOCK after completing the > > > BUI1 work so that we can process the 10000 more intents before we can > > > get to BUI2. We cannot let the tail of the log get pinned on the locked > > > inode, so we must release the ILOCK after we're done with BUI1 and > > > re-acquire it when we're ready to deal with BUI2. > > > > > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say > > > that we need to irele the inode at the end of processing, and another to > > > say that the bmap item needs to grab the ILOCK. If there's going to be > > > a BUI2 as a result of recovering BUI1, the first flag must be propagated > > > to BUI2, but the second flag must /not/ be propagated. > > > > > > A potential counterargument is that we crammed all 10000 intents into > > > the log without pinning the log, so maybe we can hold the ILOCK. > > > However.... > > > > > > (Part 3) > > > > > > Next, I considered how the swapext log item in the atomic file update > > > patchset would interact with log recovery. A couple of notes about the > > > swapext log items: > > > > > > 1. They are careful not to create the kinds of bunmap operations that > > > would cause the creation of /more/ BUIs. > > > > > > 2. Every time we log one extent's worth of unmap/remap operations (using > > > deferred bmap log items, naturally) we log an done item for the original > > > swapext intent item and immediately log another swapext intent for the > > > work remaining. > > > > > > I came up with the following sequence of transactions to recover: > > > > > > ---------------- > > > Transaction 0 > > > SXI0 file1, offset1, file2, offset2, length == 10 > > > ---------------- > > > Transaction 1 > > > BUI1 unmap file1, offset1, startblock1, length == 2 > > > BUI2 unmap file2, offset2, startblock2, length == 2 > > > BUI3 map file1, offset1, startblock2, length == 2 > > > BUI4 map file2, offset2, startblock1, length == 2 > > > SXD done (SXI0) > > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8 > > > --------------- > > > <end of log> > > > > > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need > > > processing. Each of the 5 intent items gets its own recovery > > > transaction and dfops freezer chain. BUI1, BUI3, and SXI5 will each > > > require ilock'd access to file1, which means that each of them will have > > > to have the ability to drop the ILOCK but retain the reference. The > > > same argument applies to file2 and BUI2, BUI4, and SXI5. Note also that > > > in our brave new world, file1 and file2 can be the same. > > > > > > For the swapext log item type this inode management is particularly > > > acute because we're certain to have new SXIs created as a result of > > > recovering an SXI. > > > > > > (End) > > > > > > In conclusion, we need a mechanism to drop both the inode lock and the > > > inode reference. Each log item type that works with inodes will have to > > > set aside 2 flags somewhere in the incore extent structure, and provide > > > more or less the same code to manage that state. Right now that's just > > > the bmap items, but then the swapext items and the deferred xattr items > > > will have that same requirement when they get added. > > > > > > If we do that, the inode reference and ilock management diverges even > > > further from regular operations. Regular callers are expected to iget > > > and ilock the inode and maintain that all the way to the end of > > > xfs_trans_commit. Log recovery, on the other hand, will add a bunch of > > > state handling to some of the log items so that we can drop the ILOCK > > > after the first item, and then drop the inode reference after finishing > > > the last possible user of that inode in the revived dfops chain. > > > > > > On the other hand, keeping the inode management in the defer freezer > > > code means that I keep all that complexity and state management out of > > > the ->finish_item code. When we go to finish the new incore intents > > > that get created during recovery, the inode reference and ILOCK rules > > > are the same as anywhere else -- the caller is required to grab the > > > inode, the transaction, and the ilock (in that order). > > > > > > To the extent that recovering log intent items is /not/ the same as > > > running a normal transaction, all that weirdness is concentrated in > > > xfs_log_recover.c and a single function in xfs_defer.c. The desired > > > behavior is the same across all the log intent item types, so I > > > decided in favor of keeping the centralised behavior and (trying to) > > > contain the wacky. We don't need all that right away, but we will. > > > > > > Note that I did make it so that we retain the reference always, so > > > there's no longer a need for irele/igrab cycles, which makes capturing > > > the dfops chain less error prone. > > > > > > --D > > > > > > > Brian > > > > > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/ > > > > > > > > > fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ > > > > > fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ > > > > > fs/xfs/xfs_bmap_item.c | 7 ++++-- > > > > > fs/xfs/xfs_icache.c | 19 +++++++++++++++++ > > > > > 4 files changed, 83 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > > > > index ea4d28851bbd..72933fdafcb2 100644 > > > > > --- a/fs/xfs/libxfs/xfs_defer.c > > > > > +++ b/fs/xfs/libxfs/xfs_defer.c > > > > > @@ -16,6 +16,7 @@ > > > > > #include "xfs_inode.h" > > > > > #include "xfs_inode_item.h" > > > > > #include "xfs_trace.h" > > > > > +#include "xfs_icache.h" > > > > > > > > > > /* > > > > > * Deferred Operations in XFS > > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw( > > > > > struct xfs_defer_freezer *dff, > > > > > struct xfs_trans *tp) > > > > > { > > > > > + int i; > > > > > + > > > > > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > > > > > > > > > + /* Re-acquire the inode locks. */ > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > + if (!dff->dff_inodes[i]) > > > > > + break; > > > > > + > > > > > + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; > > > > > + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > > + } > > > > > + > > > > > /* Add the dfops items to the transaction. */ > > > > > list_splice_init(&dff->dff_dfops, &tp->t_dfops); > > > > > tp->t_flags |= dff->dff_tpflags; > > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( > > > > > struct xfs_defer_freezer *dff) > > > > > { > > > > > xfs_defer_cancel_list(mp, &dff->dff_dfops); > > > > > + xfs_defer_freezer_irele(dff); > > > > > kmem_free(dff); > > > > > } > > > > > + > > > > > +/* > > > > > + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, > > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen > > > > > + * deferred ops. > > > > > + */ > > > > > +int > > > > > +xfs_defer_freezer_ijoin( > > > > > + struct xfs_defer_freezer *dff, > > > > > + struct xfs_inode *ip) > > > > > +{ > > > > > + unsigned int i; > > > > > + > > > > > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > > > > + > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > + if (dff->dff_inodes[i] == ip) > > > > > + goto out; > > > > > + if (dff->dff_inodes[i] == NULL) > > > > > + break; > > > > > + } > > > > > + > > > > > + if (i == XFS_DEFER_FREEZER_INODES) { > > > > > + ASSERT(0); > > > > > + return -EFSCORRUPTED; > > > > > + } > > > > > + > > > > > + /* > > > > > + * Attach this inode to the freezer and drop its ILOCK because we > > > > > + * assume the caller will need to allocate a transaction. > > > > > + */ > > > > > + dff->dff_inodes[i] = ip; > > > > > + dff->dff_ilocks[i] = 0; > > > > > +out: > > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > + return 0; > > > > > +} > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h > > > > > index 7ae05e10d750..0052a0313283 100644 > > > > > --- a/fs/xfs/libxfs/xfs_defer.h > > > > > +++ b/fs/xfs/libxfs/xfs_defer.h > > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer { > > > > > /* Deferred ops state saved from the transaction. */ > > > > > struct list_head dff_dfops; > > > > > unsigned int dff_tpflags; > > > > > + > > > > > + /* Inodes to hold when we want to finish the deferred work items. */ > > > > > +#define XFS_DEFER_FREEZER_INODES 2 > > > > > + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; > > > > > + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; > > > > > }; > > > > > > > > > > /* Functions to freeze a chain of deferred operations for later. */ > > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); > > > > > void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); > > > > > void xfs_defer_freeezer_finish(struct xfs_mount *mp, > > > > > struct xfs_defer_freezer *dff); > > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, > > > > > + struct xfs_inode *ip); > > > > > + > > > > > +/* These functions must be provided by the xfs implementation. */ > > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); > > > > > > > > > > #endif /* __XFS_DEFER_H__ */ > > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > > > > index c733bdeeeb9b..bbce191d8fcd 100644 > > > > > --- a/fs/xfs/xfs_bmap_item.c > > > > > +++ b/fs/xfs/xfs_bmap_item.c > > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover( > > > > > } > > > > > > > > > > error = xlog_recover_trans_commit(tp, dffp); > > > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > - xfs_irele(ip); > > > > > - return error; > > > > > + if (error) > > > > > + goto err_rele; > > > > > + return xfs_defer_freezer_ijoin(*dffp, ip); > > > > > > > > > > err_inode: > > > > > xfs_trans_cancel(tp); > > > > > +err_rele: > > > > > if (ip) { > > > > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > xfs_irele(ip); > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > index 17a0b86fe701..b96ddf5ff334 100644 > > > > > --- a/fs/xfs/xfs_icache.c > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > @@ -12,6 +12,7 @@ > > > > > #include "xfs_sb.h" > > > > > #include "xfs_mount.h" > > > > > #include "xfs_inode.h" > > > > > +#include "xfs_defer.h" > > > > > #include "xfs_trans.h" > > > > > #include "xfs_trans_priv.h" > > > > > #include "xfs_inode_item.h" > > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( > > > > > xfs_queue_eofblocks(mp); > > > > > xfs_queue_cowblocks(mp); > > > > > } > > > > > + > > > > > +/* Release all the inode resources attached to this freezer. */ > > > > > +void > > > > > +xfs_defer_freezer_irele( > > > > > + struct xfs_defer_freezer *dff) > > > > > +{ > > > > > + unsigned int i; > > > > > + > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > + if (!dff->dff_inodes[i]) > > > > > + break; > > > > > + > > > > > + if (dff->dff_ilocks[i]) > > > > > + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > > + xfs_irele(dff->dff_inodes[i]); > > > > > + dff->dff_inodes[i] = NULL; > > > > > + } > > > > > +} > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-05-07 9:53 ` Brian Foster @ 2020-05-07 15:09 ` Darrick J. Wong 2020-05-07 16:58 ` Brian Foster 0 siblings, 1 reply; 29+ messages in thread From: Darrick J. Wong @ 2020-05-07 15:09 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, May 07, 2020 at 05:53:06AM -0400, Brian Foster wrote: > On Wed, May 06, 2020 at 10:01:50AM -0700, Darrick J. Wong wrote: > > On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote: > > > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote: > > > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote: > > > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote: > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > > > > > to the inode that is involved in the bmap replay operation. If the > > > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > > > > > create a deferred op to finish the unmapping work, and we retain a > > > > > > pointer to the incore inode. > > > > > > > > > > > > Unfortunately, the very next thing we do is commit the transaction and > > > > > > drop the inode. If reclaim tears down the inode before we try to finish > > > > > > the defer ops, we dereference garbage and blow up. Therefore, create a > > > > > > way to join inodes to the defer ops freezer so that we can maintain the > > > > > > xfs_inode reference until we're done with the inode. > > > > > > > > > > > > Note: This imposes the requirement that there be enough memory to keep > > > > > > every incore inode in memory throughout recovery. > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > --- > > > > > > > > > > Maybe I'm missing something, but I thought the discussion on the > > > > > previous version[1] landed on an approach where the intent would hold a > > > > > reference to the inode. Wouldn't that break the dependency on the dfops > > > > > freeze/thaw mechanism? > > > > > > > > We did, but as I started pondering how to make this work in practice, > > > > I soon realized that it's not as simple as "don't irele the inode": > > > > > > > > (Part 1) > > > > > > > > It would be easy enough to add a flag to struct xfs_bmap_intent that > > > > means "when you're done, unlock and irele the inode". You'd have to add > > > > this capability to any log item that deals with inodes, but only > > > > deferred bmap items need this. > > > > > > > > Log recovery looks like this: > > > > > > > > ---------------- > > > > Transaction 0 > > > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > > > ---------------- > > > > <end> > > > > > > > > What happens if the first bunmapi call doesn't actually unmap > > > > everything? We'll queue yet another bmap intent item (call it BUI2) to > > > > finish the work. This means that we can't just drop the inode when > > > > we're done processing BUI1; we have to propagate the "unlock and irele" > > > > flag to BUI2. Ok, so now there's this chaining behavior that wasn't > > > > there before. > > > > > > > > Then I thought about it some more, and wondered what would happen if the > > > > log contained two unfinished bmap items for the same inode? If this > > > > happened, you'd deadlock because you've not released the ILOCK for the > > > > first unfinished bmap item when you want to acquire the ILOCK for the > > > > second unfinished bmap item. > > > > > > > > > > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap > > > dfops at runtime, but that's not necessarily the case during recovery > > > due to the dfops chaining behavior. To me, this raises the question of > > > why we need to reorder/split these particular intents during recovery if > > > the original operation would have completed under a single inode lock > > > cycle. > > > > Log recovery /did/ function that way (no reordering) until > > 509955823cc9c, when it was discovered that we don't have a good way to > > reconstruct the dfops chain from the list of recovered log intent items. > > > > > The comments in xfs_log_recover.c don't really explain it. From going > > > back to commit 509955823cc9c ("xfs: log recovery should replay deferred > > > ops in order"), it looks like the issue is that ordering gets screwed up > > > for operations that commit multiple intents per-transaction and > > > completion of those intents creates new child intents. At runtime, we'd > > > process the intents in the current transaction first and then get to the > > > child intents in the order they were created. If we didn't have the > > > recovery time dfops shuffling implemented in this patch, recovery would > > > process the first intent in the log, then all of the child intents of > > > that one before getting to the next intent in the log that presumably > > > would have been second at runtime. Am I following that correctly? > > > > Correct. If we could figure out how to rebuild the dfops chain (and > > thus the ordering requirements), we would be able to eliminate this > > freezer stuff entirely. > > > > Ok, thanks. I was wondering the same, but it's not clear how to do that > with just the intent information in the log and big checkpoint > transactions. Perhaps we could if we had a transactional id that was > common across intents in a single transaction, or otherwise unique > intent ids that would allow an intent to refer to the next in a chain at > the time the intent was logged (such that recovery could learn to > construct and process chains instead of just individual intents). I'm > curious if we're going to want/need to solve that issue somehow or > another more directly eventually as we grow more intents and perhaps > come up with new and more complex intent-based operations (with more > complex recovery scenarios). <nod> Too bad that's a log format change. :( > > > I think the rest makes sense from the perspective of not wanting to > > > plumb the conditional locking complexity through the dfops/intent > > > processing for every intent type that eventually needs it, as opposed to > > > containing in the log recovery code (via dfops magic). I need to think > > > about all that some more, but I'd like to make sure I understand why we > > > have this recovery requirement in the first place. > > > > <nod> It took me a few hours to figure out why that patch was correct > > the first time I saw it. It took me a few hours again(!) when I was > > trying to write this series. :/ > > > > If I follow correctly, the rmap swapext sequence makes for a good > example because the bmap completions generate the rmap intents. A > simplified (i.e. single extent remap) runtime transactional flow for > that looks something like the following: > > t1: unmap intent, map intent > t2: unmap done, map intent, runmap intent > t3: map done, runmap intent, rmap intent > t4: runmap done, rmap intent Work items that aren't touched in a particular transaction do not get relogged, so this sequence is: t1: unmap(0) intent, map(1) intent t2: unmap(0) done, runmap(2) intent t3: map(1) done, rmap(3) intent t4: runmap(2) done t5: rmap(3) done (Maybe I need to take another look at the autorelogging patchset.) > ... > > So at runtime these operations complete in the obvious linear order. > Intents drop off the front as done items are logged and new intents are > tacked on the end. If we crash between t2 and t3, however, we see t2 in > the log and would end up completing the map intent and the subsequently > generated rmap intent (that pops up in t3 at runtime) before we process > the runmap intent that's still in the log (and was originally in the > dfops queue for t2). I _think_ that's essentially the scenario described > in 509955823cc9c, Yes... > but it's not clear to me if it's the same runtime example.. ...and you arrived at it with a different scenario. :) --D > Brian > > > --D > > > > > Brian > > > > > > > You'd have to drop the ILOCK (but retain the inode reference) between > > > > the two bmap items. I think this is currently impossible because there > > > > aren't any transactions that queue multiple bmap intents per > > > > transaction, nobody else holds the ilock, and the existing rmap > > > > implementation of the swapext ioctl doesn't allow swapping a file with > > > > itself. However... > > > > > > > > (Part 2) > > > > > > > > The second thing I thought about is, what if there's a lot of work after > > > > transaction 0? Say the recovery stream looks like: > > > > > > > > ---------------- > > > > Transaction 0 > > > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > > > ---------------- > > > > <10000 more intents applied to other things> > > > > ---------------- > > > > <end> > > > > > > > > This is a problem, because we must drop the ILOCK after completing the > > > > BUI1 work so that we can process the 10000 more intents before we can > > > > get to BUI2. We cannot let the tail of the log get pinned on the locked > > > > inode, so we must release the ILOCK after we're done with BUI1 and > > > > re-acquire it when we're ready to deal with BUI2. > > > > > > > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say > > > > that we need to irele the inode at the end of processing, and another to > > > > say that the bmap item needs to grab the ILOCK. If there's going to be > > > > a BUI2 as a result of recovering BUI1, the first flag must be propagated > > > > to BUI2, but the second flag must /not/ be propagated. > > > > > > > > A potential counterargument is that we crammed all 10000 intents into > > > > the log without pinning the log, so maybe we can hold the ILOCK. > > > > However.... > > > > > > > > (Part 3) > > > > > > > > Next, I considered how the swapext log item in the atomic file update > > > > patchset would interact with log recovery. A couple of notes about the > > > > swapext log items: > > > > > > > > 1. They are careful not to create the kinds of bunmap operations that > > > > would cause the creation of /more/ BUIs. > > > > > > > > 2. Every time we log one extent's worth of unmap/remap operations (using > > > > deferred bmap log items, naturally) we log an done item for the original > > > > swapext intent item and immediately log another swapext intent for the > > > > work remaining. > > > > > > > > I came up with the following sequence of transactions to recover: > > > > > > > > ---------------- > > > > Transaction 0 > > > > SXI0 file1, offset1, file2, offset2, length == 10 > > > > ---------------- > > > > Transaction 1 > > > > BUI1 unmap file1, offset1, startblock1, length == 2 > > > > BUI2 unmap file2, offset2, startblock2, length == 2 > > > > BUI3 map file1, offset1, startblock2, length == 2 > > > > BUI4 map file2, offset2, startblock1, length == 2 > > > > SXD done (SXI0) > > > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8 > > > > --------------- > > > > <end of log> > > > > > > > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need > > > > processing. Each of the 5 intent items gets its own recovery > > > > transaction and dfops freezer chain. BUI1, BUI3, and SXI5 will each > > > > require ilock'd access to file1, which means that each of them will have > > > > to have the ability to drop the ILOCK but retain the reference. The > > > > same argument applies to file2 and BUI2, BUI4, and SXI5. Note also that > > > > in our brave new world, file1 and file2 can be the same. > > > > > > > > For the swapext log item type this inode management is particularly > > > > acute because we're certain to have new SXIs created as a result of > > > > recovering an SXI. > > > > > > > > (End) > > > > > > > > In conclusion, we need a mechanism to drop both the inode lock and the > > > > inode reference. Each log item type that works with inodes will have to > > > > set aside 2 flags somewhere in the incore extent structure, and provide > > > > more or less the same code to manage that state. Right now that's just > > > > the bmap items, but then the swapext items and the deferred xattr items > > > > will have that same requirement when they get added. > > > > > > > > If we do that, the inode reference and ilock management diverges even > > > > further from regular operations. Regular callers are expected to iget > > > > and ilock the inode and maintain that all the way to the end of > > > > xfs_trans_commit. Log recovery, on the other hand, will add a bunch of > > > > state handling to some of the log items so that we can drop the ILOCK > > > > after the first item, and then drop the inode reference after finishing > > > > the last possible user of that inode in the revived dfops chain. > > > > > > > > On the other hand, keeping the inode management in the defer freezer > > > > code means that I keep all that complexity and state management out of > > > > the ->finish_item code. When we go to finish the new incore intents > > > > that get created during recovery, the inode reference and ILOCK rules > > > > are the same as anywhere else -- the caller is required to grab the > > > > inode, the transaction, and the ilock (in that order). > > > > > > > > To the extent that recovering log intent items is /not/ the same as > > > > running a normal transaction, all that weirdness is concentrated in > > > > xfs_log_recover.c and a single function in xfs_defer.c. The desired > > > > behavior is the same across all the log intent item types, so I > > > > decided in favor of keeping the centralised behavior and (trying to) > > > > contain the wacky. We don't need all that right away, but we will. > > > > > > > > Note that I did make it so that we retain the reference always, so > > > > there's no longer a need for irele/igrab cycles, which makes capturing > > > > the dfops chain less error prone. > > > > > > > > --D > > > > > > > > > Brian > > > > > > > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/ > > > > > > > > > > > fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ > > > > > > fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ > > > > > > fs/xfs/xfs_bmap_item.c | 7 ++++-- > > > > > > fs/xfs/xfs_icache.c | 19 +++++++++++++++++ > > > > > > 4 files changed, 83 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > > > > > index ea4d28851bbd..72933fdafcb2 100644 > > > > > > --- a/fs/xfs/libxfs/xfs_defer.c > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.c > > > > > > @@ -16,6 +16,7 @@ > > > > > > #include "xfs_inode.h" > > > > > > #include "xfs_inode_item.h" > > > > > > #include "xfs_trace.h" > > > > > > +#include "xfs_icache.h" > > > > > > > > > > > > /* > > > > > > * Deferred Operations in XFS > > > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw( > > > > > > struct xfs_defer_freezer *dff, > > > > > > struct xfs_trans *tp) > > > > > > { > > > > > > + int i; > > > > > > + > > > > > > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > > > > > > > > > > > + /* Re-acquire the inode locks. */ > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > + if (!dff->dff_inodes[i]) > > > > > > + break; > > > > > > + > > > > > > + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; > > > > > > + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > > > + } > > > > > > + > > > > > > /* Add the dfops items to the transaction. */ > > > > > > list_splice_init(&dff->dff_dfops, &tp->t_dfops); > > > > > > tp->t_flags |= dff->dff_tpflags; > > > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( > > > > > > struct xfs_defer_freezer *dff) > > > > > > { > > > > > > xfs_defer_cancel_list(mp, &dff->dff_dfops); > > > > > > + xfs_defer_freezer_irele(dff); > > > > > > kmem_free(dff); > > > > > > } > > > > > > + > > > > > > +/* > > > > > > + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, > > > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen > > > > > > + * deferred ops. > > > > > > + */ > > > > > > +int > > > > > > +xfs_defer_freezer_ijoin( > > > > > > + struct xfs_defer_freezer *dff, > > > > > > + struct xfs_inode *ip) > > > > > > +{ > > > > > > + unsigned int i; > > > > > > + > > > > > > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > > > > > + > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > + if (dff->dff_inodes[i] == ip) > > > > > > + goto out; > > > > > > + if (dff->dff_inodes[i] == NULL) > > > > > > + break; > > > > > > + } > > > > > > + > > > > > > + if (i == XFS_DEFER_FREEZER_INODES) { > > > > > > + ASSERT(0); > > > > > > + return -EFSCORRUPTED; > > > > > > + } > > > > > > + > > > > > > + /* > > > > > > + * Attach this inode to the freezer and drop its ILOCK because we > > > > > > + * assume the caller will need to allocate a transaction. > > > > > > + */ > > > > > > + dff->dff_inodes[i] = ip; > > > > > > + dff->dff_ilocks[i] = 0; > > > > > > +out: > > > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > + return 0; > > > > > > +} > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h > > > > > > index 7ae05e10d750..0052a0313283 100644 > > > > > > --- a/fs/xfs/libxfs/xfs_defer.h > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.h > > > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer { > > > > > > /* Deferred ops state saved from the transaction. */ > > > > > > struct list_head dff_dfops; > > > > > > unsigned int dff_tpflags; > > > > > > + > > > > > > + /* Inodes to hold when we want to finish the deferred work items. */ > > > > > > +#define XFS_DEFER_FREEZER_INODES 2 > > > > > > + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; > > > > > > + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; > > > > > > }; > > > > > > > > > > > > /* Functions to freeze a chain of deferred operations for later. */ > > > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); > > > > > > void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); > > > > > > void xfs_defer_freeezer_finish(struct xfs_mount *mp, > > > > > > struct xfs_defer_freezer *dff); > > > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, > > > > > > + struct xfs_inode *ip); > > > > > > + > > > > > > +/* These functions must be provided by the xfs implementation. */ > > > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); > > > > > > > > > > > > #endif /* __XFS_DEFER_H__ */ > > > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > > > > > index c733bdeeeb9b..bbce191d8fcd 100644 > > > > > > --- a/fs/xfs/xfs_bmap_item.c > > > > > > +++ b/fs/xfs/xfs_bmap_item.c > > > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover( > > > > > > } > > > > > > > > > > > > error = xlog_recover_trans_commit(tp, dffp); > > > > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > - xfs_irele(ip); > > > > > > - return error; > > > > > > + if (error) > > > > > > + goto err_rele; > > > > > > + return xfs_defer_freezer_ijoin(*dffp, ip); > > > > > > > > > > > > err_inode: > > > > > > xfs_trans_cancel(tp); > > > > > > +err_rele: > > > > > > if (ip) { > > > > > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > xfs_irele(ip); > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > > index 17a0b86fe701..b96ddf5ff334 100644 > > > > > > --- a/fs/xfs/xfs_icache.c > > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > > @@ -12,6 +12,7 @@ > > > > > > #include "xfs_sb.h" > > > > > > #include "xfs_mount.h" > > > > > > #include "xfs_inode.h" > > > > > > +#include "xfs_defer.h" > > > > > > #include "xfs_trans.h" > > > > > > #include "xfs_trans_priv.h" > > > > > > #include "xfs_inode_item.h" > > > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( > > > > > > xfs_queue_eofblocks(mp); > > > > > > xfs_queue_cowblocks(mp); > > > > > > } > > > > > > + > > > > > > +/* Release all the inode resources attached to this freezer. */ > > > > > > +void > > > > > > +xfs_defer_freezer_irele( > > > > > > + struct xfs_defer_freezer *dff) > > > > > > +{ > > > > > > + unsigned int i; > > > > > > + > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > + if (!dff->dff_inodes[i]) > > > > > > + break; > > > > > > + > > > > > > + if (dff->dff_ilocks[i]) > > > > > > + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > > > + xfs_irele(dff->dff_inodes[i]); > > > > > > + dff->dff_inodes[i] = NULL; > > > > > > + } > > > > > > +} > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover 2020-05-07 15:09 ` Darrick J. Wong @ 2020-05-07 16:58 ` Brian Foster 0 siblings, 0 replies; 29+ messages in thread From: Brian Foster @ 2020-05-07 16:58 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Thu, May 07, 2020 at 08:09:56AM -0700, Darrick J. Wong wrote: > On Thu, May 07, 2020 at 05:53:06AM -0400, Brian Foster wrote: > > On Wed, May 06, 2020 at 10:01:50AM -0700, Darrick J. Wong wrote: > > > On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote: > > > > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote: > > > > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote: > > > > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote: > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > > > > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > > > > > > to the inode that is involved in the bmap replay operation. If the > > > > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > > > > > > create a deferred op to finish the unmapping work, and we retain a > > > > > > > pointer to the incore inode. > > > > > > > > > > > > > > Unfortunately, the very next thing we do is commit the transaction and > > > > > > > drop the inode. If reclaim tears down the inode before we try to finish > > > > > > > the defer ops, we dereference garbage and blow up. Therefore, create a > > > > > > > way to join inodes to the defer ops freezer so that we can maintain the > > > > > > > xfs_inode reference until we're done with the inode. > > > > > > > > > > > > > > Note: This imposes the requirement that there be enough memory to keep > > > > > > > every incore inode in memory throughout recovery. > > > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > > > > > > > --- > > > > > > > > > > > > Maybe I'm missing something, but I thought the discussion on the > > > > > > previous version[1] landed on an approach where the intent would hold a > > > > > > reference to the inode. Wouldn't that break the dependency on the dfops > > > > > > freeze/thaw mechanism? > > > > > > > > > > We did, but as I started pondering how to make this work in practice, > > > > > I soon realized that it's not as simple as "don't irele the inode": > > > > > > > > > > (Part 1) > > > > > > > > > > It would be easy enough to add a flag to struct xfs_bmap_intent that > > > > > means "when you're done, unlock and irele the inode". You'd have to add > > > > > this capability to any log item that deals with inodes, but only > > > > > deferred bmap items need this. > > > > > > > > > > Log recovery looks like this: > > > > > > > > > > ---------------- > > > > > Transaction 0 > > > > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > > > > ---------------- > > > > > <end> > > > > > > > > > > What happens if the first bunmapi call doesn't actually unmap > > > > > everything? We'll queue yet another bmap intent item (call it BUI2) to > > > > > finish the work. This means that we can't just drop the inode when > > > > > we're done processing BUI1; we have to propagate the "unlock and irele" > > > > > flag to BUI2. Ok, so now there's this chaining behavior that wasn't > > > > > there before. > > > > > > > > > > Then I thought about it some more, and wondered what would happen if the > > > > > log contained two unfinished bmap items for the same inode? If this > > > > > happened, you'd deadlock because you've not released the ILOCK for the > > > > > first unfinished bmap item when you want to acquire the ILOCK for the > > > > > second unfinished bmap item. > > > > > > > > > > > > > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap > > > > dfops at runtime, but that's not necessarily the case during recovery > > > > due to the dfops chaining behavior. To me, this raises the question of > > > > why we need to reorder/split these particular intents during recovery if > > > > the original operation would have completed under a single inode lock > > > > cycle. > > > > > > Log recovery /did/ function that way (no reordering) until > > > 509955823cc9c, when it was discovered that we don't have a good way to > > > reconstruct the dfops chain from the list of recovered log intent items. > > > > > > > The comments in xfs_log_recover.c don't really explain it. From going > > > > back to commit 509955823cc9c ("xfs: log recovery should replay deferred > > > > ops in order"), it looks like the issue is that ordering gets screwed up > > > > for operations that commit multiple intents per-transaction and > > > > completion of those intents creates new child intents. At runtime, we'd > > > > process the intents in the current transaction first and then get to the > > > > child intents in the order they were created. If we didn't have the > > > > recovery time dfops shuffling implemented in this patch, recovery would > > > > process the first intent in the log, then all of the child intents of > > > > that one before getting to the next intent in the log that presumably > > > > would have been second at runtime. Am I following that correctly? > > > > > > Correct. If we could figure out how to rebuild the dfops chain (and > > > thus the ordering requirements), we would be able to eliminate this > > > freezer stuff entirely. > > > > > > > Ok, thanks. I was wondering the same, but it's not clear how to do that > > with just the intent information in the log and big checkpoint > > transactions. Perhaps we could if we had a transactional id that was > > common across intents in a single transaction, or otherwise unique > > intent ids that would allow an intent to refer to the next in a chain at > > the time the intent was logged (such that recovery could learn to > > construct and process chains instead of just individual intents). I'm > > curious if we're going to want/need to solve that issue somehow or > > another more directly eventually as we grow more intents and perhaps > > come up with new and more complex intent-based operations (with more > > complex recovery scenarios). > > <nod> Too bad that's a log format change. :( > Yeah. It might be worth considering if we have an opportunity to break format in the future, but not useful for the current code. > > > > I think the rest makes sense from the perspective of not wanting to > > > > plumb the conditional locking complexity through the dfops/intent > > > > processing for every intent type that eventually needs it, as opposed to > > > > containing in the log recovery code (via dfops magic). I need to think > > > > about all that some more, but I'd like to make sure I understand why we > > > > have this recovery requirement in the first place. > > > > > > <nod> It took me a few hours to figure out why that patch was correct > > > the first time I saw it. It took me a few hours again(!) when I was > > > trying to write this series. :/ > > > > > > > If I follow correctly, the rmap swapext sequence makes for a good > > example because the bmap completions generate the rmap intents. A > > simplified (i.e. single extent remap) runtime transactional flow for > > that looks something like the following: > > > > t1: unmap intent, map intent > > t2: unmap done, map intent, runmap intent > > t3: map done, runmap intent, rmap intent > > t4: runmap done, rmap intent > > Work items that aren't touched in a particular transaction do not get > relogged, so this sequence is: > > t1: unmap(0) intent, map(1) intent > t2: unmap(0) done, runmap(2) intent > t3: map(1) done, rmap(3) intent > t4: runmap(2) done > t5: rmap(3) done > > (Maybe I need to take another look at the autorelogging patchset.) > Ah, I see. Thanks. Brian > > ... > > > > So at runtime these operations complete in the obvious linear order. > > Intents drop off the front as done items are logged and new intents are > > tacked on the end. If we crash between t2 and t3, however, we see t2 in > > the log and would end up completing the map intent and the subsequently > > generated rmap intent (that pops up in t3 at runtime) before we process > > the runmap intent that's still in the log (and was originally in the > > dfops queue for t2). I _think_ that's essentially the scenario described > > in 509955823cc9c, > > Yes... > > > but it's not clear to me if it's the same runtime example.. > > ...and you arrived at it with a different scenario. :) > > --D > > > Brian > > > > > --D > > > > > > > Brian > > > > > > > > > You'd have to drop the ILOCK (but retain the inode reference) between > > > > > the two bmap items. I think this is currently impossible because there > > > > > aren't any transactions that queue multiple bmap intents per > > > > > transaction, nobody else holds the ilock, and the existing rmap > > > > > implementation of the swapext ioctl doesn't allow swapping a file with > > > > > itself. However... > > > > > > > > > > (Part 2) > > > > > > > > > > The second thing I thought about is, what if there's a lot of work after > > > > > transaction 0? Say the recovery stream looks like: > > > > > > > > > > ---------------- > > > > > Transaction 0 > > > > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > > > > ---------------- > > > > > <10000 more intents applied to other things> > > > > > ---------------- > > > > > <end> > > > > > > > > > > This is a problem, because we must drop the ILOCK after completing the > > > > > BUI1 work so that we can process the 10000 more intents before we can > > > > > get to BUI2. We cannot let the tail of the log get pinned on the locked > > > > > inode, so we must release the ILOCK after we're done with BUI1 and > > > > > re-acquire it when we're ready to deal with BUI2. > > > > > > > > > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say > > > > > that we need to irele the inode at the end of processing, and another to > > > > > say that the bmap item needs to grab the ILOCK. If there's going to be > > > > > a BUI2 as a result of recovering BUI1, the first flag must be propagated > > > > > to BUI2, but the second flag must /not/ be propagated. > > > > > > > > > > A potential counterargument is that we crammed all 10000 intents into > > > > > the log without pinning the log, so maybe we can hold the ILOCK. > > > > > However.... > > > > > > > > > > (Part 3) > > > > > > > > > > Next, I considered how the swapext log item in the atomic file update > > > > > patchset would interact with log recovery. A couple of notes about the > > > > > swapext log items: > > > > > > > > > > 1. They are careful not to create the kinds of bunmap operations that > > > > > would cause the creation of /more/ BUIs. > > > > > > > > > > 2. Every time we log one extent's worth of unmap/remap operations (using > > > > > deferred bmap log items, naturally) we log an done item for the original > > > > > swapext intent item and immediately log another swapext intent for the > > > > > work remaining. > > > > > > > > > > I came up with the following sequence of transactions to recover: > > > > > > > > > > ---------------- > > > > > Transaction 0 > > > > > SXI0 file1, offset1, file2, offset2, length == 10 > > > > > ---------------- > > > > > Transaction 1 > > > > > BUI1 unmap file1, offset1, startblock1, length == 2 > > > > > BUI2 unmap file2, offset2, startblock2, length == 2 > > > > > BUI3 map file1, offset1, startblock2, length == 2 > > > > > BUI4 map file2, offset2, startblock1, length == 2 > > > > > SXD done (SXI0) > > > > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8 > > > > > --------------- > > > > > <end of log> > > > > > > > > > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need > > > > > processing. Each of the 5 intent items gets its own recovery > > > > > transaction and dfops freezer chain. BUI1, BUI3, and SXI5 will each > > > > > require ilock'd access to file1, which means that each of them will have > > > > > to have the ability to drop the ILOCK but retain the reference. The > > > > > same argument applies to file2 and BUI2, BUI4, and SXI5. Note also that > > > > > in our brave new world, file1 and file2 can be the same. > > > > > > > > > > For the swapext log item type this inode management is particularly > > > > > acute because we're certain to have new SXIs created as a result of > > > > > recovering an SXI. > > > > > > > > > > (End) > > > > > > > > > > In conclusion, we need a mechanism to drop both the inode lock and the > > > > > inode reference. Each log item type that works with inodes will have to > > > > > set aside 2 flags somewhere in the incore extent structure, and provide > > > > > more or less the same code to manage that state. Right now that's just > > > > > the bmap items, but then the swapext items and the deferred xattr items > > > > > will have that same requirement when they get added. > > > > > > > > > > If we do that, the inode reference and ilock management diverges even > > > > > further from regular operations. Regular callers are expected to iget > > > > > and ilock the inode and maintain that all the way to the end of > > > > > xfs_trans_commit. Log recovery, on the other hand, will add a bunch of > > > > > state handling to some of the log items so that we can drop the ILOCK > > > > > after the first item, and then drop the inode reference after finishing > > > > > the last possible user of that inode in the revived dfops chain. > > > > > > > > > > On the other hand, keeping the inode management in the defer freezer > > > > > code means that I keep all that complexity and state management out of > > > > > the ->finish_item code. When we go to finish the new incore intents > > > > > that get created during recovery, the inode reference and ILOCK rules > > > > > are the same as anywhere else -- the caller is required to grab the > > > > > inode, the transaction, and the ilock (in that order). > > > > > > > > > > To the extent that recovering log intent items is /not/ the same as > > > > > running a normal transaction, all that weirdness is concentrated in > > > > > xfs_log_recover.c and a single function in xfs_defer.c. The desired > > > > > behavior is the same across all the log intent item types, so I > > > > > decided in favor of keeping the centralised behavior and (trying to) > > > > > contain the wacky. We don't need all that right away, but we will. > > > > > > > > > > Note that I did make it so that we retain the reference always, so > > > > > there's no longer a need for irele/igrab cycles, which makes capturing > > > > > the dfops chain less error prone. > > > > > > > > > > --D > > > > > > > > > > > Brian > > > > > > > > > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/ > > > > > > > > > > > > > fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ > > > > > > > fs/xfs/xfs_bmap_item.c | 7 ++++-- > > > > > > > fs/xfs/xfs_icache.c | 19 +++++++++++++++++ > > > > > > > 4 files changed, 83 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > > > > > > index ea4d28851bbd..72933fdafcb2 100644 > > > > > > > --- a/fs/xfs/libxfs/xfs_defer.c > > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.c > > > > > > > @@ -16,6 +16,7 @@ > > > > > > > #include "xfs_inode.h" > > > > > > > #include "xfs_inode_item.h" > > > > > > > #include "xfs_trace.h" > > > > > > > +#include "xfs_icache.h" > > > > > > > > > > > > > > /* > > > > > > > * Deferred Operations in XFS > > > > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw( > > > > > > > struct xfs_defer_freezer *dff, > > > > > > > struct xfs_trans *tp) > > > > > > > { > > > > > > > + int i; > > > > > > > + > > > > > > > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > > > > > > > > > > > > > + /* Re-acquire the inode locks. */ > > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > > + if (!dff->dff_inodes[i]) > > > > > > > + break; > > > > > > > + > > > > > > > + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; > > > > > > > + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > > > > + } > > > > > > > + > > > > > > > /* Add the dfops items to the transaction. */ > > > > > > > list_splice_init(&dff->dff_dfops, &tp->t_dfops); > > > > > > > tp->t_flags |= dff->dff_tpflags; > > > > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( > > > > > > > struct xfs_defer_freezer *dff) > > > > > > > { > > > > > > > xfs_defer_cancel_list(mp, &dff->dff_dfops); > > > > > > > + xfs_defer_freezer_irele(dff); > > > > > > > kmem_free(dff); > > > > > > > } > > > > > > > + > > > > > > > +/* > > > > > > > + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, > > > > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen > > > > > > > + * deferred ops. > > > > > > > + */ > > > > > > > +int > > > > > > > +xfs_defer_freezer_ijoin( > > > > > > > + struct xfs_defer_freezer *dff, > > > > > > > + struct xfs_inode *ip) > > > > > > > +{ > > > > > > > + unsigned int i; > > > > > > > + > > > > > > > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > > > > > > + > > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > > + if (dff->dff_inodes[i] == ip) > > > > > > > + goto out; > > > > > > > + if (dff->dff_inodes[i] == NULL) > > > > > > > + break; > > > > > > > + } > > > > > > > + > > > > > > > + if (i == XFS_DEFER_FREEZER_INODES) { > > > > > > > + ASSERT(0); > > > > > > > + return -EFSCORRUPTED; > > > > > > > + } > > > > > > > + > > > > > > > + /* > > > > > > > + * Attach this inode to the freezer and drop its ILOCK because we > > > > > > > + * assume the caller will need to allocate a transaction. > > > > > > > + */ > > > > > > > + dff->dff_inodes[i] = ip; > > > > > > > + dff->dff_ilocks[i] = 0; > > > > > > > +out: > > > > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > > + return 0; > > > > > > > +} > > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h > > > > > > > index 7ae05e10d750..0052a0313283 100644 > > > > > > > --- a/fs/xfs/libxfs/xfs_defer.h > > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.h > > > > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer { > > > > > > > /* Deferred ops state saved from the transaction. */ > > > > > > > struct list_head dff_dfops; > > > > > > > unsigned int dff_tpflags; > > > > > > > + > > > > > > > + /* Inodes to hold when we want to finish the deferred work items. */ > > > > > > > +#define XFS_DEFER_FREEZER_INODES 2 > > > > > > > + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; > > > > > > > + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; > > > > > > > }; > > > > > > > > > > > > > > /* Functions to freeze a chain of deferred operations for later. */ > > > > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); > > > > > > > void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); > > > > > > > void xfs_defer_freeezer_finish(struct xfs_mount *mp, > > > > > > > struct xfs_defer_freezer *dff); > > > > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, > > > > > > > + struct xfs_inode *ip); > > > > > > > + > > > > > > > +/* These functions must be provided by the xfs implementation. */ > > > > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); > > > > > > > > > > > > > > #endif /* __XFS_DEFER_H__ */ > > > > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > > > > > > index c733bdeeeb9b..bbce191d8fcd 100644 > > > > > > > --- a/fs/xfs/xfs_bmap_item.c > > > > > > > +++ b/fs/xfs/xfs_bmap_item.c > > > > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover( > > > > > > > } > > > > > > > > > > > > > > error = xlog_recover_trans_commit(tp, dffp); > > > > > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > > - xfs_irele(ip); > > > > > > > - return error; > > > > > > > + if (error) > > > > > > > + goto err_rele; > > > > > > > + return xfs_defer_freezer_ijoin(*dffp, ip); > > > > > > > > > > > > > > err_inode: > > > > > > > xfs_trans_cancel(tp); > > > > > > > +err_rele: > > > > > > > if (ip) { > > > > > > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > > xfs_irele(ip); > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > > > index 17a0b86fe701..b96ddf5ff334 100644 > > > > > > > --- a/fs/xfs/xfs_icache.c > > > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > > > @@ -12,6 +12,7 @@ > > > > > > > #include "xfs_sb.h" > > > > > > > #include "xfs_mount.h" > > > > > > > #include "xfs_inode.h" > > > > > > > +#include "xfs_defer.h" > > > > > > > #include "xfs_trans.h" > > > > > > > #include "xfs_trans_priv.h" > > > > > > > #include "xfs_inode_item.h" > > > > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( > > > > > > > xfs_queue_eofblocks(mp); > > > > > > > xfs_queue_cowblocks(mp); > > > > > > > } > > > > > > > + > > > > > > > +/* Release all the inode resources attached to this freezer. */ > > > > > > > +void > > > > > > > +xfs_defer_freezer_irele( > > > > > > > + struct xfs_defer_freezer *dff) > > > > > > > +{ > > > > > > > + unsigned int i; > > > > > > > + > > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > > + if (!dff->dff_inodes[i]) > > > > > > > + break; > > > > > > > + > > > > > > > + if (dff->dff_ilocks[i]) > > > > > > > + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > > > > + xfs_irele(dff->dff_inodes[i]); > > > > > > > + dff->dff_inodes[i] = NULL; > > > > > > > + } > > > > > > > +} > > > > > > > > > > > > > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2020-10-06 6:24 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-09-17 3:29 [PATCH 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong 2020-09-17 3:29 ` [PATCH 1/3] xfs: clean up bmap intent item recovery checking Darrick J. Wong 2020-09-17 5:09 ` Dave Chinner 2020-09-23 7:16 ` Christoph Hellwig 2020-09-23 15:22 ` Darrick J. Wong 2020-09-17 3:29 ` [PATCH 2/3] xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering Darrick J. Wong 2020-09-17 5:13 ` Dave Chinner 2020-09-17 6:47 ` Darrick J. Wong 2020-09-17 7:08 ` [PATCH v2 " Darrick J. Wong 2020-09-17 8:19 ` Dave Chinner 2020-09-23 7:17 ` Christoph Hellwig 2020-09-17 3:29 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 2020-09-23 7:20 ` Christoph Hellwig 2020-09-23 15:55 ` Darrick J. Wong 2020-09-24 6:06 ` [PATCH v2 " Darrick J. Wong -- strict thread matches above, loose matches on Subject: below -- 2020-10-05 18:20 [PATCH v4 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong 2020-10-05 18:20 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 2020-10-06 6:24 ` Christoph Hellwig 2020-09-29 17:43 [PATCH v3 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong 2020-09-29 17:44 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 2020-09-27 23:41 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong 2020-09-27 23:41 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 2020-09-28 6:10 ` Dave Chinner 2020-09-28 17:02 ` Darrick J. Wong 2020-05-05 1:13 [PATCH v2 0/3] xfs: fix inode use-after-free during log recovery Darrick J. Wong 2020-05-05 1:13 ` [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover Darrick J. Wong 2020-05-05 14:11 ` Brian Foster 2020-05-06 0:34 ` Darrick J. Wong 2020-05-06 13:56 ` Brian Foster 2020-05-06 17:01 ` Darrick J. Wong 2020-05-07 9:53 ` Brian Foster 2020-05-07 15:09 ` Darrick J. Wong 2020-05-07 16:58 ` Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).