public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] XFS: Use KM_NOFS for incore inode extent tree allocation
@ 2008-07-21  4:52 Dave Chinner
  2008-07-21  5:58 ` Dave Chinner
  2008-07-21  7:52 ` Christoph Hellwig
  0 siblings, 2 replies; 4+ messages in thread
From: Dave Chinner @ 2008-07-21  4:52 UTC (permalink / raw)
  To: xfs; +Cc: Dave Chinner

If we allow incore extent tree allocations to recurse into the
filesystem under memory pressure, new delayed allocations through
xfs_iomap_write_delay() can deadlock on themselves if memory reclaim
tries to write back dirty pages from that inode.

It will deadlock in xfs_iomap_write_allocate() trying to take the
ilock we already hold. This can also show up as complex ABBA
deadlocks when multiple threeads are triggering memory reclaim when
trying to allocate extents.

The main cause of this is the fact that delayed allocation is
not done in a transaction, so KM_NOFS is not automatically
added to the allocations to prevent this recursion.

Mark all allocations done for the incore inode extent tree as
KM_NOFS to ensure they never recurse back into the filesystem.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_inode.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index bedc661..20b6f87 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3707,7 +3707,8 @@ xfs_iext_add_indirect_multi(
 	 * (all extents past */
 	if (nex2) {
 		byte_diff = nex2 * sizeof(xfs_bmbt_rec_t);
-		nex2_ep = (xfs_bmbt_rec_t *) kmem_alloc(byte_diff, KM_SLEEP);
+		nex2_ep = (xfs_bmbt_rec_t *) kmem_alloc(byte_diff,
+							KM_SLEEP|KM_NOFS);
 		memmove(nex2_ep, &erp->er_extbuf[idx], byte_diff);
 		erp->er_extcount -= nex2;
 		xfs_iext_irec_update_extoffs(ifp, erp_idx + 1, -nex2);
@@ -4008,7 +4009,7 @@ xfs_iext_realloc_direct(
 				kmem_realloc(ifp->if_u1.if_extents,
 						rnew_size,
 						ifp->if_real_bytes,
-						KM_SLEEP);
+						KM_SLEEP|KM_NOFS);
 		}
 		if (rnew_size > ifp->if_real_bytes) {
 			memset(&ifp->if_u1.if_extents[ifp->if_bytes /
@@ -4067,7 +4068,7 @@ xfs_iext_inline_to_direct(
 	xfs_ifork_t	*ifp,		/* inode fork pointer */
 	int		new_size)	/* number of extents in file */
 {
-	ifp->if_u1.if_extents = kmem_alloc(new_size, KM_SLEEP);
+	ifp->if_u1.if_extents = kmem_alloc(new_size, KM_SLEEP|KM_NOFS);
 	memset(ifp->if_u1.if_extents, 0, new_size);
 	if (ifp->if_bytes) {
 		memcpy(ifp->if_u1.if_extents, ifp->if_u2.if_inline_ext,
@@ -4099,7 +4100,7 @@ xfs_iext_realloc_indirect(
 	} else {
 		ifp->if_u1.if_ext_irec = (xfs_ext_irec_t *)
 			kmem_realloc(ifp->if_u1.if_ext_irec,
-				new_size, size, KM_SLEEP);
+				new_size, size, KM_SLEEP|KM_NOFS);
 	}
 }
 
@@ -4342,10 +4343,11 @@ xfs_iext_irec_init(
 	ASSERT(nextents <= XFS_LINEAR_EXTS);
 
 	erp = (xfs_ext_irec_t *)
-		kmem_alloc(sizeof(xfs_ext_irec_t), KM_SLEEP);
+		kmem_alloc(sizeof(xfs_ext_irec_t), KM_SLEEP|KM_NOFS);
 
 	if (nextents == 0) {
-		ifp->if_u1.if_extents = kmem_alloc(XFS_IEXT_BUFSZ, KM_SLEEP);
+		ifp->if_u1.if_extents = kmem_alloc(XFS_IEXT_BUFSZ,
+							KM_SLEEP|KM_NOFS);
 	} else if (!ifp->if_real_bytes) {
 		xfs_iext_inline_to_direct(ifp, XFS_IEXT_BUFSZ);
 	} else if (ifp->if_real_bytes < XFS_IEXT_BUFSZ) {
@@ -4393,7 +4395,7 @@ xfs_iext_irec_new(
 
 	/* Initialize new extent record */
 	erp = ifp->if_u1.if_ext_irec;
-	erp[erp_idx].er_extbuf = kmem_alloc(XFS_IEXT_BUFSZ, KM_SLEEP);
+	erp[erp_idx].er_extbuf = kmem_alloc(XFS_IEXT_BUFSZ, KM_SLEEP|KM_NOFS);
 	ifp->if_real_bytes = nlists * XFS_IEXT_BUFSZ;
 	memset(erp[erp_idx].er_extbuf, 0, XFS_IEXT_BUFSZ);
 	erp[erp_idx].er_extcount = 0;
-- 
1.5.6

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] XFS: Use KM_NOFS for incore inode extent tree allocation
  2008-07-21  4:52 [PATCH] XFS: Use KM_NOFS for incore inode extent tree allocation Dave Chinner
@ 2008-07-21  5:58 ` Dave Chinner
  2008-07-21  7:52 ` Christoph Hellwig
  1 sibling, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2008-07-21  5:58 UTC (permalink / raw)
  To: xfs

On Mon, Jul 21, 2008 at 02:52:39PM +1000, Dave Chinner wrote:
> If we allow incore extent tree allocations to recurse into the
> filesystem under memory pressure, new delayed allocations through
> xfs_iomap_write_delay() can deadlock on themselves if memory reclaim
> tries to write back dirty pages from that inode.
> 
> It will deadlock in xfs_iomap_write_allocate() trying to take the
> ilock we already hold. This can also show up as complex ABBA
> deadlocks when multiple threeads are triggering memory reclaim when
> trying to allocate extents.
> 
> The main cause of this is the fact that delayed allocation is
> not done in a transaction, so KM_NOFS is not automatically
> added to the allocations to prevent this recursion.
> 
> Mark all allocations done for the incore inode extent tree as
> KM_NOFS to ensure they never recurse back into the filesystem.

BTW, if you are wondering what this fixes, it's this hang:

http://oss.sgi.com/archives/xfs/2008-07/msg00091.html

And the stack traces look like:

> Call Trace:
> [<8048190c>] schedule+0x810/0x97c
> [<80483240>] __down_read+0xc4/0xec
> [<8013d860>] down_read+0x10/0x1c
> [<802cad44>] xfs_ilock+0x8c/0xa4
> [<802cac88>] xfs_ilock_map_shared+0x38/0x4c
> [<802d27f8>] xfs_iomap+0xd8/0x4dc
> [<802fe90c>] xfs_bmap+0x30/0x3c
> [<802f3cfc>] xfs_map_blocks+0x50/0x84
> [<802f52a4>] xfs_page_state_convert+0x56c/0x840
> [<802f565c>] xfs_vm_writepage+0xe4/0x140
> [<80153cf4>] pageout+0x150/0x1e8
> [<80154144>] shrink_page_list+0x2b8/0x504
> [<8015455c>] shrink_inactive_list+0xc0/0x304
> [<80154da8>] shrink_zone+0x100/0x148
> [<80154e6c>] shrink_zones+0x7c/0xac
> [<80154f94>] try_to_free_pages+0xf8/0x200
> [<8014f24c>] __alloc_pages+0x1a4/0x300
> [<80168a18>] kmem_getpages+0x58/0x138
> [<80169b1c>] cache_grow+0xd4/0x1c4
> [<80169db0>] cache_alloc_refill+0x1a4/0x210
> [<8016a2a0>] __kmalloc+0x98/0xc8
> [<802f3644>] kmem_alloc+0x94/0x130
> [<802d10d0>] xfs_iext_irec_new+0xb0/0x11c
> [<802d0134>] xfs_iext_add+0x1fc/0x254
> [<802cfedc>] xfs_iext_insert+0x34/0x90
> [<802a70c4>] xfs_bmap_add_extent_hole_delay+0x5dc/0x6fc
> [<802a3f0c>] xfs_bmap_add_extent+0x204/0x4e4
> [<802ace5c>] xfs_bmapi+0xa98/0x13e4
> [<802d3dc8>] xfs_iomap_write_delay+0x36c/0x4b8
> [<802d2aa0>] xfs_iomap+0x380/0x4dc
> [<802fe90c>] xfs_bmap+0x30/0x3c
> [<802f58b8>] __xfs_get_blocks+0xb0/0x300
> [<802f5b30>] xfs_get_blocks+0x28/0x34
> [<801718e0>] __block_prepare_write+0x208/0x548
> [<8017267c>] block_prepare_write+0x34/0x64
> [<802f5d6c>] xfs_vm_prepare_write+0x24/0x30
> [<8014bdf0>] generic_file_buffered_write+0x280/0x650
> [<802fe518>] xfs_write+0x768/0xaac
> [<802f8c80>] xfs_file_aio_write+0x88/0x94
> [<8016d8d4>] do_sync_write+0xcc/0x124
> [<8016d9e4>] vfs_write+0xb8/0x1a0
> [<8016dd10>] sys_pwrite64+0x6c/0xa8
> [<8010c180>] stack_done+0x20/0x3c

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] XFS: Use KM_NOFS for incore inode extent tree allocation
  2008-07-21  4:52 [PATCH] XFS: Use KM_NOFS for incore inode extent tree allocation Dave Chinner
  2008-07-21  5:58 ` Dave Chinner
@ 2008-07-21  7:52 ` Christoph Hellwig
  2008-07-21 10:59   ` Dave Chinner
  1 sibling, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2008-07-21  7:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Jul 21, 2008 at 02:52:39PM +1000, Dave Chinner wrote:
> If we allow incore extent tree allocations to recurse into the
> filesystem under memory pressure, new delayed allocations through
> xfs_iomap_write_delay() can deadlock on themselves if memory reclaim
> tries to write back dirty pages from that inode.
> 
> It will deadlock in xfs_iomap_write_allocate() trying to take the
> ilock we already hold. This can also show up as complex ABBA
> deadlocks when multiple threeads are triggering memory reclaim when
> trying to allocate extents.
> 
> The main cause of this is the fact that delayed allocation is
> not done in a transaction, so KM_NOFS is not automatically
> added to the allocations to prevent this recursion.
> 
> Mark all allocations done for the incore inode extent tree as
> KM_NOFS to ensure they never recurse back into the filesystem.

Looks good.  Note that KM_NOFS alone already means a allocation
that can't fail, so no need to or it to KM_SLEEP.

And long term we should try to look into allowing these to fail,
allocations that aren't allowed to fail but can't recurse back into
the fs still have a chance to deadlock.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] XFS: Use KM_NOFS for incore inode extent tree allocation
  2008-07-21  7:52 ` Christoph Hellwig
@ 2008-07-21 10:59   ` Dave Chinner
  0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2008-07-21 10:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Jul 21, 2008 at 03:52:35AM -0400, Christoph Hellwig wrote:
> On Mon, Jul 21, 2008 at 02:52:39PM +1000, Dave Chinner wrote:
> > If we allow incore extent tree allocations to recurse into the
> > filesystem under memory pressure, new delayed allocations through
> > xfs_iomap_write_delay() can deadlock on themselves if memory reclaim
> > tries to write back dirty pages from that inode.
> > 
> > It will deadlock in xfs_iomap_write_allocate() trying to take the
> > ilock we already hold. This can also show up as complex ABBA
> > deadlocks when multiple threeads are triggering memory reclaim when
> > trying to allocate extents.
> > 
> > The main cause of this is the fact that delayed allocation is
> > not done in a transaction, so KM_NOFS is not automatically
> > added to the allocations to prevent this recursion.
> > 
> > Mark all allocations done for the incore inode extent tree as
> > KM_NOFS to ensure they never recurse back into the filesystem.
> 
> Looks good.  Note that KM_NOFS alone already means a allocation
> that can't fail, so no need to or it to KM_SLEEP.

Right. I'll update the patch and resend it.

> And long term we should try to look into allowing these to fail,
> allocations that aren't allowed to fail but can't recurse back into
> the fs still have a chance to deadlock.

We need dirty transaction rollback capabilities before we can do that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-07-21 10:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-21  4:52 [PATCH] XFS: Use KM_NOFS for incore inode extent tree allocation Dave Chinner
2008-07-21  5:58 ` Dave Chinner
2008-07-21  7:52 ` Christoph Hellwig
2008-07-21 10:59   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox