* XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
@ 2013-08-21 15:24 Josef 'Jeff' Sipek
2013-08-22 2:25 ` Dave Chinner
0 siblings, 1 reply; 11+ messages in thread
From: Josef 'Jeff' Sipek @ 2013-08-21 15:24 UTC (permalink / raw)
To: xfs
We've started experimenting with larger directory block sizes to avoid
directory fragmentation. Everything seems to work fine, except that the log
is spammed with these lovely debug messages:
XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
>From looking at the code, it looks like that each of those messages (there
are thousands) equates to 100 trips through the loop. My guess is that the
larger blocks require multi-page allocations which are harder to satisfy.
This is with 3.10 kernel.
The hardware is something like (I can find out the exact config is you want):
32 cores
128 GB RAM
LSI 9271-8i RAID (one big RAID-60 with 36 disks, partitioned)
As I hinted at earlier, we end up with pretty big directories. We can
semi-reliably trigger this when we run rsync on the data between two
(identical) hosts over 10GbitE.
# xfs_info /dev/sda9
meta-data=/dev/sda9 isize=256 agcount=6, agsize=268435455 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1454213211, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=65536 ascii-ci=0
log =internal bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
/proc/slabinfo: https://www.copy.com/s/1x1yZFjYO2EI/slab.txt
sysrq m output: https://www.copy.com/s/mYfMYfJJl2EB/sysrq-m.txt
While I realize that the message isn't bad, it does mean that the system is
having hard time allocating memory. This could potentially lead to bad
performance, or even an actual deadlock. Do you have any suggestions?
Thanks,
Jeff.
--
The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all progress
depends on the unreasonable man.
- George Bernard Shaw
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-08-21 15:24 XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) Josef 'Jeff' Sipek @ 2013-08-22 2:25 ` Dave Chinner 2013-08-22 15:07 ` Josef 'Jeff' Sipek 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2013-08-22 2:25 UTC (permalink / raw) To: Josef 'Jeff' Sipek; +Cc: xfs On Wed, Aug 21, 2013 at 11:24:58AM -0400, Josef 'Jeff' Sipek wrote: > We've started experimenting with larger directory block sizes to avoid > directory fragmentation. Everything seems to work fine, except that the log > is spammed with these lovely debug messages: > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > From looking at the code, it looks like that each of those messages (there > are thousands) equates to 100 trips through the loop. My guess is that the > larger blocks require multi-page allocations which are harder to satisfy. > This is with 3.10 kernel. No, larger blocks simply require more single pages. The buffer cache does not require multi-page allocation at all. So, mode = 0x250, which means ___GFP_NOWARN | ___GFP_IO | ___GFP_WAIT which is also known as a GFP_NOFS allocation context. So, it's entirely possible that your memory is full of cached filesystem data and metadata, and the allocation that needs more can't reclaim them. > The hardware is something like (I can find out the exact config is you want): > > 32 cores > 128 GB RAM > LSI 9271-8i RAID (one big RAID-60 with 36 disks, partitioned) > > As I hinted at earlier, we end up with pretty big directories. We can > semi-reliably trigger this when we run rsync on the data between two > (identical) hosts over 10GbitE. > > # xfs_info /dev/sda9 > meta-data=/dev/sda9 isize=256 agcount=6, agsize=268435455 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=1454213211, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=65536 ascii-ci=0 > log =internal bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > /proc/slabinfo: https://www.copy.com/s/1x1yZFjYO2EI/slab.txt Hmmm. You're using filestreams. That's unusual. Only major slab cache is the buffer_head slab, with ~12 million active bufferheads. So, that means you've got at least 47-48GB of data in the page cache..... And there's only ~35000 xfs_buf items in the slab, so the metadata cache isn't very big, and reclaim from that isn't a problem, nor the inode caches as there's only 130,000 cached inodes. > sysrq m output: https://www.copy.com/s/mYfMYfJJl2EB/sysrq-m.txt 27764401 total pagecache pages which indicates that you've got close to 110GB of pages in the page cache. Hmmm, and 24-25GB of dirty pages in memory. You know, I'd be suspecting a memory reclaim problem here to do with having large amounts of dirty memory in the page cache. I don't think the underlying cause is going to be the filesystem code, as the warning should never be emitted if memory reclaim is making progress. Perhaps you could try lowering all the dirty memory thresholds to see if that allows memory reclaim to make more progress because there are fewer dirty pages in memory... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-08-22 2:25 ` Dave Chinner @ 2013-08-22 15:07 ` Josef 'Jeff' Sipek 0 siblings, 0 replies; 11+ messages in thread From: Josef 'Jeff' Sipek @ 2013-08-22 15:07 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Thu, Aug 22, 2013 at 12:25:44PM +1000, Dave Chinner wrote: > On Wed, Aug 21, 2013 at 11:24:58AM -0400, Josef 'Jeff' Sipek wrote: > > We've started experimenting with larger directory block sizes to avoid > > directory fragmentation. Everything seems to work fine, except that the log > > is spammed with these lovely debug messages: > > > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > > > From looking at the code, it looks like that each of those messages (there > > are thousands) equates to 100 trips through the loop. My guess is that the > > larger blocks require multi-page allocations which are harder to satisfy. > > This is with 3.10 kernel. > > No, larger blocks simply require more single pages. The buffer cache > does not require multi-page allocation at all. So, mode = 0x250, > which means ___GFP_NOWARN | ___GFP_IO | ___GFP_WAIT which is also > known as a GFP_NOFS allocation context. Doh! Not sure why I didn't remember the fact that directories are no different from regular files... ... > > /proc/slabinfo: https://www.copy.com/s/1x1yZFjYO2EI/slab.txt > > Hmmm. You're using filestreams. That's unusual. Right. I keep forgetting about that. > > sysrq m output: https://www.copy.com/s/mYfMYfJJl2EB/sysrq-m.txt > > 27764401 total pagecache pages > > which indicates that you've got close to 110GB of pages in the page > cache. Hmmm, and 24-25GB of dirty pages in memory. > > You know, I'd be suspecting a memory reclaim problem here to do with > having large amounts of dirty memory in the page cache. I don't > think the underlying cause is going to be the filesystem code, as > the warning should never be emitted if memory reclaim is making > progress. Perhaps you could try lowering all the dirty memory > thresholds to see if that allows memory reclaim to make more > progress because there are fewer dirty pages in memory... Yep. This makes perfect sense. Amusingly enough, we don't read much of the data so really the pagecache is supposed to buffer the writes because I/O is slow. We'll play with the dirty memory thresholds and see if that helps. Thanks! Jeff. -- All parts should go together without forcing. You must remember that the parts you are reassembling were disassembled by you. Therefore, if you can’t get them together again, there must be a reason. By all means, do not use a hammer. — IBM Manual, 1925 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
* XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) @ 2013-11-28 9:13 Emmanuel Lacour 2013-11-28 10:05 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Emmanuel Lacour @ 2013-11-28 9:13 UTC (permalink / raw) To: xfs Dear XFS users, I run a Ceph cluster using XFS on Debian wheezy servers and Linux 3.10 (debian backports). I see the following line in our logs: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) does this reveal a problem in my setup or may I ignore it? If it's a problem, can someone give me any hint on solving this? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-11-28 9:13 Emmanuel Lacour @ 2013-11-28 10:05 ` Dave Chinner 2013-12-03 9:53 ` Emmanuel Lacour 0 siblings, 1 reply; 11+ messages in thread From: Dave Chinner @ 2013-11-28 10:05 UTC (permalink / raw) To: Emmanuel Lacour; +Cc: xfs On Thu, Nov 28, 2013 at 10:13:22AM +0100, Emmanuel Lacour wrote: > > Dear XFS users, > > > I run a Ceph cluster using XFS on Debian wheezy servers and Linux 3.10 > (debian backports). I see the following line in our logs: > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > does this reveal a problem in my setup or may I ignore it? If it's a > problem, can someone give me any hint on solving this? It might be, but you need to provide more information for us to be able to make any intelligent comment on the message. Start here: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-11-28 10:05 ` Dave Chinner @ 2013-12-03 9:53 ` Emmanuel Lacour 2013-12-03 12:50 ` Dave Chinner 0 siblings, 1 reply; 11+ messages in thread From: Emmanuel Lacour @ 2013-12-03 9:53 UTC (permalink / raw) To: xfs On Thu, Nov 28, 2013 at 09:05:21PM +1100, Dave Chinner wrote: > On Thu, Nov 28, 2013 at 10:13:22AM +0100, Emmanuel Lacour wrote: > > > > Dear XFS users, > > > > > > I run a Ceph cluster using XFS on Debian wheezy servers and Linux 3.10 > > (debian backports). I see the following line in our logs: > > > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > > > does this reveal a problem in my setup or may I ignore it? If it's a > > problem, can someone give me any hint on solving this? > > It might be, but you need to provide more information for us to be > able to make any intelligent comment on the message. Start here: > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > The problem continue and crashed my ceph cluster again, so here is all the informations said in the FAQ: http://people.easter-eggs.org/~manu/xfs.log This may be related to a friend problem here: http://tracker.ceph.com/issues/6386 thanks for any help on solving this! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-12-03 9:53 ` Emmanuel Lacour @ 2013-12-03 12:50 ` Dave Chinner 2013-12-03 16:28 ` Yann Dupont ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Dave Chinner @ 2013-12-03 12:50 UTC (permalink / raw) To: Emmanuel Lacour; +Cc: xfs On Tue, Dec 03, 2013 at 10:53:58AM +0100, Emmanuel Lacour wrote: > On Thu, Nov 28, 2013 at 09:05:21PM +1100, Dave Chinner wrote: > > On Thu, Nov 28, 2013 at 10:13:22AM +0100, Emmanuel Lacour wrote: > > > > > > Dear XFS users, > > > > > > > > > I run a Ceph cluster using XFS on Debian wheezy servers and Linux 3.10 > > > (debian backports). I see the following line in our logs: > > > > > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > > > > > does this reveal a problem in my setup or may I ignore it? If it's a > > > problem, can someone give me any hint on solving this? > > > > It might be, but you need to provide more information for us to be > > able to make any intelligent comment on the message. Start here: > > > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > > > > The problem continue and crashed my ceph cluster again, so here is all > the informations said in the FAQ: > > http://people.easter-eggs.org/~manu/xfs.log OK, 32GB RAM, no obvious shortage, no dirty or writeback data. 2TB SATA drives, 32AGs, only unusual setting is 64k directory block size. Yup, there's your problem: [4583991.478469] ceph-osd D ffff88047fc93f40 0 22951 1 0x00000004 [4583991.478471] ffff88046d241140 0000000000000082 ffffffff81047e75 ffff88046f949800 [4583991.478475] 0000000000013f40 ffff88039eb0bfd8 ffff88039eb0bfd8 ffff88046d241140 [4583991.478479] 0000000000000000 00000001444d68bd ffff88046d241140 0000000000000005 [4583991.478483] Call Trace: [4583991.478487] [<ffffffff81047e75>] ? internal_add_timer+0xd/0x28 [4583991.478491] [<ffffffff8138e34a>] ? schedule_timeout+0xeb/0x123 [4583991.478494] [<ffffffff81047e63>] ? ftrace_raw_event_timer_class+0x9d/0x9d [4583991.478498] [<ffffffff8138edb6>] ? io_schedule_timeout+0x60/0x86 [4583991.478502] [<ffffffff810d85ad>] ? congestion_wait+0x70/0xdb [4583991.478505] [<ffffffff8105858f>] ? abort_exclusive_wait+0x79/0x79 [4583991.478518] [<ffffffffa056e3f9>] ? kmem_alloc+0x65/0x6f [xfs] [4583991.478535] [<ffffffffa0592c4a>] ? xfs_dir2_block_to_sf+0x5b/0x1fb [xfs] [4583991.478550] [<ffffffffa0592be0>] ? xfs_dir2_block_sfsize+0x15b/0x16a [xfs] [4583991.478566] [<ffffffffa058bf9a>] ? xfs_dir2_block_removename+0x1c7/0x208 [xfs] [4583991.478581] [<ffffffffa058ab4a>] ? xfs_dir_removename+0xda/0x114 [xfs] [4583991.478594] [<ffffffffa056a55c>] ? xfs_rename+0x428/0x554 [xfs] [4583991.478606] [<ffffffffa0567321>] ? xfs_vn_rename+0x5e/0x65 [xfs] [4583991.478610] [<ffffffff8111677b>] ? vfs_rename+0x224/0x35f [4583991.478614] [<ffffffff81113d0b>] ? lookup_dcache+0x22/0x95 [4583991.478618] [<ffffffff81116a7e>] ? SYSC_renameat+0x1c8/0x257 [4583991.478622] [<ffffffff810fb0fd>] ? __cache_free.isra.45+0x178/0x187 [4583991.478625] [<ffffffff81117eb1>] ? SyS_mkdirat+0x2e/0xce [4583991.478629] [<ffffffff8100d56a>] ? do_notify_resume+0x53/0x68 [4583991.478633] [<ffffffff81395429>] ? system_call_fastpath+0x16/0x1b It'll be stuck on this: hdr = kmem_alloc(mp->m_dirblksize, KM_SLEEP); which is trying to allocate a contiguous 64k buffer to copy the direct contents into before freeing the block and then formatting them into the inode. The failure will be caused by memory fragmentation, and the only way around it is to avoid the contiguous allocation of that size. Which, I think, is pretty easy to do. Yup, barely smoke tested patch below that demonstrates the fix. Beware - patch may eat babies and ask for more. Use it at your own risk! I'll post it for review once it's had some testing and I know it doesn't corrupt directories all over the place. > This may be related to a friend problem here: > > http://tracker.ceph.com/issues/6386 Doesn't look related, unless the OOM killer is being triggered somehow... Hmmmm - there's also a good chance the the transaction commit code has this same problem contiguous allocation problem given that it has to allocate enough space to log an entire directory buffer. Good guess - there's another thread stuck on exactly that: [4583991.476833] ceph-osd D ffff88047fc33f40 0 11072 1 0x00000004 [4583991.476836] ffff88038b32a040 0000000000000082 ffffffff81047e75 ffff88046f946040 [4583991.476840] 0000000000013f40 ffff88048ea11fd8 ffff88048ea11fd8 ffff88038b32a040 [4583991.476844] 0000000000000000 00000001444d68be ffff88038b32a040 0000000000000005 [4583991.476848] Call Trace: [4583991.476852] [<ffffffff81047e75>] ? internal_add_timer+0xd/0x28 [4583991.476855] [<ffffffff8138e34a>] ? schedule_timeout+0xeb/0x123 [4583991.476859] [<ffffffff81047e63>] ? ftrace_raw_event_timer_class+0x9d/0x9d [4583991.476862] [<ffffffff8138edb6>] ? io_schedule_timeout+0x60/0x86 [4583991.476867] [<ffffffff810d85ad>] ? congestion_wait+0x70/0xdb [4583991.476870] [<ffffffff8105858f>] ? abort_exclusive_wait+0x79/0x79 [4583991.476883] [<ffffffffa056e3f9>] ? kmem_alloc+0x65/0x6f [xfs] [4583991.476899] [<ffffffffa05a77a6>] ? xfs_log_commit_cil+0xe8/0x3d1 [xfs] [4583991.476904] [<ffffffff810748ab>] ? current_kernel_time+0x9/0x30 [4583991.476909] [<ffffffff81041942>] ? current_fs_time+0x27/0x2d [4583991.476925] [<ffffffffa05a3b7b>] ? xfs_trans_commit+0x62/0x1cf [xfs] [4583991.476939] [<ffffffffa056d3ad>] ? xfs_create+0x41e/0x54f [xfs] [4583991.476943] [<ffffffff81114574>] ? lookup_fast+0x3d/0x215 [4583991.476954] [<ffffffffa056cf29>] ? xfs_lookup+0x88/0xee [xfs] [4583991.476966] [<ffffffffa0567428>] ? xfs_vn_mknod+0xb7/0x162 [xfs] [4583991.476970] [<ffffffff81115fda>] ? vfs_create+0x62/0x8b [4583991.476974] [<ffffffff81113d0b>] ? lookup_dcache+0x22/0x95 [4583991.476978] [<ffffffff81117179>] ? do_last+0x595/0xa16 [4583991.476982] [<ffffffff811176be>] ? path_openat+0xc4/0x335 [4583991.476985] [<ffffffff81117bda>] ? do_filp_open+0x2a/0x6e [4583991.476989] [<ffffffff81120b62>] ? __alloc_fd+0xd0/0xe1 [4583991.476993] [<ffffffff8110b684>] ? do_sys_open+0x5c/0xe0 [4583991.476996] [<ffffffff81395429>] ? system_call_fastpath+0x16/0x1b That one isn't so easy to fix, unfortunately. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: xfs_dir2_block_to_sf temp buffer allocation fails From: Dave Chinner <dchinner@redhat.com> If we are using a large directory block size, and memory becomes fragmented, we can get memory allocation failures trying to kmem_alloc(64k) for a temporary buffer. However, there is not need for a directory buffer sized allocation, as the end result ends up in the inode literal area. This is, at most, slightly less than 2k of space, and hence we don't need an allocation larger than that fora temporary buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_dir2_sf.c | 58 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 34 insertions(+), 24 deletions(-) diff --git a/fs/xfs/xfs_dir2_sf.c b/fs/xfs/xfs_dir2_sf.c index aafc6e4..3725fb1 100644 --- a/fs/xfs/xfs_dir2_sf.c +++ b/fs/xfs/xfs_dir2_sf.c @@ -170,6 +170,7 @@ xfs_dir2_block_to_sf( char *ptr; /* current data pointer */ xfs_dir2_sf_entry_t *sfep; /* shortform entry */ xfs_dir2_sf_hdr_t *sfp; /* shortform directory header */ + xfs_dir2_sf_hdr_t *dst; /* temporary data buffer */ trace_xfs_dir2_block_to_sf(args); @@ -177,35 +178,20 @@ xfs_dir2_block_to_sf( mp = dp->i_mount; /* - * Make a copy of the block data, so we can shrink the inode - * and add local data. + * allocate a temporary destination buffer the size of the inode + * to format the data into. Once we have formatted the data, we + * can free the block and copy the formatted data into the inode literal + * area. */ - hdr = kmem_alloc(mp->m_dirblksize, KM_SLEEP); - memcpy(hdr, bp->b_addr, mp->m_dirblksize); - logflags = XFS_ILOG_CORE; - if ((error = xfs_dir2_shrink_inode(args, mp->m_dirdatablk, bp))) { - ASSERT(error != ENOSPC); - goto out; - } + dst = kmem_alloc(mp->m_sb.sb_inodesize, KM_SLEEP); + hdr = bp->b_addr; /* - * The buffer is now unconditionally gone, whether - * xfs_dir2_shrink_inode worked or not. - * - * Convert the inode to local format. - */ - dp->i_df.if_flags &= ~XFS_IFEXTENTS; - dp->i_df.if_flags |= XFS_IFINLINE; - dp->i_d.di_format = XFS_DINODE_FMT_LOCAL; - ASSERT(dp->i_df.if_bytes == 0); - xfs_idata_realloc(dp, size, XFS_DATA_FORK); - logflags |= XFS_ILOG_DDATA; - /* * Copy the header into the newly allocate local space. */ - sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data; + sfp = (xfs_dir2_sf_hdr_t *)dst; memcpy(sfp, sfhp, xfs_dir2_sf_hdr_size(sfhp->i8count)); - dp->i_d.di_size = size; + /* * Set up to loop over the block's entries. */ @@ -258,10 +244,34 @@ xfs_dir2_block_to_sf( ptr += dp->d_ops->data_entsize(dep->namelen); } ASSERT((char *)sfep - (char *)sfp == size); + + /* now we are done with the block, we can shrink the inode */ + logflags = XFS_ILOG_CORE; + error = xfs_dir2_shrink_inode(args, mp->m_dirdatablk, bp); + if (error) { + ASSERT(error != ENOSPC); + goto out; + } + + /* + * The buffer is now unconditionally gone, whether + * xfs_dir2_shrink_inode worked or not. + * + * Convert the inode to local format and copy the data in. + */ + dp->i_df.if_flags &= ~XFS_IFEXTENTS; + dp->i_df.if_flags |= XFS_IFINLINE; + dp->i_d.di_format = XFS_DINODE_FMT_LOCAL; + ASSERT(dp->i_df.if_bytes == 0); + xfs_idata_realloc(dp, size, XFS_DATA_FORK); + + logflags |= XFS_ILOG_DDATA; + memcpy(dp->i_df.if_u1.if_data, dst, size); + dp->i_d.di_size = size; xfs_dir2_sf_check(args); out: xfs_trans_log_inode(args->trans, dp, logflags); - kmem_free(hdr); + kmem_free(dst); return error; } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-12-03 12:50 ` Dave Chinner @ 2013-12-03 16:28 ` Yann Dupont 2013-12-09 9:47 ` Emmanuel Lacour 2013-12-11 20:22 ` Ben Myers 2 siblings, 0 replies; 11+ messages in thread From: Yann Dupont @ 2013-12-03 16:28 UTC (permalink / raw) To: xfs Le 03/12/2013 13:50, Dave Chinner a écrit : > On Tue, Dec 03, 2013 at 10:53:58AM +0100, Emmanuel Lacour wrote: > OK, 32GB RAM, no obvious shortage, no dirty or writeback data. 2TB > SATA drives, 32AGs, only unusual setting is 64k directory block size. > Yup, there's your problem: I can confirm I also saw this from times to times, in the very same context (ceph OSD, xfs volume with 64k directory blocks) I never took the time to fully report the problem because it was very sporadic, and I suspected a specific option in my hand-made kernels, in which the problem seems to occurs more often. With the 'standard' debian 3.10.2 (from testing) I never saw the problem (56 days uptime) cf http://tracker.ceph.com/issues/6301 the report isn't complete here ... sometimes I just have the deadlock, but without oops. can try to dig deeper inmy old logs to be sure it's really the same Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-12-03 12:50 ` Dave Chinner 2013-12-03 16:28 ` Yann Dupont @ 2013-12-09 9:47 ` Emmanuel Lacour 2013-12-11 20:22 ` Ben Myers 2 siblings, 0 replies; 11+ messages in thread From: Emmanuel Lacour @ 2013-12-09 9:47 UTC (permalink / raw) To: xfs On Tue, Dec 03, 2013 at 11:50:57PM +1100, Dave Chinner wrote: > thanks very much for your quick and detailled answer! > OK, 32GB RAM, no obvious shortage, no dirty or writeback data. > 2TB SATA drives, 32AGs, only unusual setting is 64k directory block > size. > yes, the 64k was taken from a too quickly read of advice, I don't think it's of any help on a ceph cluster but I'm not an FS guru. Is there a way to lower it at runtime? > Yup, there's your problem: > [...] > Which, I think, is pretty easy to do. Yup, barely smoke tested patch > below that demonstrates the fix. Beware - patch may eat babies and > ask for more. Use it at your own risk! > unfortunatly I cannot test this patch because: - it's a production cluster and it's currently hard for me to reboot nodes (not enough nodes ;)) - just after hiting this problem I saw a kernel 3.11 available on Debian backports and decided to upgrade the whole cluster. since this upgrade, there is no problems anymore ... I cross my fingers ;) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-12-03 12:50 ` Dave Chinner 2013-12-03 16:28 ` Yann Dupont 2013-12-09 9:47 ` Emmanuel Lacour @ 2013-12-11 20:22 ` Ben Myers 2013-12-11 23:53 ` Dave Chinner 2 siblings, 1 reply; 11+ messages in thread From: Ben Myers @ 2013-12-11 20:22 UTC (permalink / raw) To: Dave Chinner; +Cc: Emmanuel Lacour, xfs On Tue, Dec 03, 2013 at 11:50:57PM +1100, Dave Chinner wrote: > On Tue, Dec 03, 2013 at 10:53:58AM +0100, Emmanuel Lacour wrote: > > On Thu, Nov 28, 2013 at 09:05:21PM +1100, Dave Chinner wrote: > > > On Thu, Nov 28, 2013 at 10:13:22AM +0100, Emmanuel Lacour wrote: > > > > > > > > Dear XFS users, > > > > > > > > > > > > I run a Ceph cluster using XFS on Debian wheezy servers and Linux 3.10 > > > > (debian backports). I see the following line in our logs: > > > > > > > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > > > > > > > does this reveal a problem in my setup or may I ignore it? If it's a > > > > problem, can someone give me any hint on solving this? > > > > > > It might be, but you need to provide more information for us to be > > > able to make any intelligent comment on the message. Start here: > > > > > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > > > > > > > > The problem continue and crashed my ceph cluster again, so here is all > > the informations said in the FAQ: > > > > http://people.easter-eggs.org/~manu/xfs.log > > OK, 32GB RAM, no obvious shortage, no dirty or writeback data. > 2TB SATA drives, 32AGs, only unusual setting is 64k directory block > size. > > Yup, there's your problem: > > [4583991.478469] ceph-osd D ffff88047fc93f40 0 22951 > 1 0x00000004 > [4583991.478471] ffff88046d241140 0000000000000082 ffffffff81047e75 > ffff88046f949800 > [4583991.478475] 0000000000013f40 ffff88039eb0bfd8 ffff88039eb0bfd8 > ffff88046d241140 > [4583991.478479] 0000000000000000 00000001444d68bd ffff88046d241140 > 0000000000000005 > [4583991.478483] Call Trace: > [4583991.478487] [<ffffffff81047e75>] ? internal_add_timer+0xd/0x28 > [4583991.478491] [<ffffffff8138e34a>] ? schedule_timeout+0xeb/0x123 > [4583991.478494] [<ffffffff81047e63>] ? ftrace_raw_event_timer_class+0x9d/0x9d > [4583991.478498] [<ffffffff8138edb6>] ? io_schedule_timeout+0x60/0x86 > [4583991.478502] [<ffffffff810d85ad>] ? congestion_wait+0x70/0xdb > [4583991.478505] [<ffffffff8105858f>] ? abort_exclusive_wait+0x79/0x79 > [4583991.478518] [<ffffffffa056e3f9>] ? kmem_alloc+0x65/0x6f [xfs] > [4583991.478535] [<ffffffffa0592c4a>] ? xfs_dir2_block_to_sf+0x5b/0x1fb [xfs] > [4583991.478550] [<ffffffffa0592be0>] ? xfs_dir2_block_sfsize+0x15b/0x16a [xfs] > [4583991.478566] [<ffffffffa058bf9a>] ? xfs_dir2_block_removename+0x1c7/0x208 [xfs] > [4583991.478581] [<ffffffffa058ab4a>] ? xfs_dir_removename+0xda/0x114 [xfs] > [4583991.478594] [<ffffffffa056a55c>] ? xfs_rename+0x428/0x554 [xfs] > [4583991.478606] [<ffffffffa0567321>] ? xfs_vn_rename+0x5e/0x65 [xfs] > [4583991.478610] [<ffffffff8111677b>] ? vfs_rename+0x224/0x35f > [4583991.478614] [<ffffffff81113d0b>] ? lookup_dcache+0x22/0x95 > [4583991.478618] [<ffffffff81116a7e>] ? SYSC_renameat+0x1c8/0x257 > [4583991.478622] [<ffffffff810fb0fd>] ? __cache_free.isra.45+0x178/0x187 > [4583991.478625] [<ffffffff81117eb1>] ? SyS_mkdirat+0x2e/0xce > [4583991.478629] [<ffffffff8100d56a>] ? do_notify_resume+0x53/0x68 > [4583991.478633] [<ffffffff81395429>] ? system_call_fastpath+0x16/0x1b > > It'll be stuck on this: > > hdr = kmem_alloc(mp->m_dirblksize, KM_SLEEP); > > which is trying to allocate a contiguous 64k buffer to copy the > direct contents into before freeing the block and then formatting > them into the inode. The failure will be caused by memory > fragmentation, and the only way around it is to avoid the contiguous > allocation of that size. > > Which, I think, is pretty easy to do. Yup, barely smoke tested patch > below that demonstrates the fix. Beware - patch may eat babies and > ask for more. Use it at your own risk! > > I'll post it for review once it's had some testing and I know it > doesn't corrupt directories all over the place. > > > This may be related to a friend problem here: > > > > http://tracker.ceph.com/issues/6386 > > Doesn't look related, unless the OOM killer is being triggered > somehow... > > Hmmmm - there's also a good chance the the transaction commit code > has this same problem contiguous allocation problem given that it > has to allocate enough space to log an entire directory buffer. Good > guess - there's another thread stuck on exactly that: > > [4583991.476833] ceph-osd D ffff88047fc33f40 0 11072 1 0x00000004 > [4583991.476836] ffff88038b32a040 0000000000000082 ffffffff81047e75 ffff88046f946040 > [4583991.476840] 0000000000013f40 ffff88048ea11fd8 ffff88048ea11fd8 ffff88038b32a040 > [4583991.476844] 0000000000000000 00000001444d68be ffff88038b32a040 0000000000000005 > [4583991.476848] Call Trace: > [4583991.476852] [<ffffffff81047e75>] ? internal_add_timer+0xd/0x28 > [4583991.476855] [<ffffffff8138e34a>] ? schedule_timeout+0xeb/0x123 > [4583991.476859] [<ffffffff81047e63>] ? ftrace_raw_event_timer_class+0x9d/0x9d > [4583991.476862] [<ffffffff8138edb6>] ? io_schedule_timeout+0x60/0x86 > [4583991.476867] [<ffffffff810d85ad>] ? congestion_wait+0x70/0xdb > [4583991.476870] [<ffffffff8105858f>] ? abort_exclusive_wait+0x79/0x79 > [4583991.476883] [<ffffffffa056e3f9>] ? kmem_alloc+0x65/0x6f [xfs] > [4583991.476899] [<ffffffffa05a77a6>] ? xfs_log_commit_cil+0xe8/0x3d1 [xfs] > [4583991.476904] [<ffffffff810748ab>] ? current_kernel_time+0x9/0x30 > [4583991.476909] [<ffffffff81041942>] ? current_fs_time+0x27/0x2d > [4583991.476925] [<ffffffffa05a3b7b>] ? xfs_trans_commit+0x62/0x1cf [xfs] > [4583991.476939] [<ffffffffa056d3ad>] ? xfs_create+0x41e/0x54f [xfs] > [4583991.476943] [<ffffffff81114574>] ? lookup_fast+0x3d/0x215 > [4583991.476954] [<ffffffffa056cf29>] ? xfs_lookup+0x88/0xee [xfs] > [4583991.476966] [<ffffffffa0567428>] ? xfs_vn_mknod+0xb7/0x162 [xfs] > [4583991.476970] [<ffffffff81115fda>] ? vfs_create+0x62/0x8b > [4583991.476974] [<ffffffff81113d0b>] ? lookup_dcache+0x22/0x95 > [4583991.476978] [<ffffffff81117179>] ? do_last+0x595/0xa16 > [4583991.476982] [<ffffffff811176be>] ? path_openat+0xc4/0x335 > [4583991.476985] [<ffffffff81117bda>] ? do_filp_open+0x2a/0x6e > [4583991.476989] [<ffffffff81120b62>] ? __alloc_fd+0xd0/0xe1 > [4583991.476993] [<ffffffff8110b684>] ? do_sys_open+0x5c/0xe0 > [4583991.476996] [<ffffffff81395429>] ? system_call_fastpath+0x16/0x1b > > That one isn't so easy to fix, unfortunately. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > xfs: xfs_dir2_block_to_sf temp buffer allocation fails > > From: Dave Chinner <dchinner@redhat.com> > > If we are using a large directory block size, and memory becomes > fragmented, we can get memory allocation failures trying to > kmem_alloc(64k) for a temporary buffer. However, there is not need > for a directory buffer sized allocation, as the end result ends up > in the inode literal area. This is, at most, slightly less than 2k > of space, and hence we don't need an allocation larger than that > fora temporary buffer. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> D'oh, I missed this one too. If you stick 'patch' in the subject they'll have additional visibility. Looks good to me. Reviewed-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) 2013-12-11 20:22 ` Ben Myers @ 2013-12-11 23:53 ` Dave Chinner 0 siblings, 0 replies; 11+ messages in thread From: Dave Chinner @ 2013-12-11 23:53 UTC (permalink / raw) To: Ben Myers; +Cc: Emmanuel Lacour, xfs On Wed, Dec 11, 2013 at 02:22:26PM -0600, Ben Myers wrote: > On Tue, Dec 03, 2013 at 11:50:57PM +1100, Dave Chinner wrote: > > On Tue, Dec 03, 2013 at 10:53:58AM +0100, Emmanuel Lacour wrote: > > > On Thu, Nov 28, 2013 at 09:05:21PM +1100, Dave Chinner wrote: > > > > On Thu, Nov 28, 2013 at 10:13:22AM +0100, Emmanuel Lacour wrote: > > > > > > > > > > Dear XFS users, > > > > > > > > > > > > > > > I run a Ceph cluster using XFS on Debian wheezy servers and Linux 3.10 > > > > > (debian backports). I see the following line in our logs: > > > > > > > > > > XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) > > > > > > > > > > does this reveal a problem in my setup or may I ignore it? If it's a > > > > > problem, can someone give me any hint on solving this? > > > > > > > > It might be, but you need to provide more information for us to be > > > > able to make any intelligent comment on the message. Start here: > > > > > > > > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F > > > > > > > > > > > > > The problem continue and crashed my ceph cluster again, so here is all > > > the informations said in the FAQ: > > > > > > http://people.easter-eggs.org/~manu/xfs.log > > > > OK, 32GB RAM, no obvious shortage, no dirty or writeback data. > > 2TB SATA drives, 32AGs, only unusual setting is 64k directory block > > size. > > > > Yup, there's your problem: > > > > [4583991.478469] ceph-osd D ffff88047fc93f40 0 22951 > > 1 0x00000004 > > [4583991.478471] ffff88046d241140 0000000000000082 ffffffff81047e75 > > ffff88046f949800 > > [4583991.478475] 0000000000013f40 ffff88039eb0bfd8 ffff88039eb0bfd8 > > ffff88046d241140 > > [4583991.478479] 0000000000000000 00000001444d68bd ffff88046d241140 > > 0000000000000005 > > [4583991.478483] Call Trace: > > [4583991.478487] [<ffffffff81047e75>] ? internal_add_timer+0xd/0x28 > > [4583991.478491] [<ffffffff8138e34a>] ? schedule_timeout+0xeb/0x123 > > [4583991.478494] [<ffffffff81047e63>] ? ftrace_raw_event_timer_class+0x9d/0x9d > > [4583991.478498] [<ffffffff8138edb6>] ? io_schedule_timeout+0x60/0x86 > > [4583991.478502] [<ffffffff810d85ad>] ? congestion_wait+0x70/0xdb > > [4583991.478505] [<ffffffff8105858f>] ? abort_exclusive_wait+0x79/0x79 > > [4583991.478518] [<ffffffffa056e3f9>] ? kmem_alloc+0x65/0x6f [xfs] > > [4583991.478535] [<ffffffffa0592c4a>] ? xfs_dir2_block_to_sf+0x5b/0x1fb [xfs] > > [4583991.478550] [<ffffffffa0592be0>] ? xfs_dir2_block_sfsize+0x15b/0x16a [xfs] > > [4583991.478566] [<ffffffffa058bf9a>] ? xfs_dir2_block_removename+0x1c7/0x208 [xfs] > > [4583991.478581] [<ffffffffa058ab4a>] ? xfs_dir_removename+0xda/0x114 [xfs] > > [4583991.478594] [<ffffffffa056a55c>] ? xfs_rename+0x428/0x554 [xfs] > > [4583991.478606] [<ffffffffa0567321>] ? xfs_vn_rename+0x5e/0x65 [xfs] > > [4583991.478610] [<ffffffff8111677b>] ? vfs_rename+0x224/0x35f > > [4583991.478614] [<ffffffff81113d0b>] ? lookup_dcache+0x22/0x95 > > [4583991.478618] [<ffffffff81116a7e>] ? SYSC_renameat+0x1c8/0x257 > > [4583991.478622] [<ffffffff810fb0fd>] ? __cache_free.isra.45+0x178/0x187 > > [4583991.478625] [<ffffffff81117eb1>] ? SyS_mkdirat+0x2e/0xce > > [4583991.478629] [<ffffffff8100d56a>] ? do_notify_resume+0x53/0x68 > > [4583991.478633] [<ffffffff81395429>] ? system_call_fastpath+0x16/0x1b > > > > It'll be stuck on this: > > > > hdr = kmem_alloc(mp->m_dirblksize, KM_SLEEP); > > > > which is trying to allocate a contiguous 64k buffer to copy the > > direct contents into before freeing the block and then formatting > > them into the inode. The failure will be caused by memory > > fragmentation, and the only way around it is to avoid the contiguous > > allocation of that size. > > > > Which, I think, is pretty easy to do. Yup, barely smoke tested patch > > below that demonstrates the fix. Beware - patch may eat babies and > > ask for more. Use it at your own risk! > > > > I'll post it for review once it's had some testing and I know it > > doesn't corrupt directories all over the place. > > > > > This may be related to a friend problem here: > > > > > > http://tracker.ceph.com/issues/6386 > > > > Doesn't look related, unless the OOM killer is being triggered > > somehow... > > > > Hmmmm - there's also a good chance the the transaction commit code > > has this same problem contiguous allocation problem given that it > > has to allocate enough space to log an entire directory buffer. Good > > guess - there's another thread stuck on exactly that: > > > > [4583991.476833] ceph-osd D ffff88047fc33f40 0 11072 1 0x00000004 > > [4583991.476836] ffff88038b32a040 0000000000000082 ffffffff81047e75 ffff88046f946040 > > [4583991.476840] 0000000000013f40 ffff88048ea11fd8 ffff88048ea11fd8 ffff88038b32a040 > > [4583991.476844] 0000000000000000 00000001444d68be ffff88038b32a040 0000000000000005 > > [4583991.476848] Call Trace: > > [4583991.476852] [<ffffffff81047e75>] ? internal_add_timer+0xd/0x28 > > [4583991.476855] [<ffffffff8138e34a>] ? schedule_timeout+0xeb/0x123 > > [4583991.476859] [<ffffffff81047e63>] ? ftrace_raw_event_timer_class+0x9d/0x9d > > [4583991.476862] [<ffffffff8138edb6>] ? io_schedule_timeout+0x60/0x86 > > [4583991.476867] [<ffffffff810d85ad>] ? congestion_wait+0x70/0xdb > > [4583991.476870] [<ffffffff8105858f>] ? abort_exclusive_wait+0x79/0x79 > > [4583991.476883] [<ffffffffa056e3f9>] ? kmem_alloc+0x65/0x6f [xfs] > > [4583991.476899] [<ffffffffa05a77a6>] ? xfs_log_commit_cil+0xe8/0x3d1 [xfs] > > [4583991.476904] [<ffffffff810748ab>] ? current_kernel_time+0x9/0x30 > > [4583991.476909] [<ffffffff81041942>] ? current_fs_time+0x27/0x2d > > [4583991.476925] [<ffffffffa05a3b7b>] ? xfs_trans_commit+0x62/0x1cf [xfs] > > [4583991.476939] [<ffffffffa056d3ad>] ? xfs_create+0x41e/0x54f [xfs] > > [4583991.476943] [<ffffffff81114574>] ? lookup_fast+0x3d/0x215 > > [4583991.476954] [<ffffffffa056cf29>] ? xfs_lookup+0x88/0xee [xfs] > > [4583991.476966] [<ffffffffa0567428>] ? xfs_vn_mknod+0xb7/0x162 [xfs] > > [4583991.476970] [<ffffffff81115fda>] ? vfs_create+0x62/0x8b > > [4583991.476974] [<ffffffff81113d0b>] ? lookup_dcache+0x22/0x95 > > [4583991.476978] [<ffffffff81117179>] ? do_last+0x595/0xa16 > > [4583991.476982] [<ffffffff811176be>] ? path_openat+0xc4/0x335 > > [4583991.476985] [<ffffffff81117bda>] ? do_filp_open+0x2a/0x6e > > [4583991.476989] [<ffffffff81120b62>] ? __alloc_fd+0xd0/0xe1 > > [4583991.476993] [<ffffffff8110b684>] ? do_sys_open+0x5c/0xe0 > > [4583991.476996] [<ffffffff81395429>] ? system_call_fastpath+0x16/0x1b > > > > That one isn't so easy to fix, unfortunately. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > > xfs: xfs_dir2_block_to_sf temp buffer allocation fails > > > > From: Dave Chinner <dchinner@redhat.com> > > > > If we are using a large directory block size, and memory becomes > > fragmented, we can get memory allocation failures trying to > > kmem_alloc(64k) for a temporary buffer. However, there is not need > > for a directory buffer sized allocation, as the end result ends up > > in the inode literal area. This is, at most, slightly less than 2k > > of space, and hence we don't need an allocation larger than that > > fora temporary buffer. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > D'oh, I missed this one too. If you stick 'patch' in the subject they'll have > additional visibility. > > Looks good to me. > > Reviewed-by: Ben Myers <bpm@sgi.com> When I reply in line like this it's more of a case of "please test this patch to see if it fixes your problem" question, not really an official posting of a patch because it's very likely I have just written the patch and have only done a 5 minute smoke test of the patch. So in this context, it's not really a "please review and commit" request. The issue is that I haven't reposted the patch in a separate series asking for reviews and commit as I normally do after I've tested it properly and are happy with it. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-12-11 23:53 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-08-21 15:24 XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) Josef 'Jeff' Sipek 2013-08-22 2:25 ` Dave Chinner 2013-08-22 15:07 ` Josef 'Jeff' Sipek -- strict thread matches above, loose matches on Subject: below -- 2013-11-28 9:13 Emmanuel Lacour 2013-11-28 10:05 ` Dave Chinner 2013-12-03 9:53 ` Emmanuel Lacour 2013-12-03 12:50 ` Dave Chinner 2013-12-03 16:28 ` Yann Dupont 2013-12-09 9:47 ` Emmanuel Lacour 2013-12-11 20:22 ` Ben Myers 2013-12-11 23:53 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox