[PATCH] mm: compaction beware writeback

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: compaction beware writeback
@ 2011-03-20  6:27 Hugh Dickins
  2011-03-20 17:47 ` Andrea Arcangeli
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Hugh Dickins @ 2011-03-20  6:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrea Arcangeli, Andrew Morton, linux-mm

I notice there's a Bug 31142 "Large write to USB stick freezes"
discussion happening (which I've not digested), for which Andrea
is proposing a patch which reminds me of this one.  Thought I'd
better throw this into the mix for consideration.

I'd not sent it in yet, because I only see the problem on one machine,
and then only with a shmem patch I'm working up; but can't see how
that patch would actually be necessary to create the problem.

It happens in my extfs-on-loop-on-tmpfs swapping tests, when copying
in the kernel tree.  I believe the relevant traces are these three:
I notice sync_supers there every time it hangs, but I guess it comes
along after, and gets stuck on the same page which cp is waiting for.

D  sync_supers:
schedule +0x670
io_schedule +0x50
sync_buffer +0x68
__wait_on_bit +0x90
out_of_line_wait_on_bit +0x98
__wait_on_buffer +0x30
__sync_dirty_buffer +0xc0
ext4_commit_super +0x2c4
ext4_write_super +0x28
sync_supers +0xdc
bdi_sync_supers +0x40
kthread +0xac
kernel_thread +0x54

D  loop0:
schedule +0x670
io_schedule +0x50
sync_page +0x84
__wait_on_bit +0x90
wait_on_page_bit +0xa4
unmap_and_move +0x180
migrate_pages +0xbc
compact_zone +0xbc
compact_zone_order +0xc8
try_to_compact_pages +0x104
__alloc_pages_direct_compact +0xc0
__alloc_pages_nodemask +0x68c
allocate_slab +0x84
new_slab +0x58
__slab_alloc +0x1ec
kmem_cache_alloc +0x7c
radix_tree_preload +0x94
add_to_page_cache_locked +0x78
shmem_getpage +0x208
pagecache_write_begin +0x2c
do_lo_send_aops +0xc0
do_bio_filebacked +0x11c
loop_thread +0x204
kthread +0xac
kernel_thread +0x54

D  cp:
schedule +0x670
io_schedule +0x50
sync_buffer +0x68
__wait_on_bit +0x90
out_of_line_wait_on_bit +0x98
__wait_on_buffer +0x30
ext4_find_entry +0x230
ext4_lookup +0x44
d_alloc_and_lookup +0x74
do_last +0xe0
do_filp_open +0x2b8
do_sys_open +0x8c
compat_sys_open +0x24
syscall_exit +0x0

I believe (but haven't verified for sure) that what happens is that
compaction (when trying to allocate a radix_tree node - SLUB asks
for order 2 - in the loop0 daemon trace) chooses the cp page under
writeback which is waiting for loop0 to write it.

So I've extended your earlier PF_MEMALLOC patch to prevent waiting for
writeback as well as waiting for pagelock.  And I've never seen the
hang again since putting this patch in.

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/migrate.c |   38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

--- 2.6.38/mm/migrate.c	2011-03-14 18:20:32.000000000 -0700
+++ linux/mm/migrate.c	2011-03-15 06:36:26.000000000 -0700
@@ -637,29 +637,33 @@ static int unmap_and_move(new_page_t get
 		if (unlikely(split_huge_page(page)))
 			goto move_newpage;
 
+	/*
+	 * It's not safe for direct compaction to call lock_page.
+	 * For example, during page readahead pages are added locked
+	 * to the LRU. Later, when the IO completes the pages are
+	 * marked uptodate and unlocked. However, the queueing
+	 * could be merging multiple pages for one bio (e.g.
+	 * mpage_readpages). If an allocation happens for the
+	 * second or third page, the process can end up locking
+	 * the same page twice and deadlocking. Rather than
+	 * trying to be clever about what pages can be locked,
+	 * avoid the use of lock_page for direct compaction
+	 * altogether.
+	 *
+	 * Nor is it safe for direct compaction to wait_on_page_writeback:
+	 * we might be trying to allocate on behalf of that writeback (e.g.
+	 * slub allocating an order-2 page for a radix_tree node for the
+	 * loop device below, might target that very page under writeback).
+	 */
+	if (current->flags & PF_MEMALLOC)
+		force = 0;
+
 	/* prepare cgroup just returns 0 or -ENOMEM */
 	rc = -EAGAIN;
 
 	if (!trylock_page(page)) {
 		if (!force)
 			goto move_newpage;
-
-		/*
-		 * It's not safe for direct compaction to call lock_page.
-		 * For example, during page readahead pages are added locked
-		 * to the LRU. Later, when the IO completes the pages are
-		 * marked uptodate and unlocked. However, the queueing
-		 * could be merging multiple pages for one bio (e.g.
-		 * mpage_readpages). If an allocation happens for the
-		 * second or third page, the process can end up locking
-		 * the same page twice and deadlocking. Rather than
-		 * trying to be clever about what pages can be locked,
-		 * avoid the use of lock_page for direct compaction
-		 * altogether.
-		 */
-		if (current->flags & PF_MEMALLOC)
-			goto move_newpage;
-
 		lock_page(page);
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: compaction beware writeback
  2011-03-20  6:27 [PATCH] mm: compaction beware writeback Hugh Dickins
@ 2011-03-20 17:47 ` Andrea Arcangeli
  2011-03-21  2:37   ` Hugh Dickins
  2011-03-21  9:59 ` Mel Gorman
  2011-03-29  8:27 ` Johannes Weiner
  2 siblings, 1 reply; 6+ messages in thread
From: Andrea Arcangeli @ 2011-03-20 17:47 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mel Gorman, Andrew Morton, linux-mm

On Sat, Mar 19, 2011 at 11:27:38PM -0700, Hugh Dickins wrote:
> I notice there's a Bug 31142 "Large write to USB stick freezes"
> discussion happening (which I've not digested), for which Andrea
> is proposing a patch which reminds me of this one.  Thought I'd
> better throw this into the mix for consideration.

With regard to that bug:

https://bugzilla.kernel.org/attachment.cgi?id=51262

there's no sign of __wait_on_bit called by migrate_pages. I think the
problem there are the ->writepage run on udf to writeout dirty pages
(to retry migration in the next iteration of the loop when the page
isn't dirty anymore), which hopefully it's solved by the patch I sent.

> D  loop0:
> schedule +0x670
> io_schedule +0x50
> sync_page +0x84
> __wait_on_bit +0x90
> wait_on_page_bit +0xa4
> unmap_and_move +0x180
> migrate_pages +0xbc
> compact_zone +0xbc
> compact_zone_order +0xc8
> try_to_compact_pages +0x104
> __alloc_pages_direct_compact +0xc0
> __alloc_pages_nodemask +0x68c
> allocate_slab +0x84
> new_slab +0x58
> __slab_alloc +0x1ec
> kmem_cache_alloc +0x7c
> radix_tree_preload +0x94
> add_to_page_cache_locked +0x78
> shmem_getpage +0x208
> pagecache_write_begin +0x2c
> do_lo_send_aops +0xc0
> do_bio_filebacked +0x11c
> loop_thread +0x204
> kthread +0xac
> kernel_thread +0x54

So your patch will avoid waiting above in writeback if called by
direct compaction like above, agreed (currently we only avoid to call
lock_page but we could still wait in wait_on_page_writeback if force=1
goes on at the third pass of the migrate_pages loop). This seems a
separate issue to the last trace posted in bug 31142 as there was no
wait_on_bit in that last trace.

Interesting that slab allocates with order > 0 an object that is <4096
bytes. Is this related to slab_break_gfp_order? The comment talks
about fragmentation resistance in low memory, if that's the same
fragmentation that anti-frag solves, is this logic still actual? I
guess we still need it for cache coloring. However I noticed there's
no fallback to that, so once kmem_cache_create decided that an optimal
gfporder is > 0 (even if it doesn't need to be > 0) if the order 1
allocation fails, kmem_cache_alloc will fail too without fallback to
order 0. It could be more reliable than that.

A large allocation if nothing else reduces the frequency of gfp calls.

> So I've extended your earlier PF_MEMALLOC patch to prevent waiting for
> writeback as well as waiting for pagelock.  And I've never seen the
> hang again since putting this patch in.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---

This change looks good to me, it will make compaction a little less
reliable with regard to writeback. You are still going to execute
->writepage on dirty pages though (except for __GFP_NO_KSWAPD
allocations that signals the fact they don't need to be reliable with
my last patch for bug 31142 and foces sync=0 and have that sync
parameter checked before calling ->writepage too).

compaction usually has an huge amount of "source" memory to move, in
the destination space, so it's not so good that migration is so
aggressive and blocks when there may be another couple of contiguous
source page to move without waiting. Ideally we should move to the
next block of source pages in compaction, before setting force=1 in a
final pass. We insist a lot in migrate_pages when breaking the loop
after 1 pass could be enough, and then it should be compaction that if
it fails migration on all candidate source pages, tries an "harder
migrate" in a second pass. We could achieve that by having compaction
twiddle with the sync bit instead of the caller and only run one pass
of migrate when sync = 0. That's a bit larger change though and it may
consume more cpu in compaction (compaction loop not so cheap), so I'm
unsure if it would be an overall improvement.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: compaction beware writeback
  2011-03-20 17:47 ` Andrea Arcangeli
@ 2011-03-21  2:37   ` Hugh Dickins
  2011-03-21 12:32     ` Andrea Arcangeli
  0 siblings, 1 reply; 6+ messages in thread
From: Hugh Dickins @ 2011-03-21  2:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Mel Gorman, Andrew Morton, linux-mm

On Sun, 20 Mar 2011, Andrea Arcangeli wrote:
> 
> Interesting that slab allocates with order > 0 an object that is <4096
> bytes. Is this related to slab_break_gfp_order?

No, it's SLUB I'm using (partly for its excellent debugging, partly
to trigger issues like this).  Remember, that's SLUB's great weakness,
that for optimal efficiency it relies upon higher order pages than you'd
expect.  It's much better since Christoph put in the ORDER_FALLBACK, but
still makes a first attempt for a higher order page, which is liable to
stir up page_alloc more than we'd like.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: compaction beware writeback
  2011-03-20  6:27 [PATCH] mm: compaction beware writeback Hugh Dickins
  2011-03-20 17:47 ` Andrea Arcangeli
@ 2011-03-21  9:59 ` Mel Gorman
  2011-03-29  8:27 ` Johannes Weiner
  2 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2011-03-21  9:59 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrea Arcangeli, Andrew Morton, linux-mm

On Sat, Mar 19, 2011 at 11:27:38PM -0700, Hugh Dickins wrote:
> I notice there's a Bug 31142 "Large write to USB stick freezes"
> discussion happening (which I've not digested), for which Andrea
> is proposing a patch which reminds me of this one.  Thought I'd
> better throw this into the mix for consideration.
> 
> I'd not sent it in yet, because I only see the problem on one machine,
> and then only with a shmem patch I'm working up; but can't see how
> that patch would actually be necessary to create the problem.
> 
> It happens in my extfs-on-loop-on-tmpfs swapping tests, when copying
> in the kernel tree.  I believe the relevant traces are these three:
> I notice sync_supers there every time it hangs, but I guess it comes
> along after, and gets stuck on the same page which cp is waiting for.
> 
> D  sync_supers:
> schedule +0x670
> io_schedule +0x50
> sync_buffer +0x68
> __wait_on_bit +0x90
> out_of_line_wait_on_bit +0x98
> __wait_on_buffer +0x30
> __sync_dirty_buffer +0xc0
> ext4_commit_super +0x2c4
> ext4_write_super +0x28
> sync_supers +0xdc
> bdi_sync_supers +0x40
> kthread +0xac
> kernel_thread +0x54
> 
> D  loop0:
> schedule +0x670
> io_schedule +0x50
> sync_page +0x84
> __wait_on_bit +0x90
> wait_on_page_bit +0xa4
> unmap_and_move +0x180
> migrate_pages +0xbc
> compact_zone +0xbc
> compact_zone_order +0xc8
> try_to_compact_pages +0x104
> __alloc_pages_direct_compact +0xc0
> __alloc_pages_nodemask +0x68c
> allocate_slab +0x84
> new_slab +0x58
> __slab_alloc +0x1ec
> kmem_cache_alloc +0x7c
> radix_tree_preload +0x94
> add_to_page_cache_locked +0x78
> shmem_getpage +0x208
> pagecache_write_begin +0x2c
> do_lo_send_aops +0xc0
> do_bio_filebacked +0x11c
> loop_thread +0x204
> kthread +0xac
> kernel_thread +0x54
> 
> D  cp:
> schedule +0x670
> io_schedule +0x50
> sync_buffer +0x68
> __wait_on_bit +0x90
> out_of_line_wait_on_bit +0x98
> __wait_on_buffer +0x30
> ext4_find_entry +0x230
> ext4_lookup +0x44
> d_alloc_and_lookup +0x74
> do_last +0xe0
> do_filp_open +0x2b8
> do_sys_open +0x8c
> compat_sys_open +0x24
> syscall_exit +0x0
> 
> I believe (but haven't verified for sure) that what happens is that
> compaction (when trying to allocate a radix_tree node - SLUB asks
> for order 2 - in the loop0 daemon trace) chooses the cp page under
> writeback which is waiting for loop0 to write it.
> 
> So I've extended your earlier PF_MEMALLOC patch to prevent waiting for
> writeback as well as waiting for pagelock.  And I've never seen the
> hang again since putting this patch in.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Yes, it is the case that waiting on page writeback can also cause
badness and I think this should also be considered for -stable.

Acked-by: Mel Gorman <mel@csn.ul.ie>

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: compaction beware writeback
  2011-03-21  2:37   ` Hugh Dickins
@ 2011-03-21 12:32     ` Andrea Arcangeli
  0 siblings, 0 replies; 6+ messages in thread
From: Andrea Arcangeli @ 2011-03-21 12:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mel Gorman, Andrew Morton, linux-mm

On Sun, Mar 20, 2011 at 07:37:02PM -0700, Hugh Dickins wrote:
> On Sun, 20 Mar 2011, Andrea Arcangeli wrote:
> > 
> > Interesting that slab allocates with order > 0 an object that is <4096
> > bytes. Is this related to slab_break_gfp_order?
> 
> No, it's SLUB I'm using (partly for its excellent debugging, partly
> to trigger issues like this).  Remember, that's SLUB's great weakness,
> that for optimal efficiency it relies upon higher order pages than you'd
> expect.  It's much better since Christoph put in the ORDER_FALLBACK, but
> still makes a first attempt for a higher order page, which is liable to
> stir up page_alloc more than we'd like.

Ah ok, that explains it... I didn't realize you used SLUB sorry.

I use SLAB as it's measurably faster in most workloads even on larger
servers (but it will consume more memory on with an huge number of
cpus, up to 128 CPUs it's no big deal). My cellphone uses SLUB though
(kabi issues with evil rfs.ko or I would have switched already).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: compaction beware writeback
  2011-03-20  6:27 [PATCH] mm: compaction beware writeback Hugh Dickins
  2011-03-20 17:47 ` Andrea Arcangeli
  2011-03-21  9:59 ` Mel Gorman
@ 2011-03-29  8:27 ` Johannes Weiner
  2 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2011-03-29  8:27 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mel Gorman, Andrea Arcangeli, Andrew Morton, linux-mm

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-03-29  8:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-20  6:27 [PATCH] mm: compaction beware writeback Hugh Dickins
2011-03-20 17:47 ` Andrea Arcangeli
2011-03-21  2:37   ` Hugh Dickins
2011-03-21 12:32     ` Andrea Arcangeli
2011-03-21  9:59 ` Mel Gorman
2011-03-29  8:27 ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).