* long sleep_on_page delays writing to slow storage
@ 2011-11-07 4:59 Andy Isaacson
2011-11-09 17:00 ` Jan Kara
0 siblings, 1 reply; 14+ messages in thread
From: Andy Isaacson @ 2011-11-07 4:59 UTC (permalink / raw)
To: linux-kernel, linux-mm
I am running 1a67a573b (3.1.0-09125 plus a small local patch) on a Core
i7, 8 GB RAM, writing a few GB of data to a slow SD card attached via
usb-storage with vfat. I mounted without specifying any options,
/dev/sdb1 /mnt/usb vfat rw,nosuid,nodev,noexec,relatime,uid=22448,gid=22448,fmask=0022,dmask=0022,codepage=cp437,iocharset=utf8,shortname=mixed,errors=remount-ro 0 0
and I'm using rsync to write the data.
We end up in a fairly steady state with a half GB dirty:
Dirty: 612280 kB
The dirty count stays high despite running sync(1) in another xterm.
The bug is,
Firefox (iceweasel 7.0.1-4) hangs at random intervals. One thread is
stuck in sleep_on_page
[<ffffffff810c50da>] sleep_on_page+0xe/0x12
[<ffffffff810c525b>] wait_on_page_bit+0x72/0x74
[<ffffffff811030f9>] migrate_pages+0x17c/0x36f
[<ffffffff810fa24a>] compact_zone+0x467/0x68b
[<ffffffff810fa6a7>] try_to_compact_pages+0x14c/0x1b3
[<ffffffff810cbda1>] __alloc_pages_direct_compact+0xa7/0x15a
[<ffffffff810cc4ec>] __alloc_pages_nodemask+0x698/0x71d
[<ffffffff810f89c2>] alloc_pages_vma+0xf5/0xfa
[<ffffffff8110683f>] do_huge_pmd_anonymous_page+0xbe/0x227
[<ffffffff810e2bf4>] handle_mm_fault+0x113/0x1ce
[<ffffffff8102fe3d>] do_page_fault+0x2d7/0x31e
[<ffffffff812fe535>] page_fault+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
And it stays stuck there for long enough for me to find the thread and
attach strace. Apparently it was stuck in
1320640739.201474 munmap(0x7f5c06b00000, 2097152) = 0
for something between 20 and 60 seconds.
There's no reason to let a 6MB/sec high latency device lock up 600 MB of
dirty pages. I'll have to wait a hundred seconds after my app exits
before the system will return to usability.
And there's no way, AFAICS, for me to work around this behavior in
userland.
And I don't understand how this compact_zone thing is intended to work
in this situation.
edited but nearly full dmesg at
http://web.hexapodia.org/~adi/snow/dmesg-3.1.0-09126-g4730284.txt
Thoughts?
Thanks,
-andy
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-07 4:59 long sleep_on_page delays writing to slow storage Andy Isaacson
@ 2011-11-09 17:00 ` Jan Kara
2011-11-09 17:52 ` Mel Gorman
0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2011-11-09 17:00 UTC (permalink / raw)
To: Andy Isaacson; +Cc: linux-kernel, linux-mm, mgorman, aarcange
I've added to CC some mm developers who know much more about transparent
hugepages than I do because that is what seems to cause your problems...
On Sun 06-11-11 20:59:28, Andy Isaacson wrote:
> I am running 1a67a573b (3.1.0-09125 plus a small local patch) on a Core
> i7, 8 GB RAM, writing a few GB of data to a slow SD card attached via
> usb-storage with vfat. I mounted without specifying any options,
>
> /dev/sdb1 /mnt/usb vfat rw,nosuid,nodev,noexec,relatime,uid=22448,gid=22448,fmask=0022,dmask=0022,codepage=cp437,iocharset=utf8,shortname=mixed,errors=remount-ro 0 0
>
> and I'm using rsync to write the data.
>
> We end up in a fairly steady state with a half GB dirty:
>
> Dirty: 612280 kB
>
> The dirty count stays high despite running sync(1) in another xterm.
>
> The bug is,
>
> Firefox (iceweasel 7.0.1-4) hangs at random intervals. One thread is
> stuck in sleep_on_page
>
> [<ffffffff810c50da>] sleep_on_page+0xe/0x12
> [<ffffffff810c525b>] wait_on_page_bit+0x72/0x74
> [<ffffffff811030f9>] migrate_pages+0x17c/0x36f
> [<ffffffff810fa24a>] compact_zone+0x467/0x68b
> [<ffffffff810fa6a7>] try_to_compact_pages+0x14c/0x1b3
> [<ffffffff810cbda1>] __alloc_pages_direct_compact+0xa7/0x15a
> [<ffffffff810cc4ec>] __alloc_pages_nodemask+0x698/0x71d
> [<ffffffff810f89c2>] alloc_pages_vma+0xf5/0xfa
> [<ffffffff8110683f>] do_huge_pmd_anonymous_page+0xbe/0x227
> [<ffffffff810e2bf4>] handle_mm_fault+0x113/0x1ce
> [<ffffffff8102fe3d>] do_page_fault+0x2d7/0x31e
> [<ffffffff812fe535>] page_fault+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> And it stays stuck there for long enough for me to find the thread and
> attach strace. Apparently it was stuck in
>
> 1320640739.201474 munmap(0x7f5c06b00000, 2097152) = 0
>
> for something between 20 and 60 seconds.
That's not nice. Apparently you are using transparent hugepages and the
stuck application tried to allocate a hugepage. But to allocate a hugepage
you need a physically continguous set of pages and try_to_compact_pages()
is trying to achieve exactly that. But some of the pages that need moving
around are stuck for a long time - likely are being submitted to your USB
stick for writing. So all in all I'm not *that* surprised you see what you
see.
> There's no reason to let a 6MB/sec high latency device lock up 600 MB of
> dirty pages. I'll have to wait a hundred seconds after my app exits
> before the system will return to usability.
>
> And there's no way, AFAICS, for me to work around this behavior in
> userland.
There is - you can use /sys/block/<device>/bdi/max_ratio to tune how much
of dirty cache that device can take. Dirty cache is set to 20% of your
total memory by default so that amounts to ~1.6 GB. So if you tune
max_ratio to say 5, you will get at most 80 MB of dirty pages agains your
USB stick which should be about appropriate. You can even create a udev
rule so that when an USB stick is inserted, it automatically sets
max_ratio for it to 5...
> And I don't understand how this compact_zone thing is intended to work
> in this situation.
>
> edited but nearly full dmesg at
> http://web.hexapodia.org/~adi/snow/dmesg-3.1.0-09126-g4730284.txt
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-09 17:00 ` Jan Kara
@ 2011-11-09 17:52 ` Mel Gorman
2011-11-09 18:06 ` Andrea Arcangeli
0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2011-11-09 17:52 UTC (permalink / raw)
To: Jan Kara; +Cc: Andy Isaacson, linux-kernel, linux-mm, aarcange
On Wed, Nov 09, 2011 at 06:00:27PM +0100, Jan Kara wrote:
> I've added to CC some mm developers who know much more about transparent
> hugepages than I do because that is what seems to cause your problems...
>
This has hit more than once recently. It's not something I can
reproduce locally as such but the problem seems to be two-fold based
1. processes getting stuck in synchronous reclaim
2. processes faulting at the same time khugepaged is allocating a huge
page with CONFIG_NUMA enabled
The first one is easily fixed, the second one not so much. I'm
prototyping two patches at the moment and sending them through tests.
> On Sun 06-11-11 20:59:28, Andy Isaacson wrote:
> > I am running 1a67a573b (3.1.0-09125 plus a small local patch) on a Core
> > i7, 8 GB RAM, writing a few GB of data to a slow SD card attached via
> > usb-storage with vfat. I mounted without specifying any options,
> >
> > /dev/sdb1 /mnt/usb vfat rw,nosuid,nodev,noexec,relatime,uid=22448,gid=22448,fmask=0022,dmask=0022,codepage=cp437,iocharset=utf8,shortname=mixed,errors=remount-ro 0 0
> >
> > and I'm using rsync to write the data.
> >
Sounds similar to the cases I'm hearing about - copying from NFS to USB
with applications freezing where disabling transparent hugepages
sometimes helps.
> > We end up in a fairly steady state with a half GB dirty:
> >
> > Dirty: 612280 kB
> >
> > The dirty count stays high despite running sync(1) in another xterm.
> >
> > The bug is,
> >
> > Firefox (iceweasel 7.0.1-4) hangs at random intervals. One thread is
> > stuck in sleep_on_page
> >
> > [<ffffffff810c50da>] sleep_on_page+0xe/0x12
> > [<ffffffff810c525b>] wait_on_page_bit+0x72/0x74
> > [<ffffffff811030f9>] migrate_pages+0x17c/0x36f
> > [<ffffffff810fa24a>] compact_zone+0x467/0x68b
> > [<ffffffff810fa6a7>] try_to_compact_pages+0x14c/0x1b3
> > [<ffffffff810cbda1>] __alloc_pages_direct_compact+0xa7/0x15a
> > [<ffffffff810cc4ec>] __alloc_pages_nodemask+0x698/0x71d
> > [<ffffffff810f89c2>] alloc_pages_vma+0xf5/0xfa
> > [<ffffffff8110683f>] do_huge_pmd_anonymous_page+0xbe/0x227
> > [<ffffffff810e2bf4>] handle_mm_fault+0x113/0x1ce
> > [<ffffffff8102fe3d>] do_page_fault+0x2d7/0x31e
> > [<ffffffff812fe535>] page_fault+0x25/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > And it stays stuck there for long enough for me to find the thread and
> > attach strace. Apparently it was stuck in
> >
> > 1320640739.201474 munmap(0x7f5c06b00000, 2097152) = 0
> >
> > for something between 20 and 60 seconds.
>
> That's not nice. Apparently you are using transparent hugepages and the
> stuck application tried to allocate a hugepage. But to allocate a hugepage
> you need a physically continguous set of pages and try_to_compact_pages()
> is trying to achieve exactly that. But some of the pages that need moving
> around are stuck for a long time - likely are being submitted to your USB
> stick for writing. So all in all I'm not *that* surprised you see what you
> see.
>
Neither am I. It matches other reports I've heard over the last week.
> > There's no reason to let a 6MB/sec high latency device lock up 600 MB of
> > dirty pages. I'll have to wait a hundred seconds after my app exits
> > before the system will return to usability.
> >
> > And there's no way, AFAICS, for me to work around this behavior in
> > userland.
>
> There is - you can use /sys/block/<device>/bdi/max_ratio to tune how much
> of dirty cache that device can take. Dirty cache is set to 20% of your
> total memory by default so that amounts to ~1.6 GB. So if you tune
> max_ratio to say 5, you will get at most 80 MB of dirty pages agains your
> USB stick which should be about appropriate. You can even create a udev
> rule so that when an USB stick is inserted, it automatically sets
> max_ratio for it to 5...
>
This kinda hacks around the problem although it should work.
You should also be able to "workaround" the problem by disabling
transparet hugepages.
Finally, can you give this patch a whirl? It's a bit rough and ready
and I'm trying to see a cleaner way of allowing khugepaged to use
sync compaction on CONFIG_NUMA but I'd be interesting in confirming
it's the right direction. I have tested it sortof but not against
mainline. In my tests though I found that time spend stalled in
compaction was reduced by 47 seconds during a test lasting 30 minutes.
There is a drop in the number of transparent hugepages used but it's
minor and could be within the noise as khugepaged is still able to
use sync compaction.
==== CUT HERE ====
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..e2fbfee 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -328,18 +328,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
}
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
struct vm_area_struct *vma, unsigned long addr,
- int node);
+ int node, bool drop_mmapsem);
#else
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \
- alloc_pages(gfp_mask, order)
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \
+ ({ \
+ if (drop_mmapsem) \
+ up_read(&vma->vm_mm->mmap_sem); \
+ alloc_pages(gfp_mask, order); \
+ })
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
- alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+ alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
- alloc_pages_vma(gfp_mask, 0, vma, addr, node)
+ alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 3a93f73..d5e7132 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -154,7 +154,7 @@ __alloc_zeroed_user_highpage(gfp_t movableflags,
unsigned long vaddr)
{
struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags,
- vma, vaddr);
+ vma, vaddr, false);
if (page)
clear_user_highpage(page, vaddr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2d1587..49449ea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -655,10 +655,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
static inline struct page *alloc_hugepage_vma(int defrag,
struct vm_area_struct *vma,
unsigned long haddr, int nd,
- gfp_t extra_gfp)
+ gfp_t extra_gfp,
+ bool drop_mmapsem)
{
return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
- HPAGE_PMD_ORDER, vma, haddr, nd);
+ HPAGE_PMD_ORDER, vma, haddr, nd, drop_mmapsem);
}
#ifndef CONFIG_NUMA
@@ -683,7 +684,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, haddr, numa_node_id(), 0, false);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
goto out;
@@ -911,7 +912,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, haddr, numa_node_id(), 0, false);
else
new_page = NULL;
@@ -1783,15 +1784,14 @@ static void collapse_huge_page(struct mm_struct *mm,
* the userland I/O paths. Allocating memory with the
* mmap_sem in read mode is good idea also to allow greater
* scalability.
+ *
+ * alloc_pages_vma drops the mmap_sem so that if the process
+ * faults or calls mmap then khugepaged will not stall it.
+ * The mmap_sem is taken for write later to confirm the VMA
+ * is still valid
*/
new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
- node, __GFP_OTHER_NODE);
-
- /*
- * After allocating the hugepage, release the mmap_sem read lock in
- * preparation for taking it in write mode.
- */
- up_read(&mm->mmap_sem);
+ node, __GFP_OTHER_NODE, true);
if (unlikely(!new_page)) {
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9c51f9f..1a8c676 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1832,7 +1832,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
*/
struct page *
alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
- unsigned long addr, int node)
+ unsigned long addr, int node, bool drop_mmapsem)
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);
struct zonelist *zl;
@@ -1844,16 +1844,21 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
mpol_cond_put(pol);
+ if (drop_mmapsem)
+ up_read(&vma->vm_mm->mmap_sem);
page = alloc_page_interleave(gfp, order, nid);
put_mems_allowed();
return page;
}
zl = policy_zonelist(gfp, pol, node);
if (unlikely(mpol_needs_cond_ref(pol))) {
+ struct page *page;
/*
* slow path: ref counted shared policy
*/
- struct page *page = __alloc_pages_nodemask(gfp, order,
+ if (drop_mmapsem)
+ up_read(&vma->vm_mm->mmap_sem);
+ page = __alloc_pages_nodemask(gfp, order,
zl, policy_nodemask(gfp, pol));
__mpol_put(pol);
put_mems_allowed();
@@ -1862,6 +1867,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
/*
* fast path: default or task policy
*/
+ if (drop_mmapsem)
+ up_read(&vma->vm_mm->mmap_sem);
page = __alloc_pages_nodemask(gfp, order, zl,
policy_nodemask(gfp, pol));
put_mems_allowed();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e8ecb6..2f87f92 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2161,7 +2161,16 @@ rebalance:
sync_migration);
if (page)
goto got_pg;
- sync_migration = true;
+
+ /*
+ * Do not use sync migration for processes allocating transparent
+ * hugepages as it could stall writing back pages which is far worse
+ * than simply failing to promote a page. We still allow khugepaged
+ * to allocate as it should drop the mmap_sem before trying to
+ * allocate the page so it's acceptable for it to stall
+ */
+ sync_migration = (current->flags & PF_KTHREAD) ||
+ !(gfp_mask & __GFP_NO_KSWAPD);
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-09 17:52 ` Mel Gorman
@ 2011-11-09 18:06 ` Andrea Arcangeli
2011-11-10 0:53 ` Mel Gorman
2011-11-10 9:34 ` Johannes Weiner
0 siblings, 2 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2011-11-09 18:06 UTC (permalink / raw)
To: Mel Gorman
Cc: Jan Kara, Andy Isaacson, linux-kernel, linux-mm, Johannes Weiner
On Wed, Nov 09, 2011 at 05:52:01PM +0000, Mel Gorman wrote:
> -#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \
> - alloc_pages(gfp_mask, order)
> +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \
> + ({ \
> + if (drop_mmapsem) \
> + up_read(&vma->vm_mm->mmap_sem); \
> + alloc_pages(gfp_mask, order); \
> + })
I wouldn't change alloc_pages_vma. I think it's better to add and have
that called only by khugepaged:
alloc_pages_vma_up_read(gfp_mask, order, vma, addr, node)
{
__alloc_pages_vma(gfp_mask, order, vma, addr, node, true);
}
alloc_pages_vma(gfp_mask, order, vma, addr, node)
{
__alloc_pages_vma(gfp_mask, order, vma, addr, node, false);
}
I wonder if a change like this would be enough?
sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
But even if hidden in a new function, the main downside overall is the
fact we'll pass one more var through the stack of fast paths.
Johannes I recall you reported this too and Mel suggested the above
change, did it help in the end?
Your change in khugepaged context makes perfect sense anyway, just we
should be sure it's really needed before adding more variables in fast
path I think.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-09 18:06 ` Andrea Arcangeli
@ 2011-11-10 0:53 ` Mel Gorman
2011-11-10 1:54 ` Andrea Arcangeli
2011-11-10 9:34 ` Johannes Weiner
1 sibling, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2011-11-10 0:53 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Jan Kara, Andy Isaacson, linux-kernel, linux-mm, Johannes Weiner
On Wed, Nov 09, 2011 at 07:06:46PM +0100, Andrea Arcangeli wrote:
> On Wed, Nov 09, 2011 at 05:52:01PM +0000, Mel Gorman wrote:
> > -#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \
> > - alloc_pages(gfp_mask, order)
> > +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \
> > + ({ \
> > + if (drop_mmapsem) \
> > + up_read(&vma->vm_mm->mmap_sem); \
> > + alloc_pages(gfp_mask, order); \
> > + })
>
> I wouldn't change alloc_pages_vma. I think it's better to add and have
> that called only by khugepaged:
>
> alloc_pages_vma_up_read(gfp_mask, order, vma, addr, node)
> {
> __alloc_pages_vma(gfp_mask, order, vma, addr, node, true);
> }
>
> alloc_pages_vma(gfp_mask, order, vma, addr, node)
> {
> __alloc_pages_vma(gfp_mask, order, vma, addr, node, false);
> }
>
Yeah it would. I thought that myself after adjusting the
alloc_pages_vma() function but hadn't gone back to it. The "bodge" was
only written earlier.
> I wonder if a change like this would be enough?
>
> sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
>
Sure it does and in fact this is patch 1 of 2 and it reduced all the
THP related stalls for me. The number of THPs were reduced though
because khugepaged was no longer using sync compaction hence patch
2 which altered alloc_pages_vma. This was testing on CONFIG_NUMA so
testing for PF_KTHREAD would be insufficient as khugepaged would call
sync compaction with the mmap_sem held causing indirect stalls.
> But even if hidden in a new function, the main downside overall is the
> fact we'll pass one more var through the stack of fast paths.
>
Unfortunate but the alternative is not allowing khugepaged to use sync
compaction. Are you ok with that? The number of THPs in use was reduced
but it also was during a somewhat unrealistic stress test so it might
not matter.
> Johannes I recall you reported this too and Mel suggested the above
> change, did it help in the end?
>
> Your change in khugepaged context makes perfect sense anyway, just we
> should be sure it's really needed before adding more variables in fast
> path I think.
It's not really needed to avoid stalls - just !(gfp_mask &
__GFP_NO_KSWAPD) is enough for that. It's only needed if we want
khugepaged to continue using sync compaction without stalling processes
due to holding mmap_sem.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-10 0:53 ` Mel Gorman
@ 2011-11-10 1:54 ` Andrea Arcangeli
0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2011-11-10 1:54 UTC (permalink / raw)
To: Mel Gorman
Cc: Jan Kara, Andy Isaacson, linux-kernel, linux-mm, Johannes Weiner
On Thu, Nov 10, 2011 at 12:53:07AM +0000, Mel Gorman wrote:
> compaction. Are you ok with that? The number of THPs in use was reduced
> but it also was during a somewhat unrealistic stress test so it might
> not matter.
I think having more THP collapsed during the unrealistic load is not
so important, likely the unrelistic load is dominated not by TLB
misses but by kernel load so even if it materializes it shouldn't make
a difference. And khugepaged will just retry at the next pass anyway
so it doesn't matter if it's delayed a bit I think. And retrying on
the same address with __GFP_OTHER_NODE doesn't sound good idea.
> It's not really needed to avoid stalls - just !(gfp_mask &
> __GFP_NO_KSWAPD) is enough for that. It's only needed if we want
I would go with this first. You can keep the second patch in queue,
but considering it's altering the fast paths that affects no-THP
config too, we could at least benchmark it to be sure it's not
measurable. I guess it's not, but hey if it's not needed then we
shouldn't care.
And it was already ok, we thought it didn't matter so we reversed it
in c6a140bf164829769499b5e50d380893da39b29e but it clearly matters for
usb stick, so I would simply reapply it.
One reason we reversed it was also the fact it wasn't so clean to take
that decision in function of __GFP_NO_KSWAPD. I think it's probably
cleaner to check if __GFP_NORETRY is set instead of __GFP_NO_KSWAPD is
set.
That flag should indicate we don't really care too much if we fail the
allocation or not and not to go too hard on it, and notably those are
the allocations that are totally ok to fail without having to trigger
OOM, so again not worth going the extra mile to succeed them.
Alternatively we could check __GFP_NOFAIL but that's mostly obsolete,
yet another alternative is to check order >_ALLOC_COSTLY_ORDER but you
know any additional PAGE_ALLOC_COSTLY_ORDER check tends to make me
unhappy as the behavior has an enormous change from
PAGE_ALLOC_COSTLY_ORDER to PAGE_ALLOC_COSTLY_ORDER+1 and that's an
arbitrary number that doesn't justify a big change in behavior. So the
less PAGE_ALLOC_COSTLY_ORDER the better, ideally there shall be none :)
So I would suggest to resubmit the 1/2 patch changed to __GFP_NORETRY
or just a plain revert with __GFP_NO_KSWAPD if you don't like the
__GFP_NORETRY.
And to queue up the change to the alloc_pages_vma for later, it's not
a bad idea at all but it only paysoff for khugepaged, and 99% of
userland allocations aren't happening there.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-09 18:06 ` Andrea Arcangeli
2011-11-10 0:53 ` Mel Gorman
@ 2011-11-10 9:34 ` Johannes Weiner
2011-11-14 18:47 ` Dave Jones
1 sibling, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2011-11-10 9:34 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Mel Gorman, Jan Kara, Andy Isaacson, linux-kernel, linux-mm
On Wed, Nov 09, 2011 at 07:06:46PM +0100, Andrea Arcangeli wrote:
> On Wed, Nov 09, 2011 at 05:52:01PM +0000, Mel Gorman wrote:
> > -#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \
> > - alloc_pages(gfp_mask, order)
> > +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \
> > + ({ \
> > + if (drop_mmapsem) \
> > + up_read(&vma->vm_mm->mmap_sem); \
> > + alloc_pages(gfp_mask, order); \
> > + })
>
> I wouldn't change alloc_pages_vma. I think it's better to add and have
> that called only by khugepaged:
>
> alloc_pages_vma_up_read(gfp_mask, order, vma, addr, node)
> {
> __alloc_pages_vma(gfp_mask, order, vma, addr, node, true);
> }
>
> alloc_pages_vma(gfp_mask, order, vma, addr, node)
> {
> __alloc_pages_vma(gfp_mask, order, vma, addr, node, false);
> }
>
> I wonder if a change like this would be enough?
>
> sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
>
> But even if hidden in a new function, the main downside overall is the
> fact we'll pass one more var through the stack of fast paths.
>
> Johannes I recall you reported this too and Mel suggested the above
> change, did it help in the end?
Yes, it completely fixed the latency problem.
That said, I haven't looked at the impact on the THP success rate, but
a regression there is probably less severe than half-minute-stalls in
interactive applications.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-10 9:34 ` Johannes Weiner
@ 2011-11-14 18:47 ` Dave Jones
2011-11-15 10:13 ` Mel Gorman
0 siblings, 1 reply; 14+ messages in thread
From: Dave Jones @ 2011-11-14 18:47 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrea Arcangeli, Mel Gorman, Jan Kara, Andy Isaacson,
linux-kernel, linux-mm, kernel-team
On Thu, Nov 10, 2011 at 10:34:42AM +0100, Johannes Weiner wrote:
> > I wonder if a change like this would be enough?
> >
> > sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
> >
> > But even if hidden in a new function, the main downside overall is the
> > fact we'll pass one more var through the stack of fast paths.
> >
> > Johannes I recall you reported this too and Mel suggested the above
> > change, did it help in the end?
>
> Yes, it completely fixed the latency problem.
>
> That said, I haven't looked at the impact on the THP success rate, but
> a regression there is probably less severe than half-minute-stalls in
> interactive applications.
FWIW, we've had a few reports from Fedora users since we moved to 3.x kernels
about similar problems, so whatever the fix is for this should probably
go to stable too.
I could push an update for Fedora users to test the change above if
that would be helpful ?
Dave
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-14 18:47 ` Dave Jones
@ 2011-11-15 10:13 ` Mel Gorman
2011-11-17 19:47 ` Dave Jones
0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2011-11-15 10:13 UTC (permalink / raw)
To: Dave Jones, Johannes Weiner, Andrea Arcangeli, Jan Kara,
Andy Isaacson, linux-kernel, linux-mm, kernel-team
On Mon, Nov 14, 2011 at 01:47:17PM -0500, Dave Jones wrote:
> On Thu, Nov 10, 2011 at 10:34:42AM +0100, Johannes Weiner wrote:
>
> > > I wonder if a change like this would be enough?
> > >
> > > sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
> > >
> > > But even if hidden in a new function, the main downside overall is the
> > > fact we'll pass one more var through the stack of fast paths.
> > >
> > > Johannes I recall you reported this too and Mel suggested the above
> > > change, did it help in the end?
> >
> > Yes, it completely fixed the latency problem.
> >
> > That said, I haven't looked at the impact on the THP success rate, but
> > a regression there is probably less severe than half-minute-stalls in
> > interactive applications.
>
> FWIW, we've had a few reports from Fedora users since we moved to 3.x kernels
> about similar problems, so whatever the fix is for this should probably
> go to stable too.
>
Agreed. I made note of that when I sent a smaller patch to Andrew so
that it would be picked up by distros.
> I could push an update for Fedora users to test the change above if
> that would be helpful ?
>
It would be helpful if you could pick up the patch at
https://lkml.org/lkml/2011/11/10/173 as this is what I expect will
reach -stable eventually. It would be even better if one of the bug
reporters could test before and after that patch and report if it
fixes their problem or not.
If they are still experiencing major stalls, I have an experimental
script that may be able to capture stack traces of processes stalled
for more than 1 second. I've had some success with it locally so
maybe they could try it out to identify if it's THP or something else.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-15 10:13 ` Mel Gorman
@ 2011-11-17 19:47 ` Dave Jones
2011-11-17 22:53 ` Andrea Arcangeli
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Dave Jones @ 2011-11-17 19:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Andrea Arcangeli, Jan Kara, Andy Isaacson,
linux-kernel, linux-mm, kernel-team
On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote:
> If they are still experiencing major stalls, I have an experimental
> script that may be able to capture stack traces of processes stalled
> for more than 1 second. I've had some success with it locally so
> maybe they could try it out to identify if it's THP or something else.
I'm not sure if it's the same problem, but I'd be interested in trying
that script.
When I build a kernel on my laptop, when it gets to the final link stage,
and there's a ton of IO, my entire X session wedges for a few seconds.
This may be unrelated, because this is on an SSD, which shouldn't suffer
from the slow IO of the USB devices mentioned in this thread.
(This is even with that patch applied btw, perhaps adding further fuel to
the idea that it's unrelated).
Dave
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-17 19:47 ` Dave Jones
@ 2011-11-17 22:53 ` Andrea Arcangeli
2011-11-18 11:11 ` Mel Gorman
2011-11-21 9:18 ` Johannes Weiner
2 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2011-11-17 22:53 UTC (permalink / raw)
To: Dave Jones, Mel Gorman, Johannes Weiner, Jan Kara, Andy Isaacson,
linux-kernel, linux-mm, kernel-team
Hi Dave,
On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote:
> When I build a kernel on my laptop, when it gets to the final link stage,
> and there's a ton of IO, my entire X session wedges for a few seconds.
> This may be unrelated, because this is on an SSD, which shouldn't suffer
> from the slow IO of the USB devices mentioned in this thread.
>
> (This is even with that patch applied btw, perhaps adding further fuel to
> the idea that it's unrelated).
Agreed, if it happens even with the patch applied it's most certainly
not related to compaction. Maybe more IO scheduler related, or
something like that. You can also try to set
transparent_hugepage/defrag to "never" to be sure, then compaction
won't run.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-17 19:47 ` Dave Jones
2011-11-17 22:53 ` Andrea Arcangeli
@ 2011-11-18 11:11 ` Mel Gorman
2011-11-18 12:19 ` Josh Boyer
2011-11-21 9:18 ` Johannes Weiner
2 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2011-11-18 11:11 UTC (permalink / raw)
To: Dave Jones, Johannes Weiner, Andrea Arcangeli, Jan Kara,
Andy Isaacson, linux-kernel, linux-mm, kernel-team
[-- Attachment #1: Type: text/plain, Size: 2592 bytes --]
On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote:
> On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote:
>
> > If they are still experiencing major stalls, I have an experimental
> > script that may be able to capture stack traces of processes stalled
> > for more than 1 second. I've had some success with it locally so
> > maybe they could try it out to identify if it's THP or something else.
>
> I'm not sure if it's the same problem, but I'd be interested in trying
> that script.
>
Monitor script is attached as watch-dstate.pl. Run it as
watch-dstate.pl -o logfile
I'm also attaching a post-processing script stap-dstate-frequency.
cat logfile | stap-dstate-frequency
will report on unique stack traces, what got stuck in them and
for how long. Unfortunately, this does require a working systemtap
installation because it had to work on systems without ftrace. Usually
systemtap is a case of installing debugging symbols and its package
but milage varies.
I ran this for a few days on my own desktop but found that the worst
stalls for firefox and evolution were in futex_wait with the second
worst in
[<ffffffffa018e3c5>] ext4_sync_file+0x225/0x290 [ext4]
[<ffffffff81178250>] do_fsync+0x50/0x80
[<ffffffff8117852e>] sys_fdatasync+0xe/0x20
[<ffffffff81448592>] system_call_fastpath+0x16/0x1b
The stall timing is approximate at best. If you find the stall figures
are way too high or unrealistic, try running with --accurate-stall. The
stall figures will be much more accurate but depending on your
kernel version, the stack traces may be one line long ending with
kretprobe_trampoline.
> When I build a kernel on my laptop, when it gets to the final link stage,
> and there's a ton of IO, my entire X session wedges for a few seconds.
> This may be unrelated, because this is on an SSD, which shouldn't suffer
> from the slow IO of the USB devices mentioned in this thread.
>
I have a vague suspicion that there are some interactivity issues
around SSDs but I don't know why that is. I'm basing this on some
complaints of audio skipping with heavy kernel compiles on machines
very similar to my own other than mine uses rotary storage. It's on
the Christmas list to by myself a SSD to take a closer look.
> (This is even with that patch applied btw, perhaps adding further fuel to
> the idea that it's unrelated).
>
I suspect it's not compaction related in that case but the script may be
able to tell for sure. If it does not catch anything, alter this line to
have a smaller threshold
global stall_threshold = 1000
--
Mel Gorman
SUSE Labs
[-- Attachment #2: watch-dstate.pl --]
[-- Type: application/x-perl, Size: 10807 bytes --]
[-- Attachment #3: stap-dstate-frequency --]
[-- Type: text/plain, Size: 2432 bytes --]
#!/usr/bin/perl
# This script reads the output from the dstate monitor and reports how
# many unique stack traces there were and what the stall times were.
# The objective is to identify the worst sources of stalls.
#
# Copyright Mel Gorman 2011
use strict;
my $line;
my %unique_event_counts;
my %unique_event_stall;
my %unique_event_stall_details;
my $total_stalled;
my ($process, $stalltime, $function, $event);
my ($stall_details, $trace, $first_trace, $reading_trace, $skip);
while ($line = <>) {
# Watch for the beginning of a new event
if ($line =~ /^time ([0-9]*): ([0-9]*) \((.*)\) Stalled: ([-0-9]*) ms: (.*)/) {
# Skip uninteresting traces
if (!$skip) {
# Record the last event
$unique_event_counts{$trace}++;
$unique_event_stall_details{$trace} .= $event;
if ($stalltime >= 0) {
$unique_event_stall{$trace} += $stalltime;
$total_stalled += $stalltime;
}
}
# Start the next event
$event = sprintf "%-20s %-20s %6d ms\n", $3, $5, $4;
$reading_trace = 0;
$stalltime = $4;
if ($stalltime == 0) {
print "DEBUG $line";
}
$first_trace = "";
$trace = "";
}
# If we have reached a trace, blindly read it
if ($reading_trace) {
# Ignore traces that look like they are from user space
if ($line =~ /^\[<0/) {
$reading_trace = 0;
next;
}
$trace .= $line;
if ($first_trace eq "") {
$first_trace = $line;
$skip = 1;
# Skip uninteresting traces
if ($first_trace !~ / do_poll\+/ &&
$first_trace !~ / kthread\+/ &&
$first_trace !~ / khugepaged\+/ &&
$first_trace !~ / sys_epoll_wait\+/ &&
$first_trace !~ / kswapd\+/) {
$skip = 0;
}
}
next;
}
if ($line =~ /^\[<f/) {
$reading_trace = 1;
next;
}
}
print "Overall stalled time: $total_stalled ms\n\n";
foreach my $trace (sort {$unique_event_stall{$b} <=> $unique_event_stall{$a}} keys %unique_event_stall) {
#printf "Event $short_event us count %4d\n", $unique_event_counts{$event};
#print $unique_event_stall_details{$event};
printf "Time stalled in this event: %8d ms\n", $unique_event_stall{$trace};
printf "Event count: %8d\n", $unique_event_counts{$trace};
print $unique_event_stall_details{$trace};
print "$trace\n";
}
#print "\nDetails\n=======\n";
#foreach my $event (sort {$unique_event_stall{$b} <=> $unique_event_stall{$a}} keys %unique_event_stall) {
# print "Event $event us count $unique_event_counts{$event}\n";
# print "\n";
#}
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-18 11:11 ` Mel Gorman
@ 2011-11-18 12:19 ` Josh Boyer
0 siblings, 0 replies; 14+ messages in thread
From: Josh Boyer @ 2011-11-18 12:19 UTC (permalink / raw)
To: Mel Gorman
Cc: Dave Jones, Johannes Weiner, Andrea Arcangeli, Jan Kara,
Andy Isaacson, linux-kernel, linux-mm, kernel-team
On Fri, Nov 18, 2011 at 11:11:40AM +0000, Mel Gorman wrote:
> On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote:
> > On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote:
> >
> > > If they are still experiencing major stalls, I have an experimental
> > > script that may be able to capture stack traces of processes stalled
> > > for more than 1 second. I've had some success with it locally so
> > > maybe they could try it out to identify if it's THP or something else.
> >
> > I'm not sure if it's the same problem, but I'd be interested in trying
> > that script.
> >
>
> Monitor script is attached as watch-dstate.pl. Run it as
>
> watch-dstate.pl -o logfile
>
> I'm also attaching a post-processing script stap-dstate-frequency.
>
> cat logfile | stap-dstate-frequency
>
> will report on unique stack traces, what got stuck in them and
> for how long. Unfortunately, this does require a working systemtap
> installation because it had to work on systems without ftrace. Usually
> systemtap is a case of installing debugging symbols and its package
> but milage varies.
>
> I ran this for a few days on my own desktop but found that the worst
> stalls for firefox and evolution were in futex_wait with the second
> worst in
>
> [<ffffffffa018e3c5>] ext4_sync_file+0x225/0x290 [ext4]
> [<ffffffff81178250>] do_fsync+0x50/0x80
> [<ffffffff8117852e>] sys_fdatasync+0xe/0x20
> [<ffffffff81448592>] system_call_fastpath+0x16/0x1b
>
> The stall timing is approximate at best. If you find the stall figures
> are way too high or unrealistic, try running with --accurate-stall. The
> stall figures will be much more accurate but depending on your
> kernel version, the stack traces may be one line long ending with
> kretprobe_trampoline.
>
> > When I build a kernel on my laptop, when it gets to the final link stage,
> > and there's a ton of IO, my entire X session wedges for a few seconds.
> > This may be unrelated, because this is on an SSD, which shouldn't suffer
> > from the slow IO of the USB devices mentioned in this thread.
> >
>
> I have a vague suspicion that there are some interactivity issues
> around SSDs but I don't know why that is. I'm basing this on some
> complaints of audio skipping with heavy kernel compiles on machines
> very similar to my own other than mine uses rotary storage. It's on
> the Christmas list to by myself a SSD to take a closer look.
I see similar pauses on my laptop that doesn't have an SSD, so I don't
think it's drive related. Certainly could be I/O scheduling or ext4
sync related, but probably not a specific drive technology.
I also wonder if the dirty_background_ratio and dirty_ratio might be too
high on this machine, and the flush of the page cache is causing
excessive I/O.
> > (This is even with that patch applied btw, perhaps adding further fuel to
> > the idea that it's unrelated).
> >
>
> I suspect it's not compaction related in that case but the script may be
> able to tell for sure. If it does not catch anything, alter this line to
> have a smaller threshold
>
> global stall_threshold = 1000
I'll give this a try as well.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage
2011-11-17 19:47 ` Dave Jones
2011-11-17 22:53 ` Andrea Arcangeli
2011-11-18 11:11 ` Mel Gorman
@ 2011-11-21 9:18 ` Johannes Weiner
2 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2011-11-21 9:18 UTC (permalink / raw)
To: Dave Jones, Mel Gorman, Andrea Arcangeli, Jan Kara, Andy Isaacson,
linux-kernel, linux-mm, kernel-team
On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote:
> On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote:
>
> > If they are still experiencing major stalls, I have an experimental
> > script that may be able to capture stack traces of processes stalled
> > for more than 1 second. I've had some success with it locally so
> > maybe they could try it out to identify if it's THP or something else.
>
> I'm not sure if it's the same problem, but I'd be interested in trying
> that script.
>
> When I build a kernel on my laptop, when it gets to the final link stage,
> and there's a ton of IO, my entire X session wedges for a few seconds.
> This may be unrelated, because this is on an SSD, which shouldn't suffer
> from the slow IO of the USB devices mentioned in this thread.
>
> (This is even with that patch applied btw, perhaps adding further fuel to
> the idea that it's unrelated).
We still have the problem that individual zones may fill up
unproportionately with dirty pages and reclaim can take a while to
make progress in such zones.
Would you mind trying the per-zone dirty limits patch set? You can
find it here:
http://cmpxchg.org/~hannes/kernel/mm-per-zone-dirty-limits/
git am pzd.mbox should work on 3.2-rc1.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-11-21 9:18 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-07 4:59 long sleep_on_page delays writing to slow storage Andy Isaacson
2011-11-09 17:00 ` Jan Kara
2011-11-09 17:52 ` Mel Gorman
2011-11-09 18:06 ` Andrea Arcangeli
2011-11-10 0:53 ` Mel Gorman
2011-11-10 1:54 ` Andrea Arcangeli
2011-11-10 9:34 ` Johannes Weiner
2011-11-14 18:47 ` Dave Jones
2011-11-15 10:13 ` Mel Gorman
2011-11-17 19:47 ` Dave Jones
2011-11-17 22:53 ` Andrea Arcangeli
2011-11-18 11:11 ` Mel Gorman
2011-11-18 12:19 ` Josh Boyer
2011-11-21 9:18 ` Johannes Weiner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).