* long sleep_on_page delays writing to slow storage @ 2011-11-07 4:59 Andy Isaacson 2011-11-09 17:00 ` Jan Kara 0 siblings, 1 reply; 14+ messages in thread From: Andy Isaacson @ 2011-11-07 4:59 UTC (permalink / raw) To: linux-kernel, linux-mm I am running 1a67a573b (3.1.0-09125 plus a small local patch) on a Core i7, 8 GB RAM, writing a few GB of data to a slow SD card attached via usb-storage with vfat. I mounted without specifying any options, /dev/sdb1 /mnt/usb vfat rw,nosuid,nodev,noexec,relatime,uid=22448,gid=22448,fmask=0022,dmask=0022,codepage=cp437,iocharset=utf8,shortname=mixed,errors=remount-ro 0 0 and I'm using rsync to write the data. We end up in a fairly steady state with a half GB dirty: Dirty: 612280 kB The dirty count stays high despite running sync(1) in another xterm. The bug is, Firefox (iceweasel 7.0.1-4) hangs at random intervals. One thread is stuck in sleep_on_page [<ffffffff810c50da>] sleep_on_page+0xe/0x12 [<ffffffff810c525b>] wait_on_page_bit+0x72/0x74 [<ffffffff811030f9>] migrate_pages+0x17c/0x36f [<ffffffff810fa24a>] compact_zone+0x467/0x68b [<ffffffff810fa6a7>] try_to_compact_pages+0x14c/0x1b3 [<ffffffff810cbda1>] __alloc_pages_direct_compact+0xa7/0x15a [<ffffffff810cc4ec>] __alloc_pages_nodemask+0x698/0x71d [<ffffffff810f89c2>] alloc_pages_vma+0xf5/0xfa [<ffffffff8110683f>] do_huge_pmd_anonymous_page+0xbe/0x227 [<ffffffff810e2bf4>] handle_mm_fault+0x113/0x1ce [<ffffffff8102fe3d>] do_page_fault+0x2d7/0x31e [<ffffffff812fe535>] page_fault+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff And it stays stuck there for long enough for me to find the thread and attach strace. Apparently it was stuck in 1320640739.201474 munmap(0x7f5c06b00000, 2097152) = 0 for something between 20 and 60 seconds. There's no reason to let a 6MB/sec high latency device lock up 600 MB of dirty pages. I'll have to wait a hundred seconds after my app exits before the system will return to usability. And there's no way, AFAICS, for me to work around this behavior in userland. And I don't understand how this compact_zone thing is intended to work in this situation. edited but nearly full dmesg at http://web.hexapodia.org/~adi/snow/dmesg-3.1.0-09126-g4730284.txt Thoughts? Thanks, -andy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-07 4:59 long sleep_on_page delays writing to slow storage Andy Isaacson @ 2011-11-09 17:00 ` Jan Kara 2011-11-09 17:52 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Jan Kara @ 2011-11-09 17:00 UTC (permalink / raw) To: Andy Isaacson; +Cc: linux-kernel, linux-mm, mgorman, aarcange I've added to CC some mm developers who know much more about transparent hugepages than I do because that is what seems to cause your problems... On Sun 06-11-11 20:59:28, Andy Isaacson wrote: > I am running 1a67a573b (3.1.0-09125 plus a small local patch) on a Core > i7, 8 GB RAM, writing a few GB of data to a slow SD card attached via > usb-storage with vfat. I mounted without specifying any options, > > /dev/sdb1 /mnt/usb vfat rw,nosuid,nodev,noexec,relatime,uid=22448,gid=22448,fmask=0022,dmask=0022,codepage=cp437,iocharset=utf8,shortname=mixed,errors=remount-ro 0 0 > > and I'm using rsync to write the data. > > We end up in a fairly steady state with a half GB dirty: > > Dirty: 612280 kB > > The dirty count stays high despite running sync(1) in another xterm. > > The bug is, > > Firefox (iceweasel 7.0.1-4) hangs at random intervals. One thread is > stuck in sleep_on_page > > [<ffffffff810c50da>] sleep_on_page+0xe/0x12 > [<ffffffff810c525b>] wait_on_page_bit+0x72/0x74 > [<ffffffff811030f9>] migrate_pages+0x17c/0x36f > [<ffffffff810fa24a>] compact_zone+0x467/0x68b > [<ffffffff810fa6a7>] try_to_compact_pages+0x14c/0x1b3 > [<ffffffff810cbda1>] __alloc_pages_direct_compact+0xa7/0x15a > [<ffffffff810cc4ec>] __alloc_pages_nodemask+0x698/0x71d > [<ffffffff810f89c2>] alloc_pages_vma+0xf5/0xfa > [<ffffffff8110683f>] do_huge_pmd_anonymous_page+0xbe/0x227 > [<ffffffff810e2bf4>] handle_mm_fault+0x113/0x1ce > [<ffffffff8102fe3d>] do_page_fault+0x2d7/0x31e > [<ffffffff812fe535>] page_fault+0x25/0x30 > [<ffffffffffffffff>] 0xffffffffffffffff > > And it stays stuck there for long enough for me to find the thread and > attach strace. Apparently it was stuck in > > 1320640739.201474 munmap(0x7f5c06b00000, 2097152) = 0 > > for something between 20 and 60 seconds. That's not nice. Apparently you are using transparent hugepages and the stuck application tried to allocate a hugepage. But to allocate a hugepage you need a physically continguous set of pages and try_to_compact_pages() is trying to achieve exactly that. But some of the pages that need moving around are stuck for a long time - likely are being submitted to your USB stick for writing. So all in all I'm not *that* surprised you see what you see. > There's no reason to let a 6MB/sec high latency device lock up 600 MB of > dirty pages. I'll have to wait a hundred seconds after my app exits > before the system will return to usability. > > And there's no way, AFAICS, for me to work around this behavior in > userland. There is - you can use /sys/block/<device>/bdi/max_ratio to tune how much of dirty cache that device can take. Dirty cache is set to 20% of your total memory by default so that amounts to ~1.6 GB. So if you tune max_ratio to say 5, you will get at most 80 MB of dirty pages agains your USB stick which should be about appropriate. You can even create a udev rule so that when an USB stick is inserted, it automatically sets max_ratio for it to 5... > And I don't understand how this compact_zone thing is intended to work > in this situation. > > edited but nearly full dmesg at > http://web.hexapodia.org/~adi/snow/dmesg-3.1.0-09126-g4730284.txt Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-09 17:00 ` Jan Kara @ 2011-11-09 17:52 ` Mel Gorman 2011-11-09 18:06 ` Andrea Arcangeli 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2011-11-09 17:52 UTC (permalink / raw) To: Jan Kara; +Cc: Andy Isaacson, linux-kernel, linux-mm, aarcange On Wed, Nov 09, 2011 at 06:00:27PM +0100, Jan Kara wrote: > I've added to CC some mm developers who know much more about transparent > hugepages than I do because that is what seems to cause your problems... > This has hit more than once recently. It's not something I can reproduce locally as such but the problem seems to be two-fold based 1. processes getting stuck in synchronous reclaim 2. processes faulting at the same time khugepaged is allocating a huge page with CONFIG_NUMA enabled The first one is easily fixed, the second one not so much. I'm prototyping two patches at the moment and sending them through tests. > On Sun 06-11-11 20:59:28, Andy Isaacson wrote: > > I am running 1a67a573b (3.1.0-09125 plus a small local patch) on a Core > > i7, 8 GB RAM, writing a few GB of data to a slow SD card attached via > > usb-storage with vfat. I mounted without specifying any options, > > > > /dev/sdb1 /mnt/usb vfat rw,nosuid,nodev,noexec,relatime,uid=22448,gid=22448,fmask=0022,dmask=0022,codepage=cp437,iocharset=utf8,shortname=mixed,errors=remount-ro 0 0 > > > > and I'm using rsync to write the data. > > Sounds similar to the cases I'm hearing about - copying from NFS to USB with applications freezing where disabling transparent hugepages sometimes helps. > > We end up in a fairly steady state with a half GB dirty: > > > > Dirty: 612280 kB > > > > The dirty count stays high despite running sync(1) in another xterm. > > > > The bug is, > > > > Firefox (iceweasel 7.0.1-4) hangs at random intervals. One thread is > > stuck in sleep_on_page > > > > [<ffffffff810c50da>] sleep_on_page+0xe/0x12 > > [<ffffffff810c525b>] wait_on_page_bit+0x72/0x74 > > [<ffffffff811030f9>] migrate_pages+0x17c/0x36f > > [<ffffffff810fa24a>] compact_zone+0x467/0x68b > > [<ffffffff810fa6a7>] try_to_compact_pages+0x14c/0x1b3 > > [<ffffffff810cbda1>] __alloc_pages_direct_compact+0xa7/0x15a > > [<ffffffff810cc4ec>] __alloc_pages_nodemask+0x698/0x71d > > [<ffffffff810f89c2>] alloc_pages_vma+0xf5/0xfa > > [<ffffffff8110683f>] do_huge_pmd_anonymous_page+0xbe/0x227 > > [<ffffffff810e2bf4>] handle_mm_fault+0x113/0x1ce > > [<ffffffff8102fe3d>] do_page_fault+0x2d7/0x31e > > [<ffffffff812fe535>] page_fault+0x25/0x30 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > And it stays stuck there for long enough for me to find the thread and > > attach strace. Apparently it was stuck in > > > > 1320640739.201474 munmap(0x7f5c06b00000, 2097152) = 0 > > > > for something between 20 and 60 seconds. > > That's not nice. Apparently you are using transparent hugepages and the > stuck application tried to allocate a hugepage. But to allocate a hugepage > you need a physically continguous set of pages and try_to_compact_pages() > is trying to achieve exactly that. But some of the pages that need moving > around are stuck for a long time - likely are being submitted to your USB > stick for writing. So all in all I'm not *that* surprised you see what you > see. > Neither am I. It matches other reports I've heard over the last week. > > There's no reason to let a 6MB/sec high latency device lock up 600 MB of > > dirty pages. I'll have to wait a hundred seconds after my app exits > > before the system will return to usability. > > > > And there's no way, AFAICS, for me to work around this behavior in > > userland. > > There is - you can use /sys/block/<device>/bdi/max_ratio to tune how much > of dirty cache that device can take. Dirty cache is set to 20% of your > total memory by default so that amounts to ~1.6 GB. So if you tune > max_ratio to say 5, you will get at most 80 MB of dirty pages agains your > USB stick which should be about appropriate. You can even create a udev > rule so that when an USB stick is inserted, it automatically sets > max_ratio for it to 5... > This kinda hacks around the problem although it should work. You should also be able to "workaround" the problem by disabling transparet hugepages. Finally, can you give this patch a whirl? It's a bit rough and ready and I'm trying to see a cleaner way of allowing khugepaged to use sync compaction on CONFIG_NUMA but I'd be interesting in confirming it's the right direction. I have tested it sortof but not against mainline. In my tests though I found that time spend stalled in compaction was reduced by 47 seconds during a test lasting 30 minutes. There is a drop in the number of transparent hugepages used but it's minor and could be within the noise as khugepaged is still able to use sync compaction. ==== CUT HERE ==== diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..e2fbfee 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -328,18 +328,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order) } extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, struct vm_area_struct *vma, unsigned long addr, - int node); + int node, bool drop_mmapsem); #else #define alloc_pages(gfp_mask, order) \ alloc_pages_node(numa_node_id(), gfp_mask, order) -#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \ - alloc_pages(gfp_mask, order) +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \ + ({ \ + if (drop_mmapsem) \ + up_read(&vma->vm_mm->mmap_sem); \ + alloc_pages(gfp_mask, order); \ + }) #endif #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) #define alloc_page_vma(gfp_mask, vma, addr) \ - alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id()) + alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false) #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ - alloc_pages_vma(gfp_mask, 0, vma, addr, node) + alloc_pages_vma(gfp_mask, 0, vma, addr, node, false) extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); extern unsigned long get_zeroed_page(gfp_t gfp_mask); diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 3a93f73..d5e7132 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -154,7 +154,7 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, unsigned long vaddr) { struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags, - vma, vaddr); + vma, vaddr, false); if (page) clear_user_highpage(page, vaddr); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e2d1587..49449ea 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -655,10 +655,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp) static inline struct page *alloc_hugepage_vma(int defrag, struct vm_area_struct *vma, unsigned long haddr, int nd, - gfp_t extra_gfp) + gfp_t extra_gfp, + bool drop_mmapsem) { return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp), - HPAGE_PMD_ORDER, vma, haddr, nd); + HPAGE_PMD_ORDER, vma, haddr, nd, drop_mmapsem); } #ifndef CONFIG_NUMA @@ -683,7 +684,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(khugepaged_enter(vma))) return VM_FAULT_OOM; page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), - vma, haddr, numa_node_id(), 0); + vma, haddr, numa_node_id(), 0, false); if (unlikely(!page)) { count_vm_event(THP_FAULT_FALLBACK); goto out; @@ -911,7 +912,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), - vma, haddr, numa_node_id(), 0); + vma, haddr, numa_node_id(), 0, false); else new_page = NULL; @@ -1783,15 +1784,14 @@ static void collapse_huge_page(struct mm_struct *mm, * the userland I/O paths. Allocating memory with the * mmap_sem in read mode is good idea also to allow greater * scalability. + * + * alloc_pages_vma drops the mmap_sem so that if the process + * faults or calls mmap then khugepaged will not stall it. + * The mmap_sem is taken for write later to confirm the VMA + * is still valid */ new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address, - node, __GFP_OTHER_NODE); - - /* - * After allocating the hugepage, release the mmap_sem read lock in - * preparation for taking it in write mode. - */ - up_read(&mm->mmap_sem); + node, __GFP_OTHER_NODE, true); if (unlikely(!new_page)) { count_vm_event(THP_COLLAPSE_ALLOC_FAILED); *hpage = ERR_PTR(-ENOMEM); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 9c51f9f..1a8c676 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1832,7 +1832,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, */ struct page * alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, - unsigned long addr, int node) + unsigned long addr, int node, bool drop_mmapsem) { struct mempolicy *pol = get_vma_policy(current, vma, addr); struct zonelist *zl; @@ -1844,16 +1844,21 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); mpol_cond_put(pol); + if (drop_mmapsem) + up_read(&vma->vm_mm->mmap_sem); page = alloc_page_interleave(gfp, order, nid); put_mems_allowed(); return page; } zl = policy_zonelist(gfp, pol, node); if (unlikely(mpol_needs_cond_ref(pol))) { + struct page *page; /* * slow path: ref counted shared policy */ - struct page *page = __alloc_pages_nodemask(gfp, order, + if (drop_mmapsem) + up_read(&vma->vm_mm->mmap_sem); + page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); __mpol_put(pol); put_mems_allowed(); @@ -1862,6 +1867,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, /* * fast path: default or task policy */ + if (drop_mmapsem) + up_read(&vma->vm_mm->mmap_sem); page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); put_mems_allowed(); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e8ecb6..2f87f92 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2161,7 +2161,16 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + + /* + * Do not use sync migration for processes allocating transparent + * hugepages as it could stall writing back pages which is far worse + * than simply failing to promote a page. We still allow khugepaged + * to allocate as it should drop the mmap_sem before trying to + * allocate the page so it's acceptable for it to stall + */ + sync_migration = (current->flags & PF_KTHREAD) || + !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-09 17:52 ` Mel Gorman @ 2011-11-09 18:06 ` Andrea Arcangeli 2011-11-10 0:53 ` Mel Gorman 2011-11-10 9:34 ` Johannes Weiner 0 siblings, 2 replies; 14+ messages in thread From: Andrea Arcangeli @ 2011-11-09 18:06 UTC (permalink / raw) To: Mel Gorman Cc: Jan Kara, Andy Isaacson, linux-kernel, linux-mm, Johannes Weiner On Wed, Nov 09, 2011 at 05:52:01PM +0000, Mel Gorman wrote: > -#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \ > - alloc_pages(gfp_mask, order) > +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \ > + ({ \ > + if (drop_mmapsem) \ > + up_read(&vma->vm_mm->mmap_sem); \ > + alloc_pages(gfp_mask, order); \ > + }) I wouldn't change alloc_pages_vma. I think it's better to add and have that called only by khugepaged: alloc_pages_vma_up_read(gfp_mask, order, vma, addr, node) { __alloc_pages_vma(gfp_mask, order, vma, addr, node, true); } alloc_pages_vma(gfp_mask, order, vma, addr, node) { __alloc_pages_vma(gfp_mask, order, vma, addr, node, false); } I wonder if a change like this would be enough? sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); But even if hidden in a new function, the main downside overall is the fact we'll pass one more var through the stack of fast paths. Johannes I recall you reported this too and Mel suggested the above change, did it help in the end? Your change in khugepaged context makes perfect sense anyway, just we should be sure it's really needed before adding more variables in fast path I think. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-09 18:06 ` Andrea Arcangeli @ 2011-11-10 0:53 ` Mel Gorman 2011-11-10 1:54 ` Andrea Arcangeli 2011-11-10 9:34 ` Johannes Weiner 1 sibling, 1 reply; 14+ messages in thread From: Mel Gorman @ 2011-11-10 0:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jan Kara, Andy Isaacson, linux-kernel, linux-mm, Johannes Weiner On Wed, Nov 09, 2011 at 07:06:46PM +0100, Andrea Arcangeli wrote: > On Wed, Nov 09, 2011 at 05:52:01PM +0000, Mel Gorman wrote: > > -#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \ > > - alloc_pages(gfp_mask, order) > > +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \ > > + ({ \ > > + if (drop_mmapsem) \ > > + up_read(&vma->vm_mm->mmap_sem); \ > > + alloc_pages(gfp_mask, order); \ > > + }) > > I wouldn't change alloc_pages_vma. I think it's better to add and have > that called only by khugepaged: > > alloc_pages_vma_up_read(gfp_mask, order, vma, addr, node) > { > __alloc_pages_vma(gfp_mask, order, vma, addr, node, true); > } > > alloc_pages_vma(gfp_mask, order, vma, addr, node) > { > __alloc_pages_vma(gfp_mask, order, vma, addr, node, false); > } > Yeah it would. I thought that myself after adjusting the alloc_pages_vma() function but hadn't gone back to it. The "bodge" was only written earlier. > I wonder if a change like this would be enough? > > sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); > Sure it does and in fact this is patch 1 of 2 and it reduced all the THP related stalls for me. The number of THPs were reduced though because khugepaged was no longer using sync compaction hence patch 2 which altered alloc_pages_vma. This was testing on CONFIG_NUMA so testing for PF_KTHREAD would be insufficient as khugepaged would call sync compaction with the mmap_sem held causing indirect stalls. > But even if hidden in a new function, the main downside overall is the > fact we'll pass one more var through the stack of fast paths. > Unfortunate but the alternative is not allowing khugepaged to use sync compaction. Are you ok with that? The number of THPs in use was reduced but it also was during a somewhat unrealistic stress test so it might not matter. > Johannes I recall you reported this too and Mel suggested the above > change, did it help in the end? > > Your change in khugepaged context makes perfect sense anyway, just we > should be sure it's really needed before adding more variables in fast > path I think. It's not really needed to avoid stalls - just !(gfp_mask & __GFP_NO_KSWAPD) is enough for that. It's only needed if we want khugepaged to continue using sync compaction without stalling processes due to holding mmap_sem. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-10 0:53 ` Mel Gorman @ 2011-11-10 1:54 ` Andrea Arcangeli 0 siblings, 0 replies; 14+ messages in thread From: Andrea Arcangeli @ 2011-11-10 1:54 UTC (permalink / raw) To: Mel Gorman Cc: Jan Kara, Andy Isaacson, linux-kernel, linux-mm, Johannes Weiner On Thu, Nov 10, 2011 at 12:53:07AM +0000, Mel Gorman wrote: > compaction. Are you ok with that? The number of THPs in use was reduced > but it also was during a somewhat unrealistic stress test so it might > not matter. I think having more THP collapsed during the unrealistic load is not so important, likely the unrelistic load is dominated not by TLB misses but by kernel load so even if it materializes it shouldn't make a difference. And khugepaged will just retry at the next pass anyway so it doesn't matter if it's delayed a bit I think. And retrying on the same address with __GFP_OTHER_NODE doesn't sound good idea. > It's not really needed to avoid stalls - just !(gfp_mask & > __GFP_NO_KSWAPD) is enough for that. It's only needed if we want I would go with this first. You can keep the second patch in queue, but considering it's altering the fast paths that affects no-THP config too, we could at least benchmark it to be sure it's not measurable. I guess it's not, but hey if it's not needed then we shouldn't care. And it was already ok, we thought it didn't matter so we reversed it in c6a140bf164829769499b5e50d380893da39b29e but it clearly matters for usb stick, so I would simply reapply it. One reason we reversed it was also the fact it wasn't so clean to take that decision in function of __GFP_NO_KSWAPD. I think it's probably cleaner to check if __GFP_NORETRY is set instead of __GFP_NO_KSWAPD is set. That flag should indicate we don't really care too much if we fail the allocation or not and not to go too hard on it, and notably those are the allocations that are totally ok to fail without having to trigger OOM, so again not worth going the extra mile to succeed them. Alternatively we could check __GFP_NOFAIL but that's mostly obsolete, yet another alternative is to check order >_ALLOC_COSTLY_ORDER but you know any additional PAGE_ALLOC_COSTLY_ORDER check tends to make me unhappy as the behavior has an enormous change from PAGE_ALLOC_COSTLY_ORDER to PAGE_ALLOC_COSTLY_ORDER+1 and that's an arbitrary number that doesn't justify a big change in behavior. So the less PAGE_ALLOC_COSTLY_ORDER the better, ideally there shall be none :) So I would suggest to resubmit the 1/2 patch changed to __GFP_NORETRY or just a plain revert with __GFP_NO_KSWAPD if you don't like the __GFP_NORETRY. And to queue up the change to the alloc_pages_vma for later, it's not a bad idea at all but it only paysoff for khugepaged, and 99% of userland allocations aren't happening there. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-09 18:06 ` Andrea Arcangeli 2011-11-10 0:53 ` Mel Gorman @ 2011-11-10 9:34 ` Johannes Weiner 2011-11-14 18:47 ` Dave Jones 1 sibling, 1 reply; 14+ messages in thread From: Johannes Weiner @ 2011-11-10 9:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Jan Kara, Andy Isaacson, linux-kernel, linux-mm On Wed, Nov 09, 2011 at 07:06:46PM +0100, Andrea Arcangeli wrote: > On Wed, Nov 09, 2011 at 05:52:01PM +0000, Mel Gorman wrote: > > -#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \ > > - alloc_pages(gfp_mask, order) > > +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \ > > + ({ \ > > + if (drop_mmapsem) \ > > + up_read(&vma->vm_mm->mmap_sem); \ > > + alloc_pages(gfp_mask, order); \ > > + }) > > I wouldn't change alloc_pages_vma. I think it's better to add and have > that called only by khugepaged: > > alloc_pages_vma_up_read(gfp_mask, order, vma, addr, node) > { > __alloc_pages_vma(gfp_mask, order, vma, addr, node, true); > } > > alloc_pages_vma(gfp_mask, order, vma, addr, node) > { > __alloc_pages_vma(gfp_mask, order, vma, addr, node, false); > } > > I wonder if a change like this would be enough? > > sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); > > But even if hidden in a new function, the main downside overall is the > fact we'll pass one more var through the stack of fast paths. > > Johannes I recall you reported this too and Mel suggested the above > change, did it help in the end? Yes, it completely fixed the latency problem. That said, I haven't looked at the impact on the THP success rate, but a regression there is probably less severe than half-minute-stalls in interactive applications. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-10 9:34 ` Johannes Weiner @ 2011-11-14 18:47 ` Dave Jones 2011-11-15 10:13 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Dave Jones @ 2011-11-14 18:47 UTC (permalink / raw) To: Johannes Weiner Cc: Andrea Arcangeli, Mel Gorman, Jan Kara, Andy Isaacson, linux-kernel, linux-mm, kernel-team On Thu, Nov 10, 2011 at 10:34:42AM +0100, Johannes Weiner wrote: > > I wonder if a change like this would be enough? > > > > sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); > > > > But even if hidden in a new function, the main downside overall is the > > fact we'll pass one more var through the stack of fast paths. > > > > Johannes I recall you reported this too and Mel suggested the above > > change, did it help in the end? > > Yes, it completely fixed the latency problem. > > That said, I haven't looked at the impact on the THP success rate, but > a regression there is probably less severe than half-minute-stalls in > interactive applications. FWIW, we've had a few reports from Fedora users since we moved to 3.x kernels about similar problems, so whatever the fix is for this should probably go to stable too. I could push an update for Fedora users to test the change above if that would be helpful ? Dave ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-14 18:47 ` Dave Jones @ 2011-11-15 10:13 ` Mel Gorman 2011-11-17 19:47 ` Dave Jones 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2011-11-15 10:13 UTC (permalink / raw) To: Dave Jones, Johannes Weiner, Andrea Arcangeli, Jan Kara, Andy Isaacson, linux-kernel, linux-mm, kernel-team On Mon, Nov 14, 2011 at 01:47:17PM -0500, Dave Jones wrote: > On Thu, Nov 10, 2011 at 10:34:42AM +0100, Johannes Weiner wrote: > > > > I wonder if a change like this would be enough? > > > > > > sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); > > > > > > But even if hidden in a new function, the main downside overall is the > > > fact we'll pass one more var through the stack of fast paths. > > > > > > Johannes I recall you reported this too and Mel suggested the above > > > change, did it help in the end? > > > > Yes, it completely fixed the latency problem. > > > > That said, I haven't looked at the impact on the THP success rate, but > > a regression there is probably less severe than half-minute-stalls in > > interactive applications. > > FWIW, we've had a few reports from Fedora users since we moved to 3.x kernels > about similar problems, so whatever the fix is for this should probably > go to stable too. > Agreed. I made note of that when I sent a smaller patch to Andrew so that it would be picked up by distros. > I could push an update for Fedora users to test the change above if > that would be helpful ? > It would be helpful if you could pick up the patch at https://lkml.org/lkml/2011/11/10/173 as this is what I expect will reach -stable eventually. It would be even better if one of the bug reporters could test before and after that patch and report if it fixes their problem or not. If they are still experiencing major stalls, I have an experimental script that may be able to capture stack traces of processes stalled for more than 1 second. I've had some success with it locally so maybe they could try it out to identify if it's THP or something else. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-15 10:13 ` Mel Gorman @ 2011-11-17 19:47 ` Dave Jones 2011-11-17 22:53 ` Andrea Arcangeli ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Dave Jones @ 2011-11-17 19:47 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, Andrea Arcangeli, Jan Kara, Andy Isaacson, linux-kernel, linux-mm, kernel-team On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote: > If they are still experiencing major stalls, I have an experimental > script that may be able to capture stack traces of processes stalled > for more than 1 second. I've had some success with it locally so > maybe they could try it out to identify if it's THP or something else. I'm not sure if it's the same problem, but I'd be interested in trying that script. When I build a kernel on my laptop, when it gets to the final link stage, and there's a ton of IO, my entire X session wedges for a few seconds. This may be unrelated, because this is on an SSD, which shouldn't suffer from the slow IO of the USB devices mentioned in this thread. (This is even with that patch applied btw, perhaps adding further fuel to the idea that it's unrelated). Dave ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-17 19:47 ` Dave Jones @ 2011-11-17 22:53 ` Andrea Arcangeli 2011-11-18 11:11 ` Mel Gorman 2011-11-21 9:18 ` Johannes Weiner 2 siblings, 0 replies; 14+ messages in thread From: Andrea Arcangeli @ 2011-11-17 22:53 UTC (permalink / raw) To: Dave Jones, Mel Gorman, Johannes Weiner, Jan Kara, Andy Isaacson, linux-kernel, linux-mm, kernel-team Hi Dave, On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote: > When I build a kernel on my laptop, when it gets to the final link stage, > and there's a ton of IO, my entire X session wedges for a few seconds. > This may be unrelated, because this is on an SSD, which shouldn't suffer > from the slow IO of the USB devices mentioned in this thread. > > (This is even with that patch applied btw, perhaps adding further fuel to > the idea that it's unrelated). Agreed, if it happens even with the patch applied it's most certainly not related to compaction. Maybe more IO scheduler related, or something like that. You can also try to set transparent_hugepage/defrag to "never" to be sure, then compaction won't run. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-17 19:47 ` Dave Jones 2011-11-17 22:53 ` Andrea Arcangeli @ 2011-11-18 11:11 ` Mel Gorman 2011-11-18 12:19 ` Josh Boyer 2011-11-21 9:18 ` Johannes Weiner 2 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2011-11-18 11:11 UTC (permalink / raw) To: Dave Jones, Johannes Weiner, Andrea Arcangeli, Jan Kara, Andy Isaacson, linux-kernel, linux-mm, kernel-team [-- Attachment #1: Type: text/plain, Size: 2592 bytes --] On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote: > On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote: > > > If they are still experiencing major stalls, I have an experimental > > script that may be able to capture stack traces of processes stalled > > for more than 1 second. I've had some success with it locally so > > maybe they could try it out to identify if it's THP or something else. > > I'm not sure if it's the same problem, but I'd be interested in trying > that script. > Monitor script is attached as watch-dstate.pl. Run it as watch-dstate.pl -o logfile I'm also attaching a post-processing script stap-dstate-frequency. cat logfile | stap-dstate-frequency will report on unique stack traces, what got stuck in them and for how long. Unfortunately, this does require a working systemtap installation because it had to work on systems without ftrace. Usually systemtap is a case of installing debugging symbols and its package but milage varies. I ran this for a few days on my own desktop but found that the worst stalls for firefox and evolution were in futex_wait with the second worst in [<ffffffffa018e3c5>] ext4_sync_file+0x225/0x290 [ext4] [<ffffffff81178250>] do_fsync+0x50/0x80 [<ffffffff8117852e>] sys_fdatasync+0xe/0x20 [<ffffffff81448592>] system_call_fastpath+0x16/0x1b The stall timing is approximate at best. If you find the stall figures are way too high or unrealistic, try running with --accurate-stall. The stall figures will be much more accurate but depending on your kernel version, the stack traces may be one line long ending with kretprobe_trampoline. > When I build a kernel on my laptop, when it gets to the final link stage, > and there's a ton of IO, my entire X session wedges for a few seconds. > This may be unrelated, because this is on an SSD, which shouldn't suffer > from the slow IO of the USB devices mentioned in this thread. > I have a vague suspicion that there are some interactivity issues around SSDs but I don't know why that is. I'm basing this on some complaints of audio skipping with heavy kernel compiles on machines very similar to my own other than mine uses rotary storage. It's on the Christmas list to by myself a SSD to take a closer look. > (This is even with that patch applied btw, perhaps adding further fuel to > the idea that it's unrelated). > I suspect it's not compaction related in that case but the script may be able to tell for sure. If it does not catch anything, alter this line to have a smaller threshold global stall_threshold = 1000 -- Mel Gorman SUSE Labs [-- Attachment #2: watch-dstate.pl --] [-- Type: application/x-perl, Size: 10807 bytes --] [-- Attachment #3: stap-dstate-frequency --] [-- Type: text/plain, Size: 2432 bytes --] #!/usr/bin/perl # This script reads the output from the dstate monitor and reports how # many unique stack traces there were and what the stall times were. # The objective is to identify the worst sources of stalls. # # Copyright Mel Gorman 2011 use strict; my $line; my %unique_event_counts; my %unique_event_stall; my %unique_event_stall_details; my $total_stalled; my ($process, $stalltime, $function, $event); my ($stall_details, $trace, $first_trace, $reading_trace, $skip); while ($line = <>) { # Watch for the beginning of a new event if ($line =~ /^time ([0-9]*): ([0-9]*) \((.*)\) Stalled: ([-0-9]*) ms: (.*)/) { # Skip uninteresting traces if (!$skip) { # Record the last event $unique_event_counts{$trace}++; $unique_event_stall_details{$trace} .= $event; if ($stalltime >= 0) { $unique_event_stall{$trace} += $stalltime; $total_stalled += $stalltime; } } # Start the next event $event = sprintf "%-20s %-20s %6d ms\n", $3, $5, $4; $reading_trace = 0; $stalltime = $4; if ($stalltime == 0) { print "DEBUG $line"; } $first_trace = ""; $trace = ""; } # If we have reached a trace, blindly read it if ($reading_trace) { # Ignore traces that look like they are from user space if ($line =~ /^\[<0/) { $reading_trace = 0; next; } $trace .= $line; if ($first_trace eq "") { $first_trace = $line; $skip = 1; # Skip uninteresting traces if ($first_trace !~ / do_poll\+/ && $first_trace !~ / kthread\+/ && $first_trace !~ / khugepaged\+/ && $first_trace !~ / sys_epoll_wait\+/ && $first_trace !~ / kswapd\+/) { $skip = 0; } } next; } if ($line =~ /^\[<f/) { $reading_trace = 1; next; } } print "Overall stalled time: $total_stalled ms\n\n"; foreach my $trace (sort {$unique_event_stall{$b} <=> $unique_event_stall{$a}} keys %unique_event_stall) { #printf "Event $short_event us count %4d\n", $unique_event_counts{$event}; #print $unique_event_stall_details{$event}; printf "Time stalled in this event: %8d ms\n", $unique_event_stall{$trace}; printf "Event count: %8d\n", $unique_event_counts{$trace}; print $unique_event_stall_details{$trace}; print "$trace\n"; } #print "\nDetails\n=======\n"; #foreach my $event (sort {$unique_event_stall{$b} <=> $unique_event_stall{$a}} keys %unique_event_stall) { # print "Event $event us count $unique_event_counts{$event}\n"; # print "\n"; #} ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-18 11:11 ` Mel Gorman @ 2011-11-18 12:19 ` Josh Boyer 0 siblings, 0 replies; 14+ messages in thread From: Josh Boyer @ 2011-11-18 12:19 UTC (permalink / raw) To: Mel Gorman Cc: Dave Jones, Johannes Weiner, Andrea Arcangeli, Jan Kara, Andy Isaacson, linux-kernel, linux-mm, kernel-team On Fri, Nov 18, 2011 at 11:11:40AM +0000, Mel Gorman wrote: > On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote: > > On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote: > > > > > If they are still experiencing major stalls, I have an experimental > > > script that may be able to capture stack traces of processes stalled > > > for more than 1 second. I've had some success with it locally so > > > maybe they could try it out to identify if it's THP or something else. > > > > I'm not sure if it's the same problem, but I'd be interested in trying > > that script. > > > > Monitor script is attached as watch-dstate.pl. Run it as > > watch-dstate.pl -o logfile > > I'm also attaching a post-processing script stap-dstate-frequency. > > cat logfile | stap-dstate-frequency > > will report on unique stack traces, what got stuck in them and > for how long. Unfortunately, this does require a working systemtap > installation because it had to work on systems without ftrace. Usually > systemtap is a case of installing debugging symbols and its package > but milage varies. > > I ran this for a few days on my own desktop but found that the worst > stalls for firefox and evolution were in futex_wait with the second > worst in > > [<ffffffffa018e3c5>] ext4_sync_file+0x225/0x290 [ext4] > [<ffffffff81178250>] do_fsync+0x50/0x80 > [<ffffffff8117852e>] sys_fdatasync+0xe/0x20 > [<ffffffff81448592>] system_call_fastpath+0x16/0x1b > > The stall timing is approximate at best. If you find the stall figures > are way too high or unrealistic, try running with --accurate-stall. The > stall figures will be much more accurate but depending on your > kernel version, the stack traces may be one line long ending with > kretprobe_trampoline. > > > When I build a kernel on my laptop, when it gets to the final link stage, > > and there's a ton of IO, my entire X session wedges for a few seconds. > > This may be unrelated, because this is on an SSD, which shouldn't suffer > > from the slow IO of the USB devices mentioned in this thread. > > > > I have a vague suspicion that there are some interactivity issues > around SSDs but I don't know why that is. I'm basing this on some > complaints of audio skipping with heavy kernel compiles on machines > very similar to my own other than mine uses rotary storage. It's on > the Christmas list to by myself a SSD to take a closer look. I see similar pauses on my laptop that doesn't have an SSD, so I don't think it's drive related. Certainly could be I/O scheduling or ext4 sync related, but probably not a specific drive technology. I also wonder if the dirty_background_ratio and dirty_ratio might be too high on this machine, and the flush of the page cache is causing excessive I/O. > > (This is even with that patch applied btw, perhaps adding further fuel to > > the idea that it's unrelated). > > > > I suspect it's not compaction related in that case but the script may be > able to tell for sure. If it does not catch anything, alter this line to > have a smaller threshold > > global stall_threshold = 1000 I'll give this a try as well. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: long sleep_on_page delays writing to slow storage 2011-11-17 19:47 ` Dave Jones 2011-11-17 22:53 ` Andrea Arcangeli 2011-11-18 11:11 ` Mel Gorman @ 2011-11-21 9:18 ` Johannes Weiner 2 siblings, 0 replies; 14+ messages in thread From: Johannes Weiner @ 2011-11-21 9:18 UTC (permalink / raw) To: Dave Jones, Mel Gorman, Andrea Arcangeli, Jan Kara, Andy Isaacson, linux-kernel, linux-mm, kernel-team On Thu, Nov 17, 2011 at 02:47:20PM -0500, Dave Jones wrote: > On Tue, Nov 15, 2011 at 10:13:13AM +0000, Mel Gorman wrote: > > > If they are still experiencing major stalls, I have an experimental > > script that may be able to capture stack traces of processes stalled > > for more than 1 second. I've had some success with it locally so > > maybe they could try it out to identify if it's THP or something else. > > I'm not sure if it's the same problem, but I'd be interested in trying > that script. > > When I build a kernel on my laptop, when it gets to the final link stage, > and there's a ton of IO, my entire X session wedges for a few seconds. > This may be unrelated, because this is on an SSD, which shouldn't suffer > from the slow IO of the USB devices mentioned in this thread. > > (This is even with that patch applied btw, perhaps adding further fuel to > the idea that it's unrelated). We still have the problem that individual zones may fill up unproportionately with dirty pages and reclaim can take a while to make progress in such zones. Would you mind trying the per-zone dirty limits patch set? You can find it here: http://cmpxchg.org/~hannes/kernel/mm-per-zone-dirty-limits/ git am pzd.mbox should work on 3.2-rc1. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-11-21 9:18 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-11-07 4:59 long sleep_on_page delays writing to slow storage Andy Isaacson 2011-11-09 17:00 ` Jan Kara 2011-11-09 17:52 ` Mel Gorman 2011-11-09 18:06 ` Andrea Arcangeli 2011-11-10 0:53 ` Mel Gorman 2011-11-10 1:54 ` Andrea Arcangeli 2011-11-10 9:34 ` Johannes Weiner 2011-11-14 18:47 ` Dave Jones 2011-11-15 10:13 ` Mel Gorman 2011-11-17 19:47 ` Dave Jones 2011-11-17 22:53 ` Andrea Arcangeli 2011-11-18 11:11 ` Mel Gorman 2011-11-18 12:19 ` Josh Boyer 2011-11-21 9:18 ` Johannes Weiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).