From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756011Ab1KIRwI (ORCPT ); Wed, 9 Nov 2011 12:52:08 -0500 Received: from cantor2.suse.de ([195.135.220.15]:41872 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754095Ab1KIRwG (ORCPT ); Wed, 9 Nov 2011 12:52:06 -0500 Date: Wed, 9 Nov 2011 17:52:01 +0000 From: Mel Gorman To: Jan Kara Cc: Andy Isaacson , linux-kernel@vger.kernel.org, linux-mm@vger.kernel.org, aarcange@redhat.com Subject: Re: long sleep_on_page delays writing to slow storage Message-ID: <20111109175201.GB3083@suse.de> References: <20111107045928.GK8927@hexapodia.org> <20111109170027.GB7495@quack.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20111109170027.GB7495@quack.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 09, 2011 at 06:00:27PM +0100, Jan Kara wrote: > I've added to CC some mm developers who know much more about transparent > hugepages than I do because that is what seems to cause your problems... > This has hit more than once recently. It's not something I can reproduce locally as such but the problem seems to be two-fold based 1. processes getting stuck in synchronous reclaim 2. processes faulting at the same time khugepaged is allocating a huge page with CONFIG_NUMA enabled The first one is easily fixed, the second one not so much. I'm prototyping two patches at the moment and sending them through tests. > On Sun 06-11-11 20:59:28, Andy Isaacson wrote: > > I am running 1a67a573b (3.1.0-09125 plus a small local patch) on a Core > > i7, 8 GB RAM, writing a few GB of data to a slow SD card attached via > > usb-storage with vfat. I mounted without specifying any options, > > > > /dev/sdb1 /mnt/usb vfat rw,nosuid,nodev,noexec,relatime,uid=22448,gid=22448,fmask=0022,dmask=0022,codepage=cp437,iocharset=utf8,shortname=mixed,errors=remount-ro 0 0 > > > > and I'm using rsync to write the data. > > Sounds similar to the cases I'm hearing about - copying from NFS to USB with applications freezing where disabling transparent hugepages sometimes helps. > > We end up in a fairly steady state with a half GB dirty: > > > > Dirty: 612280 kB > > > > The dirty count stays high despite running sync(1) in another xterm. > > > > The bug is, > > > > Firefox (iceweasel 7.0.1-4) hangs at random intervals. One thread is > > stuck in sleep_on_page > > > > [] sleep_on_page+0xe/0x12 > > [] wait_on_page_bit+0x72/0x74 > > [] migrate_pages+0x17c/0x36f > > [] compact_zone+0x467/0x68b > > [] try_to_compact_pages+0x14c/0x1b3 > > [] __alloc_pages_direct_compact+0xa7/0x15a > > [] __alloc_pages_nodemask+0x698/0x71d > > [] alloc_pages_vma+0xf5/0xfa > > [] do_huge_pmd_anonymous_page+0xbe/0x227 > > [] handle_mm_fault+0x113/0x1ce > > [] do_page_fault+0x2d7/0x31e > > [] page_fault+0x25/0x30 > > [] 0xffffffffffffffff > > > > And it stays stuck there for long enough for me to find the thread and > > attach strace. Apparently it was stuck in > > > > 1320640739.201474 munmap(0x7f5c06b00000, 2097152) = 0 > > > > for something between 20 and 60 seconds. > > That's not nice. Apparently you are using transparent hugepages and the > stuck application tried to allocate a hugepage. But to allocate a hugepage > you need a physically continguous set of pages and try_to_compact_pages() > is trying to achieve exactly that. But some of the pages that need moving > around are stuck for a long time - likely are being submitted to your USB > stick for writing. So all in all I'm not *that* surprised you see what you > see. > Neither am I. It matches other reports I've heard over the last week. > > There's no reason to let a 6MB/sec high latency device lock up 600 MB of > > dirty pages. I'll have to wait a hundred seconds after my app exits > > before the system will return to usability. > > > > And there's no way, AFAICS, for me to work around this behavior in > > userland. > > There is - you can use /sys/block//bdi/max_ratio to tune how much > of dirty cache that device can take. Dirty cache is set to 20% of your > total memory by default so that amounts to ~1.6 GB. So if you tune > max_ratio to say 5, you will get at most 80 MB of dirty pages agains your > USB stick which should be about appropriate. You can even create a udev > rule so that when an USB stick is inserted, it automatically sets > max_ratio for it to 5... > This kinda hacks around the problem although it should work. You should also be able to "workaround" the problem by disabling transparet hugepages. Finally, can you give this patch a whirl? It's a bit rough and ready and I'm trying to see a cleaner way of allowing khugepaged to use sync compaction on CONFIG_NUMA but I'd be interesting in confirming it's the right direction. I have tested it sortof but not against mainline. In my tests though I found that time spend stalled in compaction was reduced by 47 seconds during a test lasting 30 minutes. There is a drop in the number of transparent hugepages used but it's minor and could be within the noise as khugepaged is still able to use sync compaction. ==== CUT HERE ==== diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..e2fbfee 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -328,18 +328,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order) } extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, struct vm_area_struct *vma, unsigned long addr, - int node); + int node, bool drop_mmapsem); #else #define alloc_pages(gfp_mask, order) \ alloc_pages_node(numa_node_id(), gfp_mask, order) -#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \ - alloc_pages(gfp_mask, order) +#define alloc_pages_vma(gfp_mask, order, vma, addr, node, drop_mmapsem) \ + ({ \ + if (drop_mmapsem) \ + up_read(&vma->vm_mm->mmap_sem); \ + alloc_pages(gfp_mask, order); \ + }) #endif #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) #define alloc_page_vma(gfp_mask, vma, addr) \ - alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id()) + alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false) #define alloc_page_vma_node(gfp_mask, vma, addr, node) \ - alloc_pages_vma(gfp_mask, 0, vma, addr, node) + alloc_pages_vma(gfp_mask, 0, vma, addr, node, false) extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); extern unsigned long get_zeroed_page(gfp_t gfp_mask); diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 3a93f73..d5e7132 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -154,7 +154,7 @@ __alloc_zeroed_user_highpage(gfp_t movableflags, unsigned long vaddr) { struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags, - vma, vaddr); + vma, vaddr, false); if (page) clear_user_highpage(page, vaddr); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e2d1587..49449ea 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -655,10 +655,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp) static inline struct page *alloc_hugepage_vma(int defrag, struct vm_area_struct *vma, unsigned long haddr, int nd, - gfp_t extra_gfp) + gfp_t extra_gfp, + bool drop_mmapsem) { return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp), - HPAGE_PMD_ORDER, vma, haddr, nd); + HPAGE_PMD_ORDER, vma, haddr, nd, drop_mmapsem); } #ifndef CONFIG_NUMA @@ -683,7 +684,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(khugepaged_enter(vma))) return VM_FAULT_OOM; page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), - vma, haddr, numa_node_id(), 0); + vma, haddr, numa_node_id(), 0, false); if (unlikely(!page)) { count_vm_event(THP_FAULT_FALLBACK); goto out; @@ -911,7 +912,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), - vma, haddr, numa_node_id(), 0); + vma, haddr, numa_node_id(), 0, false); else new_page = NULL; @@ -1783,15 +1784,14 @@ static void collapse_huge_page(struct mm_struct *mm, * the userland I/O paths. Allocating memory with the * mmap_sem in read mode is good idea also to allow greater * scalability. + * + * alloc_pages_vma drops the mmap_sem so that if the process + * faults or calls mmap then khugepaged will not stall it. + * The mmap_sem is taken for write later to confirm the VMA + * is still valid */ new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address, - node, __GFP_OTHER_NODE); - - /* - * After allocating the hugepage, release the mmap_sem read lock in - * preparation for taking it in write mode. - */ - up_read(&mm->mmap_sem); + node, __GFP_OTHER_NODE, true); if (unlikely(!new_page)) { count_vm_event(THP_COLLAPSE_ALLOC_FAILED); *hpage = ERR_PTR(-ENOMEM); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 9c51f9f..1a8c676 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1832,7 +1832,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order, */ struct page * alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, - unsigned long addr, int node) + unsigned long addr, int node, bool drop_mmapsem) { struct mempolicy *pol = get_vma_policy(current, vma, addr); struct zonelist *zl; @@ -1844,16 +1844,21 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); mpol_cond_put(pol); + if (drop_mmapsem) + up_read(&vma->vm_mm->mmap_sem); page = alloc_page_interleave(gfp, order, nid); put_mems_allowed(); return page; } zl = policy_zonelist(gfp, pol, node); if (unlikely(mpol_needs_cond_ref(pol))) { + struct page *page; /* * slow path: ref counted shared policy */ - struct page *page = __alloc_pages_nodemask(gfp, order, + if (drop_mmapsem) + up_read(&vma->vm_mm->mmap_sem); + page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); __mpol_put(pol); put_mems_allowed(); @@ -1862,6 +1867,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, /* * fast path: default or task policy */ + if (drop_mmapsem) + up_read(&vma->vm_mm->mmap_sem); page = __alloc_pages_nodemask(gfp, order, zl, policy_nodemask(gfp, pol)); put_mems_allowed(); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e8ecb6..2f87f92 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2161,7 +2161,16 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + + /* + * Do not use sync migration for processes allocating transparent + * hugepages as it could stall writing back pages which is far worse + * than simply failing to promote a page. We still allow khugepaged + * to allocate as it should drop the mmap_sem before trying to + * allocate the page so it's acceptable for it to stall + */ + sync_migration = (current->flags & PF_KTHREAD) || + !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order,