From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f53.google.com (mail-pb0-f53.google.com [209.85.160.53]) by kanga.kvack.org (Postfix) with ESMTP id 292C66B0036 for ; Thu, 6 Mar 2014 16:16:17 -0500 (EST) Received: by mail-pb0-f53.google.com with SMTP id rp16so3142069pbb.26 for ; Thu, 06 Mar 2014 13:16:16 -0800 (PST) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id e2si6098565pba.181.2014.03.06.13.16.15 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 06 Mar 2014 13:16:16 -0800 (PST) Message-ID: <5318E4BC.50301@oracle.com> Date: Thu, 06 Mar 2014 16:12:28 -0500 From: Sasha Levin MIME-Version: 1.0 Subject: mm: kernel BUG at mm/mprotect.c:149 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML Hi Rik, While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". [ 886.745765] kernel BUG at mm/mprotect.c:149! [ 886.746831] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 886.748511] Dumping ftrace buffer: [ 886.749998] (ftrace buffer empty) [ 886.750110] Modules linked in: [ 886.751396] CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105 [ 886.751396] task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000 [ 886.751396] RIP: 0010:[] [] change_protection_range+0x3b3/0x500 [ 886.751396] RSP: 0000:ffff880228437da8 EFLAGS: 00010282 [ 886.751396] RAX: 8000000527c008e5 RBX: 00007f647916e000 RCX: 0000000000000000 [ 886.751396] RDX: ffff8802ef488e40 RSI: 00007f6479000000 RDI: 8000000527c008e5 [ 886.751396] RBP: ffff880228437e78 R08: 0000000000000000 R09: 0000000000000000 [ 886.751396] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802ef488e40 [ 886.751396] R13: 00007f6479000000 R14: 00007f647916e000 R15: 00007f646e34e000 [ 886.751396] FS: 00007f64b28d4700(0000) GS:ffff88052ba00000(0000) knlGS:0000000000000000 [ 886.751396] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 886.751396] CR2: 00007f64aed83af8 CR3: 0000000206e52000 CR4: 00000000000006a0 [ 886.751396] Stack: [ 886.751396] ffff880200000001 ffffffff8447f152 ffff880228437dd8 ffff880228af3000 [ 886.769142] 00007f646e34e000 00007f647916dfff 0000000000000000 00007f647916e000 [ 886.769142] ffff880206e527f0 00007f647916dfff 0000000000000000 0000000000000000 [ 886.769142] Call Trace: [ 886.769142] [] ? preempt_count_sub+0xe2/0x120 [ 886.769142] [] change_protection+0x25/0x30 [ 886.769142] [] change_prot_numa+0x1b/0x30 [ 886.769142] [] task_numa_work+0x279/0x360 [ 886.769142] [] task_work_run+0xae/0xf0 [ 886.769142] [] do_notify_resume+0x8e/0xe0 [ 886.769142] [] retint_signal+0x4d/0x92 [ 886.769142] Code: 49 8b 3c 24 48 83 3d fc 2e ba 04 00 75 12 0f 0b 0f 1f 84 00 00 00 00 00 eb fe 66 0f 1f 44 00 00 48 89 f8 66 66 66 90 84 c0 79 0d <0f> 0b 0f 1f 00 eb fe 66 0f 1f 44 00 00 8b 4d 9c 44 8b 4d 98 89 [ 886.769142] RIP [] change_protection_range+0x3b3/0x500 [ 886.769142] RSP Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f169.google.com (mail-qc0-f169.google.com [209.85.216.169]) by kanga.kvack.org (Postfix) with ESMTP id 81ED96B0031 for ; Thu, 6 Mar 2014 17:37:55 -0500 (EST) Received: by mail-qc0-f169.google.com with SMTP id i17so3829463qcy.28 for ; Thu, 06 Mar 2014 14:37:55 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id q1si3936728qab.111.2014.03.06.14.37.53 for ; Thu, 06 Mar 2014 14:37:54 -0800 (PST) Date: Thu, 6 Mar 2014 17:31:37 -0500 From: Rik van Riel Subject: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140306173137.6a23a0b2@cuia.bos.redhat.com> In-Reply-To: <5318E4BC.50301@oracle.com> References: <5318E4BC.50301@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , mgorman@suse.de, hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Thu, 06 Mar 2014 16:12:28 -0500 Sasha Levin wrote: > While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". That patch should not introduce any functional changes, except for the VM_BUG_ON that catches the fact that we fell through to the 4kB pte handling code, despite having just handled a THP pmd... Does this patch fix the issue? Mel, am I overlooking anything obvious? :) ---8<--- Subject: mm,numa,mprotect: always continue after finding a stable thp page When turning a thp pmds into a NUMA one, change_huge_pmd will return 0 when the pmd already is a NUMA pmd. However, change_pmd_range would fall through to the code that handles 4kB pages, instead of continuing on to the next pmd. Signed-off-by: Rik van Riel Reported-by: Sasha Levin Cc: Mel Gorman --- mm/mprotect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 61f0a07..4746608 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -138,8 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } - continue; } + continue; } /* fall through, the trans huge pmd just split */ } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f41.google.com (mail-qa0-f41.google.com [209.85.216.41]) by kanga.kvack.org (Postfix) with ESMTP id CA1566B0035 for ; Thu, 6 Mar 2014 17:57:36 -0500 (EST) Received: by mail-qa0-f41.google.com with SMTP id j5so3328865qaq.14 for ; Thu, 06 Mar 2014 14:57:36 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id v2si3984615qaf.6.2014.03.06.14.57.35 for ; Thu, 06 Mar 2014 14:57:36 -0800 (PST) Message-ID: <5318FC3F.4080204@redhat.com> Date: Thu, 06 Mar 2014 17:52:47 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> In-Reply-To: <20140306173137.6a23a0b2@cuia.bos.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , mgorman@suse.de, hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On 03/06/2014 05:31 PM, Rik van Riel wrote: > On Thu, 06 Mar 2014 16:12:28 -0500 > Sasha Levin wrote: > >> While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the >> following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > That patch should not introduce any functional changes, except for > the VM_BUG_ON that catches the fact that we fell through to the 4kB > pte handling code, despite having just handled a THP pmd... > > Does this patch fix the issue? > > Mel, am I overlooking anything obvious? :) > > ---8<--- > > Subject: mm,numa,mprotect: always continue after finding a stable thp page > > When turning a thp pmds into a NUMA one, change_huge_pmd will > return 0 when the pmd already is a NUMA pmd. I did miss something obvious. In this case, the code returns 1. > However, change_pmd_range would fall through to the code that > handles 4kB pages, instead of continuing on to the next pmd. Maybe the case that I missed is when khugepaged is in the process of collapsing pages into a transparent huge page? If the virtual CPU gets de-scheduled by the host for long enough, it would be possible for khugepaged to run on another virtual CPU, and turn the pmd into a THP pmd, before that VM_BUG_ON test. I see that khugepaged takes the mmap_sem for writing in the collapse code, and it looks like task_numa_work takes the mmap_sem for reading, so I guess that may not be possible? Andrea, would you happen to know what case am I missing? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f180.google.com (mail-qc0-f180.google.com [209.85.216.180]) by kanga.kvack.org (Postfix) with ESMTP id 16A1E6B0031 for ; Thu, 6 Mar 2014 18:23:32 -0500 (EST) Received: by mail-qc0-f180.google.com with SMTP id x3so3826840qcv.25 for ; Thu, 06 Mar 2014 15:23:31 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id 108si3989080qgr.134.2014.03.06.15.23.31 for ; Thu, 06 Mar 2014 15:23:31 -0800 (PST) Date: Thu, 6 Mar 2014 17:31:12 -0500 From: Rik van Riel Subject: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140306173112.3bd6802b@cuia.bos.redhat.com> In-Reply-To: <5318E4BC.50301@oracle.com> References: <5318E4BC.50301@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , mgorman@suse.de, hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Thu, 06 Mar 2014 16:12:28 -0500 Sasha Levin wrote: > While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". That patch should not introduce any functional changes, except for the VM_BUG_ON that catches the fact that we fell through to the 4kB pte handling code, despite having just handled a THP pmd... Does this patch fix the issue? Mel, am I overlooking anything obvious? :) ---8<--- Subject: mm,numa,mprotect: always continue after finding a stable thp page When turning a thp pmds into a NUMA one, change_huge_pmd will return 0 when the pmd already is a NUMA pmd. However, change_pmd_range would fall through to the code that handles 4kB pages, instead of continuing on to the next pmd. Signed-off-by: Rik van Riel Reported-by: Sasha Levin Cc: Mel Gorman --- mm/mprotect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 61f0a07..4746608 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -138,8 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } - continue; } + continue; } /* fall through, the trans huge pmd just split */ } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id 4A91C6B0031 for ; Thu, 6 Mar 2014 20:06:17 -0500 (EST) Received: by mail-pd0-f182.google.com with SMTP id g10so3305795pdj.13 for ; Thu, 06 Mar 2014 17:06:16 -0800 (PST) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id tk9si6554470pac.64.2014.03.06.17.06.15 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 06 Mar 2014 17:06:15 -0800 (PST) Message-ID: <53191B05.20102@oracle.com> Date: Thu, 06 Mar 2014 20:04:05 -0500 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> In-Reply-To: <20140306173137.6a23a0b2@cuia.bos.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , mgorman@suse.de, hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On 03/06/2014 05:31 PM, Rik van Riel wrote: > On Thu, 06 Mar 2014 16:12:28 -0500 > Sasha Levin wrote: > >> >While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the >> >following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > That patch should not introduce any functional changes, except for > the VM_BUG_ON that catches the fact that we fell through to the 4kB > pte handling code, despite having just handled a THP pmd... > > Does this patch fix the issue? I'm seeing a different issue with this patch: [ 625.886532] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [ 625.888056] IP: [] __lock_acquire+0xbc/0x580 [ 625.888969] PGD 427842067 PUD 41b321067 PMD 0 [ 625.889775] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 625.890773] Dumping ftrace buffer: [ 625.891454] (ftrace buffer empty) [ 625.892041] Modules linked in: [ 625.892610] CPU: 9 PID: 18991 Comm: trinity-c198 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00012-g00c5c8f-dirty #110 [ 625.894293] task: ffff8804278cb000 ti: ffff88041d89a000 task.ti: ffff88041d89a000 [ 625.895678] RIP: 0010:[] [] __lock_acquire+0xbc/0x580 [ 625.896232] RSP: 0018:ffff88041d89bbe8 EFLAGS: 00010002 [ 625.896232] RAX: 0000000000000082 RBX: 0000000000000018 RCX: 0000000000000000 [ 625.896232] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [ 625.896232] RBP: ffff88041d89bc58 R08: 0000000000000001 R09: 0000000000000000 [ 625.896232] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804278cb000 [ 625.896232] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [ 625.896232] FS: 00007f144ab81700(0000) GS:ffff880a2b800000(0000) knlGS:0000000000000000 [ 625.896232] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 625.896232] CR2: 0000000000000018 CR3: 00000004290cf000 CR4: 00000000000006a0 [ 625.896232] Stack: [ 625.896232] ffff8802295abc18 ffff8804278cb000 ffff880a2b9d84c0 ffff880a2b9d84d0 [ 625.896232] ffff88041d89bc18 ffffffff810bb084 ffff88041d89bc28 ffffffff8107912d [ 625.896232] ffff88041d89bc58 ffff8804278cb000 0000000000000000 0000000000000001 [ 625.896232] Call Trace: [ 625.896232] [] ? kvm_clock_read+0x24/0x50 [ 625.896232] [] ? sched_clock+0x1d/0x30 [ 625.896232] [] lock_acquire+0x182/0x1d0 [ 625.896232] [] ? change_pte_range+0xa3/0x3b0 [ 625.896232] [] ? sched_clock+0x1d/0x30 [ 625.896232] [] _raw_spin_lock+0x40/0x80 [ 625.896232] [] ? change_pte_range+0xa3/0x3b0 [ 625.896232] [] change_pte_range+0xa3/0x3b0 [ 625.896232] [] change_protection_range+0x3a8/0x4d0 [ 625.896232] [] ? preempt_count_sub+0xe2/0x120 [ 625.896232] [] change_protection+0x25/0x30 [ 625.896232] [] change_prot_numa+0x1b/0x30 [ 625.896232] [] task_numa_work+0x279/0x360 [ 625.896232] [] task_work_run+0xae/0xf0 [ 625.896232] [] do_notify_resume+0x8e/0xe0 [ 625.896232] [] int_signal+0x12/0x17 [ 625.896232] Code: c2 37 f6 6b 85 be fa 0b 00 00 48 c7 c7 a1 4f 6c 85 e8 09 60 f9 ff 31 c0 e9 9c 04 00 00 66 90 44 8b 1d 39 1f cd 04 45 85 db 74 0c <48> 81 3b 40 52 76 87 75 06 0f 1f 00 45 31 c0 83 fe 01 77 0c 89 [ 625.896232] RIP [] __lock_acquire+0xbc/0x580 [ 625.896232] RSP [ 625.896232] CR2: 0000000000000018 Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f41.google.com (mail-wg0-f41.google.com [74.125.82.41]) by kanga.kvack.org (Postfix) with ESMTP id 880C66B0031 for ; Fri, 7 Mar 2014 09:06:57 -0500 (EST) Received: by mail-wg0-f41.google.com with SMTP id n12so5083009wgh.24 for ; Fri, 07 Mar 2014 06:06:56 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id xb3si5126168wjc.34.2014.03.07.06.06.54 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 07 Mar 2014 06:06:55 -0800 (PST) Date: Fri, 7 Mar 2014 14:06:50 +0000 From: Mel Gorman Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140307140650.GA1931@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <5318FC3F.4080204@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Sasha Levin , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote: > On 03/06/2014 05:31 PM, Rik van Riel wrote: > >On Thu, 06 Mar 2014 16:12:28 -0500 > >Sasha Levin wrote: > > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > > >That patch should not introduce any functional changes, except for > >the VM_BUG_ON that catches the fact that we fell through to the 4kB > >pte handling code, despite having just handled a THP pmd... > > > >Does this patch fix the issue? > > > >Mel, am I overlooking anything obvious? :) > > > >---8<--- > > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page > > > >When turning a thp pmds into a NUMA one, change_huge_pmd will > >return 0 when the pmd already is a NUMA pmd. > > I did miss something obvious. In this case, the code returns 1. > > >However, change_pmd_range would fall through to the code that > >handles 4kB pages, instead of continuing on to the next pmd. > > Maybe the case that I missed is when khugepaged is in the > process of collapsing pages into a transparent huge page? > > If the virtual CPU gets de-scheduled by the host for long > enough, it would be possible for khugepaged to run on > another virtual CPU, and turn the pmd into a THP pmd, > before that VM_BUG_ON test. > > I see that khugepaged takes the mmap_sem for writing in the > collapse code, and it looks like task_numa_work takes the > mmap_sem for reading, so I guess that may not be possible? > mmap_sem will prevent a parallel collapse but what prevents something like the following? do_huge_pmd_wp_page change_pmd_range if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) continue; pmdp_clear_flush(vma, haddr, pmd); if (pmd_trans_huge(*pmd)) { .... path not taken .... } page_add_new_anon_rmap(new_page, vma, haddr); set_pmd_at(mm, haddr, pmd, entry); VM_BUG_ON(pmd_trans_huge(*pmd)); We do not hold the page table lock during the pmd_trans_huge check and we do not recheck it under PTF lock in change_pte_range() -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f175.google.com (mail-we0-f175.google.com [74.125.82.175]) by kanga.kvack.org (Postfix) with ESMTP id E81DA6B0031 for ; Fri, 7 Mar 2014 10:09:29 -0500 (EST) Received: by mail-we0-f175.google.com with SMTP id q58so5055988wes.34 for ; Fri, 07 Mar 2014 07:09:29 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id hh8si9737457wjc.166.2014.03.07.07.09.28 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 07 Mar 2014 07:09:28 -0800 (PST) Date: Fri, 7 Mar 2014 15:09:23 +0000 From: Mel Gorman Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140307150923.GB1931@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140307140650.GA1931@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Sasha Levin , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote: > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote: > > On 03/06/2014 05:31 PM, Rik van Riel wrote: > > >On Thu, 06 Mar 2014 16:12:28 -0500 > > >Sasha Levin wrote: > > > > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > > > > >That patch should not introduce any functional changes, except for > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB > > >pte handling code, despite having just handled a THP pmd... > > > > > >Does this patch fix the issue? > > > > > >Mel, am I overlooking anything obvious? :) > > > > > >---8<--- > > > > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page > > > > > >When turning a thp pmds into a NUMA one, change_huge_pmd will > > >return 0 when the pmd already is a NUMA pmd. > > > > I did miss something obvious. In this case, the code returns 1. > > > > >However, change_pmd_range would fall through to the code that > > >handles 4kB pages, instead of continuing on to the next pmd. > > > > Maybe the case that I missed is when khugepaged is in the > > process of collapsing pages into a transparent huge page? > > > > If the virtual CPU gets de-scheduled by the host for long > > enough, it would be possible for khugepaged to run on > > another virtual CPU, and turn the pmd into a THP pmd, > > before that VM_BUG_ON test. > > > > I see that khugepaged takes the mmap_sem for writing in the > > collapse code, and it looks like task_numa_work takes the > > mmap_sem for reading, so I guess that may not be possible? > > > > mmap_sem will prevent a parallel collapse but what prevents something > like the following? > > do_huge_pmd_wp_page > change_pmd_range > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) > continue; > pmdp_clear_flush(vma, haddr, pmd); > if (pmd_trans_huge(*pmd)) { > .... path not taken .... > } > page_add_new_anon_rmap(new_page, vma, haddr); > set_pmd_at(mm, haddr, pmd, entry); > VM_BUG_ON(pmd_trans_huge(*pmd)); > > We do not hold the page table lock during the pmd_trans_huge check and we > do not recheck it under PTF lock in change_pte_range() > This is a completely untested prototype. It rechecks pmd_trans_huge under the lock and falls through if it hit a parallel split. It's not perfect because it could decide to fall through just because there was no prot_numa work to do but it's for illustration purposes. Secondly, I noted that you are calling invalidate for every pmd range. Is that not a lot of invalidations? We could do the same by just tracking the address of the first invalidation. diff --git a/mm/mprotect.c b/mm/mprotect.c index 2afc40e..a0050fc 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -46,6 +46,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long pages = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + + /* transhuge page could have faulted in parallel */ + if (unlikely(pmd_trans_huge(*pmd))) { + int nr_ptes; + + pte_unmap_unlock(pte, ptl); + nr_ptes = change_huge_pmd(vma, pmd, addr, newprot, prot_numa); + if (nr_ptes) + return nr_ptes; + + /* Page was split so retake the ptl and handle ptes */ + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + } + arch_enter_lazy_mmu_mode(); do { oldpte = *pte; @@ -106,14 +120,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, - pgprot_t newprot, int dirty_accountable, int prot_numa) + pgprot_t newprot, int dirty_accountable, int prot_numa, + unsigned long mni_start, unsigned long mni_end) { pmd_t *pmd; struct mm_struct *mm = vma->vm_mm; unsigned long next; unsigned long pages = 0; unsigned long nr_huge_updates = 0; - unsigned long mni_start = 0; pmd = pmd_offset(pud, addr); do { @@ -124,10 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, continue; /* invoke the mmu notifier if the pmd is populated */ - if (!mni_start) { - mni_start = addr; - mmu_notifier_invalidate_range_start(mm, mni_start, end); - } + if (!mni_start) + mmu_notifier_invalidate_range_start(mm, addr, mni_end); if (pmd_trans_huge(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) @@ -141,8 +153,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } - continue; } + continue; } /* fall through, the trans huge pmd just split */ } @@ -152,9 +164,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += this_pages; } while (pmd++, addr = next, addr != end); - if (mni_start) - mmu_notifier_invalidate_range_end(mm, mni_start, end); - if (nr_huge_updates) count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates); return pages; @@ -167,16 +176,25 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pud_t *pud; unsigned long next; unsigned long pages = 0; + unsigned long mni_start = 0; pud = pud_offset(pgd, addr); do { + unsigned long this_pages; next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) continue; - pages += change_pmd_range(vma, pud, addr, next, newprot, - dirty_accountable, prot_numa); + this_pages = change_pmd_range(vma, pud, addr, next, newprot, + dirty_accountable, prot_numa, mni_start, + end); + if (this_pages) + mni_start = addr; + pages += this_pages; } while (pud++, addr = next, addr != end); + if (mni_start) + mmu_notifier_invalidate_range_end(vma->vm_mm, mni_start, end); + return pages; } -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f172.google.com (mail-we0-f172.google.com [74.125.82.172]) by kanga.kvack.org (Postfix) with ESMTP id 526B76B0031 for ; Fri, 7 Mar 2014 10:14:12 -0500 (EST) Received: by mail-we0-f172.google.com with SMTP id t61so5136032wes.3 for ; Fri, 07 Mar 2014 07:14:11 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id q5si9782959wjq.135.2014.03.07.07.14.01 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 07 Mar 2014 07:14:02 -0800 (PST) Date: Fri, 7 Mar 2014 15:13:57 +0000 From: Mel Gorman Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140307151357.GC1931@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140307150923.GB1931@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Sasha Levin , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Fri, Mar 07, 2014 at 03:09:23PM +0000, Mel Gorman wrote: > On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote: > > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote: > > > On 03/06/2014 05:31 PM, Rik van Riel wrote: > > > >On Thu, 06 Mar 2014 16:12:28 -0500 > > > >Sasha Levin wrote: > > > > > > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > > > > > > >That patch should not introduce any functional changes, except for > > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB > > > >pte handling code, despite having just handled a THP pmd... > > > > > > > >Does this patch fix the issue? > > > > > > > >Mel, am I overlooking anything obvious? :) > > > > > > > >---8<--- > > > > > > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page > > > > > > > >When turning a thp pmds into a NUMA one, change_huge_pmd will > > > >return 0 when the pmd already is a NUMA pmd. > > > > > > I did miss something obvious. In this case, the code returns 1. > > > > > > >However, change_pmd_range would fall through to the code that > > > >handles 4kB pages, instead of continuing on to the next pmd. > > > > > > Maybe the case that I missed is when khugepaged is in the > > > process of collapsing pages into a transparent huge page? > > > > > > If the virtual CPU gets de-scheduled by the host for long > > > enough, it would be possible for khugepaged to run on > > > another virtual CPU, and turn the pmd into a THP pmd, > > > before that VM_BUG_ON test. > > > > > > I see that khugepaged takes the mmap_sem for writing in the > > > collapse code, and it looks like task_numa_work takes the > > > mmap_sem for reading, so I guess that may not be possible? > > > > > > > mmap_sem will prevent a parallel collapse but what prevents something > > like the following? > > > > do_huge_pmd_wp_page > > change_pmd_range > > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) > > continue; > > pmdp_clear_flush(vma, haddr, pmd); > > if (pmd_trans_huge(*pmd)) { > > .... path not taken .... > > } > > page_add_new_anon_rmap(new_page, vma, haddr); > > set_pmd_at(mm, haddr, pmd, entry); > > VM_BUG_ON(pmd_trans_huge(*pmd)); > > > > We do not hold the page table lock during the pmd_trans_huge check and we > > do not recheck it under PTF lock in change_pte_range() > > > > This is a completely untested prototype. It rechecks pmd_trans_huge > under the lock and falls through if it hit a parallel split. It's not > perfect because it could decide to fall through just because there was > no prot_numa work to do but it's for illustration purposes. Nope, if there was no prot_numa work to do, it returns 1 so the check is ok. It's only if 0 is returned we need to fall through and really handle ptes. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-we0-f173.google.com (mail-we0-f173.google.com [74.125.82.173]) by kanga.kvack.org (Postfix) with ESMTP id 045B96B0031 for ; Fri, 7 Mar 2014 13:27:52 -0500 (EST) Received: by mail-we0-f173.google.com with SMTP id w61so5375161wes.4 for ; Fri, 07 Mar 2014 10:27:52 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id b18si10653102wjb.1.2014.03.07.10.27.49 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 07 Mar 2014 10:27:49 -0800 (PST) Date: Fri, 7 Mar 2014 18:27:45 +0000 From: Mel Gorman Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140307182745.GD1931@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140307150923.GB1931@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel , Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Fri, Mar 07, 2014 at 03:09:23PM +0000, Mel Gorman wrote: > On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote: > > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote: > > > On 03/06/2014 05:31 PM, Rik van Riel wrote: > > > >On Thu, 06 Mar 2014 16:12:28 -0500 > > > >Sasha Levin wrote: > > > > > > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > > > > > > >That patch should not introduce any functional changes, except for > > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB > > > >pte handling code, despite having just handled a THP pmd... > > > > > > > >Does this patch fix the issue? > > > > > > > >Mel, am I overlooking anything obvious? :) > > > > > > > >---8<--- > > > > > > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page > > > > > > > >When turning a thp pmds into a NUMA one, change_huge_pmd will > > > >return 0 when the pmd already is a NUMA pmd. > > > > > > I did miss something obvious. In this case, the code returns 1. > > > > > > >However, change_pmd_range would fall through to the code that > > > >handles 4kB pages, instead of continuing on to the next pmd. > > > > > > Maybe the case that I missed is when khugepaged is in the > > > process of collapsing pages into a transparent huge page? > > > > > > If the virtual CPU gets de-scheduled by the host for long > > > enough, it would be possible for khugepaged to run on > > > another virtual CPU, and turn the pmd into a THP pmd, > > > before that VM_BUG_ON test. > > > > > > I see that khugepaged takes the mmap_sem for writing in the > > > collapse code, and it looks like task_numa_work takes the > > > mmap_sem for reading, so I guess that may not be possible? > > > > > > > mmap_sem will prevent a parallel collapse but what prevents something > > like the following? > > > > do_huge_pmd_wp_page > > change_pmd_range > > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) > > continue; > > pmdp_clear_flush(vma, haddr, pmd); > > if (pmd_trans_huge(*pmd)) { > > .... path not taken .... > > } > > page_add_new_anon_rmap(new_page, vma, haddr); > > set_pmd_at(mm, haddr, pmd, entry); > > VM_BUG_ON(pmd_trans_huge(*pmd)); > > > > We do not hold the page table lock during the pmd_trans_huge check and we > > do not recheck it under PTF lock in change_pte_range() > > > > This is a completely untested prototype. It rechecks pmd_trans_huge > under the lock and falls through if it hit a parallel split. It's not > perfect because it could decide to fall through just because there was > no prot_numa work to do but it's for illustration purposes. Secondly, > I noted that you are calling invalidate for every pmd range. Is that not > a lot of invalidations? We could do the same by just tracking the address > of the first invalidation. > And there were other minor issues. This is still untested but Sasha, can you try it out please? I discussed this with Rik on IRC for a bit and reckon this should be sufficient if the correct race has been identified. The race can only really happen for prot_numa updates and it's ok to bail on those updates if a race occurs because all we miss is a few hinting faults. That simplifies the patch considerably but throw in some comments to explain it diff --git a/mm/mprotect.c b/mm/mprotect.c index 2afc40e..72061a2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -46,6 +46,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long pages = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + + /* + * For a prot_numa update we only hold mmap_sem for read so there is a + * potential race with faulting where a pmd was temporarily none so + * recheck it under the lock and bail if we race + */ + if (prot_numa && unlikely(pmd_trans_huge(*pmd))) { + pte_unmap_unlock(pte, ptl); + return 0; + } + arch_enter_lazy_mmu_mode(); do { oldpte = *pte; @@ -141,12 +152,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } + + /* huge pmd was handled */ continue; } } /* fall through, the trans huge pmd just split */ } - VM_BUG_ON(pmd_trans_huge(*pmd)); this_pages = change_pte_range(vma, pmd, addr, next, newprot, dirty_accountable, prot_numa); pages += this_pages; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by kanga.kvack.org (Postfix) with ESMTP id 2E5676B00A8 for ; Tue, 11 Mar 2014 12:28:50 -0400 (EDT) Received: by mail-wg0-f48.google.com with SMTP id l18so5926498wgh.31 for ; Tue, 11 Mar 2014 09:28:49 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id cs8si1887018wib.70.2014.03.11.09.28.48 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Mar 2014 09:28:48 -0700 (PDT) Date: Tue, 11 Mar 2014 16:28:45 +0000 From: Mel Gorman Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140311162845.GA30604@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140307182745.GD1931@suse.de> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel , Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: > > This is a completely untested prototype. It rechecks pmd_trans_huge > > under the lock and falls through if it hit a parallel split. It's not > > perfect because it could decide to fall through just because there was > > no prot_numa work to do but it's for illustration purposes. Secondly, > > I noted that you are calling invalidate for every pmd range. Is that not > > a lot of invalidations? We could do the same by just tracking the address > > of the first invalidation. > > > > And there were other minor issues. This is still untested but Sasha, > can you try it out please? I discussed this with Rik on IRC for a bit and > reckon this should be sufficient if the correct race has been identified. > Any luck with this patch Sasha? It passed basic tests here but I had not seen the issue trigger either. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f169.google.com (mail-ob0-f169.google.com [209.85.214.169]) by kanga.kvack.org (Postfix) with ESMTP id 458236B00AA for ; Tue, 11 Mar 2014 12:59:41 -0400 (EDT) Received: by mail-ob0-f169.google.com with SMTP id va2so8688183obc.14 for ; Tue, 11 Mar 2014 09:59:40 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id e6si21888996oen.101.2014.03.11.09.59.40 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Mar 2014 09:59:40 -0700 (PDT) Message-ID: <531F3F15.8050206@oracle.com> Date: Tue, 11 Mar 2014 12:51:33 -0400 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> In-Reply-To: <20140311162845.GA30604@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Rik van Riel Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On 03/11/2014 12:28 PM, Mel Gorman wrote: > On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: >>> This is a completely untested prototype. It rechecks pmd_trans_huge >>> under the lock and falls through if it hit a parallel split. It's not >>> perfect because it could decide to fall through just because there was >>> no prot_numa work to do but it's for illustration purposes. Secondly, >>> I noted that you are calling invalidate for every pmd range. Is that not >>> a lot of invalidations? We could do the same by just tracking the address >>> of the first invalidation. >>> >> >> And there were other minor issues. This is still untested but Sasha, >> can you try it out please? I discussed this with Rik on IRC for a bit and >> reckon this should be sufficient if the correct race has been identified. >> > > Any luck with this patch Sasha? It passed basic tests here but I had not > seen the issue trigger either. > Sorry, I've been stuck in my weekend project of getting lockdep to work with page locks :) It takes a moment to test, so just to be sure - I should have only this last patch applied? Without the one in the original mail? Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f173.google.com (mail-ob0-f173.google.com [209.85.214.173]) by kanga.kvack.org (Postfix) with ESMTP id EF8CB6B00B3 for ; Tue, 11 Mar 2014 13:36:16 -0400 (EDT) Received: by mail-ob0-f173.google.com with SMTP id gq1so8762210obb.18 for ; Tue, 11 Mar 2014 10:36:16 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id d7si13960825oeh.95.2014.03.11.10.36.16 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Mar 2014 10:36:16 -0700 (PDT) Message-ID: <531F48CC.303@oracle.com> Date: Tue, 11 Mar 2014 13:33:00 -0400 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> In-Reply-To: <531F4128.8020109@redhat.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel , Mel Gorman Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On 03/11/2014 01:00 PM, Rik van Riel wrote: > On 03/11/2014 12:51 PM, Sasha Levin wrote: >> On 03/11/2014 12:28 PM, Mel Gorman wrote: >>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: >>>>> This is a completely untested prototype. It rechecks pmd_trans_huge >>>>> under the lock and falls through if it hit a parallel split. It's not >>>>> perfect because it could decide to fall through just because there was >>>>> no prot_numa work to do but it's for illustration purposes. Secondly, >>>>> I noted that you are calling invalidate for every pmd range. Is that >>>>> not >>>>> a lot of invalidations? We could do the same by just tracking the >>>>> address >>>>> of the first invalidation. >>>>> >>>> >>>> And there were other minor issues. This is still untested but Sasha, >>>> can you try it out please? I discussed this with Rik on IRC for a bit >>>> and >>>> reckon this should be sufficient if the correct race has been >>>> identified. >>>> >>> >>> Any luck with this patch Sasha? It passed basic tests here but I had not >>> seen the issue trigger either. >>> >> >> Sorry, I've been stuck in my weekend project of getting lockdep to work >> with page locks :) >> >> It takes a moment to test, so just to be sure - I should have only this >> last patch applied? >> Without the one in the original mail? > > Indeed, only this patch should do it. Okay. So just this patch on top of the latest -next shows the following issues: 1. BUG in task_numa_work: [ 439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00 [ 439.418216] IP: [] vmacache_find+0x75/0xa0 [ 439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060 [ 439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 439.420340] Dumping ftrace buffer: [ 439.420340] (ftrace buffer empty) [ 439.420340] Modules linked in: [ 439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 [ 439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000 [ 439.420340] RIP: 0010:[] [] vmacache_find+0x75/0xa0 [ 439.420340] RSP: 0018:ffff880e1a491e68 EFLAGS: 00010286 [ 439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001 [ 439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000 [ 439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000 [ 439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000 [ 439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000 [ 439.420340] FS: 00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000 [ 439.420340] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0 [ 439.420340] Stack: [ 439.420340] ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117 [ 439.420340] 00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b [ 439.420340] ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000 [ 439.420340] Call Trace: [ 439.420340] [] find_vma+0x20/0x90 [ 439.420340] [] ? task_numa_work+0x130/0x360 [ 439.420340] [] task_numa_work+0x13b/0x360 [ 439.420340] [] ? context_tracking_user_exit+0x195/0x1d0 [ 439.420340] [] task_work_run+0xae/0xf0 [ 439.420340] [] do_notify_resume+0x8e/0xe0 [ 439.420340] [] int_signal+0x12/0x17 [ 439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f [ 439.420340] RIP [] vmacache_find+0x75/0xa0 [ 439.420340] RSP [ 439.420340] CR2: ffff880e17530c00 2. Similar to the above, but with a different trace: [ 304.212158] BUG: unable to handle kernel paging request at ffff88020d37f800 [ 304.220420] IP: [] vmacache_find+0x75/0xa0 [ 304.220420] PGD 8904067 PUD 102effb067 PMD 102ef91067 PTE 800000020d37f060 [ 304.220420] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 304.220420] Dumping ftrace buffer: [ 304.220420] (ftrace buffer empty) [ 304.220420] Modules linked in: [ 304.220420] CPU: 30 PID: 11157 Comm: trinity-c393 Tainted: G B W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 [ 304.220420] task: ffff881020a90000 ti: ffff88101fda2000 task.ti: ffff88101fda2000 [ 304.260466] RIP: 0010:[] [] vmacache_find+0x75/0xa0 [ 304.260466] RSP: 0000:ffff88101fda3da8 EFLAGS: 00010286 [ 304.260466] RAX: ffff88020d37f800 RBX: 00007f7f92f680fc RCX: 0000000000000000 [ 304.260466] RDX: 0000000000000000 RSI: 00007f7f92f680fc RDI: ffff881020a90000 [ 304.260466] RBP: ffff88101fda3da8 R08: 0000000000000001 R09: 0000000000000000 [ 304.260466] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881021358000 [ 304.260466] R13: 0000000000000000 R14: ffff8810213580a8 R15: ffff881021358000 [ 304.260466] FS: 00007f7f92f8d700(0000) GS:ffff880f2ba00000(0000) knlGS:0000000000000000 [ 304.260466] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 304.260466] CR2: ffff88020d37f800 CR3: 000000101fd6b000 CR4: 00000000000006a0 [ 304.260466] Stack: [ 304.260466] ffff88101fda3dd8 ffffffff812a7610 ffffffff844abd82 00007f7f92f680fc [ 304.260466] 00007f7f92f680fc 00000000000000a8 ffff88101fda3ef8 ffffffff844abdd6 [ 304.260466] ffff88101fda3e08 ffffffff811a60d6 0000000000000000 ffff881020a90000 [ 304.260466] Call Trace: [ 304.260466] [] find_vma+0x20/0x90 [ 304.260466] [] ? __do_page_fault+0x302/0x5d0 [ 304.260466] [] __do_page_fault+0x356/0x5d0 [ 304.260466] [] ? trace_hardirqs_off_caller+0x16/0x1a0 [ 304.260466] [] ? vtime_account_user+0x96/0xb0 [ 304.260466] [] ? preempt_count_sub+0xe2/0x120 [ 304.260466] [] ? context_tracking_user_exit+0x187/0x1d0 [ 304.260466] [] ? trace_hardirqs_off_caller+0x16/0x1a0 [ 304.260466] [] do_page_fault+0x45/0x70 [ 304.260466] [] do_async_page_fault+0x36/0x100 [ 304.260466] [] async_page_fault+0x28/0x30 [ 304.260466] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f [ 304.260466] RIP [] vmacache_find+0x75/0xa0 [ 304.260466] RSP [ 304.260466] CR2: ffff88020d37f800 3. This one is a new issue. Might be related to something else but it seems related as I've never saw it before this patch: [ 560.473342] kernel BUG at mm/mmap.c:439! [ 560.473766] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 560.474723] Dumping ftrace buffer: [ 560.475410] (ftrace buffer empty) [ 560.476098] Modules linked in: [ 560.476241] CPU: 10 PID: 17037 Comm: trinity-c84 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 [ 560.477651] task: ffff880424310000 ti: ffff8803eec4c000 task.ti: ffff8803eec4c000 [ 560.478595] RIP: 0010:[] [] validate_mm+0x115/0x140 [ 560.479749] RSP: 0018:ffff8803eec4de98 EFLAGS: 00010296 [ 560.480471] RAX: 0000000000000012 RBX: 00007fff8ae88000 RCX: 0000000000000006 [ 560.481490] RDX: 0000000000000006 RSI: ffff880424310cf0 RDI: 0000000000000282 [ 560.481490] RBP: ffff8803eec4dec8 R08: 0000000000000001 R09: 0000000000000001 [ 560.481490] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8803f61aa000 [ 560.481490] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000048 [ 560.481490] FS: 00007f799ddf6700(0000) GS:ffff880b2b800000(0000) knlGS:0000000000000000 [ 560.481490] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 560.481490] CR2: 0000000004502048 CR3: 00000003f8bcb000 CR4: 00000000000006a0 [ 560.481490] Stack: [ 560.481490] ffff8803eec4dec8 0000000000000000 ffff8803f61aa000 ffff8800a22fae00 [ 560.481490] 0000000000000000 00007f799d4dc000 ffff8803eec4df28 ffffffff812a8fb7 [ 560.481490] ffffffff00000001 ffff880b21453600 ffff8800a22fae10 ffff8803f61aa008 [ 560.481490] Call Trace: [ 560.481490] [] do_munmap+0x307/0x360 [ 560.493775] [] ? vm_munmap+0x46/0x80 [ 560.493775] [] vm_munmap+0x54/0x80 [ 560.493775] [] SyS_munmap+0x2c/0x40 [ 560.493775] [] tracesys+0xdd/0xe2 [ 560.493775] Code: 32 fd ff ff 41 8b 74 24 58 39 f0 74 19 89 c2 48 c7 c7 19 f5 6d 85 31 c0 e8 45 8c 1f 03 eb 0c 0f 1f 80 00 00 00 00 45 85 ed 74 0d <0f> 0b 66 0f 1f 84 00 00 00 00 00 eb fe 48 83 c4 08 5b 41 5c 41 [ 560.493775] RIP [] validate_mm+0x115/0x140 [ 560.493775] RSP Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f46.google.com (mail-qg0-f46.google.com [209.85.192.46]) by kanga.kvack.org (Postfix) with ESMTP id 8450C6B00B7 for ; Tue, 11 Mar 2014 13:46:58 -0400 (EDT) Received: by mail-qg0-f46.google.com with SMTP id e89so5794908qgf.5 for ; Tue, 11 Mar 2014 10:46:58 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id c5si8323934qgc.117.2014.03.11.10.46.57 for ; Tue, 11 Mar 2014 10:46:58 -0700 (PDT) Message-ID: <531F4128.8020109@redhat.com> Date: Tue, 11 Mar 2014 13:00:24 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> In-Reply-To: <531F3F15.8050206@oracle.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin , Mel Gorman Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On 03/11/2014 12:51 PM, Sasha Levin wrote: > On 03/11/2014 12:28 PM, Mel Gorman wrote: >> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: >>>> This is a completely untested prototype. It rechecks pmd_trans_huge >>>> under the lock and falls through if it hit a parallel split. It's not >>>> perfect because it could decide to fall through just because there was >>>> no prot_numa work to do but it's for illustration purposes. Secondly, >>>> I noted that you are calling invalidate for every pmd range. Is that >>>> not >>>> a lot of invalidations? We could do the same by just tracking the >>>> address >>>> of the first invalidation. >>>> >>> >>> And there were other minor issues. This is still untested but Sasha, >>> can you try it out please? I discussed this with Rik on IRC for a bit >>> and >>> reckon this should be sufficient if the correct race has been >>> identified. >>> >> >> Any luck with this patch Sasha? It passed basic tests here but I had not >> seen the issue trigger either. >> > > Sorry, I've been stuck in my weekend project of getting lockdep to work > with page locks :) > > It takes a moment to test, so just to be sure - I should have only this > last patch applied? > Without the one in the original mail? Indeed, only this patch should do it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com [209.85.212.174]) by kanga.kvack.org (Postfix) with ESMTP id 220306B0055 for ; Tue, 11 Mar 2014 14:06:58 -0400 (EDT) Received: by mail-wi0-f174.google.com with SMTP id d1so1304243wiv.7 for ; Tue, 11 Mar 2014 11:06:57 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id n12si21927786wjw.141.2014.03.11.11.06.55 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Mar 2014 11:06:56 -0700 (PDT) Date: Tue, 11 Mar 2014 18:06:52 +0000 From: Mel Gorman Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140311180652.GM10663@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <531F48CC.303@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: Rik van Riel , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote: > Okay. So just this patch on top of the latest -next shows the following issues: > > 1. BUG in task_numa_work: > > [ 439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00 > [ 439.418216] IP: [] vmacache_find+0x75/0xa0 > [ 439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060 > [ 439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 439.420340] Dumping ftrace buffer: > [ 439.420340] (ftrace buffer empty) > [ 439.420340] Modules linked in: > [ 439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 > [ 439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000 > [ 439.420340] RIP: 0010:[] [] vmacache_find+0x75/0xa0 > [ 439.420340] RSP: 0018:ffff880e1a491e68 EFLAGS: 00010286 > [ 439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001 > [ 439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000 > [ 439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000 > [ 439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000 > [ 439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000 > [ 439.420340] FS: 00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000 > [ 439.420340] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0 > [ 439.420340] Stack: > [ 439.420340] ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117 > [ 439.420340] 00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b > [ 439.420340] ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000 > [ 439.420340] Call Trace: > [ 439.420340] [] find_vma+0x20/0x90 > [ 439.420340] [] ? task_numa_work+0x130/0x360 > [ 439.420340] [] task_numa_work+0x13b/0x360 > [ 439.420340] [] ? context_tracking_user_exit+0x195/0x1d0 > [ 439.420340] [] task_work_run+0xae/0xf0 > [ 439.420340] [] do_notify_resume+0x8e/0xe0 > [ 439.420340] [] int_signal+0x12/0x17 > [ 439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f > [ 439.420340] RIP [] vmacache_find+0x75/0xa0 > [ 439.420340] RSP > [ 439.420340] CR2: ffff880e17530c00 > Ok, this does not look related. It looks like damage from the VMA caching patches, possibly a use-after free. I'm skeptical that it's related to automatic NUMA balancing as such based on the second trace you posted. What does addr2line -e vmlinux ffffffff81299385 say? I want to be sure it looks like a vma dereference without risking making a mistake decoding it. 1. Does this bug trigger even if automatic NUMA balancing is disabled? 2. Does this bug trigger if DEBUG_PAGEALLOC is disabled? If it's a use-after free then the bug would still be there. 3. Can you test with the following patches reverted please? e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix 0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching The last patch will not revert cleanly (least it didn't for me) but it was just a case of git rm the two affected files, remove any include of vmacache.h and commit the rest. Thanks! -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qa0-f49.google.com (mail-qa0-f49.google.com [209.85.216.49]) by kanga.kvack.org (Postfix) with ESMTP id E868D6B0037 for ; Tue, 11 Mar 2014 15:24:34 -0400 (EDT) Received: by mail-qa0-f49.google.com with SMTP id j7so1455772qaq.8 for ; Tue, 11 Mar 2014 12:24:34 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id o92si11736847qgd.107.2014.03.11.12.24.34 for ; Tue, 11 Mar 2014 12:24:34 -0700 (PDT) Message-ID: <531F48C2.6010408@redhat.com> Date: Tue, 11 Mar 2014 13:32:50 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> In-Reply-To: <531F48CC.303@oracle.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin , Mel Gorman Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On 03/11/2014 01:33 PM, Sasha Levin wrote: > On 03/11/2014 01:00 PM, Rik van Riel wrote: >> On 03/11/2014 12:51 PM, Sasha Levin wrote: >>> On 03/11/2014 12:28 PM, Mel Gorman wrote: >>>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: >>>>>> This is a completely untested prototype. It rechecks pmd_trans_huge >>>>>> under the lock and falls through if it hit a parallel split. It's not >>>>>> perfect because it could decide to fall through just because there >>>>>> was >>>>>> no prot_numa work to do but it's for illustration purposes. Secondly, >>>>>> I noted that you are calling invalidate for every pmd range. Is that >>>>>> not >>>>>> a lot of invalidations? We could do the same by just tracking the >>>>>> address >>>>>> of the first invalidation. >>>>>> >>>>> >>>>> And there were other minor issues. This is still untested but Sasha, >>>>> can you try it out please? I discussed this with Rik on IRC for a bit >>>>> and >>>>> reckon this should be sufficient if the correct race has been >>>>> identified. >>>>> >>>> >>>> Any luck with this patch Sasha? It passed basic tests here but I had >>>> not >>>> seen the issue trigger either. >>>> >>> >>> Sorry, I've been stuck in my weekend project of getting lockdep to work >>> with page locks :) >>> >>> It takes a moment to test, so just to be sure - I should have only this >>> last patch applied? >>> Without the one in the original mail? >> >> Indeed, only this patch should do it. > > Okay. So just this patch on top of the latest -next shows the following > issues: OK, those are all issues with Davidlohr Bueso's per-thread vma cache patch :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f178.google.com (mail-ob0-f178.google.com [209.85.214.178]) by kanga.kvack.org (Postfix) with ESMTP id 291546B0037 for ; Tue, 11 Mar 2014 15:26:17 -0400 (EDT) Received: by mail-ob0-f178.google.com with SMTP id wp18so8933759obc.23 for ; Tue, 11 Mar 2014 12:26:16 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id sp3si25323577obb.17.2014.03.11.12.26.16 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Mar 2014 12:26:16 -0700 (PDT) Message-ID: <531F616A.7060300@oracle.com> Date: Tue, 11 Mar 2014 15:18:02 -0400 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> In-Reply-To: <20140311180652.GM10663@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Rik van Riel , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On 03/11/2014 02:06 PM, Mel Gorman wrote: > On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote: >> Okay. So just this patch on top of the latest -next shows the following issues: >> >> 1. BUG in task_numa_work: >> >> [ 439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00 >> [ 439.418216] IP: [] vmacache_find+0x75/0xa0 >> [ 439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060 >> [ 439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> [ 439.420340] Dumping ftrace buffer: >> [ 439.420340] (ftrace buffer empty) >> [ 439.420340] Modules linked in: >> [ 439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 >> [ 439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000 >> [ 439.420340] RIP: 0010:[] [] vmacache_find+0x75/0xa0 >> [ 439.420340] RSP: 0018:ffff880e1a491e68 EFLAGS: 00010286 >> [ 439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001 >> [ 439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000 >> [ 439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000 >> [ 439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000 >> [ 439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000 >> [ 439.420340] FS: 00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000 >> [ 439.420340] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> [ 439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0 >> [ 439.420340] Stack: >> [ 439.420340] ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117 >> [ 439.420340] 00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b >> [ 439.420340] ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000 >> [ 439.420340] Call Trace: >> [ 439.420340] [] find_vma+0x20/0x90 >> [ 439.420340] [] ? task_numa_work+0x130/0x360 >> [ 439.420340] [] task_numa_work+0x13b/0x360 >> [ 439.420340] [] ? context_tracking_user_exit+0x195/0x1d0 >> [ 439.420340] [] task_work_run+0xae/0xf0 >> [ 439.420340] [] do_notify_resume+0x8e/0xe0 >> [ 439.420340] [] int_signal+0x12/0x17 >> [ 439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f >> [ 439.420340] RIP [] vmacache_find+0x75/0xa0 >> [ 439.420340] RSP >> [ 439.420340] CR2: ffff880e17530c00 >> > > Ok, this does not look related. It looks like damage from the VMA caching > patches, possibly a use-after free. I'm skeptical that it's related to > automatic NUMA balancing as such based on the second trace you posted. It's a bit weird because right after applying the patch I've hit this trace twice while never seeing it previously. But it's very possible since there are quite a lot of different mm issues around right now. > What does addr2line -e vmlinux ffffffff81299385 say? I want to be sure > it looks like a vma dereference without risking making a mistake > decoding it. mm/vmacache.c:75 Or: if (vma && vma->vm_start <= addr && vma->vm_end > addr) > 1. Does this bug trigger even if automatic NUMA balancing is disabled? (assuming we're talking about kernel BUG at mm/mmap.c:439!) Yes, same result with numa_balancing=disable. > 2. Does this bug trigger if DEBUG_PAGEALLOC is disabled? If it's a > use-after free then the bug would still be there. Still there. > 3. Can you test with the following patches reverted please? > > e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix > e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix > 0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching > > The last patch will not revert cleanly (least it didn't for me) but it > was just a case of git rm the two affected files, remove any include of > vmacache.h and commit the rest. Don't see the issues I've reported before now. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f41.google.com (mail-pa0-f41.google.com [209.85.220.41]) by kanga.kvack.org (Postfix) with ESMTP id 5CC076B0037 for ; Tue, 11 Mar 2014 15:29:03 -0400 (EDT) Received: by mail-pa0-f41.google.com with SMTP id fa1so9307103pad.0 for ; Tue, 11 Mar 2014 12:29:03 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTP id bs8si1577711pad.331.2014.03.11.12.29.00 for ; Tue, 11 Mar 2014 12:29:01 -0700 (PDT) Date: Tue, 11 Mar 2014 12:28:59 -0700 From: Andrew Morton Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-Id: <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> In-Reply-To: <531F616A.7060300@oracle.com> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: Mel Gorman , Rik van Riel , David Rientjes , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso On Tue, 11 Mar 2014 15:18:02 -0400 Sasha Levin wrote: > > 3. Can you test with the following patches reverted please? > > > > e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix > > e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix > > 0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching > > > > The last patch will not revert cleanly (least it didn't for me) but it > > was just a case of git rm the two affected files, remove any include of > > vmacache.h and commit the rest. > > Don't see the issues I've reported before now. This is foggy. Do you mean that all the bugs went away when per-thread-vma-caching was reverted? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 157D06B0037 for ; Tue, 11 Mar 2014 15:31:17 -0400 (EDT) Received: by mail-pa0-f47.google.com with SMTP id lj1so9199799pab.6 for ; Tue, 11 Mar 2014 12:31:16 -0700 (PDT) Received: from g4t3427.houston.hp.com (g4t3427.houston.hp.com. [15.201.208.55]) by mx.google.com with ESMTPS id zt8si21072787pbc.345.2014.03.11.12.31.04 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Mar 2014 12:31:05 -0700 (PDT) Message-ID: <1394566262.2786.23.camel@buesod1.americas.hpqcorp.net> Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page From: Davidlohr Bueso Date: Tue, 11 Mar 2014 12:31:02 -0700 In-Reply-To: <20140311180652.GM10663@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Sasha Levin , Rik van Riel , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com On Tue, 2014-03-11 at 18:06 +0000, Mel Gorman wrote: > On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote: > > Okay. So just this patch on top of the latest -next shows the following issues: > > > > 1. BUG in task_numa_work: > > > > [ 439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00 > > [ 439.418216] IP: [] vmacache_find+0x75/0xa0 > > [ 439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060 > > [ 439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > [ 439.420340] Dumping ftrace buffer: > > [ 439.420340] (ftrace buffer empty) > > [ 439.420340] Modules linked in: > > [ 439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 > > [ 439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000 > > [ 439.420340] RIP: 0010:[] [] vmacache_find+0x75/0xa0 > > [ 439.420340] RSP: 0018:ffff880e1a491e68 EFLAGS: 00010286 > > [ 439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001 > > [ 439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000 > > [ 439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000 > > [ 439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000 > > [ 439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000 > > [ 439.420340] FS: 00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000 > > [ 439.420340] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > [ 439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0 > > [ 439.420340] Stack: > > [ 439.420340] ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117 > > [ 439.420340] 00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b > > [ 439.420340] ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000 > > [ 439.420340] Call Trace: > > [ 439.420340] [] find_vma+0x20/0x90 > > [ 439.420340] [] ? task_numa_work+0x130/0x360 > > [ 439.420340] [] task_numa_work+0x13b/0x360 > > [ 439.420340] [] ? context_tracking_user_exit+0x195/0x1d0 > > [ 439.420340] [] task_work_run+0xae/0xf0 > > [ 439.420340] [] do_notify_resume+0x8e/0xe0 > > [ 439.420340] [] int_signal+0x12/0x17 > > [ 439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f > > [ 439.420340] RIP [] vmacache_find+0x75/0xa0 > > [ 439.420340] RSP > > [ 439.420340] CR2: ffff880e17530c00 > > > > Ok, this does not look related. It looks like damage from the VMA caching > patches, possibly a use-after free. I'm skeptical that it's related to > automatic NUMA balancing as such based on the second trace you posted. Indeed, this issue is separate. It was reported a few days ago by Fengguang: http://lkml.org/lkml/2014/3/9/201 Apparently this has only been seen in DEBUG_PAGEALLOC + trinity scenarios, just like in Sasha's trace. I'm suspecting some invalidation is missing (not sure what DEBUG_PAGEALLOC enables differently in this aspect), as when we do the vmacache_find() the vma is bogus which triggers the paging request error. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f169.google.com (mail-ig0-f169.google.com [209.85.213.169]) by kanga.kvack.org (Postfix) with ESMTP id 3A7B16B0037 for ; Tue, 11 Mar 2014 15:34:07 -0400 (EDT) Received: by mail-ig0-f169.google.com with SMTP id h18so12441705igc.0 for ; Tue, 11 Mar 2014 12:34:07 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id pg8si33105702icb.135.2014.03.11.12.34.06 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Mar 2014 12:34:06 -0700 (PDT) Message-ID: <531F6527.6010508@oracle.com> Date: Tue, 11 Mar 2014 15:33:59 -0400 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> In-Reply-To: <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Mel Gorman , Rik van Riel , David Rientjes , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso On 03/11/2014 03:28 PM, Andrew Morton wrote: > On Tue, 11 Mar 2014 15:18:02 -0400 Sasha Levin wrote: > >>> 3. Can you test with the following patches reverted please? >>> >>> e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix >>> e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix >>> 0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching >>> >>> The last patch will not revert cleanly (least it didn't for me) but it >>> was just a case of git rm the two affected files, remove any include of >>> vmacache.h and commit the rest. >> >> Don't see the issues I've reported before now. > > This is foggy. Do you mean that all the bugs went away when > per-thread-vma-caching was reverted? No, sorry, just the vmacache_find and the mm/mmap.c:439 BUGs. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f49.google.com (mail-wg0-f49.google.com [74.125.82.49]) by kanga.kvack.org (Postfix) with ESMTP id A45A16B0095 for ; Wed, 12 Mar 2014 06:36:07 -0400 (EDT) Received: by mail-wg0-f49.google.com with SMTP id a1so6937634wgh.20 for ; Wed, 12 Mar 2014 03:36:06 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id s4si13377016wjq.83.2014.03.12.03.36.05 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 12 Mar 2014 03:36:05 -0700 (PDT) Date: Wed, 12 Mar 2014 10:36:02 +0000 From: Mel Gorman Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Message-ID: <20140312103602.GN10663@suse.de> References: <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Sasha Levin , Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML Andrew, this should go with the patches mmnuma-reorganize-change_pmd_range.patch mmnuma-reorganize-change_pmd_range-fix.patch move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch in mmotm please. Thanks. ---8<--- From: Mel Gorman Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Sasha Levin reported the following bug using trinity [ 886.745765] kernel BUG at mm/mprotect.c:149! [ 886.746831] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 886.748511] Dumping ftrace buffer: [ 886.749998] (ftrace buffer empty) [ 886.750110] Modules linked in: [ 886.751396] CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105 [ 886.751396] task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000 [ 886.751396] RIP: 0010:[] [] change_protection_range+0x3b3/0x500 [ 886.751396] RSP: 0000:ffff880228437da8 EFLAGS: 00010282 [ 886.751396] RAX: 8000000527c008e5 RBX: 00007f647916e000 RCX: 0000000000000000 [ 886.751396] RDX: ffff8802ef488e40 RSI: 00007f6479000000 RDI: 8000000527c008e5 [ 886.751396] RBP: ffff880228437e78 R08: 0000000000000000 R09: 0000000000000000 [ 886.751396] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802ef488e40 [ 886.751396] R13: 00007f6479000000 R14: 00007f647916e000 R15: 00007f646e34e000 [ 886.751396] FS: 00007f64b28d4700(0000) GS:ffff88052ba00000(0000) knlGS:0000000000000000 [ 886.751396] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 886.751396] CR2: 00007f64aed83af8 CR3: 0000000206e52000 CR4: 00000000000006a0 [ 886.751396] Stack: [ 886.751396] ffff880200000001 ffffffff8447f152 ffff880228437dd8 ffff880228af3000 [ 886.769142] 00007f646e34e000 00007f647916dfff 0000000000000000 00007f647916e000 [ 886.769142] ffff880206e527f0 00007f647916dfff 0000000000000000 0000000000000000 [ 886.769142] Call Trace: [ 886.769142] [] ? preempt_count_sub+0xe2/0x120 [ 886.769142] [] change_protection+0x25/0x30 [ 886.769142] [] change_prot_numa+0x1b/0x30 [ 886.769142] [] task_numa_work+0x279/0x360 [ 886.769142] [] task_work_run+0xae/0xf0 [ 886.769142] [] do_notify_resume+0x8e/0xe0 [ 886.769142] [] retint_signal+0x4d/0x92 [ 886.769142] Code: 49 8b 3c 24 48 83 3d fc 2e ba 04 00 75 12 0f 0b 0f 1f 84 00 00 00 00 00 eb fe 66 0f 1f 44 00 00 48 89 f8 66 66 66 90 84 c0 79 0d <0f> 0b 0f 1f 00 eb fe 66 0f 1f 44 00 00 8b 4d 9c 44 8b 4d 98 +89 [ 886.769142] RIP [] change_protection_range+0x3b3/0x500 [ 886.769142] RSP The VM_BUG_ON was added by the patch "mm,numa: reorganize change_pmd_range". The race existed without the patch but was just harder to hit. The problem is that a transhuge check is made without holding the PTL. It's possible at the time of the check that a parallel fault clears the pmd and inserts a new one which then triggers the VM_BUG_ON check. This patch removes the VM_BUG_ON but fixes the race by rechecking transhuge under the PTL when marking page tables for NUMA hinting and bailing if a race occurred. It is not a problem for calls to mprotect() as they hold mmap_sem for write. Reported-by: Sasha Levin Signed-off-by: Mel Gorman --- mm/mprotect.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 2afc40e..72061a2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -46,6 +46,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long pages = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + + /* + * For a prot_numa update we only hold mmap_sem for read so there is a + * potential race with faulting where a pmd was temporarily none so + * recheck it under the lock and bail if we race + */ + if (prot_numa && unlikely(pmd_trans_huge(*pmd))) { + pte_unmap_unlock(pte, ptl); + return 0; + } + arch_enter_lazy_mmu_mode(); do { oldpte = *pte; @@ -141,12 +152,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } + + /* huge pmd was handled */ continue; } } /* fall through, the trans huge pmd just split */ } - VM_BUG_ON(pmd_trans_huge(*pmd)); this_pages = change_pte_range(vma, pmd, addr, next, newprot, dirty_accountable, prot_numa); pages += this_pages; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by kanga.kvack.org (Postfix) with ESMTP id CE5696B0099 for ; Wed, 12 Mar 2014 08:16:30 -0400 (EDT) Received: by mail-wi0-f172.google.com with SMTP id hi5so2277675wib.17 for ; Wed, 12 Mar 2014 05:16:30 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id dr11si3790599wid.3.2014.03.12.05.16.28 for ; Wed, 12 Mar 2014 05:16:29 -0700 (PDT) Message-ID: <53205014.2050602@redhat.com> Date: Wed, 12 Mar 2014 08:16:20 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes References: <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> In-Reply-To: <20140312103602.GN10663@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Andrew Morton Cc: Sasha Levin , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML On 03/12/2014 06:36 AM, Mel Gorman wrote: > Andrew, this should go with the patches > mmnuma-reorganize-change_pmd_range.patch > mmnuma-reorganize-change_pmd_range-fix.patch > move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch > in mmotm please. > > Thanks. That would be nice indeed :) I am still not entirely sure why the kernel did not hit this race before my reorganize change_pmd_range patch. Maybe gcc used to do one load and now it does two? > The problem is that a transhuge check is made without holding the PTL. It's > possible at the time of the check that a parallel fault clears the pmd > and inserts a new one which then triggers the VM_BUG_ON check. This patch > removes the VM_BUG_ON but fixes the race by rechecking transhuge under the > PTL when marking page tables for NUMA hinting and bailing if a race occurred. > It is not a problem for calls to mprotect() as they hold mmap_sem for write. > > Reported-by: Sasha Levin > Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f169.google.com (mail-ig0-f169.google.com [209.85.213.169]) by kanga.kvack.org (Postfix) with ESMTP id 788836B0037 for ; Fri, 14 Mar 2014 23:15:50 -0400 (EDT) Received: by mail-ig0-f169.google.com with SMTP id h18so3284916igc.0 for ; Fri, 14 Mar 2014 20:15:50 -0700 (PDT) Received: from userp1040.oracle.com (userp1040.oracle.com. [156.151.31.81]) by mx.google.com with ESMTPS id nv2si31992icc.117.2014.03.14.20.15.49 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 14 Mar 2014 20:15:49 -0700 (PDT) Message-ID: <5323C5D9.2070902@oracle.com> Date: Fri, 14 Mar 2014 23:15:37 -0400 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes References: <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> In-Reply-To: <20140312103602.GN10663@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Andrew Morton Cc: Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML On 03/12/2014 06:36 AM, Mel Gorman wrote: > Andrew, this should go with the patches > mmnuma-reorganize-change_pmd_range.patch > mmnuma-reorganize-change_pmd_range-fix.patch > move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch > in mmotm please. > > Thanks. > > ---8<--- > From: Mel Gorman > Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes > > Sasha Levin reported the following bug using trinity I'm seeing a different issue with this patch. A NULL ptr deref occurs in the pte_offset_map_lock() macro right before the new recheck code: [ 1877.093980] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [ 1877.095174] IP: __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069) [ 1877.096069] PGD 6dee7a067 PUD 6dee7b067 PMD 0 [ 1877.096821] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 1877.097706] Dumping ftrace buffer: [ 1877.098281] (ftrace buffer empty) [ 1877.098825] Modules linked in: [ 1877.099327] CPU: 19 PID: 27913 Comm: trinity-c100 Tainted: G W 3.14.0-rc6-next-20140314-sasha-00012-g5590866 #219 [ 1877.100044] task: ffff8808f4280000 ti: ffff8806e1e54000 task.ti: ffff8806e1e54000 [ 1877.100044] RIP: __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069) [ 1877.100044] RSP: 0000:ffff8806e1e55be8 EFLAGS: 00010002 [ 1877.100044] RAX: 0000000000000082 RBX: 0000000000000018 RCX: 0000000000000000 [ 1877.100044] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [ 1877.100044] RBP: ffff8806e1e55c58 R08: 0000000000000001 R09: 0000000000000000 [ 1877.100044] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8808f4280000 [ 1877.100044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [ 1877.100044] FS: 00007fe3fe152700(0000) GS:ffff88042ba00000(0000) knlGS:0000000000000000 [ 1877.100044] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1877.100044] CR2: 0000000000000018 CR3: 00000006dee79000 CR4: 00000000000006a0 [ 1877.100044] DR0: 0000000000698000 DR1: 0000000000698000 DR2: 0000000000698000 [ 1877.100044] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 000000000009060a [ 1877.100044] Stack: [ 1877.100044] ffff8806e1e55c18 ffffffff81184e95 ffff8808f4280038 00000000001d8500 [ 1877.100044] ffff88042bbd8500 0000000000000013 ffff8806e1e55c48 ffffffff81185108 [ 1877.100044] ffffffff87c13bd0 ffff8808f4280000 0000000000000000 0000000000000001 [ 1877.100044] Call Trace: [ 1877.100044] ? sched_clock_local+0x25/0x90 (kernel/sched/clock.c:205) [ 1877.100044] ? sched_clock_cpu+0xb8/0x100 (kernel/sched/clock.c:310) [ 1877.100044] lock_acquire+0x182/0x1d0 (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) [ 1877.100044] ? change_pte_range+0xa3/0x410 (mm/mprotect.c:55) [ 1877.100044] ? __lock_release+0x1e2/0x200 (kernel/locking/lockdep.c:3506) [ 1877.100044] _raw_spin_lock+0x40/0x80 (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151) [ 1877.100044] ? change_pte_range+0xa3/0x410 (mm/mprotect.c:55) [ 1877.100044] ? _raw_spin_unlock+0x35/0x60 (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) [ 1877.100044] change_pte_range+0xa3/0x410 (mm/mprotect.c:55) [ 1877.100044] change_protection_range+0x3a8/0x4d0 (mm/mprotect.c:164 mm/mprotect.c:188 mm/mprotect.c:213) [ 1877.100044] ? preempt_count_sub+0xe2/0x120 (kernel/sched/core.c:2529) [ 1877.100044] change_protection+0x25/0x30 (mm/mprotect.c:237) [ 1877.100044] change_prot_numa+0x1b/0x30 (mm/mempolicy.c:559) [ 1877.100044] task_numa_work+0x279/0x360 (kernel/sched/fair.c:1911) [ 1877.100044] task_work_run+0xae/0xf0 (kernel/task_work.c:125) [ 1877.100044] do_notify_resume+0x8e/0xe0 (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751) [ 1877.100044] retint_signal+0x4d/0x92 (arch/x86/kernel/entry_64.S:1096) [ 1877.100044] Code: c2 6f 3b 6d 85 be fa 0b 00 00 48 c7 c7 ce 94 6d 85 e8 f9 78 f9 ff 31 c0 e9 bc 04 00 00 66 90 44 8b 1d 29 69 cd 04 45 85 db 74 0c <48> 81 3b 80 f2 75 87 75 06 0f 1f 00 45 31 c0 83 fe 01 77 0c 89 [ 1877.100044] RIP __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069) [ 1877.100044] RSP [ 1877.100044] CR2: 0000000000000018 Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) by kanga.kvack.org (Postfix) with ESMTP id B01B16B0166 for ; Wed, 19 Mar 2014 10:38:37 -0400 (EDT) Received: by mail-wi0-f171.google.com with SMTP id hn9so5102403wib.10 for ; Wed, 19 Mar 2014 07:38:37 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id li2si9521675wjc.170.2014.03.19.07.38.35 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 19 Mar 2014 07:38:36 -0700 (PDT) Date: Wed, 19 Mar 2014 14:38:32 +0000 From: Mel Gorman Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Message-ID: <20140319143831.GA4751@suse.de> References: <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> <5323C5D9.2070902@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <5323C5D9.2070902@oracle.com> Sender: owner-linux-mm@kvack.org List-ID: To: Sasha Levin Cc: Andrew Morton , Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote: > On 03/12/2014 06:36 AM, Mel Gorman wrote: > >Andrew, this should go with the patches > >mmnuma-reorganize-change_pmd_range.patch > >mmnuma-reorganize-change_pmd_range-fix.patch > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch > >in mmotm please. > > > >Thanks. > > > >---8<--- > >From: Mel Gorman > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes > > > >Sasha Levin reported the following bug using trinity > > I'm seeing a different issue with this patch. A NULL ptr deref occurs in the > pte_offset_map_lock() macro right before the new recheck code: > This on top? I tried testing it but got all sorts of carnage that trinity throw up in the mix and ordinary testing does not trigger the race. I've no idea which of the current mess of trinity-exposed bugs you've encountered and got fixed already. ---8<--- From: Mel Gorman Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes -fix Signed-off-by: Mel Gorman --- mm/mprotect.c | 40 ++++++++++++++++++++++++++++++---------- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 66973db..c43d557 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -36,6 +36,34 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) } #endif +/* + * For a prot_numa update we only hold mmap_sem for read so there is a + * potential race with faulting where a pmd was temporarily none. This + * function checks for a transhuge pmd under the appropriate lock. It + * returns a pte if it was successfully locked or NULL if it raced with + * a transhuge insertion. + */ +static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, int prot_numa, spinlock_t **ptl) +{ + pte_t *pte; + spinlock_t *pmdl; + + /* !prot_numa is protected by mmap_sem held for write */ + if (!prot_numa) + return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl); + + pmdl = pmd_lock(vma->vm_mm, pmd); + if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) { + spin_unlock(pmdl); + return NULL; + } + + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl); + spin_unlock(pmdl); + return pte; +} + static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable, int prot_numa) @@ -45,17 +73,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, spinlock_t *ptl; unsigned long pages = 0; - pte = pte_offset_map_lock(mm, pmd, addr, &ptl); - - /* - * For a prot_numa update we only hold mmap_sem for read so there is a - * potential race with faulting where a pmd was temporarily none so - * recheck it under the lock and bail if we race - */ - if (prot_numa && unlikely(pmd_trans_huge(*pmd))) { - pte_unmap_unlock(pte, ptl); + pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl); + if (!pte) return 0; - } arch_enter_lazy_mmu_mode(); do { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 23EDA6B0281 for ; Fri, 21 Mar 2014 18:07:01 -0400 (EDT) Received: by mail-pa0-f47.google.com with SMTP id lj1so2938607pab.34 for ; Fri, 21 Mar 2014 15:07:00 -0700 (PDT) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTP id m8si4173915pbd.288.2014.03.21.15.06.59 for ; Fri, 21 Mar 2014 15:07:00 -0700 (PDT) Date: Fri, 21 Mar 2014 15:06:58 -0700 From: Andrew Morton Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Message-Id: <20140321150658.386926ac3afa263946b1a2aa@linux-foundation.org> In-Reply-To: <20140319143831.GA4751@suse.de> References: <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> <5323C5D9.2070902@oracle.com> <20140319143831.GA4751@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Sasha Levin , Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML On Wed, 19 Mar 2014 14:38:32 +0000 Mel Gorman wrote: > On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote: > > On 03/12/2014 06:36 AM, Mel Gorman wrote: > > >Andrew, this should go with the patches > > >mmnuma-reorganize-change_pmd_range.patch > > >mmnuma-reorganize-change_pmd_range-fix.patch > > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch > > >in mmotm please. > > > > > >Thanks. > > > > > >---8<--- > > >From: Mel Gorman > > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes > > > > > >Sasha Levin reported the following bug using trinity > > > > I'm seeing a different issue with this patch. A NULL ptr deref occurs in the > > pte_offset_map_lock() macro right before the new recheck code: > > > > This on top? > > I tried testing it but got all sorts of carnage that trinity throw up > in the mix and ordinary testing does not trigger the race. I've no idea > which of the current mess of trinity-exposed bugs you've encountered and > got fixed already. > Where are we at with this one? With this patch I'm now sitting on mmnuma-reorganize-change_pmd_range.patch mmnuma-reorganize-change_pmd_range-fix.patch mm-numa-recheck-for-transhuge-pages-under-lock-during-protection-changes.patch mm-numa-recheck-for-transhuge-pages-under-lock-during-protection-changes-fix.patch move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch Where the first four are prep work for the fifth. move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch is a performance optimisation for KVM guests. Should I drop all five, try again next time? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id 707126B0283 for ; Fri, 21 Mar 2014 18:24:21 -0400 (EDT) Received: by mail-wg0-f52.google.com with SMTP id k14so1969456wgh.11 for ; Fri, 21 Mar 2014 15:24:20 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTP id ef5si2770448wib.79.2014.03.21.15.24.19 for ; Fri, 21 Mar 2014 15:24:20 -0700 (PDT) Message-ID: <532CBC0C.1080403@redhat.com> Date: Fri, 21 Mar 2014 18:24:12 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes References: <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> <5323C5D9.2070902@oracle.com> <20140319143831.GA4751@suse.de> In-Reply-To: <20140319143831.GA4751@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman , Sasha Levin Cc: Andrew Morton , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML On 03/19/2014 10:38 AM, Mel Gorman wrote: > On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote: >> On 03/12/2014 06:36 AM, Mel Gorman wrote: >>> Andrew, this should go with the patches >>> mmnuma-reorganize-change_pmd_range.patch >>> mmnuma-reorganize-change_pmd_range-fix.patch >>> move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch >>> in mmotm please. >>> >>> Thanks. >>> >>> ---8<--- >>> From: Mel Gorman >>> Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes >>> >>> Sasha Levin reported the following bug using trinity >> >> I'm seeing a different issue with this patch. A NULL ptr deref occurs in the >> pte_offset_map_lock() macro right before the new recheck code: >> > > This on top? > > I tried testing it but got all sorts of carnage that trinity throw up > in the mix and ordinary testing does not trigger the race. I've no idea > which of the current mess of trinity-exposed bugs you've encountered and > got fixed already. Eeeep indeed. If we re-test the transhuge status, we need to take the pmd lock, and not the potentially non-existent pte lock. Good catch. > ---8<--- > From: Mel Gorman > Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes -fix > > Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel > --- > mm/mprotect.c | 40 ++++++++++++++++++++++++++++++---------- > 1 file changed, 30 insertions(+), 10 deletions(-) > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 66973db..c43d557 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -36,6 +36,34 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) > } > #endif > > +/* > + * For a prot_numa update we only hold mmap_sem for read so there is a > + * potential race with faulting where a pmd was temporarily none. This > + * function checks for a transhuge pmd under the appropriate lock. It > + * returns a pte if it was successfully locked or NULL if it raced with > + * a transhuge insertion. > + */ > +static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd, > + unsigned long addr, int prot_numa, spinlock_t **ptl) > +{ > + pte_t *pte; > + spinlock_t *pmdl; > + > + /* !prot_numa is protected by mmap_sem held for write */ > + if (!prot_numa) > + return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl); > + > + pmdl = pmd_lock(vma->vm_mm, pmd); > + if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) { > + spin_unlock(pmdl); > + return NULL; > + } > + > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl); > + spin_unlock(pmdl); > + return pte; > +} > + > static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, pgprot_t newprot, > int dirty_accountable, int prot_numa) > @@ -45,17 +73,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > spinlock_t *ptl; > unsigned long pages = 0; > > - pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > - > - /* > - * For a prot_numa update we only hold mmap_sem for read so there is a > - * potential race with faulting where a pmd was temporarily none so > - * recheck it under the lock and bail if we race > - */ > - if (prot_numa && unlikely(pmd_trans_huge(*pmd))) { > - pte_unmap_unlock(pte, ptl); > + pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl); > + if (!pte) > return 0; > - } > > arch_enter_lazy_mmu_mode(); > do { > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yh0-f42.google.com (mail-yh0-f42.google.com [209.85.213.42]) by kanga.kvack.org (Postfix) with ESMTP id 124246B0283 for ; Fri, 21 Mar 2014 18:37:19 -0400 (EDT) Received: by mail-yh0-f42.google.com with SMTP id t59so3068836yho.15 for ; Fri, 21 Mar 2014 15:37:18 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id y29si7548942yhg.113.2014.03.21.15.37.18 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 21 Mar 2014 15:37:18 -0700 (PDT) Message-ID: <532CBF08.4030100@oracle.com> Date: Fri, 21 Mar 2014 18:36:56 -0400 From: Sasha Levin MIME-Version: 1.0 Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes References: <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> <5323C5D9.2070902@oracle.com> <20140319143831.GA4751@suse.de> <20140321150658.386926ac3afa263946b1a2aa@linux-foundation.org> In-Reply-To: <20140321150658.386926ac3afa263946b1a2aa@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Mel Gorman Cc: Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML On 03/21/2014 06:06 PM, Andrew Morton wrote: > On Wed, 19 Mar 2014 14:38:32 +0000 Mel Gorman wrote: > >> >On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote: >>> > >On 03/12/2014 06:36 AM, Mel Gorman wrote: >>>> > > >Andrew, this should go with the patches >>>> > > >mmnuma-reorganize-change_pmd_range.patch >>>> > > >mmnuma-reorganize-change_pmd_range-fix.patch >>>> > > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch >>>> > > >in mmotm please. >>>> > > > >>>> > > >Thanks. >>>> > > > >>>> > > >---8<--- >>>> > > >From: Mel Gorman >>>> > > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes >>>> > > > >>>> > > >Sasha Levin reported the following bug using trinity >>> > > >>> > >I'm seeing a different issue with this patch. A NULL ptr deref occurs in the >>> > >pte_offset_map_lock() macro right before the new recheck code: >>> > > >> > >> >This on top? >> > >> >I tried testing it but got all sorts of carnage that trinity throw up >> >in the mix and ordinary testing does not trigger the race. I've no idea >> >which of the current mess of trinity-exposed bugs you've encountered and >> >got fixed already. >> > > Where are we at with this one? Looking good here, haven't seen any of the issues reported in this thread. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753772AbaCFVMh (ORCPT ); Thu, 6 Mar 2014 16:12:37 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:42700 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753750AbaCFVMd (ORCPT ); Thu, 6 Mar 2014 16:12:33 -0500 Message-ID: <5318E4BC.50301@oracle.com> Date: Thu, 06 Mar 2014 16:12:28 -0500 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Rik van Riel CC: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML Subject: mm: kernel BUG at mm/mprotect.c:149 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsinet22.oracle.com [141.146.126.238] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Rik, While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". [ 886.745765] kernel BUG at mm/mprotect.c:149! [ 886.746831] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 886.748511] Dumping ftrace buffer: [ 886.749998] (ftrace buffer empty) [ 886.750110] Modules linked in: [ 886.751396] CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105 [ 886.751396] task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000 [ 886.751396] RIP: 0010:[] [] change_protection_range+0x3b3/0x500 [ 886.751396] RSP: 0000:ffff880228437da8 EFLAGS: 00010282 [ 886.751396] RAX: 8000000527c008e5 RBX: 00007f647916e000 RCX: 0000000000000000 [ 886.751396] RDX: ffff8802ef488e40 RSI: 00007f6479000000 RDI: 8000000527c008e5 [ 886.751396] RBP: ffff880228437e78 R08: 0000000000000000 R09: 0000000000000000 [ 886.751396] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802ef488e40 [ 886.751396] R13: 00007f6479000000 R14: 00007f647916e000 R15: 00007f646e34e000 [ 886.751396] FS: 00007f64b28d4700(0000) GS:ffff88052ba00000(0000) knlGS:0000000000000000 [ 886.751396] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 886.751396] CR2: 00007f64aed83af8 CR3: 0000000206e52000 CR4: 00000000000006a0 [ 886.751396] Stack: [ 886.751396] ffff880200000001 ffffffff8447f152 ffff880228437dd8 ffff880228af3000 [ 886.769142] 00007f646e34e000 00007f647916dfff 0000000000000000 00007f647916e000 [ 886.769142] ffff880206e527f0 00007f647916dfff 0000000000000000 0000000000000000 [ 886.769142] Call Trace: [ 886.769142] [] ? preempt_count_sub+0xe2/0x120 [ 886.769142] [] change_protection+0x25/0x30 [ 886.769142] [] change_prot_numa+0x1b/0x30 [ 886.769142] [] task_numa_work+0x279/0x360 [ 886.769142] [] task_work_run+0xae/0xf0 [ 886.769142] [] do_notify_resume+0x8e/0xe0 [ 886.769142] [] retint_signal+0x4d/0x92 [ 886.769142] Code: 49 8b 3c 24 48 83 3d fc 2e ba 04 00 75 12 0f 0b 0f 1f 84 00 00 00 00 00 eb fe 66 0f 1f 44 00 00 48 89 f8 66 66 66 90 84 c0 79 0d <0f> 0b 0f 1f 00 eb fe 66 0f 1f 44 00 00 8b 4d 9c 44 8b 4d 98 89 [ 886.769142] RIP [] change_protection_range+0x3b3/0x500 [ 886.769142] RSP Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751946AbaCFWcs (ORCPT ); Thu, 6 Mar 2014 17:32:48 -0500 Received: from mx1.redhat.com ([209.132.183.28]:38825 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751531AbaCFWcr (ORCPT ); Thu, 6 Mar 2014 17:32:47 -0500 Date: Thu, 6 Mar 2014 17:31:12 -0500 From: Rik van Riel To: Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , mgorman@suse.de, hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140306173112.3bd6802b@cuia.bos.redhat.com> In-Reply-To: <5318E4BC.50301@oracle.com> References: <5318E4BC.50301@oracle.com> Organization: Red Hat, Inc Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 06 Mar 2014 16:12:28 -0500 Sasha Levin wrote: > While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". That patch should not introduce any functional changes, except for the VM_BUG_ON that catches the fact that we fell through to the 4kB pte handling code, despite having just handled a THP pmd... Does this patch fix the issue? Mel, am I overlooking anything obvious? :) ---8<--- Subject: mm,numa,mprotect: always continue after finding a stable thp page When turning a thp pmds into a NUMA one, change_huge_pmd will return 0 when the pmd already is a NUMA pmd. However, change_pmd_range would fall through to the code that handles 4kB pages, instead of continuing on to the next pmd. Signed-off-by: Rik van Riel Reported-by: Sasha Levin Cc: Mel Gorman --- mm/mprotect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 61f0a07..4746608 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -138,8 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } - continue; } + continue; } /* fall through, the trans huge pmd just split */ } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752232AbaCFWdN (ORCPT ); Thu, 6 Mar 2014 17:33:13 -0500 Received: from mx1.redhat.com ([209.132.183.28]:64243 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750919AbaCFWdK (ORCPT ); Thu, 6 Mar 2014 17:33:10 -0500 Date: Thu, 6 Mar 2014 17:31:37 -0500 From: Rik van Riel To: Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , mgorman@suse.de, hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140306173137.6a23a0b2@cuia.bos.redhat.com> In-Reply-To: <5318E4BC.50301@oracle.com> References: <5318E4BC.50301@oracle.com> Organization: Red Hat, Inc Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 06 Mar 2014 16:12:28 -0500 Sasha Levin wrote: > While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". That patch should not introduce any functional changes, except for the VM_BUG_ON that catches the fact that we fell through to the 4kB pte handling code, despite having just handled a THP pmd... Does this patch fix the issue? Mel, am I overlooking anything obvious? :) ---8<--- Subject: mm,numa,mprotect: always continue after finding a stable thp page When turning a thp pmds into a NUMA one, change_huge_pmd will return 0 when the pmd already is a NUMA pmd. However, change_pmd_range would fall through to the code that handles 4kB pages, instead of continuing on to the next pmd. Signed-off-by: Rik van Riel Reported-by: Sasha Levin Cc: Mel Gorman --- mm/mprotect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 61f0a07..4746608 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -138,8 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } - continue; } + continue; } /* fall through, the trans huge pmd just split */ } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751530AbaCFWyY (ORCPT ); Thu, 6 Mar 2014 17:54:24 -0500 Received: from mx1.redhat.com ([209.132.183.28]:34596 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751095AbaCFWyW (ORCPT ); Thu, 6 Mar 2014 17:54:22 -0500 Message-ID: <5318FC3F.4080204@redhat.com> Date: Thu, 06 Mar 2014 17:52:47 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Sasha Levin CC: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , mgorman@suse.de, hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> In-Reply-To: <20140306173137.6a23a0b2@cuia.bos.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/06/2014 05:31 PM, Rik van Riel wrote: > On Thu, 06 Mar 2014 16:12:28 -0500 > Sasha Levin wrote: > >> While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the >> following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > That patch should not introduce any functional changes, except for > the VM_BUG_ON that catches the fact that we fell through to the 4kB > pte handling code, despite having just handled a THP pmd... > > Does this patch fix the issue? > > Mel, am I overlooking anything obvious? :) > > ---8<--- > > Subject: mm,numa,mprotect: always continue after finding a stable thp page > > When turning a thp pmds into a NUMA one, change_huge_pmd will > return 0 when the pmd already is a NUMA pmd. I did miss something obvious. In this case, the code returns 1. > However, change_pmd_range would fall through to the code that > handles 4kB pages, instead of continuing on to the next pmd. Maybe the case that I missed is when khugepaged is in the process of collapsing pages into a transparent huge page? If the virtual CPU gets de-scheduled by the host for long enough, it would be possible for khugepaged to run on another virtual CPU, and turn the pmd into a THP pmd, before that VM_BUG_ON test. I see that khugepaged takes the mmap_sem for writing in the collapse code, and it looks like task_numa_work takes the mmap_sem for reading, so I guess that may not be possible? Andrea, would you happen to know what case am I missing? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753181AbaCGBEU (ORCPT ); Thu, 6 Mar 2014 20:04:20 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:27005 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751438AbaCGBET (ORCPT ); Thu, 6 Mar 2014 20:04:19 -0500 Message-ID: <53191B05.20102@oracle.com> Date: Thu, 06 Mar 2014 20:04:05 -0500 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Rik van Riel CC: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , mgorman@suse.de, hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> In-Reply-To: <20140306173137.6a23a0b2@cuia.bos.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsinet21.oracle.com [141.146.126.237] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/06/2014 05:31 PM, Rik van Riel wrote: > On Thu, 06 Mar 2014 16:12:28 -0500 > Sasha Levin wrote: > >> >While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the >> >following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > That patch should not introduce any functional changes, except for > the VM_BUG_ON that catches the fact that we fell through to the 4kB > pte handling code, despite having just handled a THP pmd... > > Does this patch fix the issue? I'm seeing a different issue with this patch: [ 625.886532] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [ 625.888056] IP: [] __lock_acquire+0xbc/0x580 [ 625.888969] PGD 427842067 PUD 41b321067 PMD 0 [ 625.889775] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 625.890773] Dumping ftrace buffer: [ 625.891454] (ftrace buffer empty) [ 625.892041] Modules linked in: [ 625.892610] CPU: 9 PID: 18991 Comm: trinity-c198 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00012-g00c5c8f-dirty #110 [ 625.894293] task: ffff8804278cb000 ti: ffff88041d89a000 task.ti: ffff88041d89a000 [ 625.895678] RIP: 0010:[] [] __lock_acquire+0xbc/0x580 [ 625.896232] RSP: 0018:ffff88041d89bbe8 EFLAGS: 00010002 [ 625.896232] RAX: 0000000000000082 RBX: 0000000000000018 RCX: 0000000000000000 [ 625.896232] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [ 625.896232] RBP: ffff88041d89bc58 R08: 0000000000000001 R09: 0000000000000000 [ 625.896232] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804278cb000 [ 625.896232] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [ 625.896232] FS: 00007f144ab81700(0000) GS:ffff880a2b800000(0000) knlGS:0000000000000000 [ 625.896232] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 625.896232] CR2: 0000000000000018 CR3: 00000004290cf000 CR4: 00000000000006a0 [ 625.896232] Stack: [ 625.896232] ffff8802295abc18 ffff8804278cb000 ffff880a2b9d84c0 ffff880a2b9d84d0 [ 625.896232] ffff88041d89bc18 ffffffff810bb084 ffff88041d89bc28 ffffffff8107912d [ 625.896232] ffff88041d89bc58 ffff8804278cb000 0000000000000000 0000000000000001 [ 625.896232] Call Trace: [ 625.896232] [] ? kvm_clock_read+0x24/0x50 [ 625.896232] [] ? sched_clock+0x1d/0x30 [ 625.896232] [] lock_acquire+0x182/0x1d0 [ 625.896232] [] ? change_pte_range+0xa3/0x3b0 [ 625.896232] [] ? sched_clock+0x1d/0x30 [ 625.896232] [] _raw_spin_lock+0x40/0x80 [ 625.896232] [] ? change_pte_range+0xa3/0x3b0 [ 625.896232] [] change_pte_range+0xa3/0x3b0 [ 625.896232] [] change_protection_range+0x3a8/0x4d0 [ 625.896232] [] ? preempt_count_sub+0xe2/0x120 [ 625.896232] [] change_protection+0x25/0x30 [ 625.896232] [] change_prot_numa+0x1b/0x30 [ 625.896232] [] task_numa_work+0x279/0x360 [ 625.896232] [] task_work_run+0xae/0xf0 [ 625.896232] [] do_notify_resume+0x8e/0xe0 [ 625.896232] [] int_signal+0x12/0x17 [ 625.896232] Code: c2 37 f6 6b 85 be fa 0b 00 00 48 c7 c7 a1 4f 6c 85 e8 09 60 f9 ff 31 c0 e9 9c 04 00 00 66 90 44 8b 1d 39 1f cd 04 45 85 db 74 0c <48> 81 3b 40 52 76 87 75 06 0f 1f 00 45 31 c0 83 fe 01 77 0c 89 [ 625.896232] RIP [] __lock_acquire+0xbc/0x580 [ 625.896232] RSP [ 625.896232] CR2: 0000000000000018 Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753472AbaCGOG5 (ORCPT ); Fri, 7 Mar 2014 09:06:57 -0500 Received: from cantor2.suse.de ([195.135.220.15]:48897 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753138AbaCGOG4 (ORCPT ); Fri, 7 Mar 2014 09:06:56 -0500 Date: Fri, 7 Mar 2014 14:06:50 +0000 From: Mel Gorman To: Rik van Riel Cc: Sasha Levin , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140307140650.GA1931@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <5318FC3F.4080204@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote: > On 03/06/2014 05:31 PM, Rik van Riel wrote: > >On Thu, 06 Mar 2014 16:12:28 -0500 > >Sasha Levin wrote: > > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > > >That patch should not introduce any functional changes, except for > >the VM_BUG_ON that catches the fact that we fell through to the 4kB > >pte handling code, despite having just handled a THP pmd... > > > >Does this patch fix the issue? > > > >Mel, am I overlooking anything obvious? :) > > > >---8<--- > > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page > > > >When turning a thp pmds into a NUMA one, change_huge_pmd will > >return 0 when the pmd already is a NUMA pmd. > > I did miss something obvious. In this case, the code returns 1. > > >However, change_pmd_range would fall through to the code that > >handles 4kB pages, instead of continuing on to the next pmd. > > Maybe the case that I missed is when khugepaged is in the > process of collapsing pages into a transparent huge page? > > If the virtual CPU gets de-scheduled by the host for long > enough, it would be possible for khugepaged to run on > another virtual CPU, and turn the pmd into a THP pmd, > before that VM_BUG_ON test. > > I see that khugepaged takes the mmap_sem for writing in the > collapse code, and it looks like task_numa_work takes the > mmap_sem for reading, so I guess that may not be possible? > mmap_sem will prevent a parallel collapse but what prevents something like the following? do_huge_pmd_wp_page change_pmd_range if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) continue; pmdp_clear_flush(vma, haddr, pmd); if (pmd_trans_huge(*pmd)) { .... path not taken .... } page_add_new_anon_rmap(new_page, vma, haddr); set_pmd_at(mm, haddr, pmd, entry); VM_BUG_ON(pmd_trans_huge(*pmd)); We do not hold the page table lock during the pmd_trans_huge check and we do not recheck it under PTF lock in change_pte_range() -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752780AbaCGPJa (ORCPT ); Fri, 7 Mar 2014 10:09:30 -0500 Received: from cantor2.suse.de ([195.135.220.15]:49883 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751744AbaCGPJ3 (ORCPT ); Fri, 7 Mar 2014 10:09:29 -0500 Date: Fri, 7 Mar 2014 15:09:23 +0000 From: Mel Gorman To: Rik van Riel Cc: Sasha Levin , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140307150923.GB1931@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140307140650.GA1931@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote: > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote: > > On 03/06/2014 05:31 PM, Rik van Riel wrote: > > >On Thu, 06 Mar 2014 16:12:28 -0500 > > >Sasha Levin wrote: > > > > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > > > > >That patch should not introduce any functional changes, except for > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB > > >pte handling code, despite having just handled a THP pmd... > > > > > >Does this patch fix the issue? > > > > > >Mel, am I overlooking anything obvious? :) > > > > > >---8<--- > > > > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page > > > > > >When turning a thp pmds into a NUMA one, change_huge_pmd will > > >return 0 when the pmd already is a NUMA pmd. > > > > I did miss something obvious. In this case, the code returns 1. > > > > >However, change_pmd_range would fall through to the code that > > >handles 4kB pages, instead of continuing on to the next pmd. > > > > Maybe the case that I missed is when khugepaged is in the > > process of collapsing pages into a transparent huge page? > > > > If the virtual CPU gets de-scheduled by the host for long > > enough, it would be possible for khugepaged to run on > > another virtual CPU, and turn the pmd into a THP pmd, > > before that VM_BUG_ON test. > > > > I see that khugepaged takes the mmap_sem for writing in the > > collapse code, and it looks like task_numa_work takes the > > mmap_sem for reading, so I guess that may not be possible? > > > > mmap_sem will prevent a parallel collapse but what prevents something > like the following? > > do_huge_pmd_wp_page > change_pmd_range > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) > continue; > pmdp_clear_flush(vma, haddr, pmd); > if (pmd_trans_huge(*pmd)) { > .... path not taken .... > } > page_add_new_anon_rmap(new_page, vma, haddr); > set_pmd_at(mm, haddr, pmd, entry); > VM_BUG_ON(pmd_trans_huge(*pmd)); > > We do not hold the page table lock during the pmd_trans_huge check and we > do not recheck it under PTF lock in change_pte_range() > This is a completely untested prototype. It rechecks pmd_trans_huge under the lock and falls through if it hit a parallel split. It's not perfect because it could decide to fall through just because there was no prot_numa work to do but it's for illustration purposes. Secondly, I noted that you are calling invalidate for every pmd range. Is that not a lot of invalidations? We could do the same by just tracking the address of the first invalidation. diff --git a/mm/mprotect.c b/mm/mprotect.c index 2afc40e..a0050fc 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -46,6 +46,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long pages = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + + /* transhuge page could have faulted in parallel */ + if (unlikely(pmd_trans_huge(*pmd))) { + int nr_ptes; + + pte_unmap_unlock(pte, ptl); + nr_ptes = change_huge_pmd(vma, pmd, addr, newprot, prot_numa); + if (nr_ptes) + return nr_ptes; + + /* Page was split so retake the ptl and handle ptes */ + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + } + arch_enter_lazy_mmu_mode(); do { oldpte = *pte; @@ -106,14 +120,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, - pgprot_t newprot, int dirty_accountable, int prot_numa) + pgprot_t newprot, int dirty_accountable, int prot_numa, + unsigned long mni_start, unsigned long mni_end) { pmd_t *pmd; struct mm_struct *mm = vma->vm_mm; unsigned long next; unsigned long pages = 0; unsigned long nr_huge_updates = 0; - unsigned long mni_start = 0; pmd = pmd_offset(pud, addr); do { @@ -124,10 +138,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, continue; /* invoke the mmu notifier if the pmd is populated */ - if (!mni_start) { - mni_start = addr; - mmu_notifier_invalidate_range_start(mm, mni_start, end); - } + if (!mni_start) + mmu_notifier_invalidate_range_start(mm, addr, mni_end); if (pmd_trans_huge(*pmd)) { if (next - addr != HPAGE_PMD_SIZE) @@ -141,8 +153,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } - continue; } + continue; } /* fall through, the trans huge pmd just split */ } @@ -152,9 +164,6 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += this_pages; } while (pmd++, addr = next, addr != end); - if (mni_start) - mmu_notifier_invalidate_range_end(mm, mni_start, end); - if (nr_huge_updates) count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates); return pages; @@ -167,16 +176,25 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pud_t *pud; unsigned long next; unsigned long pages = 0; + unsigned long mni_start = 0; pud = pud_offset(pgd, addr); do { + unsigned long this_pages; next = pud_addr_end(addr, end); if (pud_none_or_clear_bad(pud)) continue; - pages += change_pmd_range(vma, pud, addr, next, newprot, - dirty_accountable, prot_numa); + this_pages = change_pmd_range(vma, pud, addr, next, newprot, + dirty_accountable, prot_numa, mni_start, + end); + if (this_pages) + mni_start = addr; + pages += this_pages; } while (pud++, addr = next, addr != end); + if (mni_start) + mmu_notifier_invalidate_range_end(vma->vm_mm, mni_start, end); + return pages; } -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753579AbaCGPOF (ORCPT ); Fri, 7 Mar 2014 10:14:05 -0500 Received: from cantor2.suse.de ([195.135.220.15]:49995 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753559AbaCGPOC (ORCPT ); Fri, 7 Mar 2014 10:14:02 -0500 Date: Fri, 7 Mar 2014 15:13:57 +0000 From: Mel Gorman To: Rik van Riel Cc: Sasha Levin , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140307151357.GC1931@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140307150923.GB1931@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 07, 2014 at 03:09:23PM +0000, Mel Gorman wrote: > On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote: > > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote: > > > On 03/06/2014 05:31 PM, Rik van Riel wrote: > > > >On Thu, 06 Mar 2014 16:12:28 -0500 > > > >Sasha Levin wrote: > > > > > > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > > > > > > >That patch should not introduce any functional changes, except for > > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB > > > >pte handling code, despite having just handled a THP pmd... > > > > > > > >Does this patch fix the issue? > > > > > > > >Mel, am I overlooking anything obvious? :) > > > > > > > >---8<--- > > > > > > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page > > > > > > > >When turning a thp pmds into a NUMA one, change_huge_pmd will > > > >return 0 when the pmd already is a NUMA pmd. > > > > > > I did miss something obvious. In this case, the code returns 1. > > > > > > >However, change_pmd_range would fall through to the code that > > > >handles 4kB pages, instead of continuing on to the next pmd. > > > > > > Maybe the case that I missed is when khugepaged is in the > > > process of collapsing pages into a transparent huge page? > > > > > > If the virtual CPU gets de-scheduled by the host for long > > > enough, it would be possible for khugepaged to run on > > > another virtual CPU, and turn the pmd into a THP pmd, > > > before that VM_BUG_ON test. > > > > > > I see that khugepaged takes the mmap_sem for writing in the > > > collapse code, and it looks like task_numa_work takes the > > > mmap_sem for reading, so I guess that may not be possible? > > > > > > > mmap_sem will prevent a parallel collapse but what prevents something > > like the following? > > > > do_huge_pmd_wp_page > > change_pmd_range > > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) > > continue; > > pmdp_clear_flush(vma, haddr, pmd); > > if (pmd_trans_huge(*pmd)) { > > .... path not taken .... > > } > > page_add_new_anon_rmap(new_page, vma, haddr); > > set_pmd_at(mm, haddr, pmd, entry); > > VM_BUG_ON(pmd_trans_huge(*pmd)); > > > > We do not hold the page table lock during the pmd_trans_huge check and we > > do not recheck it under PTF lock in change_pte_range() > > > > This is a completely untested prototype. It rechecks pmd_trans_huge > under the lock and falls through if it hit a parallel split. It's not > perfect because it could decide to fall through just because there was > no prot_numa work to do but it's for illustration purposes. Nope, if there was no prot_numa work to do, it returns 1 so the check is ok. It's only if 0 is returned we need to fall through and really handle ptes. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752480AbaCGS1v (ORCPT ); Fri, 7 Mar 2014 13:27:51 -0500 Received: from cantor2.suse.de ([195.135.220.15]:53505 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751302AbaCGS1u (ORCPT ); Fri, 7 Mar 2014 13:27:50 -0500 Date: Fri, 7 Mar 2014 18:27:45 +0000 From: Mel Gorman To: Rik van Riel , Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140307182745.GD1931@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140307150923.GB1931@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 07, 2014 at 03:09:23PM +0000, Mel Gorman wrote: > On Fri, Mar 07, 2014 at 02:06:50PM +0000, Mel Gorman wrote: > > On Thu, Mar 06, 2014 at 05:52:47PM -0500, Rik van Riel wrote: > > > On 03/06/2014 05:31 PM, Rik van Riel wrote: > > > >On Thu, 06 Mar 2014 16:12:28 -0500 > > > >Sasha Levin wrote: > > > > > > > >>While fuzzing with trinity inside a KVM tools guest running latest -next kernel I've hit the > > > >>following spew. This seems to be introduced by your patch "mm,numa: reorganize change_pmd_range()". > > > > > > > >That patch should not introduce any functional changes, except for > > > >the VM_BUG_ON that catches the fact that we fell through to the 4kB > > > >pte handling code, despite having just handled a THP pmd... > > > > > > > >Does this patch fix the issue? > > > > > > > >Mel, am I overlooking anything obvious? :) > > > > > > > >---8<--- > > > > > > > >Subject: mm,numa,mprotect: always continue after finding a stable thp page > > > > > > > >When turning a thp pmds into a NUMA one, change_huge_pmd will > > > >return 0 when the pmd already is a NUMA pmd. > > > > > > I did miss something obvious. In this case, the code returns 1. > > > > > > >However, change_pmd_range would fall through to the code that > > > >handles 4kB pages, instead of continuing on to the next pmd. > > > > > > Maybe the case that I missed is when khugepaged is in the > > > process of collapsing pages into a transparent huge page? > > > > > > If the virtual CPU gets de-scheduled by the host for long > > > enough, it would be possible for khugepaged to run on > > > another virtual CPU, and turn the pmd into a THP pmd, > > > before that VM_BUG_ON test. > > > > > > I see that khugepaged takes the mmap_sem for writing in the > > > collapse code, and it looks like task_numa_work takes the > > > mmap_sem for reading, so I guess that may not be possible? > > > > > > > mmap_sem will prevent a parallel collapse but what prevents something > > like the following? > > > > do_huge_pmd_wp_page > > change_pmd_range > > if (!pmd_trans_huge(*pmd) && pmd_none_or_clear_bad(pmd)) > > continue; > > pmdp_clear_flush(vma, haddr, pmd); > > if (pmd_trans_huge(*pmd)) { > > .... path not taken .... > > } > > page_add_new_anon_rmap(new_page, vma, haddr); > > set_pmd_at(mm, haddr, pmd, entry); > > VM_BUG_ON(pmd_trans_huge(*pmd)); > > > > We do not hold the page table lock during the pmd_trans_huge check and we > > do not recheck it under PTF lock in change_pte_range() > > > > This is a completely untested prototype. It rechecks pmd_trans_huge > under the lock and falls through if it hit a parallel split. It's not > perfect because it could decide to fall through just because there was > no prot_numa work to do but it's for illustration purposes. Secondly, > I noted that you are calling invalidate for every pmd range. Is that not > a lot of invalidations? We could do the same by just tracking the address > of the first invalidation. > And there were other minor issues. This is still untested but Sasha, can you try it out please? I discussed this with Rik on IRC for a bit and reckon this should be sufficient if the correct race has been identified. The race can only really happen for prot_numa updates and it's ok to bail on those updates if a race occurs because all we miss is a few hinting faults. That simplifies the patch considerably but throw in some comments to explain it diff --git a/mm/mprotect.c b/mm/mprotect.c index 2afc40e..72061a2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -46,6 +46,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long pages = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + + /* + * For a prot_numa update we only hold mmap_sem for read so there is a + * potential race with faulting where a pmd was temporarily none so + * recheck it under the lock and bail if we race + */ + if (prot_numa && unlikely(pmd_trans_huge(*pmd))) { + pte_unmap_unlock(pte, ptl); + return 0; + } + arch_enter_lazy_mmu_mode(); do { oldpte = *pte; @@ -141,12 +152,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } + + /* huge pmd was handled */ continue; } } /* fall through, the trans huge pmd just split */ } - VM_BUG_ON(pmd_trans_huge(*pmd)); this_pages = change_pte_range(vma, pmd, addr, next, newprot, dirty_accountable, prot_numa); pages += this_pages; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753543AbaCKQ2u (ORCPT ); Tue, 11 Mar 2014 12:28:50 -0400 Received: from cantor2.suse.de ([195.135.220.15]:58857 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752669AbaCKQ2t (ORCPT ); Tue, 11 Mar 2014 12:28:49 -0400 Date: Tue, 11 Mar 2014 16:28:45 +0000 From: Mel Gorman To: Rik van Riel , Sasha Levin Cc: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140311162845.GA30604@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140307182745.GD1931@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: > > This is a completely untested prototype. It rechecks pmd_trans_huge > > under the lock and falls through if it hit a parallel split. It's not > > perfect because it could decide to fall through just because there was > > no prot_numa work to do but it's for illustration purposes. Secondly, > > I noted that you are calling invalidate for every pmd range. Is that not > > a lot of invalidations? We could do the same by just tracking the address > > of the first invalidation. > > > > And there were other minor issues. This is still untested but Sasha, > can you try it out please? I discussed this with Rik on IRC for a bit and > reckon this should be sufficient if the correct race has been identified. > Any luck with this patch Sasha? It passed basic tests here but I had not seen the issue trigger either. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754429AbaCKQvp (ORCPT ); Tue, 11 Mar 2014 12:51:45 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:28497 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753924AbaCKQvo (ORCPT ); Tue, 11 Mar 2014 12:51:44 -0400 Message-ID: <531F3F15.8050206@oracle.com> Date: Tue, 11 Mar 2014 12:51:33 -0400 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Mel Gorman , Rik van Riel CC: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> In-Reply-To: <20140311162845.GA30604@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet21.oracle.com [156.151.31.93] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/11/2014 12:28 PM, Mel Gorman wrote: > On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: >>> This is a completely untested prototype. It rechecks pmd_trans_huge >>> under the lock and falls through if it hit a parallel split. It's not >>> perfect because it could decide to fall through just because there was >>> no prot_numa work to do but it's for illustration purposes. Secondly, >>> I noted that you are calling invalidate for every pmd range. Is that not >>> a lot of invalidations? We could do the same by just tracking the address >>> of the first invalidation. >>> >> >> And there were other minor issues. This is still untested but Sasha, >> can you try it out please? I discussed this with Rik on IRC for a bit and >> reckon this should be sufficient if the correct race has been identified. >> > > Any luck with this patch Sasha? It passed basic tests here but I had not > seen the issue trigger either. > Sorry, I've been stuck in my weekend project of getting lockdep to work with page locks :) It takes a moment to test, so just to be sure - I should have only this last patch applied? Without the one in the original mail? Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754961AbaCKRWF (ORCPT ); Tue, 11 Mar 2014 13:22:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:30300 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752395AbaCKRWD (ORCPT ); Tue, 11 Mar 2014 13:22:03 -0400 Message-ID: <531F4128.8020109@redhat.com> Date: Tue, 11 Mar 2014 13:00:24 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Sasha Levin , Mel Gorman CC: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> In-Reply-To: <531F3F15.8050206@oracle.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/11/2014 12:51 PM, Sasha Levin wrote: > On 03/11/2014 12:28 PM, Mel Gorman wrote: >> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: >>>> This is a completely untested prototype. It rechecks pmd_trans_huge >>>> under the lock and falls through if it hit a parallel split. It's not >>>> perfect because it could decide to fall through just because there was >>>> no prot_numa work to do but it's for illustration purposes. Secondly, >>>> I noted that you are calling invalidate for every pmd range. Is that >>>> not >>>> a lot of invalidations? We could do the same by just tracking the >>>> address >>>> of the first invalidation. >>>> >>> >>> And there were other minor issues. This is still untested but Sasha, >>> can you try it out please? I discussed this with Rik on IRC for a bit >>> and >>> reckon this should be sufficient if the correct race has been >>> identified. >>> >> >> Any luck with this patch Sasha? It passed basic tests here but I had not >> seen the issue trigger either. >> > > Sorry, I've been stuck in my weekend project of getting lockdep to work > with page locks :) > > It takes a moment to test, so just to be sure - I should have only this > last patch applied? > Without the one in the original mail? Indeed, only this patch should do it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753563AbaCKRdQ (ORCPT ); Tue, 11 Mar 2014 13:33:16 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:31874 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752056AbaCKRdO (ORCPT ); Tue, 11 Mar 2014 13:33:14 -0400 Message-ID: <531F48CC.303@oracle.com> Date: Tue, 11 Mar 2014 13:33:00 -0400 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Rik van Riel , Mel Gorman CC: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> In-Reply-To: <531F4128.8020109@redhat.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet21.oracle.com [156.151.31.93] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/11/2014 01:00 PM, Rik van Riel wrote: > On 03/11/2014 12:51 PM, Sasha Levin wrote: >> On 03/11/2014 12:28 PM, Mel Gorman wrote: >>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: >>>>> This is a completely untested prototype. It rechecks pmd_trans_huge >>>>> under the lock and falls through if it hit a parallel split. It's not >>>>> perfect because it could decide to fall through just because there was >>>>> no prot_numa work to do but it's for illustration purposes. Secondly, >>>>> I noted that you are calling invalidate for every pmd range. Is that >>>>> not >>>>> a lot of invalidations? We could do the same by just tracking the >>>>> address >>>>> of the first invalidation. >>>>> >>>> >>>> And there were other minor issues. This is still untested but Sasha, >>>> can you try it out please? I discussed this with Rik on IRC for a bit >>>> and >>>> reckon this should be sufficient if the correct race has been >>>> identified. >>>> >>> >>> Any luck with this patch Sasha? It passed basic tests here but I had not >>> seen the issue trigger either. >>> >> >> Sorry, I've been stuck in my weekend project of getting lockdep to work >> with page locks :) >> >> It takes a moment to test, so just to be sure - I should have only this >> last patch applied? >> Without the one in the original mail? > > Indeed, only this patch should do it. Okay. So just this patch on top of the latest -next shows the following issues: 1. BUG in task_numa_work: [ 439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00 [ 439.418216] IP: [] vmacache_find+0x75/0xa0 [ 439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060 [ 439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 439.420340] Dumping ftrace buffer: [ 439.420340] (ftrace buffer empty) [ 439.420340] Modules linked in: [ 439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 [ 439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000 [ 439.420340] RIP: 0010:[] [] vmacache_find+0x75/0xa0 [ 439.420340] RSP: 0018:ffff880e1a491e68 EFLAGS: 00010286 [ 439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001 [ 439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000 [ 439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000 [ 439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000 [ 439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000 [ 439.420340] FS: 00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000 [ 439.420340] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0 [ 439.420340] Stack: [ 439.420340] ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117 [ 439.420340] 00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b [ 439.420340] ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000 [ 439.420340] Call Trace: [ 439.420340] [] find_vma+0x20/0x90 [ 439.420340] [] ? task_numa_work+0x130/0x360 [ 439.420340] [] task_numa_work+0x13b/0x360 [ 439.420340] [] ? context_tracking_user_exit+0x195/0x1d0 [ 439.420340] [] task_work_run+0xae/0xf0 [ 439.420340] [] do_notify_resume+0x8e/0xe0 [ 439.420340] [] int_signal+0x12/0x17 [ 439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f [ 439.420340] RIP [] vmacache_find+0x75/0xa0 [ 439.420340] RSP [ 439.420340] CR2: ffff880e17530c00 2. Similar to the above, but with a different trace: [ 304.212158] BUG: unable to handle kernel paging request at ffff88020d37f800 [ 304.220420] IP: [] vmacache_find+0x75/0xa0 [ 304.220420] PGD 8904067 PUD 102effb067 PMD 102ef91067 PTE 800000020d37f060 [ 304.220420] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 304.220420] Dumping ftrace buffer: [ 304.220420] (ftrace buffer empty) [ 304.220420] Modules linked in: [ 304.220420] CPU: 30 PID: 11157 Comm: trinity-c393 Tainted: G B W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 [ 304.220420] task: ffff881020a90000 ti: ffff88101fda2000 task.ti: ffff88101fda2000 [ 304.260466] RIP: 0010:[] [] vmacache_find+0x75/0xa0 [ 304.260466] RSP: 0000:ffff88101fda3da8 EFLAGS: 00010286 [ 304.260466] RAX: ffff88020d37f800 RBX: 00007f7f92f680fc RCX: 0000000000000000 [ 304.260466] RDX: 0000000000000000 RSI: 00007f7f92f680fc RDI: ffff881020a90000 [ 304.260466] RBP: ffff88101fda3da8 R08: 0000000000000001 R09: 0000000000000000 [ 304.260466] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881021358000 [ 304.260466] R13: 0000000000000000 R14: ffff8810213580a8 R15: ffff881021358000 [ 304.260466] FS: 00007f7f92f8d700(0000) GS:ffff880f2ba00000(0000) knlGS:0000000000000000 [ 304.260466] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 304.260466] CR2: ffff88020d37f800 CR3: 000000101fd6b000 CR4: 00000000000006a0 [ 304.260466] Stack: [ 304.260466] ffff88101fda3dd8 ffffffff812a7610 ffffffff844abd82 00007f7f92f680fc [ 304.260466] 00007f7f92f680fc 00000000000000a8 ffff88101fda3ef8 ffffffff844abdd6 [ 304.260466] ffff88101fda3e08 ffffffff811a60d6 0000000000000000 ffff881020a90000 [ 304.260466] Call Trace: [ 304.260466] [] find_vma+0x20/0x90 [ 304.260466] [] ? __do_page_fault+0x302/0x5d0 [ 304.260466] [] __do_page_fault+0x356/0x5d0 [ 304.260466] [] ? trace_hardirqs_off_caller+0x16/0x1a0 [ 304.260466] [] ? vtime_account_user+0x96/0xb0 [ 304.260466] [] ? preempt_count_sub+0xe2/0x120 [ 304.260466] [] ? context_tracking_user_exit+0x187/0x1d0 [ 304.260466] [] ? trace_hardirqs_off_caller+0x16/0x1a0 [ 304.260466] [] do_page_fault+0x45/0x70 [ 304.260466] [] do_async_page_fault+0x36/0x100 [ 304.260466] [] async_page_fault+0x28/0x30 [ 304.260466] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f [ 304.260466] RIP [] vmacache_find+0x75/0xa0 [ 304.260466] RSP [ 304.260466] CR2: ffff88020d37f800 3. This one is a new issue. Might be related to something else but it seems related as I've never saw it before this patch: [ 560.473342] kernel BUG at mm/mmap.c:439! [ 560.473766] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 560.474723] Dumping ftrace buffer: [ 560.475410] (ftrace buffer empty) [ 560.476098] Modules linked in: [ 560.476241] CPU: 10 PID: 17037 Comm: trinity-c84 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 [ 560.477651] task: ffff880424310000 ti: ffff8803eec4c000 task.ti: ffff8803eec4c000 [ 560.478595] RIP: 0010:[] [] validate_mm+0x115/0x140 [ 560.479749] RSP: 0018:ffff8803eec4de98 EFLAGS: 00010296 [ 560.480471] RAX: 0000000000000012 RBX: 00007fff8ae88000 RCX: 0000000000000006 [ 560.481490] RDX: 0000000000000006 RSI: ffff880424310cf0 RDI: 0000000000000282 [ 560.481490] RBP: ffff8803eec4dec8 R08: 0000000000000001 R09: 0000000000000001 [ 560.481490] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8803f61aa000 [ 560.481490] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000048 [ 560.481490] FS: 00007f799ddf6700(0000) GS:ffff880b2b800000(0000) knlGS:0000000000000000 [ 560.481490] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 560.481490] CR2: 0000000004502048 CR3: 00000003f8bcb000 CR4: 00000000000006a0 [ 560.481490] Stack: [ 560.481490] ffff8803eec4dec8 0000000000000000 ffff8803f61aa000 ffff8800a22fae00 [ 560.481490] 0000000000000000 00007f799d4dc000 ffff8803eec4df28 ffffffff812a8fb7 [ 560.481490] ffffffff00000001 ffff880b21453600 ffff8800a22fae10 ffff8803f61aa008 [ 560.481490] Call Trace: [ 560.481490] [] do_munmap+0x307/0x360 [ 560.493775] [] ? vm_munmap+0x46/0x80 [ 560.493775] [] vm_munmap+0x54/0x80 [ 560.493775] [] SyS_munmap+0x2c/0x40 [ 560.493775] [] tracesys+0xdd/0xe2 [ 560.493775] Code: 32 fd ff ff 41 8b 74 24 58 39 f0 74 19 89 c2 48 c7 c7 19 f5 6d 85 31 c0 e8 45 8c 1f 03 eb 0c 0f 1f 80 00 00 00 00 45 85 ed 74 0d <0f> 0b 66 0f 1f 84 00 00 00 00 00 eb fe 48 83 c4 08 5b 41 5c 41 [ 560.493775] RIP [] validate_mm+0x115/0x140 [ 560.493775] RSP Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754915AbaCKSG6 (ORCPT ); Tue, 11 Mar 2014 14:06:58 -0400 Received: from cantor2.suse.de ([195.135.220.15]:60842 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753314AbaCKSG5 (ORCPT ); Tue, 11 Mar 2014 14:06:57 -0400 Date: Tue, 11 Mar 2014 18:06:52 +0000 From: Mel Gorman To: Sasha Levin Cc: Rik van Riel , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-ID: <20140311180652.GM10663@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <531F48CC.303@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote: > Okay. So just this patch on top of the latest -next shows the following issues: > > 1. BUG in task_numa_work: > > [ 439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00 > [ 439.418216] IP: [] vmacache_find+0x75/0xa0 > [ 439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060 > [ 439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 439.420340] Dumping ftrace buffer: > [ 439.420340] (ftrace buffer empty) > [ 439.420340] Modules linked in: > [ 439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 > [ 439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000 > [ 439.420340] RIP: 0010:[] [] vmacache_find+0x75/0xa0 > [ 439.420340] RSP: 0018:ffff880e1a491e68 EFLAGS: 00010286 > [ 439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001 > [ 439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000 > [ 439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000 > [ 439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000 > [ 439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000 > [ 439.420340] FS: 00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000 > [ 439.420340] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0 > [ 439.420340] Stack: > [ 439.420340] ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117 > [ 439.420340] 00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b > [ 439.420340] ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000 > [ 439.420340] Call Trace: > [ 439.420340] [] find_vma+0x20/0x90 > [ 439.420340] [] ? task_numa_work+0x130/0x360 > [ 439.420340] [] task_numa_work+0x13b/0x360 > [ 439.420340] [] ? context_tracking_user_exit+0x195/0x1d0 > [ 439.420340] [] task_work_run+0xae/0xf0 > [ 439.420340] [] do_notify_resume+0x8e/0xe0 > [ 439.420340] [] int_signal+0x12/0x17 > [ 439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f > [ 439.420340] RIP [] vmacache_find+0x75/0xa0 > [ 439.420340] RSP > [ 439.420340] CR2: ffff880e17530c00 > Ok, this does not look related. It looks like damage from the VMA caching patches, possibly a use-after free. I'm skeptical that it's related to automatic NUMA balancing as such based on the second trace you posted. What does addr2line -e vmlinux ffffffff81299385 say? I want to be sure it looks like a vma dereference without risking making a mistake decoding it. 1. Does this bug trigger even if automatic NUMA balancing is disabled? 2. Does this bug trigger if DEBUG_PAGEALLOC is disabled? If it's a use-after free then the bug would still be there. 3. Can you test with the following patches reverted please? e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix 0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching The last patch will not revert cleanly (least it didn't for me) but it was just a case of git rm the two affected files, remove any include of vmacache.h and commit the rest. Thanks! -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755175AbaCKTMx (ORCPT ); Tue, 11 Mar 2014 15:12:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:14869 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753750AbaCKTMv (ORCPT ); Tue, 11 Mar 2014 15:12:51 -0400 Message-ID: <531F48C2.6010408@redhat.com> Date: Tue, 11 Mar 2014 13:32:50 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: Sasha Levin , Mel Gorman CC: David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> In-Reply-To: <531F48CC.303@oracle.com> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/11/2014 01:33 PM, Sasha Levin wrote: > On 03/11/2014 01:00 PM, Rik van Riel wrote: >> On 03/11/2014 12:51 PM, Sasha Levin wrote: >>> On 03/11/2014 12:28 PM, Mel Gorman wrote: >>>> On Fri, Mar 07, 2014 at 06:27:45PM +0000, Mel Gorman wrote: >>>>>> This is a completely untested prototype. It rechecks pmd_trans_huge >>>>>> under the lock and falls through if it hit a parallel split. It's not >>>>>> perfect because it could decide to fall through just because there >>>>>> was >>>>>> no prot_numa work to do but it's for illustration purposes. Secondly, >>>>>> I noted that you are calling invalidate for every pmd range. Is that >>>>>> not >>>>>> a lot of invalidations? We could do the same by just tracking the >>>>>> address >>>>>> of the first invalidation. >>>>>> >>>>> >>>>> And there were other minor issues. This is still untested but Sasha, >>>>> can you try it out please? I discussed this with Rik on IRC for a bit >>>>> and >>>>> reckon this should be sufficient if the correct race has been >>>>> identified. >>>>> >>>> >>>> Any luck with this patch Sasha? It passed basic tests here but I had >>>> not >>>> seen the issue trigger either. >>>> >>> >>> Sorry, I've been stuck in my weekend project of getting lockdep to work >>> with page locks :) >>> >>> It takes a moment to test, so just to be sure - I should have only this >>> last patch applied? >>> Without the one in the original mail? >> >> Indeed, only this patch should do it. > > Okay. So just this patch on top of the latest -next shows the following > issues: OK, those are all issues with Davidlohr Bueso's per-thread vma cache patch :) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755258AbaCKTSP (ORCPT ); Tue, 11 Mar 2014 15:18:15 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:24986 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752478AbaCKTSN (ORCPT ); Tue, 11 Mar 2014 15:18:13 -0400 Message-ID: <531F616A.7060300@oracle.com> Date: Tue, 11 Mar 2014 15:18:02 -0400 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Mel Gorman CC: Rik van Riel , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> In-Reply-To: <20140311180652.GM10663@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet22.oracle.com [156.151.31.94] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/11/2014 02:06 PM, Mel Gorman wrote: > On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote: >> Okay. So just this patch on top of the latest -next shows the following issues: >> >> 1. BUG in task_numa_work: >> >> [ 439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00 >> [ 439.418216] IP: [] vmacache_find+0x75/0xa0 >> [ 439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060 >> [ 439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >> [ 439.420340] Dumping ftrace buffer: >> [ 439.420340] (ftrace buffer empty) >> [ 439.420340] Modules linked in: >> [ 439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 >> [ 439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000 >> [ 439.420340] RIP: 0010:[] [] vmacache_find+0x75/0xa0 >> [ 439.420340] RSP: 0018:ffff880e1a491e68 EFLAGS: 00010286 >> [ 439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001 >> [ 439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000 >> [ 439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000 >> [ 439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000 >> [ 439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000 >> [ 439.420340] FS: 00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000 >> [ 439.420340] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> [ 439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0 >> [ 439.420340] Stack: >> [ 439.420340] ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117 >> [ 439.420340] 00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b >> [ 439.420340] ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000 >> [ 439.420340] Call Trace: >> [ 439.420340] [] find_vma+0x20/0x90 >> [ 439.420340] [] ? task_numa_work+0x130/0x360 >> [ 439.420340] [] task_numa_work+0x13b/0x360 >> [ 439.420340] [] ? context_tracking_user_exit+0x195/0x1d0 >> [ 439.420340] [] task_work_run+0xae/0xf0 >> [ 439.420340] [] do_notify_resume+0x8e/0xe0 >> [ 439.420340] [] int_signal+0x12/0x17 >> [ 439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f >> [ 439.420340] RIP [] vmacache_find+0x75/0xa0 >> [ 439.420340] RSP >> [ 439.420340] CR2: ffff880e17530c00 >> > > Ok, this does not look related. It looks like damage from the VMA caching > patches, possibly a use-after free. I'm skeptical that it's related to > automatic NUMA balancing as such based on the second trace you posted. It's a bit weird because right after applying the patch I've hit this trace twice while never seeing it previously. But it's very possible since there are quite a lot of different mm issues around right now. > What does addr2line -e vmlinux ffffffff81299385 say? I want to be sure > it looks like a vma dereference without risking making a mistake > decoding it. mm/vmacache.c:75 Or: if (vma && vma->vm_start <= addr && vma->vm_end > addr) > 1. Does this bug trigger even if automatic NUMA balancing is disabled? (assuming we're talking about kernel BUG at mm/mmap.c:439!) Yes, same result with numa_balancing=disable. > 2. Does this bug trigger if DEBUG_PAGEALLOC is disabled? If it's a > use-after free then the bug would still be there. Still there. > 3. Can you test with the following patches reverted please? > > e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix > e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix > 0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching > > The last patch will not revert cleanly (least it didn't for me) but it > was just a case of git rm the two affected files, remove any include of > vmacache.h and commit the rest. Don't see the issues I've reported before now. Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754355AbaCKT3C (ORCPT ); Tue, 11 Mar 2014 15:29:02 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:44614 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752919AbaCKT3A (ORCPT ); Tue, 11 Mar 2014 15:29:00 -0400 Date: Tue, 11 Mar 2014 12:28:59 -0700 From: Andrew Morton To: Sasha Levin Cc: Mel Gorman , Rik van Riel , David Rientjes , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page Message-Id: <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> In-Reply-To: <531F616A.7060300@oracle.com> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 11 Mar 2014 15:18:02 -0400 Sasha Levin wrote: > > 3. Can you test with the following patches reverted please? > > > > e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix > > e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix > > 0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching > > > > The last patch will not revert cleanly (least it didn't for me) but it > > was just a case of git rm the two affected files, remove any include of > > vmacache.h and commit the rest. > > Don't see the issues I've reported before now. This is foggy. Do you mean that all the bugs went away when per-thread-vma-caching was reverted? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755246AbaCKTbH (ORCPT ); Tue, 11 Mar 2014 15:31:07 -0400 Received: from g4t3427.houston.hp.com ([15.201.208.55]:43301 "EHLO g4t3427.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755126AbaCKTbF (ORCPT ); Tue, 11 Mar 2014 15:31:05 -0400 Message-ID: <1394566262.2786.23.camel@buesod1.americas.hpqcorp.net> Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page From: Davidlohr Bueso To: Mel Gorman Cc: Sasha Levin , Rik van Riel , David Rientjes , Andrew Morton , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com Date: Tue, 11 Mar 2014 12:31:02 -0700 In-Reply-To: <20140311180652.GM10663@suse.de> References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.6.4 (3.6.4-3.fc18) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2014-03-11 at 18:06 +0000, Mel Gorman wrote: > On Tue, Mar 11, 2014 at 01:33:00PM -0400, Sasha Levin wrote: > > Okay. So just this patch on top of the latest -next shows the following issues: > > > > 1. BUG in task_numa_work: > > > > [ 439.417171] BUG: unable to handle kernel paging request at ffff880e17530c00 > > [ 439.418216] IP: [] vmacache_find+0x75/0xa0 > > [ 439.419073] PGD 8904067 PUD 1028fcb067 PMD 1028f10067 PTE 8000000e17530060 > > [ 439.420340] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > [ 439.420340] Dumping ftrace buffer: > > [ 439.420340] (ftrace buffer empty) > > [ 439.420340] Modules linked in: > > [ 439.420340] CPU: 12 PID: 9937 Comm: trinity-c212 Tainted: G W 3.14.0-rc5-next-20140307-sasha-00009-g3b24300-dirty #137 > > [ 439.420340] task: ffff880e1a45b000 ti: ffff880e1a490000 task.ti: ffff880e1a490000 > > [ 439.420340] RIP: 0010:[] [] vmacache_find+0x75/0xa0 > > [ 439.420340] RSP: 0018:ffff880e1a491e68 EFLAGS: 00010286 > > [ 439.420340] RAX: ffff880e17530c00 RBX: 0000000000000000 RCX: 0000000000000001 > > [ 439.420340] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880e1a45b000 > > [ 439.420340] RBP: ffff880e1a491e68 R08: 0000000000000000 R09: 0000000000000000 > > [ 439.420340] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880e1ab75000 > > [ 439.420340] R13: ffff880e1ab75000 R14: 0000000000010000 R15: 0000000000000000 > > [ 439.420340] FS: 00007f3458c05700(0000) GS:ffff880d2b800000(0000) knlGS:0000000000000000 > > [ 439.420340] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > [ 439.420340] CR2: ffff880e17530c00 CR3: 0000000e1a472000 CR4: 00000000000006a0 > > [ 439.420340] Stack: > > [ 439.420340] ffff880e1a491e98 ffffffff812a7610 ffffffff8118de40 0000000000000117 > > [ 439.420340] 00000001000036da ffff880e1a45b000 ffff880e1a491ef8 ffffffff8118de4b > > [ 439.420340] ffff880e1a491ec8 ffffffff81269575 ffff880e1ab750a8 ffff880e1a45b000 > > [ 439.420340] Call Trace: > > [ 439.420340] [] find_vma+0x20/0x90 > > [ 439.420340] [] ? task_numa_work+0x130/0x360 > > [ 439.420340] [] task_numa_work+0x13b/0x360 > > [ 439.420340] [] ? context_tracking_user_exit+0x195/0x1d0 > > [ 439.420340] [] task_work_run+0xae/0xf0 > > [ 439.420340] [] do_notify_resume+0x8e/0xe0 > > [ 439.420340] [] int_signal+0x12/0x17 > > [ 439.420340] Code: 42 10 00 00 00 00 48 c7 42 18 00 00 00 00 eb 38 66 0f 1f 44 00 00 31 d2 48 89 c7 48 63 ca 48 8b 84 cf b8 07 00 00 48 85 c0 74 0b <48> 39 30 77 06 48 3b 70 08 72 12 ff c2 83 fa 04 75 de 66 0f 1f > > [ 439.420340] RIP [] vmacache_find+0x75/0xa0 > > [ 439.420340] RSP > > [ 439.420340] CR2: ffff880e17530c00 > > > > Ok, this does not look related. It looks like damage from the VMA caching > patches, possibly a use-after free. I'm skeptical that it's related to > automatic NUMA balancing as such based on the second trace you posted. Indeed, this issue is separate. It was reported a few days ago by Fengguang: http://lkml.org/lkml/2014/3/9/201 Apparently this has only been seen in DEBUG_PAGEALLOC + trinity scenarios, just like in Sasha's trace. I'm suspecting some invalidation is missing (not sure what DEBUG_PAGEALLOC enables differently in this aspect), as when we do the vmacache_find() the vma is bogus which triggers the paging request error. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755238AbaCKTeO (ORCPT ); Tue, 11 Mar 2014 15:34:14 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:33552 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753782AbaCKTeM (ORCPT ); Tue, 11 Mar 2014 15:34:12 -0400 Message-ID: <531F6527.6010508@oracle.com> Date: Tue, 11 Mar 2014 15:33:59 -0400 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Andrew Morton CC: Mel Gorman , Rik van Riel , David Rientjes , "linux-mm@kvack.org" , LKML , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso Subject: Re: [PATCH -mm] mm,numa,mprotect: always continue after finding a stable thp page References: <5318E4BC.50301@oracle.com> <20140306173137.6a23a0b2@cuia.bos.redhat.com> <5318FC3F.4080204@redhat.com> <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> In-Reply-To: <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet21.oracle.com [156.151.31.93] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/11/2014 03:28 PM, Andrew Morton wrote: > On Tue, 11 Mar 2014 15:18:02 -0400 Sasha Levin wrote: > >>> 3. Can you test with the following patches reverted please? >>> >>> e15d25d9c827b4346a36a3a78dd566d5ad353402 mm-per-thread-vma-caching-fix-fix >>> e440e20dc76803cdab616b4756c201d5c72857f2 mm-per-thread-vma-caching-fix >>> 0d9ad4220e6d73f63a9eeeaac031b92838f75bb3 mm: per-thread vma caching >>> >>> The last patch will not revert cleanly (least it didn't for me) but it >>> was just a case of git rm the two affected files, remove any include of >>> vmacache.h and commit the rest. >> >> Don't see the issues I've reported before now. > > This is foggy. Do you mean that all the bugs went away when > per-thread-vma-caching was reverted? No, sorry, just the vmacache_find and the mm/mmap.c:439 BUGs. Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753445AbaCLKgK (ORCPT ); Wed, 12 Mar 2014 06:36:10 -0400 Received: from cantor2.suse.de ([195.135.220.15]:42392 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752543AbaCLKgH (ORCPT ); Wed, 12 Mar 2014 06:36:07 -0400 Date: Wed, 12 Mar 2014 10:36:02 +0000 From: Mel Gorman To: Andrew Morton Cc: Sasha Levin , Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Message-ID: <20140312103602.GN10663@suse.de> References: <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andrew, this should go with the patches mmnuma-reorganize-change_pmd_range.patch mmnuma-reorganize-change_pmd_range-fix.patch move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch in mmotm please. Thanks. ---8<--- From: Mel Gorman Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Sasha Levin reported the following bug using trinity [ 886.745765] kernel BUG at mm/mprotect.c:149! [ 886.746831] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 886.748511] Dumping ftrace buffer: [ 886.749998] (ftrace buffer empty) [ 886.750110] Modules linked in: [ 886.751396] CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105 [ 886.751396] task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000 [ 886.751396] RIP: 0010:[] [] change_protection_range+0x3b3/0x500 [ 886.751396] RSP: 0000:ffff880228437da8 EFLAGS: 00010282 [ 886.751396] RAX: 8000000527c008e5 RBX: 00007f647916e000 RCX: 0000000000000000 [ 886.751396] RDX: ffff8802ef488e40 RSI: 00007f6479000000 RDI: 8000000527c008e5 [ 886.751396] RBP: ffff880228437e78 R08: 0000000000000000 R09: 0000000000000000 [ 886.751396] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802ef488e40 [ 886.751396] R13: 00007f6479000000 R14: 00007f647916e000 R15: 00007f646e34e000 [ 886.751396] FS: 00007f64b28d4700(0000) GS:ffff88052ba00000(0000) knlGS:0000000000000000 [ 886.751396] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 886.751396] CR2: 00007f64aed83af8 CR3: 0000000206e52000 CR4: 00000000000006a0 [ 886.751396] Stack: [ 886.751396] ffff880200000001 ffffffff8447f152 ffff880228437dd8 ffff880228af3000 [ 886.769142] 00007f646e34e000 00007f647916dfff 0000000000000000 00007f647916e000 [ 886.769142] ffff880206e527f0 00007f647916dfff 0000000000000000 0000000000000000 [ 886.769142] Call Trace: [ 886.769142] [] ? preempt_count_sub+0xe2/0x120 [ 886.769142] [] change_protection+0x25/0x30 [ 886.769142] [] change_prot_numa+0x1b/0x30 [ 886.769142] [] task_numa_work+0x279/0x360 [ 886.769142] [] task_work_run+0xae/0xf0 [ 886.769142] [] do_notify_resume+0x8e/0xe0 [ 886.769142] [] retint_signal+0x4d/0x92 [ 886.769142] Code: 49 8b 3c 24 48 83 3d fc 2e ba 04 00 75 12 0f 0b 0f 1f 84 00 00 00 00 00 eb fe 66 0f 1f 44 00 00 48 89 f8 66 66 66 90 84 c0 79 0d <0f> 0b 0f 1f 00 eb fe 66 0f 1f 44 00 00 8b 4d 9c 44 8b 4d 98 +89 [ 886.769142] RIP [] change_protection_range+0x3b3/0x500 [ 886.769142] RSP The VM_BUG_ON was added by the patch "mm,numa: reorganize change_pmd_range". The race existed without the patch but was just harder to hit. The problem is that a transhuge check is made without holding the PTL. It's possible at the time of the check that a parallel fault clears the pmd and inserts a new one which then triggers the VM_BUG_ON check. This patch removes the VM_BUG_ON but fixes the race by rechecking transhuge under the PTL when marking page tables for NUMA hinting and bailing if a race occurred. It is not a problem for calls to mprotect() as they hold mmap_sem for write. Reported-by: Sasha Levin Signed-off-by: Mel Gorman --- mm/mprotect.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 2afc40e..72061a2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -46,6 +46,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long pages = 0; pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + + /* + * For a prot_numa update we only hold mmap_sem for read so there is a + * potential race with faulting where a pmd was temporarily none so + * recheck it under the lock and bail if we race + */ + if (prot_numa && unlikely(pmd_trans_huge(*pmd))) { + pte_unmap_unlock(pte, ptl); + return 0; + } + arch_enter_lazy_mmu_mode(); do { oldpte = *pte; @@ -141,12 +152,13 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pages += HPAGE_PMD_NR; nr_huge_updates++; } + + /* huge pmd was handled */ continue; } } /* fall through, the trans huge pmd just split */ } - VM_BUG_ON(pmd_trans_huge(*pmd)); this_pages = change_pte_range(vma, pmd, addr, next, newprot, dirty_accountable, prot_numa); pages += this_pages; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753817AbaCLMQi (ORCPT ); Wed, 12 Mar 2014 08:16:38 -0400 Received: from mx1.redhat.com ([209.132.183.28]:20038 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753473AbaCLMQc (ORCPT ); Wed, 12 Mar 2014 08:16:32 -0400 Message-ID: <53205014.2050602@redhat.com> Date: Wed, 12 Mar 2014 08:16:20 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Mel Gorman , Andrew Morton CC: Sasha Levin , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes References: <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> In-Reply-To: <20140312103602.GN10663@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/12/2014 06:36 AM, Mel Gorman wrote: > Andrew, this should go with the patches > mmnuma-reorganize-change_pmd_range.patch > mmnuma-reorganize-change_pmd_range-fix.patch > move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch > in mmotm please. > > Thanks. That would be nice indeed :) I am still not entirely sure why the kernel did not hit this race before my reorganize change_pmd_range patch. Maybe gcc used to do one load and now it does two? > The problem is that a transhuge check is made without holding the PTL. It's > possible at the time of the check that a parallel fault clears the pmd > and inserts a new one which then triggers the VM_BUG_ON check. This patch > removes the VM_BUG_ON but fixes the race by rechecking transhuge under the > PTL when marking page tables for NUMA hinting and bailing if a race occurred. > It is not a problem for calls to mprotect() as they hold mmap_sem for write. > > Reported-by: Sasha Levin > Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755336AbaCODPy (ORCPT ); Fri, 14 Mar 2014 23:15:54 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:51739 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754391AbaCODPx (ORCPT ); Fri, 14 Mar 2014 23:15:53 -0400 Message-ID: <5323C5D9.2070902@oracle.com> Date: Fri, 14 Mar 2014 23:15:37 -0400 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Mel Gorman , Andrew Morton CC: Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes References: <20140307140650.GA1931@suse.de> <20140307150923.GB1931@suse.de> <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> In-Reply-To: <20140312103602.GN10663@suse.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: ucsinet22.oracle.com [156.151.31.94] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/12/2014 06:36 AM, Mel Gorman wrote: > Andrew, this should go with the patches > mmnuma-reorganize-change_pmd_range.patch > mmnuma-reorganize-change_pmd_range-fix.patch > move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch > in mmotm please. > > Thanks. > > ---8<--- > From: Mel Gorman > Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes > > Sasha Levin reported the following bug using trinity I'm seeing a different issue with this patch. A NULL ptr deref occurs in the pte_offset_map_lock() macro right before the new recheck code: [ 1877.093980] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [ 1877.095174] IP: __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069) [ 1877.096069] PGD 6dee7a067 PUD 6dee7b067 PMD 0 [ 1877.096821] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 1877.097706] Dumping ftrace buffer: [ 1877.098281] (ftrace buffer empty) [ 1877.098825] Modules linked in: [ 1877.099327] CPU: 19 PID: 27913 Comm: trinity-c100 Tainted: G W 3.14.0-rc6-next-20140314-sasha-00012-g5590866 #219 [ 1877.100044] task: ffff8808f4280000 ti: ffff8806e1e54000 task.ti: ffff8806e1e54000 [ 1877.100044] RIP: __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069) [ 1877.100044] RSP: 0000:ffff8806e1e55be8 EFLAGS: 00010002 [ 1877.100044] RAX: 0000000000000082 RBX: 0000000000000018 RCX: 0000000000000000 [ 1877.100044] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [ 1877.100044] RBP: ffff8806e1e55c58 R08: 0000000000000001 R09: 0000000000000000 [ 1877.100044] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8808f4280000 [ 1877.100044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [ 1877.100044] FS: 00007fe3fe152700(0000) GS:ffff88042ba00000(0000) knlGS:0000000000000000 [ 1877.100044] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1877.100044] CR2: 0000000000000018 CR3: 00000006dee79000 CR4: 00000000000006a0 [ 1877.100044] DR0: 0000000000698000 DR1: 0000000000698000 DR2: 0000000000698000 [ 1877.100044] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 000000000009060a [ 1877.100044] Stack: [ 1877.100044] ffff8806e1e55c18 ffffffff81184e95 ffff8808f4280038 00000000001d8500 [ 1877.100044] ffff88042bbd8500 0000000000000013 ffff8806e1e55c48 ffffffff81185108 [ 1877.100044] ffffffff87c13bd0 ffff8808f4280000 0000000000000000 0000000000000001 [ 1877.100044] Call Trace: [ 1877.100044] ? sched_clock_local+0x25/0x90 (kernel/sched/clock.c:205) [ 1877.100044] ? sched_clock_cpu+0xb8/0x100 (kernel/sched/clock.c:310) [ 1877.100044] lock_acquire+0x182/0x1d0 (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602) [ 1877.100044] ? change_pte_range+0xa3/0x410 (mm/mprotect.c:55) [ 1877.100044] ? __lock_release+0x1e2/0x200 (kernel/locking/lockdep.c:3506) [ 1877.100044] _raw_spin_lock+0x40/0x80 (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151) [ 1877.100044] ? change_pte_range+0xa3/0x410 (mm/mprotect.c:55) [ 1877.100044] ? _raw_spin_unlock+0x35/0x60 (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) [ 1877.100044] change_pte_range+0xa3/0x410 (mm/mprotect.c:55) [ 1877.100044] change_protection_range+0x3a8/0x4d0 (mm/mprotect.c:164 mm/mprotect.c:188 mm/mprotect.c:213) [ 1877.100044] ? preempt_count_sub+0xe2/0x120 (kernel/sched/core.c:2529) [ 1877.100044] change_protection+0x25/0x30 (mm/mprotect.c:237) [ 1877.100044] change_prot_numa+0x1b/0x30 (mm/mempolicy.c:559) [ 1877.100044] task_numa_work+0x279/0x360 (kernel/sched/fair.c:1911) [ 1877.100044] task_work_run+0xae/0xf0 (kernel/task_work.c:125) [ 1877.100044] do_notify_resume+0x8e/0xe0 (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751) [ 1877.100044] retint_signal+0x4d/0x92 (arch/x86/kernel/entry_64.S:1096) [ 1877.100044] Code: c2 6f 3b 6d 85 be fa 0b 00 00 48 c7 c7 ce 94 6d 85 e8 f9 78 f9 ff 31 c0 e9 bc 04 00 00 66 90 44 8b 1d 29 69 cd 04 45 85 db 74 0c <48> 81 3b 80 f2 75 87 75 06 0f 1f 00 45 31 c0 83 fe 01 77 0c 89 [ 1877.100044] RIP __lock_acquire+0xbc/0x5a0 (kernel/locking/lockdep.c:3069) [ 1877.100044] RSP [ 1877.100044] CR2: 0000000000000018 Thanks, Sasha From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965409AbaCSOii (ORCPT ); Wed, 19 Mar 2014 10:38:38 -0400 Received: from cantor2.suse.de ([195.135.220.15]:60908 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964970AbaCSOig (ORCPT ); Wed, 19 Mar 2014 10:38:36 -0400 Date: Wed, 19 Mar 2014 14:38:32 +0000 From: Mel Gorman To: Sasha Levin Cc: Andrew Morton , Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Message-ID: <20140319143831.GA4751@suse.de> References: <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> <5323C5D9.2070902@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <5323C5D9.2070902@oracle.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote: > On 03/12/2014 06:36 AM, Mel Gorman wrote: > >Andrew, this should go with the patches > >mmnuma-reorganize-change_pmd_range.patch > >mmnuma-reorganize-change_pmd_range-fix.patch > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch > >in mmotm please. > > > >Thanks. > > > >---8<--- > >From: Mel Gorman > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes > > > >Sasha Levin reported the following bug using trinity > > I'm seeing a different issue with this patch. A NULL ptr deref occurs in the > pte_offset_map_lock() macro right before the new recheck code: > This on top? I tried testing it but got all sorts of carnage that trinity throw up in the mix and ordinary testing does not trigger the race. I've no idea which of the current mess of trinity-exposed bugs you've encountered and got fixed already. ---8<--- From: Mel Gorman Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes -fix Signed-off-by: Mel Gorman --- mm/mprotect.c | 40 ++++++++++++++++++++++++++++++---------- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 66973db..c43d557 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -36,6 +36,34 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) } #endif +/* + * For a prot_numa update we only hold mmap_sem for read so there is a + * potential race with faulting where a pmd was temporarily none. This + * function checks for a transhuge pmd under the appropriate lock. It + * returns a pte if it was successfully locked or NULL if it raced with + * a transhuge insertion. + */ +static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, int prot_numa, spinlock_t **ptl) +{ + pte_t *pte; + spinlock_t *pmdl; + + /* !prot_numa is protected by mmap_sem held for write */ + if (!prot_numa) + return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl); + + pmdl = pmd_lock(vma->vm_mm, pmd); + if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) { + spin_unlock(pmdl); + return NULL; + } + + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl); + spin_unlock(pmdl); + return pte; +} + static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, int dirty_accountable, int prot_numa) @@ -45,17 +73,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, spinlock_t *ptl; unsigned long pages = 0; - pte = pte_offset_map_lock(mm, pmd, addr, &ptl); - - /* - * For a prot_numa update we only hold mmap_sem for read so there is a - * potential race with faulting where a pmd was temporarily none so - * recheck it under the lock and bail if we race - */ - if (prot_numa && unlikely(pmd_trans_huge(*pmd))) { - pte_unmap_unlock(pte, ptl); + pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl); + if (!pte) return 0; - } arch_enter_lazy_mmu_mode(); do { From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752088AbaCUWHD (ORCPT ); Fri, 21 Mar 2014 18:07:03 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:48824 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750799AbaCUWHA (ORCPT ); Fri, 21 Mar 2014 18:07:00 -0400 Date: Fri, 21 Mar 2014 15:06:58 -0700 From: Andrew Morton To: Mel Gorman Cc: Sasha Levin , Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes Message-Id: <20140321150658.386926ac3afa263946b1a2aa@linux-foundation.org> In-Reply-To: <20140319143831.GA4751@suse.de> References: <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> <5323C5D9.2070902@oracle.com> <20140319143831.GA4751@suse.de> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 19 Mar 2014 14:38:32 +0000 Mel Gorman wrote: > On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote: > > On 03/12/2014 06:36 AM, Mel Gorman wrote: > > >Andrew, this should go with the patches > > >mmnuma-reorganize-change_pmd_range.patch > > >mmnuma-reorganize-change_pmd_range-fix.patch > > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch > > >in mmotm please. > > > > > >Thanks. > > > > > >---8<--- > > >From: Mel Gorman > > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes > > > > > >Sasha Levin reported the following bug using trinity > > > > I'm seeing a different issue with this patch. A NULL ptr deref occurs in the > > pte_offset_map_lock() macro right before the new recheck code: > > > > This on top? > > I tried testing it but got all sorts of carnage that trinity throw up > in the mix and ordinary testing does not trigger the race. I've no idea > which of the current mess of trinity-exposed bugs you've encountered and > got fixed already. > Where are we at with this one? With this patch I'm now sitting on mmnuma-reorganize-change_pmd_range.patch mmnuma-reorganize-change_pmd_range-fix.patch mm-numa-recheck-for-transhuge-pages-under-lock-during-protection-changes.patch mm-numa-recheck-for-transhuge-pages-under-lock-during-protection-changes-fix.patch move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch Where the first four are prep work for the fifth. move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch is a performance optimisation for KVM guests. Should I drop all five, try again next time? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752293AbaCUWYZ (ORCPT ); Fri, 21 Mar 2014 18:24:25 -0400 Received: from mx1.redhat.com ([209.132.183.28]:10840 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750803AbaCUWYX (ORCPT ); Fri, 21 Mar 2014 18:24:23 -0400 Message-ID: <532CBC0C.1080403@redhat.com> Date: Fri, 21 Mar 2014 18:24:12 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Mel Gorman , Sasha Levin CC: Andrew Morton , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes References: <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> <5323C5D9.2070902@oracle.com> <20140319143831.GA4751@suse.de> In-Reply-To: <20140319143831.GA4751@suse.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/19/2014 10:38 AM, Mel Gorman wrote: > On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote: >> On 03/12/2014 06:36 AM, Mel Gorman wrote: >>> Andrew, this should go with the patches >>> mmnuma-reorganize-change_pmd_range.patch >>> mmnuma-reorganize-change_pmd_range-fix.patch >>> move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch >>> in mmotm please. >>> >>> Thanks. >>> >>> ---8<--- >>> From: Mel Gorman >>> Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes >>> >>> Sasha Levin reported the following bug using trinity >> >> I'm seeing a different issue with this patch. A NULL ptr deref occurs in the >> pte_offset_map_lock() macro right before the new recheck code: >> > > This on top? > > I tried testing it but got all sorts of carnage that trinity throw up > in the mix and ordinary testing does not trigger the race. I've no idea > which of the current mess of trinity-exposed bugs you've encountered and > got fixed already. Eeeep indeed. If we re-test the transhuge status, we need to take the pmd lock, and not the potentially non-existent pte lock. Good catch. > ---8<--- > From: Mel Gorman > Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes -fix > > Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel > --- > mm/mprotect.c | 40 ++++++++++++++++++++++++++++++---------- > 1 file changed, 30 insertions(+), 10 deletions(-) > > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 66973db..c43d557 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -36,6 +36,34 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) > } > #endif > > +/* > + * For a prot_numa update we only hold mmap_sem for read so there is a > + * potential race with faulting where a pmd was temporarily none. This > + * function checks for a transhuge pmd under the appropriate lock. It > + * returns a pte if it was successfully locked or NULL if it raced with > + * a transhuge insertion. > + */ > +static pte_t *lock_pte_protection(struct vm_area_struct *vma, pmd_t *pmd, > + unsigned long addr, int prot_numa, spinlock_t **ptl) > +{ > + pte_t *pte; > + spinlock_t *pmdl; > + > + /* !prot_numa is protected by mmap_sem held for write */ > + if (!prot_numa) > + return pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl); > + > + pmdl = pmd_lock(vma->vm_mm, pmd); > + if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) { > + spin_unlock(pmdl); > + return NULL; > + } > + > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, ptl); > + spin_unlock(pmdl); > + return pte; > +} > + > static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > unsigned long addr, unsigned long end, pgprot_t newprot, > int dirty_accountable, int prot_numa) > @@ -45,17 +73,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > spinlock_t *ptl; > unsigned long pages = 0; > > - pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > - > - /* > - * For a prot_numa update we only hold mmap_sem for read so there is a > - * potential race with faulting where a pmd was temporarily none so > - * recheck it under the lock and bail if we race > - */ > - if (prot_numa && unlikely(pmd_trans_huge(*pmd))) { > - pte_unmap_unlock(pte, ptl); > + pte = lock_pte_protection(vma, pmd, addr, prot_numa, &ptl); > + if (!pte) > return 0; > - } > > arch_enter_lazy_mmu_mode(); > do { > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752700AbaCUWhZ (ORCPT ); Fri, 21 Mar 2014 18:37:25 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:46464 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751845AbaCUWhX (ORCPT ); Fri, 21 Mar 2014 18:37:23 -0400 Message-ID: <532CBF08.4030100@oracle.com> Date: Fri, 21 Mar 2014 18:36:56 -0400 From: Sasha Levin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Andrew Morton , Mel Gorman CC: Rik van Riel , David Rientjes , hhuang@redhat.com, knoel@redhat.com, aarcange@redhat.com, Davidlohr Bueso , "linux-mm@kvack.org" , LKML Subject: Re: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes References: <20140307182745.GD1931@suse.de> <20140311162845.GA30604@suse.de> <531F3F15.8050206@oracle.com> <531F4128.8020109@redhat.com> <531F48CC.303@oracle.com> <20140311180652.GM10663@suse.de> <531F616A.7060300@oracle.com> <20140311122859.fb6c1e772d82d9f4edd02f52@linux-foundation.org> <20140312103602.GN10663@suse.de> <5323C5D9.2070902@oracle.com> <20140319143831.GA4751@suse.de> <20140321150658.386926ac3afa263946b1a2aa@linux-foundation.org> In-Reply-To: <20140321150658.386926ac3afa263946b1a2aa@linux-foundation.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsinet21.oracle.com [141.146.126.237] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/21/2014 06:06 PM, Andrew Morton wrote: > On Wed, 19 Mar 2014 14:38:32 +0000 Mel Gorman wrote: > >> >On Fri, Mar 14, 2014 at 11:15:37PM -0400, Sasha Levin wrote: >>> > >On 03/12/2014 06:36 AM, Mel Gorman wrote: >>>> > > >Andrew, this should go with the patches >>>> > > >mmnuma-reorganize-change_pmd_range.patch >>>> > > >mmnuma-reorganize-change_pmd_range-fix.patch >>>> > > >move-mmu-notifier-call-from-change_protection-to-change_pmd_range.patch >>>> > > >in mmotm please. >>>> > > > >>>> > > >Thanks. >>>> > > > >>>> > > >---8<--- >>>> > > >From: Mel Gorman >>>> > > >Subject: [PATCH] mm: numa: Recheck for transhuge pages under lock during protection changes >>>> > > > >>>> > > >Sasha Levin reported the following bug using trinity >>> > > >>> > >I'm seeing a different issue with this patch. A NULL ptr deref occurs in the >>> > >pte_offset_map_lock() macro right before the new recheck code: >>> > > >> > >> >This on top? >> > >> >I tried testing it but got all sorts of carnage that trinity throw up >> >in the mix and ordinary testing does not trigger the race. I've no idea >> >which of the current mess of trinity-exposed bugs you've encountered and >> >got fixed already. >> > > Where are we at with this one? Looking good here, haven't seen any of the issues reported in this thread. Thanks, Sasha