* oops in copy_page_rep() @ 2013-01-05 15:22 Dave Jones 2013-01-06 3:57 ` Linus Torvalds 2013-01-06 11:55 ` Hillf Danton 0 siblings, 2 replies; 26+ messages in thread From: Dave Jones @ 2013-01-05 15:22 UTC (permalink / raw) To: Linux Kernel; +Cc: Linus Torvalds I have no idea what happened here, but this is the first time I've seen this one. This was running a tree pulled yesterday afternoon. BUG: unable to handle kernel paging request at ffff880100201000 IP: [<ffffffff81333235>] copy_page_rep+0x5/0x10 PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 8000000100201160 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: nfnetlink_log hidp fuse bnep llc2 rose caif_socket caif af_rxrpc phonet netrom af_key binfmt_misc rfcomm l2tp_ppp l2tp_core pppoe pppox ppp_generic slhc ipt_ULOG scsi_transp ort_iscsi can_raw nfnetlink ipx x25 p8023 p8022 nfc ax25 decnet rds can_bcm irda crc_ccitt can appletalk atm psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6ta ble_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec usb_debug microcode rfkill snd_pcm serio_raw snd_page_alloc snd_timer pcspkr edac_core snd soundcore r8169 mii vhost_net tun macvtap macvlan kvm_amd kvm CPU 0 Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H RIP: 0010:[<ffffffff81333235>] [<ffffffff81333235>] copy_page_rep+0x5/0x10 RSP: 0018:ffff88001ecabd00 EFLAGS: 00010286 RAX: 0000000100201000 RBX: 000000011d215000 RCX: 0000000000000200 RDX: cccccccccccccccd RSI: ffff880100201000 RDI: ffff88011d215000 RBP: ffff88001ecabd98 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008 R13: 000000000500a050 R14: ffff8800916af080 R15: ffff880095435668 FS: 00007f48a2280740(0000) GS:ffff88012ee00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff880100201000 CR3: 0000000054eda000 CR4: 00000000000007f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process trinity-child0 (pid: 3505, threadinfo ffff88001ecaa000, task ffff8800a9628000) Stack: ffffffff8119a9c7 ffff88001ecabd28 ffff8800a9628000 000000000157e088 000000000157e088 000000000157e088 ffff880095435668 ffff8800a9077600 ffff880095435668 80000001002000e5 ffff8800ba515050 0000000001400000 Call Trace: [<ffffffff8119a9c7>] ? do_huge_pmd_wp_page+0x707/0xc00 [<ffffffff81165f1c>] handle_mm_fault+0x14c/0x590 [<ffffffff810b35ce>] ? __lock_is_held+0x5e/0x90 [<ffffffff816a280c>] __do_page_fault+0x15c/0x4e0 [<ffffffff8100a1b6>] ? native_sched_clock+0x26/0x90 [<ffffffff810b28e8>] ? trace_hardirqs_off_caller+0x28/0xc0 [<ffffffff81334cbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c [<ffffffff816a2b9e>] do_page_fault+0xe/0x10 [<ffffffff8169f822>] page_fault+0x22/0x30 Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90 RIP [<ffffffff81333235>] copy_page_rep+0x5/0x10 RSP <ffff88001ecabd00> CR2: ffff880100201000 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-05 15:22 oops in copy_page_rep() Dave Jones @ 2013-01-06 3:57 ` Linus Torvalds 2013-01-07 0:37 ` Dave Jones 2013-01-06 11:55 ` Hillf Danton 1 sibling, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-01-06 3:57 UTC (permalink / raw) To: Dave Jones, Linux Kernel, Andrea Arcangeli, Hugh Dickins, Andrew Morton Adding more people in case somebody else has any idea. Anybody? On Sat, Jan 5, 2013 at 7:22 AM, Dave Jones <davej@redhat.com> wrote: > I have no idea what happened here, but this is the first time I've seen this one. > This was running a tree pulled yesterday afternoon. > > BUG: unable to handle kernel paging request at ffff880100201000 This is %rsi, which is the source for the page copy: copy_user_highpage()-> copy_user_page()-> copy_page()-> copy_page_rep I don't know exactly which copy_user_highpage() case this is from, the call trace implies this *could* be a hugepage, and those functions do copy pages individually in a loop too. > IP: [<ffffffff81333235>] copy_page_rep+0x5/0x10 > PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 8000000100201160 Hmm. That PTE looks really odd. If I read the PUD/PMD contents right, the page tables are for individual pages, but then the PTE doesn't have the present bit set: other than that it looks like it could be a valid PTE (NX and global bit set, Accessed and dirty also set, but the two low bits are clear: present and writable are clear). I think it's due to DEBUG_PAGEALLOC, so the (free) page has been unmapped from the kernel mapping. But how could a page that is the source of a page fault be free? > Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H > RIP: 0010:[<ffffffff81333235>] [<ffffffff81333235>] copy_page_rep+0x5/0x10 > RAX: 0000000100201000 RBX: 000000011d215000 RCX: 0000000000000200 The RCX value is 0x200, so this is the first access to that page. As expected. > RDX: cccccccccccccccd RSI: ffff880100201000 RDI: ffff88011d215000 RSI (source) and RDI (destination) both look like valid kernel mapping pages. But RSI isn't mapped, presumably because debug-pagealloc thinks it is free. Anybody with any ideas? The call trace indicates a normal page fault from user space, so.. Linus > Call Trace: > [<ffffffff8119a9c7>] ? do_huge_pmd_wp_page+0x707/0xc00 > [<ffffffff81165f1c>] handle_mm_fault+0x14c/0x590 > [<ffffffff810b35ce>] ? __lock_is_held+0x5e/0x90 > [<ffffffff816a280c>] __do_page_fault+0x15c/0x4e0 > [<ffffffff8100a1b6>] ? native_sched_clock+0x26/0x90 > [<ffffffff810b28e8>] ? trace_hardirqs_off_caller+0x28/0xc0 > [<ffffffff81334cbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c > [<ffffffff816a2b9e>] do_page_fault+0xe/0x10 > [<ffffffff8169f822>] page_fault+0x22/0x30 > Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90 > RIP [<ffffffff81333235>] copy_page_rep+0x5/0x10 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-06 3:57 ` Linus Torvalds @ 2013-01-07 0:37 ` Dave Jones 2013-01-07 3:38 ` Hugh Dickins 0 siblings, 1 reply; 26+ messages in thread From: Dave Jones @ 2013-01-07 0:37 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel, Andrea Arcangeli, Hugh Dickins, Andrew Morton On Sat, Jan 05, 2013 at 07:57:39PM -0800, Linus Torvalds wrote: > Adding more people in case somebody else has any idea. Anybody? > > On Sat, Jan 5, 2013 at 7:22 AM, Dave Jones <davej@redhat.com> wrote: > > I have no idea what happened here, but this is the first time I've seen this one. > > This was running a tree pulled yesterday afternoon. > > > > BUG: unable to handle kernel paging request at ffff880100201000 > > This is %rsi, which is the source for the page copy: > > copy_user_highpage()-> > copy_user_page()-> > copy_page()-> > copy_page_rep > > I don't know exactly which copy_user_highpage() case this is from, the > call trace implies this *could* be a hugepage, and those functions do > copy pages individually in a loop too. investigating the huge page theory a little further I'm a bit confused. The kernel on that machine has THP enabled, and the cpu supports it (an old amd64), but.. $ cat /sys/kernel/mm/hugepages/hugepages-2048kB/* 0 0 0 0 0 0 I was expecting at least one of those to be non-zero. /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans and pages_collapsed are both non-zero, so it's been busy doing _something_. Is this expected behaviour ? Dave ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-07 0:37 ` Dave Jones @ 2013-01-07 3:38 ` Hugh Dickins 0 siblings, 0 replies; 26+ messages in thread From: Hugh Dickins @ 2013-01-07 3:38 UTC (permalink / raw) To: Dave Jones; +Cc: Linus Torvalds, Andrea Arcangeli, Andrew Morton, linux-kernel On Sun, 6 Jan 2013, Dave Jones wrote: > > investigating the huge page theory a little further I'm a bit confused. > The kernel on that machine has THP enabled, and the cpu supports it (an old amd64), but.. > > $ cat /sys/kernel/mm/hugepages/hugepages-2048kB/* > 0 > 0 > 0 > 0 > 0 > 0 > > I was expecting at least one of those to be non-zero. > > /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans and pages_collapsed > are both non-zero, so it's been busy doing _something_. > > Is this expected behaviour ? Yes. /sys/kernel/mm/hugepages/ is all about those dusty old opaque hugepages you might have got from fs/hugetlbfs and mm/hugetlb.c; whereas the splitting issue we suspect is peculiar to the dazzling new transparent hugepages you get from mm/huge_memory.c. Hugh ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-05 15:22 oops in copy_page_rep() Dave Jones 2013-01-06 3:57 ` Linus Torvalds @ 2013-01-06 11:55 ` Hillf Danton 2013-01-06 16:10 ` Dave Jones 2013-01-06 19:06 ` Hugh Dickins 1 sibling, 2 replies; 26+ messages in thread From: Hillf Danton @ 2013-01-06 11:55 UTC (permalink / raw) To: Dave Jones, Linux Kernel, Linus Torvalds Hi Dave On Sat, Jan 5, 2013 at 11:22 PM, Dave Jones <davej@redhat.com> wrote: > I have no idea what happened here, but this is the first time I've seen this one. > This was running a tree pulled yesterday afternoon. > > BUG: unable to handle kernel paging request at ffff880100201000 > IP: [<ffffffff81333235>] copy_page_rep+0x5/0x10 > PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 8000000100201160 > Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > Modules linked in: nfnetlink_log hidp fuse bnep llc2 rose caif_socket caif af_rxrpc phonet netrom af_key binfmt_misc rfcomm l2tp_ppp l2tp_core pppoe pppox ppp_generic slhc ipt_ULOG scsi_transp > ort_iscsi can_raw nfnetlink ipx x25 p8023 p8022 nfc ax25 decnet rds can_bcm irda crc_ccitt can appletalk atm psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6ta > ble_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec usb_debug microcode rfkill snd_pcm serio_raw snd_page_alloc snd_timer pcspkr edac_core snd soundcore r8169 mii vhost_net > tun macvtap macvlan kvm_amd kvm > CPU 0 > Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H > RIP: 0010:[<ffffffff81333235>] [<ffffffff81333235>] copy_page_rep+0x5/0x10 > RSP: 0018:ffff88001ecabd00 EFLAGS: 00010286 > RAX: 0000000100201000 RBX: 000000011d215000 RCX: 0000000000000200 > RDX: cccccccccccccccd RSI: ffff880100201000 RDI: ffff88011d215000 > RBP: ffff88001ecabd98 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008 > R13: 000000000500a050 R14: ffff8800916af080 R15: ffff880095435668 > FS: 00007f48a2280740(0000) GS:ffff88012ee00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: ffff880100201000 CR3: 0000000054eda000 CR4: 00000000000007f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process trinity-child0 (pid: 3505, threadinfo ffff88001ecaa000, task ffff8800a9628000) > Stack: > ffffffff8119a9c7 ffff88001ecabd28 ffff8800a9628000 000000000157e088 > 000000000157e088 000000000157e088 ffff880095435668 ffff8800a9077600 > ffff880095435668 80000001002000e5 ffff8800ba515050 0000000001400000 > Call Trace: > [<ffffffff8119a9c7>] ? do_huge_pmd_wp_page+0x707/0xc00 > [<ffffffff81165f1c>] handle_mm_fault+0x14c/0x590 > [<ffffffff810b35ce>] ? __lock_is_held+0x5e/0x90 > [<ffffffff816a280c>] __do_page_fault+0x15c/0x4e0 > [<ffffffff8100a1b6>] ? native_sched_clock+0x26/0x90 > [<ffffffff810b28e8>] ? trace_hardirqs_off_caller+0x28/0xc0 > [<ffffffff81334cbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c > [<ffffffff816a2b9e>] do_page_fault+0xe/0x10 > [<ffffffff8169f822>] page_fault+0x22/0x30 > Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90 > RIP [<ffffffff81333235>] copy_page_rep+0x5/0x10 > RSP <ffff88001ecabd00> > CR2: ffff880100201000 > Would you please try the following patch? Hillf --- --- a/mm/memory.c Sun Jan 6 19:49:50 2013 +++ b/mm/memory.c Sun Jan 6 19:52:42 2013 @@ -3710,7 +3710,9 @@ retry: return do_huge_pmd_numa_page(mm, vma, address, orig_pmd, pmd); - if (dirty && !pmd_write(orig_pmd)) { + if (dirty && !pmd_write(orig_pmd) && + !pmd_trans_splitting(orig_pmd)) { + ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); /* -- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-06 11:55 ` Hillf Danton @ 2013-01-06 16:10 ` Dave Jones 2013-01-06 19:06 ` Hugh Dickins 1 sibling, 0 replies; 26+ messages in thread From: Dave Jones @ 2013-01-06 16:10 UTC (permalink / raw) To: Hillf Danton; +Cc: Linux Kernel, Linus Torvalds On Sun, Jan 06, 2013 at 07:55:53PM +0800, Hillf Danton wrote: > On Sat, Jan 5, 2013 at 11:22 PM, Dave Jones <davej@redhat.com> wrote: > > I have no idea what happened here, but this is the first time I've seen this one. > > This was running a tree pulled yesterday afternoon. > > > Would you please try the following patch? I can, though it took a long time for me to hit this, so I can't guarantee I'll be able to say with certainty whether it's working. (perhaps there's a printk we could add for debugging to say "would have crashed here" ?) (The fuzzer frequently walks into other bugs first, it's unusual that it ran for the length of time it did when this happened) Dave ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-06 11:55 ` Hillf Danton 2013-01-06 16:10 ` Dave Jones @ 2013-01-06 19:06 ` Hugh Dickins 2013-01-07 12:24 ` Hillf Danton 1 sibling, 1 reply; 26+ messages in thread From: Hugh Dickins @ 2013-01-06 19:06 UTC (permalink / raw) To: Hillf Danton Cc: Dave Jones, Linux Kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Mel Gorman On Sun, 6 Jan 2013, Hillf Danton wrote: > On Sat, Jan 5, 2013 at 11:22 PM, Dave Jones <davej@redhat.com> wrote: > > I have no idea what happened here, but this is the first time I've seen this one. > > This was running a tree pulled yesterday afternoon. > > > > BUG: unable to handle kernel paging request at ffff880100201000 > > IP: [<ffffffff81333235>] copy_page_rep+0x5/0x10 > > PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 8000000100201160 > > Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > > Modules linked in: nfnetlink_log hidp fuse bnep llc2 rose caif_socket caif af_rxrpc phonet netrom af_key binfmt_misc rfcomm l2tp_ppp l2tp_core pppoe pppox ppp_generic slhc ipt_ULOG scsi_transp > > ort_iscsi can_raw nfnetlink ipx x25 p8023 p8022 nfc ax25 decnet rds can_bcm irda crc_ccitt can appletalk atm psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6ta > > ble_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec usb_debug microcode rfkill snd_pcm serio_raw snd_page_alloc snd_timer pcspkr edac_core snd soundcore r8169 mii vhost_net > > tun macvtap macvlan kvm_amd kvm > > CPU 0 > > Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H > > RIP: 0010:[<ffffffff81333235>] [<ffffffff81333235>] copy_page_rep+0x5/0x10 > > RSP: 0018:ffff88001ecabd00 EFLAGS: 00010286 > > RAX: 0000000100201000 RBX: 000000011d215000 RCX: 0000000000000200 > > RDX: cccccccccccccccd RSI: ffff880100201000 RDI: ffff88011d215000 > > RBP: ffff88001ecabd98 R08: 0000000000000001 R09: 0000000000000000 > > R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008 > > R13: 000000000500a050 R14: ffff8800916af080 R15: ffff880095435668 > > FS: 00007f48a2280740(0000) GS:ffff88012ee00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: ffff880100201000 CR3: 0000000054eda000 CR4: 00000000000007f0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process trinity-child0 (pid: 3505, threadinfo ffff88001ecaa000, task ffff8800a9628000) > > Stack: > > ffffffff8119a9c7 ffff88001ecabd28 ffff8800a9628000 000000000157e088 > > 000000000157e088 000000000157e088 ffff880095435668 ffff8800a9077600 > > ffff880095435668 80000001002000e5 ffff8800ba515050 0000000001400000 > > Call Trace: > > [<ffffffff8119a9c7>] ? do_huge_pmd_wp_page+0x707/0xc00 > > [<ffffffff81165f1c>] handle_mm_fault+0x14c/0x590 > > [<ffffffff810b35ce>] ? __lock_is_held+0x5e/0x90 > > [<ffffffff816a280c>] __do_page_fault+0x15c/0x4e0 > > [<ffffffff8100a1b6>] ? native_sched_clock+0x26/0x90 > > [<ffffffff810b28e8>] ? trace_hardirqs_off_caller+0x28/0xc0 > > [<ffffffff81334cbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c > > [<ffffffff816a2b9e>] do_page_fault+0xe/0x10 > > [<ffffffff8169f822>] page_fault+0x22/0x30 > > Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90 > > RIP [<ffffffff81333235>] copy_page_rep+0x5/0x10 > > RSP <ffff88001ecabd00> > > CR2: ffff880100201000 > > > Would you please try the following patch? > > Hillf > --- > --- a/mm/memory.c Sun Jan 6 19:49:50 2013 > +++ b/mm/memory.c Sun Jan 6 19:52:42 2013 > @@ -3710,7 +3710,9 @@ retry: > return do_huge_pmd_numa_page(mm, vma, address, > orig_pmd, pmd); > > - if (dirty && !pmd_write(orig_pmd)) { > + if (dirty && !pmd_write(orig_pmd) && > + !pmd_trans_splitting(orig_pmd)) { > + > ret = do_huge_pmd_wp_page(mm, vma, address, pmd, > orig_pmd); > /* > -- Excellent suggestion! I don't think it need wait on Dave trying+failing to reproduce his oops, which strongly suggested that we're involved with a page which had very recently been split out from a hugepage, and in fact had got freed. It's clear that 3.7 had an important pmd_trans_splitting(orig_pmd) check there, which went AWOL in d10e63f29488 "mm: numa: Create basic numa page hinting infrastructure". Perhaps intended to be moved into do_huge_pmd_wp_page, but the checks there are pmd_same against orig_pmd, so vital to get orig_pmd right. I don't entirely like your patch (or the original code): shouldn't there be a wait_split_huge_page(), rather than hammering back with repeated faults until the split has completed? Or perhaps it makes little difference. Let's see what Mel or Andrea suggest. Hugh ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-06 19:06 ` Hugh Dickins @ 2013-01-07 12:24 ` Hillf Danton 2013-01-07 17:34 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Hillf Danton @ 2013-01-07 12:24 UTC (permalink / raw) To: Hugh Dickins Cc: Dave Jones, Linux Kernel, Linus Torvalds, Andrea Arcangeli, Andrew Morton, Mel Gorman Hello Hugh On Mon, Jan 7, 2013 at 3:06 AM, Hugh Dickins <hughd@google.com> wrote: > I don't entirely like your patch (or the original code): shouldn't > there be a wait_split_huge_page(), rather than hammering back with > repeated faults until the split has completed? > I take another try with waiting added, take a look please. Hillf --- --- a/mm/memory.c Sun Jan 6 19:49:50 2013 +++ b/mm/memory.c Mon Jan 7 20:16:14 2013 @@ -3710,6 +3710,11 @@ retry: return do_huge_pmd_numa_page(mm, vma, address, orig_pmd, pmd); + /* Check if pmd is stable */ + if (pmd_trans_splitting(orig_pmd)) { + wait_split_huge_page(vma->anon_vma, pmd); + goto retry; + } if (dirty && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); -- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-07 12:24 ` Hillf Danton @ 2013-01-07 17:34 ` Linus Torvalds 2013-01-08 13:04 ` Hillf Danton 0 siblings, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-01-07 17:34 UTC (permalink / raw) To: Hillf Danton Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman On Mon, Jan 7, 2013 at 4:24 AM, Hillf Danton <dhillf@gmail.com> wrote: > > I take another try with waiting added, take a look please. Hmm. Is there some reason we never need to worry about it for the "pmd_numa()" case just above? A comment about this all might be a really good idea. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-07 17:34 ` Linus Torvalds @ 2013-01-08 13:04 ` Hillf Danton 2013-01-08 15:37 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Hillf Danton @ 2013-01-08 13:04 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Jan 7, 2013 at 4:24 AM, Hillf Danton <dhillf@gmail.com> wrote: >> >> I take another try with waiting added, take a look please. > > Hmm. Is there some reason we never need to worry about it for the > "pmd_numa()" case just above? > > A comment about this all might be a really good idea. > Yes Sir, added. --- From: Hillf Danton <dhillf@gmail.com> Subject: [PATCH] mm: restore huge pmd splitting check Hugh said, it's clear that 3.7 had an important pmd_trans_splitting(orig_pmd) check there, which went AWOL in d10e63f29488 "mm: numa: Create basic numa page hinting infrastructure". It is restored for handling stable page fault, with wait_split_huge_page() added, as suggested also by Hugh, to avoid reapted faults until the split has completed. This work is inspired by the oops reported by Dave Jones at https://lkml.org/lkml/2013/1/5/115 Signed-off-by: Hillf Danton <dhillf@gmail.com> --- --- a/mm/memory.c Sun Jan 6 19:49:50 2013 +++ b/mm/memory.c Tue Jan 8 20:28:04 2013 @@ -3710,6 +3710,14 @@ retry: return do_huge_pmd_numa_page(mm, vma, address, orig_pmd, pmd); + /* + * Check if pmd is stable + * (numa pmd is stable, see change_huge_pmd()) + */ + if (pmd_trans_splitting(orig_pmd)) { + wait_split_huge_page(vma->anon_vma, pmd); + goto retry; + } if (dirty && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); -- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 13:04 ` Hillf Danton @ 2013-01-08 15:37 ` Linus Torvalds 2013-01-08 16:31 ` Kirill A. Shutemov 0 siblings, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-01-08 15:37 UTC (permalink / raw) To: Hillf Danton Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM On Tue, Jan 8, 2013 at 5:04 AM, Hillf Danton <dhillf@gmail.com> wrote: > On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Hmm. Is there some reason we never need to worry about it for the >> "pmd_numa()" case just above? >> >> A comment about this all might be a really good idea. >> > Yes Sir, added. Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but do_huge_pmd_numa_page() does not. Also, do we actually need it for huge_pmd_set_accessed()? The *placement* of that thing confuses me. And because it confuses me, I'd like to understand it. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 15:37 ` Linus Torvalds @ 2013-01-08 16:31 ` Kirill A. Shutemov 2013-01-08 16:52 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Kirill A. Shutemov @ 2013-01-08 16:31 UTC (permalink / raw) To: Linus Torvalds Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM On Tue, Jan 08, 2013 at 07:37:06AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 5:04 AM, Hillf Danton <dhillf@gmail.com> wrote: > > On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > >> > >> Hmm. Is there some reason we never need to worry about it for the > >> "pmd_numa()" case just above? > >> > >> A comment about this all might be a really good idea. > >> > > Yes Sir, added. > > Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but > do_huge_pmd_numa_page() does not. It does. The check should be moved up. > Also, do we actually need it for huge_pmd_set_accessed()? The > *placement* of that thing confuses me. And because it confuses me, I'd > like to understand it. We need it for huge_pmd_set_accessed() too. Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was correct: http://lkml.org/lkml/2012/10/25/402 -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 16:31 ` Kirill A. Shutemov @ 2013-01-08 16:52 ` Linus Torvalds 2013-01-08 17:30 ` Kirill A. Shutemov ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Linus Torvalds @ 2013-01-08 16:52 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: >> >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but >> do_huge_pmd_numa_page() does not. > > It does. The check should be moved up. > >> Also, do we actually need it for huge_pmd_set_accessed()? The >> *placement* of that thing confuses me. And because it confuses me, I'd >> like to understand it. > > We need it for huge_pmd_set_accessed() too. > > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was > correct: http://lkml.org/lkml/2012/10/25/402 Not a merge error: the pmd_trans_splitting() check was removed by commit d10e63f29488 ("mm: numa: Create basic numa page hinting infrastructure"). Now, *why* it was removed, I can't tell. And it's not clear why the original code just had it in a conditional, while the suggested patch has that "goto repeat" thing. I suspect re-trying the fault (which I assume the original code did) is actually better, because that way you go through all the "should I reschedule as I return through the exception" stuff. I dunno. Mel, that original patch came from you , although it was based on previous work by Peter/Ingo/Andrea. Can you walk us through the history and thinking about the loss of pmd_trans_splitting(). Was it purely a mistake? It looks intentional. Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 16:52 ` Linus Torvalds @ 2013-01-08 17:30 ` Kirill A. Shutemov 2013-01-08 17:38 ` Linus Torvalds 2013-01-08 17:49 ` Andrea Arcangeli 2013-01-08 17:37 ` Andrea Arcangeli 2013-01-09 11:44 ` Mel Gorman 2 siblings, 2 replies; 26+ messages in thread From: Kirill A. Shutemov @ 2013-01-08 17:30 UTC (permalink / raw) To: Linus Torvalds Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > >> > >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but > >> do_huge_pmd_numa_page() does not. > > > > It does. The check should be moved up. > > > >> Also, do we actually need it for huge_pmd_set_accessed()? The > >> *placement* of that thing confuses me. And because it confuses me, I'd > >> like to understand it. > > > > We need it for huge_pmd_set_accessed() too. > > > > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was > > correct: http://lkml.org/lkml/2012/10/25/402 > > Not a merge error: the pmd_trans_splitting() check was removed by > commit d10e63f29488 ("mm: numa: Create basic numa page hinting > infrastructure"). Check difference between patch above and merged one -- a1dd450. Merged patch is obviously broken: huge_pmd_set_accessed() can be called only if the pmd is under splitting. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:30 ` Kirill A. Shutemov @ 2013-01-08 17:38 ` Linus Torvalds 2013-01-08 17:49 ` Andrea Arcangeli 1 sibling, 0 replies; 26+ messages in thread From: Linus Torvalds @ 2013-01-08 17:38 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 8, 2013 at 9:30 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Check difference between patch above and merged one -- a1dd450. > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > only if the pmd is under splitting. Ok, that's a totally different issue, and seems to be due to different versions (Andrew - any idea why http://lkml.org/lkml/2012/10/25/402 and commit a1dd450bcb1a ("mm: thp: set the accessed flag for old pages on access fault") are different? That said, I actually think that commit a1dd450bcb1a is correct: huge_pmd_set_accessed() can not *possibly* need to check the splitting issue, since it takes the page table lock and re-verifies that the pmd entry is identical, before just setting the access flags. So that whole thing is irrelevant. huge_pmd_set_accessed() almost certainly simply doesn't care about splitting. But look at commit d10e63f29488. That's the one that removes pmd_trans_splitting() entirely, and does it for the case that *does* seem to care, namely do_huge_pmd_wp_page(). Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:30 ` Kirill A. Shutemov 2013-01-08 17:38 ` Linus Torvalds @ 2013-01-08 17:49 ` Andrea Arcangeli 2013-01-08 18:03 ` Kirill A. Shutemov 2013-01-11 7:50 ` Simon Jeons 1 sibling, 2 replies; 26+ messages in thread From: Andrea Arcangeli @ 2013-01-08 17:49 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel Hi Kirill, On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote: > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > only if the pmd is under splitting. Of course I assume you meant "only if the pmd is not under splitting". But no, setting a bitflag like the young bit or clearing or setting the numa bit won't screw with split_huge_page and it's safe even if the pmd is under splitting. Those bits are only checked here at the last stage of split_huge_page_map after taking the PT lock: spin_lock(&mm->page_table_lock); pmd = page_check_address_pmd(page, mm, address, PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); if (pmd) { pgtable = pgtable_trans_huge_withdraw(mm); pmd_populate(mm, &_pmd, pgtable); haddr = address; for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { pte_t *pte, entry; BUG_ON(PageCompound(page+i)); entry = mk_pte(page + i, vma->vm_page_prot); entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (!pmd_write(*pmd)) entry = pte_wrprotect(entry); else BUG_ON(page_mapcount(page) != 1); if (!pmd_young(*pmd)) entry = pte_mkold(entry); if (pmd_numa(*pmd)) entry = pte_mknuma(entry); pte = pte_offset_map(&_pmd, haddr); BUG_ON(!pte_none(*pte)); set_pte_at(mm, haddr, pte, entry); pte_unmap(pte); } If "young" or "numa" bitflags changed on the original *pmd for the previous part of split_huge_page, nothing will go wrong by the time we get to split_huge_page_map (the same is not true if the pfn changes!). If you think this is too tricky, we could also decide to forbid huge_pmd_set_accessed if the pmd is in splitting state, but I don't think that flipping young/numa bits while in splitting state, can cause any problem (if done correctly with PT lock + pmd_same). Thanks! ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:49 ` Andrea Arcangeli @ 2013-01-08 18:03 ` Kirill A. Shutemov 2013-01-11 7:50 ` Simon Jeons 1 sibling, 0 replies; 26+ messages in thread From: Kirill A. Shutemov @ 2013-01-08 18:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 08, 2013 at 06:49:51PM +0100, Andrea Arcangeli wrote: > Hi Kirill, > > On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote: > > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > > only if the pmd is under splitting. > > Of course I assume you meant "only if the pmd is not under splitting". The broken merged patch has this: + if (dirty && !pmd_write(orig_pmd) && !pmd_trans_splitting(orig_pmd)) { [...] + } else { + huge_pmd_set_accessed(mm, vma, address, pmd, + orig_pmd, dirty); } > But no, setting a bitflag like the young bit or clearing or setting > the numa bit won't screw with split_huge_page and it's safe even if > the pmd is under splitting. Okay. Thanks for clarification for me. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:49 ` Andrea Arcangeli 2013-01-08 18:03 ` Kirill A. Shutemov @ 2013-01-11 7:50 ` Simon Jeons 2013-01-11 14:01 ` Andrea Arcangeli 1 sibling, 1 reply; 26+ messages in thread From: Simon Jeons @ 2013-01-11 7:50 UTC (permalink / raw) To: Andrea Arcangeli Cc: Kirill A. Shutemov, Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, 2013-01-08 at 18:49 +0100, Andrea Arcangeli wrote: > Hi Kirill, > > On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote: > > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > > only if the pmd is under splitting. > > Of course I assume you meant "only if the pmd is not under splitting". > > But no, setting a bitflag like the young bit or clearing or setting > the numa bit won't screw with split_huge_page and it's safe even if > the pmd is under splitting. > > Those bits are only checked here at the last stage of > split_huge_page_map after taking the PT lock: > > spin_lock(&mm->page_table_lock); > pmd = page_check_address_pmd(page, mm, address, > PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); > if (pmd) { > pgtable = pgtable_trans_huge_withdraw(mm); > pmd_populate(mm, &_pmd, pgtable); > > haddr = address; > for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { > pte_t *pte, entry; > BUG_ON(PageCompound(page+i)); > entry = mk_pte(page + i, vma->vm_page_prot); > entry = maybe_mkwrite(pte_mkdirty(entry), vma); > if (!pmd_write(*pmd)) > entry = pte_wrprotect(entry); > else > BUG_ON(page_mapcount(page) != 1); > if (!pmd_young(*pmd)) > entry = pte_mkold(entry); > if (pmd_numa(*pmd)) > entry = pte_mknuma(entry); > pte = pte_offset_map(&_pmd, haddr); > BUG_ON(!pte_none(*pte)); > set_pte_at(mm, haddr, pte, entry); > pte_unmap(pte); > } > > If "young" or "numa" bitflags changed on the original *pmd for the > previous part of split_huge_page, nothing will go wrong by the time we > get to split_huge_page_map (the same is not true if the pfn changes!). > But this time BUG_ON(mapcount != mapcount2) in function __split_huge_page will be trigged. > If you think this is too tricky, we could also decide to forbid > huge_pmd_set_accessed if the pmd is in splitting state, but I don't > think that flipping young/numa bits while in splitting state, can > cause any problem (if done correctly with PT lock + pmd_same). > > Thanks! > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-11 7:50 ` Simon Jeons @ 2013-01-11 14:01 ` Andrea Arcangeli 0 siblings, 0 replies; 26+ messages in thread From: Andrea Arcangeli @ 2013-01-11 14:01 UTC (permalink / raw) To: Simon Jeons Cc: Kirill A. Shutemov, Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Fri, Jan 11, 2013 at 01:50:44AM -0600, Simon Jeons wrote: > On Tue, 2013-01-08 at 18:49 +0100, Andrea Arcangeli wrote: > > Hi Kirill, > > > > On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote: > > > Merged patch is obviously broken: huge_pmd_set_accessed() can be called > > > only if the pmd is under splitting. > > > > Of course I assume you meant "only if the pmd is not under splitting". > > > > But no, setting a bitflag like the young bit or clearing or setting > > the numa bit won't screw with split_huge_page and it's safe even if > > the pmd is under splitting. > > > > Those bits are only checked here at the last stage of > > split_huge_page_map after taking the PT lock: > > > > spin_lock(&mm->page_table_lock); > > pmd = page_check_address_pmd(page, mm, address, > > PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG); > > if (pmd) { > > pgtable = pgtable_trans_huge_withdraw(mm); > > pmd_populate(mm, &_pmd, pgtable); > > > > haddr = address; > > for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { > > pte_t *pte, entry; > > BUG_ON(PageCompound(page+i)); > > entry = mk_pte(page + i, vma->vm_page_prot); > > entry = maybe_mkwrite(pte_mkdirty(entry), vma); > > if (!pmd_write(*pmd)) > > entry = pte_wrprotect(entry); > > else > > BUG_ON(page_mapcount(page) != 1); > > if (!pmd_young(*pmd)) > > entry = pte_mkold(entry); > > if (pmd_numa(*pmd)) > > entry = pte_mknuma(entry); > > pte = pte_offset_map(&_pmd, haddr); > > BUG_ON(!pte_none(*pte)); > > set_pte_at(mm, haddr, pte, entry); > > pte_unmap(pte); > > } > > > > If "young" or "numa" bitflags changed on the original *pmd for the > > previous part of split_huge_page, nothing will go wrong by the time we > > get to split_huge_page_map (the same is not true if the pfn changes!). > > > > But this time BUG_ON(mapcount != mapcount2) in function > __split_huge_page will be trigged. "young" or "numa" bitflags in the pmd don't alter rmap/mapcount/pagecount/pfn or anything that could affect such BUG_ON, so I'm not sure why you think so. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 16:52 ` Linus Torvalds 2013-01-08 17:30 ` Kirill A. Shutemov @ 2013-01-08 17:37 ` Andrea Arcangeli 2013-01-08 17:51 ` Linus Torvalds 2013-01-09 4:23 ` Hugh Dickins 2013-01-09 11:44 ` Mel Gorman 2 siblings, 2 replies; 26+ messages in thread From: Andrea Arcangeli @ 2013-01-08 17:37 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel Hi, On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > >> > >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but > >> do_huge_pmd_numa_page() does not. > > > > It does. The check should be moved up. > > > >> Also, do we actually need it for huge_pmd_set_accessed()? The > >> *placement* of that thing confuses me. And because it confuses me, I'd > >> like to understand it. > > > > We need it for huge_pmd_set_accessed() too. > > > > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was > > correct: http://lkml.org/lkml/2012/10/25/402 > > Not a merge error: the pmd_trans_splitting() check was removed by > commit d10e63f29488 ("mm: numa: Create basic numa page hinting > infrastructure"). > > Now, *why* it was removed, I can't tell. And it's not clear why the > original code just had it in a conditional, while the suggested patch > has that "goto repeat" thing. I suspect re-trying the fault (which I > assume the original code did) is actually better, because that way you > go through all the "should I reschedule as I return through the > exception" stuff. I dunno. The reason it returned to userland and retried the fault is that this should be infrequent enough not to worry about it and this was marginally simpler but it could be changed. If we don't want to return to userland we should wait on the splitting bit and then take the pte walking routines like if the pmd wasn't huge. This is not related to the below though. > Mel, that original patch came from you , although it was based on > previous work by Peter/Ingo/Andrea. Can you walk us through the > history and thinking about the loss of pmd_trans_splitting(). Was it > purely a mistake? It looks intentional. d10e63f29488 is wrong in removing the pmd_splitting check, I assume it's an accidental leftover. My code did this: @@ -3530,6 +3534,9 @@ retry: */ orig_pmd = ACCESS_ONCE(*pmd); if (pmd_trans_huge(orig_pmd)) { + if (pmd_numa(*pmd)) + return huge_pmd_numa_fixup(mm, address, + orig_pmd, pmd); if (flags & FAULT_FLAG_WRITE && !pmd_write(orig_pmd) && !pmd_trans_splitting(orig_pmd)) { The function I was calling was this: +#ifdef CONFIG_AUTONUMA +/* NUMA hinting page fault entry point for trans huge pmds */ +int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr, + pmd_t pmd, pmd_t *pmdp) +{ + struct page *page; + bool migrated; + + spin_lock(&mm->page_table_lock); + if (unlikely(!pmd_same(pmd, *pmdp))) + goto out_unlock; + + page = pmd_page(pmd); + pmd = pmd_mknonnuma(pmd); + set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd); + VM_BUG_ON(pmd_numa(*pmdp)); + if (unlikely(page_mapcount(page) != 1)) + goto out_unlock; + get_page(page); + spin_unlock(&mm->page_table_lock); + + migrated = numa_hinting_fault(page, HPAGE_PMD_NR); + if (!migrated) + put_page(page); + +out: + return 0; + +out_unlock: + spin_unlock(&mm->page_table_lock); + goto out; +} +#endif Taking the PT lock, checking pmd_same and clearing the numa bitflag was perfectly ok even if the pmd was in splitting state the whole time. However do_huge_pmd_numa_page is slightly more complex than the above: the problem is that migrate_misplaced_transhuge_page gets pmdp and pmd as parameters (unlike the above numa_hinting_fault() function) and it relies internally to pmd_same too. /* Recheck the target PMD */ spin_lock(&mm->page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) { spin_unlock(&mm->page_table_lock); [..] entry = mk_pmd(new_page, vma->vm_page_prot); entry = pmd_mknonnuma(entry); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); entry = pmd_mkhuge(entry); page_add_new_anon_rmap(new_page, vma, haddr); set_pmd_at(mm, haddr, pmd, entry); And this kind of mangling isn't ok if the pmd was in splitting state because split_huge_page won't expect the pmd to change (if the numa bit changes is ok but the pfn cannot change or split_huge_page_map will go wrong). So you're right that if migrate_misplaced_transhuge_page will continue to do the pmd_same check, we should add the pmd_splitting bit in memory.c for the pmd_numa() path too. Looking at this, one thing that isn't clear is where the page_count is checked in migrate_misplaced_transhuge_page. Ok that it's unable to migrate anon transhuge COW shared pages so it doesn't need to mess with rmap (the mapcount check makes it safe), but it shouldn't be allowed to migrate memory that has gup direct-IO in flight (and that can only be detected with a page_count vs mapcount check). Real migrate does page_freeze_refs to be safe. Mel comments? Thanks, Andrea ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:37 ` Andrea Arcangeli @ 2013-01-08 17:51 ` Linus Torvalds 2013-01-08 18:03 ` Andrea Arcangeli 2013-01-09 4:23 ` Hugh Dickins 1 sibling, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-01-08 17:51 UTC (permalink / raw) To: Andrea Arcangeli Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 916 bytes --] On Tue, Jan 8, 2013 at 9:37 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > The reason it returned to userland and retried the fault is that this > should be infrequent enough not to worry about it and this was > marginally simpler but it could be changed. Yeah, that was my suspicion. And as mentioned, returning to user land might actually help with scheduling and/or signal handling latencies etc, so it might be the right thing to do. Especially if the alternative is to just busy-loop. > If we don't want to return to userland we should wait on the splitting > bit and then take the pte walking routines like if the pmd wasn't > huge. This is not related to the below though. How does this patch sound to people? It does the splitting check before the access bit set (even though I don't think it matters), and at least talks about the alternatives and the issues a bit. Hmm? Linus [-- Attachment #2: mm.patch --] [-- Type: application/octet-stream, Size: 750 bytes --] mm/memory.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 49fb1cf08611..f5ec3ae03f44 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3715,6 +3715,18 @@ retry: return do_huge_pmd_numa_page(mm, vma, address, orig_pmd, pmd); + /* + * If the pmd is splitting, return and retry the + * the fault. We *could* set just the accessed flag, + * but it's better to just avoid the races with + * splitting entirely. + * + * Alternative: wait until the split is done, and + * goto retry. + */ + if (pmd_trans_splitting(orig_pmd)) + return 0; + if (dirty && !pmd_write(orig_pmd)) { ret = do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:51 ` Linus Torvalds @ 2013-01-08 18:03 ` Andrea Arcangeli 2013-01-08 18:21 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Andrea Arcangeli @ 2013-01-08 18:03 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 08, 2013 at 09:51:47AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 9:37 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > The reason it returned to userland and retried the fault is that this > > should be infrequent enough not to worry about it and this was > > marginally simpler but it could be changed. > > Yeah, that was my suspicion. And as mentioned, returning to user land > might actually help with scheduling and/or signal handling latencies > etc, so it might be the right thing to do. Especially if the > alternative is to just busy-loop. > > > If we don't want to return to userland we should wait on the splitting > > bit and then take the pte walking routines like if the pmd wasn't > > huge. This is not related to the below though. > > How does this patch sound to people? It does the splitting check > before the access bit set (even though I don't think it matters), and > at least talks about the alternatives and the issues a bit. > > Hmm? It looks very fine to me, but I suggest to move it above the pmd_numa() check because of the newly introduced migrate_misplaced_transhuge_page method relying on pmd_same too. Thanks! ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 18:03 ` Andrea Arcangeli @ 2013-01-08 18:21 ` Linus Torvalds 2013-01-09 11:38 ` Hillf Danton 0 siblings, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-01-08 18:21 UTC (permalink / raw) To: Andrea Arcangeli Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, Jan 8, 2013 at 10:03 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > It looks very fine to me, but I suggest to move it above the > pmd_numa() check because of the newly introduced > migrate_misplaced_transhuge_page method relying on pmd_same too. Hmm. If we need it there, then we need to fix the *later* case of pmd_numa() too: if (pmd_numa(*pmd)) return do_pmd_numa_page(mm, vma, address, pmd); Also, and more fundamentally, since do_pmd_numa_page() doesn't take the orig_pmd thing as an argument (and re-check it under the page-table lock), testing pmd_trans_splitting() on it is pointless, since it can change later. So no, moving the check up does *not* make sense, at least not without other changes. Because if I read things right, pmd_trans_splitting() really has to be done with the page-table lock protection (where "with page-table lock protection" does *not* mean that it has to be done under the page table lock, but if it is done outside, then the pmd entry has to be re-verified after getting the lock - which both do_huge_pmd_wp_page() and huge_pmd_set_accessed() correctly do). Comments? Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 18:21 ` Linus Torvalds @ 2013-01-09 11:38 ` Hillf Danton 0 siblings, 0 replies; 26+ messages in thread From: Hillf Danton @ 2013-01-09 11:38 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Kirill A. Shutemov, Hugh Dickins, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Wed, Jan 9, 2013 at 2:21 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 8, 2013 at 10:03 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >> >> It looks very fine to me, but I suggest to move it above the >> pmd_numa() check because of the newly introduced >> migrate_misplaced_transhuge_page method relying on pmd_same too. > > Hmm. If we need it there, then we need to fix the *later* case of > pmd_numa() too: > > if (pmd_numa(*pmd)) > return do_pmd_numa_page(mm, vma, address, pmd); > > Also, and more fundamentally, since do_pmd_numa_page() doesn't take > the orig_pmd thing as an argument (and re-check it under the > page-table lock), testing pmd_trans_splitting() on it is pointless, > since it can change later. > A splitting pmd has to be huge first, and we do handle huge pmd already in the up dozen of lines. That said, we can igore splitting check in this case, Sir. Hillf > So no, moving the check up does *not* make sense, at least not without > other changes. Because if I read things right, pmd_trans_splitting() > really has to be done with the page-table lock protection (where "with > page-table lock protection" does *not* mean that it has to be done > under the page table lock, but if it is done outside, then the pmd > entry has to be re-verified after getting the lock - which both > do_huge_pmd_wp_page() and huge_pmd_set_accessed() correctly do). > > Comments? > > Linus ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 17:37 ` Andrea Arcangeli 2013-01-08 17:51 ` Linus Torvalds @ 2013-01-09 4:23 ` Hugh Dickins 1 sibling, 0 replies; 26+ messages in thread From: Hugh Dickins @ 2013-01-09 4:23 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linus Torvalds, Kirill A. Shutemov, Hillf Danton, Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel On Tue, 8 Jan 2013, Andrea Arcangeli wrote: > > Looking at this, one thing that isn't clear is where the page_count is > checked in migrate_misplaced_transhuge_page. Ok that it's unable to > migrate anon transhuge COW shared pages so it doesn't need to mess > with rmap (the mapcount check makes it safe), but it shouldn't be > allowed to migrate memory that has gup direct-IO in flight (and that > can only be detected with a page_count vs mapcount check). Real > migrate does page_freeze_refs to be safe. Mel comments? Yes, I protested to Mel about that before the holidays, and he quickly provided a patch, now in akpm's tree; but checking it again today, I believe it's still not quite right yet - see other mail. Hugh ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: oops in copy_page_rep() 2013-01-08 16:52 ` Linus Torvalds 2013-01-08 17:30 ` Kirill A. Shutemov 2013-01-08 17:37 ` Andrea Arcangeli @ 2013-01-09 11:44 ` Mel Gorman 2 siblings, 0 replies; 26+ messages in thread From: Mel Gorman @ 2013-01-09 11:44 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli, Andrew Morton, Linux-MM, Rik van Riel On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote: > On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > >> > >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but > >> do_huge_pmd_numa_page() does not. > > > > It does. The check should be moved up. > > > >> Also, do we actually need it for huge_pmd_set_accessed()? The > >> *placement* of that thing confuses me. And because it confuses me, I'd > >> like to understand it. > > > > We need it for huge_pmd_set_accessed() too. > > > > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was > > correct: http://lkml.org/lkml/2012/10/25/402 > > Not a merge error: the pmd_trans_splitting() check was removed by > commit d10e63f29488 ("mm: numa: Create basic numa page hinting > infrastructure"). > > Now, *why* it was removed, I can't tell. And it's not clear why the > original code just had it in a conditional, while the suggested patch > has that "goto repeat" thing. It was a mistake by me to remove it and as I screwed up in October I no longer remember how I managed it. The retry versus "goto repeat" is a detail. By retrying the full fault there is a possibility the split will still be in progress on fault retry or that a new THP is collapsed underneath and a new split started while the mmap_sem is released but both are unlikely. On the other side, taking the anon_vma rwsem for write in wait_split_huge_page() could cause delays elsewhere that would be almost impossible to detect so it is not necessarily better. Retrying the fault as your patch does is reasonable. > I suspect re-trying the fault (which I > assume the original code did) is actually better, because that way you > go through all the "should I reschedule as I return through the > exception" stuff. I dunno. > > Mel, that original patch came from you , although it was based on > previous work by Peter/Ingo/Andrea. Can you walk us through the > history and thinking about the loss of pmd_trans_splitting(). Was it > purely a mistake? It looks intentional. > Mistake. Andrea, Peter and Ingo did not make similar mistakes. Looking at your patch, I also think that the check needs to be made before the call to do_huge_pmd_numa_page() so it can reply on a pmd_same() check to make sure a split did not start before the page table lock was taken. In response you said to Andrea Also, and more fundamentally, since do_pmd_numa_page() doesn't take the orig_pmd thing as an argument (and re-check it under the page-table lock), testing pmd_trans_splitting() on it is pointless, since it can change later. do_pmd_numa_page() is called for a normal PMD that is marked pmd_numa(), not a THP PMD. As the mmap_sem is held it cannot collapse to a THP underneath us after the pmd_trans_huge() check so it should be unnecessary to check pmd_trans_splitting() there. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2013-01-11 14:03 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-01-05 15:22 oops in copy_page_rep() Dave Jones 2013-01-06 3:57 ` Linus Torvalds 2013-01-07 0:37 ` Dave Jones 2013-01-07 3:38 ` Hugh Dickins 2013-01-06 11:55 ` Hillf Danton 2013-01-06 16:10 ` Dave Jones 2013-01-06 19:06 ` Hugh Dickins 2013-01-07 12:24 ` Hillf Danton 2013-01-07 17:34 ` Linus Torvalds 2013-01-08 13:04 ` Hillf Danton 2013-01-08 15:37 ` Linus Torvalds 2013-01-08 16:31 ` Kirill A. Shutemov 2013-01-08 16:52 ` Linus Torvalds 2013-01-08 17:30 ` Kirill A. Shutemov 2013-01-08 17:38 ` Linus Torvalds 2013-01-08 17:49 ` Andrea Arcangeli 2013-01-08 18:03 ` Kirill A. Shutemov 2013-01-11 7:50 ` Simon Jeons 2013-01-11 14:01 ` Andrea Arcangeli 2013-01-08 17:37 ` Andrea Arcangeli 2013-01-08 17:51 ` Linus Torvalds 2013-01-08 18:03 ` Andrea Arcangeli 2013-01-08 18:21 ` Linus Torvalds 2013-01-09 11:38 ` Hillf Danton 2013-01-09 4:23 ` Hugh Dickins 2013-01-09 11:44 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox