oops in copy_page

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* oops in copy_page_rep()
@ 2013-01-05 15:22 Dave Jones
  2013-01-06  3:57 ` Linus Torvalds
  2013-01-06 11:55 ` Hillf Danton
  0 siblings, 2 replies; 26+ messages in thread
From: Dave Jones @ 2013-01-05 15:22 UTC (permalink / raw)
  To: Linux Kernel; +Cc: Linus Torvalds

I have no idea what happened here, but this is the first time I've seen this one.
This was running a tree pulled yesterday afternoon.

BUG: unable to handle kernel paging request at ffff880100201000
IP: [<ffffffff81333235>] copy_page_rep+0x5/0x10
PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 8000000100201160
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: nfnetlink_log hidp fuse bnep llc2 rose caif_socket caif af_rxrpc phonet netrom af_key binfmt_misc rfcomm l2tp_ppp l2tp_core pppoe pppox ppp_generic slhc ipt_ULOG scsi_transp
ort_iscsi can_raw nfnetlink ipx x25 p8023 p8022 nfc ax25 decnet rds can_bcm irda crc_ccitt can appletalk atm psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6ta
ble_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec usb_debug microcode rfkill snd_pcm serio_raw snd_page_alloc snd_timer pcspkr edac_core snd soundcore r8169 mii vhost_net
 tun macvtap macvlan kvm_amd kvm
CPU 0 
Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
RIP: 0010:[<ffffffff81333235>]  [<ffffffff81333235>] copy_page_rep+0x5/0x10
RSP: 0018:ffff88001ecabd00  EFLAGS: 00010286
RAX: 0000000100201000 RBX: 000000011d215000 RCX: 0000000000000200
RDX: cccccccccccccccd RSI: ffff880100201000 RDI: ffff88011d215000
RBP: ffff88001ecabd98 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008
R13: 000000000500a050 R14: ffff8800916af080 R15: ffff880095435668
FS:  00007f48a2280740(0000) GS:ffff88012ee00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880100201000 CR3: 0000000054eda000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process trinity-child0 (pid: 3505, threadinfo ffff88001ecaa000, task ffff8800a9628000)
Stack:
 ffffffff8119a9c7 ffff88001ecabd28 ffff8800a9628000 000000000157e088
 000000000157e088 000000000157e088 ffff880095435668 ffff8800a9077600
 ffff880095435668 80000001002000e5 ffff8800ba515050 0000000001400000
Call Trace:
 [<ffffffff8119a9c7>] ? do_huge_pmd_wp_page+0x707/0xc00
 [<ffffffff81165f1c>] handle_mm_fault+0x14c/0x590
 [<ffffffff810b35ce>] ? __lock_is_held+0x5e/0x90
 [<ffffffff816a280c>] __do_page_fault+0x15c/0x4e0
 [<ffffffff8100a1b6>] ? native_sched_clock+0x26/0x90
 [<ffffffff810b28e8>] ? trace_hardirqs_off_caller+0x28/0xc0
 [<ffffffff81334cbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff816a2b9e>] do_page_fault+0xe/0x10
 [<ffffffff8169f822>] page_fault+0x22/0x30
Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90 
RIP  [<ffffffff81333235>] copy_page_rep+0x5/0x10
 RSP <ffff88001ecabd00>
CR2: ffff880100201000



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-05 15:22 oops in copy_page_rep() Dave Jones
@ 2013-01-06  3:57 ` Linus Torvalds
  2013-01-07  0:37   ` Dave Jones
  2013-01-06 11:55 ` Hillf Danton
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-01-06  3:57 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Andrea Arcangeli, Hugh Dickins,
	Andrew Morton

Adding more people in case somebody else has any idea. Anybody?

On Sat, Jan 5, 2013 at 7:22 AM, Dave Jones <davej@redhat.com> wrote:
> I have no idea what happened here, but this is the first time I've seen this one.
> This was running a tree pulled yesterday afternoon.
>
> BUG: unable to handle kernel paging request at ffff880100201000

This is %rsi, which is the source for the page copy:

  copy_user_highpage()->
    copy_user_page()->
      copy_page()->
        copy_page_rep

I don't know exactly which copy_user_highpage() case this is from, the
call trace implies this *could* be a hugepage, and those functions do
copy pages individually in a loop too.

> IP: [<ffffffff81333235>] copy_page_rep+0x5/0x10
> PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 8000000100201160

Hmm. That PTE looks really odd. If I read the PUD/PMD contents right,
the page tables are for individual pages, but then the PTE doesn't
have the present bit set: other than that it looks like it could be a
valid PTE (NX and global bit set, Accessed and dirty also set, but the
two low bits are clear: present and writable are clear).

 I think it's due to DEBUG_PAGEALLOC, so the (free) page has been
unmapped from the kernel mapping.

But how could a page that is the source of a page fault be free?

> Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<ffffffff81333235>]  [<ffffffff81333235>] copy_page_rep+0x5/0x10
> RAX: 0000000100201000 RBX: 000000011d215000 RCX: 0000000000000200

The RCX value is 0x200, so this is the first access to that page. As expected.

> RDX: cccccccccccccccd RSI: ffff880100201000 RDI: ffff88011d215000

RSI (source) and RDI (destination) both look like valid kernel mapping
pages. But RSI isn't mapped, presumably because debug-pagealloc thinks
it is free.

Anybody with any ideas? The call trace indicates a normal page fault
from user space, so..

                   Linus

> Call Trace:
>  [<ffffffff8119a9c7>] ? do_huge_pmd_wp_page+0x707/0xc00
>  [<ffffffff81165f1c>] handle_mm_fault+0x14c/0x590
>  [<ffffffff810b35ce>] ? __lock_is_held+0x5e/0x90
>  [<ffffffff816a280c>] __do_page_fault+0x15c/0x4e0
>  [<ffffffff8100a1b6>] ? native_sched_clock+0x26/0x90
>  [<ffffffff810b28e8>] ? trace_hardirqs_off_caller+0x28/0xc0
>  [<ffffffff81334cbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>  [<ffffffff816a2b9e>] do_page_fault+0xe/0x10
>  [<ffffffff8169f822>] page_fault+0x22/0x30
> Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
> RIP  [<ffffffff81333235>] copy_page_rep+0x5/0x10

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-05 15:22 oops in copy_page_rep() Dave Jones
  2013-01-06  3:57 ` Linus Torvalds
@ 2013-01-06 11:55 ` Hillf Danton
  2013-01-06 16:10   ` Dave Jones
  2013-01-06 19:06   ` Hugh Dickins
  1 sibling, 2 replies; 26+ messages in thread
From: Hillf Danton @ 2013-01-06 11:55 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, Linus Torvalds

Hi Dave

On Sat, Jan 5, 2013 at 11:22 PM, Dave Jones <davej@redhat.com> wrote:
> I have no idea what happened here, but this is the first time I've seen this one.
> This was running a tree pulled yesterday afternoon.
>
> BUG: unable to handle kernel paging request at ffff880100201000
> IP: [<ffffffff81333235>] copy_page_rep+0x5/0x10
> PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 8000000100201160
> Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> Modules linked in: nfnetlink_log hidp fuse bnep llc2 rose caif_socket caif af_rxrpc phonet netrom af_key binfmt_misc rfcomm l2tp_ppp l2tp_core pppoe pppox ppp_generic slhc ipt_ULOG scsi_transp
> ort_iscsi can_raw nfnetlink ipx x25 p8023 p8022 nfc ax25 decnet rds can_bcm irda crc_ccitt can appletalk atm psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6ta
> ble_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec usb_debug microcode rfkill snd_pcm serio_raw snd_page_alloc snd_timer pcspkr edac_core snd soundcore r8169 mii vhost_net
>  tun macvtap macvlan kvm_amd kvm
> CPU 0
> Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> RIP: 0010:[<ffffffff81333235>]  [<ffffffff81333235>] copy_page_rep+0x5/0x10
> RSP: 0018:ffff88001ecabd00  EFLAGS: 00010286
> RAX: 0000000100201000 RBX: 000000011d215000 RCX: 0000000000000200
> RDX: cccccccccccccccd RSI: ffff880100201000 RDI: ffff88011d215000
> RBP: ffff88001ecabd98 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008
> R13: 000000000500a050 R14: ffff8800916af080 R15: ffff880095435668
> FS:  00007f48a2280740(0000) GS:ffff88012ee00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff880100201000 CR3: 0000000054eda000 CR4: 00000000000007f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process trinity-child0 (pid: 3505, threadinfo ffff88001ecaa000, task ffff8800a9628000)
> Stack:
>  ffffffff8119a9c7 ffff88001ecabd28 ffff8800a9628000 000000000157e088
>  000000000157e088 000000000157e088 ffff880095435668 ffff8800a9077600
>  ffff880095435668 80000001002000e5 ffff8800ba515050 0000000001400000
> Call Trace:
>  [<ffffffff8119a9c7>] ? do_huge_pmd_wp_page+0x707/0xc00
>  [<ffffffff81165f1c>] handle_mm_fault+0x14c/0x590
>  [<ffffffff810b35ce>] ? __lock_is_held+0x5e/0x90
>  [<ffffffff816a280c>] __do_page_fault+0x15c/0x4e0
>  [<ffffffff8100a1b6>] ? native_sched_clock+0x26/0x90
>  [<ffffffff810b28e8>] ? trace_hardirqs_off_caller+0x28/0xc0
>  [<ffffffff81334cbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>  [<ffffffff816a2b9e>] do_page_fault+0xe/0x10
>  [<ffffffff8169f822>] page_fault+0x22/0x30
> Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
> RIP  [<ffffffff81333235>] copy_page_rep+0x5/0x10
>  RSP <ffff88001ecabd00>
> CR2: ffff880100201000
>
Would you please try the following patch?

Hillf
---
--- a/mm/memory.c	Sun Jan  6 19:49:50 2013
+++ b/mm/memory.c	Sun Jan  6 19:52:42 2013
@@ -3710,7 +3710,9 @@ retry:
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);

-			if (dirty && !pmd_write(orig_pmd)) {
+			if (dirty && !pmd_write(orig_pmd) &&
+					!pmd_trans_splitting(orig_pmd)) {
+
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
 				/*
--

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-06 11:55 ` Hillf Danton
@ 2013-01-06 16:10   ` Dave Jones
  2013-01-06 19:06   ` Hugh Dickins
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Jones @ 2013-01-06 16:10 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Linux Kernel, Linus Torvalds

On Sun, Jan 06, 2013 at 07:55:53PM +0800, Hillf Danton wrote:

 > On Sat, Jan 5, 2013 at 11:22 PM, Dave Jones <davej@redhat.com> wrote:
 > > I have no idea what happened here, but this is the first time I've seen this one.
 > > This was running a tree pulled yesterday afternoon.
 > >
 > Would you please try the following patch?

I can, though it took a long time for me to hit this, so I can't guarantee
I'll be able to say with certainty whether it's working.
(perhaps there's a printk we could add for debugging to say "would have crashed here" ?)

(The fuzzer frequently walks into other bugs first, it's unusual that
it ran for the length of time it did when this happened)

	Dave


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-06 11:55 ` Hillf Danton
  2013-01-06 16:10   ` Dave Jones
@ 2013-01-06 19:06   ` Hugh Dickins
  2013-01-07 12:24     ` Hillf Danton
  1 sibling, 1 reply; 26+ messages in thread
From: Hugh Dickins @ 2013-01-06 19:06 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Dave Jones, Linux Kernel, Linus Torvalds, Andrea Arcangeli,
	Andrew Morton, Mel Gorman

On Sun, 6 Jan 2013, Hillf Danton wrote:
> On Sat, Jan 5, 2013 at 11:22 PM, Dave Jones <davej@redhat.com> wrote:
> > I have no idea what happened here, but this is the first time I've seen this one.
> > This was running a tree pulled yesterday afternoon.
> >
> > BUG: unable to handle kernel paging request at ffff880100201000
> > IP: [<ffffffff81333235>] copy_page_rep+0x5/0x10
> > PGD 1c0c063 PUD cfbff067 PMD cfc01067 PTE 8000000100201160
> > Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > Modules linked in: nfnetlink_log hidp fuse bnep llc2 rose caif_socket caif af_rxrpc phonet netrom af_key binfmt_misc rfcomm l2tp_ppp l2tp_core pppoe pppox ppp_generic slhc ipt_ULOG scsi_transp
> > ort_iscsi can_raw nfnetlink ipx x25 p8023 p8022 nfc ax25 decnet rds can_bcm irda crc_ccitt can appletalk atm psnap llc lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6ta
> > ble_filter ip6_tables snd_hda_codec_realtek btusb snd_hda_intel bluetooth snd_hda_codec usb_debug microcode rfkill snd_pcm serio_raw snd_page_alloc snd_timer pcspkr edac_core snd soundcore r8169 mii vhost_net
> >  tun macvtap macvlan kvm_amd kvm
> > CPU 0
> > Pid: 3505, comm: trinity-child0 Not tainted 3.8.0-rc2+ #45 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> > RIP: 0010:[<ffffffff81333235>]  [<ffffffff81333235>] copy_page_rep+0x5/0x10
> > RSP: 0018:ffff88001ecabd00  EFLAGS: 00010286
> > RAX: 0000000100201000 RBX: 000000011d215000 RCX: 0000000000000200
> > RDX: cccccccccccccccd RSI: ffff880100201000 RDI: ffff88011d215000
> > RBP: ffff88001ecabd98 R08: 0000000000000001 R09: 0000000000000000
> > R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000008
> > R13: 000000000500a050 R14: ffff8800916af080 R15: ffff880095435668
> > FS:  00007f48a2280740(0000) GS:ffff88012ee00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: ffff880100201000 CR3: 0000000054eda000 CR4: 00000000000007f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process trinity-child0 (pid: 3505, threadinfo ffff88001ecaa000, task ffff8800a9628000)
> > Stack:
> >  ffffffff8119a9c7 ffff88001ecabd28 ffff8800a9628000 000000000157e088
> >  000000000157e088 000000000157e088 ffff880095435668 ffff8800a9077600
> >  ffff880095435668 80000001002000e5 ffff8800ba515050 0000000001400000
> > Call Trace:
> >  [<ffffffff8119a9c7>] ? do_huge_pmd_wp_page+0x707/0xc00
> >  [<ffffffff81165f1c>] handle_mm_fault+0x14c/0x590
> >  [<ffffffff810b35ce>] ? __lock_is_held+0x5e/0x90
> >  [<ffffffff816a280c>] __do_page_fault+0x15c/0x4e0
> >  [<ffffffff8100a1b6>] ? native_sched_clock+0x26/0x90
> >  [<ffffffff810b28e8>] ? trace_hardirqs_off_caller+0x28/0xc0
> >  [<ffffffff81334cbd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> >  [<ffffffff816a2b9e>] do_page_fault+0xe/0x10
> >  [<ffffffff8169f822>] page_fault+0x22/0x30
> > Code: 90 90 90 90 90 90 9c fa 65 48 3b 06 75 14 65 48 3b 56 08 75 0d 65 48 89 1e 65 48 89 4e 08 9d b0 01 c3 9d 30 c0 c3 b9 00 02 00 00 <f3> 48 a5 c3 0f 1f 80 00 00 00 00 eb ee 66 66 66 90 66 66 66 90
> > RIP  [<ffffffff81333235>] copy_page_rep+0x5/0x10
> >  RSP <ffff88001ecabd00>
> > CR2: ffff880100201000
> >
> Would you please try the following patch?
> 
> Hillf
> ---
> --- a/mm/memory.c	Sun Jan  6 19:49:50 2013
> +++ b/mm/memory.c	Sun Jan  6 19:52:42 2013
> @@ -3710,7 +3710,9 @@ retry:
>  				return do_huge_pmd_numa_page(mm, vma, address,
>  							     orig_pmd, pmd);
> 
> -			if (dirty && !pmd_write(orig_pmd)) {
> +			if (dirty && !pmd_write(orig_pmd) &&
> +					!pmd_trans_splitting(orig_pmd)) {
> +
>  				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
>  							  orig_pmd);
>  				/*
> --

Excellent suggestion!

I don't think it need wait on Dave trying+failing to reproduce his oops,
which strongly suggested that we're involved with a page which had very
recently been split out from a hugepage, and in fact had got freed.

It's clear that 3.7 had an important pmd_trans_splitting(orig_pmd)
check there, which went AWOL in
d10e63f29488 "mm: numa: Create basic numa page hinting infrastructure".
Perhaps intended to be moved into do_huge_pmd_wp_page, but the checks
there are pmd_same against orig_pmd, so vital to get orig_pmd right.

I don't entirely like your patch (or the original code): shouldn't
there be a wait_split_huge_page(), rather than hammering back with
repeated faults until the split has completed?  Or perhaps it makes
little difference.  Let's see what Mel or Andrea suggest.

Hugh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-06  3:57 ` Linus Torvalds
@ 2013-01-07  0:37   ` Dave Jones
  2013-01-07  3:38     ` Hugh Dickins
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Jones @ 2013-01-07  0:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel, Andrea Arcangeli, Hugh Dickins, Andrew Morton

On Sat, Jan 05, 2013 at 07:57:39PM -0800, Linus Torvalds wrote:
 > Adding more people in case somebody else has any idea. Anybody?
 > 
 > On Sat, Jan 5, 2013 at 7:22 AM, Dave Jones <davej@redhat.com> wrote:
 > > I have no idea what happened here, but this is the first time I've seen this one.
 > > This was running a tree pulled yesterday afternoon.
 > >
 > > BUG: unable to handle kernel paging request at ffff880100201000
 > 
 > This is %rsi, which is the source for the page copy:
 > 
 >   copy_user_highpage()->
 >     copy_user_page()->
 >       copy_page()->
 >         copy_page_rep
 > 
 > I don't know exactly which copy_user_highpage() case this is from, the
 > call trace implies this *could* be a hugepage, and those functions do
 > copy pages individually in a loop too.

investigating the huge page theory a little further I'm a bit confused.
The kernel on that machine has THP enabled, and the cpu supports it (an old amd64), but..

$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/*
0
0
0
0
0
0

I was expecting at least one of those to be non-zero.

/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans and pages_collapsed
are both non-zero, so it's been busy doing _something_.

Is this expected behaviour ?

	Dave


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-07  0:37   ` Dave Jones
@ 2013-01-07  3:38     ` Hugh Dickins
  0 siblings, 0 replies; 26+ messages in thread
From: Hugh Dickins @ 2013-01-07  3:38 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, Andrea Arcangeli, Andrew Morton, linux-kernel

On Sun, 6 Jan 2013, Dave Jones wrote:
> 
> investigating the huge page theory a little further I'm a bit confused.
> The kernel on that machine has THP enabled, and the cpu supports it (an old amd64), but..
> 
> $ cat /sys/kernel/mm/hugepages/hugepages-2048kB/*
> 0
> 0
> 0
> 0
> 0
> 0
> 
> I was expecting at least one of those to be non-zero.
> 
> /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans and pages_collapsed
> are both non-zero, so it's been busy doing _something_.
> 
> Is this expected behaviour ?

Yes.  /sys/kernel/mm/hugepages/ is all about those dusty old opaque
hugepages you might have got from fs/hugetlbfs and mm/hugetlb.c; whereas
the splitting issue we suspect is peculiar to the dazzling new transparent
hugepages you get from mm/huge_memory.c.

Hugh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-06 19:06   ` Hugh Dickins
@ 2013-01-07 12:24     ` Hillf Danton
  2013-01-07 17:34       ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Hillf Danton @ 2013-01-07 12:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Jones, Linux Kernel, Linus Torvalds, Andrea Arcangeli,
	Andrew Morton, Mel Gorman

Hello Hugh

On Mon, Jan 7, 2013 at 3:06 AM, Hugh Dickins <hughd@google.com> wrote:
> I don't entirely like your patch (or the original code): shouldn't
> there be a wait_split_huge_page(), rather than hammering back with
> repeated faults until the split has completed?
>
I take another try with waiting added, take a look please.

Hillf
---
--- a/mm/memory.c	Sun Jan  6 19:49:50 2013
+++ b/mm/memory.c	Mon Jan  7 20:16:14 2013
@@ -3710,6 +3710,11 @@ retry:
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);

+			/* Check if pmd is stable */
+			if (pmd_trans_splitting(orig_pmd)) {
+				wait_split_huge_page(vma->anon_vma, pmd);
+				goto retry;
+			}
 			if (dirty && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
--

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-07 12:24     ` Hillf Danton
@ 2013-01-07 17:34       ` Linus Torvalds
  2013-01-08 13:04         ` Hillf Danton
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-01-07 17:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli,
	Andrew Morton, Mel Gorman

On Mon, Jan 7, 2013 at 4:24 AM, Hillf Danton <dhillf@gmail.com> wrote:
>
> I take another try with waiting added, take a look please.

Hmm. Is there some reason we never need to worry about it for the
"pmd_numa()" case just above?

A comment about this all might be a really good idea.

                   Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-07 17:34       ` Linus Torvalds
@ 2013-01-08 13:04         ` Hillf Danton
  2013-01-08 15:37           ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Hillf Danton @ 2013-01-08 13:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli,
	Andrew Morton, Mel Gorman, Linux-MM

On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 7, 2013 at 4:24 AM, Hillf Danton <dhillf@gmail.com> wrote:
>>
>> I take another try with waiting added, take a look please.
>
> Hmm. Is there some reason we never need to worry about it for the
> "pmd_numa()" case just above?
>
> A comment about this all might be a really good idea.
>
Yes Sir, added.
---
From: Hillf Danton <dhillf@gmail.com>
Subject: [PATCH] mm: restore huge pmd splitting check

Hugh said, it's clear that 3.7 had an important pmd_trans_splitting(orig_pmd)
check there, which went AWOL in
d10e63f29488 "mm: numa: Create basic numa page hinting infrastructure".

It is restored for handling stable page fault, with wait_split_huge_page()
added, as suggested also by Hugh, to avoid reapted faults until the split
has completed.

This work is inspired by the oops reported by Dave Jones at
https://lkml.org/lkml/2013/1/5/115

Signed-off-by: Hillf Danton <dhillf@gmail.com>
---

--- a/mm/memory.c	Sun Jan  6 19:49:50 2013
+++ b/mm/memory.c	Tue Jan  8 20:28:04 2013
@@ -3710,6 +3710,14 @@ retry:
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);

+			/*
+			 * Check if pmd is stable
+			 * (numa pmd is stable, see change_huge_pmd())
+			 */
+			if (pmd_trans_splitting(orig_pmd)) {
+				wait_split_huge_page(vma->anon_vma, pmd);
+				goto retry;
+			}
 			if (dirty && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);
--

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 13:04         ` Hillf Danton
@ 2013-01-08 15:37           ` Linus Torvalds
  2013-01-08 16:31             ` Kirill A. Shutemov
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-01-08 15:37 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Hugh Dickins, Dave Jones, Linux Kernel, Andrea Arcangeli,
	Andrew Morton, Mel Gorman, Linux-MM

On Tue, Jan 8, 2013 at 5:04 AM, Hillf Danton <dhillf@gmail.com> wrote:
> On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Hmm. Is there some reason we never need to worry about it for the
>> "pmd_numa()" case just above?
>>
>> A comment about this all might be a really good idea.
>>
> Yes Sir, added.

Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
do_huge_pmd_numa_page() does not.

Also, do we actually need it for huge_pmd_set_accessed()? The
*placement* of that thing confuses me. And because it confuses me, I'd
like to understand it.

                  Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 15:37           ` Linus Torvalds
@ 2013-01-08 16:31             ` Kirill A. Shutemov
  2013-01-08 16:52               ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill A. Shutemov @ 2013-01-08 16:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel,
	Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM

On Tue, Jan 08, 2013 at 07:37:06AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 5:04 AM, Hillf Danton <dhillf@gmail.com> wrote:
> > On Tue, Jan 8, 2013 at 1:34 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> Hmm. Is there some reason we never need to worry about it for the
> >> "pmd_numa()" case just above?
> >>
> >> A comment about this all might be a really good idea.
> >>
> > Yes Sir, added.
> 
> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
> do_huge_pmd_numa_page() does not.

It does. The check should be moved up.

> Also, do we actually need it for huge_pmd_set_accessed()? The
> *placement* of that thing confuses me. And because it confuses me, I'd
> like to understand it.

We need it for huge_pmd_set_accessed() too.

Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
correct: http://lkml.org/lkml/2012/10/25/402

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 16:31             ` Kirill A. Shutemov
@ 2013-01-08 16:52               ` Linus Torvalds
  2013-01-08 17:30                 ` Kirill A. Shutemov
                                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-01-08 16:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel,
	Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM,
	Rik van Riel

On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>
>> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
>> do_huge_pmd_numa_page() does not.
>
> It does. The check should be moved up.
>
>> Also, do we actually need it for huge_pmd_set_accessed()? The
>> *placement* of that thing confuses me. And because it confuses me, I'd
>> like to understand it.
>
> We need it for huge_pmd_set_accessed() too.
>
> Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
> correct: http://lkml.org/lkml/2012/10/25/402

Not a merge error: the pmd_trans_splitting() check was removed by
commit d10e63f29488 ("mm: numa: Create basic numa page hinting
infrastructure").

Now, *why* it was removed, I can't tell. And it's not clear why the
original code just had it in a conditional, while the suggested patch
has that "goto repeat" thing. I suspect re-trying the fault (which I
assume the original code did) is actually better, because that way you
go through all the "should I reschedule as I return through the
exception" stuff. I dunno.

Mel, that original patch came from you , although it was based on
previous work by Peter/Ingo/Andrea. Can you walk us through the
history and thinking about the loss of pmd_trans_splitting(). Was it
purely a mistake? It looks intentional.

               Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 16:52               ` Linus Torvalds
@ 2013-01-08 17:30                 ` Kirill A. Shutemov
  2013-01-08 17:38                   ` Linus Torvalds
  2013-01-08 17:49                   ` Andrea Arcangeli
  2013-01-08 17:37                 ` Andrea Arcangeli
  2013-01-09 11:44                 ` Mel Gorman
  2 siblings, 2 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2013-01-08 17:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel,
	Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM,
	Rik van Riel

On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>
> >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
> >> do_huge_pmd_numa_page() does not.
> >
> > It does. The check should be moved up.
> >
> >> Also, do we actually need it for huge_pmd_set_accessed()? The
> >> *placement* of that thing confuses me. And because it confuses me, I'd
> >> like to understand it.
> >
> > We need it for huge_pmd_set_accessed() too.
> >
> > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
> > correct: http://lkml.org/lkml/2012/10/25/402
> 
> Not a merge error: the pmd_trans_splitting() check was removed by
> commit d10e63f29488 ("mm: numa: Create basic numa page hinting
> infrastructure").

Check difference between patch above and merged one -- a1dd450.
Merged patch is obviously broken: huge_pmd_set_accessed() can be called
only if the pmd is under splitting.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 16:52               ` Linus Torvalds
  2013-01-08 17:30                 ` Kirill A. Shutemov
@ 2013-01-08 17:37                 ` Andrea Arcangeli
  2013-01-08 17:51                   ` Linus Torvalds
  2013-01-09  4:23                   ` Hugh Dickins
  2013-01-09 11:44                 ` Mel Gorman
  2 siblings, 2 replies; 26+ messages in thread
From: Andrea Arcangeli @ 2013-01-08 17:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
	Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel

Hi,

On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>
> >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
> >> do_huge_pmd_numa_page() does not.
> >
> > It does. The check should be moved up.
> >
> >> Also, do we actually need it for huge_pmd_set_accessed()? The
> >> *placement* of that thing confuses me. And because it confuses me, I'd
> >> like to understand it.
> >
> > We need it for huge_pmd_set_accessed() too.
> >
> > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
> > correct: http://lkml.org/lkml/2012/10/25/402
> 
> Not a merge error: the pmd_trans_splitting() check was removed by
> commit d10e63f29488 ("mm: numa: Create basic numa page hinting
> infrastructure").
> 
> Now, *why* it was removed, I can't tell. And it's not clear why the
> original code just had it in a conditional, while the suggested patch
> has that "goto repeat" thing. I suspect re-trying the fault (which I
> assume the original code did) is actually better, because that way you
> go through all the "should I reschedule as I return through the
> exception" stuff. I dunno.

The reason it returned to userland and retried the fault is that this
should be infrequent enough not to worry about it and this was
marginally simpler but it could be changed.

If we don't want to return to userland we should wait on the splitting
bit and then take the pte walking routines like if the pmd wasn't
huge. This is not related to the below though.

> Mel, that original patch came from you , although it was based on
> previous work by Peter/Ingo/Andrea. Can you walk us through the
> history and thinking about the loss of pmd_trans_splitting(). Was it
> purely a mistake? It looks intentional.

d10e63f29488 is wrong in removing the pmd_splitting check, I assume
it's an accidental leftover.

My code did this:

@@ -3530,6 +3534,9 @@ retry:
                 */
                orig_pmd = ACCESS_ONCE(*pmd);
                if (pmd_trans_huge(orig_pmd)) {
+                       if (pmd_numa(*pmd))
+                               return huge_pmd_numa_fixup(mm, address,
+                                                          orig_pmd, pmd);
                        if (flags & FAULT_FLAG_WRITE &&
                            !pmd_write(orig_pmd) &&
                            !pmd_trans_splitting(orig_pmd)) {

The function I was calling was this:

+#ifdef CONFIG_AUTONUMA
+/* NUMA hinting page fault entry point for trans huge pmds */
+int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+                       pmd_t pmd, pmd_t *pmdp)
+{
+       struct page *page;
+       bool migrated;
+
+       spin_lock(&mm->page_table_lock);
+       if (unlikely(!pmd_same(pmd, *pmdp)))
+               goto out_unlock;
+
+       page = pmd_page(pmd);
+       pmd = pmd_mknonnuma(pmd);
+       set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+       VM_BUG_ON(pmd_numa(*pmdp));
+       if (unlikely(page_mapcount(page) != 1))
+               goto out_unlock;
+       get_page(page);
+       spin_unlock(&mm->page_table_lock);
+
+       migrated = numa_hinting_fault(page, HPAGE_PMD_NR);
+       if (!migrated)
+               put_page(page);
+
+out:
+       return 0;
+
+out_unlock:
+       spin_unlock(&mm->page_table_lock);
+       goto out;
+}
+#endif

Taking the PT lock, checking pmd_same and clearing the numa bitflag
was perfectly ok even if the pmd was in splitting state the whole
time.

However do_huge_pmd_numa_page is slightly more complex than the above:
the problem is that migrate_misplaced_transhuge_page gets pmdp and pmd
as parameters (unlike the above numa_hinting_fault() function) and it
relies internally to pmd_same too.

	/* Recheck the target PMD */
	spin_lock(&mm->page_table_lock);
	if (unlikely(!pmd_same(*pmd, entry))) {
		spin_unlock(&mm->page_table_lock);
        [..]
	entry = mk_pmd(new_page, vma->vm_page_prot);
	entry = pmd_mknonnuma(entry);
	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
	entry = pmd_mkhuge(entry);

	page_add_new_anon_rmap(new_page, vma, haddr);

	set_pmd_at(mm, haddr, pmd, entry);

And this kind of mangling isn't ok if the pmd was in splitting state
because split_huge_page won't expect the pmd to change (if the numa
bit changes is ok but the pfn cannot change or split_huge_page_map
will go wrong).

So you're right that if migrate_misplaced_transhuge_page will continue
to do the pmd_same check, we should add the pmd_splitting bit in
memory.c for the pmd_numa() path too.

Looking at this, one thing that isn't clear is where the page_count is
checked in migrate_misplaced_transhuge_page. Ok that it's unable to
migrate anon transhuge COW shared pages so it doesn't need to mess
with rmap (the mapcount check makes it safe), but it shouldn't be
allowed to migrate memory that has gup direct-IO in flight (and that
can only be detected with a page_count vs mapcount check). Real
migrate does page_freeze_refs to be safe. Mel comments?

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 17:30                 ` Kirill A. Shutemov
@ 2013-01-08 17:38                   ` Linus Torvalds
  2013-01-08 17:49                   ` Andrea Arcangeli
  1 sibling, 0 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-01-08 17:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hillf Danton, Hugh Dickins, Dave Jones, Linux Kernel,
	Andrea Arcangeli, Andrew Morton, Mel Gorman, Linux-MM,
	Rik van Riel

On Tue, Jan 8, 2013 at 9:30 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Check difference between patch above and merged one -- a1dd450.
> Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> only if the pmd is under splitting.

Ok, that's a totally different issue, and seems to be due to different
versions (Andrew - any idea why

  http://lkml.org/lkml/2012/10/25/402

and commit a1dd450bcb1a ("mm: thp: set the accessed flag for old pages
on access fault") are different?

That said, I actually think that commit a1dd450bcb1a is correct:
huge_pmd_set_accessed() can not *possibly* need to check the splitting
issue, since it takes the page table lock and re-verifies that the pmd
entry is identical, before just setting the access flags.

So that whole thing is irrelevant. huge_pmd_set_accessed() almost
certainly simply doesn't care about splitting.

But look at commit d10e63f29488. That's the one that removes
pmd_trans_splitting() entirely, and does it for the case that *does*
seem to care, namely do_huge_pmd_wp_page().

                 Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 17:30                 ` Kirill A. Shutemov
  2013-01-08 17:38                   ` Linus Torvalds
@ 2013-01-08 17:49                   ` Andrea Arcangeli
  2013-01-08 18:03                     ` Kirill A. Shutemov
  2013-01-11  7:50                     ` Simon Jeons
  1 sibling, 2 replies; 26+ messages in thread
From: Andrea Arcangeli @ 2013-01-08 17:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones,
	Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel

Hi Kirill,

On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote:
> Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> only if the pmd is under splitting.

Of course I assume you meant "only if the pmd is not under splitting".

But no, setting a bitflag like the young bit or clearing or setting
the numa bit won't screw with split_huge_page and it's safe even if
the pmd is under splitting.

Those bits are only checked here at the last stage of
split_huge_page_map after taking the PT lock:

	spin_lock(&mm->page_table_lock);
	pmd = page_check_address_pmd(page, mm, address,
				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
	if (pmd) {
		pgtable = pgtable_trans_huge_withdraw(mm);
		pmd_populate(mm, &_pmd, pgtable);

		haddr = address;
		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
			pte_t *pte, entry;
			BUG_ON(PageCompound(page+i));
			entry = mk_pte(page + i, vma->vm_page_prot);
			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
			if (!pmd_write(*pmd))
				entry = pte_wrprotect(entry);
			else
				BUG_ON(page_mapcount(page) != 1);
			if (!pmd_young(*pmd))
				entry = pte_mkold(entry);
			if (pmd_numa(*pmd))
				entry = pte_mknuma(entry);
			pte = pte_offset_map(&_pmd, haddr);
			BUG_ON(!pte_none(*pte));
			set_pte_at(mm, haddr, pte, entry);
			pte_unmap(pte);
		}

If "young" or "numa" bitflags changed on the original *pmd for the
previous part of split_huge_page, nothing will go wrong by the time we
get to split_huge_page_map (the same is not true if the pfn changes!).

If you think this is too tricky, we could also decide to forbid
huge_pmd_set_accessed if the pmd is in splitting state, but I don't
think that flipping young/numa bits while in splitting state, can
cause any problem (if done correctly with PT lock + pmd_same).

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 17:37                 ` Andrea Arcangeli
@ 2013-01-08 17:51                   ` Linus Torvalds
  2013-01-08 18:03                     ` Andrea Arcangeli
  2013-01-09  4:23                   ` Hugh Dickins
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-01-08 17:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
	Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 916 bytes --]

On Tue, Jan 8, 2013 at 9:37 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> The reason it returned to userland and retried the fault is that this
> should be infrequent enough not to worry about it and this was
> marginally simpler but it could be changed.

Yeah, that was my suspicion. And as mentioned, returning to user land
might actually help with scheduling and/or signal handling latencies
etc, so it might be the right thing to do.  Especially if the
alternative is to just busy-loop.

> If we don't want to return to userland we should wait on the splitting
> bit and then take the pte walking routines like if the pmd wasn't
> huge. This is not related to the below though.

How does this patch sound to people? It does the splitting check
before the access bit set (even though I don't think it matters), and
at least talks about the alternatives and the issues a bit.

Hmm?

                 Linus

[-- Attachment #2: mm.patch --]
[-- Type: application/octet-stream, Size: 750 bytes --]

 mm/memory.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 49fb1cf08611..f5ec3ae03f44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3715,6 +3715,18 @@ retry:
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
 
+			/*
+			 * If the pmd is splitting, return and retry the
+			 * the fault. We *could* set just the accessed flag,
+			 * but it's better to just avoid the races with
+			 * splitting entirely.
+			 *
+			 * Alternative: wait until the split is done, and
+			 * goto retry.
+			 */
+			if (pmd_trans_splitting(orig_pmd))
+				return 0;
+
 			if (dirty && !pmd_write(orig_pmd)) {
 				ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
 							  orig_pmd);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 17:49                   ` Andrea Arcangeli
@ 2013-01-08 18:03                     ` Kirill A. Shutemov
  2013-01-11  7:50                     ` Simon Jeons
  1 sibling, 0 replies; 26+ messages in thread
From: Kirill A. Shutemov @ 2013-01-08 18:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hillf Danton, Hugh Dickins, Dave Jones,
	Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel

On Tue, Jan 08, 2013 at 06:49:51PM +0100, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote:
> > Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> > only if the pmd is under splitting.
> 
> Of course I assume you meant "only if the pmd is not under splitting".

The broken merged patch has this:

+                       if (dirty && !pmd_write(orig_pmd) &&
                            !pmd_trans_splitting(orig_pmd)) {
			[...]
+                       } else {
+                               huge_pmd_set_accessed(mm, vma, address, pmd,
+                                                     orig_pmd, dirty);
                        }

> But no, setting a bitflag like the young bit or clearing or setting
> the numa bit won't screw with split_huge_page and it's safe even if
> the pmd is under splitting.

Okay. Thanks for clarification for me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 17:51                   ` Linus Torvalds
@ 2013-01-08 18:03                     ` Andrea Arcangeli
  2013-01-08 18:21                       ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Andrea Arcangeli @ 2013-01-08 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
	Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel

On Tue, Jan 08, 2013 at 09:51:47AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 9:37 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > The reason it returned to userland and retried the fault is that this
> > should be infrequent enough not to worry about it and this was
> > marginally simpler but it could be changed.
> 
> Yeah, that was my suspicion. And as mentioned, returning to user land
> might actually help with scheduling and/or signal handling latencies
> etc, so it might be the right thing to do.  Especially if the
> alternative is to just busy-loop.
> 
> > If we don't want to return to userland we should wait on the splitting
> > bit and then take the pte walking routines like if the pmd wasn't
> > huge. This is not related to the below though.
> 
> How does this patch sound to people? It does the splitting check
> before the access bit set (even though I don't think it matters), and
> at least talks about the alternatives and the issues a bit.
> 
> Hmm?

It looks very fine to me, but I suggest to move it above the
pmd_numa() check because of the newly introduced
migrate_misplaced_transhuge_page method relying on pmd_same too.

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 18:03                     ` Andrea Arcangeli
@ 2013-01-08 18:21                       ` Linus Torvalds
  2013-01-09 11:38                         ` Hillf Danton
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-01-08 18:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
	Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel

On Tue, Jan 8, 2013 at 10:03 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> It looks very fine to me, but I suggest to move it above the
> pmd_numa() check because of the newly introduced
> migrate_misplaced_transhuge_page method relying on pmd_same too.

Hmm. If we need it there, then we need to fix the *later* case of
pmd_numa() too:

        if (pmd_numa(*pmd))
                return do_pmd_numa_page(mm, vma, address, pmd);

Also, and more fundamentally, since do_pmd_numa_page() doesn't take
the orig_pmd thing as an argument (and re-check it under the
page-table lock), testing pmd_trans_splitting() on it is pointless,
since it can change later.

So no, moving the check up does *not* make sense, at least not without
other changes. Because if I read things right, pmd_trans_splitting()
really has to be done with the page-table lock protection (where "with
page-table lock protection" does *not* mean that it has to be done
under the page table lock, but if it is done outside, then the pmd
entry has to be re-verified after getting the lock - which both
do_huge_pmd_wp_page() and huge_pmd_set_accessed() correctly do).

Comments?

                Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 17:37                 ` Andrea Arcangeli
  2013-01-08 17:51                   ` Linus Torvalds
@ 2013-01-09  4:23                   ` Hugh Dickins
  1 sibling, 0 replies; 26+ messages in thread
From: Hugh Dickins @ 2013-01-09  4:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Kirill A. Shutemov, Hillf Danton, Dave Jones,
	Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel

On Tue, 8 Jan 2013, Andrea Arcangeli wrote:
> 
> Looking at this, one thing that isn't clear is where the page_count is
> checked in migrate_misplaced_transhuge_page. Ok that it's unable to
> migrate anon transhuge COW shared pages so it doesn't need to mess
> with rmap (the mapcount check makes it safe), but it shouldn't be
> allowed to migrate memory that has gup direct-IO in flight (and that
> can only be detected with a page_count vs mapcount check). Real
> migrate does page_freeze_refs to be safe. Mel comments?

Yes, I protested to Mel about that before the holidays, and he
quickly provided a patch, now in akpm's tree; but checking it again
today, I believe it's still not quite right yet - see other mail.

Hugh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 18:21                       ` Linus Torvalds
@ 2013-01-09 11:38                         ` Hillf Danton
  0 siblings, 0 replies; 26+ messages in thread
From: Hillf Danton @ 2013-01-09 11:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Hugh Dickins, Dave Jones,
	Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM, Rik van Riel

On Wed, Jan 9, 2013 at 2:21 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jan 8, 2013 at 10:03 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>
>> It looks very fine to me, but I suggest to move it above the
>> pmd_numa() check because of the newly introduced
>> migrate_misplaced_transhuge_page method relying on pmd_same too.
>
> Hmm. If we need it there, then we need to fix the *later* case of
> pmd_numa() too:
>
>         if (pmd_numa(*pmd))
>                 return do_pmd_numa_page(mm, vma, address, pmd);
>
> Also, and more fundamentally, since do_pmd_numa_page() doesn't take
> the orig_pmd thing as an argument (and re-check it under the
> page-table lock), testing pmd_trans_splitting() on it is pointless,
> since it can change later.
>
A splitting pmd has to be huge first, and we do handle huge pmd already in
the up dozen of lines.
That said, we can igore splitting check in this case, Sir.

Hillf

> So no, moving the check up does *not* make sense, at least not without
> other changes. Because if I read things right, pmd_trans_splitting()
> really has to be done with the page-table lock protection (where "with
> page-table lock protection" does *not* mean that it has to be done
> under the page table lock, but if it is done outside, then the pmd
> entry has to be re-verified after getting the lock - which both
> do_huge_pmd_wp_page() and huge_pmd_set_accessed() correctly do).
>
> Comments?
>
>                 Linus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 16:52               ` Linus Torvalds
  2013-01-08 17:30                 ` Kirill A. Shutemov
  2013-01-08 17:37                 ` Andrea Arcangeli
@ 2013-01-09 11:44                 ` Mel Gorman
  2 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2013-01-09 11:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Hillf Danton, Hugh Dickins, Dave Jones,
	Linux Kernel, Andrea Arcangeli, Andrew Morton, Linux-MM,
	Rik van Riel

On Tue, Jan 08, 2013 at 08:52:14AM -0800, Linus Torvalds wrote:
> On Tue, Jan 8, 2013 at 8:31 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>
> >> Heh. I was more thinking about why do_huge_pmd_wp_page() needs it, but
> >> do_huge_pmd_numa_page() does not.
> >
> > It does. The check should be moved up.
> >
> >> Also, do we actually need it for huge_pmd_set_accessed()? The
> >> *placement* of that thing confuses me. And because it confuses me, I'd
> >> like to understand it.
> >
> > We need it for huge_pmd_set_accessed() too.
> >
> > Looks like a mis-merge. The original patch for huge_pmd_set_accessed() was
> > correct: http://lkml.org/lkml/2012/10/25/402
> 
> Not a merge error: the pmd_trans_splitting() check was removed by
> commit d10e63f29488 ("mm: numa: Create basic numa page hinting
> infrastructure").
> 
> Now, *why* it was removed, I can't tell. And it's not clear why the
> original code just had it in a conditional, while the suggested patch
> has that "goto repeat" thing.

It was a mistake by me to remove it and as I screwed up in October I no
longer remember how I managed it.

The retry versus "goto repeat" is a detail. By retrying the full fault
there is a possibility the split will still be in progress on fault
retry or that a new THP is collapsed underneath and a new split started
while the mmap_sem is released but both are unlikely. On the other side,
taking the anon_vma rwsem for write in wait_split_huge_page() could cause
delays elsewhere that would be almost impossible to detect so it is not
necessarily better. Retrying the fault as your patch does is reasonable.

> I suspect re-trying the fault (which I
> assume the original code did) is actually better, because that way you
> go through all the "should I reschedule as I return through the
> exception" stuff. I dunno.
> 
> Mel, that original patch came from you , although it was based on
> previous work by Peter/Ingo/Andrea. Can you walk us through the
> history and thinking about the loss of pmd_trans_splitting(). Was it
> purely a mistake? It looks intentional.
> 

Mistake. Andrea, Peter and Ingo did not make similar mistakes.

Looking at your patch, I also think that the check needs to be made before
the call to do_huge_pmd_numa_page() so it can reply on a pmd_same() check
to make sure a split did not start before the page table lock was taken.

In response you said to Andrea

	Also, and more fundamentally, since do_pmd_numa_page() doesn't
	take the orig_pmd thing as an argument (and re-check it under the
	page-table lock), testing pmd_trans_splitting() on it is pointless,
	since it can change later.

do_pmd_numa_page() is called for a normal PMD that is marked pmd_numa(), not
a THP PMD. As the mmap_sem is held it cannot collapse to a THP underneath us
after the pmd_trans_huge() check so it should be unnecessary to check
pmd_trans_splitting() there.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-08 17:49                   ` Andrea Arcangeli
  2013-01-08 18:03                     ` Kirill A. Shutemov
@ 2013-01-11  7:50                     ` Simon Jeons
  2013-01-11 14:01                       ` Andrea Arcangeli
  1 sibling, 1 reply; 26+ messages in thread
From: Simon Jeons @ 2013-01-11  7:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Linus Torvalds, Hillf Danton, Hugh Dickins,
	Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM,
	Rik van Riel

On Tue, 2013-01-08 at 18:49 +0100, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote:
> > Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> > only if the pmd is under splitting.
> 
> Of course I assume you meant "only if the pmd is not under splitting".
> 
> But no, setting a bitflag like the young bit or clearing or setting
> the numa bit won't screw with split_huge_page and it's safe even if
> the pmd is under splitting.
> 
> Those bits are only checked here at the last stage of
> split_huge_page_map after taking the PT lock:
> 
> 	spin_lock(&mm->page_table_lock);
> 	pmd = page_check_address_pmd(page, mm, address,
> 				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> 	if (pmd) {
> 		pgtable = pgtable_trans_huge_withdraw(mm);
> 		pmd_populate(mm, &_pmd, pgtable);
> 
> 		haddr = address;
> 		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> 			pte_t *pte, entry;
> 			BUG_ON(PageCompound(page+i));
> 			entry = mk_pte(page + i, vma->vm_page_prot);
> 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> 			if (!pmd_write(*pmd))
> 				entry = pte_wrprotect(entry);
> 			else
> 				BUG_ON(page_mapcount(page) != 1);
> 			if (!pmd_young(*pmd))
> 				entry = pte_mkold(entry);
> 			if (pmd_numa(*pmd))
> 				entry = pte_mknuma(entry);
> 			pte = pte_offset_map(&_pmd, haddr);
> 			BUG_ON(!pte_none(*pte));
> 			set_pte_at(mm, haddr, pte, entry);
> 			pte_unmap(pte);
> 		}
> 
> If "young" or "numa" bitflags changed on the original *pmd for the
> previous part of split_huge_page, nothing will go wrong by the time we
> get to split_huge_page_map (the same is not true if the pfn changes!).
> 

But this time BUG_ON(mapcount != mapcount2) in function
__split_huge_page will be trigged.

> If you think this is too tricky, we could also decide to forbid
> huge_pmd_set_accessed if the pmd is in splitting state, but I don't
> think that flipping young/numa bits while in splitting state, can
> cause any problem (if done correctly with PT lock + pmd_same).
> 
> Thanks!
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: oops in copy_page_rep()
  2013-01-11  7:50                     ` Simon Jeons
@ 2013-01-11 14:01                       ` Andrea Arcangeli
  0 siblings, 0 replies; 26+ messages in thread
From: Andrea Arcangeli @ 2013-01-11 14:01 UTC (permalink / raw)
  To: Simon Jeons
  Cc: Kirill A. Shutemov, Linus Torvalds, Hillf Danton, Hugh Dickins,
	Dave Jones, Linux Kernel, Andrew Morton, Mel Gorman, Linux-MM,
	Rik van Riel

On Fri, Jan 11, 2013 at 01:50:44AM -0600, Simon Jeons wrote:
> On Tue, 2013-01-08 at 18:49 +0100, Andrea Arcangeli wrote:
> > Hi Kirill,
> > 
> > On Tue, Jan 08, 2013 at 07:30:58PM +0200, Kirill A. Shutemov wrote:
> > > Merged patch is obviously broken: huge_pmd_set_accessed() can be called
> > > only if the pmd is under splitting.
> > 
> > Of course I assume you meant "only if the pmd is not under splitting".
> > 
> > But no, setting a bitflag like the young bit or clearing or setting
> > the numa bit won't screw with split_huge_page and it's safe even if
> > the pmd is under splitting.
> > 
> > Those bits are only checked here at the last stage of
> > split_huge_page_map after taking the PT lock:
> > 
> > 	spin_lock(&mm->page_table_lock);
> > 	pmd = page_check_address_pmd(page, mm, address,
> > 				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> > 	if (pmd) {
> > 		pgtable = pgtable_trans_huge_withdraw(mm);
> > 		pmd_populate(mm, &_pmd, pgtable);
> > 
> > 		haddr = address;
> > 		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> > 			pte_t *pte, entry;
> > 			BUG_ON(PageCompound(page+i));
> > 			entry = mk_pte(page + i, vma->vm_page_prot);
> > 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > 			if (!pmd_write(*pmd))
> > 				entry = pte_wrprotect(entry);
> > 			else
> > 				BUG_ON(page_mapcount(page) != 1);
> > 			if (!pmd_young(*pmd))
> > 				entry = pte_mkold(entry);
> > 			if (pmd_numa(*pmd))
> > 				entry = pte_mknuma(entry);
> > 			pte = pte_offset_map(&_pmd, haddr);
> > 			BUG_ON(!pte_none(*pte));
> > 			set_pte_at(mm, haddr, pte, entry);
> > 			pte_unmap(pte);
> > 		}
> > 
> > If "young" or "numa" bitflags changed on the original *pmd for the
> > previous part of split_huge_page, nothing will go wrong by the time we
> > get to split_huge_page_map (the same is not true if the pfn changes!).
> > 
> 
> But this time BUG_ON(mapcount != mapcount2) in function
> __split_huge_page will be trigged.

"young" or "numa" bitflags in the pmd don't alter
rmap/mapcount/pagecount/pfn or anything that could affect such BUG_ON,
so I'm not sure why you think so.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-01-11 14:03 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-05 15:22 oops in copy_page_rep() Dave Jones
2013-01-06  3:57 ` Linus Torvalds
2013-01-07  0:37   ` Dave Jones
2013-01-07  3:38     ` Hugh Dickins
2013-01-06 11:55 ` Hillf Danton
2013-01-06 16:10   ` Dave Jones
2013-01-06 19:06   ` Hugh Dickins
2013-01-07 12:24     ` Hillf Danton
2013-01-07 17:34       ` Linus Torvalds
2013-01-08 13:04         ` Hillf Danton
2013-01-08 15:37           ` Linus Torvalds
2013-01-08 16:31             ` Kirill A. Shutemov
2013-01-08 16:52               ` Linus Torvalds
2013-01-08 17:30                 ` Kirill A. Shutemov
2013-01-08 17:38                   ` Linus Torvalds
2013-01-08 17:49                   ` Andrea Arcangeli
2013-01-08 18:03                     ` Kirill A. Shutemov
2013-01-11  7:50                     ` Simon Jeons
2013-01-11 14:01                       ` Andrea Arcangeli
2013-01-08 17:37                 ` Andrea Arcangeli
2013-01-08 17:51                   ` Linus Torvalds
2013-01-08 18:03                     ` Andrea Arcangeli
2013-01-08 18:21                       ` Linus Torvalds
2013-01-09 11:38                         ` Hillf Danton
2013-01-09  4:23                   ` Hugh Dickins
2013-01-09 11:44                 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox