* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge) [not found] ` <52D69AB4.6000309@suse.cz> @ 2014-01-15 16:06 ` Daniel Borkmann 2014-01-31 14:40 ` Vlastimil Babka 0 siblings, 1 reply; 7+ messages in thread From: Daniel Borkmann @ 2014-01-15 16:06 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm, Jared Hulbert, netdev [keeping netdev in loop as well] On 01/15/2014 03:27 PM, Vlastimil Babka wrote: > On 01/13/2014 12:39 PM, Daniel Borkmann wrote: >> On 01/13/2014 11:16 AM, Vlastimil Babka wrote: >>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote: >>>> On 01/11/2014 07:22 AM, Andrew Morton wrote: >>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote: >>>>> >>>>>> This is being reliably triggered for each mmaped() packet(7) >>>>>> socket from user space, basically during unmapping resp. >>>>>> closing the TX socket. >>>>>> >>>>>> I believe due to some change in transparent hugepages code ? >>>>>> >>>>>> When I disable transparent hugepages, everything works fine, >>>>>> no BUG triggered. >>>>>> >>>>>> I'd be happy to test patches. >>>>> >>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug >>>>> where THP tail page is encountered") in current mainline fix this? >>>> >>>> Thanks for your answer Andrew! >>>> >>>> Hm, I just cherry-picked that onto current net-next as I have some work >>>> there, and this time I got ... >>>> >>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE) >>>> and on shutdown munlockall() ...) >>>> >>>> [ 63.863672] ------------[ cut here ]------------ >>>> [ 63.863702] kernel BUG at mm/mlock.c:507! >>>> [ 63.863721] invalid opcode: 0000 [#1] SMP >>>> [ 63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915 >>>> [ 63.864152] i2c_algo_bit drm_kms_helper drm i2c_core video >>>> [ 63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15 >>>> [ 63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012 >>>> [ 63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000 >>>> [ 63.864274] RIP: 0010:[<ffffffff8116fa9a>] [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>> [ 63.864318] RSP: 0018:ffff8800b5955e08 EFLAGS: 00010202 >>>> [ 63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034 >>>> [ 63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700 >>>> [ 63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000 >>>> [ 63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700 >>>> [ 63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000 >>>> [ 63.864492] FS: 00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000 >>>> [ 63.864526] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0 >>>> [ 63.864581] Stack: >>>> [ 63.864593] ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48 >>>> [ 63.864632] 000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000 >>>> [ 63.864669] ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000 >>>> [ 63.864708] Call Trace: >>>> [ 63.864724] [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30 >>>> [ 63.864754] [<ffffffff81171b52>] ? vma_merge+0xc2/0x330 >>>> [ 63.864786] [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190 >>>> [ 63.864812] [<ffffffff8116fde7>] do_mlockall+0x87/0xc0 >>>> [ 63.864836] [<ffffffff811702df>] sys_munlockall+0x2f/0x50 >>>> [ 63.864873] [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b >>>> [ 63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 >>>> [ 63.865114] RIP [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>> [ 63.865148] RSP <ffff8800b5955e08> >>>> [ 63.874968] ------------[ cut here ]------------ >>>> >>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some >>>> other patches are missing as well, not sure right now. >>> >>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline. >>> >>> If you can still try debug patches, please try this. Thanks. >> >> Yes, thanks, I'll come back to you some time by today. > > Daniel sent me (off-list) instructions to reproduce: > >> Then in the kernel source tree, you'll find: >> >> tools/testing/selftests/net/ >> >> There, just do a 'make' and run ./psock_tpacket > > It reproduces deterministically in mainline since 3.12, i.e. my munlock > performance series. Based on the initial debug output, I've expanded the > debug patch below a bit: > >>> From: Vlastimil Babka <vbabka@suse.cz> >>> Date: Mon, 13 Jan 2014 11:13:53 +0100 >>> Subject: [PATCH] debug munlock_vma_pages_range >>> >>> --- >>> mm/mlock.c | 22 ++++++++++++++++++++-- >>> 1 file changed, 20 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/mlock.c b/mm/mlock.c >>> index c59c420..7d0e29a 100644 >>> --- a/mm/mlock.c >>> +++ b/mm/mlock.c >>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, >>> void munlock_vma_pages_range(struct vm_area_struct *vma, >>> unsigned long start, unsigned long end) >>> { >>> + unsigned long orig_start = start; >>> + unsigned long page_increm = 0; >>> + >>> vma->vm_flags &= ~VM_LOCKED; >>> >>> while (start < end) { >>> struct page *page = NULL; >>> unsigned int page_mask; >>> - unsigned long page_increm; >>> struct pagevec pvec; >>> struct zone *zone; >>> int zoneid; >>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, >>> } >>> } >>> /* It's a bug to munlock in the middle of a THP page */ >>> - VM_BUG_ON((start >> PAGE_SHIFT) & page_mask); >>> + if ((start >> PAGE_SHIFT) & page_mask) { >>> + dump_page(page); >>> + printk("start=%lu pfn=%lu orig_start=%lu " >>> + "prev_page_increm=%lu page_mask=%u " >>> + "vm_start=%lu vm_end=%lu vm_flags=%lu\n", >>> + start, page_to_pfn(page), orig_start, >>> + page_increm, page_mask, >>> + vma->vm_start, vma->vm_end, >>> + vma->vm_flags); > + printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops, > + vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages); > + if (PageCompound(page)) { > + printk("page is compound with order=%d\n", compound_order(page)); > + } >>> + if (PageTail(page)) { >>> + struct page *first_page = page->first_page; >>> + printk("first_page pfn=%lu\n", >>> + page_to_pfn(first_page)); >>> + dump_page(first_page); >>> + } >>> + VM_BUG_ON(true); >>> + } >>> page_increm = 1 + page_mask; >>> start += page_increm * PAGE_SIZE; >>> next: >>> > > And got output like this: > > page:ffffea0002474a40 count:5 mapcount:1 mapping: (null) index:0x0 > page flags: 0x100000000004004(referenced|head) > start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707 > vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault= (null), remap_pages= (null) > page is compound with order=2 > > Observations: > - address 140242647736320 is where the vma starts, and is not aligned to 512 pages > (so it cannot be a THP head which the munlock expects). Yet there is a head page > that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page() > That's why page_mask is determined to be 511 and the code thinks it's in the > middle of a THP page. > - in fact, the page is a compound page with order=2 > - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP > - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses > vm_insert_page(), which adds the VM_MIXEDMAP flag > - the buffers that are mapped were allocated by alloc_one_pg_vec_page() > where flags indeed include __GFP_COMP > > So clearly there is a way to have mlock/munlock operate on a vma that contains > compound pages and confuse the checks for PageTransHuge(). > > The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock() > treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however > makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page, > when it encounters a head page. If the head page is of smaller order and is followed > by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere), > they wouldn't get munlocked. > > My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and > munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check > that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8 > ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes > one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock, > which leads to this assertion failing. > > The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP > to decide. > > Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP > and added to VM_SPECIAL, which makes m(un)lock skip them completely. > > Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful, > VM_NO_THP should be checked in munlock before attempting PageTransHuge() and > friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be > no THP means we don't try optimize for it and no unexpected head pages trip us. > > Thoughts? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge) 2014-01-15 16:06 ` [BUG] at include/linux/page-flags.h:415 (PageTransHuge) Daniel Borkmann @ 2014-01-31 14:40 ` Vlastimil Babka 2014-01-31 14:58 ` Thomas Hellstrom 2014-02-07 18:58 ` Hannes Frederic Sowa 0 siblings, 2 replies; 7+ messages in thread From: Vlastimil Babka @ 2014-01-31 14:40 UTC (permalink / raw) To: Daniel Borkmann Cc: Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm, Jared Hulbert, netdev, Thomas Hellstrom, John David Anglin, HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte, Peter Zijlstra On 01/15/2014 05:06 PM, Daniel Borkmann wrote: > [keeping netdev in loop as well] > > On 01/15/2014 03:27 PM, Vlastimil Babka wrote: >> On 01/13/2014 12:39 PM, Daniel Borkmann wrote: >>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote: >>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote: >>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote: >>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote: >>>>>> >>>>>>> This is being reliably triggered for each mmaped() packet(7) >>>>>>> socket from user space, basically during unmapping resp. >>>>>>> closing the TX socket. >>>>>>> >>>>>>> I believe due to some change in transparent hugepages code ? >>>>>>> >>>>>>> When I disable transparent hugepages, everything works fine, >>>>>>> no BUG triggered. >>>>>>> >>>>>>> I'd be happy to test patches. >>>>>> >>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug >>>>>> where THP tail page is encountered") in current mainline fix this? >>>>> >>>>> Thanks for your answer Andrew! >>>>> >>>>> Hm, I just cherry-picked that onto current net-next as I have some work >>>>> there, and this time I got ... >>>>> >>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE) >>>>> and on shutdown munlockall() ...) >>>>> >>>>> [ 63.863672] ------------[ cut here ]------------ >>>>> [ 63.863702] kernel BUG at mm/mlock.c:507! >>>>> [ 63.863721] invalid opcode: 0000 [#1] SMP >>>>> [ 63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videodev media uinput i915 >>>>> [ 63.864152] i2c_algo_bit drm_kms_helper drm i2c_core video >>>>> [ 63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15 >>>>> [ 63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012 >>>>> [ 63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000 >>>>> [ 63.864274] RIP: 0010:[<ffffffff8116fa9a>] [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>>> [ 63.864318] RSP: 0018:ffff8800b5955e08 EFLAGS: 00010202 >>>>> [ 63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034 >>>>> [ 63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700 >>>>> [ 63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000 >>>>> [ 63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700 >>>>> [ 63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000 >>>>> [ 63.864492] FS: 00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000 >>>>> [ 63.864526] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [ 63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0 >>>>> [ 63.864581] Stack: >>>>> [ 63.864593] ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48 >>>>> [ 63.864632] 000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000 >>>>> [ 63.864669] ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000 >>>>> [ 63.864708] Call Trace: >>>>> [ 63.864724] [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30 >>>>> [ 63.864754] [<ffffffff81171b52>] ? vma_merge+0xc2/0x330 >>>>> [ 63.864786] [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190 >>>>> [ 63.864812] [<ffffffff8116fde7>] do_mlockall+0x87/0xc0 >>>>> [ 63.864836] [<ffffffff811702df>] sys_munlockall+0x2f/0x50 >>>>> [ 63.864873] [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b >>>>> [ 63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 >>>>> [ 63.865114] RIP [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>>> [ 63.865148] RSP <ffff8800b5955e08> >>>>> [ 63.874968] ------------[ cut here ]------------ >>>>> >>>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some >>>>> other patches are missing as well, not sure right now. >>>> >>>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline. >>>> >>>> If you can still try debug patches, please try this. Thanks. >>> >>> Yes, thanks, I'll come back to you some time by today. >> >> Daniel sent me (off-list) instructions to reproduce: >> >>> Then in the kernel source tree, you'll find: >>> >>> tools/testing/selftests/net/ >>> >>> There, just do a 'make' and run ./psock_tpacket >> >> It reproduces deterministically in mainline since 3.12, i.e. my munlock >> performance series. Based on the initial debug output, I've expanded the >> debug patch below a bit: >> >>>> From: Vlastimil Babka <vbabka@suse.cz> >>>> Date: Mon, 13 Jan 2014 11:13:53 +0100 >>>> Subject: [PATCH] debug munlock_vma_pages_range >>>> >>>> --- >>>> mm/mlock.c | 22 ++++++++++++++++++++-- >>>> 1 file changed, 20 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/mm/mlock.c b/mm/mlock.c >>>> index c59c420..7d0e29a 100644 >>>> --- a/mm/mlock.c >>>> +++ b/mm/mlock.c >>>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, >>>> void munlock_vma_pages_range(struct vm_area_struct *vma, >>>> unsigned long start, unsigned long end) >>>> { >>>> + unsigned long orig_start = start; >>>> + unsigned long page_increm = 0; >>>> + >>>> vma->vm_flags &= ~VM_LOCKED; >>>> >>>> while (start < end) { >>>> struct page *page = NULL; >>>> unsigned int page_mask; >>>> - unsigned long page_increm; >>>> struct pagevec pvec; >>>> struct zone *zone; >>>> int zoneid; >>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, >>>> } >>>> } >>>> /* It's a bug to munlock in the middle of a THP page */ >>>> - VM_BUG_ON((start >> PAGE_SHIFT) & page_mask); >>>> + if ((start >> PAGE_SHIFT) & page_mask) { >>>> + dump_page(page); >>>> + printk("start=%lu pfn=%lu orig_start=%lu " >>>> + "prev_page_increm=%lu page_mask=%u " >>>> + "vm_start=%lu vm_end=%lu vm_flags=%lu\n", >>>> + start, page_to_pfn(page), orig_start, >>>> + page_increm, page_mask, >>>> + vma->vm_start, vma->vm_end, >>>> + vma->vm_flags); >> + printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops, >> + vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages); >> + if (PageCompound(page)) { >> + printk("page is compound with order=%d\n", compound_order(page)); >> + } >>>> + if (PageTail(page)) { >>>> + struct page *first_page = page->first_page; >>>> + printk("first_page pfn=%lu\n", >>>> + page_to_pfn(first_page)); >>>> + dump_page(first_page); >>>> + } >>>> + VM_BUG_ON(true); >>>> + } >>>> page_increm = 1 + page_mask; >>>> start += page_increm * PAGE_SIZE; >>>> next: >>>> >> >> And got output like this: >> >> page:ffffea0002474a40 count:5 mapcount:1 mapping: (null) index:0x0 >> page flags: 0x100000000004004(referenced|head) >> start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707 >> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault= (null), remap_pages= (null) >> page is compound with order=2 >> >> Observations: >> - address 140242647736320 is where the vma starts, and is not aligned to 512 pages >> (so it cannot be a THP head which the munlock expects). Yet there is a head page >> that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page() >> That's why page_mask is determined to be 511 and the code thinks it's in the >> middle of a THP page. >> - in fact, the page is a compound page with order=2 >> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP >> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses >> vm_insert_page(), which adds the VM_MIXEDMAP flag >> - the buffers that are mapped were allocated by alloc_one_pg_vec_page() >> where flags indeed include __GFP_COMP >> >> So clearly there is a way to have mlock/munlock operate on a vma that contains >> compound pages and confuse the checks for PageTransHuge(). >> >> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock() >> treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however >> makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page, >> when it encounters a head page. If the head page is of smaller order and is followed >> by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere), >> they wouldn't get munlocked. >> >> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and >> munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check >> that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8 >> ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes >> one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock, >> which leads to this assertion failing. >> >> The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP >> to decide. >> >> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP >> and added to VM_SPECIAL, which makes m(un)lock skip them completely. >> >> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful, >> VM_NO_THP should be checked in munlock before attempting PageTransHuge() and >> friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be >> no THP means we don't try optimize for it and no unexpected head pages trip us. >> >> Thoughts? OK, here's a RFC patch to hopefully help get us somewhere. I went for Option1, as I didn't see anyone using VM_MIXEDMAP also for LRU pages, and Option2 was ugly to implement and also seemed quite arbitrary. I'm not sure if making VM_MIXEDMAP also non-mergeable this way is an issue though. Actually a nice third option would be that the vma in question would include also another flag (other than VM_MIXEDMAP) that would declare it can contain compound pages. But I didn't find any such rule documented, perhaps VM_IO would be a candidate? I added people that touched VM_MIXEDMAP in the past to CC. -------8<------- From: Vlastimil Babka <vbabka@suse.cz> Date: Fri, 31 Jan 2014 11:50:21 +0100 Subject: [PATCH] mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking Daniel Borkmann reported a bug with VM_BUG_ON assertions failing where munlock_vma_pages_range() thinks it's unexpectedly in the middle of a THP page. This can be reproduced in tools/testing/selftests/net/ by running make and then ./psock_tpacket. The problem is that an order=2 compound page (allocated by alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped by packet_mmap()) and mistaken for a THP page and assumed to be order=9. The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It just makes munlock_vma_pages_range() skip such compound pages until the next 512-pages-aligned page, when it encounters a head page. This is however not a problem for vma's where mlocking has no effect anyway, but it can distort the accounting. Since commit 7225522bb ("mm: munlock: batch non-THP page isolation and munlock+putback using pagevec") this can trigger a VM_BUG_ON in PageTransHuge() check. This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL - a list of flags that make vma's non-mlockable and non-mergeable. The reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is already on the VM_SPECIAL list, and both are intended for non-LRU pages where mlocking makes no sense anyway. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Daniel Borkmann <borkmann@iogearbox.net> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: John David Anglin <dave.anglin@bell.net> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jared Hulbert <jaredeh@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- include/linux/mm.h | 2 +- mm/huge_memory.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f28f46e..f9b04ac 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -175,7 +175,7 @@ extern unsigned int kobjsize(const void *objp); * Special vmas that are non-mergable, non-mlock()able. * Note: mm/huge_memory.c VM_NO_THP depends on this definition. */ -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP) +#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP) /* * mapping from the currently active vm_flags protection bits (the diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf..f32fffb 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1963,7 +1963,7 @@ out: return ret; } -#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE) +#define VM_NO_THP (VM_SPECIAL|VM_HUGETLB|VM_SHARED|VM_MAYSHARE) int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice) -- 1.8.4.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge) 2014-01-31 14:40 ` Vlastimil Babka @ 2014-01-31 14:58 ` Thomas Hellstrom 2014-01-31 15:25 ` Vlastimil Babka 2014-02-07 18:58 ` Hannes Frederic Sowa 1 sibling, 1 reply; 7+ messages in thread From: Thomas Hellstrom @ 2014-01-31 14:58 UTC (permalink / raw) To: Vlastimil Babka Cc: Daniel Borkmann, Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm, Jared Hulbert, netdev, John David Anglin, HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte, Peter Zijlstra On 01/31/2014 03:40 PM, Vlastimil Babka wrote: > On 01/15/2014 05:06 PM, Daniel Borkmann wrote: >> [keeping netdev in loop as well] >> >> On 01/15/2014 03:27 PM, Vlastimil Babka wrote: >>> On 01/13/2014 12:39 PM, Daniel Borkmann wrote: >>>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote: >>>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote: >>>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote: >>>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote: >>>>>>> >>>>>>>> This is being reliably triggered for each mmaped() packet(7) >>>>>>>> socket from user space, basically during unmapping resp. >>>>>>>> closing the TX socket. >>>>>>>> >>>>>>>> I believe due to some change in transparent hugepages code ? >>>>>>>> >>>>>>>> When I disable transparent hugepages, everything works fine, >>>>>>>> no BUG triggered. >>>>>>>> >>>>>>>> I'd be happy to test patches. >>>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug >>>>>>> where THP tail page is encountered") in current mainline fix this? >>>>>> Thanks for your answer Andrew! >>>>>> >>>>>> Hm, I just cherry-picked that onto current net-next as I have some work >>>>>> there, and this time I got ... >>>>>> >>>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE) >>>>>> and on shutdown munlockall() ...) >>>>>> >>>>>> [ 63.863672] ------------[ cut here ]------------ >>>>>> [ 63.863702] kernel BUG at mm/mlock.c:507! >>>>>> [ 63.863721] invalid opcode: 0000 [#1] SMP >>>>>> [ 63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videode v media uinput i915 >>>>>> [ 63.864152] i2c_algo_bit drm_kms_helper drm i2c_core video >>>>>> [ 63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15 >>>>>> [ 63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012 >>>>>> [ 63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000 >>>>>> [ 63.864274] RIP: 0010:[<ffffffff8116fa9a>] [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>>>> [ 63.864318] RSP: 0018:ffff8800b5955e08 EFLAGS: 00010202 >>>>>> [ 63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034 >>>>>> [ 63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700 >>>>>> [ 63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000 >>>>>> [ 63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700 >>>>>> [ 63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000 >>>>>> [ 63.864492] FS: 00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000 >>>>>> [ 63.864526] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>>> [ 63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0 >>>>>> [ 63.864581] Stack: >>>>>> [ 63.864593] ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48 >>>>>> [ 63.864632] 000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000 >>>>>> [ 63.864669] ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000 >>>>>> [ 63.864708] Call Trace: >>>>>> [ 63.864724] [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30 >>>>>> [ 63.864754] [<ffffffff81171b52>] ? vma_merge+0xc2/0x330 >>>>>> [ 63.864786] [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190 >>>>>> [ 63.864812] [<ffffffff8116fde7>] do_mlockall+0x87/0xc0 >>>>>> [ 63.864836] [<ffffffff811702df>] sys_munlockall+0x2f/0x50 >>>>>> [ 63.864873] [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b >>>>>> [ 63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 >>>>>> [ 63.865114] RIP [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>>>> [ 63.865148] RSP <ffff8800b5955e08> >>>>>> [ 63.874968] ------------[ cut here ]------------ >>>>>> >>>>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some >>>>>> other patches are missing as well, not sure right now. >>>>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline. >>>>> >>>>> If you can still try debug patches, please try this. Thanks. >>>> Yes, thanks, I'll come back to you some time by today. >>> Daniel sent me (off-list) instructions to reproduce: >>> >>>> Then in the kernel source tree, you'll find: >>>> >>>> tools/testing/selftests/net/ >>>> >>>> There, just do a 'make' and run ./psock_tpacket >>> It reproduces deterministically in mainline since 3.12, i.e. my munlock >>> performance series. Based on the initial debug output, I've expanded the >>> debug patch below a bit: >>> >>>>> From: Vlastimil Babka <vbabka@suse.cz> >>>>> Date: Mon, 13 Jan 2014 11:13:53 +0100 >>>>> Subject: [PATCH] debug munlock_vma_pages_range >>>>> >>>>> --- >>>>> mm/mlock.c | 22 ++++++++++++++++++++-- >>>>> 1 file changed, 20 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/mm/mlock.c b/mm/mlock.c >>>>> index c59c420..7d0e29a 100644 >>>>> --- a/mm/mlock.c >>>>> +++ b/mm/mlock.c >>>>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, >>>>> void munlock_vma_pages_range(struct vm_area_struct *vma, >>>>> unsigned long start, unsigned long end) >>>>> { >>>>> + unsigned long orig_start = start; >>>>> + unsigned long page_increm = 0; >>>>> + >>>>> vma->vm_flags &= ~VM_LOCKED; >>>>> >>>>> while (start < end) { >>>>> struct page *page = NULL; >>>>> unsigned int page_mask; >>>>> - unsigned long page_increm; >>>>> struct pagevec pvec; >>>>> struct zone *zone; >>>>> int zoneid; >>>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, >>>>> } >>>>> } >>>>> /* It's a bug to munlock in the middle of a THP page */ >>>>> - VM_BUG_ON((start >> PAGE_SHIFT) & page_mask); >>>>> + if ((start >> PAGE_SHIFT) & page_mask) { >>>>> + dump_page(page); >>>>> + printk("start=%lu pfn=%lu orig_start=%lu " >>>>> + "prev_page_increm=%lu page_mask=%u " >>>>> + "vm_start=%lu vm_end=%lu vm_flags=%lu\n", >>>>> + start, page_to_pfn(page), orig_start, >>>>> + page_increm, page_mask, >>>>> + vma->vm_start, vma->vm_end, >>>>> + vma->vm_flags); >>> + printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops, >>> + vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages); >>> + if (PageCompound(page)) { >>> + printk("page is compound with order=%d\n", compound_order(page)); >>> + } >>>>> + if (PageTail(page)) { >>>>> + struct page *first_page = page->first_page; >>>>> + printk("first_page pfn=%lu\n", >>>>> + page_to_pfn(first_page)); >>>>> + dump_page(first_page); >>>>> + } >>>>> + VM_BUG_ON(true); >>>>> + } >>>>> page_increm = 1 + page_mask; >>>>> start += page_increm * PAGE_SIZE; >>>>> next: >>>>> >>> And got output like this: >>> >>> page:ffffea0002474a40 count:5 mapcount:1 mapping: (null) index:0x0 >>> page flags: 0x100000000004004(referenced|head) >>> start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707 >>> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault= (null), remap_pages= (null) >>> page is compound with order=2 >>> >>> Observations: >>> - address 140242647736320 is where the vma starts, and is not aligned to 512 pages >>> (so it cannot be a THP head which the munlock expects). Yet there is a head page >>> that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page() >>> That's why page_mask is determined to be 511 and the code thinks it's in the >>> middle of a THP page. >>> - in fact, the page is a compound page with order=2 >>> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP >>> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses >>> vm_insert_page(), which adds the VM_MIXEDMAP flag >>> - the buffers that are mapped were allocated by alloc_one_pg_vec_page() >>> where flags indeed include __GFP_COMP >>> >>> So clearly there is a way to have mlock/munlock operate on a vma that contains >>> compound pages and confuse the checks for PageTransHuge(). >>> >>> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock() >>> treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however >>> makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page, >>> when it encounters a head page. If the head page is of smaller order and is followed >>> by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere), >>> they wouldn't get munlocked. >>> >>> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and >>> munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check >>> that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8 >>> ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes >>> one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock, >>> which leads to this assertion failing. >>> >>> The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP >>> to decide. >>> >>> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP >>> and added to VM_SPECIAL, which makes m(un)lock skip them completely. >>> >>> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful, >>> VM_NO_THP should be checked in munlock before attempting PageTransHuge() and >>> friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be >>> no THP means we don't try optimize for it and no unexpected head pages trip us. >>> >>> Thoughts? > OK, here's a RFC patch to hopefully help get us somewhere. I went for > Option1, as I didn't see anyone using VM_MIXEDMAP also for LRU pages, > and Option2 was ugly to implement and also seemed quite arbitrary. I'm > not sure if making VM_MIXEDMAP also non-mergeable this way is an issue > though. > > Hi! Forgive an ignorant question, but are anonymous COW'd pages LRU pages? The reason I'm asking is that TTM VM_MIXEDMAP vmas may contain such pages. /Thomas -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge) 2014-01-31 14:58 ` Thomas Hellstrom @ 2014-01-31 15:25 ` Vlastimil Babka 2014-01-31 15:35 ` Thomas Hellstrom 0 siblings, 1 reply; 7+ messages in thread From: Vlastimil Babka @ 2014-01-31 15:25 UTC (permalink / raw) To: Thomas Hellstrom Cc: Daniel Borkmann, Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm, Jared Hulbert, netdev, John David Anglin, HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte, Peter Zijlstra On 01/31/2014 03:58 PM, Thomas Hellstrom wrote: > On 01/31/2014 03:40 PM, Vlastimil Babka wrote: >> On 01/15/2014 05:06 PM, Daniel Borkmann wrote: >>> [keeping netdev in loop as well] >>> >>> On 01/15/2014 03:27 PM, Vlastimil Babka wrote: >>>> On 01/13/2014 12:39 PM, Daniel Borkmann wrote: >>>>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote: >>>>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote: >>>>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote: >>>>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann <borkmann@iogearbox.net> wrote: >>>>>>>> >>>>>>>>> This is being reliably triggered for each mmaped() packet(7) >>>>>>>>> socket from user space, basically during unmapping resp. >>>>>>>>> closing the TX socket. >>>>>>>>> >>>>>>>>> I believe due to some change in transparent hugepages code ? >>>>>>>>> >>>>>>>>> When I disable transparent hugepages, everything works fine, >>>>>>>>> no BUG triggered. >>>>>>>>> >>>>>>>>> I'd be happy to test patches. >>>>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix a bug >>>>>>>> where THP tail page is encountered") in current mainline fix this? >>>>>>> Thanks for your answer Andrew! >>>>>>> >>>>>>> Hm, I just cherry-picked that onto current net-next as I have some work >>>>>>> there, and this time I got ... >>>>>>> >>>>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | MCL_FUTURE) >>>>>>> and on shutdown munlockall() ...) >>>>>>> >>>>>>> [ 63.863672] ------------[ cut here ]------------ >>>>>>> [ 63.863702] kernel BUG at mm/mlock.c:507! >>>>>>> [ 63.863721] invalid opcode: 0000 [#1] SMP >>>>>>> [ 63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bridge ebtable_filter ebtables stp llc ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core soundcore joydev wmi videobuf2_vmalloc videobuf2_memops videobuf2_core i2c_i801 pcspkr videod ev media uinput i915 >>>>>>> [ 63.864152] i2c_algo_bit drm_kms_helper drm i2c_core video >>>>>>> [ 63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted 3.13.0-rc6+ #15 >>>>>>> [ 63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012 >>>>>>> [ 63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 task.ti: ffff8800b5954000 >>>>>>> [ 63.864274] RIP: 0010:[<ffffffff8116fa9a>] [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>>>>> [ 63.864318] RSP: 0018:ffff8800b5955e08 EFLAGS: 00010202 >>>>>>> [ 63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: 0000000000000034 >>>>>>> [ 63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: ffffea0002c3e700 >>>>>>> [ 63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: a8000b0f9c000000 >>>>>>> [ 63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: ffffea0002c3e700 >>>>>>> [ 63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: 00007f0708992000 >>>>>>> [ 63.864492] FS: 00007f0708b92740(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000 >>>>>>> [ 63.864526] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>>>> [ 63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: 00000000001407e0 >>>>>>> [ 63.864581] Stack: >>>>>>> [ 63.864593] ffff8800b5955ed0 00007f0708b91fff 00007f0708b92000 ffff8800b5955e48 >>>>>>> [ 63.864632] 000001ff810c864b ffff8801ee060000 0000000000000000 0000000000000000 >>>>>>> [ 63.864669] ffff8800b5955e58 ffff8801ee060000 0000000700000086 ffff8801ee060000 >>>>>>> [ 63.864708] Call Trace: >>>>>>> [ 63.864724] [<ffffffff816956bc>] ? _raw_spin_unlock_irq+0x2c/0x30 >>>>>>> [ 63.864754] [<ffffffff81171b52>] ? vma_merge+0xc2/0x330 >>>>>>> [ 63.864786] [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190 >>>>>>> [ 63.864812] [<ffffffff8116fde7>] do_mlockall+0x87/0xc0 >>>>>>> [ 63.864836] [<ffffffff811702df>] sys_munlockall+0x2f/0x50 >>>>>>> [ 63.864873] [<ffffffff8169e192>] system_call_fastpath+0x16/0x1b >>>>>>> [ 63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 >>>>>>> [ 63.865114] RIP [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>>>>> [ 63.865148] RSP <ffff8800b5955e08> >>>>>>> [ 63.874968] ------------[ cut here ]------------ >>>>>>> >>>>>>> ... when I find some time, I'll try with normal torvalds' tree, maybe some >>>>>>> other patches are missing as well, not sure right now. >>>>>> Uh so the triggered assertion is the one added by this very patch, and there are no more changes wrt this in mainline. >>>>>> >>>>>> If you can still try debug patches, please try this. Thanks. >>>>> Yes, thanks, I'll come back to you some time by today. >>>> Daniel sent me (off-list) instructions to reproduce: >>>> >>>>> Then in the kernel source tree, you'll find: >>>>> >>>>> tools/testing/selftests/net/ >>>>> >>>>> There, just do a 'make' and run ./psock_tpacket >>>> It reproduces deterministically in mainline since 3.12, i.e. my munlock >>>> performance series. Based on the initial debug output, I've expanded the >>>> debug patch below a bit: >>>> >>>>>> From: Vlastimil Babka <vbabka@suse.cz> >>>>>> Date: Mon, 13 Jan 2014 11:13:53 +0100 >>>>>> Subject: [PATCH] debug munlock_vma_pages_range >>>>>> >>>>>> --- >>>>>> mm/mlock.c | 22 ++++++++++++++++++++-- >>>>>> 1 file changed, 20 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/mm/mlock.c b/mm/mlock.c >>>>>> index c59c420..7d0e29a 100644 >>>>>> --- a/mm/mlock.c >>>>>> +++ b/mm/mlock.c >>>>>> @@ -448,12 +448,14 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, >>>>>> void munlock_vma_pages_range(struct vm_area_struct *vma, >>>>>> unsigned long start, unsigned long end) >>>>>> { >>>>>> + unsigned long orig_start = start; >>>>>> + unsigned long page_increm = 0; >>>>>> + >>>>>> vma->vm_flags &= ~VM_LOCKED; >>>>>> >>>>>> while (start < end) { >>>>>> struct page *page = NULL; >>>>>> unsigned int page_mask; >>>>>> - unsigned long page_increm; >>>>>> struct pagevec pvec; >>>>>> struct zone *zone; >>>>>> int zoneid; >>>>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, >>>>>> } >>>>>> } >>>>>> /* It's a bug to munlock in the middle of a THP page */ >>>>>> - VM_BUG_ON((start >> PAGE_SHIFT) & page_mask); >>>>>> + if ((start >> PAGE_SHIFT) & page_mask) { >>>>>> + dump_page(page); >>>>>> + printk("start=%lu pfn=%lu orig_start=%lu " >>>>>> + "prev_page_increm=%lu page_mask=%u " >>>>>> + "vm_start=%lu vm_end=%lu vm_flags=%lu\n", >>>>>> + start, page_to_pfn(page), orig_start, >>>>>> + page_increm, page_mask, >>>>>> + vma->vm_start, vma->vm_end, >>>>>> + vma->vm_flags); >>>> + printk("vm_ops=%pF, open=%pF, fault=%pF, remap_pages=%pF\n", vma->vm_ops, >>>> + vma->vm_ops->open, vma->vm_ops->fault, vma->vm_ops->remap_pages); >>>> + if (PageCompound(page)) { >>>> + printk("page is compound with order=%d\n", compound_order(page)); >>>> + } >>>>>> + if (PageTail(page)) { >>>>>> + struct page *first_page = page->first_page; >>>>>> + printk("first_page pfn=%lu\n", >>>>>> + page_to_pfn(first_page)); >>>>>> + dump_page(first_page); >>>>>> + } >>>>>> + VM_BUG_ON(true); >>>>>> + } >>>>>> page_increm = 1 + page_mask; >>>>>> start += page_increm * PAGE_SIZE; >>>>>> next: >>>>>> >>>> And got output like this: >>>> >>>> page:ffffea0002474a40 count:5 mapcount:1 mapping: (null) index:0x0 >>>> page flags: 0x100000000004004(referenced|head) >>>> start=140242647736320 pfn=682616 orig_start=140242647736320 prev_page_increm=0 page_mask=511 vm_start=140242647736320 vm_end=140242651930624 vm_flags=268435707 >>>> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], open=packet_mm_open+0x0/0x30 [af_packet], fault= (null), remap_pages= (null) >>>> page is compound with order=2 >>>> >>>> Observations: >>>> - address 140242647736320 is where the vma starts, and is not aligned to 512 pages >>>> (so it cannot be a THP head which the munlock expects). Yet there is a head page >>>> that triggers the PageTransHuge() and consequently hpage_nr_pages() in munlock_vma_page() >>>> That's why page_mask is determined to be 511 and the code thinks it's in the >>>> middle of a THP page. >>>> - in fact, the page is a compound page with order=2 >>>> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP >>>> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) which uses >>>> vm_insert_page(), which adds the VM_MIXEDMAP flag >>>> - the buffers that are mapped were allocated by alloc_one_pg_vec_page() >>>> where flags indeed include __GFP_COMP >>>> >>>> So clearly there is a way to have mlock/munlock operate on a vma that contains >>>> compound pages and confuse the checks for PageTransHuge(). >>>> >>>> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock() >>>> treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It however >>>> makes munlock_vma_pages_range() skip pages until the next 512-pages-aligned page, >>>> when it encounters a head page. If the head page is of smaller order and is followed >>>> by normal LRU pages (theoretically, I'm not sure if that's possible, or done anywhere), >>>> they wouldn't get munlocked. >>>> >>>> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and >>>> munlock+putback using pagevec") (since 3.12) has added a new PageTransHuge() check >>>> that can trigger on tail pages of the compound page here. Commit c424be1cbbf852e46acc8 >>>> ("mm: munlock: fix a bug where THP tail page is encountered") in current rc's removes >>>> one class of bugs here, but still non-THP compound pages are not expected in mlock/munlock, >>>> which leads to this assertion failing. >>>> >>>> The question is what is the correct fix, and I'm not that familiar with VM_MIXEDMAP >>>> to decide. >>>> >>>> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be treated like VM_PFNMAP >>>> and added to VM_SPECIAL, which makes m(un)lock skip them completely. >>>> >>>> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for which mlocking is useful, >>>> VM_NO_THP should be checked in munlock before attempting PageTransHuge() and >>>> friends. VM_NO_THP already contains VM_MIXEDMAP, so knowing that there can be >>>> no THP means we don't try optimize for it and no unexpected head pages trip us. >>>> >>>> Thoughts? >> OK, here's a RFC patch to hopefully help get us somewhere. I went for >> Option1, as I didn't see anyone using VM_MIXEDMAP also for LRU pages, >> and Option2 was ugly to implement and also seemed quite arbitrary. I'm >> not sure if making VM_MIXEDMAP also non-mergeable this way is an issue >> though. >> >> > > Hi! > > Forgive an ignorant question, but are anonymous COW'd pages LRU pages? I believe so, but I've checked where VM_MIXEDMAP is used in TTM and it seems all those vma's are also VM_IO which means they are already included in VM_SPECIAL and this change won't affect them. Vlastimil > The reason I'm asking is that TTM VM_MIXEDMAP vmas may contain such pages. > > /Thomas > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge) 2014-01-31 15:25 ` Vlastimil Babka @ 2014-01-31 15:35 ` Thomas Hellstrom 0 siblings, 0 replies; 7+ messages in thread From: Thomas Hellstrom @ 2014-01-31 15:35 UTC (permalink / raw) To: Vlastimil Babka Cc: Daniel Borkmann, Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm, Jared Hulbert, netdev, John David Anglin, HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte, Peter Zijlstra On 01/31/2014 04:25 PM, Vlastimil Babka wrote: > On 01/31/2014 03:58 PM, Thomas Hellstrom wrote: >> On 01/31/2014 03:40 PM, Vlastimil Babka wrote: >>> On 01/15/2014 05:06 PM, Daniel Borkmann wrote: >>>> [keeping netdev in loop as well] >>>> >>>> On 01/15/2014 03:27 PM, Vlastimil Babka wrote: >>>>> On 01/13/2014 12:39 PM, Daniel Borkmann wrote: >>>>>> On 01/13/2014 11:16 AM, Vlastimil Babka wrote: >>>>>>> On 01/11/2014 02:32 PM, Daniel Borkmann wrote: >>>>>>>> On 01/11/2014 07:22 AM, Andrew Morton wrote: >>>>>>>>> On Fri, 10 Jan 2014 19:23:26 +0100 Daniel Borkmann >>>>>>>>> <borkmann@iogearbox.net> wrote: >>>>>>>>> >>>>>>>>>> This is being reliably triggered for each mmaped() packet(7) >>>>>>>>>> socket from user space, basically during unmapping resp. >>>>>>>>>> closing the TX socket. >>>>>>>>>> >>>>>>>>>> I believe due to some change in transparent hugepages code ? >>>>>>>>>> >>>>>>>>>> When I disable transparent hugepages, everything works fine, >>>>>>>>>> no BUG triggered. >>>>>>>>>> >>>>>>>>>> I'd be happy to test patches. >>>>>>>>> Did the inclusion of c424be1cbbf852e46acc8 ("mm: munlock: fix >>>>>>>>> a bug >>>>>>>>> where THP tail page is encountered") in current mainline fix >>>>>>>>> this? >>>>>>>> Thanks for your answer Andrew! >>>>>>>> >>>>>>>> Hm, I just cherry-picked that onto current net-next as I have >>>>>>>> some work >>>>>>>> there, and this time I got ... >>>>>>>> >>>>>>>> (User space uses packet mmap() and mlockall(MCL_CURRENT | >>>>>>>> MCL_FUTURE) >>>>>>>> and on shutdown munlockall() ...) >>>>>>>> >>>>>>>> [ 63.863672] ------------[ cut here ]------------ >>>>>>>> [ 63.863702] kernel BUG at mm/mlock.c:507! >>>>>>>> [ 63.863721] invalid opcode: 0000 [#1] SMP >>>>>>>> [ 63.863743] Modules linked in: fuse ebtable_nat xt_CHECKSUM >>>>>>>> nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE >>>>>>>> ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT >>>>>>>> nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat >>>>>>>> iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack >>>>>>>> nf_conntrack bridge ebtable_filter ebtables stp llc >>>>>>>> ip6table_filter ip6_tables rfcomm bnep snd_hda_codec_hdmi >>>>>>>> snd_hda_codec_realtek snd_hda_intel snd_hda_codec iwlwifi >>>>>>>> cfg80211 snd_hwdep btusb snd_seq bluetooth sdhci_pci >>>>>>>> snd_seq_device e1000e tpm_tis snd_pcm thinkpad_acpi sdhci ptp >>>>>>>> tpm uvcvideo pps_core snd_page_alloc snd_timer snd rfkill >>>>>>>> mmc_core iTCO_wdt iTCO_vendor_support lpc_ich mfd_core >>>>>>>> soundcore joydev wmi videobuf2_vmalloc videobuf2_memops >>>>>>>> videobuf2_core i2c_i801 pcspkr videodev media uinput i915 >>>>>>>> [ 63.864152] i2c_algo_bit drm_kms_helper drm i2c_core video >>>>>>>> [ 63.864181] CPU: 1 PID: 1617 Comm: trafgen Not tainted >>>>>>>> 3.13.0-rc6+ #15 >>>>>>>> [ 63.864209] Hardware name: LENOVO 2429BP3/2429BP3, BIOS >>>>>>>> G4ET37WW (1.12 ) 05/29/2012 >>>>>>>> [ 63.864242] task: ffff8801ee060000 ti: ffff8800b5954000 >>>>>>>> task.ti: ffff8800b5954000 >>>>>>>> [ 63.864274] RIP: 0010:[<ffffffff8116fa9a>] >>>>>>>> [<ffffffff8116fa9a>] munlock_vma_pages_range+0x2ea/0x2f0 >>>>>>>> [ 63.864318] RSP: 0018:ffff8800b5955e08 EFLAGS: 00010202 >>>>>>>> [ 63.864341] RAX: 00000000000001ff RBX: ffff8800b58f7508 RCX: >>>>>>>> 0000000000000034 >>>>>>>> [ 63.864372] RDX: 00000007f0708992 RSI: ffffea0002c3e700 RDI: >>>>>>>> ffffea0002c3e700 >>>>>>>> [ 63.864402] RBP: ffff8800b5955ee0 R08: 3800000000000000 R09: >>>>>>>> a8000b0f9c000000 >>>>>>>> [ 63.864432] R10: 57ffdef066c3e700 R11: ffffff5cfb00c14a R12: >>>>>>>> ffffea0002c3e700 >>>>>>>> [ 63.864462] R13: ffff8800b5955f48 R14: 00007f0708992000 R15: >>>>>>>> 00007f0708992000 >>>>>>>> [ 63.864492] FS: 00007f0708b92740(0000) >>>>>>>> GS:ffff88021e240000(0000) knlGS:0000000000000000 >>>>>>>> [ 63.864526] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>>>>> [ 63.864551] CR2: 00007f33bb373000 CR3: 00000000b2a2c000 CR4: >>>>>>>> 00000000001407e0 >>>>>>>> [ 63.864581] Stack: >>>>>>>> [ 63.864593] ffff8800b5955ed0 00007f0708b91fff >>>>>>>> 00007f0708b92000 ffff8800b5955e48 >>>>>>>> [ 63.864632] 000001ff810c864b ffff8801ee060000 >>>>>>>> 0000000000000000 0000000000000000 >>>>>>>> [ 63.864669] ffff8800b5955e58 ffff8801ee060000 >>>>>>>> 0000000700000086 ffff8801ee060000 >>>>>>>> [ 63.864708] Call Trace: >>>>>>>> [ 63.864724] [<ffffffff816956bc>] ? >>>>>>>> _raw_spin_unlock_irq+0x2c/0x30 >>>>>>>> [ 63.864754] [<ffffffff81171b52>] ? vma_merge+0xc2/0x330 >>>>>>>> [ 63.864786] [<ffffffff8116fb9c>] mlock_fixup+0xfc/0x190 >>>>>>>> [ 63.864812] [<ffffffff8116fde7>] do_mlockall+0x87/0xc0 >>>>>>>> [ 63.864836] [<ffffffff811702df>] sys_munlockall+0x2f/0x50 >>>>>>>> [ 63.864873] [<ffffffff8169e192>] >>>>>>>> system_call_fastpath+0x16/0x1b >>>>>>>> [ 63.864898] Code: d7 48 89 95 28 ff ff ff e8 a4 04 fe ff 84 >>>>>>>> c0 48 8b 95 28 ff ff ff 0f 85 5a ff ff ff e9 46 ff ff ff e8 3f >>>>>>>> ac 51 00 e8 34 ac 51 00 <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 >>>>>>>> 48 89 e5 41 57 41 56 41 55 >>>>>>>> [ 63.865114] RIP [<ffffffff8116fa9a>] >>>>>>>> munlock_vma_pages_range+0x2ea/0x2f0 >>>>>>>> [ 63.865148] RSP <ffff8800b5955e08> >>>>>>>> [ 63.874968] ------------[ cut here ]------------ >>>>>>>> >>>>>>>> ... when I find some time, I'll try with normal torvalds' tree, >>>>>>>> maybe some >>>>>>>> other patches are missing as well, not sure right now. >>>>>>> Uh so the triggered assertion is the one added by this very >>>>>>> patch, and there are no more changes wrt this in mainline. >>>>>>> >>>>>>> If you can still try debug patches, please try this. Thanks. >>>>>> Yes, thanks, I'll come back to you some time by today. >>>>> Daniel sent me (off-list) instructions to reproduce: >>>>> >>>>>> Then in the kernel source tree, you'll find: >>>>>> >>>>>> tools/testing/selftests/net/ >>>>>> >>>>>> There, just do a 'make' and run ./psock_tpacket >>>>> It reproduces deterministically in mainline since 3.12, i.e. my >>>>> munlock >>>>> performance series. Based on the initial debug output, I've >>>>> expanded the >>>>> debug patch below a bit: >>>>> >>>>>>> From: Vlastimil Babka <vbabka@suse.cz> >>>>>>> Date: Mon, 13 Jan 2014 11:13:53 +0100 >>>>>>> Subject: [PATCH] debug munlock_vma_pages_range >>>>>>> >>>>>>> --- >>>>>>> mm/mlock.c | 22 ++++++++++++++++++++-- >>>>>>> 1 file changed, 20 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/mm/mlock.c b/mm/mlock.c >>>>>>> index c59c420..7d0e29a 100644 >>>>>>> --- a/mm/mlock.c >>>>>>> +++ b/mm/mlock.c >>>>>>> @@ -448,12 +448,14 @@ static unsigned long >>>>>>> __munlock_pagevec_fill(struct pagevec *pvec, >>>>>>> void munlock_vma_pages_range(struct vm_area_struct *vma, >>>>>>> unsigned long start, unsigned long end) >>>>>>> { >>>>>>> + unsigned long orig_start = start; >>>>>>> + unsigned long page_increm = 0; >>>>>>> + >>>>>>> vma->vm_flags &= ~VM_LOCKED; >>>>>>> >>>>>>> while (start < end) { >>>>>>> struct page *page = NULL; >>>>>>> unsigned int page_mask; >>>>>>> - unsigned long page_increm; >>>>>>> struct pagevec pvec; >>>>>>> struct zone *zone; >>>>>>> int zoneid; >>>>>>> @@ -504,7 +506,23 @@ void munlock_vma_pages_range(struct >>>>>>> vm_area_struct *vma, >>>>>>> } >>>>>>> } >>>>>>> /* It's a bug to munlock in the middle of a THP >>>>>>> page */ >>>>>>> - VM_BUG_ON((start >> PAGE_SHIFT) & page_mask); >>>>>>> + if ((start >> PAGE_SHIFT) & page_mask) { >>>>>>> + dump_page(page); >>>>>>> + printk("start=%lu pfn=%lu orig_start=%lu " >>>>>>> + "prev_page_increm=%lu page_mask=%u " >>>>>>> + "vm_start=%lu vm_end=%lu vm_flags=%lu\n", >>>>>>> + start, page_to_pfn(page), orig_start, >>>>>>> + page_increm, page_mask, >>>>>>> + vma->vm_start, vma->vm_end, >>>>>>> + vma->vm_flags); >>>>> + printk("vm_ops=%pF, open=%pF, fault=%pF, >>>>> remap_pages=%pF\n", vma->vm_ops, >>>>> + vma->vm_ops->open, >>>>> vma->vm_ops->fault, vma->vm_ops->remap_pages); >>>>> + if (PageCompound(page)) { >>>>> + printk("page is compound with >>>>> order=%d\n", compound_order(page)); >>>>> + } >>>>>>> + if (PageTail(page)) { >>>>>>> + struct page *first_page = page->first_page; >>>>>>> + printk("first_page pfn=%lu\n", >>>>>>> + page_to_pfn(first_page)); >>>>>>> + dump_page(first_page); >>>>>>> + } >>>>>>> + VM_BUG_ON(true); >>>>>>> + } >>>>>>> page_increm = 1 + page_mask; >>>>>>> start += page_increm * PAGE_SIZE; >>>>>>> next: >>>>>>> >>>>> And got output like this: >>>>> >>>>> page:ffffea0002474a40 count:5 mapcount:1 mapping: (null) >>>>> index:0x0 >>>>> page flags: 0x100000000004004(referenced|head) >>>>> start=140242647736320 pfn=682616 orig_start=140242647736320 >>>>> prev_page_increm=0 page_mask=511 vm_start=140242647736320 >>>>> vm_end=140242651930624 vm_flags=268435707 >>>>> vm_ops=packet_mmap_ops+0x0/0xfffffffffffff8e0 [af_packet], >>>>> open=packet_mm_open+0x0/0x30 [af_packet], fault= (null), >>>>> remap_pages= (null) >>>>> page is compound with order=2 >>>>> >>>>> Observations: >>>>> - address 140242647736320 is where the vma starts, and is not >>>>> aligned to 512 pages >>>>> (so it cannot be a THP head which the munlock expects). Yet >>>>> there is a head page >>>>> that triggers the PageTransHuge() and consequently >>>>> hpage_nr_pages() in munlock_vma_page() >>>>> That's why page_mask is determined to be 511 and the code >>>>> thinks it's in the >>>>> middle of a THP page. >>>>> - in fact, the page is a compound page with order=2 >>>>> - the VM flags (except (may)read/write) are VM_SHARED and VM_MIXEDMAP >>>>> - the vma was mmapped by packet_mmap() (net/packet/af_packet.c) >>>>> which uses >>>>> vm_insert_page(), which adds the VM_MIXEDMAP flag >>>>> - the buffers that are mapped were allocated by >>>>> alloc_one_pg_vec_page() >>>>> where flags indeed include __GFP_COMP >>>>> >>>>> So clearly there is a way to have mlock/munlock operate on a vma >>>>> that contains >>>>> compound pages and confuse the checks for PageTransHuge(). >>>>> >>>>> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: >>>>> accelerate munlock() >>>>> treatment of THP pages"), i.e. since 3.9, but did not trigger a >>>>> bug. It however >>>>> makes munlock_vma_pages_range() skip pages until the next >>>>> 512-pages-aligned page, >>>>> when it encounters a head page. If the head page is of smaller >>>>> order and is followed >>>>> by normal LRU pages (theoretically, I'm not sure if that's >>>>> possible, or done anywhere), >>>>> they wouldn't get munlocked. >>>>> >>>>> My commit 7225522bb429 ("mm: munlock: batch non-THP page isolation >>>>> and >>>>> munlock+putback using pagevec") (since 3.12) has added a new >>>>> PageTransHuge() check >>>>> that can trigger on tail pages of the compound page here. Commit >>>>> c424be1cbbf852e46acc8 >>>>> ("mm: munlock: fix a bug where THP tail page is encountered") in >>>>> current rc's removes >>>>> one class of bugs here, but still non-THP compound pages are not >>>>> expected in mlock/munlock, >>>>> which leads to this assertion failing. >>>>> >>>>> The question is what is the correct fix, and I'm not that familiar >>>>> with VM_MIXEDMAP >>>>> to decide. >>>>> >>>>> Option 1: mlocking VM_MIXEDMAP vma's has no sense. They should be >>>>> treated like VM_PFNMAP >>>>> and added to VM_SPECIAL, which makes m(un)lock skip >>>>> them completely. >>>>> >>>>> Option 2: if indeed VM_MIXEDMAP can contain PageLRU pages for >>>>> which mlocking is useful, >>>>> VM_NO_THP should be checked in munlock before >>>>> attempting PageTransHuge() and >>>>> friends. VM_NO_THP already contains VM_MIXEDMAP, so >>>>> knowing that there can be >>>>> no THP means we don't try optimize for it and no >>>>> unexpected head pages trip us. >>>>> >>>>> Thoughts? >>> OK, here's a RFC patch to hopefully help get us somewhere. I went for >>> Option1, as I didn't see anyone using VM_MIXEDMAP also for LRU pages, >>> and Option2 was ugly to implement and also seemed quite arbitrary. I'm >>> not sure if making VM_MIXEDMAP also non-mergeable this way is an issue >>> though. >>> >>> >> >> Hi! >> >> Forgive an ignorant question, but are anonymous COW'd pages LRU pages? > > I believe so, but I've checked where VM_MIXEDMAP is used in TTM and it > seems all those vma's are also VM_IO which means they are already > included in VM_SPECIAL and this change won't affect them. > > Vlastimil > OK. Thanks. /Thomas >> The reason I'm asking is that TTM VM_MIXEDMAP vmas may contain such >> pages. >> >> /Thomas >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: >> https://urldefense.proofpoint.com/v1/url?u=http://www.linux-mm.org/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=Wlb%2FkDdCXWj2m8QNoBUogfTl0sK0cH2%2BOONacP0U1SE%3D%0A&s=84b13d34ca94efa34cd185d7ae467b18377a92330174671ac93dbb2948cda967 >> . >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge) 2014-01-31 14:40 ` Vlastimil Babka 2014-01-31 14:58 ` Thomas Hellstrom @ 2014-02-07 18:58 ` Hannes Frederic Sowa 2014-02-12 12:02 ` Daniel Borkmann 1 sibling, 1 reply; 7+ messages in thread From: Hannes Frederic Sowa @ 2014-02-07 18:58 UTC (permalink / raw) To: Vlastimil Babka Cc: Daniel Borkmann, Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm, Jared Hulbert, netdev, Thomas Hellstrom, John David Anglin, HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte, Peter Zijlstra Hi! On Fri, Jan 31, 2014 at 03:40:38PM +0100, Vlastimil Babka wrote: > From: Vlastimil Babka <vbabka@suse.cz> > Date: Fri, 31 Jan 2014 11:50:21 +0100 > Subject: [PATCH] mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid > m(un)locking > > Daniel Borkmann reported a bug with VM_BUG_ON assertions failing where > munlock_vma_pages_range() thinks it's unexpectedly in the middle of a THP page. > This can be reproduced in tools/testing/selftests/net/ by running make and > then ./psock_tpacket. > > The problem is that an order=2 compound page (allocated by > alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped by > packet_mmap()) and mistaken for a THP page and assumed to be order=9. > > The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate > munlock() treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. > It just makes munlock_vma_pages_range() skip such compound pages until the next > 512-pages-aligned page, when it encounters a head page. This is however not a > problem for vma's where mlocking has no effect anyway, but it can distort the > accounting. > Since commit 7225522bb ("mm: munlock: batch non-THP page isolation and > munlock+putback using pagevec") this can trigger a VM_BUG_ON in PageTransHuge() > check. > > This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL - a list of > flags that make vma's non-mlockable and non-mergeable. The reasoning is that > VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is already on the VM_SPECIAL > list, and both are intended for non-LRU pages where mlocking makes no sense > anyway. I also ran into this problem and wanted to ask what the status of this patch is? Does it need further testing? I can surely help with that. ;) Thanks, Hannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] at include/linux/page-flags.h:415 (PageTransHuge) 2014-02-07 18:58 ` Hannes Frederic Sowa @ 2014-02-12 12:02 ` Daniel Borkmann 0 siblings, 0 replies; 7+ messages in thread From: Daniel Borkmann @ 2014-02-12 12:02 UTC (permalink / raw) To: Vlastimil Babka, Andrew Morton, linux-kernel, Michel Lespinasse, linux-mm, Jared Hulbert, netdev, Thomas Hellstrom, John David Anglin, HATAYAMA Daisuke, Konstantin Khlebnikov, Carsten Otte, Peter Zijlstra Hi Vlastimil, On 02/07/2014 07:58 PM, Hannes Frederic Sowa wrote: > On Fri, Jan 31, 2014 at 03:40:38PM +0100, Vlastimil Babka wrote: >> From: Vlastimil Babka <vbabka@suse.cz> >> Date: Fri, 31 Jan 2014 11:50:21 +0100 >> Subject: [PATCH] mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid >> m(un)locking >> >> Daniel Borkmann reported a bug with VM_BUG_ON assertions failing where >> munlock_vma_pages_range() thinks it's unexpectedly in the middle of a THP page. >> This can be reproduced in tools/testing/selftests/net/ by running make and >> then ./psock_tpacket. >> >> The problem is that an order=2 compound page (allocated by >> alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped by >> packet_mmap()) and mistaken for a THP page and assumed to be order=9. >> >> The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate >> munlock() treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. >> It just makes munlock_vma_pages_range() skip such compound pages until the next >> 512-pages-aligned page, when it encounters a head page. This is however not a >> problem for vma's where mlocking has no effect anyway, but it can distort the >> accounting. >> Since commit 7225522bb ("mm: munlock: batch non-THP page isolation and >> munlock+putback using pagevec") this can trigger a VM_BUG_ON in PageTransHuge() >> check. >> >> This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL - a list of >> flags that make vma's non-mlockable and non-mergeable. The reasoning is that >> VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is already on the VM_SPECIAL >> list, and both are intended for non-LRU pages where mlocking makes no sense >> anyway. Thanks a lot for your efforts. Is your patch queued up somewhere for mainline and stable? > I also ran into this problem and wanted to ask what the status of this > patch is? Does it need further testing? I can surely help with that. ;) > > Thanks, > > Hannes > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-02-12 12:02 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <52D03A9E.2030309@iogearbox.net> [not found] ` <20140110222248.4e8419ca.akpm@linux-foundation.org> [not found] ` <52D147F1.3040803@iogearbox.net> [not found] ` <52D3BCE9.4020405@suse.cz> [not found] ` <52D3D060.1010301@iogearbox.net> [not found] ` <52D69AB4.6000309@suse.cz> 2014-01-15 16:06 ` [BUG] at include/linux/page-flags.h:415 (PageTransHuge) Daniel Borkmann 2014-01-31 14:40 ` Vlastimil Babka 2014-01-31 14:58 ` Thomas Hellstrom 2014-01-31 15:25 ` Vlastimil Babka 2014-01-31 15:35 ` Thomas Hellstrom 2014-02-07 18:58 ` Hannes Frederic Sowa 2014-02-12 12:02 ` Daniel Borkmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).