* bad page state in 3.13-rc4 @ 2013-12-19 4:07 Dave Jones 2013-12-19 4:40 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Dave Jones @ 2013-12-19 4:07 UTC (permalink / raw) To: Linux Kernel; +Cc: linux-mm, Linus Torvalds Just hit this while fuzzing with lots of child processes. (trinity -C128) BUG: Bad page state in process trinity-c93 pfn:100499 ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:439! invalid opcode: 0000 [#1] PREEMPT SMP Modules linked in: dlci sctp snd_seq_dummy hidp fuse rfcomm bnep tun can_raw can_bcm bluetooth can rose phonet pppoe pppox ppp_generic slhc llc2 af_rxrpc af_key netrom caif_socket caif ipt_ULOG nfnetlink nfc af_802154 irda crc_ccitt rds scsi_transport_iscsi x25 atm appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill xfs snd_hda_codec_hdmi snd_hda_codec_realtek libcrc32c snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer snd shpchp soundcore coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug e1000e ptp pps_core CPU: 0 PID: 4408 Comm: trinity-c39 Not tainted 3.13.0-rc4+ #5 task: ffff88021b1d5be0 ti: ffff88011fd40000 task.ti: ffff88011fd40000 RIP: 0010:[<ffffffff816d8f89>] [<ffffffff816d8f89>] get_page.part.20+0x4/0x6 RSP: 0018:ffff88011fd418c8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88024e20f780 RCX: 0000000000017151 RDX: 0000000000000000 RSI: 0000000000000017 RDI: 0000000000000001 RBP: ffff88011fd418c8 R08: fffffffffffffffd R09: 00000000000170f8 R10: ffff88024e5d3e80 R11: 0000000000000017 R12: ffffea0004012640 R13: 0000000000000000 R14: ffffea0004184200 R15: 0000000000000000 FS: 00007f2203b5e740(0000) GS:ffff88024e200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000018ca000 CR3: 000000014dcf4000 CR4: 00000000001407f0 DR0: 0000000000f88000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Stack: ffff88011fd418e8 ffffffff81141385 ffffea0004012640 ffffea0004012642 ffff88011fd418f8 ffffffff8114161c ffff88011fd41920 ffffffff81145572 ffffea0004012640 ffffea0004012600 ffffea0004012660 ffff88011fd419a8 Call Trace: [<ffffffff81141385>] __lru_cache_add+0xa5/0xc0 [<ffffffff8114161c>] lru_cache_add+0x1c/0x30 [<ffffffff81145572>] putback_lru_page+0x72/0xb0 [<ffffffff8118c3f9>] migrate_pages+0x479/0x780 [<ffffffff81155480>] ? isolate_freepages_block+0x360/0x360 [<ffffffff8115638a>] compact_zone+0x2aa/0x460 [<ffffffff8109ff90>] ? debug_check_no_locks_freed+0xb0/0x150 [<ffffffff811565d4>] compact_zone_order+0x94/0xe0 [<ffffffff81156911>] try_to_compact_pages+0xe1/0x110 [<ffffffff816d8c88>] __alloc_pages_direct_compact+0xac/0x1d0 [<ffffffff8113c86a>] __alloc_pages_nodemask+0x90a/0xab0 [<ffffffff8117e1b1>] alloc_pages_vma+0xf1/0x1b0 [<ffffffff81190e0d>] ? do_huge_pmd_anonymous_page+0xfd/0x3a0 [<ffffffff81190e0d>] do_huge_pmd_anonymous_page+0xfd/0x3a0 [<ffffffff8115beec>] ? follow_page_mask+0x24c/0x510 [<ffffffff8115d919>] handle_mm_fault+0x479/0xbb0 [<ffffffff816e5591>] ? _raw_spin_unlock+0x31/0x50 [<ffffffff8115e1fe>] __get_user_pages+0x1ae/0x5f0 [<ffffffff8116028c>] __mlock_vma_pages_range+0x8c/0xa0 [<ffffffff81160a00>] __mm_populate+0xc0/0x150 [<ffffffff8114f486>] vm_mmap_pgoff+0xb6/0xc0 [<ffffffff81162d16>] SyS_mmap_pgoff+0x116/0x270 [<ffffffff81010495>] ? syscall_trace_enter+0x145/0x270 [<ffffffff81007b52>] SyS_mmap+0x22/0x30 [<ffffffff816edb24>] tracesys+0xdd/0xe2 Code: 48 c8 48 85 d2 48 0f 49 c2 48 01 c8 49 89 06 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 <0f> 0b 80 3d 59 6a 60 00 00 75 1d 55 be 6c 00 00 00 48 c7 c7 e6 RIP [<ffffffff816d8f89>] get_page.part.20+0x4/0x6 RSP <ffff88011fd418c8> page:ffffea0004012640 count:0 mapcount:0 mapping: (null) index:0x389 page flags: 0x2000000000000c(referenced|uptodate) Modules linked in: dlci sctp snd_seq_dummy hidp fuse rfcomm bnep tun can_raw can_bcm bluetooth can rose phonet pppoe pppox ppp_generic slhc llc2 af_rxrpc af_key netrom caif_socket caif ipt_ULOG nfnetlink nfc af_802154 irda crc_ccitt rds scsi_transport_iscsi x25 atm appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill xfs snd_hda_codec_hdmi snd_hda_codec_realtek libcrc32c snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer snd shpchp soundcore coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug e1000e ptp pps_core CPU: 2 PID: 4395 Comm: trinity-c93 Tainted: G D 3.13.0-rc4+ #5 0000000000000000 ffff88004517fde8 ffffffff816db2f5 ffffea0004012640 ffff88004517fe00 ffffffff816d8b05 ffffea0004012640 ffff88004517fe38 ffffffff8113a645 ffffea0004012640 0000000000000000 002000000000001d Call Trace: [<ffffffff816db2f5>] dump_stack+0x4e/0x7a [<ffffffff816d8b05>] bad_page.part.71+0xcf/0xe8 [<ffffffff8113a645>] free_pages_prepare+0x185/0x190 [<ffffffff8113b085>] free_hot_cold_page+0x35/0x180 [<ffffffff811403f3>] __put_single_page+0x23/0x30 [<ffffffff81140665>] put_page+0x35/0x50 [<ffffffff811e8705>] aio_free_ring+0x55/0xf0 [<ffffffff811e9c5a>] SyS_io_setup+0x59a/0xbe0 [<ffffffff816edb24>] tracesys+0xdd/0xe2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 4:07 bad page state in 3.13-rc4 Dave Jones @ 2013-12-19 4:40 ` Linus Torvalds 2013-12-19 15:41 ` Christoph Lameter 2013-12-19 15:53 ` Dave Jones 0 siblings, 2 replies; 26+ messages in thread From: Linus Torvalds @ 2013-12-19 4:40 UTC (permalink / raw) To: Dave Jones, Linux Kernel, linux-mm, Christoph Lameter On Wed, Dec 18, 2013 at 8:07 PM, Dave Jones <davej@redhat.com> wrote: > Just hit this while fuzzing with lots of child processes. > (trinity -C128) Ok, there's a BUG_ON() in the middle, the "bad page" part is just this: > BUG: Bad page state in process trinity-c93 pfn:100499 > page:ffffea0004012640 count:0 mapcount:0 mapping: (null) index:0x389 > page flags: 0x2000000000000c(referenced|uptodate) > Call Trace: > [<ffffffff816db2f5>] dump_stack+0x4e/0x7a > [<ffffffff816d8b05>] bad_page.part.71+0xcf/0xe8 > [<ffffffff8113a645>] free_pages_prepare+0x185/0x190 > [<ffffffff8113b085>] free_hot_cold_page+0x35/0x180 > [<ffffffff811403f3>] __put_single_page+0x23/0x30 > [<ffffffff81140665>] put_page+0x35/0x50 > [<ffffffff811e8705>] aio_free_ring+0x55/0xf0 > [<ffffffff811e9c5a>] SyS_io_setup+0x59a/0xbe0 > [<ffffffff816edb24>] tracesys+0xdd/0xe2 at free_pages() time, and I don't see anything bad in the printout wrt the page counts of flags. Which makes me wonder if this is mem_cgroup_bad_page_check() triggering. Of course, if it's a race, it may be that by the time we print out the counts they all look good, even if they weren't good at the time we did that bad_page() *check*. And the fact that we do have a concurrent BUG_ON() triggering with a zero page count obviously does look suspicious. Looks like a possible race with memory compaction happening at the same time aio_free_ring() frees the page. Somebody who knows the migration code needs to look at this. ChristophL? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 4:40 ` Linus Torvalds @ 2013-12-19 15:41 ` Christoph Lameter 2013-12-19 20:11 ` Mel Gorman 2013-12-19 15:53 ` Dave Jones 1 sibling, 1 reply; 26+ messages in thread From: Christoph Lameter @ 2013-12-19 15:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: Dave Jones, Mel Gorman, Linux Kernel, linux-mm On Wed, 18 Dec 2013, Linus Torvalds wrote: > Somebody who knows the migration code needs to look at this. ChristophL? Its been awhile sorry and there has been a huge amount of work done on top of my earlier work. Cannot debug that anymore and I am finding myself in the role of the old guy who just complains a lot. Some of that functionality seems bizarre to me like the on the fly conversion between huge pages and regular pages, weird and complex page count handling etc etc. The last time I looked at the code I was horrified to find that the new huge page migration does not use migration ptes to create a cooldown phase but directly swaps the pmd. That used to cause huge problems with regular pages in the past. But I was told that was all safe. Mel? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 15:41 ` Christoph Lameter @ 2013-12-19 20:11 ` Mel Gorman 2013-12-19 20:30 ` Dave Jones 0 siblings, 1 reply; 26+ messages in thread From: Mel Gorman @ 2013-12-19 20:11 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Sasha Levin, Joonsoo Kim, Andrew Morton, Kirill A. Shutemov, Hugh Dickins, Dave Jones, Linux Kernel, linux-mm On Thu, Dec 19, 2013 at 09:41:50AM -0600, Christoph Lameter wrote: > On Wed, 18 Dec 2013, Linus Torvalds wrote: > > > Somebody who knows the migration code needs to look at this. ChristophL? > > Its been awhile sorry and there has been a huge amount of work done on top > of my earlier work. Cannot debug that anymore and I am finding myself in > the role of the old guy who just complains a lot. Shake your fist and tell the kids to get off your lawn. > Some of that > functionality seems bizarre to me like the on the fly conversion between > huge pages and regular pages, weird and complex page count handling etc > etc. > > The last time I looked at the code I was horrified to find that the new > huge page migration does not use migration ptes to create a cooldown phase > but directly swaps the pmd. That used to cause huge problems with regular > pages in the past. But I was told that was all safe. Mel? THP migration is specific to automatic NUMA balancing and the safety of how it works is dependant upon how pmds are marked NUMA and how they are cleared and migrated. Dave, was this a NUMA machine? If yes, was CONFIG_NUMA_BALANCING set? If yes, was NUMA_BALANCING_DEFAULT_ENABLED set or was numa_balancing=enable specified on the kernel command line? I'm skeptical that this is related to THP migration largely because the initial stack trace was in the compaction path which does not deal with THP migration. If this is recent, then an outside possibility is that this is related to pmd-level split locks and mm->page_table_lock was protecting us from some split THP vs migration race or possibly a gup page for aio vs migration race we were previously unaware of (e.g. aio taking a reference on a page that migration has frozen the references on, bug would be a case where get_page instead of get_page_unless_zero was used) . Dave, when this this bug start triggering? If it's due to a recent change in trinity, can you check if 3.12 is also affected? If not, can you check if the bug started happening somewhere around these commits? ea1e7ed33708c7a760419ff9ded0a6cb90586a50 mm: create a separate slab for page->ptl allocation 539edb5846c740d78a8b6c2e43a99ca4323df68f mm: properly separate the bloated ptl from the regular case 49076ec2ccaf68610aa03d96bced9a6694b93ca1 mm: dynamically allocate page->ptl if it cannot be embedded to struct page e009bb30c8df8a52a9622b616b67436b6a03a0cd mm: implement split page table lock for PMD level c4088ebdca64c9a2e34a38177d2249805ede1f4b mm: convert the rest to new page table lock api cb900f41215447433cbc456d1c4294e858a84d7c mm, hugetlb: convert hugetlbfs to use split pmd lock c389a250ab4cfa4a3775d9f2c45271618af6d5b2 mm, thp: do not access mm->pmd_huge_pte directly 117b0791ac42f2ec447bc864e70ad622b5604059 mm, thp: move ptl taking inside page_check_address_pmd() bf929152e9f6c49b66fad4ebf08cc95b02ce48f5 mm, thp: change pmd_trans_huge_lock() to return taken lock e1f56c89b040134add93f686931cc266541d239a mm: convert mm->nr_ptes to atomic_long_t e9bb18c7b95d4dcf8c7f0e14f920ca6f03109e75 mm: avoid increase sizeof(struct page) due to split page table lock b77d88d493b8fc7a4c2dadd3bb86d1dee2f53a56 mm: drop actor argument of do_generic_file_read() A few bad state bugs have shown up on linux-mm recently but my impression was that they were related to rmap_walk changes currently in next. The initial log indicated that this was 3.13-rc4 but is it really 3.13-rc4 or are there any -next patches applied? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 20:11 ` Mel Gorman @ 2013-12-19 20:30 ` Dave Jones 0 siblings, 0 replies; 26+ messages in thread From: Dave Jones @ 2013-12-19 20:30 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Lameter, Linus Torvalds, Sasha Levin, Joonsoo Kim, Andrew Morton, Kirill A. Shutemov, Hugh Dickins, Linux Kernel, linux-mm On Thu, Dec 19, 2013 at 08:11:58PM +0000, Mel Gorman wrote: > Dave, was this a NUMA machine? It's a dual core i5-4670T with hyperthreading. > If yes, was CONFIG_NUMA_BALANCING set? no. > Dave, when this this > bug start triggering? If it's due to a recent change in trinity, can you > check if 3.12 is also affected? If not, can you check if the bug started > happening somewhere around these commits? Right now it can take hours for it to reproduce. Until I can narrow it down to something repeatable, bisecting and trying old builds is going to be really time-consuming. Given the other VM bugs that have Sasha and I have been finding since I added the mmap reuse code to trinity, this is probably something else that has been there for a while. > A few bad state bugs have shown up on linux-mm recently but my impression > was that they were related to rmap_walk changes currently in next. The > initial log indicated that this was 3.13-rc4 but is it really 3.13-rc4 or > are there any -next patches applied? no, just rc4 (plus a handful of small patches to fix oopses etc that I've already diagnosed). I'm glad Sasha spends time running this stuff on -next, because there aren't enough hours in the day for me to look at the stuff I find in Linus' tree without looking at what's coming next. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 4:40 ` Linus Torvalds 2013-12-19 15:41 ` Christoph Lameter @ 2013-12-19 15:53 ` Dave Jones 2013-12-19 17:07 ` Linus Torvalds 1 sibling, 1 reply; 26+ messages in thread From: Dave Jones @ 2013-12-19 15:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel, linux-mm, Christoph Lameter On Wed, Dec 18, 2013 at 08:40:07PM -0800, Linus Torvalds wrote: > On Wed, Dec 18, 2013 at 8:07 PM, Dave Jones <davej@redhat.com> wrote: > > Just hit this while fuzzing with lots of child processes. > > (trinity -C128) > > Ok, there's a BUG_ON() in the middle, the "bad page" part is just this: > > > BUG: Bad page state in process trinity-c93 pfn:100499 > > page:ffffea0004012640 count:0 mapcount:0 mapping: (null) index:0x389 > > page flags: 0x2000000000000c(referenced|uptodate) > > Call Trace: > > [<ffffffff816db2f5>] dump_stack+0x4e/0x7a > > [<ffffffff816d8b05>] bad_page.part.71+0xcf/0xe8 > > [<ffffffff8113a645>] free_pages_prepare+0x185/0x190 > > [<ffffffff8113b085>] free_hot_cold_page+0x35/0x180 > > [<ffffffff811403f3>] __put_single_page+0x23/0x30 > > [<ffffffff81140665>] put_page+0x35/0x50 > > [<ffffffff811e8705>] aio_free_ring+0x55/0xf0 > > [<ffffffff811e9c5a>] SyS_io_setup+0x59a/0xbe0 > > [<ffffffff816edb24>] tracesys+0xdd/0xe2 > > at free_pages() time, and I don't see anything bad in the printout wrt > the page counts of flags. Overnight run hit another bad page state with a different trace. WARNING: CPU: 2 PID: 14107 at mm/truncate.c:331 truncate_inode_pages_range+0x5e4/0x610() Modules linked in: snd_seq_dummy fuse sctp tun 8021q garp stp hidp nfnetlink bnep rfcomm scsi_transport_iscsi nfc caif_socket caif af_802154 phonet af_rxrpc can_bcm can_raw bluetooth can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds ipt_ULOG af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill xfs libcrc32c snd_hda_codec_realtek snd_hda_co dec_hdmi coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_s eq_device e1000e ptp shpchp snd_pcm pps_core snd_page_alloc[31342.836846] BUG: Bad page state in process modprobe pfn:10f05c page:ffffea00043c1700 count:0 mapcount:0 mapping: (null) index:0x0 page flags: 0x20000000000001(locked) Modules linked in: snd_seq_dummy fuse sctp tun 8021q garp stp hidp nfnetlink bnep rfcomm scsi_transport_iscsi nfc caif_socket caif af_802154 phonet af_rxrpc can_bcm can_raw bluetooth can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds ipt_ULOG af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill xfs libcrc32c snd_hda_codec_realtek snd_hda_co dec_hdmi coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_s eq_device e1000e ptp shpchp snd_pcm pps_core snd_page_alloc snd_timer snd soundcore CPU: 0 PID: 14132 Comm: modprobe Not tainted 3.13.0-rc4+ #5 ffff88024e217108 ffff88000870fad0 ffffffff816db2f5 ffffea00043c1700 ffff88000870fae8 ffffffff816d8b05 ffff88024e2170f8 ffff88000870fbd0 ffffffff8113be1e 0000000000000000 0000000000000000 ffff88024e5d4d08 Call Trace: [<ffffffff816db2f5>] dump_stack+0x4e/0x7a [<ffffffff816d8b05>] bad_page.part.71+0xcf/0xe8 [<ffffffff8113be1e>] get_page_from_freelist+0x80e/0x950 [<ffffffff8117e1b1>] ? alloc_pages_vma+0xf1/0x1b0 [<ffffffff8117c6e6>] ? alloc_pages_current+0x106/0x1f0 [<ffffffff8113c178>] __alloc_pages_nodemask+0x218/0xab0 [<ffffffff81132c55>] ? find_get_page+0x5/0x110 [<ffffffff8103cf27>] ? pte_alloc_one+0x17/0x70 [<ffffffff8117c6e6>] alloc_pages_current+0x106/0x1f0 [<ffffffff8103cf27>] ? pte_alloc_one+0x17/0x70 [<ffffffff8103cf27>] pte_alloc_one+0x17/0x70 [<ffffffff8115a027>] __pte_alloc+0x27/0x130 [<ffffffff8115df9c>] handle_mm_fault+0xafc/0xbb0 [<ffffffff816e9321>] ? __do_page_fault+0x101/0x610 [<ffffffff816e938f>] __do_page_fault+0x16f/0x610 [<ffffffff810f7d7f>] ? __acct_update_integrals+0x7f/0x100 [<ffffffff816e5591>] ? _raw_spin_unlock+0x31/0x50 [<ffffffff8108c381>] ? vtime_account_user+0x91/0xa0 [<ffffffff811318bb>] ? context_tracking_user_exit+0x9b/0x100 [<ffffffff816e984a>] do_page_fault+0x1a/0x70 [<ffffffff816e6492>] page_fault+0x22/0x30 CPU: 2 PID: 14107 Comm: trinity-c88 Tainted: G B 3.13.0-rc4+ #5 ffffffff81a28ae4 ffff8801577d7d30 ffffffff816db2f5 0000000000000000 ffff8801577d7d68 ffffffff810529ad ffffffffffffffff ffff8801577d7da0 0000000000000000 0000000000000001 ffffea00043c1700 ffff8801577d7d78 Call Trace: [<ffffffff816db2f5>] dump_stack+0x4e/0x7a [<ffffffff810529ad>] warn_slowpath_common+0x7d/0xa0 [<ffffffff81052a8a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81142834>] truncate_inode_pages_range+0x5e4/0x610 [<ffffffff811428c7>] truncate_pagecache+0x47/0x60 [<ffffffff811428f2>] truncate_setsize+0x12/0x20 [<ffffffff811e8652>] put_aio_ring_file.isra.11+0x22/0x80 [<ffffffff811e871f>] aio_free_ring+0x6f/0xf0 [<ffffffff811e9c5a>] SyS_io_setup+0x59a/0xbe0 [<ffffffff816edb24>] tracesys+0xdd/0xe2 Interesting that CPU2 was doing sys_io_setup again. Different trace though. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 15:53 ` Dave Jones @ 2013-12-19 17:07 ` Linus Torvalds 2013-12-19 17:17 ` Dave Jones 2013-12-19 18:11 ` Kent Overstreet 0 siblings, 2 replies; 26+ messages in thread From: Linus Torvalds @ 2013-12-19 17:07 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Linux Kernel, linux-mm, Christoph Lameter, Benjamin LaHaise, Kent Overstreet, Al Viro On Thu, Dec 19, 2013 at 7:53 AM, Dave Jones <davej@redhat.com> wrote: > > Interesting that CPU2 was doing sys_io_setup again. Different trace though. Well, it was once again in aio_free_ring() - double free or freeing while already in use? And this time the other end of the complaint was allocating a new page that definitely was still busily in use (it's locked). And there's no sign of migration, although obviously that could have happened or be in progress on another CPU and just didn't notice the mess. But yes, based on the two traces, fs/aio.c:io_setup() would seem to be the main point of interest. Have you started doing something new in trinity wrt AIO, and io_setup() in particular? Or anything else different that might have started triggering this? But we do have new AIO code, and these two in particular look suspicious: - new page migration logic: 71ad7490c1f3 rework aio migrate pages to use aio fs - trying to fix double frees and error cases: e34ecee2ae79 aio: Fix a trinity splat d558023207e0 aio: prevent double free in ioctx_alloc d1b9432712a2 aio: clean up aio ring in the fail path and some kind of double free in an error path would certainly explain this (with io_setup() . And the first oops reported obviously had that migration thing. So maybe those "fixes" weren't fixing things at all (or just moved the error case around). Btw, that "rework aio migrate pages to use aio fs" looks odd. It has Ben LaHaise marked as author, but no sign-off, instead "Tested-by" and "Acked-by". Al, Ben, Kent, see the beginning thread on lkml (https://lkml.org/lkml/2013/12/18/932). Any comments? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 17:07 ` Linus Torvalds @ 2013-12-19 17:17 ` Dave Jones 2013-12-19 18:11 ` Kent Overstreet 1 sibling, 0 replies; 26+ messages in thread From: Dave Jones @ 2013-12-19 17:17 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel, linux-mm, Christoph Lameter, Benjamin LaHaise, Kent Overstreet, Al Viro On Thu, Dec 19, 2013 at 09:07:27AM -0800, Linus Torvalds wrote: > On Thu, Dec 19, 2013 at 7:53 AM, Dave Jones <davej@redhat.com> wrote: > > > > Interesting that CPU2 was doing sys_io_setup again. Different trace though. > > Well, it was once again in aio_free_ring() - double free or freeing > while already in use? And this time the other end of the complaint was > allocating a new page that definitely was still busily in use (it's > locked). > > And there's no sign of migration, although obviously that could have > happened or be in progress on another CPU and just didn't notice the > mess. But yes, based on the two traces, fs/aio.c:io_setup() would seem > to be the main point of interest. > > Have you started doing something new in trinity wrt AIO, and > io_setup() in particular? Or anything else different that might have > started triggering this? Nothing special for aio, it's always had support for creating things that look like iovecs, though now maybe it's filling those iovec's with mmaps that it created (including potentially huge pages) instead of just mallocs. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 17:07 ` Linus Torvalds 2013-12-19 17:17 ` Dave Jones @ 2013-12-19 18:11 ` Kent Overstreet 2013-12-19 18:29 ` Benjamin LaHaise 1 sibling, 1 reply; 26+ messages in thread From: Kent Overstreet @ 2013-12-19 18:11 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Benjamin LaHaise, Al Viro On Thu, Dec 19, 2013 at 09:07:27AM -0800, Linus Torvalds wrote: > On Thu, Dec 19, 2013 at 7:53 AM, Dave Jones <davej@redhat.com> wrote: > > > > Interesting that CPU2 was doing sys_io_setup again. Different trace though. > > Well, it was once again in aio_free_ring() - double free or freeing > while already in use? And this time the other end of the complaint was > allocating a new page that definitely was still busily in use (it's > locked). > > And there's no sign of migration, although obviously that could have > happened or be in progress on another CPU and just didn't notice the > mess. But yes, based on the two traces, fs/aio.c:io_setup() would seem > to be the main point of interest. > > Have you started doing something new in trinity wrt AIO, and > io_setup() in particular? Or anything else different that might have > started triggering this? > > But we do have new AIO code, and these two in particular look suspicious: > > - new page migration logic: > > 71ad7490c1f3 rework aio migrate pages to use aio fs > > - trying to fix double frees and error cases: > > e34ecee2ae79 aio: Fix a trinity splat > d558023207e0 aio: prevent double free in ioctx_alloc > d1b9432712a2 aio: clean up aio ring in the fail path > > and some kind of double free in an error path would certainly explain > this (with io_setup() . And the first oops reported obviously had that > migration thing. So maybe those "fixes" weren't fixing things at all > (or just moved the error case around). > > Btw, that "rework aio migrate pages to use aio fs" looks odd. It has > Ben LaHaise marked as author, but no sign-off, instead "Tested-by" and > "Acked-by". I could certainly believe a double free, but rereading the current code I can't find anything, and I just manually tested all the relevant error paths in ioctx_alloc() and aio_setup_ring() without finding anything. I don't get wtf that loop at line 350 is supposed to be for though. You'd think if it was doing anything important it would be doing something more intelligent than just breaking on error (?). But I haven't slept yet and maybe I'm just being dumb. I don't understand this page migration stuff at all, and I actually don't think I understand the refcounting w.r.t. the page cache either. But looking at (say) the aio_free_ring() call at line 409 - we just did one put_page() in aio_setup_ring(), and then _another_ put_page() in aio_free_ring()... ok, one of those corresponds to the get get_user_pages() did, but what's the other correspond to? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 18:11 ` Kent Overstreet @ 2013-12-19 18:29 ` Benjamin LaHaise 2013-12-19 18:35 ` Dave Jones 2013-12-19 19:19 ` Linus Torvalds 0 siblings, 2 replies; 26+ messages in thread From: Benjamin LaHaise @ 2013-12-19 18:29 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Thu, Dec 19, 2013 at 10:11:34AM -0800, Kent Overstreet wrote: > On Thu, Dec 19, 2013 at 09:07:27AM -0800, Linus Torvalds wrote: > > On Thu, Dec 19, 2013 at 7:53 AM, Dave Jones <davej@redhat.com> wrote: > > > > > > Interesting that CPU2 was doing sys_io_setup again. Different trace though. > > > > Well, it was once again in aio_free_ring() - double free or freeing > > while already in use? And this time the other end of the complaint was > > allocating a new page that definitely was still busily in use (it's > > locked). > > > > And there's no sign of migration, although obviously that could have > > happened or be in progress on another CPU and just didn't notice the > > mess. But yes, based on the two traces, fs/aio.c:io_setup() would seem > > to be the main point of interest. > > > > Have you started doing something new in trinity wrt AIO, and > > io_setup() in particular? Or anything else different that might have > > started triggering this? > > > > But we do have new AIO code, and these two in particular look suspicious: > > > > - new page migration logic: > > > > 71ad7490c1f3 rework aio migrate pages to use aio fs > > > > - trying to fix double frees and error cases: > > > > e34ecee2ae79 aio: Fix a trinity splat > > d558023207e0 aio: prevent double free in ioctx_alloc > > d1b9432712a2 aio: clean up aio ring in the fail path > > > > and some kind of double free in an error path would certainly explain > > this (with io_setup() . And the first oops reported obviously had that > > migration thing. So maybe those "fixes" weren't fixing things at all > > (or just moved the error case around). > > > > Btw, that "rework aio migrate pages to use aio fs" looks odd. It has > > Ben LaHaise marked as author, but no sign-off, instead "Tested-by" and > > "Acked-by". > > I could certainly believe a double free, but rereading the current code > I can't find anything, and I just manually tested all the relevant error > paths in ioctx_alloc() and aio_setup_ring() without finding anything. The same here. It would be very helpful to know what syscalls trinity is issuing in the lead up to the bug. > I don't get wtf that loop at line 350 is supposed to be for though. > You'd think if it was doing anything important it would be doing > something more intelligent than just breaking on error (?). But I > haven't slept yet and maybe I'm just being dumb. The loop at 350 is just instantiating the page cache pages for the ring buffer with freshly zeroed pages. > I don't understand this page migration stuff at all, and I actually > don't think I understand the refcounting w.r.t. the page cache either. > But looking at (say) the aio_free_ring() call at line 409 - we just did > one put_page() in aio_setup_ring(), and then _another_ put_page() in > aio_free_ring()... ok, one of those corresponds to the get > get_user_pages() did, but what's the other correspond to? The second put_page() should be dropping the page from the page cache. Perhaps it would be better to rely on a truncate of the file to remove the pages from the page cache. I'd hoped someone with a clue about how migration is supposed to work would have chimed in during the review, but nobody has stepped up with expertise in this area yet. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 18:29 ` Benjamin LaHaise @ 2013-12-19 18:35 ` Dave Jones 2013-12-19 19:19 ` Linus Torvalds 1 sibling, 0 replies; 26+ messages in thread From: Dave Jones @ 2013-12-19 18:35 UTC (permalink / raw) To: Benjamin LaHaise Cc: Kent Overstreet, Linus Torvalds, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Thu, Dec 19, 2013 at 01:29:21PM -0500, Benjamin LaHaise wrote: > > > and some kind of double free in an error path would certainly explain > > > this (with io_setup() . And the first oops reported obviously had that > > > migration thing. So maybe those "fixes" weren't fixing things at all > > > (or just moved the error case around). > > > > > > Btw, that "rework aio migrate pages to use aio fs" looks odd. It has > > > Ben LaHaise marked as author, but no sign-off, instead "Tested-by" and > > > "Acked-by". > > > > I could certainly believe a double free, but rereading the current code > > I can't find anything, and I just manually tested all the relevant error > > paths in ioctx_alloc() and aio_setup_ring() without finding anything. > > The same here. It would be very helpful to know what syscalls trinity is > issuing in the lead up to the bug. Working on narrowing it down. The io_setup fuzzer is actually incredibly dumb, and 99.9% of the time will just EFAULT or EINVAL. I'll see if I can smarten it up to succeed more often, in the hope that it can reproduce this faster, because right now it looks like it needs the planets to line up just right to hit the bug (even though I've hit it twice in the last 24 hrs) Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 18:29 ` Benjamin LaHaise 2013-12-19 18:35 ` Dave Jones @ 2013-12-19 19:19 ` Linus Torvalds 2013-12-19 19:26 ` Benjamin LaHaise 1 sibling, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-12-19 19:19 UTC (permalink / raw) To: Benjamin LaHaise Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Thu, Dec 19, 2013 at 10:29 AM, Benjamin LaHaise <bcrl@kvack.org> wrote: > >> I don't understand this page migration stuff at all, and I actually >> don't think I understand the refcounting w.r.t. the page cache either. >> But looking at (say) the aio_free_ring() call at line 409 - we just did >> one put_page() in aio_setup_ring(), and then _another_ put_page() in >> aio_free_ring()... ok, one of those corresponds to the get >> get_user_pages() did, but what's the other correspond to? > > The second put_page() should be dropping the page from the page cache. > Perhaps it would be better to rely on a truncate of the file to remove the > pages from the page cache. Yeah, that looks horribly buggy, if that's the intent. You can't just put_page() to remove something from the page cache. You need to do the whole "remove from radix tree" rigamarole, see for example delete_from_page_cache(). And you can't even do that blindly, because if the page is under writeback or otherwise busy, just removing it from the page cache and freeing it is wrong too. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 19:19 ` Linus Torvalds @ 2013-12-19 19:26 ` Benjamin LaHaise 2013-12-19 19:45 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Benjamin LaHaise @ 2013-12-19 19:26 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Fri, Dec 20, 2013 at 04:19:15AM +0900, Linus Torvalds wrote: > Yeah, that looks horribly buggy, if that's the intent. > > You can't just put_page() to remove something from the page cache. You > need to do the whole "remove from radix tree" rigamarole, see for > example delete_from_page_cache(). And you can't even do that blindly, > because if the page is under writeback or otherwise busy, just > removing it from the page cache and freeing it is wrong too. Okay, I'll rewriting it to use truncate to free the pages. -ben > Linus -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 19:26 ` Benjamin LaHaise @ 2013-12-19 19:45 ` Linus Torvalds 2013-12-19 19:53 ` Benjamin LaHaise 0 siblings, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-12-19 19:45 UTC (permalink / raw) To: Benjamin LaHaise Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Fri, Dec 20, 2013 at 4:26 AM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > Okay, I'll rewriting it to use truncate to free the pages. It already does that in put_aio_ring_file() afaik. No? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 19:45 ` Linus Torvalds @ 2013-12-19 19:53 ` Benjamin LaHaise 2013-12-19 20:02 ` Linus Torvalds 2013-12-19 20:24 ` Dave Jones 0 siblings, 2 replies; 26+ messages in thread From: Benjamin LaHaise @ 2013-12-19 19:53 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Fri, Dec 20, 2013 at 04:45:38AM +0900, Linus Torvalds wrote: > On Fri, Dec 20, 2013 at 4:26 AM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > > > Okay, I'll rewriting it to use truncate to free the pages. > > It already does that in put_aio_ring_file() afaik. No? Yes, that's what I found when I started looking into this in detail again. I think the page reference counting is actually correct. There are 2 references on each page: the first is from the find_or_create_page() call, and the second is from the get_user_pages() (which also makes sure the page is populated into the page tables). The only place I can see things going off the rails is if the get_user_pages() call fails. It's possible trinity could be arranging things so that the get_user_pages() call is failing somehow. Also, if it were a double free of a page, we should at least get a VM_BUG() occuring when the page's count is 0. Dave -- do you have CONFIG_DEBUG_VM on in your test rig? > Linus -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 19:53 ` Benjamin LaHaise @ 2013-12-19 20:02 ` Linus Torvalds 2013-12-19 20:11 ` Linus Torvalds 2013-12-19 20:24 ` Dave Jones 1 sibling, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-12-19 20:02 UTC (permalink / raw) To: Benjamin LaHaise Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro [-- Attachment #1: Type: text/plain, Size: 870 bytes --] On Fri, Dec 20, 2013 at 4:53 AM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > Yes, that's what I found when I started looking into this in detail again. > I think the page reference counting is actually correct. There are 2 > references on each page: the first is from the find_or_create_page() call, > and the second is from the get_user_pages() (which also makes sure the page > is populated into the page tables). Ok, I'm sorry, but that's just pure bullshit then. So it has the page array in the page cache, then mmap's it in, and uses get_user_pages() to get the pages back that it *just* created. This code is pure and utter garbage. It's beyond the pale how crazy it is. Why not just get rid of the idiotic get_user_pages() crap then? Something like the attached patch? Totally untested, but at least it makes *some* amount of sense. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1655 bytes --] fs/aio.c | 20 +++----------------- 1 file changed, 3 insertions(+), 17 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 6efb7f6cb22e..e1b02dd1be9e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -358,6 +358,8 @@ static int aio_setup_ring(struct kioctx *ctx) SetPageUptodate(page); SetPageDirty(page); unlock_page(page); + + ctx->ring_pages[i] = page; } ctx->aio_ring_file = file; nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) @@ -380,8 +382,8 @@ static int aio_setup_ring(struct kioctx *ctx) ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, 0, &populate); + up_write(&mm->mmap_sem); if (IS_ERR((void *)ctx->mmap_base)) { - up_write(&mm->mmap_sem); ctx->mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; @@ -389,22 +391,6 @@ static int aio_setup_ring(struct kioctx *ctx) pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); - /* We must do this while still holding mmap_sem for write, as we - * need to be protected against userspace attempting to mremap() - * or munmap() the ring buffer. - */ - ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, - 1, 0, ctx->ring_pages, NULL); - - /* Dropping the reference here is safe as the page cache will hold - * onto the pages for us. It is also required so that page migration - * can unmap the pages and get the right reference count. - */ - for (i = 0; i < ctx->nr_pages; i++) - put_page(ctx->ring_pages[i]); - - up_write(&mm->mmap_sem); - if (unlikely(ctx->nr_pages != nr_pages)) { aio_free_ring(ctx); return -EAGAIN; ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 20:02 ` Linus Torvalds @ 2013-12-19 20:11 ` Linus Torvalds 2013-12-19 20:31 ` Benjamin LaHaise 2013-12-19 20:31 ` Linus Torvalds 0 siblings, 2 replies; 26+ messages in thread From: Linus Torvalds @ 2013-12-19 20:11 UTC (permalink / raw) To: Benjamin LaHaise Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro [-- Attachment #1: Type: text/plain, Size: 589 bytes --] On Fri, Dec 20, 2013 at 5:02 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Why not just get rid of the idiotic get_user_pages() crap then? > Something like the attached patch? > > Totally untested, but at least it makes *some* amount of sense. Ok, that can't work, since the ring_pages[] allocation happens later. So that part needs to be moved up, and it needs to initialize 'nr_pages'. So here's the same patch, but with stuff moved around a bit, and the "oops, couldn't create page" part fixed. Bit it's still totally and entirely untested. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 2772 bytes --] fs/aio.c | 54 +++++++++++++++++++++--------------------------------- 1 file changed, 21 insertions(+), 33 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 6efb7f6cb22e..3e857e98fb87 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -347,6 +347,20 @@ static int aio_setup_ring(struct kioctx *ctx) return -EAGAIN; } + ctx->aio_ring_file = file; + nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) + / sizeof(struct io_event); + + ctx->ring_pages = ctx->internal_pages; + if (nr_pages > AIO_RING_PAGES) { + ctx->ring_pages = kcalloc(nr_pages, sizeof(struct page *), + GFP_KERNEL); + if (!ctx->ring_pages) { + put_aio_ring_file(ctx); + return -ENOMEM; + } + } + for (i = 0; i < nr_pages; i++) { struct page *page; page = find_or_create_page(file->f_inode->i_mapping, @@ -358,19 +372,14 @@ static int aio_setup_ring(struct kioctx *ctx) SetPageUptodate(page); SetPageDirty(page); unlock_page(page); + + ctx->ring_pages[i] = page; } - ctx->aio_ring_file = file; - nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) - / sizeof(struct io_event); + ctx->nr_pages = i; - ctx->ring_pages = ctx->internal_pages; - if (nr_pages > AIO_RING_PAGES) { - ctx->ring_pages = kcalloc(nr_pages, sizeof(struct page *), - GFP_KERNEL); - if (!ctx->ring_pages) { - put_aio_ring_file(ctx); - return -ENOMEM; - } + if (unlikely(i != nr_pages)) { + aio_free_ring(ctx); + return -EAGAIN; } ctx->mmap_size = nr_pages * PAGE_SIZE; @@ -380,8 +389,8 @@ static int aio_setup_ring(struct kioctx *ctx) ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, 0, &populate); + up_write(&mm->mmap_sem); if (IS_ERR((void *)ctx->mmap_base)) { - up_write(&mm->mmap_sem); ctx->mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; @@ -389,27 +398,6 @@ static int aio_setup_ring(struct kioctx *ctx) pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); - /* We must do this while still holding mmap_sem for write, as we - * need to be protected against userspace attempting to mremap() - * or munmap() the ring buffer. - */ - ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, - 1, 0, ctx->ring_pages, NULL); - - /* Dropping the reference here is safe as the page cache will hold - * onto the pages for us. It is also required so that page migration - * can unmap the pages and get the right reference count. - */ - for (i = 0; i < ctx->nr_pages; i++) - put_page(ctx->ring_pages[i]); - - up_write(&mm->mmap_sem); - - if (unlikely(ctx->nr_pages != nr_pages)) { - aio_free_ring(ctx); - return -EAGAIN; - } - ctx->user_id = ctx->mmap_base; ctx->nr_events = nr_events; /* trusted copy */ ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 20:11 ` Linus Torvalds @ 2013-12-19 20:31 ` Benjamin LaHaise 2013-12-19 20:31 ` Linus Torvalds 1 sibling, 0 replies; 26+ messages in thread From: Benjamin LaHaise @ 2013-12-19 20:31 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Fri, Dec 20, 2013 at 05:11:12AM +0900, Linus Torvalds wrote: > On Fri, Dec 20, 2013 at 5:02 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Why not just get rid of the idiotic get_user_pages() crap then? > > Something like the attached patch? > > > > Totally untested, but at least it makes *some* amount of sense. > > Ok, that can't work, since the ring_pages[] allocation happens later. > So that part needs to be moved up, and it needs to initialize > 'nr_pages'. > > So here's the same patch, but with stuff moved around a bit, and the > "oops, couldn't create page" part fixed. > > Bit it's still totally and entirely untested. That looks much better. I think the following is also needed to nail down the migratepage operation as well. I'll give these two a few tests together. -ben -- "Thought is the essence of where you are now." diff --git a/fs/aio.c b/fs/aio.c index 6efb7f6..eec0ae4 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -244,8 +244,13 @@ static void aio_free_ring(struct kioctx *ctx) int i; for (i = 0; i < ctx->nr_pages; i++) { + struct page *page; pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, page_count(ctx->ring_pages[i])); + page = ctx->ring_pages[i]; + if (!page) + continue; + ctx->ring_pages[i] = NULL; put_page(ctx->ring_pages[i]); } @@ -280,18 +285,42 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, unsigned long flags; int rc; + /* Serialize access to the old page */ + if (!trylock_page(old)) + return -EAGAIN; + + rc = 0; + + spin_lock(&mapping->private_lock); + ctx = mapping->private_data; + if (ctx) { + pgoff_t idx; + spin_lock_irqsave(&ctx->completion_lock, flags); + idx = old->index; + if (idx < (pgoff_t)ctx->nr_pages) { + if (ctx->ring_pages[idx] != old) + rc = -EAGAIN; + } else + rc = -EINVAL; + spin_unlock_irqrestore(&ctx->completion_lock, flags); + } else + rc = -EINVAL; + spin_unlock(&mapping->private_lock); + + if (rc != 0) + return rc; + /* Writeback must be complete */ BUG_ON(PageWriteback(old)); - put_page(old); + get_page(new); - rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1); if (rc != MIGRATEPAGE_SUCCESS) { - get_page(old); + unlock_page(old); + put_page(new); return rc; } - get_page(new); - /* We can potentially race against kioctx teardown here. Use the * address_space's private data lock to protect the mapping's * private_data. @@ -305,10 +334,16 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, idx = old->index; if (idx < (pgoff_t)ctx->nr_pages) ctx->ring_pages[idx] = new; + else + rc = -EINVAL; spin_unlock_irqrestore(&ctx->completion_lock, flags); } else rc = -EBUSY; spin_unlock(&mapping->private_lock); + unlock_page(old); + + if (rc == MIGRATEPAGE_SUCCESS) + put_page(old); return rc; } diff --git a/include/linux/migrate.h b/include/linux/migrate.h index b7717d7..f015c05 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -55,7 +55,8 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, - struct buffer_head *head, enum migrate_mode mode); + struct buffer_head *head, enum migrate_mode mode, + int extra_count); #else static inline void putback_lru_pages(struct list_head *l) {} diff --git a/mm/migrate.c b/mm/migrate.c index e9b7102..e73823e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -317,14 +317,15 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, */ int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, - struct buffer_head *head, enum migrate_mode mode) + struct buffer_head *head, enum migrate_mode mode, + int extra_count) { int expected_count = 0; void **pslot; if (!mapping) { /* Anonymous page without mapping */ - if (page_count(page) != 1) + if (page_count(page) != (expected_count + 1)) return -EAGAIN; return MIGRATEPAGE_SUCCESS; } @@ -584,7 +585,7 @@ int migrate_page(struct address_space *mapping, BUG_ON(PageWriteback(page)); /* Writeback must be complete */ - rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode); + rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0); if (rc != MIGRATEPAGE_SUCCESS) return rc; @@ -611,7 +612,7 @@ int buffer_migrate_page(struct address_space *mapping, head = page_buffers(page); - rc = migrate_page_move_mapping(mapping, newpage, page, head, mode); + rc = migrate_page_move_mapping(mapping, newpage, page, head, mode, 0); if (rc != MIGRATEPAGE_SUCCESS) return rc; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 20:11 ` Linus Torvalds 2013-12-19 20:31 ` Benjamin LaHaise @ 2013-12-19 20:31 ` Linus Torvalds 2013-12-19 20:42 ` Benjamin LaHaise 1 sibling, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-12-19 20:31 UTC (permalink / raw) To: Benjamin LaHaise Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Fri, Dec 20, 2013 at 5:11 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So here's the same patch, but with stuff moved around a bit, and the > "oops, couldn't create page" part fixed. > > Bit it's still totally and entirely untested. Btw, I think this actually fixes a bug, in that it doesn't leak the page reference count if the do_mmap_pgoff() call fails. That said, that looks like just a memory leak, not explaining the problem Dave sees. And maybe I'm missing something. And no, I still haven't actually tested this at all. Is there an aio tester that is worth trying? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 20:31 ` Linus Torvalds @ 2013-12-19 20:42 ` Benjamin LaHaise 0 siblings, 0 replies; 26+ messages in thread From: Benjamin LaHaise @ 2013-12-19 20:42 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Fri, Dec 20, 2013 at 05:31:29AM +0900, Linus Torvalds wrote: > On Fri, Dec 20, 2013 at 5:11 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So here's the same patch, but with stuff moved around a bit, and the > > "oops, couldn't create page" part fixed. > > > > Bit it's still totally and entirely untested. > > Btw, I think this actually fixes a bug, in that it doesn't leak the > page reference count if the do_mmap_pgoff() call fails. > > That said, that looks like just a memory leak, not explaining the > problem Dave sees. And maybe I'm missing something. > > And no, I still haven't actually tested this at all. Is there an aio > tester that is worth trying? There are a few tests in the libaio source, some in xfstests and fio, and a few I've got sitting around. To specifically exercise the page migration code path, there's a test at http://www.kvack.org/~bcrl/aio-numa-test.c . -ben > Linus -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 19:53 ` Benjamin LaHaise 2013-12-19 20:02 ` Linus Torvalds @ 2013-12-19 20:24 ` Dave Jones 2013-12-19 23:38 ` Benjamin LaHaise 1 sibling, 1 reply; 26+ messages in thread From: Dave Jones @ 2013-12-19 20:24 UTC (permalink / raw) To: Benjamin LaHaise Cc: Linus Torvalds, Kent Overstreet, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Thu, Dec 19, 2013 at 02:53:52PM -0500, Benjamin LaHaise wrote: > is populated into the page tables). The only place I can see things going > off the rails is if the get_user_pages() call fails. It's possible trinity > could be arranging things so that the get_user_pages() call is failing > somehow. Also, if it were a double free of a page, we should at least get > a VM_BUG() occuring when the page's count is 0. > > Dave -- do you have CONFIG_DEBUG_VM on in your test rig? Yes. Note the original trace in this thread was a VM_BUG_ON(atomic_read(&page->_count) <= 0); Right after these crashes btw, the box locks up solid. So bad that traces don't always make it over usb-serial. Annoying. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 20:24 ` Dave Jones @ 2013-12-19 23:38 ` Benjamin LaHaise 2013-12-20 1:00 ` Dave Jones 0 siblings, 1 reply; 26+ messages in thread From: Benjamin LaHaise @ 2013-12-19 23:38 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Kent Overstreet, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Thu, Dec 19, 2013 at 03:24:16PM -0500, Dave Jones wrote: > Yes. Note the original trace in this thread was a VM_BUG_ON(atomic_read(&page->_count) <= 0); > > Right after these crashes btw, the box locks up solid. So bad that traces don't > always make it over usb-serial. Annoying. I think I finally have an idea what's going on now. Kent's changes in e34ecee2ae791df674dfb466ce40692ca6218e43 are broken and result in a memory leak of the aio kioctx. This eventually leads to the system running out of memory, which ends up triggering the otherwise hard to hit error paths in aio_setup_ring(). Linus' suggested changes should fix the badness in the aio_setup_ring(), but more work has to be done to fix up the percpu reference counting tie in with the aio code. I'll fix this up in the morning if nobody beats me to it over night, as I'm just heading out right now. -ben > Dave > -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: bad page state in 3.13-rc4 2013-12-19 23:38 ` Benjamin LaHaise @ 2013-12-20 1:00 ` Dave Jones 2013-12-21 23:06 ` [PATCHes - aio / migrate page, please review] " Benjamin LaHaise 0 siblings, 1 reply; 26+ messages in thread From: Dave Jones @ 2013-12-20 1:00 UTC (permalink / raw) To: Benjamin LaHaise Cc: Linus Torvalds, Kent Overstreet, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Thu, Dec 19, 2013 at 06:38:54PM -0500, Benjamin LaHaise wrote: > On Thu, Dec 19, 2013 at 03:24:16PM -0500, Dave Jones wrote: > > Yes. Note the original trace in this thread was a VM_BUG_ON(atomic_read(&page->_count) <= 0); > > > > Right after these crashes btw, the box locks up solid. So bad that traces don't > > always make it over usb-serial. Annoying. > > I think I finally have an idea what's going on now. Kent's changes in > e34ecee2ae791df674dfb466ce40692ca6218e43 are broken and result in a memory > leak of the aio kioctx. This eventually leads to the system running out of > memory, which ends up triggering the otherwise hard to hit error paths in > aio_setup_ring(). Linus' suggested changes should fix the badness in the > aio_setup_ring(), but more work has to be done to fix up the percpu > reference counting tie in with the aio code. I'll fix this up in the > morning if nobody beats me to it over night, as I'm just heading out right > now. That would explain why I'm having difficulty repeating it in a hurry if it takes hours of runtime for the leak to reach a point where it becomes a problem. Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCHes - aio / migrate page, please review] Re: bad page state in 3.13-rc4 2013-12-20 1:00 ` Dave Jones @ 2013-12-21 23:06 ` Benjamin LaHaise 2013-12-22 19:09 ` Linus Torvalds 0 siblings, 1 reply; 26+ messages in thread From: Benjamin LaHaise @ 2013-12-21 23:06 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Kent Overstreet, Linux Kernel, linux-mm, Christoph Lameter, Al Viro [ Patches inline below for people to comment on ] On Thu, Dec 19, 2013 at 08:00:42PM -0500, Dave Jones wrote: > On Thu, Dec 19, 2013 at 06:38:54PM -0500, Benjamin LaHaise wrote: > > On Thu, Dec 19, 2013 at 03:24:16PM -0500, Dave Jones wrote: > > > Yes. Note the original trace in this thread was a VM_BUG_ON(atomic_read(&page->_count) <= 0); > > > > > > Right after these crashes btw, the box locks up solid. So bad that traces don't > > > always make it over usb-serial. Annoying. > > > > I think I finally have an idea what's going on now. Kent's changes in > > e34ecee2ae791df674dfb466ce40692ca6218e43 are broken and result in a memory > > leak of the aio kioctx. This eventually leads to the system running out of > > memory, which ends up triggering the otherwise hard to hit error paths in > > aio_setup_ring(). Linus' suggested changes should fix the badness in the > > aio_setup_ring(), but more work has to be done to fix up the percpu > > reference counting tie in with the aio code. I'll fix this up in the > > morning if nobody beats me to it over night, as I'm just heading out right > > now. > > That would explain why I'm having difficulty repeating it in a hurry if it > takes hours of runtime for the leak to reach a point where it becomes a problem. Okay, I've put the below two patches plus Linus's change through a round of tests, and it passes millions of iterations of the aio numa migratepage test, as well as a number of repetitions of a few simple read and write tests. The first patch fixes the memory leak Kent introduced, while the second patch makes aio_migratepage() much more paranoid and robust. Those two changes are in git://git.kvack.org/~bcrl/aio-next.git -- can a few other folks please review to make sure I haven't missed anything? Linus, feel free to add my Signed-off-by: to your sanitization of aio_setup_ring() as well, as it works okay in my testing. -ben -- "Thought is the essence of where you are now." aio.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) commit 1881686f842065d2f92ec9c6424830ffc17d23b0 Author: Benjamin LaHaise <bcrl@kvack.org> Date: Sat Dec 21 15:49:28 2013 -0500 aio: fix kioctx leak introduced by "aio: Fix a trinity splat" e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference counting to correct a bug trinity found. Unfortunately, the change lead to kioctxes being leaked because there was no final reference count to put. Add that reference count back in to fix things. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org diff --git a/fs/aio.c b/fs/aio.c index 6efb7f6..fd1c0ba 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -652,7 +652,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) aio_nr += ctx->max_reqs; spin_unlock(&aio_nr_lock); - percpu_ref_get(&ctx->users); /* io_setup() will drop this ref */ + percpu_ref_get(&ctx->users); /* io_setup() will drop this ref */ + percpu_ref_get(&ctx->reqs); /* free_ioctx_users() will drop this */ err = ioctx_add_table(ctx, mm); if (err) fs/aio.c | 52 ++++++++++++++++++++++++++++++++++++++++-------- include/linux/migrate.h | 3 +- mm/migrate.c | 13 ++++++------ 3 files changed, 53 insertions(+), 15 deletions(-) commit 8e321fefb0e60bae4e2a28d20fc4fa30758d27c6 Author: Benjamin LaHaise <bcrl@kvack.org> Date: Sat Dec 21 17:56:08 2013 -0500 aio/migratepages: make aio migrate pages sane The arbitrary restriction on page counts offered by the core migrate_page_move_mapping() code results in rather suspicious looking fiddling with page reference counts in the aio_migratepage() operation. To fix this, make migrate_page_move_mapping() take an extra_count parameter that allows aio to tell the code about its own reference count on the page being migrated. While cleaning up aio_migratepage(), make it validate that the old page being passed in is actually what aio_migratepage() expects to prevent misbehaviour in the case of races. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> diff --git a/fs/aio.c b/fs/aio.c index fd1c0ba..efa708b 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -244,9 +244,14 @@ static void aio_free_ring(struct kioctx *ctx) int i; for (i = 0; i < ctx->nr_pages; i++) { + struct page *page; pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, page_count(ctx->ring_pages[i])); - put_page(ctx->ring_pages[i]); + page = ctx->ring_pages[i]; + if (!page) + continue; + ctx->ring_pages[i] = NULL; + put_page(page); } put_aio_ring_file(ctx); @@ -280,18 +285,38 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, unsigned long flags; int rc; + rc = 0; + + /* Make sure the old page hasn't already been changed */ + spin_lock(&mapping->private_lock); + ctx = mapping->private_data; + if (ctx) { + pgoff_t idx; + spin_lock_irqsave(&ctx->completion_lock, flags); + idx = old->index; + if (idx < (pgoff_t)ctx->nr_pages) { + if (ctx->ring_pages[idx] != old) + rc = -EAGAIN; + } else + rc = -EINVAL; + spin_unlock_irqrestore(&ctx->completion_lock, flags); + } else + rc = -EINVAL; + spin_unlock(&mapping->private_lock); + + if (rc != 0) + return rc; + /* Writeback must be complete */ BUG_ON(PageWriteback(old)); - put_page(old); + get_page(new); - rc = migrate_page_move_mapping(mapping, new, old, NULL, mode); + rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1); if (rc != MIGRATEPAGE_SUCCESS) { - get_page(old); + put_page(new); return rc; } - get_page(new); - /* We can potentially race against kioctx teardown here. Use the * address_space's private data lock to protect the mapping's * private_data. @@ -303,13 +328,24 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, spin_lock_irqsave(&ctx->completion_lock, flags); migrate_page_copy(new, old); idx = old->index; - if (idx < (pgoff_t)ctx->nr_pages) - ctx->ring_pages[idx] = new; + if (idx < (pgoff_t)ctx->nr_pages) { + /* And only do the move if things haven't changed */ + if (ctx->ring_pages[idx] == old) + ctx->ring_pages[idx] = new; + else + rc = -EAGAIN; + } else + rc = -EINVAL; spin_unlock_irqrestore(&ctx->completion_lock, flags); } else rc = -EBUSY; spin_unlock(&mapping->private_lock); + if (rc == MIGRATEPAGE_SUCCESS) + put_page(old); + else + put_page(new); + return rc; } #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index b7717d7..f015c05 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -55,7 +55,8 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, - struct buffer_head *head, enum migrate_mode mode); + struct buffer_head *head, enum migrate_mode mode, + int extra_count); #else static inline void putback_lru_pages(struct list_head *l) {} diff --git a/mm/migrate.c b/mm/migrate.c index e9b7102..9194375 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -317,14 +317,15 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, */ int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, - struct buffer_head *head, enum migrate_mode mode) + struct buffer_head *head, enum migrate_mode mode, + int extra_count) { - int expected_count = 0; + int expected_count = 1 + extra_count; void **pslot; if (!mapping) { /* Anonymous page without mapping */ - if (page_count(page) != 1) + if (page_count(page) != expected_count) return -EAGAIN; return MIGRATEPAGE_SUCCESS; } @@ -334,7 +335,7 @@ int migrate_page_move_mapping(struct address_space *mapping, pslot = radix_tree_lookup_slot(&mapping->page_tree, page_index(page)); - expected_count = 2 + page_has_private(page); + expected_count += 1 + page_has_private(page); if (page_count(page) != expected_count || radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) { spin_unlock_irq(&mapping->tree_lock); @@ -584,7 +585,7 @@ int migrate_page(struct address_space *mapping, BUG_ON(PageWriteback(page)); /* Writeback must be complete */ - rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode); + rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0); if (rc != MIGRATEPAGE_SUCCESS) return rc; @@ -611,7 +612,7 @@ int buffer_migrate_page(struct address_space *mapping, head = page_buffers(page); - rc = migrate_page_move_mapping(mapping, newpage, page, head, mode); + rc = migrate_page_move_mapping(mapping, newpage, page, head, mode, 0); if (rc != MIGRATEPAGE_SUCCESS) return rc; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCHes - aio / migrate page, please review] Re: bad page state in 3.13-rc4 2013-12-21 23:06 ` [PATCHes - aio / migrate page, please review] " Benjamin LaHaise @ 2013-12-22 19:09 ` Linus Torvalds 2013-12-22 21:30 ` Dave Jones 0 siblings, 1 reply; 26+ messages in thread From: Linus Torvalds @ 2013-12-22 19:09 UTC (permalink / raw) To: Benjamin LaHaise Cc: Dave Jones, Kent Overstreet, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Sat, Dec 21, 2013 at 3:06 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > Linus, feel free to add my Signed-off-by: to your sanitization of > aio_setup_ring() as well, as it works okay in my testing. Nobody commented on your request for comments, so I applied my patch and pulled your branch, because I'm going to do -rc5 in a few and at least we want this to get testing. Dave, let's hope that the leak fixes and reference count fixes solve your problem. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCHes - aio / migrate page, please review] Re: bad page state in 3.13-rc4 2013-12-22 19:09 ` Linus Torvalds @ 2013-12-22 21:30 ` Dave Jones 0 siblings, 0 replies; 26+ messages in thread From: Dave Jones @ 2013-12-22 21:30 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin LaHaise, Kent Overstreet, Linux Kernel, linux-mm, Christoph Lameter, Al Viro On Sun, Dec 22, 2013 at 11:09:34AM -0800, Linus Torvalds wrote: > On Sat, Dec 21, 2013 at 3:06 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > > > Linus, feel free to add my Signed-off-by: to your sanitization of > > aio_setup_ring() as well, as it works okay in my testing. > > Nobody commented on your request for comments, so I applied my patch > and pulled your branch, because I'm going to do -rc5 in a few and at > least we want this to get testing. > > Dave, let's hope that the leak fixes and reference count fixes solve > your problem. Yeah, I'm not really going to have time to run tests again until after the holidays. I'll shout when I get back if things still don't look right. (or more likely, if I find something else ;-) Dave -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2013-12-22 21:38 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-19 4:07 bad page state in 3.13-rc4 Dave Jones 2013-12-19 4:40 ` Linus Torvalds 2013-12-19 15:41 ` Christoph Lameter 2013-12-19 20:11 ` Mel Gorman 2013-12-19 20:30 ` Dave Jones 2013-12-19 15:53 ` Dave Jones 2013-12-19 17:07 ` Linus Torvalds 2013-12-19 17:17 ` Dave Jones 2013-12-19 18:11 ` Kent Overstreet 2013-12-19 18:29 ` Benjamin LaHaise 2013-12-19 18:35 ` Dave Jones 2013-12-19 19:19 ` Linus Torvalds 2013-12-19 19:26 ` Benjamin LaHaise 2013-12-19 19:45 ` Linus Torvalds 2013-12-19 19:53 ` Benjamin LaHaise 2013-12-19 20:02 ` Linus Torvalds 2013-12-19 20:11 ` Linus Torvalds 2013-12-19 20:31 ` Benjamin LaHaise 2013-12-19 20:31 ` Linus Torvalds 2013-12-19 20:42 ` Benjamin LaHaise 2013-12-19 20:24 ` Dave Jones 2013-12-19 23:38 ` Benjamin LaHaise 2013-12-20 1:00 ` Dave Jones 2013-12-21 23:06 ` [PATCHes - aio / migrate page, please review] " Benjamin LaHaise 2013-12-22 19:09 ` Linus Torvalds 2013-12-22 21:30 ` Dave Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).