linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* bad page state in 3.13-rc4
@ 2013-12-19  4:07 Dave Jones
  2013-12-19  4:40 ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Jones @ 2013-12-19  4:07 UTC (permalink / raw)
  To: Linux Kernel; +Cc: linux-mm, Linus Torvalds

Just hit this while fuzzing with lots of child processes. 
(trinity -C128)

BUG: Bad page state in process trinity-c93  pfn:100499
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:439!
invalid opcode: 0000 [#1] PREEMPT SMP 
Modules linked in: dlci sctp snd_seq_dummy hidp fuse rfcomm bnep tun can_raw can_bcm bluetooth can rose phonet pppoe pppox ppp_generic slhc llc2 af_rxrpc af_key netrom caif_socket caif ipt_ULOG nfnetlink nfc af_802154 irda crc_ccitt rds scsi_transport_iscsi x25 atm appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill xfs snd_hda_codec_hdmi snd_hda_codec_realtek libcrc32c snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer snd shpchp soundcore coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug e1000e ptp pps_core
CPU: 0 PID: 4408 Comm: trinity-c39 Not tainted 3.13.0-rc4+ #5 
task: ffff88021b1d5be0 ti: ffff88011fd40000 task.ti: ffff88011fd40000
RIP: 0010:[<ffffffff816d8f89>]  [<ffffffff816d8f89>] get_page.part.20+0x4/0x6
RSP: 0018:ffff88011fd418c8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88024e20f780 RCX: 0000000000017151
RDX: 0000000000000000 RSI: 0000000000000017 RDI: 0000000000000001
RBP: ffff88011fd418c8 R08: fffffffffffffffd R09: 00000000000170f8
R10: ffff88024e5d3e80 R11: 0000000000000017 R12: ffffea0004012640
R13: 0000000000000000 R14: ffffea0004184200 R15: 0000000000000000
FS:  00007f2203b5e740(0000) GS:ffff88024e200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000018ca000 CR3: 000000014dcf4000 CR4: 00000000001407f0
DR0: 0000000000f88000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
 ffff88011fd418e8 ffffffff81141385 ffffea0004012640 ffffea0004012642
 ffff88011fd418f8 ffffffff8114161c ffff88011fd41920 ffffffff81145572
 ffffea0004012640 ffffea0004012600 ffffea0004012660 ffff88011fd419a8
Call Trace:
 [<ffffffff81141385>] __lru_cache_add+0xa5/0xc0
 [<ffffffff8114161c>] lru_cache_add+0x1c/0x30
 [<ffffffff81145572>] putback_lru_page+0x72/0xb0
 [<ffffffff8118c3f9>] migrate_pages+0x479/0x780
 [<ffffffff81155480>] ? isolate_freepages_block+0x360/0x360
 [<ffffffff8115638a>] compact_zone+0x2aa/0x460
 [<ffffffff8109ff90>] ? debug_check_no_locks_freed+0xb0/0x150
 [<ffffffff811565d4>] compact_zone_order+0x94/0xe0
 [<ffffffff81156911>] try_to_compact_pages+0xe1/0x110
 [<ffffffff816d8c88>] __alloc_pages_direct_compact+0xac/0x1d0
 [<ffffffff8113c86a>] __alloc_pages_nodemask+0x90a/0xab0
 [<ffffffff8117e1b1>] alloc_pages_vma+0xf1/0x1b0
 [<ffffffff81190e0d>] ? do_huge_pmd_anonymous_page+0xfd/0x3a0
 [<ffffffff81190e0d>] do_huge_pmd_anonymous_page+0xfd/0x3a0
 [<ffffffff8115beec>] ? follow_page_mask+0x24c/0x510
 [<ffffffff8115d919>] handle_mm_fault+0x479/0xbb0
 [<ffffffff816e5591>] ? _raw_spin_unlock+0x31/0x50
 [<ffffffff8115e1fe>] __get_user_pages+0x1ae/0x5f0
 [<ffffffff8116028c>] __mlock_vma_pages_range+0x8c/0xa0
 [<ffffffff81160a00>] __mm_populate+0xc0/0x150
 [<ffffffff8114f486>] vm_mmap_pgoff+0xb6/0xc0
 [<ffffffff81162d16>] SyS_mmap_pgoff+0x116/0x270
 [<ffffffff81010495>] ? syscall_trace_enter+0x145/0x270
 [<ffffffff81007b52>] SyS_mmap+0x22/0x30
 [<ffffffff816edb24>] tracesys+0xdd/0xe2
Code: 48 c8 48 85 d2 48 0f 49 c2 48 01 c8 49 89 06 58 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 <0f> 0b 80 3d 59 6a 60 00 00 75 1d 55 be 6c 00 00 00 48 c7 c7 e6 
RIP  [<ffffffff816d8f89>] get_page.part.20+0x4/0x6
 RSP <ffff88011fd418c8>
page:ffffea0004012640 count:0 mapcount:0 mapping:          (null) index:0x389
page flags: 0x2000000000000c(referenced|uptodate)
Modules linked in: dlci sctp snd_seq_dummy hidp fuse rfcomm bnep tun can_raw can_bcm bluetooth can rose phonet pppoe pppox ppp_generic slhc llc2 af_rxrpc af_key netrom caif_socket caif ipt_ULOG nfnetlink nfc af_802154 irda crc_ccitt rds scsi_transport_iscsi x25 atm appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill xfs snd_hda_codec_hdmi snd_hda_codec_realtek libcrc32c snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer snd shpchp soundcore coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug e1000e ptp pps_core
CPU: 2 PID: 4395 Comm: trinity-c93 Tainted: G      D      3.13.0-rc4+ #5 
 0000000000000000 ffff88004517fde8 ffffffff816db2f5 ffffea0004012640
 ffff88004517fe00 ffffffff816d8b05 ffffea0004012640 ffff88004517fe38
 ffffffff8113a645 ffffea0004012640 0000000000000000 002000000000001d
Call Trace:
 [<ffffffff816db2f5>] dump_stack+0x4e/0x7a
 [<ffffffff816d8b05>] bad_page.part.71+0xcf/0xe8
 [<ffffffff8113a645>] free_pages_prepare+0x185/0x190
 [<ffffffff8113b085>] free_hot_cold_page+0x35/0x180
 [<ffffffff811403f3>] __put_single_page+0x23/0x30
 [<ffffffff81140665>] put_page+0x35/0x50
 [<ffffffff811e8705>] aio_free_ring+0x55/0xf0
 [<ffffffff811e9c5a>] SyS_io_setup+0x59a/0xbe0
 [<ffffffff816edb24>] tracesys+0xdd/0xe2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19  4:07 bad page state in 3.13-rc4 Dave Jones
@ 2013-12-19  4:40 ` Linus Torvalds
  2013-12-19 15:41   ` Christoph Lameter
  2013-12-19 15:53   ` Dave Jones
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-12-19  4:40 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, linux-mm, Christoph Lameter

On Wed, Dec 18, 2013 at 8:07 PM, Dave Jones <davej@redhat.com> wrote:
> Just hit this while fuzzing with lots of child processes.
> (trinity -C128)

Ok, there's a BUG_ON() in the middle, the "bad page" part is just this:

> BUG: Bad page state in process trinity-c93  pfn:100499
> page:ffffea0004012640 count:0 mapcount:0 mapping:          (null) index:0x389
> page flags: 0x2000000000000c(referenced|uptodate)
> Call Trace:
>  [<ffffffff816db2f5>] dump_stack+0x4e/0x7a
>  [<ffffffff816d8b05>] bad_page.part.71+0xcf/0xe8
>  [<ffffffff8113a645>] free_pages_prepare+0x185/0x190
>  [<ffffffff8113b085>] free_hot_cold_page+0x35/0x180
>  [<ffffffff811403f3>] __put_single_page+0x23/0x30
>  [<ffffffff81140665>] put_page+0x35/0x50
>  [<ffffffff811e8705>] aio_free_ring+0x55/0xf0
>  [<ffffffff811e9c5a>] SyS_io_setup+0x59a/0xbe0
>  [<ffffffff816edb24>] tracesys+0xdd/0xe2

at free_pages() time, and I don't see anything bad in the printout wrt
the page counts of flags.

Which makes me wonder if this is mem_cgroup_bad_page_check()
triggering. Of course, if it's a race, it may be that by the time we
print out the counts they all look good, even if they weren't good at
the time we did that bad_page() *check*.

And the fact that we do have a concurrent BUG_ON() triggering with a
zero page count obviously does look suspicious. Looks like a possible
race with memory compaction happening at the same time aio_free_ring()
frees the page.

Somebody who knows the migration code needs to look at this. ChristophL?

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19  4:40 ` Linus Torvalds
@ 2013-12-19 15:41   ` Christoph Lameter
  2013-12-19 20:11     ` Mel Gorman
  2013-12-19 15:53   ` Dave Jones
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2013-12-19 15:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Mel Gorman, Linux Kernel, linux-mm

On Wed, 18 Dec 2013, Linus Torvalds wrote:

> Somebody who knows the migration code needs to look at this. ChristophL?

Its been awhile sorry and there has been a huge amount of work done on top
of my earlier work. Cannot debug that anymore and I am finding myself in
the role of the old guy who just complains a lot. Some of that
functionality seems bizarre to me like the on the fly conversion between
huge pages and regular pages, weird and complex page count handling etc
etc.

The last time I looked at the code I was horrified to find that the new
huge page migration does not use migration ptes to create a cooldown phase
but directly swaps the pmd. That used to cause huge problems with regular
pages in the past. But I was told that was all safe. Mel?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19  4:40 ` Linus Torvalds
  2013-12-19 15:41   ` Christoph Lameter
@ 2013-12-19 15:53   ` Dave Jones
  2013-12-19 17:07     ` Linus Torvalds
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Jones @ 2013-12-19 15:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel, linux-mm, Christoph Lameter

On Wed, Dec 18, 2013 at 08:40:07PM -0800, Linus Torvalds wrote:
 > On Wed, Dec 18, 2013 at 8:07 PM, Dave Jones <davej@redhat.com> wrote:
 > > Just hit this while fuzzing with lots of child processes.
 > > (trinity -C128)
 > 
 > Ok, there's a BUG_ON() in the middle, the "bad page" part is just this:
 > 
 > > BUG: Bad page state in process trinity-c93  pfn:100499
 > > page:ffffea0004012640 count:0 mapcount:0 mapping:          (null) index:0x389
 > > page flags: 0x2000000000000c(referenced|uptodate)
 > > Call Trace:
 > >  [<ffffffff816db2f5>] dump_stack+0x4e/0x7a
 > >  [<ffffffff816d8b05>] bad_page.part.71+0xcf/0xe8
 > >  [<ffffffff8113a645>] free_pages_prepare+0x185/0x190
 > >  [<ffffffff8113b085>] free_hot_cold_page+0x35/0x180
 > >  [<ffffffff811403f3>] __put_single_page+0x23/0x30
 > >  [<ffffffff81140665>] put_page+0x35/0x50
 > >  [<ffffffff811e8705>] aio_free_ring+0x55/0xf0
 > >  [<ffffffff811e9c5a>] SyS_io_setup+0x59a/0xbe0
 > >  [<ffffffff816edb24>] tracesys+0xdd/0xe2
 > 
 > at free_pages() time, and I don't see anything bad in the printout wrt
 > the page counts of flags.

Overnight run hit another bad page state with a different trace.

WARNING: CPU: 2 PID: 14107 at mm/truncate.c:331 truncate_inode_pages_range+0x5e4/0x610()
Modules linked in: snd_seq_dummy fuse sctp tun 8021q garp stp hidp nfnetlink bnep rfcomm scsi_transport_iscsi nfc caif_socket caif af_802154 phonet af_rxrpc can_bcm can_raw bluetooth
 can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds ipt_ULOG af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill xfs libcrc32c snd_hda_codec_realtek snd_hda_co
dec_hdmi coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_s
eq_device e1000e ptp shpchp snd_pcm pps_core snd_page_alloc[31342.836846] BUG: Bad page state in process modprobe  pfn:10f05c
page:ffffea00043c1700 count:0 mapcount:0 mapping:          (null) index:0x0
page flags: 0x20000000000001(locked)
Modules linked in: snd_seq_dummy fuse sctp tun 8021q garp stp hidp nfnetlink bnep rfcomm scsi_transport_iscsi nfc caif_socket caif af_802154 phonet af_rxrpc can_bcm can_raw bluetooth
 can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds ipt_ULOG af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 cfg80211 rfkill xfs libcrc32c snd_hda_codec_realtek snd_hda_co
dec_hdmi coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm crct10dif_pclmul crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr usb_debug snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_s
eq_device e1000e ptp shpchp snd_pcm pps_core snd_page_alloc snd_timer snd soundcore
CPU: 0 PID: 14132 Comm: modprobe Not tainted 3.13.0-rc4+ #5
 ffff88024e217108 ffff88000870fad0 ffffffff816db2f5 ffffea00043c1700
 ffff88000870fae8 ffffffff816d8b05 ffff88024e2170f8 ffff88000870fbd0
 ffffffff8113be1e 0000000000000000 0000000000000000 ffff88024e5d4d08
Call Trace:
 [<ffffffff816db2f5>] dump_stack+0x4e/0x7a
 [<ffffffff816d8b05>] bad_page.part.71+0xcf/0xe8
 [<ffffffff8113be1e>] get_page_from_freelist+0x80e/0x950
 [<ffffffff8117e1b1>] ? alloc_pages_vma+0xf1/0x1b0
 [<ffffffff8117c6e6>] ? alloc_pages_current+0x106/0x1f0
 [<ffffffff8113c178>] __alloc_pages_nodemask+0x218/0xab0
 [<ffffffff81132c55>] ? find_get_page+0x5/0x110
 [<ffffffff8103cf27>] ? pte_alloc_one+0x17/0x70
 [<ffffffff8117c6e6>] alloc_pages_current+0x106/0x1f0
 [<ffffffff8103cf27>] ? pte_alloc_one+0x17/0x70
 [<ffffffff8103cf27>] pte_alloc_one+0x17/0x70
 [<ffffffff8115a027>] __pte_alloc+0x27/0x130
 [<ffffffff8115df9c>] handle_mm_fault+0xafc/0xbb0
 [<ffffffff816e9321>] ? __do_page_fault+0x101/0x610
 [<ffffffff816e938f>] __do_page_fault+0x16f/0x610
 [<ffffffff810f7d7f>] ? __acct_update_integrals+0x7f/0x100
 [<ffffffff816e5591>] ? _raw_spin_unlock+0x31/0x50
 [<ffffffff8108c381>] ? vtime_account_user+0x91/0xa0
 [<ffffffff811318bb>] ? context_tracking_user_exit+0x9b/0x100
 [<ffffffff816e984a>] do_page_fault+0x1a/0x70
 [<ffffffff816e6492>] page_fault+0x22/0x30

CPU: 2 PID: 14107 Comm: trinity-c88 Tainted: G    B        3.13.0-rc4+ #5
 ffffffff81a28ae4 ffff8801577d7d30 ffffffff816db2f5 0000000000000000
 ffff8801577d7d68 ffffffff810529ad ffffffffffffffff ffff8801577d7da0
 0000000000000000 0000000000000001 ffffea00043c1700 ffff8801577d7d78
Call Trace:
 [<ffffffff816db2f5>] dump_stack+0x4e/0x7a
 [<ffffffff810529ad>] warn_slowpath_common+0x7d/0xa0
 [<ffffffff81052a8a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81142834>] truncate_inode_pages_range+0x5e4/0x610
 [<ffffffff811428c7>] truncate_pagecache+0x47/0x60
 [<ffffffff811428f2>] truncate_setsize+0x12/0x20
 [<ffffffff811e8652>] put_aio_ring_file.isra.11+0x22/0x80
 [<ffffffff811e871f>] aio_free_ring+0x6f/0xf0
 [<ffffffff811e9c5a>] SyS_io_setup+0x59a/0xbe0
 [<ffffffff816edb24>] tracesys+0xdd/0xe2

Interesting that CPU2 was doing sys_io_setup again. Different trace though.

	Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 15:53   ` Dave Jones
@ 2013-12-19 17:07     ` Linus Torvalds
  2013-12-19 17:17       ` Dave Jones
  2013-12-19 18:11       ` Kent Overstreet
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-12-19 17:07 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Linux Kernel, linux-mm,
	Christoph Lameter, Benjamin LaHaise, Kent Overstreet, Al Viro

On Thu, Dec 19, 2013 at 7:53 AM, Dave Jones <davej@redhat.com> wrote:
>
> Interesting that CPU2 was doing sys_io_setup again. Different trace though.

Well, it was once again in aio_free_ring() - double free or freeing
while already in use? And this time the other end of the complaint was
allocating a new page that definitely was still busily in use (it's
locked).

And there's no sign of migration, although obviously that could have
happened or be in progress on another CPU and just didn't notice the
mess. But yes, based on the two traces, fs/aio.c:io_setup() would seem
to be the main point of interest.

Have you started doing something new in trinity wrt AIO, and
io_setup() in particular? Or anything else different that might have
started triggering this?

But we do have new AIO code, and these two in particular look suspicious:

 - new page migration logic:

    71ad7490c1f3 rework aio migrate pages to use aio fs

 - trying to fix double frees and error cases:

    e34ecee2ae79 aio: Fix a trinity splat
    d558023207e0 aio: prevent double free in ioctx_alloc
    d1b9432712a2 aio: clean up aio ring in the fail path

and some kind of double free in an error path would certainly explain
this (with io_setup() . And the first oops reported obviously had that
migration thing. So maybe those "fixes" weren't fixing things at all
(or just moved the error case around).

Btw, that "rework aio migrate pages to use aio fs" looks odd. It has
Ben LaHaise marked as author, but no sign-off, instead "Tested-by" and
"Acked-by".

Al, Ben, Kent, see the beginning thread on lkml
(https://lkml.org/lkml/2013/12/18/932). Any comments?

                      Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 17:07     ` Linus Torvalds
@ 2013-12-19 17:17       ` Dave Jones
  2013-12-19 18:11       ` Kent Overstreet
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Jones @ 2013-12-19 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel, linux-mm, Christoph Lameter, Benjamin LaHaise,
	Kent Overstreet, Al Viro

On Thu, Dec 19, 2013 at 09:07:27AM -0800, Linus Torvalds wrote:
 > On Thu, Dec 19, 2013 at 7:53 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > Interesting that CPU2 was doing sys_io_setup again. Different trace though.
 > 
 > Well, it was once again in aio_free_ring() - double free or freeing
 > while already in use? And this time the other end of the complaint was
 > allocating a new page that definitely was still busily in use (it's
 > locked).
 > 
 > And there's no sign of migration, although obviously that could have
 > happened or be in progress on another CPU and just didn't notice the
 > mess. But yes, based on the two traces, fs/aio.c:io_setup() would seem
 > to be the main point of interest.
 > 
 > Have you started doing something new in trinity wrt AIO, and
 > io_setup() in particular? Or anything else different that might have
 > started triggering this?

Nothing special for aio, it's always had support for creating things that
look like iovecs, though now maybe it's filling those iovec's with mmaps
that it created (including potentially huge pages) instead of just mallocs.

	Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 17:07     ` Linus Torvalds
  2013-12-19 17:17       ` Dave Jones
@ 2013-12-19 18:11       ` Kent Overstreet
  2013-12-19 18:29         ` Benjamin LaHaise
  1 sibling, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2013-12-19 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Linux Kernel, linux-mm, Christoph Lameter,
	Benjamin LaHaise, Al Viro

On Thu, Dec 19, 2013 at 09:07:27AM -0800, Linus Torvalds wrote:
> On Thu, Dec 19, 2013 at 7:53 AM, Dave Jones <davej@redhat.com> wrote:
> >
> > Interesting that CPU2 was doing sys_io_setup again. Different trace though.
> 
> Well, it was once again in aio_free_ring() - double free or freeing
> while already in use? And this time the other end of the complaint was
> allocating a new page that definitely was still busily in use (it's
> locked).
> 
> And there's no sign of migration, although obviously that could have
> happened or be in progress on another CPU and just didn't notice the
> mess. But yes, based on the two traces, fs/aio.c:io_setup() would seem
> to be the main point of interest.
> 
> Have you started doing something new in trinity wrt AIO, and
> io_setup() in particular? Or anything else different that might have
> started triggering this?
> 
> But we do have new AIO code, and these two in particular look suspicious:
> 
>  - new page migration logic:
> 
>     71ad7490c1f3 rework aio migrate pages to use aio fs
> 
>  - trying to fix double frees and error cases:
> 
>     e34ecee2ae79 aio: Fix a trinity splat
>     d558023207e0 aio: prevent double free in ioctx_alloc
>     d1b9432712a2 aio: clean up aio ring in the fail path
> 
> and some kind of double free in an error path would certainly explain
> this (with io_setup() . And the first oops reported obviously had that
> migration thing. So maybe those "fixes" weren't fixing things at all
> (or just moved the error case around).
> 
> Btw, that "rework aio migrate pages to use aio fs" looks odd. It has
> Ben LaHaise marked as author, but no sign-off, instead "Tested-by" and
> "Acked-by".

I could certainly believe a double free, but rereading the current code
I can't find anything, and I just manually tested all the relevant error
paths in ioctx_alloc() and aio_setup_ring() without finding anything.

I don't get wtf that loop at line 350 is supposed to be for though.
You'd think if it was doing anything important it would be doing
something more intelligent than just breaking on error (?). But I
haven't slept yet and maybe I'm just being dumb.

I don't understand this page migration stuff at all, and I actually
don't think I understand the refcounting w.r.t. the page cache either.
But looking at (say) the aio_free_ring() call at line 409 - we just did
one put_page() in aio_setup_ring(), and then _another_ put_page() in
aio_free_ring()... ok, one of those corresponds to the get
get_user_pages() did, but what's the other correspond to?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 18:11       ` Kent Overstreet
@ 2013-12-19 18:29         ` Benjamin LaHaise
  2013-12-19 18:35           ` Dave Jones
  2013-12-19 19:19           ` Linus Torvalds
  0 siblings, 2 replies; 26+ messages in thread
From: Benjamin LaHaise @ 2013-12-19 18:29 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Thu, Dec 19, 2013 at 10:11:34AM -0800, Kent Overstreet wrote:
> On Thu, Dec 19, 2013 at 09:07:27AM -0800, Linus Torvalds wrote:
> > On Thu, Dec 19, 2013 at 7:53 AM, Dave Jones <davej@redhat.com> wrote:
> > >
> > > Interesting that CPU2 was doing sys_io_setup again. Different trace though.
> > 
> > Well, it was once again in aio_free_ring() - double free or freeing
> > while already in use? And this time the other end of the complaint was
> > allocating a new page that definitely was still busily in use (it's
> > locked).
> > 
> > And there's no sign of migration, although obviously that could have
> > happened or be in progress on another CPU and just didn't notice the
> > mess. But yes, based on the two traces, fs/aio.c:io_setup() would seem
> > to be the main point of interest.
> > 
> > Have you started doing something new in trinity wrt AIO, and
> > io_setup() in particular? Or anything else different that might have
> > started triggering this?
> > 
> > But we do have new AIO code, and these two in particular look suspicious:
> > 
> >  - new page migration logic:
> > 
> >     71ad7490c1f3 rework aio migrate pages to use aio fs
> > 
> >  - trying to fix double frees and error cases:
> > 
> >     e34ecee2ae79 aio: Fix a trinity splat
> >     d558023207e0 aio: prevent double free in ioctx_alloc
> >     d1b9432712a2 aio: clean up aio ring in the fail path
> > 
> > and some kind of double free in an error path would certainly explain
> > this (with io_setup() . And the first oops reported obviously had that
> > migration thing. So maybe those "fixes" weren't fixing things at all
> > (or just moved the error case around).
> > 
> > Btw, that "rework aio migrate pages to use aio fs" looks odd. It has
> > Ben LaHaise marked as author, but no sign-off, instead "Tested-by" and
> > "Acked-by".
> 
> I could certainly believe a double free, but rereading the current code
> I can't find anything, and I just manually tested all the relevant error
> paths in ioctx_alloc() and aio_setup_ring() without finding anything.

The same here.  It would be very helpful to know what syscalls trinity is 
issuing in the lead up to the bug.

> I don't get wtf that loop at line 350 is supposed to be for though.
> You'd think if it was doing anything important it would be doing
> something more intelligent than just breaking on error (?). But I
> haven't slept yet and maybe I'm just being dumb.

The loop at 350 is just instantiating the page cache pages for the ring 
buffer with freshly zeroed pages.

> I don't understand this page migration stuff at all, and I actually
> don't think I understand the refcounting w.r.t. the page cache either.
> But looking at (say) the aio_free_ring() call at line 409 - we just did
> one put_page() in aio_setup_ring(), and then _another_ put_page() in
> aio_free_ring()... ok, one of those corresponds to the get
> get_user_pages() did, but what's the other correspond to?

The second put_page() should be dropping the page from the page cache.  
Perhaps it would be better to rely on a truncate of the file to remove the 
pages from the page cache.  I'd hoped someone with a clue about how 
migration is supposed to work would have chimed in during the review, but 
nobody has stepped up with expertise in this area yet.

		-ben
-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 18:29         ` Benjamin LaHaise
@ 2013-12-19 18:35           ` Dave Jones
  2013-12-19 19:19           ` Linus Torvalds
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Jones @ 2013-12-19 18:35 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kent Overstreet, Linus Torvalds, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Thu, Dec 19, 2013 at 01:29:21PM -0500, Benjamin LaHaise wrote:
 
 > > > and some kind of double free in an error path would certainly explain
 > > > this (with io_setup() . And the first oops reported obviously had that
 > > > migration thing. So maybe those "fixes" weren't fixing things at all
 > > > (or just moved the error case around).
 > > > 
 > > > Btw, that "rework aio migrate pages to use aio fs" looks odd. It has
 > > > Ben LaHaise marked as author, but no sign-off, instead "Tested-by" and
 > > > "Acked-by".
 > > 
 > > I could certainly believe a double free, but rereading the current code
 > > I can't find anything, and I just manually tested all the relevant error
 > > paths in ioctx_alloc() and aio_setup_ring() without finding anything.
 > 
 > The same here.  It would be very helpful to know what syscalls trinity is 
 > issuing in the lead up to the bug.

Working on narrowing it down.  The io_setup fuzzer is actually incredibly dumb,
and 99.9% of the time will just EFAULT or EINVAL. I'll see if I can smarten it
up to succeed more often, in the hope that it can reproduce this faster, because
right now it looks like it needs the planets to line up just right to hit
the bug (even though I've hit it twice in the last 24 hrs)

	Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 18:29         ` Benjamin LaHaise
  2013-12-19 18:35           ` Dave Jones
@ 2013-12-19 19:19           ` Linus Torvalds
  2013-12-19 19:26             ` Benjamin LaHaise
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-12-19 19:19 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Thu, Dec 19, 2013 at 10:29 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
>> I don't understand this page migration stuff at all, and I actually
>> don't think I understand the refcounting w.r.t. the page cache either.
>> But looking at (say) the aio_free_ring() call at line 409 - we just did
>> one put_page() in aio_setup_ring(), and then _another_ put_page() in
>> aio_free_ring()... ok, one of those corresponds to the get
>> get_user_pages() did, but what's the other correspond to?
>
> The second put_page() should be dropping the page from the page cache.
> Perhaps it would be better to rely on a truncate of the file to remove the
> pages from the page cache.

Yeah, that looks horribly buggy, if that's the intent.

You can't just put_page() to remove something from the page cache. You
need to do the whole "remove from radix tree" rigamarole, see for
example delete_from_page_cache(). And you can't even do that blindly,
because if the page is under writeback or otherwise busy, just
removing it from the page cache and freeing it is wrong too.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 19:19           ` Linus Torvalds
@ 2013-12-19 19:26             ` Benjamin LaHaise
  2013-12-19 19:45               ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2013-12-19 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Fri, Dec 20, 2013 at 04:19:15AM +0900, Linus Torvalds wrote:
> Yeah, that looks horribly buggy, if that's the intent.
> 
> You can't just put_page() to remove something from the page cache. You
> need to do the whole "remove from radix tree" rigamarole, see for
> example delete_from_page_cache(). And you can't even do that blindly,
> because if the page is under writeback or otherwise busy, just
> removing it from the page cache and freeing it is wrong too.

Okay, I'll rewriting it to use truncate to free the pages.

		-ben

>                  Linus

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 19:26             ` Benjamin LaHaise
@ 2013-12-19 19:45               ` Linus Torvalds
  2013-12-19 19:53                 ` Benjamin LaHaise
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-12-19 19:45 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Fri, Dec 20, 2013 at 4:26 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> Okay, I'll rewriting it to use truncate to free the pages.

It already does that in put_aio_ring_file() afaik. No?

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 19:45               ` Linus Torvalds
@ 2013-12-19 19:53                 ` Benjamin LaHaise
  2013-12-19 20:02                   ` Linus Torvalds
  2013-12-19 20:24                   ` Dave Jones
  0 siblings, 2 replies; 26+ messages in thread
From: Benjamin LaHaise @ 2013-12-19 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Fri, Dec 20, 2013 at 04:45:38AM +0900, Linus Torvalds wrote:
> On Fri, Dec 20, 2013 at 4:26 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >
> > Okay, I'll rewriting it to use truncate to free the pages.
> 
> It already does that in put_aio_ring_file() afaik. No?

Yes, that's what I found when I started looking into this in detail again.  
I think the page reference counting is actually correct.  There are 2 
references on each page: the first is from the find_or_create_page() call, 
and the second is from the get_user_pages() (which also makes sure the page 
is populated into the page tables).  The only place I can see things going 
off the rails is if the get_user_pages() call fails.  It's possible trinity 
could be arranging things so that the get_user_pages() call is failing 
somehow.  Also, if it were a double free of a page, we should at least get 
a VM_BUG() occuring when the page's count is 0.

Dave -- do you have CONFIG_DEBUG_VM on in your test rig?

>                 Linus

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 19:53                 ` Benjamin LaHaise
@ 2013-12-19 20:02                   ` Linus Torvalds
  2013-12-19 20:11                     ` Linus Torvalds
  2013-12-19 20:24                   ` Dave Jones
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-12-19 20:02 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

[-- Attachment #1: Type: text/plain, Size: 870 bytes --]

On Fri, Dec 20, 2013 at 4:53 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> Yes, that's what I found when I started looking into this in detail again.
> I think the page reference counting is actually correct.  There are 2
> references on each page: the first is from the find_or_create_page() call,
> and the second is from the get_user_pages() (which also makes sure the page
> is populated into the page tables).

Ok, I'm sorry, but that's just pure bullshit then.

So it has the page array in the page cache, then mmap's it in, and
uses get_user_pages() to get the pages back that it *just* created.

This code is pure and utter garbage. It's beyond the pale how crazy it is.

Why not just get rid of the idiotic get_user_pages() crap then?
Something like the attached patch?

Totally untested, but at least it makes *some* amount of sense.

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1655 bytes --]

 fs/aio.c | 20 +++-----------------
 1 file changed, 3 insertions(+), 17 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 6efb7f6cb22e..e1b02dd1be9e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -358,6 +358,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 		SetPageUptodate(page);
 		SetPageDirty(page);
 		unlock_page(page);
+
+		ctx->ring_pages[i] = page;
 	}
 	ctx->aio_ring_file = file;
 	nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring))
@@ -380,8 +382,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
 				       MAP_SHARED | MAP_POPULATE, 0, &populate);
+	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
-		up_write(&mm->mmap_sem);
 		ctx->mmap_size = 0;
 		aio_free_ring(ctx);
 		return -EAGAIN;
@@ -389,22 +391,6 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
 
-	/* We must do this while still holding mmap_sem for write, as we
-	 * need to be protected against userspace attempting to mremap()
-	 * or munmap() the ring buffer.
-	 */
-	ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages,
-				       1, 0, ctx->ring_pages, NULL);
-
-	/* Dropping the reference here is safe as the page cache will hold
-	 * onto the pages for us.  It is also required so that page migration
-	 * can unmap the pages and get the right reference count.
-	 */
-	for (i = 0; i < ctx->nr_pages; i++)
-		put_page(ctx->ring_pages[i]);
-
-	up_write(&mm->mmap_sem);
-
 	if (unlikely(ctx->nr_pages != nr_pages)) {
 		aio_free_ring(ctx);
 		return -EAGAIN;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 20:02                   ` Linus Torvalds
@ 2013-12-19 20:11                     ` Linus Torvalds
  2013-12-19 20:31                       ` Benjamin LaHaise
  2013-12-19 20:31                       ` Linus Torvalds
  0 siblings, 2 replies; 26+ messages in thread
From: Linus Torvalds @ 2013-12-19 20:11 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

[-- Attachment #1: Type: text/plain, Size: 589 bytes --]

On Fri, Dec 20, 2013 at 5:02 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Why not just get rid of the idiotic get_user_pages() crap then?
> Something like the attached patch?
>
> Totally untested, but at least it makes *some* amount of sense.

Ok, that can't work, since the ring_pages[] allocation happens later.
So that part needs to be moved up, and it needs to initialize
'nr_pages'.

So here's the same patch, but with stuff moved around a bit, and the
"oops, couldn't create page" part fixed.

Bit it's still totally and entirely untested.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 2772 bytes --]

 fs/aio.c | 54 +++++++++++++++++++++---------------------------------
 1 file changed, 21 insertions(+), 33 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 6efb7f6cb22e..3e857e98fb87 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -347,6 +347,20 @@ static int aio_setup_ring(struct kioctx *ctx)
 		return -EAGAIN;
 	}
 
+	ctx->aio_ring_file = file;
+	nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring))
+			/ sizeof(struct io_event);
+
+	ctx->ring_pages = ctx->internal_pages;
+	if (nr_pages > AIO_RING_PAGES) {
+		ctx->ring_pages = kcalloc(nr_pages, sizeof(struct page *),
+					  GFP_KERNEL);
+		if (!ctx->ring_pages) {
+			put_aio_ring_file(ctx);
+			return -ENOMEM;
+		}
+	}
+
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
 		page = find_or_create_page(file->f_inode->i_mapping,
@@ -358,19 +372,14 @@ static int aio_setup_ring(struct kioctx *ctx)
 		SetPageUptodate(page);
 		SetPageDirty(page);
 		unlock_page(page);
+
+		ctx->ring_pages[i] = page;
 	}
-	ctx->aio_ring_file = file;
-	nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring))
-			/ sizeof(struct io_event);
+	ctx->nr_pages = i;
 
-	ctx->ring_pages = ctx->internal_pages;
-	if (nr_pages > AIO_RING_PAGES) {
-		ctx->ring_pages = kcalloc(nr_pages, sizeof(struct page *),
-					  GFP_KERNEL);
-		if (!ctx->ring_pages) {
-			put_aio_ring_file(ctx);
-			return -ENOMEM;
-		}
+	if (unlikely(i != nr_pages)) {
+		aio_free_ring(ctx);
+		return -EAGAIN;
 	}
 
 	ctx->mmap_size = nr_pages * PAGE_SIZE;
@@ -380,8 +389,8 @@ static int aio_setup_ring(struct kioctx *ctx)
 	ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
 				       PROT_READ | PROT_WRITE,
 				       MAP_SHARED | MAP_POPULATE, 0, &populate);
+	up_write(&mm->mmap_sem);
 	if (IS_ERR((void *)ctx->mmap_base)) {
-		up_write(&mm->mmap_sem);
 		ctx->mmap_size = 0;
 		aio_free_ring(ctx);
 		return -EAGAIN;
@@ -389,27 +398,6 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
 
-	/* We must do this while still holding mmap_sem for write, as we
-	 * need to be protected against userspace attempting to mremap()
-	 * or munmap() the ring buffer.
-	 */
-	ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages,
-				       1, 0, ctx->ring_pages, NULL);
-
-	/* Dropping the reference here is safe as the page cache will hold
-	 * onto the pages for us.  It is also required so that page migration
-	 * can unmap the pages and get the right reference count.
-	 */
-	for (i = 0; i < ctx->nr_pages; i++)
-		put_page(ctx->ring_pages[i]);
-
-	up_write(&mm->mmap_sem);
-
-	if (unlikely(ctx->nr_pages != nr_pages)) {
-		aio_free_ring(ctx);
-		return -EAGAIN;
-	}
-
 	ctx->user_id = ctx->mmap_base;
 	ctx->nr_events = nr_events; /* trusted copy */
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 15:41   ` Christoph Lameter
@ 2013-12-19 20:11     ` Mel Gorman
  2013-12-19 20:30       ` Dave Jones
  0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2013-12-19 20:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Sasha Levin, Joonsoo Kim, Andrew Morton,
	Kirill A. Shutemov, Hugh Dickins, Dave Jones, Linux Kernel,
	linux-mm

On Thu, Dec 19, 2013 at 09:41:50AM -0600, Christoph Lameter wrote:
> On Wed, 18 Dec 2013, Linus Torvalds wrote:
> 
> > Somebody who knows the migration code needs to look at this. ChristophL?
> 
> Its been awhile sorry and there has been a huge amount of work done on top
> of my earlier work. Cannot debug that anymore and I am finding myself in
> the role of the old guy who just complains a lot.

Shake your fist and tell the kids to get off your lawn.

> Some of that
> functionality seems bizarre to me like the on the fly conversion between
> huge pages and regular pages, weird and complex page count handling etc
> etc.
> 
> The last time I looked at the code I was horrified to find that the new
> huge page migration does not use migration ptes to create a cooldown phase
> but directly swaps the pmd. That used to cause huge problems with regular
> pages in the past. But I was told that was all safe. Mel?

THP migration is specific to automatic NUMA balancing and the safety of
how it works is dependant upon how pmds are marked NUMA and how they are
cleared and migrated.

Dave, was this a NUMA machine? If yes, was CONFIG_NUMA_BALANCING set? If
yes, was NUMA_BALANCING_DEFAULT_ENABLED set or was numa_balancing=enable
specified on the kernel command line? I'm skeptical that this is related to
THP migration largely because the initial stack trace was in the compaction
path which does not deal with THP migration.

If this is recent, then an outside possibility is that this is related to
pmd-level split locks and mm->page_table_lock was protecting us from some
split THP vs migration race or possibly a gup page for aio vs migration
race we were previously unaware of (e.g. aio taking a reference on a page
that migration has frozen the references on, bug would be a case where
get_page instead of get_page_unless_zero was used) . Dave, when this this
bug start triggering? If it's due to a recent change in trinity, can you
check if 3.12 is also affected? If not, can you check if the bug started
happening somewhere around these commits?

ea1e7ed33708c7a760419ff9ded0a6cb90586a50 mm: create a separate slab for page->ptl allocation
539edb5846c740d78a8b6c2e43a99ca4323df68f mm: properly separate the bloated ptl from the regular case
49076ec2ccaf68610aa03d96bced9a6694b93ca1 mm: dynamically allocate page->ptl if it cannot be embedded to struct page
e009bb30c8df8a52a9622b616b67436b6a03a0cd mm: implement split page table lock for PMD level
c4088ebdca64c9a2e34a38177d2249805ede1f4b mm: convert the rest to new page table lock api
cb900f41215447433cbc456d1c4294e858a84d7c mm, hugetlb: convert hugetlbfs to use split pmd lock
c389a250ab4cfa4a3775d9f2c45271618af6d5b2 mm, thp: do not access mm->pmd_huge_pte directly
117b0791ac42f2ec447bc864e70ad622b5604059 mm, thp: move ptl taking inside page_check_address_pmd()
bf929152e9f6c49b66fad4ebf08cc95b02ce48f5 mm, thp: change pmd_trans_huge_lock() to return taken lock
e1f56c89b040134add93f686931cc266541d239a mm: convert mm->nr_ptes to atomic_long_t
e9bb18c7b95d4dcf8c7f0e14f920ca6f03109e75 mm: avoid increase sizeof(struct page) due to split page table lock
b77d88d493b8fc7a4c2dadd3bb86d1dee2f53a56 mm: drop actor argument of do_generic_file_read()

A few bad state bugs have shown up on linux-mm recently but my impression
was that they were related to rmap_walk changes currently in next. The
initial log indicated that this was 3.13-rc4 but is it really 3.13-rc4 or
are there any -next patches applied?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 19:53                 ` Benjamin LaHaise
  2013-12-19 20:02                   ` Linus Torvalds
@ 2013-12-19 20:24                   ` Dave Jones
  2013-12-19 23:38                     ` Benjamin LaHaise
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Jones @ 2013-12-19 20:24 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, Kent Overstreet, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Thu, Dec 19, 2013 at 02:53:52PM -0500, Benjamin LaHaise wrote:

 > is populated into the page tables).  The only place I can see things going 
 > off the rails is if the get_user_pages() call fails.  It's possible trinity 
 > could be arranging things so that the get_user_pages() call is failing 
 > somehow.  Also, if it were a double free of a page, we should at least get 
 > a VM_BUG() occuring when the page's count is 0.
 > 
 > Dave -- do you have CONFIG_DEBUG_VM on in your test rig?

Yes. Note the original trace in this thread was a VM_BUG_ON(atomic_read(&page->_count) <= 0);

Right after these crashes btw, the box locks up solid. So bad that traces don't
always make it over usb-serial. Annoying.

	Dave


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 20:11     ` Mel Gorman
@ 2013-12-19 20:30       ` Dave Jones
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Jones @ 2013-12-19 20:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Linus Torvalds, Sasha Levin, Joonsoo Kim,
	Andrew Morton, Kirill A. Shutemov, Hugh Dickins, Linux Kernel,
	linux-mm

On Thu, Dec 19, 2013 at 08:11:58PM +0000, Mel Gorman wrote:
 
 > Dave, was this a NUMA machine?

It's a dual core i5-4670T with hyperthreading.

 > If yes, was CONFIG_NUMA_BALANCING set?

no.

 > Dave, when this this
 > bug start triggering? If it's due to a recent change in trinity, can you
 > check if 3.12 is also affected? If not, can you check if the bug started
 > happening somewhere around these commits?
 
Right now it can take hours for it to reproduce. Until I can narrow it down to
something repeatable, bisecting and trying old builds is going to be really time-consuming.

Given the other VM bugs that have Sasha and I have been finding since I added
the mmap reuse code to trinity, this is probably something else that has been
there for a while.

 > A few bad state bugs have shown up on linux-mm recently but my impression
 > was that they were related to rmap_walk changes currently in next. The
 > initial log indicated that this was 3.13-rc4 but is it really 3.13-rc4 or
 > are there any -next patches applied?

no, just rc4 (plus a handful of small patches to fix oopses etc that I've already
diagnosed).  I'm glad Sasha spends time running this stuff on -next, because 
there aren't enough hours in the day for me to look at the stuff I find
in Linus' tree without looking at what's coming next.

	Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 20:11                     ` Linus Torvalds
@ 2013-12-19 20:31                       ` Benjamin LaHaise
  2013-12-19 20:31                       ` Linus Torvalds
  1 sibling, 0 replies; 26+ messages in thread
From: Benjamin LaHaise @ 2013-12-19 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Fri, Dec 20, 2013 at 05:11:12AM +0900, Linus Torvalds wrote:
> On Fri, Dec 20, 2013 at 5:02 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Why not just get rid of the idiotic get_user_pages() crap then?
> > Something like the attached patch?
> >
> > Totally untested, but at least it makes *some* amount of sense.
> 
> Ok, that can't work, since the ring_pages[] allocation happens later.
> So that part needs to be moved up, and it needs to initialize
> 'nr_pages'.
> 
> So here's the same patch, but with stuff moved around a bit, and the
> "oops, couldn't create page" part fixed.
> 
> Bit it's still totally and entirely untested.

That looks much better.  I think the following is also needed to nail down 
the migratepage operation as well.  I'll give these two a few tests 
together.

		-ben
-- 
"Thought is the essence of where you are now."


diff --git a/fs/aio.c b/fs/aio.c
index 6efb7f6..eec0ae4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -244,8 +244,13 @@ static void aio_free_ring(struct kioctx *ctx)
 	int i;
 
 	for (i = 0; i < ctx->nr_pages; i++) {
+		struct page *page;
 		pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
 				page_count(ctx->ring_pages[i]));
+		page = ctx->ring_pages[i];
+		if (!page)
+			continue;
+		ctx->ring_pages[i] = NULL;
 		put_page(ctx->ring_pages[i]);
 	}
 
@@ -280,18 +285,42 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	unsigned long flags;
 	int rc;
 
+	/* Serialize access to the old page */
+	if (!trylock_page(old))
+		return -EAGAIN;
+
+	rc = 0;
+
+	spin_lock(&mapping->private_lock);
+	ctx = mapping->private_data;
+	if (ctx) {
+		pgoff_t idx;
+		spin_lock_irqsave(&ctx->completion_lock, flags);
+		idx = old->index;
+		if (idx < (pgoff_t)ctx->nr_pages) {
+			if (ctx->ring_pages[idx] != old)
+				rc = -EAGAIN;
+		} else
+			rc = -EINVAL;
+		spin_unlock_irqrestore(&ctx->completion_lock, flags);
+	} else
+		rc = -EINVAL;
+	spin_unlock(&mapping->private_lock);
+
+	if (rc != 0)
+		return rc;
+
 	/* Writeback must be complete */
 	BUG_ON(PageWriteback(old));
-	put_page(old);
+	get_page(new);
 
-	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
+	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1);
 	if (rc != MIGRATEPAGE_SUCCESS) {
-		get_page(old);
+		unlock_page(old);
+		put_page(new);
 		return rc;
 	}
 
-	get_page(new);
-
 	/* We can potentially race against kioctx teardown here.  Use the
 	 * address_space's private data lock to protect the mapping's
 	 * private_data.
@@ -305,10 +334,16 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 		idx = old->index;
 		if (idx < (pgoff_t)ctx->nr_pages)
 			ctx->ring_pages[idx] = new;
+		else
+			rc = -EINVAL;
 		spin_unlock_irqrestore(&ctx->completion_lock, flags);
 	} else
 		rc = -EBUSY;
 	spin_unlock(&mapping->private_lock);
+	unlock_page(old);
+
+	if (rc == MIGRATEPAGE_SUCCESS)
+		put_page(old);
 
 	return rc;
 }
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index b7717d7..f015c05 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -55,7 +55,8 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		struct buffer_head *head, enum migrate_mode mode);
+		struct buffer_head *head, enum migrate_mode mode,
+		int extra_count);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
diff --git a/mm/migrate.c b/mm/migrate.c
index e9b7102..e73823e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -317,14 +317,15 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
  */
 int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		struct buffer_head *head, enum migrate_mode mode)
+		struct buffer_head *head, enum migrate_mode mode,
+		int extra_count)
 {
 	int expected_count = 0;
 	void **pslot;
 
 	if (!mapping) {
 		/* Anonymous page without mapping */
-		if (page_count(page) != 1)
+		if (page_count(page) != (expected_count + 1))
 			return -EAGAIN;
 		return MIGRATEPAGE_SUCCESS;
 	}
@@ -584,7 +585,7 @@ int migrate_page(struct address_space *mapping,
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
+	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
 
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
@@ -611,7 +612,7 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	head = page_buffers(page);
 
-	rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
+	rc = migrate_page_move_mapping(mapping, newpage, page, head, mode, 0);
 
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 20:11                     ` Linus Torvalds
  2013-12-19 20:31                       ` Benjamin LaHaise
@ 2013-12-19 20:31                       ` Linus Torvalds
  2013-12-19 20:42                         ` Benjamin LaHaise
  1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-12-19 20:31 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Fri, Dec 20, 2013 at 5:11 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So here's the same patch, but with stuff moved around a bit, and the
> "oops, couldn't create page" part fixed.
>
> Bit it's still totally and entirely untested.

Btw, I think this actually fixes a bug, in that it doesn't leak the
page reference count if the do_mmap_pgoff() call fails.

That said, that looks like just a memory leak, not explaining the
problem Dave sees. And maybe I'm missing something.

And no, I still haven't actually tested this at all. Is there an aio
tester that is worth trying?

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 20:31                       ` Linus Torvalds
@ 2013-12-19 20:42                         ` Benjamin LaHaise
  0 siblings, 0 replies; 26+ messages in thread
From: Benjamin LaHaise @ 2013-12-19 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Jones, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Fri, Dec 20, 2013 at 05:31:29AM +0900, Linus Torvalds wrote:
> On Fri, Dec 20, 2013 at 5:11 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So here's the same patch, but with stuff moved around a bit, and the
> > "oops, couldn't create page" part fixed.
> >
> > Bit it's still totally and entirely untested.
> 
> Btw, I think this actually fixes a bug, in that it doesn't leak the
> page reference count if the do_mmap_pgoff() call fails.
> 
> That said, that looks like just a memory leak, not explaining the
> problem Dave sees. And maybe I'm missing something.
> 
> And no, I still haven't actually tested this at all. Is there an aio
> tester that is worth trying?

There are a few tests in the libaio source, some in xfstests and fio, and 
a few I've got sitting around.  To specifically exercise the page migration 
code path, there's a test at http://www.kvack.org/~bcrl/aio-numa-test.c .

		-ben

>                    Linus

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 20:24                   ` Dave Jones
@ 2013-12-19 23:38                     ` Benjamin LaHaise
  2013-12-20  1:00                       ` Dave Jones
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2013-12-19 23:38 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Kent Overstreet, Linux Kernel,
	linux-mm, Christoph Lameter, Al Viro

On Thu, Dec 19, 2013 at 03:24:16PM -0500, Dave Jones wrote:
> Yes. Note the original trace in this thread was a VM_BUG_ON(atomic_read(&page->_count) <= 0);
> 
> Right after these crashes btw, the box locks up solid. So bad that traces don't
> always make it over usb-serial. Annoying.

I think I finally have an idea what's going on now.  Kent's changes in 
e34ecee2ae791df674dfb466ce40692ca6218e43 are broken and result in a memory 
leak of the aio kioctx.  This eventually leads to the system running out of 
memory, which ends up triggering the otherwise hard to hit error paths in 
aio_setup_ring().  Linus' suggested changes should fix the badness in the 
aio_setup_ring(), but more work has to be done to fix up the percpu 
reference counting tie in with the aio code.  I'll fix this up in the 
morning if nobody beats me to it over night, as I'm just heading out right 
now.

		-ben

> 	Dave
> 

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: bad page state in 3.13-rc4
  2013-12-19 23:38                     ` Benjamin LaHaise
@ 2013-12-20  1:00                       ` Dave Jones
  2013-12-21 23:06                         ` [PATCHes - aio / migrate page, please review] " Benjamin LaHaise
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Jones @ 2013-12-20  1:00 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, Kent Overstreet, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Thu, Dec 19, 2013 at 06:38:54PM -0500, Benjamin LaHaise wrote:
 > On Thu, Dec 19, 2013 at 03:24:16PM -0500, Dave Jones wrote:
 > > Yes. Note the original trace in this thread was a VM_BUG_ON(atomic_read(&page->_count) <= 0);
 > > 
 > > Right after these crashes btw, the box locks up solid. So bad that traces don't
 > > always make it over usb-serial. Annoying.
 > 
 > I think I finally have an idea what's going on now.  Kent's changes in 
 > e34ecee2ae791df674dfb466ce40692ca6218e43 are broken and result in a memory 
 > leak of the aio kioctx.  This eventually leads to the system running out of 
 > memory, which ends up triggering the otherwise hard to hit error paths in 
 > aio_setup_ring().  Linus' suggested changes should fix the badness in the 
 > aio_setup_ring(), but more work has to be done to fix up the percpu 
 > reference counting tie in with the aio code.  I'll fix this up in the 
 > morning if nobody beats me to it over night, as I'm just heading out right 
 > now.

That would explain why I'm having difficulty repeating it in a hurry if it
takes hours of runtime for the leak to reach a point where it becomes a problem.

	Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCHes - aio / migrate page, please review] Re: bad page state in 3.13-rc4
  2013-12-20  1:00                       ` Dave Jones
@ 2013-12-21 23:06                         ` Benjamin LaHaise
  2013-12-22 19:09                           ` Linus Torvalds
  0 siblings, 1 reply; 26+ messages in thread
From: Benjamin LaHaise @ 2013-12-21 23:06 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Kent Overstreet, Linux Kernel,
	linux-mm, Christoph Lameter, Al Viro

[ Patches inline below for people to comment on ]

On Thu, Dec 19, 2013 at 08:00:42PM -0500, Dave Jones wrote:
> On Thu, Dec 19, 2013 at 06:38:54PM -0500, Benjamin LaHaise wrote:
>  > On Thu, Dec 19, 2013 at 03:24:16PM -0500, Dave Jones wrote:
>  > > Yes. Note the original trace in this thread was a VM_BUG_ON(atomic_read(&page->_count) <= 0);
>  > > 
>  > > Right after these crashes btw, the box locks up solid. So bad that traces don't
>  > > always make it over usb-serial. Annoying.
>  > 
>  > I think I finally have an idea what's going on now.  Kent's changes in 
>  > e34ecee2ae791df674dfb466ce40692ca6218e43 are broken and result in a memory 
>  > leak of the aio kioctx.  This eventually leads to the system running out of 
>  > memory, which ends up triggering the otherwise hard to hit error paths in 
>  > aio_setup_ring().  Linus' suggested changes should fix the badness in the 
>  > aio_setup_ring(), but more work has to be done to fix up the percpu 
>  > reference counting tie in with the aio code.  I'll fix this up in the 
>  > morning if nobody beats me to it over night, as I'm just heading out right 
>  > now.
> 
> That would explain why I'm having difficulty repeating it in a hurry if it
> takes hours of runtime for the leak to reach a point where it becomes a problem.

Okay, I've put the below two patches plus Linus's change through a round 
of tests, and it passes millions of iterations of the aio numa migratepage 
test, as well as a number of repetitions of a few simple read and write 
tests.  The first patch fixes the memory leak Kent introduced, while the 
second patch makes aio_migratepage() much more paranoid and robust.  Those 
two changes are in git://git.kvack.org/~bcrl/aio-next.git -- can a few 
other folks please review to make sure I haven't missed anything?

Linus, feel free to add my Signed-off-by: to your sanitization of 
aio_setup_ring() as well, as it works okay in my testing.

		-ben
-- 
"Thought is the essence of where you are now."

 aio.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

commit 1881686f842065d2f92ec9c6424830ffc17d23b0
Author: Benjamin LaHaise <bcrl@kvack.org>
Date:   Sat Dec 21 15:49:28 2013 -0500

    aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
    
    e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference
    counting to correct a bug trinity found.  Unfortunately, the change lead
    to kioctxes being leaked because there was no final reference count to
    put.  Add that reference count back in to fix things.
    
    Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
    Cc: stable@vger.kernel.org

diff --git a/fs/aio.c b/fs/aio.c
index 6efb7f6..fd1c0ba 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -652,7 +652,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	aio_nr += ctx->max_reqs;
 	spin_unlock(&aio_nr_lock);
 
-	percpu_ref_get(&ctx->users); /* io_setup() will drop this ref */
+	percpu_ref_get(&ctx->users);	/* io_setup() will drop this ref */
+	percpu_ref_get(&ctx->reqs);	/* free_ioctx_users() will drop this */
 
 	err = ioctx_add_table(ctx, mm);
 	if (err)


 fs/aio.c                |   52 ++++++++++++++++++++++++++++++++++++++++--------
 include/linux/migrate.h |    3 +-
 mm/migrate.c            |   13 ++++++------
 3 files changed, 53 insertions(+), 15 deletions(-)

commit 8e321fefb0e60bae4e2a28d20fc4fa30758d27c6
Author: Benjamin LaHaise <bcrl@kvack.org>
Date:   Sat Dec 21 17:56:08 2013 -0500

    aio/migratepages: make aio migrate pages sane
    
    The arbitrary restriction on page counts offered by the core
    migrate_page_move_mapping() code results in rather suspicious looking
    fiddling with page reference counts in the aio_migratepage() operation.
    To fix this, make migrate_page_move_mapping() take an extra_count parameter
    that allows aio to tell the code about its own reference count on the page
    being migrated.
    
    While cleaning up aio_migratepage(), make it validate that the old page
    being passed in is actually what aio_migratepage() expects to prevent
    misbehaviour in the case of races.
    
    Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

diff --git a/fs/aio.c b/fs/aio.c
index fd1c0ba..efa708b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -244,9 +244,14 @@ static void aio_free_ring(struct kioctx *ctx)
 	int i;
 
 	for (i = 0; i < ctx->nr_pages; i++) {
+		struct page *page;
 		pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
 				page_count(ctx->ring_pages[i]));
-		put_page(ctx->ring_pages[i]);
+		page = ctx->ring_pages[i];
+		if (!page)
+			continue;
+		ctx->ring_pages[i] = NULL;
+		put_page(page);
 	}
 
 	put_aio_ring_file(ctx);
@@ -280,18 +285,38 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	unsigned long flags;
 	int rc;
 
+	rc = 0;
+
+	/* Make sure the old page hasn't already been changed */
+	spin_lock(&mapping->private_lock);
+	ctx = mapping->private_data;
+	if (ctx) {
+		pgoff_t idx;
+		spin_lock_irqsave(&ctx->completion_lock, flags);
+		idx = old->index;
+		if (idx < (pgoff_t)ctx->nr_pages) {
+			if (ctx->ring_pages[idx] != old)
+				rc = -EAGAIN;
+		} else
+			rc = -EINVAL;
+		spin_unlock_irqrestore(&ctx->completion_lock, flags);
+	} else
+		rc = -EINVAL;
+	spin_unlock(&mapping->private_lock);
+
+	if (rc != 0)
+		return rc;
+
 	/* Writeback must be complete */
 	BUG_ON(PageWriteback(old));
-	put_page(old);
+	get_page(new);
 
-	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode);
+	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1);
 	if (rc != MIGRATEPAGE_SUCCESS) {
-		get_page(old);
+		put_page(new);
 		return rc;
 	}
 
-	get_page(new);
-
 	/* We can potentially race against kioctx teardown here.  Use the
 	 * address_space's private data lock to protect the mapping's
 	 * private_data.
@@ -303,13 +328,24 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 		spin_lock_irqsave(&ctx->completion_lock, flags);
 		migrate_page_copy(new, old);
 		idx = old->index;
-		if (idx < (pgoff_t)ctx->nr_pages)
-			ctx->ring_pages[idx] = new;
+		if (idx < (pgoff_t)ctx->nr_pages) {
+			/* And only do the move if things haven't changed */
+			if (ctx->ring_pages[idx] == old)
+				ctx->ring_pages[idx] = new;
+			else
+				rc = -EAGAIN;
+		} else
+			rc = -EINVAL;
 		spin_unlock_irqrestore(&ctx->completion_lock, flags);
 	} else
 		rc = -EBUSY;
 	spin_unlock(&mapping->private_lock);
 
+	if (rc == MIGRATEPAGE_SUCCESS)
+		put_page(old);
+	else
+		put_page(new);
+
 	return rc;
 }
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index b7717d7..f015c05 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -55,7 +55,8 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		struct buffer_head *head, enum migrate_mode mode);
+		struct buffer_head *head, enum migrate_mode mode,
+		int extra_count);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
diff --git a/mm/migrate.c b/mm/migrate.c
index e9b7102..9194375 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -317,14 +317,15 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head,
  */
 int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		struct buffer_head *head, enum migrate_mode mode)
+		struct buffer_head *head, enum migrate_mode mode,
+		int extra_count)
 {
-	int expected_count = 0;
+	int expected_count = 1 + extra_count;
 	void **pslot;
 
 	if (!mapping) {
 		/* Anonymous page without mapping */
-		if (page_count(page) != 1)
+		if (page_count(page) != expected_count)
 			return -EAGAIN;
 		return MIGRATEPAGE_SUCCESS;
 	}
@@ -334,7 +335,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
-	expected_count = 2 + page_has_private(page);
+	expected_count += 1 + page_has_private(page);
 	if (page_count(page) != expected_count ||
 		radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
 		spin_unlock_irq(&mapping->tree_lock);
@@ -584,7 +585,7 @@ int migrate_page(struct address_space *mapping,
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode);
+	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
 
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
@@ -611,7 +612,7 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	head = page_buffers(page);
 
-	rc = migrate_page_move_mapping(mapping, newpage, page, head, mode);
+	rc = migrate_page_move_mapping(mapping, newpage, page, head, mode, 0);
 
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCHes - aio / migrate page, please review] Re: bad page state in 3.13-rc4
  2013-12-21 23:06                         ` [PATCHes - aio / migrate page, please review] " Benjamin LaHaise
@ 2013-12-22 19:09                           ` Linus Torvalds
  2013-12-22 21:30                             ` Dave Jones
  0 siblings, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2013-12-22 19:09 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dave Jones, Kent Overstreet, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Sat, Dec 21, 2013 at 3:06 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> Linus, feel free to add my Signed-off-by: to your sanitization of
> aio_setup_ring() as well, as it works okay in my testing.

Nobody commented on your request for comments, so I applied my patch
and pulled your branch, because I'm going to do -rc5 in a few and at
least we want this to get testing.

Dave, let's hope that the leak fixes and reference count fixes solve
your problem.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHes - aio / migrate page, please review] Re: bad page state in 3.13-rc4
  2013-12-22 19:09                           ` Linus Torvalds
@ 2013-12-22 21:30                             ` Dave Jones
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Jones @ 2013-12-22 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Kent Overstreet, Linux Kernel, linux-mm,
	Christoph Lameter, Al Viro

On Sun, Dec 22, 2013 at 11:09:34AM -0800, Linus Torvalds wrote:
 > On Sat, Dec 21, 2013 at 3:06 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
 > >
 > > Linus, feel free to add my Signed-off-by: to your sanitization of
 > > aio_setup_ring() as well, as it works okay in my testing.
 > 
 > Nobody commented on your request for comments, so I applied my patch
 > and pulled your branch, because I'm going to do -rc5 in a few and at
 > least we want this to get testing.
 > 
 > Dave, let's hope that the leak fixes and reference count fixes solve
 > your problem.

Yeah, I'm not really going to have time to run tests again until after
the holidays. I'll shout when I get back if things still don't look right.
(or more likely, if I find something else ;-)

	Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-12-22 21:38 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-19  4:07 bad page state in 3.13-rc4 Dave Jones
2013-12-19  4:40 ` Linus Torvalds
2013-12-19 15:41   ` Christoph Lameter
2013-12-19 20:11     ` Mel Gorman
2013-12-19 20:30       ` Dave Jones
2013-12-19 15:53   ` Dave Jones
2013-12-19 17:07     ` Linus Torvalds
2013-12-19 17:17       ` Dave Jones
2013-12-19 18:11       ` Kent Overstreet
2013-12-19 18:29         ` Benjamin LaHaise
2013-12-19 18:35           ` Dave Jones
2013-12-19 19:19           ` Linus Torvalds
2013-12-19 19:26             ` Benjamin LaHaise
2013-12-19 19:45               ` Linus Torvalds
2013-12-19 19:53                 ` Benjamin LaHaise
2013-12-19 20:02                   ` Linus Torvalds
2013-12-19 20:11                     ` Linus Torvalds
2013-12-19 20:31                       ` Benjamin LaHaise
2013-12-19 20:31                       ` Linus Torvalds
2013-12-19 20:42                         ` Benjamin LaHaise
2013-12-19 20:24                   ` Dave Jones
2013-12-19 23:38                     ` Benjamin LaHaise
2013-12-20  1:00                       ` Dave Jones
2013-12-21 23:06                         ` [PATCHes - aio / migrate page, please review] " Benjamin LaHaise
2013-12-22 19:09                           ` Linus Torvalds
2013-12-22 21:30                             ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).