kernel 3.0: BUG: soft lockup: find_get

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
@ 2011-08-15 16:43 Justin Piszcz
  2011-08-15 18:18 ` Hugh Dickins
  2011-08-21 17:59 ` Maciej Rutecki
  0 siblings, 2 replies; 43+ messages in thread
From: Justin Piszcz @ 2011-08-15 16:43 UTC (permalink / raw)
  To: linux-kernel

Hello,

What causes this(?) -- am I out of memory(?) or is this a kernel bug?

[104147.717815] BUG: soft lockup - CPU#13 stuck for 23s! [kswapd0:934]
[104147.717820] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag ub pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore usbserial cdc_acm nouveau serio_raw ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video
[104147.717872] CPU 13 
[104147.717874] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp parport inet_diag ub pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore usbserial cdc_acm nouveau serio_raw ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video
[104147.717916] 
[104147.717921] Pid: 934, comm: kswapd0 Not tainted 3.0.0 #4 Supermicro X8DTH-i/6/iF/6F/X8DTH
[104147.717928] RIP: 0010:[<ffffffff81081f71>]  [<ffffffff81081f71>] find_get_pages+0x51/0x110
[104147.717940] RSP: 0018:ffff880626127ba0  EFLAGS: 00000246
[104147.717944] RAX: ffff8800b5e94d10 RBX: ffff880626127b68 RCX: 000000000000000e
[104147.717948] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffffea001b29c8e8
[104147.717952] RBP: 0000000000000cd7 R08: 0000000000000001 R09: ffffea001acce040
[104147.717955] R10: 0000000000000001 R11: ffff8800ad03b930 R12: ffffffff815b47ce
[104147.717959] R13: ffffea001ec13740 R14: ffff880626127dc0 R15: ffffffff81087ce4
[104147.717964] FS:  0000000000000000(0000) GS:ffff88063fce0000(0000) knlGS:0000000000000000
[104147.717969] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[104147.717973] CR2: 000000007ddf2000 CR3: 0000000001843000 CR4: 00000000000006e0
[104147.717976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[104147.717979] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[104147.717983] Process kswapd0 (pid: 934, threadinfo ffff880626126000, task ffff880626125810)
[104147.717986] Stack:
[104147.717988]  0000000000000001 ffffffff810807fe ffff8800463ed6d8 ffff880626127c10
[104147.717998]  00000000000000a9 ffffffffffffffff ffff8800463ed6d8 ffffea001ec13740
[104147.718007]  0000000000000002 ffffffff81089e15 0000000000000cd7 ffffffff8108b905
[104147.718015] Call Trace:
[104147.718023]  [<ffffffff810807fe>] ? __delete_from_page_cache+0x3e/0xe0
[104147.718030]  [<ffffffff81089e15>] ? pagevec_lookup+0x15/0x20
[104147.718036]  [<ffffffff8108b905>] ? invalidate_mapping_pages+0x55/0x130
[104147.718045]  [<ffffffff810d6835>] ? shrink_icache_memory+0x2c5/0x310
[104147.718052]  [<ffffffff8108c254>] ? shrink_slab+0x104/0x170
[104147.718058]  [<ffffffff8108eda2>] ? balance_pgdat+0x492/0x600
[104147.718064]  [<ffffffff8108efbc>] ? kswapd+0xac/0x250
[104147.718071]  [<ffffffff81050fd0>] ? abort_exclusive_wait+0xb0/0xb0
[104147.718077]  [<ffffffff8108ef10>] ? balance_pgdat+0x600/0x600
[104147.718082]  [<ffffffff8105082e>] ? kthread+0x7e/0x90
[104147.718090]  [<ffffffff815b4e14>] ? kernel_thread_helper+0x4/0x10
[104147.718096]  [<ffffffff810507b0>] ? kthread_worker_fn+0x120/0x120
[104147.718102]  [<ffffffff815b4e10>] ? gs_change+0xb/0xb
[104147.718105] Code: e7 e8 f4 8b 25 00 85 c0 89 c1 0f 84 ba 00 00 00 49 89 df 31 d2 45 31 f6 66 90 49 8b 07 48 8b 38 48 85 ff 74 3a 40 f6 c7 01 75 57 <8b> 77 08 85 f6 74 ee 44 8d 46 01 89 f0 4c 8d 4f 08 f0 44 0f b1 
[104147.718149] Call Trace:
[104147.718155]  [<ffffffff810807fe>] ? __delete_from_page_cache+0x3e/0xe0
[104147.718161]  [<ffffffff81089e15>] ? pagevec_lookup+0x15/0x20
[104147.718167]  [<ffffffff8108b905>] ? invalidate_mapping_pages+0x55/0x130
[104147.718174]  [<ffffffff810d6835>] ? shrink_icache_memory+0x2c5/0x310
[104147.718180]  [<ffffffff8108c254>] ? shrink_slab+0x104/0x170
[104147.718186]  [<ffffffff8108eda2>] ? balance_pgdat+0x492/0x600
[104147.718193]  [<ffffffff8108efbc>] ? kswapd+0xac/0x250
[104147.718199]  [<ffffffff81050fd0>] ? abort_exclusive_wait+0xb0/0xb0
[104147.718206]  [<ffffffff8108ef10>] ? balance_pgdat+0x600/0x600
[104147.718211]  [<ffffffff8105082e>] ? kthread+0x7e/0x90
[104147.718217]  [<ffffffff815b4e14>] ? kernel_thread_helper+0x4/0x10
[104147.718224]  [<ffffffff810507b0>] ? kthread_worker_fn+0x120/0x120
[104147.718229]  [<ffffffff815b4e10>] ? gs_change+0xb/0xb

Justin.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-08-15 16:43 kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Justin Piszcz
@ 2011-08-15 18:18 ` Hugh Dickins
  2011-08-15 19:02   ` Justin Piszcz
  2011-08-21 17:59 ` Maciej Rutecki
  1 sibling, 1 reply; 43+ messages in thread
From: Hugh Dickins @ 2011-08-15 18:18 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel

On Mon, 15 Aug 2011, Justin Piszcz wrote:
> Hello,
> 
> What causes this(?) -- am I out of memory(?) or is this a kernel bug?

It would be a kernel bug to lock up even if you are out of memory.

It does look like you're under memory pressure, but I don't see any OOM.

Is this something you've noticed just once, or does it happen repeatedly?

Does it always hit somewhere in find_get_pages(), or does the loop span
wider than that?

I'm answering out of interest in find_get_pages(): which does contain
a number of gotos which could result in endless looping; except that
they're all supposed to be for very transitory conditions which a
second glance at the RCU-protected tree should correct.

But if a radix_tree node got corrupted, then yes, it could loop forever.

If it's repeatable, please try again with slab poisoning (and frame
pointers) enabled?

Hugh

> 
> [104147.717815] BUG: soft lockup - CPU#13 stuck for 23s! [kswapd0:934]
> [104147.717820] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp
> parport inet_diag ub pl2303 ftdi_sio snd_usb_audio snd_pcm_oss snd_mixer_oss
> snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib snd_seq_dummy snd_seq_oss
> snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device
> snd soundcore usbserial cdc_acm nouveau serio_raw ttm drm_kms_helper drm
> agpgart i2c_algo_bit mxm_wmi wmi i7core_edac edac_core video
> [104147.717872] CPU 13 [104147.717874] Modules linked in: dm_mod tcp_diag
> parport_pc ppdev lp parport inet_diag ub pl2303 ftdi_sio snd_usb_audio
> snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> snd_seq snd_timer snd_seq_device snd soundcore usbserial cdc_acm nouveau
> serio_raw ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> i7core_edac edac_core video
> [104147.717916] [104147.717921] Pid: 934, comm: kswapd0 Not tainted 3.0.0 #4
> Supermicro X8DTH-i/6/iF/6F/X8DTH
> [104147.717928] RIP: 0010:[<ffffffff81081f71>]  [<ffffffff81081f71>]
> find_get_pages+0x51/0x110
> [104147.717940] RSP: 0018:ffff880626127ba0  EFLAGS: 00000246
> [104147.717944] RAX: ffff8800b5e94d10 RBX: ffff880626127b68 RCX:
> 000000000000000e
> [104147.717948] RDX: 0000000000000003 RSI: 0000000000000000 RDI:
> ffffea001b29c8e8
> [104147.717952] RBP: 0000000000000cd7 R08: 0000000000000001 R09:
> ffffea001acce040
> [104147.717955] R10: 0000000000000001 R11: ffff8800ad03b930 R12:
> ffffffff815b47ce
> [104147.717959] R13: ffffea001ec13740 R14: ffff880626127dc0 R15:
> ffffffff81087ce4
> [104147.717964] FS:  0000000000000000(0000) GS:ffff88063fce0000(0000)
> knlGS:0000000000000000
> [104147.717969] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [104147.717973] CR2: 000000007ddf2000 CR3: 0000000001843000 CR4:
> 00000000000006e0
> [104147.717976] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [104147.717979] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [104147.717983] Process kswapd0 (pid: 934, threadinfo ffff880626126000, task
> ffff880626125810)
> [104147.717986] Stack:
> [104147.717988]  0000000000000001 ffffffff810807fe ffff8800463ed6d8
> ffff880626127c10
> [104147.717998]  00000000000000a9 ffffffffffffffff ffff8800463ed6d8
> ffffea001ec13740
> [104147.718007]  0000000000000002 ffffffff81089e15 0000000000000cd7
> ffffffff8108b905
> [104147.718015] Call Trace:
> [104147.718023]  [<ffffffff810807fe>] ? __delete_from_page_cache+0x3e/0xe0
> [104147.718030]  [<ffffffff81089e15>] ? pagevec_lookup+0x15/0x20
> [104147.718036]  [<ffffffff8108b905>] ? invalidate_mapping_pages+0x55/0x130
> [104147.718045]  [<ffffffff810d6835>] ? shrink_icache_memory+0x2c5/0x310
> [104147.718052]  [<ffffffff8108c254>] ? shrink_slab+0x104/0x170
> [104147.718058]  [<ffffffff8108eda2>] ? balance_pgdat+0x492/0x600
> [104147.718064]  [<ffffffff8108efbc>] ? kswapd+0xac/0x250
> [104147.718071]  [<ffffffff81050fd0>] ? abort_exclusive_wait+0xb0/0xb0
> [104147.718077]  [<ffffffff8108ef10>] ? balance_pgdat+0x600/0x600
> [104147.718082]  [<ffffffff8105082e>] ? kthread+0x7e/0x90
> [104147.718090]  [<ffffffff815b4e14>] ? kernel_thread_helper+0x4/0x10
> [104147.718096]  [<ffffffff810507b0>] ? kthread_worker_fn+0x120/0x120
> [104147.718102]  [<ffffffff815b4e10>] ? gs_change+0xb/0xb
> [104147.718105] Code: e7 e8 f4 8b 25 00 85 c0 89 c1 0f 84 ba 00 00 00 49 89
> df 31 d2 45 31 f6 66 90 49 8b 07 48 8b 38 48 85 ff 74 3a 40 f6 c7 01 75 57
> <8b> 77 08 85 f6 74 ee 44 8d 46 01 89 f0 4c 8d 4f 08 f0 44 0f b1
> [104147.718149] Call Trace:
> [104147.718155]  [<ffffffff810807fe>] ? __delete_from_page_cache+0x3e/0xe0
> [104147.718161]  [<ffffffff81089e15>] ? pagevec_lookup+0x15/0x20
> [104147.718167]  [<ffffffff8108b905>] ? invalidate_mapping_pages+0x55/0x130
> [104147.718174]  [<ffffffff810d6835>] ? shrink_icache_memory+0x2c5/0x310
> [104147.718180]  [<ffffffff8108c254>] ? shrink_slab+0x104/0x170
> [104147.718186]  [<ffffffff8108eda2>] ? balance_pgdat+0x492/0x600
> [104147.718193]  [<ffffffff8108efbc>] ? kswapd+0xac/0x250
> [104147.718199]  [<ffffffff81050fd0>] ? abort_exclusive_wait+0xb0/0xb0
> [104147.718206]  [<ffffffff8108ef10>] ? balance_pgdat+0x600/0x600
> [104147.718211]  [<ffffffff8105082e>] ? kthread+0x7e/0x90
> [104147.718217]  [<ffffffff815b4e14>] ? kernel_thread_helper+0x4/0x10
> [104147.718224]  [<ffffffff810507b0>] ? kthread_worker_fn+0x120/0x120
> [104147.718229]  [<ffffffff815b4e10>] ? gs_change+0xb/0xb
> 
> Justin.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-08-15 18:18 ` Hugh Dickins
@ 2011-08-15 19:02   ` Justin Piszcz
  2011-08-15 19:53     ` Hugh Dickins
  0 siblings, 1 reply; 43+ messages in thread
From: Justin Piszcz @ 2011-08-15 19:02 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel



On Mon, 15 Aug 2011, Hugh Dickins wrote:

> On Mon, 15 Aug 2011, Justin Piszcz wrote:
>> Hello,
>>
>> What causes this(?) -- am I out of memory(?) or is this a kernel bug?
>
> It would be a kernel bug to lock up even if you are out of memory.
This machine has 48GB of RAM and its just a linux router and some gqview's
running..

>
> It does look like you're under memory pressure, but I don't see any OOM.
>
> Is this something you've noticed just once, or does it happen repeatedly?
This has happened once before (I've e-mailed LKML about it last weekend or
thereabouts but nobody responded)

It is here:
http://lkml.org/lkml/2011/8/12/54 (down?)
http://comments.gmane.org/gmane.linux.kernel/1178570

>
> Does it always hit somewhere in find_get_pages(), or does the loop span
> wider than that?
Per: http://comments.gmane.org/gmane.linux.kernel/1178570

Slightly different (From August 12)

   75 [330509.718763] Call Trace:
   76 [330509.718771]  [<ffffffff81089e15>] ? pagevec_lookup+0x15/0x20
   77 [330509.718776]  [<ffffffff8108b905>] ? invalidate_mapping_pages+0x55/0x130
   78 [330509.718784]  [<ffffffff810d6835>] ? shrink_icache_memory+0x2c5/0x310
   79 [330509.718788]  [<ffffffff8108c254>] ? shrink_slab+0x104/0x170
   80 [330509.718793]  [<ffffffff8108eda2>] ? balance_pgdat+0x492/0x600
   81 [330509.718798]  [<ffffffff8108efbc>] ? kswapd+0xac/0x250
   82 [330509.718803]  [<ffffffff81050fd0>] ? abort_exclusive_wait+0xb0/0xb0
   83 [330509.718807]  [<ffffffff8108ef10>] ? balance_pgdat+0x600/0x600
   84 [330509.718811]  [<ffffffff8105082e>] ? kthread+0x7e/0x90
   85 [330509.718818]  [<ffffffff815b4e14>] ? kernel_thread_helper+0x4/0x10
   86 [330509.718822]  [<ffffffff810507b0>] ? kthread_worker_fn+0x120/0x120
   87 [330509.718825]  [<ffffffff815b4e10>] ? gs_change+0xb/0xb

The first time it happened was when running a lot of I/O \
(dumps and streams/backups over SSH).

>
> I'm answering out of interest in find_get_pages(): which does contain
> a number of gotos which could result in endless looping; except that
> they're all supposed to be for very transitory conditions which a
> second glance at the RCU-protected tree should correct.
I am using 'server' for the workload type, not 'low latency' -- which exposes
more bugs/problems..

>
> But if a radix_tree node got corrupted, then yes, it could loop forever.
>
> If it's repeatable, please try again with slab poisoning (and frame
> pointers) enabled?
I will enable frame pointers and wait for the next error/problem and report
back if/when it recurs, thanks!

Justin.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-08-15 19:02   ` Justin Piszcz
@ 2011-08-15 19:53     ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2011-08-15 19:53 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel

On Mon, 15 Aug 2011, Justin Piszcz wrote:
> On Mon, 15 Aug 2011, Hugh Dickins wrote:
> > 
> > Does it always hit somewhere in find_get_pages(), or does the loop span
> > wider than that?
> Per: http://comments.gmane.org/gmane.linux.kernel/1178570

You've cut the RIP line out of that report, so I cannot see if it's also in
find_get_pages() or not.  I suspect that it is, but nice to see confirmation.

Ah, I missed the "(Continue reading)" button, which reveals "Full output
here": that shows more, and indeed, RIP is at find_get_pages+0x46.

Interesting, but I don't know what's going on here.

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-08-15 16:43 kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Justin Piszcz
  2011-08-15 18:18 ` Hugh Dickins
@ 2011-08-21 17:59 ` Maciej Rutecki
  2011-08-21 18:49   ` Justin Piszcz
  1 sibling, 1 reply; 43+ messages in thread
From: Maciej Rutecki @ 2011-08-21 17:59 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: linux-kernel

On poniedziałek, 15 sierpnia 2011 o 18:43:22 Justin Piszcz wrote:
> Hello,
> 
> What causes this(?) -- am I out of memory(?) or is this a kernel bug?
> 
> [104147.717815] BUG: soft lockup - CPU#13 stuck for 23s! [kswapd0:934]
> [104147.717820] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp
> parport inet_diag ub pl2303 ftdi_sio snd_usb_audio snd_pcm_oss
> snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> snd_seq snd_timer snd_seq_device snd soundcore usbserial cdc_acm nouveau
> serio_raw ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> i7core_edac edac_core video [104147.717872] CPU 13
> [104147.717874] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp
> parport inet_diag ub pl2303 ftdi_sio snd_usb_audio snd_pcm_oss
> snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
> snd_seq snd_timer snd_seq_device snd soundcore usbserial cdc_acm nouveau
> serio_raw ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
> i7core_edac edac_core video [104147.717916]
> [104147.717921] Pid: 934, comm: kswapd0 Not tainted 3.0.0 #4 Supermicro
> X8DTH-i/6/iF/6F/X8DTH [104147.717928] RIP: 0010:[<ffffffff81081f71>] 
> [<ffffffff81081f71>] find_get_pages+0x51/0x110 [104147.717940] RSP:
> 0018:ffff880626127ba0  EFLAGS: 00000246
[...]

It's regression? 2.6.39 works OK?

Regards
-- 
Maciej Rutecki
http://www.maciek.unixy.pl

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-08-21 17:59 ` Maciej Rutecki
@ 2011-08-21 18:49   ` Justin Piszcz
  0 siblings, 0 replies; 43+ messages in thread
From: Justin Piszcz @ 2011-08-21 18:49 UTC (permalink / raw)
  To: Maciej Rutecki; +Cc: linux-kernel



On Sun, 21 Aug 2011, Maciej Rutecki wrote:

> On poniedzia?ek, 15 sierpnia 2011 o 18:43:22 Justin Piszcz wrote:
>> Hello,
>>
>> What causes this(?) -- am I out of memory(?) or is this a kernel bug?
>>
>> [104147.717815] BUG: soft lockup - CPU#13 stuck for 23s! [kswapd0:934]
>> [104147.717820] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp
>> parport inet_diag ub pl2303 ftdi_sio snd_usb_audio snd_pcm_oss
>> snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
>> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
>> snd_seq snd_timer snd_seq_device snd soundcore usbserial cdc_acm nouveau
>> serio_raw ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
>> i7core_edac edac_core video [104147.717872] CPU 13
>> [104147.717874] Modules linked in: dm_mod tcp_diag parport_pc ppdev lp
>> parport inet_diag ub pl2303 ftdi_sio snd_usb_audio snd_pcm_oss
>> snd_mixer_oss snd_pcm snd_page_alloc snd_hwdep snd_usbmidi_lib
>> snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event
>> snd_seq snd_timer snd_seq_device snd soundcore usbserial cdc_acm nouveau
>> serio_raw ttm drm_kms_helper drm agpgart i2c_algo_bit mxm_wmi wmi
>> i7core_edac edac_core video [104147.717916]
>> [104147.717921] Pid: 934, comm: kswapd0 Not tainted 3.0.0 #4 Supermicro
>> X8DTH-i/6/iF/6F/X8DTH [104147.717928] RIP: 0010:[<ffffffff81081f71>]
>> [<ffffffff81081f71>] find_get_pages+0x51/0x110 [104147.717940] RSP:
>> 0018:ffff880626127ba0  EFLAGS: 00000246
> [...]
>
> It's regression? 2.6.39 works OK?

Hello,

It appears to have been occurring for sometime (although not since I have
enabled frame pointers in the kernel to debug the problem):

Kernels: 2.6.39.2, 3.0, 3.0.1.

kern.log:Jul 30 02:10:40 p34 kernel: [27760.840979] BUG: soft lockup - CPU#10 stuck for 23s! [kswapd1:946]
kern.log:Jul 30 02:11:08 p34 kernel: [27788.799309] BUG: soft lockup - CPU#10 stuck for 22s! [kswapd1:946]
kern.log:Jul 30 02:15:35 p34 kernel: [28000.480936] BUG: soft lockup - CPU#10 stuck for 21s! [kswapd1:946]
kern.log:Jul 30 02:15:35 p34 kernel: [28028.438740] BUG: soft lockup - CPU#10 stuck for 22s! [kswapd1:946]
kern.log:Jul 30 02:15:52 p34 kernel: [28072.364088] BUG: soft lockup - CPU#10 stuck for 21s! [kswapd1:946]
kern.log:Jul 30 02:16:20 p34 kernel: [28100.310535] BUG: soft lockup - CPU#10 stuck for 22s! [kswapd1:946]
kern.log:Jul 30 02:16:48 p34 kernel: [28128.260612] BUG: soft lockup - CPU#10 stuck for 22s! [kswapd1:946]
kern.log:Aug 12 01:46:47 p34 kernel: [330481.760506] BUG: soft lockup - CPU#7 stuck for 23s! [kswapd1:935]
kern.log:Aug 12 01:47:15 p34 kernel: [330509.718618] BUG: soft lockup - CPU#7 stuck for 23s! [kswapd1:935]
kern.log:Aug 15 11:57:38 p34 kernel: [104147.717815] BUG: soft lockup - CPU#13 stuck for 23s! [kswapd0:934]
kern.log:Aug 18 06:15:28 p34 kernel: [50020.901890] BUG: soft lockup - CPU#4 stuck for 22s! [kswapd0:934]

Justin.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
@ 2011-10-12 18:12 Paweł Sikora
  2011-10-13 23:16 ` Hugh Dickins
  0 siblings, 1 reply; 43+ messages in thread
From: Paweł Sikora @ 2011-10-12 18:12 UTC (permalink / raw)
  To: Hugh Dickins, linux-mm; +Cc: jpiszcz, arekm, linux-kernel

Hi Hugh,
i'm resending previous private email with larger cc list as you've requested.


in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
 
on my dual-opteron machines i have non-standard settings:
- DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
                       and 64GB ecc-ram is enough for my processing).
- vm.overcommit_memory = 2,
- vm.overcommit_ratio = 100.

after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'
(full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)

Oct  9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
Oct  9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
Oct  9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
Oct  9 08:06:43 hal kernel: [408578.629143] CPU 14
Oct  9 08:06:43 hal kernel: [408578.629143] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 raid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdrom ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod crc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
Oct  9 08:06:43 hal kernel: [408578.629143]
Oct  9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
Oct  9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct  9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18  EFLAGS: 00010246
Oct  9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
Oct  9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
Oct  9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
Oct  9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
Oct  9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
Oct  9 08:06:43 hal kernel: [408578.629143] FS:  00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
Oct  9 08:06:43 hal kernel: [408578.629143] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
Oct  9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
Oct  9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
Oct  9 08:06:43 hal kernel: [408578.629143] Stack:
Oct  9 08:06:43 hal kernel: [408578.629143]  00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
Oct  9 08:06:43 hal kernel: [408578.629143] Call Trace:
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct  9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
Oct  9 08:06:43 hal kernel: [408578.629143] RIP  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
Oct  9 08:06:43 hal kernel: [408578.629143]  RSP <ffff88021cee7d18>
Oct  9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
Oct  9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
Oct  9 08:07:10 hal kernel: [408605.283367] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 raid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdrom ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod crc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
Oct  9 08:07:10 hal kernel: [408605.285807] CPU 12
Oct  9 08:07:10 hal kernel: [408605.285807] Modules linked in: nfs fscache binfmt_misc nfsd lockd nfs_acl auth_rpcgss sunrpc ipmi_si ipmi_devintf ipmi_msghandler sch_sfq iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter xt_TCPMSS xt_tcpudp iptable_mangle ip_tables ip6table_filter ip6_tables x_tables ext4 jbd2 crc16 raid10 raid0 dm_mod uvesafb autofs4 dummy aoe joydev usbhid hid ide_cd_mod cdrom ata_generic pata_acpi ohci_hcd pata_atiixp sp5100_tco ide_pci_generic ssb ehci_hcd igb i2c_piix4 pcmcia evdev atiixp pcmcia_core psmouse ide_core usbcore i2c_core amd64_edac_mod mmc_core edac_core pcspkr sg edac_mce_amd button serio_raw processor dca ghes k10temp hwmon hed sd_mod crc_t10dif raid1 md_mod ext3 jbd mbcache ahci libahci libata scsi_mod [last unloaded: scsi_wait_scan]
Oct  9 08:07:10 hal kernel: [408605.285807]
Oct  9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G      D     3.0.4 #5 Supermicro H8DGU/H8DGU
Oct  9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>]  [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
Oct  9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808  EFLAGS: 00000293
Oct  9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
Oct  9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
Oct  9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
Oct  9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
Oct  9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
Oct  9 08:07:10 hal kernel: [408605.285807] FS:  00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
Oct  9 08:07:10 hal kernel: [408605.285807] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
Oct  9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
Oct  9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
Oct  9 08:07:10 hal kernel: [408605.285807] Stack:
Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
Oct  9 08:07:10 hal kernel: [408605.285807]  0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
Oct  9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00
Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30

BR,
Paweł.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-12 18:12 Paweł Sikora
@ 2011-10-13 23:16 ` Hugh Dickins
  2011-10-13 23:30   ` Hugh Dickins
                     ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Hugh Dickins @ 2011-10-13 23:16 UTC (permalink / raw)
  To: Pawel Sikora
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

[ Subject refers to a different, unexplained 3.0 bug from Pawel ]

On Wed, 12 Oct 2011, Pawel Sikora wrote:

> Hi Hugh,
> i'm resending previous private email with larger cc list as you've requested.

Thanks, yes, on this one I think I do have an answer;
and we ought to bring Mel and Andrea in too.

> 
> in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
>  
> on my dual-opteron machines i have non-standard settings:
> - DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
>                        and 64GB ecc-ram is enough for my processing).
> - vm.overcommit_memory = 2,
> - vm.overcommit_ratio = 100.
> 
> after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'

Yes, those are just a tiresome consequence of exiting from a BUG
while holding the page table lock(s).

> (full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
> 
> Oct  9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
> Oct  9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
> Oct  9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
> Oct  9 08:06:43 hal kernel: [408578.629143] CPU 14
[ I'm deleting that irrelevant long line list of modules ]
> Oct  9 08:06:43 hal kernel: [408578.629143]
> Oct  9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
> Oct  9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> Oct  9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18  EFLAGS: 00010246
> Oct  9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
> Oct  9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
> Oct  9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
> Oct  9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
> Oct  9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
> Oct  9 08:06:43 hal kernel: [408578.629143] FS:  00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
> Oct  9 08:06:43 hal kernel: [408578.629143] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> Oct  9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
> Oct  9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct  9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct  9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
> Oct  9 08:06:43 hal kernel: [408578.629143] Stack:
> Oct  9 08:06:43 hal kernel: [408578.629143]  00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
> Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
> Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
> Oct  9 08:06:43 hal kernel: [408578.629143] Call Trace:
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> Oct  9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
> Oct  9 08:06:43 hal kernel: [408578.629143] RIP  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> Oct  9 08:06:43 hal kernel: [408578.629143]  RSP <ffff88021cee7d18>
> Oct  9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
> Oct  9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
> Oct  9 08:07:10 hal kernel: [408605.285807] CPU 12
> Oct  9 08:07:10 hal kernel: [408605.285807]
> Oct  9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G      D     3.0.4 #5 Supermicro H8DGU/H8DGU
> Oct  9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>]  [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
> Oct  9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808  EFLAGS: 00000293
> Oct  9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
> Oct  9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
> Oct  9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
> Oct  9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
> Oct  9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
> Oct  9 08:07:10 hal kernel: [408605.285807] FS:  00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
> Oct  9 08:07:10 hal kernel: [408605.285807] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> Oct  9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
> Oct  9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct  9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct  9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
> Oct  9 08:07:10 hal kernel: [408605.285807] Stack:
> Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
> Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
> Oct  9 08:07:10 hal kernel: [408605.285807]  0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
> Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> Oct  9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00

I guess this is the only time you've seen this?  In which case, ideally
I would try to devise a testcase to demonstrate the issue below instead;
but that may involve more ingenuity than I can find time for, let's see
see if people approve of this patch anyway (it applies to 3.1 or 3.0,
and earlier releases except that i_mmap_mutex used to be i_mmap_lock).


[PATCH] mm: add anon_vma locking to mremap move

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

 3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
 kernel BUG at include/linux/swapops.h:105!
 RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
                       migration_entry_wait+0x156/0x160
  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
  [<ffffffff81421d5f>] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.

It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

Reported-by: Pawel Sikora <pluto@agmk.net>
Cc: stable@kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/mremap.c |    5 +++++
 1 file changed, 5 insertions(+)

--- 3.1-rc9/mm/mremap.c	2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c	2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
 		unsigned long new_addr)
 {
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = vma->anon_vma;
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
 		mapping = vma->vm_file->f_mapping;
 		mutex_lock(&mapping->i_mmap_mutex);
 	}
+	if (anon_vma)
+		anon_vma_lock(anon_vma);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
+	if (anon_vma)
+		anon_vma_unlock(anon_vma);
 	if (mapping)
 		mutex_unlock(&mapping->i_mmap_mutex);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:16 ` Hugh Dickins
@ 2011-10-13 23:30   ` Hugh Dickins
  2011-10-16 16:11     ` Christoph Hellwig
  2011-10-16 23:54     ` Andrea Arcangeli
  2011-10-16 22:37   ` Linus Torvalds
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 43+ messages in thread
From: Hugh Dickins @ 2011-10-13 23:30 UTC (permalink / raw)
  To: Pawel Sikora
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

[ Subject refers to a different, unexplained 3.0 bug from Pawel ]
[ Resend with correct address for linux-mm@kvack.org ]

On Wed, 12 Oct 2011, Pawel Sikora wrote:

> Hi Hugh,
> i'm resending previous private email with larger cc list as you've requested.

Thanks, yes, on this one I think I do have an answer;
and we ought to bring Mel and Andrea in too.

> 
> in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
>  
> on my dual-opteron machines i have non-standard settings:
> - DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
>                        and 64GB ecc-ram is enough for my processing).
> - vm.overcommit_memory = 2,
> - vm.overcommit_ratio = 100.
> 
> after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'

Yes, those are just a tiresome consequence of exiting from a BUG
while holding the page table lock(s).

> (full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
> 
> Oct  9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
> Oct  9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
> Oct  9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
> Oct  9 08:06:43 hal kernel: [408578.629143] CPU 14
[ I'm deleting that irrelevant long line list of modules ]
> Oct  9 08:06:43 hal kernel: [408578.629143]
> Oct  9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
> Oct  9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> Oct  9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18  EFLAGS: 00010246
> Oct  9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
> Oct  9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
> Oct  9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
> Oct  9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
> Oct  9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
> Oct  9 08:06:43 hal kernel: [408578.629143] FS:  00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
> Oct  9 08:06:43 hal kernel: [408578.629143] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> Oct  9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
> Oct  9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct  9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct  9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
> Oct  9 08:06:43 hal kernel: [408578.629143] Stack:
> Oct  9 08:06:43 hal kernel: [408578.629143]  00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
> Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
> Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
> Oct  9 08:06:43 hal kernel: [408578.629143] Call Trace:
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
> Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> Oct  9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
> Oct  9 08:06:43 hal kernel: [408578.629143] RIP  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> Oct  9 08:06:43 hal kernel: [408578.629143]  RSP <ffff88021cee7d18>
> Oct  9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
> Oct  9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
> Oct  9 08:07:10 hal kernel: [408605.285807] CPU 12
> Oct  9 08:07:10 hal kernel: [408605.285807]
> Oct  9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G      D     3.0.4 #5 Supermicro H8DGU/H8DGU
> Oct  9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>]  [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
> Oct  9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808  EFLAGS: 00000293
> Oct  9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
> Oct  9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
> Oct  9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
> Oct  9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
> Oct  9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
> Oct  9 08:07:10 hal kernel: [408605.285807] FS:  00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
> Oct  9 08:07:10 hal kernel: [408605.285807] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> Oct  9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
> Oct  9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Oct  9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Oct  9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
> Oct  9 08:07:10 hal kernel: [408605.285807] Stack:
> Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
> Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
> Oct  9 08:07:10 hal kernel: [408605.285807]  0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
> Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
> Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> Oct  9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00

I guess this is the only time you've seen this?  In which case, ideally
I would try to devise a testcase to demonstrate the issue below instead;
but that may involve more ingenuity than I can find time for, let's see
see if people approve of this patch anyway (it applies to 3.1 or 3.0,
and earlier releases except that i_mmap_mutex used to be i_mmap_lock).


[PATCH] mm: add anon_vma locking to mremap move

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

 3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
 kernel BUG at include/linux/swapops.h:105!
 RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
                       migration_entry_wait+0x156/0x160
  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
  [<ffffffff81421d5f>] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in; and enough
while migration always held mmap_sem.  But not enough nowadays, when
there's memory hotremove and compaction: anon_vma lock is also needed,
to make sure a migration entry is not dodging around behind our back.

It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
shift_arg_pages() and rmap_walk() during migration by not migrating
temporary stacks" was actually a workaround for this in the special
common case of exec's use of move_pagetables(); and we should probably
now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

Reported-by: Pawel Sikora <pluto@agmk.net>
Cc: stable@kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/mremap.c |    5 +++++
 1 file changed, 5 insertions(+)

--- 3.1-rc9/mm/mremap.c	2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/mremap.c	2011-10-13 14:36:25.097780974 -0700
@@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
 		unsigned long new_addr)
 {
 	struct address_space *mapping = NULL;
+	struct anon_vma *anon_vma = vma->anon_vma;
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
@@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
 		mapping = vma->vm_file->f_mapping;
 		mutex_lock(&mapping->i_mmap_mutex);
 	}
+	if (anon_vma)
+		anon_vma_lock(anon_vma);
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
@@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
+	if (anon_vma)
+		anon_vma_unlock(anon_vma);
 	if (mapping)
 		mutex_unlock(&mapping->i_mmap_mutex);
 	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:30   ` Hugh Dickins
@ 2011-10-16 16:11     ` Christoph Hellwig
  2011-10-16 23:54     ` Andrea Arcangeli
  1 sibling, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2011-10-16 16:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, jpiszcz, arekm, linux-kernel, Anders Ossowicki

Btw, 

Anders Ossowicki reported a very similar soft lockup on 2.6.38 recently,
although without a bug on before.

Here is the pointer: https://lkml.org/lkml/2011/10/11/87

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:16 ` Hugh Dickins
  2011-10-13 23:30   ` Hugh Dickins
@ 2011-10-16 22:37   ` Linus Torvalds
  2011-10-17  3:02     ` Hugh Dickins
  2011-10-18 19:17   ` Paweł Sikora
  2011-10-19  7:30   ` Mel Gorman
  3 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2011-10-16 22:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, jpiszcz, arekm, linux-kernel

What's the status of this thing? Is it stable/3.1 material? Do we have
ack/nak's for it? Anybody?

                               Linus

On Thu, Oct 13, 2011 at 4:16 PM, Hugh Dickins <hughd@google.com> wrote:
>
> [PATCH] mm: add anon_vma locking to mremap move
>
> I don't usually pay much attention to the stale "? " addresses in
> stack backtraces, but this lucky report from Pawel Sikora hints that
> mremap's move_ptes() has inadequate locking against page migration.
>
>  3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
>  kernel BUG at include/linux/swapops.h:105!
>  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
>                       migration_entry_wait+0x156/0x160
>  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
>  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
>  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
>  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
>  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
>  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
>  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
>  [<ffffffff81421d5f>] page_fault+0x1f/0x30
>
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.
>
> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
>
> Reported-by: Pawel Sikora <pluto@agmk.net>
> Cc: stable@kernel.org
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>
>  mm/mremap.c |    5 +++++
>  1 file changed, 5 insertions(+)
>
> --- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
> +++ linux/mm/mremap.c   2011-10-13 14:36:25.097780974 -0700
> @@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
>                unsigned long new_addr)
>  {
>        struct address_space *mapping = NULL;
> +       struct anon_vma *anon_vma = vma->anon_vma;
>        struct mm_struct *mm = vma->vm_mm;
>        pte_t *old_pte, *new_pte, pte;
>        spinlock_t *old_ptl, *new_ptl;
> @@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
>                mapping = vma->vm_file->f_mapping;
>                mutex_lock(&mapping->i_mmap_mutex);
>        }
> +       if (anon_vma)
> +               anon_vma_lock(anon_vma);
>
>        /*
>         * We don't have to worry about the ordering of src and dst
> @@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
>                spin_unlock(new_ptl);
>        pte_unmap(new_pte - 1);
>        pte_unmap_unlock(old_pte - 1, old_ptl);
> +       if (anon_vma)
> +               anon_vma_unlock(anon_vma);
>        if (mapping)
>                mutex_unlock(&mapping->i_mmap_mutex);
>        mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:30   ` Hugh Dickins
  2011-10-16 16:11     ` Christoph Hellwig
@ 2011-10-16 23:54     ` Andrea Arcangeli
  2011-10-17 18:51       ` Hugh Dickins
  2011-10-20  9:11       ` Nai Xia
  1 sibling, 2 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2011-10-16 23:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, linux-mm, jpiszcz, arekm,
	linux-kernel

On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.

For things like migrate and split_huge_page, the anon_vma layer must
guarantee the page is reachable by rmap walk at all times regardless
if it's at the old or new address.

This shall be guaranteed by the copy_vma called by move_vma well
before move_page_tables/move_ptes can run.

copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
into the anon_vma chains structures (vma_link does that). That before
any pte can be moved.

Because we keep two vmas mapped on both src and dst range, with
different vma->vm_pgoff that is valid for the page (the page doesn't
change its page->index) the page should always find _all_ its pte at
any given time.

There may be other variables at play like the order of insertion in
the anon_vma chain matches our direction of copy and removal of the
old pte. But I think the double locking of the PT lock should make the
order in the anon_vma chain absolutely irrelevant (the rmap_walk
obviously takes the PT lock too), and furthermore likely the
anon_vma_chain insertion is favorable (the dst vma is inserted last
and checked last). But it shouldn't matter.

Another thing could be the copy_vma vma_merge branch succeeding
(returning not NULL) but I doubt we risk to fall into that one. For
the rmap_walk to be always working on both the src and dst
vma->vma_pgoff the pgoff must be different so we can't possibly be ok
if there's just 1 vma covering the whole range. I exclude this could
be the case because the pgoff passed to copy_vma is different than the
vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.

Yet another point to investigate is the point where we teardown the
old vma and we leave the new vma generated by copy_vma
established. That's apparently taken care of by do_munmap in move_vma
so that shall be safe too as munmap is safe in the first place.

Overall I don't think this patch is needed and it seems a noop.

> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.

I don't think this patch can help with that, the problem of execve vs
rmap_walk is that there's 1 single vma existing for src and dst
virtual ranges while execve runs move_page_tables. So there is no
possible way that rmap_walk will be guaranteed to find _all_ ptes
mapping a page if there's just one vma mapping either the src or dst
range while move_page_table runs. No addition of locking whatsoever
can fix that bug because we miss a vma (well modulo locking that
prevents rmap_walk to run at all, until we're finished with execve,
which is more or less what VM_STACK_INCOMPLETE_SETUP does...).

The only way is to fix this is prevent migrate (or any other rmap_walk
user that requires 100% reliability from the rmap layer, for example
swap doesn't require 100% reliability and can still run and gracefully
fail at finding the pte) while we're moving pagetables in execve. And
that's what Mel's above mentioned patch does.

The other way to fix that bug that I implemented was to do copy_vma in
execve, so that we still have both src and dst ranges of
move_page_tables covered by 2 (not 1) vma, each with the proper
vma->vm_pgoff, so my approach fixed that bug as well (but requires a
vma allocation in execve so it was dropped in favor of Mel's patch
which is totally fine with as both approaches fixes the bug equally
well, even if now we've to deal with this special case of sometime
rmap_walk having false negatives if the vma_flags is set, and the
important thing is that after VM_STACK_INCOMPLETE_SETUP has been
cleared it won't ever be set again for the whole lifetime of the vma).

I may be missing something, I did a short review so far, just so the
patch doesn't get merged if not needed. I mean I think it needs a bit
more looks on it... The fact the i_mmap_mutex was taken but the
anon_vma lock was not taken (while in every other place they both are
needed) certainly makes the patch look correct, but that's just a
misleading coincidence I think.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-16 22:37   ` Linus Torvalds
@ 2011-10-17  3:02     ` Hugh Dickins
  2011-10-17  3:09       ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Hugh Dickins @ 2011-10-17  3:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, jpiszcz, arekm, linux-kernel

I've not read through and digested Andrea's reply yet, but I'd say
this is not something we need to rush into 3.1 at the last moment,
before it's been fully considered: the bug here is hard to hit,
ancient, made more likely in 2.6.35 by compaction and in 2.6.38 by
THP's reliance on compaction, but not a regression in 3.1 at all - let
it wait until stable.

Hugh

On Sun, Oct 16, 2011 at 3:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> What's the status of this thing? Is it stable/3.1 material? Do we have
> ack/nak's for it? Anybody?
>
>                               Linus
>
> On Thu, Oct 13, 2011 at 4:16 PM, Hugh Dickins <hughd@google.com> wrote:
>>
>> [PATCH] mm: add anon_vma locking to mremap move
>>
>> I don't usually pay much attention to the stale "? " addresses in
>> stack backtraces, but this lucky report from Pawel Sikora hints that
>> mremap's move_ptes() has inadequate locking against page migration.
>>
>>  3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
>>  kernel BUG at include/linux/swapops.h:105!
>>  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
>>                       migration_entry_wait+0x156/0x160
>>  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
>>  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
>>  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
>>  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
>>  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
>>  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
>>  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
>>  [<ffffffff81421d5f>] page_fault+0x1f/0x30
>>
>> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
>> and pagetable locks, were good enough before page migration (with its
>> requirement that every migration entry be found) came in; and enough
>> while migration always held mmap_sem.  But not enough nowadays, when
>> there's memory hotremove and compaction: anon_vma lock is also needed,
>> to make sure a migration entry is not dodging around behind our back.
>>
>> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
>> shift_arg_pages() and rmap_walk() during migration by not migrating
>> temporary stacks" was actually a workaround for this in the special
>> common case of exec's use of move_pagetables(); and we should probably
>> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
>>
>> Reported-by: Pawel Sikora <pluto@agmk.net>
>> Cc: stable@kernel.org
>> Signed-off-by: Hugh Dickins <hughd@google.com>
>> ---
>>
>>  mm/mremap.c |    5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> --- 3.1-rc9/mm/mremap.c 2011-07-21 19:17:23.000000000 -0700
>> +++ linux/mm/mremap.c   2011-10-13 14:36:25.097780974 -0700
>> @@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
>>                unsigned long new_addr)
>>  {
>>        struct address_space *mapping = NULL;
>> +       struct anon_vma *anon_vma = vma->anon_vma;
>>        struct mm_struct *mm = vma->vm_mm;
>>        pte_t *old_pte, *new_pte, pte;
>>        spinlock_t *old_ptl, *new_ptl;
>> @@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
>>                mapping = vma->vm_file->f_mapping;
>>                mutex_lock(&mapping->i_mmap_mutex);
>>        }
>> +       if (anon_vma)
>> +               anon_vma_lock(anon_vma);
>>
>>        /*
>>         * We don't have to worry about the ordering of src and dst
>> @@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
>>                spin_unlock(new_ptl);
>>        pte_unmap(new_pte - 1);
>>        pte_unmap_unlock(old_pte - 1, old_ptl);
>> +       if (anon_vma)
>> +               anon_vma_unlock(anon_vma);
>>        if (mapping)
>>                mutex_unlock(&mapping->i_mmap_mutex);
>>        mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-17  3:02     ` Hugh Dickins
@ 2011-10-17  3:09       ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2011-10-17  3:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	linux-mm, jpiszcz, arekm, linux-kernel

On Sun, Oct 16, 2011 at 8:02 PM, Hugh Dickins <hughd@google.com> wrote:
> I've not read through and digested Andrea's reply yet, but I'd say
> this is not something we need to rush into 3.1 at the last moment,
> before it's been fully considered: the bug here is hard to hit,
> ancient, made more likely in 2.6.35 by compaction and in 2.6.38 by
> THP's reliance on compaction, but not a regression in 3.1 at all - let
> it wait until stable.

Ok, thanks. Just wanted to check.

                     Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-16 23:54     ` Andrea Arcangeli
@ 2011-10-17 18:51       ` Hugh Dickins
  2011-10-17 22:05         ` Andrea Arcangeli
  2011-10-19  7:43         ` Mel Gorman
  2011-10-20  9:11       ` Nai Xia
  1 sibling, 2 replies; 43+ messages in thread
From: Hugh Dickins @ 2011-10-17 18:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Linus Torvalds, linux-mm,
	jpiszcz, arekm, linux-kernel

On Mon, 17 Oct 2011, Andrea Arcangeli wrote:
> On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> > mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > and pagetable locks, were good enough before page migration (with its
> > requirement that every migration entry be found) came in; and enough
> > while migration always held mmap_sem.  But not enough nowadays, when
> > there's memory hotremove and compaction: anon_vma lock is also needed,
> > to make sure a migration entry is not dodging around behind our back.
> 
> For things like migrate and split_huge_page, the anon_vma layer must
> guarantee the page is reachable by rmap walk at all times regardless
> if it's at the old or new address.
> 
> This shall be guaranteed by the copy_vma called by move_vma well
> before move_page_tables/move_ptes can run.
> 
> copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> into the anon_vma chains structures (vma_link does that). That before
> any pte can be moved.
> 
> Because we keep two vmas mapped on both src and dst range, with
> different vma->vm_pgoff that is valid for the page (the page doesn't
> change its page->index) the page should always find _all_ its pte at
> any given time.
> 
> There may be other variables at play like the order of insertion in
> the anon_vma chain matches our direction of copy and removal of the
> old pte. But I think the double locking of the PT lock should make the
> order in the anon_vma chain absolutely irrelevant (the rmap_walk
> obviously takes the PT lock too), and furthermore likely the
> anon_vma_chain insertion is favorable (the dst vma is inserted last
> and checked last). But it shouldn't matter.

Thanks a lot for thinking it over.  I _almost_ agree with you, except
there's one aspect that I forgot to highlight in the patch comment:
remove_migration_pte() behaves as page_check_address() does by default,
it peeks to see if what it wants is there _before_ taking ptlock.

And therefore, I think, it is possible that during mremap move, the swap
pte is in neither of the locations it tries at the instant it peeks there.

We could put a stop to that: see plausible alternative patch below.
Though I have dithered from one to the other and back, I think on the
whole I still prefer the anon_vma locking in move_ptes(): we don't care
too deeply about the speed of mremap, but we do care about the speed of
exec, and this does add another lock/unlock there, but it will always
be uncontended; whereas the patch at the migration end could be adding
a contended and unnecessary lock.

Oh, I don't know which, you vote - if you now agree there is a problem.
I'll sign off the migrate.c one if you prefer it.  But no hurry.

> 
> Another thing could be the copy_vma vma_merge branch succeeding
> (returning not NULL) but I doubt we risk to fall into that one. For
> the rmap_walk to be always working on both the src and dst
> vma->vma_pgoff the pgoff must be different so we can't possibly be ok
> if there's just 1 vma covering the whole range. I exclude this could
> be the case because the pgoff passed to copy_vma is different than the
> vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.
> 
> Yet another point to investigate is the point where we teardown the
> old vma and we leave the new vma generated by copy_vma
> established. That's apparently taken care of by do_munmap in move_vma
> so that shall be safe too as munmap is safe in the first place.
> 
> Overall I don't think this patch is needed and it seems a noop.
> 
> > It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> > shift_arg_pages() and rmap_walk() during migration by not migrating
> > temporary stacks" was actually a workaround for this in the special
> > common case of exec's use of move_pagetables(); and we should probably
> > now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
> 
> I don't think this patch can help with that, the problem of execve vs
> rmap_walk is that there's 1 single vma existing for src and dst
> virtual ranges while execve runs move_page_tables. So there is no
> possible way that rmap_walk will be guaranteed to find _all_ ptes
> mapping a page if there's just one vma mapping either the src or dst
> range while move_page_table runs. No addition of locking whatsoever
> can fix that bug because we miss a vma (well modulo locking that
> prevents rmap_walk to run at all, until we're finished with execve,
> which is more or less what VM_STACK_INCOMPLETE_SETUP does...).
> 
> The only way is to fix this is prevent migrate (or any other rmap_walk
> user that requires 100% reliability from the rmap layer, for example
> swap doesn't require 100% reliability and can still run and gracefully
> fail at finding the pte) while we're moving pagetables in execve. And
> that's what Mel's above mentioned patch does.

Thanks for explaining, yes, you're right.

> 
> The other way to fix that bug that I implemented was to do copy_vma in
> execve, so that we still have both src and dst ranges of
> move_page_tables covered by 2 (not 1) vma, each with the proper
> vma->vm_pgoff, so my approach fixed that bug as well (but requires a
> vma allocation in execve so it was dropped in favor of Mel's patch
> which is totally fine with as both approaches fixes the bug equally
> well, even if now we've to deal with this special case of sometime
> rmap_walk having false negatives if the vma_flags is set, and the
> important thing is that after VM_STACK_INCOMPLETE_SETUP has been
> cleared it won't ever be set again for the whole lifetime of the vma).

I think your two-vmas approach is more aesthetically pleasing (and
matches mremap), but can see that Mel's vmaflag hack^Htechnique ends up
more economical.  It is a bit sad that we lose that all-pages-swappable
condition for unlimited args, for a brief moment, but I think no memory
allocations are made in that interval, so I guess it's fine.

Hugh

> 
> I may be missing something, I did a short review so far, just so the
> patch doesn't get merged if not needed. I mean I think it needs a bit
> more looks on it... The fact the i_mmap_mutex was taken but the
> anon_vma lock was not taken (while in every other place they both are
> needed) certainly makes the patch look correct, but that's just a
> misleading coincidence I think.
> 

--- 3.1-rc9/mm/migrate.c	2011-07-21 19:17:23.000000000 -0700
+++ linux/mm/migrate.c	2011-10-17 11:21:48.923826334 -0700
@@ -119,12 +119,6 @@ static int remove_migration_pte(struct p
 			goto out;
 
 		ptep = pte_offset_map(pmd, addr);
-
-		if (!is_swap_pte(*ptep)) {
-			pte_unmap(ptep);
-			goto out;
-		}
-
 		ptl = pte_lockptr(mm, pmd);
 	}
 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-17 18:51       ` Hugh Dickins
@ 2011-10-17 22:05         ` Andrea Arcangeli
  2011-10-19  7:43         ` Mel Gorman
  1 sibling, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2011-10-17 22:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Mel Gorman, Linus Torvalds, linux-mm,
	jpiszcz, arekm, linux-kernel

On Mon, Oct 17, 2011 at 11:51:00AM -0700, Hugh Dickins wrote:
> Thanks a lot for thinking it over.  I _almost_ agree with you, except
> there's one aspect that I forgot to highlight in the patch comment:
> remove_migration_pte() behaves as page_check_address() does by default,
> it peeks to see if what it wants is there _before_ taking ptlock.
> 
> And therefore, I think, it is possible that during mremap move, the swap
> pte is in neither of the locations it tries at the instant it peeks there.

I see what you mean, I didn't realize you were fixing that race.
mremap for a few CPU cycles (which may expand if interrupted by irq)
the migration entry will only live in the kernel stack of the process
doing mremap. So the rmap_walk may just loop quick lockless and not
see it and return while mremap holds boths PT locks (src and dst
pte).

Now getting an irq exactly at that migrate cycle and that irq doesn't
sound too easy but we still must fix this race.

Maybe who needs a 100% reliability should not go lockless looping all
over the vmas without taking PT lock that prevents serialization
against the pte "moving" functions that normally do in order
ptep_clear_flush(src_ptep); set_pet_at(dst_ptep).

For example I never thought of optimizing __split_huge_page_splitting,
that must be reliable so I never felt like it could be safe to go
lockless there.

So I think it's better to fix migrate, as there may be other places
like mremap. Who can't afford failure should do the PT locking.

But maybe it's possible to find good reasons to fix the race in the
other way too.

> We could put a stop to that: see plausible alternative patch below.
> Though I have dithered from one to the other and back, I think on the
> whole I still prefer the anon_vma locking in move_ptes(): we don't care
> too deeply about the speed of mremap, but we do care about the speed of
> exec, and this does add another lock/unlock there, but it will always
> be uncontended; whereas the patch at the migration end could be adding
> a contended and unnecessary lock.
> 
> Oh, I don't know which, you vote - if you now agree there is a problem.
> I'll sign off the migrate.c one if you prefer it.  But no hurry.

Adding more locking in migrate than in mremap fast path should be
better performance-wise. Java GC uses mremap. migrate is somewhat less
performance critical, but I guess there may be other workloads where
migrate runs more often than mremap. But it also depends on the false
positive ratio of rmap_walk, if normally that's low the patch to
migrate may actually result in an optimization, while the mremap patch
can't possibly speed anything.

In short I'm slightly more inclined on preferring the fix to migrate
and enforce all rmap-walkers who can't fail should not go lockless
speculative on the ptes but take the lock before checking if the pte
they're searching is there.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:16 ` Hugh Dickins
  2011-10-13 23:30   ` Hugh Dickins
  2011-10-16 22:37   ` Linus Torvalds
@ 2011-10-18 19:17   ` Paweł Sikora
  2011-10-19  7:30   ` Mel Gorman
  3 siblings, 0 replies; 43+ messages in thread
From: Paweł Sikora @ 2011-10-18 19:17 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mel Gorman, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

On Friday 14 of October 2011 01:16:01 Hugh Dickins wrote:
> [ Subject refers to a different, unexplained 3.0 bug from Pawel ]
> 
> On Wed, 12 Oct 2011, Pawel Sikora wrote:
> 
> > Hi Hugh,
> > i'm resending previous private email with larger cc list as you've requested.
> 
> Thanks, yes, on this one I think I do have an answer;
> and we ought to bring Mel and Andrea in too.
> 
> > 
> > in the last weekend my server died again (processes stuck for 22/23s!) but this time i have more logs for you.
> >  
> > on my dual-opteron machines i have non-standard settings:
> > - DISABLED swap space (fast user process killing is better for my autotest farm than long disk swapping
> >                        and 64GB ecc-ram is enough for my processing).
> > - vm.overcommit_memory = 2,
> > - vm.overcommit_ratio = 100.
> > 
> > after initial BUG_ON (pasted below) there's a flood of 'rcu_sched_state detected stalls / CPU#X stuck for 22s!'
> 
> Yes, those are just a tiresome consequence of exiting from a BUG
> while holding the page table lock(s).
> 
> > (full compressed log is available at: http://pluto.agmk.net/kernel/kernel.bz2)
> > 
> > Oct  9 08:06:43 hal kernel: [408578.629070] ------------[ cut here ]------------
> > Oct  9 08:06:43 hal kernel: [408578.629143] kernel BUG at include/linux/swapops.h:105!
> > Oct  9 08:06:43 hal kernel: [408578.629143] invalid opcode: 0000 [#1] SMP
> > Oct  9 08:06:43 hal kernel: [408578.629143] CPU 14
> [ I'm deleting that irrelevant long line list of modules ]
> > Oct  9 08:06:43 hal kernel: [408578.629143]
> > Oct  9 08:06:43 hal kernel: [408578.629143] Pid: 29214, comm: bitgen Not tainted 3.0.4 #5 Supermicro H8DGU/H8DGU
> > Oct  9 08:06:43 hal kernel: [408578.629143] RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> > Oct  9 08:06:43 hal kernel: [408578.629143] RSP: 0000:ffff88021cee7d18  EFLAGS: 00010246
> > Oct  9 08:06:43 hal kernel: [408578.629143] RAX: 0a00000000080068 RBX: ffffea001d1dbe70 RCX: ffff880c02d18978
> > Oct  9 08:06:43 hal kernel: [408578.629143] RDX: 0000000000851a42 RSI: ffff880d8fe33618 RDI: ffffea002a09dd50
> > Oct  9 08:06:43 hal kernel: [408578.629143] RBP: ffff88021cee7d38 R08: ffff880d8fe33618 R09: 0000000000000028
> > Oct  9 08:06:43 hal kernel: [408578.629143] R10: ffff881006eb0f00 R11: f800000000851a42 R12: ffffea002a09dd40
> > Oct  9 08:06:43 hal kernel: [408578.629143] R13: 0000000c02d18978 R14: 00000000d872f000 R15: ffff880d8fe33618
> > Oct  9 08:06:43 hal kernel: [408578.629143] FS:  00007f864d7fa700(0000) GS:ffff880c1fc80000(0063) knlGS:00000000f432a910
> > Oct  9 08:06:43 hal kernel: [408578.629143] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> > Oct  9 08:06:43 hal kernel: [408578.629143] CR2: 00000000d872f000 CR3: 000000100668c000 CR4: 00000000000006e0
> > Oct  9 08:06:43 hal kernel: [408578.629143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Oct  9 08:06:43 hal kernel: [408578.629143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Oct  9 08:06:43 hal kernel: [408578.629143] Process bitgen (pid: 29214, threadinfo ffff88021cee6000, task ffff880407b06900)
> > Oct  9 08:06:43 hal kernel: [408578.629143] Stack:
> > Oct  9 08:06:43 hal kernel: [408578.629143]  00000000dcef4000 ffff880c00968c98 ffff880c02d18978 000000010a34843e
> > Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7de8 ffffffff811016a1 ffff88021cee7d78 ffff881006eb0f00
> > Oct  9 08:06:43 hal kernel: [408578.629143]  ffff88021cee7d98 ffffffff810feee2 ffff880c06f0d170 8000000b98d14067
> > Oct  9 08:06:43 hal kernel: [408578.629143] Call Trace:
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81106097>] ? vma_adjust+0x537/0x570
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
> > Oct  9 08:06:43 hal kernel: [408578.629143]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> > Oct  9 08:06:43 hal kernel: [408578.629143] Code: 80 00 00 00 00 31 f6 48 89 df e8 e6 58 fb ff eb d7 85 c9 0f 84 44 ff ff ff 8d 51 01 89 c8 f0 0f b1 16 39 c1 90 74 b5 89 c1 eb e6 <0f> 0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 20 48 85 ff
> > Oct  9 08:06:43 hal kernel: [408578.629143] RIP  [<ffffffff81127b76>] migration_entry_wait+0x156/0x160
> > Oct  9 08:06:43 hal kernel: [408578.629143]  RSP <ffff88021cee7d18>
> > Oct  9 08:06:43 hal kernel: [408578.642823] ---[ end trace 0a37362301163711 ]---
> > Oct  9 08:07:10 hal kernel: [408605.283257] BUG: soft lockup - CPU#12 stuck for 23s! [par:29801]
> > Oct  9 08:07:10 hal kernel: [408605.285807] CPU 12
> > Oct  9 08:07:10 hal kernel: [408605.285807]
> > Oct  9 08:07:10 hal kernel: [408605.285807] Pid: 29801, comm: par Tainted: G      D     3.0.4 #5 Supermicro H8DGU/H8DGU
> > Oct  9 08:07:10 hal kernel: [408605.285807] RIP: 0010:[<ffffffff814216a4>]  [<ffffffff814216a4>] _raw_spin_lock+0x14/0x20
> > Oct  9 08:07:10 hal kernel: [408605.285807] RSP: 0018:ffff880c02def808  EFLAGS: 00000293
> > Oct  9 08:07:10 hal kernel: [408605.285807] RAX: 0000000000000b09 RBX: ffffea002741f6b8 RCX: ffff880000000000
> > Oct  9 08:07:10 hal kernel: [408605.285807] RDX: ffffea0000000000 RSI: 000000002a09dd40 RDI: ffffea002a09dd50
> > Oct  9 08:07:10 hal kernel: [408605.285807] RBP: ffff880c02def808 R08: 0000000000000000 R09: ffff880f2e4f4d70
> > Oct  9 08:07:10 hal kernel: [408605.285807] R10: ffff880c1fffbe00 R11: 0000000000000050 R12: ffffffff8142988e
> > Oct  9 08:07:10 hal kernel: [408605.285807] R13: ffff880c02def7b8 R14: ffffffff810e63dc R15: ffff880c02def7b8
> > Oct  9 08:07:10 hal kernel: [408605.285807] FS:  00007fe6b677c720(0000) GS:ffff880c1fc00000(0063) knlGS:00000000f6e0d910
> > Oct  9 08:07:10 hal kernel: [408605.285807] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> > Oct  9 08:07:10 hal kernel: [408605.285807] CR2: 00000000dd40012c CR3: 00000009f6b78000 CR4: 00000000000006e0
> > Oct  9 08:07:10 hal kernel: [408605.285807] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Oct  9 08:07:10 hal kernel: [408605.285807] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Oct  9 08:07:10 hal kernel: [408605.285807] Process par (pid: 29801, threadinfo ffff880c02dee000, task ffff880c07c90700)
> > Oct  9 08:07:10 hal kernel: [408605.285807] Stack:
> > Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def858 ffffffff8110a2d7 ffffea001fc16fe0 ffff880c02d183a8
> > Oct  9 08:07:10 hal kernel: [408605.285807]  ffff880c02def8c8 ffffea001fc16fa8 ffff880c00968c98 ffff881006eb0f00
> > Oct  9 08:07:10 hal kernel: [408605.285807]  0000000000000301 00000000d8675000 ffff880c02def8c8 ffffffff8110aa6a
> > Oct  9 08:07:10 hal kernel: [408605.285807] Call Trace:
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a2d7>] __page_check_address+0x107/0x1a0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110aa6a>] try_to_unmap_one+0x3a/0x420
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110be44>] try_to_unmap_anon+0xb4/0x130
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110bf75>] try_to_unmap+0x65/0x80
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811285d0>] migrate_pages+0x310/0x4c0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e93c2>] ? ____pagevec_lru_add+0x12/0x20
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111cbf0>] ? ftrace_define_fields_mm_compaction_isolate_template+0x70/0x70
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111d5da>] compact_zone+0x52a/0x8c0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810f9919>] ? zone_statistics+0x99/0xc0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dade>] compact_zone_order+0x7e/0xb0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e46a8>] ? get_page_from_freelist+0x3b8/0x7e0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111dbcd>] try_to_compact_pages+0xbd/0xf0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e5148>] __alloc_pages_direct_compact+0xa8/0x180
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e56c5>] __alloc_pages_nodemask+0x4a5/0x7f0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff810e9698>] ? lru_cache_add_lru+0x28/0x50
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110a92d>] ? page_add_new_anon_rmap+0x9d/0xb0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8111b865>] alloc_pages_vma+0x95/0x180
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8112c2f8>] do_huge_pmd_anonymous_page+0x138/0x310
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81102ace>] handle_mm_fault+0x21e/0x310
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81001716>] ? __switch_to+0x1e6/0x2c0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8141f178>] ? schedule+0x308/0xa10
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff811077a7>] ? do_mmap_pgoff+0x357/0x370
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff8110790d>] ? sys_mmap_pgoff+0x14d/0x220
> > Oct  9 08:07:10 hal kernel: [408605.285807]  [<ffffffff81421d5f>] page_fault+0x1f/0x30
> > Oct  9 08:07:10 hal kernel: [408605.285807] Code: 0f b6 c2 85 c0 0f 95 c0 0f b6 c0 5d c3 66 2e 0f 1f 84 00 00 00 00 00 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 38 e0 74 06 f3 90 <8a> 07 eb f6 5d c3 66 0f 1f 44 00 00 55 48 89 e5 9c 58 fa ba 00
> 
> I guess this is the only time you've seen this?  In which case, ideally
> I would try to devise a testcase to demonstrate the issue below instead;
> but that may involve more ingenuity than I can find time for, let's see
> see if people approve of this patch anyway (it applies to 3.1 or 3.0,
> and earlier releases except that i_mmap_mutex used to be i_mmap_lock).
> 
> 
> [PATCH] mm: add anon_vma locking to mremap move
> 
> I don't usually pay much attention to the stale "? " addresses in
> stack backtraces, but this lucky report from Pawel Sikora hints that
> mremap's move_ptes() has inadequate locking against page migration.
> 
>  3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
>  kernel BUG at include/linux/swapops.h:105!
>  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
>                        migration_entry_wait+0x156/0x160
>   [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
>   [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
>   [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
>   [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
>   [<ffffffff81106097>] ? vma_adjust+0x537/0x570
>   [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
>   [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
>   [<ffffffff81421d5f>] page_fault+0x1f/0x30
> 
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.
> 
> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
> 
> Reported-by: Pawel Sikora <pluto@agmk.net>
> Cc: stable@kernel.org
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> 
>  mm/mremap.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> --- 3.1-rc9/mm/mremap.c	2011-07-21 19:17:23.000000000 -0700
> +++ linux/mm/mremap.c	2011-10-13 14:36:25.097780974 -0700
> @@ -77,6 +77,7 @@ static void move_ptes(struct vm_area_str
>  		unsigned long new_addr)
>  {
>  	struct address_space *mapping = NULL;
> +	struct anon_vma *anon_vma = vma->anon_vma;
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_pte, *new_pte, pte;
>  	spinlock_t *old_ptl, *new_ptl;
> @@ -95,6 +96,8 @@ static void move_ptes(struct vm_area_str
>  		mapping = vma->vm_file->f_mapping;
>  		mutex_lock(&mapping->i_mmap_mutex);
>  	}
> +	if (anon_vma)
> +		anon_vma_lock(anon_vma);
>  
>  	/*
>  	 * We don't have to worry about the ordering of src and dst
> @@ -121,6 +124,8 @@ static void move_ptes(struct vm_area_str
>  		spin_unlock(new_ptl);
>  	pte_unmap(new_pte - 1);
>  	pte_unmap_unlock(old_pte - 1, old_ptl);
> +	if (anon_vma)
> +		anon_vma_unlock(anon_vma);
>  	if (mapping)
>  		mutex_unlock(&mapping->i_mmap_mutex);
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
> 

Hi,

1).
with this patch applied to vanilla 3.0.6 kernel my opterons have been working stable for ~4 days so far.
nice :)

2).
with this patch i can't reproduce soft-lockup described at https://lkml.org/lkml/2011/8/30/112
nice :)

3).
now i've started more tests with this patch + 3.0.4 + vserver 2.3.1 to check possibly related locks
described on vserver mailinglist http://list.linux-vserver.org/archive?mss:5264:201108:odomikkjgoemcaomgidl
and lkml archive https://lkml.org/lkml/2011/5/23/398

1h uptime and still going...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-13 23:16 ` Hugh Dickins
                     ` (2 preceding siblings ...)
  2011-10-18 19:17   ` Paweł Sikora
@ 2011-10-19  7:30   ` Mel Gorman
  2011-10-21 12:44     ` Mel Gorman
  3 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-10-19  7:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

On Thu, Oct 13, 2011 at 04:16:01PM -0700, Hugh Dickins wrote:
> <SNIP>
> 
> I guess this is the only time you've seen this?  In which case, ideally
> I would try to devise a testcase to demonstrate the issue below instead;

Considering that mremap workloads have been tested fairly heavily and
this hasn't triggered before (or at least not reported), I would not be
confident it can be easily reproduced. Maybe reproducing is easier if
interrupts are also high.

> but that may involve more ingenuity than I can find time for, let's see
> see if people approve of this patch anyway (it applies to 3.1 or 3.0,
> and earlier releases except that i_mmap_mutex used to be i_mmap_lock).
> 
> 
> [PATCH] mm: add anon_vma locking to mremap move
> 
> I don't usually pay much attention to the stale "? " addresses in
> stack backtraces, but this lucky report from Pawel Sikora hints that
> mremap's move_ptes() has inadequate locking against page migration.
> 
>  3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
>  kernel BUG at include/linux/swapops.h:105!

This check is triggered if migration PTEs are left behind. In the few
cases I saw this during compaction development, it was because a VMA was
unreachable during remove_migration_pte. With the anon_vma changes, the
locking during VMA insertion is meant to protect it and the order VMAs
are linked is important so the right anon_vma lock is found.

I don't think it is an unreachable VMA problem because if it was, the
problem would trigger much more frequently and not be exclusive to
mremap.

>  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
>                        migration_entry_wait+0x156/0x160
>   [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
>   [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
>   [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
>   [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
>   [<ffffffff81106097>] ? vma_adjust+0x537/0x570
>   [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
>   [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
>   [<ffffffff81421d5f>] page_fault+0x1f/0x30
> 
> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> and pagetable locks, were good enough before page migration (with its
> requirement that every migration entry be found) came in; and enough
> while migration always held mmap_sem.  But not enough nowadays, when
> there's memory hotremove and compaction: anon_vma lock is also needed,
> to make sure a migration entry is not dodging around behind our back.
> 

migration holds the anon_vma lock while it unmaps the pages and keeps holding
it until after remove_migration_ptes is called.  There are two anon vmas
that should exist during mremap that were created for the move. They
should not be able to disappear while migration runs and right now, I'm
not seeing how the VMA can get lost :(

I think a consequence of this patch is that migration and mremap are now
serialised by anon_vma lock. As a result, it might still fix the problem
if there is some race between mremap and migration simply by stopping
them playing with each other.

> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
> shift_arg_pages() and rmap_walk() during migration by not migrating
> temporary stacks" was actually a workaround for this in the special
> common case of exec's use of move_pagetables(); and we should probably
> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
> 

The problem was that there was only one VMA for two page table
ranges. The neater fix was to create a second VMA but that required a
kmalloc and additional VMA work during exec which was considered too
heavy. VM_STACK_INCOMPLETE_SETUP is less clean but it is faster.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-17 18:51       ` Hugh Dickins
  2011-10-17 22:05         ` Andrea Arcangeli
@ 2011-10-19  7:43         ` Mel Gorman
  2011-10-19 13:39           ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-10-19  7:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Pawel Sikora, Andrew Morton, Linus Torvalds,
	linux-mm, jpiszcz, arekm, linux-kernel

On Mon, Oct 17, 2011 at 11:51:00AM -0700, Hugh Dickins wrote:
> On Mon, 17 Oct 2011, Andrea Arcangeli wrote:
> > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> > > mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > > and pagetable locks, were good enough before page migration (with its
> > > requirement that every migration entry be found) came in; and enough
> > > while migration always held mmap_sem.  But not enough nowadays, when
> > > there's memory hotremove and compaction: anon_vma lock is also needed,
> > > to make sure a migration entry is not dodging around behind our back.
> > 
> > For things like migrate and split_huge_page, the anon_vma layer must
> > guarantee the page is reachable by rmap walk at all times regardless
> > if it's at the old or new address.
> > 
> > This shall be guaranteed by the copy_vma called by move_vma well
> > before move_page_tables/move_ptes can run.
> > 
> > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> > into the anon_vma chains structures (vma_link does that). That before
> > any pte can be moved.
> > 
> > Because we keep two vmas mapped on both src and dst range, with
> > different vma->vm_pgoff that is valid for the page (the page doesn't
> > change its page->index) the page should always find _all_ its pte at
> > any given time.
> > 
> > There may be other variables at play like the order of insertion in
> > the anon_vma chain matches our direction of copy and removal of the
> > old pte. But I think the double locking of the PT lock should make the
> > order in the anon_vma chain absolutely irrelevant (the rmap_walk
> > obviously takes the PT lock too), and furthermore likely the
> > anon_vma_chain insertion is favorable (the dst vma is inserted last
> > and checked last). But it shouldn't matter.
> 
> Thanks a lot for thinking it over.  I _almost_ agree with you, except
> there's one aspect that I forgot to highlight in the patch comment:
> remove_migration_pte() behaves as page_check_address() does by default,
> it peeks to see if what it wants is there _before_ taking ptlock.
> 
> And therefore, I think, it is possible that during mremap move, the swap
> pte is in neither of the locations it tries at the instant it peeks there.
> 

I should have read the rest of the thread before responding :/ .

This makes more sense and is a relief in a sense. There is nothing known
wrong with the VMA locking or ordering. The correct PTE is found but it is
in the wrong state.

> We could put a stop to that: see plausible alternative patch below.
> Though I have dithered from one to the other and back, I think on the
> whole I still prefer the anon_vma locking in move_ptes(): we don't care
> too deeply about the speed of mremap, but we do care about the speed of

I still think the anon_vma lock serialises mremap and migration. If that
is correct, it could cause things like huge page collapsing stalling mremap
operations. That might cause slowdowns in JVMs during GC which is undesirable.

> exec, and this does add another lock/unlock there, but it will always
> be uncontended; whereas the patch at the migration end could be adding
> a contended and unnecessary lock.
> 
> Oh, I don't know which, you vote - if you now agree there is a problem.
> I'll sign off the migrate.c one if you prefer it.  But no hurry.
> 

My vote is with the migration change. While there are occasionally
patches to make migration go faster, I don't consider it a hot path.
mremap may be used intensively by JVMs so I'd loathe to hurt it.

Thanks Hugh.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19  7:43         ` Mel Gorman
@ 2011-10-19 13:39           ` Linus Torvalds
  2011-10-19 19:42             ` Hugh Dickins
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2011-10-19 13:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Andrea Arcangeli, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> My vote is with the migration change. While there are occasionally
> patches to make migration go faster, I don't consider it a hot path.
> mremap may be used intensively by JVMs so I'd loathe to hurt it.

Ok, everybody seems to like that more, and it removes code rather than
adds it, so I certainly prefer it too. Paweł, can you test that other
patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
locking patch that you already verified for your setup?

Hugh - that one didn't have a changelog/sign-off, so if you could
write that up, and Paweł's testing is successful, I can apply it...
Looks like we have acks from both Andrea and Mel.

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19 13:39           ` Linus Torvalds
@ 2011-10-19 19:42             ` Hugh Dickins
  2011-10-20  6:30               ` Paweł Sikora
  2011-10-20 12:51               ` Nai Xia
  0 siblings, 2 replies; 43+ messages in thread
From: Hugh Dickins @ 2011-10-19 19:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Andrea Arcangeli, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Wed, 19 Oct 2011, Linus Torvalds wrote:
> On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
> >
> > My vote is with the migration change. While there are occasionally
> > patches to make migration go faster, I don't consider it a hot path.
> > mremap may be used intensively by JVMs so I'd loathe to hurt it.
> 
> Ok, everybody seems to like that more, and it removes code rather than
> adds it, so I certainly prefer it too. Pawel, can you test that other
> patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
> locking patch that you already verified for your setup?
> 
> Hugh - that one didn't have a changelog/sign-off, so if you could
> write that up, and Pawel's testing is successful, I can apply it...
> Looks like we have acks from both Andrea and Mel.

Yes, I'm glad to have that input from Andrea and Mel, thank you.

Here we go.  I can't add a Tested-by since Pawel was reporting on the
alternative patch, but perhaps you'll be able to add that in later.

I may have read too much into Pawel's mail, but it sounded like he
would have expected an eponymous find_get_pages() lockup by now,
and was pleased that this patch appeared to have cured that.

I've spent quite a while trying to explain find_get_pages() lockup by
a missed migration entry, but I just don't see it: I don't expect this
(or the alternative) patch to do anything to fix that problem.  I won't
mind if it magically goes away, but I expect we'll need more info from
the debug patch I sent Justin a couple of days ago.

Ah, I'd better send the patch separately as
"[PATCH] mm: fix race between mremap and removing migration entry":
Pawel's "l" makes my old alpine setup choose quoted printable when
I reply to your mail.

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19 19:42             ` Hugh Dickins
@ 2011-10-20  6:30               ` Paweł Sikora
  2011-10-20  6:51                 ` Linus Torvalds
  2011-10-21  6:54                 ` Nai Xia
  2011-10-20 12:51               ` Nai Xia
  1 sibling, 2 replies; 43+ messages in thread
From: Paweł Sikora @ 2011-10-20  6:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Mel Gorman, Andrea Arcangeli, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Wednesday 19 of October 2011 21:42:15 Hugh Dickins wrote:
> On Wed, 19 Oct 2011, Linus Torvalds wrote:
> > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > My vote is with the migration change. While there are occasionally
> > > patches to make migration go faster, I don't consider it a hot path.
> > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
> > 
> > Ok, everybody seems to like that more, and it removes code rather than
> > adds it, so I certainly prefer it too. Pawel, can you test that other
> > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
> > locking patch that you already verified for your setup?
> > 
> > Hugh - that one didn't have a changelog/sign-off, so if you could
> > write that up, and Pawel's testing is successful, I can apply it...
> > Looks like we have acks from both Andrea and Mel.
> 
> Yes, I'm glad to have that input from Andrea and Mel, thank you.
> 
> Here we go.  I can't add a Tested-by since Pawel was reporting on the
> alternative patch, but perhaps you'll be able to add that in later.
> 
> I may have read too much into Pawel's mail, but it sounded like he
> would have expected an eponymous find_get_pages() lockup by now,
> and was pleased that this patch appeared to have cured that.
> 
> I've spent quite a while trying to explain find_get_pages() lockup by
> a missed migration entry, but I just don't see it: I don't expect this
> (or the alternative) patch to do anything to fix that problem.  I won't
> mind if it magically goes away, but I expect we'll need more info from
> the debug patch I sent Justin a couple of days ago.

the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
so please apply it to the upstream/stable git tree.

from the other side, both patches don't help for 3.0.4+vserver host soft-lock
which dies in few hours of stressing. iirc this lock has started with 2.6.38.
is there any major change in memory managment area in 2.6.38 that i can bisect
and test with vserver?

BR,
Paweł.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-20  6:30               ` Paweł Sikora
@ 2011-10-20  6:51                 ` Linus Torvalds
  2011-10-21  6:54                 ` Nai Xia
  1 sibling, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2011-10-20  6:51 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Hugh Dickins, Mel Gorman, Andrea Arcangeli, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

2011/10/19 Paweł Sikora <pluto@agmk.net>:
>
> the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
> 1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
> so please apply it to the upstream/stable git tree.

Ok, thanks, applied and pushed out.

> from the other side, both patches don't help for 3.0.4+vserver host soft-lock
> which dies in few hours of stressing. iirc this lock has started with 2.6.38.
> is there any major change in memory managment area in 2.6.38 that i can bisect
> and test with vserver?

I suspect you'd be best off simply just doing a full bisect. Yes, if
2.6.37 is the last known working kernel for you, and 38 breaks, that's
a lot of commits (about 10k, to be exact), and it will take an
annoying number of reboots and tests, but assuming you don't hit any
problems, it should still be "only" about 14 bisection points or so.

You could *try* to minimize the bisect by only looking at commits that
change mm/, but quite frankly, partial tree bisects tend to not be all
that reliable. But if you want to try, you could do basically

   git bisect start mm/
   git bisect good v2.6.37
   git bisect bad v2.6.38

and go from there. That will try to do a more specific bisect, and you
should have fewer test points, but the end result really is much less
reliable. But it might help narrow things down a bit.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-16 23:54     ` Andrea Arcangeli
  2011-10-17 18:51       ` Hugh Dickins
@ 2011-10-20  9:11       ` Nai Xia
  2011-10-21 15:56         ` Mel Gorman
  1 sibling, 1 reply; 43+ messages in thread
From: Nai Xia @ 2011-10-20  9:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Pawel Sikora, Andrew Morton, Mel Gorman, linux-mm,
	jpiszcz, arekm, linux-kernel

Hi Andrea,

On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
>> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
>> and pagetable locks, were good enough before page migration (with its
>> requirement that every migration entry be found) came in; and enough
>> while migration always held mmap_sem.  But not enough nowadays, when
>> there's memory hotremove and compaction: anon_vma lock is also needed,
>> to make sure a migration entry is not dodging around behind our back.
>
> For things like migrate and split_huge_page, the anon_vma layer must
> guarantee the page is reachable by rmap walk at all times regardless
> if it's at the old or new address.
>
> This shall be guaranteed by the copy_vma called by move_vma well
> before move_page_tables/move_ptes can run.
>
> copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> into the anon_vma chains structures (vma_link does that). That before
> any pte can be moved.
>
> Because we keep two vmas mapped on both src and dst range, with
> different vma->vm_pgoff that is valid for the page (the page doesn't
> change its page->index) the page should always find _all_ its pte at
> any given time.
>
> There may be other variables at play like the order of insertion in
> the anon_vma chain matches our direction of copy and removal of the
> old pte. But I think the double locking of the PT lock should make the
> order in the anon_vma chain absolutely irrelevant (the rmap_walk
> obviously takes the PT lock too), and furthermore likely the
> anon_vma_chain insertion is favorable (the dst vma is inserted last
> and checked last). But it shouldn't matter.

I happened to be reading these code last week.

And I do think this order matters, the reason is just quite similar why we
need i_mmap_lock in move_ptes():
If rmap_walk goes dst--->src, then when it first look into dst, ok, the
pte is not there, and it happily skip it and release the PTL.
Then just before it look into src, move_ptes() comes in, takes the locks
and moves the pte from src to dst. And then when rmap_walk() look
into src,  it will find an empty pte again. The pte is still there,
but rmap_walk() missed it !

IMO, this can really happen in case of vma_merge() succeeding.
Imagine that src vma is lately faulted and in anon_vma_prepare()
it got a same anon_vma with an existing vma ( named evil_vma )through
find_mergeable_anon_vma().  This can potentially make the vma_merge() in
copy_vma() return with evil_vma on some new relocation request. But src_vma
is really linked _after_  evil_vma/new_vma/dst_vma.
In this way, the ordering protocol  of anon_vma chain is broken.
This should be a rare case because I think in most cases
if two VMAs can reusable_anon_vma() they were already merged.

How do you think  ?

And If my reasoning is sound and this bug is really triggered by it
Hugh's first patch should be the right fix :)


Regards,

Nai Xia

>
> Another thing could be the copy_vma vma_merge branch succeeding
> (returning not NULL) but I doubt we risk to fall into that one. For
> the rmap_walk to be always working on both the src and dst
> vma->vma_pgoff the pgoff must be different so we can't possibly be ok
> if there's just 1 vma covering the whole range. I exclude this could
> be the case because the pgoff passed to copy_vma is different than the
> vma->vm_pgoff given to copy_vma, so vma_merge can't possibly succeed.
>
> Yet another point to investigate is the point where we teardown the
> old vma and we leave the new vma generated by copy_vma
> established. That's apparently taken care of by do_munmap in move_vma
> so that shall be safe too as munmap is safe in the first place.
>
> Overall I don't think this patch is needed and it seems a noop.
>
>> It appears that Mel's a8bef8ff6ea1 "mm: migration: avoid race between
>> shift_arg_pages() and rmap_walk() during migration by not migrating
>> temporary stacks" was actually a workaround for this in the special
>> common case of exec's use of move_pagetables(); and we should probably
>> now remove that VM_STACK_INCOMPLETE_SETUP stuff as a separate cleanup.
>
> I don't think this patch can help with that, the problem of execve vs
> rmap_walk is that there's 1 single vma existing for src and dst
> virtual ranges while execve runs move_page_tables. So there is no
> possible way that rmap_walk will be guaranteed to find _all_ ptes
> mapping a page if there's just one vma mapping either the src or dst
> range while move_page_table runs. No addition of locking whatsoever
> can fix that bug because we miss a vma (well modulo locking that
> prevents rmap_walk to run at all, until we're finished with execve,
> which is more or less what VM_STACK_INCOMPLETE_SETUP does...).
>
> The only way is to fix this is prevent migrate (or any other rmap_walk
> user that requires 100% reliability from the rmap layer, for example
> swap doesn't require 100% reliability and can still run and gracefully
> fail at finding the pte) while we're moving pagetables in execve. And
> that's what Mel's above mentioned patch does.
>
> The other way to fix that bug that I implemented was to do copy_vma in
> execve, so that we still have both src and dst ranges of
> move_page_tables covered by 2 (not 1) vma, each with the proper
> vma->vm_pgoff, so my approach fixed that bug as well (but requires a
> vma allocation in execve so it was dropped in favor of Mel's patch
> which is totally fine with as both approaches fixes the bug equally
> well, even if now we've to deal with this special case of sometime
> rmap_walk having false negatives if the vma_flags is set, and the
> important thing is that after VM_STACK_INCOMPLETE_SETUP has been
> cleared it won't ever be set again for the whole lifetime of the vma).
>
> I may be missing something, I did a short review so far, just so the
> patch doesn't get merged if not needed. I mean I think it needs a bit
> more looks on it... The fact the i_mmap_mutex was taken but the
> anon_vma lock was not taken (while in every other place they both are
> needed) certainly makes the patch look correct, but that's just a
> misleading coincidence I think.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19 19:42             ` Hugh Dickins
  2011-10-20  6:30               ` Paweł Sikora
@ 2011-10-20 12:51               ` Nai Xia
       [not found]                 ` <CANsGZ6a6_q8+88FRV2froBsVEq7GhtKd9fRnB-0M2MD3a7tnSw@mail.gmail.com>
  1 sibling, 1 reply; 43+ messages in thread
From: Nai Xia @ 2011-10-20 12:51 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Mel Gorman, Andrea Arcangeli, Pawel Sikora,
	Andrew Morton, linux-mm, jpiszcz, arekm, linux-kernel

On Thursday 20 October 2011 03:42:15 Hugh Dickins wrote:
> On Wed, 19 Oct 2011, Linus Torvalds wrote:
> > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
> > >
> > > My vote is with the migration change. While there are occasionally
> > > patches to make migration go faster, I don't consider it a hot path.
> > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
> > 
> > Ok, everybody seems to like that more, and it removes code rather than
> > adds it, so I certainly prefer it too. Pawel, can you test that other
> > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
> > locking patch that you already verified for your setup?
> > 
> > Hugh - that one didn't have a changelog/sign-off, so if you could
> > write that up, and Pawel's testing is successful, I can apply it...
> > Looks like we have acks from both Andrea and Mel.
> 
> Yes, I'm glad to have that input from Andrea and Mel, thank you.
> 
> Here we go.  I can't add a Tested-by since Pawel was reporting on the
> alternative patch, but perhaps you'll be able to add that in later.
> 
> I may have read too much into Pawel's mail, but it sounded like he
> would have expected an eponymous find_get_pages() lockup by now,
> and was pleased that this patch appeared to have cured that.
> 
> I've spent quite a while trying to explain find_get_pages() lockup by
> a missed migration entry, but I just don't see it: I don't expect this
> (or the alternative) patch to do anything to fix that problem.  I won't
> mind if it magically goes away, but I expect we'll need more info from
> the debug patch I sent Justin a couple of days ago.

Hi Hugh, 

Will you please look into my explanation in my reply to Andrea in this thread
and see if it's what you are seeking?


Thanks,

Nai Xia


> 
> Ah, I'd better send the patch separately as
> "[PATCH] mm: fix race between mremap and removing migration entry":
> Pawel's "l" makes my old alpine setup choose quoted printable when
> I reply to your mail.
> 
> Hugh
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
       [not found]                 ` <CANsGZ6a6_q8+88FRV2froBsVEq7GhtKd9fRnB-0M2MD3a7tnSw@mail.gmail.com>
@ 2011-10-21  6:22                   ` Nai Xia
  2011-10-21  8:07                     ` Pawel Sikora
  0 siblings, 1 reply; 43+ messages in thread
From: Nai Xia @ 2011-10-21  6:22 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: arekm, Linus Torvalds, linux-mm, Mel Gorman, jpiszcz,
	linux-kernel, Andrew Morton, Pawel Sikora, Andrea Arcangeli

On Fri, Oct 21, 2011 at 2:36 AM, Hugh Dickins <hughd@google.com> wrote:
> I'm travelling at the moment, my brain is not in gear, the source is not in
> front of me, and I'm not used to typing on my phone much!  Excuses, excuses
>
> I flip between thinking you are right, and I'm a fool, and thinking you are
> wrong, and I'm still a fool.

Ha, well, human brains are all weak in thoroughly searching racing state space,
while automated model checking is still far from applicable to complex
real world
like kernel source. Maybe some day someone will give out a human guided
computer aided tool to help us search the combination of all involved code paths
to valid a specific high level logic assertion.


>
> Please work it out with Linus, Andrea and Mel: I may not be able to reply
> for a couple of days - thanks.

OK.

And as a side note. Since I notice that Pawel's workload may include OOM,
I'd like to give an imaginary series of events that may trigger such an bug.

1.  do_brk() want to expand a vma, but vma_merge  failed because of
transient  ENOMEM,  but succeeded in creating a new vmas at the boundary.

    vma_a           vma_b
|----------------|---------------------|

2.  page fault in vma_b, gives it a anon_vma, then page fault in vma_a,
it reuses the anon_vma of  vma_b.


3.   vma_a remaps to somewhere irrelevant, a new vma_c is created
and linked by anon_vma_clone(). In the anon_vma chain of vma_b,
vma_c is linked after  vma_b:

    vma_a           vma_b                   vma_c
|----------------|---------------------|   |==============|

           vma_b                   vma_c
|---------------------|   |==============|



4.  vma_c remaps back to its original place where vma_a was.
Ok,  vma_merge() in copy_vma() says that this request can be merged
to vma_b, and it returns with vma_b.

5. move_page_tables moves from vma_c to vma_b,  and races with rmap_walk.
The reverse ordering of vma_b and vma_c in anon_vma chain makes
rmap_walk miss an entry in the way I explained.

Well, it seems a very tricky construction, but also seems a possible
thing to me.

Will Linus, Andrea and Mel or any other one please look into my construction
and judge if it's valid?

Thanks

Nai Xia

>
> Hugh
>
> On Oct 20, 2011 5:51 AM, "Nai Xia" <nai.xia@gmail.com> wrote:
>>
>> On Thursday 20 October 2011 03:42:15 Hugh Dickins wrote:
>> > On Wed, 19 Oct 2011, Linus Torvalds wrote:
>> > > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
>> > > >
>> > > > My vote is with the migration change. While there are occasionally
>> > > > patches to make migration go faster, I don't consider it a hot path.
>> > > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
>> > >
>> > > Ok, everybody seems to like that more, and it removes code rather than
>> > > adds it, so I certainly prefer it too. Pawel, can you test that other
>> > > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
>> > > locking patch that you already verified for your setup?
>> > >
>> > > Hugh - that one didn't have a changelog/sign-off, so if you could
>> > > write that up, and Pawel's testing is successful, I can apply it...
>> > > Looks like we have acks from both Andrea and Mel.
>> >
>> > Yes, I'm glad to have that input from Andrea and Mel, thank you.
>> >
>> > Here we go.  I can't add a Tested-by since Pawel was reporting on the
>> > alternative patch, but perhaps you'll be able to add that in later.
>> >
>> > I may have read too much into Pawel's mail, but it sounded like he
>> > would have expected an eponymous find_get_pages() lockup by now,
>> > and was pleased that this patch appeared to have cured that.
>> >
>> > I've spent quite a while trying to explain find_get_pages() lockup by
>> > a missed migration entry, but I just don't see it: I don't expect this
>> > (or the alternative) patch to do anything to fix that problem.  I won't
>> > mind if it magically goes away, but I expect we'll need more info from
>> > the debug patch I sent Justin a couple of days ago.
>>
>> Hi Hugh,
>>
>> Will you please look into my explanation in my reply to Andrea in this
>> thread
>> and see if it's what you are seeking?
>>
>>
>> Thanks,
>>
>> Nai Xia
>>
>>
>> >
>> > Ah, I'd better send the patch separately as
>> > "[PATCH] mm: fix race between mremap and removing migration entry":
>> > Pawel's "l" makes my old alpine setup choose quoted printable when
>> > I reply to your mail.
>> >
>> > Hugh
>> >
>> > --
>> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> > the body to majordomo@kvack.org.  For more info on Linux MM,
>> > see: http://www.linux-mm.org/ .
>> > Fight unfair telecom internet charges in Canada: sign
>> > http://stopthemeter.ca/
>> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-20  6:30               ` Paweł Sikora
  2011-10-20  6:51                 ` Linus Torvalds
@ 2011-10-21  6:54                 ` Nai Xia
  2011-10-21  7:35                   ` Pawel Sikora
  1 sibling, 1 reply; 43+ messages in thread
From: Nai Xia @ 2011-10-21  6:54 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Hugh Dickins, Linus Torvalds, Mel Gorman, Andrea Arcangeli,
	Andrew Morton, linux-mm, jpiszcz, arekm, linux-kernel

2011/10/20 Paweł Sikora <pluto@agmk.net>:
> On Wednesday 19 of October 2011 21:42:15 Hugh Dickins wrote:
>> On Wed, 19 Oct 2011, Linus Torvalds wrote:
>> > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
>> > >
>> > > My vote is with the migration change. While there are occasionally
>> > > patches to make migration go faster, I don't consider it a hot path.
>> > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
>> >
>> > Ok, everybody seems to like that more, and it removes code rather than
>> > adds it, so I certainly prefer it too. Pawel, can you test that other
>> > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
>> > locking patch that you already verified for your setup?
>> >
>> > Hugh - that one didn't have a changelog/sign-off, so if you could
>> > write that up, and Pawel's testing is successful, I can apply it...
>> > Looks like we have acks from both Andrea and Mel.
>>
>> Yes, I'm glad to have that input from Andrea and Mel, thank you.
>>
>> Here we go.  I can't add a Tested-by since Pawel was reporting on the
>> alternative patch, but perhaps you'll be able to add that in later.
>>
>> I may have read too much into Pawel's mail, but it sounded like he
>> would have expected an eponymous find_get_pages() lockup by now,
>> and was pleased that this patch appeared to have cured that.
>>
>> I've spent quite a while trying to explain find_get_pages() lockup by
>> a missed migration entry, but I just don't see it: I don't expect this
>> (or the alternative) patch to do anything to fix that problem.  I won't
>> mind if it magically goes away, but I expect we'll need more info from
>> the debug patch I sent Justin a couple of days ago.
>
> the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
> 1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
> so please apply it to the upstream/stable git tree.
>
> from the other side, both patches don't help for 3.0.4+vserver host soft-lock

Hi Paweł,

Did your "both" mean that you applied each patch and run the tests separately,
or you applied the both patches and run them together?

Maybe there were more than one bugs dancing but having a same effect,
not fixing all of them wouldn't help at all.

Thanks,

Nai Xia


> which dies in few hours of stressing. iirc this lock has started with 2.6.38.
> is there any major change in memory managment area in 2.6.38 that i can bisect
> and test with vserver?
>
> BR,
> Paweł.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21  6:54                 ` Nai Xia
@ 2011-10-21  7:35                   ` Pawel Sikora
  0 siblings, 0 replies; 43+ messages in thread
From: Pawel Sikora @ 2011-10-21  7:35 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, Linus Torvalds, Mel Gorman, Andrea Arcangeli,
	Andrew Morton, linux-mm, jpiszcz, arekm, linux-kernel

On Friday 21 of October 2011 14:54:29 Nai Xia wrote:
> 2011/10/20 Paweł Sikora <pluto@agmk.net>:
> > On Wednesday 19 of October 2011 21:42:15 Hugh Dickins wrote:
> >> On Wed, 19 Oct 2011, Linus Torvalds wrote:
> >> > On Wed, Oct 19, 2011 at 12:43 AM, Mel Gorman <mgorman@suse.de> wrote:
> >> > >
> >> > > My vote is with the migration change. While there are occasionally
> >> > > patches to make migration go faster, I don't consider it a hot path.
> >> > > mremap may be used intensively by JVMs so I'd loathe to hurt it.
> >> >
> >> > Ok, everybody seems to like that more, and it removes code rather than
> >> > adds it, so I certainly prefer it too. Pawel, can you test that other
> >> > patch (to mm/migrate.c) that Hugh posted? Instead of the mremap vma
> >> > locking patch that you already verified for your setup?
> >> >
> >> > Hugh - that one didn't have a changelog/sign-off, so if you could
> >> > write that up, and Pawel's testing is successful, I can apply it...
> >> > Looks like we have acks from both Andrea and Mel.
> >>
> >> Yes, I'm glad to have that input from Andrea and Mel, thank you.
> >>
> >> Here we go.  I can't add a Tested-by since Pawel was reporting on the
> >> alternative patch, but perhaps you'll be able to add that in later.
> >>
> >> I may have read too much into Pawel's mail, but it sounded like he
> >> would have expected an eponymous find_get_pages() lockup by now,
> >> and was pleased that this patch appeared to have cured that.
> >>
> >> I've spent quite a while trying to explain find_get_pages() lockup by
> >> a missed migration entry, but I just don't see it: I don't expect this
> >> (or the alternative) patch to do anything to fix that problem.  I won't
> >> mind if it magically goes away, but I expect we'll need more info from
> >> the debug patch I sent Justin a couple of days ago.
> >
> > the latest patch (mm/migrate.c) applied on 3.0.4 also survives points
> > 1) and 2) described previously (https://lkml.org/lkml/2011/10/18/427),
> > so please apply it to the upstream/stable git tree.
> >
> > from the other side, both patches don't help for 3.0.4+vserver host soft-lock
> 
> Hi Paweł,
> 
> Did your "both" mean that you applied each patch and run the tests separately,

yes, i've tested Hugh's patches separately.

> Maybe there were more than one bugs dancing but having a same effect,
> not fixing all of them wouldn't help at all.

i suppose that vserver patch only exposes some tricky bug introduced in 2.6.38.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21  6:22                   ` Nai Xia
@ 2011-10-21  8:07                     ` Pawel Sikora
  2011-10-21  9:07                       ` Nai Xia
  0 siblings, 1 reply; 43+ messages in thread
From: Pawel Sikora @ 2011-10-21  8:07 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Friday 21 of October 2011 14:22:37 Nai Xia wrote:

> And as a side note. Since I notice that Pawel's workload may include OOM,

my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
afaics all userspace applications usualy don't use more than half of physical memory
and so called "cache" on htop bar doesn't reach the 100%.

the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
steps and stress this machine again...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21  8:07                     ` Pawel Sikora
@ 2011-10-21  9:07                       ` Nai Xia
  2011-10-21 21:36                         ` Paweł Sikora
  0 siblings, 1 reply; 43+ messages in thread
From: Nai Xia @ 2011-10-21  9:07 UTC (permalink / raw)
  To: Pawel Sikora
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
>
>> And as a side note. Since I notice that Pawel's workload may include OOM,
>
> my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> afaics all userspace applications usualy don't use more than half of physical memory
> and so called "cache" on htop bar doesn't reach the 100%.

OK，did you logged any OOM killing if there was some memory usage burst?
But, well my above OOM reasoning is a direct short cut to imagined
root cause of "adjacent VMAs which
should have been merged but in fact not merged" case.
Maybe there are other cases that can lead to this or maybe it's
totally another bug....

But still I think if my reasoning is good, similar bad things will
happen again some time in the future,
even if it was not your case here...

>
> the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> steps and stress this machine again...

OK, it's smart to narrow down the range first....

>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-19  7:30   ` Mel Gorman
@ 2011-10-21 12:44     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-10-21 12:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Pawel Sikora, Andrew Morton, Andrea Arcangeli, linux-mm, jpiszcz,
	arekm, linux-kernel

On Wed, Oct 19, 2011 at 09:30:36AM +0200, Mel Gorman wrote:
> >  RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
> >                        migration_entry_wait+0x156/0x160
> >   [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
> >   [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
> >   [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
> >   [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
> >   [<ffffffff81106097>] ? vma_adjust+0x537/0x570
> >   [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
> >   [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
> >   [<ffffffff81421d5f>] page_fault+0x1f/0x30
> > 
> > mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > and pagetable locks, were good enough before page migration (with its
> > requirement that every migration entry be found) came in; and enough
> > while migration always held mmap_sem.  But not enough nowadays, when
> > there's memory hotremove and compaction: anon_vma lock is also needed,
> > to make sure a migration entry is not dodging around behind our back.
> > 
> 
> migration holds the anon_vma lock while it unmaps the pages and keeps holding
> it until after remove_migration_ptes is called. 

I reread this today and realised I was sloppy with my writing. migration
holds the anon_vma lock while it unmaps the pages. It also holds the
anon_vma lock during remove_migration_ptes. For the migration operation,
a reference count is held on anon_vma but not the lock itself.

> There are two anon vmas
> that should exist during mremap that were created for the move. They
> should not be able to disappear while migration runs and right now,

And what is preventing them disappearing is not the lock but the
reference count.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-20  9:11       ` Nai Xia
@ 2011-10-21 15:56         ` Mel Gorman
  2011-10-21 17:21           ` Nai Xia
  2011-10-21 17:41           ` Andrea Arcangeli
  0 siblings, 2 replies; 43+ messages in thread
From: Mel Gorman @ 2011-10-21 15:56 UTC (permalink / raw)
  To: Nai Xia
  Cc: Andrea Arcangeli, Hugh Dickins, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Thu, Oct 20, 2011 at 05:11:28PM +0800, Nai Xia wrote:
> On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> >> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> >> and pagetable locks, were good enough before page migration (with its
> >> requirement that every migration entry be found) came in; and enough
> >> while migration always held mmap_sem.  But not enough nowadays, when
> >> there's memory hotremove and compaction: anon_vma lock is also needed,
> >> to make sure a migration entry is not dodging around behind our back.
> >
> > For things like migrate and split_huge_page, the anon_vma layer must
> > guarantee the page is reachable by rmap walk at all times regardless
> > if it's at the old or new address.
> >
> > This shall be guaranteed by the copy_vma called by move_vma well
> > before move_page_tables/move_ptes can run.
> >
> > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> > into the anon_vma chains structures (vma_link does that). That before
> > any pte can be moved.
> >
> > Because we keep two vmas mapped on both src and dst range, with
> > different vma->vm_pgoff that is valid for the page (the page doesn't
> > change its page->index) the page should always find _all_ its pte at
> > any given time.
> >
> > There may be other variables at play like the order of insertion in
> > the anon_vma chain matches our direction of copy and removal of the
> > old pte. But I think the double locking of the PT lock should make the
> > order in the anon_vma chain absolutely irrelevant (the rmap_walk
> > obviously takes the PT lock too), and furthermore likely the
> > anon_vma_chain insertion is favorable (the dst vma is inserted last
> > and checked last). But it shouldn't matter.
> 
> I happened to be reading these code last week.
> 
> And I do think this order matters, the reason is just quite similar why we
> need i_mmap_lock in move_ptes():
> If rmap_walk goes dst--->src, then when it first look into dst, ok, the

You might be right in that the ordering matters. We do link new VMAs at
the end of the list in anon_vma_chain_list so remove_migrate_ptes should
be walking from src->dst.

If remove_migrate_pte finds src first, it will remove the pte and the
correct version will get copied. If move_ptes runs between when
remove_migrate_ptes moves from src to dst, then the PTE at dst will
still be correct.

> pte is not there, and it happily skip it and release the PTL.
> Then just before it look into src, move_ptes() comes in, takes the locks
> and moves the pte from src to dst. And then when rmap_walk() look
> into src,  it will find an empty pte again. The pte is still there,
> but rmap_walk() missed it !
> 

I believe the ordering is correct though and protects us in this case.

> IMO, this can really happen in case of vma_merge() succeeding.
> Imagine that src vma is lately faulted and in anon_vma_prepare()
> it got a same anon_vma with an existing vma ( named evil_vma )through
> find_mergeable_anon_vma().  This can potentially make the vma_merge() in
> copy_vma() return with evil_vma on some new relocation request. But src_vma
> is really linked _after_  evil_vma/new_vma/dst_vma.
> In this way, the ordering protocol  of anon_vma chain is broken.
> This should be a rare case because I think in most cases
> if two VMAs can reusable_anon_vma() they were already merged.
> 
> How do you think  ?
> 

Despite the comments in anon_vma_compatible(), I would expect that VMAs
that can share an anon_vma from find_mergeable_anon_vma() will also get
merged. When the new VMA is created, it will be linked in the usual
manner and the oldest->newest ordering is what is required. That's not
that important though.

What is important is if mremap is moving src to a dst that is adjacent
to another anon_vma. If src has never been faulted, it's not an issue
because there are also no migration PTEs. If src has been faulted, then
is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
are not compatible. The ordering is preserved and we are still ok.

All that said, while I don't think there is a problem, I can't convince
myself 100% of it. Andrea, can you spot a flaw?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 15:56         ` Mel Gorman
@ 2011-10-21 17:21           ` Nai Xia
  2011-10-21 17:41           ` Andrea Arcangeli
  1 sibling, 0 replies; 43+ messages in thread
From: Nai Xia @ 2011-10-21 17:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Hugh Dickins, Pawel Sikora, Andrew Morton,
	linux-mm, jpiszcz, arekm, linux-kernel

On Fri, Oct 21, 2011 at 11:56 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Oct 20, 2011 at 05:11:28PM +0800, Nai Xia wrote:
>> On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
>> >> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
>> >> and pagetable locks, were good enough before page migration (with its
>> >> requirement that every migration entry be found) came in; and enough
>> >> while migration always held mmap_sem.  But not enough nowadays, when
>> >> there's memory hotremove and compaction: anon_vma lock is also needed,
>> >> to make sure a migration entry is not dodging around behind our back.
>> >
>> > For things like migrate and split_huge_page, the anon_vma layer must
>> > guarantee the page is reachable by rmap walk at all times regardless
>> > if it's at the old or new address.
>> >
>> > This shall be guaranteed by the copy_vma called by move_vma well
>> > before move_page_tables/move_ptes can run.
>> >
>> > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
>> > into the anon_vma chains structures (vma_link does that). That before
>> > any pte can be moved.
>> >
>> > Because we keep two vmas mapped on both src and dst range, with
>> > different vma->vm_pgoff that is valid for the page (the page doesn't
>> > change its page->index) the page should always find _all_ its pte at
>> > any given time.
>> >
>> > There may be other variables at play like the order of insertion in
>> > the anon_vma chain matches our direction of copy and removal of the
>> > old pte. But I think the double locking of the PT lock should make the
>> > order in the anon_vma chain absolutely irrelevant (the rmap_walk
>> > obviously takes the PT lock too), and furthermore likely the
>> > anon_vma_chain insertion is favorable (the dst vma is inserted last
>> > and checked last). But it shouldn't matter.
>>
>> I happened to be reading these code last week.
>>
>> And I do think this order matters, the reason is just quite similar why we
>> need i_mmap_lock in move_ptes():
>> If rmap_walk goes dst--->src, then when it first look into dst, ok, the
>
> You might be right in that the ordering matters. We do link new VMAs at
> the end of the list in anon_vma_chain_list so remove_migrate_ptes should
> be walking from src->dst.
>
> If remove_migrate_pte finds src first, it will remove the pte and the
> correct version will get copied. If move_ptes runs between when
> remove_migrate_ptes moves from src to dst, then the PTE at dst will
> still be correct.
>
>> pte is not there, and it happily skip it and release the PTL.
>> Then just before it look into src, move_ptes() comes in, takes the locks
>> and moves the pte from src to dst. And then when rmap_walk() look
>> into src,  it will find an empty pte again. The pte is still there,
>> but rmap_walk() missed it !
>>
>
> I believe the ordering is correct though and protects us in this case.
>
>> IMO, this can really happen in case of vma_merge() succeeding.
>> Imagine that src vma is lately faulted and in anon_vma_prepare()
>> it got a same anon_vma with an existing vma ( named evil_vma )through
>> find_mergeable_anon_vma().  This can potentially make the vma_merge() in
>> copy_vma() return with evil_vma on some new relocation request. But src_vma
>> is really linked _after_  evil_vma/new_vma/dst_vma.
>> In this way, the ordering protocol  of anon_vma chain is broken.
>> This should be a rare case because I think in most cases
>> if two VMAs can reusable_anon_vma() they were already merged.
>>
>> How do you think  ?
>>
>
> Despite the comments in anon_vma_compatible(), I would expect that VMAs
> that can share an anon_vma from find_mergeable_anon_vma() will also get
> merged. When the new VMA is created, it will be linked in the usual
> manner and the oldest->newest ordering is what is required. That's not
> that important though.
>
> What is important is if mremap is moving src to a dst that is adjacent
> to another anon_vma. If src has never been faulted, it's not an issue
> because there are also no migration PTEs. If src has been faulted, then
> is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
> are not compatible. The ordering is preserved and we are still ok.

Hi Mel,

Thanks for input. I agree on _almost_ all your reasoning above.

But there is a tricky series of events I mentioned in
https://lkml.org/lkml/2011/10/21/14

, which, I think, can really lead to anon_vma1 == anon_vma2 in this case.
These events is led by a failure when do_brk() fails on vma_merge() due to
ENOMEM, rare it maybe though, And I am still not sure if there exists
any other corner cases when a "should be merged" VMAs just sit there
side by side
for sth reason -- normally, that does not trigger BUGs, so maybe hard to
 detect in real workload.

Please refer to my link and I think the construction was very clear if I had not
missed sth subtle.

Thanks,

Nai Xia
>
> All that said, while I don't think there is a problem, I can't convince
> myself 100% of it. Andrea, can you spot a flaw?
>
> --
> Mel Gorman
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 15:56         ` Mel Gorman
  2011-10-21 17:21           ` Nai Xia
@ 2011-10-21 17:41           ` Andrea Arcangeli
  2011-10-21 22:50             ` Andrea Arcangeli
  2011-10-22  5:07             ` Nai Xia
  1 sibling, 2 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2011-10-21 17:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Oct 21, 2011 at 05:56:32PM +0200, Mel Gorman wrote:
> On Thu, Oct 20, 2011 at 05:11:28PM +0800, Nai Xia wrote:
> > On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> > >> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > >> and pagetable locks, were good enough before page migration (with its
> > >> requirement that every migration entry be found) came in; and enough
> > >> while migration always held mmap_sem.  But not enough nowadays, when
> > >> there's memory hotremove and compaction: anon_vma lock is also needed,
> > >> to make sure a migration entry is not dodging around behind our back.
> > >
> > > For things like migrate and split_huge_page, the anon_vma layer must
> > > guarantee the page is reachable by rmap walk at all times regardless
> > > if it's at the old or new address.
> > >
> > > This shall be guaranteed by the copy_vma called by move_vma well
> > > before move_page_tables/move_ptes can run.
> > >
> > > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> > > into the anon_vma chains structures (vma_link does that). That before
> > > any pte can be moved.
> > >
> > > Because we keep two vmas mapped on both src and dst range, with
> > > different vma->vm_pgoff that is valid for the page (the page doesn't
> > > change its page->index) the page should always find _all_ its pte at
> > > any given time.
> > >
> > > There may be other variables at play like the order of insertion in
> > > the anon_vma chain matches our direction of copy and removal of the
> > > old pte. But I think the double locking of the PT lock should make the
> > > order in the anon_vma chain absolutely irrelevant (the rmap_walk
> > > obviously takes the PT lock too), and furthermore likely the
> > > anon_vma_chain insertion is favorable (the dst vma is inserted last
> > > and checked last). But it shouldn't matter.
> > 
> > I happened to be reading these code last week.
> > 
> > And I do think this order matters, the reason is just quite similar why we
> > need i_mmap_lock in move_ptes():
> > If rmap_walk goes dst--->src, then when it first look into dst, ok, the
> 
> You might be right in that the ordering matters. We do link new VMAs at

Yes I also think ordering matters as I mentioned in the previous email
that Nai answered to.

> the end of the list in anon_vma_chain_list so remove_migrate_ptes should
> be walking from src->dst.

Correct. Like I mentioned in that previous email that Nai answered,
that wouldn't be ok only if vma_merge succeeds and I didn't change my mind
about that...

copy_vma is only called by mremap so supposedly that path can
trigger. Looks like I was wrong about vma_merge being able to succeed
in copy_vma, and if it does I still think it's a problem as we have no
ordering guarantee.

The only other place that depends on the anon_vma_chain order is fork,
and there, no vma_merge can happen, so that is safe.

> If remove_migrate_pte finds src first, it will remove the pte and the
> correct version will get copied. If move_ptes runs between when
> remove_migrate_ptes moves from src to dst, then the PTE at dst will
> still be correct.

The problem is rmap_walk will search dst before src. So it will do
nothing on dst. Then mremap moves the pte from src to dst. When rmap
walk then checks "src" it finds nothing again.

> > pte is not there, and it happily skip it and release the PTL.
> > Then just before it look into src, move_ptes() comes in, takes the locks
> > and moves the pte from src to dst. And then when rmap_walk() look
> > into src,  it will find an empty pte again. The pte is still there,
> > but rmap_walk() missed it !
> > 
> 
> I believe the ordering is correct though and protects us in this case.

Normally it is, the only problem is vma_merge succeeding I think.

> > IMO, this can really happen in case of vma_merge() succeeding.
> > Imagine that src vma is lately faulted and in anon_vma_prepare()
> > it got a same anon_vma with an existing vma ( named evil_vma )through
> > find_mergeable_anon_vma().  This can potentially make the vma_merge() in
> > copy_vma() return with evil_vma on some new relocation request. But src_vma
> > is really linked _after_  evil_vma/new_vma/dst_vma.
> > In this way, the ordering protocol  of anon_vma chain is broken.
> > This should be a rare case because I think in most cases
> > if two VMAs can reusable_anon_vma() they were already merged.
> > 
> > How do you think  ?
> > 

I tried to understand the above scenario yesterday but with 12 hour
of travel on me I just couldn't.

Yesterday however I thought of another simpler case:

part of a vma is moved with mremap elsewhere. Then it is moved back to
its original place. So then vma_merge will succeed, and the "src" of
mremap is now queued last in anon_vma_chain, wrong ordering.

Today I read an email from Nai who showed apparently the same scenario
I was thinking, without evil vmas or stuff.

I have an hard time to imagine a vma_merge succeeding on a vma that
isn't going back to its original place. The vm_pgoff + vma->anon_vma
checks should keep some linarity so going back to the original place
sounds the only way vma_merge can succeed in copy_vma. But still it
can happen in that case I think (so not sure how the above scenario
with an evil_vma could ever happen if it has a different anon_vma and
it's not a part of a vma that is going back to its original place like
in the second scenario Nai also posted about).

That me and Nai had same scenario hypothesis indipendentely (second
Nai hypoteisis not the first quoted above), plus copy_vma doing
vma_merge and being only called by mremap, sounds like it can really
happen.

> Despite the comments in anon_vma_compatible(), I would expect that VMAs
> that can share an anon_vma from find_mergeable_anon_vma() will also get
> merged. When the new VMA is created, it will be linked in the usual
> manner and the oldest->newest ordering is what is required. That's not
> that important though.
> 
> What is important is if mremap is moving src to a dst that is adjacent
> to another anon_vma. If src has never been faulted, it's not an issue
> because there are also no migration PTEs. If src has been faulted, then
> is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
> are not compatible. The ordering is preserved and we are still ok.

I was thinking along these lines, the only pitfall should be when
something is moved and put back into its original place. When it is
moved, a new vma is created and queued last. When it's put back to its
original location, vma_merge will succeed, and "src" is now the
previous "dst" so queued last and that breaks.

> All that said, while I don't think there is a problem, I can't convince
> myself 100% of it. Andrea, can you spot a flaw?

I think Nai's correct, only second hypothesis though.

We have two options:

1) we remove the vma_merge call from copy_vma and we do the vma_merge
manually after mremap succeed (so then we're as safe as fork is and we
relay on the ordering). No locks but we'll just do 1 more allocation
for one addition temporary vma that will be removed after mremap
completed.

2) Hugh's original fix.

First option probably is faster and prefereable, the vma_merge there
should only trigger when putting things back to origin I suspect, and
never with random mremaps, not sure how common it is to put things
back to origin. If we're in a hurry we can merge Hugh's patch and
optimize it later. We can still retain the migrate fix if we intend to
take way number 1 later. I didn't like too much migrate doing
speculative access on ptes that it can't miss or it'll crash anyway.

Said that the fix merged upstream is 99% certain to fix things in
practice already so I doubt we're in hurry. And if things go wrong
these issues don't go unnoticed and they shouldn't corrupt memory even
if they trigger. 100% certain it can't do damage (other than a BUG_ON)
for split_huge_page as I count the pmds encountered in the rmap_walk
when I set the splitting bit, and I compare that count with
page_mapcount and BUG_ON if they don't match, and later I repeat the
same comparsion in the second rmap_walk that establishes the pte and
downgrades the hugepmd to pmd, and BUG_ON again if they don't match
with the previous rmap_walk count. It may be possible to trigger the
BUG_ON with some malicious activity but it won't be too easy either
because it's not an instant thing, still a race had to trigger and
it's hard to reproduce.

The anon_vma lock is quite a wide lock as it's shared by all parents
anon_vma_chains too, slab allocation from local cpu may actually be
faster in some condition (even when the slab allocation is
superflous). But then I'm not sure. So I'm not against applying Hugh's
fix even for the long run. I wouldn't git revert the migration change,
but then if we go with Hugh's fix probably it'd be safe.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21  9:07                       ` Nai Xia
@ 2011-10-21 21:36                         ` Paweł Sikora
  2011-10-22  6:21                           ` Nai Xia
  0 siblings, 1 reply; 43+ messages in thread
From: Paweł Sikora @ 2011-10-21 21:36 UTC (permalink / raw)
  To: Nai Xia
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Friday 21 of October 2011 11:07:56 Nai Xia wrote:
> On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> > On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
> >
> >> And as a side note. Since I notice that Pawel's workload may include OOM,
> >
> > my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> > on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> > afaics all userspace applications usualy don't use more than half of physical memory
> > and so called "cache" on htop bar doesn't reach the 100%.
> 
> OK，did you logged any OOM killing if there was some memory usage burst?
> But, well my above OOM reasoning is a direct short cut to imagined
> root cause of "adjacent VMAs which
> should have been merged but in fact not merged" case.
> Maybe there are other cases that can lead to this or maybe it's
> totally another bug....

i don't see any OOM killing with my conservative settings
(vm.overcommit_memory=2, vm.overcommit_ratio=100).

> But still I think if my reasoning is good, similar bad things will
> happen again some time in the future,
> even if it was not your case here...
> 
> >
> > the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> > died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> > steps and stress this machine again...
> 
> OK, it's smart to narrow down the range first....

disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
average load ~16. i wonder if it survive weekend...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 17:41           ` Andrea Arcangeli
@ 2011-10-21 22:50             ` Andrea Arcangeli
  2011-10-22  5:52               ` Nai Xia
  2011-10-22  5:07             ` Nai Xia
  1 sibling, 1 reply; 43+ messages in thread
From: Andrea Arcangeli @ 2011-10-21 22:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nai Xia, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Fri, Oct 21, 2011 at 07:41:20PM +0200, Andrea Arcangeli wrote:
> We have two options:
> 
> 1) we remove the vma_merge call from copy_vma and we do the vma_merge
> manually after mremap succeed (so then we're as safe as fork is and we
> relay on the ordering). No locks but we'll just do 1 more allocation
> for one addition temporary vma that will be removed after mremap
> completed.
> 
> 2) Hugh's original fix.

3) put the src vma at the tail if vma_merge succeeds and the src vma
and dst vma aren't the same

I tried to implement this but I'm still wondering about the safety of
this with concurrent processes all calling mremap at the same time on
the same anon_vma same_anon_vma list, the reasoning I think it may be
safe is in the comment. I run a few mremap with my benchmark where the
THP aware mremap in -mm gets a x10 boost and moves 5G and it didn't
crash but that's about it and not conclusive, if you review please
comment...

I've to pack luggage and prepare to fly to KS tomorrow so I may not be
responsive in the next few days.

===
>From f2898ff06b5a9a14b9d957c7696137f42a2438e9 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 22 Oct 2011 00:11:49 +0200
Subject: [PATCH] mremap: enforce rmap src/dst vma ordering in case of
 vma_merge succeeding in copy_vma

migrate was doing a rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serialize properly against
mremap PT locks. But a second problem remains in the order of vmas in
the same_anon_vma list used by the rmap_walk.

If vma_merge would succeed in copy_vma, the src vma could be placed
after the dst vma in the same_anon_vma list. That could still lead
migrate to miss some pte.

This patch adds a anon_vma_order_tail() function to force the dst vma
at the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than
taking the anon_vma root lock around every pte copy practically for
the whole duration of mremap.
---
 include/linux/rmap.h |    1 +
 mm/mmap.c            |    8 ++++++++
 mm/rmap.c            |   43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2148b12..45eb098 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
 int  anon_vma_prepare(struct vm_area_struct *);
 void unlink_anon_vmas(struct vm_area_struct *);
 int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
+void anon_vma_order_tail(struct vm_area_struct *);
 int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
 void __anon_vma_link(struct vm_area_struct *);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..a5858dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		 */
 		if (vma_start >= new_vma->vm_start &&
 		    vma_start < new_vma->vm_end)
+			/*
+			 * No need to call anon_vma_order_tail() in
+			 * this case because the same PT lock will
+			 * serialize the rmap_walk against both src
+			 * and dst vmas.
+			 */
 			*vmap = new_vma;
+		else
+			anon_vma_order_tail(new_vma);
 	} else {
 		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
 		if (new_vma) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 8005080..170cece 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -272,6 +272,49 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 }
 
 /*
+ * Some rmap walk that needs to find all ptes/hugepmds without false
+ * negatives (like migrate and split_huge_page) running concurrent
+ * with operations that copy or move pagetables (like mremap() and
+ * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
+ * in a certain order: the dst_vma must be placed after the src_vma in
+ * the list. This is always guaranteed by fork() but mremap() needs to
+ * call this function to enforce it in case the dst_vma isn't newly
+ * allocated and chained with the anon_vma_clone() function but just
+ * an extension of a pre-existing vma through vma_merge.
+ *
+ * NOTE: the same_anon_vma list can still changed by other processes
+ * while mremap runs because mremap doesn't hold the anon_vma mutex to
+ * prevent modifications to the list while it runs. All we need to
+ * enforce is that the relative order of this process vmas isn't
+ * changing (we don't care about other vmas order). Each vma
+ * corresponds to an anon_vma_chain structure so there's no risk that
+ * other processes calling anon_vma_order_tail() and changing the
+ * same_anon_vma list under mremap() will screw with the relative
+ * order of this process vmas in the list, because we won't alter the
+ * order of any vma that isn't belonging to this process. And there
+ * can't be another anon_vma_order_tail running concurrently with
+ * mremap() coming from this process because we hold the mmap_sem for
+ * the whole mremap(). fork() ordering dependency also shouldn't be
+ * affected because we only care that the parent vmas are placed in
+ * the list before the child vmas and anon_vma_order_tail won't reorder
+ * vmas from either the fork parent or child.
+ */
+void anon_vma_order_tail(struct vm_area_struct *dst)
+{
+	struct anon_vma_chain *pavc;
+	struct anon_vma *root = NULL;
+
+	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
+		struct anon_vma *anon_vma = pavc->anon_vma;
+		VM_BUG_ON(pavc->vma != dst);
+		root = lock_anon_vma_root(root, anon_vma);
+		list_del(&pavc->same_anon_vma);
+		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
+	}
+	unlock_anon_vma_root(root);
+}
+
+/*
  * Attach vma to its own anon_vma, as well as to the anon_vmas that
  * the corresponding VMA in the parent process is attached to.
  * Returns 0 on success, non-zero on failure.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 17:41           ` Andrea Arcangeli
  2011-10-21 22:50             ` Andrea Arcangeli
@ 2011-10-22  5:07             ` Nai Xia
  2011-10-31 16:34               ` Andrea Arcangeli
  1 sibling, 1 reply; 43+ messages in thread
From: Nai Xia @ 2011-10-22  5:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Saturday 22 October 2011 01:41:20 Andrea Arcangeli wrote:
> On Fri, Oct 21, 2011 at 05:56:32PM +0200, Mel Gorman wrote:
> > On Thu, Oct 20, 2011 at 05:11:28PM +0800, Nai Xia wrote:
> > > On Mon, Oct 17, 2011 at 7:54 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > > > On Thu, Oct 13, 2011 at 04:30:09PM -0700, Hugh Dickins wrote:
> > > >> mremap's down_write of mmap_sem, together with i_mmap_mutex/lock,
> > > >> and pagetable locks, were good enough before page migration (with its
> > > >> requirement that every migration entry be found) came in; and enough
> > > >> while migration always held mmap_sem.  But not enough nowadays, when
> > > >> there's memory hotremove and compaction: anon_vma lock is also needed,
> > > >> to make sure a migration entry is not dodging around behind our back.
> > > >
> > > > For things like migrate and split_huge_page, the anon_vma layer must
> > > > guarantee the page is reachable by rmap walk at all times regardless
> > > > if it's at the old or new address.
> > > >
> > > > This shall be guaranteed by the copy_vma called by move_vma well
> > > > before move_page_tables/move_ptes can run.
> > > >
> > > > copy_vma obviously takes the anon_vma lock to insert the new "dst" vma
> > > > into the anon_vma chains structures (vma_link does that). That before
> > > > any pte can be moved.
> > > >
> > > > Because we keep two vmas mapped on both src and dst range, with
> > > > different vma->vm_pgoff that is valid for the page (the page doesn't
> > > > change its page->index) the page should always find _all_ its pte at
> > > > any given time.
> > > >
> > > > There may be other variables at play like the order of insertion in
> > > > the anon_vma chain matches our direction of copy and removal of the
> > > > old pte. But I think the double locking of the PT lock should make the
> > > > order in the anon_vma chain absolutely irrelevant (the rmap_walk
> > > > obviously takes the PT lock too), and furthermore likely the
> > > > anon_vma_chain insertion is favorable (the dst vma is inserted last
> > > > and checked last). But it shouldn't matter.
> > > 
> > > I happened to be reading these code last week.
> > > 
> > > And I do think this order matters, the reason is just quite similar why we
> > > need i_mmap_lock in move_ptes():
> > > If rmap_walk goes dst--->src, then when it first look into dst, ok, the
> > 
> > You might be right in that the ordering matters. We do link new VMAs at
> 
> Yes I also think ordering matters as I mentioned in the previous email
> that Nai answered to.
> 
> > the end of the list in anon_vma_chain_list so remove_migrate_ptes should
> > be walking from src->dst.
> 
> Correct. Like I mentioned in that previous email that Nai answered,
> that wouldn't be ok only if vma_merge succeeds and I didn't change my mind
> about that...
> 
> copy_vma is only called by mremap so supposedly that path can
> trigger. Looks like I was wrong about vma_merge being able to succeed
> in copy_vma, and if it does I still think it's a problem as we have no
> ordering guarantee.
> 
> The only other place that depends on the anon_vma_chain order is fork,
> and there, no vma_merge can happen, so that is safe.
> 
> > If remove_migrate_pte finds src first, it will remove the pte and the
> > correct version will get copied. If move_ptes runs between when
> > remove_migrate_ptes moves from src to dst, then the PTE at dst will
> > still be correct.
> 
> The problem is rmap_walk will search dst before src. So it will do
> nothing on dst. Then mremap moves the pte from src to dst. When rmap
> walk then checks "src" it finds nothing again.
> 
> > > pte is not there, and it happily skip it and release the PTL.
> > > Then just before it look into src, move_ptes() comes in, takes the locks
> > > and moves the pte from src to dst. And then when rmap_walk() look
> > > into src,  it will find an empty pte again. The pte is still there,
> > > but rmap_walk() missed it !
> > > 
> > 
> > I believe the ordering is correct though and protects us in this case.
> 
> Normally it is, the only problem is vma_merge succeeding I think.
> 
> > > IMO, this can really happen in case of vma_merge() succeeding.
> > > Imagine that src vma is lately faulted and in anon_vma_prepare()
> > > it got a same anon_vma with an existing vma ( named evil_vma )through
> > > find_mergeable_anon_vma().  This can potentially make the vma_merge() in
> > > copy_vma() return with evil_vma on some new relocation request. But src_vma
> > > is really linked _after_  evil_vma/new_vma/dst_vma.
> > > In this way, the ordering protocol  of anon_vma chain is broken.
> > > This should be a rare case because I think in most cases
> > > if two VMAs can reusable_anon_vma() they were already merged.
> > > 
> > > How do you think  ?
> > > 
> 
> I tried to understand the above scenario yesterday but with 12 hour
> of travel on me I just couldn't.

Oh,yes, the first hypothesis was actually a vague feeling that things
might go wrong in that direction. The details in it was somewhat 
missleading. But following that direction, I found the 2nd clear 
hypothesis that leads to this bug step by step.

> 
> Yesterday however I thought of another simpler case:
> 
> part of a vma is moved with mremap elsewhere. Then it is moved back to
> its original place. So then vma_merge will succeed, and the "src" of
> mremap is now queued last in anon_vma_chain, wrong ordering.

Oh, yes, partial mremaping will do the trick. I was too addicted to find
a case when two VMAs missed a normal merge chance but will merge later
on. The only thing I can find by now is that ENOMEM is vma_adjust().

Partial mremaping is a simpler case and definitely more likey to happen. 

> 
> Today I read an email from Nai who showed apparently the same scenario
> I was thinking, without evil vmas or stuff.
> 
> I have an hard time to imagine a vma_merge succeeding on a vma that
> isn't going back to its original place. The vm_pgoff + vma->anon_vma
> checks should keep some linarity so going back to the original place
> sounds the only way vma_merge can succeed in copy_vma. But still it
> can happen in that case I think (so not sure how the above scenario
> with an evil_vma could ever happen if it has a different anon_vma and
> it's not a part of a vma that is going back to its original place like
> in the second scenario Nai also posted about).
> 
> That me and Nai had same scenario hypothesis indipendentely (second
> Nai hypoteisis not the first quoted above), plus copy_vma doing
> vma_merge and being only called by mremap, sounds like it can really
> happen.
> 
> > Despite the comments in anon_vma_compatible(), I would expect that VMAs
> > that can share an anon_vma from find_mergeable_anon_vma() will also get
> > merged. When the new VMA is created, it will be linked in the usual
> > manner and the oldest->newest ordering is what is required. That's not
> > that important though.
> > 
> > What is important is if mremap is moving src to a dst that is adjacent
> > to another anon_vma. If src has never been faulted, it's not an issue
> > because there are also no migration PTEs. If src has been faulted, then
> > is_mergeable_anon_vma() should fail as anon_vma1 != anon_vma2 and they
> > are not compatible. The ordering is preserved and we are still ok.
> 
> I was thinking along these lines, the only pitfall should be when
> something is moved and put back into its original place. When it is
> moved, a new vma is created and queued last. When it's put back to its
> original location, vma_merge will succeed, and "src" is now the
> previous "dst" so queued last and that breaks.
> 
> > All that said, while I don't think there is a problem, I can't convince
> > myself 100% of it. Andrea, can you spot a flaw?
> 
> I think Nai's correct, only second hypothesis though.
> 
> We have two options:
> 
> 1) we remove the vma_merge call from copy_vma and we do the vma_merge
> manually after mremap succeed (so then we're as safe as fork is and we
> relay on the ordering). No locks but we'll just do 1 more allocation
> for one addition temporary vma that will be removed after mremap
> completed.
> 
> 2) Hugh's original fix.
> 
> First option probably is faster and prefereable, the vma_merge there
> should only trigger when putting things back to origin I suspect, and
> never with random mremaps, not sure how common it is to put things
> back to origin. If we're in a hurry we can merge Hugh's patch and
> optimize it later. We can still retain the migrate fix if we intend to
> take way number 1 later. I didn't like too much migrate doing
> speculative access on ptes that it can't miss or it'll crash anyway.

Me too, I think it's error-prone or at least we must be very careful
of its not doing sth evil. If the speculative access does not save
too much of the time, we need not brother to waste our mind power
over it.

> 
> Said that the fix merged upstream is 99% certain to fix things in
> practice already so I doubt we're in hurry. And if things go wrong
> these issues don't go unnoticed and they shouldn't corrupt memory even
> if they trigger. 100% certain it can't do damage (other than a BUG_ON)
> for split_huge_page as I count the pmds encountered in the rmap_walk
> when I set the splitting bit, and I compare that count with
> page_mapcount and BUG_ON if they don't match, and later I repeat the
> same comparsion in the second rmap_walk that establishes the pte and
> downgrades the hugepmd to pmd, and BUG_ON again if they don't match
> with the previous rmap_walk count. It may be possible to trigger the
> BUG_ON with some malicious activity but it won't be too easy either
> because it's not an instant thing, still a race had to trigger and
> it's hard to reproduce.
> 
> The anon_vma lock is quite a wide lock as it's shared by all parents
> anon_vma_chains too, slab allocation from local cpu may actually be
> faster in some condition (even when the slab allocation is
> superflous). But then I'm not sure. So I'm not against applying Hugh's
> fix even for the long run. I wouldn't git revert the migration change,
> but then if we go with Hugh's fix probably it'd be safe.
> 

Yeah, anon_vma root lock is a big lock. And JFYI, actually I am doing 
some very nasty hacking on anon_vma and one of the side effects is 
breaking the root lock into pieces. But this area is pretty 
convolved by many racing conditions. I hope some day I will finally make
my patch work and have your precious review of it. :-)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 22:50             ` Andrea Arcangeli
@ 2011-10-22  5:52               ` Nai Xia
  2011-10-31 17:14                 ` Andrea Arcangeli
  0 siblings, 1 reply; 43+ messages in thread
From: Nai Xia @ 2011-10-22  5:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Saturday 22 October 2011 06:50:08 Andrea Arcangeli wrote:
> On Fri, Oct 21, 2011 at 07:41:20PM +0200, Andrea Arcangeli wrote:
> > We have two options:
> > 
> > 1) we remove the vma_merge call from copy_vma and we do the vma_merge
> > manually after mremap succeed (so then we're as safe as fork is and we
> > relay on the ordering). No locks but we'll just do 1 more allocation
> > for one addition temporary vma that will be removed after mremap
> > completed.
> > 
> > 2) Hugh's original fix.
> 
> 3) put the src vma at the tail if vma_merge succeeds and the src vma
> and dst vma aren't the same
> 
> I tried to implement this but I'm still wondering about the safety of
> this with concurrent processes all calling mremap at the same time on
> the same anon_vma same_anon_vma list, the reasoning I think it may be
> safe is in the comment. I run a few mremap with my benchmark where the
> THP aware mremap in -mm gets a x10 boost and moves 5G and it didn't

BTW, I am curious about what benchmark did you run and " x10 boost"
meaning compared to Hugh's anon_vma_locking fix?

> crash but that's about it and not conclusive, if you review please
> comment...

My comment is at the bottom of this post.

> 
> I've to pack luggage and prepare to fly to KS tomorrow so I may not be
> responsive in the next few days.
> 
> ===
> From f2898ff06b5a9a14b9d957c7696137f42a2438e9 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Sat, 22 Oct 2011 00:11:49 +0200
> Subject: [PATCH] mremap: enforce rmap src/dst vma ordering in case of
>  vma_merge succeeding in copy_vma
> 
> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.
> 
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
> 
> This patch adds a anon_vma_order_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.
> 
> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.
> ---
>  include/linux/rmap.h |    1 +
>  mm/mmap.c            |    8 ++++++++
>  mm/rmap.c            |   43 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 52 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..45eb098 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -120,6 +120,7 @@ void anon_vma_init(void);	/* create anon_vma_cachep */
>  int  anon_vma_prepare(struct vm_area_struct *);
>  void unlink_anon_vmas(struct vm_area_struct *);
>  int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +void anon_vma_order_tail(struct vm_area_struct *);
>  int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
>  void __anon_vma_link(struct vm_area_struct *);
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a65efd4..a5858dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  		 */
>  		if (vma_start >= new_vma->vm_start &&
>  		    vma_start < new_vma->vm_end)
> +			/*
> +			 * No need to call anon_vma_order_tail() in
> +			 * this case because the same PT lock will
> +			 * serialize the rmap_walk against both src
> +			 * and dst vmas.
> +			 */
>  			*vmap = new_vma;
> +		else
> +			anon_vma_order_tail(new_vma);
>  	} else {
>  		new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
>  		if (new_vma) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..170cece 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -272,6 +272,49 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  }
>  
>  /*
> + * Some rmap walk that needs to find all ptes/hugepmds without false
> + * negatives (like migrate and split_huge_page) running concurrent
> + * with operations that copy or move pagetables (like mremap() and
> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
> + * in a certain order: the dst_vma must be placed after the src_vma in
> + * the list. This is always guaranteed by fork() but mremap() needs to
> + * call this function to enforce it in case the dst_vma isn't newly
> + * allocated and chained with the anon_vma_clone() function but just
> + * an extension of a pre-existing vma through vma_merge.
> + *
> + * NOTE: the same_anon_vma list can still changed by other processes
> + * while mremap runs because mremap doesn't hold the anon_vma mutex to
> + * prevent modifications to the list while it runs. All we need to
> + * enforce is that the relative order of this process vmas isn't
> + * changing (we don't care about other vmas order). Each vma
> + * corresponds to an anon_vma_chain structure so there's no risk that
> + * other processes calling anon_vma_order_tail() and changing the
> + * same_anon_vma list under mremap() will screw with the relative
> + * order of this process vmas in the list, because we won't alter the
> + * order of any vma that isn't belonging to this process. And there
> + * can't be another anon_vma_order_tail running concurrently with
> + * mremap() coming from this process because we hold the mmap_sem for
> + * the whole mremap(). fork() ordering dependency also shouldn't be
> + * affected because we only care that the parent vmas are placed in
> + * the list before the child vmas and anon_vma_order_tail won't reorder
> + * vmas from either the fork parent or child.
> + */
> +void anon_vma_order_tail(struct vm_area_struct *dst)
> +{
> +	struct anon_vma_chain *pavc;
> +	struct anon_vma *root = NULL;
> +
> +	list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
> +		struct anon_vma *anon_vma = pavc->anon_vma;
> +		VM_BUG_ON(pavc->vma != dst);
> +		root = lock_anon_vma_root(root, anon_vma);
> +		list_del(&pavc->same_anon_vma);
> +		list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
> +	}
> +	unlock_anon_vma_root(root);
> +}

This patch and together with the reasoning looks good to me. 
But I wondering this patch can make the anon_vma chain ordering game more 
complex and harder to play in the future.
However, if it does bring much perfomance benefit, I vote for this patch 
because it balances all three requirements here: bug free, performance &
no two VMAs stay not merged for no good reason.

Our situation again makes me have the strong feeling that we are really
in bad need of a computer aided way to travel all possible state space.
There are some guys around me who do automatic software testing research.
But I am afraid our problem is too much "real world" for them... sigh...  




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-21 21:36                         ` Paweł Sikora
@ 2011-10-22  6:21                           ` Nai Xia
  2011-10-22 16:42                             ` Paweł Sikora
  0 siblings, 1 reply; 43+ messages in thread
From: Nai Xia @ 2011-10-22  6:21 UTC (permalink / raw)
  To: Paweł Sikora
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Saturday 22 October 2011 05:36:46 Paweł Sikora wrote:
> On Friday 21 of October 2011 11:07:56 Nai Xia wrote:
> > On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> > > On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
> > >
> > >> And as a side note. Since I notice that Pawel's workload may include OOM,
> > >
> > > my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> > > on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> > > afaics all userspace applications usualy don't use more than half of physical memory
> > > and so called "cache" on htop bar doesn't reach the 100%.
> > 
> > OK，did you logged any OOM killing if there was some memory usage burst?
> > But, well my above OOM reasoning is a direct short cut to imagined
> > root cause of "adjacent VMAs which
> > should have been merged but in fact not merged" case.
> > Maybe there are other cases that can lead to this or maybe it's
> > totally another bug....
> 
> i don't see any OOM killing with my conservative settings
> (vm.overcommit_memory=2, vm.overcommit_ratio=100).

OK, that does not matter now. Andrea showed us a simpler way to goto
this bug. 

> 
> > But still I think if my reasoning is good, similar bad things will
> > happen again some time in the future,
> > even if it was not your case here...
> > 
> > >
> > > the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> > > died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> > > steps and stress this machine again...
> > 
> > OK, it's smart to narrow down the range first....
> 
> disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
> opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
> average load ~16. i wonder if it survive weekend...
> 

Maybe you should give another shot of Andrea's latest anon_vma_order_tail() patch. :)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-22  6:21                           ` Nai Xia
@ 2011-10-22 16:42                             ` Paweł Sikora
       [not found]                               ` <CAPQyPG5HJKTo8AEy_khdJeciTgtNQepK6XLcpzvPF8PYS0V-Lw@mail.gmail.com>
  0 siblings, 1 reply; 43+ messages in thread
From: Paweł Sikora @ 2011-10-22 16:42 UTC (permalink / raw)
  To: nai.xia
  Cc: Hugh Dickins, arekm, Linus Torvalds, linux-mm, Mel Gorman,
	jpiszcz, linux-kernel, Andrew Morton, Andrea Arcangeli

On Saturday 22 of October 2011 08:21:23 Nai Xia wrote:
> On Saturday 22 October 2011 05:36:46 Paweł Sikora wrote:
> > On Friday 21 of October 2011 11:07:56 Nai Xia wrote:
> > > On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> > > > On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
> > > >
> > > >> And as a side note. Since I notice that Pawel's workload may include OOM,
> > > >
> > > > my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> > > > on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> > > > afaics all userspace applications usualy don't use more than half of physical memory
> > > > and so called "cache" on htop bar doesn't reach the 100%.
> > > 
> > > OK，did you logged any OOM killing if there was some memory usage burst?
> > > But, well my above OOM reasoning is a direct short cut to imagined
> > > root cause of "adjacent VMAs which
> > > should have been merged but in fact not merged" case.
> > > Maybe there are other cases that can lead to this or maybe it's
> > > totally another bug....
> > 
> > i don't see any OOM killing with my conservative settings
> > (vm.overcommit_memory=2, vm.overcommit_ratio=100).
> 
> OK, that does not matter now. Andrea showed us a simpler way to goto
> this bug. 
> 
> > 
> > > But still I think if my reasoning is good, similar bad things will
> > > happen again some time in the future,
> > > even if it was not your case here...
> > > 
> > > >
> > > > the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> > > > died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> > > > steps and stress this machine again...
> > > 
> > > OK, it's smart to narrow down the range first....
> > 
> > disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
> > opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
> > average load ~16. i wonder if it survive weekend...
> > 
> 
> Maybe you should give another shot of Andrea's latest anon_vma_order_tail() patch. :)
> 

all my attempts to disabling thp/compaction/migration failed (machine locked).
now, i'm testing 3.0.7+vserver+Hugh's+Andrea's patches+enabled few kernel debug options.

so far it has logged only something unrelated to memory managment subsystem:

[  258.397014] =======================================================
[  258.397209] [ INFO: possible circular locking dependency detected ]
[  258.397311] 3.0.7-vs2.3.1-dirty #1
[  258.397402] -------------------------------------------------------
[  258.397503] slave_odra_g_00/19432 is trying to acquire lock:
[  258.397603]  (&(&sig->cputimer.lock)->rlock){-.....}, at: [<ffffffff8103adfc>] update_curr+0xfc/0x190
[  258.397912] 
[  258.397912] but task is already holding lock:
[  258.398090]  (&rq->lock){-.-.-.}, at: [<ffffffff81041a8e>] scheduler_tick+0x4e/0x280
[  258.398387] 
[  258.398388] which lock already depends on the new lock.
[  258.398389] 
[  258.398652] 
[  258.398653] the existing dependency chain (in reverse order) is:
[  258.398836] 
[  258.398837] -> #2 (&rq->lock){-.-.-.}:
[  258.399178]        [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[  258.399336]        [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[  258.399495]        [<ffffffff81040bd7>] wake_up_new_task+0x97/0x1c0
[  258.399652]        [<ffffffff81047db6>] do_fork+0x176/0x460
[  258.399807]        [<ffffffff8100999c>] kernel_thread+0x6c/0x70
[  258.399964]        [<ffffffff8144715d>] rest_init+0x21/0xc4
[  258.400120]        [<ffffffff818adbd2>] start_kernel+0x3d6/0x3e1
[  258.400280]        [<ffffffff818ad322>] x86_64_start_reservations+0x132/0x136
[  258.400336]        [<ffffffff818ad416>] x86_64_start_kernel+0xf0/0xf7
[  258.400336] 
[  258.400336] -> #1 (&p->pi_lock){-.-.-.}:
[  258.400336]        [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[  258.400336]        [<ffffffff81466f5c>] _raw_spin_lock_irqsave+0x3c/0x60
[  258.400336]        [<ffffffff8106f328>] thread_group_cputimer+0x38/0x100
[  258.400336]        [<ffffffff8106f41d>] cpu_timer_sample_group+0x2d/0xa0
[  258.400336]        [<ffffffff8107080a>] set_process_cpu_timer+0x3a/0x110
[  258.400336]        [<ffffffff8107091a>] update_rlimit_cpu+0x3a/0x60
[  258.400336]        [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[  258.400336]        [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[  258.400336]        [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b
[  258.400336] 
[  258.400336] -> #0 (&(&sig->cputimer.lock)->rlock){-.....}:
[  258.400336]        [<ffffffff810951e7>] __lock_acquire+0x1aa7/0x1cc0
[  258.400336]        [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[  258.400336]        [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[  258.400336]        [<ffffffff8103adfc>] update_curr+0xfc/0x190
[  258.400336]        [<ffffffff8103b22d>] task_tick_fair+0x2d/0x140
[  258.400336]        [<ffffffff81041b0f>] scheduler_tick+0xcf/0x280
[  258.400336]        [<ffffffff8105a439>] update_process_times+0x69/0x80
[  258.400336]        [<ffffffff8108e0cf>] tick_sched_timer+0x5f/0xc0
[  258.400336]        [<ffffffff81071339>] __run_hrtimer+0x79/0x1f0
[  258.400336]        [<ffffffff81071ce3>] hrtimer_interrupt+0xf3/0x220
[  258.400336]        [<ffffffff8101daa4>] smp_apic_timer_interrupt+0x64/0xa0
[  258.400336]        [<ffffffff8146f9d3>] apic_timer_interrupt+0x13/0x20
[  258.400336]        [<ffffffff8107092d>] update_rlimit_cpu+0x4d/0x60
[  258.400336]        [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[  258.400336]        [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[  258.400336]        [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b
[  258.400336] 
[  258.400336] other info that might help us debug this:
[  258.400336] 
[  258.400336] Chain exists of:
[  258.400336]   &(&sig->cputimer.lock)->rlock --> &p->pi_lock --> &rq->lock
[  258.400336] 
[  258.400336]  Possible unsafe locking scenario:
[  258.400336] 
[  258.400336]        CPU0                    CPU1
[  258.400336]        ----                    ----
[  258.400336]   lock(&rq->lock);
[  258.400336]                                lock(&p->pi_lock);
[  258.400336]                                lock(&rq->lock);
[  258.400336]   lock(&(&sig->cputimer.lock)->rlock);
[  258.400336] 
[  258.400336]  *** DEADLOCK ***
[  258.400336] 
[  258.400336] 2 locks held by slave_odra_g_00/19432:
[  258.400336]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff81062acd>] do_prlimit+0x5d/0x240
[  258.400336]  #1:  (&rq->lock){-.-.-.}, at: [<ffffffff81041a8e>] scheduler_tick+0x4e/0x280
[  258.400336] 
[  258.400336] stack backtrace:
[  258.400336] Pid: 19432, comm: slave_odra_g_00 Not tainted 3.0.7-vs2.3.1-dirty #1
[  258.400336] Call Trace:
[  258.400336]  <IRQ>  [<ffffffff8145e204>] print_circular_bug+0x23d/0x24e
[  258.400336]  [<ffffffff810951e7>] __lock_acquire+0x1aa7/0x1cc0
[  258.400336]  [<ffffffff8109264d>] ? mark_lock+0x2dd/0x330
[  258.400336]  [<ffffffff81093bfd>] ? __lock_acquire+0x4bd/0x1cc0
[  258.400336]  [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[  258.400336]  [<ffffffff810959ee>] lock_acquire+0x8e/0x120
[  258.400336]  [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[  258.400336]  [<ffffffff81466e5c>] _raw_spin_lock+0x2c/0x40
[  258.400336]  [<ffffffff8103adfc>] ? update_curr+0xfc/0x190
[  258.400336]  [<ffffffff8103adfc>] update_curr+0xfc/0x190
[  258.400336]  [<ffffffff8103b22d>] task_tick_fair+0x2d/0x140
[  258.400336]  [<ffffffff81041b0f>] scheduler_tick+0xcf/0x280
[  258.400336]  [<ffffffff8105a439>] update_process_times+0x69/0x80
[  258.400336]  [<ffffffff8108e0cf>] tick_sched_timer+0x5f/0xc0
[  258.400336]  [<ffffffff81071339>] __run_hrtimer+0x79/0x1f0
[  258.400336]  [<ffffffff8108e070>] ? tick_nohz_handler+0x100/0x100
[  258.400336]  [<ffffffff81071ce3>] hrtimer_interrupt+0xf3/0x220
[  258.400336]  [<ffffffff8101daa4>] smp_apic_timer_interrupt+0x64/0xa0
[  258.400336]  [<ffffffff8146f9d3>] apic_timer_interrupt+0x13/0x20
[  258.400336]  <EOI>  [<ffffffff814674e0>] ? _raw_spin_unlock_irq+0x30/0x40
[  258.400336]  [<ffffffff8107092d>] update_rlimit_cpu+0x4d/0x60
[  258.400336]  [<ffffffff81062c0e>] do_prlimit+0x19e/0x240
[  258.400336]  [<ffffffff81063008>] sys_setrlimit+0x48/0x60
[  258.400336]  [<ffffffff8146efbb>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
       [not found]                               ` <CAPQyPG5HJKTo8AEy_khdJeciTgtNQepK6XLcpzvPF8PYS0V-Lw@mail.gmail.com>
@ 2011-10-25  7:33                                 ` Pawel Sikora
  0 siblings, 0 replies; 43+ messages in thread
From: Pawel Sikora @ 2011-10-25  7:33 UTC (permalink / raw)
  To: Nai Xia; +Cc: linux-kernel, akpm, aarcange, mgorman, hughd, torvalds

On Tuesday 25 of October 2011 12:21:30 Nai Xia wrote:
> 2011/10/23 Paweł Sikora <pluto@agmk.net>:
> > On Saturday 22 of October 2011 08:21:23 Nai Xia wrote:
> >> On Saturday 22 October 2011 05:36:46 Paweł Sikora wrote:
> >> > On Friday 21 of October 2011 11:07:56 Nai Xia wrote:
> >> > > On Fri, Oct 21, 2011 at 4:07 PM, Pawel Sikora <pluto@agmk.net> wrote:
> >> > > > On Friday 21 of October 2011 14:22:37 Nai Xia wrote:
> >> > > >
> >> > > >> And as a side note. Since I notice that Pawel's workload may include OOM,
> >> > > >
> >> > > > my last tests on patched (3.0.4 + migrate.c fix + vserver) kernel produce full cpu load
> >> > > > on dual 8-cores opterons like on this htop screenshot -> http://pluto.agmk.net/kernel/screen1.png
> >> > > > afaics all userspace applications usualy don't use more than half of physical memory
> >> > > > and so called "cache" on htop bar doesn't reach the 100%.
> >> > >
> >> > > OK，did you logged any OOM killing if there was some memory usage burst?
> >> > > But, well my above OOM reasoning is a direct short cut to imagined
> >> > > root cause of "adjacent VMAs which
> >> > > should have been merged but in fact not merged" case.
> >> > > Maybe there are other cases that can lead to this or maybe it's
> >> > > totally another bug....
> >> >
> >> > i don't see any OOM killing with my conservative settings
> >> > (vm.overcommit_memory=2, vm.overcommit_ratio=100).
> >>
> >> OK, that does not matter now. Andrea showed us a simpler way to goto
> >> this bug.
> >>
> >> >
> >> > > But still I think if my reasoning is good, similar bad things will
> >> > > happen again some time in the future,
> >> > > even if it was not your case here...
> >> > >
> >> > > >
> >> > > > the patched kernel with disabled CONFIG_TRANSPARENT_HUGEPAGE (new thing in 2.6.38)
> >> > > > died at night, so now i'm going to disable also CONFIG_COMPACTION/MIGRATION in next
> >> > > > steps and stress this machine again...
> >> > >
> >> > > OK, it's smart to narrow down the range first....
> >> >
> >> > disabling hugepage/compacting didn't help but disabling hugepage/compacting/migration keeps
> >> > opterons stable for ~9h so far. userspace uses ~40GB (from 64) ram, caches reach 100% on htop bar,
> >> > average load ~16. i wonder if it survive weekend...
> >> >
> >>
> >> Maybe you should give another shot of Andrea's latest anon_vma_order_tail() patch. :)
> >>
> >
> > all my attempts to disabling thp/compaction/migration failed (machine locked).
> > now, i'm testing 3.0.7+vserver+Hugh's+Andrea's patches+enabled few kernel debug options.
> 
> Have you got the result of this patch combination by now?

yes, this combination is working *stable* for ~2 days so far (with heavy stressing).

moreover, i've isolated/reported a faulty code in vserver patch that causes cryptic
deadlocks for 2.6.38+ kernels: http://list.linux-vserver.org/archive?msp:5420:mdaibmimlbgoligkjdma

> I have no clues about the locking below, indeed, it seems like another bug......

this might be fixed by 3.0.8  https://lkml.org/lkml/2011/10/23/26, i'll test it soon...

> >
> > so far it has logged only something unrelated to memory managment subsystem:
> >
> > [  258.397014] =======================================================
> > [  258.397209] [ INFO: possible circular locking dependency detected ]
> > [  258.397311] 3.0.7-vs2.3.1-dirty #1
> > [  258.397402] -------------------------------------------------------
> > [  258.397503] slave_odra_g_00/19432 is trying to acquire lock:
> > [  258.397603]  (&(&sig->cputimer.lock)->rlock){-.....}, at: [<ffffffff8103adfc>] update_curr+0xfc/0x190

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-22  5:07             ` Nai Xia
@ 2011-10-31 16:34               ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2011-10-31 16:34 UTC (permalink / raw)
  To: Nai Xia
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

Hi Nai,

On Sat, Oct 22, 2011 at 01:07:11PM +0800, Nai Xia wrote:
> Yeah, anon_vma root lock is a big lock. And JFYI, actually I am doing 
> some very nasty hacking on anon_vma and one of the side effects is 
> breaking the root lock into pieces. But this area is pretty 
> convolved by many racing conditions. I hope some day I will finally make
> my patch work and have your precious review of it. :-)

:) It's going to be not trivial, initially it was not a shared lock
but it wasn't safe that way (especially with migrate required a
reliable rmap_walk) and using a shared lock across all
same_anon_vma/same_vma lists was the only way to be safe and solve the
races.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110
  2011-10-22  5:52               ` Nai Xia
@ 2011-10-31 17:14                 ` Andrea Arcangeli
  0 siblings, 0 replies; 43+ messages in thread
From: Andrea Arcangeli @ 2011-10-31 17:14 UTC (permalink / raw)
  To: Nai Xia
  Cc: Mel Gorman, Hugh Dickins, Pawel Sikora, Andrew Morton, linux-mm,
	jpiszcz, arekm, linux-kernel

On Sat, Oct 22, 2011 at 01:52:22PM +0800, Nai Xia wrote:
> BTW, I am curious about what benchmark did you run and " x10 boost"
> meaning compared to Hugh's anon_vma_locking fix?

I was referring to the mremap optimizations I pushed in -mm.

> This patch and together with the reasoning looks good to me. 
> But I wondering this patch can make the anon_vma chain ordering game more 
> complex and harder to play in the future.

Well we don't know yet what future will bring... at least this adds
some documentation on the fact the order matters for
fork/mremap/migrate/split_huge_page. As far as I can tell they're the
4 pieces of the VM where the rmap_walk order matters. And
split_huge_page and migrate are the only two where if the rmap_walk
fails we can't safely continue and have to BUG_ON.

> However, if it does bring much perfomance benefit, I vote for this patch 
> because it balances all three requirements here: bug free, performance &
> no two VMAs stay not merged for no good reason.

I suppose it should bring an SMP performance benefit as the critical
section is reduced but we'll have to do some more list_del/add_tail
than if we take the global lock...

> Our situation again makes me have the strong feeling that we are really
> in bad need of a computer aided way to travel all possible state space.
> There are some guys around me who do automatic software testing research.
> But I am afraid our problem is too much "real world" for them... sigh...  

Also the code changes too fast for that...

I'll send the patch again with signoff.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2011-10-31 17:56 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-15 16:43 kernel 3.0: BUG: soft lockup: find_get_pages+0x51/0x110 Justin Piszcz
2011-08-15 18:18 ` Hugh Dickins
2011-08-15 19:02   ` Justin Piszcz
2011-08-15 19:53     ` Hugh Dickins
2011-08-21 17:59 ` Maciej Rutecki
2011-08-21 18:49   ` Justin Piszcz
  -- strict thread matches above, loose matches on Subject: below --
2011-10-12 18:12 Paweł Sikora
2011-10-13 23:16 ` Hugh Dickins
2011-10-13 23:30   ` Hugh Dickins
2011-10-16 16:11     ` Christoph Hellwig
2011-10-16 23:54     ` Andrea Arcangeli
2011-10-17 18:51       ` Hugh Dickins
2011-10-17 22:05         ` Andrea Arcangeli
2011-10-19  7:43         ` Mel Gorman
2011-10-19 13:39           ` Linus Torvalds
2011-10-19 19:42             ` Hugh Dickins
2011-10-20  6:30               ` Paweł Sikora
2011-10-20  6:51                 ` Linus Torvalds
2011-10-21  6:54                 ` Nai Xia
2011-10-21  7:35                   ` Pawel Sikora
2011-10-20 12:51               ` Nai Xia
     [not found]                 ` <CANsGZ6a6_q8+88FRV2froBsVEq7GhtKd9fRnB-0M2MD3a7tnSw@mail.gmail.com>
2011-10-21  6:22                   ` Nai Xia
2011-10-21  8:07                     ` Pawel Sikora
2011-10-21  9:07                       ` Nai Xia
2011-10-21 21:36                         ` Paweł Sikora
2011-10-22  6:21                           ` Nai Xia
2011-10-22 16:42                             ` Paweł Sikora
     [not found]                               ` <CAPQyPG5HJKTo8AEy_khdJeciTgtNQepK6XLcpzvPF8PYS0V-Lw@mail.gmail.com>
2011-10-25  7:33                                 ` Pawel Sikora
2011-10-20  9:11       ` Nai Xia
2011-10-21 15:56         ` Mel Gorman
2011-10-21 17:21           ` Nai Xia
2011-10-21 17:41           ` Andrea Arcangeli
2011-10-21 22:50             ` Andrea Arcangeli
2011-10-22  5:52               ` Nai Xia
2011-10-31 17:14                 ` Andrea Arcangeli
2011-10-22  5:07             ` Nai Xia
2011-10-31 16:34               ` Andrea Arcangeli
2011-10-16 22:37   ` Linus Torvalds
2011-10-17  3:02     ` Hugh Dickins
2011-10-17  3:09       ` Linus Torvalds
2011-10-18 19:17   ` Paweł Sikora
2011-10-19  7:30   ` Mel Gorman
2011-10-21 12:44     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).