Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Yet another page allocation stall on 4.9
From: Cong Wang @ 2017-05-24 22:55 UTC (permalink / raw)
  To: linux-mm

Hello, mm experts


I know there are at least two similar reports of page allocation stall
on 4.9, but I am not sure if they all have the same cause nor I could
find any fix to the problem.

Below is the one we got when running LTP memcg_stress test with 150
memcg groups each with 0.5g memory on a 64G memory host. So far, this
is not reproducible at all.

Please let me know if I can provide any other information you need.

Thanks.

[16211.987039]  [<ffffffff86395ab7>] dump_stack+0x4d/0x66^M
[16211.997600]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16212.017235]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16212.037413]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16212.057215]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16212.077023]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16212.087943]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16212.107591]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16212.127232]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16212.146862]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16212.166836]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16212.177664]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16212.197438]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16212.217026] CPU: 4 PID: 3872 Comm: scribed Not tainted
4.9.23.el7.twitter.x86_64 #1^M
[16212.217035] Mem-Info:^M
[16212.217041] active_anon:16069537 inactive_anon:5561 isolated_anon:0^M
[16212.217041]  active_file:1301 inactive_file:1449 isolated_file:0^M
[16212.217041]  unevictable:0 dirty:0 writeback:0 unstable:0^M
[16212.217041]  slab_reclaimable:22962 slab_unreclaimable:79806^M
[16212.217041]  mapped:6434 shmem:6365 pagetables:34668 bounce:0^M
[16212.217041]  free:161016 free_pcp:955 free_cma:0^M
[16212.217047] Node 0 active_anon:31718548kB inactive_anon:8728kB
active_file:4988kB inactive_file:5584kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:11524kB dirty:0kB
writeback:0kB shmem:8832kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:0
all_unreclaimable? no^M
[16212.217051] Node 1 active_anon:32559600kB inactive_anon:13516kB
active_file:216kB inactive_file:212kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:14208kB dirty:0kB
writeback:0kB shmem:16628kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:683486
all_unreclaimable? yes^M
[16212.217056] Node 0 DMA free:15888kB min:20kB low:32kB high:44kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15972kB managed:15888kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB^M
[16212.217058] lowmem_reserve[]: 0 1903 32095 32095^M
[16212.217062] Node 0 DMA32 free:123344kB min:2668kB low:4616kB
high:6564kB active_anon:1820276kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2015240kB
managed:1949672kB mlocked:0kB slab_reclaimable:64kB
slab_unreclaimable:2216kB kernel_stack:0kB pagetables:3544kB
bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB^M
[16212.217064] lowmem_reserve[]: 0 0 30191 30191^M
[16212.217068] Node 0 Normal free:460088kB min:42308kB low:73224kB
high:104140kB active_anon:29898272kB inactive_anon:8728kB
active_file:4988kB inactive_file:5584kB unevictable:0kB
writepending:0kB present:31457280kB managed:30916476kB mlocked:0kB
slab_reclaimable:52244kB slab_unreclaimable:172728kB
kernel_stack:5976kB pagetables:64740kB bounce:0kB free_pcp:2588kB
local_pcp:120kB free_cma:0kB^M
[16212.217070] lowmem_reserve[]: 0 0 0 0^M
[16212.217074] Node 1 Normal free:44744kB min:45108kB low:78068kB
high:111028kB active_anon:32559600kB inactive_anon:13516kB
active_file:216kB inactive_file:212kB unevictable:0kB writepending:0kB
present:33554432kB managed:32962516kB mlocked:0kB
slab_reclaimable:39540kB slab_unreclaimable:144280kB
kernel_stack:5208kB pagetables:70388kB bounce:0kB free_pcp:1112kB
local_pcp:0kB free_cma:0kB^M
[16212.217075] lowmem_reserve[]: 0 0 0 0^M
[16212.217083] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M)
= 15888kB^M
[16212.217093] Node 0 DMA32: 30*4kB (U) 13*8kB (UM) 9*16kB (UM)
17*32kB (UM) 5*64kB (UME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UE)
3*1024kB (UE) 3*2048kB (UME) 27*4096kB (M) = 123344kB^M
[16212.217101] Node 0 Normal: 1259*4kB (UMEH) 3848*8kB (UMEH)
3612*16kB (UMEH) 3740*32kB (UMEH) 3581*64kB (UMEH) 41*128kB (MEH)
12*256kB (UEH) 2*512kB (ME) 2*1024kB (E) 3*2048kB (UME) 0*4096kB =
460012kB^M
[16212.217109] Node 1 Normal: 520*4kB (UMEH) 135*8kB (UMEH) 69*16kB
(UMEH) 31*32kB (UME) 161*64kB (UMEH) 170*128kB (UM) 29*256kB (U)
0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB^M
[16212.217110] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16212.217111] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16212.217112] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16212.217112] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16212.217113] 9058 total pagecache pages^M
[16212.217114] 0 pages in swap cache^M
[16212.217114] Swap cache stats: add 0, delete 0, find 0/0^M
[16212.217115] Free swap  = 0kB^M
[16212.217115] Total swap = 0kB^M
[16212.217116] 16760731 pages RAM^M
[16212.217116] 0 pages HighMem/MovableOnly^M
[16212.217117] 299593 pages reserved^M
[16212.217117] 13 pages hwpoisoned^M
[16213.387131] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013^M
[16213.407413]  ffffaac5cd0bba88 ffffffff86395ab7 ffffffff86a3b280
0000000000000001^M
[16213.436908]  ffffaac5cd0bbb08 ffffffff8619a6c6 024201cacd0bbaf0
ffffffff86a3b280^M
[16213.457248]  ffffaac5cd0bbab0 0100000000000010 ffffaac5cd0bbb18
ffffaac5cd0bbac8^M
[16213.477525] Call Trace:^M
[16213.487314]  [<ffffffff86395ab7>] dump_stack+0x4d/0x66^M
[16213.497723]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16213.505627] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s!
[cleanup:7598]^M
[16213.505710] Modules linked in: dummy veth tun xfs libcrc32c
intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp
crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support dcdbas
ghash_clmulni_intel lpc_ich i2c_i801 hed wmi i2c_smbus shpchp i2c_core
ioatdma dca acpi_cpufreq tcp_diag inet_diag ipmi_si ipmi_devintf
ipmi_msghandler sch_fq_codel mlx4_en ptp pps_core crc32c_intel
mlx4_core devlink ipv6 crc_ccitt^M
[16213.505713] CPU: 5 PID: 7598 Comm: cleanup Not tainted
4.9.23.el7.twitter.x86_64 #1^M
[16213.505714] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013^M
[16213.505717] task: ffff8af1bf098000 task.stack: ffffaac5e60c8000^M
[16213.505722] RIP: 0010:[<ffffffff86395a93>]  [<ffffffff86395a93>]
dump_stack+0x29/0x66^M
[16213.505724] RSP: 0000:ffffaac5e60cba78  EFLAGS: 00000286^M
[16213.505727] RAX: 0000000000000004 RBX: 0000000000000286 RCX:
00000000ffffffff^M
[16213.505729] RDX: 0000000000000005 RSI: 0000000000000292 RDI:
ffffffff86c51be0^M
[16213.505730] RBP: ffffaac5e60cba88 R08: 0000000000000000 R09:
0000000000000031^M
[16213.505733] R10: 0000000000000000 R11: 0000000003cd5438 R12:
0000000000000001^M
[16213.505737] R13: ffffffff86d43c80 R14: ffff8af1bf098000 R15:
ffffaac5e60cbc40^M
[16213.505742] FS:  00007ff77f22e840(0000) GS:ffff8af1dfb40000(0000)
knlGS:0000000000000000^M
[16213.505746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[16213.505748] CR2: 00007fe7603d0650 CR3: 000000022db50000 CR4:
00000000000406e0^M
[16213.505750] Stack:^M
[16213.505767]  ffffffff86a3b280 0000000000000001 ffffaac5e60cbb08
ffffffff8619a6c6^M
[16213.505780]  024201cae60cbaf0 ffffffff86a3b280 ffffaac5e60cbab0
0100000000000010^M
[16213.505792]  ffffaac5e60cbb18 ffffaac5e60cbac8 ffff8af1bf098000
0000000000000000^M
[16213.505795] Call Trace:^M
[16213.505799]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16213.505804]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16213.505807]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16213.505810]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16213.505814]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16213.505822]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16213.505826]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16213.505828]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16213.505830]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16213.505832]  [<ffffffff863b434d>] ? list_del+0xd/0x30^M
[16213.505833]  [<ffffffff86257e58>] ? ep_poll+0x308/0x320^M
[16213.505835]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16213.505837]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16213.505839]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16213.505877] Code: 5d c3 55 83 c9 ff 48 89 e5 41 54 53 9c 5b fa 65
8b 15 4a 47 c7 79 89 c8 f0 0f b1 15 48 a3 92 00 83 f8 ff 74 0a 39 c2
74 0b 53 9d <f3> 90 eb dd 45 31 e4 eb 06 41 bc 01 00 00 00 48 c7 c7 41
1a a2 ^M
[16214.250659] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s!
[scribed:3905]^M
[16214.250762] Modules linked in: dummy veth tun xfs libcrc32c
intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp
crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support dcdbas
ghash_clmulni_intel lpc_ich i2c_i801 hed wmi i2c_smbus shpchp i2c_core
ioatdma dca acpi_cpufreq tcp_diag inet_diag ipmi_si ipmi_devintf
ipmi_msghandler sch_fq_codel mlx4_en ptp pps_core crc32c_intel
mlx4_core devlink ipv6 crc_ccitt^M
[16214.250765] CPU: 17 PID: 3905 Comm: scribed Tainted: G
L  4.9.23.el7.twitter.x86_64 #1^M
[16214.250767] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013^M
[16214.250770] task: ffff8af9cb938000 task.stack: ffffaac5cd1c0000^M
[16214.250776] RIP: 0010:[<ffffffff86395a93>]  [<ffffffff86395a93>]
dump_stack+0x29/0x66^M
[16214.250778] RSP: 0000:ffffaac5cd1c3a78  EFLAGS: 00000286^M
[16214.250781] RAX: 0000000000000004 RBX: 0000000000000286 RCX:
00000000ffffffff^M
[16214.250783] RDX: 0000000000000011 RSI: 0000000000000292 RDI:
ffffffff86c51be0^M
[16214.250787] RBP: ffffaac5cd1c3a88 R08: 0000000000000000 R09:
0000000000000031^M
[16214.250789] R10: 0000000000000000 R11: 0000000003cd574c R12:
0000000000000001^M
[16214.250791] R13: ffffffff86d43c80 R14: ffff8af9cb938000 R15:
ffffaac5cd1c3c40^M
[16214.250793] FS:  00007fc4c17fa700(0000) GS:ffff8af1dfcc0000(0000)
knlGS:0000000000000000^M
[16214.250795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[16214.250798] CR2: 00007f14196cb650 CR3: 000000083d948000 CR4:
00000000000406e0^M
[16214.250800] Stack:^M
[16214.250815]  ffffffff86a3b280 0000000000000001 ffffaac5cd1c3b08
ffffffff8619a6c6^M
[16214.250827]  024201ca860f0940 ffffffff86a3b280 ffffaac5cd1c3ab0
0000000000000010^M
[16214.250838]  ffffaac5cd1c3b18 ffffaac5cd1c3ac8 000000000000000f
0000000000000000^M
[16214.250840] Call Trace:^M
[16214.250843]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16214.250846]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16214.250849]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16214.250853]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16214.250855]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16214.250857]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16214.250859]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16214.250860]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16214.250863]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16214.250865]  [<ffffffff860c25c1>] ? pick_next_task_fair+0x471/0x4a0^M
[16214.250869]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16214.250871]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16214.250873]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16214.250918] Code: 5d c3 55 83 c9 ff 48 89 e5 41 54 53 9c 5b fa 65
8b 15 4a 47 c7 79 89 c8 f0 0f b1 15 48 a3 92 00 83 f8 ff 74 0a 39 c2
74 0b 53 9d <f3> 90 eb dd 45 31 e4 eb 06 41 bc 01 00 00 00 48 c7 c7 41
1a a2 ^M
[16215.157526]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16215.177523]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16215.197540]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16215.217331]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16215.237374]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16215.257136]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16215.276950]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16215.287555]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16215.307538]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16215.327165]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16215.346964]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16215.357554] CPU: 20 PID: 7812 Comm: proxymap Tainted: G
L  4.9.23.el7.twitter.x86_64 #1^M
[16215.357557] Mem-Info:^M
[16215.357563] active_anon:16069475 inactive_anon:5560 isolated_anon:0^M
[16215.357563]  active_file:1319 inactive_file:1356 isolated_file:0^M
[16215.357563]  unevictable:0 dirty:0 writeback:0 unstable:0^M
[16215.357563]  slab_reclaimable:22986 slab_unreclaimable:79813^M
[16215.357563]  mapped:6433 shmem:6364 pagetables:34697 bounce:0^M
[16215.357563]  free:160997 free_pcp:1010 free_cma:0^M
[16215.357567] Node 0 active_anon:31718344kB inactive_anon:8724kB
active_file:5052kB inactive_file:5200kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:11524kB dirty:0kB
writeback:0kB shmem:8828kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:39
all_unreclaimable? no^M
[16215.357571] Node 1 active_anon:32559556kB inactive_anon:13516kB
active_file:224kB inactive_file:224kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:14208kB dirty:0kB
writeback:0kB shmem:16628kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:683486
all_unreclaimable? yes^M
[16215.357574] Node 0 DMA free:15888kB min:20kB low:32kB high:44kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15972kB managed:15888kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB^M
[16215.357576] lowmem_reserve[]: 0 1903 32095 32095^M
[16215.357579] Node 0 DMA32 free:123344kB min:2668kB low:4616kB
high:6564kB active_anon:1820276kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2015240kB
managed:1949672kB mlocked:0kB slab_reclaimable:64kB
slab_unreclaimable:2216kB kernel_stack:0kB pagetables:3544kB
bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB^M
[16215.357580] lowmem_reserve[]: 0 0 30191 30191^M
[16215.357584] Node 0 Normal free:460012kB min:42308kB low:73224kB
high:104140kB active_anon:29898068kB inactive_anon:8724kB
active_file:5052kB inactive_file:5200kB unevictable:0kB
writepending:0kB present:31457280kB managed:30916476kB mlocked:0kB
slab_reclaimable:52340kB slab_unreclaimable:172756kB
kernel_stack:5992kB pagetables:64856kB bounce:0kB free_pcp:2808kB
local_pcp:116kB free_cma:0kB^M
[16215.357585] lowmem_reserve[]: 0 0 0 0^M
[16215.357588] Node 1 Normal free:44744kB min:45108kB low:78068kB
high:111028kB active_anon:32559556kB inactive_anon:13516kB
active_file:224kB inactive_file:224kB unevictable:0kB writepending:0kB
present:33554432kB managed:32962516kB mlocked:0kB
slab_reclaimable:39540kB slab_unreclaimable:144280kB
kernel_stack:5224kB pagetables:70388kB bounce:0kB free_pcp:1112kB
local_pcp:0kB free_cma:0kB^M
[16215.357589] lowmem_reserve[]: 0 0 0 0^M
[16215.357595] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M)
= 15888kB^M
[16215.357602] Node 0 DMA32: 30*4kB (U) 13*8kB (UM) 9*16kB (UM)
17*32kB (UM) 5*64kB (UME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UE)
3*1024kB (UE) 3*2048kB (UME) 27*4096kB (M) = 123344kB^M
[16215.357609] Node 0 Normal: 1259*4kB (UMEH) 3848*8kB (UMEH)
3612*16kB (UMEH) 3740*32kB (UMEH) 3581*64kB (UMEH) 41*128kB (MEH)
12*256kB (UEH) 2*512kB (ME) 2*1024kB (E) 3*2048kB (UME) 0*4096kB =
460012kB^M
[16215.357614] Node 1 Normal: 520*4kB (UMEH) 135*8kB (UMEH) 69*16kB
(UMEH) 31*32kB (UME) 161*64kB (UMEH) 170*128kB (UM) 29*256kB (U)
0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB^M
[16215.357615] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16215.357616] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16215.357617] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16215.357618] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16215.357618] 9075 total pagecache pages^M
[16215.357619] 0 pages in swap cache^M
[16215.357619] Swap cache stats: add 0, delete 0, find 0/0^M
[16215.357620] Free swap  = 0kB^M
[16215.357620] Total swap = 0kB^M
[16215.357620] 16760731 pages RAM^M
[16215.357621] 0 pages HighMem/MovableOnly^M
[16215.357621] 299593 pages reserved^M
[16215.357621] 13 pages hwpoisoned^M
[16216.520770] warn_alloc: 5 callbacks suppressed^M
[16216.520775] scribed: page allocation stalls for 35691ms, order:0,
mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)^M
[16216.587564] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013^M
[16216.607766]  ffffaac5e6403a88 ffffffff86395ab7 ffffffff86a3b280
0000000000000001[16216.631514] memcg_process_s: ^M
[16216.631519] page allocation stalls for 31710ms, order:0,
mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)^M
[16216.667939]  ffffaac5e6403b08 ffffffff8619a6c6 024201cae6403af0
ffffffff86a3b280^M
[16216.697354]  ffffaac5e6403ab0 0100000000000010 ffffaac5e6403b18
ffffaac5e6403ac8^M
[16216.717761] Call Trace:^M
[16216.727390]  [<ffffffff86395ab7>] dump_stack+0x4d/0x66^M
[16216.746985]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16216.757761]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16216.777571]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16216.797558]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16216.817570]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16216.837349]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16216.854787] scribed: page allocation stalls for 35977ms, order:0,
mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)^M
[16216.887239]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16216.907027]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16216.917933]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16216.937650]  [<ffffffff863b434d>] ? list_del+0xd/0x30^M
[16216.957041]  [<ffffffff86257e58>] ? ep_poll+0x308/0x320^M
[16216.967687]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16216.984835] scribed: page allocation stalls for 36056ms, order:0,
mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)^M
[16217.017608]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16217.037351]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16217.047932] CPU: 8 PID: 827 Comm: crond Tainted: G             L
4.9.23.el7.twitter.x86_64 #1^M
[16217.047945] Mem-Info:^M
[16217.047955] active_anon:16071724 inactive_anon:3194 isolated_anon:0^M
[16217.047955]  active_file:1617 inactive_file:843 isolated_file:0^M
[16217.047955]  unevictable:0 dirty:0 writeback:0 unstable:0^M
[16217.047955]  slab_reclaimable:22986 slab_unreclaimable:79813^M
[16217.047955]  mapped:5722 shmem:6364 pagetables:34673 bounce:0^M
[16217.047955]  free:161140 free_pcp:1145 free_cma:0^M
[16217.047961] Node 0 active_anon:31722972kB inactive_anon:3628kB
active_file:6244kB inactive_file:3184kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:9044kB dirty:0kB
writeback:0kB shmem:8828kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:29
all_unreclaimable? no^M
[16217.047966] Node 1 active_anon:32563924kB inactive_anon:9148kB
active_file:224kB inactive_file:188kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:13844kB dirty:0kB
writeback:0kB shmem:16628kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:686021
all_unreclaimable? yes^M
[16217.047970] Node 0 DMA free:15888kB min:20kB low:32kB high:44kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15972kB managed:15888kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB^M
[16217.047972] lowmem_reserve[]: 0 1903 32095 32095^M
[16217.047976] Node 0 DMA32 free:123344kB min:2668kB low:4616kB
high:6564kB active_anon:1820276kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2015240kB
managed:1949672kB mlocked:0kB slab_reclaimable:64kB
slab_unreclaimable:2216kB kernel_stack:0kB pagetables:3544kB
bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB^M
[16217.047978] lowmem_reserve[]: 0 0 30191 30191^M
[16217.047982] Node 0 Normal free:460584kB min:42308kB low:73224kB
high:104140kB active_anon:29902696kB inactive_anon:3628kB
active_file:6244kB inactive_file:3184kB unevictable:0kB
writepending:0kB present:31457280kB managed:30916476kB mlocked:0kB
slab_reclaimable:52340kB slab_unreclaimable:172756kB
kernel_stack:5976kB pagetables:64760kB bounce:0kB free_pcp:3204kB
local_pcp:0kB free_cma:0kB^M
[16217.047984] lowmem_reserve[]: 0 0 0 0^M
[16217.047988] Node 1 Normal free:44744kB min:45108kB low:78068kB
high:111028kB active_anon:32563924kB inactive_anon:9148kB
active_file:224kB inactive_file:188kB unevictable:0kB writepending:0kB
present:33554432kB managed:32962516kB mlocked:0kB
slab_reclaimable:39540kB slab_unreclaimable:144280kB
kernel_stack:5224kB pagetables:70388kB bounce:0kB free_pcp:1256kB
local_pcp:116kB free_cma:0kB^M
[16217.047989] lowmem_reserve[]: 0 0 0 0^M
[16217.047997] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M)
= 15888kB^M
[16217.048007] Node 0 DMA32: 30*4kB (U) 13*8kB (UM) 9*16kB (UM)
17*32kB (UM) 5*64kB (UME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UE)
3*1024kB (UE) 3*2048kB (UME) 27*4096kB (M) = 123344kB^M
[16217.048015] Node 0 Normal: 1352*4kB (UMEH) 3846*8kB (UMEH)
3610*16kB (UMEH) 3742*32kB (UMEH) 3581*64kB (UMEH) 41*128kB (MEH)
12*256kB (UEH) 2*512kB (ME) 2*1024kB (E) 3*2048kB (UME) 0*4096kB =
460400kB^M
[16217.048023] Node 1 Normal: 520*4kB (UMEH) 135*8kB (UMEH) 69*16kB
(UMEH) 31*32kB (UME) 161*64kB (UMEH) 170*128kB (UM) 29*256kB (U)
0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB^M
[16217.048025] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16217.048026] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16217.048026] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16217.048027] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16217.048028] 8969 total pagecache pages^M
[16217.048029] 0 pages in swap cache^M
[16217.048030] Swap cache stats: add 0, delete 0, find 0/0^M
[16217.048030] Free swap  = 0kB^M
[16217.048031] Total swap = 0kB^M
[16217.048031] 16760731 pages RAM^M
[16217.048032] 0 pages HighMem/MovableOnly^M
[16217.048032] 299593 pages reserved^M
[16217.048033] 13 pages hwpoisoned^M
[16217.075797] memcg_process_s: page allocation stalls for 32206ms,
order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: mm/migrate: Fix ref-count handling when !hugepage_migration_supported()
From: Manoj Iyer @ 2017-05-24 22:19 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Andrew Morton, will.deacon, catalin.marinas, linux-kernel,
	linux-arm-kernel, linux-mm, tbaicar, Tabi, Timur, Joonsoo Kim,
	Naoya Horiguchi, Wanpeng Li, Christoph Lameter
In-Reply-To: <20170524154728.2492-1-punit.agrawal@arm.com>

On Wed, 24 May 2017, Punit Agrawal wrote:

> On failing to migrate a page, soft_offline_huge_page() performs the
> necessary update to the hugepage ref-count. When
> !hugepage_migration_supported() , unmap_and_move_hugepage() also
> decrements the page ref-count for the hugepage. The combined behaviour
> leaves the ref-count in an inconsistent state.
>
> This leads to soft lockups when running the overcommitted hugepage test
> from mce-tests suite.
>
> Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
> soft offline: 0x83ed600: migration failed 1, type
> 1fffc00000008008 (uptodate|head)
> INFO: rcu_preempt detected stalls on CPUs/tasks:
> Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
>  (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
>  thugetlb_overco R  running task        0  2715   2685 0x00000008
>  Call trace:
>  [<ffff000008089f90>] dump_backtrace+0x0/0x268
>  [<ffff00000808a2d4>] show_stack+0x24/0x30
>  [<ffff000008100d34>] sched_show_task+0x134/0x180
>  [<ffff0000081c90fc>] rcu_print_detail_task_stall_rnp+0x54/0x7c
>  [<ffff00000813cfd4>] rcu_check_callbacks+0xa74/0xb08
>  [<ffff000008143a3c>] update_process_times+0x34/0x60
>  [<ffff0000081550e8>] tick_sched_handle.isra.7+0x38/0x70
>  [<ffff00000815516c>] tick_sched_timer+0x4c/0x98
>  [<ffff0000081442e0>] __hrtimer_run_queues+0xc0/0x300
>  [<ffff000008144fa4>] hrtimer_interrupt+0xac/0x228
>  [<ffff0000089a56d4>] arch_timer_handler_phys+0x3c/0x50
>  [<ffff00000812f1bc>] handle_percpu_devid_irq+0x8c/0x290
>  [<ffff0000081297fc>] generic_handle_irq+0x34/0x50
>  [<ffff000008129f00>] __handle_domain_irq+0x68/0xc0
>  [<ffff0000080816b4>] gic_handle_irq+0x5c/0xb0
>
> Fix this by dropping the ref-count decrement in
> unmap_and_move_hugepage() when !hugepage_migration_supported().
>
> Fixes: 32665f2bbfed ("mm/migrate: correct failure handling if !hugepage_migration_support()")
> Reported-by: Manoj Iyer <manoj.iyer@canonical.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> Cc: Christoph Lameter <cl@linux.com>
> ---
> Hi Andrew,
>
> We ran into this bug when working towards enabling memory corruption
> on arm64. The patch was tested on an arm64 platform running v4.12-rc2
> with the series to enable memory corruption handling[0].
>
> Please consider merging as a fix for the 4.12 release.
>
> Thanks,
> Punit
>

I applied this patch applied to Ubuntu Zesty (4.10) kernel and tested on 
QDF2400 platform with mce-test without config migration enabled.

== dmesg ==
[   91.569358] Soft offlining page 0x1763c00 at 0x400000000000
[   91.569364] soft offline: 0x1763c00: migration failed 1, type 
100000000008008
[  150.282911] Soft offlining page 0x1763c00 at 0x400000000000
[  150.282917] soft offline: 0x1763c00: migration failed 1, type 
100000000008008

The mce-test failed as expected but did not encounter the soft lockups. 
(The test case might have an error it is missing an 'echo' in failure 
case.)

$ sudo ./run_hugepage_overcommit.sh

***************************************************************************
Pay attention:

This test checks that hugepage soft-offlining works under overcommitting.
***************************************************************************


-------------------------------------
TestCase ./thugetlb_overcommit 1
FAIL: migration failed.
Unpoisoning.

 	Num of Executed Test Case: 1	Num of Failed Case: 1

Tested-By: Manoj Iyer <manoj.iyer@canonical.com>

Thanks
Manoj Iyer

> [0] https://www.spinics.net/lists/arm-kernel/msg581657.html
> ---
> mm/migrate.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 89a0a1707f4c..187abd1526df 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1201,10 +1201,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> 	 * tables or check whether the hugepage is pmd-based or not before
> 	 * kicking migration.
> 	 */
> -	if (!hugepage_migration_supported(page_hstate(hpage))) {
> -		putback_active_hugepage(hpage);
> +	if (!hugepage_migration_supported(page_hstate(hpage)))
> 		return -ENOSYS;
> -	}
>
> 	new_hpage = get_new_page(hpage, private, &result);
> 	if (!new_hpage)
>

--
============================
Manoj Iyer
Ubuntu/Canonical
ARM Servers - Cloud
============================

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/3] mm/slub: Only define kmalloc_large_node_hook() for NUMA systems
From: Matthias Kaehlcke @ 2017-05-24 22:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	linux-mm, linux-kernel, Douglas Anderson
In-Reply-To: <alpine.DEB.2.10.1705241326200.49680@chino.kir.corp.google.com>

Hi David,

El Wed, May 24, 2017 at 01:36:21PM -0700 David Rientjes ha dit:

> On Tue, 23 May 2017, Matthias Kaehlcke wrote:
> 
> > > diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
> > > index de179993e039..e1895ce6fa1b 100644
> > > --- a/include/linux/compiler-clang.h
> > > +++ b/include/linux/compiler-clang.h
> > > @@ -15,3 +15,8 @@
> > >   * with any version that can compile the kernel
> > >   */
> > >  #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
> > > +
> > > +#ifdef inline
> > > +#undef inline
> > > +#define inline __attribute__((unused))
> > > +#endif
> > 
> > Thanks for the suggestion!
> > 
> > Nothing breaks and the warnings are silenced. It seems we could use
> > this if there is a stong opposition against having warnings on unused
> > static inline functions in .c files.
> > 
> 
> It would be slightly different, it would be:
> 
> #define inline inline __attribute__((unused))
> 
> to still inline the functions, I was just seeing if there was anything 
> else that clang was warning about that was unrelated to a function's 
> inlining.
> 
> > Still I am not convinced that gcc's behavior is preferable in this
> > case. True, it saves us from adding a bunch of __maybe_unused or
> > #ifdefs, on the other hand the warning is a useful tool to spot truly
> > unused code. So far about 50% of the warnings I looked into fall into
> > this category.
> > 
> 
> I think gcc's behavior is a result of how it does preprocessing and is a 
> clearly defined and long-standing semantic given in the gcc manual 
> regarding -Wunused-function.
> 
> #define IS_PAGE_ALIGNED(__size)	(!(__size & ((size_t)PAGE_SIZE - 1)))
> static inline int is_page_aligned(size_t size)
> {
> 	return !(size & ((size_t)PAGE_SIZE - 1));
> }
> 
> Gcc will not warn about either of these being unused, regardless of -Wall, 
> -Wunused-function, or -pedantic.  Clang, correct me if I'm wrong, will 
> only warn about is_page_aligned().

Indeed, clang does not warn about unused defines.

> So the argument could be made that one of the additional benefits of 
> static inline functions is that a subset of compilers, heavily in the 
> minority, will detect whether it's unused and we'll get patches that 
> remove them.  Functionally, it would only result in LOC reduction.  But, 
> isn't adding #ifdef's to silence the warning just adding more LOC?

The LOC reduction comes from the removal of the actual dead code that
is spotted because the warning was enabled and pointed it out :)

Using #ifdef is one option, in most cases the function can be marked as
__maybe_unused, which technically doesn't (necessarily) increase
LOC. However some maintainers prefer the use of #ifdef over
__maybe_unused in certain cases.

> I have no preference either way, I think it would be up to the person who 
> is maintaining the code and has to deal with the patches.

I think it would be good to have a general policy/agreement, to either
disable the warning completely (not my preference) or 'allow' the use
of one of the available mechanism to suppress the warning for
functions that are not used in some configurations or only kept around
for reference/debugging/symmetry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 2/2] mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
From: Reza Arbab @ 2017-05-24 21:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Yasuaki Ishimatsu, qiuxishi,
	Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML, Michal Hocko
In-Reply-To: <20170524122411.25212-3-mhocko@kernel.org>

On Wed, May 24, 2017 at 02:24:11PM +0200, Michal Hocko wrote:
>20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for movable-dedicated
>node") has introduced CONFIG_MOVABLE_NODE without a good explanation on
>why it is actually useful. It makes a lot of sense to make movable node
>semantic opt in but we already have that because the feature has to be
>explicitly enabled on the kernel command line. A config option on top
>only makes the configuration space larger without a good reason. It also
>adds an additional ifdefery that pollutes the code. Just drop the config
>option and make it de-facto always enabled. This shouldn't introduce any
>change to the semantic.

Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com>

-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 1/2] mm, memory_hotplug: drop artificial restriction on online/offline
From: Reza Arbab @ 2017-05-24 21:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Yasuaki Ishimatsu, qiuxishi,
	Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML, Michal Hocko
In-Reply-To: <20170524122411.25212-2-mhocko@kernel.org>

On Wed, May 24, 2017 at 02:24:10PM +0200, Michal Hocko wrote:
>74d42d8fe146 ("memory_hotplug: ensure every online node has NORMAL
>memory") has added can_offline_normal which checks the amount of
>memory in !movable zones as long as CONFIG_MOVABLE_NODE is disable.
>It disallows to offline memory if there is nothing left with a
>justification that "memory-management acts bad when we have nodes which
>is online but don't have any normal memory".
>
>74d42d8fe146 ("memory_hotplug: ensure every online node has NORMAL
>memory") has introduced a restriction that every numa node has to have
>at least some memory in !movable zones before a first movable memory
>can be onlined if !CONFIG_MOVABLE_NODE with the same justification
>
>While it is true that not having _any_ memory for kernel allocations on
>a NUMA node is far from great and such a node would be quite subotimal
>because all kernel allocations will have to fallback to another NUMA
>node but there is no reason to disallow such a configuration in
>principle.
>
>Besides that there is not really a big difference to have one memblock
>for ZONE_NORMAL available or none. With 128MB size memblocks the system
>might trash on the kernel allocations requests anyway. It is really
>hard to draw a line on how much normal memory is really sufficient so
>we have to rely on administrator to configure system sanely therefore
>drop the artificial restriction and remove can_offline_normal and
>can_online_high_movable altogether.

I'm really liking all this cleanup of the memory hotplug code. Thanks!  
Much appreciated.

Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com>

-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] compiler, clang: suppress warning for unused static inline functions
From: Andrew Morton @ 2017-05-24 21:32 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: David Rientjes, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	linux-mm, linux-kernel, Douglas Anderson, Mark Brown, Ingo Molnar,
	David Miller
In-Reply-To: <20170524212229.GR141096@google.com>

On Wed, 24 May 2017 14:22:29 -0700 Matthias Kaehlcke <mka@chromium.org> wrote:

> I'm not a kernel maintainer, so it's not my decision whether this
> warning should be silenced, my personal opinion is that it's benfits
> outweigh the inconveniences of dealing with half-false positives,
> generally caused by the heavy use of #ifdef by the kernel itself.

Please resend and include this info in the changelog.  Describe
instances where this warning has resulted in actual runtime or
developer-visible benefits.

Where possible an appropriate I suggest it is better to move the
offending function into a header file, rather than adding ifdefs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
From: Tejun Heo @ 2017-05-24 21:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto, efault
In-Reply-To: <9b147a7e-fec3-3b78-7587-3890efcd42f2@redhat.com>

Hello,

On Wed, May 24, 2017 at 05:17:13PM -0400, Waiman Long wrote:
> An alternative is to have separate enabling for thread root. For example,
> 
> # echo root > cgroup.threads
> # echo enable > child/cgroup.threads
> 
> The first statement make the current cgroup the thread root. However,
> setting it to a thread root doesn't make its child to be threaded. This
> have to be explicitly done on each of the children. Once a child cgroup
> is made to be threaded, all its descendants will be threaded. That will
> have the same effect as the current patch.

Yeah, I'm toying with different ideas.  I'll get back to you once
things get more concrete.

> With delegation, do you mean the relationship between a container and
> its host?

It can be but doesn't have to be.  For example, it can be delegations
to users without namespace / container being involved.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] compiler, clang: suppress warning for unused static inline functions
From: Matthias Kaehlcke @ 2017-05-24 21:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	linux-mm, linux-kernel, Douglas Anderson, Mark Brown, Ingo Molnar,
	David Miller
In-Reply-To: <alpine.DEB.2.10.1705241400510.49680@chino.kir.corp.google.com>

El Wed, May 24, 2017 at 02:01:15PM -0700 David Rientjes ha dit:

> GCC explicitly does not warn for unused static inline functions for
> -Wunused-function.  The manual states:
> 
> 	Warn whenever a static function is declared but not defined or
> 	a non-inline static function is unused.
> 
> Clang does warn for static inline functions that are unused.
> 
> It turns out that suppressing the warnings avoids potentially complex
> #ifdef directives, which also reduces LOC.
> 
> Supress the warning for clang.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---

As expressed earlier in other threads, I don't think gcc's behavior is
preferable in this case. The warning on static inline functions (only
in .c files) allows to detect truly unused code. About 50% of the
warnings I have looked into so far fall into this category.

In my opinion it is more valuable to detect dead code than not having
a few more __maybe_unused attributes (there aren't really that many
instances, at least with x86 and arm64 defconfig). In most cases it is
not necessary to use #ifdef, it is an option which is preferred by
some maintainers. The reduced LOC is arguable, since dectecting dead
code allows to remove it.

I'm not a kernel maintainer, so it's not my decision whether this
warning should be silenced, my personal opinion is that it's benfits
outweigh the inconveniences of dealing with half-false positives,
generally caused by the heavy use of #ifdef by the kernel itself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
From: Waiman Long @ 2017-05-24 21:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto, efault
In-Reply-To: <20170524203616.GO24798@htj.duckdns.org>

On 05/24/2017 04:36 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, May 22, 2017 at 01:13:16PM -0400, Waiman Long wrote:
>>> Maybe I'm misunderstanding the design, but this seems to push the
>>> processes which belong to the threaded subtree to the parent which is
>>> part of the usual resource domain hierarchy thus breaking the no
>>> internal competition constraint.  I'm not sure this is something we'd
>>> want.  Given that the limitation of the original threaded mode was the
>>> required nesting below root and that we treat root special anyway
>>> (exactly in the way necessary), I wonder whether it'd be better to
>>> simply allow root to be both domain and thread root.
>> Yes, root can be both domain and thread root. I haven't placed any
>> restriction on that.
> I've been playing with the proposed "make the parent resource domain".
> Unfortunately, the parent - child relationship becomes weird.
>
> The parent becomes the thread root, which means that its
> cgroup.threads file becomes writable and threads can be put in there.
> It's really weird to write to a child's interface and have the
> parent's behavior changed.  This becomes weirder with delegation.  If
> a cgroup is delegated, its cgroup.threads should be delegated too but
> if the child enables threaded mode, that makes the undelegated parent
> thread root, which means that either 1. the delegatee can't migrate
> threads to the thread root or 2. if the parent's cgroup.threads is
> writeable, the delegatee can mass with other descendants under it
> which shouldn't be allowed.
>
> I think the operation of making a cgroup a thread root should happen
> on the cgroup where that's requested; otherwise, nesting becomes too
> twisted.  This should be solvable.  Will think more about it.
>
> Thanks.
>
An alternative is to have separate enabling for thread root. For example,

# echo root > cgroup.threads
# echo enable > child/cgroup.threads

The first statement make the current cgroup the thread root. However,
setting it to a thread root doesn't make its child to be threaded. This
have to be explicitly done on each of the children. Once a child cgroup
is made to be threaded, all its descendants will be threaded. That will
have the same effect as the current patch.

With delegation, do you mean the relationship between a container and
its host?

Cheers,
Longman


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [patch] compiler, clang: suppress warning for unused static inline functions
From: David Rientjes @ 2017-05-24 21:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthias Kaehlcke, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	linux-mm, linux-kernel, Douglas Anderson

GCC explicitly does not warn for unused static inline functions for
-Wunused-function.  The manual states:

	Warn whenever a static function is declared but not defined or
	a non-inline static function is unused.

Clang does warn for static inline functions that are unused.

It turns out that suppressing the warnings avoids potentially complex
#ifdef directives, which also reduces LOC.

Supress the warning for clang.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/compiler-clang.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
--- a/include/linux/compiler-clang.h
+++ b/include/linux/compiler-clang.h
@@ -15,3 +15,10 @@
  * with any version that can compile the kernel
  */
 #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
+
+/*
+ * GCC does not warn about unused static inline functions for
+ * -Wunused-function.  This turns out to avoid the need for complex #ifdef
+ * directives.  Suppress the warning in clang as well.
+ */
+#define inline inline __attribute__((unused))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm/oom_kill: count global and memory cgroup oom kills
From: David Rientjes @ 2017-05-24 20:43 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Roman Guschin, linux-mm, Andrew Morton, Tejun Heo, cgroups,
	linux-kernel, Vlastimil Babka, Michal Hocko, hannes
In-Reply-To: <0f67046d-cdf6-1264-26f6-11c82978c621@yandex-team.ru>

On Tue, 23 May 2017, Konstantin Khlebnikov wrote:

> This is worth addition. Let's call it "oom_victim" for short.
> 
> It allows to locate leaky part if they are spread over sub-containers within
> common limit.
> But doesn't tell which limit caused this kill. For hierarchical limits this
> might be not so easy.
> 
> I think oom_kill better suits for automatic actions - restart affected
> hierarchy, increase limits, e.t.c.
> But oom_victim allows to determine container affected by global oom killer.
> 
> So, probably it's worth to merge them together and increment oom_kill by
> global killer for victim memcg:
> 
> 	if (!is_memcg_oom(oc)) {
> 		count_vm_event(OOM_KILL);
> 		mem_cgroup_count_vm_event(mm, OOM_KILL);
> 	} else
> 		mem_cgroup_event(oc->memcg, OOM_KILL);
> 

Our complete solution is that we have a complementary 
memory.oom_kill_control that allows users to register for eventfd(2) 
notification when the kernel oom killer kills a victim, but this is 
because we have had complete support for userspace oom handling for years.  
When read, it exports three classes of information:

 - the "total" (hierarchical) and "local" (memcg specific) number of oom
   kills for system oom conditions (overcommit),

 - the "total" and "local" number of oom kills for memcg oom conditions, 
   and
 
 - the total number of processes in the hierarchy where an oom victim was
   reaped successfully and unsuccessfully.

One benefit of this is that it prevents us from having to scrape the 
kernel log for oom events which has been troublesome in the past, but 
userspace can easily do so when the eventfd triggers for the kill 
notification.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/3] mm/slub: Only define kmalloc_large_node_hook() for NUMA systems
From: David Rientjes @ 2017-05-24 20:36 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Joonsoo Kim,
	linux-mm, linux-kernel, Douglas Anderson
In-Reply-To: <20170523165608.GN141096@google.com>

On Tue, 23 May 2017, Matthias Kaehlcke wrote:

> > diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h
> > index de179993e039..e1895ce6fa1b 100644
> > --- a/include/linux/compiler-clang.h
> > +++ b/include/linux/compiler-clang.h
> > @@ -15,3 +15,8 @@
> >   * with any version that can compile the kernel
> >   */
> >  #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
> > +
> > +#ifdef inline
> > +#undef inline
> > +#define inline __attribute__((unused))
> > +#endif
> 
> Thanks for the suggestion!
> 
> Nothing breaks and the warnings are silenced. It seems we could use
> this if there is a stong opposition against having warnings on unused
> static inline functions in .c files.
> 

It would be slightly different, it would be:

#define inline inline __attribute__((unused))

to still inline the functions, I was just seeing if there was anything 
else that clang was warning about that was unrelated to a function's 
inlining.

> Still I am not convinced that gcc's behavior is preferable in this
> case. True, it saves us from adding a bunch of __maybe_unused or
> #ifdefs, on the other hand the warning is a useful tool to spot truly
> unused code. So far about 50% of the warnings I looked into fall into
> this category.
> 

I think gcc's behavior is a result of how it does preprocessing and is a 
clearly defined and long-standing semantic given in the gcc manual 
regarding -Wunused-function.

#define IS_PAGE_ALIGNED(__size)	(!(__size & ((size_t)PAGE_SIZE - 1)))
static inline int is_page_aligned(size_t size)
{
	return !(size & ((size_t)PAGE_SIZE - 1));
}

Gcc will not warn about either of these being unused, regardless of -Wall, 
-Wunused-function, or -pedantic.  Clang, correct me if I'm wrong, will 
only warn about is_page_aligned().

So the argument could be made that one of the additional benefits of 
static inline functions is that a subset of compilers, heavily in the 
minority, will detect whether it's unused and we'll get patches that 
remove them.  Functionally, it would only result in LOC reduction.  But, 
isn't adding #ifdef's to silence the warning just adding more LOC?

I have no preference either way, I think it would be up to the person who 
is maintaining the code and has to deal with the patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 11/17] cgroup: Implement new thread mode semantics
From: Tejun Heo @ 2017-05-24 20:36 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto, efault
In-Reply-To: <b1d02881-f522-8baa-5ebe-9b1ad74a03e4@redhat.com>

Hello, Waiman.

On Mon, May 22, 2017 at 01:13:16PM -0400, Waiman Long wrote:
> > Maybe I'm misunderstanding the design, but this seems to push the
> > processes which belong to the threaded subtree to the parent which is
> > part of the usual resource domain hierarchy thus breaking the no
> > internal competition constraint.  I'm not sure this is something we'd
> > want.  Given that the limitation of the original threaded mode was the
> > required nesting below root and that we treat root special anyway
> > (exactly in the way necessary), I wonder whether it'd be better to
> > simply allow root to be both domain and thread root.
> 
> Yes, root can be both domain and thread root. I haven't placed any
> restriction on that.

I've been playing with the proposed "make the parent resource domain".
Unfortunately, the parent - child relationship becomes weird.

The parent becomes the thread root, which means that its
cgroup.threads file becomes writable and threads can be put in there.
It's really weird to write to a child's interface and have the
parent's behavior changed.  This becomes weirder with delegation.  If
a cgroup is delegated, its cgroup.threads should be delegated too but
if the child enables threaded mode, that makes the undelegated parent
thread root, which means that either 1. the delegatee can't migrate
threads to the thread root or 2. if the parent's cgroup.threads is
writeable, the delegatee can mass with other descendants under it
which shouldn't be allowed.

I think the operation of making a cgroup a thread root should happen
on the cgroup where that's requested; otherwise, nesting becomes too
twisted.  This should be solvable.  Will think more about it.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm/migrate: Fix ref-count handling when !hugepage_migration_supported()
From: Andrew Morton @ 2017-05-24 19:56 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: will.deacon, catalin.marinas, manoj.iyer, linux-kernel,
	linux-arm-kernel, linux-mm, tbaicar, timur, Joonsoo Kim,
	Naoya Horiguchi, Wanpeng Li, Christoph Lameter
In-Reply-To: <20170524154728.2492-1-punit.agrawal@arm.com>

On Wed, 24 May 2017 16:47:28 +0100 Punit Agrawal <punit.agrawal@arm.com> wrote:

> On failing to migrate a page, soft_offline_huge_page() performs the
> necessary update to the hugepage ref-count. When
> !hugepage_migration_supported() , unmap_and_move_hugepage() also
> decrements the page ref-count for the hugepage. The combined behaviour
> leaves the ref-count in an inconsistent state.
> 
> This leads to soft lockups when running the overcommitted hugepage test
> from mce-tests suite.
> 
> Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
> soft offline: 0x83ed600: migration failed 1, type
> 1fffc00000008008 (uptodate|head)
> INFO: rcu_preempt detected stalls on CPUs/tasks:
>  Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
>   (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
>   thugetlb_overco R  running task        0  2715   2685 0x00000008
>   Call trace:
>   [<ffff000008089f90>] dump_backtrace+0x0/0x268
>   [<ffff00000808a2d4>] show_stack+0x24/0x30
>   [<ffff000008100d34>] sched_show_task+0x134/0x180
>   [<ffff0000081c90fc>] rcu_print_detail_task_stall_rnp+0x54/0x7c
>   [<ffff00000813cfd4>] rcu_check_callbacks+0xa74/0xb08
>   [<ffff000008143a3c>] update_process_times+0x34/0x60
>   [<ffff0000081550e8>] tick_sched_handle.isra.7+0x38/0x70
>   [<ffff00000815516c>] tick_sched_timer+0x4c/0x98
>   [<ffff0000081442e0>] __hrtimer_run_queues+0xc0/0x300
>   [<ffff000008144fa4>] hrtimer_interrupt+0xac/0x228
>   [<ffff0000089a56d4>] arch_timer_handler_phys+0x3c/0x50
>   [<ffff00000812f1bc>] handle_percpu_devid_irq+0x8c/0x290
>   [<ffff0000081297fc>] generic_handle_irq+0x34/0x50
>   [<ffff000008129f00>] __handle_domain_irq+0x68/0xc0
>   [<ffff0000080816b4>] gic_handle_irq+0x5c/0xb0
> 
> Fix this by dropping the ref-count decrement in
> unmap_and_move_hugepage() when !hugepage_migration_supported().
> 
> Fixes: 32665f2bbfed ("mm/migrate: correct failure handling if !hugepage_migration_support()")
> Reported-by: Manoj Iyer <manoj.iyer@canonical.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> Cc: Christoph Lameter <cl@linux.com>

32665f2bbfed was three years ago.  Do you have any theory as to why
this took so long to be detected?  And do you believe a -stable
backport is warranted?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] mm: add counters for different page fault types
From: Luigi Semenzato @ 2017-05-24 19:41 UTC (permalink / raw)
  To: linux-mm, dianders, dtor, sonnyrao; +Cc: Luigi Semenzato, Luigi Semenzato

VM event counters are added to keep track of anonymous
vs. file vs. shmem page faults.  They are: pgmajfault_a,
pgmajfault_f and pgmajfault_s.  These are useful to
analyze system performance, particularly when the cost
of a fault for a file page is very different from that
of an anonymous page, as would happen, for instance, in
the presence of zram.

The PGMAJFAULT counter is no longer directly maintained.
Instead the three new counters are added whenever the
total count is needed.

Signed-off-by: Luigi Semenzato <semenzato@google.com>
---
 arch/s390/appldata/appldata_mem.c | 9 ++++++++-
 drivers/virtio/virtio_balloon.c   | 5 ++++-
 fs/dax.c                          | 5 +++--
 fs/ncpfs/mmap.c                   | 4 ++--
 include/linux/vm_event_item.h     | 1 +
 mm/filemap.c                      | 4 ++--
 mm/memcontrol.c                   | 7 ++++++-
 mm/memory.c                       | 4 ++--
 mm/shmem.c                        | 4 ++--
 mm/vmstat.c                       | 5 +++++
 10 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
index 598df5708501..adb8b6412ffa 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -62,6 +62,9 @@ struct appldata_mem_data {
 	u64 pgalloc;		/* page allocations */
 	u64 pgfault;		/* page faults (major+minor) */
 	u64 pgmajfault;		/* page faults (major only) */
+	u64 pgmajfault_s;	/* shmem page faults (major only) */
+	u64 pgmajfault_a;	/* anonymous page faults (major only) */
+	u64 pgmajfault_f;	/* file page faults (major only) */
 // <-- New in 2.6
 
 } __packed;
@@ -93,7 +96,11 @@ static void appldata_get_mem_data(void *data)
 	mem_data->pgalloc    = ev[PGALLOC_NORMAL];
 	mem_data->pgalloc    += ev[PGALLOC_DMA];
 	mem_data->pgfault    = ev[PGFAULT];
-	mem_data->pgmajfault = ev[PGMAJFAULT];
+	mem_data->pgmajfault =
+		ev[PGMAJFAULT_S] + ev[PGMAJFAULT_A] + ev[PGMAJFAULT_F];
+	mem_data->pgmajfault_s = ev[PGMAJFAULT_S];
+	mem_data->pgmajfault_a = ev[PGMAJFAULT_A];
+	mem_data->pgmajfault_f = ev[PGMAJFAULT_F];
 
 	si_meminfo(&val);
 	mem_data->sharedram = val.sharedram;
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 408c174ef0d5..ed7100645d25 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -259,7 +259,10 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb)
 				pages_to_bytes(events[PSWPIN]));
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
 				pages_to_bytes(events[PSWPOUT]));
-	update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT,
+		    events[PGMAJFAULT_S] +
+		    events[PGMAJFAULT_A] +
+		    events[PGMAJFAULT_F]);
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
 #endif
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
diff --git a/fs/dax.c b/fs/dax.c
index c22eaf162f95..3c92f2af0514 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1200,8 +1200,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
 		if (iomap.flags & IOMAP_F_NEW) {
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
+			count_vm_event(PGMAJFAULT_F);
+			mem_cgroup_count_vm_event(vmf->vma->vm_mm,
+						  PGMAJFAULT_F);
 			major = VM_FAULT_MAJOR;
 		}
 		error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
index 0c3905e0542e..ae04b9d86288 100644
--- a/fs/ncpfs/mmap.c
+++ b/fs/ncpfs/mmap.c
@@ -88,8 +88,8 @@ static int ncp_file_mmap_fault(struct vm_fault *vmf)
 	 * fetches from the network, here the analogue of disk.
 	 * -- nyc
 	 */
-	count_vm_event(PGMAJFAULT);
-	mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
+	count_vm_event(PGMAJFAULT_F);
+	mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT_F);
 	return VM_FAULT_MAJOR;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d84ae90ccd5c..2d2df45d4520 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -27,6 +27,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGSCAN_SKIP),
 		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
 		PGFAULT, PGMAJFAULT,
+		PGMAJFAULT_S, PGMAJFAULT_A, PGMAJFAULT_F,
 		PGLAZYFREED,
 		PGREFILL,
 		PGSTEAL_KSWAPD,
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..d2b187b648b3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2225,8 +2225,8 @@ int filemap_fault(struct vm_fault *vmf)
 	} else if (!page) {
 		/* No page in the page cache at all */
 		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
-		count_vm_event(PGMAJFAULT);
-		mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
+		count_vm_event(PGMAJFAULT_F);
+		mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT_F);
 		ret = VM_FAULT_MAJOR;
 retry_find:
 		page = find_get_page(mapping, offset);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94172089f52f..045361f2b8fa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3122,6 +3122,8 @@ unsigned int memcg1_events[] = {
 	PGPGOUT,
 	PGFAULT,
 	PGMAJFAULT,
+	PGMAJFAULT_A,
+	PGMAJFAULT_F,
 };
 
 static const char *const memcg1_event_names[] = {
@@ -3129,6 +3131,8 @@ static const char *const memcg1_event_names[] = {
 	"pgpgout",
 	"pgfault",
 	"pgmajfault",
+	"pgmajfault_a",
+	"pgmajfault_f",
 };
 
 static int memcg_stat_show(struct seq_file *m, void *v)
@@ -5229,7 +5233,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
 	/* Accumulated memory events */
 
 	seq_printf(m, "pgfault %lu\n", events[PGFAULT]);
-	seq_printf(m, "pgmajfault %lu\n", events[PGMAJFAULT]);
+	seq_printf(m, "pgmajfault %lu\n", events[PGMAJFAULT_S] +
+			events[PGMAJFAULT_A] + events[PGMAJFAULT_F]);
 
 	seq_printf(m, "workingset_refault %lu\n",
 		   stat[WORKINGSET_REFAULT]);
diff --git a/mm/memory.c b/mm/memory.c
index 6ff5d729ded0..2c2b7b3ffe7f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2718,8 +2718,8 @@ int do_swap_page(struct vm_fault *vmf)
 
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
-		count_vm_event(PGMAJFAULT);
-		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+		count_vm_event(PGMAJFAULT_A);
+		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT_A);
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
diff --git a/mm/shmem.c b/mm/shmem.c
index e67d6ba4e98e..5eea045575c4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1644,9 +1644,9 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 			/* Or update major stats only when swapin succeeds?? */
 			if (fault_type) {
 				*fault_type |= VM_FAULT_MAJOR;
-				count_vm_event(PGMAJFAULT);
+				count_vm_event(PGMAJFAULT_S);
 				mem_cgroup_count_vm_event(charge_mm,
-							  PGMAJFAULT);
+							  PGMAJFAULT_S);
 			}
 			/* Here we actually start the io */
 			page = shmem_swapin(swap, gfp, info, index);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 76f73670200a..741bb14761cd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -995,6 +995,9 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pgmajfault_s",
+	"pgmajfault_a",
+	"pgmajfault_f",
 	"pglazyfreed",
 
 	"pgrefill",
@@ -1511,6 +1514,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	all_vm_events(v);
 	v[PGPGIN] /= 2;		/* sectors -> kbytes */
 	v[PGPGOUT] /= 2;
+	/* Add up page faults */
+	v[PGMAJFAULT] = v[PGMAJFAULT_S] + v[PGMAJFAULT_A] + v[PGMAJFAULT_F];
 #endif
 	return (unsigned long *)m->private + *pos;
 }
-- 
2.13.0.219.gdb65acc882-goog

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH V2] swap: add block io poll in swapin path
From: Shaohua Li @ 2017-05-24 19:27 UTC (permalink / raw)
  To: linux-mm @ kvack . org
  Cc: Andrew Morton, Kernel Team, Tim Chen, Huang Ying, Jens Axboe

For fast flash disk, async IO could introduce overhead because of
context switch. block-mq now supports IO poll, which improves
performance and latency a lot. swapin is a good place to use this
technique, because the task is waitting for the swapin page to continue
execution.

In my virtual machine, directly read 4k data from a NVMe with iopoll is
about 60% better than that without poll. With iopoll support in swapin
patch, my microbenchmark (a task does random memory write) is about 10%
~ 25% faster. CPU utilization increases a lot though, 2x and even 3x CPU
utilization. This will depend on disk speed though. While iopoll in
swapin isn't intended for all usage cases, it's a win for latency
sensistive workloads with high speed swap disk. block layer has knob to
control poll in runtime. If poll isn't enabled in block layer, there
should be no noticeable change in swapin.

I got a chance to run the same test in a NVMe with DRAM as the media. In
simple fio IO test, blkpoll boosts 50% performance in single thread test
and ~20% in 8 threads test. So this is the base line. In above swap
test, blkpoll boosts ~27% performance in single thread test. blkpoll
uses 2x CPU time though. If we enable hybid polling, the performance
gain has very slight drop but CPU time is only 50% worse than that
without blkpoll. Also we can adjust parameter of hybid poll, with it,
the CPU time penality is reduced further. In 8 threads test, blkpoll
doesn't help though. The performance is similar to that without blkpoll,
but cpu utilization is similar too. There is lock contention in swap
path. The cpu time spending on blkpoll isn't high. So overall, blkpoll
swapin isn't worse than that without it.

The swapin readahead might read several pages in in the same time and
form a big IO request. Since the IO will take longer time, it doesn't
make sense to do poll, so the patch only does iopoll for single page
swapin.

V1->V2:
- Don't use PageLocked to quit poll loop, which has races

Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/swap.h |  5 +++--
 mm/madvise.c         |  4 ++--
 mm/page_io.c         | 23 +++++++++++++++++++++--
 mm/swap_state.c      | 10 ++++++----
 mm/swapfile.c        |  2 +-
 5 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5ab1c98..b0e7562 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -331,7 +331,7 @@ extern void kswapd_stop(int nid);
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
 /* linux/mm/page_io.c */
-extern int swap_readpage(struct page *);
+extern int swap_readpage(struct page *, bool do_poll);
 extern int swap_writepage(struct page *page, struct writeback_control *wbc);
 extern void end_swap_bio_write(struct bio *bio);
 extern int __swap_writepage(struct page *page, struct writeback_control *wbc,
@@ -362,7 +362,8 @@ extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page *lookup_swap_cache(swp_entry_t);
 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr);
+			struct vm_area_struct *vma, unsigned long addr,
+			bool do_poll);
 extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool *new_page_allocated);
diff --git a/mm/madvise.c b/mm/madvise.c
index 25b78ee..8eda184 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -205,7 +205,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 			continue;
 
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
-								vma, index);
+							vma, index, false);
 		if (page)
 			put_page(page);
 	}
@@ -246,7 +246,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
 		}
 		swap = radix_to_swp_entry(page);
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
-								NULL, 0);
+							NULL, 0, false);
 		if (page)
 			put_page(page);
 	}
diff --git a/mm/page_io.c b/mm/page_io.c
index 23f6d0d..d0b78a9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -117,6 +117,7 @@ static void swap_slot_free_notify(struct page *page)
 static void end_swap_bio_read(struct bio *bio)
 {
 	struct page *page = bio->bi_io_vec[0].bv_page;
+	struct task_struct *waiter = bio->bi_private;
 
 	if (bio->bi_error) {
 		SetPageError(page);
@@ -132,7 +133,9 @@ static void end_swap_bio_read(struct bio *bio)
 	swap_slot_free_notify(page);
 out:
 	unlock_page(page);
+	WRITE_ONCE(bio->bi_private, NULL);
 	bio_put(bio);
+	wake_up_process(waiter);
 }
 
 int generic_swapfile_activate(struct swap_info_struct *sis,
@@ -329,11 +332,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return ret;
 }
 
-int swap_readpage(struct page *page)
+int swap_readpage(struct page *page, bool do_poll)
 {
 	struct bio *bio;
 	int ret = 0;
 	struct swap_info_struct *sis = page_swap_info(page);
+	blk_qc_t qc;
+	struct block_device *bdev;
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -372,9 +377,23 @@ int swap_readpage(struct page *page)
 		ret = -ENOMEM;
 		goto out;
 	}
+	bdev = bio->bi_bdev;
+	bio->bi_private = current;
 	bio_set_op_attrs(bio, REQ_OP_READ, 0);
 	count_vm_event(PSWPIN);
-	submit_bio(bio);
+	bio_get(bio);
+	qc = submit_bio(bio);
+	while (do_poll) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!READ_ONCE(bio->bi_private))
+			break;
+
+		if (!blk_mq_poll(bdev_get_queue(bdev), qc))
+			break;
+	}
+	__set_current_state(TASK_RUNNING);
+	bio_put(bio);
+
 out:
 	return ret;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 9c71b6b..6683c02 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -412,14 +412,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * the swap entry is no longer in use.
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct vm_area_struct *vma, unsigned long addr, bool do_poll)
 {
 	bool page_was_allocated;
 	struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
 			vma, addr, &page_was_allocated);
 
 	if (page_was_allocated)
-		swap_readpage(retpage);
+		swap_readpage(retpage, do_poll);
 
 	return retpage;
 }
@@ -496,11 +496,13 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
 	struct blk_plug plug;
+	bool do_poll = true;
 
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
 
+	do_poll = false;
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -511,7 +513,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
-						gfp_mask, vma, addr);
+						gfp_mask, vma, addr, false);
 		if (!page)
 			continue;
 		if (offset != entry_offset && likely(!PageTransCompound(page)))
@@ -522,7 +524,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
-	return read_swap_cache_async(entry, gfp_mask, vma, addr);
+	return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll);
 }
 
 int init_swap_address_space(unsigned int type, unsigned long nr_pages)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8a6cdf9..9d7e9ad 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1852,7 +1852,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
 		page = read_swap_cache_async(entry,
-					GFP_HIGHUSER_MOVABLE, NULL, 0);
+					GFP_HIGHUSER_MOVABLE, NULL, 0, false);
 		if (!page) {
 			/*
 			 * Either swap_duplicate() failed because entry
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH v4 8/8] mm: rmap: Use correct helper when poisoning hugepages
From: kbuild test robot @ 2017-05-24 19:20 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: kbuild-all, akpm, linux-mm, linux-kernel, linux-arm-kernel,
	catalin.marinas, will.deacon, n-horiguchi, kirill.shutemov,
	mike.kravetz, steve.capper, mark.rutland, linux-arch,
	aneesh.kumar
In-Reply-To: <20170524115409.31309-9-punit.agrawal@arm.com>

[-- Attachment #1: Type: text/plain, Size: 1469 bytes --]

Hi Punit,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12-rc2 next-20170524]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Punit-Agrawal/Support-for-contiguous-pte-hugepages/20170524-221905
config: x86_64-kexec (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1386:5: error: implicit declaration of function 'set_huge_swap_pte_at' [-Werror=implicit-function-declaration]
        set_huge_swap_pte_at(mm, address,
        ^~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/set_huge_swap_pte_at +1386 mm/rmap.c

  1380	
  1381			if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
  1382				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  1383				if (PageHuge(page)) {
  1384					int nr = 1 << compound_order(page);
  1385					hugetlb_count_sub(nr, mm);
> 1386					set_huge_swap_pte_at(mm, address,
  1387							     pvmw.pte, pteval,
  1388							     vma_mmu_pagesize(vma));
  1389				} else {

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 25343 bytes --]

^ permalink raw reply

* Re: [PATCH] mm: make kswapd try harder to keep active pages in cache
From: Josef Bacik @ 2017-05-24 18:50 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Josef Bacik, akpm, kernel-team, riel, linux-mm
In-Reply-To: <20170524174610.GB22174@cmpxchg.org>

On Wed, May 24, 2017 at 01:46:10PM -0400, Johannes Weiner wrote:
> Hi Josef,
> 
> On Tue, May 23, 2017 at 10:23:23AM -0400, Josef Bacik wrote:
> > @@ -308,7 +317,8 @@ EXPORT_SYMBOL(unregister_shrinker);
> >  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  				    struct shrinker *shrinker,
> >  				    unsigned long nr_scanned,
> > -				    unsigned long nr_eligible)
> > +				    unsigned long nr_eligible,
> > +				    unsigned long *slab_scanned)
> 
> Once you pass in pool size ratios here, nr_scanned and nr_eligible
> become confusing. Can you update the names?
> 

Yeah I kept changing them and eventually decided my names were equally as
shitty, so I just left them.  I'll change them to something useful.

> > @@ -2292,6 +2310,15 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> >  				scan = 0;
> >  			}
> >  			break;
> > +		case SCAN_INACTIVE:
> > +			if (file && !is_active_lru(lru)) {
> > +				if (scan && size > sc->nr_to_reclaim)
> > +					scan = sc->nr_to_reclaim;
> 
> Why is the scan target different than with regular cache reclaim? I'd
> expect that we only need to zero the active list sizes here, not that
> we'd also need any further updates to 'scan'.
> 

Huh I actually screwed this up slightly from what I wanted.  Since

scan = size >> sc->priority

we'd sometimes end up with scan < nr_to_reclaim, but since we're only scanning
inactive we really want to try as hard as possible to reclaim what we need from
inactive.  What I should have done is something like

scan = max(sc->nr_to_reclaim, scan);

instead, I'll fix that.

> > @@ -2509,8 +2536,62 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  {
> >  	struct reclaim_state *reclaim_state = current->reclaim_state;
> >  	unsigned long nr_reclaimed, nr_scanned;
> > +	unsigned long nr_reclaim, nr_slab, total_high_wmark = 0, nr_inactive;
> > +	int z;
> >  	bool reclaimable = false;
> > +	bool skip_slab = false;
> > +
> > +	nr_slab = sum_zone_node_page_state(pgdat->node_id,
> > +					   NR_SLAB_RECLAIMABLE);
> > +	nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > +	nr_reclaim = pgdat_reclaimable_pages(pgdat);
> > +
> > +	for (z = 0; z < MAX_NR_ZONES; z++) {
> > +		struct zone *zone = &pgdat->node_zones[z];
> > +		if (!managed_zone(zone))
> > +			continue;
> > +		total_high_wmark += high_wmark_pages(zone);
> > +	}
> 
> This function is used for memcg target reclaim, in which case you're
> only looking at a subset of the pgdats and zones. Any pgdat or zone
> state read here would be scoped incorrectly; and the ratios on the
> node level are independent from ratios on the cgroup level and can
> diverge heavily from each other.
> 
> These size inquiries to drive the balancing will have to be made
> inside the memcg iteration loop further down with per-cgroup numbers.
> 

Ok so I suppose I need to look at the actual lru list sizes instead for these
numbers for !global_reclaim(sc)?

> > +	/*
> > +	 * If we don't have a lot of inactive or slab pages then there's no
> > +	 * point in trying to free them exclusively, do the normal scan stuff.
> > +	 */
> > +	if (nr_inactive < total_high_wmark && nr_slab < total_high_wmark)
> > +		sc->inactive_only = 0;
> 
> Yes, we need something like this, to know when to fall back to full
> reclaim. Cgroups don't have high watermarks, but I guess some magic
> number for "too few pages" could do the trick.
> 
> > +	/*
> > +	 * We don't have historical information, we can't make good decisions
> > +	 * about ratio's and where we should put pressure, so just apply
> > +	 * pressure based on overall consumption ratios.
> > +	 */
> > +	if (!sc->slab_diff && !sc->inactive_diff)
> > +		sc->inactive_only = 0;
> 
> This one I'm not sure about. If we have enough slabs and and inactive
> pages why shouldn't we go for them first anyway - regardless of
> whether they have grown since the last reclaim invocation?
> 

Because we use them for the ratio of where to put pressure, but I suppose I
could just drop this and do

foo = max(sc->slab_diff, 1);
bar = max(sc->inactive_diff, 1);

so if we have no historical information we just equally scan both.  I'll do that
instead.

> > @@ -2543,10 +2626,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
> >  			node_lru_pages += lru_pages;
> >  
> > -			if (memcg)
> > -				shrink_slab(sc->gfp_mask, pgdat->node_id,
> > -					    memcg, sc->nr_scanned - scanned,
> > -					    lru_pages);
> > +			/*
> > +			 * We don't want to put a lot of pressure on all of the
> > +			 * slabs if a memcg is mostly full, so use the ratio of
> > +			 * the lru size to the total reclaimable space on the
> > +			 * system.  If we have sc->inactive_only set then we
> > +			 * want to use the ratio of the difference between the
> > +			 * two since the last kswapd run so we apply pressure to
> > +			 * the consumer appropriately.
> > +			 */
> > +			if (memcg && !skip_slab) {
> > +				unsigned long numerator = lru_pages;
> > +				unsigned long denominator = nr_reclaim;
> 
> I don't quite follow this.
> 
> It calculates the share of this memcg's pages on the node, which is
> the ratio we should apply to the global slab pool to have equivalent
> pressure. However, it's being applied to the *memcg's* share of slab
> pages. This means that the smaller the memcg relative to the node, the
> less of its tiny share of slab objects we reclaim.
> 
> We're not translating from fraction to total, we're translating from
> fraction to fraction. Shouldn't the ratio be always 1:1?
> 
> For example, if there is only one cgroup on the node, the ratio would
> be 1 share of LRU pages and 1 share of slab pages. But if there are
> two cgroups, we still scan one share of each cgroup's LRU pages but
> only half a share of each cgroup's slab pages. That doesn't add up.
> 
> Am I missing something?
> 

We hashed this out offline, but basically we concluded to add a memcg specific
slab reclaimable counter so we can make these ratios be consistent with the
global ratios.

> > +				if (sc->inactive_only) {
> > +					numerator = sc->slab_diff;
> > +					denominator = sc->inactive_diff;
> > +				}
> 
> Shouldn't these diffs already be reflected in the pool sizes? If we
> scan pools proportional to their size, we also go more aggressively
> for the one that grows faster relative to the other one, right?
> 

Sure unless the aggressive growth is from a different cgroup, we want to apply
proportional pressure everywhere.  I suppose that should only be done in the
global reclaim case.

> I guess this *could* be more adaptive to fluctuations, but I wonder if
> that's an optimization that could be split out into a separate patch,
> to make it easier to review on its own merit. Start with a simple size
> based balancing in the first patch, add improved adaptiveness after.
> 
> As mentioned above, this function is used not just by kswapd but also
> by direct reclaim, which doesn't initialize these fields and so always
> passes 0:0. We should be able to retain sensible balancing for them as
> well, but it would require moving the diff sampling.
> 
> To make it work for cgroup reclaim, it would have to look at the
> growths not on the node level, but on the lruvec level in
> get_scan_count() or thereabouts.
> 
> Anyway, I think it might be less confusing to nail down the size-based
> pressure balancing for slab caches first, and then do the recent diff
> balancing on top of it as an optimization.

Yeah I had it all separate but it got kind of weird and hard to tell which part
was needed where.  Now that I've taken a step back I see where I can split it
up, so I'll fix these things and split the patches up.  Thanks!

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
From: Waiman Long @ 2017-05-24 18:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto, efault
In-Reply-To: <20170524170527.GH24798@htj.duckdns.org>

On 05/24/2017 01:05 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, May 22, 2017 at 12:56:08PM -0400, Waiman Long wrote:
>> All controllers can use the special sub-directory if userland chooses to
>> do so. The problem that I am trying to address in this patch is to allow
>> more natural hierarchy that reflect a certain purpose, like the task
>> classification done by systemd. Restricting tasks only to leaf nodes
>> makes the hierarchy unnatural and probably difficult to manage.
> I see but how is this different from userland just creating the leaf
> cgroup?  I'm not sure what this actually enables in terms of what can
> be achieved with cgroup.  I suppose we can argue that this is more
> convenient but I'd like to keep the interface orthogonal as much as
> reasonably possible.
>
> Thanks.
>
I am just thinking that it is a bit more natural with the concept of the
special resource domain sub-directory. You are right that the same
effect can be achieved by proper placement of tasks and enabling of
controllers.

A (cpu,memory) [T1] - B(cpu,memory) [T2]
                                  \ cgroups.resource_domain (memory)

A (cpu,memory)  - B(cpu,memory) [T2]
                            \ C (memory) [T1]

With respect to the tasks T1 and T2, the above 2 configurations are the
same.

I am OK to drop this patch. However, I still think the current
no-internal process constraint is too restricting. I will suggest either

 1. Allow internal processes and document the way to avoid internal
    process competition as shown above from the userland, or
 2. Mark only certain controllers as not allowing internal processes
    when they are enabled.

What do you think about this?

Cheers,
Longman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
From: Waiman Long @ 2017-05-24 18:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto, efault
In-Reply-To: <20170524175600.GL24798@htj.duckdns.org>

On 05/24/2017 01:56 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, May 24, 2017 at 01:49:46PM -0400, Waiman Long wrote:
>> What I am saying is as follows:
>>     / A
>> P - B
>>    \ C
>>
>> # echo +memory > P/cgroups.subtree_control
>> # echo -memory > P/A/cgroup.controllers
>> # echo "#memory" > P/B/cgroup.controllers
>>
>> The parent grants the memory controller to its children - A, B and C.
>> Child A has the memory controller explicitly disabled. Child B has the
>> memory controller in pass-through mode, while child C has the memory
>> controller enabled by default. "echo +memory > cgroup.controllers" is
>> not allowed. There are 2 possible choices with regard to the '-' or '#'
>> prefixes. We can allow them before the grant from the parent or only
>> after that. In the former case, the state remains dormant until after
>> the grant from the parent.
> Ah, I see, you want cgroup.controllers to be able to mask available
> controllers by the parent.  Can you expand your example with further
> nesting and how #memory on cgroup.controllers would affect the nested
> descendant?
>
> Thanks.
>
I would allow enabling the controller in subtree_control if granted from
the parent and not explicitly disabled. IOW, both B and C can "echo
+memory" to their subtree_control to grant memory controller to their
children, but not A. A has to re-enable memory controller or set it to
pass-through mode before it can enable it in subtree_control. I need to
clarify that "echo +memory > cgroup.controllers" is allowed to re-enable
it, but not without the granting from its parent.

Cheers,
Longman


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 12/17] cgroup: Remove cgroup v2 no internal process constraint
From: Waiman Long @ 2017-05-24 18:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto, efault
In-Reply-To: <20170524170527.GH24798@htj.duckdns.org>

[-- Attachment #1: Type: text/plain, Size: 1746 bytes --]

On 05/24/2017 01:05 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, May 22, 2017 at 12:56:08PM -0400, Waiman Long wrote:
>> All controllers can use the special sub-directory if userland chooses to
>> do so. The problem that I am trying to address in this patch is to allow
>> more natural hierarchy that reflect a certain purpose, like the task
>> classification done by systemd. Restricting tasks only to leaf nodes
>> makes the hierarchy unnatural and probably difficult to manage.
> I see but how is this different from userland just creating the leaf
> cgroup?  I'm not sure what this actually enables in terms of what can
> be achieved with cgroup.  I suppose we can argue that this is more
> convenient but I'd like to keep the interface orthogonal as much as
> reasonably possible.
>
> Thanks.
>
I am just thinking that it is a bit more natural with the concept of the
special resource domain sub-directory. You are right that the same
effect can be achieved by proper placement of tasks and enabling of
controllers.

A (cpu,memory) [T1] - B(cpu,memory) [T2]
                                  \ cgroups.resource_domain (memory)

A (cpu,memory)  - B(cpu,memory) [T2]
                            \ C (memory) [T1]

With respect to the tasks T1 and T2, the above 2 configurations are the
same.

I am OK to drop this patch. However, I still think the current
no-internal process constraint is too restricting. I will suggest either

 1. Allow internal processes and document the way to avoid internal
    process competition as shown above from the userland, or
 2. Mark only certain controllers as not allowing internal processes
    when they are enabled.

What do you think about this?

Cheers,
Longman

[-- Attachment #2: Type: text/html, Size: 2347 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
From: Tejun Heo @ 2017-05-24 17:56 UTC (permalink / raw)
  To: Waiman Long
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto, efault
In-Reply-To: <29bc746d-f89b-3385-fd5c-314bcd22f9f7@redhat.com>

Hello,

On Wed, May 24, 2017 at 01:49:46PM -0400, Waiman Long wrote:
> What I am saying is as follows:
>     / A
> P - B
>    \ C
> 
> # echo +memory > P/cgroups.subtree_control
> # echo -memory > P/A/cgroup.controllers
> # echo "#memory" > P/B/cgroup.controllers
> 
> The parent grants the memory controller to its children - A, B and C.
> Child A has the memory controller explicitly disabled. Child B has the
> memory controller in pass-through mode, while child C has the memory
> controller enabled by default. "echo +memory > cgroup.controllers" is
> not allowed. There are 2 possible choices with regard to the '-' or '#'
> prefixes. We can allow them before the grant from the parent or only
> after that. In the former case, the state remains dormant until after
> the grant from the parent.

Ah, I see, you want cgroup.controllers to be able to mask available
controllers by the parent.  Can you expand your example with further
nesting and how #memory on cgroup.controllers would affect the nested
descendant?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [HMM 00/15] HMM (Heterogeneous Memory Management) v22
From: Jerome Glisse @ 2017-05-24 17:53 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm,
	John Hubbard, David Nellans
In-Reply-To: <CAKTCnzn2rTnqq62JY3GfAd7SCv1PChTrHSB6ikJzdjNzXC9cGA@mail.gmail.com>

On Wed, May 24, 2017 at 11:55:12AM +1000, Balbir Singh wrote:
> On Tue, May 23, 2017 at 2:51 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> > Patchset is on top of mmotm mmotm-2017-05-18, git branch:
> >
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v22
> >
> > Change since v21 is adding back special refcounting in put_page() to
> > catch when a ZONE_DEVICE page is free (refcount going from 2 to 1
> > unlike regular page where a refcount of 0 means the page is free).
> > See patch 8 of this serie for this refcounting. I did not use static
> > keys because it kind of scares me to do that for an inline function.
> > If people strongly feel about this i can try to make static key works
> > here. Kirill will most likely want to review this.
> >
> >
> > Everything else is the same. Below is the long description of what HMM
> > is about and why. At the end of this email i describe briefly each patch
> > and suggest reviewers for each of them.
> >
> >
> > Heterogeneous Memory Management (HMM) (description and justification)
> >
> 
> Thanks for the patches! These patches are very helpful. There are a
> few additional things we would need on top of this (once HMM the base
> is merged)
> 
> 1. Support for other architectures, we'd like to make sure we can get
> this working for powerpc for example. As a first step we have
> ZONE_DEVICE enablement patches, but I think we need some additional
> patches for iomem space searching and memory hotplug, IIRC
> 2. HMM-CDM and physical address based migration bits. In a recent RFC
> we decided to try and use the HMM CDM route as a route to implementing
> coherent device memory as a starting point. It would be nice to have
> those patches on top of these once these make it to mm -
> https://lwn.net/Articles/720380/
> 

I intend to post the updated HMM CDM patchset early next week. I am
tie in couple internal backport but i should be able to resume work
on that this week.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 13/17] cgroup: Allow fine-grained controllers control in cgroup v2
From: Waiman Long @ 2017-05-24 17:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar, cgroups,
	linux-kernel, linux-doc, linux-mm, kernel-team, pjt, luto, efault
In-Reply-To: <20170524173144.GI24798@htj.duckdns.org>

On 05/24/2017 01:31 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, May 19, 2017 at 05:20:01PM -0400, Waiman Long wrote:
>>> This breaks the invariant that in a cgroup its resource control knobs
>>> control distribution of resources from its parent.  IOW, the resource
>>> control knobs of a cgroup always belong to the parent.  This is also
>>> reflected in how delegation is done.  The delegatee assumes ownership
>>> of the cgroup itself and the ability to manage sub-cgroups but doesn't
>>> get the ownership of the resource control knobs as otherwise the
>>> parent would lose control over how it distributes its resources.
>> One twist that I am thinking is to have a controller enabled by the
>> parent in subtree_control, but then allow the child to either disable it
>> or set it in pass-through mode by writing to controllers file. IOW, a
>> child cannot enable a controller without parent's permission. Once a
>> child has permission, it can do whatever it wants. A parent cannot force
>> a child to have a controller enabled.
> Heh, I think I need more details to follow your proposal.  Anyways,
> what we need to guarantee is that a descendant is never allowed to
> pull in more resources than its ancestors want it to.

What I am saying is as follows:
    / A
P - B
   \ C

# echo +memory > P/cgroups.subtree_control
# echo -memory > P/A/cgroup.controllers
# echo "#memory" > P/B/cgroup.controllers

The parent grants the memory controller to its children - A, B and C.
Child A has the memory controller explicitly disabled. Child B has the
memory controller in pass-through mode, while child C has the memory
controller enabled by default. "echo +memory > cgroup.controllers" is
not allowed. There are 2 possible choices with regard to the '-' or '#'
prefixes. We can allow them before the grant from the parent or only
after that. In the former case, the state remains dormant until after
the grant from the parent.

Cheers,
Longman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/1] Sealable memory support
From: Igor Stoppa @ 2017-05-24 17:45 UTC (permalink / raw)
  To: Kees Cook
  Cc: Casey Schaufler, Michal Hocko, Dave Hansen, Laura Abbott,
	Linux-MM, kernel-hardening@lists.openwall.com, LKML, Daniel Micay,
	Greg KH, James Morris, Stephen Smalley
In-Reply-To: <CAGXu5jK25XvX4vSODg7rkdBPj_FzveUSODFUKu1=KatmKhFVzg@mail.gmail.com>



On 23/05/17 23:11, Kees Cook wrote:
> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:

[...]

> I would want hardened usercopy support as a requirement for using
> smalloc(). Without it, we're regressing the over-read protection that
> already exists for slab objects, if kernel code switched from slab to
> smalloc. It should be very similar to the existing slab checks. "Is
> this a smalloc object? Have we read beyond the end of a given object?"
> etc. The metadata is all there, except for an efficient way to mark a
> page as a smalloc page, but I think that just requires a new Page$Name
> bit test, as done for slab.

ok

[...]

> I meant this:
> 
> CPU 1     CPU 2
> create
> alloc
> write
> seal
> ...
> unseal
>                 write
> write
> seal
> 
> The CPU 2 write would be, for example, an attacker using a
> vulnerability to attempt to write to memory in the sealed area. All it
> would need to do to succeed would be to trigger an action in the
> kernel that would do a "legitimate" write (which requires the unseal),
> and race it. Unsealing should be CPU-local, if the API is going to
> support this kind of access.

I see.
If the CPU1 were to forcibly halt anything that can race with it, then
it would be sure that there was no interference.
A reactive approach could be, instead, to re-validate the content after
the sealing, assuming that it is possible.

[...]

> I am more concerned about _any_ unseal after initial seal. And even
> then, it'd be nice to keep things CPU-local. My concerns are related
> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
> which is kind of like this, but focused on the .data section, not
> dynamic memory. It has similar concerns about CPU-locality.
> Additionally, even writing to memory and then making it read-only
> later runs risks (see threads about BPF JIT races vs making things
> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
> doesn't change the risk this series is fixing: races with attacker
> writes during assignment but before read-only marking).

If you are talking about an attacker, rather than protection against
accidental overwrites, how hashing can be enough?
Couldn't the attacker compromise that too?

> So, while smalloc would hugely reduce the window an attacker has
> available to change data contents, this API doesn't eliminate it. (To
> eliminate it, there would need to be a CPU-local page permission view
> that let only the current CPU to the page, and then restore it to
> read-only to match the global read-only view.)

That or, if one is ready to take the hit, freeze every other possible
attack vector. But I'm not sure this could be justifiable.

[...]

> Ah! In that case, sure. This isn't what the proposed API provided,
> though, so let's adjust it to only perform the unseal at destroy time.
> That makes it much saner, IMO. "Write once" dynamic allocations, or
> "read-only after seal". woalloc? :P yay naming

For now I'm still using smalloc.
Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
Other options might require some re-formatting.

[...]

> I think a shared global pool would need to be eliminated for true
> write-once semantics.

ok

[...]

>> I'd rather not add extra locking to something that doesn't need it:
>> Allocate - write - seal - read, read, read, ... - unseal - destroy.
> 
> Yup, I would be totally fine with this. It still has a race between
> allocate and seal, but it's a huge improvement over the existing state
> of the world where all dynamic memory is writable. :)

Great!


[...]

> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
> could just drop the runtime disabling of SELinux, we'd be fine.

I am not sure I understand this point.
If the kernel is properly configured, the master toggle variable
disappears, right?
Or do you mean the disabling through modifications of the linked list of
the hooks?

[...]


> Hm, I just meant add a char[] to the metadata and pass it in during
> create(). Then it's possible to report which smalloc cache is being
> examined during hardened usercopy checks.

Ok, that is not a big deal.
wrt this, I have spent some time writing a debug module, which currently
dumps into a debugfs entry a bunch of info about the various pools.
I could split it across multiple entries, using the label to generate
their names.

[...]

> It seems like smalloc pools could also be refcounted?

I am not sure what you mean.
What do you want to count?
Number of pools? Nodes per pool? Allocations per node?

And what for?

At least in the case of tearing down a pool, when a module is unloaded,
nobody needs to free anything that was allocated with smalloc.
The teardown function will free the pages from each node.

Is this the place where you think there should be a check on the number
of pages freed?

[...]

>>>> +#define NODE_HEADER                                    \
>>>> +       struct {                                        \
>>>> +               __SMALLOC_ALIGNED__ struct {            \
>>>> +                       struct list_head list;          \
>>>> +                       align_t *free;                  \
>>>> +                       unsigned long available_words;  \
>>>> +               };                                      \
>>>> +       }
>>
>> Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> It's probably a sufficient starting point, depending on how the API
> shakes out. Without unseal-write-seal properties, I case much less
> about redzoning, etc.

ok, but my question (I am not sure if it was clear) was about the use of
a macro for the nameless structure that contains the header.

[...]

> Well, a poor example would be struct sock, since it needs to be
> regularly written to, but it has function pointers near the end which
> have been a very common target for attackers. (Though this is less so
> now that INET_DIAG no longer exposes the kernel addresses to allocated
> struct socks.)

Ok, this could be the scope for a further set of patches, after this one
is done.



One more thing: how should I tie this allocator to the rest?
I have verified that is seems to work with both SLUB and SLAB.
Can I make it depend on either of them being enabled?

Should it be optionally enabled?
What to default to, if it's not enabled? vmalloc?

---
thanks, igor

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox