Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Yet another page allocation stall on 4.9
From: Cong Wang @ 2017-05-24 22:55 UTC (permalink / raw)
  To: linux-mm

Hello, mm experts


I know there are at least two similar reports of page allocation stall
on 4.9, but I am not sure if they all have the same cause nor I could
find any fix to the problem.

Below is the one we got when running LTP memcg_stress test with 150
memcg groups each with 0.5g memory on a 64G memory host. So far, this
is not reproducible at all.

Please let me know if I can provide any other information you need.

Thanks.

[16211.987039]  [<ffffffff86395ab7>] dump_stack+0x4d/0x66^M
[16211.997600]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16212.017235]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16212.037413]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16212.057215]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16212.077023]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16212.087943]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16212.107591]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16212.127232]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16212.146862]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16212.166836]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16212.177664]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16212.197438]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16212.217026] CPU: 4 PID: 3872 Comm: scribed Not tainted
4.9.23.el7.twitter.x86_64 #1^M
[16212.217035] Mem-Info:^M
[16212.217041] active_anon:16069537 inactive_anon:5561 isolated_anon:0^M
[16212.217041]  active_file:1301 inactive_file:1449 isolated_file:0^M
[16212.217041]  unevictable:0 dirty:0 writeback:0 unstable:0^M
[16212.217041]  slab_reclaimable:22962 slab_unreclaimable:79806^M
[16212.217041]  mapped:6434 shmem:6365 pagetables:34668 bounce:0^M
[16212.217041]  free:161016 free_pcp:955 free_cma:0^M
[16212.217047] Node 0 active_anon:31718548kB inactive_anon:8728kB
active_file:4988kB inactive_file:5584kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:11524kB dirty:0kB
writeback:0kB shmem:8832kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:0
all_unreclaimable? no^M
[16212.217051] Node 1 active_anon:32559600kB inactive_anon:13516kB
active_file:216kB inactive_file:212kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:14208kB dirty:0kB
writeback:0kB shmem:16628kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:683486
all_unreclaimable? yes^M
[16212.217056] Node 0 DMA free:15888kB min:20kB low:32kB high:44kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15972kB managed:15888kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB^M
[16212.217058] lowmem_reserve[]: 0 1903 32095 32095^M
[16212.217062] Node 0 DMA32 free:123344kB min:2668kB low:4616kB
high:6564kB active_anon:1820276kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2015240kB
managed:1949672kB mlocked:0kB slab_reclaimable:64kB
slab_unreclaimable:2216kB kernel_stack:0kB pagetables:3544kB
bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB^M
[16212.217064] lowmem_reserve[]: 0 0 30191 30191^M
[16212.217068] Node 0 Normal free:460088kB min:42308kB low:73224kB
high:104140kB active_anon:29898272kB inactive_anon:8728kB
active_file:4988kB inactive_file:5584kB unevictable:0kB
writepending:0kB present:31457280kB managed:30916476kB mlocked:0kB
slab_reclaimable:52244kB slab_unreclaimable:172728kB
kernel_stack:5976kB pagetables:64740kB bounce:0kB free_pcp:2588kB
local_pcp:120kB free_cma:0kB^M
[16212.217070] lowmem_reserve[]: 0 0 0 0^M
[16212.217074] Node 1 Normal free:44744kB min:45108kB low:78068kB
high:111028kB active_anon:32559600kB inactive_anon:13516kB
active_file:216kB inactive_file:212kB unevictable:0kB writepending:0kB
present:33554432kB managed:32962516kB mlocked:0kB
slab_reclaimable:39540kB slab_unreclaimable:144280kB
kernel_stack:5208kB pagetables:70388kB bounce:0kB free_pcp:1112kB
local_pcp:0kB free_cma:0kB^M
[16212.217075] lowmem_reserve[]: 0 0 0 0^M
[16212.217083] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M)
= 15888kB^M
[16212.217093] Node 0 DMA32: 30*4kB (U) 13*8kB (UM) 9*16kB (UM)
17*32kB (UM) 5*64kB (UME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UE)
3*1024kB (UE) 3*2048kB (UME) 27*4096kB (M) = 123344kB^M
[16212.217101] Node 0 Normal: 1259*4kB (UMEH) 3848*8kB (UMEH)
3612*16kB (UMEH) 3740*32kB (UMEH) 3581*64kB (UMEH) 41*128kB (MEH)
12*256kB (UEH) 2*512kB (ME) 2*1024kB (E) 3*2048kB (UME) 0*4096kB =
460012kB^M
[16212.217109] Node 1 Normal: 520*4kB (UMEH) 135*8kB (UMEH) 69*16kB
(UMEH) 31*32kB (UME) 161*64kB (UMEH) 170*128kB (UM) 29*256kB (U)
0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB^M
[16212.217110] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16212.217111] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16212.217112] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16212.217112] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16212.217113] 9058 total pagecache pages^M
[16212.217114] 0 pages in swap cache^M
[16212.217114] Swap cache stats: add 0, delete 0, find 0/0^M
[16212.217115] Free swap  = 0kB^M
[16212.217115] Total swap = 0kB^M
[16212.217116] 16760731 pages RAM^M
[16212.217116] 0 pages HighMem/MovableOnly^M
[16212.217117] 299593 pages reserved^M
[16212.217117] 13 pages hwpoisoned^M
[16213.387131] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013^M
[16213.407413]  ffffaac5cd0bba88 ffffffff86395ab7 ffffffff86a3b280
0000000000000001^M
[16213.436908]  ffffaac5cd0bbb08 ffffffff8619a6c6 024201cacd0bbaf0
ffffffff86a3b280^M
[16213.457248]  ffffaac5cd0bbab0 0100000000000010 ffffaac5cd0bbb18
ffffaac5cd0bbac8^M
[16213.477525] Call Trace:^M
[16213.487314]  [<ffffffff86395ab7>] dump_stack+0x4d/0x66^M
[16213.497723]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16213.505627] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s!
[cleanup:7598]^M
[16213.505710] Modules linked in: dummy veth tun xfs libcrc32c
intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp
crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support dcdbas
ghash_clmulni_intel lpc_ich i2c_i801 hed wmi i2c_smbus shpchp i2c_core
ioatdma dca acpi_cpufreq tcp_diag inet_diag ipmi_si ipmi_devintf
ipmi_msghandler sch_fq_codel mlx4_en ptp pps_core crc32c_intel
mlx4_core devlink ipv6 crc_ccitt^M
[16213.505713] CPU: 5 PID: 7598 Comm: cleanup Not tainted
4.9.23.el7.twitter.x86_64 #1^M
[16213.505714] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013^M
[16213.505717] task: ffff8af1bf098000 task.stack: ffffaac5e60c8000^M
[16213.505722] RIP: 0010:[<ffffffff86395a93>]  [<ffffffff86395a93>]
dump_stack+0x29/0x66^M
[16213.505724] RSP: 0000:ffffaac5e60cba78  EFLAGS: 00000286^M
[16213.505727] RAX: 0000000000000004 RBX: 0000000000000286 RCX:
00000000ffffffff^M
[16213.505729] RDX: 0000000000000005 RSI: 0000000000000292 RDI:
ffffffff86c51be0^M
[16213.505730] RBP: ffffaac5e60cba88 R08: 0000000000000000 R09:
0000000000000031^M
[16213.505733] R10: 0000000000000000 R11: 0000000003cd5438 R12:
0000000000000001^M
[16213.505737] R13: ffffffff86d43c80 R14: ffff8af1bf098000 R15:
ffffaac5e60cbc40^M
[16213.505742] FS:  00007ff77f22e840(0000) GS:ffff8af1dfb40000(0000)
knlGS:0000000000000000^M
[16213.505746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[16213.505748] CR2: 00007fe7603d0650 CR3: 000000022db50000 CR4:
00000000000406e0^M
[16213.505750] Stack:^M
[16213.505767]  ffffffff86a3b280 0000000000000001 ffffaac5e60cbb08
ffffffff8619a6c6^M
[16213.505780]  024201cae60cbaf0 ffffffff86a3b280 ffffaac5e60cbab0
0100000000000010^M
[16213.505792]  ffffaac5e60cbb18 ffffaac5e60cbac8 ffff8af1bf098000
0000000000000000^M
[16213.505795] Call Trace:^M
[16213.505799]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16213.505804]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16213.505807]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16213.505810]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16213.505814]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16213.505822]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16213.505826]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16213.505828]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16213.505830]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16213.505832]  [<ffffffff863b434d>] ? list_del+0xd/0x30^M
[16213.505833]  [<ffffffff86257e58>] ? ep_poll+0x308/0x320^M
[16213.505835]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16213.505837]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16213.505839]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16213.505877] Code: 5d c3 55 83 c9 ff 48 89 e5 41 54 53 9c 5b fa 65
8b 15 4a 47 c7 79 89 c8 f0 0f b1 15 48 a3 92 00 83 f8 ff 74 0a 39 c2
74 0b 53 9d <f3> 90 eb dd 45 31 e4 eb 06 41 bc 01 00 00 00 48 c7 c7 41
1a a2 ^M
[16214.250659] NMI watchdog: BUG: soft lockup - CPU#17 stuck for 22s!
[scribed:3905]^M
[16214.250762] Modules linked in: dummy veth tun xfs libcrc32c
intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp
crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support dcdbas
ghash_clmulni_intel lpc_ich i2c_i801 hed wmi i2c_smbus shpchp i2c_core
ioatdma dca acpi_cpufreq tcp_diag inet_diag ipmi_si ipmi_devintf
ipmi_msghandler sch_fq_codel mlx4_en ptp pps_core crc32c_intel
mlx4_core devlink ipv6 crc_ccitt^M
[16214.250765] CPU: 17 PID: 3905 Comm: scribed Tainted: G
L  4.9.23.el7.twitter.x86_64 #1^M
[16214.250767] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013^M
[16214.250770] task: ffff8af9cb938000 task.stack: ffffaac5cd1c0000^M
[16214.250776] RIP: 0010:[<ffffffff86395a93>]  [<ffffffff86395a93>]
dump_stack+0x29/0x66^M
[16214.250778] RSP: 0000:ffffaac5cd1c3a78  EFLAGS: 00000286^M
[16214.250781] RAX: 0000000000000004 RBX: 0000000000000286 RCX:
00000000ffffffff^M
[16214.250783] RDX: 0000000000000011 RSI: 0000000000000292 RDI:
ffffffff86c51be0^M
[16214.250787] RBP: ffffaac5cd1c3a88 R08: 0000000000000000 R09:
0000000000000031^M
[16214.250789] R10: 0000000000000000 R11: 0000000003cd574c R12:
0000000000000001^M
[16214.250791] R13: ffffffff86d43c80 R14: ffff8af9cb938000 R15:
ffffaac5cd1c3c40^M
[16214.250793] FS:  00007fc4c17fa700(0000) GS:ffff8af1dfcc0000(0000)
knlGS:0000000000000000^M
[16214.250795] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[16214.250798] CR2: 00007f14196cb650 CR3: 000000083d948000 CR4:
00000000000406e0^M
[16214.250800] Stack:^M
[16214.250815]  ffffffff86a3b280 0000000000000001 ffffaac5cd1c3b08
ffffffff8619a6c6^M
[16214.250827]  024201ca860f0940 ffffffff86a3b280 ffffaac5cd1c3ab0
0000000000000010^M
[16214.250838]  ffffaac5cd1c3b18 ffffaac5cd1c3ac8 000000000000000f
0000000000000000^M
[16214.250840] Call Trace:^M
[16214.250843]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16214.250846]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16214.250849]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16214.250853]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16214.250855]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16214.250857]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16214.250859]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16214.250860]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16214.250863]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16214.250865]  [<ffffffff860c25c1>] ? pick_next_task_fair+0x471/0x4a0^M
[16214.250869]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16214.250871]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16214.250873]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16214.250918] Code: 5d c3 55 83 c9 ff 48 89 e5 41 54 53 9c 5b fa 65
8b 15 4a 47 c7 79 89 c8 f0 0f b1 15 48 a3 92 00 83 f8 ff 74 0a 39 c2
74 0b 53 9d <f3> 90 eb dd 45 31 e4 eb 06 41 bc 01 00 00 00 48 c7 c7 41
1a a2 ^M
[16215.157526]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16215.177523]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16215.197540]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16215.217331]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16215.237374]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16215.257136]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16215.276950]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16215.287555]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16215.307538]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16215.327165]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16215.346964]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16215.357554] CPU: 20 PID: 7812 Comm: proxymap Tainted: G
L  4.9.23.el7.twitter.x86_64 #1^M
[16215.357557] Mem-Info:^M
[16215.357563] active_anon:16069475 inactive_anon:5560 isolated_anon:0^M
[16215.357563]  active_file:1319 inactive_file:1356 isolated_file:0^M
[16215.357563]  unevictable:0 dirty:0 writeback:0 unstable:0^M
[16215.357563]  slab_reclaimable:22986 slab_unreclaimable:79813^M
[16215.357563]  mapped:6433 shmem:6364 pagetables:34697 bounce:0^M
[16215.357563]  free:160997 free_pcp:1010 free_cma:0^M
[16215.357567] Node 0 active_anon:31718344kB inactive_anon:8724kB
active_file:5052kB inactive_file:5200kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:11524kB dirty:0kB
writeback:0kB shmem:8828kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:39
all_unreclaimable? no^M
[16215.357571] Node 1 active_anon:32559556kB inactive_anon:13516kB
active_file:224kB inactive_file:224kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:14208kB dirty:0kB
writeback:0kB shmem:16628kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:683486
all_unreclaimable? yes^M
[16215.357574] Node 0 DMA free:15888kB min:20kB low:32kB high:44kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15972kB managed:15888kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB^M
[16215.357576] lowmem_reserve[]: 0 1903 32095 32095^M
[16215.357579] Node 0 DMA32 free:123344kB min:2668kB low:4616kB
high:6564kB active_anon:1820276kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2015240kB
managed:1949672kB mlocked:0kB slab_reclaimable:64kB
slab_unreclaimable:2216kB kernel_stack:0kB pagetables:3544kB
bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB^M
[16215.357580] lowmem_reserve[]: 0 0 30191 30191^M
[16215.357584] Node 0 Normal free:460012kB min:42308kB low:73224kB
high:104140kB active_anon:29898068kB inactive_anon:8724kB
active_file:5052kB inactive_file:5200kB unevictable:0kB
writepending:0kB present:31457280kB managed:30916476kB mlocked:0kB
slab_reclaimable:52340kB slab_unreclaimable:172756kB
kernel_stack:5992kB pagetables:64856kB bounce:0kB free_pcp:2808kB
local_pcp:116kB free_cma:0kB^M
[16215.357585] lowmem_reserve[]: 0 0 0 0^M
[16215.357588] Node 1 Normal free:44744kB min:45108kB low:78068kB
high:111028kB active_anon:32559556kB inactive_anon:13516kB
active_file:224kB inactive_file:224kB unevictable:0kB writepending:0kB
present:33554432kB managed:32962516kB mlocked:0kB
slab_reclaimable:39540kB slab_unreclaimable:144280kB
kernel_stack:5224kB pagetables:70388kB bounce:0kB free_pcp:1112kB
local_pcp:0kB free_cma:0kB^M
[16215.357589] lowmem_reserve[]: 0 0 0 0^M
[16215.357595] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M)
= 15888kB^M
[16215.357602] Node 0 DMA32: 30*4kB (U) 13*8kB (UM) 9*16kB (UM)
17*32kB (UM) 5*64kB (UME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UE)
3*1024kB (UE) 3*2048kB (UME) 27*4096kB (M) = 123344kB^M
[16215.357609] Node 0 Normal: 1259*4kB (UMEH) 3848*8kB (UMEH)
3612*16kB (UMEH) 3740*32kB (UMEH) 3581*64kB (UMEH) 41*128kB (MEH)
12*256kB (UEH) 2*512kB (ME) 2*1024kB (E) 3*2048kB (UME) 0*4096kB =
460012kB^M
[16215.357614] Node 1 Normal: 520*4kB (UMEH) 135*8kB (UMEH) 69*16kB
(UMEH) 31*32kB (UME) 161*64kB (UMEH) 170*128kB (UM) 29*256kB (U)
0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB^M
[16215.357615] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16215.357616] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16215.357617] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16215.357618] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16215.357618] 9075 total pagecache pages^M
[16215.357619] 0 pages in swap cache^M
[16215.357619] Swap cache stats: add 0, delete 0, find 0/0^M
[16215.357620] Free swap  = 0kB^M
[16215.357620] Total swap = 0kB^M
[16215.357620] 16760731 pages RAM^M
[16215.357621] 0 pages HighMem/MovableOnly^M
[16215.357621] 299593 pages reserved^M
[16215.357621] 13 pages hwpoisoned^M
[16216.520770] warn_alloc: 5 callbacks suppressed^M
[16216.520775] scribed: page allocation stalls for 35691ms, order:0,
mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)^M
[16216.587564] Hardware name: Dell Inc. PowerEdge C6220/03C9JJ, BIOS
2.2.3 11/07/2013^M
[16216.607766]  ffffaac5e6403a88 ffffffff86395ab7 ffffffff86a3b280
0000000000000001[16216.631514] memcg_process_s: ^M
[16216.631519] page allocation stalls for 31710ms, order:0,
mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)^M
[16216.667939]  ffffaac5e6403b08 ffffffff8619a6c6 024201cae6403af0
ffffffff86a3b280^M
[16216.697354]  ffffaac5e6403ab0 0100000000000010 ffffaac5e6403b18
ffffaac5e6403ac8^M
[16216.717761] Call Trace:^M
[16216.727390]  [<ffffffff86395ab7>] dump_stack+0x4d/0x66^M
[16216.746985]  [<ffffffff8619a6c6>] warn_alloc+0x116/0x130^M
[16216.757761]  [<ffffffff8619b0bf>] __alloc_pages_slowpath+0x96f/0xbd0^M
[16216.777571]  [<ffffffff8619b4f6>] __alloc_pages_nodemask+0x1d6/0x230^M
[16216.797558]  [<ffffffff861e61d5>] alloc_pages_current+0x95/0x140^M
[16216.817570]  [<ffffffff8619114a>] __page_cache_alloc+0xca/0xe0^M
[16216.837349]  [<ffffffff86194312>] filemap_fault+0x312/0x4d0^M
[16216.854787] scribed: page allocation stalls for 35977ms, order:0,
mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)^M
[16216.887239]  [<ffffffff862a8196>] ext4_filemap_fault+0x36/0x50^M
[16216.907027]  [<ffffffff861c2c71>] __do_fault+0x71/0x130^M
[16216.917933]  [<ffffffff861c6cec>] handle_mm_fault+0xebc/0x13a0^M
[16216.937650]  [<ffffffff863b434d>] ? list_del+0xd/0x30^M
[16216.957041]  [<ffffffff86257e58>] ? ep_poll+0x308/0x320^M
[16216.967687]  [<ffffffff860509e4>] __do_page_fault+0x254/0x4a0^M
[16216.984835] scribed: page allocation stalls for 36056ms, order:0,
mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)^M
[16217.017608]  [<ffffffff86050c50>] do_page_fault+0x20/0x70^M
[16217.037351]  [<ffffffff86700aa2>] page_fault+0x22/0x30^M
[16217.047932] CPU: 8 PID: 827 Comm: crond Tainted: G             L
4.9.23.el7.twitter.x86_64 #1^M
[16217.047945] Mem-Info:^M
[16217.047955] active_anon:16071724 inactive_anon:3194 isolated_anon:0^M
[16217.047955]  active_file:1617 inactive_file:843 isolated_file:0^M
[16217.047955]  unevictable:0 dirty:0 writeback:0 unstable:0^M
[16217.047955]  slab_reclaimable:22986 slab_unreclaimable:79813^M
[16217.047955]  mapped:5722 shmem:6364 pagetables:34673 bounce:0^M
[16217.047955]  free:161140 free_pcp:1145 free_cma:0^M
[16217.047961] Node 0 active_anon:31722972kB inactive_anon:3628kB
active_file:6244kB inactive_file:3184kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:9044kB dirty:0kB
writeback:0kB shmem:8828kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:29
all_unreclaimable? no^M
[16217.047966] Node 1 active_anon:32563924kB inactive_anon:9148kB
active_file:224kB inactive_file:188kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:13844kB dirty:0kB
writeback:0kB shmem:16628kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:686021
all_unreclaimable? yes^M
[16217.047970] Node 0 DMA free:15888kB min:20kB low:32kB high:44kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15972kB managed:15888kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB^M
[16217.047972] lowmem_reserve[]: 0 1903 32095 32095^M
[16217.047976] Node 0 DMA32 free:123344kB min:2668kB low:4616kB
high:6564kB active_anon:1820276kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2015240kB
managed:1949672kB mlocked:0kB slab_reclaimable:64kB
slab_unreclaimable:2216kB kernel_stack:0kB pagetables:3544kB
bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB^M
[16217.047978] lowmem_reserve[]: 0 0 30191 30191^M
[16217.047982] Node 0 Normal free:460584kB min:42308kB low:73224kB
high:104140kB active_anon:29902696kB inactive_anon:3628kB
active_file:6244kB inactive_file:3184kB unevictable:0kB
writepending:0kB present:31457280kB managed:30916476kB mlocked:0kB
slab_reclaimable:52340kB slab_unreclaimable:172756kB
kernel_stack:5976kB pagetables:64760kB bounce:0kB free_pcp:3204kB
local_pcp:0kB free_cma:0kB^M
[16217.047984] lowmem_reserve[]: 0 0 0 0^M
[16217.047988] Node 1 Normal free:44744kB min:45108kB low:78068kB
high:111028kB active_anon:32563924kB inactive_anon:9148kB
active_file:224kB inactive_file:188kB unevictable:0kB writepending:0kB
present:33554432kB managed:32962516kB mlocked:0kB
slab_reclaimable:39540kB slab_unreclaimable:144280kB
kernel_stack:5224kB pagetables:70388kB bounce:0kB free_pcp:1256kB
local_pcp:116kB free_cma:0kB^M
[16217.047989] lowmem_reserve[]: 0 0 0 0^M
[16217.047997] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M)
= 15888kB^M
[16217.048007] Node 0 DMA32: 30*4kB (U) 13*8kB (UM) 9*16kB (UM)
17*32kB (UM) 5*64kB (UME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UE)
3*1024kB (UE) 3*2048kB (UME) 27*4096kB (M) = 123344kB^M
[16217.048015] Node 0 Normal: 1352*4kB (UMEH) 3846*8kB (UMEH)
3610*16kB (UMEH) 3742*32kB (UMEH) 3581*64kB (UMEH) 41*128kB (MEH)
12*256kB (UEH) 2*512kB (ME) 2*1024kB (E) 3*2048kB (UME) 0*4096kB =
460400kB^M
[16217.048023] Node 1 Normal: 520*4kB (UMEH) 135*8kB (UMEH) 69*16kB
(UMEH) 31*32kB (UME) 161*64kB (UMEH) 170*128kB (UM) 29*256kB (U)
0*512kB 0*1024kB 0*2048kB 0*4096kB = 44744kB^M
[16217.048025] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16217.048026] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16217.048026] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB^M
[16217.048027] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB^M
[16217.048028] 8969 total pagecache pages^M
[16217.048029] 0 pages in swap cache^M
[16217.048030] Swap cache stats: add 0, delete 0, find 0/0^M
[16217.048030] Free swap  = 0kB^M
[16217.048031] Total swap = 0kB^M
[16217.048031] 16760731 pages RAM^M
[16217.048032] 0 pages HighMem/MovableOnly^M
[16217.048032] 299593 pages reserved^M
[16217.048033] 13 pages hwpoisoned^M
[16217.075797] memcg_process_s: page allocation stalls for 32206ms,
order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] compiler, clang: suppress warning for unused static inline functions
From: Doug Anderson @ 2017-05-24 23:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthias Kaehlcke, David Rientjes, Christoph Lameter,
	Pekka Enberg, Joonsoo Kim, linux-mm, linux-kernel@vger.kernel.org,
	Mark Brown, Ingo Molnar, David Miller
In-Reply-To: <20170524143205.cae1a02ab2ad7348c1a59e0c@linux-foundation.org>

Hi,

On Wed, May 24, 2017 at 2:32 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 24 May 2017 14:22:29 -0700 Matthias Kaehlcke <mka@chromium.org> wrote:
>
>> I'm not a kernel maintainer, so it's not my decision whether this
>> warning should be silenced, my personal opinion is that it's benfits
>> outweigh the inconveniences of dealing with half-false positives,
>> generally caused by the heavy use of #ifdef by the kernel itself.
>
> Please resend and include this info in the changelog.  Describe
> instances where this warning has resulted in actual runtime or
> developer-visible benefits.
>
> Where possible an appropriate I suggest it is better to move the
> offending function into a header file, rather than adding ifdefs.

Can you clarify what you're asking for here?

* Matthias has been sending out individual patches that take each
particular case into account to try to remove the warnings.  In some
cases this removes totally dead code.  In other cases this adds
__maybe_unused.  ...and as a last resort it uses #ifdef.  In each of
these individual patches we wouldn't want a list of all other patches,
I think.

* Matthias is arguing here _against_ David's patch.

The best I can understand is that you're asking David to add
Matthias's objections into his patch description, then say why we're
still disabling this warning?

---

If you just want a list of things in response to this thread...

Clang's behavior has found some dead code, as shown by:

* https://patchwork.kernel.org/patch/9732161/
  ring-buffer: Remove unused function __rb_data_page_index()
* https://patchwork.kernel.org/patch/9735027/
  r8152: Remove unused function usb_ocp_read()
* https://patchwork.kernel.org/patch/9735053/
  net1080: Remove unused function nc_dump_ttl()
* https://patchwork.kernel.org/patch/9741513/
  crypto: rng: Remove unused function __crypto_rng_cast()
* https://patchwork.kernel.org/patch/9741539/
  x86/ioapic: Remove unused function IO_APIC_irq_trigger()
* https://patchwork.kernel.org/patch/9741549/
  ASoC: Intel: sst: Remove unused function sst_restore_shim64()
* https://patchwork.kernel.org/patch/9743225/
  ASoC: cht_bsw_max98090_ti: Remove unused function cht_get_codec_dai()

...plus more examples...

However, clang's behavior has also led to patches that add a
"__maybe_unused" attribute (usually no increase in LOC unless it
causes word wrap) and also added a handful of #ifdefs, as you've
pointed out.  The example we already talked about was:

* https://patchwork.kernel.org/patch/9738139/
  mm/slub: Only define kmalloc_large_node_hook() for NUMA systems

We can, of course, discuss the best way to solve each individual
issue.  ...and if we can find a way around #ifdef in most places that
seems ideal.  If people really think the ability to spot dead code is
not important, though, then disabling the warning globally like
David's patch is the way to go.

Note that in addition to spotting some dead code, clang's warnings
also have the ability to identify "paste-o" bugs during development
that would be harder to find if these warnings were disabled.  It's
unlikely problems like this would last long in the kernel, but
certainly I've made paste-o mistakes like this and then spent quite a
while trying to figure out why things weren't working until my eyes
finally spotted my stupidity.  Like:

static inline void its_a_dog(void) {
  pr_info("It's a dog\n");
}

static inline void its_a_cat(void) {
  pr_info("It's a dog\n");
}

static void foo(void) {
  if (strcmp(animal, "cat") == 0) {
    /* It's a cat! */
    its_a_cat();
  } else {
    /* It's a dog! */
   its_a_cat();
  }
}

Clang would (nicely) tell me that its_a_dog() is unused.  This is a
stupid example but I've made this type of mistake in the past for
sure.

-Doug

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: hwpoison: Use compound_head() flags for huge pages
From: Naoya Horiguchi @ 2017-05-24 23:30 UTC (permalink / raw)
  To: James Morse; +Cc: linux-mm@kvack.org, Punit Agrawal
In-Reply-To: <20170524130204.21845-1-james.morse@arm.com>

On Wed, May 24, 2017 at 02:02:04PM +0100, James Morse wrote:
> memory_failure() chooses a recovery action function based on the page
> flags. For huge pages it uses the tail page flags which don't have
> anything interesting set, resulting in:
> > Memory failure: 0x9be3b4: Unknown page state
> > Memory failure: 0x9be3b4: recovery action for unknown page: Failed
> 
> Instead, save a copy of the head page's flags if this is a huge page,
> this means if there are no relevant flags for this tail page, we use
> the head pages flags instead. This results in the me_huge_page()
> recovery action being called:
> > Memory failure: 0x9b7969: recovery action for huge page: Delayed
> 
> For hugepages that have not yet been allocated, this allows the hugepage
> to be dequeued.
> 
> CC: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>

Looks good to me.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> ---
> This is intended as a fix, but I can't find the patch that introduced this
> behaviour. (not recent, and there is a lot of history down there!)

Please add a tag

Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages")

> 
> This doesn't apply to stable trees before v3.10...
> Cc: stable@vger.kernel.org # 3.10.105

You can skip older stable kernels to which the fix isn't cleanly applicable.

Thanks,
Naoya Horiguchi

> 
>  mm/memory-failure.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2527dfeddb00..44a6a33af219 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1184,7 +1184,10 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
>  	 * page_remove_rmap() in try_to_unmap_one(). So to determine page status
>  	 * correctly, we save a copy of the page flags at this time.
>  	 */
> -	page_flags = p->flags;
> +	if (PageHuge(p))
> +		page_flags = hpage->flags;
> +	else
> +		page_flags = p->flags;
>  
>  	/*
>  	 * unpoison always clear PG_hwpoison inside page lock
> -- 
> 2.11.0
> 
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: add counters for different page fault types
From: Minchan Kim @ 2017-05-25  0:19 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: linux-mm, dianders, dtor, sonnyrao, Luigi Semenzato
In-Reply-To: <20170524194126.18040-1-semenzato@chromium.org>

Hi Luigi,

On Wed, May 24, 2017 at 12:41:26PM -0700, Luigi Semenzato wrote:
> VM event counters are added to keep track of anonymous
> vs. file vs. shmem page faults.  They are: pgmajfault_a,
> pgmajfault_f and pgmajfault_s.  These are useful to
> analyze system performance, particularly when the cost
> of a fault for a file page is very different from that
> of an anonymous page, as would happen, for instance, in
> the presence of zram.

Yeb, it's useful with zram and the way I have used is 

        PGMAJFAULT - PSWPIN

With that, I can get how many portion in majfault stems from
file-backed pages while others are from swap.

Can't it meet for your requirement?

Thanks.

> 
> The PGMAJFAULT counter is no longer directly maintained.
> Instead the three new counters are added whenever the
> total count is needed.
> 
> Signed-off-by: Luigi Semenzato <semenzato@google.com>
> ---
>  arch/s390/appldata/appldata_mem.c | 9 ++++++++-
>  drivers/virtio/virtio_balloon.c   | 5 ++++-
>  fs/dax.c                          | 5 +++--
>  fs/ncpfs/mmap.c                   | 4 ++--
>  include/linux/vm_event_item.h     | 1 +
>  mm/filemap.c                      | 4 ++--
>  mm/memcontrol.c                   | 7 ++++++-
>  mm/memory.c                       | 4 ++--
>  mm/shmem.c                        | 4 ++--
>  mm/vmstat.c                       | 5 +++++
>  10 files changed, 35 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
> index 598df5708501..adb8b6412ffa 100644
> --- a/arch/s390/appldata/appldata_mem.c
> +++ b/arch/s390/appldata/appldata_mem.c
> @@ -62,6 +62,9 @@ struct appldata_mem_data {
>  	u64 pgalloc;		/* page allocations */
>  	u64 pgfault;		/* page faults (major+minor) */
>  	u64 pgmajfault;		/* page faults (major only) */
> +	u64 pgmajfault_s;	/* shmem page faults (major only) */
> +	u64 pgmajfault_a;	/* anonymous page faults (major only) */
> +	u64 pgmajfault_f;	/* file page faults (major only) */
>  // <-- New in 2.6
>  
>  } __packed;
> @@ -93,7 +96,11 @@ static void appldata_get_mem_data(void *data)
>  	mem_data->pgalloc    = ev[PGALLOC_NORMAL];
>  	mem_data->pgalloc    += ev[PGALLOC_DMA];
>  	mem_data->pgfault    = ev[PGFAULT];
> -	mem_data->pgmajfault = ev[PGMAJFAULT];
> +	mem_data->pgmajfault =
> +		ev[PGMAJFAULT_S] + ev[PGMAJFAULT_A] + ev[PGMAJFAULT_F];
> +	mem_data->pgmajfault_s = ev[PGMAJFAULT_S];
> +	mem_data->pgmajfault_a = ev[PGMAJFAULT_A];
> +	mem_data->pgmajfault_f = ev[PGMAJFAULT_F];
>  
>  	si_meminfo(&val);
>  	mem_data->sharedram = val.sharedram;
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 408c174ef0d5..ed7100645d25 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -259,7 +259,10 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb)
>  				pages_to_bytes(events[PSWPIN]));
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
>  				pages_to_bytes(events[PSWPOUT]));
> -	update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
> +	update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT,
> +		    events[PGMAJFAULT_S] +
> +		    events[PGMAJFAULT_A] +
> +		    events[PGMAJFAULT_F]);
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
>  #endif
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
> diff --git a/fs/dax.c b/fs/dax.c
> index c22eaf162f95..3c92f2af0514 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1200,8 +1200,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>  	switch (iomap.type) {
>  	case IOMAP_MAPPED:
>  		if (iomap.flags & IOMAP_F_NEW) {
> -			count_vm_event(PGMAJFAULT);
> -			mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> +			count_vm_event(PGMAJFAULT_F);
> +			mem_cgroup_count_vm_event(vmf->vma->vm_mm,
> +						  PGMAJFAULT_F);
>  			major = VM_FAULT_MAJOR;
>  		}
>  		error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
> diff --git a/fs/ncpfs/mmap.c b/fs/ncpfs/mmap.c
> index 0c3905e0542e..ae04b9d86288 100644
> --- a/fs/ncpfs/mmap.c
> +++ b/fs/ncpfs/mmap.c
> @@ -88,8 +88,8 @@ static int ncp_file_mmap_fault(struct vm_fault *vmf)
>  	 * fetches from the network, here the analogue of disk.
>  	 * -- nyc
>  	 */
> -	count_vm_event(PGMAJFAULT);
> -	mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> +	count_vm_event(PGMAJFAULT_F);
> +	mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT_F);
>  	return VM_FAULT_MAJOR;
>  }
>  
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index d84ae90ccd5c..2d2df45d4520 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -27,6 +27,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGSCAN_SKIP),
>  		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
>  		PGFAULT, PGMAJFAULT,
> +		PGMAJFAULT_S, PGMAJFAULT_A, PGMAJFAULT_F,
>  		PGLAZYFREED,
>  		PGREFILL,
>  		PGSTEAL_KSWAPD,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6f1be573a5e6..d2b187b648b3 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2225,8 +2225,8 @@ int filemap_fault(struct vm_fault *vmf)
>  	} else if (!page) {
>  		/* No page in the page cache at all */
>  		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
> -		count_vm_event(PGMAJFAULT);
> -		mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT);
> +		count_vm_event(PGMAJFAULT_F);
> +		mem_cgroup_count_vm_event(vmf->vma->vm_mm, PGMAJFAULT_F);
>  		ret = VM_FAULT_MAJOR;
>  retry_find:
>  		page = find_get_page(mapping, offset);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 94172089f52f..045361f2b8fa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3122,6 +3122,8 @@ unsigned int memcg1_events[] = {
>  	PGPGOUT,
>  	PGFAULT,
>  	PGMAJFAULT,
> +	PGMAJFAULT_A,
> +	PGMAJFAULT_F,
>  };
>  
>  static const char *const memcg1_event_names[] = {
> @@ -3129,6 +3131,8 @@ static const char *const memcg1_event_names[] = {
>  	"pgpgout",
>  	"pgfault",
>  	"pgmajfault",
> +	"pgmajfault_a",
> +	"pgmajfault_f",
>  };
>  
>  static int memcg_stat_show(struct seq_file *m, void *v)
> @@ -5229,7 +5233,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
>  	/* Accumulated memory events */
>  
>  	seq_printf(m, "pgfault %lu\n", events[PGFAULT]);
> -	seq_printf(m, "pgmajfault %lu\n", events[PGMAJFAULT]);
> +	seq_printf(m, "pgmajfault %lu\n", events[PGMAJFAULT_S] +
> +			events[PGMAJFAULT_A] + events[PGMAJFAULT_F]);
>  
>  	seq_printf(m, "workingset_refault %lu\n",
>  		   stat[WORKINGSET_REFAULT]);
> diff --git a/mm/memory.c b/mm/memory.c
> index 6ff5d729ded0..2c2b7b3ffe7f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2718,8 +2718,8 @@ int do_swap_page(struct vm_fault *vmf)
>  
>  		/* Had to read the page from swap area: Major fault */
>  		ret = VM_FAULT_MAJOR;
> -		count_vm_event(PGMAJFAULT);
> -		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> +		count_vm_event(PGMAJFAULT_A);
> +		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT_A);
>  	} else if (PageHWPoison(page)) {
>  		/*
>  		 * hwpoisoned dirty swapcache pages are kept for killing
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e67d6ba4e98e..5eea045575c4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1644,9 +1644,9 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  			/* Or update major stats only when swapin succeeds?? */
>  			if (fault_type) {
>  				*fault_type |= VM_FAULT_MAJOR;
> -				count_vm_event(PGMAJFAULT);
> +				count_vm_event(PGMAJFAULT_S);
>  				mem_cgroup_count_vm_event(charge_mm,
> -							  PGMAJFAULT);
> +							  PGMAJFAULT_S);
>  			}
>  			/* Here we actually start the io */
>  			page = shmem_swapin(swap, gfp, info, index);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 76f73670200a..741bb14761cd 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -995,6 +995,9 @@ const char * const vmstat_text[] = {
>  
>  	"pgfault",
>  	"pgmajfault",
> +	"pgmajfault_s",
> +	"pgmajfault_a",
> +	"pgmajfault_f",
>  	"pglazyfreed",
>  
>  	"pgrefill",
> @@ -1511,6 +1514,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>  	all_vm_events(v);
>  	v[PGPGIN] /= 2;		/* sectors -> kbytes */
>  	v[PGPGOUT] /= 2;
> +	/* Add up page faults */
> +	v[PGMAJFAULT] = v[PGMAJFAULT_S] + v[PGMAJFAULT_A] + v[PGMAJFAULT_F];
>  #endif
>  	return (unsigned long *)m->private + *pos;
>  }
> -- 
> 2.13.0.219.gdb65acc882-goog
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v1 00/11] mm/kasan: support per-page shadow memory to reduce memory consumption
From: Joonsoo Kim @ 2017-05-25  0:41 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andrew Morton, Andrey Ryabinin, Alexander Potapenko, kasan-dev,
	linux-mm@kvack.org, LKML, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, kernel-team
In-Reply-To: <CACT4Y+ZwL+iTMvF5NpsovThQrdhunCc282ffjqQcgZg3tAQH4w@mail.gmail.com>

On Wed, May 24, 2017 at 07:19:50PM +0200, Dmitry Vyukov wrote:
> On Wed, May 24, 2017 at 9:45 AM, Joonsoo Kim <js1304@gmail.com> wrote:
> >> > What does make your current patch work then?
> >> > Say we map a new shadow page, update the page shadow to say that there
> >> > is mapped shadow. Then another CPU loads the page shadow and then
> >> > loads from the newly mapped shadow. If we don't flush TLB, what makes
> >> > the second CPU see the newly mapped shadow?
> >>
> >> /\/\/\/\/\/\
> >>
> >> Joonsoo, please answer this question above.
> >
> > Hello, I've answered it in another e-mail however it would not be
> > sufficient. I try again.
> >
> > If the page isn't used for kernel stack, slab, and global variable
> > (aka. kernel memory), black shadow is mapped for the page. We map a
> > new shadow page if the page will be used for kernel memory. We need to
> > flush TLB in all cpus when mapping a new shadow however it's not
> > possible in some cases. So, this patch does just flushing local cpu's
> > TLB. Another cpu could have stale TLB that points black shadow for
> > this page. If that cpu with stale TLB try to check vailidity of the
> > object on this page, result would be invalid since stale TLB points
> > the black shadow and it's shadow value is non-zero. We need a magic
> > here. At this moment, we cannot make sure if invalid is correct result
> > or not since we didn't do full TLB flush. So fixup processing is
> > started. It is implemented in check_memory_region_slow(). Flushing
> > local TLB and re-checking the shadow value. With flushing local TLB,
> > we will use fresh TLB at this time. Therefore, we can pass the
> > validity check as usual.
> >
> >> I am trying to understand if there is any chance to make mapping a
> >> single page for all non-interesting shadow ranges work. That would be
> >
> > This is what this patchset does. Mapping a single (zero/black) shadow
> > page for all non-interesting (non-kernel memory) shadow ranges.
> > There is only single instance of zero/black shadow page. On v1,
> > I used black shadow page only so fail to get enough performance. On
> > v2 mentioned in another thread, I use zero shadow for some region. I
> > guess that performance problem would be gone.
> 
> 
> I can't say I understand everything here, but after staring at the
> patch I don't understand why we need pshadow at all now. Especially
> with this commit
> https://github.com/JoonsooKim/linux/commit/be36ee65f185e3c4026fe93b633056ea811120fb.
> It seems that the current shadow is enough.

pshadow exists for non-kernel memory like as page cache or anonymous page.
This patch doesn't map a new shadow (per-byte shadow) for those pages
to reduce memory consumption. However, we need to know if those page
are allocated or not in order to check the validity of access to those
page. We cannot utilize zero/black shadow page here since mapping
single zero/black shadow page represents eight real page's shadow
value. Instead, we use per-page shadow here and mark/unmark it when
allocation and free happens. With it, we can know the state of the
page and we can determine the validity of access to them.

> If we see bad shadow when the actual shadow value is good, we fall
> onto slow path, flush tlb, reload shadow, see that it is good and
> return. Pshadow is not needed in this case.

For the kernel memory, if we see bad shadow due to *stale TLB*, we
fall onto slow path (check_memory_region_slow()) and flush tlb and
reload shadow.

For the non-kernel memory, if we see bad shadow, we fall onto
pshadow_val() check and we can see actual state of the page.

> If we see good shadow when the actual shadow value is bad, we return
> immediately and get false negative. Pshadow is not involved as well.
> What am I missing?

In this patchset, there is no case that we see good shadow when the
actual (p)shadow value is bad. This case should not happen since we
can miss actual error.

Please let me know that these explanation is insufficient. I will try
more. :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v1 00/11] mm/kasan: support per-page shadow memory to reduce memory consumption
From: Joonsoo Kim @ 2017-05-25  0:46 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Andrey Ryabinin, Andrew Morton, Alexander Potapenko, kasan-dev,
	linux-mm@kvack.org, LKML, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, kernel-team
In-Reply-To: <CACT4Y+b56nGv6WcTqysa=Xxdksxr-c9-tCzBxEY8PzfVYAUbrA@mail.gmail.com>

On Wed, May 24, 2017 at 06:31:04PM +0200, Dmitry Vyukov wrote:
> On Wed, May 24, 2017 at 8:04 AM, Joonsoo Kim <js1304@gmail.com> wrote:
> >> >> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >> >> >
> >> >> > Hello, all.
> >> >> >
> >> >> > This is an attempt to recude memory consumption of KASAN. Please see
> >> >> > following description to get the more information.
> >> >> >
> >> >> > 1. What is per-page shadow memory
> >> >> >
> >> >> > This patch introduces infrastructure to support per-page shadow memory.
> >> >> > Per-page shadow memory is the same with original shadow memory except
> >> >> > the granualarity. It's one byte shows the shadow value for the page.
> >> >> > The purpose of introducing this new shadow memory is to save memory
> >> >> > consumption.
> >> >> >
> >> >> > 2. Problem of current approach
> >> >> >
> >> >> > Until now, KASAN needs shadow memory for all the range of the memory
> >> >> > so the amount of statically allocated memory is so large. It causes
> >> >> > the problem that KASAN cannot run on the system with hard memory
> >> >> > constraint. Even if KASAN can run, large memory consumption due to
> >> >> > KASAN changes behaviour of the workload so we cannot validate
> >> >> > the moment that we want to check.
> >> >> >
> >> >> > 3. How does this patch fix the problem
> >> >> >
> >> >> > This patch tries to fix the problem by reducing memory consumption for
> >> >> > the shadow memory. There are two observations.
> >> >> >
> >> >>
> >> >>
> >> >> I think that the best way to deal with your problem is to increase shadow scale size.
> >> >>
> >> >> You'll need to add tunable to gcc to control shadow size. I expect that gcc has some
> >> >> places where 8-shadow scale size is hardcoded, but it should be fixable.
> >> >>
> >> >> The kernel also have some small amount of code written with KASAN_SHADOW_SCALE_SIZE == 8 in mind,
> >> >> which should be easy to fix.
> >> >>
> >> >> Note that bigger shadow scale size requires bigger alignment of allocated memory and variables.
> >> >> However, according to comments in gcc/asan.c gcc already aligns stack and global variables and at
> >> >> 32-bytes boundary.
> >> >> So we could bump shadow scale up to 32 without increasing current stack consumption.
> >> >>
> >> >> On a small machine (1Gb) 1/32 of shadow is just 32Mb which is comparable to yours 30Mb, but I expect it to be
> >> >> much faster. More importantly, this will require only small amount of simple changes in code, which will be
> >> >> a *lot* more easier to maintain.
> >>
> >>
> >> Interesting option. We never considered increasing scale in user space
> >> due to performance implications. But the algorithm always supported up
> >> to 128x scale. Definitely worth considering as an option.
> >
> > Could you explain me how does increasing scale reduce performance? I
> > tried to guess the reason but failed.
> 
> 
> The main reason is inline instrumentation. Inline instrumentation for
> a check of 8-byte access (which are very common in 64-bit code) is
> just a check of the shadow byte for 0. For smaller accesses we have
> more complex instrumentation that first checks shadow for 0 and then
> does precise check based on size/offset of the access + shadow value.
> That's slower and also increases register pressure and code size
> (which can further reduce performance due to icache overflow). If we
> increase scale to 16/32, all accesses will need that slow path.
> Another thing is stack instrumentation: larger scale will require
> larger redzones to ensure proper alignment. That will increase stack
> frames and also more instructions to poison/unpoison stack shadow on
> function entry/exit.

Now, I see. Thanks for explanation.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH -mm -v6] mm, swap: Sort swap entries before free
From: Huang, Ying @ 2017-05-25  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Hugh Dickins, Shaohua Li,
	Minchan Kim, Rik van Riel

From: Huang Ying <ying.huang@intel.com>

To reduce the lock contention of swap_info_struct->lock when freeing
swap entry.  The freed swap entries will be collected in a per-CPU
buffer firstly, and be really freed later in batch.  During the batch
freeing, if the consecutive swap entries in the per-CPU buffer belongs
to same swap device, the swap_info_struct->lock needs to be
acquired/released only once, so that the lock contention could be
reduced greatly.  But if there are multiple swap devices, it is
possible that the lock may be unnecessarily released/acquired because
the swap entries belong to the same swap device are non-consecutive in
the per-CPU buffer.

To solve the issue, the per-CPU buffer is sorted according to the swap
device before freeing the swap entries.

With the patch, the memory (some swapped out) free time reduced
11.6% (from 2.65s to 2.35s) in the vm-scalability swap-w-rand test
case with 16 processes.  The test is done on a Xeon E5 v3 system.  The
swap device used is a RAM simulated PMEM (persistent memory) device.
To test swapping, the test case creates 16 processes, which allocate
and write to the anonymous pages until the RAM and part of the swap
device is used up, finally the memory (some swapped out) is freed
before exit.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>

v6:

- Revert to a simpler way to determine whether sort is necessary,
  because it is found the overhead of sort is very small.

v5:

- Use a smarter way to determine whether sort is necessary.

v4:

- Avoid unnecessary sort if all entries are from one swap device.

v3:

- Add some comments in code per Rik's suggestion.

v2:

- Avoid sort swap entries if there is only one swap device.
---
 mm/swapfile.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8a6cdf9e55f9..07b1a3d4910a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -37,6 +37,7 @@
 #include <linux/swapfile.h>
 #include <linux/export.h>
 #include <linux/swap_slots.h>
+#include <linux/sort.h>

 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -1198,6 +1199,13 @@ void put_swap_page(struct page *page, swp_entry_t entry)
 		swapcache_free_cluster(entry);
 }

+static int swp_entry_cmp(const void *ent1, const void *ent2)
+{
+	const swp_entry_t *e1 = ent1, *e2 = ent2;
+
+	return (int)swp_type(*e1) - (int)swp_type(*e2);
+}
+
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
 	struct swap_info_struct *p, *prev;
@@ -1208,6 +1216,15 @@ void swapcache_free_entries(swp_entry_t *entries, int n)

 	prev = NULL;
 	p = NULL;
+
+	/*
+	 * Sort swap entries by swap device, so each lock is only
+	 * taken once.  Although nr_swapfiles isn't absolute correct,
+	 * but the overhead of sort() is so low that it isn't
+	 * necessary to optimize further.
+	 */
+	if (nr_swapfiles > 1)
+		sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
 	for (i = 0; i < n; ++i) {
 		p = swap_info_get_cont(entries[i], prev);
 		if (p)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [Question] Mlocked count will not be decreased
From: Yisheng Xie @ 2017-05-25  1:00 UTC (permalink / raw)
  To: Vlastimil Babka, Xishi Qiu
  Cc: Kefeng Wang, linux-mm, linux-kernel, zhongjiang
In-Reply-To: <d354b321-0d11-4308-0b0e-aacef5a5e34b@suse.cz>

Hi Vlastimil,

Thanks for comment.
On 2017/5/24 19:52, Vlastimil Babka wrote:
> On 05/24/2017 01:38 PM, Xishi Qiu wrote:
>>>
>>> Race condition with what? Who else would isolate our pages?
>>>
>>
>> Hi Vlastimil,
>>
>> I find the root cause, if the page was not cached on the current cpu,
>> lru_add_drain() will not push it to LRU. So we should handle fail
>> case in mlock_vma_page().
> 
> Yeah that would explain it.
> 
>> follow_page_pte()
>> 		...
>> 		if (page->mapping && trylock_page(page)) {
>> 			lru_add_drain();  /* push cached pages to LRU */
>> 			/*
>> 			 * Because we lock page here, and migration is
>> 			 * blocked by the pte's page reference, and we
>> 			 * know the page is still mapped, we don't even
>> 			 * need to check for file-cache page truncation.
>> 			 */
>> 			mlock_vma_page(page);
>> 			unlock_page(page);
>> 		}
>> 		...
>>
>> I think we should add yisheng's patch, also we should add the following change.
>> I think it is better than use lru_add_drain_all().
> 
> I agree about yisheng's fix (but v2 didn't address my comments). I don't
> think we should add the hunk below, as that deviates from the rest of
> the design.
> 
Sorry, I have sent the patch before your comment. Anyway I will send another version
as your suggestion.

Thanks
Yisheng Xie



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 2/4] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
From: NeilBrown @ 2017-05-25  1:21 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, Andrew Morton, LKML,
	Michal Hocko
In-Reply-To: <20170307154843.32516-3-mhocko@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 6692 bytes --]

On Tue, Mar 07 2017, Michal Hocko wrote:

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 2bfcfd33e476..60af7937c6f2 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -25,7 +25,7 @@ struct vm_area_struct;
>  #define ___GFP_FS		0x80u
>  #define ___GFP_COLD		0x100u
>  #define ___GFP_NOWARN		0x200u
> -#define ___GFP_REPEAT		0x400u
> +#define ___GFP_RETRY_MAYFAIL		0x400u
>  #define ___GFP_NOFAIL		0x800u
>  #define ___GFP_NORETRY		0x1000u
>  #define ___GFP_MEMALLOC		0x2000u
> @@ -136,26 +136,38 @@ struct vm_area_struct;
>   *
>   * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
>   *
> - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> - *   _might_ fail.  This depends upon the particular VM implementation.
> + * The default allocator behavior depends on the request size. We have a concept
> + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).

Boundary conditions is one of my pet peeves....
The description here suggests that an allocation of
"1<<PAGE_ALLOC_COSTLY_ORDER" pages is not "costly", which is
inconsistent with how those words would normally be interpreted.

Looking at the code I see comparisons like:

   order < PAGE_ALLOC_COSTLY_ORDER
or
   order >= PAGE_ALLOC_COSTLY_ORDER

which supports the documented (but incoherent) meaning.

But I also see:

  order = max_t(int, PAGE_ALLOC_COSTLY_ORDER - 1, 0);

which looks like it is trying to perform the largest non-costly
allocation, but is making a smaller allocation than necessary.

I would *really* like it if the constant actually meant what its name
implied.

 PAGE_ALLOC_MAX_NON_COSTLY
??

> + * !costly allocations are too essential to fail so they are implicitly
> + * non-failing (with some exceptions like OOM victims might fail) by default while
> + * costly requests try to be not disruptive and back off even without invoking
> + * the OOM killer. The following three modifiers might be used to override some of
> + * these implicit rules
> + *
> + * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> + *   return NULL when direct reclaim and memory compaction have failed to allow
> + *   the allocation to succeed.  The OOM killer is not called with the current
> + *   implementation. This is a default mode for costly allocations.

The name here is "NORETRY", but the text says "not retry indefinitely".
So does it retry or not?
I would assuming it "tried" once, and only once.
However it could be that a "try" is not a simple well defined task.
Maybe some escalation happens on the 2nd or 3rd "try", so they are really
trying different things?

The word "indefinitely" implies there is a definite limit.  It might
help to say what that is, or at least say that it is small.

Also, this documentation is phrased to tell the VM implementor what is,
or is not, allowed.  Most readers will be more interested is the
responsibilities of the caller.

  __GFP_NORETRY: The VM implementation will not retry after all
     reasonable avenues for finding free memory have been pursued.  The
     implementation may sleep (i.e. call 'schedule()'), but only while
     waiting for another task to perform some specific action.
     The caller must handle failure.  This flag is suitable when failure can
     easily be handled at small cost, such as reduced throughput.

> + *
> + * __GFP_RETRY_MAYFAIL: Try hard to allocate the memory, but the allocation attempt
> + *   _might_ fail. All viable forms of memory reclaim are tried before the fail.
> + *   The OOM killer is excluded because this would be too disruptive. This can be
> + *   used to override non-failing default behavior for !costly requests as well as
> + *   fortify costly requests.

What does "Try hard" mean?
In part, it means "retry everything a few more times", I guess in the
hope that something happened in the mean time.
It also seems to mean waiting for compaction to happen, which I
guess is only relevant for >PAGE_SIZE allocations?
Maybe it also means waiting for page-out to complete.
So the summary would be that it waits for a little while, hoping for a
miracle.

   __GFP_RETRY_MAYFAIL:  The VM implementation will retry memory reclaim
     procedures that have previously failed if there is some indication
     that progress has been made else where.  It can wait for other
     tasks to attempt high level approaches to freeing memory such as
     compaction (which removed fragmentation) and page-out.
     There is still a definite limit to the number of retries, but it is
     a larger limit than with __GFP_NORERY.
     Allocations with this flag may fail, but only when there is
     genuinely little unused memory.  While these allocations do not
     directly trigger the OOM killer, their failure indicates that the
     system is likely to need to use the OOM killer soon.
     The caller must handle failure, but can reasonably do so by failing
     a higher-level request, or completing it only in a much less
     efficient manner.
     If the allocation does fail, and the caller is in a position to
     free some non-essential memory, doing so could benefit the system
     as a whole.

>   *
>   * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>   *   cannot handle allocation failures. New users should be evaluated carefully
>   *   (and the flag should be used only when there is no reasonable failure
>   *   policy) but it is definitely preferable to use the flag rather than
> - *   opencode endless loop around allocator.
> - *
> - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> - *   return NULL when direct reclaim and memory compaction have failed to allow
> - *   the allocation to succeed.  The OOM killer is not called with the current
> - *   implementation.
> + *   opencode endless loop around allocator. Using this flag for costly allocations
> + *   is _highly_ discouraged.

Should this explicitly say that the OOM killer might be invoked in an attempt
to satisfy this allocation?  Is the OOM killer *only* invoked from
allocations with __GFP_NOFAIL ?
Maybe be extra explicit "The allocation could block indefinitely but
will never return with failure.  Testing for failure is pointless.".

I've probably got several specifics wrong.  I've tried to answer the
questions that I would like to see answered by the documentation.   If
you can fix it up so that those questions are answered correctly, that
would be great.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: [Question] Mlocked count will not be decreased
From: Xishi Qiu @ 2017-05-25  1:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Yisheng Xie, Kefeng Wang, linux-mm, linux-kernel, zhongjiang
In-Reply-To: <b41f2c9a-7e74-529f-2ec1-3d9ae369dcb5@suse.cz>

On 2017/5/24 21:16, Vlastimil Babka wrote:

> On 05/24/2017 02:10 PM, Xishi Qiu wrote:
>> On 2017/5/24 19:52, Vlastimil Babka wrote:
>>
>>> On 05/24/2017 01:38 PM, Xishi Qiu wrote:
>>>>>
>>>>> Race condition with what? Who else would isolate our pages?
>>>>>
>>>>
>>>> Hi Vlastimil,
>>>>
>>>> I find the root cause, if the page was not cached on the current cpu,
>>>> lru_add_drain() will not push it to LRU. So we should handle fail
>>>> case in mlock_vma_page().
>>>
>>> Yeah that would explain it.
>>>
>>>> follow_page_pte()
>>>> 		...
>>>> 		if (page->mapping && trylock_page(page)) {
>>>> 			lru_add_drain();  /* push cached pages to LRU */
>>>> 			/*
>>>> 			 * Because we lock page here, and migration is
>>>> 			 * blocked by the pte's page reference, and we
>>>> 			 * know the page is still mapped, we don't even
>>>> 			 * need to check for file-cache page truncation.
>>>> 			 */
>>>> 			mlock_vma_page(page);
>>>> 			unlock_page(page);
>>>> 		}
>>>> 		...
>>>>
>>>> I think we should add yisheng's patch, also we should add the following change.
>>>> I think it is better than use lru_add_drain_all().
>>>
>>> I agree about yisheng's fix (but v2 didn't address my comments). I don't
>>> think we should add the hunk below, as that deviates from the rest of
>>> the design.
>>
>> Hi Vlastimil,
>>
>> The rest of the design is that mlock should always success here, right?
> 
> The rest of the design allows a temporary disconnect between mlocked
> flag and being placed on unevictable lru.
> 
>> If we don't handle the fail case, the page will be in anon/file lru list
>> later when call __pagevec_lru_add(), but NR_MLOCK increased,
>> this is wrong, right?
> 
> It's not wrong, the page cannot get evicted even if on wrong lru, so
> effectively it's already mlocked. We would be underaccounting NR_MLOCK.
> 

Hi Vlastimil,

I'm not quite understand why the page cannot get evicted even if on wrong lru.
__isolate_lru_page() will only skip PageUnevictable(page), but this flag has not
been set, we only set PageMlocked.

Thanks,
Xishi Qiu

>> Thanks,
>> Xishi Qiu
>>
>>>
>>> Thanks,
>>> Vlastimil
>>>
>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>> index 3d3ee6c..ca2aeb9 100644
>>>> --- a/mm/mlock.c
>>>> +++ b/mm/mlock.c
>>>> @@ -88,6 +88,11 @@ void mlock_vma_page(struct page *page)
>>>>  		count_vm_event(UNEVICTABLE_PGMLOCKED);
>>>>  		if (!isolate_lru_page(page))
>>>>  			putback_lru_page(page);
>>>> +		else {
>>>> +			ClearPageMlocked(page);
>>>> +			mod_zone_page_state(page_zone(page), NR_MLOCK,
>>>> +					-hpage_nr_pages(page));
>>>> +		}
>>>>  	}
>>>>  }
>>>>
>>>> Thanks,
>>>> Xishi Qiu
>>>>
>>>
>>>
>>> .
>>>
>>
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
> 
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm/migrate: Fix ref-count handling when !hugepage_migration_supported()
From: Naoya Horiguchi @ 2017-05-25  1:59 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Andrew Morton, will.deacon@arm.com, catalin.marinas@arm.com,
	manoj.iyer@arm.com, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	tbaicar@codeaurora.org, timur@qti.qualcomm.com, Joonsoo Kim,
	Wanpeng Li, Christoph Lameter
In-Reply-To: <20170524154728.2492-1-punit.agrawal@arm.com>

On Wed, May 24, 2017 at 04:47:28PM +0100, Punit Agrawal wrote:
> On failing to migrate a page, soft_offline_huge_page() performs the
> necessary update to the hugepage ref-count. When
> !hugepage_migration_supported() , unmap_and_move_hugepage() also
> decrements the page ref-count for the hugepage. The combined behaviour
> leaves the ref-count in an inconsistent state.
> 
> This leads to soft lockups when running the overcommitted hugepage test
> from mce-tests suite.
> 
> Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
> soft offline: 0x83ed600: migration failed 1, type
> 1fffc00000008008 (uptodate|head)
> INFO: rcu_preempt detected stalls on CPUs/tasks:
>  Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
>   (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
>   thugetlb_overco R  running task        0  2715   2685 0x00000008
>   Call trace:
>   [<ffff000008089f90>] dump_backtrace+0x0/0x268
>   [<ffff00000808a2d4>] show_stack+0x24/0x30
>   [<ffff000008100d34>] sched_show_task+0x134/0x180
>   [<ffff0000081c90fc>] rcu_print_detail_task_stall_rnp+0x54/0x7c
>   [<ffff00000813cfd4>] rcu_check_callbacks+0xa74/0xb08
>   [<ffff000008143a3c>] update_process_times+0x34/0x60
>   [<ffff0000081550e8>] tick_sched_handle.isra.7+0x38/0x70
>   [<ffff00000815516c>] tick_sched_timer+0x4c/0x98
>   [<ffff0000081442e0>] __hrtimer_run_queues+0xc0/0x300
>   [<ffff000008144fa4>] hrtimer_interrupt+0xac/0x228
>   [<ffff0000089a56d4>] arch_timer_handler_phys+0x3c/0x50
>   [<ffff00000812f1bc>] handle_percpu_devid_irq+0x8c/0x290
>   [<ffff0000081297fc>] generic_handle_irq+0x34/0x50
>   [<ffff000008129f00>] __handle_domain_irq+0x68/0xc0
>   [<ffff0000080816b4>] gic_handle_irq+0x5c/0xb0
> 
> Fix this by dropping the ref-count decrement in
> unmap_and_move_hugepage() when !hugepage_migration_supported().
> 
> Fixes: 32665f2bbfed ("mm/migrate: correct failure handling if !hugepage_migration_support()")
> Reported-by: Manoj Iyer <manoj.iyer@canonical.com>
> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> Cc: Christoph Lameter <cl@linux.com>
> 
> --
> Hi Andrew,
> 
> We ran into this bug when working towards enabling memory corruption
> on arm64. The patch was tested on an arm64 platform running v4.12-rc2
> with the series to enable memory corruption handling[0].
> 
> Please consider merging as a fix for the 4.12 release.
> 
> Thanks,
> Punit
> 
> [0] https://www.spinics.net/lists/arm-kernel/msg581657.html
> ---
>  mm/migrate.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 89a0a1707f4c..187abd1526df 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1201,10 +1201,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  	 * tables or check whether the hugepage is pmd-based or not before
>  	 * kicking migration.
>  	 */
> -	if (!hugepage_migration_supported(page_hstate(hpage))) {
> -		putback_active_hugepage(hpage);

Thank you for reporting and suggestion, Punit, Manoj.

Simply dropping this putback_active_hugepage() may resume the failure
counting issue addressed in 32665f2bbfed, so I would recommend to call
putback_movable_pages() in failure path in soft_offline_huge_page().

@@ -1600,7 +1600,8 @@ static int soft_offline_huge_page(struct page *page, int flags)
 		 * only one hugepage pointed to by hpage, so we need not
 		 * run through the pagelist here.
 		 */
-		putback_active_hugepage(hpage);
+		if (!list_empty(&pagelist))
+			putback_movable_pages(&pagelist);
 		if (ret > 0)
 			ret = -EIO;
 	} else {

Could you check this works for you?

Thanks,
Naoya Horiguchi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm/migrate: Fix ref-count handling when !hugepage_migration_supported()
From: Naoya Horiguchi @ 2017-05-25  2:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Punit Agrawal, will.deacon@arm.com, catalin.marinas@arm.com,
	manoj.iyer@arm.com, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	tbaicar@codeaurora.org, timur@qti.qualcomm.com, Joonsoo Kim,
	Wanpeng Li, Christoph Lameter
In-Reply-To: <20170524125610.8fbc644f8fa1cf8175b7757b@linux-foundation.org>

On Wed, May 24, 2017 at 12:56:10PM -0700, Andrew Morton wrote:
> On Wed, 24 May 2017 16:47:28 +0100 Punit Agrawal <punit.agrawal@arm.com> wrote:
> 
> > On failing to migrate a page, soft_offline_huge_page() performs the
> > necessary update to the hugepage ref-count. When
> > !hugepage_migration_supported() , unmap_and_move_hugepage() also
> > decrements the page ref-count for the hugepage. The combined behaviour
> > leaves the ref-count in an inconsistent state.
> > 
> > This leads to soft lockups when running the overcommitted hugepage test
> > from mce-tests suite.
> > 
> > Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
> > soft offline: 0x83ed600: migration failed 1, type
> > 1fffc00000008008 (uptodate|head)
> > INFO: rcu_preempt detected stalls on CPUs/tasks:
> >  Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
> >   (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
> >   thugetlb_overco R  running task        0  2715   2685 0x00000008
> >   Call trace:
> >   [<ffff000008089f90>] dump_backtrace+0x0/0x268
> >   [<ffff00000808a2d4>] show_stack+0x24/0x30
> >   [<ffff000008100d34>] sched_show_task+0x134/0x180
> >   [<ffff0000081c90fc>] rcu_print_detail_task_stall_rnp+0x54/0x7c
> >   [<ffff00000813cfd4>] rcu_check_callbacks+0xa74/0xb08
> >   [<ffff000008143a3c>] update_process_times+0x34/0x60
> >   [<ffff0000081550e8>] tick_sched_handle.isra.7+0x38/0x70
> >   [<ffff00000815516c>] tick_sched_timer+0x4c/0x98
> >   [<ffff0000081442e0>] __hrtimer_run_queues+0xc0/0x300
> >   [<ffff000008144fa4>] hrtimer_interrupt+0xac/0x228
> >   [<ffff0000089a56d4>] arch_timer_handler_phys+0x3c/0x50
> >   [<ffff00000812f1bc>] handle_percpu_devid_irq+0x8c/0x290
> >   [<ffff0000081297fc>] generic_handle_irq+0x34/0x50
> >   [<ffff000008129f00>] __handle_domain_irq+0x68/0xc0
> >   [<ffff0000080816b4>] gic_handle_irq+0x5c/0xb0
> > 
> > Fix this by dropping the ref-count decrement in
> > unmap_and_move_hugepage() when !hugepage_migration_supported().
> > 
> > Fixes: 32665f2bbfed ("mm/migrate: correct failure handling if !hugepage_migration_support()")
> > Reported-by: Manoj Iyer <manoj.iyer@canonical.com>
> > Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> > Cc: Christoph Lameter <cl@linux.com>
> 
> 32665f2bbfed was three years ago.  Do you have any theory as to why
> this took so long to be detected?

My per-release testing only ran for "hugepage_migration_supported() == true"
setting (i.e. x86 with CONFIG_HUGETLB_PAGE=y). I need extend the coverage.
And other arch's developers recently have come to have interest in hugepage
migration.

>  And do you believe a -stable
> backport is warranted?

I agree to send the fix to stable, so the stable tag is wanted.

Cc: stable@kernel.org   # v3.14+

Thanks,
Naoya Horiguchi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v2] mlock: fix mlock count can not decrease in race condition
From: Yisheng Xie @ 2017-05-25  2:13 UTC (permalink / raw)
  To: akpm
  Cc: vbabka, joern, mgorman, walken, hughd, riel, hannes, mhocko,
	qiuxishi, zhongjiang, guohanjun, wangkefeng.wang, stable,
	linux-kernel, linux-mm

Kefeng reported that when run the follow test the mlock count in meminfo
cannot be decreased:
 [1] testcase
 linux:~ # cat test_mlockal
 grep Mlocked /proc/meminfo
  for j in `seq 0 10`
  do
 	for i in `seq 4 15`
 	do
 		./p_mlockall >> log &
 	done
 	sleep 0.2
 done
 # wait some time to let mlock counter decrease and 5s may not enough
 sleep 5
 grep Mlocked /proc/meminfo

 linux:~ # cat p_mlockall.c
 #include <sys/mman.h>
 #include <stdlib.h>
 #include <stdio.h>

 #define SPACE_LEN	4096

 int main(int argc, char ** argv)
 {
 	int ret;
 	void *adr = malloc(SPACE_LEN);
 	if (!adr)
 		return -1;

 	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
 	printf("mlcokall ret = %d\n", ret);

 	ret = munlockall();
 	printf("munlcokall ret = %d\n", ret);

 	free(adr);
 	return 0;
 }

When __munlock_pagevec, we ClearPageMlock but isolation_failed in race
condition, and we do not count these page into delta_munlocked, which cause
mlock counter incorrect for we had Clear the PageMlock and cannot count down
the number in the feture.

Fix it by count the number of page whoes PageMlock flag is cleared.

Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joern Engel <joern@logfs.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
CC: zhongjiang <zhongjiang@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: <stable@vger.kernel.org>
---
v2:
 - use delta_munlocked for it doesn't do the increment in fastpath - Vlastimil

Hi Andrew:
Could you please help to fold this?

Thanks
Yisheng Xie

 mm/mlock.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index c483c5c..b562b55 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -284,7 +284,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
 	int i;
 	int nr = pagevec_count(pvec);
-	int delta_munlocked;
+	int delta_munlocked = -nr;
 	struct pagevec pvec_putback;
 	int pgrescued = 0;
 
@@ -304,6 +304,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 				continue;
 			else
 				__munlock_isolation_failed(page);
+		} else {
+			delta_munlocked++;
 		}
 
 		/*
@@ -315,7 +317,6 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 		pagevec_add(&pvec_putback, pvec->pages[i]);
 		pvec->pages[i] = NULL;
 	}
-	delta_munlocked = -nr + pagevec_count(&pvec_putback);
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
 	spin_unlock_irq(zone_lru_lock(zone));
 
-- 
1.7.12.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH] mm/vmalloc: a slight change of compare target in __insert_vmap_area()
From: zhong jiang @ 2017-05-25  3:04 UTC (permalink / raw)
  To: Wei Yang; +Cc: akpm, mhocko, linux-mm, linux-kernel
In-Reply-To: <20170524100347.8131-1-richard.weiyang@gmail.com>

I hit the overlap issue, but it  is hard to reproduced. if you think it is safe. and the situation
is not happen. AFAIC, it is no need to add the code.

if you insist on the point. Maybe VM_WARN_ON is a choice.

Regards
zhongjiang
On 2017/5/24 18:03, Wei Yang wrote:
> The vmap RB tree store the elements in order and no overlap between any of
> them. The comparison in __insert_vmap_area() is to decide which direction
> the search should follow and make sure the new vmap_area is not overlap
> with any other.
>
> Current implementation fails to do the overlap check.
>
> When first "if" is not true, it means
>
>     va->va_start >= tmp_va->va_end
>
> And with the truth
>
>     xxx->va_end > xxx->va_start
>
> The deduction is
>
>     va->va_end > tmp_va->va_start
>
> which is the condition in second "if".
>
> This patch changes a little of the comparison in __insert_vmap_area() to
> make sure it forbids the overlapped vmap_area.
>
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> ---
>  mm/vmalloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 0b057628a7ba..8087451cb332 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -360,9 +360,9 @@ static void __insert_vmap_area(struct vmap_area *va)
>  
>  		parent = *p;
>  		tmp_va = rb_entry(parent, struct vmap_area, rb_node);
> -		if (va->va_start < tmp_va->va_end)
> +		if (va->va_end <= tmp_va->va_start)
>  			p = &(*p)->rb_left;
> -		else if (va->va_end > tmp_va->va_start)
> +		else if (va->va_start >= tmp_va->va_end)
>  			p = &(*p)->rb_right;
>  		else
>  			BUG();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm/zsmalloc: fix -Wunneeded-internal-declaration warning
From: Minchan Kim @ 2017-05-25  5:38 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: md, mka, Nitin Gupta, Sergey Senozhatsky, linux-mm, linux-kernel
In-Reply-To: <20170524053859.29059-1-nick.desaulniers@gmail.com>

On Tue, May 23, 2017 at 10:38:57PM -0700, Nick Desaulniers wrote:
> is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is
> only compiled in as a runtime check when CONFIG_DEBUG_VM is set,
> otherwise is checked at compile time and not actually compiled in.
> 
> Fixes the following warning, found with Clang:
> 
> mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and
> will not be emitted [-Wunneeded-internal-declaration]
> static int is_first_page(struct page *page)
>            ^
> 
> Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm/hugetlb: Report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
From: Naoya Horiguchi @ 2017-05-25  5:38 UTC (permalink / raw)
  To: James Morse
  Cc: linux-mm@kvack.org, Kirill A . Shutemov, Andrew Morton,
	Punit Agrawal
In-Reply-To: <20170524160900.28786-1-james.morse@arm.com>

On Wed, May 24, 2017 at 05:09:00PM +0100, James Morse wrote:
> KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
> FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
> finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a special
> case. (check_user_page_hwpoison())
> 
> When huge pages are involved, this doesn't work so well. get_user_pages()
> calls follow_hugetlb_page(), which stops early if it receives
> VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning -EFAULT to
> the caller. The step to map this to -EHWPOISON based on the FOLL_ flags
> is missing. The hwpoison special case is skipped, and -EFAULT is returned
> to user-space, causing Qemu or kvmtool to exit.
> 
> Instead, move this VM_FAULT_ to errno mapping code into a header file
> and use it from faultin_page() and follow_hugetlb_page().
> 
> With this, KVM works as expected.
> 
> CC: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> This isn't a problem for arm64 today as we haven't enabled MEMORY_FAILURE,
> but I can't see any reason this doesn't happen on x86 too, so I think this
> should be a fix. This doesn't apply earlier than stable's v4.11.1 due to
> all sorts of cleanup. My best offer is:
> Cc: stable@vger.kernel.org # 4.11.1
> 
>  include/linux/mm.h | 10 ++++++++++
>  mm/gup.c           |  9 +++------
>  mm/hugetlb.c       |  3 +++
>  3 files changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7cb17c6b97de..48b47c214c50 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2327,6 +2327,16 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
>  #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
>  #define FOLL_COW	0x4000	/* internal GUP flag */
>  
> +static inline int vm_fault_to_errno(int vm_fault, int foll_flags) {

According to coding style, opening bracket should come with a new line.

> +	if (vm_fault & VM_FAULT_OOM)
> +		return -ENOMEM;
> +	if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> +		return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
> +	if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
> +		return -EFAULT;
> +	return 0;
> +}
> +

Can you apply this function to fixup_user_fault()?  fixup_user_fault()
now returns -EHWPOISON if handle_mm_fault returns VM_FAULT_HWPOISON*,
but I think there's no specific reason to choose EHWPOISON.
Callers of fixup_user_fault() have no interest in hwpoison code, and
they just use the return value to check success/failure (== 0 or != 0.)
So using vm_fault_to_errno(ret, 0) should be OK.

>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>  			void *data);
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
> diff --git a/mm/gup.c b/mm/gup.c
> index d9e6fddcc51f..69f6cec279b3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -407,12 +407,9 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>  
>  	ret = handle_mm_fault(vma, address, fault_flags);
>  	if (ret & VM_FAULT_ERROR) {
> -		if (ret & VM_FAULT_OOM)
> -			return -ENOMEM;
> -		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
> -			return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
> -		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
> -			return -EFAULT;
> +		int err = vm_fault_to_errno(ret, *flags);
> +		if (err)
> +			return err;
>  		BUG();
>  	}
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e5828875f7bb..08f69dadbc63 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4170,7 +4170,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			}
>  			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
>  			if (ret & VM_FAULT_ERROR) {
> +				int err = vm_fault_to_errno(ret, flags);
>  				remainder = 0;
> +				if (err)
> +					return err;

(nitpick) checking err comes before remainder = 0 ?
# although compiler optimizes it by itself.

Thanks,
Naoya Horiguchi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm/vmalloc: a slight change of compare target in __insert_vmap_area()
From: Michal Hocko @ 2017-05-25  5:39 UTC (permalink / raw)
  To: Wei Yang; +Cc: akpm, linux-mm, linux-kernel
In-Reply-To: <20170524150730.GA8445@WeideMacBook-Pro.local>

On Wed 24-05-17 23:07:30, Wei Yang wrote:
> On Wed, May 24, 2017 at 02:11:35PM +0200, Michal Hocko wrote:
> >On Wed 24-05-17 18:03:47, Wei Yang wrote:
> >> The vmap RB tree store the elements in order and no overlap between any of
> >> them. The comparison in __insert_vmap_area() is to decide which direction
> >> the search should follow and make sure the new vmap_area is not overlap
> >> with any other.
> >> 
> >> Current implementation fails to do the overlap check.
> >> 
> >> When first "if" is not true, it means
> >> 
> >>     va->va_start >= tmp_va->va_end
> >> 
> >> And with the truth
> >> 
> >>     xxx->va_end > xxx->va_start
> >> 
> >> The deduction is
> >> 
> >>     va->va_end > tmp_va->va_start
> >> 
> >> which is the condition in second "if".
> >> 
> >> This patch changes a little of the comparison in __insert_vmap_area() to
> >> make sure it forbids the overlapped vmap_area.
> >
> >Why do we care about overlapping vmap areas at this level. This is an
> >internal function and all the sanity checks should have been done by
> >that time AFAIR. Could you describe the problem which you are trying to
> >fix/address?
> >
> 
> No problem it tries to fix.

I would prefer the not touch the code if there is no problem to fix.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] compiler, clang: suppress warning for unused static inline functions
From: Ingo Molnar @ 2017-05-25  5:52 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: David Rientjes, Andrew Morton, Christoph Lameter, Pekka Enberg,
	Joonsoo Kim, linux-mm, linux-kernel, Douglas Anderson, Mark Brown,
	David Miller
In-Reply-To: <20170524212229.GR141096@google.com>


* Matthias Kaehlcke <mka@chromium.org> wrote:

> El Wed, May 24, 2017 at 02:01:15PM -0700 David Rientjes ha dit:
> 
> > GCC explicitly does not warn for unused static inline functions for
> > -Wunused-function.  The manual states:
> > 
> > 	Warn whenever a static function is declared but not defined or
> > 	a non-inline static function is unused.
> > 
> > Clang does warn for static inline functions that are unused.
> > 
> > It turns out that suppressing the warnings avoids potentially complex
> > #ifdef directives, which also reduces LOC.
> > 
> > Supress the warning for clang.
> > 
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> 
> As expressed earlier in other threads, I don't think gcc's behavior is
> preferable in this case. The warning on static inline functions (only
> in .c files) allows to detect truly unused code. About 50% of the
> warnings I have looked into so far fall into this category.
> 
> In my opinion it is more valuable to detect dead code than not having
> a few more __maybe_unused attributes (there aren't really that many
> instances, at least with x86 and arm64 defconfig). In most cases it is
> not necessary to use #ifdef, it is an option which is preferred by
> some maintainers. The reduced LOC is arguable, since dectecting dead
> code allows to remove it.

Static inline functions in headers are often not dead code.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [Question] Mlocked count will not be decreased
From: Vlastimil Babka @ 2017-05-25  6:12 UTC (permalink / raw)
  To: Xishi Qiu; +Cc: Yisheng Xie, Kefeng Wang, linux-mm, linux-kernel, zhongjiang
In-Reply-To: <5926306F.2060205@huawei.com>

On 05/25/2017 03:16 AM, Xishi Qiu wrote:
> On 2017/5/24 21:16, Vlastimil Babka wrote:
>>>>
>>>> I agree about yisheng's fix (but v2 didn't address my comments). I don't
>>>> think we should add the hunk below, as that deviates from the rest of
>>>> the design.
>>>
>>> Hi Vlastimil,
>>>
>>> The rest of the design is that mlock should always success here, right?
>>
>> The rest of the design allows a temporary disconnect between mlocked
>> flag and being placed on unevictable lru.
>>
>>> If we don't handle the fail case, the page will be in anon/file lru list
>>> later when call __pagevec_lru_add(), but NR_MLOCK increased,
>>> this is wrong, right?
>>
>> It's not wrong, the page cannot get evicted even if on wrong lru, so
>> effectively it's already mlocked. We would be underaccounting NR_MLOCK.
>>
> 
> Hi Vlastimil,
> 
> I'm not quite understand why the page cannot get evicted even if on wrong lru.
> __isolate_lru_page() will only skip PageUnevictable(page), but this flag has not
> been set, we only set PageMlocked.

The isolated page has to be unmapped from all vma's that map it. See
try_to_unmap_one() and this check:

                if (!(flags & TTU_IGNORE_MLOCK)) {
                        if (vma->vm_flags & VM_LOCKED) {
				...
				ret = false;

This VM_LOCKED is what actually controls if page is evictable. The rest
is optimization (separate lru list so we don't scan the pages in reclaim
if they can't be evicted anyway), and accounting (PageMlocked flag pages
counted as NR_MLOCK). That's why temporary inconsistency isn't a problem.


> Thanks,
> Xishi Qiu
> 
>>> Thanks,
>>> Xishi Qiu
>>>
>>>>
>>>> Thanks,
>>>> Vlastimil
>>>>
>>>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>>>> index 3d3ee6c..ca2aeb9 100644
>>>>> --- a/mm/mlock.c
>>>>> +++ b/mm/mlock.c
>>>>> @@ -88,6 +88,11 @@ void mlock_vma_page(struct page *page)
>>>>>  		count_vm_event(UNEVICTABLE_PGMLOCKED);
>>>>>  		if (!isolate_lru_page(page))
>>>>>  			putback_lru_page(page);
>>>>> +		else {
>>>>> +			ClearPageMlocked(page);
>>>>> +			mod_zone_page_state(page_zone(page), NR_MLOCK,
>>>>> +					-hpage_nr_pages(page));
>>>>> +		}
>>>>>  	}
>>>>>  }
>>>>>
>>>>> Thanks,
>>>>> Xishi Qiu
>>>>>
>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>
>>
>> .
>>
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 2/2] mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
From: Michal Hocko @ 2017-05-25  6:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
In-Reply-To: <6a0bd7c7-8beb-d599-ed31-caca68cd8b30@suse.cz>

On Wed 24-05-17 17:17:08, Vlastimil Babka wrote:
> On 05/24/2017 03:42 PM, Michal Hocko wrote:
[...]
> >>> --- a/mm/Kconfig
> >>> +++ b/mm/Kconfig
> >>> @@ -149,32 +149,6 @@ config NO_BOOTMEM
> >>>  config MEMORY_ISOLATION
> >>>  	bool
> >>>  
> >>> -config MOVABLE_NODE
> >>> -	bool "Enable to assign a node which has only movable memory"
> >>> -	depends on HAVE_MEMBLOCK
> >>> -	depends on NO_BOOTMEM
> >>> -	depends on X86_64 || OF_EARLY_FLATTREE || MEMORY_HOTPLUG
> >>> -	depends on NUMA
> >>
> >> That's a lot of depends. What happens if some of them are not met and
> >> the movable_node bootparam is used?
> > 
> > Good question. I haven't explored that, to be honest. Now that I am looking closer
> > I am not even sure why all those dependencies are thre. MEMORY_HOTPLUG
> > is clear and OF_EARLY_FLATTREE is explained by 41a9ada3e6b4 ("of/fdt:
> > mark hotpluggable memory"). NUMA is less clear to me because
> > MEMORY_HOTPLUG doesn't really depend on NUMA systems. Dependency on
> > NO_BOOTMEM is also not clear to me because zones layout
> > doesn't really depend on the specific boot time allocator.
> > 
> > So we are left with HAVE_MEMBLOCK which seems to be there because
> > movable_node_enabled is defined there while the parameter handling is in
> > the hotplug proper. But there is no real reason to have it like that.
> > This compiles but I will have to put throw my full compile battery on it
> > to be sure. I will make it a separate patch.
> 
> I'd expect stuff might compile and work (run without crash), just in
> some cases the boot option could be effectively ignored? In that case
> it's just a matter of documenting the option, possibly also some warning
> when used, e.g. "node_movable was ignored because CONFIG_FOO is not
> enabled"?

Hmm, I can make the cmd parameter available only when
CONFIG_HAVE_MEMBLOCK_NODE_MAP but I am not sure how helpful it would be.
AFAIR unrecognized options are just ignored. On the other hand debugging
why the parameter doesn't do anything might be really frustrating. Here
is the patch I will put on top of the two posted. Strictly speaking it
breaks the bisection but swithing the order would be kind of pointless
ifdefery game and I do not see it would matter all that much. I can
rework if you guys think otherwise though.
---

^ permalink raw reply

* Re: [RFC PATCH 1/2] mm, memory_hotplug: drop artificial restriction on online/offline
From: Michal Hocko @ 2017-05-25  6:28 UTC (permalink / raw)
  To: Reza Arbab
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Yasuaki Ishimatsu, qiuxishi,
	Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
In-Reply-To: <20170524215056.h4r3sdk23bn4c2sr@arbab-laptop.localdomain>

On Wed 24-05-17 16:50:56, Reza Arbab wrote:
> On Wed, May 24, 2017 at 02:24:10PM +0200, Michal Hocko wrote:
> >74d42d8fe146 ("memory_hotplug: ensure every online node has NORMAL
> >memory") has added can_offline_normal which checks the amount of
> >memory in !movable zones as long as CONFIG_MOVABLE_NODE is disable.
> >It disallows to offline memory if there is nothing left with a
> >justification that "memory-management acts bad when we have nodes which
> >is online but don't have any normal memory".
> >
> >74d42d8fe146 ("memory_hotplug: ensure every online node has NORMAL
> >memory") has introduced a restriction that every numa node has to have
> >at least some memory in !movable zones before a first movable memory
> >can be onlined if !CONFIG_MOVABLE_NODE with the same justification
> >
> >While it is true that not having _any_ memory for kernel allocations on
> >a NUMA node is far from great and such a node would be quite subotimal
> >because all kernel allocations will have to fallback to another NUMA
> >node but there is no reason to disallow such a configuration in
> >principle.
> >
> >Besides that there is not really a big difference to have one memblock
> >for ZONE_NORMAL available or none. With 128MB size memblocks the system
> >might trash on the kernel allocations requests anyway. It is really
> >hard to draw a line on how much normal memory is really sufficient so
> >we have to rely on administrator to configure system sanely therefore
> >drop the artificial restriction and remove can_offline_normal and
> >can_online_high_movable altogether.
> 
> I'm really liking all this cleanup of the memory hotplug code. Thanks!  Much
> appreciated.

I am glad to hear that and more is to come.

> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com>

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v2] mlock: fix mlock count can not decrease in race condition
From: Vlastimil Babka @ 2017-05-25  6:32 UTC (permalink / raw)
  To: Yisheng Xie, akpm
  Cc: joern, mgorman, walken, hughd, riel, hannes, mhocko, qiuxishi,
	zhongjiang, guohanjun, wangkefeng.wang, stable, linux-kernel,
	linux-mm
In-Reply-To: <1495678405-54569-1-git-send-email-xieyisheng1@huawei.com>

On 05/25/2017 04:13 AM, Yisheng Xie wrote:
> Kefeng reported that when run the follow test the mlock count

> in meminfo
> cannot be decreased:

"increases permanently."?

>  [1] testcase
>  linux:~ # cat test_mlockal
>  grep Mlocked /proc/meminfo
>   for j in `seq 0 10`
>   do
>  	for i in `seq 4 15`
>  	do
>  		./p_mlockall >> log &
>  	done
>  	sleep 0.2
>  done
>  # wait some time to let mlock counter decrease and 5s may not enough
>  sleep 5
>  grep Mlocked /proc/meminfo
> 
>  linux:~ # cat p_mlockall.c
>  #include <sys/mman.h>
>  #include <stdlib.h>
>  #include <stdio.h>
> 
>  #define SPACE_LEN	4096
> 
>  int main(int argc, char ** argv)
>  {
>  	int ret;
>  	void *adr = malloc(SPACE_LEN);
>  	if (!adr)
>  		return -1;
> 
>  	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
>  	printf("mlcokall ret = %d\n", ret);
> 
>  	ret = munlockall();
>  	printf("munlcokall ret = %d\n", ret);
> 
>  	free(adr);
>  	return 0;
>  }
> 
> When __munlock_pagevec, we ClearPageMlock but isolation_failed in race
> condition, and we do not count these page into delta_munlocked, which cause
> mlock counter incorrect for we had Clear the PageMlock and cannot count down
> the number in the feture.

Can I suggest the following instead:

In __munlock_pagevec() we should decrement NR_MLOCK for each page where
we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch
NR_MLOCK zone state updates") has introduced a bug where we don't
decrement NR_MLOCK for pages where we clear the flag, but fail to
isolate them from the lru list (e.g. when the pages are on some other
cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK
accounting gets permanently disrupted by this.

> Fix it by count the number of page whoes PageMlock flag is cleared.
> 
> Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
> Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> Cc: Joern Engel <joern@logfs.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Xishi Qiu <qiuxishi@huawei.com>
> CC: zhongjiang <zhongjiang@huawei.com>
> Cc: Hanjun Guo <guohanjun@huawei.com>
> Cc: <stable@vger.kernel.org>
> ---
> v2:
>  - use delta_munlocked for it doesn't do the increment in fastpath - Vlastimil
> 
> Hi Andrew:
> Could you please help to fold this?
> 
> Thanks
> Yisheng Xie
> 
>  mm/mlock.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index c483c5c..b562b55 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -284,7 +284,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  {
>  	int i;
>  	int nr = pagevec_count(pvec);
> -	int delta_munlocked;
> +	int delta_munlocked = -nr;
>  	struct pagevec pvec_putback;
>  	int pgrescued = 0;
>  
> @@ -304,6 +304,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  				continue;
>  			else
>  				__munlock_isolation_failed(page);
> +		} else {
> +			delta_munlocked++;
>  		}
>  
>  		/*
> @@ -315,7 +317,6 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>  		pagevec_add(&pvec_putback, pvec->pages[i]);
>  		pvec->pages[i] = NULL;
>  	}
> -	delta_munlocked = -nr + pagevec_count(&pvec_putback);
>  	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
>  	spin_unlock_irq(zone_lru_lock(zone));
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH -mm 00/13] mm, THP, swap: Delay splitting THP after swapped out
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel, Andrea Arcangeli,
	Kirill A . Shutemov, Jens Axboe, Michal Hocko, Huang Ying

From: Huang Ying <ying.huang@intel.com>

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Johannes and Minchan, Thanks a lot for your review to the first
step of the THP swap optimization!  Could you help me to review the
second step in this patchset?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [01/13], [02/13], [03/13],
[04/13], [07/13], [12/13], and [13/13].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [01/13], [03/13], [08/13], [09/13], [10/13],
[12/13].

Hi, Jens and Shaohua, could you help me to review the block part of
the patchset?  Especially [05/13], [06/13], and [07/13].

Hi, Johannes, Michal, could you help me to review the cgroup part of
the patchset?  Especially [09/13], [10/13], and [11/13].

And for all, Any comment is welcome!

This is the second step of THP (Transparent Huge Page) swap
optimization.  In the first step, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the
swap space for the THP and adding the THP into the swap cache.  In the
second step, the splitting is delayed further to after the swapping
out finished.  The plan is to delay splitting THP step by step,
finally avoid splitting THP for the THP swapping out and swap out/in
the THP as a whole.

In the patchset, more operations for the anonymous THP reclaiming,
such as TLB flushing, writing the THP to the swap device, removing the
THP from the swap cache are batched.  So that the performance of
anonymous THP swapping out are improved.

This patchset is based on the 5/18 head of mmotm/master.

During the development, the following scenarios/code paths have been
checked,

- swap out/in
- swap off
- write protect page fault
- madvise_free
- process exit
- split huge page

Please let me know if I missed something.

With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes.  At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%.  The test is done on a Xeon E5 v3 system.  The
swap device used is a RAM simulated PMEM (persistent memory) device.
To test the sequential swapping out, the test case creates 8
processes, which sequentially allocate and write to the anonymous
pages until the RAM and part of the swap device is used up.

Below is the part of the cover letter for the first step patchset of
THP swap optimization which applies to all steps.

----------------------------------------------------------------->

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce TLB flushing and
  lock acquiring/releasing, including allocating/freeing the swap
  space, adding/deleting to/from the swap cache, and writing/reading
  the swap space, etc.  This will help improve the performance of the
  THP swap.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which are usually 4k random
  IO.  This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH -mm 01/13] mm, THP, swap: Support to clear swap cache flag for THP swapped out
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel
In-Reply-To: <20170525064635.2832-1-ying.huang@intel.com>

From: Huang Ying <ying.huang@intel.com>

Previously, swapcache_free_cluster() is used only in the error path of
shrink_page_list() to free the swap cluster just allocated if the
THP (Transparent Huge Page) is failed to be split.  In this patch, it
is enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the swap
cluster that holds the contents of THP swapped out.

This will be used in delaying splitting THP after swapping out
support.  Because there is no THP swapping in as a whole support yet,
after clearing the swap cache flag, the swap cluster backing the THP
swapped out will be split.  So that the swap slots in the swap cluster
can be swapped in as normal pages later.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/swapfile.c | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8a6cdf9e55f9..4cd02dec6894 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1167,22 +1167,40 @@ static void swapcache_free_cluster(swp_entry_t entry)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	unsigned char *map;
-	unsigned int i;
+	unsigned int i, free_entries = 0;
+	unsigned char val;
 
-	si = swap_info_get(entry);
+	si = _swap_info_get(entry);
 	if (!si)
 		return;
 
 	ci = lock_cluster(si, offset);
 	map = si->swap_map + offset;
 	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
-		VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
-		map[i] = 0;
+		val = map[i];
+		VM_BUG_ON(!(val & SWAP_HAS_CACHE));
+		if (val == SWAP_HAS_CACHE)
+			free_entries++;
+	}
+	if (!free_entries) {
+		for (i = 0; i < SWAPFILE_CLUSTER; i++)
+			map[i] &= ~SWAP_HAS_CACHE;
 	}
 	unlock_cluster(ci);
-	mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
-	swap_free_cluster(si, idx);
-	spin_unlock(&si->lock);
+	if (free_entries == SWAPFILE_CLUSTER) {
+		spin_lock(&si->lock);
+		ci = lock_cluster(si, offset);
+		memset(map, 0, SWAPFILE_CLUSTER);
+		unlock_cluster(ci);
+		mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+		swap_free_cluster(si, idx);
+		spin_unlock(&si->lock);
+	} else if (free_entries) {
+		for (i = 0; i < SWAPFILE_CLUSTER; i++, entry.val++) {
+			if (!__swap_entry_free(si, entry, SWAP_HAS_CACHE))
+				free_swap_slot(entry);
+		}
+	}
 }
 #else
 static inline void swapcache_free_cluster(swp_entry_t entry)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH -mm 02/13] mm, THP, swap: Support to reclaim swap space for THP swapped out
From: Huang, Ying @ 2017-05-25  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim,
	Hugh Dickins, Shaohua Li, Rik van Riel
In-Reply-To: <20170525064635.2832-1-ying.huang@intel.com>

From: Huang Ying <ying.huang@intel.com>

The normal swap slot reclaiming can be done when the swap count
reaches SWAP_HAS_CACHE.  But for the swap slot which is backing a THP,
all swap slots backing one THP must be reclaimed together, because the
swap slot may be used again when the THP is swapped out again later.
So the swap slots backing one THP can be reclaimed together when the
swap count for all swap slots for the THP reached SWAP_HAS_CACHE.  In
the patch, the functions to check whether the swap count for all swap
slots backing one THP reached SWAP_HAS_CACHE are implemented and used
when checking whether a swap slot can be reclaimed.

To make it easier to determine whether a swap slot is backing a THP, a
new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
cluster which is backing a THP (Transparent Huge Page).  Because THP
swap in as a whole isn't supported now.  After deleting the THP from
the swap cache (for example, swapping out finished), the
CLUSTER_FLAG_HUGE flag will be cleared.  So that, the normal pages
inside THP can be swapped in individually.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
---
 include/linux/swap.h |  1 +
 mm/swapfile.c        | 78 +++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5ab1c98c7d27..c563c45b30b4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -188,6 +188,7 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4cd02dec6894..675afc235de1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -264,6 +264,16 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
+static inline bool cluster_is_huge(struct swap_cluster_info *info)
+{
+	return info->flags & CLUSTER_FLAG_HUGE;
+}
+
+static inline void cluster_clear_huge(struct swap_cluster_info *info)
+{
+	info->flags &= ~CLUSTER_FLAG_HUGE;
+}
+
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -845,7 +855,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
 
 	map = si->swap_map + offset;
 	for (i = 0; i < SWAPFILE_CLUSTER; i++)
@@ -1175,6 +1185,7 @@ static void swapcache_free_cluster(swp_entry_t entry)
 		return;
 
 	ci = lock_cluster(si, offset);
+	VM_BUG_ON(!cluster_is_huge(ci));
 	map = si->swap_map + offset;
 	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 		val = map[i];
@@ -1186,6 +1197,7 @@ static void swapcache_free_cluster(swp_entry_t entry)
 		for (i = 0; i < SWAPFILE_CLUSTER; i++)
 			map[i] &= ~SWAP_HAS_CACHE;
 	}
+	cluster_clear_huge(ci);
 	unlock_cluster(ci);
 	if (free_entries == SWAPFILE_CLUSTER) {
 		spin_lock(&si->lock);
@@ -1334,6 +1346,54 @@ int swp_swapcount(swp_entry_t entry)
 	return count;
 }
 
+#ifdef CONFIG_THP_SWAP
+static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
+					 swp_entry_t entry)
+{
+	struct swap_cluster_info *ci;
+	unsigned char *map = si->swap_map;
+	unsigned long roffset = swp_offset(entry);
+	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	int i;
+	bool ret = false;
+
+	ci = lock_cluster_or_swap_info(si, offset);
+	if (!cluster_is_huge(ci)) {
+		if (map[roffset] != SWAP_HAS_CACHE)
+			ret = true;
+		goto unlock_out;
+	}
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		if (map[offset + i] != SWAP_HAS_CACHE) {
+			ret = true;
+			break;
+		}
+	}
+unlock_out:
+	unlock_cluster_or_swap_info(si, ci);
+	return ret;
+}
+
+static bool page_swapped(struct page *page)
+{
+	swp_entry_t entry;
+	struct swap_info_struct *si;
+
+	if (likely(!PageTransCompound(page)))
+		return page_swapcount(page) != 0;
+
+	page = compound_head(page);
+	entry.val = page_private(page);
+	si = _swap_info_get(entry);
+	if (si)
+		return swap_page_trans_huge_swapped(si, entry);
+	return false;
+}
+#else
+#define swap_page_trans_huge_swapped(si, entry)	swap_swapcount(si, entry)
+#define page_swapped(page)			(page_swapcount(page) != 0)
+#endif
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content
@@ -1388,7 +1448,7 @@ int try_to_free_swap(struct page *page)
 		return 0;
 	if (PageWriteback(page))
 		return 0;
-	if (page_swapcount(page))
+	if (page_swapped(page))
 		return 0;
 
 	/*
@@ -1409,6 +1469,7 @@ int try_to_free_swap(struct page *page)
 	if (pm_suspended_storage())
 		return 0;
 
+	page = compound_head(page);
 	delete_from_swap_cache(page);
 	SetPageDirty(page);
 	return 1;
@@ -1430,7 +1491,8 @@ int free_swap_and_cache(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry, 1);
-		if (count == SWAP_HAS_CACHE) {
+		if (count == SWAP_HAS_CACHE &&
+		    !swap_page_trans_huge_swapped(p, entry)) {
 			page = find_get_page(swap_address_space(entry),
 					     swp_offset(entry));
 			if (page && !trylock_page(page)) {
@@ -1447,7 +1509,8 @@ int free_swap_and_cache(swp_entry_t entry)
 		 */
 		if (PageSwapCache(page) && !PageWriteback(page) &&
 		    (!page_mapped(page) || mem_cgroup_swap_full(page)) &&
-		    !swap_swapcount(p, entry)) {
+		    !swap_page_trans_huge_swapped(p, entry)) {
+			page = compound_head(page);
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
 		}
@@ -2001,7 +2064,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
 				.sync_mode = WB_SYNC_NONE,
 			};
 
-			swap_writepage(page, &wbc);
+			swap_writepage(compound_head(page), &wbc);
 			lock_page(page);
 			wait_on_page_writeback(page);
 		}
@@ -2014,8 +2077,9 @@ int try_to_unuse(unsigned int type, bool frontswap,
 		 * delete, since it may not have been written out to swap yet.
 		 */
 		if (PageSwapCache(page) &&
-		    likely(page_private(page) == entry.val))
-			delete_from_swap_cache(page);
+		    likely(page_private(page) == entry.val) &&
+		    !page_swapped(page))
+			delete_from_swap_cache(compound_head(page));
 
 		/*
 		 * So we could skip searching mms once swap count went
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox