* isolate_freepages_block and excessive CPU usage by OSD process
@ 2014-11-15 11:48 Andrey Korolyov
2014-11-15 16:32 ` Vlastimil Babka
0 siblings, 1 reply; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-15 11:48 UTC (permalink / raw)
To: ceph-users@lists.ceph.com; +Cc: riel, Mark Nelson, linux-mm
[-- Attachment #1: Type: text/plain, Size: 2957 bytes --]
Hello,
I had found recently that the OSD daemons under certain conditions
(moderate vm pressure, moderate I/O, slightly altered vm settings) can
go into loop involving isolate_freepages and effectively hit Ceph
cluster performance. I found this thread
https://lkml.org/lkml/2012/6/27/545, but looks like that the
significant decrease of bdi max_ratio did not helped even for a bit.
Although I have approximately a half of physical memory for cache-like
stuff, the problem with mm persists, so I would like to try
suggestions from the other people. In current testing iteration I had
decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
background ratio to 15 and 10 correspondingly (because default values
are too spiky for mine workloads). The host kernel is a linux-stable
3.10.
Non-default VM settings are:
vm.swappiness = 5
vm.dirty_ratio=10
vm.dirty_background_ratio=5
bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
situation worsened, because unstable OSD host cause domino-like effect
on other hosts, which are starting to flap too and only cache flush
via drop_caches is helping.
Unfortunately there are no slab info from "exhausted" state due to
sporadic nature of this bug, will try to catch next time.
slabtop (normal state):
Active / Total Objects (% used) : 8675843 / 8965833 (96.8%)
Active / Total Slabs (% used) : 224858 / 224858 (100.0%)
Active / Total Caches (% used) : 86 / 132 (65.2%)
Active / Total Size (% used) : 1152171.37K / 1253116.37K (91.9%)
Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
6890130 6889185 99% 0.10K 176670 39 706680K buffer_head
751232 721707 96% 0.06K 11738 64 46952K kmalloc-64
251636 226228 89% 0.55K 8987 28 143792K radix_tree_node
121696 45710 37% 0.25K 3803 32 30424K kmalloc-256
113022 80618 71% 0.19K 2691 42 21528K dentry
112672 35160 31% 0.50K 3521 32 56336K kmalloc-512
73136 72800 99% 0.07K 1306 56 5224K Acpi-ParseExt
61696 58644 95% 0.02K 241 256 964K kmalloc-16
54348 36649 67% 0.38K 1294 42 20704K ip6_dst_cache
53136 51787 97% 0.11K 1476 36 5904K sysfs_dir_cache
51200 50724 99% 0.03K 400 128 1600K kmalloc-32
49120 46105 93% 1.00K 1535 32 49120K xfs_inode
30702 30702 100% 0.04K 301 102 1204K Acpi-Namespace
28224 25742 91% 0.12K 882 32 3528K kmalloc-128
28028 22691 80% 0.18K 637 44 5096K vm_area_struct
28008 28008 100% 0.22K 778 36 6224K xfs_ili
18944 18944 100% 0.01K 37 512 148K kmalloc-8
16576 15154 91% 0.06K 259 64 1036K anon_vma
16475 14200 86% 0.16K 659 25 2636K sigqueue
zoneinfo (normal state, attached)
[-- Attachment #2: zoneinfo --]
[-- Type: application/octet-stream, Size: 15098 bytes --]
Node 0, zone DMA
pages free 3973
min 5
low 6
high 7
scanned 0
spanned 4095
present 3994
managed 3973
nr_free_pages 3973
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 0
nr_written 0
numa_hit 0
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 0
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 1914, 32121, 32121)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 8
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 9
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 10
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 11
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 12
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 13
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 14
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 15
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 16
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 17
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 18
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 19
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 20
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 21
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 22
count: 0
high: 0
batch: 1
vm stats threshold: 10
cpu: 23
count: 0
high: 0
batch: 1
vm stats threshold: 10
all_unreclaimable: 1
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 32223
min 669
low 836
high 1003
scanned 0
spanned 1044480
present 511926
managed 490239
nr_free_pages 32223
nr_inactive_anon 277
nr_active_anon 45533
nr_inactive_file 227698
nr_active_file 122112
nr_unevictable 4760
nr_mlock 4760
nr_anon_pages 49781
nr_mapped 133
nr_file_pages 350087
nr_dirty 160
nr_writeback 0
nr_slab_reclaimable 20418
nr_slab_unreclaimable 30228
nr_page_table_pages 190
nr_kernel_stack 436
nr_unstable 0
nr_bounce 0
nr_vmscan_write 2
nr_vmscan_immediate_reclaim 3499
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 277
nr_dirtied 609807631
nr_written 609734467
numa_hit 6979761185
numa_miss 3941324201
numa_foreign 0
numa_interleave 0
numa_local 6979751851
numa_other 3941333535
nr_anon_transparent_hugepages 1
nr_free_cma 0
protection: (0, 0, 30206, 30206)
pagesets
cpu: 0
count: 12
high: 186
batch: 31
vm stats threshold: 50
cpu: 1
count: 8
high: 186
batch: 31
vm stats threshold: 50
cpu: 2
count: 60
high: 186
batch: 31
vm stats threshold: 50
cpu: 3
count: 45
high: 186
batch: 31
vm stats threshold: 50
cpu: 4
count: 12
high: 186
batch: 31
vm stats threshold: 50
cpu: 5
count: 3
high: 186
batch: 31
vm stats threshold: 50
cpu: 6
count: 49
high: 186
batch: 31
vm stats threshold: 50
cpu: 7
count: 28
high: 186
batch: 31
vm stats threshold: 50
cpu: 8
count: 0
high: 186
batch: 31
vm stats threshold: 50
cpu: 9
count: 5
high: 186
batch: 31
vm stats threshold: 50
cpu: 10
count: 0
high: 186
batch: 31
vm stats threshold: 50
cpu: 11
count: 0
high: 186
batch: 31
vm stats threshold: 50
cpu: 12
count: 19
high: 186
batch: 31
vm stats threshold: 50
cpu: 13
count: 1
high: 186
batch: 31
vm stats threshold: 50
cpu: 14
count: 12
high: 186
batch: 31
vm stats threshold: 50
cpu: 15
count: 162
high: 186
batch: 31
vm stats threshold: 50
cpu: 16
count: 14
high: 186
batch: 31
vm stats threshold: 50
cpu: 17
count: 0
high: 186
batch: 31
vm stats threshold: 50
cpu: 18
count: 3
high: 186
batch: 31
vm stats threshold: 50
cpu: 19
count: 0
high: 186
batch: 31
vm stats threshold: 50
cpu: 20
count: 0
high: 186
batch: 31
vm stats threshold: 50
cpu: 21
count: 0
high: 186
batch: 31
vm stats threshold: 50
cpu: 22
count: 0
high: 186
batch: 31
vm stats threshold: 50
cpu: 23
count: 0
high: 186
batch: 31
vm stats threshold: 50
all_unreclaimable: 0
start_pfn: 4096
inactive_ratio: 3
Node 0, zone Normal
pages free 32960
min 10568
low 13210
high 15852
scanned 0
spanned 7864320
present 7864320
managed 7732828
nr_free_pages 32960
nr_inactive_anon 11191
nr_active_anon 3036913
nr_inactive_file 3223885
nr_active_file 1127966
nr_unevictable 4086
nr_mlock 4086
nr_anon_pages 2363745
nr_mapped 34191
nr_file_pages 4358872
nr_dirty 2926
nr_writeback 0
nr_slab_reclaimable 82623
nr_slab_unreclaimable 24026
nr_page_table_pages 12611
nr_kernel_stack 1842
nr_unstable 0
nr_bounce 0
nr_vmscan_write 59
nr_vmscan_immediate_reclaim 29602
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 6348
nr_dirtied 8347305401
nr_written 8343222456
numa_hit 49594613817
numa_miss 635457096
numa_foreign 391251876
numa_interleave 20063
numa_local 49594490600
numa_other 635580313
nr_anon_transparent_hugepages 1331
nr_free_cma 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 58
high: 186
batch: 31
vm stats threshold: 90
cpu: 1
count: 161
high: 186
batch: 31
vm stats threshold: 90
cpu: 2
count: 159
high: 186
batch: 31
vm stats threshold: 90
cpu: 3
count: 170
high: 186
batch: 31
vm stats threshold: 90
cpu: 4
count: 159
high: 186
batch: 31
vm stats threshold: 90
cpu: 5
count: 78
high: 186
batch: 31
vm stats threshold: 90
cpu: 6
count: 64
high: 186
batch: 31
vm stats threshold: 90
cpu: 7
count: 151
high: 186
batch: 31
vm stats threshold: 90
cpu: 8
count: 182
high: 186
batch: 31
vm stats threshold: 90
cpu: 9
count: 173
high: 186
batch: 31
vm stats threshold: 90
cpu: 10
count: 164
high: 186
batch: 31
vm stats threshold: 90
cpu: 11
count: 165
high: 186
batch: 31
vm stats threshold: 90
cpu: 12
count: 176
high: 186
batch: 31
vm stats threshold: 90
cpu: 13
count: 156
high: 186
batch: 31
vm stats threshold: 90
cpu: 14
count: 157
high: 186
batch: 31
vm stats threshold: 90
cpu: 15
count: 135
high: 186
batch: 31
vm stats threshold: 90
cpu: 16
count: 158
high: 186
batch: 31
vm stats threshold: 90
cpu: 17
count: 172
high: 186
batch: 31
vm stats threshold: 90
cpu: 18
count: 167
high: 186
batch: 31
vm stats threshold: 90
cpu: 19
count: 171
high: 186
batch: 31
vm stats threshold: 90
cpu: 20
count: 169
high: 186
batch: 31
vm stats threshold: 90
cpu: 21
count: 157
high: 186
batch: 31
vm stats threshold: 90
cpu: 22
count: 177
high: 186
batch: 31
vm stats threshold: 90
cpu: 23
count: 161
high: 186
batch: 31
vm stats threshold: 90
all_unreclaimable: 0
start_pfn: 1048576
inactive_ratio: 17
Node 1, zone Normal
pages free 14880
min 11284
low 14105
high 16926
scanned 0
spanned 8388608
present 8388608
managed 8257056
nr_free_pages 14880
nr_inactive_anon 13140
nr_active_anon 2569269
nr_inactive_file 3715797
nr_active_file 1659970
nr_unevictable 15464
nr_mlock 15464
nr_anon_pages 1310698
nr_mapped 45301
nr_file_pages 5387102
nr_dirty 3551
nr_writeback 0
nr_slab_reclaimable 135572
nr_slab_unreclaimable 24093
nr_page_table_pages 6677
nr_kernel_stack 775
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 57854
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 10317
nr_dirtied 13325911763
nr_written 13320630581
numa_hit 43510008565
numa_miss 391251876
numa_foreign 4576781297
numa_interleave 19867
numa_local 43509973410
numa_other 391287031
nr_anon_transparent_hugepages 2492
nr_free_cma 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 155
high: 186
batch: 31
vm stats threshold: 90
cpu: 1
count: 173
high: 186
batch: 31
vm stats threshold: 90
cpu: 2
count: 104
high: 186
batch: 31
vm stats threshold: 90
cpu: 3
count: 168
high: 186
batch: 31
vm stats threshold: 90
cpu: 4
count: 158
high: 186
batch: 31
vm stats threshold: 90
cpu: 5
count: 169
high: 186
batch: 31
vm stats threshold: 90
cpu: 6
count: 53
high: 186
batch: 31
vm stats threshold: 90
cpu: 7
count: 81
high: 186
batch: 31
vm stats threshold: 90
cpu: 8
count: 63
high: 186
batch: 31
vm stats threshold: 90
cpu: 9
count: 168
high: 186
batch: 31
vm stats threshold: 90
cpu: 10
count: 46
high: 186
batch: 31
vm stats threshold: 90
cpu: 11
count: 28
high: 186
batch: 31
vm stats threshold: 90
cpu: 12
count: 161
high: 186
batch: 31
vm stats threshold: 90
cpu: 13
count: 177
high: 186
batch: 31
vm stats threshold: 90
cpu: 14
count: 155
high: 186
batch: 31
vm stats threshold: 90
cpu: 15
count: 181
high: 186
batch: 31
vm stats threshold: 90
cpu: 16
count: 164
high: 186
batch: 31
vm stats threshold: 90
cpu: 17
count: 185
high: 186
batch: 31
vm stats threshold: 90
cpu: 18
count: 69
high: 186
batch: 31
vm stats threshold: 90
cpu: 19
count: 75
high: 186
batch: 31
vm stats threshold: 90
cpu: 20
count: 151
high: 186
batch: 31
vm stats threshold: 90
cpu: 21
count: 91
high: 186
batch: 31
vm stats threshold: 90
cpu: 22
count: 51
high: 186
batch: 31
vm stats threshold: 90
cpu: 23
count: 56
high: 186
batch: 31
vm stats threshold: 90
all_unreclaimable: 0
start_pfn: 8912896
inactive_ratio: 17
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-15 11:48 Andrey Korolyov
@ 2014-11-15 16:32 ` Vlastimil Babka
2014-11-15 17:10 ` Andrey Korolyov
0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-11-15 16:32 UTC (permalink / raw)
To: Andrey Korolyov, ceph-users@lists.ceph.com
Cc: riel, Mark Nelson, linux-mm, David Rientjes, Joonsoo Kim
On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
> Hello,
>
> I had found recently that the OSD daemons under certain conditions
> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
> go into loop involving isolate_freepages and effectively hit Ceph
> cluster performance. I found this thread
Do you feel it is a regression, compared to some older kernel version or something?
> https://lkml.org/lkml/2012/6/27/545, but looks like that the
> significant decrease of bdi max_ratio did not helped even for a bit.
> Although I have approximately a half of physical memory for cache-like
> stuff, the problem with mm persists, so I would like to try
> suggestions from the other people. In current testing iteration I had
> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
> background ratio to 15 and 10 correspondingly (because default values
> are too spiky for mine workloads). The host kernel is a linux-stable
> 3.10.
Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
it, or at least 3.17. Lot of patches went to reduce compaction overhead for
(especially for transparent hugepages) since 3.10.
> Non-default VM settings are:
> vm.swappiness = 5
> vm.dirty_ratio=10
> vm.dirty_background_ratio=5
> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
> situation worsened, because unstable OSD host cause domino-like effect
> on other hosts, which are starting to flap too and only cache flush
> via drop_caches is helping.
>
> Unfortunately there are no slab info from "exhausted" state due to
> sporadic nature of this bug, will try to catch next time.
>
> slabtop (normal state):
> Active / Total Objects (% used) : 8675843 / 8965833 (96.8%)
> Active / Total Slabs (% used) : 224858 / 224858 (100.0%)
> Active / Total Caches (% used) : 86 / 132 (65.2%)
> Active / Total Size (% used) : 1152171.37K / 1253116.37K (91.9%)
> Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>
> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
> 6890130 6889185 99% 0.10K 176670 39 706680K buffer_head
> 751232 721707 96% 0.06K 11738 64 46952K kmalloc-64
> 251636 226228 89% 0.55K 8987 28 143792K radix_tree_node
> 121696 45710 37% 0.25K 3803 32 30424K kmalloc-256
> 113022 80618 71% 0.19K 2691 42 21528K dentry
> 112672 35160 31% 0.50K 3521 32 56336K kmalloc-512
> 73136 72800 99% 0.07K 1306 56 5224K Acpi-ParseExt
> 61696 58644 95% 0.02K 241 256 964K kmalloc-16
> 54348 36649 67% 0.38K 1294 42 20704K ip6_dst_cache
> 53136 51787 97% 0.11K 1476 36 5904K sysfs_dir_cache
> 51200 50724 99% 0.03K 400 128 1600K kmalloc-32
> 49120 46105 93% 1.00K 1535 32 49120K xfs_inode
> 30702 30702 100% 0.04K 301 102 1204K Acpi-Namespace
> 28224 25742 91% 0.12K 882 32 3528K kmalloc-128
> 28028 22691 80% 0.18K 637 44 5096K vm_area_struct
> 28008 28008 100% 0.22K 778 36 6224K xfs_ili
> 18944 18944 100% 0.01K 37 512 148K kmalloc-8
> 16576 15154 91% 0.06K 259 64 1036K anon_vma
> 16475 14200 86% 0.16K 659 25 2636K sigqueue
>
> zoneinfo (normal state, attached)
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-15 16:32 ` Vlastimil Babka
@ 2014-11-15 17:10 ` Andrey Korolyov
2014-11-15 18:45 ` Vlastimil Babka
0 siblings, 1 reply; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-15 17:10 UTC (permalink / raw)
To: Vlastimil Babka
Cc: ceph-users@lists.ceph.com, riel, Mark Nelson, linux-mm,
David Rientjes, Joonsoo Kim
On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
>> Hello,
>>
>> I had found recently that the OSD daemons under certain conditions
>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>> go into loop involving isolate_freepages and effectively hit Ceph
>> cluster performance. I found this thread
>
> Do you feel it is a regression, compared to some older kernel version or something?
No, it`s just a rare but very concerning stuff. The higher pressure
is, the more chance to hit this particular issue, although absolute
numbers are still very large (e.g. room for cache memory). Some
googling also found simular question on sf:
http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
but there are no perf info unfortunately so I cannot say if the issue
is the same or not.
>
>> https://lkml.org/lkml/2012/6/27/545, but looks like that the
>> significant decrease of bdi max_ratio did not helped even for a bit.
>> Although I have approximately a half of physical memory for cache-like
>> stuff, the problem with mm persists, so I would like to try
>> suggestions from the other people. In current testing iteration I had
>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
>> background ratio to 15 and 10 correspondingly (because default values
>> are too spiky for mine workloads). The host kernel is a linux-stable
>> 3.10.
>
> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
> it, or at least 3.17. Lot of patches went to reduce compaction overhead for
> (especially for transparent hugepages) since 3.10.
Heh, I may say that I limited to pushing knobs in 3.10, because it has
a well-known set of problems and any major version switch will lead to
months-long QA procedures, but I may try that if none of mine knob
selection will help. I am not THP user, the problem is happening with
regular 4k pages and almost default VM settings. Also it worth to mean
that kernel messages are not complaining about allocation failures, as
in case in URL from above, compaction just tightens up to some limit
and (after it 'locked' system for a couple of minutes, reducing actual
I/O and derived amount of memory operations) it goes back to normal.
Cache flush fixing this just in a moment, so should large room for
min_free_kbytes. Over couple of days, depends on which nodes with
certain settings issue will reappear, I may judge if my ideas was
wrong.
>
>> Non-default VM settings are:
>> vm.swappiness = 5
>> vm.dirty_ratio=10
>> vm.dirty_background_ratio=5
>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
>> situation worsened, because unstable OSD host cause domino-like effect
>> on other hosts, which are starting to flap too and only cache flush
>> via drop_caches is helping.
>>
>> Unfortunately there are no slab info from "exhausted" state due to
>> sporadic nature of this bug, will try to catch next time.
>>
>> slabtop (normal state):
>> Active / Total Objects (% used) : 8675843 / 8965833 (96.8%)
>> Active / Total Slabs (% used) : 224858 / 224858 (100.0%)
>> Active / Total Caches (% used) : 86 / 132 (65.2%)
>> Active / Total Size (% used) : 1152171.37K / 1253116.37K (91.9%)
>> Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>>
>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
>> 6890130 6889185 99% 0.10K 176670 39 706680K buffer_head
>> 751232 721707 96% 0.06K 11738 64 46952K kmalloc-64
>> 251636 226228 89% 0.55K 8987 28 143792K radix_tree_node
>> 121696 45710 37% 0.25K 3803 32 30424K kmalloc-256
>> 113022 80618 71% 0.19K 2691 42 21528K dentry
>> 112672 35160 31% 0.50K 3521 32 56336K kmalloc-512
>> 73136 72800 99% 0.07K 1306 56 5224K Acpi-ParseExt
>> 61696 58644 95% 0.02K 241 256 964K kmalloc-16
>> 54348 36649 67% 0.38K 1294 42 20704K ip6_dst_cache
>> 53136 51787 97% 0.11K 1476 36 5904K sysfs_dir_cache
>> 51200 50724 99% 0.03K 400 128 1600K kmalloc-32
>> 49120 46105 93% 1.00K 1535 32 49120K xfs_inode
>> 30702 30702 100% 0.04K 301 102 1204K Acpi-Namespace
>> 28224 25742 91% 0.12K 882 32 3528K kmalloc-128
>> 28028 22691 80% 0.18K 637 44 5096K vm_area_struct
>> 28008 28008 100% 0.22K 778 36 6224K xfs_ili
>> 18944 18944 100% 0.01K 37 512 148K kmalloc-8
>> 16576 15154 91% 0.06K 259 64 1036K anon_vma
>> 16475 14200 86% 0.16K 659 25 2636K sigqueue
>>
>> zoneinfo (normal state, attached)
>>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-15 17:10 ` Andrey Korolyov
@ 2014-11-15 18:45 ` Vlastimil Babka
2014-11-15 18:52 ` Andrey Korolyov
0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-11-15 18:45 UTC (permalink / raw)
To: Andrey Korolyov
Cc: ceph-users@lists.ceph.com, riel, Mark Nelson, linux-mm,
David Rientjes, Joonsoo Kim, Johannes Weiner
On 11/15/2014 06:10 PM, Andrey Korolyov wrote:
> On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
>>> Hello,
>>>
>>> I had found recently that the OSD daemons under certain conditions
>>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>>> go into loop involving isolate_freepages and effectively hit Ceph
>>> cluster performance. I found this thread
>>
>> Do you feel it is a regression, compared to some older kernel version or something?
>
> No, it`s just a rare but very concerning stuff. The higher pressure
> is, the more chance to hit this particular issue, although absolute
> numbers are still very large (e.g. room for cache memory). Some
> googling also found simular question on sf:
> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
> but there are no perf info unfortunately so I cannot say if the issue
> is the same or not.
Well it would be useful to find out what's doing the high-order allocations.
With 'perf -g -a' and then 'perf report -g' determine the call stack. Order and
allocation flags can be captured by enabling the page_alloc tracepoint.
>>
>>> https://lkml.org/lkml/2012/6/27/545, but looks like that the
>>> significant decrease of bdi max_ratio did not helped even for a bit.
>>> Although I have approximately a half of physical memory for cache-like
>>> stuff, the problem with mm persists, so I would like to try
>>> suggestions from the other people. In current testing iteration I had
>>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
>>> background ratio to 15 and 10 correspondingly (because default values
>>> are too spiky for mine workloads). The host kernel is a linux-stable
>>> 3.10.
>>
>> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
>> it, or at least 3.17. Lot of patches went to reduce compaction overhead for
>> (especially for transparent hugepages) since 3.10.
>
> Heh, I may say that I limited to pushing knobs in 3.10, because it has
> a well-known set of problems and any major version switch will lead to
> months-long QA procedures, but I may try that if none of mine knob
> selection will help. I am not THP user, the problem is happening with
> regular 4k pages and almost default VM settings. Also it worth to mean
OK that's useful to know. So it might be some driver (do you also have
mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order allocations.
> that kernel messages are not complaining about allocation failures, as
> in case in URL from above, compaction just tightens up to some limit
Without the warnings, that's why we need tracing/profiling to find out what's
causing it.
> and (after it 'locked' system for a couple of minutes, reducing actual
> I/O and derived amount of memory operations) it goes back to normal.
> Cache flush fixing this just in a moment, so should large room for
That could perhaps suggest a poor coordination between reclaim and compaction,
made worse by the fact that there are more parallel ongoing attempts and the
watermark checking doesn't take that into account.
> min_free_kbytes. Over couple of days, depends on which nodes with
> certain settings issue will reappear, I may judge if my ideas was
> wrong.
>
>>
>>> Non-default VM settings are:
>>> vm.swappiness = 5
>>> vm.dirty_ratio=10
>>> vm.dirty_background_ratio=5
>>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
>>> situation worsened, because unstable OSD host cause domino-like effect
>>> on other hosts, which are starting to flap too and only cache flush
>>> via drop_caches is helping.
>>>
>>> Unfortunately there are no slab info from "exhausted" state due to
>>> sporadic nature of this bug, will try to catch next time.
>>>
>>> slabtop (normal state):
>>> Active / Total Objects (% used) : 8675843 / 8965833 (96.8%)
>>> Active / Total Slabs (% used) : 224858 / 224858 (100.0%)
>>> Active / Total Caches (% used) : 86 / 132 (65.2%)
>>> Active / Total Size (% used) : 1152171.37K / 1253116.37K (91.9%)
>>> Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>>>
>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
>>> 6890130 6889185 99% 0.10K 176670 39 706680K buffer_head
>>> 751232 721707 96% 0.06K 11738 64 46952K kmalloc-64
>>> 251636 226228 89% 0.55K 8987 28 143792K radix_tree_node
>>> 121696 45710 37% 0.25K 3803 32 30424K kmalloc-256
>>> 113022 80618 71% 0.19K 2691 42 21528K dentry
>>> 112672 35160 31% 0.50K 3521 32 56336K kmalloc-512
>>> 73136 72800 99% 0.07K 1306 56 5224K Acpi-ParseExt
>>> 61696 58644 95% 0.02K 241 256 964K kmalloc-16
>>> 54348 36649 67% 0.38K 1294 42 20704K ip6_dst_cache
>>> 53136 51787 97% 0.11K 1476 36 5904K sysfs_dir_cache
>>> 51200 50724 99% 0.03K 400 128 1600K kmalloc-32
>>> 49120 46105 93% 1.00K 1535 32 49120K xfs_inode
>>> 30702 30702 100% 0.04K 301 102 1204K Acpi-Namespace
>>> 28224 25742 91% 0.12K 882 32 3528K kmalloc-128
>>> 28028 22691 80% 0.18K 637 44 5096K vm_area_struct
>>> 28008 28008 100% 0.22K 778 36 6224K xfs_ili
>>> 18944 18944 100% 0.01K 37 512 148K kmalloc-8
>>> 16576 15154 91% 0.06K 259 64 1036K anon_vma
>>> 16475 14200 86% 0.16K 659 25 2636K sigqueue
>>>
>>> zoneinfo (normal state, attached)
>>>
>>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-15 18:45 ` Vlastimil Babka
@ 2014-11-15 18:52 ` Andrey Korolyov
0 siblings, 0 replies; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-15 18:52 UTC (permalink / raw)
To: Vlastimil Babka
Cc: ceph-users@lists.ceph.com, riel, Mark Nelson, linux-mm,
David Rientjes, Joonsoo Kim, Johannes Weiner
On Sat, Nov 15, 2014 at 9:45 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/15/2014 06:10 PM, Andrey Korolyov wrote:
>> On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
>>>> Hello,
>>>>
>>>> I had found recently that the OSD daemons under certain conditions
>>>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>>>> go into loop involving isolate_freepages and effectively hit Ceph
>>>> cluster performance. I found this thread
>>>
>>> Do you feel it is a regression, compared to some older kernel version or something?
>>
>> No, it`s just a rare but very concerning stuff. The higher pressure
>> is, the more chance to hit this particular issue, although absolute
>> numbers are still very large (e.g. room for cache memory). Some
>> googling also found simular question on sf:
>> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
>> but there are no perf info unfortunately so I cannot say if the issue
>> is the same or not.
>
> Well it would be useful to find out what's doing the high-order allocations.
> With 'perf -g -a' and then 'perf report -g' determine the call stack. Order and
> allocation flags can be captured by enabling the page_alloc tracepoint.
Thanks, please give me some time to go through testing iterations, so
I`ll collect appropriate perf.data.
>
>>>
>>>> https://lkml.org/lkml/2012/6/27/545, but looks like that the
>>>> significant decrease of bdi max_ratio did not helped even for a bit.
>>>> Although I have approximately a half of physical memory for cache-like
>>>> stuff, the problem with mm persists, so I would like to try
>>>> suggestions from the other people. In current testing iteration I had
>>>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
>>>> background ratio to 15 and 10 correspondingly (because default values
>>>> are too spiky for mine workloads). The host kernel is a linux-stable
>>>> 3.10.
>>>
>>> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
>>> it, or at least 3.17. Lot of patches went to reduce compaction overhead for
>>> (especially for transparent hugepages) since 3.10.
>>
>> Heh, I may say that I limited to pushing knobs in 3.10, because it has
>> a well-known set of problems and any major version switch will lead to
>> months-long QA procedures, but I may try that if none of mine knob
>> selection will help. I am not THP user, the problem is happening with
>> regular 4k pages and almost default VM settings. Also it worth to mean
>
> OK that's useful to know. So it might be some driver (do you also have
> mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order allocations.
Yes, I am using mellanox transport there and SLUB allocator, as SLAB
had some issues with allocations with uneven node fill-up on a
two-head system which I am primarily using.
>
>> that kernel messages are not complaining about allocation failures, as
>> in case in URL from above, compaction just tightens up to some limit
>
> Without the warnings, that's why we need tracing/profiling to find out what's
> causing it.
>
>> and (after it 'locked' system for a couple of minutes, reducing actual
>> I/O and derived amount of memory operations) it goes back to normal.
>> Cache flush fixing this just in a moment, so should large room for
>
> That could perhaps suggest a poor coordination between reclaim and compaction,
> made worse by the fact that there are more parallel ongoing attempts and the
> watermark checking doesn't take that into account.
>
>> min_free_kbytes. Over couple of days, depends on which nodes with
>> certain settings issue will reappear, I may judge if my ideas was
>> wrong.
>>
>>>
>>>> Non-default VM settings are:
>>>> vm.swappiness = 5
>>>> vm.dirty_ratio=10
>>>> vm.dirty_background_ratio=5
>>>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
>>>> situation worsened, because unstable OSD host cause domino-like effect
>>>> on other hosts, which are starting to flap too and only cache flush
>>>> via drop_caches is helping.
>>>>
>>>> Unfortunately there are no slab info from "exhausted" state due to
>>>> sporadic nature of this bug, will try to catch next time.
>>>>
>>>> slabtop (normal state):
>>>> Active / Total Objects (% used) : 8675843 / 8965833 (96.8%)
>>>> Active / Total Slabs (% used) : 224858 / 224858 (100.0%)
>>>> Active / Total Caches (% used) : 86 / 132 (65.2%)
>>>> Active / Total Size (% used) : 1152171.37K / 1253116.37K (91.9%)
>>>> Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>>>>
>>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
>>>> 6890130 6889185 99% 0.10K 176670 39 706680K buffer_head
>>>> 751232 721707 96% 0.06K 11738 64 46952K kmalloc-64
>>>> 251636 226228 89% 0.55K 8987 28 143792K radix_tree_node
>>>> 121696 45710 37% 0.25K 3803 32 30424K kmalloc-256
>>>> 113022 80618 71% 0.19K 2691 42 21528K dentry
>>>> 112672 35160 31% 0.50K 3521 32 56336K kmalloc-512
>>>> 73136 72800 99% 0.07K 1306 56 5224K Acpi-ParseExt
>>>> 61696 58644 95% 0.02K 241 256 964K kmalloc-16
>>>> 54348 36649 67% 0.38K 1294 42 20704K ip6_dst_cache
>>>> 53136 51787 97% 0.11K 1476 36 5904K sysfs_dir_cache
>>>> 51200 50724 99% 0.03K 400 128 1600K kmalloc-32
>>>> 49120 46105 93% 1.00K 1535 32 49120K xfs_inode
>>>> 30702 30702 100% 0.04K 301 102 1204K Acpi-Namespace
>>>> 28224 25742 91% 0.12K 882 32 3528K kmalloc-128
>>>> 28028 22691 80% 0.18K 637 44 5096K vm_area_struct
>>>> 28008 28008 100% 0.22K 778 36 6224K xfs_ili
>>>> 18944 18944 100% 0.01K 37 512 148K kmalloc-8
>>>> 16576 15154 91% 0.06K 259 64 1036K anon_vma
>>>> 16475 14200 86% 0.16K 659 25 2636K sigqueue
>>>>
>>>> zoneinfo (normal state, attached)
>>>>
>>>
>>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
[not found] <CABYiri-do2YdfBx=r+u1kwXkEwN4v+yeRSHB-ODXo4gMFgW-Fg.mail.gmail.com>
@ 2014-11-19 1:21 ` Christian Marie
2014-11-19 18:03 ` Andrey Korolyov
0 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-11-19 1:21 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]
> Hello,
>
> I had found recently that the OSD daemons under certain conditions
> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
> go into loop involving isolate_freepages and effectively hit Ceph
> cluster performance.
Hi! I'm the creator of the server fault issue you reference:
http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
I'd like to get to the bottom of this very much, I'm seeing a very similar
pattern on 3.10.0-123.9.3.el7.x86_64, if this is fixed in later versions
perhaps we could backport something.
Here is some perf output:
http://ponies.io/raw/compaction.png
Looks pretty similar. I also have hundreds of MB logs and traces should we need
some specific question answered.
I've managed to reproduce many failed compactions with this:
https://gist.github.com/christian-marie/cde7e80c5edb889da541
I took some compaction stress test code and bolted on a little loop to mmap a
large sparse file and read every PAGE_SIZEth byte.
Run it once, compactions seem to do okay, run it again and they're really slow.
This seems to be because my little trick to fill up cache memory only seems to
work exactly half the time. Note that transhuge pages are only used to
introduce fragmentation/pressure here, turning transparent huge pages off
doesn't seem to make the slightest difference to the spinning-in-reclaim issue.
We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
currently working on adding support for that (the hardware supports it). Are
you also using ipoib or have something else doing high order allocations? It's
a bit concerning for me if you don't as it would suggest that cutting down on
those allocations won't help.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-19 1:21 ` isolate_freepages_block and excessive CPU usage by OSD process Christian Marie
@ 2014-11-19 18:03 ` Andrey Korolyov
2014-11-19 21:20 ` Christian Marie
0 siblings, 1 reply; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-19 18:03 UTC (permalink / raw)
To: Christian Marie; +Cc: linux-mm
On Wed, Nov 19, 2014 at 4:21 AM, Christian Marie <christian@ponies.io> wrote:
>> Hello,
>>
>> I had found recently that the OSD daemons under certain conditions
>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>> go into loop involving isolate_freepages and effectively hit Ceph
>> cluster performance.
>
> Hi! I'm the creator of the server fault issue you reference:
>
> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
>
> I'd like to get to the bottom of this very much, I'm seeing a very similar
> pattern on 3.10.0-123.9.3.el7.x86_64, if this is fixed in later versions
> perhaps we could backport something.
>
> Here is some perf output:
>
> http://ponies.io/raw/compaction.png
>
> Looks pretty similar. I also have hundreds of MB logs and traces should we need
> some specific question answered.
>
> I've managed to reproduce many failed compactions with this:
>
> https://gist.github.com/christian-marie/cde7e80c5edb889da541
>
> I took some compaction stress test code and bolted on a little loop to mmap a
> large sparse file and read every PAGE_SIZEth byte.
>
> Run it once, compactions seem to do okay, run it again and they're really slow.
> This seems to be because my little trick to fill up cache memory only seems to
> work exactly half the time. Note that transhuge pages are only used to
> introduce fragmentation/pressure here, turning transparent huge pages off
> doesn't seem to make the slightest difference to the spinning-in-reclaim issue.
>
> We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
> currently working on adding support for that (the hardware supports it). Are
> you also using ipoib or have something else doing high order allocations? It's
> a bit concerning for me if you don't as it would suggest that cutting down on
> those allocations won't help.
So do I. On a test environment with regular tengig cards I was unable
to reproduce the issue. Honestly, I thought that almost every
contemporary driver for high-speed cards is working with
scatter-gather, so I had not mlx in mind as a potential cause of this
problem from very beginning. There are a couple of reports in ceph
lists, complaining for OSD flapping/unresponsiveness without clear
reason on certain (not always clear though) conditions which may have
same root cause. Wonder if numad-like mechanism will help there, but
its usage is generally an anti-performance pattern in my experience.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-19 18:03 ` Andrey Korolyov
@ 2014-11-19 21:20 ` Christian Marie
2014-11-19 23:10 ` Vlastimil Babka
0 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-11-19 21:20 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]
On Wed, Nov 19, 2014 at 10:03:44PM +0400, Andrey Korolyov wrote:
> > We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
> > currently working on adding support for that (the hardware supports it). Are
> > you also using ipoib or have something else doing high order allocations? It's
> > a bit concerning for me if you don't as it would suggest that cutting down on
> > those allocations won't help.
>
> So do I. On a test environment with regular tengig cards I was unable to
> reproduce the issue. Honestly, I thought that almost every contemporary
> driver for high-speed cards is working with scatter-gather, so I had not mlx
> in mind as a potential cause of this problem from very beginning.
Right, the drivers handle SG just fine, even in UD mode. It's just that as soon
as you go switch to CM they turn of hardware IP csums and SG support. The only
question I remain to answer before testing a patched driver is whether or not
the messages sent by Ceph are fragmented enough to save allocations. If not, we
could always patch Ceph as well but this is beginning to snowball.
Here is the untested WIP patch for SG support in ipoib CM mode, I'm currently
talking to the original author of a larger patch to review and split that and
get them both upstream.:
https://gist.github.com/christian-marie/e8048b9c118bd3925957
> There are a couple of reports in ceph lists, complaining for OSD
> flapping/unresponsiveness without clear reason on certain (not always clear
> though) conditions which may have same root cause.
Possibly, though ipoib and Ceph seem to be a relatively rare combination.
Someone will likely find this thread if it is the same root cause.
> Wonder if numad-like mechanism will help there, but its usage is generally an
> anti-performance pattern in my experience.
We've played with zone_reclaim_mode and numad to no avail. Only thing we haven't
tried is striping, which I don't want to do anyway.
If these large allocations are indeed a reasonable thing to ask of the
compaction/reclaim subsystem that seems like the best way forward. I have two
questions that follow from this conjecture:
Are compaction behaving badly or are we just asking for too many high order
allocations?
Is this fixed in a later kernel? I haven't tested yet.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-19 21:20 ` Christian Marie
@ 2014-11-19 23:10 ` Vlastimil Babka
2014-11-19 23:49 ` Andrey Korolyov
` (2 more replies)
0 siblings, 3 replies; 36+ messages in thread
From: Vlastimil Babka @ 2014-11-19 23:10 UTC (permalink / raw)
To: linux-mm
On 11/19/2014 10:20 PM, Christian Marie wrote:
> On Wed, Nov 19, 2014 at 10:03:44PM +0400, Andrey Korolyov wrote:
>> > We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
>> > currently working on adding support for that (the hardware supports it). Are
>> > you also using ipoib or have something else doing high order allocations? It's
>> > a bit concerning for me if you don't as it would suggest that cutting down on
>> > those allocations won't help.
>>
>> So do I. On a test environment with regular tengig cards I was unable to
>> reproduce the issue. Honestly, I thought that almost every contemporary
>> driver for high-speed cards is working with scatter-gather, so I had not mlx
>> in mind as a potential cause of this problem from very beginning.
>
> Right, the drivers handle SG just fine, even in UD mode. It's just that as soon
> as you go switch to CM they turn of hardware IP csums and SG support. The only
> question I remain to answer before testing a patched driver is whether or not
> the messages sent by Ceph are fragmented enough to save allocations. If not, we
> could always patch Ceph as well but this is beginning to snowball.
>
> Here is the untested WIP patch for SG support in ipoib CM mode, I'm currently
> talking to the original author of a larger patch to review and split that and
> get them both upstream.:
>
> https://gist.github.com/christian-marie/e8048b9c118bd3925957
>
>> There are a couple of reports in ceph lists, complaining for OSD
>> flapping/unresponsiveness without clear reason on certain (not always clear
>> though) conditions which may have same root cause.
>
> Possibly, though ipoib and Ceph seem to be a relatively rare combination.
> Someone will likely find this thread if it is the same root cause.
>
>> Wonder if numad-like mechanism will help there, but its usage is generally an
>> anti-performance pattern in my experience.
>
> We've played with zone_reclaim_mode and numad to no avail. Only thing we haven't
> tried is striping, which I don't want to do anyway.
>
> If these large allocations are indeed a reasonable thing to ask of the
> compaction/reclaim subsystem that seems like the best way forward. I have two
> questions that follow from this conjecture:
>
> Are compaction behaving badly or are we just asking for too many high order
> allocations?
>
> Is this fixed in a later kernel? I haven't tested yet.
As I said, recent kernels received many compaction performance tuning patches,
and reclaim as well. I would recommend trying them, if it's possible.
You mention 3.10.0-123.9.3.el7.x86_64 which I have no idea how it relates to
upstream stable kernel. Upstream version 3.10.44 received several compaction
fixes that I'd deem critical for compaction to work as intended, and lack of
them could explain your problems:
mm: compaction: reset cached scanner pfn's before reading them
commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.
mm: compaction: detect when scanners meet in isolate_freepages
commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.
mm/compaction: make isolate_freepages start at pageblock boundary
commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.
You might want to check if those are included in your kernel package, and/or try
upstream stable 3.10 (if you can't use the latest for some reason).
Vlastimil
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-19 23:10 ` Vlastimil Babka
@ 2014-11-19 23:49 ` Andrey Korolyov
2014-11-20 3:30 ` Christian Marie
2014-11-21 2:35 ` Christian Marie
2 siblings, 0 replies; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-19 23:49 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: linux-mm, Christian Marie
On Thu, Nov 20, 2014 at 2:10 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/19/2014 10:20 PM, Christian Marie wrote:
>> On Wed, Nov 19, 2014 at 10:03:44PM +0400, Andrey Korolyov wrote:
>>> > We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
>>> > currently working on adding support for that (the hardware supports it). Are
>>> > you also using ipoib or have something else doing high order allocations? It's
>>> > a bit concerning for me if you don't as it would suggest that cutting down on
>>> > those allocations won't help.
>>>
>>> So do I. On a test environment with regular tengig cards I was unable to
>>> reproduce the issue. Honestly, I thought that almost every contemporary
>>> driver for high-speed cards is working with scatter-gather, so I had not mlx
>>> in mind as a potential cause of this problem from very beginning.
>>
>> Right, the drivers handle SG just fine, even in UD mode. It's just that as soon
>> as you go switch to CM they turn of hardware IP csums and SG support. The only
>> question I remain to answer before testing a patched driver is whether or not
>> the messages sent by Ceph are fragmented enough to save allocations. If not, we
>> could always patch Ceph as well but this is beginning to snowball.
>>
>> Here is the untested WIP patch for SG support in ipoib CM mode, I'm currently
>> talking to the original author of a larger patch to review and split that and
>> get them both upstream.:
>>
>> https://gist.github.com/christian-marie/e8048b9c118bd3925957
>>
>>> There are a couple of reports in ceph lists, complaining for OSD
>>> flapping/unresponsiveness without clear reason on certain (not always clear
>>> though) conditions which may have same root cause.
>>
>> Possibly, though ipoib and Ceph seem to be a relatively rare combination.
>> Someone will likely find this thread if it is the same root cause.
>>
>>> Wonder if numad-like mechanism will help there, but its usage is generally an
>>> anti-performance pattern in my experience.
>>
>> We've played with zone_reclaim_mode and numad to no avail. Only thing we haven't
>> tried is striping, which I don't want to do anyway.
>>
>> If these large allocations are indeed a reasonable thing to ask of the
>> compaction/reclaim subsystem that seems like the best way forward. I have two
>> questions that follow from this conjecture:
>>
>> Are compaction behaving badly or are we just asking for too many high order
>> allocations?
>>
>> Is this fixed in a later kernel? I haven't tested yet.
>
> As I said, recent kernels received many compaction performance tuning patches,
> and reclaim as well. I would recommend trying them, if it's possible.
>
> You mention 3.10.0-123.9.3.el7.x86_64 which I have no idea how it relates to
> upstream stable kernel. Upstream version 3.10.44 received several compaction
> fixes that I'd deem critical for compaction to work as intended, and lack of
> them could explain your problems:
>
> mm: compaction: reset cached scanner pfn's before reading them
> commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.
>
> mm: compaction: detect when scanners meet in isolate_freepages
> commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.
>
> mm/compaction: make isolate_freepages start at pageblock boundary
> commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.
>
> You might want to check if those are included in your kernel package, and/or try
> upstream stable 3.10 (if you can't use the latest for some reason).
>
> Vlastimil
Thanks, neither Christian`s nor mine builds aren`t including those. I
mentioned that I run -stable 3.10 but it was derived from public
branch probably as early as RH`s and received only
performance/security fixes at most. Will check the issue soon and
report back.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-19 23:10 ` Vlastimil Babka
2014-11-19 23:49 ` Andrey Korolyov
@ 2014-11-20 3:30 ` Christian Marie
2014-11-21 2:35 ` Christian Marie
2 siblings, 0 replies; 36+ messages in thread
From: Christian Marie @ 2014-11-20 3:30 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]
On Thu, Nov 20, 2014 at 12:10:30AM +0100, Vlastimil Babka wrote:
> > Is this fixed in a later kernel? I haven't tested yet.
>
> As I said, recent kernels received many compaction performance tuning patches,
> and reclaim as well. I would recommend trying them, if it's possible.
>
> You mention 3.10.0-123.9.3.el7.x86_64 which I have no idea how it relates to
> upstream stable kernel. Upstream version 3.10.44 received several compaction
> fixes that I'd deem critical for compaction to work as intended, and lack of
> them could explain your problems:
>
> mm: compaction: reset cached scanner pfn's before reading them
> commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.
>
> mm: compaction: detect when scanners meet in isolate_freepages
> commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.
>
> mm/compaction: make isolate_freepages start at pageblock boundary
> commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.
>
> You might want to check if those are included in your kernel package, and/or try
> upstream stable 3.10 (if you can't use the latest for some reason).
Excellent, thankyou.
I realised there were a lot of changes but this list of specific fixes might
help narrow down the actual cause here. I've just built a kernel that's exactly
the same as the exploding one with just these three patches and will be back
tomorrow with the results of testing.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-19 23:10 ` Vlastimil Babka
2014-11-19 23:49 ` Andrey Korolyov
2014-11-20 3:30 ` Christian Marie
@ 2014-11-21 2:35 ` Christian Marie
2014-11-23 9:33 ` Christian Marie
2 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-11-21 2:35 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 1417 bytes --]
On Thu, Nov 20, 2014 at 12:10:30AM +0100, Vlastimil Babka wrote:
> As I said, recent kernels received many compaction performance tuning patches,
> and reclaim as well. I would recommend trying them, if it's possible.
>
> You mention 3.10.0-123.9.3.el7.x86_64 which I have no idea how it relates to
> upstream stable kernel. Upstream version 3.10.44 received several compaction
> fixes that I'd deem critical for compaction to work as intended, and lack of
> them could explain your problems:
>
> mm: compaction: reset cached scanner pfn's before reading them
> commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.
>
> mm: compaction: detect when scanners meet in isolate_freepages
> commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.
>
> mm/compaction: make isolate_freepages start at pageblock boundary
> commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.
>
> You might want to check if those are included in your kernel package, and/or try
> upstream stable 3.10 (if you can't use the latest for some reason).
I built exactly the same kernel with these patches applied, unfortunately it
suffered the same problem. I will now try the latest (3.18-rc5) release
candidate and report back.
Do you have any ideas of where I could be looking to collect data to track down
what is happening here? Here is some perf output again:
http://ponies.io/raw/compaction.png
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-21 2:35 ` Christian Marie
@ 2014-11-23 9:33 ` Christian Marie
2014-11-24 21:48 ` Andrey Korolyov
0 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-11-23 9:33 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 775 bytes --]
Here's an update:
Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
Ceph brings no perceived improvement over the chassis running 3.10 kernels.
Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
http://ponies.io/raw/cluster.png
It is perhaps faring a little better that those chassis running the 3.10 in
that it did not have min_free_kbytes raised to 2GB as the others did, instead
it was sitting around 90MB.
The perf recording did look a little different. Not sure if this was just the
luck of the draw in how the fractal rendering works:
http://ponies.io/raw/perf-3.10.png
Any pointers on how we can track this down? There's at least three of us
following at this now so we should have plenty of area to test.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-23 9:33 ` Christian Marie
@ 2014-11-24 21:48 ` Andrey Korolyov
2014-11-28 8:03 ` Joonsoo Kim
0 siblings, 1 reply; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-24 21:48 UTC (permalink / raw)
To: linux-mm
On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@ponies.io> wrote:
> Here's an update:
>
> Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
> Ceph brings no perceived improvement over the chassis running 3.10 kernels.
>
> Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
>
> http://ponies.io/raw/cluster.png
>
> It is perhaps faring a little better that those chassis running the 3.10 in
> that it did not have min_free_kbytes raised to 2GB as the others did, instead
> it was sitting around 90MB.
>
> The perf recording did look a little different. Not sure if this was just the
> luck of the draw in how the fractal rendering works:
>
> http://ponies.io/raw/perf-3.10.png
>
> Any pointers on how we can track this down? There's at least three of us
> following at this now so we should have plenty of area to test.
Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
is presented for single- and two-headed systems as well. Ceph-users
reported presence of the problem for 3.17, so probably we are facing
generic compaction issue.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-24 21:48 ` Andrey Korolyov
@ 2014-11-28 8:03 ` Joonsoo Kim
2014-11-28 9:26 ` Vlastimil Babka
0 siblings, 1 reply; 36+ messages in thread
From: Joonsoo Kim @ 2014-11-28 8:03 UTC (permalink / raw)
To: Andrey Korolyov
Cc: linux-mm, Christoph Lameter, David Rientjes, Andrew Morton,
Vlastimil Babka
On Tue, Nov 25, 2014 at 01:48:42AM +0400, Andrey Korolyov wrote:
> On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@ponies.io> wrote:
> > Here's an update:
> >
> > Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
> > Ceph brings no perceived improvement over the chassis running 3.10 kernels.
> >
> > Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
> >
> > http://ponies.io/raw/cluster.png
> >
> > It is perhaps faring a little better that those chassis running the 3.10 in
> > that it did not have min_free_kbytes raised to 2GB as the others did, instead
> > it was sitting around 90MB.
> >
> > The perf recording did look a little different. Not sure if this was just the
> > luck of the draw in how the fractal rendering works:
> >
> > http://ponies.io/raw/perf-3.10.png
> >
> > Any pointers on how we can track this down? There's at least three of us
> > following at this now so we should have plenty of area to test.
>
>
> Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
> is presented for single- and two-headed systems as well. Ceph-users
> reported presence of the problem for 3.17, so probably we are facing
> generic compaction issue.
>
Hello,
I didn't follow-up this discussion, but, at glance, this excessive CPU
usage by compaction is related to following fixes.
Could you test following two patches?
If these fixes your problem, I will resumit patches with proper commit
description.
Thanks.
-------->8-------------
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-28 8:03 ` Joonsoo Kim
@ 2014-11-28 9:26 ` Vlastimil Babka
2014-12-01 8:31 ` Joonsoo Kim
0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-11-28 9:26 UTC (permalink / raw)
To: Joonsoo Kim, Andrey Korolyov
Cc: linux-mm, Christoph Lameter, David Rientjes, Andrew Morton
On 28.11.2014 9:03, Joonsoo Kim wrote:
> On Tue, Nov 25, 2014 at 01:48:42AM +0400, Andrey Korolyov wrote:
>> On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@ponies.io> wrote:
>>> Here's an update:
>>>
>>> Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
>>> Ceph brings no perceived improvement over the chassis running 3.10 kernels.
>>>
>>> Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
>>>
>>> http://ponies.io/raw/cluster.png
>>>
>>> It is perhaps faring a little better that those chassis running the 3.10 in
>>> that it did not have min_free_kbytes raised to 2GB as the others did, instead
>>> it was sitting around 90MB.
>>>
>>> The perf recording did look a little different. Not sure if this was just the
>>> luck of the draw in how the fractal rendering works:
>>>
>>> http://ponies.io/raw/perf-3.10.png
>>>
>>> Any pointers on how we can track this down? There's at least three of us
>>> following at this now so we should have plenty of area to test.
>>
>> Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
>> is presented for single- and two-headed systems as well. Ceph-users
>> reported presence of the problem for 3.17, so probably we are facing
>> generic compaction issue.
>>
> Hello,
>
> I didn't follow-up this discussion, but, at glance, this excessive CPU
> usage by compaction is related to following fixes.
>
> Could you test following two patches?
>
> If these fixes your problem, I will resumit patches with proper commit
> description.
>
> Thanks.
>
> -------->8-------------
> From 079f3f119f1e3cbe9d981e7d0cada94e0c532162 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Fri, 28 Nov 2014 16:36:00 +0900
> Subject: [PATCH 1/2] mm/compaction: fix wrong order check in
> compact_finished()
>
> What we want to check here is whether there is highorder freepage
> in buddy list of other migratetype in order to steal it without
> fragmentation. But, current code just checks cc->order which means
> allocation request order. So, this is wrong.
>
> Without this fix, non-movable synchronous compaction below pageblock order
> would not stopped until compaction complete, because migratetype of most
> pageblocks are movable and cc->order is always below than pageblock order
> in this case.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
> mm/compaction.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b544d61..052194f 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1082,7 +1082,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
> return COMPACT_PARTIAL;
>
> /* Job done if allocation would set block type */
> - if (cc->order >= pageblock_order && area->nr_free)
> + if (order >= pageblock_order && area->nr_free)
> return COMPACT_PARTIAL;
Dang, good catch!
But I wonder, are MIGRATE_RESERVE pages counted towards area->nr_free?
Seems to me that they are, so this check can have false positives?
Hm probably for unmovable allocation, MIGRATE_CMA pages is the same case?
Vlastimil
> }
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-11-28 9:26 ` Vlastimil Babka
@ 2014-12-01 8:31 ` Joonsoo Kim
2014-12-02 1:47 ` Christian Marie
0 siblings, 1 reply; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-01 8:31 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrey Korolyov, linux-mm, Christoph Lameter, David Rientjes,
Andrew Morton
On Fri, Nov 28, 2014 at 10:26:15AM +0100, Vlastimil Babka wrote:
> On 28.11.2014 9:03, Joonsoo Kim wrote:
> >On Tue, Nov 25, 2014 at 01:48:42AM +0400, Andrey Korolyov wrote:
> >>On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@ponies.io> wrote:
> >>>Here's an update:
> >>>
> >>>Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
> >>>Ceph brings no perceived improvement over the chassis running 3.10 kernels.
> >>>
> >>>Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
> >>>
> >>>http://ponies.io/raw/cluster.png
> >>>
> >>>It is perhaps faring a little better that those chassis running the 3.10 in
> >>>that it did not have min_free_kbytes raised to 2GB as the others did, instead
> >>>it was sitting around 90MB.
> >>>
> >>>The perf recording did look a little different. Not sure if this was just the
> >>>luck of the draw in how the fractal rendering works:
> >>>
> >>>http://ponies.io/raw/perf-3.10.png
> >>>
> >>>Any pointers on how we can track this down? There's at least three of us
> >>>following at this now so we should have plenty of area to test.
> >>
> >>Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
> >>is presented for single- and two-headed systems as well. Ceph-users
> >>reported presence of the problem for 3.17, so probably we are facing
> >>generic compaction issue.
> >>
> >Hello,
> >
> >I didn't follow-up this discussion, but, at glance, this excessive CPU
> >usage by compaction is related to following fixes.
> >
> >Could you test following two patches?
> >
> >If these fixes your problem, I will resumit patches with proper commit
> >description.
> >
> >Thanks.
> >
> >-------->8-------------
> > From 079f3f119f1e3cbe9d981e7d0cada94e0c532162 Mon Sep 17 00:00:00 2001
> >From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >Date: Fri, 28 Nov 2014 16:36:00 +0900
> >Subject: [PATCH 1/2] mm/compaction: fix wrong order check in
> > compact_finished()
> >
> >What we want to check here is whether there is highorder freepage
> >in buddy list of other migratetype in order to steal it without
> >fragmentation. But, current code just checks cc->order which means
> >allocation request order. So, this is wrong.
> >
> >Without this fix, non-movable synchronous compaction below pageblock order
> >would not stopped until compaction complete, because migratetype of most
> >pageblocks are movable and cc->order is always below than pageblock order
> >in this case.
> >
> >Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >---
> > mm/compaction.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index b544d61..052194f 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -1082,7 +1082,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
> > return COMPACT_PARTIAL;
> > /* Job done if allocation would set block type */
> >- if (cc->order >= pageblock_order && area->nr_free)
> >+ if (order >= pageblock_order && area->nr_free)
> > return COMPACT_PARTIAL;
>
> Dang, good catch!
> But I wonder, are MIGRATE_RESERVE pages counted towards area->nr_free?
> Seems to me that they are, so this check can have false positives?
> Hm probably for unmovable allocation, MIGRATE_CMA pages is the same case?
>
Hello,
Althoth MIGRATE_RESERVE are counted for area->nr_free, at this
moment, there is no freepage on MIGRATE_RESERVE. It would be used
already before triggering compaction.
In case of MIGRATE_CMA, false positives are possible. But, it also
broken on __zone_watermark_ok(). Without area->nr_free_cma, we can't
fix inaccurate check. Please see following link.
https://lkml.org/lkml/2014/6/2/1
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-01 8:31 ` Joonsoo Kim
@ 2014-12-02 1:47 ` Christian Marie
2014-12-02 4:53 ` Joonsoo Kim
0 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-12-02 1:47 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 1061 bytes --]
On 28.11.2014 9:03, Joonsoo Kim wrote:
> Hello,
>
> I didn't follow-up this discussion, but, at glance, this excessive CPU
> usage by compaction is related to following fixes.
>
> Could you test following two patches?
>
> If these fixes your problem, I will resumit patches with proper commit
> description.
>
> -------- 8< ---------
Thanks for looking into this. Running 3.18-rc5 kernel with your patches has
produced some interesting results.
Load average still spikes to around 2000-3000 with the processors spinning 100%
doing compaction related things when min_free_kbytes is left at the default.
However, unlike before, the system is now completely stable. Pre-patch it would
be almost completely unresponsive (having to wait 30 seconds to establish an
SSH connection and several seconds to send a character).
Is it reasonable to guess that ipoib is giving compaction a hard time and
fixing this bug has allowed the system to at least not lock up?
I will try back-porting this to 3.10 and seeing if it is stable under these
strange conditions also.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-02 1:47 ` Christian Marie
@ 2014-12-02 4:53 ` Joonsoo Kim
2014-12-02 5:06 ` Christian Marie
2014-12-02 15:46 ` Vlastimil Babka
0 siblings, 2 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-02 4:53 UTC (permalink / raw)
To: linux-mm
On Tue, Dec 02, 2014 at 12:47:24PM +1100, Christian Marie wrote:
> On 28.11.2014 9:03, Joonsoo Kim wrote:
> > Hello,
> >
> > I didn't follow-up this discussion, but, at glance, this excessive CPU
> > usage by compaction is related to following fixes.
> >
> > Could you test following two patches?
> >
> > If these fixes your problem, I will resumit patches with proper commit
> > description.
> >
> > -------- 8< ---------
>
>
> Thanks for looking into this. Running 3.18-rc5 kernel with your patches has
> produced some interesting results.
>
> Load average still spikes to around 2000-3000 with the processors spinning 100%
> doing compaction related things when min_free_kbytes is left at the default.
>
> However, unlike before, the system is now completely stable. Pre-patch it would
> be almost completely unresponsive (having to wait 30 seconds to establish an
> SSH connection and several seconds to send a character).
>
> Is it reasonable to guess that ipoib is giving compaction a hard time and
> fixing this bug has allowed the system to at least not lock up?
>
> I will try back-porting this to 3.10 and seeing if it is stable under these
> strange conditions also.
Hello,
Good to hear!
Load average spike may be related to skip bit management. Currently, there is
no way to maintain skip bit permanently. So, after one iteration of compaction
is finished and skip bit is reset, all pageblocks should be re-scanned.
Your system has mellanox driver and although I don't know exactly what it is,
I heard that it allocates enormous pages and do get_user_pages() to
pin pages in memory. These memory aren't available to compaction, but,
compaction always scan it.
This is just my assumption, so if possible, please check it with
compaction tracepoint. If it is, we can make a solution for this
problem.
Anyway, could you test one more time without second patch?
IMO, first patch is reasonable to backport, because it fixes a real bug.
But, I'm not sure if second patch is needed to backport or not.
One more testing will help us to understand the effect of patch.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-02 4:53 ` Joonsoo Kim
@ 2014-12-02 5:06 ` Christian Marie
2014-12-03 4:04 ` Christian Marie
2014-12-03 7:57 ` Joonsoo Kim
2014-12-02 15:46 ` Vlastimil Babka
1 sibling, 2 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-02 5:06 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 619 bytes --]
On Tue, Dec 02, 2014 at 01:53:24PM +0900, Joonsoo Kim wrote:
> This is just my assumption, so if possible, please check it with
> compaction tracepoint. If it is, we can make a solution for this
> problem.
Which event/function would you like me to trace specifically?
> Anyway, could you test one more time without second patch?
> IMO, first patch is reasonable to backport, because it fixes a real bug.
> But, I'm not sure if second patch is needed to backport or not.
> One more testing will help us to understand the effect of patch.
I will attempt to do this tomorrow and should have results in around 24 hours.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-02 4:53 ` Joonsoo Kim
2014-12-02 5:06 ` Christian Marie
@ 2014-12-02 15:46 ` Vlastimil Babka
2014-12-03 7:49 ` Joonsoo Kim
1 sibling, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-12-02 15:46 UTC (permalink / raw)
To: Joonsoo Kim, linux-mm
On 12/02/2014 05:53 AM, Joonsoo Kim wrote:
> On Tue, Dec 02, 2014 at 12:47:24PM +1100, Christian Marie wrote:
>> On 28.11.2014 9:03, Joonsoo Kim wrote:
>>> Hello,
>>>
>>> I didn't follow-up this discussion, but, at glance, this excessive CPU
>>> usage by compaction is related to following fixes.
>>>
>>> Could you test following two patches?
>>>
>>> If these fixes your problem, I will resumit patches with proper commit
>>> description.
>>>
>>> -------- 8< ---------
>>
>>
>> Thanks for looking into this. Running 3.18-rc5 kernel with your patches has
>> produced some interesting results.
>>
>> Load average still spikes to around 2000-3000 with the processors spinning 100%
>> doing compaction related things when min_free_kbytes is left at the default.
>>
>> However, unlike before, the system is now completely stable. Pre-patch it would
>> be almost completely unresponsive (having to wait 30 seconds to establish an
>> SSH connection and several seconds to send a character).
>>
>> Is it reasonable to guess that ipoib is giving compaction a hard time and
>> fixing this bug has allowed the system to at least not lock up?
>>
>> I will try back-porting this to 3.10 and seeing if it is stable under these
>> strange conditions also.
>
> Hello,
>
> Good to hear!
Indeed, although I somehow doubt your first patch could have made such
difference. It only matters when you have a whole pageblock free.
Without the patch, the particular compaction attempt that managed to
free the block might not be terminated ASAP, but then the free pageblock
is still allocatable by the following allocation attempts, so it
shouldn't result in a stream of complete compactions.
So I would expect it's either a fluke, or the second patch made the
difference, to either SLUB or something else making such fallback-able
allocations.
But hmm, I've never considered the implications of compact_finished()
migratetypes handling on unmovable allocations. Regardless of cc->order,
it often has to free a whole pageblock to succeed, as it's unlikely it
will succeed compacting within a pageblock already marked as UNMOVABLE.
Guess it's to prevent further fragmentation and that makes sense, but it
does make high-order unmovable allocations problematic. At least the
watermark checks for allowing compaction in the first place are then
wrong - we decide that based on cc->order, but in we fact need at least
a pageblock worth of space free to actually succeed.
> Load average spike may be related to skip bit management. Currently, there is
> no way to maintain skip bit permanently. So, after one iteration of compaction
> is finished and skip bit is reset, all pageblocks should be re-scanned.
Shouldn't be "after one iteration of compaction", the bits are cleared
only when compaction is restarting after being deferred, or when kswapd
goes to sleep.
> Your system has mellanox driver and although I don't know exactly what it is,
> I heard that it allocates enormous pages and do get_user_pages() to
> pin pages in memory. These memory aren't available to compaction, but,
> compaction always scan it.
>
> This is just my assumption, so if possible, please check it with
> compaction tracepoint. If it is, we can make a solution for this
> problem.
>
> Anyway, could you test one more time without second patch?
> IMO, first patch is reasonable to backport, because it fixes a real bug.
> But, I'm not sure if second patch is needed to backport or not.
> One more testing will help us to understand the effect of patch.
>
> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-02 5:06 ` Christian Marie
@ 2014-12-03 4:04 ` Christian Marie
2014-12-03 8:05 ` Joonsoo Kim
2014-12-04 23:30 ` Vlastimil Babka
2014-12-03 7:57 ` Joonsoo Kim
1 sibling, 2 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-03 4:04 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 3602 bytes --]
On Tue, Dec 02, 2014 at 04:06:08PM +1100, Christian Marie wrote:
> I will attempt to do this tomorrow and should have results in around 24 hours.
I ran said test today and wasn't able to pinpoint a solid difference between a kernel
with both patches and one with only the first. The one with both patches "felt"
a little more responsive, probably a fluke.
I'd really like to write a stress test that simulates what ceph/ipoib is doing
here so that I can test this in a more scientific manner.
Here is some perf output, the kernel with only the first patch is on the right:
http://ponies.io/raw/before-after.png
A note in passing: we left the cluster running with min_free_kbytes set to the
default last night and within a few hours it started spewing the usual
pre-patch allocation failures, so whilst this patch appears to make the system
more responsive under adverse conditions the underlying
not-keeping-up-with-pressure issue is still there.
There's enough starvation to break single page allocations.
Keep in mind that this is on a 3.10 kernel with the patches applied so I'm not
expecting anyone to particularly care. I'm running out of time to test the
whole cluster at 3.18 is all, I really do think that replicating the allocation
pattern is the best way forward but my attempts at simply sending a lot of
packets that look similar with lots of page cache don't do it.
Those allocation failures on 3.10 with both patches look like this:
[73138.803800] ceph-osd: page allocation failure: order:0, mode:0x20
[73138.803802] CPU: 0 PID: 9214 Comm: ceph-osd Tainted: GF
O-------------- 3.10.0-123.9.3.anchor.x86_64 #1
[73138.803803] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.2.2
01/16/2014
[73138.803803] 0000000000000020 00000000d6532f99 ffff88081fa03aa0
ffffffff815e23bb
[73138.803806] ffff88081fa03b30 ffffffff81147340 00000000ffffffff
ffff8807da887900
[73138.803808] ffff88083ffd9e80 ffff8800b2242900 ffff8807d843c050
00000000d6532f99
[73138.803812] Call Trace:
[73138.803813] <IRQ> [<ffffffff815e23bb>] dump_stack+0x19/0x1b
[73138.803817] [<ffffffff81147340>] warn_alloc_failed+0x110/0x180
[73138.803819] [<ffffffff8114b4ee>] __alloc_pages_nodemask+0x91e/0xb20
[73138.803821] [<ffffffff8152f82a>] ? tcp_v4_rcv+0x67a/0x7c0
[73138.803823] [<ffffffff81509710>] ? ip_rcv_finish+0x350/0x350
[73138.803826] [<ffffffff81188369>] alloc_pages_current+0xa9/0x170
[73138.803828] [<ffffffff814bedb1>] __netdev_alloc_frag+0x91/0x140
[73138.803831] [<ffffffff814c0df7>] __netdev_alloc_skb+0x77/0xc0
[73138.803834] [<ffffffffa06b54c5>] ipoib_cm_handle_rx_wc+0xf5/0x940
[ib_ipoib]
[73138.803838] [<ffffffffa0625e78>] ? mlx4_ib_poll_cq+0xc8/0x210 [mlx4_ib]
[73138.803841] [<ffffffffa06a90ed>] ipoib_poll+0x8d/0x150 [ib_ipoib]
[73138.803843] [<ffffffff814d05aa>] net_rx_action+0x15a/0x250
[73138.803846] [<ffffffff81067047>] __do_softirq+0xf7/0x290
[73138.803848] [<ffffffff815f43dc>] call_softirq+0x1c/0x30
[73138.803851] [<ffffffff81014d25>] do_softirq+0x55/0x90
[73138.803853] [<ffffffff810673e5>] irq_exit+0x115/0x120
[73138.803855] [<ffffffff815f4cd8>] do_IRQ+0x58/0xf0
[73138.803857] [<ffffffff815e9e2d>] common_interrupt+0x6d/0x6d
[73138.803858] <EOI> [<ffffffff815f2bc0>] ? sysret_audit+0x17/0x21
We get some like this, also:
[ 1293.152415] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[ 1293.152416] cache: kmalloc-256, object size: 256, buffer size: 256,
default order: 1, min order: 0
[ 1293.152417] node 0: slabs: 1789, objs: 57248, free: 0
[ 1293.152418] node 1: slabs: 449, objs: 14368, free: 2
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-02 15:46 ` Vlastimil Babka
@ 2014-12-03 7:49 ` Joonsoo Kim
2014-12-03 12:43 ` Vlastimil Babka
0 siblings, 1 reply; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-03 7:49 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: linux-mm
On Tue, Dec 02, 2014 at 04:46:33PM +0100, Vlastimil Babka wrote:
> On 12/02/2014 05:53 AM, Joonsoo Kim wrote:
> >On Tue, Dec 02, 2014 at 12:47:24PM +1100, Christian Marie wrote:
> >>On 28.11.2014 9:03, Joonsoo Kim wrote:
> >>>Hello,
> >>>
> >>>I didn't follow-up this discussion, but, at glance, this excessive CPU
> >>>usage by compaction is related to following fixes.
> >>>
> >>>Could you test following two patches?
> >>>
> >>>If these fixes your problem, I will resumit patches with proper commit
> >>>description.
> >>>
> >>>-------- 8< ---------
> >>
> >>
> >>Thanks for looking into this. Running 3.18-rc5 kernel with your patches has
> >>produced some interesting results.
> >>
> >>Load average still spikes to around 2000-3000 with the processors spinning 100%
> >>doing compaction related things when min_free_kbytes is left at the default.
> >>
> >>However, unlike before, the system is now completely stable. Pre-patch it would
> >>be almost completely unresponsive (having to wait 30 seconds to establish an
> >>SSH connection and several seconds to send a character).
> >>
> >>Is it reasonable to guess that ipoib is giving compaction a hard time and
> >>fixing this bug has allowed the system to at least not lock up?
> >>
> >>I will try back-porting this to 3.10 and seeing if it is stable under these
> >>strange conditions also.
> >
> >Hello,
> >
> >Good to hear!
>
> Indeed, although I somehow doubt your first patch could have made
> such difference. It only matters when you have a whole pageblock
> free. Without the patch, the particular compaction attempt that
> managed to free the block might not be terminated ASAP, but then the
> free pageblock is still allocatable by the following allocation
> attempts, so it shouldn't result in a stream of complete
> compactions.
High-order freepage made by compaction could be broken by other
order-0 allocation attempts, so following high-order allocation attempts
could result in new compaction. It would be dependent on workload.
Anyway, we should fix cc->order to order. :)
>
> So I would expect it's either a fluke, or the second patch made the
> difference, to either SLUB or something else making such
> fallback-able allocations.
>
> But hmm, I've never considered the implications of
> compact_finished() migratetypes handling on unmovable allocations.
> Regardless of cc->order, it often has to free a whole pageblock to
> succeed, as it's unlikely it will succeed compacting within a
> pageblock already marked as UNMOVABLE. Guess it's to prevent further
> fragmentation and that makes sense, but it does make high-order
> unmovable allocations problematic. At least the watermark checks for
> allowing compaction in the first place are then wrong - we decide
> that based on cc->order, but in we fact need at least a pageblock
> worth of space free to actually succeed.
I think that watermark check is okay but we need a elegant way to decide
the best timing compaction should be stopped. I made following two patches
about this. This patch would make non-movable compaction less
aggressive. This is just draft so ignore my poor description. :)
Could you comment it?
--------->8-----------------
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-02 5:06 ` Christian Marie
2014-12-03 4:04 ` Christian Marie
@ 2014-12-03 7:57 ` Joonsoo Kim
2014-12-04 7:30 ` Christian Marie
1 sibling, 1 reply; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-03 7:57 UTC (permalink / raw)
To: linux-mm
On Tue, Dec 02, 2014 at 04:06:08PM +1100, Christian Marie wrote:
> On Tue, Dec 02, 2014 at 01:53:24PM +0900, Joonsoo Kim wrote:
> > This is just my assumption, so if possible, please check it with
> > compaction tracepoint. If it is, we can make a solution for this
> > problem.
>
> Which event/function would you like me to trace specifically?
Hello,
It'd be very helpful to get output of
"trace_event=compaction:*,kmem:mm_page_alloc_extfrag" on the kernel
with my tracepoint patches below.
See following link. There is 3 patches.
https://lkml.org/lkml/2014/12/3/71
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-03 4:04 ` Christian Marie
@ 2014-12-03 8:05 ` Joonsoo Kim
2014-12-04 23:30 ` Vlastimil Babka
1 sibling, 0 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-03 8:05 UTC (permalink / raw)
To: linux-mm
On Wed, Dec 03, 2014 at 03:04:04PM +1100, Christian Marie wrote:
> On Tue, Dec 02, 2014 at 04:06:08PM +1100, Christian Marie wrote:
> > I will attempt to do this tomorrow and should have results in around 24 hours.
>
> I ran said test today and wasn't able to pinpoint a solid difference between a kernel
> with both patches and one with only the first. The one with both patches "felt"
> a little more responsive, probably a fluke.
Thanks! It would help me.
>
> I'd really like to write a stress test that simulates what ceph/ipoib is doing
> here so that I can test this in a more scientific manner.
>
> Here is some perf output, the kernel with only the first patch is on the right:
>
> http://ponies.io/raw/before-after.png
>
>
> A note in passing: we left the cluster running with min_free_kbytes set to the
> default last night and within a few hours it started spewing the usual
> pre-patch allocation failures, so whilst this patch appears to make the system
> more responsive under adverse conditions the underlying
> not-keeping-up-with-pressure issue is still there.
I guess that it is caused by too fast allocation. If your allocation rate
is more than kswapd's reclaim rate and no GFP_WAIT, failure would be possible.
Following failure log looks that case. In this case, enlaring
min_free_kbytes may be right solution, but, I'm not expert so please consult
other MM guys.
> There's enough starvation to break single page allocations.
>
> Keep in mind that this is on a 3.10 kernel with the patches applied so I'm not
> expecting anyone to particularly care. I'm running out of time to test the
> whole cluster at 3.18 is all, I really do think that replicating the allocation
> pattern is the best way forward but my attempts at simply sending a lot of
> packets that look similar with lots of page cache don't do it.
>
> Those allocation failures on 3.10 with both patches look like this:
>
> [73138.803800] ceph-osd: page allocation failure: order:0, mode:0x20
> [73138.803802] CPU: 0 PID: 9214 Comm: ceph-osd Tainted: GF
> O-------------- 3.10.0-123.9.3.anchor.x86_64 #1
> [73138.803803] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.2.2
> 01/16/2014
> [73138.803803] 0000000000000020 00000000d6532f99 ffff88081fa03aa0
> ffffffff815e23bb
> [73138.803806] ffff88081fa03b30 ffffffff81147340 00000000ffffffff
> ffff8807da887900
> [73138.803808] ffff88083ffd9e80 ffff8800b2242900 ffff8807d843c050
> 00000000d6532f99
> [73138.803812] Call Trace:
> [73138.803813] <IRQ> [<ffffffff815e23bb>] dump_stack+0x19/0x1b
> [73138.803817] [<ffffffff81147340>] warn_alloc_failed+0x110/0x180
> [73138.803819] [<ffffffff8114b4ee>] __alloc_pages_nodemask+0x91e/0xb20
> [73138.803821] [<ffffffff8152f82a>] ? tcp_v4_rcv+0x67a/0x7c0
> [73138.803823] [<ffffffff81509710>] ? ip_rcv_finish+0x350/0x350
> [73138.803826] [<ffffffff81188369>] alloc_pages_current+0xa9/0x170
> [73138.803828] [<ffffffff814bedb1>] __netdev_alloc_frag+0x91/0x140
> [73138.803831] [<ffffffff814c0df7>] __netdev_alloc_skb+0x77/0xc0
> [73138.803834] [<ffffffffa06b54c5>] ipoib_cm_handle_rx_wc+0xf5/0x940
> [ib_ipoib]
> [73138.803838] [<ffffffffa0625e78>] ? mlx4_ib_poll_cq+0xc8/0x210 [mlx4_ib]
> [73138.803841] [<ffffffffa06a90ed>] ipoib_poll+0x8d/0x150 [ib_ipoib]
> [73138.803843] [<ffffffff814d05aa>] net_rx_action+0x15a/0x250
> [73138.803846] [<ffffffff81067047>] __do_softirq+0xf7/0x290
> [73138.803848] [<ffffffff815f43dc>] call_softirq+0x1c/0x30
> [73138.803851] [<ffffffff81014d25>] do_softirq+0x55/0x90
> [73138.803853] [<ffffffff810673e5>] irq_exit+0x115/0x120
> [73138.803855] [<ffffffff815f4cd8>] do_IRQ+0x58/0xf0
> [73138.803857] [<ffffffff815e9e2d>] common_interrupt+0x6d/0x6d
> [73138.803858] <EOI> [<ffffffff815f2bc0>] ? sysret_audit+0x17/0x21
>
> We get some like this, also:
>
> [ 1293.152415] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
> [ 1293.152416] cache: kmalloc-256, object size: 256, buffer size: 256,
> default order: 1, min order: 0
> [ 1293.152417] node 0: slabs: 1789, objs: 57248, free: 0
> [ 1293.152418] node 1: slabs: 449, objs: 14368, free: 2
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-03 7:49 ` Joonsoo Kim
@ 2014-12-03 12:43 ` Vlastimil Babka
2014-12-04 6:53 ` Joonsoo Kim
0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-12-03 12:43 UTC (permalink / raw)
To: Joonsoo Kim; +Cc: linux-mm
On 12/03/2014 08:49 AM, Joonsoo Kim wrote:
> On Tue, Dec 02, 2014 at 04:46:33PM +0100, Vlastimil Babka wrote:
>>
>> Indeed, although I somehow doubt your first patch could have made
>> such difference. It only matters when you have a whole pageblock
>> free. Without the patch, the particular compaction attempt that
>> managed to free the block might not be terminated ASAP, but then the
>> free pageblock is still allocatable by the following allocation
>> attempts, so it shouldn't result in a stream of complete
>> compactions.
>
> High-order freepage made by compaction could be broken by other
> order-0 allocation attempts, so following high-order allocation attempts
> could result in new compaction. It would be dependent on workload.
>
> Anyway, we should fix cc->order to order. :)
Sure, no doubts about it.
>>
>> So I would expect it's either a fluke, or the second patch made the
>> difference, to either SLUB or something else making such
>> fallback-able allocations.
>>
>> But hmm, I've never considered the implications of
>> compact_finished() migratetypes handling on unmovable allocations.
>> Regardless of cc->order, it often has to free a whole pageblock to
>> succeed, as it's unlikely it will succeed compacting within a
>> pageblock already marked as UNMOVABLE. Guess it's to prevent further
>> fragmentation and that makes sense, but it does make high-order
>> unmovable allocations problematic. At least the watermark checks for
>> allowing compaction in the first place are then wrong - we decide
>> that based on cc->order, but in we fact need at least a pageblock
>> worth of space free to actually succeed.
>
> I think that watermark check is okay but we need a elegant way to decide
> the best timing compaction should be stopped. I made following two patches
> about this. This patch would make non-movable compaction less
> aggressive. This is just draft so ignore my poor description. :)
>
> Could you comment it?
>
> --------->8-----------------
> From bd6b285c38fd94e5ec03a720bed4debae3914bde Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Mon, 1 Dec 2014 11:56:57 +0900
> Subject: [PATCH 1/2] mm/page_alloc: expands broken freepage to proper buddy
> list when steal
>
> There is odd behaviour when we steal freepages from other migratetype
> buddy list. In try_to_steal_freepages(), we move all freepages in
> the pageblock that founded freepage is belong to to the request
> migratetype in order to mitigate fragmentation. If the number of moved
> pages are enough to change pageblock migratetype, there is no problem. If
> not enough, we don't change pageblock migratetype and add broken freepages
> to the original migratetype buddy list rather than request migratetype
> one. For me, this is odd, because we already moved all freepages in this
> pageblock to the request migratetype. This patch fixes this situation to
> add broken freepages to the request migratetype buddy list in this case.
>
Yeah, I have noticed this a while ago, and traced the history of how this
happened. But surprisingly just changing this back didn't evaluate as a clear
win, so I have added some further tunning. I will try to send this ASAP.
> This patch introduce new function that can help to decide if we can
> steal the page without resulting in fragmentation. It will be used in
> following patch for compaction finish criteria.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
> +static bool can_steal_freepages(unsigned int order,
> + int start_mt, int fallback_mt)
> +{
> + /*
> + * When borrowing from MIGRATE_CMA, we need to release the excess
> + * buddy pages to CMA itself. We also ensure the freepage_migratetype
> + * is set to CMA so it is returned to the correct freelist in case
> + * the page ends up being not actually allocated from the pcp lists.
> + */
> + if (is_migrate_cma(fallback_mt))
> + return false;
>
> - }
> + /* Can take ownership for orders >= pageblock_order */
> + if (order >= pageblock_order)
> + return true;
> +
> + if (order >= pageblock_order / 2 ||
> + start_mt == MIGRATE_RECLAIMABLE ||
> + page_group_by_mobility_disabled)
> + return true;
>
> - return fallback_type;
> + return false;
Note that this is not exactly consistent for compaction and allocation.
Allocation will succeed as long as a large enough fallback page exist - it might
not just steal extra free pages if the fallback page order is low (or it's not
for MIGRATE_RECLAIMABLE allocation). But for compaction, with your patches you
still evaluate whether it can steal also the extra pages, so it's more strict
condition. It might make sense, but let's not claim it's fully consistent? And
it definitely needs evaluation...
Vlastimil
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-03 12:43 ` Vlastimil Babka
@ 2014-12-04 6:53 ` Joonsoo Kim
0 siblings, 0 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-04 6:53 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: linux-mm
On Wed, Dec 03, 2014 at 01:43:31PM +0100, Vlastimil Babka wrote:
> On 12/03/2014 08:49 AM, Joonsoo Kim wrote:
> > On Tue, Dec 02, 2014 at 04:46:33PM +0100, Vlastimil Babka wrote:
> >>
> >> Indeed, although I somehow doubt your first patch could have made
> >> such difference. It only matters when you have a whole pageblock
> >> free. Without the patch, the particular compaction attempt that
> >> managed to free the block might not be terminated ASAP, but then the
> >> free pageblock is still allocatable by the following allocation
> >> attempts, so it shouldn't result in a stream of complete
> >> compactions.
> >
> > High-order freepage made by compaction could be broken by other
> > order-0 allocation attempts, so following high-order allocation attempts
> > could result in new compaction. It would be dependent on workload.
> >
> > Anyway, we should fix cc->order to order. :)
>
> Sure, no doubts about it.
Okay.
>
> >>
> >> So I would expect it's either a fluke, or the second patch made the
> >> difference, to either SLUB or something else making such
> >> fallback-able allocations.
> >>
> >> But hmm, I've never considered the implications of
> >> compact_finished() migratetypes handling on unmovable allocations.
> >> Regardless of cc->order, it often has to free a whole pageblock to
> >> succeed, as it's unlikely it will succeed compacting within a
> >> pageblock already marked as UNMOVABLE. Guess it's to prevent further
> >> fragmentation and that makes sense, but it does make high-order
> >> unmovable allocations problematic. At least the watermark checks for
> >> allowing compaction in the first place are then wrong - we decide
> >> that based on cc->order, but in we fact need at least a pageblock
> >> worth of space free to actually succeed.
> >
> > I think that watermark check is okay but we need a elegant way to decide
> > the best timing compaction should be stopped. I made following two patches
> > about this. This patch would make non-movable compaction less
> > aggressive. This is just draft so ignore my poor description. :)
> >
> > Could you comment it?
> >
> > --------->8-----------------
> > From bd6b285c38fd94e5ec03a720bed4debae3914bde Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Date: Mon, 1 Dec 2014 11:56:57 +0900
> > Subject: [PATCH 1/2] mm/page_alloc: expands broken freepage to proper buddy
> > list when steal
> >
> > There is odd behaviour when we steal freepages from other migratetype
> > buddy list. In try_to_steal_freepages(), we move all freepages in
> > the pageblock that founded freepage is belong to to the request
> > migratetype in order to mitigate fragmentation. If the number of moved
> > pages are enough to change pageblock migratetype, there is no problem. If
> > not enough, we don't change pageblock migratetype and add broken freepages
> > to the original migratetype buddy list rather than request migratetype
> > one. For me, this is odd, because we already moved all freepages in this
> > pageblock to the request migratetype. This patch fixes this situation to
> > add broken freepages to the request migratetype buddy list in this case.
> >
>
> Yeah, I have noticed this a while ago, and traced the history of how this
> happened. But surprisingly just changing this back didn't evaluate as a clear
> win, so I have added some further tunning. I will try to send this ASAP.
I'd like to see it.
Anyway, if there is no remarkable degradation you found, merging this patch
is better than as is. This is odd logic and we don't understand how it works
and whether it is better or not. So making logic understandable deserves
to consider.
>
> > This patch introduce new function that can help to decide if we can
> > steal the page without resulting in fragmentation. It will be used in
> > following patch for compaction finish criteria.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> > +static bool can_steal_freepages(unsigned int order,
> > + int start_mt, int fallback_mt)
> > +{
> > + /*
> > + * When borrowing from MIGRATE_CMA, we need to release the excess
> > + * buddy pages to CMA itself. We also ensure the freepage_migratetype
> > + * is set to CMA so it is returned to the correct freelist in case
> > + * the page ends up being not actually allocated from the pcp lists.
> > + */
> > + if (is_migrate_cma(fallback_mt))
> > + return false;
> >
> > - }
> > + /* Can take ownership for orders >= pageblock_order */
> > + if (order >= pageblock_order)
> > + return true;
> > +
> > + if (order >= pageblock_order / 2 ||
> > + start_mt == MIGRATE_RECLAIMABLE ||
> > + page_group_by_mobility_disabled)
> > + return true;
> >
> > - return fallback_type;
> > + return false;
>
> Note that this is not exactly consistent for compaction and allocation.
Yes, I know. So I asked to ignore my poor description. :)
Sorry about that.
> Allocation will succeed as long as a large enough fallback page exist - it might
> not just steal extra free pages if the fallback page order is low (or it's not
> for MIGRATE_RECLAIMABLE allocation). But for compaction, with your patches you
> still evaluate whether it can steal also the extra pages, so it's more strict
> condition. It might make sense, but let's not claim it's fully consistent? And
> it definitely needs evaluation...
IMO, it's more strict, but, make more sense than current one. Do you agree?
Anyway, I need evaulation. My quick attempt results in good result. I will share
it after more testing.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-03 7:57 ` Joonsoo Kim
@ 2014-12-04 7:30 ` Christian Marie
2014-12-04 7:51 ` Christian Marie
2014-12-05 1:07 ` Joonsoo Kim
0 siblings, 2 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-04 7:30 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 1308 bytes --]
On Wed, Dec 03, 2014 at 04:57:47PM +0900, Joonsoo Kim wrote:
> It'd be very helpful to get output of
> "trace_event=compaction:*,kmem:mm_page_alloc_extfrag" on the kernel
> with my tracepoint patches below.
>
> See following link. There is 3 patches.
>
> https://lkml.org/lkml/2014/12/3/71
I have just finished testing 3.18rc5 with both of the small patches mentioned
earlier in this thread and 2/3 of your event patches. The second patch
(https://lkml.org/lkml/2014/12/3/72) did not apply due to compaction_suitable
being different (am I missing another patch you are basing this off?).
My compaction_suitable is:
unsigned long compaction_suitable(struct zone *zone, int order)
Results without that second event patch are as follows:
Trace under heavy load but before any spiking system usage or significant
compaction spinning:
http://ponies.io/raw/compaction_events/before.gz
Trace during 100% cpu utilization, much of which was in system:
http://ponies.io/raw/compaction_events/during.gz
perf report at the time of during.gz:
http://ponies.io/raw/compaction_events/perf.png
Interested to see what you make of the limited information. I may be able to
try all of your patches some time next week against whatever they apply cleanly
to. If that is needed.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-04 7:30 ` Christian Marie
@ 2014-12-04 7:51 ` Christian Marie
2014-12-05 1:07 ` Joonsoo Kim
1 sibling, 0 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-04 7:51 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 1534 bytes --]
An extra note that may or may not be related, I just saw this whilst load
testing:
[177586.215195] swap_free: Unused swap offset entry 0000365b
[177586.215224] BUG: Bad page map in process ceph-osd pte:006cb600
pmd:fea8a8067
[177586.215260] addr:00007f12dff8a000 vm_flags:00100077
anon_vma:ffff8807e6002000 mapping: (null) index:7f12dff8a
[177586.215316] CPU: 22 PID: 48567 Comm: ceph-osd Tainted: GF B
O-------------- 3.10.0-123.9.3.anchor.x86_64 #1
[177586.215318] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.2.2
01/16/2014
[177586.215319] 00007f12dff8a000 00000000cdae60bd ffff88062ff6bc70
ffffffff815e23bb
[177586.215324] ffff88062ff6bcb8 ffffffff81167b48 00000000006cb600
00000007f12dff8a
[177586.215329] ffff880fea8a8c50 00000000006cb600 00007f12dff8a000
00007f12dffde000
[177586.215333] Call Trace:
[177586.215337] [<ffffffff815e23bb>] dump_stack+0x19/0x1b
[177586.215340] [<ffffffff81167b48>] print_bad_pte+0x1a8/0x240
[177586.215343] [<ffffffff811694b0>] unmap_page_range+0x5b0/0x860
[177586.215348] [<ffffffff811697e1>] unmap_single_vma+0x81/0xf0
[177586.215353] [<ffffffff8114fade>] ? lru_add_drain_cpu+0xce/0xe0
[177586.215358] [<ffffffff8116a9f5>] zap_page_range+0x105/0x170
[177586.215361] [<ffffffff81167354>] SyS_madvise+0x394/0x810
[177586.215366] [<ffffffff810c30a0>] ? SyS_futex+0x80/0x180
This was on a 3.10 kernel with the two patches mentioned earlier in this
thread. I'm not suggesting it's related, just thought I'd note it as I've never
seen a bad page mapping before.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-03 4:04 ` Christian Marie
2014-12-03 8:05 ` Joonsoo Kim
@ 2014-12-04 23:30 ` Vlastimil Babka
2014-12-05 5:50 ` Christian Marie
1 sibling, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-12-04 23:30 UTC (permalink / raw)
To: linux-mm
On 3.12.2014 5:04, Christian Marie wrote:
> On Tue, Dec 02, 2014 at 04:06:08PM +1100, Christian Marie wrote:
>> I will attempt to do this tomorrow and should have results in around 24 hours.
> I ran said test today and wasn't able to pinpoint a solid difference between a kernel
> with both patches and one with only the first. The one with both patches "felt"
> a little more responsive, probably a fluke.
>
> I'd really like to write a stress test that simulates what ceph/ipoib is doing
> here so that I can test this in a more scientific manner.
>
> Here is some perf output, the kernel with only the first patch is on the right:
>
> http://ponies.io/raw/before-after.png
>
>
> A note in passing: we left the cluster running with min_free_kbytes set to the
> default last night and within a few hours it started spewing the usual
> pre-patch allocation failures, so whilst this patch appears to make the system
> more responsive under adverse conditions the underlying
> not-keeping-up-with-pressure issue is still there.
>
> There's enough starvation to break single page allocations.
Oh, I would think that if you can't allocate single pages, then there's
little wonder that compaction also spends all its time looking for single
free pages. Did that happen just now for the single page allocations,
or was it always the case?
>
> Keep in mind that this is on a 3.10 kernel with the patches applied so I'm not
> expecting anyone to particularly care. I'm running out of time to test the
> whole cluster at 3.18 is all, I really do think that replicating the allocation
> pattern is the best way forward but my attempts at simply sending a lot of
> packets that look similar with lots of page cache don't do it.
>
> Those allocation failures on 3.10 with both patches look like this:
>
> [73138.803800] ceph-osd: page allocation failure: order:0, mode:0x20
> [73138.803802] CPU: 0 PID: 9214 Comm: ceph-osd Tainted: GF
> O-------------- 3.10.0-123.9.3.anchor.x86_64 #1
> [73138.803803] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.2.2
> 01/16/2014
> [73138.803803] 0000000000000020 00000000d6532f99 ffff88081fa03aa0
> ffffffff815e23bb
> [73138.803806] ffff88081fa03b30 ffffffff81147340 00000000ffffffff
> ffff8807da887900
> [73138.803808] ffff88083ffd9e80 ffff8800b2242900 ffff8807d843c050
> 00000000d6532f99
> [73138.803812] Call Trace:
> [73138.803813] <IRQ> [<ffffffff815e23bb>] dump_stack+0x19/0x1b
> [73138.803817] [<ffffffff81147340>] warn_alloc_failed+0x110/0x180
> [73138.803819] [<ffffffff8114b4ee>] __alloc_pages_nodemask+0x91e/0xb20
> [73138.803821] [<ffffffff8152f82a>] ? tcp_v4_rcv+0x67a/0x7c0
> [73138.803823] [<ffffffff81509710>] ? ip_rcv_finish+0x350/0x350
> [73138.803826] [<ffffffff81188369>] alloc_pages_current+0xa9/0x170
> [73138.803828] [<ffffffff814bedb1>] __netdev_alloc_frag+0x91/0x140
> [73138.803831] [<ffffffff814c0df7>] __netdev_alloc_skb+0x77/0xc0
> [73138.803834] [<ffffffffa06b54c5>] ipoib_cm_handle_rx_wc+0xf5/0x940
> [ib_ipoib]
> [73138.803838] [<ffffffffa0625e78>] ? mlx4_ib_poll_cq+0xc8/0x210 [mlx4_ib]
> [73138.803841] [<ffffffffa06a90ed>] ipoib_poll+0x8d/0x150 [ib_ipoib]
> [73138.803843] [<ffffffff814d05aa>] net_rx_action+0x15a/0x250
> [73138.803846] [<ffffffff81067047>] __do_softirq+0xf7/0x290
> [73138.803848] [<ffffffff815f43dc>] call_softirq+0x1c/0x30
> [73138.803851] [<ffffffff81014d25>] do_softirq+0x55/0x90
> [73138.803853] [<ffffffff810673e5>] irq_exit+0x115/0x120
> [73138.803855] [<ffffffff815f4cd8>] do_IRQ+0x58/0xf0
> [73138.803857] [<ffffffff815e9e2d>] common_interrupt+0x6d/0x6d
> [73138.803858] <EOI> [<ffffffff815f2bc0>] ? sysret_audit+0x17/0x21
>
> We get some like this, also:
>
> [ 1293.152415] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
> [ 1293.152416] cache: kmalloc-256, object size: 256, buffer size: 256,
> default order: 1, min order: 0
> [ 1293.152417] node 0: slabs: 1789, objs: 57248, free: 0
> [ 1293.152418] node 1: slabs: 449, objs: 14368, free: 2
>
---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-04 7:30 ` Christian Marie
2014-12-04 7:51 ` Christian Marie
@ 2014-12-05 1:07 ` Joonsoo Kim
2014-12-05 5:55 ` Christian Marie
2014-12-10 15:06 ` Vlastimil Babka
1 sibling, 2 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-05 1:07 UTC (permalink / raw)
To: linux-mm
On Thu, Dec 04, 2014 at 06:30:45PM +1100, Christian Marie wrote:
> On Wed, Dec 03, 2014 at 04:57:47PM +0900, Joonsoo Kim wrote:
> > It'd be very helpful to get output of
> > "trace_event=compaction:*,kmem:mm_page_alloc_extfrag" on the kernel
> > with my tracepoint patches below.
> >
> > See following link. There is 3 patches.
> >
> > https://lkml.org/lkml/2014/12/3/71
>
> I have just finished testing 3.18rc5 with both of the small patches mentioned
> earlier in this thread and 2/3 of your event patches. The second patch
> (https://lkml.org/lkml/2014/12/3/72) did not apply due to compaction_suitable
> being different (am I missing another patch you are basing this off?).
In fact, I'm using next-20141124 kernel, not just mainline one. There
is a lot of fixes from Vlastimil and it may cause the applying failure.
But, it's not that important in this case. I have gotten enough information
about this problem on your below log.
>
> My compaction_suitable is:
>
> unsigned long compaction_suitable(struct zone *zone, int order)
>
> Results without that second event patch are as follows:
>
> Trace under heavy load but before any spiking system usage or significant
> compaction spinning:
>
> http://ponies.io/raw/compaction_events/before.gz
>
> Trace during 100% cpu utilization, much of which was in system:
>
> http://ponies.io/raw/compaction_events/during.gz
It looks that there is no stop condition in isolate_freepages(). In
this period, your system have not enough freepage and many processes
try to find freepage for compaction. Because there is no stop
condition, they iterate almost all memory range every time. At the
bottom of this mail, I attach one more fix although I don't test it
yet. It will cause a lot of allocation failure that your network layer
need. It is order 5 allocation request and with __GFP_NOWARN gfp flag,
so I assume that there is no problem if allocation request is failed,
but, I'm not sure.
watermark check on this patch needs cc->classzone_idx, cc->alloc_flags
that comes from Vlastimil's recent change. If you want to test it with
3.18rc5, please remove it. It doesn't much matter.
Anyway, I hope it also helps you.
> perf report at the time of during.gz:
>
> http://ponies.io/raw/compaction_events/perf.png
By judging from this perf report, my second patch would have no impact
to your system. I thought that this excessive cpu usage is started from
the SLUB, but, order 5 kmalloc request is just forwarded to page
allocator in current SLUB implementation, so patch 2 from me would not
work on this problem.
By the way, is it common that network layer needs order 5 allocation?
IMHO, it'd be better to avoid this highorder request, because the kernel
easily fail to handle this kind of request.
Thanks.
>
> Interested to see what you make of the limited information. I may be able to
> try all of your patches some time next week against whatever they apply cleanly
> to. If that is needed.
------------>8-----------------
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-04 23:30 ` Vlastimil Babka
@ 2014-12-05 5:50 ` Christian Marie
0 siblings, 0 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-05 5:50 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 644 bytes --]
On Fri, Dec 05, 2014 at 12:30:37AM +0100, Vlastimil Babka wrote:
> Oh, I would think that if you can't allocate single pages, then there's
> little wonder that compaction also spends all its time looking for single
> free pages. Did that happen just now for the single page allocations,
> or was it always the case?
This has always been the case with the default min_free_pages and given enough
pressure for enough time. I have just been hoping that compaction should
be "smart" enough to lest reclaim do its stuff quickly if single page
allocations are failing.
Raising min_free_kbytes makes these 0 order allocations failures never happen.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-05 1:07 ` Joonsoo Kim
@ 2014-12-05 5:55 ` Christian Marie
2014-12-08 7:19 ` Joonsoo Kim
2014-12-10 15:06 ` Vlastimil Babka
1 sibling, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-12-05 5:55 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 1958 bytes --]
On Fri, Dec 05, 2014 at 10:07:33AM +0900, Joonsoo Kim wrote:
> It looks that there is no stop condition in isolate_freepages(). In
> this period, your system have not enough freepage and many processes
> try to find freepage for compaction. Because there is no stop
> condition, they iterate almost all memory range every time. At the
> bottom of this mail, I attach one more fix although I don't test it
> yet. It will cause a lot of allocation failure that your network layer
> need. It is order 5 allocation request and with __GFP_NOWARN gfp flag,
> so I assume that there is no problem if allocation request is failed,
> but, I'm not sure.
>
> watermark check on this patch needs cc->classzone_idx, cc->alloc_flags
> that comes from Vlastimil's recent change. If you want to test it with
> 3.18rc5, please remove it. It doesn't much matter.
>
> Anyway, I hope it also helps you.
Thank you, I will try this next week. If it improves the situation do you think
that we have a good chance of merging it upstream? I should think that
backporting such a fix would be a hard sell.
> By judging from this perf report, my second patch would have no impact
> to your system. I thought that this excessive cpu usage is started from
> the SLUB, but, order 5 kmalloc request is just forwarded to page
> allocator in current SLUB implementation, so patch 2 from me would not
> work on this problem.
I agree with this.
>
> By the way, is it common that network layer needs order 5 allocation?
> IMHO, it'd be better to avoid this highorder request, because the kernel
> easily fail to handle this kind of request.
Yes, agreed. I'm trying to sort that issue out concurrently. I'm currently
collaborating on a patch to get Scatter Gather support for the network layer so
that we can avoid these huge allocations. They are large because ipoib in
Connected Mode wants a very large MTU (around 65535) and does not do SG in CM.
[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-05 5:55 ` Christian Marie
@ 2014-12-08 7:19 ` Joonsoo Kim
0 siblings, 0 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-08 7:19 UTC (permalink / raw)
To: linux-mm
On Fri, Dec 05, 2014 at 04:55:44PM +1100, Christian Marie wrote:
> On Fri, Dec 05, 2014 at 10:07:33AM +0900, Joonsoo Kim wrote:
> > It looks that there is no stop condition in isolate_freepages(). In
> > this period, your system have not enough freepage and many processes
> > try to find freepage for compaction. Because there is no stop
> > condition, they iterate almost all memory range every time. At the
> > bottom of this mail, I attach one more fix although I don't test it
> > yet. It will cause a lot of allocation failure that your network layer
> > need. It is order 5 allocation request and with __GFP_NOWARN gfp flag,
> > so I assume that there is no problem if allocation request is failed,
> > but, I'm not sure.
> >
> > watermark check on this patch needs cc->classzone_idx, cc->alloc_flags
> > that comes from Vlastimil's recent change. If you want to test it with
> > 3.18rc5, please remove it. It doesn't much matter.
> >
> > Anyway, I hope it also helps you.
>
> Thank you, I will try this next week. If it improves the situation do you think
> that we have a good chance of merging it upstream? I should think that
> backporting such a fix would be a hard sell.
I think that if it improves the situation, it could be merged into upstream.
If the patch fix real issue, it is a candidate for stable tree.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-05 1:07 ` Joonsoo Kim
2014-12-05 5:55 ` Christian Marie
@ 2014-12-10 15:06 ` Vlastimil Babka
2014-12-11 3:08 ` Joonsoo Kim
1 sibling, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-12-10 15:06 UTC (permalink / raw)
To: Joonsoo Kim, linux-mm
On 12/05/2014 02:07 AM, Joonsoo Kim wrote:
> ------------>8-----------------
> From b7daa232c327a4ebbb48ca0538a2dbf9ca83ca1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Fri, 5 Dec 2014 09:38:30 +0900
> Subject: [PATCH] mm/compaction: stop the compaction if there isn't enough
> freepage
>
> After compaction_suitable() passed, there is no check whether the system
> has enough memory to compact and blindly try to find freepage through
> iterating all memory range. This causes excessive cpu usage in low free
> memory condition and finally compaction would be failed. It makes sense
> that compaction would be stopped if there isn't enough freepage. So,
> this patch adds watermark check to isolate_freepages() in order to stop
> the compaction in this case.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
> mm/compaction.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e005620..31c4009 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -828,6 +828,7 @@ static void isolate_freepages(struct compact_control *cc)
> unsigned long low_pfn; /* lowest pfn scanner is able to scan */
> int nr_freepages = cc->nr_freepages;
> struct list_head *freelist = &cc->freepages;
> + unsigned long watermark = low_wmark_pages(zone) + (2UL << cc->order);
Given that we maybe have already isolated up to 31 free pages (if
cc->nr_migratepages is the maximum 32), then this is somewhat stricter
than the check in isolation_suitable() (when nothing was isolated yet)
and may interrupt us prematurely. We should allow for some slack.
>
> /*
> * Initialise the free scanner. The starting point is where we last
> @@ -903,6 +904,14 @@ static void isolate_freepages(struct compact_control *cc)
> */
> if (cc->contended)
> break;
> +
> + /*
> + * Watermarks for order-0 must be met for compaction.
> + * See compaction_suitable for more detailed explanation.
> + */
> + if (!zone_watermark_ok(zone, 0, watermark,
> + cc->classzone_idx, cc->alloc_flags))
> + break;
> }
I'm a also bit concerned about the overhead of doing this in each pageblock.
I wonder if there could be a mechanism where a process entering reclaim
or compaction with the goal of meeting the watermarks to allocate,
should increase the watermarks needed for further parallel allocation
attempts to pass. Then it shouldn't happen that somebody else steals the
memory.
> /* split_free_page does not map the pages */
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: isolate_freepages_block and excessive CPU usage by OSD process
2014-12-10 15:06 ` Vlastimil Babka
@ 2014-12-11 3:08 ` Joonsoo Kim
0 siblings, 0 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-11 3:08 UTC (permalink / raw)
To: Vlastimil Babka; +Cc: linux-mm
On Wed, Dec 10, 2014 at 04:06:19PM +0100, Vlastimil Babka wrote:
> On 12/05/2014 02:07 AM, Joonsoo Kim wrote:
> >------------>8-----------------
> > From b7daa232c327a4ebbb48ca0538a2dbf9ca83ca1f Mon Sep 17 00:00:00 2001
> >From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >Date: Fri, 5 Dec 2014 09:38:30 +0900
> >Subject: [PATCH] mm/compaction: stop the compaction if there isn't enough
> > freepage
> >
> >After compaction_suitable() passed, there is no check whether the system
> >has enough memory to compact and blindly try to find freepage through
> >iterating all memory range. This causes excessive cpu usage in low free
> >memory condition and finally compaction would be failed. It makes sense
> >that compaction would be stopped if there isn't enough freepage. So,
> >this patch adds watermark check to isolate_freepages() in order to stop
> >the compaction in this case.
> >
> >Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >---
> > mm/compaction.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index e005620..31c4009 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -828,6 +828,7 @@ static void isolate_freepages(struct compact_control *cc)
> > unsigned long low_pfn; /* lowest pfn scanner is able to scan */
> > int nr_freepages = cc->nr_freepages;
> > struct list_head *freelist = &cc->freepages;
> >+ unsigned long watermark = low_wmark_pages(zone) + (2UL << cc->order);
>
> Given that we maybe have already isolated up to 31 free pages (if
> cc->nr_migratepages is the maximum 32), then this is somewhat
> stricter than the check in isolation_suitable() (when nothing was
> isolated yet) and may interrupt us prematurely. We should allow for
> some slack.
Okay. Will allow some slack.
>
> >
> > /*
> > * Initialise the free scanner. The starting point is where we last
> >@@ -903,6 +904,14 @@ static void isolate_freepages(struct compact_control *cc)
> > */
> > if (cc->contended)
> > break;
> >+
> >+ /*
> >+ * Watermarks for order-0 must be met for compaction.
> >+ * See compaction_suitable for more detailed explanation.
> >+ */
> >+ if (!zone_watermark_ok(zone, 0, watermark,
> >+ cc->classzone_idx, cc->alloc_flags))
> >+ break;
> > }
>
> I'm a also bit concerned about the overhead of doing this in each pageblock.
Yep, we can do it whenever SWAP_CLUSTER_MAX pageblock is scanned. It
will reduce overhead somewhat. I will change it.
>
> I wonder if there could be a mechanism where a process entering
> reclaim or compaction with the goal of meeting the watermarks to
> allocate, should increase the watermarks needed for further parallel
> allocation attempts to pass. Then it shouldn't happen that somebody
> else steals the memory.
I don't know, neither.
Thanks.
>
> > /* split_free_page does not map the pages */
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2014-12-11 3:04 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CABYiri-do2YdfBx=r+u1kwXkEwN4v+yeRSHB-ODXo4gMFgW-Fg.mail.gmail.com>
2014-11-19 1:21 ` isolate_freepages_block and excessive CPU usage by OSD process Christian Marie
2014-11-19 18:03 ` Andrey Korolyov
2014-11-19 21:20 ` Christian Marie
2014-11-19 23:10 ` Vlastimil Babka
2014-11-19 23:49 ` Andrey Korolyov
2014-11-20 3:30 ` Christian Marie
2014-11-21 2:35 ` Christian Marie
2014-11-23 9:33 ` Christian Marie
2014-11-24 21:48 ` Andrey Korolyov
2014-11-28 8:03 ` Joonsoo Kim
2014-11-28 9:26 ` Vlastimil Babka
2014-12-01 8:31 ` Joonsoo Kim
2014-12-02 1:47 ` Christian Marie
2014-12-02 4:53 ` Joonsoo Kim
2014-12-02 5:06 ` Christian Marie
2014-12-03 4:04 ` Christian Marie
2014-12-03 8:05 ` Joonsoo Kim
2014-12-04 23:30 ` Vlastimil Babka
2014-12-05 5:50 ` Christian Marie
2014-12-03 7:57 ` Joonsoo Kim
2014-12-04 7:30 ` Christian Marie
2014-12-04 7:51 ` Christian Marie
2014-12-05 1:07 ` Joonsoo Kim
2014-12-05 5:55 ` Christian Marie
2014-12-08 7:19 ` Joonsoo Kim
2014-12-10 15:06 ` Vlastimil Babka
2014-12-11 3:08 ` Joonsoo Kim
2014-12-02 15:46 ` Vlastimil Babka
2014-12-03 7:49 ` Joonsoo Kim
2014-12-03 12:43 ` Vlastimil Babka
2014-12-04 6:53 ` Joonsoo Kim
2014-11-15 11:48 Andrey Korolyov
2014-11-15 16:32 ` Vlastimil Babka
2014-11-15 17:10 ` Andrey Korolyov
2014-11-15 18:45 ` Vlastimil Babka
2014-11-15 18:52 ` Andrey Korolyov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).