isolate_freepages_block and excessive CPU usage by OSD process

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* isolate_freepages_block and excessive CPU usage by OSD process
@ 2014-11-15 11:48 Andrey Korolyov
  2014-11-15 16:32 ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-15 11:48 UTC (permalink / raw)
  To: ceph-users@lists.ceph.com; +Cc: riel, Mark Nelson, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2957 bytes --]

Hello,

I had found recently that the OSD daemons under certain conditions
(moderate vm pressure, moderate I/O, slightly altered vm settings) can
go into loop involving isolate_freepages and effectively hit Ceph
cluster performance. I found this thread
https://lkml.org/lkml/2012/6/27/545, but looks like that the
significant decrease of bdi max_ratio did not helped even for a bit.
Although I have approximately a half of physical memory for cache-like
stuff, the problem with mm persists, so I would like to try
suggestions from the other people. In current testing iteration I had
decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
background ratio to 15 and 10 correspondingly (because default values
are too spiky for mine workloads). The host kernel is a linux-stable
3.10.

Non-default VM settings are:
vm.swappiness = 5
vm.dirty_ratio=10
vm.dirty_background_ratio=5
bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
situation worsened, because unstable OSD host cause domino-like effect
on other hosts, which are starting to flap too and only cache flush
via drop_caches is helping.

Unfortunately there are no slab info from "exhausted" state due to
sporadic nature of this bug, will try to catch next time.

slabtop (normal state):
 Active / Total Objects (% used)    : 8675843 / 8965833 (96.8%)
 Active / Total Slabs (% used)      : 224858 / 224858 (100.0%)
 Active / Total Caches (% used)     : 86 / 132 (65.2%)
 Active / Total Size (% used)       : 1152171.37K / 1253116.37K (91.9%)
 Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
6890130 6889185  99%    0.10K 176670       39    706680K buffer_head
751232 721707  96%    0.06K  11738       64     46952K kmalloc-64
251636 226228  89%    0.55K   8987       28    143792K radix_tree_node
121696  45710  37%    0.25K   3803       32     30424K kmalloc-256
113022  80618  71%    0.19K   2691       42     21528K dentry
112672  35160  31%    0.50K   3521       32     56336K kmalloc-512
 73136  72800  99%    0.07K   1306       56      5224K Acpi-ParseExt
 61696  58644  95%    0.02K    241      256       964K kmalloc-16
 54348  36649  67%    0.38K   1294       42     20704K ip6_dst_cache
 53136  51787  97%    0.11K   1476       36      5904K sysfs_dir_cache
 51200  50724  99%    0.03K    400      128      1600K kmalloc-32
 49120  46105  93%    1.00K   1535       32     49120K xfs_inode
 30702  30702 100%    0.04K    301      102      1204K Acpi-Namespace
 28224  25742  91%    0.12K    882       32      3528K kmalloc-128
 28028  22691  80%    0.18K    637       44      5096K vm_area_struct
 28008  28008 100%    0.22K    778       36      6224K xfs_ili
 18944  18944 100%    0.01K     37      512       148K kmalloc-8
 16576  15154  91%    0.06K    259       64      1036K anon_vma
 16475  14200  86%    0.16K    659       25      2636K sigqueue

zoneinfo (normal state, attached)

[-- Attachment #2: zoneinfo --]
[-- Type: application/octet-stream, Size: 15098 bytes --]

Node 0, zone      DMA
  pages free     3973
        min      5
        low      6
        high     7
        scanned  0
        spanned  4095
        present  3994
        managed  3973
    nr_free_pages 3973
    nr_inactive_anon 0
    nr_active_anon 0
    nr_inactive_file 0
    nr_active_file 0
    nr_unevictable 0
    nr_mlock     0
    nr_anon_pages 0
    nr_mapped    0
    nr_file_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_slab_reclaimable 0
    nr_slab_unreclaimable 0
    nr_page_table_pages 0
    nr_kernel_stack 0
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     0
    nr_dirtied   0
    nr_written   0
    numa_hit     0
    numa_miss    0
    numa_foreign 0
    numa_interleave 0
    numa_local   0
    numa_other   0
    nr_anon_transparent_hugepages 0
    nr_free_cma  0
        protection: (0, 1914, 32121, 32121)
  pagesets
    cpu: 0
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 1
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 2
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 3
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 4
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 5
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 6
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 7
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 8
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 9
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 10
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 11
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 12
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 13
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 14
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 15
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 16
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 17
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 18
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 19
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 20
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 21
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 22
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
    cpu: 23
              count: 0
              high:  0
              batch: 1
  vm stats threshold: 10
  all_unreclaimable: 1
  start_pfn:         1
  inactive_ratio:    1
Node 0, zone    DMA32
  pages free     32223
        min      669
        low      836
        high     1003
        scanned  0
        spanned  1044480
        present  511926
        managed  490239
    nr_free_pages 32223
    nr_inactive_anon 277
    nr_active_anon 45533
    nr_inactive_file 227698
    nr_active_file 122112
    nr_unevictable 4760
    nr_mlock     4760
    nr_anon_pages 49781
    nr_mapped    133
    nr_file_pages 350087
    nr_dirty     160
    nr_writeback 0
    nr_slab_reclaimable 20418
    nr_slab_unreclaimable 30228
    nr_page_table_pages 190
    nr_kernel_stack 436
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 2
    nr_vmscan_immediate_reclaim 3499
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     277
    nr_dirtied   609807631
    nr_written   609734467
    numa_hit     6979761185
    numa_miss    3941324201
    numa_foreign 0
    numa_interleave 0
    numa_local   6979751851
    numa_other   3941333535
    nr_anon_transparent_hugepages 1
    nr_free_cma  0
        protection: (0, 0, 30206, 30206)
  pagesets
    cpu: 0
              count: 12
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 1
              count: 8
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 2
              count: 60
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 3
              count: 45
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 4
              count: 12
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 5
              count: 3
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 6
              count: 49
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 7
              count: 28
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 8
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 9
              count: 5
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 10
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 11
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 12
              count: 19
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 13
              count: 1
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 14
              count: 12
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 15
              count: 162
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 16
              count: 14
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 17
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 18
              count: 3
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 19
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 20
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 21
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 22
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
    cpu: 23
              count: 0
              high:  186
              batch: 31
  vm stats threshold: 50
  all_unreclaimable: 0
  start_pfn:         4096
  inactive_ratio:    3
Node 0, zone   Normal
  pages free     32960
        min      10568
        low      13210
        high     15852
        scanned  0
        spanned  7864320
        present  7864320
        managed  7732828
    nr_free_pages 32960
    nr_inactive_anon 11191
    nr_active_anon 3036913
    nr_inactive_file 3223885
    nr_active_file 1127966
    nr_unevictable 4086
    nr_mlock     4086
    nr_anon_pages 2363745
    nr_mapped    34191
    nr_file_pages 4358872
    nr_dirty     2926
    nr_writeback 0
    nr_slab_reclaimable 82623
    nr_slab_unreclaimable 24026
    nr_page_table_pages 12611
    nr_kernel_stack 1842
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 59
    nr_vmscan_immediate_reclaim 29602
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     6348
    nr_dirtied   8347305401
    nr_written   8343222456
    numa_hit     49594613817
    numa_miss    635457096
    numa_foreign 391251876
    numa_interleave 20063
    numa_local   49594490600
    numa_other   635580313
    nr_anon_transparent_hugepages 1331
    nr_free_cma  0
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 58
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 1
              count: 161
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 2
              count: 159
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 3
              count: 170
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 4
              count: 159
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 5
              count: 78
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 6
              count: 64
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 7
              count: 151
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 8
              count: 182
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 9
              count: 173
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 10
              count: 164
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 11
              count: 165
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 12
              count: 176
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 13
              count: 156
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 14
              count: 157
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 15
              count: 135
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 16
              count: 158
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 17
              count: 172
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 18
              count: 167
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 19
              count: 171
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 20
              count: 169
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 21
              count: 157
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 22
              count: 177
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 23
              count: 161
              high:  186
              batch: 31
  vm stats threshold: 90
  all_unreclaimable: 0
  start_pfn:         1048576
  inactive_ratio:    17
Node 1, zone   Normal
  pages free     14880
        min      11284
        low      14105
        high     16926
        scanned  0
        spanned  8388608
        present  8388608
        managed  8257056
    nr_free_pages 14880
    nr_inactive_anon 13140
    nr_active_anon 2569269
    nr_inactive_file 3715797
    nr_active_file 1659970
    nr_unevictable 15464
    nr_mlock     15464
    nr_anon_pages 1310698
    nr_mapped    45301
    nr_file_pages 5387102
    nr_dirty     3551
    nr_writeback 0
    nr_slab_reclaimable 135572
    nr_slab_unreclaimable 24093
    nr_page_table_pages 6677
    nr_kernel_stack 775
    nr_unstable  0
    nr_bounce    0
    nr_vmscan_write 0
    nr_vmscan_immediate_reclaim 57854
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem     10317
    nr_dirtied   13325911763
    nr_written   13320630581
    numa_hit     43510008565
    numa_miss    391251876
    numa_foreign 4576781297
    numa_interleave 19867
    numa_local   43509973410
    numa_other   391287031
    nr_anon_transparent_hugepages 2492
    nr_free_cma  0
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0
              count: 155
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 1
              count: 173
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 2
              count: 104
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 3
              count: 168
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 4
              count: 158
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 5
              count: 169
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 6
              count: 53
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 7
              count: 81
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 8
              count: 63
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 9
              count: 168
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 10
              count: 46
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 11
              count: 28
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 12
              count: 161
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 13
              count: 177
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 14
              count: 155
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 15
              count: 181
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 16
              count: 164
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 17
              count: 185
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 18
              count: 69
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 19
              count: 75
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 20
              count: 151
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 21
              count: 91
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 22
              count: 51
              high:  186
              batch: 31
  vm stats threshold: 90
    cpu: 23
              count: 56
              high:  186
              batch: 31
  vm stats threshold: 90
  all_unreclaimable: 0
  start_pfn:         8912896
  inactive_ratio:    17

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-15 11:48 Andrey Korolyov
@ 2014-11-15 16:32 ` Vlastimil Babka
  2014-11-15 17:10   ` Andrey Korolyov
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-11-15 16:32 UTC (permalink / raw)
  To: Andrey Korolyov, ceph-users@lists.ceph.com
  Cc: riel, Mark Nelson, linux-mm, David Rientjes, Joonsoo Kim

On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
> Hello,
> 
> I had found recently that the OSD daemons under certain conditions
> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
> go into loop involving isolate_freepages and effectively hit Ceph
> cluster performance. I found this thread

Do you feel it is a regression, compared to some older kernel version or something?

> https://lkml.org/lkml/2012/6/27/545, but looks like that the
> significant decrease of bdi max_ratio did not helped even for a bit.
> Although I have approximately a half of physical memory for cache-like
> stuff, the problem with mm persists, so I would like to try
> suggestions from the other people. In current testing iteration I had
> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
> background ratio to 15 and 10 correspondingly (because default values
> are too spiky for mine workloads). The host kernel is a linux-stable
> 3.10.

Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
it, or at least 3.17. Lot of patches went to reduce compaction overhead for
(especially for transparent hugepages) since 3.10.

> Non-default VM settings are:
> vm.swappiness = 5
> vm.dirty_ratio=10
> vm.dirty_background_ratio=5
> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
> situation worsened, because unstable OSD host cause domino-like effect
> on other hosts, which are starting to flap too and only cache flush
> via drop_caches is helping.
> 
> Unfortunately there are no slab info from "exhausted" state due to
> sporadic nature of this bug, will try to catch next time.
> 
> slabtop (normal state):
>  Active / Total Objects (% used)    : 8675843 / 8965833 (96.8%)
>  Active / Total Slabs (% used)      : 224858 / 224858 (100.0%)
>  Active / Total Caches (% used)     : 86 / 132 (65.2%)
>  Active / Total Size (% used)       : 1152171.37K / 1253116.37K (91.9%)
>  Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> 6890130 6889185  99%    0.10K 176670       39    706680K buffer_head
> 751232 721707  96%    0.06K  11738       64     46952K kmalloc-64
> 251636 226228  89%    0.55K   8987       28    143792K radix_tree_node
> 121696  45710  37%    0.25K   3803       32     30424K kmalloc-256
> 113022  80618  71%    0.19K   2691       42     21528K dentry
> 112672  35160  31%    0.50K   3521       32     56336K kmalloc-512
>  73136  72800  99%    0.07K   1306       56      5224K Acpi-ParseExt
>  61696  58644  95%    0.02K    241      256       964K kmalloc-16
>  54348  36649  67%    0.38K   1294       42     20704K ip6_dst_cache
>  53136  51787  97%    0.11K   1476       36      5904K sysfs_dir_cache
>  51200  50724  99%    0.03K    400      128      1600K kmalloc-32
>  49120  46105  93%    1.00K   1535       32     49120K xfs_inode
>  30702  30702 100%    0.04K    301      102      1204K Acpi-Namespace
>  28224  25742  91%    0.12K    882       32      3528K kmalloc-128
>  28028  22691  80%    0.18K    637       44      5096K vm_area_struct
>  28008  28008 100%    0.22K    778       36      6224K xfs_ili
>  18944  18944 100%    0.01K     37      512       148K kmalloc-8
>  16576  15154  91%    0.06K    259       64      1036K anon_vma
>  16475  14200  86%    0.16K    659       25      2636K sigqueue
> 
> zoneinfo (normal state, attached)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-15 16:32 ` Vlastimil Babka
@ 2014-11-15 17:10   ` Andrey Korolyov
  2014-11-15 18:45     ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-15 17:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: ceph-users@lists.ceph.com, riel, Mark Nelson, linux-mm,
	David Rientjes, Joonsoo Kim

On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
>> Hello,
>>
>> I had found recently that the OSD daemons under certain conditions
>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>> go into loop involving isolate_freepages and effectively hit Ceph
>> cluster performance. I found this thread
>
> Do you feel it is a regression, compared to some older kernel version or something?

No, it`s just a rare but very concerning stuff. The higher pressure
is, the more chance to hit this particular issue, although absolute
numbers are still very large (e.g. room for cache memory). Some
googling also found simular question on sf:
http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
but there are no perf info unfortunately so I cannot say if the issue
is the same or not.

>
>> https://lkml.org/lkml/2012/6/27/545, but looks like that the
>> significant decrease of bdi max_ratio did not helped even for a bit.
>> Although I have approximately a half of physical memory for cache-like
>> stuff, the problem with mm persists, so I would like to try
>> suggestions from the other people. In current testing iteration I had
>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
>> background ratio to 15 and 10 correspondingly (because default values
>> are too spiky for mine workloads). The host kernel is a linux-stable
>> 3.10.
>
> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
> it, or at least 3.17. Lot of patches went to reduce compaction overhead for
> (especially for transparent hugepages) since 3.10.

Heh, I may say that I limited to pushing knobs in 3.10, because it has
a well-known set of problems and any major version switch will lead to
months-long QA procedures, but I may try that if none of mine knob
selection will help. I am not THP user, the problem is happening with
regular 4k pages and almost default VM settings. Also it worth to mean
that kernel messages are not complaining about allocation failures, as
in case in URL from above, compaction just tightens up to some limit
and (after it 'locked' system for a couple of minutes, reducing actual
I/O and derived amount of memory operations) it goes back to normal.
Cache flush fixing this just in a moment, so should large room for
min_free_kbytes. Over couple of days, depends on which nodes with
certain settings issue will reappear, I may judge if my ideas was
wrong.

>
>> Non-default VM settings are:
>> vm.swappiness = 5
>> vm.dirty_ratio=10
>> vm.dirty_background_ratio=5
>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
>> situation worsened, because unstable OSD host cause domino-like effect
>> on other hosts, which are starting to flap too and only cache flush
>> via drop_caches is helping.
>>
>> Unfortunately there are no slab info from "exhausted" state due to
>> sporadic nature of this bug, will try to catch next time.
>>
>> slabtop (normal state):
>>  Active / Total Objects (% used)    : 8675843 / 8965833 (96.8%)
>>  Active / Total Slabs (% used)      : 224858 / 224858 (100.0%)
>>  Active / Total Caches (% used)     : 86 / 132 (65.2%)
>>  Active / Total Size (% used)       : 1152171.37K / 1253116.37K (91.9%)
>>  Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>>
>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>> 6890130 6889185  99%    0.10K 176670       39    706680K buffer_head
>> 751232 721707  96%    0.06K  11738       64     46952K kmalloc-64
>> 251636 226228  89%    0.55K   8987       28    143792K radix_tree_node
>> 121696  45710  37%    0.25K   3803       32     30424K kmalloc-256
>> 113022  80618  71%    0.19K   2691       42     21528K dentry
>> 112672  35160  31%    0.50K   3521       32     56336K kmalloc-512
>>  73136  72800  99%    0.07K   1306       56      5224K Acpi-ParseExt
>>  61696  58644  95%    0.02K    241      256       964K kmalloc-16
>>  54348  36649  67%    0.38K   1294       42     20704K ip6_dst_cache
>>  53136  51787  97%    0.11K   1476       36      5904K sysfs_dir_cache
>>  51200  50724  99%    0.03K    400      128      1600K kmalloc-32
>>  49120  46105  93%    1.00K   1535       32     49120K xfs_inode
>>  30702  30702 100%    0.04K    301      102      1204K Acpi-Namespace
>>  28224  25742  91%    0.12K    882       32      3528K kmalloc-128
>>  28028  22691  80%    0.18K    637       44      5096K vm_area_struct
>>  28008  28008 100%    0.22K    778       36      6224K xfs_ili
>>  18944  18944 100%    0.01K     37      512       148K kmalloc-8
>>  16576  15154  91%    0.06K    259       64      1036K anon_vma
>>  16475  14200  86%    0.16K    659       25      2636K sigqueue
>>
>> zoneinfo (normal state, attached)
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-15 17:10   ` Andrey Korolyov
@ 2014-11-15 18:45     ` Vlastimil Babka
  2014-11-15 18:52       ` Andrey Korolyov
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-11-15 18:45 UTC (permalink / raw)
  To: Andrey Korolyov
  Cc: ceph-users@lists.ceph.com, riel, Mark Nelson, linux-mm,
	David Rientjes, Joonsoo Kim, Johannes Weiner

On 11/15/2014 06:10 PM, Andrey Korolyov wrote:
> On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
>>> Hello,
>>>
>>> I had found recently that the OSD daemons under certain conditions
>>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>>> go into loop involving isolate_freepages and effectively hit Ceph
>>> cluster performance. I found this thread
>>
>> Do you feel it is a regression, compared to some older kernel version or something?
> 
> No, it`s just a rare but very concerning stuff. The higher pressure
> is, the more chance to hit this particular issue, although absolute
> numbers are still very large (e.g. room for cache memory). Some
> googling also found simular question on sf:
> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
> but there are no perf info unfortunately so I cannot say if the issue
> is the same or not.

Well it would be useful to find out what's doing the high-order allocations.
With 'perf -g -a' and then 'perf report -g' determine the call stack. Order and
allocation flags can be captured by enabling the page_alloc tracepoint.

>>
>>> https://lkml.org/lkml/2012/6/27/545, but looks like that the
>>> significant decrease of bdi max_ratio did not helped even for a bit.
>>> Although I have approximately a half of physical memory for cache-like
>>> stuff, the problem with mm persists, so I would like to try
>>> suggestions from the other people. In current testing iteration I had
>>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
>>> background ratio to 15 and 10 correspondingly (because default values
>>> are too spiky for mine workloads). The host kernel is a linux-stable
>>> 3.10.
>>
>> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
>> it, or at least 3.17. Lot of patches went to reduce compaction overhead for
>> (especially for transparent hugepages) since 3.10.
> 
> Heh, I may say that I limited to pushing knobs in 3.10, because it has
> a well-known set of problems and any major version switch will lead to
> months-long QA procedures, but I may try that if none of mine knob
> selection will help. I am not THP user, the problem is happening with
> regular 4k pages and almost default VM settings. Also it worth to mean

OK that's useful to know. So it might be some driver (do you also have
mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order allocations.

> that kernel messages are not complaining about allocation failures, as
> in case in URL from above, compaction just tightens up to some limit

Without the warnings, that's why we need tracing/profiling to find out what's
causing it.

> and (after it 'locked' system for a couple of minutes, reducing actual
> I/O and derived amount of memory operations) it goes back to normal.
> Cache flush fixing this just in a moment, so should large room for

That could perhaps suggest a poor coordination between reclaim and compaction,
made worse by the fact that there are more parallel ongoing attempts and the
watermark checking doesn't take that into account.

> min_free_kbytes. Over couple of days, depends on which nodes with
> certain settings issue will reappear, I may judge if my ideas was
> wrong.
> 
>>
>>> Non-default VM settings are:
>>> vm.swappiness = 5
>>> vm.dirty_ratio=10
>>> vm.dirty_background_ratio=5
>>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
>>> situation worsened, because unstable OSD host cause domino-like effect
>>> on other hosts, which are starting to flap too and only cache flush
>>> via drop_caches is helping.
>>>
>>> Unfortunately there are no slab info from "exhausted" state due to
>>> sporadic nature of this bug, will try to catch next time.
>>>
>>> slabtop (normal state):
>>>  Active / Total Objects (% used)    : 8675843 / 8965833 (96.8%)
>>>  Active / Total Slabs (% used)      : 224858 / 224858 (100.0%)
>>>  Active / Total Caches (% used)     : 86 / 132 (65.2%)
>>>  Active / Total Size (% used)       : 1152171.37K / 1253116.37K (91.9%)
>>>  Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>>>
>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>> 6890130 6889185  99%    0.10K 176670       39    706680K buffer_head
>>> 751232 721707  96%    0.06K  11738       64     46952K kmalloc-64
>>> 251636 226228  89%    0.55K   8987       28    143792K radix_tree_node
>>> 121696  45710  37%    0.25K   3803       32     30424K kmalloc-256
>>> 113022  80618  71%    0.19K   2691       42     21528K dentry
>>> 112672  35160  31%    0.50K   3521       32     56336K kmalloc-512
>>>  73136  72800  99%    0.07K   1306       56      5224K Acpi-ParseExt
>>>  61696  58644  95%    0.02K    241      256       964K kmalloc-16
>>>  54348  36649  67%    0.38K   1294       42     20704K ip6_dst_cache
>>>  53136  51787  97%    0.11K   1476       36      5904K sysfs_dir_cache
>>>  51200  50724  99%    0.03K    400      128      1600K kmalloc-32
>>>  49120  46105  93%    1.00K   1535       32     49120K xfs_inode
>>>  30702  30702 100%    0.04K    301      102      1204K Acpi-Namespace
>>>  28224  25742  91%    0.12K    882       32      3528K kmalloc-128
>>>  28028  22691  80%    0.18K    637       44      5096K vm_area_struct
>>>  28008  28008 100%    0.22K    778       36      6224K xfs_ili
>>>  18944  18944 100%    0.01K     37      512       148K kmalloc-8
>>>  16576  15154  91%    0.06K    259       64      1036K anon_vma
>>>  16475  14200  86%    0.16K    659       25      2636K sigqueue
>>>
>>> zoneinfo (normal state, attached)
>>>
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-15 18:45     ` Vlastimil Babka
@ 2014-11-15 18:52       ` Andrey Korolyov
  0 siblings, 0 replies; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-15 18:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: ceph-users@lists.ceph.com, riel, Mark Nelson, linux-mm,
	David Rientjes, Joonsoo Kim, Johannes Weiner

On Sat, Nov 15, 2014 at 9:45 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/15/2014 06:10 PM, Andrey Korolyov wrote:
>> On Sat, Nov 15, 2014 at 7:32 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> On 11/15/2014 12:48 PM, Andrey Korolyov wrote:
>>>> Hello,
>>>>
>>>> I had found recently that the OSD daemons under certain conditions
>>>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>>>> go into loop involving isolate_freepages and effectively hit Ceph
>>>> cluster performance. I found this thread
>>>
>>> Do you feel it is a regression, compared to some older kernel version or something?
>>
>> No, it`s just a rare but very concerning stuff. The higher pressure
>> is, the more chance to hit this particular issue, although absolute
>> numbers are still very large (e.g. room for cache memory). Some
>> googling also found simular question on sf:
>> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
>> but there are no perf info unfortunately so I cannot say if the issue
>> is the same or not.
>
> Well it would be useful to find out what's doing the high-order allocations.
> With 'perf -g -a' and then 'perf report -g' determine the call stack. Order and
> allocation flags can be captured by enabling the page_alloc tracepoint.

Thanks, please give me some time to go through testing iterations, so
I`ll collect appropriate perf.data.
>
>>>
>>>> https://lkml.org/lkml/2012/6/27/545, but looks like that the
>>>> significant decrease of bdi max_ratio did not helped even for a bit.
>>>> Although I have approximately a half of physical memory for cache-like
>>>> stuff, the problem with mm persists, so I would like to try
>>>> suggestions from the other people. In current testing iteration I had
>>>> decreased vfs_cache_pressure to 10 and raised vm_dirty_ratio and
>>>> background ratio to 15 and 10 correspondingly (because default values
>>>> are too spiky for mine workloads). The host kernel is a linux-stable
>>>> 3.10.
>>>
>>> Well I'm glad to hear it's not 3.18-rc3 this time. But I would recommend trying
>>> it, or at least 3.17. Lot of patches went to reduce compaction overhead for
>>> (especially for transparent hugepages) since 3.10.
>>
>> Heh, I may say that I limited to pushing knobs in 3.10, because it has
>> a well-known set of problems and any major version switch will lead to
>> months-long QA procedures, but I may try that if none of mine knob
>> selection will help. I am not THP user, the problem is happening with
>> regular 4k pages and almost default VM settings. Also it worth to mean
>
> OK that's useful to know. So it might be some driver (do you also have
> mellanox?) or maybe SLUB (do you have it enabled?) is trying high-order allocations.

Yes, I am using mellanox transport there and SLUB allocator, as SLAB
had some issues with allocations with uneven node fill-up on a
two-head system which I am primarily using.

>
>> that kernel messages are not complaining about allocation failures, as
>> in case in URL from above, compaction just tightens up to some limit
>
> Without the warnings, that's why we need tracing/profiling to find out what's
> causing it.
>
>> and (after it 'locked' system for a couple of minutes, reducing actual
>> I/O and derived amount of memory operations) it goes back to normal.
>> Cache flush fixing this just in a moment, so should large room for
>
> That could perhaps suggest a poor coordination between reclaim and compaction,
> made worse by the fact that there are more parallel ongoing attempts and the
> watermark checking doesn't take that into account.
>
>> min_free_kbytes. Over couple of days, depends on which nodes with
>> certain settings issue will reappear, I may judge if my ideas was
>> wrong.
>>
>>>
>>>> Non-default VM settings are:
>>>> vm.swappiness = 5
>>>> vm.dirty_ratio=10
>>>> vm.dirty_background_ratio=5
>>>> bdi_max_ratio was 100%, right now 20%, at a glance it looks like the
>>>> situation worsened, because unstable OSD host cause domino-like effect
>>>> on other hosts, which are starting to flap too and only cache flush
>>>> via drop_caches is helping.
>>>>
>>>> Unfortunately there are no slab info from "exhausted" state due to
>>>> sporadic nature of this bug, will try to catch next time.
>>>>
>>>> slabtop (normal state):
>>>>  Active / Total Objects (% used)    : 8675843 / 8965833 (96.8%)
>>>>  Active / Total Slabs (% used)      : 224858 / 224858 (100.0%)
>>>>  Active / Total Caches (% used)     : 86 / 132 (65.2%)
>>>>  Active / Total Size (% used)       : 1152171.37K / 1253116.37K (91.9%)
>>>>  Minimum / Average / Maximum Object : 0.01K / 0.14K / 15.75K
>>>>
>>>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
>>>> 6890130 6889185  99%    0.10K 176670       39    706680K buffer_head
>>>> 751232 721707  96%    0.06K  11738       64     46952K kmalloc-64
>>>> 251636 226228  89%    0.55K   8987       28    143792K radix_tree_node
>>>> 121696  45710  37%    0.25K   3803       32     30424K kmalloc-256
>>>> 113022  80618  71%    0.19K   2691       42     21528K dentry
>>>> 112672  35160  31%    0.50K   3521       32     56336K kmalloc-512
>>>>  73136  72800  99%    0.07K   1306       56      5224K Acpi-ParseExt
>>>>  61696  58644  95%    0.02K    241      256       964K kmalloc-16
>>>>  54348  36649  67%    0.38K   1294       42     20704K ip6_dst_cache
>>>>  53136  51787  97%    0.11K   1476       36      5904K sysfs_dir_cache
>>>>  51200  50724  99%    0.03K    400      128      1600K kmalloc-32
>>>>  49120  46105  93%    1.00K   1535       32     49120K xfs_inode
>>>>  30702  30702 100%    0.04K    301      102      1204K Acpi-Namespace
>>>>  28224  25742  91%    0.12K    882       32      3528K kmalloc-128
>>>>  28028  22691  80%    0.18K    637       44      5096K vm_area_struct
>>>>  28008  28008 100%    0.22K    778       36      6224K xfs_ili
>>>>  18944  18944 100%    0.01K     37      512       148K kmalloc-8
>>>>  16576  15154  91%    0.06K    259       64      1036K anon_vma
>>>>  16475  14200  86%    0.16K    659       25      2636K sigqueue
>>>>
>>>> zoneinfo (normal state, attached)
>>>>
>>>
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
       [not found] <CABYiri-do2YdfBx=r+u1kwXkEwN4v+yeRSHB-ODXo4gMFgW-Fg.mail.gmail.com>
@ 2014-11-19  1:21 ` Christian Marie
  2014-11-19 18:03   ` Andrey Korolyov
  0 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-11-19  1:21 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]

> Hello,
> 
> I had found recently that the OSD daemons under certain conditions
> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
> go into loop involving isolate_freepages and effectively hit Ceph
> cluster performance.

Hi! I'm the creator of the server fault issue you reference:

http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph

I'd like to get to the bottom of this very much, I'm seeing a very similar
pattern on 3.10.0-123.9.3.el7.x86_64, if this is fixed in later versions
perhaps we could backport something.

Here is some perf output:

http://ponies.io/raw/compaction.png

Looks pretty similar. I also have hundreds of MB logs and traces should we need
some specific question answered.

I've managed to reproduce many failed compactions with this:

https://gist.github.com/christian-marie/cde7e80c5edb889da541

I took some compaction stress test code and bolted on a little loop to mmap a
large sparse file and read every PAGE_SIZEth byte.

Run it once, compactions seem to do okay, run it again and they're really slow.
This seems to be because my little trick to fill up cache memory only seems to
work exactly half the time. Note that transhuge pages are only used to
introduce fragmentation/pressure here, turning transparent huge pages off
doesn't seem to make the slightest difference to the spinning-in-reclaim issue.

We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
currently working on adding support for that (the hardware supports it). Are
you also using ipoib or have something else doing high order allocations? It's
a bit concerning for me if you don't as it would suggest that cutting down on
those allocations won't help.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-19  1:21 ` isolate_freepages_block and excessive CPU usage by OSD process Christian Marie
@ 2014-11-19 18:03   ` Andrey Korolyov
  2014-11-19 21:20     ` Christian Marie
  0 siblings, 1 reply; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-19 18:03 UTC (permalink / raw)
  To: Christian Marie; +Cc: linux-mm

On Wed, Nov 19, 2014 at 4:21 AM, Christian Marie <christian@ponies.io> wrote:
>> Hello,
>>
>> I had found recently that the OSD daemons under certain conditions
>> (moderate vm pressure, moderate I/O, slightly altered vm settings) can
>> go into loop involving isolate_freepages and effectively hit Ceph
>> cluster performance.
>
> Hi! I'm the creator of the server fault issue you reference:
>
> http://serverfault.com/questions/642883/cause-of-page-fragmentation-on-large-server-with-xfs-20-disks-and-ceph
>
> I'd like to get to the bottom of this very much, I'm seeing a very similar
> pattern on 3.10.0-123.9.3.el7.x86_64, if this is fixed in later versions
> perhaps we could backport something.
>
> Here is some perf output:
>
> http://ponies.io/raw/compaction.png
>
> Looks pretty similar. I also have hundreds of MB logs and traces should we need
> some specific question answered.
>
> I've managed to reproduce many failed compactions with this:
>
> https://gist.github.com/christian-marie/cde7e80c5edb889da541
>
> I took some compaction stress test code and bolted on a little loop to mmap a
> large sparse file and read every PAGE_SIZEth byte.
>
> Run it once, compactions seem to do okay, run it again and they're really slow.
> This seems to be because my little trick to fill up cache memory only seems to
> work exactly half the time. Note that transhuge pages are only used to
> introduce fragmentation/pressure here, turning transparent huge pages off
> doesn't seem to make the slightest difference to the spinning-in-reclaim issue.
>
> We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
> currently working on adding support for that (the hardware supports it). Are
> you also using ipoib or have something else doing high order allocations? It's
> a bit concerning for me if you don't as it would suggest that cutting down on
> those allocations won't help.

So do I. On a test environment with regular tengig cards I was unable
to reproduce the issue. Honestly, I thought that almost every
contemporary driver for high-speed cards is working with
scatter-gather, so I had not mlx in mind as a potential cause of this
problem from very beginning. There are a couple of reports in ceph
lists, complaining for OSD flapping/unresponsiveness without clear
reason on certain (not always clear though) conditions which may have
same root cause. Wonder if numad-like mechanism will help there, but
its usage is generally an anti-performance pattern in my experience.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-19 18:03   ` Andrey Korolyov
@ 2014-11-19 21:20     ` Christian Marie
  2014-11-19 23:10       ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-11-19 21:20 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 2338 bytes --]

On Wed, Nov 19, 2014 at 10:03:44PM +0400, Andrey Korolyov wrote:
> > We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
> > currently working on adding support for that (the hardware supports it). Are
> > you also using ipoib or have something else doing high order allocations? It's
> > a bit concerning for me if you don't as it would suggest that cutting down on
> > those allocations won't help.
> 
> So do I. On a test environment with regular tengig cards I was unable to
> reproduce the issue. Honestly, I thought that almost every contemporary
> driver for high-speed cards is working with scatter-gather, so I had not mlx
> in mind as a potential cause of this problem from very beginning.

Right, the drivers handle SG just fine, even in UD mode. It's just that as soon
as you go switch to CM they turn of hardware IP csums and SG support. The only
question I remain to answer before testing a patched driver is whether or not
the messages sent by Ceph are fragmented enough to save allocations. If not, we
could always patch Ceph as well but this is beginning to snowball.

Here is the untested WIP patch for SG support in ipoib CM mode, I'm currently
talking to the original author of a larger patch to review and split that and
get them both upstream.:

https://gist.github.com/christian-marie/e8048b9c118bd3925957

> There are a couple of reports in ceph lists, complaining for OSD
> flapping/unresponsiveness without clear reason on certain (not always clear
> though) conditions which may have same root cause.

Possibly, though ipoib and Ceph seem to be a relatively rare combination.
Someone will likely find this thread if it is the same root cause.

> Wonder if numad-like mechanism will help there, but its usage is generally an
> anti-performance pattern in my experience.

We've played with zone_reclaim_mode and numad to no avail. Only thing we haven't
tried is striping, which I don't want to do anyway.

If these large allocations are indeed a reasonable thing to ask of the
compaction/reclaim subsystem that seems like the best way forward. I have two
questions that follow from this conjecture:

Are compaction behaving badly or are we just asking for too many high order
allocations?

Is this fixed in a later kernel? I haven't tested yet.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-19 21:20     ` Christian Marie
@ 2014-11-19 23:10       ` Vlastimil Babka
  2014-11-19 23:49         ` Andrey Korolyov
                           ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Vlastimil Babka @ 2014-11-19 23:10 UTC (permalink / raw)
  To: linux-mm

On 11/19/2014 10:20 PM, Christian Marie wrote:
> On Wed, Nov 19, 2014 at 10:03:44PM +0400, Andrey Korolyov wrote:
>> > We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
>> > currently working on adding support for that (the hardware supports it). Are
>> > you also using ipoib or have something else doing high order allocations? It's
>> > a bit concerning for me if you don't as it would suggest that cutting down on
>> > those allocations won't help.
>> 
>> So do I. On a test environment with regular tengig cards I was unable to
>> reproduce the issue. Honestly, I thought that almost every contemporary
>> driver for high-speed cards is working with scatter-gather, so I had not mlx
>> in mind as a potential cause of this problem from very beginning.
> 
> Right, the drivers handle SG just fine, even in UD mode. It's just that as soon
> as you go switch to CM they turn of hardware IP csums and SG support. The only
> question I remain to answer before testing a patched driver is whether or not
> the messages sent by Ceph are fragmented enough to save allocations. If not, we
> could always patch Ceph as well but this is beginning to snowball.
> 
> Here is the untested WIP patch for SG support in ipoib CM mode, I'm currently
> talking to the original author of a larger patch to review and split that and
> get them both upstream.:
> 
> https://gist.github.com/christian-marie/e8048b9c118bd3925957
> 
>> There are a couple of reports in ceph lists, complaining for OSD
>> flapping/unresponsiveness without clear reason on certain (not always clear
>> though) conditions which may have same root cause.
> 
> Possibly, though ipoib and Ceph seem to be a relatively rare combination.
> Someone will likely find this thread if it is the same root cause.
> 
>> Wonder if numad-like mechanism will help there, but its usage is generally an
>> anti-performance pattern in my experience.
> 
> We've played with zone_reclaim_mode and numad to no avail. Only thing we haven't
> tried is striping, which I don't want to do anyway.
> 
> If these large allocations are indeed a reasonable thing to ask of the
> compaction/reclaim subsystem that seems like the best way forward. I have two
> questions that follow from this conjecture:
> 
> Are compaction behaving badly or are we just asking for too many high order
> allocations?
> 
> Is this fixed in a later kernel? I haven't tested yet.

As I said, recent kernels received many compaction performance tuning patches,
and reclaim as well. I would recommend trying them, if it's possible.

You mention 3.10.0-123.9.3.el7.x86_64 which I have no idea how it relates to
upstream stable kernel. Upstream version 3.10.44 received several compaction
fixes that I'd deem critical for compaction to work as intended, and lack of
them could explain your problems:

mm: compaction: reset cached scanner pfn's before reading them
commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.

mm: compaction: detect when scanners meet in isolate_freepages
commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.

mm/compaction: make isolate_freepages start at pageblock boundary
commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.

You might want to check if those are included in your kernel package, and/or try
upstream stable 3.10 (if you can't use the latest for some reason).

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-19 23:10       ` Vlastimil Babka
@ 2014-11-19 23:49         ` Andrey Korolyov
  2014-11-20  3:30         ` Christian Marie
  2014-11-21  2:35         ` Christian Marie
  2 siblings, 0 replies; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-19 23:49 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, Christian Marie

On Thu, Nov 20, 2014 at 2:10 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/19/2014 10:20 PM, Christian Marie wrote:
>> On Wed, Nov 19, 2014 at 10:03:44PM +0400, Andrey Korolyov wrote:
>>> > We are using Mellanox ipoib drivers which do not do scatter-gather, so I'm
>>> > currently working on adding support for that (the hardware supports it). Are
>>> > you also using ipoib or have something else doing high order allocations? It's
>>> > a bit concerning for me if you don't as it would suggest that cutting down on
>>> > those allocations won't help.
>>>
>>> So do I. On a test environment with regular tengig cards I was unable to
>>> reproduce the issue. Honestly, I thought that almost every contemporary
>>> driver for high-speed cards is working with scatter-gather, so I had not mlx
>>> in mind as a potential cause of this problem from very beginning.
>>
>> Right, the drivers handle SG just fine, even in UD mode. It's just that as soon
>> as you go switch to CM they turn of hardware IP csums and SG support. The only
>> question I remain to answer before testing a patched driver is whether or not
>> the messages sent by Ceph are fragmented enough to save allocations. If not, we
>> could always patch Ceph as well but this is beginning to snowball.
>>
>> Here is the untested WIP patch for SG support in ipoib CM mode, I'm currently
>> talking to the original author of a larger patch to review and split that and
>> get them both upstream.:
>>
>> https://gist.github.com/christian-marie/e8048b9c118bd3925957
>>
>>> There are a couple of reports in ceph lists, complaining for OSD
>>> flapping/unresponsiveness without clear reason on certain (not always clear
>>> though) conditions which may have same root cause.
>>
>> Possibly, though ipoib and Ceph seem to be a relatively rare combination.
>> Someone will likely find this thread if it is the same root cause.
>>
>>> Wonder if numad-like mechanism will help there, but its usage is generally an
>>> anti-performance pattern in my experience.
>>
>> We've played with zone_reclaim_mode and numad to no avail. Only thing we haven't
>> tried is striping, which I don't want to do anyway.
>>
>> If these large allocations are indeed a reasonable thing to ask of the
>> compaction/reclaim subsystem that seems like the best way forward. I have two
>> questions that follow from this conjecture:
>>
>> Are compaction behaving badly or are we just asking for too many high order
>> allocations?
>>
>> Is this fixed in a later kernel? I haven't tested yet.
>
> As I said, recent kernels received many compaction performance tuning patches,
> and reclaim as well. I would recommend trying them, if it's possible.
>
> You mention 3.10.0-123.9.3.el7.x86_64 which I have no idea how it relates to
> upstream stable kernel. Upstream version 3.10.44 received several compaction
> fixes that I'd deem critical for compaction to work as intended, and lack of
> them could explain your problems:
>
> mm: compaction: reset cached scanner pfn's before reading them
> commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.
>
> mm: compaction: detect when scanners meet in isolate_freepages
> commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.
>
> mm/compaction: make isolate_freepages start at pageblock boundary
> commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.
>
> You might want to check if those are included in your kernel package, and/or try
> upstream stable 3.10 (if you can't use the latest for some reason).
>
> Vlastimil

Thanks, neither Christian`s nor mine builds aren`t including those. I
mentioned that I run -stable 3.10 but it was derived from public
branch probably as early as RH`s and received only
performance/security fixes at most. Will check the issue soon and
report back.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-19 23:10       ` Vlastimil Babka
  2014-11-19 23:49         ` Andrey Korolyov
@ 2014-11-20  3:30         ` Christian Marie
  2014-11-21  2:35         ` Christian Marie
  2 siblings, 0 replies; 36+ messages in thread
From: Christian Marie @ 2014-11-20  3:30 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On Thu, Nov 20, 2014 at 12:10:30AM +0100, Vlastimil Babka wrote:
> > Is this fixed in a later kernel? I haven't tested yet.
> 
> As I said, recent kernels received many compaction performance tuning patches,
> and reclaim as well. I would recommend trying them, if it's possible.
> 
> You mention 3.10.0-123.9.3.el7.x86_64 which I have no idea how it relates to
> upstream stable kernel. Upstream version 3.10.44 received several compaction
> fixes that I'd deem critical for compaction to work as intended, and lack of
> them could explain your problems:
> 
> mm: compaction: reset cached scanner pfn's before reading them
> commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.
> 
> mm: compaction: detect when scanners meet in isolate_freepages
> commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.
> 
> mm/compaction: make isolate_freepages start at pageblock boundary
> commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.
> 
> You might want to check if those are included in your kernel package, and/or try
> upstream stable 3.10 (if you can't use the latest for some reason).

Excellent, thankyou.

I realised there were a lot of changes but this list of specific fixes might
help narrow down the actual cause here. I've just built a kernel that's exactly
the same as the exploding one with just these three patches and will be back
tomorrow with the results of testing.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-19 23:10       ` Vlastimil Babka
  2014-11-19 23:49         ` Andrey Korolyov
  2014-11-20  3:30         ` Christian Marie
@ 2014-11-21  2:35         ` Christian Marie
  2014-11-23  9:33           ` Christian Marie
  2 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-11-21  2:35 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1417 bytes --]

On Thu, Nov 20, 2014 at 12:10:30AM +0100, Vlastimil Babka wrote:
> As I said, recent kernels received many compaction performance tuning patches,
> and reclaim as well. I would recommend trying them, if it's possible.
> 
> You mention 3.10.0-123.9.3.el7.x86_64 which I have no idea how it relates to
> upstream stable kernel. Upstream version 3.10.44 received several compaction
> fixes that I'd deem critical for compaction to work as intended, and lack of
> them could explain your problems:
> 
> mm: compaction: reset cached scanner pfn's before reading them
> commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.
> 
> mm: compaction: detect when scanners meet in isolate_freepages
> commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.
> 
> mm/compaction: make isolate_freepages start at pageblock boundary
> commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.
> 
> You might want to check if those are included in your kernel package, and/or try
> upstream stable 3.10 (if you can't use the latest for some reason).

I built exactly the same kernel with these patches applied, unfortunately it
suffered the same problem. I will now try the latest (3.18-rc5) release
candidate and report back.

Do you have any ideas of where I could be looking to collect data to track down
what is happening here? Here is some perf output again:

http://ponies.io/raw/compaction.png

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-21  2:35         ` Christian Marie
@ 2014-11-23  9:33           ` Christian Marie
  2014-11-24 21:48             ` Andrey Korolyov
  0 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-11-23  9:33 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 775 bytes --]

Here's an update:

Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
Ceph brings no perceived improvement over the chassis running 3.10 kernels.

Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:

http://ponies.io/raw/cluster.png

It is perhaps faring a little better that those chassis running the 3.10 in
that it did not have min_free_kbytes raised to 2GB as the others did, instead
it was sitting around 90MB.

The perf recording did look a little different. Not sure if this was just the
luck of the draw in how the fractal rendering works:

http://ponies.io/raw/perf-3.10.png

Any pointers on how we can track this down? There's at least three of us
following at this now so we should have plenty of area to test.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-23  9:33           ` Christian Marie
@ 2014-11-24 21:48             ` Andrey Korolyov
  2014-11-28  8:03               ` Joonsoo Kim
  0 siblings, 1 reply; 36+ messages in thread
From: Andrey Korolyov @ 2014-11-24 21:48 UTC (permalink / raw)
  To: linux-mm

On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@ponies.io> wrote:
> Here's an update:
>
> Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
> Ceph brings no perceived improvement over the chassis running 3.10 kernels.
>
> Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
>
> http://ponies.io/raw/cluster.png
>
> It is perhaps faring a little better that those chassis running the 3.10 in
> that it did not have min_free_kbytes raised to 2GB as the others did, instead
> it was sitting around 90MB.
>
> The perf recording did look a little different. Not sure if this was just the
> luck of the draw in how the fractal rendering works:
>
> http://ponies.io/raw/perf-3.10.png
>
> Any pointers on how we can track this down? There's at least three of us
> following at this now so we should have plenty of area to test.


Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
is presented for single- and two-headed systems as well. Ceph-users
reported presence of the problem for 3.17, so probably we are facing
generic compaction issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-24 21:48             ` Andrey Korolyov
@ 2014-11-28  8:03               ` Joonsoo Kim
  2014-11-28  9:26                 ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Joonsoo Kim @ 2014-11-28  8:03 UTC (permalink / raw)
  To: Andrey Korolyov
  Cc: linux-mm, Christoph Lameter, David Rientjes, Andrew Morton,
	Vlastimil Babka

On Tue, Nov 25, 2014 at 01:48:42AM +0400, Andrey Korolyov wrote:
> On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@ponies.io> wrote:
> > Here's an update:
> >
> > Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
> > Ceph brings no perceived improvement over the chassis running 3.10 kernels.
> >
> > Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
> >
> > http://ponies.io/raw/cluster.png
> >
> > It is perhaps faring a little better that those chassis running the 3.10 in
> > that it did not have min_free_kbytes raised to 2GB as the others did, instead
> > it was sitting around 90MB.
> >
> > The perf recording did look a little different. Not sure if this was just the
> > luck of the draw in how the fractal rendering works:
> >
> > http://ponies.io/raw/perf-3.10.png
> >
> > Any pointers on how we can track this down? There's at least three of us
> > following at this now so we should have plenty of area to test.
> 
> 
> Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
> is presented for single- and two-headed systems as well. Ceph-users
> reported presence of the problem for 3.17, so probably we are facing
> generic compaction issue.
> 

Hello,

I didn't follow-up this discussion, but, at glance, this excessive CPU
usage by compaction is related to following fixes.

Could you test following two patches?

If these fixes your problem, I will resumit patches with proper commit
description.

Thanks.

-------->8-------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-28  8:03               ` Joonsoo Kim
@ 2014-11-28  9:26                 ` Vlastimil Babka
  2014-12-01  8:31                   ` Joonsoo Kim
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-11-28  9:26 UTC (permalink / raw)
  To: Joonsoo Kim, Andrey Korolyov
  Cc: linux-mm, Christoph Lameter, David Rientjes, Andrew Morton

On 28.11.2014 9:03, Joonsoo Kim wrote:
> On Tue, Nov 25, 2014 at 01:48:42AM +0400, Andrey Korolyov wrote:
>> On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@ponies.io> wrote:
>>> Here's an update:
>>>
>>> Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
>>> Ceph brings no perceived improvement over the chassis running 3.10 kernels.
>>>
>>> Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
>>>
>>> http://ponies.io/raw/cluster.png
>>>
>>> It is perhaps faring a little better that those chassis running the 3.10 in
>>> that it did not have min_free_kbytes raised to 2GB as the others did, instead
>>> it was sitting around 90MB.
>>>
>>> The perf recording did look a little different. Not sure if this was just the
>>> luck of the draw in how the fractal rendering works:
>>>
>>> http://ponies.io/raw/perf-3.10.png
>>>
>>> Any pointers on how we can track this down? There's at least three of us
>>> following at this now so we should have plenty of area to test.
>>
>> Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
>> is presented for single- and two-headed systems as well. Ceph-users
>> reported presence of the problem for 3.17, so probably we are facing
>> generic compaction issue.
>>
> Hello,
>
> I didn't follow-up this discussion, but, at glance, this excessive CPU
> usage by compaction is related to following fixes.
>
> Could you test following two patches?
>
> If these fixes your problem, I will resumit patches with proper commit
> description.
>
> Thanks.
>
> -------->8-------------
>  From 079f3f119f1e3cbe9d981e7d0cada94e0c532162 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Fri, 28 Nov 2014 16:36:00 +0900
> Subject: [PATCH 1/2] mm/compaction: fix wrong order check in
>   compact_finished()
>
> What we want to check here is whether there is highorder freepage
> in buddy list of other migratetype in order to steal it without
> fragmentation. But, current code just checks cc->order which means
> allocation request order. So, this is wrong.
>
> Without this fix, non-movable synchronous compaction below pageblock order
> would not stopped until compaction complete, because migratetype of most
> pageblocks are movable and cc->order is always below than pageblock order
> in this case.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/compaction.c |    2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b544d61..052194f 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1082,7 +1082,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
>   			return COMPACT_PARTIAL;
>   
>   		/* Job done if allocation would set block type */
> -		if (cc->order >= pageblock_order && area->nr_free)
> +		if (order >= pageblock_order && area->nr_free)
>   			return COMPACT_PARTIAL;

Dang, good catch!
But I wonder, are MIGRATE_RESERVE pages counted towards area->nr_free?
Seems to me that they are, so this check can have false positives?
Hm probably for unmovable allocation, MIGRATE_CMA pages is the same case?

Vlastimil

>   	}
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-11-28  9:26                 ` Vlastimil Babka
@ 2014-12-01  8:31                   ` Joonsoo Kim
  2014-12-02  1:47                     ` Christian Marie
  0 siblings, 1 reply; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-01  8:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrey Korolyov, linux-mm, Christoph Lameter, David Rientjes,
	Andrew Morton

On Fri, Nov 28, 2014 at 10:26:15AM +0100, Vlastimil Babka wrote:
> On 28.11.2014 9:03, Joonsoo Kim wrote:
> >On Tue, Nov 25, 2014 at 01:48:42AM +0400, Andrey Korolyov wrote:
> >>On Sun, Nov 23, 2014 at 12:33 PM, Christian Marie <christian@ponies.io> wrote:
> >>>Here's an update:
> >>>
> >>>Tried running 3.18.0-rc5 over the weekend to no avail. A load spike through
> >>>Ceph brings no perceived improvement over the chassis running 3.10 kernels.
> >>>
> >>>Here is a graph of *system* cpu time (not user), note that 3.18 was a005.block:
> >>>
> >>>http://ponies.io/raw/cluster.png
> >>>
> >>>It is perhaps faring a little better that those chassis running the 3.10 in
> >>>that it did not have min_free_kbytes raised to 2GB as the others did, instead
> >>>it was sitting around 90MB.
> >>>
> >>>The perf recording did look a little different. Not sure if this was just the
> >>>luck of the draw in how the fractal rendering works:
> >>>
> >>>http://ponies.io/raw/perf-3.10.png
> >>>
> >>>Any pointers on how we can track this down? There's at least three of us
> >>>following at this now so we should have plenty of area to test.
> >>
> >>Checked against 3.16 (3.17 hanged for an unrelated problem), the issue
> >>is presented for single- and two-headed systems as well. Ceph-users
> >>reported presence of the problem for 3.17, so probably we are facing
> >>generic compaction issue.
> >>
> >Hello,
> >
> >I didn't follow-up this discussion, but, at glance, this excessive CPU
> >usage by compaction is related to following fixes.
> >
> >Could you test following two patches?
> >
> >If these fixes your problem, I will resumit patches with proper commit
> >description.
> >
> >Thanks.
> >
> >-------->8-------------
> > From 079f3f119f1e3cbe9d981e7d0cada94e0c532162 Mon Sep 17 00:00:00 2001
> >From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >Date: Fri, 28 Nov 2014 16:36:00 +0900
> >Subject: [PATCH 1/2] mm/compaction: fix wrong order check in
> >  compact_finished()
> >
> >What we want to check here is whether there is highorder freepage
> >in buddy list of other migratetype in order to steal it without
> >fragmentation. But, current code just checks cc->order which means
> >allocation request order. So, this is wrong.
> >
> >Without this fix, non-movable synchronous compaction below pageblock order
> >would not stopped until compaction complete, because migratetype of most
> >pageblocks are movable and cc->order is always below than pageblock order
> >in this case.
> >
> >Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >---
> >  mm/compaction.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index b544d61..052194f 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -1082,7 +1082,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
> >  			return COMPACT_PARTIAL;
> >  		/* Job done if allocation would set block type */
> >-		if (cc->order >= pageblock_order && area->nr_free)
> >+		if (order >= pageblock_order && area->nr_free)
> >  			return COMPACT_PARTIAL;
> 
> Dang, good catch!
> But I wonder, are MIGRATE_RESERVE pages counted towards area->nr_free?
> Seems to me that they are, so this check can have false positives?
> Hm probably for unmovable allocation, MIGRATE_CMA pages is the same case?
> 

Hello,

Althoth MIGRATE_RESERVE are counted for area->nr_free, at this
moment, there is no freepage on MIGRATE_RESERVE. It would be used
already before triggering compaction.

In case of MIGRATE_CMA, false positives are possible. But, it also
broken on __zone_watermark_ok(). Without area->nr_free_cma, we can't
fix inaccurate check. Please see following link.

https://lkml.org/lkml/2014/6/2/1

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-01  8:31                   ` Joonsoo Kim
@ 2014-12-02  1:47                     ` Christian Marie
  2014-12-02  4:53                       ` Joonsoo Kim
  0 siblings, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-12-02  1:47 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1061 bytes --]

On 28.11.2014 9:03, Joonsoo Kim wrote:
> Hello,
>
> I didn't follow-up this discussion, but, at glance, this excessive CPU
> usage by compaction is related to following fixes.
>
> Could you test following two patches?
>
> If these fixes your problem, I will resumit patches with proper commit
> description.
>
> -------- 8< ---------

Thanks for looking into this. Running 3.18-rc5 kernel with your patches has
produced some interesting results.

Load average still spikes to around 2000-3000 with the processors spinning 100%
doing compaction related things when min_free_kbytes is left at the default.

However, unlike before, the system is now completely stable. Pre-patch it would
be almost completely unresponsive (having to wait 30 seconds to establish an
SSH connection and several seconds to send a character).

Is it reasonable to guess that ipoib is giving compaction a hard time and
fixing this bug has allowed the system to at least not lock up?

I will try back-porting this to 3.10 and seeing if it is stable under these
strange conditions also.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-02  1:47                     ` Christian Marie
@ 2014-12-02  4:53                       ` Joonsoo Kim
  2014-12-02  5:06                         ` Christian Marie
  2014-12-02 15:46                         ` Vlastimil Babka
  0 siblings, 2 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-02  4:53 UTC (permalink / raw)
  To: linux-mm

On Tue, Dec 02, 2014 at 12:47:24PM +1100, Christian Marie wrote:
> On 28.11.2014 9:03, Joonsoo Kim wrote:
> > Hello,
> >
> > I didn't follow-up this discussion, but, at glance, this excessive CPU
> > usage by compaction is related to following fixes.
> >
> > Could you test following two patches?
> >
> > If these fixes your problem, I will resumit patches with proper commit
> > description.
> >
> > -------- 8< ---------
> 
> 
> Thanks for looking into this. Running 3.18-rc5 kernel with your patches has
> produced some interesting results.
> 
> Load average still spikes to around 2000-3000 with the processors spinning 100%
> doing compaction related things when min_free_kbytes is left at the default.
> 
> However, unlike before, the system is now completely stable. Pre-patch it would
> be almost completely unresponsive (having to wait 30 seconds to establish an
> SSH connection and several seconds to send a character).
> 
> Is it reasonable to guess that ipoib is giving compaction a hard time and
> fixing this bug has allowed the system to at least not lock up?
> 
> I will try back-porting this to 3.10 and seeing if it is stable under these
> strange conditions also.

Hello,

Good to hear!
Load average spike may be related to skip bit management. Currently, there is
no way to maintain skip bit permanently. So, after one iteration of compaction
is finished and skip bit is reset, all pageblocks should be re-scanned.

Your system has mellanox driver and although I don't know exactly what it is,
I heard that it allocates enormous pages and do get_user_pages() to
pin pages in memory. These memory aren't available to compaction, but,
compaction always scan it.

This is just my assumption, so if possible, please check it with
compaction tracepoint. If it is, we can make a solution for this
problem.

Anyway, could you test one more time without second patch?
IMO, first patch is reasonable to backport, because it fixes a real bug.
But, I'm not sure if second patch is needed to backport or not.
One more testing will help us to understand the effect of patch.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-02  4:53                       ` Joonsoo Kim
@ 2014-12-02  5:06                         ` Christian Marie
  2014-12-03  4:04                           ` Christian Marie
  2014-12-03  7:57                           ` Joonsoo Kim
  2014-12-02 15:46                         ` Vlastimil Babka
  1 sibling, 2 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-02  5:06 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 619 bytes --]

On Tue, Dec 02, 2014 at 01:53:24PM +0900, Joonsoo Kim wrote:
> This is just my assumption, so if possible, please check it with
> compaction tracepoint. If it is, we can make a solution for this
> problem.

Which event/function would you like me to trace specifically?

> Anyway, could you test one more time without second patch?
> IMO, first patch is reasonable to backport, because it fixes a real bug.
> But, I'm not sure if second patch is needed to backport or not.
> One more testing will help us to understand the effect of patch.

I will attempt to do this tomorrow and should have results in around 24 hours.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-02  4:53                       ` Joonsoo Kim
  2014-12-02  5:06                         ` Christian Marie
@ 2014-12-02 15:46                         ` Vlastimil Babka
  2014-12-03  7:49                           ` Joonsoo Kim
  1 sibling, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-12-02 15:46 UTC (permalink / raw)
  To: Joonsoo Kim, linux-mm

On 12/02/2014 05:53 AM, Joonsoo Kim wrote:
> On Tue, Dec 02, 2014 at 12:47:24PM +1100, Christian Marie wrote:
>> On 28.11.2014 9:03, Joonsoo Kim wrote:
>>> Hello,
>>>
>>> I didn't follow-up this discussion, but, at glance, this excessive CPU
>>> usage by compaction is related to following fixes.
>>>
>>> Could you test following two patches?
>>>
>>> If these fixes your problem, I will resumit patches with proper commit
>>> description.
>>>
>>> -------- 8< ---------
>>
>>
>> Thanks for looking into this. Running 3.18-rc5 kernel with your patches has
>> produced some interesting results.
>>
>> Load average still spikes to around 2000-3000 with the processors spinning 100%
>> doing compaction related things when min_free_kbytes is left at the default.
>>
>> However, unlike before, the system is now completely stable. Pre-patch it would
>> be almost completely unresponsive (having to wait 30 seconds to establish an
>> SSH connection and several seconds to send a character).
>>
>> Is it reasonable to guess that ipoib is giving compaction a hard time and
>> fixing this bug has allowed the system to at least not lock up?
>>
>> I will try back-porting this to 3.10 and seeing if it is stable under these
>> strange conditions also.
>
> Hello,
>
> Good to hear!

Indeed, although I somehow doubt your first patch could have made such 
difference. It only matters when you have a whole pageblock free. 
Without the patch, the particular compaction attempt that managed to 
free the block might not be terminated ASAP, but then the free pageblock 
is still allocatable by the following allocation attempts, so it 
shouldn't result in a stream of complete compactions.

So I would expect it's either a fluke, or the second patch made the 
difference, to either SLUB or something else making such fallback-able 
allocations.

But hmm, I've never considered the implications of compact_finished() 
migratetypes handling on unmovable allocations. Regardless of cc->order, 
it often has to free a whole pageblock to succeed, as it's unlikely it 
will succeed compacting within a pageblock already marked as UNMOVABLE. 
Guess it's to prevent further fragmentation and that makes sense, but it 
does make high-order unmovable allocations problematic. At least the 
watermark checks for allowing compaction in the first place are then 
wrong - we decide that based on cc->order, but in we fact need at least 
a pageblock worth of space free to actually succeed.

> Load average spike may be related to skip bit management. Currently, there is
> no way to maintain skip bit permanently. So, after one iteration of compaction
> is finished and skip bit is reset, all pageblocks should be re-scanned.

Shouldn't be "after one iteration of compaction", the bits are cleared 
only when compaction is restarting after being deferred, or when kswapd 
goes to sleep.

> Your system has mellanox driver and although I don't know exactly what it is,
> I heard that it allocates enormous pages and do get_user_pages() to
> pin pages in memory. These memory aren't available to compaction, but,
> compaction always scan it.
>
> This is just my assumption, so if possible, please check it with
> compaction tracepoint. If it is, we can make a solution for this
> problem.
>
> Anyway, could you test one more time without second patch?
> IMO, first patch is reasonable to backport, because it fixes a real bug.
> But, I'm not sure if second patch is needed to backport or not.
> One more testing will help us to understand the effect of patch.
>
> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-02  5:06                         ` Christian Marie
@ 2014-12-03  4:04                           ` Christian Marie
  2014-12-03  8:05                             ` Joonsoo Kim
  2014-12-04 23:30                             ` Vlastimil Babka
  2014-12-03  7:57                           ` Joonsoo Kim
  1 sibling, 2 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-03  4:04 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 3602 bytes --]

On Tue, Dec 02, 2014 at 04:06:08PM +1100, Christian Marie wrote:
> I will attempt to do this tomorrow and should have results in around 24 hours.

I ran said test today and wasn't able to pinpoint a solid difference between a kernel
with both patches and one with only the first. The one with both patches "felt"
a little more responsive, probably a fluke.

I'd really like to write a stress test that simulates what ceph/ipoib is doing
here so that I can test this in a more scientific manner.

Here is some perf output, the kernel with only the first patch is on the right:

http://ponies.io/raw/before-after.png


A note in passing: we left the cluster running with min_free_kbytes set to the
default last night and within a few hours it started spewing the usual
pre-patch allocation failures, so whilst this patch appears to make the system
more responsive under adverse conditions the underlying
not-keeping-up-with-pressure issue is still there.

There's enough starvation to break single page allocations.

Keep in mind that this is on a 3.10 kernel with the patches applied so I'm not
expecting anyone to particularly care. I'm running out of time to test the
whole cluster at 3.18 is all, I really do think that replicating the allocation
pattern is the best way forward but my attempts at simply sending a lot of
packets that look similar with lots of page cache don't do it.

Those allocation failures on 3.10 with both patches look like this:

	[73138.803800] ceph-osd: page allocation failure: order:0, mode:0x20
	[73138.803802] CPU: 0 PID: 9214 Comm: ceph-osd Tainted: GF
	O--------------   3.10.0-123.9.3.anchor.x86_64 #1
	[73138.803803] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.2.2
	01/16/2014
	[73138.803803]  0000000000000020 00000000d6532f99 ffff88081fa03aa0
	ffffffff815e23bb
	[73138.803806]  ffff88081fa03b30 ffffffff81147340 00000000ffffffff
	ffff8807da887900
	[73138.803808]  ffff88083ffd9e80 ffff8800b2242900 ffff8807d843c050
	00000000d6532f99
	[73138.803812] Call Trace:
	[73138.803813]  <IRQ>  [<ffffffff815e23bb>] dump_stack+0x19/0x1b
	[73138.803817]  [<ffffffff81147340>] warn_alloc_failed+0x110/0x180
	[73138.803819]  [<ffffffff8114b4ee>] __alloc_pages_nodemask+0x91e/0xb20
	[73138.803821]  [<ffffffff8152f82a>] ? tcp_v4_rcv+0x67a/0x7c0
	[73138.803823]  [<ffffffff81509710>] ? ip_rcv_finish+0x350/0x350
	[73138.803826]  [<ffffffff81188369>] alloc_pages_current+0xa9/0x170
	[73138.803828]  [<ffffffff814bedb1>] __netdev_alloc_frag+0x91/0x140
	[73138.803831]  [<ffffffff814c0df7>] __netdev_alloc_skb+0x77/0xc0
	[73138.803834]  [<ffffffffa06b54c5>] ipoib_cm_handle_rx_wc+0xf5/0x940
	[ib_ipoib]
	[73138.803838]  [<ffffffffa0625e78>] ? mlx4_ib_poll_cq+0xc8/0x210 [mlx4_ib]
	[73138.803841]  [<ffffffffa06a90ed>] ipoib_poll+0x8d/0x150 [ib_ipoib]
	[73138.803843]  [<ffffffff814d05aa>] net_rx_action+0x15a/0x250
	[73138.803846]  [<ffffffff81067047>] __do_softirq+0xf7/0x290
	[73138.803848]  [<ffffffff815f43dc>] call_softirq+0x1c/0x30
	[73138.803851]  [<ffffffff81014d25>] do_softirq+0x55/0x90
	[73138.803853]  [<ffffffff810673e5>] irq_exit+0x115/0x120
	[73138.803855]  [<ffffffff815f4cd8>] do_IRQ+0x58/0xf0
	[73138.803857]  [<ffffffff815e9e2d>] common_interrupt+0x6d/0x6d
	[73138.803858]  <EOI>  [<ffffffff815f2bc0>] ? sysret_audit+0x17/0x21

We get some like this, also:

[ 1293.152415] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[ 1293.152416]   cache: kmalloc-256, object size: 256, buffer size: 256,
default order: 1, min order: 0
[ 1293.152417]   node 0: slabs: 1789, objs: 57248, free: 0
[ 1293.152418]   node 1: slabs: 449, objs: 14368, free: 2


[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-02 15:46                         ` Vlastimil Babka
@ 2014-12-03  7:49                           ` Joonsoo Kim
  2014-12-03 12:43                             ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-03  7:49 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm

On Tue, Dec 02, 2014 at 04:46:33PM +0100, Vlastimil Babka wrote:
> On 12/02/2014 05:53 AM, Joonsoo Kim wrote:
> >On Tue, Dec 02, 2014 at 12:47:24PM +1100, Christian Marie wrote:
> >>On 28.11.2014 9:03, Joonsoo Kim wrote:
> >>>Hello,
> >>>
> >>>I didn't follow-up this discussion, but, at glance, this excessive CPU
> >>>usage by compaction is related to following fixes.
> >>>
> >>>Could you test following two patches?
> >>>
> >>>If these fixes your problem, I will resumit patches with proper commit
> >>>description.
> >>>
> >>>-------- 8< ---------
> >>
> >>
> >>Thanks for looking into this. Running 3.18-rc5 kernel with your patches has
> >>produced some interesting results.
> >>
> >>Load average still spikes to around 2000-3000 with the processors spinning 100%
> >>doing compaction related things when min_free_kbytes is left at the default.
> >>
> >>However, unlike before, the system is now completely stable. Pre-patch it would
> >>be almost completely unresponsive (having to wait 30 seconds to establish an
> >>SSH connection and several seconds to send a character).
> >>
> >>Is it reasonable to guess that ipoib is giving compaction a hard time and
> >>fixing this bug has allowed the system to at least not lock up?
> >>
> >>I will try back-porting this to 3.10 and seeing if it is stable under these
> >>strange conditions also.
> >
> >Hello,
> >
> >Good to hear!
> 
> Indeed, although I somehow doubt your first patch could have made
> such difference. It only matters when you have a whole pageblock
> free. Without the patch, the particular compaction attempt that
> managed to free the block might not be terminated ASAP, but then the
> free pageblock is still allocatable by the following allocation
> attempts, so it shouldn't result in a stream of complete
> compactions.

High-order freepage made by compaction could be broken by other
order-0 allocation attempts, so following high-order allocation attempts
could result in new compaction. It would be dependent on workload.

Anyway, we should fix cc->order to order. :)

> 
> So I would expect it's either a fluke, or the second patch made the
> difference, to either SLUB or something else making such
> fallback-able allocations.
> 
> But hmm, I've never considered the implications of
> compact_finished() migratetypes handling on unmovable allocations.
> Regardless of cc->order, it often has to free a whole pageblock to
> succeed, as it's unlikely it will succeed compacting within a
> pageblock already marked as UNMOVABLE. Guess it's to prevent further
> fragmentation and that makes sense, but it does make high-order
> unmovable allocations problematic. At least the watermark checks for
> allowing compaction in the first place are then wrong - we decide
> that based on cc->order, but in we fact need at least a pageblock
> worth of space free to actually succeed.

I think that watermark check is okay but we need a elegant way to decide
the best timing compaction should be stopped. I made following two patches
about this. This patch would make non-movable compaction less
aggressive. This is just draft so ignore my poor description. :)

Could you comment it?

--------->8-----------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-02  5:06                         ` Christian Marie
  2014-12-03  4:04                           ` Christian Marie
@ 2014-12-03  7:57                           ` Joonsoo Kim
  2014-12-04  7:30                             ` Christian Marie
  1 sibling, 1 reply; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-03  7:57 UTC (permalink / raw)
  To: linux-mm

On Tue, Dec 02, 2014 at 04:06:08PM +1100, Christian Marie wrote:
> On Tue, Dec 02, 2014 at 01:53:24PM +0900, Joonsoo Kim wrote:
> > This is just my assumption, so if possible, please check it with
> > compaction tracepoint. If it is, we can make a solution for this
> > problem.
> 
> Which event/function would you like me to trace specifically?

Hello,

It'd be very helpful to get output of
"trace_event=compaction:*,kmem:mm_page_alloc_extfrag" on the kernel
with my tracepoint patches below.

See following link. There is 3 patches.

https://lkml.org/lkml/2014/12/3/71

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-03  4:04                           ` Christian Marie
@ 2014-12-03  8:05                             ` Joonsoo Kim
  2014-12-04 23:30                             ` Vlastimil Babka
  1 sibling, 0 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-03  8:05 UTC (permalink / raw)
  To: linux-mm

On Wed, Dec 03, 2014 at 03:04:04PM +1100, Christian Marie wrote:
> On Tue, Dec 02, 2014 at 04:06:08PM +1100, Christian Marie wrote:
> > I will attempt to do this tomorrow and should have results in around 24 hours.
> 
> I ran said test today and wasn't able to pinpoint a solid difference between a kernel
> with both patches and one with only the first. The one with both patches "felt"
> a little more responsive, probably a fluke.

Thanks! It would help me.

> 
> I'd really like to write a stress test that simulates what ceph/ipoib is doing
> here so that I can test this in a more scientific manner.
> 
> Here is some perf output, the kernel with only the first patch is on the right:
> 
> http://ponies.io/raw/before-after.png
> 
> 
> A note in passing: we left the cluster running with min_free_kbytes set to the
> default last night and within a few hours it started spewing the usual
> pre-patch allocation failures, so whilst this patch appears to make the system
> more responsive under adverse conditions the underlying
> not-keeping-up-with-pressure issue is still there.

I guess that it is caused by too fast allocation. If your allocation rate
is more than kswapd's reclaim rate and no GFP_WAIT, failure would be possible.
Following failure log looks that case. In this case, enlaring
min_free_kbytes may be right solution, but, I'm not expert so please consult
other MM guys.

> There's enough starvation to break single page allocations.
> 
> Keep in mind that this is on a 3.10 kernel with the patches applied so I'm not
> expecting anyone to particularly care. I'm running out of time to test the
> whole cluster at 3.18 is all, I really do think that replicating the allocation
> pattern is the best way forward but my attempts at simply sending a lot of
> packets that look similar with lots of page cache don't do it.
> 
> Those allocation failures on 3.10 with both patches look like this:
> 
> 	[73138.803800] ceph-osd: page allocation failure: order:0, mode:0x20
> 	[73138.803802] CPU: 0 PID: 9214 Comm: ceph-osd Tainted: GF
> 	O--------------   3.10.0-123.9.3.anchor.x86_64 #1
> 	[73138.803803] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.2.2
> 	01/16/2014
> 	[73138.803803]  0000000000000020 00000000d6532f99 ffff88081fa03aa0
> 	ffffffff815e23bb
> 	[73138.803806]  ffff88081fa03b30 ffffffff81147340 00000000ffffffff
> 	ffff8807da887900
> 	[73138.803808]  ffff88083ffd9e80 ffff8800b2242900 ffff8807d843c050
> 	00000000d6532f99
> 	[73138.803812] Call Trace:
> 	[73138.803813]  <IRQ>  [<ffffffff815e23bb>] dump_stack+0x19/0x1b
> 	[73138.803817]  [<ffffffff81147340>] warn_alloc_failed+0x110/0x180
> 	[73138.803819]  [<ffffffff8114b4ee>] __alloc_pages_nodemask+0x91e/0xb20
> 	[73138.803821]  [<ffffffff8152f82a>] ? tcp_v4_rcv+0x67a/0x7c0
> 	[73138.803823]  [<ffffffff81509710>] ? ip_rcv_finish+0x350/0x350
> 	[73138.803826]  [<ffffffff81188369>] alloc_pages_current+0xa9/0x170
> 	[73138.803828]  [<ffffffff814bedb1>] __netdev_alloc_frag+0x91/0x140
> 	[73138.803831]  [<ffffffff814c0df7>] __netdev_alloc_skb+0x77/0xc0
> 	[73138.803834]  [<ffffffffa06b54c5>] ipoib_cm_handle_rx_wc+0xf5/0x940
> 	[ib_ipoib]
> 	[73138.803838]  [<ffffffffa0625e78>] ? mlx4_ib_poll_cq+0xc8/0x210 [mlx4_ib]
> 	[73138.803841]  [<ffffffffa06a90ed>] ipoib_poll+0x8d/0x150 [ib_ipoib]
> 	[73138.803843]  [<ffffffff814d05aa>] net_rx_action+0x15a/0x250
> 	[73138.803846]  [<ffffffff81067047>] __do_softirq+0xf7/0x290
> 	[73138.803848]  [<ffffffff815f43dc>] call_softirq+0x1c/0x30
> 	[73138.803851]  [<ffffffff81014d25>] do_softirq+0x55/0x90
> 	[73138.803853]  [<ffffffff810673e5>] irq_exit+0x115/0x120
> 	[73138.803855]  [<ffffffff815f4cd8>] do_IRQ+0x58/0xf0
> 	[73138.803857]  [<ffffffff815e9e2d>] common_interrupt+0x6d/0x6d
> 	[73138.803858]  <EOI>  [<ffffffff815f2bc0>] ? sysret_audit+0x17/0x21
> 
> We get some like this, also:
> 
> [ 1293.152415] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
> [ 1293.152416]   cache: kmalloc-256, object size: 256, buffer size: 256,
> default order: 1, min order: 0
> [ 1293.152417]   node 0: slabs: 1789, objs: 57248, free: 0
> [ 1293.152418]   node 1: slabs: 449, objs: 14368, free: 2
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-03  7:49                           ` Joonsoo Kim
@ 2014-12-03 12:43                             ` Vlastimil Babka
  2014-12-04  6:53                               ` Joonsoo Kim
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-12-03 12:43 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: linux-mm

On 12/03/2014 08:49 AM, Joonsoo Kim wrote:
> On Tue, Dec 02, 2014 at 04:46:33PM +0100, Vlastimil Babka wrote:
>> 
>> Indeed, although I somehow doubt your first patch could have made
>> such difference. It only matters when you have a whole pageblock
>> free. Without the patch, the particular compaction attempt that
>> managed to free the block might not be terminated ASAP, but then the
>> free pageblock is still allocatable by the following allocation
>> attempts, so it shouldn't result in a stream of complete
>> compactions.
> 
> High-order freepage made by compaction could be broken by other
> order-0 allocation attempts, so following high-order allocation attempts
> could result in new compaction. It would be dependent on workload.
> 
> Anyway, we should fix cc->order to order. :)

Sure, no doubts about it.

>> 
>> So I would expect it's either a fluke, or the second patch made the
>> difference, to either SLUB or something else making such
>> fallback-able allocations.
>> 
>> But hmm, I've never considered the implications of
>> compact_finished() migratetypes handling on unmovable allocations.
>> Regardless of cc->order, it often has to free a whole pageblock to
>> succeed, as it's unlikely it will succeed compacting within a
>> pageblock already marked as UNMOVABLE. Guess it's to prevent further
>> fragmentation and that makes sense, but it does make high-order
>> unmovable allocations problematic. At least the watermark checks for
>> allowing compaction in the first place are then wrong - we decide
>> that based on cc->order, but in we fact need at least a pageblock
>> worth of space free to actually succeed.
> 
> I think that watermark check is okay but we need a elegant way to decide
> the best timing compaction should be stopped. I made following two patches
> about this. This patch would make non-movable compaction less
> aggressive. This is just draft so ignore my poor description. :)
> 
> Could you comment it?
> 
> --------->8-----------------
> From bd6b285c38fd94e5ec03a720bed4debae3914bde Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Mon, 1 Dec 2014 11:56:57 +0900
> Subject: [PATCH 1/2] mm/page_alloc: expands broken freepage to proper buddy
>  list when steal
> 
> There is odd behaviour when we steal freepages from other migratetype
> buddy list. In try_to_steal_freepages(), we move all freepages in
> the pageblock that founded freepage is belong to to the request
> migratetype in order to mitigate fragmentation. If the number of moved
> pages are enough to change pageblock migratetype, there is no problem. If
> not enough, we don't change pageblock migratetype and add broken freepages
> to the original migratetype buddy list rather than request migratetype
> one. For me, this is odd, because we already moved all freepages in this
> pageblock to the request migratetype. This patch fixes this situation to
> add broken freepages to the request migratetype buddy list in this case.
>

Yeah, I have noticed this a while ago, and traced the history of how this
happened. But surprisingly just changing this back didn't evaluate as a clear
win, so I have added some further tunning. I will try to send this ASAP.

> This patch introduce new function that can help to decide if we can
> steal the page without resulting in fragmentation. It will be used in
> following patch for compaction finish criteria.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
> +static bool can_steal_freepages(unsigned int order,
> +			int start_mt, int fallback_mt)
> +{
> +	/*
> +	 * When borrowing from MIGRATE_CMA, we need to release the excess
> +	 * buddy pages to CMA itself. We also ensure the freepage_migratetype
> +	 * is set to CMA so it is returned to the correct freelist in case
> +	 * the page ends up being not actually allocated from the pcp lists.
> +	 */
> +	if (is_migrate_cma(fallback_mt))
> +		return false;
>  
> -	}
> +	/* Can take ownership for orders >= pageblock_order */
> +	if (order >= pageblock_order)
> +		return true;
> +
> +	if (order >= pageblock_order / 2 ||
> +		start_mt == MIGRATE_RECLAIMABLE ||
> +		page_group_by_mobility_disabled)
> +		return true;
>  
> -	return fallback_type;
> +	return false;

Note that this is not exactly consistent for compaction and allocation.
Allocation will succeed as long as a large enough fallback page exist - it might
not just steal extra free pages if the fallback page order is low (or it's not
for MIGRATE_RECLAIMABLE allocation). But for compaction, with your patches you
still evaluate whether it can steal also the extra pages, so it's more strict
condition. It might make sense, but let's not claim it's fully consistent? And
it definitely needs evaluation...

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-03 12:43                             ` Vlastimil Babka
@ 2014-12-04  6:53                               ` Joonsoo Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-04  6:53 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm

On Wed, Dec 03, 2014 at 01:43:31PM +0100, Vlastimil Babka wrote:
> On 12/03/2014 08:49 AM, Joonsoo Kim wrote:
> > On Tue, Dec 02, 2014 at 04:46:33PM +0100, Vlastimil Babka wrote:
> >> 
> >> Indeed, although I somehow doubt your first patch could have made
> >> such difference. It only matters when you have a whole pageblock
> >> free. Without the patch, the particular compaction attempt that
> >> managed to free the block might not be terminated ASAP, but then the
> >> free pageblock is still allocatable by the following allocation
> >> attempts, so it shouldn't result in a stream of complete
> >> compactions.
> > 
> > High-order freepage made by compaction could be broken by other
> > order-0 allocation attempts, so following high-order allocation attempts
> > could result in new compaction. It would be dependent on workload.
> > 
> > Anyway, we should fix cc->order to order. :)
> 
> Sure, no doubts about it.

Okay.

> 
> >> 
> >> So I would expect it's either a fluke, or the second patch made the
> >> difference, to either SLUB or something else making such
> >> fallback-able allocations.
> >> 
> >> But hmm, I've never considered the implications of
> >> compact_finished() migratetypes handling on unmovable allocations.
> >> Regardless of cc->order, it often has to free a whole pageblock to
> >> succeed, as it's unlikely it will succeed compacting within a
> >> pageblock already marked as UNMOVABLE. Guess it's to prevent further
> >> fragmentation and that makes sense, but it does make high-order
> >> unmovable allocations problematic. At least the watermark checks for
> >> allowing compaction in the first place are then wrong - we decide
> >> that based on cc->order, but in we fact need at least a pageblock
> >> worth of space free to actually succeed.
> > 
> > I think that watermark check is okay but we need a elegant way to decide
> > the best timing compaction should be stopped. I made following two patches
> > about this. This patch would make non-movable compaction less
> > aggressive. This is just draft so ignore my poor description. :)
> > 
> > Could you comment it?
> > 
> > --------->8-----------------
> > From bd6b285c38fd94e5ec03a720bed4debae3914bde Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Date: Mon, 1 Dec 2014 11:56:57 +0900
> > Subject: [PATCH 1/2] mm/page_alloc: expands broken freepage to proper buddy
> >  list when steal
> > 
> > There is odd behaviour when we steal freepages from other migratetype
> > buddy list. In try_to_steal_freepages(), we move all freepages in
> > the pageblock that founded freepage is belong to to the request
> > migratetype in order to mitigate fragmentation. If the number of moved
> > pages are enough to change pageblock migratetype, there is no problem. If
> > not enough, we don't change pageblock migratetype and add broken freepages
> > to the original migratetype buddy list rather than request migratetype
> > one. For me, this is odd, because we already moved all freepages in this
> > pageblock to the request migratetype. This patch fixes this situation to
> > add broken freepages to the request migratetype buddy list in this case.
> >
> 
> Yeah, I have noticed this a while ago, and traced the history of how this
> happened. But surprisingly just changing this back didn't evaluate as a clear
> win, so I have added some further tunning. I will try to send this ASAP.

I'd like to see it.

Anyway, if there is no remarkable degradation you found, merging this patch
is better than as is. This is odd logic and we don't understand how it works
and whether it is better or not. So making logic understandable deserves
to consider.

> 
> > This patch introduce new function that can help to decide if we can
> > steal the page without resulting in fragmentation. It will be used in
> > following patch for compaction finish criteria.
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> > +static bool can_steal_freepages(unsigned int order,
> > +			int start_mt, int fallback_mt)
> > +{
> > +	/*
> > +	 * When borrowing from MIGRATE_CMA, we need to release the excess
> > +	 * buddy pages to CMA itself. We also ensure the freepage_migratetype
> > +	 * is set to CMA so it is returned to the correct freelist in case
> > +	 * the page ends up being not actually allocated from the pcp lists.
> > +	 */
> > +	if (is_migrate_cma(fallback_mt))
> > +		return false;
> >  
> > -	}
> > +	/* Can take ownership for orders >= pageblock_order */
> > +	if (order >= pageblock_order)
> > +		return true;
> > +
> > +	if (order >= pageblock_order / 2 ||
> > +		start_mt == MIGRATE_RECLAIMABLE ||
> > +		page_group_by_mobility_disabled)
> > +		return true;
> >  
> > -	return fallback_type;
> > +	return false;
> 
> Note that this is not exactly consistent for compaction and allocation.

Yes, I know. So I asked to ignore my poor description. :)
Sorry about that.

> Allocation will succeed as long as a large enough fallback page exist - it might
> not just steal extra free pages if the fallback page order is low (or it's not
> for MIGRATE_RECLAIMABLE allocation). But for compaction, with your patches you
> still evaluate whether it can steal also the extra pages, so it's more strict
> condition. It might make sense, but let's not claim it's fully consistent? And
> it definitely needs evaluation...

IMO, it's more strict, but, make more sense than current one. Do you agree?
Anyway, I need evaulation. My quick attempt results in good result. I will share
it after more testing.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-03  7:57                           ` Joonsoo Kim
@ 2014-12-04  7:30                             ` Christian Marie
  2014-12-04  7:51                               ` Christian Marie
  2014-12-05  1:07                               ` Joonsoo Kim
  0 siblings, 2 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-04  7:30 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1308 bytes --]

On Wed, Dec 03, 2014 at 04:57:47PM +0900, Joonsoo Kim wrote:
> It'd be very helpful to get output of
> "trace_event=compaction:*,kmem:mm_page_alloc_extfrag" on the kernel
> with my tracepoint patches below.
> 
> See following link. There is 3 patches.
> 
> https://lkml.org/lkml/2014/12/3/71

I have just finished testing 3.18rc5 with both of the small patches mentioned
earlier in this thread and 2/3 of your event patches. The second patch
(https://lkml.org/lkml/2014/12/3/72) did not apply due to compaction_suitable
being different (am I missing another patch you are basing this off?).

My compaction_suitable is:

	unsigned long compaction_suitable(struct zone *zone, int order)

Results without that second event patch are as follows:

Trace under heavy load but before any spiking system usage or significant
compaction spinning:

http://ponies.io/raw/compaction_events/before.gz

Trace during 100% cpu utilization, much of which was in system:

http://ponies.io/raw/compaction_events/during.gz

perf report at the time of during.gz:

http://ponies.io/raw/compaction_events/perf.png

Interested to see what you make of the limited information. I may be able to
try all of your patches some time next week against whatever they apply cleanly
to. If that is needed.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-04  7:30                             ` Christian Marie
@ 2014-12-04  7:51                               ` Christian Marie
  2014-12-05  1:07                               ` Joonsoo Kim
  1 sibling, 0 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-04  7:51 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1534 bytes --]

An extra note that may or may not be related, I just saw this whilst load
testing:

[177586.215195] swap_free: Unused swap offset entry 0000365b
[177586.215224] BUG: Bad page map in process ceph-osd  pte:006cb600
pmd:fea8a8067
[177586.215260] addr:00007f12dff8a000 vm_flags:00100077
anon_vma:ffff8807e6002000 mapping:          (null) index:7f12dff8a
[177586.215316] CPU: 22 PID: 48567 Comm: ceph-osd Tainted: GF   B
O--------------   3.10.0-123.9.3.anchor.x86_64 #1
[177586.215318] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.2.2
01/16/2014
[177586.215319]  00007f12dff8a000 00000000cdae60bd ffff88062ff6bc70
ffffffff815e23bb
[177586.215324]  ffff88062ff6bcb8 ffffffff81167b48 00000000006cb600
00000007f12dff8a
[177586.215329]  ffff880fea8a8c50 00000000006cb600 00007f12dff8a000
00007f12dffde000
[177586.215333] Call Trace:
[177586.215337]  [<ffffffff815e23bb>] dump_stack+0x19/0x1b
[177586.215340]  [<ffffffff81167b48>] print_bad_pte+0x1a8/0x240
[177586.215343]  [<ffffffff811694b0>] unmap_page_range+0x5b0/0x860
[177586.215348]  [<ffffffff811697e1>] unmap_single_vma+0x81/0xf0
[177586.215353]  [<ffffffff8114fade>] ? lru_add_drain_cpu+0xce/0xe0
[177586.215358]  [<ffffffff8116a9f5>] zap_page_range+0x105/0x170
[177586.215361]  [<ffffffff81167354>] SyS_madvise+0x394/0x810
[177586.215366]  [<ffffffff810c30a0>] ? SyS_futex+0x80/0x180

This was on a 3.10 kernel with the two patches mentioned earlier in this
thread. I'm not suggesting it's related, just thought I'd note it as I've never
seen a bad page mapping before.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-03  4:04                           ` Christian Marie
  2014-12-03  8:05                             ` Joonsoo Kim
@ 2014-12-04 23:30                             ` Vlastimil Babka
  2014-12-05  5:50                               ` Christian Marie
  1 sibling, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-12-04 23:30 UTC (permalink / raw)
  To: linux-mm

On 3.12.2014 5:04, Christian Marie wrote:
> On Tue, Dec 02, 2014 at 04:06:08PM +1100, Christian Marie wrote:
>> I will attempt to do this tomorrow and should have results in around 24 hours.
> I ran said test today and wasn't able to pinpoint a solid difference between a kernel
> with both patches and one with only the first. The one with both patches "felt"
> a little more responsive, probably a fluke.
>
> I'd really like to write a stress test that simulates what ceph/ipoib is doing
> here so that I can test this in a more scientific manner.
>
> Here is some perf output, the kernel with only the first patch is on the right:
>
> http://ponies.io/raw/before-after.png
>
>
> A note in passing: we left the cluster running with min_free_kbytes set to the
> default last night and within a few hours it started spewing the usual
> pre-patch allocation failures, so whilst this patch appears to make the system
> more responsive under adverse conditions the underlying
> not-keeping-up-with-pressure issue is still there.
>
> There's enough starvation to break single page allocations.

Oh, I would think that if you can't allocate single pages, then there's
little wonder that compaction also spends all its time looking for single
free pages. Did that happen just now for the single page allocations,
or was it always the case?

>
> Keep in mind that this is on a 3.10 kernel with the patches applied so I'm not
> expecting anyone to particularly care. I'm running out of time to test the
> whole cluster at 3.18 is all, I really do think that replicating the allocation
> pattern is the best way forward but my attempts at simply sending a lot of
> packets that look similar with lots of page cache don't do it.
>
> Those allocation failures on 3.10 with both patches look like this:
>
> 	[73138.803800] ceph-osd: page allocation failure: order:0, mode:0x20
> 	[73138.803802] CPU: 0 PID: 9214 Comm: ceph-osd Tainted: GF
> 	O--------------   3.10.0-123.9.3.anchor.x86_64 #1
> 	[73138.803803] Hardware name: Dell Inc. PowerEdge R720xd/0X3D66, BIOS 2.2.2
> 	01/16/2014
> 	[73138.803803]  0000000000000020 00000000d6532f99 ffff88081fa03aa0
> 	ffffffff815e23bb
> 	[73138.803806]  ffff88081fa03b30 ffffffff81147340 00000000ffffffff
> 	ffff8807da887900
> 	[73138.803808]  ffff88083ffd9e80 ffff8800b2242900 ffff8807d843c050
> 	00000000d6532f99
> 	[73138.803812] Call Trace:
> 	[73138.803813]  <IRQ>  [<ffffffff815e23bb>] dump_stack+0x19/0x1b
> 	[73138.803817]  [<ffffffff81147340>] warn_alloc_failed+0x110/0x180
> 	[73138.803819]  [<ffffffff8114b4ee>] __alloc_pages_nodemask+0x91e/0xb20
> 	[73138.803821]  [<ffffffff8152f82a>] ? tcp_v4_rcv+0x67a/0x7c0
> 	[73138.803823]  [<ffffffff81509710>] ? ip_rcv_finish+0x350/0x350
> 	[73138.803826]  [<ffffffff81188369>] alloc_pages_current+0xa9/0x170
> 	[73138.803828]  [<ffffffff814bedb1>] __netdev_alloc_frag+0x91/0x140
> 	[73138.803831]  [<ffffffff814c0df7>] __netdev_alloc_skb+0x77/0xc0
> 	[73138.803834]  [<ffffffffa06b54c5>] ipoib_cm_handle_rx_wc+0xf5/0x940
> 	[ib_ipoib]
> 	[73138.803838]  [<ffffffffa0625e78>] ? mlx4_ib_poll_cq+0xc8/0x210 [mlx4_ib]
> 	[73138.803841]  [<ffffffffa06a90ed>] ipoib_poll+0x8d/0x150 [ib_ipoib]
> 	[73138.803843]  [<ffffffff814d05aa>] net_rx_action+0x15a/0x250
> 	[73138.803846]  [<ffffffff81067047>] __do_softirq+0xf7/0x290
> 	[73138.803848]  [<ffffffff815f43dc>] call_softirq+0x1c/0x30
> 	[73138.803851]  [<ffffffff81014d25>] do_softirq+0x55/0x90
> 	[73138.803853]  [<ffffffff810673e5>] irq_exit+0x115/0x120
> 	[73138.803855]  [<ffffffff815f4cd8>] do_IRQ+0x58/0xf0
> 	[73138.803857]  [<ffffffff815e9e2d>] common_interrupt+0x6d/0x6d
> 	[73138.803858]  <EOI>  [<ffffffff815f2bc0>] ? sysret_audit+0x17/0x21
>
> We get some like this, also:
>
> [ 1293.152415] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
> [ 1293.152416]   cache: kmalloc-256, object size: 256, buffer size: 256,
> default order: 1, min order: 0
> [ 1293.152417]   node 0: slabs: 1789, objs: 57248, free: 0
> [ 1293.152418]   node 1: slabs: 449, objs: 14368, free: 2
>


---
This email has been checked for viruses by Avast antivirus software.
http://www.avast.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-04  7:30                             ` Christian Marie
  2014-12-04  7:51                               ` Christian Marie
@ 2014-12-05  1:07                               ` Joonsoo Kim
  2014-12-05  5:55                                 ` Christian Marie
  2014-12-10 15:06                                 ` Vlastimil Babka
  1 sibling, 2 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-05  1:07 UTC (permalink / raw)
  To: linux-mm

On Thu, Dec 04, 2014 at 06:30:45PM +1100, Christian Marie wrote:
> On Wed, Dec 03, 2014 at 04:57:47PM +0900, Joonsoo Kim wrote:
> > It'd be very helpful to get output of
> > "trace_event=compaction:*,kmem:mm_page_alloc_extfrag" on the kernel
> > with my tracepoint patches below.
> > 
> > See following link. There is 3 patches.
> > 
> > https://lkml.org/lkml/2014/12/3/71
> 
> I have just finished testing 3.18rc5 with both of the small patches mentioned
> earlier in this thread and 2/3 of your event patches. The second patch
> (https://lkml.org/lkml/2014/12/3/72) did not apply due to compaction_suitable
> being different (am I missing another patch you are basing this off?).

In fact, I'm using next-20141124 kernel, not just mainline one. There
is a lot of fixes from Vlastimil and it may cause the applying failure.
But, it's not that important in this case. I have gotten enough information
about this problem on your below log.

> 
> My compaction_suitable is:
> 
> 	unsigned long compaction_suitable(struct zone *zone, int order)
> 
> Results without that second event patch are as follows:
> 
> Trace under heavy load but before any spiking system usage or significant
> compaction spinning:
> 
> http://ponies.io/raw/compaction_events/before.gz
> 
> Trace during 100% cpu utilization, much of which was in system:
> 
> http://ponies.io/raw/compaction_events/during.gz

It looks that there is no stop condition in isolate_freepages(). In
this period, your system have not enough freepage and many processes
try to find freepage for compaction. Because there is no stop
condition, they iterate almost all memory range every time. At the
bottom of this mail, I attach one more fix although I don't test it
yet. It will cause a lot of allocation failure that your network layer
need. It is order 5 allocation request and with __GFP_NOWARN gfp flag,
so I assume that there is no problem if allocation request is failed,
but, I'm not sure.

watermark check on this patch needs cc->classzone_idx, cc->alloc_flags
that comes from Vlastimil's recent change. If you want to test it with
3.18rc5, please remove it. It doesn't much matter.

Anyway, I hope it also helps you.

> perf report at the time of during.gz:
> 
> http://ponies.io/raw/compaction_events/perf.png

By judging from this perf report, my second patch would have no impact
to your system. I thought that this excessive cpu usage is started from
the SLUB, but, order 5 kmalloc request is just forwarded to page
allocator in current SLUB implementation, so patch 2 from me would not
work on this problem.

By the way, is it common that network layer needs order 5 allocation?
IMHO, it'd be better to avoid this highorder request, because the kernel
easily fail to handle this kind of request.

Thanks.

> 
> Interested to see what you make of the limited information. I may be able to
> try all of your patches some time next week against whatever they apply cleanly
> to. If that is needed.

------------>8-----------------

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-04 23:30                             ` Vlastimil Babka
@ 2014-12-05  5:50                               ` Christian Marie
  0 siblings, 0 replies; 36+ messages in thread
From: Christian Marie @ 2014-12-05  5:50 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]

On Fri, Dec 05, 2014 at 12:30:37AM +0100, Vlastimil Babka wrote:
> Oh, I would think that if you can't allocate single pages, then there's
> little wonder that compaction also spends all its time looking for single
> free pages. Did that happen just now for the single page allocations,
> or was it always the case?

This has always been the case with the default min_free_pages and given enough
pressure for enough time. I have just been hoping that compaction should
be "smart" enough to lest reclaim do its stuff quickly if single page
allocations are failing.

Raising min_free_kbytes makes these 0 order allocations failures never happen.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-05  1:07                               ` Joonsoo Kim
@ 2014-12-05  5:55                                 ` Christian Marie
  2014-12-08  7:19                                   ` Joonsoo Kim
  2014-12-10 15:06                                 ` Vlastimil Babka
  1 sibling, 1 reply; 36+ messages in thread
From: Christian Marie @ 2014-12-05  5:55 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 1958 bytes --]

On Fri, Dec 05, 2014 at 10:07:33AM +0900, Joonsoo Kim wrote:
> It looks that there is no stop condition in isolate_freepages(). In
> this period, your system have not enough freepage and many processes
> try to find freepage for compaction. Because there is no stop
> condition, they iterate almost all memory range every time. At the
> bottom of this mail, I attach one more fix although I don't test it
> yet. It will cause a lot of allocation failure that your network layer
> need. It is order 5 allocation request and with __GFP_NOWARN gfp flag,
> so I assume that there is no problem if allocation request is failed,
> but, I'm not sure.
> 
> watermark check on this patch needs cc->classzone_idx, cc->alloc_flags
> that comes from Vlastimil's recent change. If you want to test it with
> 3.18rc5, please remove it. It doesn't much matter.
> 
> Anyway, I hope it also helps you.

Thank you, I will try this next week. If it improves the situation do you think
that we have a good chance of merging it upstream? I should think that
backporting such a fix would be a hard sell.

> By judging from this perf report, my second patch would have no impact
> to your system. I thought that this excessive cpu usage is started from
> the SLUB, but, order 5 kmalloc request is just forwarded to page
> allocator in current SLUB implementation, so patch 2 from me would not
> work on this problem.

I agree with this.

> 
> By the way, is it common that network layer needs order 5 allocation?
> IMHO, it'd be better to avoid this highorder request, because the kernel
> easily fail to handle this kind of request.

Yes, agreed. I'm trying to sort that issue out concurrently. I'm currently
collaborating on a patch to get Scatter Gather support for the network layer so
that we can avoid these huge allocations. They are large because ipoib in
Connected Mode wants a very large MTU (around 65535) and does not do SG in CM.

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-05  5:55                                 ` Christian Marie
@ 2014-12-08  7:19                                   ` Joonsoo Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-08  7:19 UTC (permalink / raw)
  To: linux-mm

On Fri, Dec 05, 2014 at 04:55:44PM +1100, Christian Marie wrote:
> On Fri, Dec 05, 2014 at 10:07:33AM +0900, Joonsoo Kim wrote:
> > It looks that there is no stop condition in isolate_freepages(). In
> > this period, your system have not enough freepage and many processes
> > try to find freepage for compaction. Because there is no stop
> > condition, they iterate almost all memory range every time. At the
> > bottom of this mail, I attach one more fix although I don't test it
> > yet. It will cause a lot of allocation failure that your network layer
> > need. It is order 5 allocation request and with __GFP_NOWARN gfp flag,
> > so I assume that there is no problem if allocation request is failed,
> > but, I'm not sure.
> > 
> > watermark check on this patch needs cc->classzone_idx, cc->alloc_flags
> > that comes from Vlastimil's recent change. If you want to test it with
> > 3.18rc5, please remove it. It doesn't much matter.
> > 
> > Anyway, I hope it also helps you.
> 
> Thank you, I will try this next week. If it improves the situation do you think
> that we have a good chance of merging it upstream? I should think that
> backporting such a fix would be a hard sell.

I think that if it improves the situation, it could be merged into upstream.
If the patch fix real issue, it is a candidate for stable tree.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-05  1:07                               ` Joonsoo Kim
  2014-12-05  5:55                                 ` Christian Marie
@ 2014-12-10 15:06                                 ` Vlastimil Babka
  2014-12-11  3:08                                   ` Joonsoo Kim
  1 sibling, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2014-12-10 15:06 UTC (permalink / raw)
  To: Joonsoo Kim, linux-mm

On 12/05/2014 02:07 AM, Joonsoo Kim wrote:
> ------------>8-----------------
>  From b7daa232c327a4ebbb48ca0538a2dbf9ca83ca1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Fri, 5 Dec 2014 09:38:30 +0900
> Subject: [PATCH] mm/compaction: stop the compaction if there isn't enough
>   freepage
>
> After compaction_suitable() passed, there is no check whether the system
> has enough memory to compact and blindly try to find freepage through
> iterating all memory range. This causes excessive cpu usage in low free
> memory condition and finally compaction would be failed. It makes sense
> that compaction would be stopped if there isn't enough freepage. So,
> this patch adds watermark check to isolate_freepages() in order to stop
> the compaction in this case.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>   mm/compaction.c |    9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e005620..31c4009 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -828,6 +828,7 @@ static void isolate_freepages(struct compact_control *cc)
>   	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
>   	int nr_freepages = cc->nr_freepages;
>   	struct list_head *freelist = &cc->freepages;
> +	unsigned long watermark = low_wmark_pages(zone) + (2UL << cc->order);

Given that we maybe have already isolated up to 31 free pages (if 
cc->nr_migratepages is the maximum 32), then this is somewhat stricter 
than the check in isolation_suitable() (when nothing was isolated yet) 
and may interrupt us prematurely. We should allow for some slack.

>
>   	/*
>   	 * Initialise the free scanner. The starting point is where we last
> @@ -903,6 +904,14 @@ static void isolate_freepages(struct compact_control *cc)
>   		 */
>   		if (cc->contended)
>   			break;
> +
> +		/*
> +		 * Watermarks for order-0 must be met for compaction.
> +		 * See compaction_suitable for more detailed explanation.
> +		 */
> +		if (!zone_watermark_ok(zone, 0, watermark,
> +			cc->classzone_idx, cc->alloc_flags))
> +			break;
>   	}

I'm a also bit concerned about the overhead of doing this in each pageblock.

I wonder if there could be a mechanism where a process entering reclaim 
or compaction with the goal of meeting the watermarks to allocate, 
should increase the watermarks needed for further parallel allocation 
attempts to pass. Then it shouldn't happen that somebody else steals the 
memory.

>   	/* split_free_page does not map the pages */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: isolate_freepages_block and excessive CPU usage by OSD process
  2014-12-10 15:06                                 ` Vlastimil Babka
@ 2014-12-11  3:08                                   ` Joonsoo Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Joonsoo Kim @ 2014-12-11  3:08 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm

On Wed, Dec 10, 2014 at 04:06:19PM +0100, Vlastimil Babka wrote:
> On 12/05/2014 02:07 AM, Joonsoo Kim wrote:
> >------------>8-----------------
> > From b7daa232c327a4ebbb48ca0538a2dbf9ca83ca1f Mon Sep 17 00:00:00 2001
> >From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >Date: Fri, 5 Dec 2014 09:38:30 +0900
> >Subject: [PATCH] mm/compaction: stop the compaction if there isn't enough
> >  freepage
> >
> >After compaction_suitable() passed, there is no check whether the system
> >has enough memory to compact and blindly try to find freepage through
> >iterating all memory range. This causes excessive cpu usage in low free
> >memory condition and finally compaction would be failed. It makes sense
> >that compaction would be stopped if there isn't enough freepage. So,
> >this patch adds watermark check to isolate_freepages() in order to stop
> >the compaction in this case.
> >
> >Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >---
> >  mm/compaction.c |    9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index e005620..31c4009 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -828,6 +828,7 @@ static void isolate_freepages(struct compact_control *cc)
> >  	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
> >  	int nr_freepages = cc->nr_freepages;
> >  	struct list_head *freelist = &cc->freepages;
> >+	unsigned long watermark = low_wmark_pages(zone) + (2UL << cc->order);
> 
> Given that we maybe have already isolated up to 31 free pages (if
> cc->nr_migratepages is the maximum 32), then this is somewhat
> stricter than the check in isolation_suitable() (when nothing was
> isolated yet) and may interrupt us prematurely. We should allow for
> some slack.

Okay. Will allow some slack.

> 
> >
> >  	/*
> >  	 * Initialise the free scanner. The starting point is where we last
> >@@ -903,6 +904,14 @@ static void isolate_freepages(struct compact_control *cc)
> >  		 */
> >  		if (cc->contended)
> >  			break;
> >+
> >+		/*
> >+		 * Watermarks for order-0 must be met for compaction.
> >+		 * See compaction_suitable for more detailed explanation.
> >+		 */
> >+		if (!zone_watermark_ok(zone, 0, watermark,
> >+			cc->classzone_idx, cc->alloc_flags))
> >+			break;
> >  	}
> 
> I'm a also bit concerned about the overhead of doing this in each pageblock.

Yep, we can do it whenever SWAP_CLUSTER_MAX pageblock is scanned. It
will reduce overhead somewhat. I will change it.

> 
> I wonder if there could be a mechanism where a process entering
> reclaim or compaction with the goal of meeting the watermarks to
> allocate, should increase the watermarks needed for further parallel
> allocation attempts to pass. Then it shouldn't happen that somebody
> else steals the memory.

I don't know, neither.

Thanks.

> 
> >  	/* split_free_page does not map the pages */
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-12-11  3:04 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CABYiri-do2YdfBx=r+u1kwXkEwN4v+yeRSHB-ODXo4gMFgW-Fg.mail.gmail.com>
2014-11-19  1:21 ` isolate_freepages_block and excessive CPU usage by OSD process Christian Marie
2014-11-19 18:03   ` Andrey Korolyov
2014-11-19 21:20     ` Christian Marie
2014-11-19 23:10       ` Vlastimil Babka
2014-11-19 23:49         ` Andrey Korolyov
2014-11-20  3:30         ` Christian Marie
2014-11-21  2:35         ` Christian Marie
2014-11-23  9:33           ` Christian Marie
2014-11-24 21:48             ` Andrey Korolyov
2014-11-28  8:03               ` Joonsoo Kim
2014-11-28  9:26                 ` Vlastimil Babka
2014-12-01  8:31                   ` Joonsoo Kim
2014-12-02  1:47                     ` Christian Marie
2014-12-02  4:53                       ` Joonsoo Kim
2014-12-02  5:06                         ` Christian Marie
2014-12-03  4:04                           ` Christian Marie
2014-12-03  8:05                             ` Joonsoo Kim
2014-12-04 23:30                             ` Vlastimil Babka
2014-12-05  5:50                               ` Christian Marie
2014-12-03  7:57                           ` Joonsoo Kim
2014-12-04  7:30                             ` Christian Marie
2014-12-04  7:51                               ` Christian Marie
2014-12-05  1:07                               ` Joonsoo Kim
2014-12-05  5:55                                 ` Christian Marie
2014-12-08  7:19                                   ` Joonsoo Kim
2014-12-10 15:06                                 ` Vlastimil Babka
2014-12-11  3:08                                   ` Joonsoo Kim
2014-12-02 15:46                         ` Vlastimil Babka
2014-12-03  7:49                           ` Joonsoo Kim
2014-12-03 12:43                             ` Vlastimil Babka
2014-12-04  6:53                               ` Joonsoo Kim
2014-11-15 11:48 Andrey Korolyov
2014-11-15 16:32 ` Vlastimil Babka
2014-11-15 17:10   ` Andrey Korolyov
2014-11-15 18:45     ` Vlastimil Babka
2014-11-15 18:52       ` Andrey Korolyov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).