All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rabin Vincent <rabin.vincent@axis.com>
To: fengguang.wu@intel.com, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [PATCH?] Non-throttling of mkfs leads to OOM
Date: Thu, 6 Aug 2015 00:21:52 +0200	[thread overview]
Message-ID: <20150805222151.GA24795@axis.com> (raw)

Hi,

I received some reports of mkfs.ext4 on an SD card triggering the OOM
killer on a swapless system with a low amount of free memory.  I have
made a reproducible setup of this by using null_blk with a large
completion delay.  The problem appears to be that we do not throttle to
the level of vm_dirty_ratio in throttle_vm_writeout().

I configure null_blk like this: null_blk.irqmode=2
null_blk.completion_nsec=30000000 null_blk.queue_mode=1 and run the test under
a qemu-kvm instance.

The kernel is 4.2-rc5 along with the patch I posted earlier today which fixes
the initial dirty limit after the cgroups changes
(https://lkml.org/lkml/2015/8/5/650).  The problem is not related to the
recent writeback cgroups changes though; earlier kernels show the same
behaviour.

/proc/meminfo looks like this just before run of mkfs.ext4

  MemTotal:         197032 kB
  MemFree:            4184 kB
  MemAvailable:      14376 kB
  Buffers:             332 kB
  Cached:            64528 kB
  SwapCached:            0 kB
  Active:             2840 kB
  Inactive:          62612 kB
  Active(anon):        640 kB
  Inactive(anon):    61448 kB
  Active(file):       2200 kB
  Inactive(file):     1164 kB
  Unevictable:           0 kB
  Mlocked:               0 kB
  SwapTotal:             0 kB
  SwapFree:              0 kB
  Dirty:                28 kB
  Writeback:             0 kB
  AnonPages:           660 kB
  Mapped:             2776 kB
  Shmem:             61456 kB
  Slab:              26580 kB
  SReclaimable:      12884 kB
  SUnreclaim:        13696 kB
  KernelStack:         576 kB
  PageTables:          220 kB
  NFS_Unstable:          0 kB
  Bounce:                0 kB
  WritebackTmp:          0 kB
  CommitLimit:       98516 kB
  Committed_AS:      63424 kB
  VmallocTotal:   34359738367 kB
  VmallocUsed:       68532 kB
  VmallocChunk:   34359667708 kB
  DirectMap4k:       16256 kB
  DirectMap2M:      208896 kB
  DirectMap1G:           0 kB

And mkfs.ext4 runs like this:

# mkfs.ext4 -F -O ^extent -E lazy_itable_init=0,lazy_journal_init=0 /dev/nullb0:
 mke2fs 1.42.13 (17-May-2015)
 Creating filesystem with 65536000 4k blocks and 16384000 inodes
 Filesystem UUID: 970d9572-67d4-4a1f-a4d0-8376c6235c6a
 Superblock backups stored on blocks: 
 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
 	4096000, 7962624, 11239424, 20480000, 23887872
 
 Allocating group tables: done                            
 Writing inode tables: [    4.557641] mkfs.ext4 invoked oom-killer: gfp_mask=0x10200d0, order=0, oom_score_adj=0
 [    4.558395] mkfs.ext4 cpuset=/ mems_allowed=0
 [    4.558814] CPU: 0 PID: 667 Comm: mkfs.ext4 Not tainted 4.2.0-rc5+ #292
 [    4.559391] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
 [    4.562285] Call Trace:
 [    4.562509]  [<ffffffff81515c64>] dump_stack+0x4f/0x7b
 [    4.562957]  [<ffffffff81513eb8>] dump_header.isra.9+0x76/0x38f
 [    4.563474]  [<ffffffff8109d95d>] ? trace_hardirqs_on+0xd/0x10
 [    4.564023]  [<ffffffff8151ca2a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
 [    4.564614]  [<ffffffff81123cac>] oom_kill_process+0x38c/0x470
 [    4.565190]  [<ffffffff81057cb5>] ? has_ns_capability_noaudit+0x5/0x160
 [    4.573104]  [<ffffffff8107bc51>] ? get_parent_ip+0x11/0x50
 [    4.573608]  [<ffffffff81124145>] ? out_of_memory+0x355/0x420
 [    4.574143]  [<ffffffff811241a7>] out_of_memory+0x3b7/0x420
 [    4.574634]  [<ffffffff81128dc0>] ? __alloc_pages_nodemask+0x810/0xb10
 [    4.575203]  [<ffffffff81128fc6>] __alloc_pages_nodemask+0xa16/0xb10
 [    4.575755]  [<ffffffff8111fd91>] pagecache_get_page+0x101/0x1c0
 [    4.576287]  [<ffffffff811a2cf0>] ? I_BDEV+0x20/0x20
 [    4.576722]  [<ffffffff8112092d>] grab_cache_page_write_begin+0x2d/0x50
 [    4.577298]  [<ffffffff811a114d>] block_write_begin+0x2d/0x80
 [    4.577809]  [<ffffffff811a34f5>] ? blkdev_write_begin+0x5/0x30
 [    4.578325]  [<ffffffff811a3513>] blkdev_write_begin+0x23/0x30
 [    4.578830]  [<ffffffff81120abf>] generic_perform_write+0xaf/0x1b0
 [    4.579368]  [<ffffffff81121c60>] __generic_file_write_iter+0x190/0x1f0
 [    4.579975]  [<ffffffff811a3b28>] blkdev_write_iter+0x78/0x100
 [    4.580497]  [<ffffffff81167daa>] __vfs_write+0xaa/0xe0
 [    4.580984]  [<ffffffff811681a7>] vfs_write+0x97/0x100
 [    4.581434]  [<ffffffff811689a7>] SyS_pwrite64+0x77/0x90
 [    4.581923]  [<ffffffff8151d3ae>] entry_SYSCALL_64_fastpath+0x12/0x76
 [    4.585984] Mem-Info:
 [    4.586558] active_anon:207 inactive_anon:15363 isolated_anon:0
 [    4.586558]  active_file:295 inactive_file:347 isolated_file:0
 [    4.586558]  unevictable:0 dirty:0 writeback:0 unstable:0
 [    4.586558]  slab_reclaimable:3260 slab_unreclaimable:3452
 [    4.586558]  mapped:288 shmem:15363 pagetables:56 bounce:0
 [    4.586558]  free:1112 free_pcp:32 free_cma:0
 [    4.589928] DMA free:1032kB min:144kB low:180kB high:216kB active_anon:52kB inactive_anon:5920kB active_file:48kB inactive_file:128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:4kB writeback:0kB mapped:88kB shmem:5920kB slab_reclaimable:164kB slab_unreclaimable:648kB kernel_stack:16kB pagetables:28kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
 [    4.594056] lowmem_reserve[]: 0 173 173 173
 [    4.596036] DMA32 free:3136kB min:1608kB low:2008kB high:2412kB active_anon:776kB inactive_anon:55532kB active_file:1132kB inactive_file:1380kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:208768kB managed:181124kB mlocked:0kB dirty:0kB writeback:0kB mapped:1176kB shmem:55532kB slab_reclaimable:12876kB slab_unreclaimable:13160kB kernel_stack:592kB pagetables:196kB unstable:0kB bounce:0kB free_pcp:140kB local_pcp:140kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
 [    4.600369] lowmem_reserve[]: 0 0 0 0
 [    4.601393] DMA: 14*4kB (UM) 20*8kB (UM) 20*16kB (UM) 13*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1016kB
 [    4.602710] DMA32: 44*4kB (UEM) 128*8kB (UEM) 44*16kB (UEM) 24*32kB (UEM) 3*64kB (M) 2*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3120kB
 [    4.604158] 16088 total pagecache pages
 [    4.604509] 56190 pages RAM
 [    4.604768] 0 pages HighMem/MovableOnly
 [    4.605148] 6932 pages reserved
 [    4.605437] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
 [    4.606238] [   75]     0    75     1134       16       8       3        0             0 klogd
 [    4.607012] [  140]     0   140     2667      140      10       3        0             0 dropbear
 [    4.607809] [  141]     0   141     1660      176       8       3        0             0 sh
 [    4.608737] [  661]     0   661     1134      167       8       3        0             0 exe
 [    4.611211] [  667]     0   667     4271      321      14       3        0             0 mkfs.ext4
 [    4.612198] Out of memory: Kill process 667 (mkfs.ext4) score 6 or sacrifice child
 [    4.614630] Killed process 667 (mkfs.ext4) total-vm:17084kB, anon-rss:184kB, file-rss:1100kB
 Killed

The last balance_dirty_pages shows that the task is within its ratelimit and so
is not paused there:

       mkfs.ext4-664   [000]     4.773565: balance_dirty_pages:  bdi 253:0: limit=9233 setpoint=4685 dirty=576 bdi_setpoint=4532 bdi_dirty=576 dirty_ratelimit=102400 task_ratelimit=181400 dirtied=58 dirtied_pause=512 paused=0 pause=-32 period=0 think=32

Then the page allocator gets called and starts direct reclaim:

       mkfs.ext4-664   [000]     4.773565: function:             blkdev_write_begin <-- generic_perform_write
       mkfs.ext4-664   [000]     4.773566: function:             __alloc_pages_nodemask <-- pagecache_get_page
       mkfs.ext4-664   [000]     4.773566: function:                __alloc_pages_direct_compact <-- __alloc_pages_nodemask
       mkfs.ext4-664   [000]     4.773566: function:                try_to_free_pages <-- __alloc_pages_nodemask
       mkfs.ext4-664   [000]     4.773567: mm_vmscan_direct_reclaim_begin: order=0 may_writepage=1 gfp_flags=GFP_USER

However, even though NR_WRITEBACK is much larger than thresh,
confgestion_wait() is not called because throttle_vm_writeout() uses the much
higher domain->dirty_limit (limit) rather that vm_dirty_ratio (thresh):

       mkfs.ext4-664   [000]     4.773567: function:             throttle_vm_writeout <-- shrink_lruvec.isra.59
       mkfs.ext4-664   [000]     4.773567: global_dirty_state:   dirty=3 writeback=573 unstable=0 bg_thresh=91 thresh=183 limit=9233 dirtied=3465 written=2889

None of the wait_iff_congested() calls have any effect since the bdi is not
detected as being congested (queue/nr_request is the default of 128 and
fewer than 30 requests are currently on it):

       mkfs.ext4-664   [000]     4.773583: function:             wait_iff_congested <-- shrink_inactive_list
       mkfs.ext4-664   [000]     4.773583: writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

try_to_free_pages() also keeps on waking up the flusher threads:

       mkfs.ext4-664   [000]     4.776026: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d748498 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776026: writeback_queue:      bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776032: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d749188 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776033: writeback_queue:      bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776040: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d749620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776040: writeback_queue:      bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776047: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74a498 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776047: writeback_queue:      bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776054: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74a620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776054: writeback_queue:      bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776061: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74b310 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776061: writeback_queue:      bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776068: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74b620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO

All to no avail since we never wait for any writes and end up OOMing
after some rounds of direct reclaim:

       mkfs.ext4-664   [000]     4.776174: mm_vmscan_direct_reclaim_end: nr_reclaimed=0
       mkfs.ext4-664   [000]     4.776192: function:             out_of_memory <-- __alloc_pages_nodemask

(A full ftrace is available at https://drive.google.com/file/d/0B4tMLbMvJ-l6MndXbG5wQmZrTm8/view)

The problem seems to be that throttle_vm_writeout() does not respect the set
vm_dirty_ratio values.  With the following change, mkfs.ext4 is able to
complete successfully every time in this scenario without triggering OOM:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5cccc12..47f1d09 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1916,7 +1920,6 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
-		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page

But this is a revert of the following commit and presumably reintroduces the
problem for the use case described there:

  commit 47a133339c332f9f8e155c70f5da401aded69948
  Author: Fengguang Wu <fengguang.wu@intel.com>
  Date:   Wed Mar 21 16:34:09 2012 -0700
  
      mm: use global_dirty_limit in throttle_vm_writeout()
  
      When starting a memory hog task, a desktop box w/o swap is found to go
      unresponsive for a long time.  It's solely caused by lots of congestion
      waits in throttle_vm_writeout():
      
      ...
      
      The root cause is, the dirty threshold is knocked down a lot by the memory
      hog task.  Fixed by using global_dirty_limit which decreases gradually on
      such events and can guarantee we stay above (the also decreasing) nr_dirty
      in the progress of following down to the new dirty threshold.

Comments?

Thanks,
/Rabin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Rabin Vincent <rabin.vincent@axis.com>
To: fengguang.wu@intel.com, akpm@linux-foundation.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [PATCH?] Non-throttling of mkfs leads to OOM
Date: Thu, 6 Aug 2015 00:21:52 +0200	[thread overview]
Message-ID: <20150805222151.GA24795@axis.com> (raw)

Hi,

I received some reports of mkfs.ext4 on an SD card triggering the OOM
killer on a swapless system with a low amount of free memory.  I have
made a reproducible setup of this by using null_blk with a large
completion delay.  The problem appears to be that we do not throttle to
the level of vm_dirty_ratio in throttle_vm_writeout().

I configure null_blk like this: null_blk.irqmode=2
null_blk.completion_nsec=30000000 null_blk.queue_mode=1 and run the test under
a qemu-kvm instance.

The kernel is 4.2-rc5 along with the patch I posted earlier today which fixes
the initial dirty limit after the cgroups changes
(https://lkml.org/lkml/2015/8/5/650).  The problem is not related to the
recent writeback cgroups changes though; earlier kernels show the same
behaviour.

/proc/meminfo looks like this just before run of mkfs.ext4

  MemTotal:         197032 kB
  MemFree:            4184 kB
  MemAvailable:      14376 kB
  Buffers:             332 kB
  Cached:            64528 kB
  SwapCached:            0 kB
  Active:             2840 kB
  Inactive:          62612 kB
  Active(anon):        640 kB
  Inactive(anon):    61448 kB
  Active(file):       2200 kB
  Inactive(file):     1164 kB
  Unevictable:           0 kB
  Mlocked:               0 kB
  SwapTotal:             0 kB
  SwapFree:              0 kB
  Dirty:                28 kB
  Writeback:             0 kB
  AnonPages:           660 kB
  Mapped:             2776 kB
  Shmem:             61456 kB
  Slab:              26580 kB
  SReclaimable:      12884 kB
  SUnreclaim:        13696 kB
  KernelStack:         576 kB
  PageTables:          220 kB
  NFS_Unstable:          0 kB
  Bounce:                0 kB
  WritebackTmp:          0 kB
  CommitLimit:       98516 kB
  Committed_AS:      63424 kB
  VmallocTotal:   34359738367 kB
  VmallocUsed:       68532 kB
  VmallocChunk:   34359667708 kB
  DirectMap4k:       16256 kB
  DirectMap2M:      208896 kB
  DirectMap1G:           0 kB

And mkfs.ext4 runs like this:

# mkfs.ext4 -F -O ^extent -E lazy_itable_init=0,lazy_journal_init=0 /dev/nullb0:
 mke2fs 1.42.13 (17-May-2015)
 Creating filesystem with 65536000 4k blocks and 16384000 inodes
 Filesystem UUID: 970d9572-67d4-4a1f-a4d0-8376c6235c6a
 Superblock backups stored on blocks: 
 	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
 	4096000, 7962624, 11239424, 20480000, 23887872
 
 Allocating group tables: done                            
 Writing inode tables: [    4.557641] mkfs.ext4 invoked oom-killer: gfp_mask=0x10200d0, order=0, oom_score_adj=0
 [    4.558395] mkfs.ext4 cpuset=/ mems_allowed=0
 [    4.558814] CPU: 0 PID: 667 Comm: mkfs.ext4 Not tainted 4.2.0-rc5+ #292
 [    4.559391] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
 [    4.562285] Call Trace:
 [    4.562509]  [<ffffffff81515c64>] dump_stack+0x4f/0x7b
 [    4.562957]  [<ffffffff81513eb8>] dump_header.isra.9+0x76/0x38f
 [    4.563474]  [<ffffffff8109d95d>] ? trace_hardirqs_on+0xd/0x10
 [    4.564023]  [<ffffffff8151ca2a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
 [    4.564614]  [<ffffffff81123cac>] oom_kill_process+0x38c/0x470
 [    4.565190]  [<ffffffff81057cb5>] ? has_ns_capability_noaudit+0x5/0x160
 [    4.573104]  [<ffffffff8107bc51>] ? get_parent_ip+0x11/0x50
 [    4.573608]  [<ffffffff81124145>] ? out_of_memory+0x355/0x420
 [    4.574143]  [<ffffffff811241a7>] out_of_memory+0x3b7/0x420
 [    4.574634]  [<ffffffff81128dc0>] ? __alloc_pages_nodemask+0x810/0xb10
 [    4.575203]  [<ffffffff81128fc6>] __alloc_pages_nodemask+0xa16/0xb10
 [    4.575755]  [<ffffffff8111fd91>] pagecache_get_page+0x101/0x1c0
 [    4.576287]  [<ffffffff811a2cf0>] ? I_BDEV+0x20/0x20
 [    4.576722]  [<ffffffff8112092d>] grab_cache_page_write_begin+0x2d/0x50
 [    4.577298]  [<ffffffff811a114d>] block_write_begin+0x2d/0x80
 [    4.577809]  [<ffffffff811a34f5>] ? blkdev_write_begin+0x5/0x30
 [    4.578325]  [<ffffffff811a3513>] blkdev_write_begin+0x23/0x30
 [    4.578830]  [<ffffffff81120abf>] generic_perform_write+0xaf/0x1b0
 [    4.579368]  [<ffffffff81121c60>] __generic_file_write_iter+0x190/0x1f0
 [    4.579975]  [<ffffffff811a3b28>] blkdev_write_iter+0x78/0x100
 [    4.580497]  [<ffffffff81167daa>] __vfs_write+0xaa/0xe0
 [    4.580984]  [<ffffffff811681a7>] vfs_write+0x97/0x100
 [    4.581434]  [<ffffffff811689a7>] SyS_pwrite64+0x77/0x90
 [    4.581923]  [<ffffffff8151d3ae>] entry_SYSCALL_64_fastpath+0x12/0x76
 [    4.585984] Mem-Info:
 [    4.586558] active_anon:207 inactive_anon:15363 isolated_anon:0
 [    4.586558]  active_file:295 inactive_file:347 isolated_file:0
 [    4.586558]  unevictable:0 dirty:0 writeback:0 unstable:0
 [    4.586558]  slab_reclaimable:3260 slab_unreclaimable:3452
 [    4.586558]  mapped:288 shmem:15363 pagetables:56 bounce:0
 [    4.586558]  free:1112 free_pcp:32 free_cma:0
 [    4.589928] DMA free:1032kB min:144kB low:180kB high:216kB active_anon:52kB inactive_anon:5920kB active_file:48kB inactive_file:128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:4kB writeback:0kB mapped:88kB shmem:5920kB slab_reclaimable:164kB slab_unreclaimable:648kB kernel_stack:16kB pagetables:28kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
 [    4.594056] lowmem_reserve[]: 0 173 173 173
 [    4.596036] DMA32 free:3136kB min:1608kB low:2008kB high:2412kB active_anon:776kB inactive_anon:55532kB active_file:1132kB inactive_file:1380kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:208768kB managed:181124kB mlocked:0kB dirty:0kB writeback:0kB mapped:1176kB shmem:55532kB slab_reclaimable:12876kB slab_unreclaimable:13160kB kernel_stack:592kB pagetables:196kB unstable:0kB bounce:0kB free_pcp:140kB local_pcp:140kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
 [    4.600369] lowmem_reserve[]: 0 0 0 0
 [    4.601393] DMA: 14*4kB (UM) 20*8kB (UM) 20*16kB (UM) 13*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1016kB
 [    4.602710] DMA32: 44*4kB (UEM) 128*8kB (UEM) 44*16kB (UEM) 24*32kB (UEM) 3*64kB (M) 2*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3120kB
 [    4.604158] 16088 total pagecache pages
 [    4.604509] 56190 pages RAM
 [    4.604768] 0 pages HighMem/MovableOnly
 [    4.605148] 6932 pages reserved
 [    4.605437] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
 [    4.606238] [   75]     0    75     1134       16       8       3        0             0 klogd
 [    4.607012] [  140]     0   140     2667      140      10       3        0             0 dropbear
 [    4.607809] [  141]     0   141     1660      176       8       3        0             0 sh
 [    4.608737] [  661]     0   661     1134      167       8       3        0             0 exe
 [    4.611211] [  667]     0   667     4271      321      14       3        0             0 mkfs.ext4
 [    4.612198] Out of memory: Kill process 667 (mkfs.ext4) score 6 or sacrifice child
 [    4.614630] Killed process 667 (mkfs.ext4) total-vm:17084kB, anon-rss:184kB, file-rss:1100kB
 Killed

The last balance_dirty_pages shows that the task is within its ratelimit and so
is not paused there:

       mkfs.ext4-664   [000]     4.773565: balance_dirty_pages:  bdi 253:0: limit=9233 setpoint=4685 dirty=576 bdi_setpoint=4532 bdi_dirty=576 dirty_ratelimit=102400 task_ratelimit=181400 dirtied=58 dirtied_pause=512 paused=0 pause=-32 period=0 think=32

Then the page allocator gets called and starts direct reclaim:

       mkfs.ext4-664   [000]     4.773565: function:             blkdev_write_begin <-- generic_perform_write
       mkfs.ext4-664   [000]     4.773566: function:             __alloc_pages_nodemask <-- pagecache_get_page
       mkfs.ext4-664   [000]     4.773566: function:                __alloc_pages_direct_compact <-- __alloc_pages_nodemask
       mkfs.ext4-664   [000]     4.773566: function:                try_to_free_pages <-- __alloc_pages_nodemask
       mkfs.ext4-664   [000]     4.773567: mm_vmscan_direct_reclaim_begin: order=0 may_writepage=1 gfp_flags=GFP_USER

However, even though NR_WRITEBACK is much larger than thresh,
confgestion_wait() is not called because throttle_vm_writeout() uses the much
higher domain->dirty_limit (limit) rather that vm_dirty_ratio (thresh):

       mkfs.ext4-664   [000]     4.773567: function:             throttle_vm_writeout <-- shrink_lruvec.isra.59
       mkfs.ext4-664   [000]     4.773567: global_dirty_state:   dirty=3 writeback=573 unstable=0 bg_thresh=91 thresh=183 limit=9233 dirtied=3465 written=2889

None of the wait_iff_congested() calls have any effect since the bdi is not
detected as being congested (queue/nr_request is the default of 128 and
fewer than 30 requests are currently on it):

       mkfs.ext4-664   [000]     4.773583: function:             wait_iff_congested <-- shrink_inactive_list
       mkfs.ext4-664   [000]     4.773583: writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

try_to_free_pages() also keeps on waking up the flusher threads:

       mkfs.ext4-664   [000]     4.776026: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d748498 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776026: writeback_queue:      bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776032: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d749188 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776033: writeback_queue:      bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776040: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d749620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776040: writeback_queue:      bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776047: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74a498 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776047: writeback_queue:      bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776054: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74a620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776054: writeback_queue:      bdi 254:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776061: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74b310 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO
       mkfs.ext4-664   [000]     4.776061: writeback_queue:      bdi 253:0: sb_dev 0:0 nr_pages=1105 sync_mode=0 kupdate=0 range_cyclic=0 background=0 reason=try_to_free_pages
       mkfs.ext4-664   [000]     4.776068: kmalloc:              (wb_start_writeback+0x42) call_site=ffffffff811970d2 ptr=0xffff88000d74b620 bytes_req=64 bytes_alloc=392 gfp_flags=GFP_ATOMIC|GFP_ZERO

All to no avail since we never wait for any writes and end up OOMing
after some rounds of direct reclaim:

       mkfs.ext4-664   [000]     4.776174: mm_vmscan_direct_reclaim_end: nr_reclaimed=0
       mkfs.ext4-664   [000]     4.776192: function:             out_of_memory <-- __alloc_pages_nodemask

(A full ftrace is available at https://drive.google.com/file/d/0B4tMLbMvJ-l6MndXbG5wQmZrTm8/view)

The problem seems to be that throttle_vm_writeout() does not respect the set
vm_dirty_ratio values.  With the following change, mkfs.ext4 is able to
complete successfully every time in this scenario without triggering OOM:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5cccc12..47f1d09 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1916,7 +1920,6 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
-		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page

But this is a revert of the following commit and presumably reintroduces the
problem for the use case described there:

  commit 47a133339c332f9f8e155c70f5da401aded69948
  Author: Fengguang Wu <fengguang.wu@intel.com>
  Date:   Wed Mar 21 16:34:09 2012 -0700
  
      mm: use global_dirty_limit in throttle_vm_writeout()
  
      When starting a memory hog task, a desktop box w/o swap is found to go
      unresponsive for a long time.  It's solely caused by lots of congestion
      waits in throttle_vm_writeout():
      
      ...
      
      The root cause is, the dirty threshold is knocked down a lot by the memory
      hog task.  Fixed by using global_dirty_limit which decreases gradually on
      such events and can guarantee we stay above (the also decreasing) nr_dirty
      in the progress of following down to the new dirty threshold.

Comments?

Thanks,
/Rabin

             reply	other threads:[~2015-08-05 22:21 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-05 22:21 Rabin Vincent [this message]
2015-08-05 22:21 ` [PATCH?] Non-throttling of mkfs leads to OOM Rabin Vincent

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150805222151.GA24795@axis.com \
    --to=rabin.vincent@axis.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengguang.wu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.