* 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18 4:29 Dave Hansen
2015-09-18 6:09 ` Greg Thelen
0 siblings, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2015-09-18 4:29 UTC (permalink / raw)
To: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
Andrew Morton, Jan Kara,
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list, Greg Thelen
[-- Attachment #1: Type: text/plain, Size: 4536 bytes --]
I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
14.04.3 system. The system will run fine for a few hours, but suddenly
start becoming horribly I/O bound. A compile of perf for instance takes
20-30 minutes and the compile seems entirely I/O bound. But, the SSD is
only seeing tens or hundreds of KB/s of writes.
Looking at some writeback tracepoints shows it hitting
balance_dirty_pages() pretty hard with a pretty large number of dirty
pages. :)
> ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
> cgroup=/user/1000.user/c2.session
So something is underflowing dirty.
I added the attached patch and got a warning pretty quickly, so this
looks pretty reproducible for me.
I'm not 100% sure this is from the 4.3 merge window. I was running the
4.2-rcs, but they seemed to have their own issues. Ubuntu seems to be
automatically creating some cgroups, so they're definitely in play here.
Any ideas what is going on?
> [ 12.415472] ------------[ cut here ]------------
> [ 12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
> [ 12.415483] MEM_CGROUP_STAT_DIRTY bogus
> [ 12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
> [ 12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
> [ 12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
> [ 12.415542] ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
> [ 12.415544] ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
> [ 12.415547] ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
> [ 12.415550] Call Trace:
> [ 12.415555] [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
> [ 12.415560] [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
> [ 12.415563] [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
> [ 12.415566] [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
> [ 12.415568] [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
> [ 12.415571] [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
> [ 12.415575] [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
> [ 12.415578] [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
> [ 12.415582] [<ffffffff81153765>] try_to_release_page+0x35/0x50
> [ 12.415585] [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
> [ 12.415587] [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
> [ 12.415590] [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
> [ 12.415593] [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
> [ 12.415595] [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
> [ 12.415598] [<ffffffff8109c494>] ? __wake_up+0x44/0x50
> [ 12.415600] [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
> [ 12.415603] [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
> [ 12.415607] [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
> [ 12.415609] [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
> [ 12.415612] [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
> [ 12.415615] [<ffffffff811e139c>] evict+0xbc/0x190
> [ 12.415617] [<ffffffff811e1d2d>] iput+0x17d/0x1e0
> [ 12.415620] [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
> [ 12.415622] [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
> [ 12.415626] [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [ 12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
> [ 12.418211] ------------[ cut here ]------------
[-- Attachment #2: dirtywarn.patch --]
[-- Type: text/x-patch, Size: 3944 bytes --]
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..f130d02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3757,6 +3757,9 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
unsigned long file_pages;
*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+ WARN(*pdirty > (1<<30), "%s() huge dirty: %ld\n", __func__, *pdirty);
+ if (*pdirty > (1<<30))
+ *pdirty = 0;
/* this should eventually include NR_UNSTABLE_NFS */
*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
@@ -4607,6 +4610,8 @@ static int mem_cgroup_move_account(struct page *page,
nr_pages);
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
nr_pages);
+ WARN_ONCE(__this_cpu_read(from->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
+ WARN_ONCE(__this_cpu_read( to->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
}
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0a931cd..9dfd8fa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1546,6 +1546,7 @@ static void balance_dirty_pages(struct address_space *mapping,
global_page_state(NR_UNSTABLE_NFS);
gdtc->avail = global_dirtyable_memory();
gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+ WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
domain_dirty_limits(gdtc);
@@ -1570,6 +1571,7 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
&writeback);
+ WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
mdtc_cap_avail(mdtc);
mdtc->dirty += writeback;
@@ -1884,6 +1886,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
gdtc->avail = global_dirtyable_memory();
gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
+ WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
domain_dirty_limits(gdtc);
if (gdtc->dirty > gdtc->bg_thresh)
@@ -1896,6 +1899,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
unsigned long writeback;
mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
+ WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
mdtc_cap_avail(mdtc);
domain_dirty_limits(mdtc); /* ditto, ignore writeback */
@@ -2404,7 +2408,9 @@ void account_page_dirtied(struct page *page, struct address_space *mapping,
inode_attach_wb(inode, page);
wb = inode_to_wb(inode);
+ WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+ WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
__inc_wb_stat(wb, WB_RECLAIMABLE);
@@ -2426,6 +2432,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
{
if (mapping_cap_account_dirty(mapping)) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+ WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_wb_stat(wb, WB_RECLAIMABLE);
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
@@ -2686,6 +2693,7 @@ int clear_page_dirty_for_io(struct page *page)
wb = unlocked_inode_to_wb_begin(inode, &locked);
if (TestClearPageDirty(page)) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+ WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_wb_stat(wb, WB_RECLAIMABLE);
ret = 1;
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
2015-09-18 4:29 4.3-rc1 dirty page count underflow (cgroup-related?) Dave Hansen
@ 2015-09-18 6:09 ` Greg Thelen
2015-09-18 8:31 ` Greg Thelen
2015-09-18 14:50 ` Dave Hansen
0 siblings, 2 replies; 6+ messages in thread
From: Greg Thelen @ 2015-09-18 6:09 UTC (permalink / raw)
To: Dave Hansen
Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
Andrew Morton, Jan Kara,
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list
Dave Hansen wrote:
> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
> 14.04.3 system. The system will run fine for a few hours, but suddenly
> start becoming horribly I/O bound. A compile of perf for instance takes
> 20-30 minutes and the compile seems entirely I/O bound. But, the SSD is
> only seeing tens or hundreds of KB/s of writes.
>
> Looking at some writeback tracepoints shows it hitting
> balance_dirty_pages() pretty hard with a pretty large number of dirty
> pages. :)
>
>> ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>> cgroup=/user/1000.user/c2.session
>
> So something is underflowing dirty.
>
> I added the attached patch and got a warning pretty quickly, so this
> looks pretty reproducible for me.
>
> I'm not 100% sure this is from the 4.3 merge window. I was running the
> 4.2-rcs, but they seemed to have their own issues. Ubuntu seems to be
> automatically creating some cgroups, so they're definitely in play here.
>
> Any ideas what is going on?
>
>> [ 12.415472] ------------[ cut here ]------------
>> [ 12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>> [ 12.415483] MEM_CGROUP_STAT_DIRTY bogus
>> [ 12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>> [ 12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>> [ 12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>> [ 12.415542] ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>> [ 12.415544] ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>> [ 12.415547] ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>> [ 12.415550] Call Trace:
>> [ 12.415555] [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>> [ 12.415560] [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>> [ 12.415563] [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>> [ 12.415566] [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>> [ 12.415568] [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>> [ 12.415571] [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>> [ 12.415575] [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>> [ 12.415578] [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>> [ 12.415582] [<ffffffff81153765>] try_to_release_page+0x35/0x50
>> [ 12.415585] [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>> [ 12.415587] [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>> [ 12.415590] [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>> [ 12.415593] [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>> [ 12.415595] [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>> [ 12.415598] [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>> [ 12.415600] [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>> [ 12.415603] [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>> [ 12.415607] [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>> [ 12.415609] [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>> [ 12.415612] [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>> [ 12.415615] [<ffffffff811e139c>] evict+0xbc/0x190
>> [ 12.415617] [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>> [ 12.415620] [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>> [ 12.415622] [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>> [ 12.415626] [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [ 12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>> [ 12.418211] ------------[ cut here ]------------
I'm not denying the issue, bug the WARNING splat isn't necessarily
catching a problem. The corresponding code comes from your debug patch:
+ WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
This only checks a single cpu's counter, which can be negative. The sum
of all counters is what matters.
Imagine:
cpu1) dirty page: inc
cpu2) clean page: dec
The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
I'll look at the code and also see if I can reproduce the failure using
mem_cgroup_read_stat() for all of the new WARNs.
Did you notice if the global /proc/meminfo:Dirty count also underflowed?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
2015-09-18 6:09 ` Greg Thelen
@ 2015-09-18 8:31 ` Greg Thelen
2015-09-18 14:50 ` Dave Hansen
1 sibling, 0 replies; 6+ messages in thread
From: Greg Thelen @ 2015-09-18 8:31 UTC (permalink / raw)
To: Dave Hansen
Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
Andrew Morton, Jan Kara,
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list
Greg Thelen wrote:
> Dave Hansen wrote:
>
>> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
>> 14.04.3 system. The system will run fine for a few hours, but suddenly
>> start becoming horribly I/O bound. A compile of perf for instance takes
>> 20-30 minutes and the compile seems entirely I/O bound. But, the SSD is
>> only seeing tens or hundreds of KB/s of writes.
>>
>> Looking at some writeback tracepoints shows it hitting
>> balance_dirty_pages() pretty hard with a pretty large number of dirty
>> pages. :)
>>
>>> ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>>> cgroup=/user/1000.user/c2.session
>>
>> So something is underflowing dirty.
>>
>> I added the attached patch and got a warning pretty quickly, so this
>> looks pretty reproducible for me.
>>
>> I'm not 100% sure this is from the 4.3 merge window. I was running the
>> 4.2-rcs, but they seemed to have their own issues. Ubuntu seems to be
>> automatically creating some cgroups, so they're definitely in play here.
>>
>> Any ideas what is going on?
>>
>>> [ 12.415472] ------------[ cut here ]------------
>>> [ 12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>>> [ 12.415483] MEM_CGROUP_STAT_DIRTY bogus
>>> [ 12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>>> [ 12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>>> [ 12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>>> [ 12.415542] ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>>> [ 12.415544] ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>>> [ 12.415547] ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>>> [ 12.415550] Call Trace:
>>> [ 12.415555] [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>>> [ 12.415560] [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>>> [ 12.415563] [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>>> [ 12.415566] [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>>> [ 12.415568] [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>>> [ 12.415571] [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>>> [ 12.415575] [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>>> [ 12.415578] [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>>> [ 12.415582] [<ffffffff81153765>] try_to_release_page+0x35/0x50
>>> [ 12.415585] [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>>> [ 12.415587] [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>>> [ 12.415590] [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>>> [ 12.415593] [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>>> [ 12.415595] [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>>> [ 12.415598] [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>>> [ 12.415600] [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>>> [ 12.415603] [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>>> [ 12.415607] [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>>> [ 12.415609] [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>>> [ 12.415612] [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>>> [ 12.415615] [<ffffffff811e139c>] evict+0xbc/0x190
>>> [ 12.415617] [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>>> [ 12.415620] [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>>> [ 12.415622] [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>>> [ 12.415626] [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>>> [ 12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>>> [ 12.418211] ------------[ cut here ]------------
>
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem. The corresponding code comes from your debug patch:
> + WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>
> This only checks a single cpu's counter, which can be negative. The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.
>
> Did you notice if the global /proc/meminfo:Dirty count also underflowed?
Looks to be miscommunication between memcg dirty page accounting and
page migration.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
2015-09-18 6:09 ` Greg Thelen
2015-09-18 8:31 ` Greg Thelen
@ 2015-09-18 14:50 ` Dave Hansen
2015-09-21 8:06 ` Greg Thelen
1 sibling, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2015-09-18 14:50 UTC (permalink / raw)
To: Greg Thelen
Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
Andrew Morton, Jan Kara,
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list
On 09/17/2015 11:09 PM, Greg Thelen wrote:
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem. The corresponding code comes from your debug patch:
> + WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>
> This only checks a single cpu's counter, which can be negative. The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.
D'oh. I'll replace those with the proper mem_cgroup_read_stat() and
test with your patch to see if anything still triggers.
> Did you notice if the global /proc/meminfo:Dirty count also underflowed?
It did not underflow. It was one of the first things I looked at and it
looked fine, went down near 0 at 'sync', etc...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
2015-09-18 14:50 ` Dave Hansen
@ 2015-09-21 8:06 ` Greg Thelen
2015-09-22 9:03 ` Michal Hocko
0 siblings, 1 reply; 6+ messages in thread
From: Greg Thelen @ 2015-09-21 8:06 UTC (permalink / raw)
To: Dave Hansen
Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
Andrew Morton, Jan Kara,
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list
Dave Hansen wrote:
> On 09/17/2015 11:09 PM, Greg Thelen wrote:
>> I'm not denying the issue, bug the WARNING splat isn't necessarily
>> catching a problem. The corresponding code comes from your debug patch:
>> + WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>>
>> This only checks a single cpu's counter, which can be negative. The sum
>> of all counters is what matters.
>> Imagine:
>> cpu1) dirty page: inc
>> cpu2) clean page: dec
>> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>>
>> I'll look at the code and also see if I can reproduce the failure using
>> mem_cgroup_read_stat() for all of the new WARNs.
>
> D'oh. I'll replace those with the proper mem_cgroup_read_stat() and
> test with your patch to see if anything still triggers.
Thanks Dave.
Here's what I think we should use to fix the issue. I tagged this for
v4.2 stable given the way that unpatched performance falls apart without
warning or workaround (besides deleting and recreating affected memcg).
Feedback welcome.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
2015-09-21 8:06 ` Greg Thelen
@ 2015-09-22 9:03 ` Michal Hocko
0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2015-09-22 9:03 UTC (permalink / raw)
To: Greg Thelen
Cc: Dave Hansen, Johannes Weiner, Tejun Heo, Jens Axboe,
Andrew Morton, Jan Kara,
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
open list
[I am sorry I didn't get to this earlier because I was at an internal
conference last week]
On Mon 21-09-15 01:06:58, Greg Thelen wrote:
[...]
> >From f5c39c2e8471c10fe0464ca7b6e6f743ce6920a6 Mon Sep 17 00:00:00 2001
> From: Greg Thelen <gthelen@google.com>
> Date: Sat, 19 Sep 2015 16:21:18 -0700
> Subject: [PATCH] memcg: fix dirty page migration
>
> The problem starts with a file backed dirty page which is charged to a
> memcg. Then page migration is used to move oldpage to newpage.
> Migration:
> - copies the oldpage's data to newpage
> - clears oldpage.PG_dirty
> - sets newpage.PG_dirty
> - uncharges oldpage from memcg
> - charges newpage to memcg
>
> Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
> count. However, because newpage is not yet charged, setting
> newpage.PG_dirty does not increment the memcg's dirty page count. After
> migration completes newpage.PG_dirty is eventually cleared, often in
> account_page_cleaned(). At this time newpage is charged to a memcg so
> the memcg's dirty page count is decremented which causes underflow
> because the count was not previously incremented by migration. This
> underflow causes balance_dirty_pages() to see a very large unsigned
> number of dirty memcg pages which leads to aggressive throttling of
> buffered writes by processes in non root memcg.
Very well spotted!
> This issue:
> - can harm performance of non root memcg buffered writes.
> - can report too small (even negative) values in
> memory.stat[(total_)dirty] counters of all memcg, including the root.
>
> To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
> page_memcg() and set_page_memcg() helpers.
>
> Test:
> 0) setup and enter limited memcg
> mkdir /sys/fs/cgroup/test
> echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>
> 1) buffered writes baseline
> dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
> sync
> grep ^dirty /sys/fs/cgroup/test/memory.stat
>
> 2) buffered writes with compaction antagonist to induce migration
> yes 1 > /proc/sys/vm/compact_memory &
> rm -rf /data/tmp/foo
> dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
> kill %
> sync
> grep ^dirty /sys/fs/cgroup/test/memory.stat
>
> 3) buffered writes without antagonist, should match baseline
> rm -rf /data/tmp/foo
> dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
> sync
> grep ^dirty /sys/fs/cgroup/test/memory.stat
>
> (speed, dirty residue)
> unpatched patched
> 1) 841 MB/s 0 dirty pages 886 MB/s 0 dirty pages
> 2) 611 MB/s -33427456 dirty pages 793 MB/s 0 dirty pages
> 3) 114 MB/s -33427456 dirty pages 891 MB/s 0 dirty pages
>
> Notice that unpatched baseline performance (1) fell after
> migration (3): 841 -> 114 MB/s. In the patched kernel, post
> migration performance matches baseline.
>
> Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
> Cc: <stable@vger.kernel.org> # 4.2+
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> ---
> include/linux/mm.h | 21 +++++++++++++++++++++
> mm/migrate.c | 12 +++++++++++-
> 2 files changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 91c08f6f0dc9..80001de019ba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -905,6 +905,27 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> #endif
> }
>
> +#ifdef CONFIG_MEMCG
> +static inline struct mem_cgroup *page_memcg(struct page *page)
> +{
> + return page->mem_cgroup;
> +}
> +
> +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
> +{
> + page->mem_cgroup = memcg;
> +}
> +#else
> +static inline struct mem_cgroup *page_memcg(struct page *page)
> +{
> + return NULL;
> +}
> +
> +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
> +{
> +}
> +#endif
> +
> /*
> * Some inline functions in vmstat.h depend on page_zone()
> */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c3cb566af3e2..6116b8f64d27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -740,6 +740,15 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> if (PageSwapBacked(page))
> SetPageSwapBacked(newpage);
>
> + /*
> + * Indirectly called below, migrate_page_copy() copies PG_dirty and thus
> + * needs newpage's memcg set to transfer memcg dirty page accounting.
> + * So perform memcg migration in two steps:
> + * 1. set newpage->mem_cgroup (here)
> + * 2. clear page->mem_cgroup (below)
> + */
> + set_page_memcg(newpage, page_memcg(page));
> +
> mapping = page_mapping(page);
> if (!mapping)
> rc = migrate_page(mapping, newpage, page, mode);
> @@ -756,9 +765,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> rc = fallback_migrate_page(mapping, newpage, page, mode);
>
> if (rc != MIGRATEPAGE_SUCCESS) {
> + set_page_memcg(newpage, NULL);
> newpage->mapping = NULL;
> } else {
> - mem_cgroup_migrate(page, newpage, false);
> + set_page_memcg(page, NULL);
> if (page_was_mapped)
> remove_migration_ptes(page, newpage);
> page->mapping = NULL;
> --
> 2.6.0.rc0.131.gf624c3d
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-09-22 9:03 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-18 4:29 4.3-rc1 dirty page count underflow (cgroup-related?) Dave Hansen
2015-09-18 6:09 ` Greg Thelen
2015-09-18 8:31 ` Greg Thelen
2015-09-18 14:50 ` Dave Hansen
2015-09-21 8:06 ` Greg Thelen
2015-09-22 9:03 ` Michal Hocko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).