4.3-rc1 dirty page count underflow (cgroup-related?)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18  4:29 Dave Hansen
  2015-09-18  6:09 ` Greg Thelen
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2015-09-18  4:29 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list, Greg Thelen

[-- Attachment #1: Type: text/plain, Size: 4536 bytes --]

I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
14.04.3 system.  The system will run fine for a few hours, but suddenly
start becoming horribly I/O bound.  A compile of perf for instance takes
20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
only seeing tens or hundreds of KB/s of writes.

Looking at some writeback tracepoints shows it hitting
balance_dirty_pages() pretty hard with a pretty large number of dirty
pages. :)

>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
> cgroup=/user/1000.user/c2.session

So something is underflowing dirty.

I added the attached patch and got a warning pretty quickly, so this
looks pretty reproducible for me.

I'm not 100% sure this is from the 4.3 merge window.  I was running the
4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
automatically creating some cgroups, so they're definitely in play here.

Any ideas what is going on?

> [   12.415472] ------------[ cut here ]------------
> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
> [   12.415550] Call Trace:
> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
> [   12.418211] ------------[ cut here ]------------

[-- Attachment #2: dirtywarn.patch --]
[-- Type: text/x-patch, Size: 3944 bytes --]

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..f130d02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3757,6 +3757,9 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
 	unsigned long file_pages;
 
 	*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+	WARN(*pdirty > (1<<30), "%s() huge dirty: %ld\n", __func__, *pdirty);
+	if (*pdirty > (1<<30))
+		*pdirty = 0;
 
 	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
@@ -4607,6 +4610,8 @@ static int mem_cgroup_move_account(struct page *page,
 				       nr_pages);
 			__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
 				       nr_pages);
+			WARN_ONCE(__this_cpu_read(from->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
+			WARN_ONCE(__this_cpu_read(  to->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		}
 	}
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0a931cd..9dfd8fa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1546,6 +1546,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 					global_page_state(NR_UNSTABLE_NFS);
 		gdtc->avail = global_dirtyable_memory();
 		gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
 
 		domain_dirty_limits(gdtc);
 
@@ -1570,6 +1571,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			 */
 			mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
 					    &writeback);
+			WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
 			mdtc_cap_avail(mdtc);
 			mdtc->dirty += writeback;
 
@@ -1884,6 +1886,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	gdtc->avail = global_dirtyable_memory();
 	gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
 		      global_page_state(NR_UNSTABLE_NFS);
+	WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
@@ -1896,6 +1899,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		unsigned long writeback;
 
 		mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
+		WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
 		mdtc_cap_avail(mdtc);
 		domain_dirty_limits(mdtc);	/* ditto, ignore writeback */
 
@@ -2404,7 +2408,9 @@ void account_page_dirtied(struct page *page, struct address_space *mapping,
 		inode_attach_wb(inode, page);
 		wb = inode_to_wb(inode);
 
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
@@ -2426,6 +2432,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
@@ -2686,6 +2693,7 @@ int clear_page_dirty_for_io(struct page *page)
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+			WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-18  4:29 4.3-rc1 dirty page count underflow (cgroup-related?) Dave Hansen
@ 2015-09-18  6:09 ` Greg Thelen
  2015-09-18  8:31   ` Greg Thelen
  2015-09-18 14:50   ` Dave Hansen
  0 siblings, 2 replies; 6+ messages in thread
From: Greg Thelen @ 2015-09-18  6:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Dave Hansen wrote:

> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
> 14.04.3 system.  The system will run fine for a few hours, but suddenly
> start becoming horribly I/O bound.  A compile of perf for instance takes
> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
> only seeing tens or hundreds of KB/s of writes.
>
> Looking at some writeback tracepoints shows it hitting
> balance_dirty_pages() pretty hard with a pretty large number of dirty
> pages. :)
>
>>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>> cgroup=/user/1000.user/c2.session
>
> So something is underflowing dirty.
>
> I added the attached patch and got a warning pretty quickly, so this
> looks pretty reproducible for me.
>
> I'm not 100% sure this is from the 4.3 merge window.  I was running the
> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
> automatically creating some cgroups, so they're definitely in play here.
>
> Any ideas what is going on?
>
>> [   12.415472] ------------[ cut here ]------------
>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>> [   12.415550] Call Trace:
>> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
>> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
>> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>> [   12.418211] ------------[ cut here ]------------

I'm not denying the issue, bug the WARNING splat isn't necessarily
catching a problem.  The corresponding code comes from your debug patch:
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");

This only checks a single cpu's counter, which can be negative.  The sum
of all counters is what matters.
Imagine:
cpu1) dirty page: inc
cpu2) clean page: dec
The sum is properly zero, but cpu2 is -1, which will trigger the WARN.

I'll look at the code and also see if I can reproduce the failure using
mem_cgroup_read_stat() for all of the new WARNs.

Did you notice if the global /proc/meminfo:Dirty count also underflowed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-18  6:09 ` Greg Thelen
@ 2015-09-18  8:31   ` Greg Thelen
  2015-09-18 14:50   ` Dave Hansen
  1 sibling, 0 replies; 6+ messages in thread
From: Greg Thelen @ 2015-09-18  8:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Greg Thelen wrote:

> Dave Hansen wrote:
>
>> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
>> 14.04.3 system.  The system will run fine for a few hours, but suddenly
>> start becoming horribly I/O bound.  A compile of perf for instance takes
>> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
>> only seeing tens or hundreds of KB/s of writes.
>>
>> Looking at some writeback tracepoints shows it hitting
>> balance_dirty_pages() pretty hard with a pretty large number of dirty
>> pages. :)
>>
>>>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>>> cgroup=/user/1000.user/c2.session
>>
>> So something is underflowing dirty.
>>
>> I added the attached patch and got a warning pretty quickly, so this
>> looks pretty reproducible for me.
>>
>> I'm not 100% sure this is from the 4.3 merge window.  I was running the
>> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
>> automatically creating some cgroups, so they're definitely in play here.
>>
>> Any ideas what is going on?
>>
>>> [   12.415472] ------------[ cut here ]------------
>>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>>> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>>> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>>> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>>> [   12.415550] Call Trace:
>>> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>>> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>>> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>>> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>>> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>>> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>>> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>>> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>>> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
>>> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>>> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>>> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>>> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>>> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>>> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>>> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>>> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>>> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>>> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>>> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>>> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
>>> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>>> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>>> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>>> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>>> [   12.418211] ------------[ cut here ]------------
>
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem.  The corresponding code comes from your debug patch:
> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>
> This only checks a single cpu's counter, which can be negative.  The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.
>
> Did you notice if the global /proc/meminfo:Dirty count also underflowed?

Looks to be miscommunication between memcg dirty page accounting and
page migration.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-18  6:09 ` Greg Thelen
  2015-09-18  8:31   ` Greg Thelen
@ 2015-09-18 14:50   ` Dave Hansen
  2015-09-21  8:06     ` Greg Thelen
  1 sibling, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2015-09-18 14:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list

On 09/17/2015 11:09 PM, Greg Thelen wrote:
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem.  The corresponding code comes from your debug patch:
> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
> 
> This only checks a single cpu's counter, which can be negative.  The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
> 
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.

D'oh.  I'll replace those with the proper mem_cgroup_read_stat() and
test with your patch to see if anything still triggers.

> Did you notice if the global /proc/meminfo:Dirty count also underflowed?

It did not underflow.  It was one of the first things I looked at and it
looked fine, went down near 0 at 'sync', etc...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-18 14:50   ` Dave Hansen
@ 2015-09-21  8:06     ` Greg Thelen
  2015-09-22  9:03       ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Greg Thelen @ 2015-09-21  8:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Dave Hansen wrote:

> On 09/17/2015 11:09 PM, Greg Thelen wrote:
>> I'm not denying the issue, bug the WARNING splat isn't necessarily
>> catching a problem.  The corresponding code comes from your debug patch:
>> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>> 
>> This only checks a single cpu's counter, which can be negative.  The sum
>> of all counters is what matters.
>> Imagine:
>> cpu1) dirty page: inc
>> cpu2) clean page: dec
>> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>> 
>> I'll look at the code and also see if I can reproduce the failure using
>> mem_cgroup_read_stat() for all of the new WARNs.
>
> D'oh.  I'll replace those with the proper mem_cgroup_read_stat() and
> test with your patch to see if anything still triggers.

Thanks Dave.

Here's what I think we should use to fix the issue.  I tagged this for
v4.2 stable given the way that unpatched performance falls apart without
warning or workaround (besides deleting and recreating affected memcg).
Feedback welcome.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-21  8:06     ` Greg Thelen
@ 2015-09-22  9:03       ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2015-09-22  9:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Dave Hansen, Johannes Weiner, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list

[I am sorry I didn't get to this earlier because I was at an internal
conference last week]

On Mon 21-09-15 01:06:58, Greg Thelen wrote:
[...]
> >From f5c39c2e8471c10fe0464ca7b6e6f743ce6920a6 Mon Sep 17 00:00:00 2001
> From: Greg Thelen <gthelen@google.com>
> Date: Sat, 19 Sep 2015 16:21:18 -0700
> Subject: [PATCH] memcg: fix dirty page migration
> 
> The problem starts with a file backed dirty page which is charged to a
> memcg.  Then page migration is used to move oldpage to newpage.
> Migration:
> - copies the oldpage's data to newpage
> - clears oldpage.PG_dirty
> - sets newpage.PG_dirty
> - uncharges oldpage from memcg
> - charges newpage to memcg
> 
> Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
> count.  However, because newpage is not yet charged, setting
> newpage.PG_dirty does not increment the memcg's dirty page count.  After
> migration completes newpage.PG_dirty is eventually cleared, often in
> account_page_cleaned().  At this time newpage is charged to a memcg so
> the memcg's dirty page count is decremented which causes underflow
> because the count was not previously incremented by migration.  This
> underflow causes balance_dirty_pages() to see a very large unsigned
> number of dirty memcg pages which leads to aggressive throttling of
> buffered writes by processes in non root memcg.

Very well spotted!

> This issue:
> - can harm performance of non root memcg buffered writes.
> - can report too small (even negative) values in
>   memory.stat[(total_)dirty] counters of all memcg, including the root.
> 
> To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
> page_memcg() and set_page_memcg() helpers.
> 
> Test:
>     0) setup and enter limited memcg
>     mkdir /sys/fs/cgroup/test
>     echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
>     echo $$ > /sys/fs/cgroup/test/cgroup.procs
> 
>     1) buffered writes baseline
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>     2) buffered writes with compaction antagonist to induce migration
>     yes 1 > /proc/sys/vm/compact_memory &
>     rm -rf /data/tmp/foo
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     kill %
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>     3) buffered writes without antagonist, should match baseline
>     rm -rf /data/tmp/foo
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>                        (speed, dirty residue)
>              unpatched                       patched
>     1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
>     2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
>     3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
> 
>     Notice that unpatched baseline performance (1) fell after
>     migration (3): 841 -> 114 MB/s.  In the patched kernel, post
>     migration performance matches baseline.
> 
> Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
> Cc: <stable@vger.kernel.org> # 4.2+
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/mm.h | 21 +++++++++++++++++++++
>  mm/migrate.c       | 12 +++++++++++-
>  2 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 91c08f6f0dc9..80001de019ba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -905,6 +905,27 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
>  #endif
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static inline struct mem_cgroup *page_memcg(struct page *page)
> +{
> +	return page->mem_cgroup;
> +}
> +
> +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
> +{
> +	page->mem_cgroup = memcg;
> +}
> +#else
> +static inline struct mem_cgroup *page_memcg(struct page *page)
> +{
> +	return NULL;
> +}
> +
> +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
> +{
> +}
> +#endif
> +
>  /*
>   * Some inline functions in vmstat.h depend on page_zone()
>   */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c3cb566af3e2..6116b8f64d27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -740,6 +740,15 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  	if (PageSwapBacked(page))
>  		SetPageSwapBacked(newpage);
>  
> +	/*
> +	 * Indirectly called below, migrate_page_copy() copies PG_dirty and thus
> +	 * needs newpage's memcg set to transfer memcg dirty page accounting.
> +	 * So perform memcg migration in two steps:
> +	 * 1. set newpage->mem_cgroup (here)
> +	 * 2. clear page->mem_cgroup (below)
> +	 */
> +	set_page_memcg(newpage, page_memcg(page));
> +
>  	mapping = page_mapping(page);
>  	if (!mapping)
>  		rc = migrate_page(mapping, newpage, page, mode);
> @@ -756,9 +765,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		rc = fallback_migrate_page(mapping, newpage, page, mode);
>  
>  	if (rc != MIGRATEPAGE_SUCCESS) {
> +		set_page_memcg(newpage, NULL);
>  		newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		set_page_memcg(page, NULL);
>  		if (page_was_mapped)
>  			remove_migration_ptes(page, newpage);
>  		page->mapping = NULL;
> -- 
> 2.6.0.rc0.131.gf624c3d

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-09-22  9:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-18  4:29 4.3-rc1 dirty page count underflow (cgroup-related?) Dave Hansen
2015-09-18  6:09 ` Greg Thelen
2015-09-18  8:31   ` Greg Thelen
2015-09-18 14:50   ` Dave Hansen
2015-09-21  8:06     ` Greg Thelen
2015-09-22  9:03       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).