cgroup v1 and balance_dirty

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* cgroup v1 and balance_dirty_pages
@ 2022-11-17  6:54 Aneesh Kumar K.V
       [not found] ` <87wn7uf4ve.fsf-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Aneesh Kumar K.V @ 2022-11-17  6:54 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA

Hi,

Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
have task dirtying too many pages w.r.t to memory limit in the memcg.
This is because with cgroup v1 all the limits are checked against global
available resources. So on a system with a large amount of memory, a
cgroup with a smaller limit can easily hit OOM if the task within the
cgroup continuously dirty pages.

Shouldn't we throttle the task based on the memcg limits in this case?
commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
on traditional hierarchies") indicates we run into issues with enabling
cgroup writeback with v1. But we still can keep the global writeback
domain, but check the throtling needs against memcg limits in
balance_dirty_pages()?

-aneesh

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <87wn7uf4ve.fsf-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>]

* Re: cgroup v1 and balance_dirty_pages
       [not found] ` <87wn7uf4ve.fsf-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
@ 2022-11-17 15:12   ` Johannes Weiner
       [not found]     ` <Y3ZPZyaX1WN3tad4-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2022-11-17 15:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Tejun Heo, Zefan Li, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Aneesh,

On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
> have task dirtying too many pages w.r.t to memory limit in the memcg.
> This is because with cgroup v1 all the limits are checked against global
> available resources. So on a system with a large amount of memory, a
> cgroup with a smaller limit can easily hit OOM if the task within the
> cgroup continuously dirty pages.

Page reclaim has special writeback throttling for cgroup1, see the
folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
proper dirty throttling, but it should prevent OOMs.

Is this not working anymore?

> Shouldn't we throttle the task based on the memcg limits in this case?
> commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
> on traditional hierarchies") indicates we run into issues with enabling
> cgroup writeback with v1. But we still can keep the global writeback
> domain, but check the throtling needs against memcg limits in
> balance_dirty_pages()?

Deciding when to throttle is only one side of the coin, though.

The other side is selective flushing in the IO context of whoever
generated the dirty data, and matching the rate of dirtying to the
rate of writeback. This isn't really possible in cgroup1, as the
domains for memory and IO control could be disjunct.

For example, if a fast-IO cgroup shares memory with a slow-IO cgroup,
what's the IO context for flushing the shared dirty data? What's the
throttling rate you apply to dirtiers?

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <Y3ZPZyaX1WN3tad4-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: cgroup v1 and balance_dirty_pages
       [not found]     ` <Y3ZPZyaX1WN3tad4-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-11-17 15:42       ` Aneesh Kumar K V
       [not found]         ` <697e50fd-1954-4642-9f61-1afad0ebf8c6-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Aneesh Kumar K V @ 2022-11-17 15:42 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tejun Heo, Zefan Li, cgroups-u79uwXL29TY76Z2rM5mHXA

On 11/17/22 8:42 PM, Johannes Weiner wrote:
> Hi Aneesh,
> 
> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
>> have task dirtying too many pages w.r.t to memory limit in the memcg.
>> This is because with cgroup v1 all the limits are checked against global
>> available resources. So on a system with a large amount of memory, a
>> cgroup with a smaller limit can easily hit OOM if the task within the
>> cgroup continuously dirty pages.
> 
> Page reclaim has special writeback throttling for cgroup1, see the
> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
> proper dirty throttling, but it should prevent OOMs.
> 
> Is this not working anymore?

The test is a simple dd test on on a 256GB system.

root@lp2:/sys/fs/cgroup/memory# mkdir test
root@lp2:/sys/fs/cgroup/memory# cd test/
root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes 
root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks 
root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M 
Killed


Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio
we are writing will be in writeback?

> 
>> Shouldn't we throttle the task based on the memcg limits in this case?
>> commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>> on traditional hierarchies") indicates we run into issues with enabling
>> cgroup writeback with v1. But we still can keep the global writeback
>> domain, but check the throtling needs against memcg limits in
>> balance_dirty_pages()?
> 
> Deciding when to throttle is only one side of the coin, though.
> 
> The other side is selective flushing in the IO context of whoever
> generated the dirty data, and matching the rate of dirtying to the
> rate of writeback. This isn't really possible in cgroup1, as the
> domains for memory and IO control could be disjunct.
> 
> For example, if a fast-IO cgroup shares memory with a slow-IO cgroup,
> what's the IO context for flushing the shared dirty data? What's the
> throttling rate you apply to dirtiers?

I am not using I/O controller at all. Only cpu and memory controllers are
used and what I am observing is depending on the system memory size, the container
with same memory limits will hit OOM on some machine and not on others.

One of the challenge with the above test is, we are not able to reclaim via
shrink_folio_list() because these are dirty file lru pages and we take the
below code path

	if (folio_is_file_lru(folio) &&
			    (!current_is_kswapd() ||
			     !folio_test_reclaim(folio) ||
			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
	......
				goto activate_locked;
	}

 

-aneesh

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <697e50fd-1954-4642-9f61-1afad0ebf8c6-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>]

* Re: cgroup v1 and balance_dirty_pages
       [not found]         ` <697e50fd-1954-4642-9f61-1afad0ebf8c6-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
@ 2022-11-17 15:51           ` Aneesh Kumar K V
       [not found]             ` <9fb5941b-2c74-87af-a476-ce94b43bb542-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Aneesh Kumar K V @ 2022-11-17 15:51 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tejun Heo, Zefan Li, cgroups-u79uwXL29TY76Z2rM5mHXA

On 11/17/22 9:12 PM, Aneesh Kumar K V wrote:
> On 11/17/22 8:42 PM, Johannes Weiner wrote:
>> Hi Aneesh,
>>
>> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
>>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
>>> have task dirtying too many pages w.r.t to memory limit in the memcg.
>>> This is because with cgroup v1 all the limits are checked against global
>>> available resources. So on a system with a large amount of memory, a
>>> cgroup with a smaller limit can easily hit OOM if the task within the
>>> cgroup continuously dirty pages.
>>
>> Page reclaim has special writeback throttling for cgroup1, see the
>> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
>> proper dirty throttling, but it should prevent OOMs.
>>
>> Is this not working anymore?
> 
> The test is a simple dd test on on a 256GB system.
> 
> root@lp2:/sys/fs/cgroup/memory# mkdir test
> root@lp2:/sys/fs/cgroup/memory# cd test/
> root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes 
> root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks 
> root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M 
> Killed
> 
> 
> Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio
> we are writing will be in writeback?

Other way to look at this is, if the writeback is never started via balance_dirty_pages,
will we be finding folios in shrink_folio_list that is in writeback? 

> 
>>
>>> Shouldn't we throttle the task based on the memcg limits in this case?
>>> commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>>> on traditional hierarchies") indicates we run into issues with enabling
>>> cgroup writeback with v1. But we still can keep the global writeback
>>> domain, but check the throtling needs against memcg limits in
>>> balance_dirty_pages()?
>>
>> Deciding when to throttle is only one side of the coin, though.
>>
>> The other side is selective flushing in the IO context of whoever
>> generated the dirty data, and matching the rate of dirtying to the
>> rate of writeback. This isn't really possible in cgroup1, as the
>> domains for memory and IO control could be disjunct.
>>
>> For example, if a fast-IO cgroup shares memory with a slow-IO cgroup,
>> what's the IO context for flushing the shared dirty data? What's the
>> throttling rate you apply to dirtiers?
> 
> I am not using I/O controller at all. Only cpu and memory controllers are
> used and what I am observing is depending on the system memory size, the container
> with same memory limits will hit OOM on some machine and not on others.
> 
> One of the challenge with the above test is, we are not able to reclaim via
> shrink_folio_list() because these are dirty file lru pages and we take the
> below code path
> 
> 	if (folio_is_file_lru(folio) &&
> 			    (!current_is_kswapd() ||
> 			     !folio_test_reclaim(folio) ||
> 			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
> 	......
> 				goto activate_locked;
> 	}
> 
>  
> 
> -aneesh


^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <9fb5941b-2c74-87af-a476-ce94b43bb542-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>]

* Re: cgroup v1 and balance_dirty_pages
       [not found]             ` <9fb5941b-2c74-87af-a476-ce94b43bb542-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
@ 2022-11-17 16:31               ` Johannes Weiner
       [not found]                 ` <Y3ZhyfROmGKn/jfr-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2022-11-17 16:31 UTC (permalink / raw)
  To: Aneesh Kumar K V; +Cc: Tejun Heo, Zefan Li, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 17, 2022 at 09:21:10PM +0530, Aneesh Kumar K V wrote:
> On 11/17/22 9:12 PM, Aneesh Kumar K V wrote:
> > On 11/17/22 8:42 PM, Johannes Weiner wrote:
> >> Hi Aneesh,
> >>
> >> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
> >>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
> >>> have task dirtying too many pages w.r.t to memory limit in the memcg.
> >>> This is because with cgroup v1 all the limits are checked against global
> >>> available resources. So on a system with a large amount of memory, a
> >>> cgroup with a smaller limit can easily hit OOM if the task within the
> >>> cgroup continuously dirty pages.
> >>
> >> Page reclaim has special writeback throttling for cgroup1, see the
> >> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
> >> proper dirty throttling, but it should prevent OOMs.
> >>
> >> Is this not working anymore?
> > 
> > The test is a simple dd test on on a 256GB system.
> > 
> > root@lp2:/sys/fs/cgroup/memory# mkdir test
> > root@lp2:/sys/fs/cgroup/memory# cd test/
> > root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes 
> > root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks 
> > root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M 
> > Killed
> > 
> > 
> > Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio
> > we are writing will be in writeback?
> 
> Other way to look at this is, if the writeback is never started via balance_dirty_pages,
> will we be finding folios in shrink_folio_list that is in writeback? 

The flushers are started from reclaim if necessary. See this code from
shrink_inactive_list():

	/*
	 * If dirty folios are scanned that are not queued for IO, it
	 * implies that flushers are not doing their job. This can
	 * happen when memory pressure pushes dirty folios to the end of
	 * the LRU before the dirty limits are breached and the dirty
	 * data has expired. It can also happen when the proportion of
	 * dirty folios grows not through writes but through memory
	 * pressure reclaiming all the clean cache. And in some cases,
	 * the flushers simply cannot keep up with the allocation
	 * rate. Nudge the flusher threads in case they are asleep.
	 */
	if (stat.nr_unqueued_dirty == nr_taken)
		wakeup_flusher_threads(WB_REASON_VMSCAN);

It sounds like there isn't enough time for writeback to commence
before the memcg already declares OOM.

If you place a reclaim_throttle(VMSCAN_THROTTLE_WRITEBACK) after that
wakeup, does that fix the issue?

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <Y3ZhyfROmGKn/jfr-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: cgroup v1 and balance_dirty_pages
       [not found]                 ` <Y3ZhyfROmGKn/jfr-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-11-17 17:16                   ` Aneesh Kumar K V
       [not found]                     ` <db372090-cd6d-32e9-2ed1-0d5f9dc9c1df-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Aneesh Kumar K V @ 2022-11-17 17:16 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tejun Heo, Zefan Li, cgroups-u79uwXL29TY76Z2rM5mHXA

On 11/17/22 10:01 PM, Johannes Weiner wrote:
> On Thu, Nov 17, 2022 at 09:21:10PM +0530, Aneesh Kumar K V wrote:
>> On 11/17/22 9:12 PM, Aneesh Kumar K V wrote:
>>> On 11/17/22 8:42 PM, Johannes Weiner wrote:
>>>> Hi Aneesh,
>>>>
>>>> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
>>>>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
>>>>> have task dirtying too many pages w.r.t to memory limit in the memcg.
>>>>> This is because with cgroup v1 all the limits are checked against global
>>>>> available resources. So on a system with a large amount of memory, a
>>>>> cgroup with a smaller limit can easily hit OOM if the task within the
>>>>> cgroup continuously dirty pages.
>>>>
>>>> Page reclaim has special writeback throttling for cgroup1, see the
>>>> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
>>>> proper dirty throttling, but it should prevent OOMs.
>>>>
>>>> Is this not working anymore?
>>>
>>> The test is a simple dd test on on a 256GB system.
>>>
>>> root@lp2:/sys/fs/cgroup/memory# mkdir test
>>> root@lp2:/sys/fs/cgroup/memory# cd test/
>>> root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes 
>>> root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks 
>>> root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M 
>>> Killed
>>>
>>>
>>> Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio
>>> we are writing will be in writeback?
>>
>> Other way to look at this is, if the writeback is never started via balance_dirty_pages,
>> will we be finding folios in shrink_folio_list that is in writeback? 
> 
> The flushers are started from reclaim if necessary. See this code from
> shrink_inactive_list():
> 
> 	/*
> 	 * If dirty folios are scanned that are not queued for IO, it
> 	 * implies that flushers are not doing their job. This can
> 	 * happen when memory pressure pushes dirty folios to the end of
> 	 * the LRU before the dirty limits are breached and the dirty
> 	 * data has expired. It can also happen when the proportion of
> 	 * dirty folios grows not through writes but through memory
> 	 * pressure reclaiming all the clean cache. And in some cases,
> 	 * the flushers simply cannot keep up with the allocation
> 	 * rate. Nudge the flusher threads in case they are asleep.
> 	 */
> 	if (stat.nr_unqueued_dirty == nr_taken)
> 		wakeup_flusher_threads(WB_REASON_VMSCAN);
> 
> It sounds like there isn't enough time for writeback to commence
> before the memcg already declares OOM.
> 
> If you place a reclaim_throttle(VMSCAN_THROTTLE_WRITEBACK) after that
> wakeup, does that fix the issue?

yes. That helped. One thing I noticed is with that reclaim_throttle, we
don't end up calling folio_wait_writeback() at all. But still the
dd was able to continue till the file system got full. 

Without that reclaim_throttle(), we do end up calling folio_wait_writeback()
but at some point hit OOM 

[   78.274704] vmscan: memcg throttling                                               
[   78.422914] dd invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
[   78.422927] CPU: 33 PID: 1185 Comm: dd Not tainted 6.0.0-dirty #394
[   78.422933] Call Trace:   
[   78.422935] [c00000001d0ab1d0] [c000000000cbcba4] dump_stack_lvl+0x98/0xe0 (unreliable)
[   78.422947] [c00000001d0ab210] [c0000000004ef618] dump_header+0x68/0x470
[   78.422955] [c00000001d0ab2a0] [c0000000004ed6e0] oom_kill_process+0x410/0x440
[   78.422961] [c00000001d0ab2e0] [c0000000004eedf0] out_of_memory+0x230/0x950
[   78.422968] [c00000001d0ab380] [c00000000063e748] mem_cgroup_out_of_memory+0x148/0x190
[   78.422975] [c00000001d0ab410] [c00000000064b54c] try_charge_memcg+0x95c/0x9d0
[   78.422982] [c00000001d0ab570] [c00000000064c83c] charge_memcg+0x6c/0x180
[   78.422988] [c00000001d0ab5b0] [c00000000064f9b8] __mem_cgroup_charge+0x48/0xb0
[   78.422993] [c00000001d0ab5f0] [c0000000004dfedc] __filemap_add_folio+0x2cc/0x870
[   78.423000] [c00000001d0ab6b0] [c0000000004e04fc] filemap_add_folio+0x7c/0x130
[   78.423006] [c00000001d0ab710] [c0000000004e1d4c] __filemap_get_folio+0x2dc/0xb00
[   78.423012] [c00000001d0ab840] [c000000000771f64] iomap_write_begin+0x2a4/0xba0
[   78.423018] [c00000001d0ab9a0] [c000000000772a28] iomap_file_buffered_write+0x1c8/0x460
[   78.423024] [c00000001d0abb60] [c0000000009c1bf8] xfs_file_buffered_write+0x158/0x4f0



^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <db372090-cd6d-32e9-2ed1-0d5f9dc9c1df-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>]

* Re: cgroup v1 and balance_dirty_pages
       [not found]                     ` <db372090-cd6d-32e9-2ed1-0d5f9dc9c1df-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
@ 2022-11-17 17:50                       ` Johannes Weiner
       [not found]                         ` <Y3Z0ZIroRFd1B6ad-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2022-11-17 17:50 UTC (permalink / raw)
  To: Aneesh Kumar K V; +Cc: Tejun Heo, Zefan Li, cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, Nov 17, 2022 at 10:46:53PM +0530, Aneesh Kumar K V wrote:
> On 11/17/22 10:01 PM, Johannes Weiner wrote:
> > On Thu, Nov 17, 2022 at 09:21:10PM +0530, Aneesh Kumar K V wrote:
> >> On 11/17/22 9:12 PM, Aneesh Kumar K V wrote:
> >>> On 11/17/22 8:42 PM, Johannes Weiner wrote:
> >>>> Hi Aneesh,
> >>>>
> >>>> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
> >>>>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
> >>>>> have task dirtying too many pages w.r.t to memory limit in the memcg.
> >>>>> This is because with cgroup v1 all the limits are checked against global
> >>>>> available resources. So on a system with a large amount of memory, a
> >>>>> cgroup with a smaller limit can easily hit OOM if the task within the
> >>>>> cgroup continuously dirty pages.
> >>>>
> >>>> Page reclaim has special writeback throttling for cgroup1, see the
> >>>> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
> >>>> proper dirty throttling, but it should prevent OOMs.
> >>>>
> >>>> Is this not working anymore?
> >>>
> >>> The test is a simple dd test on on a 256GB system.
> >>>
> >>> root@lp2:/sys/fs/cgroup/memory# mkdir test
> >>> root@lp2:/sys/fs/cgroup/memory# cd test/
> >>> root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes 
> >>> root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks 
> >>> root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M 
> >>> Killed
> >>>
> >>>
> >>> Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio
> >>> we are writing will be in writeback?
> >>
> >> Other way to look at this is, if the writeback is never started via balance_dirty_pages,
> >> will we be finding folios in shrink_folio_list that is in writeback? 
> > 
> > The flushers are started from reclaim if necessary. See this code from
> > shrink_inactive_list():
> > 
> > 	/*
> > 	 * If dirty folios are scanned that are not queued for IO, it
> > 	 * implies that flushers are not doing their job. This can
> > 	 * happen when memory pressure pushes dirty folios to the end of
> > 	 * the LRU before the dirty limits are breached and the dirty
> > 	 * data has expired. It can also happen when the proportion of
> > 	 * dirty folios grows not through writes but through memory
> > 	 * pressure reclaiming all the clean cache. And in some cases,
> > 	 * the flushers simply cannot keep up with the allocation
> > 	 * rate. Nudge the flusher threads in case they are asleep.
> > 	 */
> > 	if (stat.nr_unqueued_dirty == nr_taken)
> > 		wakeup_flusher_threads(WB_REASON_VMSCAN);
> > 
> > It sounds like there isn't enough time for writeback to commence
> > before the memcg already declares OOM.
> > 
> > If you place a reclaim_throttle(VMSCAN_THROTTLE_WRITEBACK) after that
> > wakeup, does that fix the issue?
> 
> yes. That helped. One thing I noticed is with that reclaim_throttle, we
> don't end up calling folio_wait_writeback() at all. But still the
> dd was able to continue till the file system got full. 
> 
> Without that reclaim_throttle(), we do end up calling folio_wait_writeback()
> but at some point hit OOM 

Interesting. This is probably owed to the discrepancy between total
memory and the cgroup size. The flusher might put the occasional
cgroup page under writeback, but cgroup reclaim will still see mostly
dirty pages and not slow down enough.

Would you mind sending a patch for adding that reclaim_throttle()?
Gated on !writeback_throttling_sane(), with a short comment explaining
that the flushers may not issue writeback quickly enough for cgroup1
writeback throttling to work on larger systems with small cgroups.

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <Y3Z0ZIroRFd1B6ad-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: cgroup v1 and balance_dirty_pages
       [not found]                         ` <Y3Z0ZIroRFd1B6ad-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2022-11-18  3:56                           ` Aneesh Kumar K V
  0 siblings, 0 replies; 8+ messages in thread
From: Aneesh Kumar K V @ 2022-11-18  3:56 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Tejun Heo, Zefan Li, cgroups-u79uwXL29TY76Z2rM5mHXA

On 11/17/22 11:20 PM, Johannes Weiner wrote:
> On Thu, Nov 17, 2022 at 10:46:53PM +0530, Aneesh Kumar K V wrote:
>> On 11/17/22 10:01 PM, Johannes Weiner wrote:
>>> On Thu, Nov 17, 2022 at 09:21:10PM +0530, Aneesh Kumar K V wrote:
>>>> On 11/17/22 9:12 PM, Aneesh Kumar K V wrote:
>>>>> On 11/17/22 8:42 PM, Johannes Weiner wrote:
>>>>>> Hi Aneesh,
>>>>>>
>>>>>> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
>>>>>>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
>>>>>>> have task dirtying too many pages w.r.t to memory limit in the memcg.
>>>>>>> This is because with cgroup v1 all the limits are checked against global
>>>>>>> available resources. So on a system with a large amount of memory, a
>>>>>>> cgroup with a smaller limit can easily hit OOM if the task within the
>>>>>>> cgroup continuously dirty pages.
>>>>>>
>>>>>> Page reclaim has special writeback throttling for cgroup1, see the
>>>>>> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
>>>>>> proper dirty throttling, but it should prevent OOMs.
>>>>>>
>>>>>> Is this not working anymore?
>>>>>
>>>>> The test is a simple dd test on on a 256GB system.
>>>>>
>>>>> root@lp2:/sys/fs/cgroup/memory# mkdir test
>>>>> root@lp2:/sys/fs/cgroup/memory# cd test/
>>>>> root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes 
>>>>> root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks 
>>>>> root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M 
>>>>> Killed
>>>>>
>>>>>
>>>>> Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio
>>>>> we are writing will be in writeback?
>>>>
>>>> Other way to look at this is, if the writeback is never started via balance_dirty_pages,
>>>> will we be finding folios in shrink_folio_list that is in writeback? 
>>>
>>> The flushers are started from reclaim if necessary. See this code from
>>> shrink_inactive_list():
>>>
>>> 	/*
>>> 	 * If dirty folios are scanned that are not queued for IO, it
>>> 	 * implies that flushers are not doing their job. This can
>>> 	 * happen when memory pressure pushes dirty folios to the end of
>>> 	 * the LRU before the dirty limits are breached and the dirty
>>> 	 * data has expired. It can also happen when the proportion of
>>> 	 * dirty folios grows not through writes but through memory
>>> 	 * pressure reclaiming all the clean cache. And in some cases,
>>> 	 * the flushers simply cannot keep up with the allocation
>>> 	 * rate. Nudge the flusher threads in case they are asleep.
>>> 	 */
>>> 	if (stat.nr_unqueued_dirty == nr_taken)
>>> 		wakeup_flusher_threads(WB_REASON_VMSCAN);
>>>
>>> It sounds like there isn't enough time for writeback to commence
>>> before the memcg already declares OOM.
>>>
>>> If you place a reclaim_throttle(VMSCAN_THROTTLE_WRITEBACK) after that
>>> wakeup, does that fix the issue?
>>
>> yes. That helped. One thing I noticed is with that reclaim_throttle, we
>> don't end up calling folio_wait_writeback() at all. But still the
>> dd was able to continue till the file system got full. 
>>
>> Without that reclaim_throttle(), we do end up calling folio_wait_writeback()
>> but at some point hit OOM 
> 
> Interesting. This is probably owed to the discrepancy between total
> memory and the cgroup size. The flusher might put the occasional
> cgroup page under writeback, but cgroup reclaim will still see mostly
> dirty pages and not slow down enough.
> 
> Would you mind sending a patch for adding that reclaim_throttle()?
> Gated on !writeback_throttling_sane(), with a short comment explaining
> that the flushers may not issue writeback quickly enough for cgroup1
> writeback throttling to work on larger systems with small cgroups.

I will do that. 

-aneesh



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-18  3:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-17  6:54 cgroup v1 and balance_dirty_pages Aneesh Kumar K.V
     [not found] ` <87wn7uf4ve.fsf-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
2022-11-17 15:12   ` Johannes Weiner
     [not found]     ` <Y3ZPZyaX1WN3tad4-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-11-17 15:42       ` Aneesh Kumar K V
     [not found]         ` <697e50fd-1954-4642-9f61-1afad0ebf8c6-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
2022-11-17 15:51           ` Aneesh Kumar K V
     [not found]             ` <9fb5941b-2c74-87af-a476-ce94b43bb542-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
2022-11-17 16:31               ` Johannes Weiner
     [not found]                 ` <Y3ZhyfROmGKn/jfr-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-11-17 17:16                   ` Aneesh Kumar K V
     [not found]                     ` <db372090-cd6d-32e9-2ed1-0d5f9dc9c1df-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org>
2022-11-17 17:50                       ` Johannes Weiner
     [not found]                         ` <Y3Z0ZIroRFd1B6ad-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-11-18  3:56                           ` Aneesh Kumar K V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox