linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	hannes@cmpxchg.org, jack@suse.cz, torvalds@linux-foundation.org
Subject: Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
Date: Fri, 29 Sep 2017 02:15:58 +0200	[thread overview]
Message-ID: <05736c6b-f401-2d02-432c-2fd6966abbd4@kernel.dk> (raw)
In-Reply-To: <20170928144100.e11801ef742521e0e3f4b8df@linux-foundation.org>

On 09/28/2017 11:41 PM, Andrew Morton wrote:
> On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> 
>> When someone calls wakeup_flusher_threads() or
>> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
>> pages in the system (or on that bdi). If we are tight on memory, we
>> can get tons of these queued from kswapd/vmscan. This causes (at
>> least) two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>>    We've seen as much as 600 million of these writeback work items
>>    pending. That's a lot of memory to pointlessly hold hostage,
>>    while the box is under memory pressure.
>>
>> 2) We spend so much time processing these work items, that we
>>    introduce a softlockup in writeback processing. This is because
>>    each of the writeback work items don't end up doing any work (it's
>>    hard when you have millions of identical ones coming in to the
>>    flush machinery), so we just sit in a tight loop pulling work
>>    items and deleting/freeing them.
>>
>> Fix this by adding a 'start_all' bit to the writeback structure, and
>> set that when someone attempts to flush all dirty pages. The bit is
>> cleared when we start writeback on that work item. If the bit is
>> already set when we attempt to queue !nr_pages writeback, then we
>> simply ignore it.
>>
>> This provides us one full flush in flight, with one pending as well,
>> and makes for more efficient handling of this type of writeback.
>>
>> ...
>>
>> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
>>  		return;
>>  
>>  	/*
>> +	 * All callers of this function want to start writeback of all
>> +	 * dirty pages. Places like vmscan can call this at a very
>> +	 * high frequency, causing pointless allocations of tons of
>> +	 * work items and keeping the flusher threads busy retrieving
>> +	 * that work. Ensure that we only allow one of them pending and
>> +	 * inflight at the time. It doesn't matter if we race a little
>> +	 * bit on this, so use the faster separate test/set bit variants.
>> +	 */
>> +	if (test_bit(WB_start_all, &wb->state))
>> +		return;
>> +
>> +	set_bit(WB_start_all, &wb->state);
> 
> test_and_set_bit()?

Like Linus says, this is done purposely. I've even included a bit about
it in the comment above, though maybe it's not clear enough. I've used
this trick in blk-mq quite a bit as well, and for high frequency calls,
it can make a substantial difference not to redirty that cache line if
you can avoid it.

If you do care about atomicity, this works really well too:

if (test_bit(bit, addr) || test_and_set_bit(bit, addr))
	...

just to avoid the locked operation. Also see this commit:
commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d
Author: Jens Axboe <axboe@fb.com>
Date:   Thu May 22 11:54:16 2014 -0700

    mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT

where there are some actual numbers on a specific case.

For the case at hand, we don't even need to do the test_and_set
case, since we don't care about a small race there.

-- 
Jens Axboe

  parent reply	other threads:[~2017-09-29  0:15 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-27 20:13 [PATCH 0/12 v3] Writeback improvements Jens Axboe
2017-09-27 20:13 ` [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL Jens Axboe
2017-09-28 14:08   ` Nikolay Borisov
2017-10-02 15:02   ` Jan Kara
2017-09-27 20:13 ` [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases Jens Axboe
2017-09-28 14:11   ` Nikolay Borisov
2017-09-28 18:12     ` Jens Axboe
2017-10-03 12:10   ` Jan Kara
2017-10-03 12:25     ` Jan Kara
2017-10-03 14:36       ` Jens Axboe
2017-10-03 15:52         ` Jan Kara
2017-09-27 20:13 ` [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow() Jens Axboe
2017-09-28 14:12   ` Nikolay Borisov
2017-10-03 12:22   ` Jan Kara
2017-09-27 20:13 ` [PATCH 04/12] fs: kill 'nr_pages' argument from wakeup_flusher_threads() Jens Axboe
2017-09-27 20:13 ` [PATCH 05/12] writeback: switch wakeup_flusher_threads() to cyclic writeback Jens Axboe
2017-10-03 12:43   ` Jan Kara
2017-09-27 20:13 ` [PATCH 06/12] writeback: provide a wakeup_flusher_threads_bdi() Jens Axboe
2017-09-27 20:13 ` [PATCH 07/12] writeback: pass in '0' for nr_pages writeback in laptop mode Jens Axboe
2017-09-27 20:13 ` [PATCH 08/12] writeback: make wb_start_writeback() static Jens Axboe
2017-09-27 20:13 ` [PATCH 09/12] writeback: move nr_pages == 0 logic to one location Jens Axboe
2017-09-27 20:13 ` [PATCH 10/12] writeback: only allow one inflight and pending full flush Jens Axboe
2017-09-28 21:41   ` Andrew Morton
2017-09-28 21:44     ` Linus Torvalds
2017-09-29  0:17       ` Jens Axboe
2017-09-29  5:21         ` Amir Goldstein
2017-10-03 16:06         ` Matthew Wilcox
2017-10-03 16:11           ` Jens Axboe
2017-10-03 17:03             ` Matthew Wilcox
2017-09-29  0:15     ` Jens Axboe [this message]
2017-09-27 20:13 ` [PATCH 11/12] writeback: make sync_inodes_sb() use range cyclic writeback Jens Axboe
2017-09-27 20:13 ` [PATCH 12/12] writeback: kill off ->range_cycle option Jens Axboe
2017-09-27 20:22   ` Jens Axboe
2017-09-28 13:19 ` [PATCH 0/12 v3] Writeback improvements John Stoffel
2017-09-28 13:39   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05736c6b-f401-2d02-432c-2fd6966abbd4@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).