linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Roman Gushchin <guro@fb.com>
Cc: Jan Kara <jack@suse.cz>, Tejun Heo <tj@kernel.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Alexander Viro <viro@zeniv.linux.org.uk>,
	Dennis Zhou <dennis@kernel.org>,
	Dave Chinner <dchinner@redhat.com>,
	cgroups@vger.kernel.org
Subject: Re: [PATCH v7 6/6] writeback, cgroup: release dying cgwbs by switching attached inodes
Date: Mon, 7 Jun 2021 11:24:26 +0200	[thread overview]
Message-ID: <20210607092426.GC30275@quack2.suse.cz> (raw)
In-Reply-To: <20210604013159.3126180-7-guro@fb.com>

On Thu 03-06-21 18:31:59, Roman Gushchin wrote:
> Asynchronously try to release dying cgwbs by switching attached inodes
> to the bdi's wb. It helps to get rid of per-cgroup writeback
> structures themselves and of pinned memory and block cgroups, which
> are significantly larger structures (mostly due to large per-cpu
> statistics data). This prevents memory waste and helps to avoid
> different scalability problems caused by large piles of dying cgroups.
> 
> Reuse the existing mechanism of inode switching used for foreign inode
> detection. To speed things up batch up to 115 inode switching in a
> single operation (the maximum number is selected so that the resulting
> struct inode_switch_wbs_context can fit into 1024 bytes). Because
> every switching consists of two steps divided by an RCU grace period,
> it would be too slow without batching. Please note that the whole
> batch counts as a single operation (when increasing/decreasing
> isw_nr_in_flight). This allows to keep umounting working (flush the
> switching queue), however prevents cleanups from consuming the whole
> switching quota and effectively blocking the frn switching.

Hum, your comment about unmount made me think... Isn't all that stuff racy?
generic_shutdown_super() has:
                sync_filesystem(sb);
                sb->s_flags &= ~SB_ACTIVE;

                cgroup_writeback_umount();

and cgroup_writeback_umount() is:
        if (atomic_read(&isw_nr_in_flight)) {
                /*
                 * Use rcu_barrier() to wait for all pending callbacks to
                 * ensure that all in-flight wb switches are in the workqueue.
                 */
                rcu_barrier();
                flush_workqueue(isw_wq);
	}

So we are clearly missing a smp_mb() here (likely in
cgroup_writeback_umount()) as clearing of SB_ACTIVE needs to be reliably
happing before atomic_read(&isw_nr_in_flight).

Also ...

> +bool cleanup_offline_cgwb(struct bdi_writeback *wb)
> +{
> +	struct inode_switch_wbs_context *isw;
> +	struct inode *inode;
> +	int nr;
> +	bool restart = false;
> +
> +	isw = kzalloc(sizeof(*isw) + WB_MAX_INODES_PER_ISW *
> +		      sizeof(struct inode *), GFP_KERNEL);
> +	if (!isw)
> +		return restart;
> +
> +	/* no need to call wb_get() here: bdi's root wb is not refcounted */
> +	isw->new_wb = &wb->bdi->wb;
> +
> +	nr = 0;
> +	spin_lock(&wb->list_lock);
> +	list_for_each_entry(inode, &wb->b_attached, i_io_list) {
> +		if (!inode_prepare_wbs_switch(inode, isw->new_wb))
> +			continue;
> +
> +		isw->inodes[nr++] = inode;
> +
> +		if (nr >= WB_MAX_INODES_PER_ISW - 1) {
> +			restart = true;
> +			break;
> +		}
> +	}
> +	spin_unlock(&wb->list_lock);
> +
> +	/* no attached inodes? bail out */
> +	if (nr == 0) {
> +		kfree(isw);
> +		return restart;
> +	}
> +
> +	/*
> +	 * In addition to synchronizing among switchers, I_WB_SWITCH tells
> +	 * the RCU protected stat update paths to grab the i_page
> +	 * lock so that stat transfer can synchronize against them.
> +	 * Let's continue after I_WB_SWITCH is guaranteed to be visible.
> +	 */
> +	INIT_RCU_WORK(&isw->work, inode_switch_wbs_work_fn);
> +	queue_rcu_work(isw_wq, &isw->work);
> +
> +	atomic_inc(&isw_nr_in_flight);

... the increment of isw_nr_in_flight needs to happen before we start to
grab any inodes. Otherwise unmount can pass past cgroup_writeback_umount()
while we are still holding inode references in cleanup_offline_cgwb() the
result will be "Busy inodes after unmount." message and use-after-free
issues (with inode->i_sb which gets freed).

Frankly, I think much safer option would be to wait in evict() for
I_WB_SWITCH similarly as we wait for I_SYNC (through
inode_wait_for_writeback()). And with that we can do away with
cgroup_writeback_umount() altogether. But I guess that's out of scope of
this series.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  parent reply	other threads:[~2021-06-07  9:24 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-04  1:31 [PATCH v7 0/6] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups Roman Gushchin
2021-06-04  1:31 ` [PATCH v7 1/6] writeback, cgroup: do not switch inodes with I_WILL_FREE flag Roman Gushchin
2021-06-07  8:48   ` Jan Kara
2021-06-04  1:31 ` [PATCH v7 2/6] writeback, cgroup: switch to rcu_work API in inode_switch_wbs() Roman Gushchin
2021-06-04  1:31 ` [PATCH v7 3/6] writeback, cgroup: keep list of inodes attached to bdi_writeback Roman Gushchin
2021-06-04  1:31 ` [PATCH v7 4/6] writeback, cgroup: split out the functional part of inode_switch_wbs_work_fn() Roman Gushchin
2021-06-04  1:31 ` [PATCH v7 5/6] writeback, cgroup: support switching multiple inodes at once Roman Gushchin
2021-06-07  9:00   ` Jan Kara
2021-06-04  1:31 ` [PATCH v7 6/6] writeback, cgroup: release dying cgwbs by switching attached inodes Roman Gushchin
2021-06-04 15:51   ` Tejun Heo
2021-06-05 21:34   ` Dennis Zhou
2021-06-08  0:20     ` Roman Gushchin
2021-06-07  9:24   ` Jan Kara [this message]
2021-06-04 15:53 ` [PATCH v7 0/6] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups Tejun Heo
2021-06-04 22:24   ` Roman Gushchin
2021-06-04 23:31     ` Tejun Heo
2021-06-05 21:37 ` Dennis Zhou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210607092426.GC30275@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=cgroups@vger.kernel.org \
    --cc=dchinner@redhat.com \
    --cc=dennis@kernel.org \
    --cc=guro@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).