From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ba4g=PT=vger.kernel.org=linux-fsdevel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CD8D1C43387
	for <linux-fsdevel@archiver.kernel.org>; Fri, 11 Jan 2019 07:23:19 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id A810D20872
	for <linux-fsdevel@archiver.kernel.org>; Fri, 11 Jan 2019 07:23:19 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730336AbfAKHXS (ORCPT
        <rfc822;linux-fsdevel@archiver.kernel.org>);
        Fri, 11 Jan 2019 02:23:18 -0500
Received: from zeniv.linux.org.uk ([195.92.253.2]:42898 "EHLO
        ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729499AbfAKHXS (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 11 Jan 2019 02:23:18 -0500
Received: from viro by ZenIV.linux.org.uk with local (Exim 4.91 #2 (Red Hat Linux))
        id 1ghr9q-0000eP-00; Fri, 11 Jan 2019 07:23:10 +0000
Date:   Fri, 11 Jan 2019 07:23:09 +0000
From:   Al Viro <viro@zeniv.linux.org.uk>
To:     Zefan Li <lizefan@huawei.com>
Cc:     Tejun Heo <tj@kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        linux-fsdevel@vger.kernel.org
Subject: [heads-up] buggered refcounting logics in cgroup1_mount()
Message-ID: <20190111072309.GT2217@ZenIV.linux.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

	There'd been a long string of kludges in that area, and unfortunately
even the latest variant of solution is broken.  You rely upon the trick with
percpu_ref_reinit() being done only after cgroup_do_mount().  That prevents
parallel mount picking cgroup_root of superblock in the middle of setting
up, then losing CPU, original mount succeeding and umount marking our
cgroup_root killed; then the second mount regains CPU and proceeds to set
a new superblock.  Umount of _that_ would try to kill cgroup_root one more
time, with obvious nastiness.

	Delaying the moment when cgroup_root can be grabbed solves that,
but it does nothing for another similar case:
	* cgroup1 is mounted and populated
	* umount doesn't kill cgroup_root since it's non-empty
	* mount attempt picks cgroup_root, tries to do kernfs_pin_sb()
(which finds no superblocks) and successfully bumps the refcount.  Then
it drops cgroup_mutex and proceeds to lose CPU (preemption, blocking
allocation in kernfs, whatever).
	* another mount attempt picks the same cgroup_root, again
finds no superblock, bumps the refcount and proceeds to mount the
sucker.  Then, before the first one regains CPU, it manages to
empty the tree and umount it, marking cgroup_root killed.
	* now the first attempt regains CPU and we are in the same
situation as with the original bug.

	Another problem with that is that (without any races) failing
allocation in d_make_root() within kernfs_fill_super() will end up
in cgroup_kill_sb(), with cgroup_root _still_ marked killed (your
kludge creates it that way and we hadn't reached the reinit yet).

	AFAICS, a cleaner solution would be this:
* to hell with kernfs_pin_sb(); just try to grab a reference to
cgroup_root on reuse.
* have cgroup_kill_sb() treat "it's already marked killed" as
"just drop the reference, then".
* after cgroup_do_mount() check if cgroup_root got marked killed and
do deactivate_locked_super() in such case (with the same
restart_syscall() failure exit).

	Objections?  I would love to kill kernfs_pin_sb() as
a followup (it's a fundamentally broken API), but that's not
a stable fodder; some fix of refcounting bugs, OTOH, should be.