From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from zeniv.linux.org.uk (zeniv.linux.org.uk [62.89.141.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83058DDA9 for ; Mon, 9 Feb 2026 00:32:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=62.89.141.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770597151; cv=none; b=sArcKA5UM7J5O+6pj0TmV8WhBho4awh01kI+ZWCWYPVJJaSTU+OuvL5vjeAtjJ1HVEitmaY4Iw+A1uQBYhFDhee2N3KKdpSi2Es+PTFz41SmFwgyA4iTkUw7NtOH0H28UUdRv/SJYcBrectFwbhPhS9rVczn+vJO+dwamp9rToo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770597151; c=relaxed/simple; bh=6ZPbe7+e/efTT0QlF0n09hw4Uou9OP+i8vmz6pF/g1M=; h=Date:From:To:Cc:Subject:Message-ID:MIME-Version:Content-Type: Content-Disposition; b=IhLHsFky88NWXQMU4HoUOAJs8aE2TadDQgN8BV2j7lPzFfx6sjtfB0OaS7cayqKgC8LVCHTa0dHLKtUw129rfcc0+vGRDCkaJa/gg9m4cPJJ1z/QszBTw/c4Esp+z98M8DvCN90cd0w34h2ZEdgFJLVTOmIBMMsKnZsaVXR040Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zeniv.linux.org.uk; spf=none smtp.mailfrom=ftp.linux.org.uk; dkim=pass (2048-bit key) header.d=linux.org.uk header.i=@linux.org.uk header.b=J9XT7ZCX; arc=none smtp.client-ip=62.89.141.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=zeniv.linux.org.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=ftp.linux.org.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linux.org.uk header.i=@linux.org.uk header.b="J9XT7ZCX" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=linux.org.uk; s=zeniv-20220401; h=Sender:Content-Type:MIME-Version: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-Transfer-Encoding: Content-ID:Content-Description:In-Reply-To:References; bh=nNXjLncOUn8X9AlBi0+nAt3kLGqLpSDBqVslOuFLKK4=; b=J9XT7ZCX/WXOKmVk+uCT4DSpsI 5HV2EFX/4wx3io4OoTm8a4yg54UlVbH5kVMM49u7dl9dnhMqw/aPvIMdW1+uKMMKQmUmC9w8sTBDV Gyv5QUvXJvlzF9D0Xo0lIoLtNjt7OyAHori3c0Bsdr5cV4dpL594j+f5sIylWspEmMMX71KQI4XjP dyLtml6JnXjl4KdtbWBd4TM0IhgpmEL7vWiaCW/gwYZeo4z2/pzUcPCAIMn2lQPrv963oR35KYUNn yeWSHUZIMc/LGUbVCtoaoxWkn0JppyxO+9jsGVLhsJhSHe/jk484nRVanlj+q/iGEJYwuw2EnDRRR f8vg5qKw==; Received: from viro by zeniv.linux.org.uk with local (Exim 4.99.1 #2 (Red Hat Linux)) id 1vpFEb-00000006i8d-0l1W; Mon, 09 Feb 2026 00:34:37 +0000 Date: Mon, 9 Feb 2026 00:34:37 +0000 From: Al Viro To: linux-fsdevel@vger.kernel.org Cc: Linus Torvalds , Christian Brauner , Jan Kara , "H. Peter Anvin" , Werner Almesberger Subject: [RFC] pivot_root(2) races Message-ID: <20260209003437.GF3183987@ZenIV> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: Al Viro pivot_root(2) semantics is unique in one respect: this is the only syscall that changes current directory and root of other processes. All other syscalls affect that only for the caller and the threads that happen to share its fs_struct (i.e. if CLONE_FS had been used all the way back to common ancestor). pivot_root(), OTOH, goes over all threads in the system and every thread that had the same root as the caller gets switched to new root; ditto for current directories. AFAICS, the original rationale had been about the kernel threads that would otherwise keep the old root busy. These days that could've been dealt with much easier, but that behaviour is cast in stone - it's been 25 years since Werner had brought that thing into the tree (2.3.41-pre4), and the time for objections is long gone. Unfortunately, the way it's been done (all the way since the original posting) is racy. If pivot_root() is called while another thread is in the middle of fork(), it will not see the fs_struct of the child to be. The race window is from the call of copy_fs() to the point where copy_process() finally decides that everything is ready and grabs tasklist_lock. There's enough blocking allocations in that range to make it an issue even on UP, without any kind of preemption, etc. - that race had been there since the very beginning. If pivot_root() comes before that window, child will get the root and current directory already switched; if it comes after the window, the child will have its root and current directory switched by pivot_root() itself. If it's in the window, though, the child ends up born chrooted into wherever the original root has been moved to by pivot_root(). Similar races exist for other syscalls that create new fs_struct instances (unshare(2) and setns(2)), with the same underlying mechanism - embryonic fs_struct instances are missed by the function that does that switchover (chroot_fs_refs()). Making those instances visible to it is possible; I've tried to play with that idea, but that leads to really disgusting code and adds a cost to each fork(), all for the sake of a rarely used syscall. IMO it's a non-starter. If it was just the fork() alone, we could deal with that simply by delaying the copying of ->fs->{root,pwd} until copy_process() has grabbed tasklist_lock. Unfortunately, the things are more complicated. First of all, there's CLONE_NEWNS to deal with. It does flip the root and current directory of the caller from locations in original namespace to the corresponding locations in the copy. These days it's done by passing child's fs_struct all the way to copy_mnt_ns(), where we use that fs_struct to pick the original locations from *and* to store the new locations into. That, of course, opens the same pivot_root() race; child's fs_struct is invisible to it, and if it has happened between copy_fs() and copy_mnt_ns() we'll get the new namespace with mount tree reflecting the changes from pivot_root() and child chrooted into the subtree where the original root had been moved to. pivot_root() done to the old namespace after that point is not an issue (we get the same result as if clone(CLONE_NEWNS) had won the race) and pivot_root() to the new namespace is not possible until the child becomes visible to chroot_fs_refs(). Note that fs_struct passed to copy_mnt_ns() serves two purposes - we need the original locations to calculate the new ones and we need some way to report those locations to the rest of the system. The former role should be served by current->fs; for the latter I would prefer to give it a pointer to pair of struct path, so that setting the child's fs_struct would be done by the caller. We could keep using child's fs_struct for that (it will always be an embryonic instance - CLONE_NEWNS is mutually exclusive with CLONE_FS), but that makes for considerably messier cleanup logics in copy_process(). Doing that closes the race for all clone(2) variants - we add a two-element array of struct path in copy_process(), initialize it with all NULLs and pass it to copy_namespaces() instead of child's ->fs. Then, once we have grabbed the tasklist_lock, we do the following: struct fs_struct *fs = current->fs; struct fs_struct *new_fs = p->fs; read_seqlock_excl(&fs->seq); if (fs != new_fs) { new_fs->root = likely(!path[0].mnt) ? fs->root : path[0]; new_fs->pwd = likely(!path[1].mnt) ? fs->root : path[1]; path_get(&fs->root); path_get(&fs->pwd); } else { fs->users++; } read_sequnlock_excl(&fs->seq); for switchover. copy_fs() would allocate the new fs_struct (in !CLONE_FS case, that is), but do no copying of pwd/root into it. Failure exits would not use exit_fs() - that has a side benefit of using exit_fs() only for current; new fs_struct would be freed directly and references in path[] would be dropped by path_put(), success of failure. unshare(2) also has a similar race, with similar solution: * copy pwd/root from original to replacement fs_struct in the same scope where we change current->fs * pass a pair of struct path to unshare_nsproxy_namespaces() instead of giving it fs_struct * use exact same logics as in copy_process() for filling ->root and ->pwd of new fs_struct (that gives a side benefit in unshare_nsproxy_namespaces() - instead of the kernel/nsproxy.c:226: *new_nsp = create_new_namespaces(unshare_flags, current, user_ns, kernel/nsproxy.c:227: new_fs ? new_fs : current->fs); we simply pass struct path array as-is) Other callers of create_new_namespaces() can simply pass NULL instead of bothering with struct path pairs - they never get CLONE_NEWNS in flags, so... At that point we are left with two callers of copy_fs_struct(). One is unshare_fs_struct(), with only one user (knfsd thread setup). It does have the same race with pivot_root(), not that it had been likely, but that race is trivially closed in the same way as with fork() et.al. - just delay copying until we are about to switch current->fs. The remaining caller is something entirely different - it's prepare_nsset(). There the embryonic fs_struct is not going end up as any task's ->fs; it's used only to get the root of namespace we are joining - mntns_install() sets it, and later it goes into current->fs->{root,pwd} once we are sure that joining other namespaces won't fail (if it's only CLONE_NEWNS, nothing gets allocated and the damn thing goes straight into current->fs->{root,pwd}). It's also racy; unlike copy_mnt_ns(), there's no exclusion whatsoever wrt pivot_root(2) (cloning namespace obviously needs it to stabilize the mount tree it's copying). Even the case of pure CLONE_NEWNS is racy - it switches fs->{root,pwd} to whatever overmounts the namespace root and each of those is done under fs->seq, but that's two scopes and neither covers finding that overmount, so chroot_fs_refs() might come and mess the things up. If it's more than just CLONE_NEWNS and we end up with the damn thing stored in the embryonic fs_struct, the race window includes joining the other namespaces as well. Another problem with "CLONE_NEWS + something else" case is that we end up screwing the logics in mntns_install() that tried to reject the case of shared current->fs; embryo isn't shared, so the test in there passes just fine and it's not repeated in the caller, so it's possible to have one of the fs_struct-sharing threads join a namespace, switching the root and current directory for the rest without having their namespace switched. TBH, looking at that one I'd say that we move finding the namespace root to separate helper and have _that_ switch root and current directory of the caller, with commit_nsset() using it. The interesting part is whether we want to deal with the possibility of errors at that point... It *can*, in theory, happen, but only if the namespace root is overmounted by a mount trap and stepping onto it ends up failing. That's an insane setup, of course, but... Comments? PS: Werner and hpa Cc'd as the folks involved in introducing pivot_root(2) in the first place. See https://lkml.org/lkml/2000/1/25/111; thanks to Sasha Levin for finding that thread, BTW - lore doesn't seem to have it and google not just fails to find it, their "AI" had fed me an impressive gaslighting session, complete with inventing inexistent l-k postings from me, claiming that syscall in question had been introduced by one Al Viro ;-/ Telling it that no such postings exist on any of the suggested sites got the expected waffling, request for URL of specific posting has ended up with "https://lore.kernel.org", once we got through the difference between the site name and clickable link... Pity whoever tries to use that shite for "research". For the record - pivot_root(2) is from Werner Almesberger, with suggestions from Linus and Peter; I hadn't been involved at all.