From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from hr2.samba.org (hr2.samba.org [144.76.82.148]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E614E40242E; Tue, 28 Apr 2026 13:49:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=144.76.82.148 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777384161; cv=none; b=Q5U5Iiwg9tMZX+w0Ibc1/AhdUlIYPCTcbQGnl3igc++vxRZibUoSK8KmsE61q+2d9EAYmsVZDNgvycO5tPOfm5U5/jkcOI9GLYVcmr5NGmgH7+RIJzCAbX5NneLfkhUwwxmtwVBDVIFk6Ybu1hIV5iIwEFhvG8iHumsJ0cVYIak= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777384161; c=relaxed/simple; bh=VJLrfiffR+uX3dDo67yHkGsVUjs8Td/NdtRSRAxK5rI=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=u8qO8WZFXtMZIn7eN17w1aP84xQEJ4OzacFS9IvYh21GnyUKKEJo9o1+MLF4izR5q/yAc7IDHtAZTAsiFr+UXeHUcHci1nYrO1MgXpoG0DvipN9BelALI6J506hKLZN5ewTyFTnvdS7Lg2QT4C3WSBLjd6xPzXJpMyLtZk8NnFI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=samba.org; spf=pass smtp.mailfrom=samba.org; dkim=pass (3072-bit key) header.d=samba.org header.i=@samba.org header.b=siWTW250; arc=none smtp.client-ip=144.76.82.148 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=samba.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=samba.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (3072-bit key) header.d=samba.org header.i=@samba.org header.b="siWTW250" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=samba.org; s=42; h=Cc:To:From:Date:Message-ID; bh=+iF/EIyfETkkbf77GRpdzpYQNxQ1hDmi2QfyhcPABCY=; b=siWTW2506jrtLtVqOowPdAGirO RRXzU+xeifRsIuNqb4TKbkG4e9gKIjttiNM6XrikI9SnIwWWLeNnD8C70hnsI8mc0v2QnQH1NQSlt HYcx/4GNhBvw0GeVPK6euYsiUUJK4NYDMQGx4CHoQduMvFLu/Gfx2+h1zPQzmRAEhWTm7yoyWXZ1T 7BnlTUq33vLlNQCrcC2xRQq7iO/Dv7w+Fr5dC5i65vNtaH94DEF3Cy4GszXYRW0S2cunlRkoz0Jpa MFsGbsqOGuA4uln6vygb/NSZHF+mJkKwJl2ftVL5GqTW9MXMrr/x6rRrIFOUgm3E+aD6RA0bkOZH+ YTSgZdfg0kHspsn6ctJBm2xJgqDAZbE/xo8yAlu5kfcDJQSwkttSk6eFFdJR75vYUvsP+e6ldWg// lvertcgX+y4hmUpUzlQ4Ez4njYX1ckI3z4F81it4MzsRmhZnPQIXetfrucYuleEk7x8HT3p3sUgQB 6wq/sEHp7hQtHp/rVLAXqQso; Received: from [127.0.0.2] (localhost [127.0.0.1]) by hr2.samba.org with esmtpsa (TLS1.3:ECDHE_SECP256R1__ECDSA_SECP256R1_SHA256__CHACHA20_POLY1305:256) (Exim) id 1wHioN-00000004jxr-3oW5; Tue, 28 Apr 2026 13:49:15 +0000 Message-ID: <1906c171-e18d-48bd-9529-3960e9b8a284@samba.org> Date: Tue, 28 Apr 2026 15:49:15 +0200 Precedence: bulk X-Mailing-List: linux-api@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd From: Stefan Metzmacher To: Christian Brauner , Jori Koolstra , Jeff Layton Cc: Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, Alexander Viro , Arnd Bergmann , "H . Peter Anvin" , Jan Kara , Peter Zijlstra , Andrey Albershteyn , Masami Hiramatsu , Jiri Olsa , =?UTF-8?Q?Thomas_Wei=C3=9Fschuh?= , Mathieu Desnoyers , Aleksa Sarai , cmirabil@redhat.com, Greg Kroah-Hartman , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org References: <20260412135434.3095416-1-jkoolstra@xs4all.nl> <20260412135434.3095416-2-jkoolstra@xs4all.nl> <20260427-umlegen-aufbau-ee3a97f1528a@brauner> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Am 28.04.26 um 15:39 schrieb Stefan Metzmacher: > Am 27.04.26 um 17:48 schrieb Christian Brauner: >> On Sun, Apr 12, 2026 at 03:54:33PM +0200, Jori Koolstra wrote: >>> Currently there is no way to race-freely create and open a directory. >>> For regular files we have open(O_CREAT) for creating a new file inode, >>> and returning a pinning fd to it. The lack of such functionality for >>> directories means that when populating a directory tree there's always >>> a race involved: the inodes first need to be created, and then opened >>> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/..., >>> but in the time window between the creation and the opening they might >>> be replaced by something else. >>> >>> Addressing this race without proper APIs is possible (by immediately >>> fstat()ing what was opened, to verify that it has the right inode type), >>> but difficult to get right. Hence, mkdirat2() that creates a directory >>> and returns an O_DIRECTORY fd is useful. >>> >>> This feature idea (and description) is taken from the UAPI group: >>> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes >>> >>> Signed-off-by: Jori Koolstra >>> --- >>>   arch/x86/entry/syscalls/syscall_64.tbl |  1 + >>>   fs/internal.h                          |  2 ++ >>>   fs/namei.c                             | 44 +++++++++++++++++++++++--- >>>   include/linux/syscalls.h               |  2 ++ >>>   include/uapi/asm-generic/unistd.h      |  5 ++- >>>   scripts/syscall.tbl                    |  1 + >>>   6 files changed, 50 insertions(+), 5 deletions(-) >>> >>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >>> index 524155d655da..e200ca2067a4 100644 >>> --- a/arch/x86/entry/syscalls/syscall_64.tbl >>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >>> @@ -396,6 +396,7 @@ >>>   469    common    file_setattr        sys_file_setattr >>>   470    common    listns            sys_listns >>>   471    common    rseq_slice_yield    sys_rseq_slice_yield >>> +472    common    mkdirat2        sys_mkdirat2 >>>   # >>>   # Due to a historical design error, certain syscalls are numbered differently >>> diff --git a/fs/internal.h b/fs/internal.h >>> index cbc384a1aa09..c6a79afadacf 100644 >>> --- a/fs/internal.h >>> +++ b/fs/internal.h >>> @@ -59,6 +59,8 @@ int may_linkat(struct mnt_idmap *idmap, const struct path *link); >>>   int filename_renameat2(int olddfd, struct filename *oldname, int newdfd, >>>            struct filename *newname, unsigned int flags); >>>   int filename_mkdirat(int dfd, struct filename *name, umode_t mode); >>> +struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode, >>> +        unsigned int flags, bool open); >>>   int filename_mknodat(int dfd, struct filename *name, umode_t mode, unsigned int dev); >>>   int filename_symlinkat(struct filename *from, int newdfd, struct filename *to); >>>   int filename_linkat(int olddfd, struct filename *old, int newdfd, >>> diff --git a/fs/namei.c b/fs/namei.c >>> index a880454a6415..6451e96dc225 100644 >>> --- a/fs/namei.c >>> +++ b/fs/namei.c >>> @@ -5255,18 +5255,36 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, >>>   } >>>   EXPORT_SYMBOL(vfs_mkdir); >>> -int filename_mkdirat(int dfd, struct filename *name, umode_t mode) >>> +static int mkdirat_lookup_flags(unsigned int flags) >>> +{ >>> +    int lookup_flags = LOOKUP_DIRECTORY; >>> + >>> +    if (!(flags & AT_SYMLINK_NOFOLLOW)) >>> +        lookup_flags |= LOOKUP_FOLLOW; >>> +    if (!(flags & AT_NO_AUTOMOUNT)) >>> +        lookup_flags |= LOOKUP_AUTOMOUNT; >>> + >>> +    return lookup_flags; >>> +} >>> + >>> +int filename_mkdirat(int dfd, struct filename *name, umode_t mode) { >>> +    return PTR_ERR_OR_ZERO(do_file_mkdirat(dfd, name, mode, 0, false)); >>> +} >>> + >>> +struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode, >>> +        unsigned int flags, bool open) >>>   { >>>       struct dentry *dentry; >>>       struct path path; >>>       int error; >>> -    unsigned int lookup_flags = LOOKUP_DIRECTORY; >>> +    struct file *filp = NULL; >>> +    unsigned int lookup_flags = mkdirat_lookup_flags(flags); >>>       struct delegated_inode delegated_inode = { }; >>>   retry: >>>       dentry = filename_create(dfd, name, &path, lookup_flags); >>>       if (IS_ERR(dentry)) >>> -        return PTR_ERR(dentry); >>> +        return ERR_CAST(dentry); >>>       error = security_path_mkdir(&path, dentry, >>>               mode_strip_umask(path.dentry->d_inode, mode)); >>> @@ -5276,6 +5294,10 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode) >>>           if (IS_ERR(dentry)) >>>               error = PTR_ERR(dentry); >>>       } >>> +    if (open && !error && !is_delegated(&delegated_inode)) { >>> +        const struct path new_path = { .mnt = path.mnt, .dentry = dentry }; >>> +        filp = dentry_open(&new_path, O_DIRECTORY, current_cred()); >>> +    } >> >> So definitely a patchset worthing doing but this will be hairy. And >> Mateusz is right. As written this doesn't work. The canonical pattern >> how e.g., dentry_open() does it is to preallocate the file. >> >> I do wonder though whether we shouldn't just make O_CREAT | O_DIRECTORY >> work. I remember that I had a vague comment about this in [1] a few >> years ago (cf. [1]). It might even be less hairy to get that one right >> as all the thinking for O_CREAT is already there. >> >> What was the rationale for mkdirat2() instead of threading this through >> openat()/openat2() with O_CREAT? >> >> And side-question: @Jeff, can nfs atomic open deal with O_CREAT | >> O_DIRECTORY? > > If it helps the SMB2/3 protocol only has a single SMB2 Create operation > that uses FILE_CREATE+FILE_NON_DIRECTORY_FILE or FILE_CREATE+FILE_DIRECTORY_FILE. > > Given all the openat() ignores unknown flags or combinations, maybe this > should be openat2 only and even a new flag (at the for the userspace interface). > or do_sys_open() will reject it for open and openat. I just found the interaction of __O_TMPFILE and O_DIRECTORY there should be a O_MKDIR or something similar that's openat2 only. > While we're there an O_TMPDIR would also be wonderful to have. > Currently samba works around it by using a hidden directory name, invisible > for SMB clients, but nfs and local users see it. That should also be openat2 only if added. metze