* Testing if two open descriptors refer to the same inode
@ 2024-07-29 6:55 Florian Weimer
2024-07-29 9:09 ` Aleksa Sarai
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Florian Weimer @ 2024-07-29 6:55 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel, linux-api; +Cc: Dave Chinner
It was pointed out to me that inode numbers on Linux are no longer
expected to be unique per file system, even for local file systems.
Applications sometimes need to check if two (open) files are the same.
For example, a program may want to use a temporary file if is invoked
with input and output files referring to the same file.
How can we check for this? The POSIX way is to compare st_ino and
st_dev in stat output, but if inode numbers are not unique, that will
result in files falsely being reported as identical. It's harmless in
the temporary file case, but it in other scenarios, it may result in
data loss.
Thanks,
Florian
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: Testing if two open descriptors refer to the same inode 2024-07-29 6:55 Testing if two open descriptors refer to the same inode Florian Weimer @ 2024-07-29 9:09 ` Aleksa Sarai 2024-07-29 9:29 ` Florian Weimer 2024-07-29 10:18 ` Mateusz Guzik 2024-07-29 15:24 ` Jeff Layton 2 siblings, 1 reply; 24+ messages in thread From: Aleksa Sarai @ 2024-07-29 9:09 UTC (permalink / raw) To: Florian Weimer; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 1953 bytes --] On 2024-07-29, Florian Weimer <fweimer@redhat.com> wrote: > It was pointed out to me that inode numbers on Linux are no longer > expected to be unique per file system, even for local file systems. > Applications sometimes need to check if two (open) files are the same. > For example, a program may want to use a temporary file if is invoked > with input and output files referring to the same file. Based on the discussions we had at LSF/MM, I believe the "correct" way now is to do name_to_handle_at(fd, "", ..., AT_EMPTY_PATH|AT_HANDLE_FID) and then use the fhandle as the key to compare inodes. AT_HANDLE_FID is needed for filesystems that don't support decoding file handles, and was added in Linux 6.6[1]. However, I think this inode issue is only relevant for btree filesystems, and I think both btrfs and bcachefs both support decoding fhandles so this should work on fairly old kernels without issue (though I haven't checked). Lennart suggested there should be a way to get this information from statx(2) so that you can get this new inode identifier without doing a bunch of extra syscalls to verify that inode didn't change between the two syscalls. I have a patchset for this, but I suspect it's too ugly (we can't return the full file handle so we need to hash it). I'll send an RFC later this week or next. [1]: commit 96b2b072ee62 ("exportfs: allow exporting non-decodeable file handles to userspace") > How can we check for this? The POSIX way is to compare st_ino and > st_dev in stat output, but if inode numbers are not unique, that will > result in files falsely being reported as identical. It's harmless in > the temporary file case, but it in other scenarios, it may result in > data loss. (Another problem is that st_dev can be different for the same mount due to subvolumes.) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 9:09 ` Aleksa Sarai @ 2024-07-29 9:29 ` Florian Weimer 0 siblings, 0 replies; 24+ messages in thread From: Florian Weimer @ 2024-07-29 9:29 UTC (permalink / raw) To: Aleksa Sarai; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner * Aleksa Sarai: > On 2024-07-29, Florian Weimer <fweimer@redhat.com> wrote: >> It was pointed out to me that inode numbers on Linux are no longer >> expected to be unique per file system, even for local file systems. >> Applications sometimes need to check if two (open) files are the same. >> For example, a program may want to use a temporary file if is invoked >> with input and output files referring to the same file. > > Based on the discussions we had at LSF/MM, I believe the "correct" way > now is to do > > name_to_handle_at(fd, "", ..., AT_EMPTY_PATH|AT_HANDLE_FID) > > and then use the fhandle as the key to compare inodes. AT_HANDLE_FID is > needed for filesystems that don't support decoding file handles, and was > added in Linux 6.6[1]. However, I think this inode issue is only > relevant for btree filesystems, and I think both btrfs and bcachefs both > support decoding fhandles so this should work on fairly old kernels > without issue (though I haven't checked). > [1]: commit 96b2b072ee62 ("exportfs: allow exporting non-decodeable file handles to userspace") Thanks, it's not too bad. The name_to_handle_at manual page says that the handle is supposed to be treated as an opaque value, although it mentions AT_HANDLE_FID. I think this needs to be fixed that it's expected to compare the handle bytes, and also say whether it's necessary to compare the type or not. > Lennart suggested there should be a way to get this information from > statx(2) so that you can get this new inode identifier without doing a > bunch of extra syscalls to verify that inode didn't change between the > two syscalls. I have a patchset for this, but I suspect it's too ugly > (we can't return the full file handle so we need to hash it). I'll send > an RFC later this week or next. Hashing these things is rather nasty because it makes things impossible to test. >> How can we check for this? The POSIX way is to compare st_ino and >> st_dev in stat output, but if inode numbers are not unique, that will >> result in files falsely being reported as identical. It's harmless in >> the temporary file case, but it in other scenarios, it may result in >> data loss. > > (Another problem is that st_dev can be different for the same mount due > to subvolumes.) Uh-oh. If st_dev are different, is it still possible that truncating one path will affect the other with the different st_dev value? Thanks, Florian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 6:55 Testing if two open descriptors refer to the same inode Florian Weimer 2024-07-29 9:09 ` Aleksa Sarai @ 2024-07-29 10:18 ` Mateusz Guzik 2024-07-29 10:40 ` Florian Weimer ` (2 more replies) 2024-07-29 15:24 ` Jeff Layton 2 siblings, 3 replies; 24+ messages in thread From: Mateusz Guzik @ 2024-07-29 10:18 UTC (permalink / raw) To: Florian Weimer; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > It was pointed out to me that inode numbers on Linux are no longer > expected to be unique per file system, even for local file systems. I don't know if I'm parsing this correctly. Are you claiming on-disk inode numbers are not guaranteed unique per filesystem? It sounds like utter breakage, with capital 'f'. I know the 32-bit inode allocation code can result in unintentional duplicates after wrap around (see get_next_ino), but that's for in-memory stuff only(?) like pipes, so perhaps tolerable. Anyhow, the kernel recently got F_DUPFD_QUERY which tests if the *file* object is the same. While the above is not what's needed here, I guess it sets a precedent for F_DUPINODE_QUERY (or whatever other name) to be added to handily compare inode pointers. It may be worthwhile regardless of the above. (or maybe kcmp could be extended?) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 10:18 ` Mateusz Guzik @ 2024-07-29 10:40 ` Florian Weimer 2024-07-29 10:50 ` Mateusz Guzik 2024-07-29 12:26 ` Christian Brauner 2024-07-29 13:36 ` Theodore Ts'o 2 siblings, 1 reply; 24+ messages in thread From: Florian Weimer @ 2024-07-29 10:40 UTC (permalink / raw) To: Mateusz Guzik; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner * Mateusz Guzik: > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: >> It was pointed out to me that inode numbers on Linux are no longer >> expected to be unique per file system, even for local file systems. > > I don't know if I'm parsing this correctly. > > Are you claiming on-disk inode numbers are not guaranteed unique per > filesystem? It sounds like utter breakage, with capital 'f'. Yes, POSIX semantics and traditional Linux semantics for POSIX-like local file systems are different. > While the above is not what's needed here, I guess it sets a precedent > for F_DUPINODE_QUERY (or whatever other name) to be added to handily > compare inode pointers. It may be worthwhile regardless of the above. > (or maybe kcmp could be extended?) I looked at kcmp as well, but I think it's dependent on checkpoint/restore. File sameness checks are much more basic than that. Thanks, Florian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 10:40 ` Florian Weimer @ 2024-07-29 10:50 ` Mateusz Guzik 2024-07-29 10:56 ` Mateusz Guzik 2024-07-29 10:57 ` Florian Weimer 0 siblings, 2 replies; 24+ messages in thread From: Mateusz Guzik @ 2024-07-29 10:50 UTC (permalink / raw) To: Florian Weimer; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > * Mateusz Guzik: > > > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > >> It was pointed out to me that inode numbers on Linux are no longer > >> expected to be unique per file system, even for local file systems. > > > > I don't know if I'm parsing this correctly. > > > > Are you claiming on-disk inode numbers are not guaranteed unique per > > filesystem? It sounds like utter breakage, with capital 'f'. > > Yes, POSIX semantics and traditional Linux semantics for POSIX-like > local file systems are different. > Can you link me some threads about this? > > While the above is not what's needed here, I guess it sets a precedent > > for F_DUPINODE_QUERY (or whatever other name) to be added to handily > > compare inode pointers. It may be worthwhile regardless of the above. > > (or maybe kcmp could be extended?) > > I looked at kcmp as well, but I think it's dependent on > checkpoint/restore. File sameness checks are much more basic than that. > I had this in mind (untested modulo compilation): diff --git a/fs/fcntl.c b/fs/fcntl.c index 300e5d9ad913..5723c3e82eac 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp) return f.file == filp; } +static long f_dupfd_query_inode(int fd, struct file *filp) +{ + CLASS(fd_raw, f)(fd); + + return f.file->f_inode == filp->f_inode; +} + static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) { @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, case F_DUPFD_QUERY: err = f_dupfd_query(argi, filp); break; + case F_DUPFD_QUERY_INODE: + err = f_dupfd_query_inode(argi, filp); + break; case F_GETFD: err = get_close_on_exec(fd) ? FD_CLOEXEC : 0; break; diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index c0bcc185fa48..2e93dbdd8fd2 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -16,6 +16,8 @@ #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4) + /* * Cancel a blocking posix lock; internal use only until we expose an * asynchronous lock api to userspace: ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 10:50 ` Mateusz Guzik @ 2024-07-29 10:56 ` Mateusz Guzik 2024-07-29 10:57 ` Florian Weimer 1 sibling, 0 replies; 24+ messages in thread From: Mateusz Guzik @ 2024-07-29 10:56 UTC (permalink / raw) To: Florian Weimer; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Mon, Jul 29, 2024 at 12:50 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > > * Mateusz Guzik: > > > > > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > > >> It was pointed out to me that inode numbers on Linux are no longer > > >> expected to be unique per file system, even for local file systems. > > > > > > I don't know if I'm parsing this correctly. > > > > > > Are you claiming on-disk inode numbers are not guaranteed unique per > > > filesystem? It sounds like utter breakage, with capital 'f'. > > > > Yes, POSIX semantics and traditional Linux semantics for POSIX-like > > local file systems are different. > > > > Can you link me some threads about this? > > > > While the above is not what's needed here, I guess it sets a precedent > > > for F_DUPINODE_QUERY (or whatever other name) to be added to handily > > > compare inode pointers. It may be worthwhile regardless of the above. > > > (or maybe kcmp could be extended?) > > > > I looked at kcmp as well, but I think it's dependent on > > checkpoint/restore. File sameness checks are much more basic than that. > > > > I had this in mind (untested modulo compilation): > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index 300e5d9ad913..5723c3e82eac 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp) > return f.file == filp; > } > > +static long f_dupfd_query_inode(int fd, struct file *filp) > +{ > + CLASS(fd_raw, f)(fd); > + > + return f.file->f_inode == filp->f_inode; > +} > + > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > struct file *filp) > { > @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > case F_DUPFD_QUERY: > err = f_dupfd_query(argi, filp); > break; > + case F_DUPFD_QUERY_INODE: > + err = f_dupfd_query_inode(argi, filp); > + break; > case F_GETFD: > err = get_close_on_exec(fd) ? FD_CLOEXEC : 0; > break; > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index c0bcc185fa48..2e93dbdd8fd2 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -16,6 +16,8 @@ > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) > > +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4) > + > /* > * Cancel a blocking posix lock; internal use only until we expose an > * asynchronous lock api to userspace: To clarify, if indeed the dev + ino combo is no longer sufficient to conclude it's the same thing, an explicit & handy to use way should be provided. For a case where you got both inodes open something like the above will do the trick, unless I'm missing something. I can do a proper patch posting later after runtime testing and whatnot if you confirm this sorts out your problem. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 10:50 ` Mateusz Guzik 2024-07-29 10:56 ` Mateusz Guzik @ 2024-07-29 10:57 ` Florian Weimer 2024-07-29 11:06 ` Mateusz Guzik 2024-07-29 23:08 ` Dave Chinner 1 sibling, 2 replies; 24+ messages in thread From: Florian Weimer @ 2024-07-29 10:57 UTC (permalink / raw) To: Mateusz Guzik; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner * Mateusz Guzik: > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: >> * Mateusz Guzik: >> >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: >> >> It was pointed out to me that inode numbers on Linux are no longer >> >> expected to be unique per file system, even for local file systems. >> > >> > I don't know if I'm parsing this correctly. >> > >> > Are you claiming on-disk inode numbers are not guaranteed unique per >> > filesystem? It sounds like utter breakage, with capital 'f'. >> >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like >> local file systems are different. > > Can you link me some threads about this? Sorry, it was an internal thread. It's supposed to be common knowledge among Linux file system developers. Aleksa referenced LSF/MM discussions. > I had this in mind (untested modulo compilation): > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index 300e5d9ad913..5723c3e82eac 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp) > return f.file == filp; > } > > +static long f_dupfd_query_inode(int fd, struct file *filp) > +{ > + CLASS(fd_raw, f)(fd); > + > + return f.file->f_inode == filp->f_inode; > +} > + > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > struct file *filp) > { > @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > case F_DUPFD_QUERY: > err = f_dupfd_query(argi, filp); > break; > + case F_DUPFD_QUERY_INODE: > + err = f_dupfd_query_inode(argi, filp); > + break; > case F_GETFD: > err = get_close_on_exec(fd) ? FD_CLOEXEC : 0; > break; > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index c0bcc185fa48..2e93dbdd8fd2 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -16,6 +16,8 @@ > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) > > +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4) > + > /* > * Cancel a blocking posix lock; internal use only until we expose an > * asynchronous lock api to userspace: It's certainly much easier to use than name_to_handle_at, so it looks like a useful option to have. Could we return a three-way comparison result for sorting? Or would that expose too much about kernel pointer values? Thanks, Florian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 10:57 ` Florian Weimer @ 2024-07-29 11:06 ` Mateusz Guzik 2024-07-29 11:36 ` Florian Weimer ` (2 more replies) 2024-07-29 23:08 ` Dave Chinner 1 sibling, 3 replies; 24+ messages in thread From: Mateusz Guzik @ 2024-07-29 11:06 UTC (permalink / raw) To: Florian Weimer Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner, Christian Brauner On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote: > > * Mateusz Guzik: > > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > >> * Mateusz Guzik: > >> > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > >> >> It was pointed out to me that inode numbers on Linux are no longer > >> >> expected to be unique per file system, even for local file systems. > >> > > >> > I don't know if I'm parsing this correctly. > >> > > >> > Are you claiming on-disk inode numbers are not guaranteed unique per > >> > filesystem? It sounds like utter breakage, with capital 'f'. > >> > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like > >> local file systems are different. > > > > Can you link me some threads about this? > > Sorry, it was an internal thread. It's supposed to be common knowledge > among Linux file system developers. Aleksa referenced LSF/MM > discussions. > So much for open development :-P > > I had this in mind (untested modulo compilation): > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c > > index 300e5d9ad913..5723c3e82eac 100644 > > --- a/fs/fcntl.c > > +++ b/fs/fcntl.c > > @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp) > > return f.file == filp; > > } > > > > +static long f_dupfd_query_inode(int fd, struct file *filp) > > +{ > > + CLASS(fd_raw, f)(fd); > > + > > + return f.file->f_inode == filp->f_inode; > > +} > > + > > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > > struct file *filp) > > { > > @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > > case F_DUPFD_QUERY: > > err = f_dupfd_query(argi, filp); > > break; > > + case F_DUPFD_QUERY_INODE: > > + err = f_dupfd_query_inode(argi, filp); > > + break; > > case F_GETFD: > > err = get_close_on_exec(fd) ? FD_CLOEXEC : 0; > > break; > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > > index c0bcc185fa48..2e93dbdd8fd2 100644 > > --- a/include/uapi/linux/fcntl.h > > +++ b/include/uapi/linux/fcntl.h > > @@ -16,6 +16,8 @@ > > > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) > > > > +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4) > > + > > /* > > * Cancel a blocking posix lock; internal use only until we expose an > > * asynchronous lock api to userspace: > > It's certainly much easier to use than name_to_handle_at, so it looks > like a useful option to have. > > Could we return a three-way comparison result for sorting? Or would > that expose too much about kernel pointer values? > As is this would sort by inode *address* which I don't believe is of any use -- the order has to be assumed arbitrary. Perhaps there is something which is reliably the same and can be combined with something else to be unique system-wide (the magic handle thing?). But even then you would need to justify trying to sort by fcntl calls, which sounds pretty dodgey to me. Given that thing I *suspect* statx() may want to get extended with some guaranteed unique identifier. Then you can sort in userspace all you want. Based on your opening mail I assumed you only need to check 2 files, for which the proposed fcntl does the trick. Or to put it differently: there seems to be more to the picture than in the opening mail, so perhaps you could outline what you are looking for. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 11:06 ` Mateusz Guzik @ 2024-07-29 11:36 ` Florian Weimer 2024-07-29 12:00 ` Mateusz Guzik 2024-07-29 11:40 ` Aleksa Sarai 2024-07-29 11:47 ` Aleksa Sarai 2 siblings, 1 reply; 24+ messages in thread From: Florian Weimer @ 2024-07-29 11:36 UTC (permalink / raw) To: Mateusz Guzik Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner, Christian Brauner * Mateusz Guzik: > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Mateusz Guzik: >> >> > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: >> >> * Mateusz Guzik: >> >> >> >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: >> >> >> It was pointed out to me that inode numbers on Linux are no longer >> >> >> expected to be unique per file system, even for local file systems. >> >> > >> >> > I don't know if I'm parsing this correctly. >> >> > >> >> > Are you claiming on-disk inode numbers are not guaranteed unique per >> >> > filesystem? It sounds like utter breakage, with capital 'f'. >> >> >> >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like >> >> local file systems are different. >> > >> > Can you link me some threads about this? >> >> Sorry, it was an internal thread. It's supposed to be common knowledge >> among Linux file system developers. Aleksa referenced LSF/MM >> discussions. >> > > So much for open development :-P I found this pretty quickly, so it does seem widely known: [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more <https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/> >> It's certainly much easier to use than name_to_handle_at, so it looks >> like a useful option to have. >> >> Could we return a three-way comparison result for sorting? Or would >> that expose too much about kernel pointer values? >> > > As is this would sort by inode *address* which I don't believe is of > any use -- the order has to be assumed arbitrary. Doesn't the order remain valid while the files remain open? Anything else doesn't seem reasonable to expect anyway. Thanks, Florian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 11:36 ` Florian Weimer @ 2024-07-29 12:00 ` Mateusz Guzik 0 siblings, 0 replies; 24+ messages in thread From: Mateusz Guzik @ 2024-07-29 12:00 UTC (permalink / raw) To: Florian Weimer Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner, Christian Brauner On Mon, Jul 29, 2024 at 1:37 PM Florian Weimer <fweimer@redhat.com> wrote: > > * Mateusz Guzik: > > > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote: > >> > >> * Mateusz Guzik: > >> > >> > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > >> >> * Mateusz Guzik: > >> >> > >> >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > >> >> >> It was pointed out to me that inode numbers on Linux are no longer > >> >> >> expected to be unique per file system, even for local file systems. > >> >> > > >> >> > I don't know if I'm parsing this correctly. > >> >> > > >> >> > Are you claiming on-disk inode numbers are not guaranteed unique per > >> >> > filesystem? It sounds like utter breakage, with capital 'f'. > >> >> > >> >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like > >> >> local file systems are different. > >> > > >> > Can you link me some threads about this? > >> > >> Sorry, it was an internal thread. It's supposed to be common knowledge > >> among Linux file system developers. Aleksa referenced LSF/MM > >> discussions. > >> > > > > So much for open development :-P > > I found this pretty quickly, so it does seem widely known: > > [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more > <https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/> > Huh, thanks. > >> It's certainly much easier to use than name_to_handle_at, so it looks > >> like a useful option to have. > >> > >> Could we return a three-way comparison result for sorting? Or would > >> that expose too much about kernel pointer values? > >> > > > > As is this would sort by inode *address* which I don't believe is of > > any use -- the order has to be assumed arbitrary. > > Doesn't the order remain valid while the files remain open? Anything > else doesn't seem reasonable to expect anyway. > They will indeed remain stable in that setting, I am saying ordering may be different after a reboot or if there was some memory reclamation going on between restarts of the program. This is quite a difference from dev + ino combo not suffering these problems. That is to say I don't see what is the benefit of having the kernel provide a way to sort inodes in a way which can give different results. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 11:06 ` Mateusz Guzik 2024-07-29 11:36 ` Florian Weimer @ 2024-07-29 11:40 ` Aleksa Sarai 2024-07-31 18:07 ` David Sterba 2024-07-29 11:47 ` Aleksa Sarai 2 siblings, 1 reply; 24+ messages in thread From: Aleksa Sarai @ 2024-07-29 11:40 UTC (permalink / raw) To: Mateusz Guzik Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner, Christian Brauner [-- Attachment #1: Type: text/plain, Size: 2182 bytes --] On 2024-07-29, Mateusz Guzik <mjguzik@gmail.com> wrote: > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote: > > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > > >> >> It was pointed out to me that inode numbers on Linux are no longer > > >> >> expected to be unique per file system, even for local file systems. > > >> > > > >> > I don't know if I'm parsing this correctly. > > >> > > > >> > Are you claiming on-disk inode numbers are not guaranteed unique per > > >> > filesystem? It sounds like utter breakage, with capital 'f'. > > >> > > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like > > >> local file systems are different. > > > > > > Can you link me some threads about this? > > > > Sorry, it was an internal thread. It's supposed to be common knowledge > > among Linux file system developers. Aleksa referenced LSF/MM > > discussions. > > So much for open development :-P To be clear, this wasn't _decided_ at LSF/MM, it was brought up as a topic. There is an LWN article about the session that mentions the issue[1]. My understanding is that the btrfs and bcachefs folks independently determined they cannot provide this guarantee. As far as I understand, the reason why is that inode number allocation on btree filesystems stores information about location and some other bits (maybe subvolumes) in the bits, making it harder to guarantee there will be no collisions. Don't quote me on that though, I'm sure they'll tell me I'm wrong when they wake up. :D As the article mentions, Kent Overstreet suggested trying to see how many things break if you build a kernel where all inode numbers are the same (my guess would be "very badly" -- aside from the classic problem of hardlink detection, a lot of programs key things by (dev, ino), and some inode numbers are guaranteed by the kernel for pseudo-filesystems like PROC_ROOT_INO). [1]: https://lwn.net/Articles/975444/ -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 11:40 ` Aleksa Sarai @ 2024-07-31 18:07 ` David Sterba 0 siblings, 0 replies; 24+ messages in thread From: David Sterba @ 2024-07-31 18:07 UTC (permalink / raw) To: Aleksa Sarai Cc: Mateusz Guzik, Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner, Christian Brauner On Mon, Jul 29, 2024 at 09:40:57PM +1000, Aleksa Sarai wrote: > On 2024-07-29, Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote: > > > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > > > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > > > >> >> It was pointed out to me that inode numbers on Linux are no longer > > > >> >> expected to be unique per file system, even for local file systems. > > > >> > > > > >> > I don't know if I'm parsing this correctly. > > > >> > > > > >> > Are you claiming on-disk inode numbers are not guaranteed unique per > > > >> > filesystem? It sounds like utter breakage, with capital 'f'. > > > >> > > > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like > > > >> local file systems are different. > > > > > > > > Can you link me some threads about this? > > > > > > Sorry, it was an internal thread. It's supposed to be common knowledge > > > among Linux file system developers. Aleksa referenced LSF/MM > > > discussions. > > > > So much for open development :-P > > To be clear, this wasn't _decided_ at LSF/MM, it was brought up as a > topic. There is an LWN article about the session that mentions the > issue[1]. A discussion about inode numbers or subvolumes comes up every year with better of worse suggestions what to do about it. > My understanding is that the btrfs and bcachefs folks independently > determined they cannot provide this guarantee. As far as I understand, > the reason why is that inode number allocation on btree filesystems > stores information about location and some other bits (maybe subvolumes) > in the bits, making it harder to guarantee there will be no collisions. No, on btrfs the inode numbers don't encode anything about location, it's a simple number. The inode numbers remain the same when a snapshot is taken as it's a 1:1 clone of the file hierarchy, the directory representing a subvolume/snapshot has fixed inode number 256. The only difference is the internal subvolume id. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 11:06 ` Mateusz Guzik 2024-07-29 11:36 ` Florian Weimer 2024-07-29 11:40 ` Aleksa Sarai @ 2024-07-29 11:47 ` Aleksa Sarai 2024-07-29 12:12 ` Mateusz Guzik 2 siblings, 1 reply; 24+ messages in thread From: Aleksa Sarai @ 2024-07-29 11:47 UTC (permalink / raw) To: Mateusz Guzik Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner, Christian Brauner [-- Attachment #1: Type: text/plain, Size: 4688 bytes --] On 2024-07-29, Mateusz Guzik <mjguzik@gmail.com> wrote: > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote: > > > > * Mateusz Guzik: > > > > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > > >> * Mateusz Guzik: > > >> > > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > > >> >> It was pointed out to me that inode numbers on Linux are no longer > > >> >> expected to be unique per file system, even for local file systems. > > >> > > > >> > I don't know if I'm parsing this correctly. > > >> > > > >> > Are you claiming on-disk inode numbers are not guaranteed unique per > > >> > filesystem? It sounds like utter breakage, with capital 'f'. > > >> > > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like > > >> local file systems are different. > > > > > > Can you link me some threads about this? > > > > Sorry, it was an internal thread. It's supposed to be common knowledge > > among Linux file system developers. Aleksa referenced LSF/MM > > discussions. > > > > So much for open development :-P > > > > I had this in mind (untested modulo compilation): > > > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c > > > index 300e5d9ad913..5723c3e82eac 100644 > > > --- a/fs/fcntl.c > > > +++ b/fs/fcntl.c > > > @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp) > > > return f.file == filp; > > > } > > > > > > +static long f_dupfd_query_inode(int fd, struct file *filp) > > > +{ > > > + CLASS(fd_raw, f)(fd); > > > + > > > + return f.file->f_inode == filp->f_inode; > > > +} > > > + > > > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > > > struct file *filp) > > > { > > > @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > > > case F_DUPFD_QUERY: > > > err = f_dupfd_query(argi, filp); > > > break; > > > + case F_DUPFD_QUERY_INODE: > > > + err = f_dupfd_query_inode(argi, filp); > > > + break; > > > case F_GETFD: > > > err = get_close_on_exec(fd) ? FD_CLOEXEC : 0; > > > break; > > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > > > index c0bcc185fa48..2e93dbdd8fd2 100644 > > > --- a/include/uapi/linux/fcntl.h > > > +++ b/include/uapi/linux/fcntl.h > > > @@ -16,6 +16,8 @@ > > > > > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) > > > > > > +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4) > > > + > > > /* > > > * Cancel a blocking posix lock; internal use only until we expose an > > > * asynchronous lock api to userspace: > > > > It's certainly much easier to use than name_to_handle_at, so it looks > > like a useful option to have. > > > > Could we return a three-way comparison result for sorting? Or would > > that expose too much about kernel pointer values? > > > > As is this would sort by inode *address* which I don't believe is of > any use -- the order has to be assumed arbitrary. > > Perhaps there is something which is reliably the same and can be > combined with something else to be unique system-wide (the magic > handle thing?). > > But even then you would need to justify trying to sort by fcntl calls, > which sounds pretty dodgey to me. Programs need to key things by (dev, ino) currently, so you need to be able to get some kind of ordinal that you can sort with. If we really want to make an interface to let you do this without exposing hashes in statx, then kcmp(2) makes more sense, but having to keep a file descriptor for each entry in a hashtable would obviously cause -EMFILE issues. > Given that thing I *suspect* statx() may want to get extended with > some guaranteed unique identifier. Then you can sort in userspace all > you want. Yeah, this is what the hashed fhandle patch I have does. > Based on your opening mail I assumed you only need to check 2 files, > for which the proposed fcntl does the trick. > > Or to put it differently: there seems to be more to the picture than > in the opening mail, so perhaps you could outline what you are looking > for. Hardlink detection requires creating a hashmap of (dev, ino) to find hardlinks. Pair-wise checking is not sufficient for that usecase (which AFAIK is the most common thing people use inode numbers for -- it's at least probably the most common thing people run in practice since archive tools do this.) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 228 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 11:47 ` Aleksa Sarai @ 2024-07-29 12:12 ` Mateusz Guzik 2024-07-29 23:19 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Mateusz Guzik @ 2024-07-29 12:12 UTC (permalink / raw) To: Aleksa Sarai Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner, Christian Brauner On Mon, Jul 29, 2024 at 1:47 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2024-07-29, Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote: > > > > > > * Mateusz Guzik: > > > > > > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > > > >> * Mateusz Guzik: > > > >> > > > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > > > >> >> It was pointed out to me that inode numbers on Linux are no longer > > > >> >> expected to be unique per file system, even for local file systems. > > > >> > > > > >> > I don't know if I'm parsing this correctly. > > > >> > > > > >> > Are you claiming on-disk inode numbers are not guaranteed unique per > > > >> > filesystem? It sounds like utter breakage, with capital 'f'. > > > >> > > > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like > > > >> local file systems are different. > > > > > > > > Can you link me some threads about this? > > > > > > Sorry, it was an internal thread. It's supposed to be common knowledge > > > among Linux file system developers. Aleksa referenced LSF/MM > > > discussions. > > > > > > > So much for open development :-P > > > > > > I had this in mind (untested modulo compilation): > > > > > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c > > > > index 300e5d9ad913..5723c3e82eac 100644 > > > > --- a/fs/fcntl.c > > > > +++ b/fs/fcntl.c > > > > @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp) > > > > return f.file == filp; > > > > } > > > > > > > > +static long f_dupfd_query_inode(int fd, struct file *filp) > > > > +{ > > > > + CLASS(fd_raw, f)(fd); > > > > + > > > > + return f.file->f_inode == filp->f_inode; > > > > +} > > > > + > > > > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > > > > struct file *filp) > > > > { > > > > @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, > > > > case F_DUPFD_QUERY: > > > > err = f_dupfd_query(argi, filp); > > > > break; > > > > + case F_DUPFD_QUERY_INODE: > > > > + err = f_dupfd_query_inode(argi, filp); > > > > + break; > > > > case F_GETFD: > > > > err = get_close_on_exec(fd) ? FD_CLOEXEC : 0; > > > > break; > > > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > > > > index c0bcc185fa48..2e93dbdd8fd2 100644 > > > > --- a/include/uapi/linux/fcntl.h > > > > +++ b/include/uapi/linux/fcntl.h > > > > @@ -16,6 +16,8 @@ > > > > > > > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3) > > > > > > > > +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4) > > > > + > > > > /* > > > > * Cancel a blocking posix lock; internal use only until we expose an > > > > * asynchronous lock api to userspace: > > > > > > It's certainly much easier to use than name_to_handle_at, so it looks > > > like a useful option to have. > > > > > > Could we return a three-way comparison result for sorting? Or would > > > that expose too much about kernel pointer values? > > > > > > > As is this would sort by inode *address* which I don't believe is of > > any use -- the order has to be assumed arbitrary. > > > > Perhaps there is something which is reliably the same and can be > > combined with something else to be unique system-wide (the magic > > handle thing?). > > > > But even then you would need to justify trying to sort by fcntl calls, > > which sounds pretty dodgey to me. > > Programs need to key things by (dev, ino) currently, so you need to be > able to get some kind of ordinal that you can sort with. > That I know, except normally that's done by sorting by (f)stat results. > If we really want to make an interface to let you do this without > exposing hashes in statx, then kcmp(2) makes more sense, but having to > keep a file descriptor for each entry in a hashtable would obviously > cause -EMFILE issues. > Agreed, hence the proposal to extend statx. > > Given that thing I *suspect* statx() may want to get extended with > > some guaranteed unique identifier. Then you can sort in userspace all > > you want. > > Yeah, this is what the hashed fhandle patch I have does. > Ok, I see your other e-mail. > > Based on your opening mail I assumed you only need to check 2 files, > > for which the proposed fcntl does the trick. > > > > Or to put it differently: there seems to be more to the picture than > > in the opening mail, so perhaps you could outline what you are looking > > for. > > Hardlink detection requires creating a hashmap of (dev, ino) to find > hardlinks. Pair-wise checking is not sufficient for that usecase (which > AFAIK is the most common thing people use inode numbers for -- it's at > least probably the most common thing people run in practice since > archive tools do this.) > So if you have *numerous* files to check then indeed the fcntl is no good, but the sorting variant is no good either -- you don't know what key to look stuff up by since you don't know any of the addresses (obfuscated or otherwise). There needs to be a dev + ino replacement which retains the property of being reliable between reboots and so on. Since you said you have a patchset which exports something in statx, chances are this is sorted out -- I'm gong to wait for that, meanwhile I'm not going to submit my fcntl anywhere -- hopefuly it will be avoided. :) -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 12:12 ` Mateusz Guzik @ 2024-07-29 23:19 ` Dave Chinner 0 siblings, 0 replies; 24+ messages in thread From: Dave Chinner @ 2024-07-29 23:19 UTC (permalink / raw) To: Mateusz Guzik Cc: Aleksa Sarai, Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner, Christian Brauner On Mon, Jul 29, 2024 at 02:12:15PM +0200, Mateusz Guzik wrote: > There needs to be a dev + ino replacement which retains the property > of being reliable between reboots and so on. Yes, that's exactly what filehandles provide. OTOH, dev + ino by itself does not provide this guarantee, as st_dev is not guaranteed to be constant across reboots because of things like async device discovery and enumeration. i.e. userspace has to do a bunch of work to determine the "st_dev" for a given filesystem actually refers to the same filesystem it did before the system restarted (e.g. via FS uuid checks). Filehandles already contain persistent filesystem identifiers, and so userspace doesn't have to do anything special to ensure that comparisons across reboots are valid. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 10:57 ` Florian Weimer 2024-07-29 11:06 ` Mateusz Guzik @ 2024-07-29 23:08 ` Dave Chinner 1 sibling, 0 replies; 24+ messages in thread From: Dave Chinner @ 2024-07-29 23:08 UTC (permalink / raw) To: Florian Weimer Cc: Mateusz Guzik, linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Mon, Jul 29, 2024 at 12:57:14PM +0200, Florian Weimer wrote: > * Mateusz Guzik: > > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote: > >> * Mateusz Guzik: > >> > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > >> >> It was pointed out to me that inode numbers on Linux are no longer > >> >> expected to be unique per file system, even for local file systems. > >> > > >> > I don't know if I'm parsing this correctly. > >> > > >> > Are you claiming on-disk inode numbers are not guaranteed unique per > >> > filesystem? It sounds like utter breakage, with capital 'f'. > >> > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like > >> local file systems are different. > > > > Can you link me some threads about this? > > Sorry, it was an internal thread. It's supposed to be common knowledge > among Linux file system developers. btrfs has been dealing with this issue since snapshots/subvols were first introduced some 15-odd years ago. This isn't a new problem... https://lore.kernel.org/linux-fsdevel/20231025210654.GA2892534@perftesting/ > Aleksa referenced LSF/MM > discussions. I also referenced those discussions and the -fsdevel discussions that have been happening for quite some time. I'll reference some of them here again, because they are all out in the open.... https://lwn.net/Articles/975444/ https://lore.kernel.org/linux-fsdevel/?q=st_vol https://lore.kernel.org/linux-fsdevel/20231211233231.oiazgkqs7yahruuw@moria.home.lan/ -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 10:18 ` Mateusz Guzik 2024-07-29 10:40 ` Florian Weimer @ 2024-07-29 12:26 ` Christian Brauner 2024-07-29 13:36 ` Theodore Ts'o 2 siblings, 0 replies; 24+ messages in thread From: Christian Brauner @ 2024-07-29 12:26 UTC (permalink / raw) To: Mateusz Guzik Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Mon, Jul 29, 2024 at 12:18:15PM GMT, Mateusz Guzik wrote: > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote: > > It was pointed out to me that inode numbers on Linux are no longer > > expected to be unique per file system, even for local file systems. > > I don't know if I'm parsing this correctly. > > Are you claiming on-disk inode numbers are not guaranteed unique per > filesystem? It sounds like utter breakage, with capital 'f'. > > I know the 32-bit inode allocation code can result in unintentional > duplicates after wrap around (see get_next_ino), but that's for > in-memory stuff only(?) like pipes, so perhaps tolerable. > > Anyhow, the kernel recently got F_DUPFD_QUERY which tests if the *file* > object is the same. > > While the above is not what's needed here, I guess it sets a precedent > for F_DUPINODE_QUERY (or whatever other name) to be added to handily > compare inode pointers. It may be worthwhile regardless of the above. > (or maybe kcmp could be extended?) We don't use kcmp() for such core operations. It's overkill and security sensitive so quite a few configs don't enable it. See [1] for details how kcmp() can be used to facilitate UAF attacks and defeat probabilistic UAF mitigations by using ordering information. So that ordering requirement should really go out the window. Another thing is that kcmp() works cross-process which is also out of scope for this. kcmp() really should be reserved for stuff that checkpoint/restore in userspace needs and that's otherwise too fruity to be implemented in our regular apis. [1]: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 10:18 ` Mateusz Guzik 2024-07-29 10:40 ` Florian Weimer 2024-07-29 12:26 ` Christian Brauner @ 2024-07-29 13:36 ` Theodore Ts'o 2024-07-30 2:31 ` Dave Chinner 2 siblings, 1 reply; 24+ messages in thread From: Theodore Ts'o @ 2024-07-29 13:36 UTC (permalink / raw) To: Mateusz Guzik Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Mon, Jul 29, 2024 at 12:18:15PM +0200, Mateusz Guzik wrote: > > Are you claiming on-disk inode numbers are not guaranteed unique per > filesystem? It sounds like utter breakage, with capital 'f'. The reality is that there exists file systems which do not return unique inode numbers. For example, there are virtiofs implementations which pass the inode numbers straight through with a fixed dev_t. If you have a large number of packages mounted via iscsi, and those packages include shared libraries, then you can have two different shared libraries with the same inode number, and then you can watch the dynamic liunker get Very Confused, and debugging the problem can be.... interesting. (Three gueses how I found out about this, and the first two don't count. Yes, we figured out a workaround.) So that breakage exists already, today. For people who don't like this, they can stick to those file systems that still guarantee unique inode numbers, at least for local disk file systems --- for example, to use ext4 and xfs, over btrfs and bcachefs. However, this is a short-term expedient, and in the long term, we will need to guide userspace to use something that is more likely to work, such as file handles. And ideally, this needs to be standardized at venues such as the Austin Group, so that it becomes interfaces which are used across operating systems, not just for Linux. It's going to be a multi-year, if not decade-long, effort... - Ted ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 13:36 ` Theodore Ts'o @ 2024-07-30 2:31 ` Dave Chinner 2024-07-30 4:19 ` Theodore Ts'o 2024-07-30 15:38 ` Christoph Hellwig 0 siblings, 2 replies; 24+ messages in thread From: Dave Chinner @ 2024-07-30 2:31 UTC (permalink / raw) To: Theodore Ts'o Cc: Mateusz Guzik, Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Mon, Jul 29, 2024 at 09:36:01AM -0400, Theodore Ts'o wrote: > On Mon, Jul 29, 2024 at 12:18:15PM +0200, Mateusz Guzik wrote: > > > > Are you claiming on-disk inode numbers are not guaranteed unique per > > filesystem? It sounds like utter breakage, with capital 'f'. > > The reality is that there exists file systems which do not return > unique inode numbers. For example, there are virtiofs implementations > which pass the inode numbers straight through with a fixed dev_t. If > you have a large number of packages mounted via iscsi, and those > packages include shared libraries, then you can have two different > shared libraries with the same inode number, and then you can watch > the dynamic liunker get Very Confused, and debugging the problem can > be.... interesting. (Three gueses how I found out about this, and the > first two don't count. Yes, we figured out a workaround.) > > So that breakage exists already, today. > > For people who don't like this, they can stick to those file systems > that still guarantee unique inode numbers, at least for local disk > file systems --- for example, to use ext4 and xfs, over btrfs and > bcachefs. I don't think you can make such a simplistic delineation, because there's more than one issue at play here. There are at least two different "is this inode identical" use cases that {st_dev,st_ino} is being used for. The first, as Florian described, is to determine if two open fds refer to the same inode for collision avoidance. This works on traditional filesystems like ext4 and XFS, but isn't reliable on filesystems with integrated snapshot/subvolume functionality. The second is that {dev,ino} is being used to disambiguate paths that point to hardlinked inodes for the purposes of identifying and optimising access and replication of shared (i.e. non-unique) file data. This works on traditional filesystems like ext4, but fails badly on filesystem that support FICLONERANGE (XFS, btrfs, NFS, CIFS, bcachefs, etc) because cloned files have unique inodes but non-unique data. > However, this is a short-term expedient, and in the long term, we will > need to guide userspace to use something that is more likely to work, > such as file handles. The first case can be done with filehandles - it's a simple resolution of fds to filehandles and compare the opaque filehandles. That's the short-term solution because it's the easy one to solve. However, filehandles do not work for the solving the second case. Hardlinks are basically a mechanism for avoiding data copying within the same filesystem. i.e. hardlink disambiguation is essentially the process of detecting which dirents point to the same shared data. We can detect a hardlinked inode simply by looking at the link count, and we use the inode number to determine that they point to the same physical storage. Applications like tar and rsync are detecting hard links to avoid two main issues: - moving the same data multiple times - causing destination write amplification by storing the same data in multiple places They avoid these by creating a hardlink map of the directory structure being processed, and then recreate that hardlink map at the destination. We could use filehandles for that, too, and then we wouldn't be relying on {dev,ino} for this, either. However, any application that is using inode number or filehandle comparisons to detect data uniqueness does not work if other applications and utilities are using reflink copies rather than hardlinks for space efficient data copying. Let's all keep this in mind here: the default behaviour of `cp` is to use file clones on filesystems that support them over physical data copies. I have maybe half a dozen hardlinks in most of my local XFS filesystems, but I have many tens of thousands of cloned files in those same filesystems. IOWs, any tool that is using {dev,ino} as a proxy for data uniqueness is fundamentally deficient on any filesystem that supports file cloning. Given the potential for badness in replicating filesystems full of cloned data, it's far more important and higher priority for such utilities to move away from using {dev,ino} to detect data uniqueness. Handling cloned data efficiently requires this, and that's a far better reason for moving away from {dev,ino} based disambiguation than "oh, it doesn't work on btrfs properly". Detecting "is the data unique" is not that hard - e.g. we could add a statx flag to say "this inode has shared data" - and then userspace can tag that inode as needing data disambiguation before starting to move data. However, data disambiguation (i.e. finding what inodes share the data at which file offset) is a much harder problem. This largely requires knowledge of the entire layout of the filesystem, and so it's really only a question the filesystem itself can resolve. We already provide such an interface for XFS with ioctl(GETFSMAP). It is driven by the on-disk reverse mapping btrees, and so can quickly answer the "what {inode,offset,len} tuples share this physical extent" question. The interface is generic, however, so asking such a question and determining the answer is .... complex. That is our long term challenge: replacing the use of {dev,ino} for data uniqueness disambiguation. Making the identification of owners of non-unique/shared data simple for applications to use and fast for filesystems to resolve will be a challenge. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-30 2:31 ` Dave Chinner @ 2024-07-30 4:19 ` Theodore Ts'o 2024-07-30 15:38 ` Christoph Hellwig 1 sibling, 0 replies; 24+ messages in thread From: Theodore Ts'o @ 2024-07-30 4:19 UTC (permalink / raw) To: Dave Chinner Cc: Mateusz Guzik, Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Tue, Jul 30, 2024 at 12:31:57PM +1000, Dave Chinner wrote: > I don't think you can make such a simplistic delineation, because > there's more than one issue at play here. > > There are at least two different "is this inode identical" > use cases that {st_dev,st_ino} is being used for. > > The first, as Florian described, is to determine if two open fds > refer to the same inode for collision avoidance. > > This works on traditional filesystems like ext4 and XFS, but isn't > reliable on filesystems with integrated snapshot/subvolume > functionality. That's fair, but the first is the problem I think is more important, because there are existing programs which are depending on st_dev/st_ino as being a reliable way of detecting uniqueness --- and if this breas, file data may get corrpted, or the dyanmic linker will assume that two unrelated shared libraries are actually the same, with hilarity then ensuing --- because an interface which is guaranteed by decade of Unix history, and POSIX, is broken by some Linux file systems. > The second is that {dev,ino} is being used to disambiguate paths > that point to hardlinked inodes for the purposes of identifying > and optimising access and replication of shared (i.e. non-unique) > file data. > > This works on traditional filesystems like ext4, but fails badly on > filesystem that support FICLONERANGE (XFS, btrfs, NFS, CIFS, > bcachefs, etc) because cloned files have unique inodes but > non-unique data. That's a problem, yes --- but it's a previously unsolved problem, and the failure will cause inefficiency, but it doesn't actually cause data corruption. It's also a very hard problem, especially if we're considering the full, general FICLONERANGE interface, where an arbitrary subset of blocks at one offset, might be cloned at a completely different offset in a different file, and where a single file might have cloned ranges from a dozens of other files. How this information would be communicated to userspace, so they could copy a directory hierarchy without an increased expansion is a hard problem. Given that we have a simple solution (filehandles) to fix a problem where false positives causes lost data or core dumps, let's solve the simple problem and try to get it standardized acrossed operating systems beyond Linux, and in parallel, we can try to figure out a much more complicated interface to solve this other problem. - Ted ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-30 2:31 ` Dave Chinner 2024-07-30 4:19 ` Theodore Ts'o @ 2024-07-30 15:38 ` Christoph Hellwig 1 sibling, 0 replies; 24+ messages in thread From: Christoph Hellwig @ 2024-07-30 15:38 UTC (permalink / raw) To: Dave Chinner Cc: Theodore Ts'o, Mateusz Guzik, Florian Weimer, linux-fsdevel, linux-kernel, linux-api, Dave Chinner On Tue, Jul 30, 2024 at 12:31:57PM +1000, Dave Chinner wrote: > There are at least two different "is this inode identical" > use cases that {st_dev,st_ino} is being used for. > > The first, as Florian described, is to determine if two open fds > refer to the same inode for collision avoidance. > > This works on traditional filesystems like ext4 and XFS, but isn't > reliable on filesystems with integrated snapshot/subvolume > functionality. It's not about snapshot, it's about file systems being broken. Even btrfs for example always has a unique st_dev,st_ino pair, it can just unexpectly change at any subvolume root and not just at a mount point. > That is our long term challenge: replacing the use of {dev,ino} for > data uniqueness disambiguation. Making the identification of owners > of non-unique/shared data simple for applications to use and fast > for filesystems to resolve will be a challenge. I don't think there is any way to provide such a guarantee as there is so many levels of cloning or dedup, many of which are totally invisible to the high level file system interface. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 6:55 Testing if two open descriptors refer to the same inode Florian Weimer 2024-07-29 9:09 ` Aleksa Sarai 2024-07-29 10:18 ` Mateusz Guzik @ 2024-07-29 15:24 ` Jeff Layton 2024-07-29 15:39 ` Florian Weimer 2 siblings, 1 reply; 24+ messages in thread From: Jeff Layton @ 2024-07-29 15:24 UTC (permalink / raw) To: Florian Weimer, linux-fsdevel, linux-kernel, linux-api; +Cc: Dave Chinner On Mon, 2024-07-29 at 08:55 +0200, Florian Weimer wrote: > It was pointed out to me that inode numbers on Linux are no longer > expected to be unique per file system, even for local file systems. > Applications sometimes need to check if two (open) files are the > same. > For example, a program may want to use a temporary file if is invoked > with input and output files referring to the same file. > > How can we check for this? The POSIX way is to compare st_ino and > st_dev in stat output, but if inode numbers are not unique, that will > result in files falsely being reported as identical. It's harmless > in > the temporary file case, but it in other scenarios, it may result in > data loss. > I believe this is the problem that STATX_SUBVOL was intended to solve. Both bcachefs and btrfs will provide this attribute if requested. So, basically to uniquely ID an inode using statx, you need a tuple of: stx_dev_major/minor stx_subvol stx_ino If the filesystem doesn't provide STATX_SUBVOL, then one can (likely) conclude that stx_dev_* and stx_ino are enough. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode 2024-07-29 15:24 ` Jeff Layton @ 2024-07-29 15:39 ` Florian Weimer 0 siblings, 0 replies; 24+ messages in thread From: Florian Weimer @ 2024-07-29 15:39 UTC (permalink / raw) To: Jeff Layton; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner * Jeff Layton: > On Mon, 2024-07-29 at 08:55 +0200, Florian Weimer wrote: >> It was pointed out to me that inode numbers on Linux are no longer >> expected to be unique per file system, even for local file systems. >> Applications sometimes need to check if two (open) files are the >> same. >> For example, a program may want to use a temporary file if is invoked >> with input and output files referring to the same file. >> >> How can we check for this? The POSIX way is to compare st_ino and >> st_dev in stat output, but if inode numbers are not unique, that will >> result in files falsely being reported as identical. It's harmless >> in >> the temporary file case, but it in other scenarios, it may result in >> data loss. >> > > I believe this is the problem that STATX_SUBVOL was intended to solve. > > Both bcachefs and btrfs will provide this attribute if requested. So, > basically to uniquely ID an inode using statx, you need a tuple of: > > stx_dev_major/minor > stx_subvol > stx_ino > > If the filesystem doesn't provide STATX_SUBVOL, then one can (likely) > conclude that stx_dev_* and stx_ino are enough. Does this really work for the virtiofs case, though? It has to pass through all three *and* make things unique relative to the host, I think. Thanks, Florian ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-07-31 18:07 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-29 6:55 Testing if two open descriptors refer to the same inode Florian Weimer 2024-07-29 9:09 ` Aleksa Sarai 2024-07-29 9:29 ` Florian Weimer 2024-07-29 10:18 ` Mateusz Guzik 2024-07-29 10:40 ` Florian Weimer 2024-07-29 10:50 ` Mateusz Guzik 2024-07-29 10:56 ` Mateusz Guzik 2024-07-29 10:57 ` Florian Weimer 2024-07-29 11:06 ` Mateusz Guzik 2024-07-29 11:36 ` Florian Weimer 2024-07-29 12:00 ` Mateusz Guzik 2024-07-29 11:40 ` Aleksa Sarai 2024-07-31 18:07 ` David Sterba 2024-07-29 11:47 ` Aleksa Sarai 2024-07-29 12:12 ` Mateusz Guzik 2024-07-29 23:19 ` Dave Chinner 2024-07-29 23:08 ` Dave Chinner 2024-07-29 12:26 ` Christian Brauner 2024-07-29 13:36 ` Theodore Ts'o 2024-07-30 2:31 ` Dave Chinner 2024-07-30 4:19 ` Theodore Ts'o 2024-07-30 15:38 ` Christoph Hellwig 2024-07-29 15:24 ` Jeff Layton 2024-07-29 15:39 ` Florian Weimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).