* Testing if two open descriptors refer to the same inode
@ 2024-07-29 6:55 Florian Weimer
2024-07-29 9:09 ` Aleksa Sarai
` (2 more replies)
0 siblings, 3 replies; 24+ messages in thread
From: Florian Weimer @ 2024-07-29 6:55 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel, linux-api; +Cc: Dave Chinner
It was pointed out to me that inode numbers on Linux are no longer
expected to be unique per file system, even for local file systems.
Applications sometimes need to check if two (open) files are the same.
For example, a program may want to use a temporary file if is invoked
with input and output files referring to the same file.
How can we check for this? The POSIX way is to compare st_ino and
st_dev in stat output, but if inode numbers are not unique, that will
result in files falsely being reported as identical. It's harmless in
the temporary file case, but it in other scenarios, it may result in
data loss.
Thanks,
Florian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 6:55 Testing if two open descriptors refer to the same inode Florian Weimer
@ 2024-07-29 9:09 ` Aleksa Sarai
2024-07-29 9:29 ` Florian Weimer
2024-07-29 10:18 ` Mateusz Guzik
2024-07-29 15:24 ` Jeff Layton
2 siblings, 1 reply; 24+ messages in thread
From: Aleksa Sarai @ 2024-07-29 9:09 UTC (permalink / raw)
To: Florian Weimer; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner
[-- Attachment #1: Type: text/plain, Size: 1953 bytes --]
On 2024-07-29, Florian Weimer <fweimer@redhat.com> wrote:
> It was pointed out to me that inode numbers on Linux are no longer
> expected to be unique per file system, even for local file systems.
> Applications sometimes need to check if two (open) files are the same.
> For example, a program may want to use a temporary file if is invoked
> with input and output files referring to the same file.
Based on the discussions we had at LSF/MM, I believe the "correct" way
now is to do
name_to_handle_at(fd, "", ..., AT_EMPTY_PATH|AT_HANDLE_FID)
and then use the fhandle as the key to compare inodes. AT_HANDLE_FID is
needed for filesystems that don't support decoding file handles, and was
added in Linux 6.6[1]. However, I think this inode issue is only
relevant for btree filesystems, and I think both btrfs and bcachefs both
support decoding fhandles so this should work on fairly old kernels
without issue (though I haven't checked).
Lennart suggested there should be a way to get this information from
statx(2) so that you can get this new inode identifier without doing a
bunch of extra syscalls to verify that inode didn't change between the
two syscalls. I have a patchset for this, but I suspect it's too ugly
(we can't return the full file handle so we need to hash it). I'll send
an RFC later this week or next.
[1]: commit 96b2b072ee62 ("exportfs: allow exporting non-decodeable file handles to userspace")
> How can we check for this? The POSIX way is to compare st_ino and
> st_dev in stat output, but if inode numbers are not unique, that will
> result in files falsely being reported as identical. It's harmless in
> the temporary file case, but it in other scenarios, it may result in
> data loss.
(Another problem is that st_dev can be different for the same mount due
to subvolumes.)
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 9:09 ` Aleksa Sarai
@ 2024-07-29 9:29 ` Florian Weimer
0 siblings, 0 replies; 24+ messages in thread
From: Florian Weimer @ 2024-07-29 9:29 UTC (permalink / raw)
To: Aleksa Sarai; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner
* Aleksa Sarai:
> On 2024-07-29, Florian Weimer <fweimer@redhat.com> wrote:
>> It was pointed out to me that inode numbers on Linux are no longer
>> expected to be unique per file system, even for local file systems.
>> Applications sometimes need to check if two (open) files are the same.
>> For example, a program may want to use a temporary file if is invoked
>> with input and output files referring to the same file.
>
> Based on the discussions we had at LSF/MM, I believe the "correct" way
> now is to do
>
> name_to_handle_at(fd, "", ..., AT_EMPTY_PATH|AT_HANDLE_FID)
>
> and then use the fhandle as the key to compare inodes. AT_HANDLE_FID is
> needed for filesystems that don't support decoding file handles, and was
> added in Linux 6.6[1]. However, I think this inode issue is only
> relevant for btree filesystems, and I think both btrfs and bcachefs both
> support decoding fhandles so this should work on fairly old kernels
> without issue (though I haven't checked).
> [1]: commit 96b2b072ee62 ("exportfs: allow exporting non-decodeable file handles to userspace")
Thanks, it's not too bad. The name_to_handle_at manual page says that
the handle is supposed to be treated as an opaque value, although it
mentions AT_HANDLE_FID. I think this needs to be fixed that it's
expected to compare the handle bytes, and also say whether it's
necessary to compare the type or not.
> Lennart suggested there should be a way to get this information from
> statx(2) so that you can get this new inode identifier without doing a
> bunch of extra syscalls to verify that inode didn't change between the
> two syscalls. I have a patchset for this, but I suspect it's too ugly
> (we can't return the full file handle so we need to hash it). I'll send
> an RFC later this week or next.
Hashing these things is rather nasty because it makes things impossible
to test.
>> How can we check for this? The POSIX way is to compare st_ino and
>> st_dev in stat output, but if inode numbers are not unique, that will
>> result in files falsely being reported as identical. It's harmless in
>> the temporary file case, but it in other scenarios, it may result in
>> data loss.
>
> (Another problem is that st_dev can be different for the same mount due
> to subvolumes.)
Uh-oh. If st_dev are different, is it still possible that truncating
one path will affect the other with the different st_dev value?
Thanks,
Florian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 6:55 Testing if two open descriptors refer to the same inode Florian Weimer
2024-07-29 9:09 ` Aleksa Sarai
@ 2024-07-29 10:18 ` Mateusz Guzik
2024-07-29 10:40 ` Florian Weimer
` (2 more replies)
2024-07-29 15:24 ` Jeff Layton
2 siblings, 3 replies; 24+ messages in thread
From: Mateusz Guzik @ 2024-07-29 10:18 UTC (permalink / raw)
To: Florian Weimer; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner
On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> It was pointed out to me that inode numbers on Linux are no longer
> expected to be unique per file system, even for local file systems.
I don't know if I'm parsing this correctly.
Are you claiming on-disk inode numbers are not guaranteed unique per
filesystem? It sounds like utter breakage, with capital 'f'.
I know the 32-bit inode allocation code can result in unintentional
duplicates after wrap around (see get_next_ino), but that's for
in-memory stuff only(?) like pipes, so perhaps tolerable.
Anyhow, the kernel recently got F_DUPFD_QUERY which tests if the *file*
object is the same.
While the above is not what's needed here, I guess it sets a precedent
for F_DUPINODE_QUERY (or whatever other name) to be added to handily
compare inode pointers. It may be worthwhile regardless of the above.
(or maybe kcmp could be extended?)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 10:18 ` Mateusz Guzik
@ 2024-07-29 10:40 ` Florian Weimer
2024-07-29 10:50 ` Mateusz Guzik
2024-07-29 12:26 ` Christian Brauner
2024-07-29 13:36 ` Theodore Ts'o
2 siblings, 1 reply; 24+ messages in thread
From: Florian Weimer @ 2024-07-29 10:40 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner
* Mateusz Guzik:
> On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
>> It was pointed out to me that inode numbers on Linux are no longer
>> expected to be unique per file system, even for local file systems.
>
> I don't know if I'm parsing this correctly.
>
> Are you claiming on-disk inode numbers are not guaranteed unique per
> filesystem? It sounds like utter breakage, with capital 'f'.
Yes, POSIX semantics and traditional Linux semantics for POSIX-like
local file systems are different.
> While the above is not what's needed here, I guess it sets a precedent
> for F_DUPINODE_QUERY (or whatever other name) to be added to handily
> compare inode pointers. It may be worthwhile regardless of the above.
> (or maybe kcmp could be extended?)
I looked at kcmp as well, but I think it's dependent on
checkpoint/restore. File sameness checks are much more basic than that.
Thanks,
Florian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 10:40 ` Florian Weimer
@ 2024-07-29 10:50 ` Mateusz Guzik
2024-07-29 10:56 ` Mateusz Guzik
2024-07-29 10:57 ` Florian Weimer
0 siblings, 2 replies; 24+ messages in thread
From: Mateusz Guzik @ 2024-07-29 10:50 UTC (permalink / raw)
To: Florian Weimer; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner
On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> * Mateusz Guzik:
>
> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> >> It was pointed out to me that inode numbers on Linux are no longer
> >> expected to be unique per file system, even for local file systems.
> >
> > I don't know if I'm parsing this correctly.
> >
> > Are you claiming on-disk inode numbers are not guaranteed unique per
> > filesystem? It sounds like utter breakage, with capital 'f'.
>
> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> local file systems are different.
>
Can you link me some threads about this?
> > While the above is not what's needed here, I guess it sets a precedent
> > for F_DUPINODE_QUERY (or whatever other name) to be added to handily
> > compare inode pointers. It may be worthwhile regardless of the above.
> > (or maybe kcmp could be extended?)
>
> I looked at kcmp as well, but I think it's dependent on
> checkpoint/restore. File sameness checks are much more basic than that.
>
I had this in mind (untested modulo compilation):
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 300e5d9ad913..5723c3e82eac 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp)
return f.file == filp;
}
+static long f_dupfd_query_inode(int fd, struct file *filp)
+{
+ CLASS(fd_raw, f)(fd);
+
+ return f.file->f_inode == filp->f_inode;
+}
+
static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
struct file *filp)
{
@@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
case F_DUPFD_QUERY:
err = f_dupfd_query(argi, filp);
break;
+ case F_DUPFD_QUERY_INODE:
+ err = f_dupfd_query_inode(argi, filp);
+ break;
case F_GETFD:
err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
break;
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index c0bcc185fa48..2e93dbdd8fd2 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -16,6 +16,8 @@
#define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3)
+#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4)
+
/*
* Cancel a blocking posix lock; internal use only until we expose an
* asynchronous lock api to userspace:
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 10:50 ` Mateusz Guzik
@ 2024-07-29 10:56 ` Mateusz Guzik
2024-07-29 10:57 ` Florian Weimer
1 sibling, 0 replies; 24+ messages in thread
From: Mateusz Guzik @ 2024-07-29 10:56 UTC (permalink / raw)
To: Florian Weimer; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner
On Mon, Jul 29, 2024 at 12:50 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> > * Mateusz Guzik:
> >
> > > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> > >> It was pointed out to me that inode numbers on Linux are no longer
> > >> expected to be unique per file system, even for local file systems.
> > >
> > > I don't know if I'm parsing this correctly.
> > >
> > > Are you claiming on-disk inode numbers are not guaranteed unique per
> > > filesystem? It sounds like utter breakage, with capital 'f'.
> >
> > Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> > local file systems are different.
> >
>
> Can you link me some threads about this?
>
> > > While the above is not what's needed here, I guess it sets a precedent
> > > for F_DUPINODE_QUERY (or whatever other name) to be added to handily
> > > compare inode pointers. It may be worthwhile regardless of the above.
> > > (or maybe kcmp could be extended?)
> >
> > I looked at kcmp as well, but I think it's dependent on
> > checkpoint/restore. File sameness checks are much more basic than that.
> >
>
> I had this in mind (untested modulo compilation):
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 300e5d9ad913..5723c3e82eac 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp)
> return f.file == filp;
> }
>
> +static long f_dupfd_query_inode(int fd, struct file *filp)
> +{
> + CLASS(fd_raw, f)(fd);
> +
> + return f.file->f_inode == filp->f_inode;
> +}
> +
> static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> struct file *filp)
> {
> @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> case F_DUPFD_QUERY:
> err = f_dupfd_query(argi, filp);
> break;
> + case F_DUPFD_QUERY_INODE:
> + err = f_dupfd_query_inode(argi, filp);
> + break;
> case F_GETFD:
> err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> break;
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index c0bcc185fa48..2e93dbdd8fd2 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -16,6 +16,8 @@
>
> #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3)
>
> +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4)
> +
> /*
> * Cancel a blocking posix lock; internal use only until we expose an
> * asynchronous lock api to userspace:
To clarify, if indeed the dev + ino combo is no longer sufficient to
conclude it's the same thing, an explicit & handy to use way should be
provided.
For a case where you got both inodes open something like the above
will do the trick, unless I'm missing something. I can do a proper
patch posting later after runtime testing and whatnot if you confirm
this sorts out your problem.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 10:50 ` Mateusz Guzik
2024-07-29 10:56 ` Mateusz Guzik
@ 2024-07-29 10:57 ` Florian Weimer
2024-07-29 11:06 ` Mateusz Guzik
2024-07-29 23:08 ` Dave Chinner
1 sibling, 2 replies; 24+ messages in thread
From: Florian Weimer @ 2024-07-29 10:57 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner
* Mateusz Guzik:
> On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
>> * Mateusz Guzik:
>>
>> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
>> >> It was pointed out to me that inode numbers on Linux are no longer
>> >> expected to be unique per file system, even for local file systems.
>> >
>> > I don't know if I'm parsing this correctly.
>> >
>> > Are you claiming on-disk inode numbers are not guaranteed unique per
>> > filesystem? It sounds like utter breakage, with capital 'f'.
>>
>> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
>> local file systems are different.
>
> Can you link me some threads about this?
Sorry, it was an internal thread. It's supposed to be common knowledge
among Linux file system developers. Aleksa referenced LSF/MM
discussions.
> I had this in mind (untested modulo compilation):
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 300e5d9ad913..5723c3e82eac 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp)
> return f.file == filp;
> }
>
> +static long f_dupfd_query_inode(int fd, struct file *filp)
> +{
> + CLASS(fd_raw, f)(fd);
> +
> + return f.file->f_inode == filp->f_inode;
> +}
> +
> static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> struct file *filp)
> {
> @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> case F_DUPFD_QUERY:
> err = f_dupfd_query(argi, filp);
> break;
> + case F_DUPFD_QUERY_INODE:
> + err = f_dupfd_query_inode(argi, filp);
> + break;
> case F_GETFD:
> err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> break;
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index c0bcc185fa48..2e93dbdd8fd2 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -16,6 +16,8 @@
>
> #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3)
>
> +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4)
> +
> /*
> * Cancel a blocking posix lock; internal use only until we expose an
> * asynchronous lock api to userspace:
It's certainly much easier to use than name_to_handle_at, so it looks
like a useful option to have.
Could we return a three-way comparison result for sorting? Or would
that expose too much about kernel pointer values?
Thanks,
Florian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 10:57 ` Florian Weimer
@ 2024-07-29 11:06 ` Mateusz Guzik
2024-07-29 11:36 ` Florian Weimer
` (2 more replies)
2024-07-29 23:08 ` Dave Chinner
1 sibling, 3 replies; 24+ messages in thread
From: Mateusz Guzik @ 2024-07-29 11:06 UTC (permalink / raw)
To: Florian Weimer
Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner,
Christian Brauner
On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Mateusz Guzik:
>
> > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> >> * Mateusz Guzik:
> >>
> >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> >> >> It was pointed out to me that inode numbers on Linux are no longer
> >> >> expected to be unique per file system, even for local file systems.
> >> >
> >> > I don't know if I'm parsing this correctly.
> >> >
> >> > Are you claiming on-disk inode numbers are not guaranteed unique per
> >> > filesystem? It sounds like utter breakage, with capital 'f'.
> >>
> >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> >> local file systems are different.
> >
> > Can you link me some threads about this?
>
> Sorry, it was an internal thread. It's supposed to be common knowledge
> among Linux file system developers. Aleksa referenced LSF/MM
> discussions.
>
So much for open development :-P
> > I had this in mind (untested modulo compilation):
> >
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 300e5d9ad913..5723c3e82eac 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp)
> > return f.file == filp;
> > }
> >
> > +static long f_dupfd_query_inode(int fd, struct file *filp)
> > +{
> > + CLASS(fd_raw, f)(fd);
> > +
> > + return f.file->f_inode == filp->f_inode;
> > +}
> > +
> > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > struct file *filp)
> > {
> > @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > case F_DUPFD_QUERY:
> > err = f_dupfd_query(argi, filp);
> > break;
> > + case F_DUPFD_QUERY_INODE:
> > + err = f_dupfd_query_inode(argi, filp);
> > + break;
> > case F_GETFD:
> > err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> > break;
> > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > index c0bcc185fa48..2e93dbdd8fd2 100644
> > --- a/include/uapi/linux/fcntl.h
> > +++ b/include/uapi/linux/fcntl.h
> > @@ -16,6 +16,8 @@
> >
> > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3)
> >
> > +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4)
> > +
> > /*
> > * Cancel a blocking posix lock; internal use only until we expose an
> > * asynchronous lock api to userspace:
>
> It's certainly much easier to use than name_to_handle_at, so it looks
> like a useful option to have.
>
> Could we return a three-way comparison result for sorting? Or would
> that expose too much about kernel pointer values?
>
As is this would sort by inode *address* which I don't believe is of
any use -- the order has to be assumed arbitrary.
Perhaps there is something which is reliably the same and can be
combined with something else to be unique system-wide (the magic
handle thing?).
But even then you would need to justify trying to sort by fcntl calls,
which sounds pretty dodgey to me.
Given that thing I *suspect* statx() may want to get extended with
some guaranteed unique identifier. Then you can sort in userspace all
you want.
Based on your opening mail I assumed you only need to check 2 files,
for which the proposed fcntl does the trick.
Or to put it differently: there seems to be more to the picture than
in the opening mail, so perhaps you could outline what you are looking
for.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 11:06 ` Mateusz Guzik
@ 2024-07-29 11:36 ` Florian Weimer
2024-07-29 12:00 ` Mateusz Guzik
2024-07-29 11:40 ` Aleksa Sarai
2024-07-29 11:47 ` Aleksa Sarai
2 siblings, 1 reply; 24+ messages in thread
From: Florian Weimer @ 2024-07-29 11:36 UTC (permalink / raw)
To: Mateusz Guzik
Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner,
Christian Brauner
* Mateusz Guzik:
> On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Mateusz Guzik:
>>
>> > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
>> >> * Mateusz Guzik:
>> >>
>> >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
>> >> >> It was pointed out to me that inode numbers on Linux are no longer
>> >> >> expected to be unique per file system, even for local file systems.
>> >> >
>> >> > I don't know if I'm parsing this correctly.
>> >> >
>> >> > Are you claiming on-disk inode numbers are not guaranteed unique per
>> >> > filesystem? It sounds like utter breakage, with capital 'f'.
>> >>
>> >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
>> >> local file systems are different.
>> >
>> > Can you link me some threads about this?
>>
>> Sorry, it was an internal thread. It's supposed to be common knowledge
>> among Linux file system developers. Aleksa referenced LSF/MM
>> discussions.
>>
>
> So much for open development :-P
I found this pretty quickly, so it does seem widely known:
[LSF TOPIC] statx extensions for subvol/snapshot filesystems & more
<https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/>
>> It's certainly much easier to use than name_to_handle_at, so it looks
>> like a useful option to have.
>>
>> Could we return a three-way comparison result for sorting? Or would
>> that expose too much about kernel pointer values?
>>
>
> As is this would sort by inode *address* which I don't believe is of
> any use -- the order has to be assumed arbitrary.
Doesn't the order remain valid while the files remain open? Anything
else doesn't seem reasonable to expect anyway.
Thanks,
Florian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 11:06 ` Mateusz Guzik
2024-07-29 11:36 ` Florian Weimer
@ 2024-07-29 11:40 ` Aleksa Sarai
2024-07-31 18:07 ` David Sterba
2024-07-29 11:47 ` Aleksa Sarai
2 siblings, 1 reply; 24+ messages in thread
From: Aleksa Sarai @ 2024-07-29 11:40 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api,
Dave Chinner, Christian Brauner
[-- Attachment #1: Type: text/plain, Size: 2182 bytes --]
On 2024-07-29, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote:
> > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> > >> >> It was pointed out to me that inode numbers on Linux are no longer
> > >> >> expected to be unique per file system, even for local file systems.
> > >> >
> > >> > I don't know if I'm parsing this correctly.
> > >> >
> > >> > Are you claiming on-disk inode numbers are not guaranteed unique per
> > >> > filesystem? It sounds like utter breakage, with capital 'f'.
> > >>
> > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> > >> local file systems are different.
> > >
> > > Can you link me some threads about this?
> >
> > Sorry, it was an internal thread. It's supposed to be common knowledge
> > among Linux file system developers. Aleksa referenced LSF/MM
> > discussions.
>
> So much for open development :-P
To be clear, this wasn't _decided_ at LSF/MM, it was brought up as a
topic. There is an LWN article about the session that mentions the
issue[1].
My understanding is that the btrfs and bcachefs folks independently
determined they cannot provide this guarantee. As far as I understand,
the reason why is that inode number allocation on btree filesystems
stores information about location and some other bits (maybe subvolumes)
in the bits, making it harder to guarantee there will be no collisions.
Don't quote me on that though, I'm sure they'll tell me I'm wrong when
they wake up. :D
As the article mentions, Kent Overstreet suggested trying to see how
many things break if you build a kernel where all inode numbers are the
same (my guess would be "very badly" -- aside from the classic problem
of hardlink detection, a lot of programs key things by (dev, ino), and
some inode numbers are guaranteed by the kernel for pseudo-filesystems
like PROC_ROOT_INO).
[1]: https://lwn.net/Articles/975444/
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 11:06 ` Mateusz Guzik
2024-07-29 11:36 ` Florian Weimer
2024-07-29 11:40 ` Aleksa Sarai
@ 2024-07-29 11:47 ` Aleksa Sarai
2024-07-29 12:12 ` Mateusz Guzik
2 siblings, 1 reply; 24+ messages in thread
From: Aleksa Sarai @ 2024-07-29 11:47 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api,
Dave Chinner, Christian Brauner
[-- Attachment #1: Type: text/plain, Size: 4688 bytes --]
On 2024-07-29, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > * Mateusz Guzik:
> >
> > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> > >> * Mateusz Guzik:
> > >>
> > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> > >> >> It was pointed out to me that inode numbers on Linux are no longer
> > >> >> expected to be unique per file system, even for local file systems.
> > >> >
> > >> > I don't know if I'm parsing this correctly.
> > >> >
> > >> > Are you claiming on-disk inode numbers are not guaranteed unique per
> > >> > filesystem? It sounds like utter breakage, with capital 'f'.
> > >>
> > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> > >> local file systems are different.
> > >
> > > Can you link me some threads about this?
> >
> > Sorry, it was an internal thread. It's supposed to be common knowledge
> > among Linux file system developers. Aleksa referenced LSF/MM
> > discussions.
> >
>
> So much for open development :-P
>
> > > I had this in mind (untested modulo compilation):
> > >
> > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > index 300e5d9ad913..5723c3e82eac 100644
> > > --- a/fs/fcntl.c
> > > +++ b/fs/fcntl.c
> > > @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp)
> > > return f.file == filp;
> > > }
> > >
> > > +static long f_dupfd_query_inode(int fd, struct file *filp)
> > > +{
> > > + CLASS(fd_raw, f)(fd);
> > > +
> > > + return f.file->f_inode == filp->f_inode;
> > > +}
> > > +
> > > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > > struct file *filp)
> > > {
> > > @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > > case F_DUPFD_QUERY:
> > > err = f_dupfd_query(argi, filp);
> > > break;
> > > + case F_DUPFD_QUERY_INODE:
> > > + err = f_dupfd_query_inode(argi, filp);
> > > + break;
> > > case F_GETFD:
> > > err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> > > break;
> > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > > index c0bcc185fa48..2e93dbdd8fd2 100644
> > > --- a/include/uapi/linux/fcntl.h
> > > +++ b/include/uapi/linux/fcntl.h
> > > @@ -16,6 +16,8 @@
> > >
> > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3)
> > >
> > > +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4)
> > > +
> > > /*
> > > * Cancel a blocking posix lock; internal use only until we expose an
> > > * asynchronous lock api to userspace:
> >
> > It's certainly much easier to use than name_to_handle_at, so it looks
> > like a useful option to have.
> >
> > Could we return a three-way comparison result for sorting? Or would
> > that expose too much about kernel pointer values?
> >
>
> As is this would sort by inode *address* which I don't believe is of
> any use -- the order has to be assumed arbitrary.
>
> Perhaps there is something which is reliably the same and can be
> combined with something else to be unique system-wide (the magic
> handle thing?).
>
> But even then you would need to justify trying to sort by fcntl calls,
> which sounds pretty dodgey to me.
Programs need to key things by (dev, ino) currently, so you need to be
able to get some kind of ordinal that you can sort with.
If we really want to make an interface to let you do this without
exposing hashes in statx, then kcmp(2) makes more sense, but having to
keep a file descriptor for each entry in a hashtable would obviously
cause -EMFILE issues.
> Given that thing I *suspect* statx() may want to get extended with
> some guaranteed unique identifier. Then you can sort in userspace all
> you want.
Yeah, this is what the hashed fhandle patch I have does.
> Based on your opening mail I assumed you only need to check 2 files,
> for which the proposed fcntl does the trick.
>
> Or to put it differently: there seems to be more to the picture than
> in the opening mail, so perhaps you could outline what you are looking
> for.
Hardlink detection requires creating a hashmap of (dev, ino) to find
hardlinks. Pair-wise checking is not sufficient for that usecase (which
AFAIK is the most common thing people use inode numbers for -- it's at
least probably the most common thing people run in practice since
archive tools do this.)
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 11:36 ` Florian Weimer
@ 2024-07-29 12:00 ` Mateusz Guzik
0 siblings, 0 replies; 24+ messages in thread
From: Mateusz Guzik @ 2024-07-29 12:00 UTC (permalink / raw)
To: Florian Weimer
Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner,
Christian Brauner
On Mon, Jul 29, 2024 at 1:37 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Mateusz Guzik:
>
> > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Mateusz Guzik:
> >>
> >> > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> >> >> * Mateusz Guzik:
> >> >>
> >> >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> >> >> >> It was pointed out to me that inode numbers on Linux are no longer
> >> >> >> expected to be unique per file system, even for local file systems.
> >> >> >
> >> >> > I don't know if I'm parsing this correctly.
> >> >> >
> >> >> > Are you claiming on-disk inode numbers are not guaranteed unique per
> >> >> > filesystem? It sounds like utter breakage, with capital 'f'.
> >> >>
> >> >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> >> >> local file systems are different.
> >> >
> >> > Can you link me some threads about this?
> >>
> >> Sorry, it was an internal thread. It's supposed to be common knowledge
> >> among Linux file system developers. Aleksa referenced LSF/MM
> >> discussions.
> >>
> >
> > So much for open development :-P
>
> I found this pretty quickly, so it does seem widely known:
>
> [LSF TOPIC] statx extensions for subvol/snapshot filesystems & more
> <https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/>
>
Huh, thanks.
> >> It's certainly much easier to use than name_to_handle_at, so it looks
> >> like a useful option to have.
> >>
> >> Could we return a three-way comparison result for sorting? Or would
> >> that expose too much about kernel pointer values?
> >>
> >
> > As is this would sort by inode *address* which I don't believe is of
> > any use -- the order has to be assumed arbitrary.
>
> Doesn't the order remain valid while the files remain open? Anything
> else doesn't seem reasonable to expect anyway.
>
They will indeed remain stable in that setting, I am saying ordering
may be different after a reboot or if there was some memory
reclamation going on between restarts of the program.
This is quite a difference from dev + ino combo not suffering these problems.
That is to say I don't see what is the benefit of having the kernel
provide a way to sort inodes in a way which can give different
results.
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 11:47 ` Aleksa Sarai
@ 2024-07-29 12:12 ` Mateusz Guzik
2024-07-29 23:19 ` Dave Chinner
0 siblings, 1 reply; 24+ messages in thread
From: Mateusz Guzik @ 2024-07-29 12:12 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api,
Dave Chinner, Christian Brauner
On Mon, Jul 29, 2024 at 1:47 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2024-07-29, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote:
> > >
> > > * Mateusz Guzik:
> > >
> > > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> > > >> * Mateusz Guzik:
> > > >>
> > > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> > > >> >> It was pointed out to me that inode numbers on Linux are no longer
> > > >> >> expected to be unique per file system, even for local file systems.
> > > >> >
> > > >> > I don't know if I'm parsing this correctly.
> > > >> >
> > > >> > Are you claiming on-disk inode numbers are not guaranteed unique per
> > > >> > filesystem? It sounds like utter breakage, with capital 'f'.
> > > >>
> > > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> > > >> local file systems are different.
> > > >
> > > > Can you link me some threads about this?
> > >
> > > Sorry, it was an internal thread. It's supposed to be common knowledge
> > > among Linux file system developers. Aleksa referenced LSF/MM
> > > discussions.
> > >
> >
> > So much for open development :-P
> >
> > > > I had this in mind (untested modulo compilation):
> > > >
> > > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > > index 300e5d9ad913..5723c3e82eac 100644
> > > > --- a/fs/fcntl.c
> > > > +++ b/fs/fcntl.c
> > > > @@ -343,6 +343,13 @@ static long f_dupfd_query(int fd, struct file *filp)
> > > > return f.file == filp;
> > > > }
> > > >
> > > > +static long f_dupfd_query_inode(int fd, struct file *filp)
> > > > +{
> > > > + CLASS(fd_raw, f)(fd);
> > > > +
> > > > + return f.file->f_inode == filp->f_inode;
> > > > +}
> > > > +
> > > > static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > > > struct file *filp)
> > > > {
> > > > @@ -361,6 +368,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > > > case F_DUPFD_QUERY:
> > > > err = f_dupfd_query(argi, filp);
> > > > break;
> > > > + case F_DUPFD_QUERY_INODE:
> > > > + err = f_dupfd_query_inode(argi, filp);
> > > > + break;
> > > > case F_GETFD:
> > > > err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> > > > break;
> > > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > > > index c0bcc185fa48..2e93dbdd8fd2 100644
> > > > --- a/include/uapi/linux/fcntl.h
> > > > +++ b/include/uapi/linux/fcntl.h
> > > > @@ -16,6 +16,8 @@
> > > >
> > > > #define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3)
> > > >
> > > > +#define F_DUPFD_QUERY_INODE (F_LINUX_SPECIFIC_BASE + 4)
> > > > +
> > > > /*
> > > > * Cancel a blocking posix lock; internal use only until we expose an
> > > > * asynchronous lock api to userspace:
> > >
> > > It's certainly much easier to use than name_to_handle_at, so it looks
> > > like a useful option to have.
> > >
> > > Could we return a three-way comparison result for sorting? Or would
> > > that expose too much about kernel pointer values?
> > >
> >
> > As is this would sort by inode *address* which I don't believe is of
> > any use -- the order has to be assumed arbitrary.
> >
> > Perhaps there is something which is reliably the same and can be
> > combined with something else to be unique system-wide (the magic
> > handle thing?).
> >
> > But even then you would need to justify trying to sort by fcntl calls,
> > which sounds pretty dodgey to me.
>
> Programs need to key things by (dev, ino) currently, so you need to be
> able to get some kind of ordinal that you can sort with.
>
That I know, except normally that's done by sorting by (f)stat results.
> If we really want to make an interface to let you do this without
> exposing hashes in statx, then kcmp(2) makes more sense, but having to
> keep a file descriptor for each entry in a hashtable would obviously
> cause -EMFILE issues.
>
Agreed, hence the proposal to extend statx.
> > Given that thing I *suspect* statx() may want to get extended with
> > some guaranteed unique identifier. Then you can sort in userspace all
> > you want.
>
> Yeah, this is what the hashed fhandle patch I have does.
>
Ok, I see your other e-mail.
> > Based on your opening mail I assumed you only need to check 2 files,
> > for which the proposed fcntl does the trick.
> >
> > Or to put it differently: there seems to be more to the picture than
> > in the opening mail, so perhaps you could outline what you are looking
> > for.
>
> Hardlink detection requires creating a hashmap of (dev, ino) to find
> hardlinks. Pair-wise checking is not sufficient for that usecase (which
> AFAIK is the most common thing people use inode numbers for -- it's at
> least probably the most common thing people run in practice since
> archive tools do this.)
>
So if you have *numerous* files to check then indeed the fcntl is no
good, but the sorting variant is no good either -- you don't know what
key to look stuff up by since you don't know any of the addresses
(obfuscated or otherwise).
There needs to be a dev + ino replacement which retains the property
of being reliable between reboots and so on.
Since you said you have a patchset which exports something in statx,
chances are this is sorted out -- I'm gong to wait for that, meanwhile
I'm not going to submit my fcntl anywhere -- hopefuly it will be
avoided. :)
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 10:18 ` Mateusz Guzik
2024-07-29 10:40 ` Florian Weimer
@ 2024-07-29 12:26 ` Christian Brauner
2024-07-29 13:36 ` Theodore Ts'o
2 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2024-07-29 12:26 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api,
Dave Chinner
On Mon, Jul 29, 2024 at 12:18:15PM GMT, Mateusz Guzik wrote:
> On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> > It was pointed out to me that inode numbers on Linux are no longer
> > expected to be unique per file system, even for local file systems.
>
> I don't know if I'm parsing this correctly.
>
> Are you claiming on-disk inode numbers are not guaranteed unique per
> filesystem? It sounds like utter breakage, with capital 'f'.
>
> I know the 32-bit inode allocation code can result in unintentional
> duplicates after wrap around (see get_next_ino), but that's for
> in-memory stuff only(?) like pipes, so perhaps tolerable.
>
> Anyhow, the kernel recently got F_DUPFD_QUERY which tests if the *file*
> object is the same.
>
> While the above is not what's needed here, I guess it sets a precedent
> for F_DUPINODE_QUERY (or whatever other name) to be added to handily
> compare inode pointers. It may be worthwhile regardless of the above.
> (or maybe kcmp could be extended?)
We don't use kcmp() for such core operations. It's overkill and security
sensitive so quite a few configs don't enable it. See [1] for details
how kcmp() can be used to facilitate UAF attacks and defeat
probabilistic UAF mitigations by using ordering information. So that
ordering requirement should really go out the window.
Another thing is that kcmp() works cross-process which is also out of
scope for this. kcmp() really should be reserved for stuff that
checkpoint/restore in userspace needs and that's otherwise too fruity to
be implemented in our regular apis.
[1]: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 10:18 ` Mateusz Guzik
2024-07-29 10:40 ` Florian Weimer
2024-07-29 12:26 ` Christian Brauner
@ 2024-07-29 13:36 ` Theodore Ts'o
2024-07-30 2:31 ` Dave Chinner
2 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2024-07-29 13:36 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Florian Weimer, linux-fsdevel, linux-kernel, linux-api,
Dave Chinner
On Mon, Jul 29, 2024 at 12:18:15PM +0200, Mateusz Guzik wrote:
>
> Are you claiming on-disk inode numbers are not guaranteed unique per
> filesystem? It sounds like utter breakage, with capital 'f'.
The reality is that there exists file systems which do not return
unique inode numbers. For example, there are virtiofs implementations
which pass the inode numbers straight through with a fixed dev_t. If
you have a large number of packages mounted via iscsi, and those
packages include shared libraries, then you can have two different
shared libraries with the same inode number, and then you can watch
the dynamic liunker get Very Confused, and debugging the problem can
be.... interesting. (Three gueses how I found out about this, and the
first two don't count. Yes, we figured out a workaround.)
So that breakage exists already, today.
For people who don't like this, they can stick to those file systems
that still guarantee unique inode numbers, at least for local disk
file systems --- for example, to use ext4 and xfs, over btrfs and
bcachefs.
However, this is a short-term expedient, and in the long term, we will
need to guide userspace to use something that is more likely to work,
such as file handles. And ideally, this needs to be standardized at
venues such as the Austin Group, so that it becomes interfaces which
are used across operating systems, not just for Linux. It's going to
be a multi-year, if not decade-long, effort...
- Ted
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 6:55 Testing if two open descriptors refer to the same inode Florian Weimer
2024-07-29 9:09 ` Aleksa Sarai
2024-07-29 10:18 ` Mateusz Guzik
@ 2024-07-29 15:24 ` Jeff Layton
2024-07-29 15:39 ` Florian Weimer
2 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2024-07-29 15:24 UTC (permalink / raw)
To: Florian Weimer, linux-fsdevel, linux-kernel, linux-api; +Cc: Dave Chinner
On Mon, 2024-07-29 at 08:55 +0200, Florian Weimer wrote:
> It was pointed out to me that inode numbers on Linux are no longer
> expected to be unique per file system, even for local file systems.
> Applications sometimes need to check if two (open) files are the
> same.
> For example, a program may want to use a temporary file if is invoked
> with input and output files referring to the same file.
>
> How can we check for this? The POSIX way is to compare st_ino and
> st_dev in stat output, but if inode numbers are not unique, that will
> result in files falsely being reported as identical. It's harmless
> in
> the temporary file case, but it in other scenarios, it may result in
> data loss.
>
I believe this is the problem that STATX_SUBVOL was intended to solve.
Both bcachefs and btrfs will provide this attribute if requested. So,
basically to uniquely ID an inode using statx, you need a tuple of:
stx_dev_major/minor
stx_subvol
stx_ino
If the filesystem doesn't provide STATX_SUBVOL, then one can (likely)
conclude that stx_dev_* and stx_ino are enough.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 15:24 ` Jeff Layton
@ 2024-07-29 15:39 ` Florian Weimer
0 siblings, 0 replies; 24+ messages in thread
From: Florian Weimer @ 2024-07-29 15:39 UTC (permalink / raw)
To: Jeff Layton; +Cc: linux-fsdevel, linux-kernel, linux-api, Dave Chinner
* Jeff Layton:
> On Mon, 2024-07-29 at 08:55 +0200, Florian Weimer wrote:
>> It was pointed out to me that inode numbers on Linux are no longer
>> expected to be unique per file system, even for local file systems.
>> Applications sometimes need to check if two (open) files are the
>> same.
>> For example, a program may want to use a temporary file if is invoked
>> with input and output files referring to the same file.
>>
>> How can we check for this? The POSIX way is to compare st_ino and
>> st_dev in stat output, but if inode numbers are not unique, that will
>> result in files falsely being reported as identical. It's harmless
>> in
>> the temporary file case, but it in other scenarios, it may result in
>> data loss.
>>
>
> I believe this is the problem that STATX_SUBVOL was intended to solve.
>
> Both bcachefs and btrfs will provide this attribute if requested. So,
> basically to uniquely ID an inode using statx, you need a tuple of:
>
> stx_dev_major/minor
> stx_subvol
> stx_ino
>
> If the filesystem doesn't provide STATX_SUBVOL, then one can (likely)
> conclude that stx_dev_* and stx_ino are enough.
Does this really work for the virtiofs case, though? It has to pass
through all three *and* make things unique relative to the host, I
think.
Thanks,
Florian
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 10:57 ` Florian Weimer
2024-07-29 11:06 ` Mateusz Guzik
@ 2024-07-29 23:08 ` Dave Chinner
1 sibling, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2024-07-29 23:08 UTC (permalink / raw)
To: Florian Weimer
Cc: Mateusz Guzik, linux-fsdevel, linux-kernel, linux-api,
Dave Chinner
On Mon, Jul 29, 2024 at 12:57:14PM +0200, Florian Weimer wrote:
> * Mateusz Guzik:
>
> > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> >> * Mateusz Guzik:
> >>
> >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> >> >> It was pointed out to me that inode numbers on Linux are no longer
> >> >> expected to be unique per file system, even for local file systems.
> >> >
> >> > I don't know if I'm parsing this correctly.
> >> >
> >> > Are you claiming on-disk inode numbers are not guaranteed unique per
> >> > filesystem? It sounds like utter breakage, with capital 'f'.
> >>
> >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> >> local file systems are different.
> >
> > Can you link me some threads about this?
>
> Sorry, it was an internal thread. It's supposed to be common knowledge
> among Linux file system developers.
btrfs has been dealing with this issue since snapshots/subvols were
first introduced some 15-odd years ago. This isn't a new problem...
https://lore.kernel.org/linux-fsdevel/20231025210654.GA2892534@perftesting/
> Aleksa referenced LSF/MM
> discussions.
I also referenced those discussions and the -fsdevel discussions
that have been happening for quite some time. I'll reference some
of them here again, because they are all out in the open....
https://lwn.net/Articles/975444/
https://lore.kernel.org/linux-fsdevel/?q=st_vol
https://lore.kernel.org/linux-fsdevel/20231211233231.oiazgkqs7yahruuw@moria.home.lan/
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 12:12 ` Mateusz Guzik
@ 2024-07-29 23:19 ` Dave Chinner
0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2024-07-29 23:19 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Aleksa Sarai, Florian Weimer, linux-fsdevel, linux-kernel,
linux-api, Dave Chinner, Christian Brauner
On Mon, Jul 29, 2024 at 02:12:15PM +0200, Mateusz Guzik wrote:
> There needs to be a dev + ino replacement which retains the property
> of being reliable between reboots and so on.
Yes, that's exactly what filehandles provide.
OTOH, dev + ino by itself does not provide this guarantee, as st_dev
is not guaranteed to be constant across reboots because of things
like async device discovery and enumeration. i.e. userspace has to
do a bunch of work to determine the "st_dev" for a given filesystem
actually refers to the same filesystem it did before the system
restarted (e.g. via FS uuid checks).
Filehandles already contain persistent filesystem identifiers, and
so userspace doesn't have to do anything special to ensure that
comparisons across reboots are valid.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 13:36 ` Theodore Ts'o
@ 2024-07-30 2:31 ` Dave Chinner
2024-07-30 4:19 ` Theodore Ts'o
2024-07-30 15:38 ` Christoph Hellwig
0 siblings, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2024-07-30 2:31 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Mateusz Guzik, Florian Weimer, linux-fsdevel, linux-kernel,
linux-api, Dave Chinner
On Mon, Jul 29, 2024 at 09:36:01AM -0400, Theodore Ts'o wrote:
> On Mon, Jul 29, 2024 at 12:18:15PM +0200, Mateusz Guzik wrote:
> >
> > Are you claiming on-disk inode numbers are not guaranteed unique per
> > filesystem? It sounds like utter breakage, with capital 'f'.
>
> The reality is that there exists file systems which do not return
> unique inode numbers. For example, there are virtiofs implementations
> which pass the inode numbers straight through with a fixed dev_t. If
> you have a large number of packages mounted via iscsi, and those
> packages include shared libraries, then you can have two different
> shared libraries with the same inode number, and then you can watch
> the dynamic liunker get Very Confused, and debugging the problem can
> be.... interesting. (Three gueses how I found out about this, and the
> first two don't count. Yes, we figured out a workaround.)
>
> So that breakage exists already, today.
>
> For people who don't like this, they can stick to those file systems
> that still guarantee unique inode numbers, at least for local disk
> file systems --- for example, to use ext4 and xfs, over btrfs and
> bcachefs.
I don't think you can make such a simplistic delineation, because
there's more than one issue at play here.
There are at least two different "is this inode identical"
use cases that {st_dev,st_ino} is being used for.
The first, as Florian described, is to determine if two open fds
refer to the same inode for collision avoidance.
This works on traditional filesystems like ext4 and XFS, but isn't
reliable on filesystems with integrated snapshot/subvolume
functionality.
The second is that {dev,ino} is being used to disambiguate paths
that point to hardlinked inodes for the purposes of identifying
and optimising access and replication of shared (i.e. non-unique)
file data.
This works on traditional filesystems like ext4, but fails badly on
filesystem that support FICLONERANGE (XFS, btrfs, NFS, CIFS,
bcachefs, etc) because cloned files have unique inodes but
non-unique data.
> However, this is a short-term expedient, and in the long term, we will
> need to guide userspace to use something that is more likely to work,
> such as file handles.
The first case can be done with filehandles - it's a simple
resolution of fds to filehandles and compare the opaque filehandles.
That's the short-term solution because it's the easy one to solve.
However, filehandles do not work for the solving the second case.
Hardlinks are basically a mechanism for avoiding data copying within
the same filesystem. i.e. hardlink disambiguation is essentially
the process of detecting which dirents point to the same shared
data. We can detect a hardlinked inode simply by looking at the
link count, and we use the inode number to determine that they point
to the same physical storage.
Applications like tar and rsync are detecting hard links to avoid
two main issues:
- moving the same data multiple times
- causing destination write amplification by storing the
same data in multiple places
They avoid these by creating a hardlink map of the directory
structure being processed, and then recreate that hardlink map at
the destination. We could use filehandles for that, too, and then
we wouldn't be relying on {dev,ino} for this, either.
However, any application that is using inode number or filehandle
comparisons to detect data uniqueness does not work if other
applications and utilities are using reflink copies rather than
hardlinks for space efficient data copying.
Let's all keep this in mind here: the default behaviour of `cp` is
to use file clones on filesystems that support them over physical
data copies. I have maybe half a dozen hardlinks in most of my local
XFS filesystems, but I have many tens of thousands of cloned files
in those same filesystems.
IOWs, any tool that is using {dev,ino} as a proxy for data
uniqueness is fundamentally deficient on any filesystem that
supports file cloning.
Given the potential for badness in replicating filesystems full of
cloned data, it's far more important and higher priority for such
utilities to move away from using {dev,ino} to detect data
uniqueness. Handling cloned data efficiently requires this, and
that's a far better reason for moving away from {dev,ino} based
disambiguation than "oh, it doesn't work on btrfs properly".
Detecting "is the data unique" is not that hard - e.g. we could
add a statx flag to say "this inode has shared data" - and then
userspace can tag that inode as needing data disambiguation before
starting to move data.
However, data disambiguation (i.e. finding what inodes share the
data at which file offset) is a much harder problem. This largely
requires knowledge of the entire layout of the filesystem, and so
it's really only a question the filesystem itself can resolve.
We already provide such an interface for XFS with ioctl(GETFSMAP).
It is driven by the on-disk reverse mapping btrees, and so can
quickly answer the "what {inode,offset,len} tuples share this
physical extent" question. The interface is generic, however, so
asking such a question and determining the answer is .... complex.
That is our long term challenge: replacing the use of {dev,ino} for
data uniqueness disambiguation. Making the identification of owners
of non-unique/shared data simple for applications to use and fast
for filesystems to resolve will be a challenge.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-30 2:31 ` Dave Chinner
@ 2024-07-30 4:19 ` Theodore Ts'o
2024-07-30 15:38 ` Christoph Hellwig
1 sibling, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2024-07-30 4:19 UTC (permalink / raw)
To: Dave Chinner
Cc: Mateusz Guzik, Florian Weimer, linux-fsdevel, linux-kernel,
linux-api, Dave Chinner
On Tue, Jul 30, 2024 at 12:31:57PM +1000, Dave Chinner wrote:
> I don't think you can make such a simplistic delineation, because
> there's more than one issue at play here.
>
> There are at least two different "is this inode identical"
> use cases that {st_dev,st_ino} is being used for.
>
> The first, as Florian described, is to determine if two open fds
> refer to the same inode for collision avoidance.
>
> This works on traditional filesystems like ext4 and XFS, but isn't
> reliable on filesystems with integrated snapshot/subvolume
> functionality.
That's fair, but the first is the problem I think is more important,
because there are existing programs which are depending on
st_dev/st_ino as being a reliable way of detecting uniqueness --- and
if this breas, file data may get corrpted, or the dyanmic linker will
assume that two unrelated shared libraries are actually the same, with
hilarity then ensuing --- because an interface which is guaranteed by
decade of Unix history, and POSIX, is broken by some Linux file
systems.
> The second is that {dev,ino} is being used to disambiguate paths
> that point to hardlinked inodes for the purposes of identifying
> and optimising access and replication of shared (i.e. non-unique)
> file data.
>
> This works on traditional filesystems like ext4, but fails badly on
> filesystem that support FICLONERANGE (XFS, btrfs, NFS, CIFS,
> bcachefs, etc) because cloned files have unique inodes but
> non-unique data.
That's a problem, yes --- but it's a previously unsolved problem, and
the failure will cause inefficiency, but it doesn't actually cause
data corruption.
It's also a very hard problem, especially if we're considering the
full, general FICLONERANGE interface, where an arbitrary subset of
blocks at one offset, might be cloned at a completely different offset
in a different file, and where a single file might have cloned ranges
from a dozens of other files. How this information would be
communicated to userspace, so they could copy a directory hierarchy
without an increased expansion is a hard problem.
Given that we have a simple solution (filehandles) to fix a problem
where false positives causes lost data or core dumps, let's solve the
simple problem and try to get it standardized acrossed operating
systems beyond Linux, and in parallel, we can try to figure out a much
more complicated interface to solve this other problem.
- Ted
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-30 2:31 ` Dave Chinner
2024-07-30 4:19 ` Theodore Ts'o
@ 2024-07-30 15:38 ` Christoph Hellwig
1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-07-30 15:38 UTC (permalink / raw)
To: Dave Chinner
Cc: Theodore Ts'o, Mateusz Guzik, Florian Weimer, linux-fsdevel,
linux-kernel, linux-api, Dave Chinner
On Tue, Jul 30, 2024 at 12:31:57PM +1000, Dave Chinner wrote:
> There are at least two different "is this inode identical"
> use cases that {st_dev,st_ino} is being used for.
>
> The first, as Florian described, is to determine if two open fds
> refer to the same inode for collision avoidance.
>
> This works on traditional filesystems like ext4 and XFS, but isn't
> reliable on filesystems with integrated snapshot/subvolume
> functionality.
It's not about snapshot, it's about file systems being broken. Even
btrfs for example always has a unique st_dev,st_ino pair, it can
just unexpectly change at any subvolume root and not just at a mount
point.
> That is our long term challenge: replacing the use of {dev,ino} for
> data uniqueness disambiguation. Making the identification of owners
> of non-unique/shared data simple for applications to use and fast
> for filesystems to resolve will be a challenge.
I don't think there is any way to provide such a guarantee as there
is so many levels of cloning or dedup, many of which are totally
invisible to the high level file system interface.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Testing if two open descriptors refer to the same inode
2024-07-29 11:40 ` Aleksa Sarai
@ 2024-07-31 18:07 ` David Sterba
0 siblings, 0 replies; 24+ messages in thread
From: David Sterba @ 2024-07-31 18:07 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Mateusz Guzik, Florian Weimer, linux-fsdevel, linux-kernel,
linux-api, Dave Chinner, Christian Brauner
On Mon, Jul 29, 2024 at 09:40:57PM +1000, Aleksa Sarai wrote:
> On 2024-07-29, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > On Mon, Jul 29, 2024 at 12:57 PM Florian Weimer <fweimer@redhat.com> wrote:
> > > > On Mon, Jul 29, 2024 at 12:40:35PM +0200, Florian Weimer wrote:
> > > >> > On Mon, Jul 29, 2024 at 08:55:46AM +0200, Florian Weimer wrote:
> > > >> >> It was pointed out to me that inode numbers on Linux are no longer
> > > >> >> expected to be unique per file system, even for local file systems.
> > > >> >
> > > >> > I don't know if I'm parsing this correctly.
> > > >> >
> > > >> > Are you claiming on-disk inode numbers are not guaranteed unique per
> > > >> > filesystem? It sounds like utter breakage, with capital 'f'.
> > > >>
> > > >> Yes, POSIX semantics and traditional Linux semantics for POSIX-like
> > > >> local file systems are different.
> > > >
> > > > Can you link me some threads about this?
> > >
> > > Sorry, it was an internal thread. It's supposed to be common knowledge
> > > among Linux file system developers. Aleksa referenced LSF/MM
> > > discussions.
> >
> > So much for open development :-P
>
> To be clear, this wasn't _decided_ at LSF/MM, it was brought up as a
> topic. There is an LWN article about the session that mentions the
> issue[1].
A discussion about inode numbers or subvolumes comes up every year with
better of worse suggestions what to do about it.
> My understanding is that the btrfs and bcachefs folks independently
> determined they cannot provide this guarantee. As far as I understand,
> the reason why is that inode number allocation on btree filesystems
> stores information about location and some other bits (maybe subvolumes)
> in the bits, making it harder to guarantee there will be no collisions.
No, on btrfs the inode numbers don't encode anything about location,
it's a simple number. The inode numbers remain the same when a snapshot
is taken as it's a 1:1 clone of the file hierarchy, the directory
representing a subvolume/snapshot has fixed inode number 256. The only
difference is the internal subvolume id.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-07-31 18:07 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-29 6:55 Testing if two open descriptors refer to the same inode Florian Weimer
2024-07-29 9:09 ` Aleksa Sarai
2024-07-29 9:29 ` Florian Weimer
2024-07-29 10:18 ` Mateusz Guzik
2024-07-29 10:40 ` Florian Weimer
2024-07-29 10:50 ` Mateusz Guzik
2024-07-29 10:56 ` Mateusz Guzik
2024-07-29 10:57 ` Florian Weimer
2024-07-29 11:06 ` Mateusz Guzik
2024-07-29 11:36 ` Florian Weimer
2024-07-29 12:00 ` Mateusz Guzik
2024-07-29 11:40 ` Aleksa Sarai
2024-07-31 18:07 ` David Sterba
2024-07-29 11:47 ` Aleksa Sarai
2024-07-29 12:12 ` Mateusz Guzik
2024-07-29 23:19 ` Dave Chinner
2024-07-29 23:08 ` Dave Chinner
2024-07-29 12:26 ` Christian Brauner
2024-07-29 13:36 ` Theodore Ts'o
2024-07-30 2:31 ` Dave Chinner
2024-07-30 4:19 ` Theodore Ts'o
2024-07-30 15:38 ` Christoph Hellwig
2024-07-29 15:24 ` Jeff Layton
2024-07-29 15:39 ` Florian Weimer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).