* [PATCH] stat: don't fail if the major number is >= 256
@ 2022-04-11 14:43 Mikulas Patocka
2022-04-11 15:03 ` Matthew Wilcox
2022-04-11 16:30 ` Linus Torvalds
0 siblings, 2 replies; 15+ messages in thread
From: Mikulas Patocka @ 2022-04-11 14:43 UTC (permalink / raw)
To: Linus Torvalds, Alexander Viro; +Cc: linux-fsdevel, linux-kernel
If you run a program compiled with OpenWatcom for Linux on a filesystem on
NVMe, all "stat" syscalls fail with -EOVERFLOW. The reason is that the
NVMe driver allocates a device with the major number 259 and it doesn't
pass the "old_valid_dev" test.
This patch removes the tests - it's better to wrap around than to return
an error. (note that cp_old_stat also doesn't report an error and wraps
the number around)
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
fs/stat.c | 6 ------
1 file changed, 6 deletions(-)
Index: linux-5.17.2/fs/stat.c
===================================================================
--- linux-5.17.2.orig/fs/stat.c 2022-04-10 21:39:27.000000000 +0200
+++ linux-5.17.2/fs/stat.c 2022-04-10 21:42:43.000000000 +0200
@@ -334,7 +334,6 @@ SYSCALL_DEFINE2(fstat, unsigned int, fd,
# define choose_32_64(a,b) b
#endif
-#define valid_dev(x) choose_32_64(old_valid_dev(x),true)
#define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x)
#ifndef INIT_STRUCT_STAT_PADDING
@@ -345,8 +344,6 @@ static int cp_new_stat(struct kstat *sta
{
struct stat tmp;
- if (!valid_dev(stat->dev) || !valid_dev(stat->rdev))
- return -EOVERFLOW;
#if BITS_PER_LONG == 32
if (stat->size > MAX_NON_LFS)
return -EOVERFLOW;
@@ -644,9 +641,6 @@ static int cp_compat_stat(struct kstat *
{
struct compat_stat tmp;
- if (!old_valid_dev(stat->dev) || !old_valid_dev(stat->rdev))
- return -EOVERFLOW;
-
memset(&tmp, 0, sizeof(tmp));
tmp.st_dev = old_encode_dev(stat->dev);
tmp.st_ino = stat->ino;
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-11 14:43 [PATCH] stat: don't fail if the major number is >= 256 Mikulas Patocka @ 2022-04-11 15:03 ` Matthew Wilcox 2022-04-11 15:16 ` Christoph Hellwig 2022-04-11 16:30 ` Linus Torvalds 1 sibling, 1 reply; 15+ messages in thread From: Matthew Wilcox @ 2022-04-11 15:03 UTC (permalink / raw) To: Mikulas Patocka Cc: Linus Torvalds, Alexander Viro, linux-fsdevel, linux-kernel On Mon, Apr 11, 2022 at 10:43:33AM -0400, Mikulas Patocka wrote: > If you run a program compiled with OpenWatcom for Linux on a filesystem on > NVMe, all "stat" syscalls fail with -EOVERFLOW. The reason is that the > NVMe driver allocates a device with the major number 259 and it doesn't > pass the "old_valid_dev" test. > > This patch removes the tests - it's better to wrap around than to return > an error. (note that cp_old_stat also doesn't report an error and wraps > the number around) Is it better? You've done a good job arguing why it is for this particular situation, but if there's a program which compares files by st_dev+st_ino, it might think two files are identical when they're actually different and, eg, skip backing up a file because it thinks it already did it. That would be a silent failure, which is worse than this noisy failure. The real problem is clearly that Linus denied my request for a real major number for NVMe back in 2012 or whenever it was :-P > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > --- > fs/stat.c | 6 ------ > 1 file changed, 6 deletions(-) > > Index: linux-5.17.2/fs/stat.c > =================================================================== > --- linux-5.17.2.orig/fs/stat.c 2022-04-10 21:39:27.000000000 +0200 > +++ linux-5.17.2/fs/stat.c 2022-04-10 21:42:43.000000000 +0200 > @@ -334,7 +334,6 @@ SYSCALL_DEFINE2(fstat, unsigned int, fd, > # define choose_32_64(a,b) b > #endif > > -#define valid_dev(x) choose_32_64(old_valid_dev(x),true) > #define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x) > > #ifndef INIT_STRUCT_STAT_PADDING > @@ -345,8 +344,6 @@ static int cp_new_stat(struct kstat *sta > { > struct stat tmp; > > - if (!valid_dev(stat->dev) || !valid_dev(stat->rdev)) > - return -EOVERFLOW; > #if BITS_PER_LONG == 32 > if (stat->size > MAX_NON_LFS) > return -EOVERFLOW; > @@ -644,9 +641,6 @@ static int cp_compat_stat(struct kstat * > { > struct compat_stat tmp; > > - if (!old_valid_dev(stat->dev) || !old_valid_dev(stat->rdev)) > - return -EOVERFLOW; > - > memset(&tmp, 0, sizeof(tmp)); > tmp.st_dev = old_encode_dev(stat->dev); > tmp.st_ino = stat->ino; > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-11 15:03 ` Matthew Wilcox @ 2022-04-11 15:16 ` Christoph Hellwig 0 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2022-04-11 15:16 UTC (permalink / raw) To: Matthew Wilcox Cc: Mikulas Patocka, Linus Torvalds, Alexander Viro, linux-fsdevel, linux-kernel On Mon, Apr 11, 2022 at 04:03:37PM +0100, Matthew Wilcox wrote: > Is it better? You've done a good job arguing why it is for this particular > situation, but if there's a program which compares files by > st_dev+st_ino, it might think two files are identical when they're > actually different and, eg, skip backing up a file because it thinks > it already did it. That would be a silent failure, which is worse > than this noisy failure. Agreed. > The real problem is clearly that Linus denied my request for a real > major number for NVMe back in 2012 or whenever it was :-P I don't think that is the real probem. The whole dynamic dev_t scheme has worked out really well and I'm glade drivers don't need a major allocation anymore. But I'm not sure why the dynamic major had to be past 255 given these legacy issues, especially as there are plenty of free ones with lower numbers. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-11 14:43 [PATCH] stat: don't fail if the major number is >= 256 Mikulas Patocka 2022-04-11 15:03 ` Matthew Wilcox @ 2022-04-11 16:30 ` Linus Torvalds 2022-04-11 17:13 ` Mikulas Patocka 1 sibling, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2022-04-11 16:30 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List On Mon, Apr 11, 2022 at 4:43 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > If you run a program compiled with OpenWatcom for Linux on a filesystem on > NVMe, all "stat" syscalls fail with -EOVERFLOW. The reason is that the > NVMe driver allocates a device with the major number 259 and it doesn't > pass the "old_valid_dev" test. OpenWatcom? Really? > This patch removes the tests - it's better to wrap around than to return > an error. (note that cp_old_stat also doesn't report an error and wraps > the number around) Hmm. We've used majors over 256 for a long time, but some of them are admittedly very rare (SCSI OSD?) Unfortunate. And in this case 259 aliases to 3, which is the old HD/IDE0 major number. That's not great - there would be other numbers that didn't have that problem (ie 4-6 are all currently only character device majors, I think). Anyway, I think that check is just bogus. The cp_new_stat() thing uses 'struct stat' and it has unsigned long st_dev; /* Device. */ unsigned long st_rdev; /* Device number, if device. */ so there's no reason to limit things to the old 8-bit behavior. Yes, it does that #define valid_dev(x) choose_32_64(old_valid_dev(x),true) #define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x) static __always_inline u16 old_encode_dev(dev_t dev) { return (MAJOR(dev) << 8) | MINOR(dev); } which currently drops bits, but we should just *fix* that. We can put the high bits in the upper bits, not limit it to 16 bits when we have more space than that. Even the *really* old 'struct old_stat' doesn't really have a 16-bit st_dev/rdev. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-11 16:30 ` Linus Torvalds @ 2022-04-11 17:13 ` Mikulas Patocka 2022-04-12 5:37 ` Linus Torvalds 0 siblings, 1 reply; 15+ messages in thread From: Mikulas Patocka @ 2022-04-11 17:13 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Mon, 11 Apr 2022, Linus Torvalds wrote: > On Mon, Apr 11, 2022 at 4:43 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > If you run a program compiled with OpenWatcom for Linux on a filesystem on > > NVMe, all "stat" syscalls fail with -EOVERFLOW. The reason is that the > > NVMe driver allocates a device with the major number 259 and it doesn't > > pass the "old_valid_dev" test. > > OpenWatcom? Really? Yes. I use OpenWatcom to verify that my programs are clean ANSI C without any gccisms. Other than that, it is not much useful - it has it's own libc, it's own module format, and programs compiled with OpenWatcom cannot be linked with existing *.a or *.so libraries. > > This patch removes the tests - it's better to wrap around than to return > > an error. (note that cp_old_stat also doesn't report an error and wraps > > the number around) > > Hmm. We've used majors over 256 for a long time, but some of them are > admittedly very rare (SCSI OSD?) > > Unfortunate. And in this case 259 aliases to 3, which is the old > HD/IDE0 major number. That's not great - there would be other numbers > that didn't have that problem (ie 4-6 are all currently only character > device majors, I think). Should we perhaps hash the number, take 16 bits of the hash and hope than the collision won't happen? > Anyway, I think that check is just bogus. The cp_new_stat() thing uses > 'struct stat' and it has > > unsigned long st_dev; /* Device. */ > unsigned long st_rdev; /* Device number, if device. */ > > so there's no reason to limit things to the old 8-bit behavior. > > Yes, it does that > > #define valid_dev(x) choose_32_64(old_valid_dev(x),true) > #define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x) > > static __always_inline u16 old_encode_dev(dev_t dev) > { > return (MAJOR(dev) << 8) | MINOR(dev); > } > > which currently drops bits, but we should just *fix* that. We can put > the high bits in the upper bits, not limit it to 16 bits when we have > more space than that. Yes - we can return values larger than 16-bit here. But there's a risk that the userspace code will extract the values using macros like this and lose the upper bits: #define major(device) ((int)(((device) >> 8) & 0xFF)) #define minor(device) ((int)((device) & 0xff)) > Even the *really* old 'struct old_stat' doesn't really have a 16-bit > st_dev/rdev. > > Linus For me, the failure happens in cp_compat_stat (I have a 64-bit kernel). In struct compat_stat in arch/x86/include/asm/compat.h, st_dev and st_rdev are compat_dev_t which is 16-bit. But they are followed by 16-bit paddings, so they could be extended. If you have a native 32-bit kernel, it uses 'struct stat' defined at the beginning of arch/x86/include/uapi/asm/stat.h that has 32-bit st_dev and st_rdev. If you use a 64-bit kernel with 32-bit compat, it uses 'struct compat_stat' defined in arch/x86/include/asm/compat.h that has 16-bit st_dev and st_rdev. That's an inconsistency that should be resolved. What did glibc do? Did it use 16-bit dev_t with following padding or 32-bit dev_t? (the current glibc just uses stat64 and 64-bit dev_t always) Mikulas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-11 17:13 ` Mikulas Patocka @ 2022-04-12 5:37 ` Linus Torvalds 2022-04-12 5:41 ` Christoph Hellwig ` (3 more replies) 0 siblings, 4 replies; 15+ messages in thread From: Linus Torvalds @ 2022-04-12 5:37 UTC (permalink / raw) To: Mikulas Patocka Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Mon, Apr 11, 2022 at 7:13 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > Should we perhaps hash the number, take 16 bits of the hash and hope > than the collision won't happen? That would "work", but I think it would be incredibly annoying to users with basically random results. I think the solution is to just put the bits in the high bits. Yes, they might be masked off if people use 'MAJOR()' to pick them out, but the common "compare st_dev and st_ino" model at least works. That's the one that wants unique numbers. > For me, the failure happens in cp_compat_stat (I have a 64-bit kernel). In > struct compat_stat in arch/x86/include/asm/compat.h, st_dev and st_rdev > are compat_dev_t which is 16-bit. But they are followed by 16-bit > paddings, so they could be extended. Ok, that actually looks like a bug. The compat structure should match the native structure. Those "u16 __padX" fields seem to be just a symptom of the bug. The only user of that compat_stat structure is the kernel, so that should just be fixed. Of course, who knows what the libraries have done, so user space could still have screwed up. > If you have a native 32-bit kernel, it uses 'struct stat' defined at the > beginning of arch/x86/include/uapi/asm/stat.h that has 32-bit st_dev and > st_rdev. Correct. It's literally the compat structure that has no basis in reality. Or it might be some truly ancient thing, but I really don't think so. Regardless, if we fill in the high 16 bits of that field, in the _worst_ case things will act as if your patch had been applied, but in any sane case you'll have that working "unique st_dev" thing. I'd love to actually use the better "modern" 64-bit encoding (12+20 bits, or whatever it is we do, too lazy to look it up), but hey, that historical thing is what it is. Realistically, nobody uses it. Apparently your OpenWatcom use is also really just fairly theoretical, not some "this app is used by people and doesn't work with a filesystem on NVMe". Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-12 5:37 ` Linus Torvalds @ 2022-04-12 5:41 ` Christoph Hellwig 2022-04-12 5:47 ` Linus Torvalds 2022-04-12 6:49 ` Linus Torvalds ` (2 subsequent siblings) 3 siblings, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2022-04-12 5:41 UTC (permalink / raw) To: Linus Torvalds Cc: Mikulas Patocka, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Mon, Apr 11, 2022 at 07:37:44PM -1000, Linus Torvalds wrote: > Realistically, nobody uses it. Apparently your OpenWatcom use is also > really just fairly theoretical, not some "this app is used by people > and doesn't work with a filesystem on NVMe". And that is easily fixed by using a lower major for the block dynamic dev_t. In theory userspace should be able to copy with any major for it, but I fear something might break so we could make it configurable. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-12 5:41 ` Christoph Hellwig @ 2022-04-12 5:47 ` Linus Torvalds 2022-04-12 5:52 ` Christoph Hellwig 0 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2022-04-12 5:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Mikulas Patocka, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Mon, Apr 11, 2022 at 7:41 PM Christoph Hellwig <hch@infradead.org> wrote: > > And that is easily fixed by using a lower major for the block dynamic > dev_t. In theory userspace should be able to copy with any major for > it, but I fear something might break so we could make it configurable. We actually have a ton of major numbers free, although it's not obvious because of the whole "they are used by character devices, not block devices" issue. Ie 4-6, 12, 14, 19 are all free, afaik. But yeah, somebody might have a static /dev for some odd embedded case, I guess. That said, it really does look like we just return -EOVERFLOW much too eagerly, for stupid and bad reasons. Considering that BLOCK_EXT_MAJOR has been 259 since 2008, and this is the first time anybody has hit this, I don't think there's much reason to change that major number when the whole error case seems to have been largely a mistake to begin with. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-12 5:47 ` Linus Torvalds @ 2022-04-12 5:52 ` Christoph Hellwig 0 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2022-04-12 5:52 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Hellwig, Mikulas Patocka, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Mon, Apr 11, 2022 at 07:47:44PM -1000, Linus Torvalds wrote: > Considering that BLOCK_EXT_MAJOR has been 259 since 2008, and this is > the first time anybody has hit this, I don't think there's much reason > to change that major number when the whole error case seems to have > been largely a mistake to begin with. Yeah. If Mikulas still has a problem he could just patch BLOCK_EXT_MAJOR locally. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-12 5:37 ` Linus Torvalds 2022-04-12 5:41 ` Christoph Hellwig @ 2022-04-12 6:49 ` Linus Torvalds 2022-04-12 8:19 ` Andreas Schwab 2022-04-12 9:41 ` [PATCH] stat: fix inconsistency between struct stat and struct compat_stat Mikulas Patocka 3 siblings, 0 replies; 15+ messages in thread From: Linus Torvalds @ 2022-04-12 6:49 UTC (permalink / raw) To: Mikulas Patocka Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Mon, Apr 11, 2022 at 7:37 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Correct. It's literally the compat structure that has no basis in reality. > > Or it might be some truly ancient thing, but I really don't think so. I was intrigued, so I went back and checked. unsigned short st_dev; unsigned short __pad1; is in fact historical. But it was changed to unsigned long st_dev; (for i386, so this is a 32-bit 'unsigned long') on April 2, 2003. From the BK tree conversion: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d so I think we should just make sure that the 64-bit compat system call is compatible with that 2003+ state, not with some truly ancient state. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: don't fail if the major number is >= 256 2022-04-12 5:37 ` Linus Torvalds 2022-04-12 5:41 ` Christoph Hellwig 2022-04-12 6:49 ` Linus Torvalds @ 2022-04-12 8:19 ` Andreas Schwab 2022-04-12 9:41 ` [PATCH] stat: fix inconsistency between struct stat and struct compat_stat Mikulas Patocka 3 siblings, 0 replies; 15+ messages in thread From: Andreas Schwab @ 2022-04-12 8:19 UTC (permalink / raw) To: Linus Torvalds Cc: Mikulas Patocka, Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Apr 11 2022, Linus Torvalds wrote: >> For me, the failure happens in cp_compat_stat (I have a 64-bit kernel). In >> struct compat_stat in arch/x86/include/asm/compat.h, st_dev and st_rdev >> are compat_dev_t which is 16-bit. But they are followed by 16-bit >> paddings, so they could be extended. > > Ok, that actually looks like a bug. > > The compat structure should match the native structure. Those "u16 > __padX" fields seem to be just a symptom of the bug. Looks like the move to 32-bit st_[r]dev was never applied to struct compat_stat, see commit e95b206567 ("[PATCH] struct stat - support larger dev_t") from tglx/history.git. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] stat: fix inconsistency between struct stat and struct compat_stat 2022-04-12 5:37 ` Linus Torvalds ` (2 preceding siblings ...) 2022-04-12 8:19 ` Andreas Schwab @ 2022-04-12 9:41 ` Mikulas Patocka 2022-04-12 16:30 ` Linus Torvalds 3 siblings, 1 reply; 15+ messages in thread From: Mikulas Patocka @ 2022-04-12 9:41 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Mon, 11 Apr 2022, Linus Torvalds wrote: > On Mon, Apr 11, 2022 at 7:13 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > Should we perhaps hash the number, take 16 bits of the hash and hope > > than the collision won't happen? > > That would "work", but I think it would be incredibly annoying to > users with basically random results. > > I think the solution is to just put the bits in the high bits. Yes, > they might be masked off if people use 'MAJOR()' to pick them out, but > the common "compare st_dev and st_ino" model at least works. That's > the one that wants unique numbers. > > > For me, the failure happens in cp_compat_stat (I have a 64-bit kernel). In > > struct compat_stat in arch/x86/include/asm/compat.h, st_dev and st_rdev > > are compat_dev_t which is 16-bit. But they are followed by 16-bit > > paddings, so they could be extended. > > Ok, that actually looks like a bug. > > The compat structure should match the native structure. Those "u16 > __padX" fields seem to be just a symptom of the bug. > > The only user of that compat_stat structure is the kernel, so that > should just be fixed. > > Of course, who knows what the libraries have done, so user space could > still have screwed up. Here I'm sending a patch that makes struct compat_stat match struct stat. stat: fix inconsistency between struct stat and struct compat_stat struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit st_dev and st_rdev; struct compat_stat (defined in arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by a 16-bit padding. This patch fixes struct compat_stat to match struct stat. Note that we can't change compat_dev_t because it is used by compat_loop_info. Also, if the st_dev and st_rdev values are 32-bit, we don't have to use old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW on filesystems that are on NVMe because NVMe uses the major number 259. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- arch/x86/include/asm/compat.h | 6 ++---- fs/stat.c | 19 ++++++++++--------- 2 files changed, 12 insertions(+), 13 deletions(-) Index: linux-5.17.2/arch/x86/include/asm/compat.h =================================================================== --- linux-5.17.2.orig/arch/x86/include/asm/compat.h 2022-01-21 10:29:12.000000000 +0100 +++ linux-5.17.2/arch/x86/include/asm/compat.h 2022-04-12 11:27:14.000000000 +0200 @@ -28,15 +28,13 @@ typedef u16 compat_ipc_pid_t; typedef __kernel_fsid_t compat_fsid_t; struct compat_stat { - compat_dev_t st_dev; - u16 __pad1; + u32 st_dev; compat_ino_t st_ino; compat_mode_t st_mode; compat_nlink_t st_nlink; __compat_uid_t st_uid; __compat_gid_t st_gid; - compat_dev_t st_rdev; - u16 __pad2; + u32 st_rdev; u32 st_size; u32 st_blksize; u32 st_blocks; Index: linux-5.17.2/fs/stat.c =================================================================== --- linux-5.17.2.orig/fs/stat.c 2022-04-12 10:39:46.000000000 +0200 +++ linux-5.17.2/fs/stat.c 2022-04-12 10:58:28.000000000 +0200 @@ -334,9 +334,6 @@ SYSCALL_DEFINE2(fstat, unsigned int, fd, # define choose_32_64(a,b) b #endif -#define valid_dev(x) choose_32_64(old_valid_dev(x),true) -#define encode_dev(x) choose_32_64(old_encode_dev,new_encode_dev)(x) - #ifndef INIT_STRUCT_STAT_PADDING # define INIT_STRUCT_STAT_PADDING(st) memset(&st, 0, sizeof(st)) #endif @@ -345,7 +342,9 @@ static int cp_new_stat(struct kstat *sta { struct stat tmp; - if (!valid_dev(stat->dev) || !valid_dev(stat->rdev)) + if (sizeof(tmp.st_dev) < 4 && !old_valid_dev(stat->dev)) + return -EOVERFLOW; + if (sizeof(tmp.st_rdev) < 4 && !old_valid_dev(stat->rdev)) return -EOVERFLOW; #if BITS_PER_LONG == 32 if (stat->size > MAX_NON_LFS) @@ -353,7 +352,7 @@ static int cp_new_stat(struct kstat *sta #endif INIT_STRUCT_STAT_PADDING(tmp); - tmp.st_dev = encode_dev(stat->dev); + tmp.st_dev = new_encode_dev(stat->dev); tmp.st_ino = stat->ino; if (sizeof(tmp.st_ino) < sizeof(stat->ino) && tmp.st_ino != stat->ino) return -EOVERFLOW; @@ -363,7 +362,7 @@ static int cp_new_stat(struct kstat *sta return -EOVERFLOW; SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid)); SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid)); - tmp.st_rdev = encode_dev(stat->rdev); + tmp.st_rdev = new_encode_dev(stat->rdev); tmp.st_size = stat->size; tmp.st_atime = stat->atime.tv_sec; tmp.st_mtime = stat->mtime.tv_sec; @@ -644,11 +643,13 @@ static int cp_compat_stat(struct kstat * { struct compat_stat tmp; - if (!old_valid_dev(stat->dev) || !old_valid_dev(stat->rdev)) + if (sizeof(tmp.st_dev) < 4 && !old_valid_dev(stat->dev)) + return -EOVERFLOW; + if (sizeof(tmp.st_rdev) < 4 && !old_valid_dev(stat->rdev)) return -EOVERFLOW; memset(&tmp, 0, sizeof(tmp)); - tmp.st_dev = old_encode_dev(stat->dev); + tmp.st_dev = new_encode_dev(stat->dev); tmp.st_ino = stat->ino; if (sizeof(tmp.st_ino) < sizeof(stat->ino) && tmp.st_ino != stat->ino) return -EOVERFLOW; @@ -658,7 +659,7 @@ static int cp_compat_stat(struct kstat * return -EOVERFLOW; SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid)); SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid)); - tmp.st_rdev = old_encode_dev(stat->rdev); + tmp.st_rdev = new_encode_dev(stat->rdev); if ((u64) stat->size > MAX_NON_LFS) return -EOVERFLOW; tmp.st_size = stat->size; ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: fix inconsistency between struct stat and struct compat_stat 2022-04-12 9:41 ` [PATCH] stat: fix inconsistency between struct stat and struct compat_stat Mikulas Patocka @ 2022-04-12 16:30 ` Linus Torvalds 2022-04-12 17:42 ` Mikulas Patocka 0 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2022-04-12 16:30 UTC (permalink / raw) To: Mikulas Patocka Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Mon, Apr 11, 2022 at 11:41 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > > Also, if the st_dev and st_rdev values are 32-bit, we don't have to use > old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW > on filesystems that are on NVMe because NVMe uses the major number 259. The problem with this part of the patch is that this: > @@ -353,7 +352,7 @@ static int cp_new_stat(struct kstat *sta > #endif > > INIT_STRUCT_STAT_PADDING(tmp); > - tmp.st_dev = encode_dev(stat->dev); > + tmp.st_dev = new_encode_dev(stat->dev); completely changes the format of that st_dev field. For completely insane historical reasons, we have had the rule that - 32-bit architectures encode the device into a 16 bit value - 64-bit architectures encode the device number into a 32 bit value and that has been true *despite* the fact that the actual "st_dev" field has been 32-bit and 64-bit respectively since 2003! And it doesn't help that to confuse things even more, the _naming_ of those "encode_dev" functions is "old and new", so that logically you'd think that "cp_new_stat()" would use "new_encode_dev()". Nope. So on 32-bit architectures, cp_new_stat() uses "old_encode_dev()", which historically put the minor number in bits 0..7, and the major number in bits 8..15. End result: on a 32-bit system (or in the compat syscall mode), changing to new_encode_dev() would confuse anybody (like just "ls -l /dev") that uses that old stat call and tries to print out major/minor numbers. Now,. the good news is that (a) nobody should use that old stat call, since the new world order is called "stat64" and has been for a loooong time - also since at least 2003) (b) we could just hide the bits in upper bits instead. So what I suggest we do is to make old_encode_dev() put the minor bits in bits 0..7 _and_ 16..23, and the major bits in 8..15 _and_ 24..32. And then the -EOVERFLOW should be something like unsigned int st_dev = encode_dev(stat->dev); tmp.st_dev = st_dev; if (st_dev != tmp.st_dev) return -EOVERFLOW; for the lcase that tmp.st_dev is actually 16-bit (ie the compat case for some architecture where the padding wasn't there?) NOTE: That will still screw up 'ls -l' output, but only for the devices that previously would have returned -EOVERFLOW. And it will make anybopdy who does that "stat1->st_dev == stat2->st_dev && ino == ino2" thing for testing "same inode" work just fine. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: fix inconsistency between struct stat and struct compat_stat 2022-04-12 16:30 ` Linus Torvalds @ 2022-04-12 17:42 ` Mikulas Patocka 2022-04-12 23:25 ` Linus Torvalds 0 siblings, 1 reply; 15+ messages in thread From: Mikulas Patocka @ 2022-04-12 17:42 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Tue, 12 Apr 2022, Linus Torvalds wrote: > On Mon, Apr 11, 2022 at 11:41 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > Also, if the st_dev and st_rdev values are 32-bit, we don't have to use > > old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW > > on filesystems that are on NVMe because NVMe uses the major number 259. > > The problem with this part of the patch is that this: > > > @@ -353,7 +352,7 @@ static int cp_new_stat(struct kstat *sta > > #endif > > > > INIT_STRUCT_STAT_PADDING(tmp); > > - tmp.st_dev = encode_dev(stat->dev); > > + tmp.st_dev = new_encode_dev(stat->dev); > > completely changes the format of that st_dev field. we have these definitions: static __always_inline u16 old_encode_dev(dev_t dev) { return (MAJOR(dev) << 8) | MINOR(dev); } static __always_inline u32 new_encode_dev(dev_t dev) { unsigned major = MAJOR(dev); unsigned minor = MINOR(dev); return (minor & 0xff) | (major << 8) | ((minor & ~0xff) << 12); } As long as both major and minor numbers are less than 256, these functions return equivalent results. So, I think it's safe to replace old_encode_dev with new_encode_dev. old_encode_dev shouldn't be called with minor >= 256, because it blends the upper minor bits into the major field - the kernel doesn't do this and checks the value with old_valid_dev before calling old_encode_dev. But when old_valid_dev returns true, it doesn't matter if you use old_encode_dev or new_encode_dev - both give equivalent results. When I tested it, both gcc and openwatcom return st_dev 0x10301, which is the expected value (the NVMe device has major 259 and minor 1). > (b) we could just hide the bits in upper bits instead. > > So what I suggest we do is to make old_encode_dev() put the minor bits > in bits 0..7 _and_ 16..23, and the major bits in 8..15 _and_ 24..32. new_encode_dev puts the minor value into bits 0..7, 20..31 and the major value into bits 8..19 So, we can use this instead of inventing a new format. Mikulas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] stat: fix inconsistency between struct stat and struct compat_stat 2022-04-12 17:42 ` Mikulas Patocka @ 2022-04-12 23:25 ` Linus Torvalds 0 siblings, 0 replies; 15+ messages in thread From: Linus Torvalds @ 2022-04-12 23:25 UTC (permalink / raw) To: Mikulas Patocka Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Matthew Wilcox On Tue, Apr 12, 2022 at 7:42 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > As long as both major and minor numbers are less than 256, these functions > return equivalent results. So, I think it's safe to replace old_encode_dev > with new_encode_dev. You are of course 100% right, and I should have looked more closely at the code rather than going by my (broken) assumptions based on old memory of what we did when we did that "new" stat expansion. I take back all my objections that were completely bogus. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2022-04-12 23:41 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-04-11 14:43 [PATCH] stat: don't fail if the major number is >= 256 Mikulas Patocka 2022-04-11 15:03 ` Matthew Wilcox 2022-04-11 15:16 ` Christoph Hellwig 2022-04-11 16:30 ` Linus Torvalds 2022-04-11 17:13 ` Mikulas Patocka 2022-04-12 5:37 ` Linus Torvalds 2022-04-12 5:41 ` Christoph Hellwig 2022-04-12 5:47 ` Linus Torvalds 2022-04-12 5:52 ` Christoph Hellwig 2022-04-12 6:49 ` Linus Torvalds 2022-04-12 8:19 ` Andreas Schwab 2022-04-12 9:41 ` [PATCH] stat: fix inconsistency between struct stat and struct compat_stat Mikulas Patocka 2022-04-12 16:30 ` Linus Torvalds 2022-04-12 17:42 ` Mikulas Patocka 2022-04-12 23:25 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox