* XFS Bug Report @ 2017-08-22 21:10 Ruoxin Jiang 2017-08-22 22:33 ` Eric Sandeen 0 siblings, 1 reply; 8+ messages in thread From: Ruoxin Jiang @ 2017-08-22 21:10 UTC (permalink / raw) To: linux-xfs; +Cc: Suman Jana [-- Attachment #1: Type: text/plain, Size: 934 bytes --] Hello, We are researchers from Columbia University, New York. As part of our current research we have run into some issues using xfs filesystem. For example, we encountered a case where the setuid bit of a modified executable was not removed as desired. We have attached three cases, one involving the setuid bit in directory 1, and a data corruption issue in directory 2. In directory 3 is a semantic discrepancy we discovered between xfs and other popular filesystems (ext4/btrfs) regarding the setgid bit. In each directory, you will find a Readme describing the issue and pointing to the code that may lead to the issue. For your convenience, we also included test programs (min.cpp) and instructions in Readme to help reproduce the issues. We tried to report them on bugzilla@oss.sig.com but the website seems to be down. We would appreciate very much if you could confirm the three cases at your conveniences. Thanks, Amy [-- Attachment #2: xfs_issues.tar.gz --] [-- Type: application/x-gzip, Size: 2550 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS Bug Report 2017-08-22 21:10 XFS Bug Report Ruoxin Jiang @ 2017-08-22 22:33 ` Eric Sandeen 2017-08-22 22:43 ` Eric Sandeen 2017-08-22 23:28 ` Dave Chinner 0 siblings, 2 replies; 8+ messages in thread From: Eric Sandeen @ 2017-08-22 22:33 UTC (permalink / raw) To: Ruoxin Jiang, linux-xfs; +Cc: Suman Jana On 8/22/17 4:10 PM, Ruoxin Jiang wrote: > Hello, > > We are researchers from Columbia University, New York. As part of our > current research we have run into some issues using xfs filesystem. > For example, we encountered a case where the setuid bit of a modified > executable was not removed as desired. Thanks for the reports. For case 1: > * In xfs, function `xfs_file_aio_write_checks` calls `file_remove_privs` which > removes special file priviledges when the executable is written to or truncated. > > * If an DIO write is not aligned to device logical sector size, the syscall > fails with EINVAL` before `xfs_file_aio_write_checks` is called. Thus the setuid > bit will not be removed from the modified executable. If the write did not happen, why should the setuid bit be removed? The write failed and the file's contents were not modified. It seems like xfs's behavior makes sense; is there a posix specification that indicates the behavior should be different? Case 2 does look problematic (failed mmap write extending the file EOF) Case 3 seems to show xfs violating this rule, I guess? When the owner or group of an executable file are changed by a non-superuser, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel ver- sion. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown(). I thought this was all handled at the vfs, though, odd. -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS Bug Report 2017-08-22 22:33 ` Eric Sandeen @ 2017-08-22 22:43 ` Eric Sandeen 2017-08-22 23:28 ` Dave Chinner 1 sibling, 0 replies; 8+ messages in thread From: Eric Sandeen @ 2017-08-22 22:43 UTC (permalink / raw) To: Ruoxin Jiang, linux-xfs; +Cc: Suman Jana On 8/22/17 5:33 PM, Eric Sandeen wrote: > On 8/22/17 4:10 PM, Ruoxin Jiang wrote: >> Hello, >> >> We are researchers from Columbia University, New York. As part of our >> current research we have run into some issues using xfs filesystem. >> For example, we encountered a case where the setuid bit of a modified >> executable was not removed as desired. > > Thanks for the reports. > > For case 1: > >> * In xfs, function `xfs_file_aio_write_checks` calls `file_remove_privs` which >> removes special file priviledges when the executable is written to or truncated. >> >> * If an DIO write is not aligned to device logical sector size, the syscall >> fails with EINVAL` before `xfs_file_aio_write_checks` is called. Thus the setuid >> bit will not be removed from the modified executable. > > If the write did not happen, why should the setuid bit be removed? > The write failed and the file's contents were not modified. It seems > like xfs's behavior makes sense; is there a posix specification that > indicates the behavior should be different? In fact, POSIX says: "/Upon successful completion/ where nbyte is greater than 0, write() shall mark for update the st_ctime and st_mtime fields of the file, and if the file is a regular file, the S_ISUID and S_ISGID bits of the file mode may be cleared." (emphasis mine ... in your case, the write did not successfully complete, and 0 bytes were written) -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS Bug Report 2017-08-22 22:33 ` Eric Sandeen 2017-08-22 22:43 ` Eric Sandeen @ 2017-08-22 23:28 ` Dave Chinner 2017-08-23 2:20 ` Eric Sandeen 2017-08-23 16:35 ` Darrick J. Wong 1 sibling, 2 replies; 8+ messages in thread From: Dave Chinner @ 2017-08-22 23:28 UTC (permalink / raw) To: Eric Sandeen; +Cc: Ruoxin Jiang, linux-xfs, Suman Jana On Tue, Aug 22, 2017 at 05:33:24PM -0500, Eric Sandeen wrote: > On 8/22/17 4:10 PM, Ruoxin Jiang wrote: > > Hello, > > > > We are researchers from Columbia University, New York. As part of our > > current research we have run into some issues using xfs filesystem. > > For example, we encountered a case where the setuid bit of a modified > > executable was not removed as desired. > > Thanks for the reports. > > For case 1: > > > * In xfs, function `xfs_file_aio_write_checks` calls `file_remove_privs` which > > removes special file priviledges when the executable is written to or truncated. > > > > * If an DIO write is not aligned to device logical sector size, the syscall > > fails with EINVAL` before `xfs_file_aio_write_checks` is called. Thus the setuid > > bit will not be removed from the modified executable. > > If the write did not happen, why should the setuid bit be removed? > The write failed and the file's contents were not modified. It seems > like xfs's behavior makes sense; is there a posix specification that > indicates the behavior should be different? > > Case 2 does look problematic (failed mmap write extending the file EOF) mmap should not be able to extend the file at all. It should segv if an attempt to write past EOF occurs. > Case 3 seems to show xfs violating this rule, I guess? > > When the owner or group of an executable file are changed by a > non-superuser, the S_ISUID and S_ISGID mode bits are cleared. > POSIX does not specify whether this also should happen when root > does the chown(); the Linux behavior depends on the kernel ver- > sion. In case of a non-group-executable file (i.e., one for > which the S_IXGRP bit is not set) the S_ISGID bit indicates > mandatory locking, and is not cleared by a chown(). > > I thought this was all handled at the vfs, though, odd. xfs_setattr_nonsize() does: /* * CAP_FSETID overrides the following restrictions: * * The set-user-ID and set-group-ID bits of a file will be * cleared upon successful return from chown() */ if ((inode->i_mode & (S_ISUID|S_ISGID)) && !capable(CAP_FSETID)) inode->i_mode &= ~(S_ISUID|S_ISGID); Whereas the VFS has this check in should_remove_suid(): /* * sgid without any exec bits is just a mandatory locking mark; leave * it alone. If some exec bits are set, it's a real sgid; kill it. */ if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) kill |= ATTR_KILL_SGID; if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) return kill; So the VFS is doing the right thing but the XFS code does it's own thing for reasons I don't remember. Given the VFS handles this, maybe we should finally remove it from the XFS code? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS Bug Report 2017-08-22 23:28 ` Dave Chinner @ 2017-08-23 2:20 ` Eric Sandeen 2017-08-23 16:35 ` Darrick J. Wong 1 sibling, 0 replies; 8+ messages in thread From: Eric Sandeen @ 2017-08-23 2:20 UTC (permalink / raw) To: Dave Chinner; +Cc: Ruoxin Jiang, linux-xfs, Suman Jana On 8/22/17 6:28 PM, Dave Chinner wrote: > On Tue, Aug 22, 2017 at 05:33:24PM -0500, Eric Sandeen wrote: ... >> Case 3 seems to show xfs violating this rule, I guess? >> >> When the owner or group of an executable file are changed by a >> non-superuser, the S_ISUID and S_ISGID mode bits are cleared. >> POSIX does not specify whether this also should happen when root >> does the chown(); the Linux behavior depends on the kernel ver- >> sion. In case of a non-group-executable file (i.e., one for >> which the S_IXGRP bit is not set) the S_ISGID bit indicates >> mandatory locking, and is not cleared by a chown(). >> >> I thought this was all handled at the vfs, though, odd. > > xfs_setattr_nonsize() does: > > /* > * CAP_FSETID overrides the following restrictions: > * > * The set-user-ID and set-group-ID bits of a file will be > * cleared upon successful return from chown() > */ > if ((inode->i_mode & (S_ISUID|S_ISGID)) && > !capable(CAP_FSETID)) > inode->i_mode &= ~(S_ISUID|S_ISGID); > > Whereas the VFS has this check in should_remove_suid(): > > /* > * sgid without any exec bits is just a mandatory locking mark; leave > * it alone. If some exec bits are set, it's a real sgid; kill it. > */ > if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) > kill |= ATTR_KILL_SGID; > > if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) > return kill; Thanks, not sure why I didn't find that. > So the VFS is doing the right thing but the XFS code does it's own > thing for reasons I don't remember. Given the VFS handles this, maybe > we should finally remove it from the XFS code? Yep, probably another remnant of irix code not having a vfs to rely on as much? And at least one version of POSIX says: "If the specified file is a regular file, one or more of the S_IXUSR, S_IXGRP, or S_IXOTH bits of the file mode are set, and the process has appropriate privileges, it is implementation-defined whether the set-user-ID and set-group-ID bits are altered." Sooo I suppose that was just the Irix implementation ;) I wonder if LTP tests this, I would have expected it to. -Eric ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS Bug Report 2017-08-22 23:28 ` Dave Chinner 2017-08-23 2:20 ` Eric Sandeen @ 2017-08-23 16:35 ` Darrick J. Wong 2017-08-23 23:59 ` Dave Chinner 1 sibling, 1 reply; 8+ messages in thread From: Darrick J. Wong @ 2017-08-23 16:35 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Sandeen, Ruoxin Jiang, linux-xfs, Suman Jana On Wed, Aug 23, 2017 at 09:28:31AM +1000, Dave Chinner wrote: > On Tue, Aug 22, 2017 at 05:33:24PM -0500, Eric Sandeen wrote: > > On 8/22/17 4:10 PM, Ruoxin Jiang wrote: > > > Hello, > > > > > > We are researchers from Columbia University, New York. As part of our > > > current research we have run into some issues using xfs filesystem. > > > For example, we encountered a case where the setuid bit of a modified > > > executable was not removed as desired. > > > > Thanks for the reports. > > > > For case 1: > > > > > * In xfs, function `xfs_file_aio_write_checks` calls `file_remove_privs` which > > > removes special file priviledges when the executable is written to or truncated. > > > > > > * If an DIO write is not aligned to device logical sector size, the syscall > > > fails with EINVAL` before `xfs_file_aio_write_checks` is called. Thus the setuid > > > bit will not be removed from the modified executable. > > > > If the write did not happen, why should the setuid bit be removed? > > The write failed and the file's contents were not modified. It seems > > like xfs's behavior makes sense; is there a posix specification that > > indicates the behavior should be different? > > > > Case 2 does look problematic (failed mmap write extending the file EOF) > > mmap should not be able to extend the file at all. It should segv > if an attempt to write past EOF occurs. const char *filename = "file0"; r[0] = creat(filename, 00666); r[1] = write(r[0], "1234567890", 10); printf("write()=%ld, errno=%d\n", r[1], errno); Ok, so we create "file0" as r[0] and write 10 bytes... r[2] = creat(filename, 00444); ...then open again as r[2] and truncate it... r[3] = write(r[2], "A", 1); ...and write a single byte to the truncated file. i_size is now 1, and we have two file descriptions r0 and r2 both pointing to it. r0's position is 10 and r2's position is 1. printf("write()=%ld, errno=%d\n", r[3], errno); r[4] = syscall(__NR_mmap, 0x20000000ul, 0xc000ul, 0x0ul, 0x32ul, 0xfffffffffffffffful, 0x0ul, 0, 0, 0); Create a 48k mapping at 0x20000000 to some anonymous memory... with PROT_NONE. r[5] = syscall(__NR_write, r[0], 0x2000b000ul, 0x1ul, 0, 0, 0, 0, 0, 0); printf("write()=%ld, errno=%d\n", r[5], errno); Then try to write 1 byte to r0 (whose pos is 10) from this anonymous region, which bails out with EFAULT because the region is PROT_NONE. This is effectively pwrite(r[0], <unreadable>, 1, 10); Therefore, it's not an mmap write that extends the file length, it's the write call that moves out EOF in preparation for the write, only to have the buffered write fail because the process doesn't have read access to the memory buffer. Not sure what to do about this corner case, though -- I guess you could check that at least the first byte of the write will succeed, and only then call xfs_zero_eof. Could still be a TOCTOU race though. Thoughts? --D > > > Case 3 seems to show xfs violating this rule, I guess? > > > > When the owner or group of an executable file are changed by a > > non-superuser, the S_ISUID and S_ISGID mode bits are cleared. > > POSIX does not specify whether this also should happen when root > > does the chown(); the Linux behavior depends on the kernel ver- > > sion. In case of a non-group-executable file (i.e., one for > > which the S_IXGRP bit is not set) the S_ISGID bit indicates > > mandatory locking, and is not cleared by a chown(). > > > > I thought this was all handled at the vfs, though, odd. > > xfs_setattr_nonsize() does: > > /* > * CAP_FSETID overrides the following restrictions: > * > * The set-user-ID and set-group-ID bits of a file will be > * cleared upon successful return from chown() > */ > if ((inode->i_mode & (S_ISUID|S_ISGID)) && > !capable(CAP_FSETID)) > inode->i_mode &= ~(S_ISUID|S_ISGID); > > Whereas the VFS has this check in should_remove_suid(): > > /* > * sgid without any exec bits is just a mandatory locking mark; leave > * it alone. If some exec bits are set, it's a real sgid; kill it. > */ > if (unlikely((mode & S_ISGID) && (mode & S_IXGRP))) > kill |= ATTR_KILL_SGID; > > if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode))) > return kill; > > So the VFS is doing the right thing but the XFS code does it's own > thing for reasons I don't remember. Given the VFS handles this, maybe > we should finally remove it from the XFS code? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS Bug Report 2017-08-23 16:35 ` Darrick J. Wong @ 2017-08-23 23:59 ` Dave Chinner 2017-08-24 0:26 ` Darrick J. Wong 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2017-08-23 23:59 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Eric Sandeen, Ruoxin Jiang, linux-xfs, Suman Jana On Wed, Aug 23, 2017 at 09:35:43AM -0700, Darrick J. Wong wrote: > On Wed, Aug 23, 2017 at 09:28:31AM +1000, Dave Chinner wrote: > > On Tue, Aug 22, 2017 at 05:33:24PM -0500, Eric Sandeen wrote: > > > On 8/22/17 4:10 PM, Ruoxin Jiang wrote: > > > > Hello, > > > > > > > > We are researchers from Columbia University, New York. As part of our > > > > current research we have run into some issues using xfs filesystem. > > > > For example, we encountered a case where the setuid bit of a modified > > > > executable was not removed as desired. > > > > > > Thanks for the reports. > > > > > > For case 1: > > > > > > > * In xfs, function `xfs_file_aio_write_checks` calls `file_remove_privs` which > > > > removes special file priviledges when the executable is written to or truncated. > > > > > > > > * If an DIO write is not aligned to device logical sector size, the syscall > > > > fails with EINVAL` before `xfs_file_aio_write_checks` is called. Thus the setuid > > > > bit will not be removed from the modified executable. > > > > > > If the write did not happen, why should the setuid bit be removed? > > > The write failed and the file's contents were not modified. It seems > > > like xfs's behavior makes sense; is there a posix specification that > > > indicates the behavior should be different? > > > > > > Case 2 does look problematic (failed mmap write extending the file EOF) > > > > mmap should not be able to extend the file at all. It should segv > > if an attempt to write past EOF occurs. .... > Create a 48k mapping at 0x20000000 to some anonymous memory... with > PROT_NONE. > > r[5] = syscall(__NR_write, r[0], 0x2000b000ul, 0x1ul, 0, 0, 0, 0, 0, 0); > printf("write()=%ld, errno=%d\n", r[5], errno); > > Then try to write 1 byte to r0 (whose pos is 10) from this anonymous > region, which bails out with EFAULT because the region is PROT_NONE. > This is effectively pwrite(r[0], <unreadable>, 1, 10); My bad - I didn't look at the code for that case, I was just commenting on Eric's statement about mmap extending the file. > Therefore, it's not an mmap write that extends the file length, it's the > write call that moves out EOF in preparation for the write, only to have > the buffered write fail because the process doesn't have read access to > the memory buffer. > > Not sure what to do about this corner case, though -- I guess you could > check that at least the first byte of the write will succeed, and only > then call xfs_zero_eof. Could still be a TOCTOU race though. > > Thoughts? The zeroing succeeded, yes? So there's no stale data exposure at all? Given it's such a whacky corner case, do we really care what happens to the file size on error as long as we don't expose stale data? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: XFS Bug Report 2017-08-23 23:59 ` Dave Chinner @ 2017-08-24 0:26 ` Darrick J. Wong 0 siblings, 0 replies; 8+ messages in thread From: Darrick J. Wong @ 2017-08-24 0:26 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Sandeen, Ruoxin Jiang, linux-xfs, Suman Jana On Thu, Aug 24, 2017 at 09:59:46AM +1000, Dave Chinner wrote: > On Wed, Aug 23, 2017 at 09:35:43AM -0700, Darrick J. Wong wrote: > > On Wed, Aug 23, 2017 at 09:28:31AM +1000, Dave Chinner wrote: > > > On Tue, Aug 22, 2017 at 05:33:24PM -0500, Eric Sandeen wrote: > > > > On 8/22/17 4:10 PM, Ruoxin Jiang wrote: > > > > > Hello, > > > > > > > > > > We are researchers from Columbia University, New York. As part of our > > > > > current research we have run into some issues using xfs filesystem. > > > > > For example, we encountered a case where the setuid bit of a modified > > > > > executable was not removed as desired. > > > > > > > > Thanks for the reports. > > > > > > > > For case 1: > > > > > > > > > * In xfs, function `xfs_file_aio_write_checks` calls `file_remove_privs` which > > > > > removes special file priviledges when the executable is written to or truncated. > > > > > > > > > > * If an DIO write is not aligned to device logical sector size, the syscall > > > > > fails with EINVAL` before `xfs_file_aio_write_checks` is called. Thus the setuid > > > > > bit will not be removed from the modified executable. > > > > > > > > If the write did not happen, why should the setuid bit be removed? > > > > The write failed and the file's contents were not modified. It seems > > > > like xfs's behavior makes sense; is there a posix specification that > > > > indicates the behavior should be different? > > > > > > > > Case 2 does look problematic (failed mmap write extending the file EOF) > > > > > > mmap should not be able to extend the file at all. It should segv > > > if an attempt to write past EOF occurs. > > .... > > Create a 48k mapping at 0x20000000 to some anonymous memory... with > > PROT_NONE. > > > > r[5] = syscall(__NR_write, r[0], 0x2000b000ul, 0x1ul, 0, 0, 0, 0, 0, 0); > > printf("write()=%ld, errno=%d\n", r[5], errno); > > > > Then try to write 1 byte to r0 (whose pos is 10) from this anonymous > > region, which bails out with EFAULT because the region is PROT_NONE. > > This is effectively pwrite(r[0], <unreadable>, 1, 10); > > My bad - I didn't look at the code for that case, I was just > commenting on Eric's statement about mmap extending the file. > > > Therefore, it's not an mmap write that extends the file length, it's the > > write call that moves out EOF in preparation for the write, only to have > > the buffered write fail because the process doesn't have read access to > > the memory buffer. > > > > Not sure what to do about this corner case, though -- I guess you could > > check that at least the first byte of the write will succeed, and only > > then call xfs_zero_eof. Could still be a TOCTOU race though. > > > > Thoughts? > > The zeroing succeeded, yes? So there's no stale data exposure at > all? Correct. > Given it's such a whacky corner case, do we really care what > happens to the file size on error as long as we don't expose stale > data? I'd just as soon focus on sharper problems, so I'll just throw out the usual "patches welcomed" and see if anyone feels motivated to fix this. :) --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-08-24 0:27 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-08-22 21:10 XFS Bug Report Ruoxin Jiang 2017-08-22 22:33 ` Eric Sandeen 2017-08-22 22:43 ` Eric Sandeen 2017-08-22 23:28 ` Dave Chinner 2017-08-23 2:20 ` Eric Sandeen 2017-08-23 16:35 ` Darrick J. Wong 2017-08-23 23:59 ` Dave Chinner 2017-08-24 0:26 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).