[PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
       [not found] <148738063792.29384.10681837280402457846.stgit@birch.djwong.org>
@ 2017-02-21 22:14 ` Darrick J. Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2017-02-21 22:14 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, linux-ext4; +Cc: linux-api, linux-man, linux-btrfs

Document the new GETFSMAP ioctl that returns the physical layout of a
(disk-based) filesystem.  This time around the fs-specific parts have
been moved to a separate section; I'll move move them into separate
xfsprogs/e2fsprogs manpages when we get closer to landing the ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 man2/ioctl_getfsmap.2 |  359 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 359 insertions(+)
 create mode 100644 man2/ioctl_getfsmap.2

diff --git a/man2/ioctl_getfsmap.2 b/man2/ioctl_getfsmap.2
new file mode 100644
index 0000000..7121d61
--- /dev/null
+++ b/man2/ioctl_getfsmap.2
@@ -0,0 +1,359 @@
+.\" Copyright (c) 2017, Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
+.\" This is free documentation; you can redistribute it and/or
+.\" modify it under the terms of the GNU General Public License as
+.\" published by the Free Software Foundation; either version 2 of
+.\" the License, or (at your option) any later version.
+.\"
+.\" The GNU General Public License's references to "object code"
+.\" and "executables" are to be interpreted as the output of any
+.\" document formatting or typesetting system, including
+.\" intermediate and printed output.
+.\"
+.\" This manual is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.TH IOCTL-GETFSMAP 2 2017-02-10 "Linux" "Linux Programmer's Manual"
+.SH NAME
+ioctl_getfsmap \- retrieve the physical layout of the filesystem
+.SH SYNOPSIS
+.br
+.B #include <sys/ioctl.h>
+.br
+.B #include <linux/fs.h>
+.br
+.B #include <linux/fsmap.h>
+.sp
+.BI "int ioctl(int " fd ", GETFSMAP, struct fsmap_head * " arg );
+.SH DESCRIPTION
+This
+.BR ioctl (2)
+retrieves physical extent mappings for a filesystem.
+This information can be used to discover which files are mapped to a physical
+block, examine free space, or find known bad blocks, among other things.
+
+The sole argument to this ioctl should be a pointer to a single
+.BR "struct fsmap_head" ":"
+.in +4n
+.nf
+
+struct fsmap {
+	__u32		fmr_device;	/* device id */
+	__u32		fmr_flags;	/* mapping flags */
+	__u64		fmr_physical;	/* device offset of segment */
+	__u64		fmr_owner;	/* owner id */
+	__u64		fmr_offset;	/* file offset of segment */
+	__u64		fmr_length;	/* length of segment */
+	__u64		fmr_reserved[3];	/* must be zero */
+};
+
+struct fsmap_head {
+	__u32		fmh_iflags;	/* control flags */
+	__u32		fmh_oflags;	/* output flags */
+	__u32		fmh_count;	/* # of entries in array incl. input */
+	__u32		fmh_entries;	/* # of entries filled in (output). */
+	__u64		fmh_reserved[6];	/* must be zero */
+
+	struct fsmap	fmh_keys[2];	/* low and high keys for the mapping search */
+	struct fsmap	fmh_recs[];	/* returned records */
+};
+
+.fi
+.in
+The two
+.I fmh_keys
+array elements specify the lowest and highest reverse-mapping
+keys, respectively, for which userspace would like physical mapping
+information.
+A reverse mapping key consists of the tuple (device, block, owner, offset).
+The owner and offset fields are part of the key because some filesystems
+support sharing physical blocks between multiple files and
+therefore may return multiple mappings for a given physical block.
+.PP
+Filesystem mappings are copied into the
+.I fmh_recs
+array, which immediately follows the header data.
+.SS Fields of struct fsmap_head
+.PP
+The
+.I fmh_iflags
+field is a bitmask passed to the kernel to alter the output.
+There are no flags defined, so this value must be zero.
+
+.PP
+The
+.I fmh_oflags
+field is a bitmask of flags that concern all output mappings.
+If
+.B FMH_OF_DEV_T
+is set, then the
+.I fmr_device
+field represents a
+.B dev_t
+structure containing the major and minor numbers of the block device.
+
+.PP
+The
+.I fmh_count
+field contains the number of elements in the array being passed to the
+kernel.
+If this value is 0,
+.I fmh_entries
+will be set to the number of records that would have been returned had
+the array been large enough;
+no mapping information will be returned.
+
+.PP
+The
+.I fmh_entries
+field contains the number of elements in the
+.I fmh_recs
+array that contain useful information.
+
+.PP
+The
+.I fmh_reserved
+fields must be set to zero.
+
+.SS Keys
+.PP
+The two key records in
+.B fsmap_head.fmh_keys
+specify the lowest and highest extent records in the keyspace that the caller
+wants returned.
+A filesystem that can share blocks between files likely requires the tuple
+.RI "(" "device" ", " "physical" ", " "owner" ", " "offset" ", " "flags" ")"
+to uniquely index any filesystem mapping record.
+Classic non-sharing filesystems might be able to identify any record with only
+.RI "(" "device" ", " "physical" ", " "flags" ")."
+For example, if the low key is set to (0, 36864, 0, 0, 0), the filesystem will
+only return records for extents starting at or above 36KiB on disk.
+If the high key is set to (0, 1048576, 0, 0, 0), only records below 1MiB will
+be returned.
+By convention, the field
+.B fsmap_head.fmh_keys[0]
+must contain the low key and
+.B fsmap_head.fmh_keys[1]
+must contain the high key for the request.
+.PP
+For convenience, if
+.B fmr_length
+is set in the low key, it will be added to
+.IR fmr_block " or " fmr_offset
+as appropriate.
+The caller can take advantage of this subtlety to set up subsequent calls
+by copying
+.B fsmap_head.fmh_recs[fsmap_head.fmh_entries - 1]
+into the low key.
+The function
+.B fsmap_advance
+provides this functionality.
+
+.SS Fields of struct fsmap
+.PP
+The
+.I fmr_device
+field contains a 32-bit cookie to uniquely identify the underlying storage
+device.
+If the
+.B FMH_OF_DEV_T
+flag is set in the header's
+.I fmh_oflags
+field, this field contains a
+.B dev_t
+from which major and minor numbers can be extracted.
+If the flag is not set, this field contains a value that must be unique
+for each unique storage device.
+
+.PP
+The
+.I fmr_physical
+field contains the disk address of the extent in bytes.
+
+.PP
+The
+.I fmr_owner
+field contains the owner of the extent.
+This is an inode number unless
+.B FMR_OF_SPECIAL_OWNER
+is set in the
+.I fmr_flags
+field, in which case the value is determined by the filesystem.
+See the section below about special owner values for more details.
+
+.PP
+The
+.I fmr_offset
+field contains the logical address in the mapping record in bytes.
+This field has no meaning if the
+.BR FMR_OF_SPECIAL_OWNER " or " FMR_OF_EXTENT_MAP
+flags are set in
+.IR fmr_flags "."
+
+.PP
+The
+.I fmr_length
+field contains the length of the extent in bytes.
+
+.PP
+The
+.I fmr_flags
+field is a bitmask of extent state flags.
+The bits are:
+.RS 0.4i
+.TP
+.B FMR_OF_PREALLOC
+The extent is allocated but not yet written.
+.TP
+.B FMR_OF_ATTR_FORK
+This extent contains extended attribute data.
+.TP
+.B FMR_OF_EXTENT_MAP
+This extent contains extent map information for the owner.
+.TP
+.B FMR_OF_SHARED
+Parts of this extent may be shared.
+.TP
+.B FMR_OF_SPECIAL_OWNER
+The
+.I fmr_owner
+field contains a special value instead of an inode number.
+.TP
+.B FMR_OF_LAST
+This is the last record in the filesystem.
+.RE
+
+.PP
+The
+.I fmr_reserved
+field will be set to zero.
+
+.SS Special Owner Values
+The following special owner values are generic to all filesystems:
+.RS 0.4i
+.TP
+.B FMR_OWN_FREE
+Free space.
+.TP
+.B FMR_OWN_UNKNOWN
+This extent is in use but its owner is not known.
+.TP
+.B FMR_OWN_METADATA
+This extent is filesystem metadata.
+.RE
+
+XFS can return the following special owner values:
+.RS 0.4i
+.TP
+.B XFS_FMR_OWN_FREE
+Free space.
+.TP
+.B XFS_FMR_OWN_UNKNOWN
+This extent is in use but its owner is not known.
+.TP
+.B XFS_FMR_OWN_FS
+Static filesystem metadata which exists at a fixed address.
+These are the AG superblock, the AGF, the AGFL, and the AGI headers.
+.TP
+.B XFS_FMR_OWN_LOG
+The filesystem journal.
+.TP
+.B XFS_FMR_OWN_AG
+Allocation group metadata, such as the free space btrees and the
+reverse mapping btrees.
+.TP
+.B XFS_FMR_OWN_INOBT
+The inode and free inode btrees.
+.TP
+.B XFS_FMR_OWN_INODES
+Inode records.
+.TP
+.B XFS_FMR_OWN_REFC
+Reference count information.
+.TP
+.B XFS_FMR_OWN_COW
+This extent is being used to stage a copy-on-write.
+.TP
+.B XFS_FMR_OWN_DEFECTIVE:
+This extent has been marked defective either by the filesystem or the
+underlying device.
+.RE
+
+ext4 can return the following special owner values:
+.RS 0.4i
+.TP
+.B EXT4_FMR_OWN_FREE
+Free space.
+.TP
+.B EXT4_FMR_OWN_UNKNOWN
+This extent is in use but its owner is not known.
+.TP
+.B EXT4_FMR_OWN_FS
+Static filesystem metadata which exists at a fixed address.
+This is the superblock and the group descriptors.
+.TP
+.B EXT4_FMR_OWN_LOG
+The filesystem journal.
+.TP
+.B EXT4_FMR_OWN_INODES
+Inode records.
+.TP
+.B EXT4_FMR_OWN_BLKBM
+Block bitmap.
+.TP
+.B EXT4_FMR_OWN_INOBM
+Inode bitmap.
+.RE
+
+.SH RETURN VALUE
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.PP
+.SH ERRORS
+Error codes can be one of, but are not limited to, the following:
+.TP
+.B EINVAL
+The array is not long enough, or a non-zero value was passed in one of the
+fields that must be zero.
+.TP
+.B EFAULT
+The pointer passed in was not mapped to a valid memory address.
+.TP
+.B EBADF
+.IR fd
+is not open for reading.
+.TP
+.B EPERM
+This query is not allowed.
+.TP
+.B EOPNOTSUPP
+The filesystem does not support this command.
+.TP
+.B EUCLEAN
+The filesystem metadata is corrupt and needs repair.
+.TP
+.B EBADMSG
+The filesystem has detected a checksum error in the metadata.
+.TP
+.B ENOMEM
+Insufficient memory to process the request.
+
+.SH EXAMPLE
+.TP
+Please see io/fsmap.c in the xfsprogs distribution for a sample program.
+
+.SH CONFORMING TO
+This API is Linux-specific.
+Not all filesystems support it.
+.fi
+.in
+.SH SEE ALSO
+.BR ioctl (2)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
       [not found]                 ` <87mvakpl5m.fsf@xmission.com>
@ 2017-05-10 20:14                   ` Darrick J. Wong
  2017-05-11  5:10                     ` Eric Biggers
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2017-05-10 20:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Ts'o, Eric Biggers, Jann Horn,
	Michael Kerrisk-manpages, linux-xfs, linux-fsdevel, linux-ext4,
	Linux API, linux-man, linux-btrfs

[cc btrfs, since afaict that's where most of the dedupe tool authors hang out]

On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> Theodore Ts'o <tytso@mit.edu> writes:
> 
> > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> >> 1.) Privacy implications.  Say the filesystem is being shared between multiple
> >>     users, and one user unpacks foo.tar.gz into their home directory, which
> >>     they've set to mode 700 to hide from other users.  Because of this new
> >>     ioctl, all users will be able to see every (inode number, size in blocks)
> >>     pair that was added to the filesystem, as well as the exact layout of the
> >>     physical block allocations which might hint at how the files were created.
> >>     If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> >>     regard, its presence on the filesystem will be revealed to all users.  And
> >>     if any filesystems happen to prefer allocating blocks near the containing
> >>     directory, the directory the files are in would likely be revealed too.

Frankly, why are container users even allowed to make unrestricted ioctl
calls?  I thought we had a bunch of security infrastructure to constrain
what userspace can do to a system, so why don't ioctls fall under these
same protections?  If your containers are really that adversarial, you
ought to be blacklisting as much as you can.

> > Unix/Linux has historically not been terribly concerned about trying
> > to protect this kind of privacy between users.  So for example, in
> > order to do this, you would have to call GETFSMAP continously to track
> > this sort of thing.  Someone who wanted to do this could probably get
> > this information (and much, much more) by continuously running "ps" to
> > see what processes are running.
> >
> > (I will note. wryly, that in the bad old days, when dozens of users
> > were sharing a one MIPS Vax/780, it was considered a *good* thing
> > that social pressure could be applied when it was found that someone
> > was running a CPU or memory hogger on a time sharing system.  The
> > privacy right of someone running "xtrek" to be able to hide this from
> > other users on the system was never considered important at all.  :-)

Not to mention someone running GETFSMAP in a loop will be pretty obvious
both from the high kernel cpu usage and the huge number of metadata
operations.

> > Fortunately, the days of timesharing seem to well behind us.  For
> > those people who think that containers are as secure as VM's (hah,
> > hah, hah), it might be that best way to handle this is to have a mount
> > option that requires root access to this functionality.  For those
> > people who really care about this, they can disable access.

Or use separate filesystems for each container so that exploitable bugs
that shut down the filesystem can't be used to kill the other
containers.  You could use a torrent of metadata-heavy operations
(fallocate a huge file, punch every block, truncate file, repeat) to DoS
the other containers.

> What would be the reason for not putting this behind
> capable(CAP_SYS_ADMIN)?
> 
> What possible legitimate function could this functionality serve to
> users who don't own your filesystem?

As I've said before, it's to enable dedupe tools to decide, given a set
of files with shareable blocks, roughly how many other times each of
those shareable blocks are shared so that they can make better decisions
about which file keeps its shareable blocks, and which file gets
remapped.  Dedupe is not a privileged operation, nor are any of the
tools.

> I have seen several people speak up how this is a concern I don't see
> anyone saying here is a legitimate use for a non-system administrator.

/I/ said that a few emails ago.

--D

> This doesn't seem like something where abuses of time-sharing systems
> can be observed.
> 
> Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
  2017-05-10 20:14                   ` Darrick J. Wong
@ 2017-05-11  5:10                     ` Eric Biggers
  2017-05-14  1:41                       ` Andreas Dilger
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Biggers @ 2017-05-11  5:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Eric W. Biederman, Theodore Ts'o, Jann Horn,
	Michael Kerrisk-manpages, linux-xfs, linux-fsdevel, linux-ext4,
	Linux API, linux-man, linux-btrfs

On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
> 
> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> > Theodore Ts'o <tytso@mit.edu> writes:
> > 
> > > On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> > >> 1.) Privacy implications.  Say the filesystem is being shared between multiple
> > >>     users, and one user unpacks foo.tar.gz into their home directory, which
> > >>     they've set to mode 700 to hide from other users.  Because of this new
> > >>     ioctl, all users will be able to see every (inode number, size in blocks)
> > >>     pair that was added to the filesystem, as well as the exact layout of the
> > >>     physical block allocations which might hint at how the files were created.
> > >>     If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> > >>     regard, its presence on the filesystem will be revealed to all users.  And
> > >>     if any filesystems happen to prefer allocating blocks near the containing
> > >>     directory, the directory the files are in would likely be revealed too.
> 
> Frankly, why are container users even allowed to make unrestricted ioctl
> calls?  I thought we had a bunch of security infrastructure to constrain
> what userspace can do to a system, so why don't ioctls fall under these
> same protections?  If your containers are really that adversarial, you
> ought to be blacklisting as much as you can.
> 

Personally I don't find the presence of sandboxing features to be a very good
excuse for introducing random insecure ioctls.  Not everyone has everything
perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
about the filesystem ioctls, too, since they can be executed on any regular
file, without having to open some device node in /dev.

(And this actually does happen; the SELinux policy in Android, for example,
still allows apps to call any ioctl on their data files, despite all the effort
that has gone into whitelisting other types of ioctls.  Which should be fixed,
of course, but it shows that this kind of mistake is very easy to make.)

> > > Unix/Linux has historically not been terribly concerned about trying
> > > to protect this kind of privacy between users.  So for example, in
> > > order to do this, you would have to call GETFSMAP continously to track
> > > this sort of thing.  Someone who wanted to do this could probably get
> > > this information (and much, much more) by continuously running "ps" to
> > > see what processes are running.
> > >
> > > (I will note. wryly, that in the bad old days, when dozens of users
> > > were sharing a one MIPS Vax/780, it was considered a *good* thing
> > > that social pressure could be applied when it was found that someone
> > > was running a CPU or memory hogger on a time sharing system.  The
> > > privacy right of someone running "xtrek" to be able to hide this from
> > > other users on the system was never considered important at all.  :-)
> 
> Not to mention someone running GETFSMAP in a loop will be pretty obvious
> both from the high kernel cpu usage and the huge number of metadata
> operations.

Well, only if that someone running GETFSMAP actually wants to watch things in
real-time (it's not necessary for all scenarios that have been mentioned), *and*
there is monitoring in place which actually detects it and can do something
about it.

Yes, PIDs have traditionally been global, but today we have PID namespaces, and
many other isolation features such as mount namespaces.  Nothing is perfect, of
course, and containers are a lot worse than VMs, but it seems weird to use that
as an excuse to knowingly make things worse...

> 
> > > Fortunately, the days of timesharing seem to well behind us.  For
> > > those people who think that containers are as secure as VM's (hah,
> > > hah, hah), it might be that best way to handle this is to have a mount
> > > option that requires root access to this functionality.  For those
> > > people who really care about this, they can disable access.
> 
> Or use separate filesystems for each container so that exploitable bugs
> that shut down the filesystem can't be used to kill the other
> containers.  You could use a torrent of metadata-heavy operations
> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
> the other containers.
> 
> > What would be the reason for not putting this behind
> > capable(CAP_SYS_ADMIN)?
> > 
> > What possible legitimate function could this functionality serve to
> > users who don't own your filesystem?
> 
> As I've said before, it's to enable dedupe tools to decide, given a set
> of files with shareable blocks, roughly how many other times each of
> those shareable blocks are shared so that they can make better decisions
> about which file keeps its shareable blocks, and which file gets
> remapped.  Dedupe is not a privileged operation, nor are any of the
> tools.
> 

So why does the ioctl need to return all extent mappings for the entire
filesystem, instead of just the share count of each block in the file that the
ioctl is called on?

- Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
  2017-05-11  5:10                     ` Eric Biggers
@ 2017-05-14  1:41                       ` Andreas Dilger
  2017-05-14  4:25                         ` Darrick J. Wong
  2017-05-14 13:56                         ` Andy Lutomirski
  0 siblings, 2 replies; 7+ messages in thread
From: Andreas Dilger @ 2017-05-14  1:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Darrick J. Wong, Eric W. Biederman, Theodore Ts'o, Jann Horn,
	Michael Kerrisk-manpages, linux-xfs, linux-fsdevel, linux-ext4,
	Linux API, linux-man, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6186 bytes --]

On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> 
> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
>> 
>> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
>>> Theodore Ts'o <tytso@mit.edu> writes:
>>> 
>>>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
>>>>> 1.) Privacy implications.  Say the filesystem is being shared between multiple
>>>>>    users, and one user unpacks foo.tar.gz into their home directory, which
>>>>>    they've set to mode 700 to hide from other users.  Because of this new
>>>>>    ioctl, all users will be able to see every (inode number, size in blocks)
>>>>>    pair that was added to the filesystem, as well as the exact layout of the
>>>>>    physical block allocations which might hint at how the files were created.
>>>>>    If there is a known "fingerprint" for the unpacked foo.tar.gz in this
>>>>>    regard, its presence on the filesystem will be revealed to all users.  And
>>>>>    if any filesystems happen to prefer allocating blocks near the containing
>>>>>    directory, the directory the files are in would likely be revealed too.
>> 
>> Frankly, why are container users even allowed to make unrestricted ioctl
>> calls?  I thought we had a bunch of security infrastructure to constrain
>> what userspace can do to a system, so why don't ioctls fall under these
>> same protections?  If your containers are really that adversarial, you
>> ought to be blacklisting as much as you can.
>> 
> 
> Personally I don't find the presence of sandboxing features to be a very good
> excuse for introducing random insecure ioctls.  Not everyone has everything
> perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
> about the filesystem ioctls, too, since they can be executed on any regular
> file, without having to open some device node in /dev.
> 
> (And this actually does happen; the SELinux policy in Android, for example,
> still allows apps to call any ioctl on their data files, despite all the effort
> that has gone into whitelisting other types of ioctls.  Which should be fixed,
> of course, but it shows that this kind of mistake is very easy to make.)
> 
>>>> Unix/Linux has historically not been terribly concerned about trying
>>>> to protect this kind of privacy between users.  So for example, in
>>>> order to do this, you would have to call GETFSMAP continously to track
>>>> this sort of thing.  Someone who wanted to do this could probably get
>>>> this information (and much, much more) by continuously running "ps" to
>>>> see what processes are running.
>>>> 
>>>> (I will note. wryly, that in the bad old days, when dozens of users
>>>> were sharing a one MIPS Vax/780, it was considered a *good* thing
>>>> that social pressure could be applied when it was found that someone
>>>> was running a CPU or memory hogger on a time sharing system.  The
>>>> privacy right of someone running "xtrek" to be able to hide this from
>>>> other users on the system was never considered important at all.  :-)
>> 
>> Not to mention someone running GETFSMAP in a loop will be pretty obvious
>> both from the high kernel cpu usage and the huge number of metadata
>> operations.
> 
> Well, only if that someone running GETFSMAP actually wants to watch things in
> real-time (it's not necessary for all scenarios that have been mentioned), *and*
> there is monitoring in place which actually detects it and can do something
> about it.
> 
> Yes, PIDs have traditionally been global, but today we have PID namespaces, and
> many other isolation features such as mount namespaces.  Nothing is perfect, of
> course, and containers are a lot worse than VMs, but it seems weird to use that
> as an excuse to knowingly make things worse...
> 
>> 
>>>> Fortunately, the days of timesharing seem to well behind us.  For
>>>> those people who think that containers are as secure as VM's (hah,
>>>> hah, hah), it might be that best way to handle this is to have a mount
>>>> option that requires root access to this functionality.  For those
>>>> people who really care about this, they can disable access.
>> 
>> Or use separate filesystems for each container so that exploitable bugs
>> that shut down the filesystem can't be used to kill the other
>> containers.  You could use a torrent of metadata-heavy operations
>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
>> the other containers.
>> 
>>> What would be the reason for not putting this behind
>>> capable(CAP_SYS_ADMIN)?
>>> 
>>> What possible legitimate function could this functionality serve to
>>> users who don't own your filesystem?
>> 
>> As I've said before, it's to enable dedupe tools to decide, given a set
>> of files with shareable blocks, roughly how many other times each of
>> those shareable blocks are shared so that they can make better decisions
>> about which file keeps its shareable blocks, and which file gets
>> remapped.  Dedupe is not a privileged operation, nor are any of the
>> tools.
>> 
> 
> So why does the ioctl need to return all extent mappings for the entire
> filesystem, instead of just the share count of each block in the file that the
> ioctl is called on?

One possibility is that the ioctl() can return the mapping for all inodes
owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
than one if there is a reason to do so) with all the other allocated blocks
for inodes the user doesn't have permission to access?

IMHO, this would allow a non-root user the main benefit of GETFSMAP,  which
is trying to determine how fragmented their files are and/or how fragmented
the free space is, without leaking any information about file sizes, location,
or other information the user can't already get today in a less efficient
manner.

I don't know how hard this is to implement, but seems not impossible.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
  2017-05-14  1:41                       ` Andreas Dilger
@ 2017-05-14  4:25                         ` Darrick J. Wong
  2017-05-14 13:56                         ` Andy Lutomirski
  1 sibling, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2017-05-14  4:25 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Eric Biggers, Eric W. Biederman, Theodore Ts'o, Jann Horn,
	Michael Kerrisk-manpages, linux-xfs, linux-fsdevel, linux-ext4,
	Linux API, linux-man, linux-btrfs

On Sat, May 13, 2017 at 07:41:24PM -0600, Andreas Dilger wrote:
> On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> > 
> > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
> >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
> >> 
> >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote:
> >>> Theodore Ts'o <tytso@mit.edu> writes:
> >>> 
> >>>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote:
> >>>>> 1.) Privacy implications.  Say the filesystem is being shared between multiple
> >>>>>    users, and one user unpacks foo.tar.gz into their home directory, which
> >>>>>    they've set to mode 700 to hide from other users.  Because of this new
> >>>>>    ioctl, all users will be able to see every (inode number, size in blocks)
> >>>>>    pair that was added to the filesystem, as well as the exact layout of the
> >>>>>    physical block allocations which might hint at how the files were created.
> >>>>>    If there is a known "fingerprint" for the unpacked foo.tar.gz in this
> >>>>>    regard, its presence on the filesystem will be revealed to all users.  And
> >>>>>    if any filesystems happen to prefer allocating blocks near the containing
> >>>>>    directory, the directory the files are in would likely be revealed too.
> >> 
> >> Frankly, why are container users even allowed to make unrestricted ioctl
> >> calls?  I thought we had a bunch of security infrastructure to constrain
> >> what userspace can do to a system, so why don't ioctls fall under these
> >> same protections?  If your containers are really that adversarial, you
> >> ought to be blacklisting as much as you can.
> >> 
> > 
> > Personally I don't find the presence of sandboxing features to be a very good
> > excuse for introducing random insecure ioctls.  Not everyone has everything
> > perfectly "sandboxed" all the time, for obvious reasons.  It's easy to forget
> > about the filesystem ioctls, too, since they can be executed on any regular
> > file, without having to open some device node in /dev.
> > 
> > (And this actually does happen; the SELinux policy in Android, for example,
> > still allows apps to call any ioctl on their data files, despite all the effort
> > that has gone into whitelisting other types of ioctls.  Which should be fixed,
> > of course, but it shows that this kind of mistake is very easy to make.)
> > 
> >>>> Unix/Linux has historically not been terribly concerned about trying
> >>>> to protect this kind of privacy between users.  So for example, in
> >>>> order to do this, you would have to call GETFSMAP continously to track
> >>>> this sort of thing.  Someone who wanted to do this could probably get
> >>>> this information (and much, much more) by continuously running "ps" to
> >>>> see what processes are running.
> >>>> 
> >>>> (I will note. wryly, that in the bad old days, when dozens of users
> >>>> were sharing a one MIPS Vax/780, it was considered a *good* thing
> >>>> that social pressure could be applied when it was found that someone
> >>>> was running a CPU or memory hogger on a time sharing system.  The
> >>>> privacy right of someone running "xtrek" to be able to hide this from
> >>>> other users on the system was never considered important at all.  :-)
> >> 
> >> Not to mention someone running GETFSMAP in a loop will be pretty obvious
> >> both from the high kernel cpu usage and the huge number of metadata
> >> operations.
> > 
> > Well, only if that someone running GETFSMAP actually wants to watch things in
> > real-time (it's not necessary for all scenarios that have been mentioned), *and*
> > there is monitoring in place which actually detects it and can do something
> > about it.
> > 
> > Yes, PIDs have traditionally been global, but today we have PID namespaces, and
> > many other isolation features such as mount namespaces.  Nothing is perfect, of
> > course, and containers are a lot worse than VMs, but it seems weird to use that
> > as an excuse to knowingly make things worse...
> > 
> >> 
> >>>> Fortunately, the days of timesharing seem to well behind us.  For
> >>>> those people who think that containers are as secure as VM's (hah,
> >>>> hah, hah), it might be that best way to handle this is to have a mount
> >>>> option that requires root access to this functionality.  For those
> >>>> people who really care about this, they can disable access.
> >> 
> >> Or use separate filesystems for each container so that exploitable bugs
> >> that shut down the filesystem can't be used to kill the other
> >> containers.  You could use a torrent of metadata-heavy operations
> >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
> >> the other containers.
> >> 
> >>> What would be the reason for not putting this behind
> >>> capable(CAP_SYS_ADMIN)?
> >>> 
> >>> What possible legitimate function could this functionality serve to
> >>> users who don't own your filesystem?
> >> 
> >> As I've said before, it's to enable dedupe tools to decide, given a set
> >> of files with shareable blocks, roughly how many other times each of
> >> those shareable blocks are shared so that they can make better decisions
> >> about which file keeps its shareable blocks, and which file gets
> >> remapped.  Dedupe is not a privileged operation, nor are any of the
> >> tools.
> >> 
> > 
> > So why does the ioctl need to return all extent mappings for the entire
> > filesystem, instead of just the share count of each block in the file that the
> > ioctl is called on?
> 
> One possibility is that the ioctl() can return the mapping for all inodes
> owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
> or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
> than one if there is a reason to do so) with all the other allocated blocks
> for inodes the user doesn't have permission to access?

Hmm, CAP_DAC_OVERRIDE/CAP_FOWNER?  That might be a reasonable set of
capabilities to grant access...

> IMHO, this would allow a non-root user the main benefit of GETFSMAP,  which
> is trying to determine how fragmented their files are and/or how fragmented
> the free space is, without leaking any information about file sizes, location,
> or other information the user can't already get today in a less efficient
> manner.
> 
> I don't know how hard this is to implement, but seems not impossible.

It's already implemented in both XFS and ext4. <cough>

File extents are marked as "owned" by "unknown".

Now, I suppose one could devise a scheme such that files that the caller
can open actually do get inode numbers returned, but ... that's more
engineering work, let's see if anyone asks for that (vs. asks for any of
the magic capability bits).

--D

> 
> Cheers, Andreas
> 
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
  2017-05-14  1:41                       ` Andreas Dilger
  2017-05-14  4:25                         ` Darrick J. Wong
@ 2017-05-14 13:56                         ` Andy Lutomirski
  2017-05-18  2:04                           ` Darrick J. Wong
  1 sibling, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2017-05-14 13:56 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Eric Biggers, Darrick J. Wong, Eric W. Biederman,
	Theodore Ts'o, Jann Horn, Michael Kerrisk-manpages, linux-xfs,
	Linux FS Devel, linux-ext4@vger.kernel.org, Linux API, linux-man,
	linux-btrfs

On Sat, May 13, 2017 at 6:41 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
>>
>> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
>>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]

>> Yes, PIDs have traditionally been global, but today we have PID namespaces, and
>> many other isolation features such as mount namespaces.  Nothing is perfect, of
>> course, and containers are a lot worse than VMs, but it seems weird to use that
>> as an excuse to knowingly make things worse...
>>

Indeed.  Not only PID namespaces -- we have hidepid and we can simply
unmount /proc.  "There are other info leaks" is a poor excuse.

>>>
>>>>> Fortunately, the days of timesharing seem to well behind us.  For
>>>>> those people who think that containers are as secure as VM's (hah,
>>>>> hah, hah), it might be that best way to handle this is to have a mount
>>>>> option that requires root access to this functionality.  For those
>>>>> people who really care about this, they can disable access.
>>>
>>> Or use separate filesystems for each container so that exploitable bugs
>>> that shut down the filesystem can't be used to kill the other
>>> containers.  You could use a torrent of metadata-heavy operations
>>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
>>> the other containers.
>>>
>>>> What would be the reason for not putting this behind
>>>> capable(CAP_SYS_ADMIN)?
>>>>
>>>> What possible legitimate function could this functionality serve to
>>>> users who don't own your filesystem?
>>>
>>> As I've said before, it's to enable dedupe tools to decide, given a set
>>> of files with shareable blocks, roughly how many other times each of
>>> those shareable blocks are shared so that they can make better decisions
>>> about which file keeps its shareable blocks, and which file gets
>>> remapped.  Dedupe is not a privileged operation, nor are any of the
>>> tools.
>>>
>>
>> So why does the ioctl need to return all extent mappings for the entire
>> filesystem, instead of just the share count of each block in the file that the
>> ioctl is called on?
>
> One possibility is that the ioctl() can return the mapping for all inodes
> owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
> or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
> than one if there is a reason to do so) with all the other allocated blocks
> for inodes the user doesn't have permission to access?

Sounds like it could be reasonable.  But you don't want "owned by the
calling PID" precisely -- you also need to check
kgid_has_mapping(current_user_ns(), inode->i_gid), I think.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl
  2017-05-14 13:56                         ` Andy Lutomirski
@ 2017-05-18  2:04                           ` Darrick J. Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2017-05-18  2:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andreas Dilger, Eric Biggers, Eric W. Biederman,
	Theodore Ts'o, Jann Horn, Michael Kerrisk-manpages, linux-xfs,
	Linux FS Devel, linux-ext4@vger.kernel.org, Linux API, linux-man,
	linux-btrfs

On Sun, May 14, 2017 at 06:56:10AM -0700, Andy Lutomirski wrote:
> On Sat, May 13, 2017 at 6:41 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On May 10, 2017, at 11:10 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> >>
> >> On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote:
> >>> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out]
> 
> >> Yes, PIDs have traditionally been global, but today we have PID namespaces, and
> >> many other isolation features such as mount namespaces.  Nothing is perfect, of
> >> course, and containers are a lot worse than VMs, but it seems weird to use that
> >> as an excuse to knowingly make things worse...
> >>
> 
> Indeed.  Not only PID namespaces -- we have hidepid and we can simply
> unmount /proc.  "There are other info leaks" is a poor excuse.

Eh.  From the sounds of it I'm not all that impressed at the isolation
and leakproofness of any of these schemes.  Regardless, I will rephrase
the manpage to emphasize more strongly that filesystems are under no
obligation to share inode numbers, privileged callers or otherwise.

> >>>
> >>>>> Fortunately, the days of timesharing seem to well behind us.  For
> >>>>> those people who think that containers are as secure as VM's (hah,
> >>>>> hah, hah), it might be that best way to handle this is to have a mount
> >>>>> option that requires root access to this functionality.  For those
> >>>>> people who really care about this, they can disable access.
> >>>
> >>> Or use separate filesystems for each container so that exploitable bugs
> >>> that shut down the filesystem can't be used to kill the other
> >>> containers.  You could use a torrent of metadata-heavy operations
> >>> (fallocate a huge file, punch every block, truncate file, repeat) to DoS
> >>> the other containers.
> >>>
> >>>> What would be the reason for not putting this behind
> >>>> capable(CAP_SYS_ADMIN)?
> >>>>
> >>>> What possible legitimate function could this functionality serve to
> >>>> users who don't own your filesystem?
> >>>
> >>> As I've said before, it's to enable dedupe tools to decide, given a set
> >>> of files with shareable blocks, roughly how many other times each of
> >>> those shareable blocks are shared so that they can make better decisions
> >>> about which file keeps its shareable blocks, and which file gets
> >>> remapped.  Dedupe is not a privileged operation, nor are any of the
> >>> tools.
> >>>
> >>
> >> So why does the ioctl need to return all extent mappings for the entire
> >> filesystem, instead of just the share count of each block in the file that the
> >> ioctl is called on?
> >
> > One possibility is that the ioctl() can return the mapping for all inodes
> > owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE,
> > or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more
> > than one if there is a reason to do so) with all the other allocated blocks
> > for inodes the user doesn't have permission to access?
> 
> Sounds like it could be reasonable.  But you don't want "owned by the
> calling PID" precisely -- you also need to check
> kgid_has_mapping(current_user_ns(), inode->i_gid), I think.

Not to mention that I don't want to go xfs_igetting every inode across
the entire filesystem... :)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-05-18  2:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <148738063792.29384.10681837280402457846.stgit@birch.djwong.org>
2017-02-21 22:14 ` [PATCH] ioctl_getfsmap.2: document the GETFSMAP ioctl Darrick J. Wong
     [not found] <20170507155855.GD5970@birch.djwong.org>
     [not found] ` <CAG48ez1AWewJRg8gySgihn0y15jRhC6C+5DNwGsDpAhtokB=Lw@mail.gmail.com>
     [not found]   ` <20170508184112.GJ5973@birch.djwong.org>
     [not found]     ` <CAG48ez3e+2VuvjtEfJuMujEo6PWBO3z8oM-otN2juq96jKdjCw@mail.gmail.com>
     [not found]       ` <20170508204738.GL5973@birch.djwong.org>
     [not found]         ` <CAG48ez0iLRazKvXty9CG8ENXvkG6b1xjO0Q75p+16HKNptFnow@mail.gmail.com>
     [not found]           ` <20170509015324.GM5973@birch.djwong.org>
     [not found]             ` <20170509211746.GA87747@gmail.com>
     [not found]               ` <20170510163818.7bleiykxgnx3pkds@thunk.org>
     [not found]                 ` <87mvakpl5m.fsf@xmission.com>
2017-05-10 20:14                   ` Darrick J. Wong
2017-05-11  5:10                     ` Eric Biggers
2017-05-14  1:41                       ` Andreas Dilger
2017-05-14  4:25                         ` Darrick J. Wong
2017-05-14 13:56                         ` Andy Lutomirski
2017-05-18  2:04                           ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).