Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Aleksa Sarai @ 2019-10-21 18:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Omar Sandoval, linux-fsdevel, linux-btrfs, Dave Chinner,
	Jann Horn, linux-api, kernel-team
In-Reply-To: <20191021182806.GA6706@magnolia>

[-- Attachment #1: Type: text/plain, Size: 9945 bytes --]

On 2019-10-21, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  include/linux/fs.h      | 14 +++++++
> >  include/uapi/linux/fs.h | 26 ++++++++++++-
> >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC		(1 << 5)
> >  #define IOCB_WRITE		(1 << 6)
> >  #define IOCB_NOWAIT		(1 << 7)
> > +#define IOCB_ENCODED		(1 << 8)
> >  
> >  struct kiocb {
> >  	struct file		*ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +				struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >  				struct file *file_out, loff_t pos_out,
> >  				loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> >  			return -EOPNOTSUPP;
> >  		ki->ki_flags |= IOCB_NOWAIT;
> >  	}
> > +	if (flags & RWF_ENCODED) {
> > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +			return -EOPNOTSUPP;
> > +		ki->ki_flags |= IOCB_ENCODED;
> > +	}
> >  	if (flags & RWF_HIPRI)
> >  		ki->ki_flags |= IOCB_HIPRI;
> >  	if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +	ENCODED_IOV_COMPRESSION_NONE,
> > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > +	ENCODED_IOV_COMPRESSION_LZO,
> > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +	ENCODED_IOV_ENCRYPTION_NONE,
> > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +	__u64 len;
> > +	__u64 unencoded_len;
> > +	__u64 unencoded_offset;
> > +	__u32 compression;
> > +	__u32 encryption;
> 
> Can we add some must-be-zero padding space at the end here for whomever
> comes along next wanting to add more encoding info?

I would suggest to copy the extension design of copy_struct_from_user().
Adding must-be-zero padding is a less-ideal solution to the extension
problem than length-based extension.

Also (I might be wrong) but shouldn't the __u64s be __aligned_u64 (as
with syscall structure arguments)?

> (And maybe a manpage and some basic testing, to reiterate Dave...)
> 
> --D
> 
> > +};
> > +
> >  /* high priority request, poll if possible */
> >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> >  
> > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> >  /* per-IO O_APPEND */
> >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> >  
> > +/* encoded (e.g., compressed or encrypted) IO */
> > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > +
> >  /* mask of flags supported by the kernel */
> >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > -			 RWF_APPEND)
> > +			 RWF_APPEND | RWF_ENCODED)
> >  
> >  #endif /* _UAPI_LINUX_FS_H */
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 1146fcfa3215..d2e6d9caf353 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> >  	return 0;
> >  }
> >  
> > -/*
> > - * Performs necessary checks before doing a write
> > - *
> > - * Can adjust writing position or amount of bytes to write.
> > - * Returns appropriate error code that caller should return or
> > - * zero in case that write should be allowed.
> > - */
> > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file->f_mapping->host;
> > -	loff_t count;
> > -	int ret;
> >  
> >  	if (IS_SWAPFILE(inode))
> >  		return -ETXTBSY;
> >  
> > -	if (!iov_iter_count(from))
> > +	if (!*count)
> >  		return 0;
> >  
> >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> >  		return -EINVAL;
> >  
> > -	count = iov_iter_count(from);
> > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > +}
> > +
> > +/*
> > + * Performs necessary checks before doing a write
> > + *
> > + * Can adjust writing position or amount of bytes to write.
> > + * Returns a negative errno or the new number of bytes to write.
> > + */
> > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	loff_t count = iov_iter_count(from);
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  }
> >  EXPORT_SYMBOL(generic_write_checks);
> >  
> > +int generic_encoded_write_checks(struct kiocb *iocb,
> > +				 struct encoded_iov *encoded)
> > +{
> > +	loff_t count = encoded->unencoded_len;
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (count != encoded->unencoded_len) {
> > +		/*
> > +		 * The write got truncated by generic_write_checks_common(). We
> > +		 * can't do a partial encoded write.
> > +		 */
> > +		return -EFBIG;
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > +
> > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > +		return -EINVAL;
> > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > +}
> > +EXPORT_SYMBOL(check_encoded_read);
> > +
> > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> > +			 struct iov_iter *from)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > +		return -EINVAL;
> > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > +		return -EFAULT;
> > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > +		return -EINVAL;
> > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > +		return -EINVAL;
> > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > +		return -EINVAL;
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(import_encoded_write);
> > +
> >  /*
> >   * Performs necessary checks before doing a clone.
> >   *
> > -- 
> > 2.23.0
> > 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Darrick J. Wong @ 2019-10-21 18:28 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team
In-Reply-To: <7f98cf5409cf2b583cd5b3451fc739fd3428873b.1571164762.git.osandov@fb.com>

On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Btrfs supports transparent compression: data written by the user can be
> compressed when written to disk and decompressed when read back.
> However, we'd like to add an interface to write pre-compressed data
> directly to the filesystem, and the matching interface to read
> compressed data without decompressing it. This adds support for
> so-called "encoded I/O" via preadv2() and pwritev2().
> 
> A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> this flag is set, iov[0].iov_base points to a struct encoded_iov which
> is used for metadata: namely, the compression algorithm, unencoded
> (i.e., decompressed) length, and what subrange of the unencoded data
> should be used (needed for truncated or hole-punched extents and when
> reading in the middle of an extent). For reads, the filesystem returns
> this information; for writes, the caller provides it to the filesystem.
> iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> used to extend the interface in the future. The remaining iovecs contain
> the encoded extent.
> 
> Filesystems must indicate that they support encoded writes by setting
> FMODE_ENCODED_IO in ->file_open().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  include/linux/fs.h      | 14 +++++++
>  include/uapi/linux/fs.h | 26 ++++++++++++-
>  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
>  3 files changed, 108 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..54681f21e05e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File does not contribute to nr_files count */
>  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
>  
> +/* File supports encoded IO */
> +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> +
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
>   * that indicates that they should check the contents of the iovec are
> @@ -314,6 +317,7 @@ enum rw_hint {
>  #define IOCB_SYNC		(1 << 5)
>  #define IOCB_WRITE		(1 << 6)
>  #define IOCB_NOWAIT		(1 << 7)
> +#define IOCB_ENCODED		(1 << 8)
>  
>  struct kiocb {
>  	struct file		*ki_filp;
> @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
>  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
>  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
>  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> +struct encoded_iov;
> +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> +				struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>  				struct file *file_out, loff_t pos_out,
>  				loff_t *count, unsigned int remap_flags);
> @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>  			return -EOPNOTSUPP;
>  		ki->ki_flags |= IOCB_NOWAIT;
>  	}
> +	if (flags & RWF_ENCODED) {
> +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> +			return -EOPNOTSUPP;
> +		ki->ki_flags |= IOCB_ENCODED;
> +	}
>  	if (flags & RWF_HIPRI)
>  		ki->ki_flags |= IOCB_HIPRI;
>  	if (flags & RWF_DSYNC)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 379a612f8f1d..ed92a8a257cb 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -284,6 +284,27 @@ struct fsxattr {
>  
>  typedef int __bitwise __kernel_rwf_t;
>  
> +enum {
> +	ENCODED_IOV_COMPRESSION_NONE,
> +	ENCODED_IOV_COMPRESSION_ZLIB,
> +	ENCODED_IOV_COMPRESSION_LZO,
> +	ENCODED_IOV_COMPRESSION_ZSTD,
> +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> +};
> +
> +enum {
> +	ENCODED_IOV_ENCRYPTION_NONE,
> +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> +};
> +
> +struct encoded_iov {
> +	__u64 len;
> +	__u64 unencoded_len;
> +	__u64 unencoded_offset;
> +	__u32 compression;
> +	__u32 encryption;

Can we add some must-be-zero padding space at the end here for whomever
comes along next wanting to add more encoding info?

(And maybe a manpage and some basic testing, to reiterate Dave...)

--D

> +};
> +
>  /* high priority request, poll if possible */
>  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
>  
> @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO O_APPEND */
>  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
>  
> +/* encoded (e.g., compressed or encrypted) IO */
> +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> +
>  /* mask of flags supported by the kernel */
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> -			 RWF_APPEND)
> +			 RWF_APPEND | RWF_ENCODED)
>  
>  #endif /* _UAPI_LINUX_FS_H */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1146fcfa3215..d2e6d9caf353 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
>  	return 0;
>  }
>  
> -/*
> - * Performs necessary checks before doing a write
> - *
> - * Can adjust writing position or amount of bytes to write.
> - * Returns appropriate error code that caller should return or
> - * zero in case that write should be allowed.
> - */
> -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file->f_mapping->host;
> -	loff_t count;
> -	int ret;
>  
>  	if (IS_SWAPFILE(inode))
>  		return -ETXTBSY;
>  
> -	if (!iov_iter_count(from))
> +	if (!*count)
>  		return 0;
>  
>  	/* FIXME: this is for backwards compatibility with 2.4 */
> @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
>  		return -EINVAL;
>  
> -	count = iov_iter_count(from);
> -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> +}
> +
> +/*
> + * Performs necessary checks before doing a write
> + *
> + * Can adjust writing position or amount of bytes to write.
> + * Returns a negative errno or the new number of bytes to write.
> + */
> +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	loff_t count = iov_iter_count(from);
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
>  	if (ret)
>  		return ret;
>  
> @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  }
>  EXPORT_SYMBOL(generic_write_checks);
>  
> +int generic_encoded_write_checks(struct kiocb *iocb,
> +				 struct encoded_iov *encoded)
> +{
> +	loff_t count = encoded->unencoded_len;
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
> +	if (ret)
> +		return ret;
> +
> +	if (count != encoded->unencoded_len) {
> +		/*
> +		 * The write got truncated by generic_write_checks_common(). We
> +		 * can't do a partial encoded write.
> +		 */
> +		return -EFBIG;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(generic_encoded_write_checks);
> +
> +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> +		return -EINVAL;
> +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> +}
> +EXPORT_SYMBOL(check_encoded_read);
> +
> +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> +			 struct iov_iter *from)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> +		return -EINVAL;
> +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> +		return -EFAULT;
> +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> +		return -EINVAL;
> +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> +		return -EINVAL;
> +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> +		return -EINVAL;
> +	return 0;
> +}
> +EXPORT_SYMBOL(import_encoded_write);
> +
>  /*
>   * Performs necessary checks before doing a clone.
>   *
> -- 
> 2.23.0
> 

^ permalink raw reply

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
From: Omar Sandoval @ 2019-10-21 18:05 UTC (permalink / raw)
  To: dsterba, Nikolay Borisov, linux-btrfs, linux-fsdevel, kernel-team,
	Dave Chinner, Jann Horn, linux-api
In-Reply-To: <20191021131452.GH3001@twin.jikos.cz>

On Mon, Oct 21, 2019 at 03:14:52PM +0200, David Sterba wrote:
> On Fri, Oct 18, 2019 at 03:55:13PM -0700, Omar Sandoval wrote:
> > > > +	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > 
> > > nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)
> > 
> > disk_num_bytes is a u64, so that would expand to a 64-bit division. The
> > compiler is probably smart enough to optimize it to a shift, but I
> > didn't want to rely on that, because that would cause build failures on
> > 32-bit.
> 
> There are several DIV_ROUND_UP(u64, PAGE_SIZE) in btrfs code, no build
> brekages have been reported so far, you can use it.

Good to know, I'll fix both places I'm doing this, then.

^ permalink raw reply

* Re: [PATCH v3 1/2] clone3: add CLONE_CLEAR_SIGHAND
From: Oleg Nesterov @ 2019-10-21 15:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Florian Weimer, Arnd Bergmann, libc-alpha,
	David Howells, Jann Horn, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Shuah Khan, Andrew Morton, Michal Hocko,
	Elena Reshetova, Thomas Gleixner, Roman Gushchin,
	Andrea Arcangeli
In-Reply-To: <20191021144633.GA2720@redhat.com>

On 10/21, Oleg Nesterov wrote:
>
> On 10/14, Christian Brauner wrote:
> >
> > The child helper process on Linux posix_spawn must ensure that no signal
> > handlers are enabled, so the signal disposition must be either SIG_DFL
> > or SIG_IGN. However, it requires a sigprocmask to obtain the current
> > signal mask and at least _NSIG sigaction calls to reset the signal
> > handlers for each posix_spawn call
>
> Plus the caller has to block/unblock all signals around clone(VM|VFORK).
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

just in case... I meant that posix_spawn() has to block/unblock, not its
caller.

> Can this justify the new CLONE_ flag? Honestly, I have no idea. But the
> patch is simple and looks technically correct to me. FWIW,
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>

^ permalink raw reply

* Re: [PATCH v3 1/2] clone3: add CLONE_CLEAR_SIGHAND
From: Oleg Nesterov @ 2019-10-21 14:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Florian Weimer, Arnd Bergmann, libc-alpha,
	David Howells, Jann Horn, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Shuah Khan, Andrew Morton, Michal Hocko,
	Elena Reshetova, Thomas Gleixner, Roman Gushchin,
	Andrea Arcangeli
In-Reply-To: <20191014104538.3096-1-christian.brauner@ubuntu.com>

On 10/14, Christian Brauner wrote:
>
> The child helper process on Linux posix_spawn must ensure that no signal
> handlers are enabled, so the signal disposition must be either SIG_DFL
> or SIG_IGN. However, it requires a sigprocmask to obtain the current
> signal mask and at least _NSIG sigaction calls to reset the signal
> handlers for each posix_spawn call

Plus the caller has to block/unblock all signals around clone(VM|VFORK).

Can this justify the new CLONE_ flag? Honestly, I have no idea. But the
patch is simple and looks technically correct to me. FWIW,

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

^ permalink raw reply

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
From: David Sterba @ 2019-10-21 13:14 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Nikolay Borisov, linux-btrfs, linux-fsdevel, kernel-team,
	Dave Chinner, Jann Horn, linux-api
In-Reply-To: <20191018225513.GD59713@vader>

On Fri, Oct 18, 2019 at 03:55:13PM -0700, Omar Sandoval wrote:
> > > +	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > 
> > nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)
> 
> disk_num_bytes is a u64, so that would expand to a 64-bit division. The
> compiler is probably smart enough to optimize it to a shift, but I
> didn't want to rely on that, because that would cause build failures on
> 32-bit.

There are several DIV_ROUND_UP(u64, PAGE_SIZE) in btrfs code, no build
brekages have been reported so far, you can use it.

^ permalink raw reply

* Re: [PATCH man-pages] Document encoded I/O
From: Amir Goldstein @ 2019-10-21  6:18 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, Linux Btrfs, Dave Chinner, Jann Horn, Linux API,
	kernel-team, Theodore Tso
In-Reply-To: <c7e8f93596fee7bb818dc0edf29f484036be1abb.1571164851.git.osandov@fb.com>

CC: Ted

What ever happened to read/write ext4 encrypted data API?
https://marc.info/?l=linux-ext4&m=145030599010416&w=2

Can we learn anything from the ext4 experience to improve
the new proposed API?


On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> This adds a new page, rwf_encoded(7), providing an overview of encoded
> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> reference it.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  man2/fcntl.2       |  10 +-
>  man2/open.2        |  13 ++
>  man2/readv.2       |  46 +++++++
>  man7/rwf_encoded.7 | 297 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 365 insertions(+), 1 deletion(-)
>  create mode 100644 man7/rwf_encoded.7
>
> diff --git a/man2/fcntl.2 b/man2/fcntl.2
> index fce4f4c2b..76fe9cc6f 100644
> --- a/man2/fcntl.2
> +++ b/man2/fcntl.2
> @@ -222,8 +222,9 @@ On Linux, this command can change only the
>  .BR O_ASYNC ,
>  .BR O_DIRECT ,
>  .BR O_NOATIME ,
> +.BR O_NONBLOCK ,
>  and
> -.B O_NONBLOCK
> +.B O_ENCODED
>  flags.
>  It is not possible to change the
>  .BR O_DSYNC
> @@ -1803,6 +1804,13 @@ Attempted to clear the
>  flag on a file that has the append-only attribute set.
>  .TP
>  .B EPERM
> +Attempted to set the
> +.B O_ENCODED
> +flag and the calling process did not have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.TP
> +.B EPERM
>  .I cmd
>  was
>  .BR F_ADD_SEALS ,
> diff --git a/man2/open.2 b/man2/open.2
> index b0f485b41..cdd3c549c 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -421,6 +421,14 @@ was followed by a call to
>  .BR fdatasync (2)).
>  .IR "See NOTES below" .
>  .TP
> +.B O_ENCODED
> +Open the file with encoded I/O permissions;

1. I find the name of the flag confusing.
Yes, most people don't read documentation so carefully (or at all)
so they will assume O_ENCODED will affect read/write or that it
relates to RWF_ENCODED in a similar way that O_SYNC relates
to RWF_SYNC (i.e. logical OR and not logical AND).

I am not good at naming and to prove it I will propose:
O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED

2. While I see no harm in adding O_ flag to open(2) for this
use case, I also don't see a major benefit in adding it.
What if we only allowed setting the flag via fcntl(2) which returns
an error on old kernels?
Since unlike most O_ flags, O_ENCODED does NOT affect file
i/o without additional opt-in flags, it is not standard anyway and
therefore I find that setting it only via fcntl(2) is less error prone.


> +see
> +.BR rwf_encoded (7).
> +The caller must have the
> +.B CAP_SYS_ADMIN
> +capabilty.
> +.TP
>  .B O_EXCL
>  Ensure that this call creates the file:
>  if this flag is specified in conjunction with
> @@ -1168,6 +1176,11 @@ did not match the owner of the file and the caller was not privileged.
>  The operation was prevented by a file seal; see
>  .BR fcntl (2).
>  .TP
> +.B EPERM
> +The
> +.B O_ENCODED
> +flag was specified, but the caller was not privileged.
> +.TP
>  .B EROFS
>  .I pathname
>  refers to a file on a read-only filesystem and write access was
> diff --git a/man2/readv.2 b/man2/readv.2
> index af27aa63e..aa60b980a 100644
> --- a/man2/readv.2
> +++ b/man2/readv.2
> @@ -265,6 +265,11 @@ the data is always appended to the end of the file.
>  However, if the
>  .I offset
>  argument is \-1, the current file offset is updated.
> +.TP
> +.BR RWF_ENCODED " (since Linux 5.6)"
> +Read or write encoded (e.g., compressed) data.
> +See
> +.BR rwf_encoded (7).
>  .SH RETURN VALUE
>  On success,
>  .BR readv (),
> @@ -284,6 +289,13 @@ than requested (see
>  and
>  .BR write (2)).
>  .PP
> +If
> +.B
> +RWF_ENCODED
> +was specified in
> +.IR flags ,
> +then the return value is the number of encoded bytes.
> +.PP
>  On error, \-1 is returned, and \fIerrno\fP is set appropriately.
>  .SH ERRORS
>  The errors are as given for
> @@ -314,6 +326,40 @@ is less than zero or greater than the permitted maximum.
>  .TP
>  .B EOPNOTSUPP
>  An unknown flag is specified in \fIflags\fP.
> +.TP
> +.B EOPNOTSUPP
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the filesystem does not implement encoded I/O.
> +.TP
> +.B EPERM
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the file was not opened with the
> +.B O_ENCODED
> +flag.
> +.PP
> +.BR preadv2 ()
> +can fail for the following reasons:
> +.TP
> +.B EFBIG
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and buffers in
> +.I iov
> +were not big enough to return the encoded data.

I don't like it that EFBIG is returned for read.
While EXXX values meaning is often very vague, EFBIG meaning
is still quite consistent - it is always a write related error when trying
to change i_size above fs/system limits.

In the case above, I find E2BIG much more appropriate.
Although its original meaning was too long arg list, it already grew
several cases where it generally means "buffer cannot hold the result"
like the case with msgrcv(2) and listxattr(2).

Thanks,
Amir.

> +.PP
> +.BR pwritev2 ()
> +can fail for the following reasons:
> +.TP
> +.B EINVAL
> +.B RWF_ENCODED
> +is specified in
> +.I flags
> +and the alignment and/or size requirements are not met.
>  .SH VERSIONS
>  .BR preadv ()
>  and
> diff --git a/man7/rwf_encoded.7 b/man7/rwf_encoded.7
> new file mode 100644
> index 000000000..90f5292e2
> --- /dev/null
> +++ b/man7/rwf_encoded.7
> @@ -0,0 +1,297 @@
> +.\" Copyright (c) 2019 by Omar Sandoval <osandov@fb.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\"
> +.TH RWF_ENCODED  7 2019-10-14 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +rwf_encoded \- overview of encoded I/O
> +.SH DESCRIPTION
> +Several filesystems (e.g., Btrfs) support transparent encoding
> +(e.g., compression, encryption) of data on disk:
> +written data is encoded by the kernel before it is written to disk,
> +and read data is decoded before being returned to the user.
> +In some cases, it is useful to skip this encoding step.
> +For example, the user may want to read the compressed contents of a file
> +or write pre-compressed data directly to a file.
> +This is referred to as "encoded I/O".
> +.SS Encoded I/O API
> +Encoded I/O is specified with the
> +.B RWF_ENCODED
> +flag to
> +.BR preadv2 (2)
> +and
> +.BR pwritev2 (2).
> +If
> +.B RWF_ENCODED
> +is specified, then
> +.I iov[0].iov_base
> +points to an
> +.I
> +encoded_iov
> +structure, defined in
> +.I <linux/fs.h>
> +as:
> +.PP
> +.in +4n
> +.EX
> +struct encoded_iov {
> +    __u64 len;
> +    __u64 unencoded_len;
> +    __u64 unencoded_offset;
> +    __u32 compression;
> +    __u32 encryption;
> +
> +};
> +.EE
> +.in
> +.PP
> +.I iov[0].iov_len
> +must be set to
> +.IR "sizeof(struct\ encoded_iov)" .
> +The remaining buffers contain the encoded data.
> +.PP
> +.I compression
> +and
> +.I encryption
> +are the encoding fields.
> +.I compression
> +is one of
> +.B ENCODED_IOV_COMPRESSION_NONE
> +(zero),
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +or
> +.BR ENCODED_IOV_COMPRESSION_ZSTD .
> +.I encryption
> +is currently always
> +.B ENCODED_IOV_ENCRYPTION_NONE
> +(zero).
> +.PP
> +.I unencoded_len
> +is the length of the unencoded (i.e., decrypted and decompressed) data.
> +.I unencoded_offset
> +is the offset into the unencoded data where the data in the file begins
> +(strictly less than
> +.IR unencoded_len ).
> +.I len
> +is the length of the data in the file.
> +.PP
> +In most cases,
> +.I len
> +is equal to
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +is zero.
> +However, it may be necessary to refer to a subset of the unencoded data,
> +usually because a read occurred in the middle of an encoded extent,
> +because part of an extent was overwritten or deallocated in some
> +way (e.g., with
> +.BR write (2),
> +.BR truncate (2),
> +or
> +.BR fallocate (2))
> +or because part of an extent was added to the file (e.g., with
> +.BR ioctl_ficlonerange (2)
> +or
> +.BR ioctl_fideduperange (2)).
> +For example, if
> +.I len
> +is 300,
> +.I unencoded_len
> +is 1000,
> +and
> +.I unencoded_offset
> +is 600,
> +then the encoded data is 1000 bytes long when decoded,
> +of which only the 300 bytes starting at offset 600 are used;
> +the first 600 and last 100 bytes should be ignored.
> +.PP
> +Additionally,
> +.I len
> +may be greater than
> +.I unencoded_len
> +-
> +.IR unencoded_offset;
> +in this case, the data in the file is longer than the unencoded data,
> +and the difference is zero-filled.
> +.PP
> +If the unencoded data is actually longer than
> +.IR unencoded_len ,
> +then it is truncated;
> +if it is shorter, then it is extended with zeroes.
> +.PP
> +For
> +.BR pwritev2 (),
> +the metadata should be specified in
> +.IR iov[0] ,
> +and the encoded data should be passed in the remaining buffers.
> +This returns the number of encoded bytes written (that is, the sum of
> +.I iov[n].iov_len
> +for 1 <=
> +.I n
> +<
> +.IR iovcnt ;
> +partial writes will not occur).
> +If the
> +.I offset
> +argument to
> +.BR pwritev2 ()
> +is -1, then the file offset is incremented by
> +.IR len .
> +At least one encoding field must be non-zero.
> +Note that the encoded data is not validated when it is written;
> +if it is not valid (e.g., it cannot be decompressed),
> +then a subsequent read may result in an error.
> +.PP
> +For
> +.BR preadv2 (),
> +the metadata is returned in
> +.IR iov[0] ,
> +and the encoded data is returned in the remaining buffers.
> +This returns the number of encoded bytes read.
> +Note that a return value of zero does not indicate end of file;
> +one should refer to
> +.I len
> +(for example, a hole in the file has a non-zero
> +.I len
> +but a zero return value).
> +A
> +.I len
> +of zero indicates end of file.
> +If the
> +.I offset
> +argument to
> +.BR preadv2 ()
> +is -1, then the file offset is incremented by
> +.IR len .
> +If the provided buffers are not large enough to return an entire encoded
> +extent,
> +then this returns -1 and sets
> +.I errno
> +to
> +.BR EFBIG .
> +This will only return one encoded extent per call.
> +This can also read data which is not encoded;
> +all encoding fields will be zero in that case.
> +.SS Security
> +Encoded I/O creates the potential for some security issues:
> +.IP * 3
> +Encoded writes allow writing arbitrary data which the kernel will decode on
> +a subsequent read. Decompression algorithms are complex and may have bugs
> +which can be exploited by malicous data.
> +.IP *
> +Encoded reads may return data which is not logically present in the file
> +(see the discussion of
> +.I len
> +vs.
> +.I unencoded_len
> +above).
> +It may not be intended for this data to be readable.
> +.PP
> +Therefore, encoded I/O requires privilege.
> +Namely, the
> +.B RWF_ENCODED
> +flag may only be used when the file was opened with the
> +.B O_ENCODED
> +flag to
> +.BR open (2),
> +which requires the
> +.B CAP_SYS_ADMIN
> +capability.
> +.B O_ENCODED
> +may be set and cleared with
> +.BR fcntl (2).
> +Note that it is not cleared on
> +.BR fork (2)
> +or
> +.BR execve (2);
> +one may wish to use
> +.B O_CLOEXEC
> +with
> +.BR O_ENCODED .
> +.SS Filesystem support
> +Encoded I/O is supported on the following filesystems:
> +.TP
> +Btrfs (since Linux 5.6)
> +.IP
> +Btrfs supports encoded reads and writes of compressed data.
> +The data is encoded as follows:
> +.RS
> +.IP * 3
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +then the encoded data is a single zlib stream.
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +then the encoded data is compressed page by page with LZO1X
> +and wrapped in the format described in the Linux kernel source file
> +.IR fs/btrfs/lzo.c .
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZSTD ,
> +then the encoded data is a single zstd frame compressed with the
> +.I windowLog
> +compression parameter set to no more than 17.
> +.RE
> +.IP
> +Additionally, there are some restrictions on
> +.BR pwritev2 ():
> +.RS
> +.IP * 3
> +.I offset
> +(or the current file offset if
> +.I offset
> +is -1) must be aligned to the sector size of the filesystem.
> +.IP *
> +.I len
> +must be aligned to the sector size of the filesystem
> +unless the data ends at or beyond the current end of the file.
> +.IP *
> +.I unencoded_len
> +and the length of the encoded data must each be no more than 128 KiB.
> +This limit may increase in the future.
> +.IP *
> +The length of the encoded data rounded up to the nearest sector must be
> +less than
> +.I unencoded_len
> +rounded up to the nearest sector.
> +.IP *
> +Referring to a subset of unencoded data is not yet implemented; i.e.,
> +.I len
> +must equal
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +must be zero.
> +.IP *
> +Writing compressed inline extents is not yet implemented.
> +.RE
> --
> 2.23.0
>

^ permalink raw reply

* Re: [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data
From: Dave Chinner @ 2019-10-20 23:05 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Jann Horn, linux-api, kernel-team
In-Reply-To: <cover.1571164762.git.osandov@fb.com>

On Tue, Oct 15, 2019 at 11:42:38AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Hello,
> 
> This series adds an API for reading compressed data on a filesystem
> without decompressing it as well as support for writing compressed data
> directly to the filesystem. It is based on my previous series which
> added a Btrfs-specific ioctl [1], but it is now an extension to
> preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
> man page patch describing the API in detail. Test cases and examples
> programs are available [3].
> 
> The use case that I have in mind is Btrfs send/receive: currently, when
> sending data from one compressed filesystem to another, the sending side
> decompresses the data and the receiving side recompresses it before
> writing it out. This is wasteful and can be avoided if we can just send
> and write compressed extents. The send part will be implemented in a
> separate series, as this API can stand alone.
> 
> Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
> Patch 4 implements encoded reads for Btrfs, and patch 5 implements
> encoded writes.
> 
> Changes from v1 [4]:
> 
> - Encoded reads are now also implemented.
> - The encoded_iov structure now includes metadata for referring to a
>   subset of decoded data. This is required to handle certain cases where
>   a compressed extent is truncated, hole punched, or otherwise sliced up
>   and Btrfs chooses to reflect this in metadata instead of decompressing
>   the whole extent and rewriting the pieces. We call these "bookend
>   extents" in Btrfs, but any filesystem supporting transparent encoding
>   is likely to have a similar concept.

Where's the in-kernel documentation for this API? You're encoding a
specific set of behaviours into the user API, so this needs a whole
heap of documentation in the generic code to describe how it works
so that other filesystems implementing have a well defined guideline
to what they need to support.

Also, I don't see any test code for this - can you please add
support for RWF_ENCODED to xfs_io and write a suite of unit tests
for fstests that exercise the user API fully?  Given our history of
screwing up new user APIs, this absolutely should not be merged
until there is a full set of generic unit tests written and reviewed
for it and support has been added to fsstress, fsx, and other test
utilities to fuzz and stress the implementation as part of normal
day-to-day filesystem development...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* Re: [musl] [PATCH] arm64: uapi: Fix user space compile with musl libc
From: Hauke Mehrtens @ 2019-10-19 21:51 UTC (permalink / raw)
  To: musl, Rich Felker
  Cc: catalin.marinas, will, linux-arm-kernel, linux-kernel, linux-api,
	stable
In-Reply-To: <20191019202950.GC16318@brightrain.aerifal.cx>

On 10/19/19 10:29 PM, Rich Felker wrote:
> On Sat, Oct 19, 2019 at 10:17:17PM +0200, Hauke Mehrtens wrote:
>> musl libc also defines the structures in their arch/aarch64/bits/signal.h
>> header file. Some applications like strace and gdb include both of them
>> and then the structure definitions are clashing and the build of these
>> user space applications fails.
>>
>> This patch allows a libc to define a constant which tells the kernel
>> header file that the libc already defined these structures and that they
>> should not be defined by the kernel uapi header files any more to
>> prevent clashes. This is done in a similar way as it is already done for
>> other header files.
>>
>> When this patch was accepted into the kernel I will also update musl
>> libc to define these constants.
> 
> I don't entirely object to this outright, but I'd really like to avoid
> adding further __UAPI_DEF_* suppressions. AIUI asm/sigcontext.h is not
> intended to be used with userspace headers. Is it still being
> indirectly included via some other uapi headers? (I thought that was
> fixed..) If so, that should really be fixed first, and then we can see
> if there's still motivation for the patch here.
> 
> Rich
> 
Hi Rich,

I did some more research and it looks like this patch also fixes my
problem with strace and gdb compile:
https://git.kernel.org/linus/9966a05c7b80f075f2bc7e48dbb108d3f2927234

I will backport it in OpenWrt to kernel 4.19.

Please drop my patch.

It would be nice if it could go into the stable 4.19 kernel.

Hauke

^ permalink raw reply

* Re: [musl] [PATCH] arm64: uapi: Fix user space compile with musl libc
From: Rich Felker @ 2019-10-19 20:29 UTC (permalink / raw)
  To: Hauke Mehrtens
  Cc: catalin.marinas, musl, linux-kernel, stable, linux-api, will,
	linux-arm-kernel
In-Reply-To: <20191019201717.15358-1-hauke@hauke-m.de>

On Sat, Oct 19, 2019 at 10:17:17PM +0200, Hauke Mehrtens wrote:
> musl libc also defines the structures in their arch/aarch64/bits/signal.h
> header file. Some applications like strace and gdb include both of them
> and then the structure definitions are clashing and the build of these
> user space applications fails.
> 
> This patch allows a libc to define a constant which tells the kernel
> header file that the libc already defined these structures and that they
> should not be defined by the kernel uapi header files any more to
> prevent clashes. This is done in a similar way as it is already done for
> other header files.
> 
> When this patch was accepted into the kernel I will also update musl
> libc to define these constants.

I don't entirely object to this outright, but I'd really like to avoid
adding further __UAPI_DEF_* suppressions. AIUI asm/sigcontext.h is not
intended to be used with userspace headers. Is it still being
indirectly included via some other uapi headers? (I thought that was
fixed..) If so, that should really be fixed first, and then we can see
if there's still motivation for the patch here.

Rich


> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
> Cc: stable@vger.kernel.org
> ---
>  arch/arm64/include/uapi/asm/sigcontext.h | 13 +++++++++++++
>  include/uapi/linux/libc-compat.h         | 20 ++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/arch/arm64/include/uapi/asm/sigcontext.h b/arch/arm64/include/uapi/asm/sigcontext.h
> index 8b0ebce92427..92d911146137 100644
> --- a/arch/arm64/include/uapi/asm/sigcontext.h
> +++ b/arch/arm64/include/uapi/asm/sigcontext.h
> @@ -20,7 +20,9 @@
>  #ifndef __ASSEMBLY__
>  
>  #include <linux/types.h>
> +#include <linux/libc-compat.h>
>  
> +#if __UAPI_DEF_SIGCONTEXT
>  /*
>   * Signal context structure - contains all info to do with the state
>   * before the signal handler was invoked.
> @@ -35,6 +37,7 @@ struct sigcontext {
>  	/* 4K reserved for FP/SIMD state and future expansion */
>  	__u8 __reserved[4096] __attribute__((__aligned__(16)));
>  };
> +#endif
>  
>  /*
>   * Allocation of __reserved[]:
> @@ -57,6 +60,7 @@ struct sigcontext {
>   * generated when userspace does not opt in for any such extension.
>   */
>  
> +#if __UAPI_DEF_AARCH64_CTX
>  /*
>   * Header to be used at the beginning of structures extending the user
>   * context. Such structures must be placed after the rt_sigframe on the stack
> @@ -67,7 +71,9 @@ struct _aarch64_ctx {
>  	__u32 magic;
>  	__u32 size;
>  };
> +#endif
>  
> +#if __UAPI_DEF_FPSIMD_CONTEXT
>  #define FPSIMD_MAGIC	0x46508001
>  
>  struct fpsimd_context {
> @@ -76,7 +82,9 @@ struct fpsimd_context {
>  	__u32 fpcr;
>  	__uint128_t vregs[32];
>  };
> +#endif
>  
> +#if __UAPI_DEF_ESR_CONTEXT
>  /*
>   * Note: similarly to all other integer fields, each V-register is stored in an
>   * endianness-dependent format, with the byte at offset i from the start of the
> @@ -93,7 +101,9 @@ struct esr_context {
>  	struct _aarch64_ctx head;
>  	__u64 esr;
>  };
> +#endif
>  
> +#if __UAPI_DEF_EXTRA_CONTEXT
>  /*
>   * extra_context: describes extra space in the signal frame for
>   * additional structures that don't fit in sigcontext.__reserved[].
> @@ -128,7 +138,9 @@ struct extra_context {
>  	__u32 size; /* size in bytes of the extra space */
>  	__u32 __reserved[3];
>  };
> +#endif
>  
> +#if __UAPI_DEF_SVE_CONTEXT
>  #define SVE_MAGIC	0x53564501
>  
>  struct sve_context {
> @@ -136,6 +148,7 @@ struct sve_context {
>  	__u16 vl;
>  	__u16 __reserved[3];
>  };
> +#endif
>  
>  #endif /* !__ASSEMBLY__ */
>  
> diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
> index 8254c937c9f4..a863130f4638 100644
> --- a/include/uapi/linux/libc-compat.h
> +++ b/include/uapi/linux/libc-compat.h
> @@ -264,4 +264,24 @@
>  
>  #endif /* __GLIBC__ */
>  
> +/* Definitions for arch/arm64/include/uapi/asm/sigcontext.h */
> +#ifndef __UAPI_DEF_SIGCONTEXT
> +#define __UAPI_DEF_SIGCONTEXT		1
> +#endif
> +#ifndef __UAPI_DEF_AARCH64_CTX
> +#define __UAPI_DEF_AARCH64_CTX		1
> +#endif
> +#ifndef __UAPI_DEF_FPSIMD_CONTEXT
> +#define __UAPI_DEF_FPSIMD_CONTEXT	1
> +#endif
> +#ifndef __UAPI_DEF_ESR_CONTEXT
> +#define __UAPI_DEF_ESR_CONTEXT		1
> +#endif
> +#ifndef __UAPI_DEF_EXTRA_CONTEXT
> +#define __UAPI_DEF_EXTRA_CONTEXT	1
> +#endif
> +#ifndef __UAPI_DEF_SVE_CONTEXT
> +#define __UAPI_DEF_SVE_CONTEXT		1
> +#endif
> +
>  #endif /* _UAPI_LIBC_COMPAT_H */
> -- 
> 2.20.1

^ permalink raw reply

* [PATCH] arm64: uapi: Fix user space compile with musl libc
From: Hauke Mehrtens @ 2019-10-19 20:17 UTC (permalink / raw)
  To: catalin.marinas, will
  Cc: linux-arm-kernel, linux-kernel, linux-api, musl, Hauke Mehrtens,
	stable

musl libc also defines the structures in their arch/aarch64/bits/signal.h
header file. Some applications like strace and gdb include both of them
and then the structure definitions are clashing and the build of these
user space applications fails.

This patch allows a libc to define a constant which tells the kernel
header file that the libc already defined these structures and that they
should not be defined by the kernel uapi header files any more to
prevent clashes. This is done in a similar way as it is already done for
other header files.

When this patch was accepted into the kernel I will also update musl
libc to define these constants.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Cc: stable@vger.kernel.org
---
 arch/arm64/include/uapi/asm/sigcontext.h | 13 +++++++++++++
 include/uapi/linux/libc-compat.h         | 20 ++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/sigcontext.h b/arch/arm64/include/uapi/asm/sigcontext.h
index 8b0ebce92427..92d911146137 100644
--- a/arch/arm64/include/uapi/asm/sigcontext.h
+++ b/arch/arm64/include/uapi/asm/sigcontext.h
@@ -20,7 +20,9 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/types.h>
+#include <linux/libc-compat.h>
 
+#if __UAPI_DEF_SIGCONTEXT
 /*
  * Signal context structure - contains all info to do with the state
  * before the signal handler was invoked.
@@ -35,6 +37,7 @@ struct sigcontext {
 	/* 4K reserved for FP/SIMD state and future expansion */
 	__u8 __reserved[4096] __attribute__((__aligned__(16)));
 };
+#endif
 
 /*
  * Allocation of __reserved[]:
@@ -57,6 +60,7 @@ struct sigcontext {
  * generated when userspace does not opt in for any such extension.
  */
 
+#if __UAPI_DEF_AARCH64_CTX
 /*
  * Header to be used at the beginning of structures extending the user
  * context. Such structures must be placed after the rt_sigframe on the stack
@@ -67,7 +71,9 @@ struct _aarch64_ctx {
 	__u32 magic;
 	__u32 size;
 };
+#endif
 
+#if __UAPI_DEF_FPSIMD_CONTEXT
 #define FPSIMD_MAGIC	0x46508001
 
 struct fpsimd_context {
@@ -76,7 +82,9 @@ struct fpsimd_context {
 	__u32 fpcr;
 	__uint128_t vregs[32];
 };
+#endif
 
+#if __UAPI_DEF_ESR_CONTEXT
 /*
  * Note: similarly to all other integer fields, each V-register is stored in an
  * endianness-dependent format, with the byte at offset i from the start of the
@@ -93,7 +101,9 @@ struct esr_context {
 	struct _aarch64_ctx head;
 	__u64 esr;
 };
+#endif
 
+#if __UAPI_DEF_EXTRA_CONTEXT
 /*
  * extra_context: describes extra space in the signal frame for
  * additional structures that don't fit in sigcontext.__reserved[].
@@ -128,7 +138,9 @@ struct extra_context {
 	__u32 size; /* size in bytes of the extra space */
 	__u32 __reserved[3];
 };
+#endif
 
+#if __UAPI_DEF_SVE_CONTEXT
 #define SVE_MAGIC	0x53564501
 
 struct sve_context {
@@ -136,6 +148,7 @@ struct sve_context {
 	__u16 vl;
 	__u16 __reserved[3];
 };
+#endif
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index 8254c937c9f4..a863130f4638 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -264,4 +264,24 @@
 
 #endif /* __GLIBC__ */
 
+/* Definitions for arch/arm64/include/uapi/asm/sigcontext.h */
+#ifndef __UAPI_DEF_SIGCONTEXT
+#define __UAPI_DEF_SIGCONTEXT		1
+#endif
+#ifndef __UAPI_DEF_AARCH64_CTX
+#define __UAPI_DEF_AARCH64_CTX		1
+#endif
+#ifndef __UAPI_DEF_FPSIMD_CONTEXT
+#define __UAPI_DEF_FPSIMD_CONTEXT	1
+#endif
+#ifndef __UAPI_DEF_ESR_CONTEXT
+#define __UAPI_DEF_ESR_CONTEXT		1
+#endif
+#ifndef __UAPI_DEF_EXTRA_CONTEXT
+#define __UAPI_DEF_EXTRA_CONTEXT	1
+#endif
+#ifndef __UAPI_DEF_SVE_CONTEXT
+#define __UAPI_DEF_SVE_CONTEXT		1
+#endif
+
 #endif /* _UAPI_LIBC_COMPAT_H */
-- 
2.20.1

^ permalink raw reply related

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Aleksa Sarai @ 2019-10-19  5:01 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team, christian.brauner
In-Reply-To: <7f98cf5409cf2b583cd5b3451fc739fd3428873b.1571164762.git.osandov@fb.com>

[-- Attachment #1: Type: text/plain, Size: 10424 bytes --]

On 2019-10-15, Omar Sandoval <osandov@osandov.com> wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> Btrfs supports transparent compression: data written by the user can be
> compressed when written to disk and decompressed when read back.
> However, we'd like to add an interface to write pre-compressed data
> directly to the filesystem, and the matching interface to read
> compressed data without decompressing it. This adds support for
> so-called "encoded I/O" via preadv2() and pwritev2().
> 
> A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> this flag is set, iov[0].iov_base points to a struct encoded_iov which
> is used for metadata: namely, the compression algorithm, unencoded
> (i.e., decompressed) length, and what subrange of the unencoded data
> should be used (needed for truncated or hole-punched extents and when
> reading in the middle of an extent). For reads, the filesystem returns
> this information; for writes, the caller provides it to the filesystem.
> iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> used to extend the interface in the future. The remaining iovecs contain
> the encoded extent.
> 
> Filesystems must indicate that they support encoded writes by setting
> FMODE_ENCODED_IO in ->file_open().
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  include/linux/fs.h      | 14 +++++++
>  include/uapi/linux/fs.h | 26 ++++++++++++-
>  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
>  3 files changed, 108 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..54681f21e05e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File does not contribute to nr_files count */
>  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
>  
> +/* File supports encoded IO */
> +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> +
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
>   * that indicates that they should check the contents of the iovec are
> @@ -314,6 +317,7 @@ enum rw_hint {
>  #define IOCB_SYNC		(1 << 5)
>  #define IOCB_WRITE		(1 << 6)
>  #define IOCB_NOWAIT		(1 << 7)
> +#define IOCB_ENCODED		(1 << 8)
>  
>  struct kiocb {
>  	struct file		*ki_filp;
> @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
>  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
>  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
>  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> +struct encoded_iov;
> +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> +				struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>  				struct file *file_out, loff_t pos_out,
>  				loff_t *count, unsigned int remap_flags);
> @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>  			return -EOPNOTSUPP;
>  		ki->ki_flags |= IOCB_NOWAIT;
>  	}
> +	if (flags & RWF_ENCODED) {
> +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> +			return -EOPNOTSUPP;
> +		ki->ki_flags |= IOCB_ENCODED;
> +	}
>  	if (flags & RWF_HIPRI)
>  		ki->ki_flags |= IOCB_HIPRI;
>  	if (flags & RWF_DSYNC)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 379a612f8f1d..ed92a8a257cb 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -284,6 +284,27 @@ struct fsxattr {
>  
>  typedef int __bitwise __kernel_rwf_t;
>  
> +enum {
> +	ENCODED_IOV_COMPRESSION_NONE,
> +	ENCODED_IOV_COMPRESSION_ZLIB,
> +	ENCODED_IOV_COMPRESSION_LZO,
> +	ENCODED_IOV_COMPRESSION_ZSTD,
> +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> +};
> +
> +enum {
> +	ENCODED_IOV_ENCRYPTION_NONE,
> +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> +};
> +
> +struct encoded_iov {
> +	__u64 len;
> +	__u64 unencoded_len;
> +	__u64 unencoded_offset;
> +	__u32 compression;
> +	__u32 encryption;
> +};
> +
>  /* high priority request, poll if possible */
>  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
>  
> @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO O_APPEND */
>  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
>  
> +/* encoded (e.g., compressed or encrypted) IO */
> +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> +
>  /* mask of flags supported by the kernel */
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> -			 RWF_APPEND)
> +			 RWF_APPEND | RWF_ENCODED)
>  
>  #endif /* _UAPI_LINUX_FS_H */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1146fcfa3215..d2e6d9caf353 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
>  	return 0;
>  }
>  
> -/*
> - * Performs necessary checks before doing a write
> - *
> - * Can adjust writing position or amount of bytes to write.
> - * Returns appropriate error code that caller should return or
> - * zero in case that write should be allowed.
> - */
> -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file->f_mapping->host;
> -	loff_t count;
> -	int ret;
>  
>  	if (IS_SWAPFILE(inode))
>  		return -ETXTBSY;
>  
> -	if (!iov_iter_count(from))
> +	if (!*count)
>  		return 0;
>  
>  	/* FIXME: this is for backwards compatibility with 2.4 */
> @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
>  		return -EINVAL;
>  
> -	count = iov_iter_count(from);
> -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> +}
> +
> +/*
> + * Performs necessary checks before doing a write
> + *
> + * Can adjust writing position or amount of bytes to write.
> + * Returns a negative errno or the new number of bytes to write.
> + */
> +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	loff_t count = iov_iter_count(from);
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
>  	if (ret)
>  		return ret;
>  
> @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  }
>  EXPORT_SYMBOL(generic_write_checks);
>  
> +int generic_encoded_write_checks(struct kiocb *iocb,
> +				 struct encoded_iov *encoded)
> +{
> +	loff_t count = encoded->unencoded_len;
> +	int ret;
> +
> +	ret = generic_write_checks_common(iocb, &count);
> +	if (ret)
> +		return ret;
> +
> +	if (count != encoded->unencoded_len) {
> +		/*
> +		 * The write got truncated by generic_write_checks_common(). We
> +		 * can't do a partial encoded write.
> +		 */
> +		return -EFBIG;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(generic_encoded_write_checks);
> +
> +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> +		return -EINVAL;

I'm not sure what is precisely the right way of doing this within the
iov_iter world (maybe we should write some new helpers for that
usecase), but if you want to make forwards-compatibility much more
smooth please take a look at the new copy_struct_from_user() semantics
(it allows you to extend userspace-exposed structures without making it
painful for new software on old kernels).

Basically the semantics boil down to the following (ksize is the
kernel's struct size, and usize is the size that userspace used). All
new features must have their zero-value mean "don't use the new
feature".

  1. If ksize == usize, use it verbatim.
  2. If ksize > usize (old userspace), zero-fill the rest of the kernel
	 struct -- thus if userspace doesn't know about a new feature, it is
	 disabled.
  3. If ksize < usize (old kernel), check whether the trailing
     (usize - ksize) bytes are zero. If they are, then just use the
	 ksize prefix of the userspace struct. If they are non-zero then
	 give -E2BIG. Thus if userspace is newer than the kernel but isn't
	 using a new feature, it won't get an error.

This is how clone3(2) works (and openat2(2) will work), as well as some
older syscalls like perf_event_open(2) and sched_setattr(2). BPF also
has some similar semantics.

I really would like us to have a much more uniform way of defining
extensible APIs in Linux. By returning -EINVAL for all differently-sized
structs means that any new programs will give errors on older kernels
(even if they aren't using any new features).

> +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> +}
> +EXPORT_SYMBOL(check_encoded_read);
> +
> +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> +			 struct iov_iter *from)
> +{
> +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> +		return -EPERM;
> +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> +		return -EINVAL;
> +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> +		return -EFAULT;
> +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> +		return -EINVAL;
> +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> +		return -EINVAL;
> +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> +		return -EINVAL;
> +	return 0;
> +}
> +EXPORT_SYMBOL(import_encoded_write);
> +
>  /*
>   * Performs necessary checks before doing a clone.
>   *

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] fs: add O_ENCODED open flag
From: Aleksa Sarai @ 2019-10-19  4:50 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-fsdevel, linux-btrfs, Dave Chinner, Jann Horn, linux-api,
	kernel-team
In-Reply-To: <c4d2e911b7b04df9aa8418c8b11bc4c194e3808c.1571164762.git.osandov@fb.com>

[-- Attachment #1: Type: text/plain, Size: 4790 bytes --]

On 2019-10-15, Omar Sandoval <osandov@osandov.com> wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The upcoming RWF_ENCODED operation introduces some security concerns:
> 
> 1. Compressed writes will pass arbitrary data to decompression
>    algorithms in the kernel.
> 2. Compressed reads can leak truncated/hole punched data.
> 
> Therefore, we need to require privilege for RWF_ENCODED. It's not
> possible to do the permissions checks at the time of the read or write
> because, e.g., io_uring submits IO from a worker thread. So, add an open
> flag which requires CAP_SYS_ADMIN. It can also be set and cleared with
> fcntl(). The flag is not cleared in any way on fork or exec; it should
> probably be used with O_CLOEXEC in most cases.
> 
> Note that the usual issue that unknown open flags are ignored doesn't
> really matter for O_ENCODED; if the kernel doesn't support O_ENCODED,
> then it doesn't support RWF_ENCODED, either.
> 
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/fcntl.c                       | 10 ++++++++--
>  fs/namei.c                       |  4 ++++
>  include/linux/fcntl.h            |  2 +-
>  include/uapi/asm-generic/fcntl.h |  4 ++++
>  4 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 3d40771e8e7c..45ebc6df078e 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -30,7 +30,8 @@
>  #include <asm/siginfo.h>
>  #include <linux/uaccess.h>
>  
> -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
> +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
> +		    O_ENCODED)
>  
>  static int setfl(int fd, struct file * filp, unsigned long arg)
>  {
> @@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
>  		if (!inode_owner_or_capable(inode))
>  			return -EPERM;
>  
> +	/* O_ENCODED can only be set by superuser */
> +	if ((arg & O_ENCODED) && !(filp->f_flags & O_ENCODED) &&
> +	    !capable(CAP_SYS_ADMIN))
> +		return -EPERM;

I have a feeling the error should probably be an EACCES and not EPERM.

> +
>  	/* required for strict SunOS emulation */
>  	if (O_NONBLOCK != O_NDELAY)
>  	       if (arg & O_NDELAY)
> @@ -1031,7 +1037,7 @@ static int __init fcntl_init(void)
>  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>  	 * is defined as O_NONBLOCK on some platforms and not on others.
>  	 */
> -	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> +	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
>  		HWEIGHT32(
>  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
>  			__FMODE_EXEC | __FMODE_NONOTIFY));
> diff --git a/fs/namei.c b/fs/namei.c
> index 671c3c1a3425..ae86b125888a 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2978,6 +2978,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
>  	if (flag & O_NOATIME && !inode_owner_or_capable(inode))
>  		return -EPERM;
>  
> +	/* O_ENCODED can only be set by superuser */
> +	if ((flag & O_ENCODED) && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;

I would suggest that this check be put into build_open_flags() rather
than putting it this late in open(). Also, same nit about the error
return as above.

> +
>  	return 0;
>  }
>  
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index d019df946cb2..5fac02479639 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -9,7 +9,7 @@
>  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
>  	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
>  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ENCODED)
>  
>  #ifndef force_o_largefile
>  #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 9dc0bf0c5a6e..8c5cbd5942e3 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -97,6 +97,10 @@
>  #define O_NDELAY	O_NONBLOCK
>  #endif
>  
> +#ifndef O_ENCODED
> +#define O_ENCODED	040000000
> +#endif

You should also define this for all of the architectures which don't use
the generic O_* flag values. On alpha, O_PATH is equal to the value you
picked (just be careful on sparc -- 0x4000000 is the next free bit, but
it's used by FMODE_NONOTIFY.)

> +
>  #define F_DUPFD		0	/* dup */
>  #define F_GETFD		1	/* get close_on_exec */
>  #define F_SETFD		2	/* set/clear close_on_exec */

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Richard Guy Briggs @ 2019-10-19  1:39 UTC (permalink / raw)
  To: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel
  Cc: nhorman, dhowells, ebiederm, simo, eparis, mpatel, serge
In-Reply-To: <214163d11a75126f610bcedfad67a4d89575dc77.1568834525.git.rgb@redhat.com>

On 2019-09-18 21:22, Richard Guy Briggs wrote:
> Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> process in a non-init user namespace the capability to set audit
> container identifiers.
> 
> Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> structure:
> struct audit_capcontid_status {
>         pid_t   pid;
>         u32     enable;
> };

Paul, can I get a review of the general idea here to see if you're ok
with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
setting contid from beyond the init user namespace where capable() can't
reach and ns_capable() is meaningless for these purposes?

Last weekend was Canadian Thanksgiving where I took an extra day for an
annual bike trip and I'm buried to my neck in a complete kitchen gut
(down to 1920 structural double brick and knob/tube wiring), but I've
got fixes or responses to almost everything else you've raised which
I'll post shortly.

Thanks!

> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
>  include/linux/audit.h      | 14 +++++++
>  include/uapi/linux/audit.h |  2 +
>  kernel/audit.c             | 98 +++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/audit.h             |  5 +++
>  4 files changed, 117 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 1ce27af686ea..dcc53e62e266 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -117,6 +117,7 @@ struct audit_task_info {
>  	kuid_t			loginuid;
>  	unsigned int		sessionid;
>  	struct audit_cont	*cont;
> +	u32			capcontid;
>  #ifdef CONFIG_AUDITSYSCALL
>  	struct audit_context	*ctx;
>  #endif
> @@ -224,6 +225,14 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
>  	return tsk->audit->sessionid;
>  }
>  
> +static inline u32 audit_get_capcontid(struct task_struct *tsk)
> +{
> +	if (!tsk->audit)
> +		return 0;
> +	return tsk->audit->capcontid;
> +}
> +
> +extern int audit_set_capcontid(struct task_struct *tsk, u32 enable);
>  extern int audit_set_contid(struct task_struct *tsk, u64 contid);
>  
>  static inline u64 audit_get_contid(struct task_struct *tsk)
> @@ -309,6 +318,11 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
>  	return AUDIT_SID_UNSET;
>  }
>  
> +static inline u32 audit_get_capcontid(struct task_struct *tsk)
> +{
> +	return 0;
> +}
> +
>  static inline u64 audit_get_contid(struct task_struct *tsk)
>  {
>  	return AUDIT_CID_UNSET;
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index eef42c8eea77..011b0a8ee9b2 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -78,6 +78,8 @@
>  #define AUDIT_GET_LOGINUID	1024	/* Get loginuid of a task */
>  #define AUDIT_SET_LOGINUID	1025	/* Set loginuid of a task */
>  #define AUDIT_GET_SESSIONID	1026	/* Set sessionid of a task */
> +#define AUDIT_GET_CAPCONTID	1027	/* Get cap_contid of a task */
> +#define AUDIT_SET_CAPCONTID	1028	/* Set cap_contid of a task */
>  
>  #define AUDIT_FIRST_USER_MSG	1100	/* Userspace messages mostly uninteresting to kernel */
>  #define AUDIT_USER_AVC		1107	/* We filter this differently */
> diff --git a/kernel/audit.c b/kernel/audit.c
> index a70c9184e5d9..7160da464849 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1192,6 +1192,14 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type)
>  	case AUDIT_GET_SESSIONID:
>  		return 0;
>  		break;
> +	case AUDIT_GET_CAPCONTID:
> +	case AUDIT_SET_CAPCONTID:
> +	case AUDIT_GET_CONTID:
> +	case AUDIT_SET_CONTID:
> +		if (!netlink_capable(skb, CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> +			return -EPERM;
> +		return 0;
> +		break;
>  	default:  /* do more checks below */
>  		break;
>  	}
> @@ -1227,8 +1235,6 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type)
>  	case AUDIT_TTY_SET:
>  	case AUDIT_TRIM:
>  	case AUDIT_MAKE_EQUIV:
> -	case AUDIT_GET_CONTID:
> -	case AUDIT_SET_CONTID:
>  	case AUDIT_SET_LOGINUID:
>  		/* Only support auditd and auditctl in initial pid namespace
>  		 * for now. */
> @@ -1304,6 +1310,23 @@ static int audit_get_contid_status(struct sk_buff *skb)
>  	return 0;
>  }
>  
> +static int audit_get_capcontid_status(struct sk_buff *skb)
> +{
> +	struct nlmsghdr *nlh = nlmsg_hdr(skb);
> +	u32 seq = nlh->nlmsg_seq;
> +	void *data = nlmsg_data(nlh);
> +	struct audit_capcontid_status cs;
> +
> +	cs.pid = ((struct audit_capcontid_status *)data)->pid;
> +	if (!cs.pid)
> +		cs.pid = task_tgid_nr(current);
> +	rcu_read_lock();
> +	cs.enable = audit_get_capcontid(find_task_by_vpid(cs.pid));
> +	rcu_read_unlock();
> +	audit_send_reply(skb, seq, AUDIT_GET_CAPCONTID, 0, 0, &cs, sizeof(cs));
> +	return 0;
> +}
> +
>  struct audit_loginuid_status { uid_t loginuid; };
>  
>  static int audit_get_loginuid_status(struct sk_buff *skb)
> @@ -1779,6 +1802,27 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
>  		if (err)
>  			return err;
>  		break;
> +	case AUDIT_SET_CAPCONTID: {
> +		struct audit_capcontid_status *s = data;
> +		struct task_struct *tsk;
> +
> +		/* check if new data is valid */
> +		if (nlmsg_len(nlh) < sizeof(*s))
> +			return -EINVAL;
> +		tsk = find_get_task_by_vpid(s->pid);
> +		if (!tsk)
> +			return -EINVAL;
> +
> +		err = audit_set_capcontid(tsk, s->enable);
> +		put_task_struct(tsk);
> +		return err;
> +		break;
> +	}
> +	case AUDIT_GET_CAPCONTID:
> +		err = audit_get_capcontid_status(skb);
> +		if (err)
> +			return err;
> +		break;
>  	case AUDIT_SET_LOGINUID: {
>  		uid_t *loginuid = data;
>  		kuid_t kloginuid;
> @@ -2711,6 +2755,56 @@ static struct task_struct *audit_cont_owner(struct task_struct *tsk)
>  	return NULL;
>  }
>  
> +int audit_set_capcontid(struct task_struct *task, u32 enable)
> +{
> +	u32 oldcapcontid;
> +	int rc = 0;
> +	struct audit_buffer *ab;
> +	uid_t uid;
> +	struct tty_struct *tty;
> +	char comm[sizeof(current->comm)];
> +
> +	if (!task->audit)
> +		return -ENOPROTOOPT;
> +	oldcapcontid = audit_get_capcontid(task);
> +	/* if task is not descendant, block */
> +	if (task == current)
> +		rc = -EBADSLT;
> +	else if (!task_is_descendant(current, task))
> +		rc = -EXDEV;
> +	else if (current_user_ns() == &init_user_ns) {
> +		if (!capable(CAP_AUDIT_CONTROL) && !audit_get_capcontid(current))
> +			rc = -EPERM;
> +	}
> +	if (!rc)
> +		task->audit->capcontid = enable;
> +
> +	if (!audit_enabled)
> +		return rc;
> +
> +	ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_SET_CAPCONTID);
> +	if (!ab)
> +		return rc;
> +
> +	uid = from_kuid(&init_user_ns, task_uid(current));
> +	tty = audit_get_tty();
> +	audit_log_format(ab,
> +			 "opid=%d capcontid=%u old-capcontid=%u pid=%d uid=%u auid=%u tty=%s ses=%u",
> +			 task_tgid_nr(task), enable, oldcapcontid,
> +			 task_tgid_nr(current), uid,
> +			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
> +			 tty ? tty_name(tty) : "(none)",
> +			 audit_get_sessionid(current));
> +	audit_put_tty(tty);
> +	audit_log_task_context(ab);
> +	audit_log_format(ab, " comm=");
> +	audit_log_untrustedstring(ab, get_task_comm(comm, current));
> +	audit_log_d_path_exe(ab, current->mm);
> +	audit_log_format(ab, " res=%d", !rc);
> +	audit_log_end(ab);
> +	return rc;
> +}
> +
>  /*
>   * audit_set_contid - set current task's audit contid
>   * @task: target task
> diff --git a/kernel/audit.h b/kernel/audit.h
> index cb25341c1a0f..ac4694e88485 100644
> --- a/kernel/audit.h
> +++ b/kernel/audit.h
> @@ -231,6 +231,11 @@ struct audit_contid_status {
>  	u64	id;
>  };
>  
> +struct audit_capcontid_status {
> +	pid_t	pid;
> +	u32	enable;
> +};
> +
>  #define AUDIT_CONTID_DEPTH	5
>  
>  /* Indicates that audit should log the full pathname. */
> -- 
> 1.8.3.1
> 

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
From: Omar Sandoval @ 2019-10-18 23:33 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api
In-Reply-To: <20191018225513.GD59713@vader>

On Fri, Oct 18, 2019 at 03:55:13PM -0700, Omar Sandoval wrote:
> On Wed, Oct 16, 2019 at 01:44:56PM +0300, Nikolay Borisov wrote:
> > 
> > 
> > On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > The implementation resembles direct I/O: we have to flush any ordered
> > > extents, invalidate the page cache, and do the io tree/delalloc/extent
> > > map/ordered extent dance. From there, we can reuse the compression code
> > > with a minor modification to distinguish the write from writeback.
> > > 
> > > Now that read and write are implemented, this also sets the
> > > FMODE_ENCODED_IO flag in btrfs_file_open().
> > > 
> > > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > > ---
> > >  fs/btrfs/compression.c |   6 +-
> > >  fs/btrfs/compression.h |   5 +-
> > >  fs/btrfs/ctree.h       |   2 +
> > >  fs/btrfs/file.c        |  40 +++++++--
> > >  fs/btrfs/inode.c       | 197 ++++++++++++++++++++++++++++++++++++++++-
> > >  5 files changed, 237 insertions(+), 13 deletions(-)
> > > 

[snip]

> > > +	for (;;) {
> > > +		struct btrfs_ordered_extent *ordered;
> > > +
> > > +		ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
> > > +		if (ret)
> > > +			goto out_pages;
> > > +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> > > +						    start >> PAGE_SHIFT,
> > > +						    end >> PAGE_SHIFT);
> > > +		if (ret)
> > > +			goto out_pages;
> > > +		lock_extent_bits(io_tree, start, end, &cached_state);
> > > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > > +						     end - start + 1);
> > > +		if (!ordered &&
> > > +		    !filemap_range_has_page(inode->i_mapping, start, end))
> > > +			break;
> > > +		if (ordered)
> > > +			btrfs_put_ordered_extent(ordered);
> > > +		unlock_extent_cached(io_tree, start, end, &cached_state);
> > > +		cond_resched();
> > > +	}
> > > +
> > > +	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
> > > +					   num_bytes);
> > > +	if (ret)
> > > +		goto out_unlock;
> > > +
> > > +	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
> > > +				   disk_num_bytes, 0, 0, &ins, 1, 1);
> > > +	if (ret)
> > > +		goto out_delalloc_release;
> > > +
> > > +	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
> > > +			  ins.offset, ins.offset, num_bytes, compression,
> > > +			  BTRFS_ORDERED_COMPRESSED);
> > > +	if (IS_ERR(em)) {
> > > +		ret = PTR_ERR(em);
> > > +		goto out_free_reserve;
> > > +	}
> > > +	free_extent_map(em);
> > > +
> > > +	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
> > > +						num_bytes, ins.offset,
> > > +						BTRFS_ORDERED_COMPRESSED,
> > > +						compression);
> > > +	if (ret) {
> > > +		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
> > > +		goto out_free_reserve;
> > > +	}
> > > +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> > > +
> > > +	if (start + encoded->len > inode->i_size)
> > > +		i_size_write(inode, start + encoded->len);
> > 
> > Don't we want the inode size to be updated once data hits disk and
> > btrfs_finish_ordered_io is called?
> 
> Yup, you're right, this is too early.

Actually, no, this part is fine. Compare to the call to i_size_write()
in btrfs_get_blocks_direct_write(): we lock the extent in the io_tree,
create the ordered extent, update i_size, then unlock the extent. Anyone
else who comes in is going to find the ordered extent and wait on that.

> > > +
> > > +	unlock_extent_cached(io_tree, start, end, &cached_state);
> > > +
> > > +	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
> > > +
> > > +	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
> > > +					  ins.offset, pages, nr_pages, 0,
> > > +					  false)) {
> > > +		struct page *page = pages[0];
> > > +
> > > +		page->mapping = inode->i_mapping;
> > > +		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
> > > +		page->mapping = NULL;
> > > +		ret = -EIO;
> > > +		goto out_pages;
> > > +	}
> 
> I also need to wait for the I/O to finish here.
> 
> > > +	iocb->ki_pos += encoded->len;
> > > +	return orig_count;
> > > +
> > > +out_free_reserve:
> > > +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> > > +	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> > > +out_delalloc_release:
> > > +	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
> > > +				     true);
> > > +out_unlock:
> > > +	unlock_extent_cached(io_tree, start, end, &cached_state);
> > > +out_pages:
> > > +	for (i = 0; i < nr_pages; i++) {
> > > +		if (pages[i])
> > > +			put_page(pages[i]);
> > > +	}
> > > +	kvfree(pages);
> > > +	return ret;
> > > +}
> > > +
> > >  #ifdef CONFIG_SWAP
> > >  /*
> > >   * Add an entry indicating a block group or device which is pinned by a
> > > 

^ permalink raw reply

* Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes
From: Omar Sandoval @ 2019-10-18 22:55 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api
In-Reply-To: <0da91628-7f54-7d24-bf58-6807eb9535a5@suse.com>

On Wed, Oct 16, 2019 at 01:44:56PM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > The implementation resembles direct I/O: we have to flush any ordered
> > extents, invalidate the page cache, and do the io tree/delalloc/extent
> > map/ordered extent dance. From there, we can reuse the compression code
> > with a minor modification to distinguish the write from writeback.
> > 
> > Now that read and write are implemented, this also sets the
> > FMODE_ENCODED_IO flag in btrfs_file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/btrfs/compression.c |   6 +-
> >  fs/btrfs/compression.h |   5 +-
> >  fs/btrfs/ctree.h       |   2 +
> >  fs/btrfs/file.c        |  40 +++++++--
> >  fs/btrfs/inode.c       | 197 ++++++++++++++++++++++++++++++++++++++++-
> >  5 files changed, 237 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> > index b05b361e2062..6632dd8d2e4d 100644
> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -276,7 +276,8 @@ static void end_compressed_bio_write(struct bio *bio)
> >  			bio->bi_status == BLK_STS_OK);
> >  	cb->compressed_pages[0]->mapping = NULL;
> >  
> > -	end_compressed_writeback(inode, cb);
> > +	if (cb->writeback)
> > +		end_compressed_writeback(inode, cb);
> >  	/* note, our inode could be gone now */
> >  
> >  	/*
> > @@ -311,7 +312,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  				 unsigned long compressed_len,
> >  				 struct page **compressed_pages,
> >  				 unsigned long nr_pages,
> > -				 unsigned int write_flags)
> > +				 unsigned int write_flags, bool writeback)
> 
> I don't see this function being called with true in this patch set,
> meaning it essentially eliminates end_compressed_writeback call in
> end_compressed_bio_write? Am I missing anything?

I'll point it out below.

> >  {
> >  	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >  	struct bio *bio = NULL;
> > @@ -336,6 +337,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  	cb->mirror_num = 0;
> >  	cb->compressed_pages = compressed_pages;
> >  	cb->compressed_len = compressed_len;
> > +	cb->writeback = writeback;
> >  	cb->orig_bio = NULL;
> >  	cb->nr_pages = nr_pages;
> >  
> > diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
> > index 4cb8be9ff88b..d4176384ec15 100644
> > --- a/fs/btrfs/compression.h
> > +++ b/fs/btrfs/compression.h
> > @@ -47,6 +47,9 @@ struct compressed_bio {
> >  	/* the compression algorithm for this bio */
> >  	int compress_type;
> >  
> > +	/* Whether this is a write for writeback. */
> > +	bool writeback;
> > +
> >  	/* number of compressed pages in the array */
> >  	unsigned long nr_pages;
> >  
> > @@ -93,7 +96,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
> >  				  unsigned long compressed_len,
> >  				  struct page **compressed_pages,
> >  				  unsigned long nr_pages,
> > -				  unsigned int write_flags);
> > +				  unsigned int write_flags, bool writeback);
> >  blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
> >  				 int mirror_num, unsigned long bio_flags);
> >  
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 3b2aa1c7218c..9e1719e82cc8 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2907,6 +2907,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> >  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> >  					  u64 end, int uptodate);
> >  ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> > +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
> > +			    struct encoded_iov *encoded);
> >  
> >  extern const struct dentry_operations btrfs_dentry_operations;
> >  
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 51740cee39fc..8de6ac9b4b9c 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1893,8 +1893,7 @@ static void update_time_for_write(struct inode *inode)
> >  		inode_inc_iversion(inode);
> >  }
> >  
> > -static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> > -				    struct iov_iter *from)
> > +static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file_inode(file);
> > @@ -1904,14 +1903,22 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  	u64 end_pos;
> >  	ssize_t num_written = 0;
> >  	const bool sync = iocb->ki_flags & IOCB_DSYNC;
> > +	struct encoded_iov encoded;
> >  	ssize_t err;
> >  	loff_t pos;
> >  	size_t count;
> >  	loff_t oldsize;
> >  	int clean_page = 0;
> >  
> > -	if (!(iocb->ki_flags & IOCB_DIRECT) &&
> > -	    (iocb->ki_flags & IOCB_NOWAIT))
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		err = import_encoded_write(iocb, &encoded, from);
> > +		if (err)
> > +			return err;
> > +	}
> > +
> > +	if ((iocb->ki_flags & IOCB_NOWAIT) &&
> > +	    (!(iocb->ki_flags & IOCB_DIRECT) ||
> > +	     (iocb->ki_flags & IOCB_ENCODED)))
> >  		return -EOPNOTSUPP;
> >  
> >  	if (!inode_trylock(inode)) {
> > @@ -1920,14 +1927,27 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  		inode_lock(inode);
> >  	}
> >  
> > -	err = generic_write_checks(iocb, from);
> > -	if (err <= 0) {
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		err = generic_encoded_write_checks(iocb, &encoded);
> > +		if (err) {
> > +			inode_unlock(inode);
> > +			return err;
> > +		}
> > +		count = encoded.len;
> > +	} else {
> > +		err = generic_write_checks(iocb, from);
> > +		if (err < 0) {
> > +			inode_unlock(inode);
> > +			return err;
> > +		}
> > +		count = iov_iter_count(from);
> > +	}
> > +	if (count == 0) {
> >  		inode_unlock(inode);
> >  		return err;
> >  	}
> >  
> >  	pos = iocb->ki_pos;
> > -	count = iov_iter_count(from);
> >  	if (iocb->ki_flags & IOCB_NOWAIT) {
> >  		/*
> >  		 * We will allocate space in case nodatacow is not set,
> > @@ -1986,7 +2006,9 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
> >  	if (sync)
> >  		atomic_inc(&BTRFS_I(inode)->sync_writers);
> >  
> > -	if (iocb->ki_flags & IOCB_DIRECT) {
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		num_written = btrfs_encoded_write(iocb, from, &encoded);
> > +	} else if (iocb->ki_flags & IOCB_DIRECT) {
> >  		num_written = __btrfs_direct_write(iocb, from);
> >  	} else {
> >  		num_written = btrfs_buffered_write(iocb, from);
> > @@ -3461,7 +3483,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
> >  
> >  static int btrfs_file_open(struct inode *inode, struct file *filp)
> >  {
> > -	filp->f_mode |= FMODE_NOWAIT;
> > +	filp->f_mode |= FMODE_NOWAIT | FMODE_ENCODED_IO;
> >  	return generic_file_open(inode, filp);
> >  }
> >  
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 174d0738d2c9..bcc5a2bed22b 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -865,7 +865,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
> >  				    ins.objectid,
> >  				    ins.offset, async_extent->pages,
> >  				    async_extent->nr_pages,
> > -				    async_chunk->write_flags)) {
> > +				    async_chunk->write_flags, true)) {

This is the btrfs_submit_compressed_write() call, it's just so long that
the diff context doesn't include the name :)

> >  			struct page *p = async_extent->pages[0];
> >  			const u64 start = async_extent->start;
> >  			const u64 end = start + async_extent->ram_size - 1;
> > @@ -11055,6 +11055,201 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> >  	return ret;
> >  }
> >  
> > +ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from,
> > +			    struct encoded_iov *encoded)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct btrfs_root *root = BTRFS_I(inode)->root;
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct extent_changeset *data_reserved = NULL;
> > +	struct extent_state *cached_state = NULL;
> > +	int compression;
> > +	size_t orig_count;
> > +	u64 disk_num_bytes, num_bytes;
> > +	u64 start, end;
> > +	unsigned long nr_pages, i;
> > +	struct page **pages;
> > +	struct btrfs_key ins;
> > +	struct extent_map *em;
> > +	ssize_t ret;
> > +
> > +	switch (encoded->compression) {
> > +	case ENCODED_IOV_COMPRESSION_ZLIB:
> > +		compression = BTRFS_COMPRESS_ZLIB;
> > +		break;
> > +	case ENCODED_IOV_COMPRESSION_LZO:
> > +		compression = BTRFS_COMPRESS_LZO;
> > +		break;
> > +	case ENCODED_IOV_COMPRESSION_ZSTD:
> > +		compression = BTRFS_COMPRESS_ZSTD;
> > +		break;
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	disk_num_bytes = orig_count = iov_iter_count(from);
> > +
> > +	/* For now, it's too hard to support bookend extents. */
> > +	if (encoded->unencoded_len != encoded->len ||
> > +	    encoded->unencoded_offset != 0)
> > +		return -EINVAL;
> > +
> > +	/* The extent size must be sane. */
> > +	if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED ||
> > +	    disk_num_bytes > BTRFS_MAX_COMPRESSED || disk_num_bytes == 0)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * The compressed data on disk must be sector-aligned. For convenience,
> > +	 * we extend it with zeroes if it isn't.
> > +	 */
> > +	disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize);
> > +
> > +	/*
> > +	 * The extent in the file must also be sector-aligned. However, we allow
> > +	 * a write which ends at or extends i_size to have an unaligned length;
> > +	 * we round up the extent size and set i_size to the given length.
> > +	 */
> > +	start = iocb->ki_pos;
> > +	if (!IS_ALIGNED(start, fs_info->sectorsize))
> > +		return -EINVAL;
> > +	if (start + encoded->len >= inode->i_size) {
> > +		num_bytes = ALIGN(encoded->len, fs_info->sectorsize);
> > +	} else {
> > +		num_bytes = encoded->len;
> > +		if (!IS_ALIGNED(num_bytes, fs_info->sectorsize))
> > +			return -EINVAL;
> > +	}
> > +
> > +	/*
> > +	 * It's valid to have compressed data which is larger than or the same
> > +	 * size as the decompressed data. However, for buffered I/O, we fall
> > +	 * back to writing the decompressed data if compression didn't shrink
> > +	 * it. So, for now, let's not allow creating such extents.
> > +	 *
> > +	 * Note that for now this also implicitly prevents writing data that
> > +	 * would fit in an inline extent.
> > +	 */
> > +	if (disk_num_bytes >= num_bytes)
> > +		return -EINVAL;
> > +
> > +	end = start + num_bytes - 1;
> > +
> > +	nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> 
> nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)

disk_num_bytes is a u64, so that would expand to a 64-bit division. The
compiler is probably smart enough to optimize it to a shift, but I
didn't want to rely on that, because that would cause build failures on
32-bit.

> > +	pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_USER);
> 
> This could be a simple GFP_KERNEL  allocation

I mixed up GFP_USER and GFP_KERNEL_ACCOUNT. I think we want
GFP_KERNEL_ACCOUNT rather than GFP_KERNEL here because the allocation is
triggered by a userspace request. (Obviously we're not very consistent
about that in Btrfs, but for new stuff we might as well be more careful
about it).

> > +	if (!pages)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_pages; i++) {
> > +		size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from));
> > +		char *kaddr;
> > +
> > +		pages[i] = alloc_page(GFP_HIGHUSER);
> 
> Why GFP_HIGHUSER? You are reading from userspace,  not writing to it. A
> plain, NOFS allocation should suffice (of course using the newer
> memalloc_nofs_save api)?

The __GFP_HIGHMEM bit is just to give the allocator more to work with
since we only ever access the pages with kmap() (we do the same thing
elsewhere in Btrfs). It doesn't need to be NOFS, but this should
probably be GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM.

> > +		if (!pages[i]) {
> > +			ret = -ENOMEM;
> > +			goto out_pages;
> > +		}
> > +		kaddr = kmap(pages[i]);
> > +		if (copy_from_iter(kaddr, bytes, from) != bytes) {
> > +			kunmap(pages[i]);
> > +			ret = -EFAULT;
> > +			goto out_pages;
> > +		}
> > +		if (bytes < PAGE_SIZE)
> > +			memset(kaddr + bytes, 0, PAGE_SIZE - bytes);
> > +		kunmap(pages[i]);
> > +	}
> > +
> > +	for (;;) {
> > +		struct btrfs_ordered_extent *ordered;
> > +
> > +		ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
> > +		if (ret)
> > +			goto out_pages;
> > +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> > +						    start >> PAGE_SHIFT,
> > +						    end >> PAGE_SHIFT);
> > +		if (ret)
> > +			goto out_pages;
> > +		lock_extent_bits(io_tree, start, end, &cached_state);
> > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > +						     end - start + 1);
> > +		if (!ordered &&
> > +		    !filemap_range_has_page(inode->i_mapping, start, end))
> > +			break;
> > +		if (ordered)
> > +			btrfs_put_ordered_extent(ordered);
> > +		unlock_extent_cached(io_tree, start, end, &cached_state);
> > +		cond_resched();
> > +	}
> > +
> > +	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start,
> > +					   num_bytes);
> > +	if (ret)
> > +		goto out_unlock;
> > +
> > +	ret = btrfs_reserve_extent(root, num_bytes, disk_num_bytes,
> > +				   disk_num_bytes, 0, 0, &ins, 1, 1);
> > +	if (ret)
> > +		goto out_delalloc_release;
> > +
> > +	em = create_io_em(inode, start, num_bytes, start, ins.objectid,
> > +			  ins.offset, ins.offset, num_bytes, compression,
> > +			  BTRFS_ORDERED_COMPRESSED);
> > +	if (IS_ERR(em)) {
> > +		ret = PTR_ERR(em);
> > +		goto out_free_reserve;
> > +	}
> > +	free_extent_map(em);
> > +
> > +	ret = btrfs_add_ordered_extent_compress(inode, start, ins.objectid,
> > +						num_bytes, ins.offset,
> > +						BTRFS_ORDERED_COMPRESSED,
> > +						compression);
> > +	if (ret) {
> > +		btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0);
> > +		goto out_free_reserve;
> > +	}
> > +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> > +
> > +	if (start + encoded->len > inode->i_size)
> > +		i_size_write(inode, start + encoded->len);
> 
> Don't we want the inode size to be updated once data hits disk and
> btrfs_finish_ordered_io is called?

Yup, you're right, this is too early.

> > +
> > +	unlock_extent_cached(io_tree, start, end, &cached_state);
> > +
> > +	btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes, false);
> > +
> > +	if (btrfs_submit_compressed_write(inode, start, num_bytes, ins.objectid,
> > +					  ins.offset, pages, nr_pages, 0,
> > +					  false)) {
> > +		struct page *page = pages[0];
> > +
> > +		page->mapping = inode->i_mapping;
> > +		btrfs_writepage_endio_finish_ordered(page, start, end, 0);
> > +		page->mapping = NULL;
> > +		ret = -EIO;
> > +		goto out_pages;
> > +	}

I also need to wait for the I/O to finish here.

> > +	iocb->ki_pos += encoded->len;
> > +	return orig_count;
> > +
> > +out_free_reserve:
> > +	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> > +	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> > +out_delalloc_release:
> > +	btrfs_delalloc_release_space(inode, data_reserved, start, num_bytes,
> > +				     true);
> > +out_unlock:
> > +	unlock_extent_cached(io_tree, start, end, &cached_state);
> > +out_pages:
> > +	for (i = 0; i < nr_pages; i++) {
> > +		if (pages[i])
> > +			put_page(pages[i]);
> > +	}
> > +	kvfree(pages);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_SWAP
> >  /*
> >   * Add an entry indicating a block group or device which is pinned by a
> > 

^ permalink raw reply

* Re: [RFC PATCH v2 4/5] btrfs: implement RWF_ENCODED reads
From: Omar Sandoval @ 2019-10-18 22:23 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api
In-Reply-To: <0c0c3307-de6c-5df9-bbe1-5079cfc70480@suse.com>

On Wed, Oct 16, 2019 at 02:10:10PM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > There are 4 main cases:
> > 
> > 1. Inline extents: we copy the data straight out of the extent buffer.
> > 2. Hole/preallocated extents: we indicate the size of the extent
> >    starting from the read position; we don't need to copy zeroes.
> > 3. Regular, uncompressed extents: we read the sectors we need directly
> >    from disk.
> > 4. Regular, compressed extents: we read the entire compressed extent
> >    from disk and indicate what subset of the decompressed extent is in
> >    the file.
> > 
> > This initial implementation simplifies a few things that can be improved
> > in the future:
> > 
> > - We hold the inode lock during the operation.
> > - Cases 1, 3, and 4 allocate temporary memory to read into before
> >   copying out to userspace.
> > - Cases 3 and 4 do not implement repair yet.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  fs/btrfs/ctree.h |   2 +
> >  fs/btrfs/file.c  |  12 +-
> >  fs/btrfs/inode.c | 462 +++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 475 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 71552b2ca340..3b2aa1c7218c 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2906,6 +2906,8 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
> >  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
> >  void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
> >  					  u64 end, int uptodate);
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter);
> > +
> >  extern const struct dentry_operations btrfs_dentry_operations;
> >  
> >  /* ioctl.c */
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 27e5b269e729..51740cee39fc 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -390,6 +390,16 @@ int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info)
> >  	return 0;
> >  }
> >  
> > +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	if (iocb->ki_flags & IOCB_ENCODED) {
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EOPNOTSUPP;
> > +		return btrfs_encoded_read(iocb, iter);
> > +	}
> > +	return generic_file_read_iter(iocb, iter);
> > +}
> > +
> >  /* simple helper to fault in pages and copy.  This should go away
> >   * and be replaced with calls into generic code.
> >   */
> > @@ -3457,7 +3467,7 @@ static int btrfs_file_open(struct inode *inode, struct file *filp)
> >  
> >  const struct file_operations btrfs_file_operations = {
> >  	.llseek		= btrfs_file_llseek,
> > -	.read_iter      = generic_file_read_iter,
> > +	.read_iter      = btrfs_file_read_iter,
> >  	.splice_read	= generic_file_splice_read,
> >  	.write_iter	= btrfs_file_write_iter,
> >  	.mmap		= btrfs_file_mmap,
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 8bce46122ef7..174d0738d2c9 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -10593,6 +10593,468 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
> >  	}
> >  }
> >  
> > +static int encoded_iov_compression_from_btrfs(struct encoded_iov *encoded,
> > +					      unsigned int compress_type)
> > +{
> > +	switch (compress_type) {
> > +	case BTRFS_COMPRESS_NONE:
> > +		encoded->compression = ENCODED_IOV_COMPRESSION_NONE;
> > +		break;
> > +	case BTRFS_COMPRESS_ZLIB:
> > +		encoded->compression = ENCODED_IOV_COMPRESSION_ZLIB;
> > +		break;
> > +	case BTRFS_COMPRESS_LZO:
> > +		encoded->compression = ENCODED_IOV_COMPRESSION_LZO;
> > +		break;
> > +	case BTRFS_COMPRESS_ZSTD:
> > +		encoded->compression = ENCODED_IOV_COMPRESSION_ZSTD;
> > +		break;
> > +	default:
> > +		return -EIO;
> > +	}
> > +	return 0;
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb,
> > +					 struct iov_iter *iter, u64 start,
> > +					 u64 lockend,
> > +					 struct extent_state **cached_state,
> > +					 u64 extent_start, size_t count,
> > +					 struct encoded_iov *encoded,
> > +					 bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct btrfs_path *path;
> > +	struct extent_buffer *leaf;
> > +	struct btrfs_file_extent_item *item;
> > +	u64 ram_bytes;
> > +	unsigned long ptr;
> > +	void *tmp;
> > +	ssize_t ret;
> > +
> > +	path = btrfs_alloc_path();
> > +	if (!path) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path,
> > +				       btrfs_ino(BTRFS_I(inode)), extent_start,
> > +				       0);
> > +	if (ret) {
> > +		if (ret > 0) {
> > +			/* The extent item disappeared? */
> > +			ret = -EIO;
> > +		}
> > +		goto out;
> > +	}
> > +	leaf = path->nodes[0];
> > +	item = btrfs_item_ptr(leaf, path->slots[0],
> > +			      struct btrfs_file_extent_item);
> > +
> > +	ram_bytes = btrfs_file_extent_ram_bytes(leaf, item);
> > +	ptr = btrfs_file_extent_inline_start(item);
> > +
> > +	encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) -
> > +			iocb->ki_pos);
> > +	ret = encoded_iov_compression_from_btrfs(encoded,
> > +				 btrfs_file_extent_compression(leaf, item));
> > +	if (ret)
> > +		goto out;
> > +	if (encoded->compression) {
> > +		size_t inline_size;
> > +
> > +		inline_size = btrfs_file_extent_inline_item_len(leaf,
> > +						btrfs_item_nr(path->slots[0]));
> > +		if (inline_size > count) {
> > +			ret = -EFBIG;
> > +			goto out;
> > +		}
> > +		count = inline_size;
> > +		encoded->unencoded_len = ram_bytes;
> > +		encoded->unencoded_offset = iocb->ki_pos - extent_start;
> > +	} else {
> > +		encoded->len = encoded->unencoded_len = count =
> > +			min_t(u64, count, encoded->len);
> > +		ptr += iocb->ki_pos - extent_start;
> > +	}
> > +
> > +	tmp = kmalloc(count, GFP_NOFS);
> > +	if (!tmp) {
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +	read_extent_buffer(leaf, tmp, ptr, count);
> > +	btrfs_free_path(path);
> > +	path = NULL;
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock(inode);
> > +	*unlocked = true;
> > +	if (copy_to_iter(encoded, sizeof(*encoded), iter) == sizeof(*encoded) &&
> > +	    copy_to_iter(tmp, count, iter) == count)
> > +		ret = count;
> > +	else
> > +		ret = -EFAULT;
> > +	kfree(tmp);
> > +
> > +out:
> > +	btrfs_free_path(path);
> > +	return ret;
> > +}
> > +
> > +struct btrfs_encoded_read_private {
> > +	struct inode *inode;
> > +	wait_queue_head_t wait;
> > +	atomic_t pending;
> > +	bool uptodate;
> > +	bool skip_csum;
> > +};
> > +
> > +static bool btrfs_encoded_read_check_csums(struct btrfs_io_bio *io_bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private;
> > +	struct inode *inode = priv->inode;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	u32 sectorsize = fs_info->sectorsize;
> > +	struct bio_vec *bvec;
> > +	struct bvec_iter_all iter_all;
> > +	u64 offset = 0;
> > +
> > +	if (priv->skip_csum)
> > +		return true;
> > +	bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) {
> > +		unsigned int i, nr_sectors, pgoff;
> > +
> > +		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len);
> > +		pgoff = bvec->bv_offset;
> > +		for (i = 0; i < nr_sectors; i++) {
> > +			int csum_pos;
> > +
> > +			csum_pos = BTRFS_BYTES_TO_BLKS(fs_info, offset);
> > +			if (__readpage_endio_check(inode, io_bio, csum_pos,
> > +						   bvec->bv_page, pgoff,
> > +						   io_bio->logical + offset,
> > +						   sectorsize))
> > +				return false;
> > +			offset += sectorsize;
> > +			pgoff += sectorsize;
> > +		}
> > +	}
> > +	return true;
> > +}
> > +
> > +static void btrfs_encoded_read_endio(struct bio *bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > +
> > +	if (bio->bi_status || !btrfs_encoded_read_check_csums(io_bio))
> > +		priv->uptodate = false;
> > +	if (!atomic_dec_return(&priv->pending))
> > +		wake_up(&priv->wait);
> > +	btrfs_io_bio_free_csum(io_bio);
> > +	bio_put(bio);
> > +}
> > +
> > +static bool btrfs_submit_encoded_read(struct bio *bio)
> > +{
> > +	struct btrfs_encoded_read_private *priv = bio->bi_private;
> > +	struct inode *inode = priv->inode;
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	blk_status_t status;
> > +
> > +	atomic_inc(&priv->pending);
> > +
> > +	if (!priv->skip_csum) {
> > +		status = btrfs_lookup_bio_sums_at_offset(inode, bio,
> > +							 btrfs_io_bio(bio)->logical,
> > +							 NULL);
> > +		if (status)
> > +			goto out;
> > +	}
> > +
> > +	status = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA);
> > +	if (status)
> > +		goto out;
> > +
> > +	status = btrfs_map_bio(fs_info, bio, 0, 0);
> > +out:
> > +	if (status) {
> > +		bio->bi_status = status;
> > +		bio_endio(bio);
> > +		return false;
> > +	}
> > +	return true;
> > +}
> > +
> > +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb,
> > +					  struct iov_iter *iter,
> > +					  u64 start, u64 lockend,
> > +					  struct extent_state **cached_state,
> > +					  struct block_device *bdev,
> > +					  u64 offset, u64 disk_io_size,
> > +					  size_t count,
> > +					  const struct encoded_iov *encoded,
> > +					  bool *unlocked)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	struct btrfs_encoded_read_private priv = {
> > +		.inode = inode,
> > +		.wait = __WAIT_QUEUE_HEAD_INITIALIZER(priv.wait),
> > +		.pending = ATOMIC_INIT(1),
> > +		.uptodate = true,
> > +		.skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM,
> > +	};
> > +	struct page **pages;
> > +	unsigned long nr_pages, i;
> > +	struct bio *bio = NULL;
> > +	u64 cur;
> > +	size_t page_offset;
> > +	ssize_t ret;
> > +
> > +	nr_pages = (disk_io_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +	for (i = 0; i < nr_pages; i++) {
> > +		pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> > +		if (!pages[i]) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	i = 0;
> > +	cur = 0;
> > +	while (cur < disk_io_size) {
> > +		size_t bytes = min_t(u64, disk_io_size - cur,
> > +				     PAGE_SIZE);
> > +
> > +		if (!bio) {
> > +			bio = btrfs_bio_alloc(offset + cur);
> > +			bio_set_dev(bio, bdev);
> > +			bio->bi_end_io = btrfs_encoded_read_endio;
> > +			bio->bi_private = &priv;
> > +			bio->bi_opf = REQ_OP_READ;
> > +			btrfs_io_bio(bio)->logical = start + cur;
> > +		}
> > +
> > +		if (bio_add_page(bio, pages[i], bytes, 0) < bytes) {
> > +			bool success;
> > +
> > +			success = btrfs_submit_encoded_read(bio);
> > +			bio = NULL;
> > +			if (!success)
> > +				break;
> > +			continue;
> > +		}
> > +		i++;
> > +		cur += bytes;
> > +	}
> > +
> > +	if (bio)
> > +		btrfs_submit_encoded_read(bio);
> > +	if (atomic_dec_return(&priv.pending))
> > +		wait_event(priv.wait, !atomic_read(&priv.pending));
> > +	if (!priv.uptodate) {
> > +		ret = -EIO;
> > +		goto out;
> > +	}
> > +
> > +	unlock_extent_cached(io_tree, start, lockend, cached_state);
> > +	inode_unlock(inode);
> > +	*unlocked = true;
> > +
> > +	if (copy_to_iter(encoded, sizeof(*encoded), iter) != sizeof(*encoded)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +	if (encoded->compression) {
> > +		i = 0;
> > +		page_offset = 0;
> > +	} else {
> > +		i = (iocb->ki_pos - start) >> PAGE_SHIFT;
> > +		page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1);
> > +	}
> > +	cur = 0;
> > +	while (cur < count) {
> > +		size_t bytes = min_t(size_t, count - cur,
> > +				     PAGE_SIZE - page_offset);
> > +
> > +		if (copy_page_to_iter(pages[i], page_offset, bytes,
> > +				      iter) != bytes) {
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +		i++;
> > +		cur += bytes;
> > +		page_offset = 0;
> > +	}
> > +	ret = count;
> > +out:
> > +	for (i = 0; i < nr_pages; i++) {
> > +		if (pages[i])
> > +			put_page(pages[i]);
> > +	}
> > +	kfree(pages);
> > +	return ret;
> > +}
> > +
> > +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > +	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +	ssize_t ret;
> > +	size_t count;
> > +	struct block_device *em_bdev;
> > +	u64 start, lockend, offset, disk_io_size;
> > +	struct extent_state *cached_state = NULL;
> > +	struct extent_map *em;
> > +	struct encoded_iov encoded = {};
> > +	bool unlocked = false;
> > +
> > +	ret = check_encoded_read(iocb, iter);
> > +	if (ret < 0)
> > +		return ret;
> > +	if (ret == 0) {
> > +empty:
> > +		if (copy_to_iter(&encoded, sizeof(encoded), iter) ==
> > +		    sizeof(encoded))
> > +			return 0;
> > +		else
> > +			return -EFAULT;
> 
> nit: Just put the label at the end of the function since it's a simple
> error handler.

It's not really an error handler, it's a corner case. It doesn't seem
any nicer to have the corner case far away at the end of the function
and goto it from two places.

> > +	}
> > +	count = ret;
> > +
> > +	file_accessed(iocb->ki_filp);
> > +
> > +	inode_lock(inode);
> > +
> > +	if (iocb->ki_pos >= inode->i_size) {
> > +		inode_unlock(inode);
> > +		goto empty;
> 
> That way you won't have to jump backwards here ...
> 
> > +	}
> > +	start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize);
> > +	/*
> > +	 * We don't know how long the extent containing iocb->ki_pos is, but if
> > +	 * it's compressed we know that it won't be longer than this.
> > +	 */
> > +	lockend = start + BTRFS_MAX_UNCOMPRESSED - 1;
> > +
> > +	for (;;) {
> > +		struct btrfs_ordered_extent *ordered;
> > +
> > +		ret = btrfs_wait_ordered_range(inode, start,
> > +					       lockend - start + 1);
> > +		if (ret)
> > +			goto out_unlock_inode;
> > +		lock_extent_bits(io_tree, start, lockend, &cached_state);
> > +		ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start,
> > +						     lockend - start + 1);
> > +		if (!ordered)
> > +			break;
> > +		btrfs_put_ordered_extent(ordered);
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +		cond_resched();
> > +	}
> > +
> > +	em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start,
> > +			      lockend - start + 1, 0);
> > +	if (IS_ERR(em)) {
> > +		ret = PTR_ERR(em);
> > +		goto out_unlock_extent;
> > +	}
> > +	em_bdev = em->bdev;
> > +
> > +	if (em->block_start == EXTENT_MAP_INLINE) {
> > +		u64 extent_start = em->start;
> > +
> > +		/*
> > +		 * For inline extents we get everything we need out of the
> > +		 * extent item.
> > +		 */
> > +		free_extent_map(em);
> > +		em = NULL;
> > +		ret = btrfs_encoded_read_inline(iocb, iter, start, lockend,
> > +						&cached_state, extent_start,
> > +						count, &encoded, &unlocked);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * We only want to return up to EOF even if the extent extends beyond
> > +	 * that.
> > +	 */
> > +	encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) -
> > +		       iocb->ki_pos);
> > +	if (em->block_start == EXTENT_MAP_HOLE ||
> > +	    test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
> > +		offset = EXTENT_MAP_HOLE;
> > +	} else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> > +		offset = em->block_start;
> > +		/*
> > +		 * Bail if the buffer isn't large enough to return the whole
> > +		 * compressed extent.
> > +		 */
> > +		if (em->block_len > count) {
> > +			ret = -EFBIG;
> > +			goto out_em;
> > +		}
> > +		disk_io_size = count = em->block_len;
> > +		encoded.unencoded_len = em->ram_bytes;
> > +		encoded.unencoded_offset = iocb->ki_pos - em->orig_start;
> > +		ret = encoded_iov_compression_from_btrfs(&encoded,
> > +							 em->compress_type);
> > +		if (ret)
> > +			goto out_em;
> > +	} else {
> > +		offset = em->block_start + (start - em->start);
> > +		if (encoded.len > count)
> > +			encoded.len = count;
> > +		/*
> > +		 * Don't read beyond what we locked. This also limits the page
> > +		 * allocations that we'll do.
> > +		 */
> > +		disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start;
> > +		encoded.len = encoded.unencoded_len = count =
> > +			start + disk_io_size - iocb->ki_pos;
> > +		disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize);
> > +	}
> > +	free_extent_map(em);
> > +	em = NULL;
> > +
> > +	if (offset == EXTENT_MAP_HOLE) {
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +		inode_unlock(inode);
> > +		unlocked = true;
> > +		if (copy_to_iter(&encoded, sizeof(encoded), iter) ==
> > +		    sizeof(encoded))
> > +			ret = 0;
> > +		else
> > +			ret = -EFAULT;
> > +	} else {
> > +		ret = btrfs_encoded_read_regular(iocb, iter, start, lockend,
> > +						 &cached_state, em_bdev, offset,
> > +						 disk_io_size, count, &encoded,
> > +						 &unlocked);
> > +	}
> > +
> > +out:
> > +	if (ret >= 0)
> > +		iocb->ki_pos += encoded.len;
> > +out_em:
> > +	free_extent_map(em);
> > +out_unlock_extent:
> > +	if (!unlocked)
> > +		unlock_extent_cached(io_tree, start, lockend, &cached_state);
> > +out_unlock_inode:
> > +	if (!unlocked)
> > +		inode_unlock(inode);
> > +	return ret;
> > +}
> > +
> >  #ifdef CONFIG_SWAP
> >  /*
> >   * Add an entry indicating a block group or device which is pinned by a
> > 

^ permalink raw reply

* Re: [RFC PATCH v2 3/5] btrfs: generalize btrfs_lookup_bio_sums_dio()
From: Omar Sandoval @ 2019-10-18 22:19 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api
In-Reply-To: <aa49f032-f6be-8594-7c80-7101a0c6bcd0@suse.com>

On Wed, Oct 16, 2019 at 12:22:33PM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > This isn't actually dio-specific; it just looks up the csums starting at
> > the given offset instead of using the page index. Rename it to
> > btrfs_lookup_bio_sums_at_offset() and add the dst parameter. We might
> > even want to expose __btrfs_lookup_bio_sums() as the public API instead
> > of having two trivial wrappers, but I'll leave that for another day.
> 
> IMO exposing btrfs_lookup_bio_sums and adding proper kernel doc for its
> parameters is the correct way forward. Consider doing this if the
> general direction of this patchset is accepted and before sending the
> final revision.

Ok, if I'm not the only one that thinks it's a good idea, I'll go ahead
with that.

^ permalink raw reply

* Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data
From: Omar Sandoval @ 2019-10-18 22:19 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-btrfs, linux-fsdevel, kernel-team, Dave Chinner, Jann Horn,
	linux-api
In-Reply-To: <17ed54f3-40c2-2aea-ed9f-9c1307bdf806@suse.com>

On Wed, Oct 16, 2019 at 12:50:48PM +0300, Nikolay Borisov wrote:
> 
> 
> On 15.10.19 г. 21:42 ч., Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> 
> In the future when encryption is also supported. What should be the
> mechanism to enforce ordering of encoding operations i.e. first compress
> then encrypt => uncoded_len should be the resulting size after the
> encrypt operation. To me (not being a cryptographer) this seems the
> sensible thing to do since compression will be effective that way.
> However, what if , for whatever reasons, a different filesystem wants to
> support this interface but chooses to do it the other around -> encrypt,
> then compress?

Compress-then-encrypt is the only sane way to do it (because properly
encrypted data is indistinguishable from random data, which doesn't
compress very well). When we add encryption support, we can add a note
to the man page.

If someone _really_ wants encrypt-then-compress, they can add another
encoding field, compression_after_encryption.

> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  include/linux/fs.h      | 14 +++++++
> >  include/uapi/linux/fs.h | 26 ++++++++++++-
> >  mm/filemap.c            | 82 ++++++++++++++++++++++++++++++++++-------
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO	((__force fmode_t)0x40000000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC		(1 << 5)
> >  #define IOCB_WRITE		(1 << 6)
> >  #define IOCB_NOWAIT		(1 << 7)
> > +#define IOCB_ENCODED		(1 << 8)
> >  
> >  struct kiocb {
> >  	struct file		*ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +				struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> >  				struct file *file_out, loff_t pos_out,
> >  				loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> >  			return -EOPNOTSUPP;
> >  		ki->ki_flags |= IOCB_NOWAIT;
> >  	}
> > +	if (flags & RWF_ENCODED) {
> > +		if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +			return -EOPNOTSUPP;
> > +		ki->ki_flags |= IOCB_ENCODED;
> > +	}
> >  	if (flags & RWF_HIPRI)
> >  		ki->ki_flags |= IOCB_HIPRI;
> >  	if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +	ENCODED_IOV_COMPRESSION_NONE,
> > +	ENCODED_IOV_COMPRESSION_ZLIB,
> > +	ENCODED_IOV_COMPRESSION_LZO,
> > +	ENCODED_IOV_COMPRESSION_ZSTD,
> > +	ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +	ENCODED_IOV_ENCRYPTION_NONE,
> > +	ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +	__u64 len;
> > +	__u64 unencoded_len;
> > +	__u64 unencoded_offset;
> > +	__u32 compression;
> > +	__u32 encryption;
> > +};
> > +
> >  /* high priority request, poll if possible */
> >  #define RWF_HIPRI	((__force __kernel_rwf_t)0x00000001)
> >  
> > @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
> >  /* per-IO O_APPEND */
> >  #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
> >  
> > +/* encoded (e.g., compressed or encrypted) IO */
> 
> nit: s/or/and\/or/ or both are exclusive?

Changed, thanks.

> > +#define RWF_ENCODED	((__force __kernel_rwf_t)0x00000020)
> > +
> >  /* mask of flags supported by the kernel */
> >  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> > -			 RWF_APPEND)
> > +			 RWF_APPEND | RWF_ENCODED)
> >  
> >  #endif /* _UAPI_LINUX_FS_H */
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 1146fcfa3215..d2e6d9caf353 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2948,24 +2948,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> >  	return 0;
> >  }
> >  
> > -/*
> > - * Performs necessary checks before doing a write
> > - *
> > - * Can adjust writing position or amount of bytes to write.
> > - * Returns appropriate error code that caller should return or
> > - * zero in case that write should be allowed.
> > - */
> > -inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file->f_mapping->host;
> > -	loff_t count;
> > -	int ret;
> >  
> >  	if (IS_SWAPFILE(inode))
> >  		return -ETXTBSY;
> >  
> > -	if (!iov_iter_count(from))
> > +	if (!*count)
> >  		return 0;
> >  
> >  	/* FIXME: this is for backwards compatibility with 2.4 */
> > @@ -2975,8 +2966,21 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> >  		return -EINVAL;
> >  
> > -	count = iov_iter_count(from);
> > -	ret = generic_write_check_limits(file, iocb->ki_pos, &count);
> > +	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
> > +}
> > +
> > +/*
> > + * Performs necessary checks before doing a write
> > + *
> > + * Can adjust writing position or amount of bytes to write.
> > + * Returns a negative errno or the new number of bytes to write.
> > + */
> > +inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	loff_t count = iov_iter_count(from);
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> >  	if (ret)
> >  		return ret;
> >  
> > @@ -2985,6 +2989,58 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  }
> >  EXPORT_SYMBOL(generic_write_checks);
> >  
> > +int generic_encoded_write_checks(struct kiocb *iocb,
> > +				 struct encoded_iov *encoded)
> > +{
> > +	loff_t count = encoded->unencoded_len;
> > +	int ret;
> > +
> > +	ret = generic_write_checks_common(iocb, &count);
> 
> That's a bit confusing. You will only ever write encoded len bytes, yet
> you check the unencoded len. Presumably that's to ensure the data can be
> read back successfully? Still it feels a bit odd. IMO this warrants a
> comment.
> 
> When you use this function in patch 5 all the checks are performed
> against unencoded_len yet you do :
> 
> count = encoded.len;

Oops, this is supposed to check encoded->len. I forgot to update it when
I made the file length and unencoded length distinct fields. Good catch!

This needs to check the file length rather than the encoded length
because the checks in generic_write_check_limits() are concerned with
the file size (RLIMIT_FSIZE and s_maxbytes).

> > +	if (ret)
> > +		return ret;
> > +
> > +	if (count != encoded->unencoded_len) {
> > +		/*
> > +		 * The write got truncated by generic_write_checks_common(). We
> > +		 * can't do a partial encoded write.
> > +		 */
> > +		return -EFBIG;
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(generic_encoded_write_checks);
> > +
> > +ssize_t check_encoded_read(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(iter) != sizeof(struct encoded_iov))
> > +		return -EINVAL;
> > +	return iov_iter_count(iter) - sizeof(struct encoded_iov);
> > +}
> > +EXPORT_SYMBOL(check_encoded_read);
> > +
> > +int import_encoded_write(struct kiocb *iocb, struct encoded_iov *encoded,
> 
> nit: This might be just me but 'import' doesn't sound right, how about
> parse_encoded_write ?

The naming is borrowed from import_iovec(). IMO that's more descriptive
since we're not really parsing anything.

> > +			 struct iov_iter *from)
> > +{
> > +	if (!(iocb->ki_filp->f_flags & O_ENCODED))
> > +		return -EPERM;
> > +	if (iov_iter_single_seg_count(from) != sizeof(*encoded))
> > +		return -EINVAL;
> > +	if (copy_from_iter(encoded, sizeof(*encoded), from) != sizeof(*encoded))
> > +		return -EFAULT;
> > +	if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE &&
> > +	    encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE)
> > +		return -EINVAL;
> > +	if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES ||
> > +	    encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES)
> > +		return -EINVAL;
> > +	if (encoded->unencoded_offset >= encoded->unencoded_len)
> > +		return -EINVAL;
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(import_encoded_write);
> > +
> >  /*
> >   * Performs necessary checks before doing a clone.
> >   *
> > 

^ permalink raw reply

* Re: [PATCH v3 1/5] pidfd: check pid has attached task in fdinfo
From: Christian Brauner @ 2019-10-18 15:05 UTC (permalink / raw)
  To: oleg, linux-kernel
  Cc: aarcange, akpm, christian, cyphar, elena.reshetova, guro, jannh,
	ldv, linux-api, linux-kselftest, mhocko, mingo, peterz, shuah,
	tglx, viro
In-Reply-To: <20191017101832.5985-1-christian.brauner@ubuntu.com>

On Thu, Oct 17, 2019 at 12:18:28PM +0200, Christian Brauner wrote:
> Currently, when a task is dead we still print the pid it used to use in
> the fdinfo files of its pidfds. This doesn't make much sense since the
> pid may have already been reused. So verify that the task is still alive
> by introducing the pid_has_task() helper which will be used by other
> callers in follow-up patches.
> If the task is not alive anymore, we will print -1. This allows us to
> differentiate between a task not being present in a given pid namespace
> - in which case we already print 0 - and a task having been reaped.
> 
> Note that this uses PIDTYPE_PID for the check. Technically, we could've
> checked PIDTYPE_TGID since pidfds currently only refer to thread-group
> leaders but if they won't anymore in the future then this check becomes
> problematic without it being immediately obvious to non-experts imho. If
> a thread is created via clone(CLONE_THREAD) than struct pid has a single
> non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a
> PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even
> though the thread-group leader might still be very much alive. So
> checking PIDTYPE_PID is fine and is easier to maintain should we ever
> allow pidfds to refer to threads.
> 
> Cc: Jann Horn <jannh@google.com>
> Cc: Christian Kellner <christian@kellner.me>
> Cc: linux-api@vger.kernel.org
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>

Applied this series to:
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=pidfd

Thanks!
Christian

^ permalink raw reply

* Re: [PATCHv7 00/33] kernel: Introduce Time Namespace
From: Andrei Vagin @ 2019-10-17 23:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Adrian Reber,
	Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Vincenzo Frascino, containers, criu, linux-api, x86
In-Reply-To: <alpine.DEB.2.21.1910171122030.1824@nanos.tec.linutronix.de>

On Thu, Oct 17, 2019 at 11:24:45AM +0200, Thomas Gleixner wrote:
> On Fri, 11 Oct 2019, Dmitry Safonov wrote:
> > We wrote two small benchmarks. The first one gettime_perf.c calls
> > clock_gettime() in a loop for 3 seconds. It shows us performance with
> > a hot CPU cache (more clock_gettime() cycles - the better):
> > 
> >         | before    | CONFIG_TIME_NS=n | host      | inside timens
> > --------------------------------------------------------------
> >         | 153242367 | 153567617        | 150933203 | 139310914
> >         | 153324800 | 153115132        | 150919828 | 139299761
> >         | 153125401 | 153686868        | 150930471 | 139273917
> >         | 153399355 | 153694866        | 151083410 | 139286081
> >         | 153489417 | 153739716        | 150997262 | 139146403
> >         | 153494270 | 153724332        | 150035651 | 138835612
> > -----------------------------------------------------------
> > avg     | 153345935 | 153588088        | 150816637 | 139192114
> > diff %  | 100       | 100.1            | 98.3      | 90.7
> 
> 
> That host 98.3% number is weird and does not match the tests I did with the
> fallback code I provided you. On my limited testing that fallback hidden in
> the slowpath did not show any difference to the TIME_NS=n case when not
> inside a time namespace.

You did your experiments without a small optimization that we introduced
in the 18-th patch:

[PATCHv7 18/33] lib/vdso: Add unlikely() hint into vdso_read_begin()

When I did my measurements in the first time, I found that with this
timens change clock_gettime() shows a better performance when
CONFIG_TIME_NS isn't set. This looked weird for me, because I don't
expect to see this improvement. After analyzing a disassembled code of
vdso.so, I found that we can add the unlikely() hint into
vdso_read_begin() and this gives us 2% improvement of clock_gettime
performance on the upsteam kernel.

In my table, the "before" column is actually for the upstream kernel
with the 18-th patch. Here is the table with the real "before" column:

        | before    | with 18/33 | CONFIG_TIME_NS=n | host      | inside timens
------------------------------------------------------------------------------
avg     | 150331408 | 153345935  | 153588088        | 150816637 | 139192114
------------------------------------------------------------------------------
diff %  |       98  |      100   | 100.1            | 98.3      | 90.7
------------------------------------------------------------------------------
stdev % |       0.3 |     0.09   | 0.15             | 0.25      | 0.13

If we compare numbers in "before", "host" and "inside timens" columns, we
see the same results that you had. clock_gettime() works with the
same performance in the host namespace and 7% slower in a time
namespace.

Now let's look why we have these 2% degradation in the host time
namespace. For that, we cat look at disassembled code of do_hres:

Before:
   0:   55                      push   %rbp
   1:   48 63 f6                movslq %esi,%rsi
   4:   49 89 d1                mov    %rdx,%r9
   7:   49 89 c8                mov    %rcx,%r8
   a:   48 c1 e6 04             shl    $0x4,%rsi
   e:   48 01 fe                add    %rdi,%rsi
  11:   48 89 e5                mov    %rsp,%rbp
  14:   41 54                   push   %r12
  16:   53                      push   %rbx
  17:   44 8b 17                mov    (%rdi),%r10d
  1a:   41 f6 c2 01             test   $0x1,%r10b
  1e:   0f 85 fb 00 00 00       jne    11f <do_hres.isra.0+0x11f>
  24:   8b 47 04                mov    0x4(%rdi),%eax
  27:   83 f8 01                cmp    $0x1,%eax
  2a:   74 0f                   je     3b <do_hres.isra.0+0x3b>
  2c:   83 f8 02                cmp    $0x2,%eax
  2f:   74 72                   je     a3 <do_hres.isra.0+0xa3>
  31:   5b                      pop    %rbx
  32:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  37:   41 5c                   pop    %r12
  39:   5d                      pop    %rbp
  3a:   c3                      retq
...

After:
   0:   55                      push   %rbp
   1:   4c 63 ce                movslq %esi,%r9
   4:   49 89 d0                mov    %rdx,%r8
   7:   49 c1 e1 04             shl    $0x4,%r9
   b:   49 01 f9                add    %rdi,%r9
   e:   48 89 e5                mov    %rsp,%rbp
  11:   41 56                   push   %r14
  13:   41 55                   push   %r13
  15:   41 54                   push   %r12
  17:   53                      push   %rbx
  18:   44 8b 17                mov    (%rdi),%r10d
  1b:   44 89 d0                mov    %r10d,%eax
  1e:   f7 d0                   not    %eax
  20:   83 e0 01                and    $0x1,%eax
  23:   89 c3                   mov    %eax,%ebx
  25:   0f 84 03 01 00 00       je     12e <do_hres+0x12e>
  2b:   8b 47 04                mov    0x4(%rdi),%eax
  2e:   83 f8 01                cmp    $0x1,%eax
  31:   74 13                   je     46 <do_hres+0x46>
  33:   83 f8 02                cmp    $0x2,%eax
  36:   74 7b                   je     b3 <do_hres+0xb3>
  38:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  3d:   5b                      pop    %rbx
  3e:   41 5c                   pop    %r12
  40:   41 5d                   pop    %r13
  42:   41 5e                   pop    %r14
  44:   5d                      pop    %rbp
  45:   c3                      retq
...

So I think we see these 2% degradation in the host time namespace,
because we need to save to extra registers on stack. If we want to avoid
this degradation, we can mark do_hres_timens as noinline. In this case,
the disassembled code will be the same as before these changes:

  0000000000000160 <do_hres>:
  do_hres():
   160:   55                      push   %rbp
   161:   4c 63 ce                movslq %esi,%r9
   164:   49 89 d0                mov    %rdx,%r8
   167:   49 c1 e1 04             shl    $0x4,%r9
   16b:   49 01 f9                add    %rdi,%r9
   16e:   48 89 e5                mov    %rsp,%rbp
   171:   41 54                   push   %r12
   173:   53                      push   %rbx
   174:   44 8b 17                mov    (%rdi),%r10d
   177:   41 f6 c2 01             test   $0x1,%r10b
   17b:   0f 85 fc 00 00 00       jne    27d <do_hres+0x11d>
   181:   8b 47 04                mov    0x4(%rdi),%eax
   184:   83 f8 01                cmp    $0x1,%eax
   187:   74 0f                   je     198 <do_hres+0x38>
   189:   83 f8 02                cmp    $0x2,%eax
   18c:   74 73                   je     201 <do_hres+0xa1>
   18e:   5b                      pop    %rbx
   18f:   b8 ff ff ff ff          mov    $0xffffffff,%eax
   194:   41 5c                   pop    %r12
   196:   5d                      pop    %rbp
   197:   c3                      retq
...

But this change will affect the performance of clock_gettime in a time
namespace.

My experiments shows that with the noinline annotation for
do_hres_timens, clock_gettime will work with the same performance in the
host time namespace, but it will be slower on 11% in a time namespace.

Thomas, what do you think about this? Do we need to mark do_hres_timens
as noinline?

Thanks,
Andrei

^ permalink raw reply

* Re: [RFC PATCH 03/21] pipe: Use head and tail pointers for the ring, not cursor and length
From: David Howells @ 2019-10-17 10:53 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: dhowells, torvalds, Casey Schaufler, Stephen Smalley,
	Greg Kroah-Hartman, nicolas.dichtel, raven, Christian Brauner,
	keyrings, linux-usb, linux-block, linux-security-module,
	linux-fsdevel, linux-api, linux-kernel
In-Reply-To: <b8799179-d389-8005-4f6d-845febc3bb23@rasmusvillemoes.dk>

Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> >  (6) The number of free slots in the ring is "(tail + pipe->ring_size) -
> >      head".
> 
> Seems an odd way of writing pipe->ring_size - (head - tail) ; i.e.
> obviously #free slots is #size minus #occupancy.

Perhaps so.  The way I was looking at it is the window into which things can
be written is tail...tail+ring_size; the number of free slots is the distance
from head to the end of the window.

Anyway, I now have a helper that does it your way.

> >  (7) The ring is full if "head >= (tail + pipe->ring_size)", which can also
> >      be written as "head - tail >= pipe->ring_size".
> >
> 
> No it cannot, it _must_ be written in the latter form.

Ah, you're right.  I have a helper now for that too.

> head-tail == pipe_size or head-tail >= pipe_size

In general, I'd prefer ">=" just in case tail gets in front of head.

Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> > Also split pipe->buffers into pipe->ring_size (which indicates the size of
> > the ring) and pipe->max_usage (which restricts the amount of ring that
> > write() is allowed to fill).  This allows for a pipe that is both writable
> > by the kernel notification facility and by userspace, allowing plenty of
> > ring space for notifications to be added whilst preventing userspace from
> > being able to use up too much buffer space.
> 
> That seems like something that should be added in a separate patch -
> adding ->max_usage and switching appropriate users of ->ring_size over,
> so it's more clear where you're using one or the other.

Okay.

> > +		ibuf = &pipe->bufs[tail];
> 
> I don't see where tail gets masked between tail = pipe->tail;

Yeah - I missed that one.

> In any case, how about seeding head and tail with something like 1<<20 when
> creating the pipe so bugs like that are hit more quickly.

That's sounds like a good idea.

> > +			while (tail < head) {
> > +				count += pipe->bufs[tail & mask].len;
> > +				tail++;
> >  			}
> 
> This is broken if head has wrapped but tail has not. It has to be "while
> (head - tail)" or perhaps just "while (tail != head)" or something along
> those lines.

Yeah...  It's just too easy to overlook this and use ordinary comparisons.
I've switched to "while (tail != head)".

> > +	mask = pipe->ring_size - 1;
> > +	head = pipe->head & mask;
> > +	tail = pipe->tail & mask;
> > +	n = pipe->head - pipe->tail;
> 
> I think it's confusing to "premask" head and tail here. Can you either
> drop that (pipe_set_size should hardly be a hot path?), or perhaps call
> them something else to avoid a future reader seeing an unmasked
> bufs[head] and thinking that's a bug?

I've made it now do the masking right before doing the memcpy calls and used
different variable names for it:

	if (n > 0) {
		unsigned int h = head & mask;
		unsigned int t = tail & mask;
		if (h > t) {
			memcpy(bufs, &pipe->bufs + t,
			       n * sizeof(struct pipe_buffer));
		} else {
			unsigned int tsize = pipe->ring_size - t;
			if (h > 0)
				memcpy(bufs + tsize, pipe->bufs,
				       h * sizeof(struct pipe_buffer));
			memcpy(bufs, pipe->bufs + t,
			       tsize * sizeof(struct pipe_buffer));
		}

> > -	data_start(i, &idx, start);
> > -	/* some of this one + all after this one */
> > -	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
> > -	capacity = min(npages,maxpages) * PAGE_SIZE - *start;
> > +	data_start(i, &i_head, start);
> > +	p_tail = i->pipe->tail;
> > +	/* Amount of free space: some of this one + all after this one */
> > +	npages = (p_tail + i->pipe->ring_size) - i_head;
> 
> Hm, it's not clear that this is equivalent to the old computation. Since
> it seems repeated in a few places, could it be factored to a little
> helper (before this patch) and the "some of this one + all after this
> one" comment perhaps expanded to explain what is going on?

Yeah...  It's a bit weird, even before my changes.

However, looking at it again, it seems data_start() does the appropriate
calculations.  If there's space in the current head buffer, it returns the
offset to that and the head of that buffer, otherwise it advances the head
pointer and sets the offset to 0.

So I think the comment may actually be retrospective - referring to the state
that data_start() has given us, rather than talking about the next bit of
code.

I also wonder if pipe_get_pages_alloc() is broken.  It doesn't check to see
whether the buffer is full at the point it calls data_start().  However,
data_start() doesn't check either and, without this patch, will simply advance
and mask off the ring index - which may wrap.

The maths in the unpatched version is pretty icky and I'm not convinced it's
correct.

David

^ permalink raw reply

* [PATCH v3 5/5] pid: use pid_has_task() in pidfd_open()
From: Christian Brauner @ 2019-10-17 10:18 UTC (permalink / raw)
  To: oleg, linux-kernel
  Cc: aarcange, akpm, christian, cyphar, elena.reshetova, guro, jannh,
	ldv, linux-api, linux-kselftest, mhocko, mingo, peterz, shuah,
	tglx, viro, Christian Brauner
In-Reply-To: <20191017101832.5985-1-christian.brauner@ubuntu.com>

Use the new pid_has_task() helper in pidfd_open(). This simplifies the
code and avoids taking rcu_read_{lock,unlock}() and leads to overall
nicer code.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
---
/* pidfd selftests */
passed

/* v1 */
patch not present

/* v2 */
Link: https://lore.kernel.org/r/20191016153606.2326-5-christian.brauner@ubuntu.com
patch introduced

/* v3 */
- Oleg Nesterov <oleg@redhat.com>:
  - s/task_alive/pid_has_task/
---
 kernel/pid.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 124d40b239b1..7b5f6c963d72 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -497,7 +497,7 @@ static int pidfd_create(struct pid *pid)
  */
 SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 {
-	int fd, ret;
+	int fd;
 	struct pid *p;
 
 	if (flags)
@@ -510,13 +510,11 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 	if (!p)
 		return -ESRCH;
 
-	ret = 0;
-	rcu_read_lock();
-	if (!pid_task(p, PIDTYPE_TGID))
-		ret = -EINVAL;
-	rcu_read_unlock();
+	if (pid_has_task(p, PIDTYPE_TGID))
+		fd = pidfd_create(p);
+	else
+		fd = -EINVAL;
 
-	fd = ret ?: pidfd_create(p);
 	put_pid(p);
 	return fd;
 }
-- 
2.23.0

^ permalink raw reply related

* [PATCH v3 4/5] exit: use pid_has_task() in do_wait()
From: Christian Brauner @ 2019-10-17 10:18 UTC (permalink / raw)
  To: oleg, linux-kernel
  Cc: aarcange, akpm, christian, cyphar, elena.reshetova, guro, jannh,
	ldv, linux-api, linux-kselftest, mhocko, mingo, peterz, shuah,
	tglx, viro, Christian Brauner
In-Reply-To: <20191017101832.5985-1-christian.brauner@ubuntu.com>

Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this check.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
---
/* pidfd selftests */
passed

/* v1 */
patch not present

/* v2 */
Link: https://lore.kernel.org/r/20191016153606.2326-4-christian.brauner@ubuntu.com
patch introduced

/* v3 */
- Oleg Nesterov <oleg@redhat.com>:
  - s/task_alive/pid_has_task/
---
 kernel/exit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index a46a50d67002..f2d20ab74422 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1457,7 +1457,7 @@ static long do_wait(struct wait_opts *wo)
 	 */
 	wo->notask_error = -ECHILD;
 	if ((wo->wo_type < PIDTYPE_MAX) &&
-	   (!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
+	   (!wo->wo_pid || !pid_has_task(wo->wo_pid, wo->wo_type)))
 		goto notask;
 
 	set_current_state(TASK_INTERRUPTIBLE);
-- 
2.23.0

^ permalink raw reply related

* [PATCH v3 3/5] pid: use pid_has_task() in __change_pid()
From: Christian Brauner @ 2019-10-17 10:18 UTC (permalink / raw)
  To: oleg, linux-kernel
  Cc: aarcange, akpm, christian, cyphar, elena.reshetova, guro, jannh,
	ldv, linux-api, linux-kselftest, mhocko, mingo, peterz, shuah,
	tglx, viro, Christian Brauner
In-Reply-To: <20191017101832.5985-1-christian.brauner@ubuntu.com>

Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this
check.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
---
/* pidfd selftests */
passed

/* v1 */
patch not present

/* v2 */
Link: https://lore.kernel.org/r/20191016153606.2326-3-christian.brauner@ubuntu.com
patch introduced

/* v2 */
- Oleg Nesterov <oleg@redhat.com>:
  - s/task_alive/pid_has_task/g
---
 kernel/pid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 0a9f2e437217..124d40b239b1 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -299,7 +299,7 @@ static void __change_pid(struct task_struct *task, enum pid_type type,
 	*pid_ptr = new;
 
 	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
-		if (!hlist_empty(&pid->tasks[tmp]))
+		if (pid_has_task(pid, tmp))
 			return;
 
 	free_pid(pid);
-- 
2.23.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox