Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [RFC PATCH 04/10] pipe: Use head and tail pointers for the ring, not cursor and length [ver #2]
From: David Howells @ 2019-10-23 20:17 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

Convert pipes to use head and tail pointers for the buffer ring rather than
pointer and length as the latter requires two atomic ops to update (or a
combined op) whereas the former only requires one.

 (1) The head pointer is the point at which production occurs and points to
     the slot in which the next buffer will be placed.  This is equivalent
     to pipe->curbuf + pipe->nrbufs.

     The head pointer belongs to the write-side.

 (2) The tail pointer is the point at which consumption occurs.  It points
     to the next slot to be consumed.  This is equivalent to pipe->curbuf.

     The tail pointer belongs to the read-side.

 (3) head and tail are allowed to run to UINT_MAX and wrap naturally.  They
     are only masked off when the array is being accessed, e.g.:

	pipe->bufs[head & mask]

     This means that it is not necessary to have a dead slot in the ring as
     head == tail isn't ambiguous.

 (4) The ring is empty if "head == tail".

     A helper, pipe_empty(), is provided for this.

 (5) The occupancy of the ring is "head - tail".

     A helper, pipe_occupancy(), is provided for this.

 (6) The number of free slots in the ring is "pipe->ring_size - occupancy".

     A helper, pipe_space_for_user() is provided to indicate how many slots
     userspace may use.

 (7) The ring is full if "head - tail >= pipe->ring_size".

     A helper, pipe_full(), is provided for this.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fuse/dev.c             |   31 +++--
 fs/pipe.c                 |  169 ++++++++++++++++-------------
 fs/splice.c               |  188 ++++++++++++++++++++------------
 include/linux/pipe_fs_i.h |   86 ++++++++++++++-
 include/linux/uio.h       |    4 -
 lib/iov_iter.c            |  266 +++++++++++++++++++++++++--------------------
 6 files changed, 464 insertions(+), 280 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index dadd617d826c..1e4bc27573cc 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -703,7 +703,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 			cs->pipebufs++;
 			cs->nr_segs--;
 		} else {
-			if (cs->nr_segs == cs->pipe->buffers)
+			if (cs->nr_segs >= cs->pipe->ring_size)
 				return -EIO;
 
 			page = alloc_page(GFP_HIGHUSER);
@@ -879,7 +879,7 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
 	struct pipe_buffer *buf;
 	int err;
 
-	if (cs->nr_segs == cs->pipe->buffers)
+	if (cs->nr_segs >= cs->pipe->ring_size)
 		return -EIO;
 
 	err = unlock_request(cs->req);
@@ -1341,7 +1341,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	if (!fud)
 		return -EPERM;
 
-	bufs = kvmalloc_array(pipe->buffers, sizeof(struct pipe_buffer),
+	bufs = kvmalloc_array(pipe->ring_size, sizeof(struct pipe_buffer),
 			      GFP_KERNEL);
 	if (!bufs)
 		return -ENOMEM;
@@ -1353,7 +1353,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	if (ret < 0)
 		goto out;
 
-	if (pipe->nrbufs + cs.nr_segs > pipe->buffers) {
+	if (pipe_occupancy(pipe->head, pipe->tail) + cs.nr_segs > pipe->ring_size) {
 		ret = -EIO;
 		goto out;
 	}
@@ -1935,6 +1935,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 				     struct file *out, loff_t *ppos,
 				     size_t len, unsigned int flags)
 {
+	unsigned int head, tail, mask, count;
 	unsigned nbuf;
 	unsigned idx;
 	struct pipe_buffer *bufs;
@@ -1949,8 +1950,12 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 
 	pipe_lock(pipe);
 
-	bufs = kvmalloc_array(pipe->nrbufs, sizeof(struct pipe_buffer),
-			      GFP_KERNEL);
+	head = pipe->head;
+	tail = pipe->tail;
+	mask = pipe->ring_size - 1;
+	count = head - tail;
+
+	bufs = kvmalloc_array(count, sizeof(struct pipe_buffer), GFP_KERNEL);
 	if (!bufs) {
 		pipe_unlock(pipe);
 		return -ENOMEM;
@@ -1958,8 +1963,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 
 	nbuf = 0;
 	rem = 0;
-	for (idx = 0; idx < pipe->nrbufs && rem < len; idx++)
-		rem += pipe->bufs[(pipe->curbuf + idx) & (pipe->buffers - 1)].len;
+	for (idx = tail; idx < head && rem < len; idx++)
+		rem += pipe->bufs[idx & mask].len;
 
 	ret = -EINVAL;
 	if (rem < len)
@@ -1970,16 +1975,16 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 		struct pipe_buffer *ibuf;
 		struct pipe_buffer *obuf;
 
-		BUG_ON(nbuf >= pipe->buffers);
-		BUG_ON(!pipe->nrbufs);
-		ibuf = &pipe->bufs[pipe->curbuf];
+		BUG_ON(nbuf >= pipe->ring_size);
+		BUG_ON(tail == head);
+		ibuf = &pipe->bufs[tail & mask];
 		obuf = &bufs[nbuf];
 
 		if (rem >= ibuf->len) {
 			*obuf = *ibuf;
 			ibuf->ops = NULL;
-			pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
-			pipe->nrbufs--;
+			tail++;
+			pipe_commit_read(pipe, tail);
 		} else {
 			if (!pipe_buf_get(pipe, ibuf))
 				goto out_free;
diff --git a/fs/pipe.c b/fs/pipe.c
index 8a2ab2f974bd..8a0806fe12d3 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -43,10 +43,11 @@ unsigned long pipe_user_pages_hard;
 unsigned long pipe_user_pages_soft = PIPE_DEF_BUFFERS * INR_OPEN_CUR;
 
 /*
- * We use a start+len construction, which provides full use of the 
- * allocated memory.
- * -- Florian Coosmann (FGC)
- * 
+ * We use head and tail indices that aren't masked off, except at the point of
+ * dereference, but rather they're allowed to wrap naturally.  This means there
+ * isn't a dead spot in the buffer, provided the ring size < INT_MAX.
+ * -- David Howells 2019-09-23.
+ *
  * Reads with count = 0 should always return 0.
  * -- Julian Bradfield 1999-06-07.
  *
@@ -285,10 +286,12 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	ret = 0;
 	__pipe_lock(pipe);
 	for (;;) {
-		int bufs = pipe->nrbufs;
-		if (bufs) {
-			int curbuf = pipe->curbuf;
-			struct pipe_buffer *buf = pipe->bufs + curbuf;
+		unsigned int head = pipe->head;
+		unsigned int tail = pipe->tail;
+		unsigned int mask = pipe->ring_size - 1;
+
+		if (!pipe_empty(head, tail)) {
+			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
 			size_t chars = buf->len;
 			size_t written;
 			int error;
@@ -321,17 +324,17 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 
 			if (!buf->len) {
 				pipe_buf_release(pipe, buf);
-				curbuf = (curbuf + 1) & (pipe->buffers - 1);
-				pipe->curbuf = curbuf;
-				pipe->nrbufs = --bufs;
+				tail++;
+				pipe_commit_read(pipe, tail);
 				do_wakeup = 1;
 			}
 			total_len -= chars;
 			if (!total_len)
 				break;	/* common path: read succeeded */
+			if (!pipe_empty(head, tail))	/* More to do? */
+				continue;
 		}
-		if (bufs)	/* More to do? */
-			continue;
+
 		if (!pipe->writers)
 			break;
 		if (!pipe->waiting_writers) {
@@ -380,6 +383,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
+	unsigned int head, tail, max_usage, mask;
 	ssize_t ret = 0;
 	int do_wakeup = 0;
 	size_t total_len = iov_iter_count(from);
@@ -397,12 +401,15 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		goto out;
 	}
 
+	tail = pipe->tail;
+	head = pipe->head;
+	max_usage = pipe->ring_size;
+	mask = pipe->ring_size - 1;
+
 	/* We try to merge small writes */
 	chars = total_len & (PAGE_SIZE-1); /* size of the last buffer */
-	if (pipe->nrbufs && chars != 0) {
-		int lastbuf = (pipe->curbuf + pipe->nrbufs - 1) &
-							(pipe->buffers - 1);
-		struct pipe_buffer *buf = pipe->bufs + lastbuf;
+	if (!pipe_empty(head, tail) && chars != 0) {
+		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;
 
 		if (pipe_buf_can_merge(buf) && offset + chars <= PAGE_SIZE) {
@@ -423,18 +430,16 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 
 	for (;;) {
-		int bufs;
-
 		if (!pipe->readers) {
 			send_sig(SIGPIPE, current, 0);
 			if (!ret)
 				ret = -EPIPE;
 			break;
 		}
-		bufs = pipe->nrbufs;
-		if (bufs < pipe->buffers) {
-			int newbuf = (pipe->curbuf + bufs) & (pipe->buffers-1);
-			struct pipe_buffer *buf = pipe->bufs + newbuf;
+
+		tail = pipe->tail;
+		if (!pipe_full(head, tail, max_usage)) {
+			struct pipe_buffer *buf = &pipe->bufs[head & mask];
 			struct page *page = pipe->tmp_page;
 			int copied;
 
@@ -470,14 +475,19 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 				buf->ops = &packet_pipe_buf_ops;
 				buf->flags = PIPE_BUF_FLAG_PACKET;
 			}
-			pipe->nrbufs = ++bufs;
+
+			head++;
+			pipe_commit_write(pipe, head);
 			pipe->tmp_page = NULL;
 
 			if (!iov_iter_count(from))
 				break;
 		}
-		if (bufs < pipe->buffers)
+
+		if (!pipe_full(head, tail, max_usage))
 			continue;
+
+		/* Wait for buffer space to become available. */
 		if (filp->f_flags & O_NONBLOCK) {
 			if (!ret)
 				ret = -EAGAIN;
@@ -515,17 +525,19 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct pipe_inode_info *pipe = filp->private_data;
-	int count, buf, nrbufs;
+	int count, head, tail, mask;
 
 	switch (cmd) {
 		case FIONREAD:
 			__pipe_lock(pipe);
 			count = 0;
-			buf = pipe->curbuf;
-			nrbufs = pipe->nrbufs;
-			while (--nrbufs >= 0) {
-				count += pipe->bufs[buf].len;
-				buf = (buf+1) & (pipe->buffers - 1);
+			head = pipe->head;
+			tail = pipe->tail;
+			mask = pipe->ring_size - 1;
+
+			while (tail != head) {
+				count += pipe->bufs[tail & mask].len;
+				tail++;
 			}
 			__pipe_unlock(pipe);
 
@@ -541,21 +553,25 @@ pipe_poll(struct file *filp, poll_table *wait)
 {
 	__poll_t mask;
 	struct pipe_inode_info *pipe = filp->private_data;
-	int nrbufs;
+	unsigned int head = READ_ONCE(pipe->head);
+	unsigned int tail = READ_ONCE(pipe->tail);
 
 	poll_wait(filp, &pipe->wait, wait);
 
+	BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size);
+
 	/* Reading only -- no need for acquiring the semaphore.  */
-	nrbufs = pipe->nrbufs;
 	mask = 0;
 	if (filp->f_mode & FMODE_READ) {
-		mask = (nrbufs > 0) ? EPOLLIN | EPOLLRDNORM : 0;
+		if (!pipe_empty(head, tail))
+			mask |= EPOLLIN | EPOLLRDNORM;
 		if (!pipe->writers && filp->f_version != pipe->w_counter)
 			mask |= EPOLLHUP;
 	}
 
 	if (filp->f_mode & FMODE_WRITE) {
-		mask |= (nrbufs < pipe->buffers) ? EPOLLOUT | EPOLLWRNORM : 0;
+		if (!pipe_full(head, tail, pipe->ring_size))
+			mask |= EPOLLOUT | EPOLLWRNORM;
 		/*
 		 * Most Unices do not set EPOLLERR for FIFOs but on Linux they
 		 * behave exactly like pipes for poll().
@@ -679,7 +695,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 	if (pipe->bufs) {
 		init_waitqueue_head(&pipe->wait);
 		pipe->r_counter = pipe->w_counter = 1;
-		pipe->buffers = pipe_bufs;
+		pipe->ring_size = pipe_bufs;
 		pipe->user = user;
 		mutex_init(&pipe->mutex);
 		return pipe;
@@ -697,9 +713,9 @@ void free_pipe_info(struct pipe_inode_info *pipe)
 {
 	int i;
 
-	(void) account_pipe_buffers(pipe->user, pipe->buffers, 0);
+	(void) account_pipe_buffers(pipe->user, pipe->ring_size, 0);
 	free_uid(pipe->user);
-	for (i = 0; i < pipe->buffers; i++) {
+	for (i = 0; i < pipe->ring_size; i++) {
 		struct pipe_buffer *buf = pipe->bufs + i;
 		if (buf->ops)
 			pipe_buf_release(pipe, buf);
@@ -880,7 +896,7 @@ SYSCALL_DEFINE1(pipe, int __user *, fildes)
 
 static int wait_for_partner(struct pipe_inode_info *pipe, unsigned int *cnt)
 {
-	int cur = *cnt;	
+	int cur = *cnt;
 
 	while (cur == *cnt) {
 		pipe_wait(pipe);
@@ -955,7 +971,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
 			}
 		}
 		break;
-	
+
 	case FMODE_WRITE:
 	/*
 	 *  O_WRONLY
@@ -975,7 +991,7 @@ static int fifo_open(struct inode *inode, struct file *filp)
 				goto err_wr;
 		}
 		break;
-	
+
 	case FMODE_READ | FMODE_WRITE:
 	/*
 	 *  O_RDWR
@@ -1054,14 +1070,14 @@ unsigned int round_pipe_size(unsigned long size)
 static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 {
 	struct pipe_buffer *bufs;
-	unsigned int size, nr_pages;
+	unsigned int size, nr_slots, head, tail, mask, n;
 	unsigned long user_bufs;
 	long ret = 0;
 
 	size = round_pipe_size(arg);
-	nr_pages = size >> PAGE_SHIFT;
+	nr_slots = size >> PAGE_SHIFT;
 
-	if (!nr_pages)
+	if (!nr_slots)
 		return -EINVAL;
 
 	/*
@@ -1071,13 +1087,13 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 	 * Decreasing the pipe capacity is always permitted, even
 	 * if the user is currently over a limit.
 	 */
-	if (nr_pages > pipe->buffers &&
+	if (nr_slots > pipe->ring_size &&
 			size > pipe_max_size && !capable(CAP_SYS_RESOURCE))
 		return -EPERM;
 
-	user_bufs = account_pipe_buffers(pipe->user, pipe->buffers, nr_pages);
+	user_bufs = account_pipe_buffers(pipe->user, pipe->ring_size, nr_slots);
 
-	if (nr_pages > pipe->buffers &&
+	if (nr_slots > pipe->ring_size &&
 			(too_many_pipe_buffers_hard(user_bufs) ||
 			 too_many_pipe_buffers_soft(user_bufs)) &&
 			is_unprivileged_user()) {
@@ -1086,17 +1102,21 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 	}
 
 	/*
-	 * We can shrink the pipe, if arg >= pipe->nrbufs. Since we don't
-	 * expect a lot of shrink+grow operations, just free and allocate
-	 * again like we would do for growing. If the pipe currently
+	 * We can shrink the pipe, if arg is greater than the ring occupancy.
+	 * Since we don't expect a lot of shrink+grow operations, just free and
+	 * allocate again like we would do for growing.  If the pipe currently
 	 * contains more buffers than arg, then return busy.
 	 */
-	if (nr_pages < pipe->nrbufs) {
+	mask = pipe->ring_size - 1;
+	head = pipe->head;
+	tail = pipe->tail;
+	n = pipe_occupancy(pipe->head, pipe->tail);
+	if (nr_slots < n) {
 		ret = -EBUSY;
 		goto out_revert_acct;
 	}
 
-	bufs = kcalloc(nr_pages, sizeof(*bufs),
+	bufs = kcalloc(nr_slots, sizeof(*bufs),
 		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (unlikely(!bufs)) {
 		ret = -ENOMEM;
@@ -1105,33 +1125,36 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 
 	/*
 	 * The pipe array wraps around, so just start the new one at zero
-	 * and adjust the indexes.
+	 * and adjust the indices.
 	 */
-	if (pipe->nrbufs) {
-		unsigned int tail;
-		unsigned int head;
-
-		tail = pipe->curbuf + pipe->nrbufs;
-		if (tail < pipe->buffers)
-			tail = 0;
-		else
-			tail &= (pipe->buffers - 1);
-
-		head = pipe->nrbufs - tail;
-		if (head)
-			memcpy(bufs, pipe->bufs + pipe->curbuf, head * sizeof(struct pipe_buffer));
-		if (tail)
-			memcpy(bufs + head, pipe->bufs, tail * sizeof(struct pipe_buffer));
+	if (n > 0) {
+		unsigned int h = head & mask;
+		unsigned int t = tail & mask;
+		if (h > t) {
+			memcpy(bufs, pipe->bufs + t,
+			       n * sizeof(struct pipe_buffer));
+		} else {
+			unsigned int tsize = pipe->ring_size - t;
+			if (h > 0)
+				memcpy(bufs + tsize, pipe->bufs,
+				       h * sizeof(struct pipe_buffer));
+			memcpy(bufs, pipe->bufs + t,
+			       tsize * sizeof(struct pipe_buffer));
+		}
 	}
 
-	pipe->curbuf = 0;
+	head = n;
+	tail = 0;
+
 	kfree(pipe->bufs);
 	pipe->bufs = bufs;
-	pipe->buffers = nr_pages;
-	return nr_pages * PAGE_SIZE;
+	pipe->ring_size = nr_slots;
+	pipe->tail = tail;
+	pipe->head = head;
+	return pipe->ring_size * PAGE_SIZE;
 
 out_revert_acct:
-	(void) account_pipe_buffers(pipe->user, nr_pages, pipe->buffers);
+	(void) account_pipe_buffers(pipe->user, nr_slots, pipe->ring_size);
 	return ret;
 }
 
@@ -1161,7 +1184,7 @@ long pipe_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 		ret = pipe_set_size(pipe, arg);
 		break;
 	case F_GETPIPE_SZ:
-		ret = pipe->buffers * PAGE_SIZE;
+		ret = pipe->ring_size * PAGE_SIZE;
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/fs/splice.c b/fs/splice.c
index 98412721f056..bbc025236ff9 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -185,6 +185,9 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		       struct splice_pipe_desc *spd)
 {
 	unsigned int spd_pages = spd->nr_pages;
+	unsigned int tail = pipe->tail;
+	unsigned int head = pipe->head;
+	unsigned int mask = pipe->ring_size - 1;
 	int ret = 0, page_nr = 0;
 
 	if (!spd_pages)
@@ -196,9 +199,8 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		goto out;
 	}
 
-	while (pipe->nrbufs < pipe->buffers) {
-		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-		struct pipe_buffer *buf = pipe->bufs + newbuf;
+	while (!pipe_full(head, tail, pipe->ring_size)) {
+		struct pipe_buffer *buf = &pipe->bufs[head & mask];
 
 		buf->page = spd->pages[page_nr];
 		buf->offset = spd->partial[page_nr].offset;
@@ -207,7 +209,8 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		buf->ops = spd->ops;
 		buf->flags = 0;
 
-		pipe->nrbufs++;
+		head++;
+		pipe_commit_write(pipe, head);
 		page_nr++;
 		ret += buf->len;
 
@@ -228,17 +231,19 @@ EXPORT_SYMBOL_GPL(splice_to_pipe);
 
 ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
 {
+	unsigned int head = pipe->head;
+	unsigned int tail = pipe->tail;
+	unsigned int mask = pipe->ring_size - 1;
 	int ret;
 
 	if (unlikely(!pipe->readers)) {
 		send_sig(SIGPIPE, current, 0);
 		ret = -EPIPE;
-	} else if (pipe->nrbufs == pipe->buffers) {
+	} else if (pipe_full(head, tail, pipe->ring_size)) {
 		ret = -EAGAIN;
 	} else {
-		int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
-		pipe->bufs[newbuf] = *buf;
-		pipe->nrbufs++;
+		pipe->bufs[head & mask] = *buf;
+		pipe_commit_write(pipe, head + 1);
 		return buf->len;
 	}
 	pipe_buf_release(pipe, buf);
@@ -252,14 +257,14 @@ EXPORT_SYMBOL(add_to_pipe);
  */
 int splice_grow_spd(const struct pipe_inode_info *pipe, struct splice_pipe_desc *spd)
 {
-	unsigned int buffers = READ_ONCE(pipe->buffers);
+	unsigned int max_usage = READ_ONCE(pipe->ring_size);
 
-	spd->nr_pages_max = buffers;
-	if (buffers <= PIPE_DEF_BUFFERS)
+	spd->nr_pages_max = max_usage;
+	if (max_usage <= PIPE_DEF_BUFFERS)
 		return 0;
 
-	spd->pages = kmalloc_array(buffers, sizeof(struct page *), GFP_KERNEL);
-	spd->partial = kmalloc_array(buffers, sizeof(struct partial_page),
+	spd->pages = kmalloc_array(max_usage, sizeof(struct page *), GFP_KERNEL);
+	spd->partial = kmalloc_array(max_usage, sizeof(struct partial_page),
 				     GFP_KERNEL);
 
 	if (spd->pages && spd->partial)
@@ -298,10 +303,11 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 {
 	struct iov_iter to;
 	struct kiocb kiocb;
-	int idx, ret;
+	unsigned int i_head;
+	int ret;
 
 	iov_iter_pipe(&to, READ, pipe, len);
-	idx = to.idx;
+	i_head = to.head;
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
@@ -309,7 +315,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 		*ppos = kiocb.ki_pos;
 		file_accessed(in);
 	} else if (ret < 0) {
-		to.idx = idx;
+		to.head = i_head;
 		to.iov_offset = 0;
 		iov_iter_advance(&to, 0); /* to free what was emitted */
 		/*
@@ -370,11 +376,12 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	struct iov_iter to;
 	struct page **pages;
 	unsigned int nr_pages;
+	unsigned int mask;
 	size_t offset, base, copied = 0;
 	ssize_t res;
 	int i;
 
-	if (pipe->nrbufs == pipe->buffers)
+	if (pipe_full(pipe->head, pipe->tail, pipe->ring_size))
 		return -EAGAIN;
 
 	/*
@@ -400,8 +407,9 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 		}
 	}
 
-	pipe->bufs[to.idx].offset = offset;
-	pipe->bufs[to.idx].len -= offset;
+	mask = pipe->ring_size - 1;
+	pipe->bufs[to.head & mask].offset = offset;
+	pipe->bufs[to.head & mask].len -= offset;
 
 	for (i = 0; i < nr_pages; i++) {
 		size_t this_len = min_t(size_t, len, PAGE_SIZE - offset);
@@ -443,7 +451,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 
 	more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
 
-	if (sd->len < sd->total_len && pipe->nrbufs > 1)
+	if (sd->len < sd->total_len &&
+	    pipe_occupancy(pipe->head, pipe->tail) > 1)
 		more |= MSG_SENDPAGE_NOTLAST;
 
 	return file->f_op->sendpage(file, buf->page, buf->offset,
@@ -481,10 +490,13 @@ static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
 static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd,
 			  splice_actor *actor)
 {
+	unsigned int head = pipe->head;
+	unsigned int tail = pipe->tail;
+	unsigned int mask = pipe->ring_size - 1;
 	int ret;
 
-	while (pipe->nrbufs) {
-		struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
+	while (!pipe_empty(tail, head)) {
+		struct pipe_buffer *buf = &pipe->bufs[tail & mask];
 
 		sd->len = buf->len;
 		if (sd->len > sd->total_len)
@@ -511,8 +523,8 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des
 
 		if (!buf->len) {
 			pipe_buf_release(pipe, buf);
-			pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
-			pipe->nrbufs--;
+			tail++;
+			pipe_commit_read(pipe, tail);
 			if (pipe->files)
 				sd->need_wakeup = true;
 		}
@@ -543,7 +555,7 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des
 	if (signal_pending(current))
 		return -ERESTARTSYS;
 
-	while (!pipe->nrbufs) {
+	while (pipe_empty(pipe->head, pipe->tail)) {
 		if (!pipe->writers)
 			return 0;
 
@@ -686,7 +698,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		.pos = *ppos,
 		.u.file = out,
 	};
-	int nbufs = pipe->buffers;
+	int nbufs = pipe->ring_size;
 	struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec),
 					GFP_KERNEL);
 	ssize_t ret;
@@ -699,16 +711,19 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 	splice_from_pipe_begin(&sd);
 	while (sd.total_len) {
 		struct iov_iter from;
+		unsigned int head = pipe->head;
+		unsigned int tail = pipe->tail;
+		unsigned int mask = pipe->ring_size - 1;
 		size_t left;
-		int n, idx;
+		int n;
 
 		ret = splice_from_pipe_next(pipe, &sd);
 		if (ret <= 0)
 			break;
 
-		if (unlikely(nbufs < pipe->buffers)) {
+		if (unlikely(nbufs < pipe->ring_size)) {
 			kfree(array);
-			nbufs = pipe->buffers;
+			nbufs = pipe->ring_size;
 			array = kcalloc(nbufs, sizeof(struct bio_vec),
 					GFP_KERNEL);
 			if (!array) {
@@ -719,16 +734,13 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 		/* build the vector */
 		left = sd.total_len;
-		for (n = 0, idx = pipe->curbuf; left && n < pipe->nrbufs; n++, idx++) {
-			struct pipe_buffer *buf = pipe->bufs + idx;
+		for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++, n++) {
+			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
 			size_t this_len = buf->len;
 
 			if (this_len > left)
 				this_len = left;
 
-			if (idx == pipe->buffers - 1)
-				idx = -1;
-
 			ret = pipe_buf_confirm(pipe, buf);
 			if (unlikely(ret)) {
 				if (ret == -ENODATA)
@@ -752,14 +764,15 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		*ppos = sd.pos;
 
 		/* dismiss the fully eaten buffers, adjust the partial one */
+		tail = pipe->tail;
 		while (ret) {
-			struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
+			struct pipe_buffer *buf = &pipe->bufs[tail & mask];
 			if (ret >= buf->len) {
 				ret -= buf->len;
 				buf->len = 0;
 				pipe_buf_release(pipe, buf);
-				pipe->curbuf = (pipe->curbuf + 1) & (pipe->buffers - 1);
-				pipe->nrbufs--;
+				tail++;
+				pipe_commit_read(pipe, tail);
 				if (pipe->files)
 					sd.need_wakeup = true;
 			} else {
@@ -942,15 +955,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	sd->flags &= ~SPLICE_F_NONBLOCK;
 	more = sd->flags & SPLICE_F_MORE;
 
-	WARN_ON_ONCE(pipe->nrbufs != 0);
+	WARN_ON_ONCE(!pipe_empty(pipe->head, pipe->tail));
 
 	while (len) {
+		unsigned int p_space;
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
 
 		/* Don't try to read more the pipe has space for. */
-		read_len = min_t(size_t, len,
-				 (pipe->buffers - pipe->nrbufs) << PAGE_SHIFT);
+		p_space = pipe->ring_size -
+			pipe_occupancy(pipe->head, pipe->tail);
+		read_len = min_t(size_t, len, p_space << PAGE_SHIFT);
 		ret = do_splice_to(in, &pos, pipe, read_len, flags);
 		if (unlikely(ret <= 0))
 			goto out_release;
@@ -989,7 +1004,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	}
 
 done:
-	pipe->nrbufs = pipe->curbuf = 0;
+	pipe->tail = pipe->head = 0;
 	file_accessed(in);
 	return bytes;
 
@@ -998,8 +1013,8 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 	 * If we did an incomplete transfer we must release
 	 * the pipe buffers in question:
 	 */
-	for (i = 0; i < pipe->buffers; i++) {
-		struct pipe_buffer *buf = pipe->bufs + i;
+	for (i = 0; i < pipe->ring_size; i++) {
+		struct pipe_buffer *buf = &pipe->bufs[i];
 
 		if (buf->ops)
 			pipe_buf_release(pipe, buf);
@@ -1075,7 +1090,7 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 			send_sig(SIGPIPE, current, 0);
 			return -EPIPE;
 		}
-		if (pipe->nrbufs != pipe->buffers)
+		if (!pipe_full(pipe->head, pipe->tail, pipe->ring_size))
 			return 0;
 		if (flags & SPLICE_F_NONBLOCK)
 			return -EAGAIN;
@@ -1442,16 +1457,16 @@ static int ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 	int ret;
 
 	/*
-	 * Check ->nrbufs without the inode lock first. This function
+	 * Check the pipe occupancy without the inode lock first. This function
 	 * is speculative anyways, so missing one is ok.
 	 */
-	if (pipe->nrbufs)
+	if (!pipe_empty(pipe->head, pipe->tail))
 		return 0;
 
 	ret = 0;
 	pipe_lock(pipe);
 
-	while (!pipe->nrbufs) {
+	while (pipe_empty(pipe->head, pipe->tail)) {
 		if (signal_pending(current)) {
 			ret = -ERESTARTSYS;
 			break;
@@ -1483,13 +1498,13 @@ static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 	 * Check ->nrbufs without the inode lock first. This function
 	 * is speculative anyways, so missing one is ok.
 	 */
-	if (pipe->nrbufs < pipe->buffers)
+	if (pipe_full(pipe->head, pipe->tail, pipe->ring_size))
 		return 0;
 
 	ret = 0;
 	pipe_lock(pipe);
 
-	while (pipe->nrbufs >= pipe->buffers) {
+	while (pipe_full(pipe->head, pipe->tail, pipe->ring_size)) {
 		if (!pipe->readers) {
 			send_sig(SIGPIPE, current, 0);
 			ret = -EPIPE;
@@ -1520,7 +1535,10 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
-	int ret = 0, nbuf;
+	unsigned int i_head, o_head;
+	unsigned int i_tail, o_tail;
+	unsigned int i_mask, o_mask;
+	int ret = 0;
 	bool input_wakeup = false;
 
 
@@ -1540,7 +1558,14 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 	 */
 	pipe_double_lock(ipipe, opipe);
 
+	i_tail = ipipe->tail;
+	i_mask = ipipe->ring_size - 1;
+	o_head = opipe->head;
+	o_mask = opipe->ring_size - 1;
+
 	do {
+		size_t o_len;
+
 		if (!opipe->readers) {
 			send_sig(SIGPIPE, current, 0);
 			if (!ret)
@@ -1548,14 +1573,18 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			break;
 		}
 
-		if (!ipipe->nrbufs && !ipipe->writers)
+		i_head = ipipe->head;
+		o_tail = opipe->tail;
+
+		if (pipe_empty(i_head, i_tail) && !ipipe->writers)
 			break;
 
 		/*
 		 * Cannot make any progress, because either the input
 		 * pipe is empty or the output pipe is full.
 		 */
-		if (!ipipe->nrbufs || opipe->nrbufs >= opipe->buffers) {
+		if (pipe_empty(i_head, i_tail) ||
+		    pipe_full(o_head, o_tail, opipe->ring_size)) {
 			/* Already processed some buffers, break */
 			if (ret)
 				break;
@@ -1575,9 +1604,8 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			goto retry;
 		}
 
-		ibuf = ipipe->bufs + ipipe->curbuf;
-		nbuf = (opipe->curbuf + opipe->nrbufs) & (opipe->buffers - 1);
-		obuf = opipe->bufs + nbuf;
+		ibuf = &ipipe->bufs[i_tail & i_mask];
+		obuf = &opipe->bufs[o_head & o_mask];
 
 		if (len >= ibuf->len) {
 			/*
@@ -1585,10 +1613,12 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			 */
 			*obuf = *ibuf;
 			ibuf->ops = NULL;
-			opipe->nrbufs++;
-			ipipe->curbuf = (ipipe->curbuf + 1) & (ipipe->buffers - 1);
-			ipipe->nrbufs--;
+			i_tail++;
+			pipe_commit_read(ipipe, i_tail);
 			input_wakeup = true;
+			o_len = obuf->len;
+			o_head++;
+			pipe_commit_write(opipe, o_head);
 		} else {
 			/*
 			 * Get a reference to this pipe buffer,
@@ -1610,12 +1640,14 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			pipe_buf_mark_unmergeable(obuf);
 
 			obuf->len = len;
-			opipe->nrbufs++;
-			ibuf->offset += obuf->len;
-			ibuf->len -= obuf->len;
+			ibuf->offset += len;
+			ibuf->len -= len;
+			o_len = len;
+			o_head++;
+			pipe_commit_write(opipe, o_head);
 		}
-		ret += obuf->len;
-		len -= obuf->len;
+		ret += o_len;
+		len -= o_len;
 	} while (len);
 
 	pipe_unlock(ipipe);
@@ -1641,7 +1673,10 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 		     size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
-	int ret = 0, i = 0, nbuf;
+	unsigned int i_head, o_head;
+	unsigned int i_tail, o_tail;
+	unsigned int i_mask, o_mask;
+	int ret = 0;
 
 	/*
 	 * Potential ABBA deadlock, work around it by ordering lock
@@ -1650,6 +1685,11 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 	 */
 	pipe_double_lock(ipipe, opipe);
 
+	i_tail = ipipe->tail;
+	i_mask = ipipe->ring_size - 1;
+	o_head = opipe->head;
+	o_mask = opipe->ring_size - 1;
+
 	do {
 		if (!opipe->readers) {
 			send_sig(SIGPIPE, current, 0);
@@ -1658,15 +1698,19 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 			break;
 		}
 
+		i_head = ipipe->head;
+		o_tail = opipe->tail;
+
 		/*
-		 * If we have iterated all input buffers or ran out of
+		 * If we have iterated all input buffers or run out of
 		 * output room, break.
 		 */
-		if (i >= ipipe->nrbufs || opipe->nrbufs >= opipe->buffers)
+		if (pipe_empty(i_head, i_tail) ||
+		    pipe_full(o_head, o_tail, opipe->ring_size))
 			break;
 
-		ibuf = ipipe->bufs + ((ipipe->curbuf + i) & (ipipe->buffers-1));
-		nbuf = (opipe->curbuf + opipe->nrbufs) & (opipe->buffers - 1);
+		ibuf = &ipipe->bufs[i_tail & i_mask];
+		obuf = &opipe->bufs[o_head & o_mask];
 
 		/*
 		 * Get a reference to this pipe buffer,
@@ -1678,7 +1722,6 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 			break;
 		}
 
-		obuf = opipe->bufs + nbuf;
 		*obuf = *ibuf;
 
 		/*
@@ -1691,11 +1734,12 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 
 		if (obuf->len > len)
 			obuf->len = len;
-
-		opipe->nrbufs++;
 		ret += obuf->len;
 		len -= obuf->len;
-		i++;
+
+		o_head++;
+		pipe_commit_write(opipe, o_head);
+		i_tail++;
 	} while (len);
 
 	/*
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 5c626fdc10db..fad096697ff5 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -30,9 +30,9 @@ struct pipe_buffer {
  *	struct pipe_inode_info - a linux kernel pipe
  *	@mutex: mutex protecting the whole thing
  *	@wait: reader/writer wait point in case of empty/full pipe
- *	@nrbufs: the number of non-empty pipe buffers in this pipe
- *	@buffers: total number of buffers (should be a power of 2)
- *	@curbuf: the current pipe buffer entry
+ *	@head: The point of buffer production
+ *	@tail: The point of buffer consumption
+ *	@ring_size: total number of buffers (should be a power of 2)
  *	@tmp_page: cached released page
  *	@readers: number of current readers of this pipe
  *	@writers: number of current writers of this pipe
@@ -48,7 +48,9 @@ struct pipe_buffer {
 struct pipe_inode_info {
 	struct mutex mutex;
 	wait_queue_head_t wait;
-	unsigned int nrbufs, curbuf, buffers;
+	unsigned int head;
+	unsigned int tail;
+	unsigned int ring_size;
 	unsigned int readers;
 	unsigned int writers;
 	unsigned int files;
@@ -104,6 +106,82 @@ struct pipe_buf_operations {
 	bool (*get)(struct pipe_inode_info *, struct pipe_buffer *);
 };
 
+/**
+ * pipe_commit_read - Set pipe buffer tail pointer in the read-side
+ * @pipe: The pipe in question
+ * @tail: The new tail pointer
+ *
+ * Update the tail pointer in the read-side code after a read has taken place.
+ */
+static inline void pipe_commit_read(struct pipe_inode_info *pipe,
+				    unsigned int tail)
+{
+	pipe->tail = tail;
+}
+
+/**
+ * pipe_commit_write - Set pipe buffer head pointer in the write-side
+ * @pipe: The pipe in question
+ * @head: The new head pointer
+ *
+ * Update the head pointer in the write-side code after a write has taken place.
+ */
+static inline void pipe_commit_write(struct pipe_inode_info *pipe,
+				     unsigned int head)
+{
+	pipe->head = head;
+}
+
+/**
+ * pipe_empty - Return true if the pipe is empty
+ * @head: The pipe ring head pointer
+ * @tail: The pipe ring tail pointer
+ */
+static inline bool pipe_empty(unsigned int head, unsigned int tail)
+{
+	return head == tail;
+}
+
+/**
+ * pipe_occupancy - Return number of slots used in the pipe
+ * @head: The pipe ring head pointer
+ * @tail: The pipe ring tail pointer
+ */
+static inline unsigned int pipe_occupancy(unsigned int head, unsigned int tail)
+{
+	return head - tail;
+}
+
+/**
+ * pipe_full - Return true if the pipe is full
+ * @head: The pipe ring head pointer
+ * @tail: The pipe ring tail pointer
+ * @limit: The maximum amount of slots available.
+ */
+static inline bool pipe_full(unsigned int head, unsigned int tail,
+			     unsigned int limit)
+{
+	return pipe_occupancy(head, tail) >= limit;
+}
+
+/**
+ * pipe_space_for_user - Return number of slots available to userspace
+ * @head: The pipe ring head pointer
+ * @tail: The pipe ring tail pointer
+ * @pipe: The pipe info structure
+ */
+static inline unsigned int pipe_space_for_user(unsigned int head, unsigned int tail,
+					       struct pipe_inode_info *pipe)
+{
+	unsigned int p_occupancy, p_space;
+
+	p_occupancy = pipe_occupancy(head, tail);
+	if (p_occupancy >= pipe->ring_size)
+		return 0;
+	p_space = pipe->ring_size - p_occupancy;
+	return p_space;
+}
+
 /**
  * pipe_buf_get - get a reference to a pipe_buffer
  * @pipe:	the pipe that the buffer belongs to
diff --git a/include/linux/uio.h b/include/linux/uio.h
index ab5f523bc0df..9576fd8158d7 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -45,8 +45,8 @@ struct iov_iter {
 	union {
 		unsigned long nr_segs;
 		struct {
-			int idx;
-			int start_idx;
+			unsigned int head;
+			unsigned int start_head;
 		};
 	};
 };
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 639d5e7014c1..150a40bdb21a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -325,28 +325,33 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 static bool sanity(const struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	int idx = i->idx;
-	int next = pipe->curbuf + pipe->nrbufs;
+	unsigned int p_head = pipe->head;
+	unsigned int p_tail = pipe->tail;
+	unsigned int p_mask = pipe->ring_size - 1;
+	unsigned int p_occupancy = pipe_occupancy(p_head, p_tail);
+	unsigned int i_head = i->head;
+	unsigned int idx;
+
 	if (i->iov_offset) {
 		struct pipe_buffer *p;
-		if (unlikely(!pipe->nrbufs))
+		if (unlikely(p_occupancy == 0))
 			goto Bad;	// pipe must be non-empty
-		if (unlikely(idx != ((next - 1) & (pipe->buffers - 1))))
+		if (unlikely(i_head != p_head - 1))
 			goto Bad;	// must be at the last buffer...
 
-		p = &pipe->bufs[idx];
+		p = &pipe->bufs[i_head & p_mask];
 		if (unlikely(p->offset + p->len != i->iov_offset))
 			goto Bad;	// ... at the end of segment
 	} else {
-		if (idx != (next & (pipe->buffers - 1)))
+		if (i_head != p_head)
 			goto Bad;	// must be right after the last buffer
 	}
 	return true;
 Bad:
-	printk(KERN_ERR "idx = %d, offset = %zd\n", i->idx, i->iov_offset);
-	printk(KERN_ERR "curbuf = %d, nrbufs = %d, buffers = %d\n",
-			pipe->curbuf, pipe->nrbufs, pipe->buffers);
-	for (idx = 0; idx < pipe->buffers; idx++)
+	printk(KERN_ERR "idx = %d, offset = %zd\n", i_head, i->iov_offset);
+	printk(KERN_ERR "head = %d, tail = %d, buffers = %d\n",
+			p_head, p_tail, pipe->ring_size);
+	for (idx = 0; idx < pipe->ring_size; idx++)
 		printk(KERN_ERR "[%p %p %d %d]\n",
 			pipe->bufs[idx].ops,
 			pipe->bufs[idx].page,
@@ -359,18 +364,15 @@ static bool sanity(const struct iov_iter *i)
 #define sanity(i) true
 #endif
 
-static inline int next_idx(int idx, struct pipe_inode_info *pipe)
-{
-	return (idx + 1) & (pipe->buffers - 1);
-}
-
 static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
 	struct pipe_buffer *buf;
+	unsigned int p_tail = pipe->tail;
+	unsigned int p_mask = pipe->ring_size - 1;
+	unsigned int i_head = i->head;
 	size_t off;
-	int idx;
 
 	if (unlikely(bytes > i->count))
 		bytes = i->count;
@@ -382,8 +384,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 		return 0;
 
 	off = i->iov_offset;
-	idx = i->idx;
-	buf = &pipe->bufs[idx];
+	buf = &pipe->bufs[i_head & p_mask];
 	if (off) {
 		if (offset == off && buf->page == page) {
 			/* merge with the last one */
@@ -391,18 +392,21 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 			i->iov_offset += bytes;
 			goto out;
 		}
-		idx = next_idx(idx, pipe);
-		buf = &pipe->bufs[idx];
+		i_head++;
+		buf = &pipe->bufs[i_head & p_mask];
 	}
-	if (idx == pipe->curbuf && pipe->nrbufs)
+	if (pipe_full(i_head, p_tail, pipe->ring_size))
 		return 0;
-	pipe->nrbufs++;
+
 	buf->ops = &page_cache_pipe_buf_ops;
-	get_page(buf->page = page);
+	get_page(page);
+	buf->page = page;
 	buf->offset = offset;
 	buf->len = bytes;
+
+	pipe_commit_read(pipe, i_head);
 	i->iov_offset = offset + bytes;
-	i->idx = idx;
+	i->head = i_head;
 out:
 	i->count -= bytes;
 	return bytes;
@@ -480,24 +484,30 @@ static inline bool allocated(struct pipe_buffer *buf)
 	return buf->ops == &default_pipe_buf_ops;
 }
 
-static inline void data_start(const struct iov_iter *i, int *idxp, size_t *offp)
+static inline void data_start(const struct iov_iter *i,
+			      unsigned int *iter_headp, size_t *offp)
 {
+	unsigned int p_mask = i->pipe->ring_size - 1;
+	unsigned int iter_head = i->head;
 	size_t off = i->iov_offset;
-	int idx = i->idx;
-	if (off && (!allocated(&i->pipe->bufs[idx]) || off == PAGE_SIZE)) {
-		idx = next_idx(idx, i->pipe);
+
+	if (off && (!allocated(&i->pipe->bufs[iter_head & p_mask]) ||
+		    off == PAGE_SIZE)) {
+		iter_head++;
 		off = 0;
 	}
-	*idxp = idx;
+	*iter_headp = iter_head;
 	*offp = off;
 }
 
 static size_t push_pipe(struct iov_iter *i, size_t size,
-			int *idxp, size_t *offp)
+			int *iter_headp, size_t *offp)
 {
 	struct pipe_inode_info *pipe = i->pipe;
+	unsigned int p_tail = pipe->tail;
+	unsigned int p_mask = pipe->ring_size - 1;
+	unsigned int iter_head;
 	size_t off;
-	int idx;
 	ssize_t left;
 
 	if (unlikely(size > i->count))
@@ -506,33 +516,34 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
 		return 0;
 
 	left = size;
-	data_start(i, &idx, &off);
-	*idxp = idx;
+	data_start(i, &iter_head, &off);
+	*iter_headp = iter_head;
 	*offp = off;
 	if (off) {
 		left -= PAGE_SIZE - off;
 		if (left <= 0) {
-			pipe->bufs[idx].len += size;
+			pipe->bufs[iter_head & p_mask].len += size;
 			return size;
 		}
-		pipe->bufs[idx].len = PAGE_SIZE;
-		idx = next_idx(idx, pipe);
+		pipe->bufs[iter_head & p_mask].len = PAGE_SIZE;
+		iter_head++;
 	}
-	while (idx != pipe->curbuf || !pipe->nrbufs) {
+	while (!pipe_full(iter_head, p_tail, pipe->ring_size)) {
+		struct pipe_buffer *buf = &pipe->bufs[iter_head & p_mask];
 		struct page *page = alloc_page(GFP_USER);
 		if (!page)
 			break;
-		pipe->nrbufs++;
-		pipe->bufs[idx].ops = &default_pipe_buf_ops;
-		pipe->bufs[idx].page = page;
-		pipe->bufs[idx].offset = 0;
-		if (left <= PAGE_SIZE) {
-			pipe->bufs[idx].len = left;
+
+		buf->ops = &default_pipe_buf_ops;
+		buf->page = page;
+		buf->offset = 0;
+		buf->len = max_t(ssize_t, left, PAGE_SIZE);
+		left -= buf->len;
+		iter_head++;
+		pipe_commit_write(pipe, iter_head);
+
+		if (left == 0)
 			return size;
-		}
-		pipe->bufs[idx].len = PAGE_SIZE;
-		left -= PAGE_SIZE;
-		idx = next_idx(idx, pipe);
 	}
 	return size - left;
 }
@@ -541,23 +552,26 @@ static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
 				struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
+	unsigned int p_mask = pipe->ring_size - 1;
+	unsigned int i_head;
 	size_t n, off;
-	int idx;
 
 	if (!sanity(i))
 		return 0;
 
-	bytes = n = push_pipe(i, bytes, &idx, &off);
+	bytes = n = push_pipe(i, bytes, &i_head, &off);
 	if (unlikely(!n))
 		return 0;
-	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+	do {
 		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
-		memcpy_to_page(pipe->bufs[idx].page, off, addr, chunk);
-		i->idx = idx;
+		memcpy_to_page(pipe->bufs[i_head & p_mask].page, off, addr, chunk);
+		i->head = i_head;
 		i->iov_offset = off + chunk;
 		n -= chunk;
 		addr += chunk;
-	}
+		off = 0;
+		i_head++;
+	} while (n);
 	i->count -= bytes;
 	return bytes;
 }
@@ -573,28 +587,31 @@ static size_t csum_and_copy_to_pipe_iter(const void *addr, size_t bytes,
 				__wsum *csum, struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
+	unsigned int p_mask = pipe->ring_size - 1;
+	unsigned int i_head;
 	size_t n, r;
 	size_t off = 0;
 	__wsum sum = *csum;
-	int idx;
 
 	if (!sanity(i))
 		return 0;
 
-	bytes = n = push_pipe(i, bytes, &idx, &r);
+	bytes = n = push_pipe(i, bytes, &i_head, &r);
 	if (unlikely(!n))
 		return 0;
-	for ( ; n; idx = next_idx(idx, pipe), r = 0) {
+	do {
 		size_t chunk = min_t(size_t, n, PAGE_SIZE - r);
-		char *p = kmap_atomic(pipe->bufs[idx].page);
+		char *p = kmap_atomic(pipe->bufs[i_head & p_mask].page);
 		sum = csum_and_memcpy(p + r, addr, chunk, sum, off);
 		kunmap_atomic(p);
-		i->idx = idx;
+		i->head = i_head;
 		i->iov_offset = r + chunk;
 		n -= chunk;
 		off += chunk;
 		addr += chunk;
-	}
+		r = 0;
+		i_head++;
+	} while (n);
 	i->count -= bytes;
 	*csum = sum;
 	return bytes;
@@ -645,29 +662,32 @@ static size_t copy_pipe_to_iter_mcsafe(const void *addr, size_t bytes,
 				struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
+	unsigned int p_mask = pipe->ring_size - 1;
+	unsigned int i_head;
 	size_t n, off, xfer = 0;
-	int idx;
 
 	if (!sanity(i))
 		return 0;
 
-	bytes = n = push_pipe(i, bytes, &idx, &off);
+	bytes = n = push_pipe(i, bytes, &i_head, &off);
 	if (unlikely(!n))
 		return 0;
-	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+	do {
 		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
 		unsigned long rem;
 
-		rem = memcpy_mcsafe_to_page(pipe->bufs[idx].page, off, addr,
-				chunk);
-		i->idx = idx;
+		rem = memcpy_mcsafe_to_page(pipe->bufs[i_head & p_mask].page,
+					    off, addr, chunk);
+		i->head = i_head;
 		i->iov_offset = off + chunk - rem;
 		xfer += chunk - rem;
 		if (rem)
 			break;
 		n -= chunk;
 		addr += chunk;
-	}
+		off = 0;
+		i_head++;
+	} while (n);
 	i->count -= xfer;
 	return xfer;
 }
@@ -925,6 +945,8 @@ EXPORT_SYMBOL(copy_page_from_iter);
 static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
+	unsigned int p_mask = pipe->ring_size - 1;
+	unsigned int i_head;
 	size_t n, off;
 	int idx;
 
@@ -935,13 +957,15 @@ static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 	if (unlikely(!n))
 		return 0;
 
-	for ( ; n; idx = next_idx(idx, pipe), off = 0) {
+	do {
 		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
-		memzero_page(pipe->bufs[idx].page, off, chunk);
-		i->idx = idx;
+		memzero_page(pipe->bufs[i_head & p_mask].page, off, chunk);
+		i->head = i_head;
 		i->iov_offset = off + chunk;
 		n -= chunk;
-	}
+		off = 0;
+		i_head++;
+	} while (n);
 	i->count -= bytes;
 	return bytes;
 }
@@ -987,20 +1011,26 @@ EXPORT_SYMBOL(iov_iter_copy_from_user_atomic);
 static inline void pipe_truncate(struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	if (pipe->nrbufs) {
+	unsigned int p_tail = pipe->tail;
+	unsigned int p_head = pipe->head;
+	unsigned int p_mask = pipe->ring_size - 1;
+
+	if (!pipe_empty(p_head, p_tail)) {
+		struct pipe_buffer *buf;
+		unsigned int i_head = i->head;
 		size_t off = i->iov_offset;
-		int idx = i->idx;
-		int nrbufs = (idx - pipe->curbuf) & (pipe->buffers - 1);
+
 		if (off) {
-			pipe->bufs[idx].len = off - pipe->bufs[idx].offset;
-			idx = next_idx(idx, pipe);
-			nrbufs++;
+			buf = &pipe->bufs[i_head & p_mask];
+			buf->len = off - buf->offset;
+			i_head++;
 		}
-		while (pipe->nrbufs > nrbufs) {
-			pipe_buf_release(pipe, &pipe->bufs[idx]);
-			idx = next_idx(idx, pipe);
-			pipe->nrbufs--;
+		while (p_head != i_head) {
+			p_head--;
+			pipe_buf_release(pipe, &pipe->bufs[p_head & p_mask]);
 		}
+
+		pipe_commit_write(pipe, p_head);
 	}
 }
 
@@ -1011,18 +1041,20 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 		size = i->count;
 	if (size) {
 		struct pipe_buffer *buf;
+		unsigned int p_mask = pipe->ring_size - 1;
+		unsigned int i_head = i->head;
 		size_t off = i->iov_offset, left = size;
-		int idx = i->idx;
+
 		if (off) /* make it relative to the beginning of buffer */
-			left += off - pipe->bufs[idx].offset;
+			left += off - pipe->bufs[i_head & p_mask].offset;
 		while (1) {
-			buf = &pipe->bufs[idx];
+			buf = &pipe->bufs[i_head & p_mask];
 			if (left <= buf->len)
 				break;
 			left -= buf->len;
-			idx = next_idx(idx, pipe);
+			i_head++;
 		}
-		i->idx = idx;
+		i->head = i_head;
 		i->iov_offset = buf->offset + left;
 	}
 	i->count -= size;
@@ -1053,25 +1085,27 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	i->count += unroll;
 	if (unlikely(iov_iter_is_pipe(i))) {
 		struct pipe_inode_info *pipe = i->pipe;
-		int idx = i->idx;
+		unsigned int p_mask = pipe->ring_size - 1;
+		unsigned int i_head = i->head;
 		size_t off = i->iov_offset;
 		while (1) {
-			size_t n = off - pipe->bufs[idx].offset;
+			struct pipe_buffer *b = &pipe->bufs[i_head & p_mask];
+			size_t n = off - b->offset;
 			if (unroll < n) {
 				off -= unroll;
 				break;
 			}
 			unroll -= n;
-			if (!unroll && idx == i->start_idx) {
+			if (!unroll && i_head == i->start_head) {
 				off = 0;
 				break;
 			}
-			if (!idx--)
-				idx = pipe->buffers - 1;
-			off = pipe->bufs[idx].offset + pipe->bufs[idx].len;
+			i_head--;
+			b = &pipe->bufs[i_head & p_mask];
+			off = b->offset + b->len;
 		}
 		i->iov_offset = off;
-		i->idx = idx;
+		i->head = i_head;
 		pipe_truncate(i);
 		return;
 	}
@@ -1159,13 +1193,13 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
 			size_t count)
 {
 	BUG_ON(direction != READ);
-	WARN_ON(pipe->nrbufs == pipe->buffers);
+	WARN_ON(pipe_full(pipe->head, pipe->tail, pipe->ring_size));
 	i->type = ITER_PIPE | READ;
 	i->pipe = pipe;
-	i->idx = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
+	i->head = pipe->head;
 	i->iov_offset = 0;
 	i->count = count;
-	i->start_idx = i->idx;
+	i->start_head = i->head;
 }
 EXPORT_SYMBOL(iov_iter_pipe);
 
@@ -1189,11 +1223,12 @@ EXPORT_SYMBOL(iov_iter_discard);
 
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
+	unsigned int p_mask = i->pipe->ring_size - 1;
 	unsigned long res = 0;
 	size_t size = i->count;
 
 	if (unlikely(iov_iter_is_pipe(i))) {
-		if (size && i->iov_offset && allocated(&i->pipe->bufs[i->idx]))
+		if (size && i->iov_offset && allocated(&i->pipe->bufs[i->head & p_mask]))
 			return size | i->iov_offset;
 		return size;
 	}
@@ -1231,19 +1266,20 @@ EXPORT_SYMBOL(iov_iter_gap_alignment);
 static inline ssize_t __pipe_get_pages(struct iov_iter *i,
 				size_t maxsize,
 				struct page **pages,
-				int idx,
+				int iter_head,
 				size_t *start)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	ssize_t n = push_pipe(i, maxsize, &idx, start);
+	unsigned int p_mask = pipe->ring_size - 1;
+	ssize_t n = push_pipe(i, maxsize, &iter_head, start);
 	if (!n)
 		return -EFAULT;
 
 	maxsize = n;
 	n += *start;
 	while (n > 0) {
-		get_page(*pages++ = pipe->bufs[idx].page);
-		idx = next_idx(idx, pipe);
+		get_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
+		iter_head++;
 		n -= PAGE_SIZE;
 	}
 
@@ -1254,9 +1290,8 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
-	unsigned npages;
+	unsigned int iter_head, npages;
 	size_t capacity;
-	int idx;
 
 	if (!maxsize)
 		return 0;
@@ -1264,12 +1299,12 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	if (!sanity(i))
 		return -EFAULT;
 
-	data_start(i, &idx, start);
-	/* some of this one + all after this one */
-	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
-	capacity = min(npages,maxpages) * PAGE_SIZE - *start;
+	data_start(i, &iter_head, start);
+	/* Amount of free space: some of this one + all after this one */
+	npages = pipe_space_for_user(iter_head, i->pipe->tail, i->pipe);
+	capacity = min(npages, maxpages) * PAGE_SIZE - *start;
 
-	return __pipe_get_pages(i, min(maxsize, capacity), pages, idx, start);
+	return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, start);
 }
 
 ssize_t iov_iter_get_pages(struct iov_iter *i,
@@ -1323,9 +1358,8 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 		   size_t *start)
 {
 	struct page **p;
+	unsigned int iter_head, npages;
 	ssize_t n;
-	int idx;
-	int npages;
 
 	if (!maxsize)
 		return 0;
@@ -1333,9 +1367,9 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	if (!sanity(i))
 		return -EFAULT;
 
-	data_start(i, &idx, start);
-	/* some of this one + all after this one */
-	npages = ((i->pipe->curbuf - idx - 1) & (i->pipe->buffers - 1)) + 1;
+	data_start(i, &iter_head, start);
+	/* Amount of free space: some of this one + all after this one */
+	npages = pipe_space_for_user(iter_head, i->pipe->tail, i->pipe);
 	n = npages * PAGE_SIZE - *start;
 	if (maxsize > n)
 		maxsize = n;
@@ -1344,7 +1378,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	p = get_pages_array(npages);
 	if (!p)
 		return -ENOMEM;
-	n = __pipe_get_pages(i, maxsize, p, idx, start);
+	n = __pipe_get_pages(i, maxsize, p, iter_head, start);
 	if (n > 0)
 		*pages = p;
 	else
@@ -1560,15 +1594,15 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 
 	if (unlikely(iov_iter_is_pipe(i))) {
 		struct pipe_inode_info *pipe = i->pipe;
+		unsigned int iter_head;
 		size_t off;
-		int idx;
 
 		if (!sanity(i))
 			return 0;
 
-		data_start(i, &idx, &off);
+		data_start(i, &iter_head, &off);
 		/* some of this one + all after this one */
-		npages = ((pipe->curbuf - idx - 1) & (pipe->buffers - 1)) + 1;
+		npages = pipe_space_for_user(iter_head, pipe->tail, pipe);
 		if (npages >= maxpages)
 			return maxpages;
 	} else iterate_all_kinds(i, size, v, ({

^ permalink raw reply related

* [RFC PATCH 05/10] pipe: Allow pipes to have kernel-reserved slots [ver #2]
From: David Howells @ 2019-10-23 20:17 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

Split pipe->ring_size into two numbers:

 (1) pipe->ring_size - indicates the hard size of the pipe ring.

 (2) pipe->max_usage - indicates the maximum number of pipe ring slots that
     userspace orchestrated events can fill.

This allows for a pipe that is both writable by the general kernel
notification facility and by userspace, allowing plenty of ring space for
notifications to be added whilst preventing userspace from being able to
pin too much unswappable kernel space.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fuse/dev.c             |    8 ++++----
 fs/pipe.c                 |   10 ++++++----
 fs/splice.c               |   26 +++++++++++++-------------
 include/linux/pipe_fs_i.h |    6 +++++-
 lib/iov_iter.c            |    4 ++--
 5 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 1e4bc27573cc..5ef57a322cb8 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -703,7 +703,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 			cs->pipebufs++;
 			cs->nr_segs--;
 		} else {
-			if (cs->nr_segs >= cs->pipe->ring_size)
+			if (cs->nr_segs >= cs->pipe->max_usage)
 				return -EIO;
 
 			page = alloc_page(GFP_HIGHUSER);
@@ -879,7 +879,7 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
 	struct pipe_buffer *buf;
 	int err;
 
-	if (cs->nr_segs >= cs->pipe->ring_size)
+	if (cs->nr_segs >= cs->pipe->max_usage)
 		return -EIO;
 
 	err = unlock_request(cs->req);
@@ -1341,7 +1341,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	if (!fud)
 		return -EPERM;
 
-	bufs = kvmalloc_array(pipe->ring_size, sizeof(struct pipe_buffer),
+	bufs = kvmalloc_array(pipe->max_usage, sizeof(struct pipe_buffer),
 			      GFP_KERNEL);
 	if (!bufs)
 		return -ENOMEM;
@@ -1353,7 +1353,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	if (ret < 0)
 		goto out;
 
-	if (pipe_occupancy(pipe->head, pipe->tail) + cs.nr_segs > pipe->ring_size) {
+	if (pipe_occupancy(pipe->head, pipe->tail) + cs.nr_segs > pipe->max_usage) {
 		ret = -EIO;
 		goto out;
 	}
diff --git a/fs/pipe.c b/fs/pipe.c
index 8a0806fe12d3..12e47cae9425 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -403,7 +403,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 
 	tail = pipe->tail;
 	head = pipe->head;
-	max_usage = pipe->ring_size;
+	max_usage = pipe->max_usage;
 	mask = pipe->ring_size - 1;
 
 	/* We try to merge small writes */
@@ -570,7 +570,7 @@ pipe_poll(struct file *filp, poll_table *wait)
 	}
 
 	if (filp->f_mode & FMODE_WRITE) {
-		if (!pipe_full(head, tail, pipe->ring_size))
+		if (!pipe_full(head, tail, pipe->max_usage))
 			mask |= EPOLLOUT | EPOLLWRNORM;
 		/*
 		 * Most Unices do not set EPOLLERR for FIFOs but on Linux they
@@ -695,6 +695,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 	if (pipe->bufs) {
 		init_waitqueue_head(&pipe->wait);
 		pipe->r_counter = pipe->w_counter = 1;
+		pipe->max_usage = pipe_bufs;
 		pipe->ring_size = pipe_bufs;
 		pipe->user = user;
 		mutex_init(&pipe->mutex);
@@ -1149,9 +1150,10 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg)
 	kfree(pipe->bufs);
 	pipe->bufs = bufs;
 	pipe->ring_size = nr_slots;
+	pipe->max_usage = nr_slots;
 	pipe->tail = tail;
 	pipe->head = head;
-	return pipe->ring_size * PAGE_SIZE;
+	return pipe->max_usage * PAGE_SIZE;
 
 out_revert_acct:
 	(void) account_pipe_buffers(pipe->user, nr_slots, pipe->ring_size);
@@ -1184,7 +1186,7 @@ long pipe_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 		ret = pipe_set_size(pipe, arg);
 		break;
 	case F_GETPIPE_SZ:
-		ret = pipe->ring_size * PAGE_SIZE;
+		ret = pipe->max_usage * PAGE_SIZE;
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/fs/splice.c b/fs/splice.c
index bbc025236ff9..03b96eae084b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -199,7 +199,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		goto out;
 	}
 
-	while (!pipe_full(head, tail, pipe->ring_size)) {
+	while (!pipe_full(head, tail, pipe->max_usage)) {
 		struct pipe_buffer *buf = &pipe->bufs[head & mask];
 
 		buf->page = spd->pages[page_nr];
@@ -239,7 +239,7 @@ ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
 	if (unlikely(!pipe->readers)) {
 		send_sig(SIGPIPE, current, 0);
 		ret = -EPIPE;
-	} else if (pipe_full(head, tail, pipe->ring_size)) {
+	} else if (pipe_full(head, tail, pipe->max_usage)) {
 		ret = -EAGAIN;
 	} else {
 		pipe->bufs[head & mask] = *buf;
@@ -257,7 +257,7 @@ EXPORT_SYMBOL(add_to_pipe);
  */
 int splice_grow_spd(const struct pipe_inode_info *pipe, struct splice_pipe_desc *spd)
 {
-	unsigned int max_usage = READ_ONCE(pipe->ring_size);
+	unsigned int max_usage = READ_ONCE(pipe->max_usage);
 
 	spd->nr_pages_max = max_usage;
 	if (max_usage <= PIPE_DEF_BUFFERS)
@@ -381,7 +381,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 	ssize_t res;
 	int i;
 
-	if (pipe_full(pipe->head, pipe->tail, pipe->ring_size))
+	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
 		return -EAGAIN;
 
 	/*
@@ -698,7 +698,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		.pos = *ppos,
 		.u.file = out,
 	};
-	int nbufs = pipe->ring_size;
+	int nbufs = pipe->max_usage;
 	struct bio_vec *array = kcalloc(nbufs, sizeof(struct bio_vec),
 					GFP_KERNEL);
 	ssize_t ret;
@@ -721,9 +721,9 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		if (ret <= 0)
 			break;
 
-		if (unlikely(nbufs < pipe->ring_size)) {
+		if (unlikely(nbufs < pipe->max_usage)) {
 			kfree(array);
-			nbufs = pipe->ring_size;
+			nbufs = pipe->max_usage;
 			array = kcalloc(nbufs, sizeof(struct bio_vec),
 					GFP_KERNEL);
 			if (!array) {
@@ -963,7 +963,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		loff_t pos = sd->pos, prev_pos = pos;
 
 		/* Don't try to read more the pipe has space for. */
-		p_space = pipe->ring_size -
+		p_space = pipe->max_usage -
 			pipe_occupancy(pipe->head, pipe->tail);
 		read_len = min_t(size_t, len, p_space << PAGE_SHIFT);
 		ret = do_splice_to(in, &pos, pipe, read_len, flags);
@@ -1090,7 +1090,7 @@ static int wait_for_space(struct pipe_inode_info *pipe, unsigned flags)
 			send_sig(SIGPIPE, current, 0);
 			return -EPIPE;
 		}
-		if (!pipe_full(pipe->head, pipe->tail, pipe->ring_size))
+		if (!pipe_full(pipe->head, pipe->tail, pipe->max_usage))
 			return 0;
 		if (flags & SPLICE_F_NONBLOCK)
 			return -EAGAIN;
@@ -1498,13 +1498,13 @@ static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 	 * Check ->nrbufs without the inode lock first. This function
 	 * is speculative anyways, so missing one is ok.
 	 */
-	if (pipe_full(pipe->head, pipe->tail, pipe->ring_size))
+	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
 		return 0;
 
 	ret = 0;
 	pipe_lock(pipe);
 
-	while (pipe_full(pipe->head, pipe->tail, pipe->ring_size)) {
+	while (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
 		if (!pipe->readers) {
 			send_sig(SIGPIPE, current, 0);
 			ret = -EPIPE;
@@ -1584,7 +1584,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 		 * pipe is empty or the output pipe is full.
 		 */
 		if (pipe_empty(i_head, i_tail) ||
-		    pipe_full(o_head, o_tail, opipe->ring_size)) {
+		    pipe_full(o_head, o_tail, opipe->max_usage)) {
 			/* Already processed some buffers, break */
 			if (ret)
 				break;
@@ -1706,7 +1706,7 @@ static int link_pipe(struct pipe_inode_info *ipipe,
 		 * output room, break.
 		 */
 		if (pipe_empty(i_head, i_tail) ||
-		    pipe_full(o_head, o_tail, opipe->ring_size))
+		    pipe_full(o_head, o_tail, opipe->max_usage))
 			break;
 
 		ibuf = &ipipe->bufs[i_tail & i_mask];
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index fad096697ff5..90055ff16550 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -32,6 +32,7 @@ struct pipe_buffer {
  *	@wait: reader/writer wait point in case of empty/full pipe
  *	@head: The point of buffer production
  *	@tail: The point of buffer consumption
+ *	@max_usage: The maximum number of slots that may be used in the ring
  *	@ring_size: total number of buffers (should be a power of 2)
  *	@tmp_page: cached released page
  *	@readers: number of current readers of this pipe
@@ -50,6 +51,7 @@ struct pipe_inode_info {
 	wait_queue_head_t wait;
 	unsigned int head;
 	unsigned int tail;
+	unsigned int max_usage;
 	unsigned int ring_size;
 	unsigned int readers;
 	unsigned int writers;
@@ -176,9 +178,11 @@ static inline unsigned int pipe_space_for_user(unsigned int head, unsigned int t
 	unsigned int p_occupancy, p_space;
 
 	p_occupancy = pipe_occupancy(head, tail);
-	if (p_occupancy >= pipe->ring_size)
+	if (p_occupancy >= pipe->max_usage)
 		return 0;
 	p_space = pipe->ring_size - p_occupancy;
+	if (p_space > pipe->max_usage)
+		p_space = pipe->max_usage;
 	return p_space;
 }
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 150a40bdb21a..134737e9d4ee 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -395,7 +395,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 		i_head++;
 		buf = &pipe->bufs[i_head & p_mask];
 	}
-	if (pipe_full(i_head, p_tail, pipe->ring_size))
+	if (pipe_full(i_head, p_tail, pipe->max_usage))
 		return 0;
 
 	buf->ops = &page_cache_pipe_buf_ops;
@@ -528,7 +528,7 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
 		pipe->bufs[iter_head & p_mask].len = PAGE_SIZE;
 		iter_head++;
 	}
-	while (!pipe_full(iter_head, p_tail, pipe->ring_size)) {
+	while (!pipe_full(iter_head, p_tail, pipe->max_usage)) {
 		struct pipe_buffer *buf = &pipe->bufs[iter_head & p_mask];
 		struct page *page = alloc_page(GFP_USER);
 		if (!page)

^ permalink raw reply related

* [RFC PATCH 06/10] pipe: Advance tail pointer inside of wait spinlock in pipe_read() [ver #2]
From: David Howells @ 2019-10-23 20:18 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

Advance the pipe ring tail pointer inside of wait spinlock in pipe_read()
so that the pipe can be written into with kernel notifications from
contexts where pipe->mutex cannot be taken.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/pipe.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 12e47cae9425..1274305772fb 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -324,9 +324,14 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 
 			if (!buf->len) {
 				pipe_buf_release(pipe, buf);
+				spin_lock_irq(&pipe->wait.lock);
 				tail++;
 				pipe_commit_read(pipe, tail);
-				do_wakeup = 1;
+				do_wakeup = 0;
+				wake_up_interruptible_sync_poll_locked(
+					&pipe->wait, EPOLLOUT | EPOLLWRNORM);
+				spin_unlock_irq(&pipe->wait.lock);
+				kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 			}
 			total_len -= chars;
 			if (!total_len)
@@ -358,6 +363,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		if (do_wakeup) {
 			wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
  			kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
+			do_wakeup = 0;
 		}
 		pipe_wait(pipe);
 	}

^ permalink raw reply related

* [RFC PATCH 07/10] pipe: Conditionalise wakeup in pipe_read() [ver #2]
From: David Howells @ 2019-10-23 20:18 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

Only do a wakeup in pipe_read() if we made space in a completely full
buffer.  The producer shouldn't be waiting on pipe->wait otherwise.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/pipe.c |   15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 1274305772fb..e3a8f10750c9 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -327,11 +327,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 				spin_lock_irq(&pipe->wait.lock);
 				tail++;
 				pipe_commit_read(pipe, tail);
-				do_wakeup = 0;
-				wake_up_interruptible_sync_poll_locked(
-					&pipe->wait, EPOLLOUT | EPOLLWRNORM);
+				do_wakeup = 1;
+				if (head - (tail - 1) == pipe->max_usage)
+					wake_up_interruptible_sync_poll_locked(
+						&pipe->wait, EPOLLOUT | EPOLLWRNORM);
 				spin_unlock_irq(&pipe->wait.lock);
-				kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
+				if (head - (tail - 1) == pipe->max_usage)
+					kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 			}
 			total_len -= chars;
 			if (!total_len)
@@ -360,11 +362,6 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 				ret = -ERESTARTSYS;
 			break;
 		}
-		if (do_wakeup) {
-			wake_up_interruptible_sync_poll(&pipe->wait, EPOLLOUT | EPOLLWRNORM);
- 			kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
-			do_wakeup = 0;
-		}
 		pipe_wait(pipe);
 	}
 	__pipe_unlock(pipe);

^ permalink raw reply related

* [RFC PATCH 08/10] pipe: Rearrange sequence in pipe_write() to preallocate slot [ver #2]
From: David Howells @ 2019-10-23 20:18 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

Rearrange the sequence in pipe_write() so that the allocation of the new
buffer, the allocation of a ring slot and the attachment to the ring is
done under the pipe wait spinlock and then the lock is dropped and the
buffer can be filled.

The data copy needs to be done with the spinlock unheld and irqs enabled,
so the lock needs to be dropped first.  However, the reader can't progress
as we're holding pipe->mutex.

We also need to drop the lock as that would impact others looking at the
pipe waitqueue, such as poll(), the consumer and a future kernel message
writer.

We just abandon the preallocated slot if we get a copy error.  Future
writes may continue it and a future read will eventually recycle it.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/pipe.c |   51 +++++++++++++++++++++++++++++++++------------------
 1 file changed, 33 insertions(+), 18 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index e3a8f10750c9..1bfad2212b95 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -386,7 +386,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
-	unsigned int head, tail, max_usage, mask;
+	unsigned int head, max_usage, mask;
 	ssize_t ret = 0;
 	int do_wakeup = 0;
 	size_t total_len = iov_iter_count(from);
@@ -404,14 +404,13 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		goto out;
 	}
 
-	tail = pipe->tail;
 	head = pipe->head;
 	max_usage = pipe->max_usage;
 	mask = pipe->ring_size - 1;
 
 	/* We try to merge small writes */
 	chars = total_len & (PAGE_SIZE-1); /* size of the last buffer */
-	if (!pipe_empty(head, tail) && chars != 0) {
+	if (!pipe_empty(head, pipe->tail) && chars != 0) {
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;
 
@@ -440,8 +439,8 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			break;
 		}
 
-		tail = pipe->tail;
-		if (!pipe_full(head, tail, max_usage)) {
+		head = pipe->head;
+		if (!pipe_full(head, pipe->tail, max_usage)) {
 			struct pipe_buffer *buf = &pipe->bufs[head & mask];
 			struct page *page = pipe->tmp_page;
 			int copied;
@@ -454,40 +453,56 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 				}
 				pipe->tmp_page = page;
 			}
+
+			/* Allocate a slot in the ring in advance and attach an
+			 * empty buffer.  If we fault or otherwise fail to use
+			 * it, either the reader will consume it or it'll still
+			 * be there for the next write.
+			 */
+			spin_lock_irq(&pipe->wait.lock);
+
+			head = pipe->head;
+			pipe_commit_write(pipe, head + 1);
+
 			/* Always wake up, even if the copy fails. Otherwise
 			 * we lock up (O_NONBLOCK-)readers that sleep due to
 			 * syscall merging.
 			 * FIXME! Is this really true?
 			 */
-			do_wakeup = 1;
-			copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
-			if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
-				if (!ret)
-					ret = -EFAULT;
-				break;
-			}
-			ret += copied;
+			wake_up_interruptible_sync_poll_locked(
+				&pipe->wait, EPOLLIN | EPOLLRDNORM);
+
+			spin_unlock_irq(&pipe->wait.lock);
+			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 
 			/* Insert it into the buffer array */
+			buf = &pipe->bufs[head & mask];
 			buf->page = page;
 			buf->ops = &anon_pipe_buf_ops;
 			buf->offset = 0;
-			buf->len = copied;
+			buf->len = 0;
 			buf->flags = 0;
 			if (is_packetized(filp)) {
 				buf->ops = &packet_pipe_buf_ops;
 				buf->flags = PIPE_BUF_FLAG_PACKET;
 			}
-
-			head++;
-			pipe_commit_write(pipe, head);
 			pipe->tmp_page = NULL;
 
+			copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
+			if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
+				if (!ret)
+					ret = -EFAULT;
+				break;
+			}
+			ret += copied;
+			buf->offset = 0;
+			buf->len = copied;
+
 			if (!iov_iter_count(from))
 				break;
 		}
 
-		if (!pipe_full(head, tail, max_usage))
+		if (!pipe_full(head, pipe->tail, max_usage))
 			continue;
 
 		/* Wait for buffer space to become available. */

^ permalink raw reply related

* [RFC PATCH 09/10] pipe: Remove redundant wakeup from pipe_write() [ver #2]
From: David Howells @ 2019-10-23 20:18 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

Remove a redundant wakeup from pipe_write().

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/pipe.c |    5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 1bfad2212b95..3df93990dd9d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -516,11 +516,6 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 				ret = -ERESTARTSYS;
 			break;
 		}
-		if (do_wakeup) {
-			wake_up_interruptible_sync_poll(&pipe->wait, EPOLLIN | EPOLLRDNORM);
-			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
-			do_wakeup = 0;
-		}
 		pipe->waiting_writers++;
 		pipe_wait(pipe);
 		pipe->waiting_writers--;

^ permalink raw reply related

* [RFC PATCH 10/10] pipe: Check for ring full inside of the spinlock in pipe_write() [ver #2]
From: David Howells @ 2019-10-23 20:18 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

Make pipe_write() check to see if the ring has become full between it
taking the pipe mutex, checking the ring status and then taking the
spinlock.

This can happen if a notification is written into the pipe as that happens
without the pipe mutex.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/pipe.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index 3df93990dd9d..6a982a88f658 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -462,6 +462,11 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			spin_lock_irq(&pipe->wait.lock);
 
 			head = pipe->head;
+			if (pipe_full(head, pipe->tail, max_usage)) {
+				spin_unlock_irq(&pipe->wait.lock);
+				continue;
+			}
+
 			pipe_commit_write(pipe, head + 1);
 
 			/* Always wake up, even if the copy fails. Otherwise

^ permalink raw reply related

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andrea Arcangeli @ 2019-10-23 21:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Daniel Colascione, Linus Torvalds, Pavel Emelyanov,
	Lokesh Gidra, Nick Kralevich, Nosh Minwalla, Tim Murray,
	Mike Rapoport, Linux API, LKML, Dr. David Alan Gilbert
In-Reply-To: <CALCETrWY+5ynDct7eU_nDUqx=okQvjm=Y5wJvA4ahBja=CQXGw@mail.gmail.com>

On Wed, Oct 23, 2019 at 12:21:18PM -0700, Andy Lutomirski wrote:
> There are two things going on here.
> 
> 1. Daniel wants to add LSM labels to userfaultfd objects.  This seems
> reasonable to me.  The question, as I understand it, is: who is the
> subject that creates a uffd referring to a forked child?  I'm sure
> this is solvable in any number of straightforward ways, but I think
> it's less important than:

The new uffd created during fork would definitely need to be accounted
on the criu monitor, nor to the parent nor the child, so it'd need to
be accounted to the process/context that has the fd in its file
descriptors array. But since this is less important let's ignore this
for a second.

> 2. The existing ABI is busted independently of #1.  Suppose you call
> userfaultfd to get a userfaultfd and enable UFFD_FEATURE_EVENT_FORK.
> Then you do:
> 
> $ sudo <&[userfaultfd number]
> 
> Sudo will read it and get a new fd unexpectedly added to its fd table.
> It's worse if SCM_RIGHTS is involved.

So the problem is just that a new fd is created. So for this to turn
out to a practical issue, it requires finding a reckless suid that
won't even bother checking the return value of the open/socket
syscalls or some equivalent fd number related side effect. All right
that makes more sense now and of course I agree it needs fixing.

> So I think we either need to declare that UFFD_FEATURE_EVENT_FORK is
> only usable by global root or we need to remove it and maybe re-add it
> in some other form.

If I had a time machine, I'd rather prefer to do the below:

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index fe6d804a38dc..574062051678 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1958,7 +1958,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
 		return -ENOMEM;

 	refcount_set(&ctx->refcount, 1);
-	ctx->flags = flags;
+	ctx->flags = flags | UFFD_CLOEXEC;
 	ctx->features = 0;
 	ctx->state = UFFD_STATE_WAIT_API;
 	ctx->released = false;

I mean there's no strong requirement to allow any uffd to survive exec
even if UFFD_FEATURE_EVENT_FORK never existed, it's enough if it can
be passed through unix domain sockets.

Until UFFD_FEATURE_EVENT_FORK come around, there was no particular
reason to implicitly enforce O_CLOEXEC on all uffd, it was totally
possible to clone() and exec() to pass the fd to a different
process. So it never rang a bell that this would turn out to be a
problem after UFFD_FEATURE_EVENT_FORK was introduced.

There are various ways to approach this:

1) drop all non cooperative features and mark their feature bitflags
   reserved (no ABI break)

2) enforce UFFD_CLOEXEC with above patch (potential ABI break all
   userfaultfd users)

3) enforce UFFD_CLOEXEC if UFFD_FEATURE_EVENT_FORK is set (ABI break
   only if UFFD_FEATURE_EVENT_FORK is set). Note all forked uffd
   are opened with the same flags inherited from the original uffd.

4) enforce the global root permission check when creating the uffd only if
   UFFD_FEATURE_EVENT_FORK is set.

5) drop all non cooperative features from API 0xaa and introduce API
   0xab with the features back, but with UFFD_CLOEXEC implicitly
   enforced and with UFFD_CLOEXEC forbidden to be set in the flags

6) stick to API 0xaa and drop only UFFD_FEATURE_EVENT_FORK, but add a
   UFFD_FEATURE_EVENT_FORK2 that requires UFFD_CLOEXEC to be set
   (instead of implicitly enforcing it)

7) stick to API 0xaa and drop only UFFD_FEATURE_EVENT_FORK, but add a
   UFFD_FEATURE_EVENT_FORK2 that does the global root permission check 

5 is the non-ABI-break version of 2.

6 is the non-ABI-break version of 3.

7 is the non-ABI-break version of 4.

My favorite is 1) for the reason explained in the previous email.

However if postcopy live migration of bare metal containers already
runs in production anywhere or is at least very close to reach that
milestone or if the non-cooperative features are used in production in
any other way, we'd like to know where and in such case that will
totally change my mind about it. In such case I'd suggest to pick any
of the other options except 1).

In short there shall be good reason for going through further
maintenance burden.

Thanks,
Andrea

^ permalink raw reply related

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andy Lutomirski @ 2019-10-23 21:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andy Lutomirski, Jann Horn, Daniel Colascione, Linus Torvalds,
	Pavel Emelyanov, Lokesh Gidra, Nick Kralevich, Nosh Minwalla,
	Tim Murray, Mike Rapoport, Linux API, LKML,
	Dr. David Alan Gilbert
In-Reply-To: <20191023211645.GC9902@redhat.com>

On Wed, Oct 23, 2019 at 2:16 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> On Wed, Oct 23, 2019 at 12:21:18PM -0700, Andy Lutomirski wrote:
> > There are two things going on here.
> >
> > 1. Daniel wants to add LSM labels to userfaultfd objects.  This seems
> > reasonable to me.  The question, as I understand it, is: who is the
> > subject that creates a uffd referring to a forked child?  I'm sure
> > this is solvable in any number of straightforward ways, but I think
> > it's less important than:
>
> The new uffd created during fork would definitely need to be accounted
> on the criu monitor, nor to the parent nor the child, so it'd need to
> be accounted to the process/context that has the fd in its file
> descriptors array. But since this is less important let's ignore this
> for a second.
>
> > 2. The existing ABI is busted independently of #1.  Suppose you call
> > userfaultfd to get a userfaultfd and enable UFFD_FEATURE_EVENT_FORK.
> > Then you do:
> >
> > $ sudo <&[userfaultfd number]
> >
> > Sudo will read it and get a new fd unexpectedly added to its fd table.
> > It's worse if SCM_RIGHTS is involved.
>
> So the problem is just that a new fd is created. So for this to turn
> out to a practical issue, it requires finding a reckless suid that
> won't even bother checking the return value of the open/socket
> syscalls or some equivalent fd number related side effect. All right
> that makes more sense now and of course I agree it needs fixing.

Or it requires a long-lived daemon that receives fds over SCM_RIGHTS
and reads from them.

>
> > So I think we either need to declare that UFFD_FEATURE_EVENT_FORK is
> > only usable by global root or we need to remove it and maybe re-add it
> > in some other form.
>
> If I had a time machine, I'd rather prefer to do the below:
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index fe6d804a38dc..574062051678 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1958,7 +1958,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
>                 return -ENOMEM;
>
>         refcount_set(&ctx->refcount, 1);
> -       ctx->flags = flags;
> +       ctx->flags = flags | UFFD_CLOEXEC;

That doesn't solve the problem.  With your time machine, you should
instead use ioctl() or recvmsg().

>
> 4) enforce the global root permission check when creating the uffd only if
>    UFFD_FEATURE_EVENT_FORK is set.

This could work, but we should also add a better way to do
UFFD_FEATURE_EVENT_FORK and get CRIU to start using it.  If CRIU is
the only user, we can probably drop the old ABI after a couple of
releases, since as far as I know, CRIU users need to upgrade their
CRIU more or less in sync with the kernel so that new kernel features
get checkpointed and restored.

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andrea Arcangeli @ 2019-10-23 22:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jann Horn, Daniel Colascione, Linus Torvalds, Pavel Emelyanov,
	Lokesh Gidra, Nick Kralevich, Nosh Minwalla, Tim Murray,
	Mike Rapoport, Linux API, LKML, Dr. David Alan Gilbert
In-Reply-To: <CALCETrVS_Ym9wpvTP-ys-OBKCgg7QQjPdhJZe5YXJ6e8JQkNQQ@mail.gmail.com>

On Wed, Oct 23, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> That doesn't solve the problem.  With your time machine, you should

Would you elaborate what problem remains if execve closes all uffd
so that read() cannot run post execve?

> instead use ioctl() or recvmsg().

The event delivery is modeled after eventfd.c per userfaultfd.c header
file, so would then eventfd also need to be converted to ioctl or
recvmsg to deliver its event any better? Initially I evaluated to use
eventfd for it in fact, but it wasn't possible. I didn't look like it
could get any better than eventfd in terms of event delivery.

Or do you refer to single out only the delivery of the UFFD_EVENT_FORK
event not through read()?

> > 4) enforce the global root permission check when creating the uffd only if
> >    UFFD_FEATURE_EVENT_FORK is set.
> 
> This could work, but we should also add a better way to do
> UFFD_FEATURE_EVENT_FORK and get CRIU to start using it.  If CRIU is
> the only user, we can probably drop the old ABI after a couple of
> releases, since as far as I know, CRIU users need to upgrade their
> CRIU more or less in sync with the kernel so that new kernel features
> get checkpointed and restored.

Getting CRIU stat using it shouldn't be a problem at all, but we'll be
back to square one if you just stop there.

I don't see how to lift those limitations in the wiki to make it
usable in production by just providing a better way to do
UFFD_FEATURE_EVENT_FORK.

If you're volunteering to fix the limitations and make CRIU usable in
production that would be awesome, then of course we should do whatever
possible to improve UFFD_FEATURE_EVENT_FORK.

Thanks,
Andrea

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andy Lutomirski @ 2019-10-23 23:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andy Lutomirski, Jann Horn, Daniel Colascione, Linus Torvalds,
	Pavel Emelyanov, Lokesh Gidra, Nick Kralevich, Nosh Minwalla,
	Tim Murray, Mike Rapoport, Linux API, LKML,
	Dr. David Alan Gilbert
In-Reply-To: <20191023224110.GE9902@redhat.com>



> On Oct 23, 2019, at 3:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Wed, Oct 23, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
>> That doesn't solve the problem.  With your time machine, you should
> 
> Would you elaborate what problem remains if execve closes all uffd
> so that read() cannot run post execve?
> 

fcntl() can clear the CLOEXEC flag. And CLOEXEC has no effect on SCM_RIGHTS.

>> instead use ioctl() or recvmsg().
> 
> The event delivery is modeled after eventfd.c per userfaultfd.c header
> file, so would then eventfd also need to be converted to ioctl or
> recvmsg to deliver its event any better? Initially I evaluated to use
> eventfd for it in fact, but it wasn't possible. I didn't look like it
> could get any better than eventfd in terms of event delivery.
> 
> Or do you refer to single out only the delivery of the UFFD_EVENT_FORK
> event not through read()?

Delivering events through read() is just fine. The problem is when delivering an event does more than just returning bytes. As far as I’ve noticed, uffd’s read() just returns bytes as long as FORK is disabled.

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andrea Arcangeli @ 2019-10-23 23:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Jann Horn, Daniel Colascione, Linus Torvalds,
	Pavel Emelyanov, Lokesh Gidra, Nick Kralevich, Nosh Minwalla,
	Tim Murray, Mike Rapoport, Linux API, LKML,
	Dr. David Alan Gilbert
In-Reply-To: <EB0634BD-AAF4-4805-8178-30FFA94B7B58@amacapital.net>

On Wed, Oct 23, 2019 at 04:01:53PM -0700, Andy Lutomirski wrote:
> Delivering events through read() is just fine. The problem is when
> delivering an event does more than just returning bytes. As far as
> I’ve noticed, uffd’s read() just returns bytes as long as FORK is
> disabled.

Yes, fork is the only case where read doesn't return bytes.

Moving only the fd creation to a separate syscall would then avoid
involuntary creation of the fd.

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andrea Arcangeli @ 2019-10-24  0:23 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Andy Lutomirski, Jann Horn, Linus Torvalds, Pavel Emelyanov,
	Lokesh Gidra, Nick Kralevich, Nosh Minwalla, Tim Murray,
	Mike Rapoport, Linux API, LKML
In-Reply-To: <CAKOZuetKkM=PK2QA8LdXwM8cM8qJvFu4u5bjePWai3XRnHe-pA@mail.gmail.com>

On Wed, Oct 23, 2019 at 01:05:47PM -0700, Daniel Colascione wrote:
> This is a debate that won't get resolved here. A ton of work has gone
> into namespaces, migration, various cgroup things, and so on, and I
> don't see that work getting torn out.

This is precisely why I thought it was a good idea to support the
non-cooperative use case too even though we had no immediate use for
it.

> Sure they can. Can't we stick processes in a memcg and set a
> memory.high threshold beyond which threads in that cgroup will enter
> direct reclaim on page allocations? I'd call that throttling.

The uffd-wp solution during the throttling can resolve a wrprotect
fault in the parent for every 4k page that has been written to disk
and it'll prioritize writing to disk those userfaults that are
currently blocked. I don't see how you could reach an equivalent
optimal runtime without uffd-wp and just with memcg because the
snapshot process won't have a clue which pages are been duped by the
COWs. The uffd-wp by avoding fork will also avoid more expensive MM
switches during the snapshot.

> This issue *has* to get fixed one way or another.

Sure.

^ permalink raw reply

* Re: [PATCHv7 18/33] lib/vdso: Add unlikely() hint into vdso_read_begin()
From: Andrei Vagin @ 2019-10-24  6:13 UTC (permalink / raw)
  To: Vincenzo Frascino
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Adrian Reber,
	Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Thomas Gleixner, containers, criu, linux-api, x86
In-Reply-To: <100f6921-9081-7eb0-7acc-f10cfb647c21@arm.com>

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Wed, Oct 16, 2019 at 12:24:14PM +0100, Vincenzo Frascino wrote:
> On 10/11/19 2:23 AM, Dmitry Safonov wrote:
> > From: Andrei Vagin <avagin@gmail.com>
> > 
> > Place the branch with no concurrent write before contended case.
> > 
> > Performance numbers for Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
> > (more clock_gettime() cycles - the better):
> >         | before    | after
> > -----------------------------------
> >         | 150252214 | 153242367
> >         | 150301112 | 153324800
> >         | 150392773 | 153125401
> >         | 150373957 | 153399355
> >         | 150303157 | 153489417
> >         | 150365237 | 153494270
> > -----------------------------------
> > avg     | 150331408 | 153345935
> > diff %  | 2	    | 0
> > -----------------------------------
> > stdev % | 0.3	    | 0.1
> > 
> > Signed-off-by: Andrei Vagin <avagin@gmail.com>
> > Co-developed-by: Dmitry Safonov <dima@arista.com>
> > Signed-off-by: Dmitry Safonov <dima@arista.com>
> 
> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>

Hello Vincenzo,

Could you test the attached patch on aarch64? On x86, it gives about 9%
performance improvement for CLOCK_MONOTONIC and CLOCK_BOOTTIME.

Here is my test:
https://github.com/avagin/vdso-perf

It is calling clock_gettime() in a loop for three seconds and then
reports a number of iterations.

Thanks,
Andrei

[-- Attachment #2: 0001-lib-vdso-make-do_hres-and-do_coarse-as-__always_inli.patch --]
[-- Type: text/plain, Size: 2208 bytes --]

>From 5252093fec4c74802e5ef501b9f1db3369430c80 Mon Sep 17 00:00:00 2001
From: Andrei Vagin <avagin@gmail.com>
Date: Tue, 22 Oct 2019 18:23:17 -0700
Subject: [PATCH] lib/vdso: make do_hres and do_coarse as __always_inline

Performance numbers for Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
(more clock_gettime() cycles - the better):

clock            | before     | after      | diff
----------------------------------------------------------
monotonic        |  153222105 |  166775025 | 8.8%
monotonic-coarse |  671557054 |  691513017 | 3.0%
monotonic-raw    |  147116067 |  161057395 | 9.5%
boottime         |  153446224 |  166962668 | 9.1%

Signed-off-by: Andrei Vagin <avagin@gmail.com>
---
 lib/vdso/gettimeofday.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index e630e7ff57f1..b4f7f0f246af 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -38,7 +38,7 @@ u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
 }
 #endif
 
-static int do_hres(const struct vdso_data *vd, clockid_t clk,
+static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk,
 		   struct __kernel_timespec *ts)
 {
 	const struct vdso_timestamp *vdso_ts = &vd->basetime[clk];
@@ -68,7 +68,7 @@ static int do_hres(const struct vdso_data *vd, clockid_t clk,
 	return 0;
 }
 
-static void do_coarse(const struct vdso_data *vd, clockid_t clk,
+static __always_inline void do_coarse(const struct vdso_data *vd, clockid_t clk,
 		      struct __kernel_timespec *ts)
 {
 	const struct vdso_timestamp *vdso_ts = &vd->basetime[clk];
@@ -97,12 +97,16 @@ __cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts)
 	 */
 	msk = 1U << clock;
 	if (likely(msk & VDSO_HRES)) {
-		return do_hres(&vd[CS_HRES_COARSE], clock, ts);
+		vd = &vd[CS_HRES_COARSE];
+out_hres:
+		return do_hres(vd, clock, ts);
 	} else if (msk & VDSO_COARSE) {
 		do_coarse(&vd[CS_HRES_COARSE], clock, ts);
 		return 0;
 	} else if (msk & VDSO_RAW) {
-		return do_hres(&vd[CS_RAW], clock, ts);
+		vd = &vd[CS_RAW];
+		/* goto allows to avoid extra inlining of do_hres. */
+		goto out_hres;
 	}
 	return -1;
 }
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v14 2/6] namei: LOOKUP_IN_ROOT: chroot-like path resolution
From: Aleksa Sarai @ 2019-10-24  7:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Ingo Molnar,
	Peter Zijlstra, Eric Biederman, Andy Lutomirski, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Rasmus Villemoes,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Christian Brauner
In-Reply-To: <20191012041541.milbmfbjpj5bcl5a@yavin.dot.cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 4614 bytes --]

On 2019-10-12, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-10-12, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2019-10-10, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > On Wed, Oct 9, 2019 at 10:42 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > --- a/fs/namei.c
> > > > +++ b/fs/namei.c
> > > > @@ -2277,6 +2277,11 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > > >
> > > >         nd->m_seq = read_seqbegin(&mount_lock);
> > > >
> > > > +       /* LOOKUP_IN_ROOT treats absolute paths as being relative-to-dirfd. */
> > > > +       if (flags & LOOKUP_IN_ROOT)
> > > > +               while (*s == '/')
> > > > +                       s++;
> > > > +
> > > >         /* Figure out the starting path and root (if needed). */
> > > >         if (*s == '/') {
> > > >                 error = nd_jump_root(nd);
> > > 
> > > Hmm. Wouldn't this make more sense all inside the if (*s =- '/') test?
> > > That way if would be where we check for "should we start at the root",
> > > which seems to make more sense conceptually.
> > 
> > I don't really agree (though I do think that both options are pretty
> > ugly). Doing it before the block makes it clear that absolute paths are
> > just treated relative-to-dirfd -- doing it inside the block makes it
> > look more like "/" is a special-case for nd_jump_root(). And while that
> 
> Sorry, I meant "special-case for LOOKUP_IN_ROOT".
> 
> > is somewhat true, this is just a side-effect of making the code more
> > clean -- my earlier versions reworked the dirfd handling to always grab
> > nd->root first if LOOKUP_IS_SCOPED. I switched to this method based on
> > Al's review.
> > 
> > In fairness, I do agree that the lonely while loop looks ugly.
> 
> And with the old way I did it (where we grabbed nd->root first) the
> semantics were slightly more clear -- stripping leading "/"s doesn't
> really look as "clearly obvious" as grabbing nd->root beforehand and
> treating "/"s normally. But the code was also needlessly more complex.
> 
> > > That test for '/' currently has a "} else if (..)", but that's
> > > pointless since it ends with a "return" anyway. So the "else" logic is
> > > just noise.
> > 
> > This depends on the fact that LOOKUP_BENEATH always triggers -EXDEV for
> > nd_jump_root() -- if we ever add another "scoped lookup" flag then the
> > logic will have to be further reworked.
> > 
> > (It should be noted that the new version doesn't always end with a
> > "return", but you could change it to act that way given the above
> > assumption.)
> > 
> > > And if you get rid of the unnecessary else, moving the LOOKUP_IN_ROOT
> > > inside the if-statement works fine.
> > > 
> > > So this could be something like
> > > 
> > >     --- a/fs/namei.c
> > >     +++ b/fs/namei.c
> > >     @@ -2194,11 +2196,19 @@ static const char *path_init(struct
> > > nameidata *nd, unsigned flags)
> > > 
> > >         nd->m_seq = read_seqbegin(&mount_lock);
> > >         if (*s == '/') {
> > >     -           set_root(nd);
> > >     -           if (likely(!nd_jump_root(nd)))
> > >     -                   return s;
> > >     -           return ERR_PTR(-ECHILD);
> > >     -   } else if (nd->dfd == AT_FDCWD) {
> > >     +           /* LOOKUP_IN_ROOT treats absolute paths as being
> > > relative-to-dirfd. */
> > >     +           if (!(flags & LOOKUP_IN_ROOT)) {
> > >     +                   set_root(nd);
> > >     +                   if (likely(!nd_jump_root(nd)))
> > >     +                           return s;
> > >     +                   return ERR_PTR(-ECHILD);
> > >     +           }
> > >     +
> > >     +           /* Skip initial '/' for LOOKUP_IN_ROOT */
> > >     +           do { s++; } while (*s == '/');
> > >     +   }
> > >     +
> > >     +   if (nd->dfd == AT_FDCWD) {
> > >                 if (flags & LOOKUP_RCU) {
> > >                         struct fs_struct *fs = current->fs;
> > >                         unsigned seq;
> > > 
> > > instead. The patch ends up slightly bigger (due to the re-indentation)
> > > but now it handles all the "start at root" in the same place. Doesn't
> > > that make sense?
> > 
> > It is correct (though I'd need to clean it up a bit to handle
> > nd_jump_root() correctly), and if you really would like me to change it
> > I will -- but I just don't agree that it's cleaner.

Linus, did you still want me to make your proposed change?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Mike Rapoport @ 2019-10-24  9:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andy Lutomirski, Jann Horn, Daniel Colascione, Linus Torvalds,
	Pavel Emelyanov, Lokesh Gidra, Nick Kralevich, Nosh Minwalla,
	Tim Murray, Mike Rapoport, Linux API, LKML
In-Reply-To: <20191023190959.GA9902@redhat.com>

On Wed, Oct 23, 2019 at 03:09:59PM -0400, Andrea Arcangeli wrote:
> Hello,
> 
> On Sat, Oct 12, 2019 at 06:14:23PM -0700, Andy Lutomirski wrote:
> > [adding more people because this is going to be an ABI break, sigh]
> 
> That wouldn't break the ABI, no more than when if you boot a kernel
> built with CONFIG_USERFAULTFD=n.
> 
> All non-cooperative features can be removed any time in a backwards
> compatible way, the only precaution is to mark their feature bits as
> reserved so they can't be reused for something else later.
> 
> > least severely restricted.  A .read implementation MUST NOT ACT ON THE
> > CALLING TASK.  Ever.  Just imagine the effect of passing a userfaultfd
> > as stdin to a setuid program.
> 
> With UFFD_EVENT_FORK, the newly created uffd that controls the child,
> is not passed to the parent nor to the child. Instead it's passed to
> the CRIU monitor only, which has to be already running as root and is
> fully trusted and acts a hypervisor (despite there is no hypervisor).
> 
> By the time execve runs and any suid bit in the execve'd inode becomes
> relevant, well before the new userland executable code can run, the
> kernel throws away the "old_mm" controlled by any uffd and all
> attached uffds are released as well.
> 
> All I found is your "A .read implementation MUST NOT ACT ON THE
> CALLING TASK" as an explanation that something is broken but I need
> further clarification.
> 
> Of course I can see you can always open a uffd and pass it to any task
> you are going to execve on, but that simply means the suid program
> will be able to control you, not the other way around. If you don't
> want to be controlled by the next task, no matter if suid or not, just
> don't that. What I don't see is how you're going to control the suid
> binary from the outside, the suid binary at most will block in the
> poll, read and write syscalls and get garbage or write some garbage
> and get an error, it won't get signals and it cannot block in any page
> fault either, it's not immediately clear what's out of ordinary.
> 
> On Mon, Oct 14, 2019 at 06:04:22PM +0200, Jann Horn wrote:
> > FWIW, <https://codesearch.debian.net/search?q=UFFD_FEATURE_EVENT_FORK&literal=1>
> > just shows the kernel, kernel selftests, and strace code for decoding
> > syscall arguments. CRIU uses it though (probably for postcopy live
> > migration / lazy migration?), I guess that code isn't in debian for
> > some reason.
> 
> https://criu.org/Userfaultfd#Limitations

That's no the reason that UFFD_FEATURE_EVENT_FORK does not show up in
Debian code search, CRIU simply is not there. Debian packages CRIU only in
experimental and I believe that's not indexed by the code search.

As for the limitations, the races were fixed, I just forgot to update the
wiki. As for the supported memory types and COW pages, these only affect
efficiency of post-copy, but not the correctness.
 
> The CRIU developers did a truly amazing job by making container post
> copy live migration work great for a subset of apps, that alone was an
> amazing achievement. Is that achievement enough to use post copy live
> migration of bare metal containers in production? Unfortunately
> probably not and not just in debian.
 
I don't know if anybody is using post-copy migration of containers in
production, but I don't think that the reason for that would be technical.
IMHO it's more about prevailing perception that there is no need to migrate
containers at all, not only with post-copy, and, as the result, slow rate
of adoption of container migration in general.

> If you're wrong and UFFDIO_EVENT_FORK isn't currently buggy and in
> turn it isn't causing further maintenance burden, there is no hurry of
> removing them, but in the long term, if none of the non-cooperative
> features find its way in production (like it was reasonable to expect
> initially), they must be removed from the kernel anyway, not just
> UFFD_EVEN_FORK but all non-cooperative features associated with it.

... 
 
> On my side, instead of trying to fix whatever issue in
> UFFD_EVENT_FORK, I'd prefer to spend my time reviewing the uffd-wp
> feature from Peter and the page fault enhancement patchset that Peter
> and Linus were discussing. uffd-wp has the potential to drop fork()
> from all apps calling fork() only to do an atomic snapshot of their
> memory. Replacing fork() also means the uffd manager thread can decide
> how much memory to reserve to the snapshot and it can start throttling
> waiting for I/O completion if the threshold is exceeded, while fork
> COWs cannot throttle and all apps using fork() risk to hit on x2
> memory usage which can become oom-killer material if the memory size
> of the process is huge. The side benefit is also that the way
> userfaultfd works the fault granularity is entirely in control of
> userland (because it's always userland that resolves the fault), it
> could decide to use 8k or 16k even if that doesn't match the hardware
> page size. That will allow to keep THP on without risking to hit on 2M
> cows during the snapshot. Being able to keep THP enabled in nosql db
> without hitting on slow 2M COW copies during snapshot, should allow a
> further overall performance improvement when the snapshot is not
> running than what it is possible today. In a completely different use
> case, uffd-wp will also avoid JITs to set a dirty bit every time they
> modify any data in memory. It should also be possible to provide the
> same soft-dirty information in O(1) instead of O(N).

If I remember correctly, there was an intention to deprecate soft-dirty in
favor of uffd-wp, which brings us back to the necessity to have
non-cooperative uffd because otherwise even pre-copy in CRIU will be broken
and that *is* used in production.
 
> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCHv7 18/33] lib/vdso: Add unlikely() hint into vdso_read_begin()
From: Vincenzo Frascino @ 2019-10-24  9:30 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Adrian Reber,
	Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Thomas Gleixner, containers, criu, linux-api, x86
In-Reply-To: <20191024061311.GA4541@gmail.com>

Hi Andrei,

On 10/24/19 7:13 AM, Andrei Vagin wrote:
> On Wed, Oct 16, 2019 at 12:24:14PM +0100, Vincenzo Frascino wrote:
>> On 10/11/19 2:23 AM, Dmitry Safonov wrote:
>>> From: Andrei Vagin <avagin@gmail.com>
>>>
>>> Place the branch with no concurrent write before contended case.
>>>
>>> Performance numbers for Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
>>> (more clock_gettime() cycles - the better):
>>>         | before    | after
>>> -----------------------------------
>>>         | 150252214 | 153242367
>>>         | 150301112 | 153324800
>>>         | 150392773 | 153125401
>>>         | 150373957 | 153399355
>>>         | 150303157 | 153489417
>>>         | 150365237 | 153494270
>>> -----------------------------------
>>> avg     | 150331408 | 153345935
>>> diff %  | 2	    | 0
>>> -----------------------------------
>>> stdev % | 0.3	    | 0.1
>>>
>>> Signed-off-by: Andrei Vagin <avagin@gmail.com>
>>> Co-developed-by: Dmitry Safonov <dima@arista.com>
>>> Signed-off-by: Dmitry Safonov <dima@arista.com>
>>
>> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
>> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
> 
> Hello Vincenzo,
> 
> Could you test the attached patch on aarch64? On x86, it gives about 9%
> performance improvement for CLOCK_MONOTONIC and CLOCK_BOOTTIME.
> 

I did run similar tests in past with a previous version of the unified vDSO
library and what I can tell based on the results of those is that the impact of
"__always_inline" alone was around 7% on arm64, in fact I had a comment stating
"To improve performances, in this file, __always_inline it is used for the
functions called multiple times." in my implementation [1].

[1] https://bit.ly/2W9zMxB

I spent some time yesterday trying to dig out why the approach did not make the
cut but I could not infer it from the review process.

> Here is my test:
> https://github.com/avagin/vdso-perf
> 
> It is calling clock_gettime() in a loop for three seconds and then
> reports a number of iterations.
> 

I am happy to run the test on arm64 and provide some results.

> Thanks,
> Andrei
> 

-- 
Regards,
Vincenzo

^ permalink raw reply

* Re: [RFC PATCH 04/10] pipe: Use head and tail pointers for the ring, not cursor and length [ver #2]
From: David Howells @ 2019-10-24 10:32 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-block, linux-security-module, linux-fsdevel, linux-api,
	linux-kernel
In-Reply-To: <157186186167.3995.7568100174393739543.stgit@warthog.procyon.org.uk>

I've pushed to git a new version that fixes an incomplete conversion in
pipe_zero(), ports the powerpc virtio_console driver and fixes a comment in
splice.

David

^ permalink raw reply

* Re: [PATCH] mm: allow unmapped hole at head side of mbind range
From: Vlastimil Babka @ 2019-10-24 10:48 UTC (permalink / raw)
  To: Li Xinhai, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Linux API
In-Reply-To: <TY2PR04MB29753892EBAD17D9E8FECBAEE86A0@TY2PR04MB2975.apcprd04.prod.outlook.com>

+ linux-api

On 10/24/19 9:35 AM, Li Xinhai wrote:
> From: Li Xinhai  <xinhai.li@outlook.com>
> 
> mbind_range silently ignore unmapped hole at middle and tail of the 
> specified range, but report EFAULT if hole at head side.


Hmm that's unfortunate. mbind() manpage says:

EFAULT Part or all of the memory range specified by nodemask and maxnode
points outside your accessible address space.  Or, there was an unmapped
hole in  the  specified  memory range specified by addr and len.

That sounds like any hole inside the specified range should return
EFAULT. But perhaps it can be also interpreted as you suggest, that the
whole range is an unmapped hole. There's some risk of breaking existing
userspace if we change it either way.

> It is more reasonable to support silently ignore holes at any part of 
> the range, only report EFAULT if the whole range is in hole.
> 
> Signed-off-by: Li Xinhai <xinhai.li@outlook.com>
> ---
> 
>  mm/mempolicy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4ae967bcf954..ae160d9936d9 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -738,7 +738,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
>         unsigned long vmend;
>  
>         vma = find_vma(mm, start);
> -       if (!vma || vma->vm_start > start)
> +       if (!vma || vma->vm_start >= end)
>                 return -EFAULT;
>  
>         prev = vma->vm_prev;
> 

^ permalink raw reply

* Re: [PATCHv7 18/33] lib/vdso: Add unlikely() hint into vdso_read_begin()
From: Vincenzo Frascino @ 2019-10-24 13:14 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Dmitry Safonov, linux-kernel, Dmitry Safonov, Adrian Reber,
	Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
	Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
	Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
	Thomas Gleixner, containers, criu, linux-api, x86
In-Reply-To: <9aa9857e-ee1c-0117-bfcb-45fc6bcab866@arm.com>

Hi Andrei,

On 10/24/19 10:30 AM, Vincenzo Frascino wrote:
[...]
>> Here is my test:
>> https://github.com/avagin/vdso-perf
>>
>> It is calling clock_gettime() in a loop for three seconds and then
>> reports a number of iterations.
>>
> 
> I am happy to run the test on arm64 and provide some results.
>

Please find below the results:

clock            | before     | after      | diff
==================================================
monotonic          17326692     17951770     3.6%
monotonic-coarse   43624027     44215292     1.3%
monotonic-raw      17541809     17554932     0.1%
boottime           17334982     17954361     3.5%

The improvement for arm64 for monotonic and boottime is around 3.5%.

I created a pull request on github to send to you back the changes I had
to do to vdso-perf to cross-compile it on arm64 [1].

[1] https://github.com/avagin/vdso-perf/pull/1


>> Thanks,
>> Andrei
>>
> 

-- 
Regards,
Vincenzo

^ permalink raw reply

* Re: [RFC PATCH 00/10] pipe: Notification queue preparation [ver #2]
From: Peter Zijlstra @ 2019-10-24 13:14 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, Rasmus Villemoes, Greg Kroah-Hartman, nicolas.dichtel,
	raven, Christian Brauner, keyrings, linux-usb, linux-block,
	linux-security-module, linux-fsdevel, linux-api, linux-kernel
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

On Wed, Oct 23, 2019 at 09:17:04PM +0100, David Howells wrote:

>  (1) It removes the nr_exclusive argument from __wake_up_sync_key() as this
>      is always 1.  This prepares for step 2.
> 
>  (2) Adds wake_up_interruptible_sync_poll_locked() so that poll can be
>      woken up from a function that's holding the poll waitqueue spinlock.

>  include/linux/wait.h       |   11 +-
>  kernel/sched/wait.c        |   37 ++++--
> 

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply

* Re: [PATCH 3/7] Add a UFFD_SECURE flag to the userfaultfd API.
From: Andrea Arcangeli @ 2019-10-24 15:10 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andy Lutomirski, Jann Horn, Daniel Colascione, Linus Torvalds,
	Pavel Emelyanov, Lokesh Gidra, Nick Kralevich, Nosh Minwalla,
	Tim Murray, Mike Rapoport, Linux API, LKML
In-Reply-To: <20191024090258.GA9802@linux.ibm.com>

Hello,

On Thu, Oct 24, 2019 at 12:02:59PM +0300, Mike Rapoport wrote:
> That's no the reason that UFFD_FEATURE_EVENT_FORK does not show up in
> Debian code search, CRIU simply is not there. Debian packages CRIU only in
> experimental and I believe that's not indexed by the code search.
> 
> As for the limitations, the races were fixed, I just forgot to update the
> wiki. As for the supported memory types and COW pages, these only affect
> efficiency of post-copy, but not the correctness.

That's what I was hoping for. If the wiki information is stale and
there are no races it is totally plausible that it's being actively
used in production so we need to fix the kernel bug. I was just
checking because I wasn't sure anymore of the status after I read the
wiki.

If the CRIU initialization code that issues the uffd syscall runs as
global root the ABI breaking permission check from Andy sounds the
simplest for a short term fix, because it will be unnoticed by any
production usage with CIRU --lazy-pages.

Then later we could add a UFFD_FEATURE_EVENT_FORK2 that will not
require root permission.

Thanks,
Andrea

^ permalink raw reply

* [RFC PATCH 11/10] pipe: Add fsync() support [ver #2]
From: David Howells @ 2019-10-24 16:57 UTC (permalink / raw)
  To: torvalds
  Cc: dhowells, Rasmus Villemoes, Greg Kroah-Hartman, Peter Zijlstra,
	nicolas.dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-block, linux-security-module, linux-fsdevel, linux-api,
	linux-kernel
In-Reply-To: <157186182463.3995.13922458878706311997.stgit@warthog.procyon.org.uk>

pipe: Add fsync() support

The keyrings testsuite needs the ability to wait for all the outstanding
notifications in the queue to have been processed so that it can then go
through them to find out whether the notifications it expected have been
emitted.

Implement fsync() support for pipes to provide this.  The tailmost buffer
at the point of calling is marked and fsync adds itself to the list of
waiters, noting the tail position to be waited for and marking the buffer
as no longer mergeable.  Then when the buffer is consumed, if the flag is
set, any matching waiters are woken up.

Signed-off-by: David Howells <dhowells@redhat.com>
---
 fs/fuse/dev.c             |    1 
 fs/pipe.c                 |   61 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/splice.c               |    3 ++
 include/linux/pipe_fs_i.h |   22 ++++++++++++++++
 lib/iov_iter.c            |    2 -
 5 files changed, 88 insertions(+), 1 deletion(-)


diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5ef57a322cb8..9617a35579cb 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1983,6 +1983,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 		if (rem >= ibuf->len) {
 			*obuf = *ibuf;
 			ibuf->ops = NULL;
+			pipe_wake_fsync(pipe, ibuf, tail);
 			tail++;
 			pipe_commit_read(pipe, tail);
 		} else {
diff --git a/fs/pipe.c b/fs/pipe.c
index 6a982a88f658..8e5fd7314be1 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -30,6 +30,12 @@
 
 #include "internal.h"
 
+struct pipe_fsync {
+	struct list_head	link;		/* Link in pipe->fsync */
+	struct completion	done;
+	unsigned int		tail;		/* The buffer being waited for */
+};
+
 /*
  * The max size that a non-root user is allowed to grow the pipe. Can
  * be set by root in /proc/sys/fs/pipe-max-size
@@ -269,6 +275,58 @@ static bool pipe_buf_can_merge(struct pipe_buffer *buf)
 	return buf->ops == &anon_pipe_buf_ops;
 }
 
+/*
+ * Wait for all the data currently in the pipe to be consumed.
+ */
+static int pipe_fsync(struct file *file, loff_t a, loff_t b, int datasync)
+{
+	struct pipe_inode_info *pipe = file->private_data;
+	struct pipe_buffer *buf;
+	struct pipe_fsync fsync;
+	unsigned int head, tail, mask;
+
+	pipe_lock(pipe);
+
+	head = pipe->head;
+	tail = pipe->tail;
+	mask = pipe->ring_size - 1;
+
+	if (pipe_empty(head, tail)) {
+		pipe_unlock(pipe);
+		return 0;
+	}
+
+	init_completion(&fsync.done);
+	fsync.tail = tail;
+	buf = &pipe->bufs[tail & mask];
+	buf->flags |= PIPE_BUF_FLAG_FSYNC;
+	pipe_buf_mark_unmergeable(buf);
+	list_add_tail(&fsync.link, &pipe->fsync);
+	pipe_unlock(pipe);
+
+	if (wait_for_completion_interruptible(&fsync.done) < 0) {
+		pipe_lock(pipe);
+		list_del(&fsync.link);
+		pipe_unlock(pipe);
+		return -EINTR;
+	}
+
+	return 0;
+}
+
+void __pipe_wake_fsync(struct pipe_inode_info *pipe, unsigned int tail)
+{
+	struct pipe_fsync *fsync, *p;
+
+	list_for_each_entry_safe(fsync, p, &pipe->fsync, link) {
+		if (fsync->tail == tail) {
+			list_del_init(&fsync->link);
+			complete(&fsync->done);
+		}
+	}
+}
+EXPORT_SYMBOL(__pipe_wake_fsync);
+
 static ssize_t
 pipe_read(struct kiocb *iocb, struct iov_iter *to)
 {
@@ -325,6 +383,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			if (!buf->len) {
 				pipe_buf_release(pipe, buf);
 				spin_lock_irq(&pipe->wait.lock);
+				pipe_wake_fsync(pipe, buf, tail);
 				tail++;
 				pipe_commit_read(pipe, tail);
 				do_wakeup = 1;
@@ -717,6 +776,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 		pipe->ring_size = pipe_bufs;
 		pipe->user = user;
 		mutex_init(&pipe->mutex);
+		INIT_LIST_HEAD(&pipe->fsync);
 		return pipe;
 	}
 
@@ -1060,6 +1120,7 @@ const struct file_operations pipefifo_fops = {
 	.llseek		= no_llseek,
 	.read_iter	= pipe_read,
 	.write_iter	= pipe_write,
+	.fsync		= pipe_fsync,
 	.poll		= pipe_poll,
 	.unlocked_ioctl	= pipe_ioctl,
 	.release	= pipe_release,
diff --git a/fs/splice.c b/fs/splice.c
index 3f72bc31b6ec..e106367e1be6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -523,6 +523,7 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des
 
 		if (!buf->len) {
 			pipe_buf_release(pipe, buf);
+			pipe_wake_fsync(pipe, buf, tail);
 			tail++;
 			pipe_commit_read(pipe, tail);
 			if (pipe->files)
@@ -771,6 +772,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 				ret -= buf->len;
 				buf->len = 0;
 				pipe_buf_release(pipe, buf);
+				pipe_wake_fsync(pipe, buf, tail);
 				tail++;
 				pipe_commit_read(pipe, tail);
 				if (pipe->files)
@@ -1613,6 +1615,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			 */
 			*obuf = *ibuf;
 			ibuf->ops = NULL;
+			pipe_wake_fsync(ipipe, ibuf, i_tail);
 			i_tail++;
 			pipe_commit_read(ipipe, i_tail);
 			input_wakeup = true;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 90055ff16550..1a3027089558 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -8,6 +8,7 @@
 #define PIPE_BUF_FLAG_ATOMIC	0x02	/* was atomically mapped */
 #define PIPE_BUF_FLAG_GIFT	0x04	/* page is a gift */
 #define PIPE_BUF_FLAG_PACKET	0x08	/* read() as a packet */
+#define PIPE_BUF_FLAG_FSYNC	0x10	/* fsync() is waiting for this buffer to die */
 
 /**
  *	struct pipe_buffer - a linux kernel pipe buffer
@@ -43,6 +44,7 @@ struct pipe_buffer {
  *	@w_counter: writer counter
  *	@fasync_readers: reader side fasync
  *	@fasync_writers: writer side fasync
+ *	@fsync: Waiting fsyncs
  *	@bufs: the circular array of pipe buffers
  *	@user: the user who created this pipe
  **/
@@ -62,6 +64,7 @@ struct pipe_inode_info {
 	struct page *tmp_page;
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
+	struct list_head fsync;
 	struct pipe_buffer *bufs;
 	struct user_struct *user;
 };
@@ -268,6 +271,25 @@ extern const struct pipe_buf_operations nosteal_pipe_buf_ops;
 long pipe_fcntl(struct file *, unsigned int, unsigned long arg);
 struct pipe_inode_info *get_pipe_info(struct file *file);
 
+void __pipe_wake_fsync(struct pipe_inode_info *pipe, unsigned int tail);
+
+/**
+ * pipe_wake_fsync - Wake up anyone waiting with fsync for this point
+ * @pipe: The pipe that owns the buffer
+ * @buf: The pipe buffer in question
+ * @tail: The index in the ring of the buffer
+ *
+ * Check to see if anyone is waiting for the pipe ring to clear up to and
+ * including this buffer, and, if they are, wake them up.
+ */
+static inline void pipe_wake_fsync(struct pipe_inode_info *pipe,
+				   struct pipe_buffer *buf,
+				   unsigned int tail)
+{
+	if (unlikely(buf->flags & PIPE_BUF_FLAG_FSYNC))
+		__pipe_wake_fsync(pipe, tail);
+}
+
 int create_pipe_files(struct file **, int);
 unsigned int round_pipe_size(unsigned long size);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e22f4e283f6d..38d52524cd21 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -404,7 +404,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 	buf->offset = offset;
 	buf->len = bytes;
 
-	pipe_commit_read(pipe, i_head);
+	pipe_commit_write(pipe, i_head);
 	i->iov_offset = offset + bytes;
 	i->head = i_head;
 out:

^ permalink raw reply related

* Re: [PATCH ghak90 V7 20/21] audit: add capcontid to set contid outside init_user_ns
From: Richard Guy Briggs @ 2019-10-24 21:00 UTC (permalink / raw)
  To: Paul Moore
  Cc: containers, linux-api, Linux-Audit Mailing List, linux-fsdevel,
	LKML, netdev, netfilter-devel, sgrubb, omosnace, dhowells, simo,
	Eric Paris, Serge Hallyn, ebiederm, nhorman, Dan Walsh, mpatel
In-Reply-To: <CAHC9VhSiwnY-+2awxvGeO4a0NgfVkOPd8fzzBVujp=HtjskTuQ@mail.gmail.com>

On 2019-10-21 20:31, Paul Moore wrote:
> On Mon, Oct 21, 2019 at 7:58 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-10-21 17:43, Paul Moore wrote:
> > > On Mon, Oct 21, 2019 at 5:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2019-10-21 15:53, Paul Moore wrote:
> > > > > On Fri, Oct 18, 2019 at 9:39 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2019-09-18 21:22, Richard Guy Briggs wrote:
> > > > > > > Provide a mechanism similar to CAP_AUDIT_CONTROL to explicitly give a
> > > > > > > process in a non-init user namespace the capability to set audit
> > > > > > > container identifiers.
> > > > > > >
> > > > > > > Use audit netlink message types AUDIT_GET_CAPCONTID 1027 and
> > > > > > > AUDIT_SET_CAPCONTID 1028.  The message format includes the data
> > > > > > > structure:
> > > > > > > struct audit_capcontid_status {
> > > > > > >         pid_t   pid;
> > > > > > >         u32     enable;
> > > > > > > };
> > > > > >
> > > > > > Paul, can I get a review of the general idea here to see if you're ok
> > > > > > with this way of effectively extending CAP_AUDIT_CONTROL for the sake of
> > > > > > setting contid from beyond the init user namespace where capable() can't
> > > > > > reach and ns_capable() is meaningless for these purposes?
> > > > >
> > > > > I think my previous comment about having both the procfs and netlink
> > > > > interfaces apply here.  I don't see why we need two different APIs at
> > > > > the start; explain to me why procfs isn't sufficient.  If the argument
> > > > > is simply the desire to avoid mounting procfs in the container, how
> > > > > many container orchestrators can function today without a valid /proc?
> > > >
> > > > Ok, sorry, I meant to address that question from a previous patch
> > > > comment at the same time.
> > > >
> > > > It was raised by Eric Biederman that the proc filesystem interface for
> > > > audit had its limitations and he had suggested an audit netlink
> > > > interface made more sense.
> > >
> > > I'm sure you've got it handy, so I'm going to be lazy and ask: archive
> > > pointer to Eric's comments?  Just a heads-up, I'm really *not* a fan
> > > of using the netlink interface for this, so unless Eric presents a
> > > super compelling reason for why we shouldn't use procfs I'm inclined
> > > to stick with /proc.
> >
> > It was actually a video call with Eric and Steve where that was
> > recommended, so I can't provide you with any first-hand communication
> > about it.  I'll get more details...
> 
> Yeah, that sort of information really needs to be on the list.

Here's the note I had from that meeting:

- Eric raised the issue that using /proc is likely to get more and more
  hoary due to mount namespaces and suggested that we use a netlink
audit message (or a new syscall) to set the audit container identifier
and since the loginuid is a similar type of operation, that it should be
migrated over to a similar mechanism to get it away from /proc.  Get
could be done with a netlink audit message that triggers an audit log
message to deliver the information.  I'm reluctant to further pollute
the syscall space if we can find another method.  The netlink audit
message makes sense since any audit-enabled service is likely to already
have an audit socket open.

I don't have more detailed notes about what Eric said specifically.

> > So, with that out of the way, could you please comment on the general
> > idea of what was intended to be the central idea of this mechanism to be
> > able to nest containers beyond the initial user namespace (knowing that
> > a /proc interface is available and the audit netlink interface isn't
> > necessary for it to work and the latter can be easily removed)?
> 
> I'm not entirely clear what you are asking about, are you asking why I
> care about nesting container orchestrators?  Simply put, it is not
> uncommon for the LXC/LXD folks to see nested container orchestrators,
> so I felt it was important to support that use case.  When we
> originally started this effort we probably should have done a better
> job reaching out to the LXC/LXD folks, we may have caught this
> earlier.  Regardless, we caught it, and it looks like we are on our
> way to supporting it (that's good).
> 
> Are you asking why I prefer the procfs approach to setting/getting the
> audit container ID?  For one, it makes it easier for a LSM to enforce
> the audit container ID operations independent of the other audit
> control APIs.  It also provides a simpler interface for container
> orchestrators.  Both seem like desirable traits as far as I'm
> concerned.
> 
> > > > The intent was to switch to the audit netlink interface for contid,
> > > > capcontid and to add the audit netlink interface for loginuid and
> > > > sessionid while deprecating the proc interface for loginuid and
> > > > sessionid.  This was alluded to in the cover letter, but not very clear,
> > > > I'm afraid.  I have patches to remove the contid and loginuid/sessionid
> > > > interfaces in another tree which is why I had forgotten to outline that
> > > > plan more explicitly in the cover letter.
> 
> -- 
> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH ghak90 V7 06/21] audit: contid limit of 32k imposed to avoid DoS
From: Richard Guy Briggs @ 2019-10-24 21:23 UTC (permalink / raw)
  To: Paul Moore
  Cc: Neil Horman, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, netfilter-devel, sgrubb, omosnace,
	dhowells, simo, Eric Paris, Serge Hallyn, ebiederm, Dan Walsh,
	mpatel
In-Reply-To: <CAHC9VhRbSUCB0OZorC4+y+5uJDR5uMXdRn2LOTYGu2gcFJSrcA@mail.gmail.com>

On 2019-10-10 20:38, Paul Moore wrote:
> On Fri, Sep 27, 2019 at 8:52 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > On Wed, Sep 18, 2019 at 09:22:23PM -0400, Richard Guy Briggs wrote:
> > > Set an arbitrary limit on the number of audit container identifiers to
> > > limit abuse.
> > >
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > ---
> > >  kernel/audit.c | 8 ++++++++
> > >  kernel/audit.h | 4 ++++
> > >  2 files changed, 12 insertions(+)
> > >
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index 53d13d638c63..329916534dd2 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> 
> ...
> 
> > > @@ -2465,6 +2472,7 @@ int audit_set_contid(struct task_struct *task, u64 contid)
> > >                               newcont->owner = current;
> > >                               refcount_set(&newcont->refcount, 1);
> > >                               list_add_rcu(&newcont->list, &audit_contid_hash[h]);
> > > +                             audit_contid_count++;
> > >                       } else {
> > >                               rc = -ENOMEM;
> > >                               goto conterror;
> > > diff --git a/kernel/audit.h b/kernel/audit.h
> > > index 162de8366b32..543f1334ba47 100644
> > > --- a/kernel/audit.h
> > > +++ b/kernel/audit.h
> > > @@ -219,6 +219,10 @@ static inline int audit_hash_contid(u64 contid)
> > >       return (contid & (AUDIT_CONTID_BUCKETS-1));
> > >  }
> > >
> > > +extern int audit_contid_count;
> > > +
> > > +#define AUDIT_CONTID_COUNT   1 << 16
> > > +
> >
> > Just to ask the question, since it wasn't clear in the changelog, what
> > abuse are you avoiding here?  Ostensibly you should be able to create as
> > many container ids as you have space for, and the simple creation of
> > container ids doesn't seem like the resource strain I would be concerned
> > about here, given that an orchestrator can still create as many
> > containers as the system will otherwise allow, which will consume
> > significantly more ram/disk/etc.
> 
> I've got a similar question.  Up to this point in the patchset, there
> is a potential issue of hash bucket chain lengths and traversing them
> with a spinlock held, but it seems like we shouldn't be putting an
> arbitrary limit on audit container IDs unless we have a good reason
> for it.  If for some reason we do want to enforce a limit, it should
> probably be a tunable value like a sysctl, or similar.

Can you separate and clarify the concerns here?

I plan to move this patch to the end of the patchset and make it
optional, possibly adding a tuning mechanism.  Like the migration from
/proc to netlink for loginuid/sessionid/contid/capcontid, this was Eric
Biederman's concern and suggested mitigation.

As for the first issue of the bucket chain length traversal while
holding the list spin-lock, would you prefer to use the rcu lock to
traverse the list and then only hold the spin-lock when modifying the
list, and possibly even make the spin-lock more fine-grained per list?

> paul moore

- RGB

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox