netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH][RFC] network splice receive
@ 2007-06-05  8:05 Jens Axboe
  2007-06-05 11:45 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Jens Axboe @ 2007-06-05  8:05 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1526 bytes --]

Hi,

Here's an implementation of tcp network splice receive support. It's
originally based on the patch set that Intel posted some time ago, but
has been (close to) 100% reworked.

Now, I'm not a networking guru by any stretch of the imagination, so I'd
like some input on the direction of the main patch. Is the approach
feasible? Glaring errors? Missing bits?

If you want to test it, I'd suggest downloading the latest splice tools
snapshot here:

http://brick.kernel.dk/snaps/splice-git-latest.tar.gz

Examples:

- Sending a small test message over the network:

  user@host1:~/splice $ ./splice-fromnet 9999 | cat
  user@host2:~ $ echo hello | netcat host1 9999

  should write "hello" on host1. Yeah, complex stuff.

- Sending a file over the network:

  user@host1:~/splice $ ./splice-fromnet 9999 | ./splice out outfile
  user@host2:~ $ cat somefile | netcat host1 9999

  should send somefile over the network, storing it in outfile.

Seems to work reasonably well for me, sometimes I do see small ranges
inside the output file that are not correct, but I haven't been able to
reproduce today. I think it has to do with page reuse, hence the
NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
instead of referencing it.

Patches are against the #splice branch of the block repo, "official" url
of that is:

git://git.kernel.dk/data/git/linux-2.6-block.git/

and it's based on Linus main tree (at 2.6.22-rc4 right now). Let me know
if I should supply netdev branch patches instead.

-- 
Jens Axboe


[-- Attachment #2: 0001-NET-tcp_read_sock-alloc-recv_actor-return-retur.patch --]
[-- Type: text/x-patch, Size: 1103 bytes --]

>From 592c46ea813c31c0d6b28bf543ce2f5dd884a75e Mon Sep 17 00:00:00 2001
From: Jens Axboe <jens.axboe@oracle.com>
Date: Mon, 4 Jun 2007 15:06:43 +0200
Subject: [PATCH] [NET] tcp_read_sock: alloc recv_actor() return return negative error value

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 net/ipv4/tcp.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cd3c7e9..450f44b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1064,7 +1064,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 					break;
 			}
 			used = recv_actor(desc, skb, offset, len);
-			if (used <= len) {
+			if (used < 0) {
+				if (!copied)
+					copied = used;
+				break;
+			} else if (used <= len) {
 				seq += used;
 				copied += used;
 				offset += used;
@@ -1086,7 +1090,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tcp_rcv_space_adjust(sk);
 
 	/* Clean up data we have read: This will do ACK frames. */
-	if (copied)
+	if (copied > 0)
 		tcp_cleanup_rbuf(sk, copied);
 	return copied;
 }
-- 
1.5.2.rc1


[-- Attachment #3: 0002-NET-TCP-splice-receive-support.patch --]
[-- Type: text/x-patch, Size: 11938 bytes --]

>From 10d906a9a5a16a022d5067bee3963a0e3a03ae0c Mon Sep 17 00:00:00 2001
From: Jens Axboe <jens.axboe@oracle.com>
Date: Tue, 5 Jun 2007 09:54:00 +0200
Subject: [PATCH] [NET] TCP splice receive support

Losely based on original patches from Intel, modified to actually
be zero-copy (the original patches memcpy'ed the data).

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/net.h    |    3 +
 include/linux/skbuff.h |    5 ++
 include/net/tcp.h      |    3 +
 net/core/skbuff.c      |  114 +++++++++++++++++++++++++++++++++++++++
 net/ipv4/af_inet.c     |    1 +
 net/ipv4/tcp.c         |  138 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/socket.c           |   13 +++++
 7 files changed, 277 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index efc4517..472ee12 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -19,6 +19,7 @@
 #define _LINUX_NET_H
 
 #include <linux/wait.h>
+#include <linux/splice.h>
 #include <asm/socket.h>
 
 struct poll_table_struct;
@@ -165,6 +166,8 @@ struct proto_ops {
 				      struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
 				      int offset, size_t size, int flags);
+	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
+				       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
 };
 
 struct net_proto_family {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7367c7..619dcf5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,6 +1504,11 @@ extern int	       skb_store_bits(struct sk_buff *skb, int offset,
 extern __wsum	       skb_copy_and_csum_bits(const struct sk_buff *skb,
 					      int offset, u8 *to, int len,
 					      __wsum csum);
+extern int             skb_splice_bits(const struct sk_buff *skb,
+						unsigned int offset,
+						struct pipe_inode_info *pipe,
+						unsigned int len,
+						unsigned int flags);
 extern void	       skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8af9ae..8e86697 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -308,6 +308,9 @@ extern int			tcp_twsk_unique(struct sock *sk,
 
 extern void			tcp_twsk_destructor(struct sock *sk);
 
+extern ssize_t			tcp_splice_read(struct socket *sk, loff_t *ppos,
+					        struct pipe_inode_info *pipe, size_t len, unsigned int flags);
+
 static inline void tcp_dec_quickack_mode(struct sock *sk,
 					 const unsigned int pkts)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7c6a34e..4e97220 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -52,6 +52,7 @@
 #endif
 #include <linux/string.h>
 #include <linux/skbuff.h>
+#include <linux/splice.h>
 #include <linux/cache.h>
 #include <linux/rtnetlink.h>
 #include <linux/init.h>
@@ -71,6 +72,33 @@
 static struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
+static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
+				  struct pipe_buffer *buf)
+{
+#ifdef NET_COPY_SPLICE
+	__free_page(buf->page);
+#else
+	put_page(buf->page);
+#endif
+}
+
+static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
+			       struct pipe_buffer *buf)
+{
+	return 1;
+}
+
+/* Pipe buffer operations for a socket. */
+static struct pipe_buf_operations sock_pipe_buf_ops = {
+	.can_merge = 0,
+	.map = generic_pipe_buf_map,
+	.unmap = generic_pipe_buf_unmap,
+	.pin = generic_pipe_buf_pin,
+	.release = sock_pipe_buf_release,
+	.steal = sock_pipe_buf_steal,
+	.get = generic_pipe_buf_get,
+};
+
 /*
  *	Keep out-of-line to prevent kernel bloat.
  *	__builtin_return_address is not used because it is not always
@@ -1116,6 +1144,92 @@ fault:
 	return -EFAULT;
 }
 
+/*
+ * Fill page/offset/length into spd, if it can hold more pages.
+ */
+static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
+				unsigned int len, unsigned int offset)
+{
+	struct page *p;
+
+	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
+		return 1;
+
+#ifdef NET_COPY_SPLICE
+	p = alloc_pages(GFP_KERNEL, 0);
+	if (!p)
+		return 1;
+
+	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+#else
+	p = page;
+	get_page(p);
+#endif
+
+	spd->pages[spd->nr_pages] = p;
+	spd->partial[spd->nr_pages].len = len;
+	spd->partial[spd->nr_pages].offset = offset;
+	spd->nr_pages++;
+	return 0;
+}
+
+/*
+ * Map data from the skb to a pipe.
+ */
+int skb_splice_bits(const struct sk_buff *skb, unsigned int offset,
+		    struct pipe_inode_info *pipe, unsigned int len,
+		    unsigned int flags)
+{
+	struct partial_page partial[PIPE_BUFFERS];
+	struct page *pages[PIPE_BUFFERS];
+	int headlen, seg, i;
+	struct page *page;
+	struct splice_pipe_desc spd = {
+		.pages = pages,
+		.partial = partial,
+		.flags = flags,
+		.ops = &sock_pipe_buf_ops,
+	};
+
+	headlen = skb_headlen(skb) - offset;
+	if (headlen <= 0)
+		return 0;
+
+	/*
+	 * first map the linear region into the pages/partial map
+	 */
+	i = seg = 0;
+	while (i < headlen) {
+		void *p = skb->data + i;
+		unsigned int poff = (unsigned long) p & (PAGE_SIZE - 1);
+		unsigned int plen;
+
+		page = virt_to_page(p);
+		poff = (unsigned long) p & (PAGE_SIZE - 1);
+		plen = min_t(unsigned, (unsigned)(headlen - i), PAGE_SIZE - poff);
+
+		if (spd_fill_page(&spd, page, plen, poff))
+			break;
+
+		i += PAGE_SIZE - poff;
+	}
+
+	/*
+	 * then map the fragments
+	 */
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++, seg++) {
+		const skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+
+		if (spd_fill_page(&spd, f->page, f->size, f->page_offset))
+			break;
+	}
+
+	if (spd.nr_pages)
+		return splice_to_pipe(pipe, &spd);
+
+	return 0;
+}
+
 /**
  *	skb_store_bits - store bits from kernel buffer to skb
  *	@skb: destination buffer
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 041fba3..0ff9f86 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -835,6 +835,7 @@ const struct proto_ops inet_stream_ops = {
 	.recvmsg	   = sock_common_recvmsg,
 	.mmap		   = sock_no_mmap,
 	.sendpage	   = tcp_sendpage,
+	.splice_read	   = tcp_splice_read,
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_sock_common_setsockopt,
 	.compat_getsockopt = compat_sock_common_getsockopt,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 450f44b..34228ca 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -253,6 +253,10 @@
 #include <linux/poll.h>
 #include <linux/init.h>
 #include <linux/fs.h>
+#include <linux/skbuff.h>
+#include <linux/splice.h>
+#include <linux/net.h>
+#include <linux/socket.h>
 #include <linux/random.h>
 #include <linux/bootmem.h>
 #include <linux/cache.h>
@@ -264,6 +268,7 @@
 #include <net/xfrm.h>
 #include <net/ip.h>
 #include <net/netdma.h>
+#include <net/sock.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -291,6 +296,17 @@ EXPORT_SYMBOL(tcp_memory_allocated);
 EXPORT_SYMBOL(tcp_sockets_allocated);
 
 /*
+ *	Create a TCP splice context.
+ */
+struct tcp_splice_state {
+		struct pipe_inode_info *pipe;
+		void (*original_data_ready)(struct sock*, int);
+		size_t len;
+		size_t offset;
+		unsigned int flags;
+};
+
+/*
  * Pressure flag: try to collapse.
  * Technical note: it is used by multiple contexts non atomically.
  * All the sk_stream_mem_schedule() is of this nature: accounting
@@ -500,6 +516,127 @@ static inline void tcp_push(struct sock *sk, int flags, int mss_now,
 	}
 }
 
+int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
+			 unsigned int offset, size_t len)
+{
+	struct tcp_splice_state *tss = rd_desc->arg.data;
+
+	return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags);
+}
+
+void tcp_splice_data_ready(struct sock *sk, int flag)
+{
+	/*
+	 * Restore splice context/ read_descriptor_t from sk->sk_user_data
+	 */
+	struct tcp_splice_state *tss = sk->sk_user_data;
+	read_descriptor_t rd_desc = {
+		.arg.data = tss,
+		.count = 1,
+	};
+
+	lock_sock(sk);
+	tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
+	release_sock(sk);
+
+	if (tss->len == 0) {
+		/* Restore original sk_data_ready callback. */
+		sk->sk_data_ready = tss->original_data_ready;
+		/* Wakeup user thread. */
+		sk->sk_state_change(sk);
+	}
+}
+
+static int __tcp_splice_read(struct sock *sk, struct tcp_splice_state *tss)
+{
+	/* Store TCP splice context information in read_descriptor_t. */
+	read_descriptor_t rd_desc = {
+		.arg.data = tss,
+	};
+
+	tss->original_data_ready = sk->sk_data_ready;
+	sk->sk_user_data = tss;
+
+	return tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv);
+}
+
+/*
+ *  tcp_splice_read - splice data from TCP socket to a pipe
+ * @sock:	socket to splice from
+ * @pipe:	pipe to splice to
+ * @len:	number of bytes to splice
+ * @flags:	splice modifier flags
+ *
+ * Will read pages from given socket and fill them into a pipe.
+ */
+ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+			struct pipe_inode_info *pipe, size_t len,
+			unsigned int flags)
+{
+	struct tcp_splice_state tss = {
+		.pipe = pipe,
+		.len = len,
+		.flags = flags,
+	};
+	struct sock *sk = sock->sk;
+	ssize_t spliced;
+	int ret;
+
+	if (*ppos != 0)
+		return -EINVAL;
+
+	lock_sock(sk);
+
+	ret = spliced = 0;
+	while (tss.len) {
+		ret = __tcp_splice_read(sk, &tss);
+		if (ret < 0)
+			break;
+		else if (!ret) {
+			if (flags & SPLICE_F_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (spliced)
+				break;
+			if (sock_flag(sk, SOCK_DONE))
+				break;
+			if (sk->sk_err) {
+				ret = sock_error(sk);
+				break;
+			}
+			if (sk->sk_shutdown & RCV_SHUTDOWN)
+				break;
+			if (sk->sk_state == TCP_CLOSE) {
+				if (!sock_flag(sk, SOCK_DONE)) {
+					/* This occurs when user tries to read
+					 * from never connected socket.
+					 */
+					ret = -ENOTCONN;
+					break;
+				}
+				break;
+			}
+			if (signal_pending(current)) {
+				ret = -EAGAIN;
+				break;
+			}
+			printk("wait for more...\n");
+			sk_wait_data(sk, &sk->sk_rcvtimeo);
+			continue;
+		}
+		tss.len -= ret;
+		spliced += ret;
+	}
+
+	release_sock(sk);
+
+	if (spliced)
+		return spliced;
+
+	return ret;
+}
+
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
 			 size_t psize, int flags)
 {
@@ -2515,6 +2652,7 @@ EXPORT_SYMBOL(tcp_poll);
 EXPORT_SYMBOL(tcp_read_sock);
 EXPORT_SYMBOL(tcp_recvmsg);
 EXPORT_SYMBOL(tcp_sendmsg);
+EXPORT_SYMBOL(tcp_splice_read);
 EXPORT_SYMBOL(tcp_sendpage);
 EXPORT_SYMBOL(tcp_setsockopt);
 EXPORT_SYMBOL(tcp_shutdown);
diff --git a/net/socket.c b/net/socket.c
index f453019..41240f5 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -111,6 +111,9 @@ static long compat_sock_ioctl(struct file *file,
 static int sock_fasync(int fd, struct file *filp, int on);
 static ssize_t sock_sendpage(struct file *file, struct page *page,
 			     int offset, size_t size, loff_t *ppos, int more);
+static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
+			        struct pipe_inode_info *pipe, size_t len,
+				unsigned int flags);
 
 /*
  *	Socket files have a set of 'special' operations as well as the generic file ones. These don't appear
@@ -133,6 +136,7 @@ static const struct file_operations socket_file_ops = {
 	.fasync =	sock_fasync,
 	.sendpage =	sock_sendpage,
 	.splice_write = generic_splice_sendpage,
+	.splice_read =	sock_splice_read,
 };
 
 /*
@@ -691,6 +695,15 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 	return sock->ops->sendpage(sock, page, offset, size, flags);
 }
 
+static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
+			        struct pipe_inode_info *pipe, size_t len,
+				unsigned int flags)
+{
+	struct socket *sock = file->private_data;
+
+	return sock->ops->splice_read(sock, ppos, pipe, len, flags);
+}
+
 static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb,
 					 struct sock_iocb *siocb)
 {
-- 
1.5.2.rc1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-05  8:05 [PATCH][RFC] network splice receive Jens Axboe
@ 2007-06-05 11:45 ` Jens Axboe
  2007-06-05 12:20   ` Jens Axboe
  2007-06-05 13:34 ` Evgeniy Polyakov
  2007-06-05 14:31 ` Evgeniy Polyakov
  2 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-05 11:45 UTC (permalink / raw)
  To: netdev

On Tue, Jun 05 2007, Jens Axboe wrote:
> Seems to work reasonably well for me, sometimes I do see small ranges
> inside the output file that are not correct, but I haven't been able to
> reproduce today. I think it has to do with page reuse, hence the
> NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
> instead of referencing it.

I managed to reproduce. It's segments of 68-80 bytes beyond corrupt in
the middle of the out, and there might be 1-3 of such occurences in the
30mb file I tested with. The first 16 bytes of the corruption are always
the same:

0000 1800 4ff3 937f e000 6381 7275 0008

Perhaps that hex pattern rings a bell with someone intimate with the
networking. The remaining wrong bytes don't seem to have anything in
common.

Slab poisoning doesn't change the pattern, so it's not use-after-free.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-05 11:45 ` Jens Axboe
@ 2007-06-05 12:20   ` Jens Axboe
  2007-06-05 12:34     ` jamal
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-05 12:20 UTC (permalink / raw)
  To: netdev

On Tue, Jun 05 2007, Jens Axboe wrote:
> On Tue, Jun 05 2007, Jens Axboe wrote:
> > Seems to work reasonably well for me, sometimes I do see small ranges
> > inside the output file that are not correct, but I haven't been able to
> > reproduce today. I think it has to do with page reuse, hence the
> > NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
> > instead of referencing it.
> 
> I managed to reproduce. It's segments of 68-80 bytes beyond corrupt in
> the middle of the out, and there might be 1-3 of such occurences in the
> 30mb file I tested with. The first 16 bytes of the corruption are always
> the same:
> 
> 0000 1800 4ff3 937f e000 6381 7275 0008
> 
> Perhaps that hex pattern rings a bell with someone intimate with the
> networking. The remaining wrong bytes don't seem to have anything in
> common.

Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is
00:E0:81:63:75:72 which are the middle 12 bytes of the 16.

Hope that helps someone clue me in as to which network part is reusing
the data. Do I need to 'pin' the sk_buff until the pipe data has been
consumed?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-05 12:20   ` Jens Axboe
@ 2007-06-05 12:34     ` jamal
  2007-06-06  7:14       ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: jamal @ 2007-06-05 12:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: netdev

On Tue, 2007-05-06 at 14:20 +0200, Jens Axboe wrote:

> > 
> > 0000 1800 4ff3 937f e000 6381 7275 0008
> > 
> > Perhaps that hex pattern rings a bell with someone intimate with the
> > networking. The remaining wrong bytes don't seem to have anything in
> > common.
> 
> Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is
> 00:E0:81:63:75:72 which are the middle 12 bytes of the 16.
> 

It appears you may have endianness issues and perhaps a 16 bit
mis-offset. the 0008 at the end looks like a swapped 0x800 which
is the ethernet type for IPV4.

> Hope that helps someone clue me in as to which network part is reusing
> the data. Do I need to 'pin' the sk_buff until the pipe data has been
> consumed?

I would worry about the driver level first - likely thats where your
corruption is.

cheers,
jamal


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-05  8:05 [PATCH][RFC] network splice receive Jens Axboe
  2007-06-05 11:45 ` Jens Axboe
@ 2007-06-05 13:34 ` Evgeniy Polyakov
  2007-06-05 14:31 ` Evgeniy Polyakov
  2 siblings, 0 replies; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-05 13:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: netdev

On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> Hi,

Hi Jens.

> Here's an implementation of tcp network splice receive support. It's
> originally based on the patch set that Intel posted some time ago, but
> has been (close to) 100% reworked.
> 
> Now, I'm not a networking guru by any stretch of the imagination, so I'd
> like some input on the direction of the main patch. Is the approach
> feasible? Glaring errors? Missing bits?

First one - you seems to create new data_ready callback
tcp_splice_data_ready(), but it is unused, and actually can not be at 
all - there will be a deadlock, since sk_data_ready can be called with locked 
socket and also in bh context.

I will setup this and report abck about bugs.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-05  8:05 [PATCH][RFC] network splice receive Jens Axboe
  2007-06-05 11:45 ` Jens Axboe
  2007-06-05 13:34 ` Evgeniy Polyakov
@ 2007-06-05 14:31 ` Evgeniy Polyakov
  2007-06-05 14:49   ` Evgeniy Polyakov
  2007-06-06  7:17   ` Jens Axboe
  2 siblings, 2 replies; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-05 14:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: netdev

On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> Here's an implementation of tcp network splice receive support. It's
> originally based on the patch set that Intel posted some time ago, but
> has been (close to) 100% reworked.
> 
> Now, I'm not a networking guru by any stretch of the imagination, so I'd
> like some input on the direction of the main patch. Is the approach
> feasible? Glaring errors? Missing bits?

  263.709262] ------------[ cut here ]------------
  [  263.713932] kernel BUG at include/linux/mm.h:285!
  [  263.718678] invalid opcode: 0000 [1] PREEMPT SMP 
  [  263.723561] CPU 0 
  [  263.725665] Modules linked in: button loop snd_intel8x0
  snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore psmouse
  snd_page_alloc k8temp i2c_nforcen
  [  263.755666] Pid: 2709, comm: splice-fromnet Not tainted
  2.6.22-rc4-splice #2
  [  263.762759] RIP: 0010:[<ffffffff8038c60c>]  [<ffffffff8038c60c>]
  skb_splice_bits+0xac/0x1c9
  [  263.771212] RSP: 0018:ffff81003c79fc88  EFLAGS: 00010246
  [  263.776564] RAX: 0000000000000000 RBX: 00000000000005a8 RCX:
  ffff81003ff04778
  [  263.783743] RDX: ffff81003ff04778 RSI: 0000000000000ab2 RDI:
  000000000003d52d
  [  263.790925] RBP: ffff81003c79fdd8 R08: 0000000000000000 R09:
  ffff81003d927b78
  [  263.798104] R10: ffffffff803b0181 R11: ffff81003c79fde8 R12:
  ffff81003d52d000
  [  263.805284] R13: 000000000000054e R14: ffff81003d927b78 R15:
  ffff81003bbc6ea0
  [  263.812463] FS:  00002ac4089a86d0(0000) GS:ffffffff804fb000(0000)
  knlGS:0000000000000000
  [  263.820611] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  [  263.826396] CR2: 00002ac4088320e0 CR3: 000000003c987000 CR4:
  00000000000006e0
  [  263.833578] Process splice-fromnet (pid: 2709, threadinfo
  ffff81003c79e000, task ffff81003755c380)
  [  263.842591] Stack:  ffff81003ff04720 0000000000000000
  ffff81003755c380 0000000000000046
  [  263.850897]  00000000000000c6 0000000000000046 ffff81003b0428b8
  ffff81003d0b5b10
  [  263.858543]  00000000000000c6 ffff81003d0b5b10 ffff81003b0428b8
  ffff81003d0b5b10
  [  263.865957] Call Trace:
  [  263.868683]  [<ffffffff803dc630>] _read_unlock_irq+0x31/0x4e
  [  263.874393]  [<ffffffff803afb54>] tcp_splice_data_recv+0x20/0x22
  [  263.880447]  [<ffffffff803afa2b>] tcp_read_sock+0xa2/0x1ab
  [  263.885983]  [<ffffffff803afb34>] tcp_splice_data_recv+0x0/0x22
  [  263.891951]  [<ffffffff803b01c1>] tcp_splice_read+0xae/0x1a3
  [  263.897655]  [<ffffffff8038920f>] sock_def_readable+0x0/0x6f
  [  263.903366]  [<ffffffff80384a65>] sock_splice_read+0x15/0x17
  [  263.909072]  [<ffffffff8029e773>] do_splice_to+0x76/0x88
  [  263.914432]  [<ffffffff8029fcc8>] sys_splice+0x1a8/0x232
  [  263.919795]  [<ffffffff802097ce>] system_call+0x7e/0x83
  [  263.925067] 
  [  263.926606] 
  [  263.926607] Code: 0f 0b eb fe 44 89 e6 81 e6 ff 0f 00 00 90 ff 42
  08 48 63 55 
  [  263.936418] RIP  [<ffffffff8038c60c>] skb_splice_bits+0xac/0x1c9
  [  263.942516]  RSP <ffff81003c79fc88>

This a vm_bug_on in get_page().

> +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> +				unsigned int len, unsigned int offset)
> +{
> +	struct page *p;
> +
> +	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> +		return 1;
> +
> +#ifdef NET_COPY_SPLICE
> +	p = alloc_pages(GFP_KERNEL, 0);
> +	if (!p)
> +		return 1;
> +
> +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> +#else
> +	p = page;
> +	get_page(p);
> +#endif

Some pages have zero reference counter here.

Is commented NET_COPY_SPLICE part from old implementation?
It will be always slower than existing approach due to allocation
overhead.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-05 14:31 ` Evgeniy Polyakov
@ 2007-06-05 14:49   ` Evgeniy Polyakov
  2007-06-06  7:17   ` Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-05 14:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: netdev

On Tue, Jun 05, 2007 at 06:31:31PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
>   [  263.936418] RIP  [<ffffffff8038c60c>] skb_splice_bits+0xac/0x1c9
>   [  263.942516]  RSP <ffff81003c79fc88>
> 
> This a vm_bug_on in get_page().
> 
> > +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> > +				unsigned int len, unsigned int offset)
> > +{
> > +	struct page *p;
> > +
> > +	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> > +		return 1;
> > +
> > +#ifdef NET_COPY_SPLICE
> > +	p = alloc_pages(GFP_KERNEL, 0);
> > +	if (!p)
> > +		return 1;
> > +
> > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > +#else
> > +	p = page;
> > +	get_page(p);
> > +#endif
> 
> Some pages have zero reference counter here.

Very likley bug with mac address is related to this one and you do not
have vm debug enabled in the config? Naive atomic_inc and
atomic_dec_return with bug_on < 0 instead of that get_page and put_page
in spd_fill_page()/sock_pipe_buf_release() resulted in broken file -
initial data contained 6B and 5A instead of zeroes sent. Even more naive 
atomic_add(2, page) ended with:

[   48.273345] page:ffff81003ff22a18 flags:0x0100000000000000
mapping:0000000000000000 mapcount:0 count:2
[   48.273347] Trying to fix it up, but a reboot is needed
[   48.273349] Backtrace:
[   48.295576] 
[   48.295577] Call Trace:
[   48.299624]  [<ffffffff8025f075>] bad_page+0x67/0x95
[   48.304636]  [<ffffffff8025f771>] __free_pages_ok+0x76/0x2c1
[   48.310343]  [<ffffffff8025fbec>] __free_pages+0x29/0x2b
[   48.315703]  [<ffffffff8025fc38>] free_pages+0x4a/0x4f
[   48.320884]  [<ffffffff8027b15f>] kmem_freepages+0xd9/0xe2
[   48.326416]  [<ffffffff8027bd93>] slab_destroy+0xef/0x114
[   48.331865]  [<ffffffff8027bf15>] free_block+0x15d/0x19f
[   48.337227]  [<ffffffff8027c0bb>] cache_flusharray+0x95/0xff
[   48.342933]  [<ffffffff8027c36a>] kfree+0x1cd/0x1ec
[   48.347863]  [<ffffffff8038b233>] skb_release_data+0xab/0xb0
[   48.353567]  [<ffffffff8038aff3>] kfree_skbmem+0x11/0x7e
[   48.358927]  [<ffffffff8038b104>] __kfree_skb+0xa4/0xa9
[   48.364204]  [<ffffffff803afa9e>] tcp_read_sock+0x101/0x1ab
[   48.369823]  [<ffffffff803afb48>] tcp_splice_data_recv+0x0/0x22
[   48.375791]  [<ffffffff803b01d5>] tcp_splice_read+0xae/0x1a3
[   48.381497]  [<ffffffff8038920f>] sock_def_readable+0x0/0x6f
[   48.387209]  [<ffffffff80384a65>] sock_splice_read+0x15/0x17
[   48.392913]  [<ffffffff8029e773>] do_splice_to+0x76/0x88
[   48.398273]  [<ffffffff8029fcc8>] sys_splice+0x1a8/0x232
[   48.403636]  [<ffffffff802097ce>] system_call+0x7e/0x83

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-05 12:34     ` jamal
@ 2007-06-06  7:14       ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2007-06-06  7:14 UTC (permalink / raw)
  To: jamal; +Cc: netdev

On Tue, Jun 05 2007, jamal wrote:
> On Tue, 2007-05-06 at 14:20 +0200, Jens Axboe wrote:
> 
> > > 
> > > 0000 1800 4ff3 937f e000 6381 7275 0008
> > > 
> > > Perhaps that hex pattern rings a bell with someone intimate with the
> > > networking. The remaining wrong bytes don't seem to have anything in
> > > common.
> > 
> > Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is
> > 00:E0:81:63:75:72 which are the middle 12 bytes of the 16.
> > 
> 
> It appears you may have endianness issues and perhaps a 16 bit
> mis-offset. the 0008 at the end looks like a swapped 0x800 which
> is the ethernet type for IPV4.

Yeah, it looks like the first part of the ethernet frame. I'm using an
e1000 on the receive side and a sky2 on the sender.

> > Hope that helps someone clue me in as to which network part is reusing
> > the data. Do I need to 'pin' the sk_buff until the pipe data has been
> > consumed?
> 
> I would worry about the driver level first - likely thats where your
> corruption is.

I'd just be a heck-of-a-lot more inclined to suspect my code! Hence I
thought that data reuse for perhaps another packet would be the likely
explanation, especially since it looked like valid network driver data
ending up where it should not.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-05 14:31 ` Evgeniy Polyakov
  2007-06-05 14:49   ` Evgeniy Polyakov
@ 2007-06-06  7:17   ` Jens Axboe
  2007-06-07  8:09     ` Evgeniy Polyakov
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-06  7:17 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

On Tue, Jun 05 2007, Evgeniy Polyakov wrote:
> On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > Here's an implementation of tcp network splice receive support. It's
> > originally based on the patch set that Intel posted some time ago, but
> > has been (close to) 100% reworked.
> > 
> > Now, I'm not a networking guru by any stretch of the imagination, so I'd
> > like some input on the direction of the main patch. Is the approach
> > feasible? Glaring errors? Missing bits?
> 
>   263.709262] ------------[ cut here ]------------
>   [  263.713932] kernel BUG at include/linux/mm.h:285!
>   [  263.718678] invalid opcode: 0000 [1] PREEMPT SMP 
>   [  263.723561] CPU 0 
>   [  263.725665] Modules linked in: button loop snd_intel8x0
>   snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore psmouse
>   snd_page_alloc k8temp i2c_nforcen
>   [  263.755666] Pid: 2709, comm: splice-fromnet Not tainted
>   2.6.22-rc4-splice #2
>   [  263.762759] RIP: 0010:[<ffffffff8038c60c>]  [<ffffffff8038c60c>]
>   skb_splice_bits+0xac/0x1c9
>   [  263.771212] RSP: 0018:ffff81003c79fc88  EFLAGS: 00010246
>   [  263.776564] RAX: 0000000000000000 RBX: 00000000000005a8 RCX:
>   ffff81003ff04778
>   [  263.783743] RDX: ffff81003ff04778 RSI: 0000000000000ab2 RDI:
>   000000000003d52d
>   [  263.790925] RBP: ffff81003c79fdd8 R08: 0000000000000000 R09:
>   ffff81003d927b78
>   [  263.798104] R10: ffffffff803b0181 R11: ffff81003c79fde8 R12:
>   ffff81003d52d000
>   [  263.805284] R13: 000000000000054e R14: ffff81003d927b78 R15:
>   ffff81003bbc6ea0
>   [  263.812463] FS:  00002ac4089a86d0(0000) GS:ffffffff804fb000(0000)
>   knlGS:0000000000000000
>   [  263.820611] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>   [  263.826396] CR2: 00002ac4088320e0 CR3: 000000003c987000 CR4:
>   00000000000006e0
>   [  263.833578] Process splice-fromnet (pid: 2709, threadinfo
>   ffff81003c79e000, task ffff81003755c380)
>   [  263.842591] Stack:  ffff81003ff04720 0000000000000000
>   ffff81003755c380 0000000000000046
>   [  263.850897]  00000000000000c6 0000000000000046 ffff81003b0428b8
>   ffff81003d0b5b10
>   [  263.858543]  00000000000000c6 ffff81003d0b5b10 ffff81003b0428b8
>   ffff81003d0b5b10
>   [  263.865957] Call Trace:
>   [  263.868683]  [<ffffffff803dc630>] _read_unlock_irq+0x31/0x4e
>   [  263.874393]  [<ffffffff803afb54>] tcp_splice_data_recv+0x20/0x22
>   [  263.880447]  [<ffffffff803afa2b>] tcp_read_sock+0xa2/0x1ab
>   [  263.885983]  [<ffffffff803afb34>] tcp_splice_data_recv+0x0/0x22
>   [  263.891951]  [<ffffffff803b01c1>] tcp_splice_read+0xae/0x1a3
>   [  263.897655]  [<ffffffff8038920f>] sock_def_readable+0x0/0x6f
>   [  263.903366]  [<ffffffff80384a65>] sock_splice_read+0x15/0x17
>   [  263.909072]  [<ffffffff8029e773>] do_splice_to+0x76/0x88
>   [  263.914432]  [<ffffffff8029fcc8>] sys_splice+0x1a8/0x232
>   [  263.919795]  [<ffffffff802097ce>] system_call+0x7e/0x83
>   [  263.925067] 
>   [  263.926606] 
>   [  263.926607] Code: 0f 0b eb fe 44 89 e6 81 e6 ff 0f 00 00 90 ff 42
>   08 48 63 55 
>   [  263.936418] RIP  [<ffffffff8038c60c>] skb_splice_bits+0xac/0x1c9
>   [  263.942516]  RSP <ffff81003c79fc88>
> 
> This a vm_bug_on in get_page().
> 
> > +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
> > +				unsigned int len, unsigned int offset)
> > +{
> > +	struct page *p;
> > +
> > +	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
> > +		return 1;
> > +
> > +#ifdef NET_COPY_SPLICE
> > +	p = alloc_pages(GFP_KERNEL, 0);
> > +	if (!p)
> > +		return 1;
> > +
> > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > +#else
> > +	p = page;
> > +	get_page(p);
> > +#endif
> 
> Some pages have zero reference counter here.
> 
> Is commented NET_COPY_SPLICE part from old implementation?
> It will be always slower than existing approach due to allocation
> overhead.

NET_COPY_SPLICE is not supposed to be enabled, it's just a demonstration
of a copy operation if you wanted to test functionality without just
linking in the skb pages. At least that would allow you to test
correctness of the rest of the code, since I don't know if the skb page
linking is entirely correct yet.

But it's somewhat annoying to get pages with zero reference counts
there, I wonder how that happens. I guess if the skb->data originated
from kmalloc() then we don't really know. The main intent there was just
to ensure the page wasn't going away, but clearly it's not good enough
to ensure that reuse isn't taking place.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-06  7:17   ` Jens Axboe
@ 2007-06-07  8:09     ` Evgeniy Polyakov
  2007-06-07 10:51       ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-07  8:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: netdev

On Wed, Jun 06, 2007 at 09:17:25AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > Some pages have zero reference counter here.
> 
> But it's somewhat annoying to get pages with zero reference counts
> there, I wonder how that happens. I guess if the skb->data originated
> from kmalloc() then we don't really know. The main intent there was just
> to ensure the page wasn't going away, but clearly it's not good enough
> to ensure that reuse isn't taking place.

What bout checking if page belongs to kmalloc cache (or any other cache
via priviate pointers) and do not perform any kind of reference counting
on them? I will play with this a bit later today.

> -- 
> Jens Axboe

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-07  8:09     ` Evgeniy Polyakov
@ 2007-06-07 10:51       ` Jens Axboe
  2007-06-07 14:58         ` Evgeniy Polyakov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-07 10:51 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

On Thu, Jun 07 2007, Evgeniy Polyakov wrote:
> On Wed, Jun 06, 2007 at 09:17:25AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > > Some pages have zero reference counter here.
> > 
> > But it's somewhat annoying to get pages with zero reference counts
> > there, I wonder how that happens. I guess if the skb->data originated
> > from kmalloc() then we don't really know. The main intent there was just
> > to ensure the page wasn't going away, but clearly it's not good enough
> > to ensure that reuse isn't taking place.
> 
> What bout checking if page belongs to kmalloc cache (or any other cache
> via priviate pointers) and do not perform any kind of reference counting
> on them? I will play with this a bit later today.

That might work, but sounds a little dirty... But there's probably no
way around. Be sure to look at the #splice-net branch if you are playing
with this, I've updated it a number of times and fixed some bugs in
there. Notably it now gets the offset right, and handles fragments and
fraglist as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-07 10:51       ` Jens Axboe
@ 2007-06-07 14:58         ` Evgeniy Polyakov
  2007-06-08  7:48           ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-07 14:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: netdev

On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > What bout checking if page belongs to kmalloc cache (or any other cache
> > via priviate pointers) and do not perform any kind of reference counting
> > on them? I will play with this a bit later today.
> 
> That might work, but sounds a little dirty... But there's probably no
> way around. Be sure to look at the #splice-net branch if you are playing
> with this, I've updated it a number of times and fixed some bugs in
> there. Notably it now gets the offset right, and handles fragments and
> fraglist as well.

I've pulled splice-net, which indeed fixed some issues, but referencing
slab pages is still is not allowed. There are at least two problems
(although they are related):
1. if we do not increment reference counter for slab pages, they
eventually get refilled and slab exploses after it understood that its
pages are in use (or user dies when page is moved out of his control in
slab).
2. get/put page does not work with slab pages, and simple
increment/decrement of the reference counters is not allowed too.

Both problems have the same root - slab does not allow anyone to 
manipulate page's members. That should be broken/changed to allow splice
to put its hands into network using the fastest way.
I will think about it.

> -- 
> Jens Axboe

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-07 14:58         ` Evgeniy Polyakov
@ 2007-06-08  7:48           ` Jens Axboe
  2007-06-08  8:06             ` David Miller
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-08  7:48 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

On Thu, Jun 07 2007, Evgeniy Polyakov wrote:
> On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > > What bout checking if page belongs to kmalloc cache (or any other cache
> > > via priviate pointers) and do not perform any kind of reference counting
> > > on them? I will play with this a bit later today.
> > 
> > That might work, but sounds a little dirty... But there's probably no
> > way around. Be sure to look at the #splice-net branch if you are playing
> > with this, I've updated it a number of times and fixed some bugs in
> > there. Notably it now gets the offset right, and handles fragments and
> > fraglist as well.
> 
> I've pulled splice-net, which indeed fixed some issues, but referencing
> slab pages is still is not allowed. There are at least two problems
> (although they are related):
> 1. if we do not increment reference counter for slab pages, they
> eventually get refilled and slab exploses after it understood that its
> pages are in use (or user dies when page is moved out of his control in
> slab).
> 2. get/put page does not work with slab pages, and simple
> increment/decrement of the reference counters is not allowed too.
> 
> Both problems have the same root - slab does not allow anyone to 
> manipulate page's members. That should be broken/changed to allow splice
> to put its hands into network using the fastest way.
> I will think about it.

Perhaps it's possible to solve this at a different level - can we hang
on to the skb until the pipe buffer has been consumed, and prevent reuse
that way? Then we don't have to care what backing the skb has, as long
as it (and its data) isn't being reused until we drop the reference to
it in sock_pipe_buf_release().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08  7:48           ` Jens Axboe
@ 2007-06-08  8:06             ` David Miller
  2007-06-08  8:38               ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: David Miller @ 2007-06-08  8:06 UTC (permalink / raw)
  To: jens.axboe; +Cc: johnpol, netdev

From: Jens Axboe <jens.axboe@oracle.com>
Date: Fri, 8 Jun 2007 09:48:24 +0200

> Perhaps it's possible to solve this at a different level - can we hang
> on to the skb until the pipe buffer has been consumed, and prevent reuse
> that way? Then we don't have to care what backing the skb has, as long
> as it (and its data) isn't being reused until we drop the reference to
> it in sock_pipe_buf_release().

Depending upon whether the pipe buffer consumption is bounded of not,
this will jam up the TCP sender because the SKB data allocation is
charged against the socket send buffer allocation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08  8:06             ` David Miller
@ 2007-06-08  8:38               ` Jens Axboe
  2007-06-08  8:56                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-08  8:38 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, netdev

On Fri, Jun 08 2007, David Miller wrote:
> From: Jens Axboe <jens.axboe@oracle.com>
> Date: Fri, 8 Jun 2007 09:48:24 +0200
> 
> > Perhaps it's possible to solve this at a different level - can we hang
> > on to the skb until the pipe buffer has been consumed, and prevent reuse
> > that way? Then we don't have to care what backing the skb has, as long
> > as it (and its data) isn't being reused until we drop the reference to
> > it in sock_pipe_buf_release().
> 
> Depending upon whether the pipe buffer consumption is bounded of not,
> this will jam up the TCP sender because the SKB data allocation is
> charged against the socket send buffer allocation.

Forgive my network ignorance, but is that a problem? Since you bring it
up, I guess so :-)

We can grow the pipe, should we have to. So instead of blocking waiting
on reader consumption, we can extend the size of the pipe and keep
going.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08  8:38               ` Jens Axboe
@ 2007-06-08  8:56                 ` Evgeniy Polyakov
  2007-06-08  9:04                   ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-08  8:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Miller, netdev

On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> On Fri, Jun 08 2007, David Miller wrote:
> > From: Jens Axboe <jens.axboe@oracle.com>
> > Date: Fri, 8 Jun 2007 09:48:24 +0200
> > 
> > > Perhaps it's possible to solve this at a different level - can we hang
> > > on to the skb until the pipe buffer has been consumed, and prevent reuse
> > > that way? Then we don't have to care what backing the skb has, as long
> > > as it (and its data) isn't being reused until we drop the reference to
> > > it in sock_pipe_buf_release().
> > 
> > Depending upon whether the pipe buffer consumption is bounded of not,
> > this will jam up the TCP sender because the SKB data allocation is
> > charged against the socket send buffer allocation.
> 
> Forgive my network ignorance, but is that a problem? Since you bring it
> up, I guess so :-)

David means, that socket bufer allocation is limited, and delaying
freeing can end up with exhausint that limit.

> We can grow the pipe, should we have to. So instead of blocking waiting
> on reader consumption, we can extend the size of the pipe and keep
> going.

I have a code, which roughly works (but I will test it some more), which
just introduces reference counters for slab pages, so that the would not
be actually freed via page reclaim, but only after reference counters
are dropped. That forced changes in mm/slab.c so likely it is
unacceptible solution, but it is interesting as is.

> -- 
> Jens Axboe

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08  8:56                 ` Evgeniy Polyakov
@ 2007-06-08  9:04                   ` Jens Axboe
  2007-06-08 13:58                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-08  9:04 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
> On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > On Fri, Jun 08 2007, David Miller wrote:
> > > From: Jens Axboe <jens.axboe@oracle.com>
> > > Date: Fri, 8 Jun 2007 09:48:24 +0200
> > > 
> > > > Perhaps it's possible to solve this at a different level - can we hang
> > > > on to the skb until the pipe buffer has been consumed, and prevent reuse
> > > > that way? Then we don't have to care what backing the skb has, as long
> > > > as it (and its data) isn't being reused until we drop the reference to
> > > > it in sock_pipe_buf_release().
> > > 
> > > Depending upon whether the pipe buffer consumption is bounded of not,
> > > this will jam up the TCP sender because the SKB data allocation is
> > > charged against the socket send buffer allocation.
> > 
> > Forgive my network ignorance, but is that a problem? Since you bring it
> > up, I guess so :-)
> 
> David means, that socket bufer allocation is limited, and delaying
> freeing can end up with exhausint that limit.

OK, so a delayed empty of the pipe could end up causing packet drops
elsewhere due to allocation exhaustion?

> > We can grow the pipe, should we have to. So instead of blocking waiting
> > on reader consumption, we can extend the size of the pipe and keep
> > going.
> 
> I have a code, which roughly works (but I will test it some more), which
> just introduces reference counters for slab pages, so that the would not
> be actually freed via page reclaim, but only after reference counters
> are dropped. That forced changes in mm/slab.c so likely it is
> unacceptible solution, but it is interesting as is.

Hmm, still seems like it's working around the problem. We essentially
just need to ensure that the data doesn't get _reused_, not just freed.
It doesn't help holding a reference to the page, if someone else just
reuses it and fills it with other data before it has been consumed and
released by the pipe buffer operations.

That's why I thought the skb referencing was the better idea, then we
don't have to care about the backing of the skb either. Provided that
preventing the free of the skb before the pipe buffer has been consumed
guarantees that the contents aren't reused.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08  9:04                   ` Jens Axboe
@ 2007-06-08 13:58                     ` Evgeniy Polyakov
  2007-06-08 14:14                       ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-08 13:58 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Miller, netdev

On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> OK, so a delayed empty of the pipe could end up causing packet drops
> elsewhere due to allocation exhaustion?

Yes.

> > > We can grow the pipe, should we have to. So instead of blocking waiting
> > > on reader consumption, we can extend the size of the pipe and keep
> > > going.
> > 
> > I have a code, which roughly works (but I will test it some more), which
> > just introduces reference counters for slab pages, so that the would not
> > be actually freed via page reclaim, but only after reference counters
> > are dropped. That forced changes in mm/slab.c so likely it is
> > unacceptible solution, but it is interesting as is.
> 
> Hmm, still seems like it's working around the problem. We essentially
> just need to ensure that the data doesn't get _reused_, not just freed.
> It doesn't help holding a reference to the page, if someone else just
> reuses it and fills it with other data before it has been consumed and
> released by the pipe buffer operations.
> 
> That's why I thought the skb referencing was the better idea, then we
> don't have to care about the backing of the skb either. Provided that
> preventing the free of the skb before the pipe buffer has been consumed
> guarantees that the contents aren't reused.

It is not only better idea, it is the only correct one.
Attached patch for interested reader, which does slab pages accounting,
but it is broken. It does not fires up with kernel bug, but it fills
output file with random garbage from reused and dirtied pages. And I do
not know why, but received file is always smaller than file being sent
(when file has resonable size like 10mb, with 4-40kb filesize things
seems to be ok).

I've started skb referencing work, let's see where this will end up.

diff --git a/fs/splice.c b/fs/splice.c
index 928bea0..742e1ee 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,18 @@
 #include <linux/syscalls.h>
 #include <linux/uio.h>
 
+extern void slab_change_usage(struct page *p);
+
+static inline void splice_page_release(struct page *p)
+{
+	struct page *head = p->first_page;
+	if (!PageSlab(head))
+		page_cache_release(p);
+	else {
+		slab_change_usage(head);
+	}
+}
+
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
  * a vm helper function, it's already simplified quite a bit by the
@@ -81,7 +93,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
 static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe,
 					struct pipe_buffer *buf)
 {
-	page_cache_release(buf->page);
+	splice_page_release(buf->page);
 	buf->flags &= ~PIPE_BUF_FLAG_LRU;
 }
 
@@ -246,7 +258,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 	}
 
 	while (page_nr < spd->nr_pages)
-		page_cache_release(spd->pages[page_nr++]);
+		splice_page_release(spd->pages[page_nr++]);
 
 	return ret;
 }
@@ -322,7 +334,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 			error = add_to_page_cache_lru(page, mapping, index,
 					      GFP_KERNEL);
 			if (unlikely(error)) {
-				page_cache_release(page);
+				splice_page_release(page);
 				if (error == -EEXIST)
 					continue;
 				break;
@@ -448,7 +460,7 @@ fill_it:
 	 * we got, 'nr_pages' is how many pages are in the map.
 	 */
 	while (page_nr < nr_pages)
-		page_cache_release(pages[page_nr++]);
+		splice_page_release(pages[page_nr++]);
 
 	if (spd.nr_pages)
 		return splice_to_pipe(pipe, &spd);
@@ -604,7 +616,7 @@ find_page:
 
 		if (ret != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
-		page_cache_release(page);
+		splice_page_release(page);
 		if (ret == AOP_TRUNCATED_PAGE)
 			goto find_page;
 
@@ -634,7 +646,7 @@ find_page:
 	ret = mapping->a_ops->commit_write(file, page, offset, offset+this_len);
 	if (ret) {
 		if (ret == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
+			splice_page_release(page);
 			goto find_page;
 		}
 		if (ret < 0)
@@ -651,7 +663,7 @@ find_page:
 	 */
 	mark_page_accessed(page);
 out:
-	page_cache_release(page);
+	splice_page_release(page);
 	unlock_page(page);
 out_ret:
 	return ret;
diff --git a/mm/slab.c b/mm/slab.c
index 2e71a32..673383d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1649,8 +1649,12 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	else
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
-	for (i = 0; i < nr_pages; i++)
-		__SetPageSlab(page + i);
+	for (i = 0; i < nr_pages; i++) {
+		struct page *p = page + i;
+		__SetPageSlab(p);
+		p->first_page = page;
+	}
+	atomic_inc(&page->_count);
 	return page_address(page);
 }
 
@@ -1662,6 +1666,12 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
 	unsigned long i = (1 << cachep->gfporder);
 	struct page *page = virt_to_page(addr);
 	const unsigned long nr_freed = i;
+	struct page *mp = page->first_page;
+
+	if (atomic_dec_return(&mp->_count) != 1) {
+		printk("%s: page: %p, head: %p, order: %d, count: %d.\n", __func__, page, mp, cachep->gfporder, atomic_read(&mp->_count));
+		return;
+	}
 
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		sub_zone_page_state(page_zone(page),
@@ -1679,6 +1689,36 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
 	free_pages((unsigned long)addr, cachep->gfporder);
 }
 
+void slab_change_usage(struct page *page)
+{
+	struct kmem_cache *cachep = page_get_cache(page);
+	unsigned long i = (1 << cachep->gfporder);
+	const unsigned long nr_freed = i;
+	struct page *p = page = page->first_page;
+
+	//printk("%s: page: %p, count: %d, order: %u.\n", __func__, page, atomic_read(&page->_count), cachep->gfporder);
+
+	if (atomic_dec_return(&page->_count) != 1)
+		return;
+	
+	//printk("%s: page: %p, count: %d, freeing.\n", __func__, page, atomic_read(&page->_count));
+
+	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
+		sub_zone_page_state(page_zone(page),
+				NR_SLAB_RECLAIMABLE, nr_freed);
+	else
+		sub_zone_page_state(page_zone(page),
+				NR_SLAB_UNRECLAIMABLE, nr_freed);
+	while (i--) {
+		BUG_ON(!PageSlab(p));
+		__ClearPageSlab(p);
+		p++;
+	}
+	if (current->reclaim_state)
+		current->reclaim_state->reclaimed_slab += nr_freed;
+	__free_pages(page, cachep->gfporder);
+}
+
 static void kmem_rcu_free(struct rcu_head *head)
 {
 	struct slab_rcu *slab_rcu = (struct slab_rcu *)head;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0312bd3..7c027cb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -72,13 +72,20 @@
 static struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
+extern void slab_change_usage(struct page *p);
+
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
+	struct page *p = buf->page->first_page;
 #ifdef NET_COPY_SPLICE
 	__free_page(buf->page);
 #else
-	put_page(buf->page);
+	if (!PageSlab(p))
+		put_page(buf->page);
+	else {
+		slab_change_usage(p);
+	}
 #endif
 }
 
@@ -1162,8 +1169,12 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 
 	memcpy(page_address(p) + offset, page_address(page) + offset, len);
 #else
-	p = page;
-	get_page(p);
+	p = page->first_page;
+	atomic_inc(&p->_count);
+#if 0
+	printk("%s: p: %p, page: %p, slab: %d, head_count: %d, page_count: %d.\n", 
+			__func__, p, page, PageSlab(p), atomic_read(&p->_count), atomic_read(&page->_count));
+#endif
 #endif
 
 	spd->pages[spd->nr_pages] = p;
@@ -1194,7 +1205,7 @@ static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
 	for (seg = 0; len < headlen && *total_len; len += PAGE_SIZE - poff) {
 		void *p = skb->data + len;
 
-		poff = (unsigned long) p & (PAGE_SIZE - 1);
+		poff = offset_in_page(p);
 		plen = min_t(unsigned, headlen - len, PAGE_SIZE - poff);
 
 		if (*offset) {
@@ -1209,7 +1220,10 @@ static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
 
 		if (plen > *total_len)
 			plen = *total_len;
-
+#if 0
+		printk("%s: skb_data: %p, ptr: %p, len: %u, headlen: %u, page: %p, plen: %u, poff: %u.\n",
+				__func__, skb->data, p, len, headlen, virt_to_page(p), plen, poff);
+#endif
 		if (spd_fill_page(spd, virt_to_page(p), plen, poff))
 			break;
 


> -- 
> Jens Axboe

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08 13:58                     ` Evgeniy Polyakov
@ 2007-06-08 14:14                       ` Jens Axboe
  2007-06-08 14:57                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-08 14:14 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
> On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > OK, so a delayed empty of the pipe could end up causing packet drops
> > elsewhere due to allocation exhaustion?
> 
> Yes.
> 
> > > > We can grow the pipe, should we have to. So instead of blocking waiting
> > > > on reader consumption, we can extend the size of the pipe and keep
> > > > going.
> > > 
> > > I have a code, which roughly works (but I will test it some more), which
> > > just introduces reference counters for slab pages, so that the would not
> > > be actually freed via page reclaim, but only after reference counters
> > > are dropped. That forced changes in mm/slab.c so likely it is
> > > unacceptible solution, but it is interesting as is.
> > 
> > Hmm, still seems like it's working around the problem. We essentially
> > just need to ensure that the data doesn't get _reused_, not just freed.
> > It doesn't help holding a reference to the page, if someone else just
> > reuses it and fills it with other data before it has been consumed and
> > released by the pipe buffer operations.
> > 
> > That's why I thought the skb referencing was the better idea, then we
> > don't have to care about the backing of the skb either. Provided that
> > preventing the free of the skb before the pipe buffer has been consumed
> > guarantees that the contents aren't reused.
> 
> It is not only better idea, it is the only correct one.
> Attached patch for interested reader, which does slab pages accounting,
> but it is broken. It does not fires up with kernel bug, but it fills
> output file with random garbage from reused and dirtied pages. And I do
> not know why, but received file is always smaller than file being sent
> (when file has resonable size like 10mb, with 4-40kb filesize things
> seems to be ok).
> 
> I've started skb referencing work, let's see where this will end up.

Here's a start, for the splice side at least of storing a buf-private
entity with the ops.

diff --git a/fs/splice.c b/fs/splice.c
index 90588a8..f24e367 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -191,6 +191,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 			buf->page = spd->pages[page_nr];
 			buf->offset = spd->partial[page_nr].offset;
 			buf->len = spd->partial[page_nr].len;
+			buf->private = spd->partial[page_nr].private;
 			buf->ops = spd->ops;
 			if (spd->flags & SPLICE_F_GIFT)
 				buf->flags |= PIPE_BUF_FLAG_GIFT;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7ba228d..4409167 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -14,6 +14,7 @@ struct pipe_buffer {
 	unsigned int offset, len;
 	const struct pipe_buf_operations *ops;
 	unsigned int flags;
+	unsigned long private;
 };
 
 struct pipe_inode_info {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 619dcf5..64e3eed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,7 +1504,7 @@ extern int	       skb_store_bits(struct sk_buff *skb, int offset,
 extern __wsum	       skb_copy_and_csum_bits(const struct sk_buff *skb,
 					      int offset, u8 *to, int len,
 					      __wsum csum);
-extern int             skb_splice_bits(const struct sk_buff *skb,
+extern int             skb_splice_bits(struct sk_buff *skb,
 						unsigned int offset,
 						struct pipe_inode_info *pipe,
 						unsigned int len,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index b3f1528..1a1182b 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -41,6 +41,7 @@ struct splice_desc {
 struct partial_page {
 	unsigned int offset;
 	unsigned int len;
+	unsigned long private;
 };
 
 /*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d2b2547..7d9ec9e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -78,7 +78,10 @@ static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 #ifdef NET_COPY_SPLICE
 	__free_page(buf->page);
 #else
-	put_page(buf->page);
+	struct sk_buff *skb = (struct sk_buff *) buf->private;
+
+	kfree_skb(skb);
+	//put_page(buf->page);
 #endif
 }
 
@@ -1148,7 +1151,8 @@ fault:
  * Fill page/offset/length into spd, if it can hold more pages.
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
-				unsigned int len, unsigned int offset)
+				unsigned int len, unsigned int offset,
+				struct sk_buff *skb)
 {
 	struct page *p;
 
@@ -1163,12 +1167,14 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 	memcpy(page_address(p) + offset, page_address(page) + offset, len);
 #else
 	p = page;
-	get_page(p);
+	skb_get(skb);
+	//get_page(p);
 #endif
 
 	spd->pages[spd->nr_pages] = p;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = offset;
+	spd->partial[spd->nr_pages].private = (unsigned long) skb;
 	spd->nr_pages++;
 	return 0;
 }
@@ -1177,7 +1183,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
  * Map linear and fragment data from the skb to spd. Returns number of
  * pages mapped.
  */
-static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
+static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 			     unsigned int *total_len,
 			     struct splice_pipe_desc *spd)
 {
@@ -1210,7 +1216,7 @@ static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
 		if (plen > *total_len)
 			plen = *total_len;
 
-		if (spd_fill_page(spd, virt_to_page(p), plen, poff))
+		if (spd_fill_page(spd, virt_to_page(p), plen, poff, skb))
 			break;
 
 		*total_len -= plen;
@@ -1238,7 +1244,7 @@ static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
 		if (plen > *total_len)
 			plen = *total_len;
 
-		if (spd_fill_page(spd, f->page, plen, poff))
+		if (spd_fill_page(spd, f->page, plen, poff, skb))
 			break;
 
 		*total_len -= plen;
@@ -1253,7 +1259,7 @@ static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(const struct sk_buff *skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
 		    unsigned int flags)
 {

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08 14:14                       ` Jens Axboe
@ 2007-06-08 14:57                         ` Evgeniy Polyakov
  2007-06-08 15:19                           ` Jens Axboe
  2007-06-08 15:30                           ` Evgeniy Polyakov
  0 siblings, 2 replies; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-08 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Miller, netdev

On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> Here's a start, for the splice side at least of storing a buf-private
> entity with the ops.

:) I tested the same implementation, but I put skb pointer into
page->private. My approach is not correct, since the same page can hold
several objects, so if there are several splicers, this will scream.
I've tested your patch on top of splice-net branch, here is a result:

[   44.798853] Slab corruption: skbuff_head_cache start=ffff81003b726668, len=192
[   44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[   44.811598] Last user: [<ffffffff803699fd>](kfree_skbmem+0x7a/0x7e)
[   44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5
[   44.824889] Prev obj: start=ffff81003b726590, len=192
[   44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
[   44.835604] Last user: [<ffffffff8036a22c>](__alloc_skb+0x40/0x13f)
[   44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00
[   44.855772] Next obj: start=ffff81003b726740, len=192
[   44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
[   44.866314] Last user: [<ffffffff803699fd>](kfree_skbmem+0x7a/0x7e)
[   44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
[   44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

I will try some things for the nearest 30-60 minutes, and then will move to
canoe trip until thuesday, so will not be able to work on this idea.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08 14:57                         ` Evgeniy Polyakov
@ 2007-06-08 15:19                           ` Jens Axboe
  2007-06-08 15:30                           ` Evgeniy Polyakov
  1 sibling, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2007-06-08 15:19 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
> On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > Here's a start, for the splice side at least of storing a buf-private
> > entity with the ops.
> 
> :) I tested the same implementation, but I put skb pointer into
> page->private. My approach is not correct, since the same page can hold
> several objects, so if there are several splicers, this will scream.
> I've tested your patch on top of splice-net branch, here is a result:
> 
> [   44.798853] Slab corruption: skbuff_head_cache start=ffff81003b726668, len=192
> [   44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
> [   44.811598] Last user: [<ffffffff803699fd>](kfree_skbmem+0x7a/0x7e)
> [   44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5
> [   44.824889] Prev obj: start=ffff81003b726590, len=192
> [   44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
> [   44.835604] Last user: [<ffffffff8036a22c>](__alloc_skb+0x40/0x13f)
> [   44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [   44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00
> [   44.855772] Next obj: start=ffff81003b726740, len=192
> [   44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
> [   44.866314] Last user: [<ffffffff803699fd>](kfree_skbmem+0x7a/0x7e)
> [   44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> [   44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
> 
> I will try some things for the nearest 30-60 minutes, and then will move to
> canoe trip until thuesday, so will not be able to work on this idea.

I'm not surprised, it wasn't tested at all - just provides the basic
framework for storing the skb so we can access it on pipe buffer
release.

Lets talk more next week, I'll likely play with this approach on monday.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08 14:57                         ` Evgeniy Polyakov
  2007-06-08 15:19                           ` Jens Axboe
@ 2007-06-08 15:30                           ` Evgeniy Polyakov
  2007-06-09  6:36                             ` Jens Axboe
  2007-06-11  8:00                             ` Jens Axboe
  1 sibling, 2 replies; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-08 15:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Miller, netdev

On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> I will try some things for the nearest 30-60 minutes, and then will move to
> canoe trip until thuesday, so will not be able to work on this idea.

Ok, replacing in fs/splice.c every page_cache_release() with
static void splice_page_release(struct page *p)
{
	if (!PageSlab(p))
		page_cache_release(p);
}

and putting cloned skb into private field instead of 
original on in spd_fill_page() ends up without kernel hung.

I'm not sure it is correct, that page can be released in fs/splice.c
without calling any callback from network code, when network data is
being processed.

Size of the received file is bigger than file sent, file contains repeated
blocks of data sometimes. Cloned skb usage is likely too big overhead,
although for receiving fast clone is unused in most cases, so there
might be some gain.

Attached your patch with above changes.

diff --git a/fs/splice.c b/fs/splice.c
index 928bea0..a75dc56 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,12 @@
 #include <linux/syscalls.h>
 #include <linux/uio.h>
 
+static void splice_page_release(struct page *p)
+{
+	if (!PageSlab(p))
+		page_cache_release(p);
+}
+
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
  * a vm helper function, it's already simplified quite a bit by the
@@ -81,7 +87,7 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
 static void page_cache_pipe_buf_release(struct pipe_inode_info *pipe,
 					struct pipe_buffer *buf)
 {
-	page_cache_release(buf->page);
+	splice_page_release(buf->page);
 	buf->flags &= ~PIPE_BUF_FLAG_LRU;
 }
 
@@ -191,6 +197,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 			buf->page = spd->pages[page_nr];
 			buf->offset = spd->partial[page_nr].offset;
 			buf->len = spd->partial[page_nr].len;
+			buf->private = spd->partial[page_nr].private;
 			buf->ops = spd->ops;
 			if (spd->flags & SPLICE_F_GIFT)
 				buf->flags |= PIPE_BUF_FLAG_GIFT;
@@ -246,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 	}
 
 	while (page_nr < spd->nr_pages)
-		page_cache_release(spd->pages[page_nr++]);
+		splice_page_release(spd->pages[page_nr++]);
 
 	return ret;
 }
@@ -322,7 +329,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 			error = add_to_page_cache_lru(page, mapping, index,
 					      GFP_KERNEL);
 			if (unlikely(error)) {
-				page_cache_release(page);
+				splice_page_release(page);
 				if (error == -EEXIST)
 					continue;
 				break;
@@ -448,7 +455,7 @@ fill_it:
 	 * we got, 'nr_pages' is how many pages are in the map.
 	 */
 	while (page_nr < nr_pages)
-		page_cache_release(pages[page_nr++]);
+		splice_page_release(pages[page_nr++]);
 
 	if (spd.nr_pages)
 		return splice_to_pipe(pipe, &spd);
@@ -604,7 +611,7 @@ find_page:
 
 		if (ret != AOP_TRUNCATED_PAGE)
 			unlock_page(page);
-		page_cache_release(page);
+		splice_page_release(page);
 		if (ret == AOP_TRUNCATED_PAGE)
 			goto find_page;
 
@@ -634,7 +641,7 @@ find_page:
 	ret = mapping->a_ops->commit_write(file, page, offset, offset+this_len);
 	if (ret) {
 		if (ret == AOP_TRUNCATED_PAGE) {
-			page_cache_release(page);
+			splice_page_release(page);
 			goto find_page;
 		}
 		if (ret < 0)
@@ -651,7 +658,7 @@ find_page:
 	 */
 	mark_page_accessed(page);
 out:
-	page_cache_release(page);
+	splice_page_release(page);
 	unlock_page(page);
 out_ret:
 	return ret;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7ba228d..4409167 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -14,6 +14,7 @@ struct pipe_buffer {
 	unsigned int offset, len;
 	const struct pipe_buf_operations *ops;
 	unsigned int flags;
+	unsigned long private;
 };
 
 struct pipe_inode_info {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 619dcf5..64e3eed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,7 +1504,7 @@ extern int	       skb_store_bits(struct sk_buff *skb, int offset,
 extern __wsum	       skb_copy_and_csum_bits(const struct sk_buff *skb,
 					      int offset, u8 *to, int len,
 					      __wsum csum);
-extern int             skb_splice_bits(const struct sk_buff *skb,
+extern int             skb_splice_bits(struct sk_buff *skb,
 						unsigned int offset,
 						struct pipe_inode_info *pipe,
 						unsigned int len,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index b3f1528..1a1182b 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -41,6 +41,7 @@ struct splice_desc {
 struct partial_page {
 	unsigned int offset;
 	unsigned int len;
+	unsigned long private;
 };
 
 /*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0312bd3..3306d90 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -78,7 +78,9 @@ static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 #ifdef NET_COPY_SPLICE
 	__free_page(buf->page);
 #else
-	put_page(buf->page);
+	struct sk_buff *skb = (struct sk_buff *) buf->private;
+
+	kfree_skb(skb);
 #endif
 }
 
@@ -1148,9 +1150,15 @@ fault:
  * Fill page/offset/length into spd, if it can hold more pages.
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
-				unsigned int len, unsigned int offset)
+				unsigned int len, unsigned int offset,
+				struct sk_buff *__skb)
 {
 	struct page *p;
+	struct sk_buff *skb = skb_clone(__skb, GFP_KERNEL);
+
+	if (!skb)
+		return 1;
+
 
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
@@ -1163,12 +1171,12 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 	memcpy(page_address(p) + offset, page_address(page) + offset, len);
 #else
 	p = page;
-	get_page(p);
 #endif
 
 	spd->pages[spd->nr_pages] = p;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = offset;
+	spd->partial[spd->nr_pages].private = (unsigned long) skb;
 	spd->nr_pages++;
 	return 0;
 }
@@ -1177,7 +1185,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
  * Map linear and fragment data from the skb to spd. Returns number of
  * pages mapped.
  */
-static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
+static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 			     unsigned int *total_len,
 			     struct splice_pipe_desc *spd)
 {
@@ -1210,7 +1218,7 @@ static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
 		if (plen > *total_len)
 			plen = *total_len;
 
-		if (spd_fill_page(spd, virt_to_page(p), plen, poff))
+		if (spd_fill_page(spd, virt_to_page(p), plen, poff, skb))
 			break;
 
 		*total_len -= plen;
@@ -1238,7 +1246,7 @@ static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
 		if (plen > *total_len)
 			plen = *total_len;
 
-		if (spd_fill_page(spd, f->page, plen, poff))
+		if (spd_fill_page(spd, f->page, plen, poff, skb))
 			break;
 
 		*total_len -= plen;
@@ -1253,7 +1261,7 @@ static int __skb_splice_bits(const struct sk_buff *skb, unsigned int *offset,
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(const struct sk_buff *skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
 		    unsigned int flags)
 {

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08 15:30                           ` Evgeniy Polyakov
@ 2007-06-09  6:36                             ` Jens Axboe
  2007-06-12 11:29                               ` Evgeniy Polyakov
  2007-06-11  8:00                             ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-09  6:36 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
> On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > I will try some things for the nearest 30-60 minutes, and then will move to
> > canoe trip until thuesday, so will not be able to work on this idea.
> 
> Ok, replacing in fs/splice.c every page_cache_release() with
> static void splice_page_release(struct page *p)
> {
> 	if (!PageSlab(p))
> 		page_cache_release(p);
> }

Ehm, I don't see why that should be necessary. Except in
splice_to_pipe(), I have considered that we need to pass in a release
function if mapping fails at some point. But it's probably best to do
that in the caller, since they have the knowledge of how to release the
pages.

The rest of the PageSlab() tests are bogus.

> and putting cloned skb into private field instead of 
> original on in spd_fill_page() ends up without kernel hung.

Why? Seems pointless to allocate a clone just to hold on to the skb, a
reference should be equally good. I would not be opposed to doing it
this way, I just don't see what a clone buys us as compared to just
holding that reference to the skb.

> I'm not sure it is correct, that page can be released in fs/splice.c
> without calling any callback from network code, when network data is
> being processed.

Please explain!

> Size of the received file is bigger than file sent, file contains repeated
> blocks of data sometimes. Cloned skb usage is likely too big overhead,
> although for receiving fast clone is unused in most cases, so there
> might be some gain.
> 
> Attached your patch with above changes.

Thanks, I'll fiddle with this on monday.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-08 15:30                           ` Evgeniy Polyakov
  2007-06-09  6:36                             ` Jens Axboe
@ 2007-06-11  8:00                             ` Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2007-06-11  8:00 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
> Size of the received file is bigger than file sent, file contains repeated
> blocks of data sometimes. Cloned skb usage is likely too big overhead,
> although for receiving fast clone is unused in most cases, so there
> might be some gain.

That was actually a new bug, here:

        plen -= *offset;
        poff += *offset;

in __skb_slice_bits(), we should only subtract the offset from plen, not
add to poff. Then we just create some weird hole without any meaning. So
remove those two poff additions in the two loops, and the size issue is
resolved at least.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-09  6:36                             ` Jens Axboe
@ 2007-06-12 11:29                               ` Evgeniy Polyakov
  2007-06-12 11:33                                 ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-12 11:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Miller, netdev

On Sat, Jun 09, 2007 at 08:36:09AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
> > On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > > I will try some things for the nearest 30-60 minutes, and then will move to
> > > canoe trip until thuesday, so will not be able to work on this idea.
> > 
> > Ok, replacing in fs/splice.c every page_cache_release() with
> > static void splice_page_release(struct page *p)
> > {
> > 	if (!PageSlab(p))
> > 		page_cache_release(p);
> > }
> 
> Ehm, I don't see why that should be necessary. Except in
> splice_to_pipe(), I have considered that we need to pass in a release
> function if mapping fails at some point. But it's probably best to do
> that in the caller, since they have the knowledge of how to release the
> pages.
> 
> The rest of the PageSlab() tests are bogus.

I had a crashdump, where page was released via splice_to_pipe() indeed,
I did not investigate if it is possible to release provided page in
other places. I think if in future there will other slab usage cases
except networking receiving, that might be useful, but as is it is not
needed.

> > and putting cloned skb into private field instead of 
> > original on in spd_fill_page() ends up without kernel hung.
> 
> Why? Seems pointless to allocate a clone just to hold on to the skb, a
> reference should be equally good. I would not be opposed to doing it
> this way, I just don't see what a clone buys us as compared to just
> holding that reference to the skb.

Receiving code does not expect shared skbs - too many fields are changed
with assumptions that it is a private copy.

> > I'm not sure it is correct, that page can be released in fs/splice.c
> > without calling any callback from network code, when network data is
> > being processed.
> 
> Please explain!

I had a crashdump, where page was attempted to be released in
fs/splice.c:splice_to_pipe(), I do not have details handy, but the best
solution would be to provide a release callback and use that instead of
page_cache_release().

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-12 11:29                               ` Evgeniy Polyakov
@ 2007-06-12 11:33                                 ` Jens Axboe
  2007-06-12 12:35                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-12 11:33 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
> On Sat, Jun 09, 2007 at 08:36:09AM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
> > > On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > > > I will try some things for the nearest 30-60 minutes, and then will move to
> > > > canoe trip until thuesday, so will not be able to work on this idea.
> > > 
> > > Ok, replacing in fs/splice.c every page_cache_release() with
> > > static void splice_page_release(struct page *p)
> > > {
> > > 	if (!PageSlab(p))
> > > 		page_cache_release(p);
> > > }
> > 
> > Ehm, I don't see why that should be necessary. Except in
> > splice_to_pipe(), I have considered that we need to pass in a release
> > function if mapping fails at some point. But it's probably best to do
> > that in the caller, since they have the knowledge of how to release the
> > pages.
> > 
> > The rest of the PageSlab() tests are bogus.
> 
> I had a crashdump, where page was released via splice_to_pipe() indeed,
> I did not investigate if it is possible to release provided page in
> other places. I think if in future there will other slab usage cases
> except networking receiving, that might be useful, but as is it is not
> needed.

Read the just posted code, it has moved way beyond this :-)

> > > and putting cloned skb into private field instead of 
> > > original on in spd_fill_page() ends up without kernel hung.
> > 
> > Why? Seems pointless to allocate a clone just to hold on to the skb, a
> > reference should be equally good. I would not be opposed to doing it
> > this way, I just don't see what a clone buys us as compared to just
> > holding that reference to the skb.
> 
> Receiving code does not expect shared skbs - too many fields are changed
> with assumptions that it is a private copy.

Actually the main problem is that tcp_read_sock() unconditionally frees
the skb, so it wouldn't help if we grabbed a reference to it. I've yet
to receive an explanation of why it does so, seem awkward and violates
the whole principle of reference counted objects. Davem??

So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
hope we can get rid of that by fixing tcp_read_sock(), though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-12 11:33                                 ` Jens Axboe
@ 2007-06-12 12:35                                   ` Evgeniy Polyakov
  2007-06-12 12:40                                     ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-12 12:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Miller, netdev

On Tue, Jun 12, 2007 at 01:33:54PM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> > I had a crashdump, where page was released via splice_to_pipe() indeed,
> > I did not investigate if it is possible to release provided page in
> > other places. I think if in future there will other slab usage cases
> > except networking receiving, that might be useful, but as is it is not
> > needed.
> 
> Read the just posted code, it has moved way beyond this :-)

It is just a side result of traditional optimization technique called 
"vim ':%s/page_cache_release/splice_page_release'" :)

> > > > and putting cloned skb into private field instead of 
> > > > original on in spd_fill_page() ends up without kernel hung.
> > > 
> > > Why? Seems pointless to allocate a clone just to hold on to the skb, a
> > > reference should be equally good. I would not be opposed to doing it
> > > this way, I just don't see what a clone buys us as compared to just
> > > holding that reference to the skb.
> > 
> > Receiving code does not expect shared skbs - too many fields are changed
> > with assumptions that it is a private copy.
> 
> Actually the main problem is that tcp_read_sock() unconditionally frees
> the skb, so it wouldn't help if we grabbed a reference to it. I've yet
> to receive an explanation of why it does so, seem awkward and violates
> the whole principle of reference counted objects. Davem??
> 
> So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
> hope we can get rid of that by fixing tcp_read_sock(), though.

It does that because it knows, that skb is not allowed to be shared
there. Similar things are being done in udp for example - code changes
internal mebers of skb, since it knows skb is not shared.

For example generic_make_request() is not allowed to change, say, 
bio->bi_sector or bi_destructor, since it does not own a block request, 
not matter what bi_cnt is. From another side, ->bi_destructor() can do
whatever it wants with bio without any check for its reference counter.

According to sk_eat_skb() - it is an optimisation to remove atomic
check.

> -- 
> Jens Axboe

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-12 12:35                                   ` Evgeniy Polyakov
@ 2007-06-12 12:40                                     ` Jens Axboe
  2007-06-12 13:11                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2007-06-12 12:40 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
> > > > > and putting cloned skb into private field instead of 
> > > > > original on in spd_fill_page() ends up without kernel hung.
> > > > 
> > > > Why? Seems pointless to allocate a clone just to hold on to the skb, a
> > > > reference should be equally good. I would not be opposed to doing it
> > > > this way, I just don't see what a clone buys us as compared to just
> > > > holding that reference to the skb.
> > > 
> > > Receiving code does not expect shared skbs - too many fields are changed
> > > with assumptions that it is a private copy.
> > 
> > Actually the main problem is that tcp_read_sock() unconditionally frees
> > the skb, so it wouldn't help if we grabbed a reference to it. I've yet
> > to receive an explanation of why it does so, seem awkward and violates
> > the whole principle of reference counted objects. Davem??
> > 
> > So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
> > hope we can get rid of that by fixing tcp_read_sock(), though.
> 
> It does that because it knows, that skb is not allowed to be shared
> there. Similar things are being done in udp for example - code changes
> internal mebers of skb, since it knows skb is not shared.
> 
> For example generic_make_request() is not allowed to change, say, 
> bio->bi_sector or bi_destructor, since it does not own a block request, 
> not matter what bi_cnt is. From another side, ->bi_destructor() can do
> whatever it wants with bio without any check for its reference counter.

But generic_make_request() DOES change ->bi_sector, that's how partition
remapping works :-). The destructor can of course do whatever it wants,
by definition the bio is not referenced at that point (or it would not
have been called). So while I think your analogy is quite poor, I do now
follow the code (even if I think it's ugly). There's quite a big
difference between changing parts of the elements of a structure to just
grabbing a reference to it. If the skb cannot be referenced, skb_get()
should return NULL.

But that aside, I see the issue. I'll just stick to the clone, it works
fine as-is (well almost, there's a leak there, but functionally it's
ok!).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-12 13:11                                       ` Evgeniy Polyakov
@ 2007-06-12 13:11                                         ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2007-06-12 13:11 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
> > difference between changing parts of the elements of a structure to just
> > grabbing a reference to it. If the skb cannot be referenced, skb_get()
> > should return NULL.
> > 
> > But that aside, I see the issue. I'll just stick to the clone, it works
> > fine as-is (well almost, there's a leak there, but functionally it's
> > ok!).
> 
> Btw, is it allowed to use splice from network with, say, nfs?
> Since RPC code uses sk_user_data as long as network splice.

It doesn't anymore, see the version posted today (or yesterday, but it
would be silly to read older code than the newest :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH][RFC] network splice receive
  2007-06-12 12:40                                     ` Jens Axboe
@ 2007-06-12 13:11                                       ` Evgeniy Polyakov
  2007-06-12 13:11                                         ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Evgeniy Polyakov @ 2007-06-12 13:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Miller, netdev

On Tue, Jun 12, 2007 at 02:40:05PM +0200, Jens Axboe (jens.axboe@oracle.com) wrote:
> On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
> > > > > > and putting cloned skb into private field instead of 
> > > > > > original on in spd_fill_page() ends up without kernel hung.
> > > > > 
> > > > > Why? Seems pointless to allocate a clone just to hold on to the skb, a
> > > > > reference should be equally good. I would not be opposed to doing it
> > > > > this way, I just don't see what a clone buys us as compared to just
> > > > > holding that reference to the skb.
> > > > 
> > > > Receiving code does not expect shared skbs - too many fields are changed
> > > > with assumptions that it is a private copy.
> > > 
> > > Actually the main problem is that tcp_read_sock() unconditionally frees
> > > the skb, so it wouldn't help if we grabbed a reference to it. I've yet
> > > to receive an explanation of why it does so, seem awkward and violates
> > > the whole principle of reference counted objects. Davem??
> > > 
> > > So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
> > > hope we can get rid of that by fixing tcp_read_sock(), though.
> > 
> > It does that because it knows, that skb is not allowed to be shared
> > there. Similar things are being done in udp for example - code changes
> > internal mebers of skb, since it knows skb is not shared.
> > 
> > For example generic_make_request() is not allowed to change, say, 
> > bio->bi_sector or bi_destructor, since it does not own a block request, 
> > not matter what bi_cnt is. From another side, ->bi_destructor() can do
> > whatever it wants with bio without any check for its reference counter.
> 
> But generic_make_request() DOES change ->bi_sector, that's how partition
> remapping works :-). The destructor can of course do whatever it wants,
> by definition the bio is not referenced at that point (or it would not
> have been called). So while I think your analogy is quite poor, I do now
> follow the code (even if I think it's ugly). There's quite a big

Yeah, that was quite long time ago I hacked block layer :)
Good we found a way to explain the issue.

> difference between changing parts of the elements of a structure to just
> grabbing a reference to it. If the skb cannot be referenced, skb_get()
> should return NULL.
> 
> But that aside, I see the issue. I'll just stick to the clone, it works
> fine as-is (well almost, there's a leak there, but functionally it's
> ok!).

Btw, is it allowed to use splice from network with, say, nfs?
Since RPC code uses sk_user_data as long as network splice.

> -- 
> Jens Axboe

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2007-06-12 13:13 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-05  8:05 [PATCH][RFC] network splice receive Jens Axboe
2007-06-05 11:45 ` Jens Axboe
2007-06-05 12:20   ` Jens Axboe
2007-06-05 12:34     ` jamal
2007-06-06  7:14       ` Jens Axboe
2007-06-05 13:34 ` Evgeniy Polyakov
2007-06-05 14:31 ` Evgeniy Polyakov
2007-06-05 14:49   ` Evgeniy Polyakov
2007-06-06  7:17   ` Jens Axboe
2007-06-07  8:09     ` Evgeniy Polyakov
2007-06-07 10:51       ` Jens Axboe
2007-06-07 14:58         ` Evgeniy Polyakov
2007-06-08  7:48           ` Jens Axboe
2007-06-08  8:06             ` David Miller
2007-06-08  8:38               ` Jens Axboe
2007-06-08  8:56                 ` Evgeniy Polyakov
2007-06-08  9:04                   ` Jens Axboe
2007-06-08 13:58                     ` Evgeniy Polyakov
2007-06-08 14:14                       ` Jens Axboe
2007-06-08 14:57                         ` Evgeniy Polyakov
2007-06-08 15:19                           ` Jens Axboe
2007-06-08 15:30                           ` Evgeniy Polyakov
2007-06-09  6:36                             ` Jens Axboe
2007-06-12 11:29                               ` Evgeniy Polyakov
2007-06-12 11:33                                 ` Jens Axboe
2007-06-12 12:35                                   ` Evgeniy Polyakov
2007-06-12 12:40                                     ` Jens Axboe
2007-06-12 13:11                                       ` Evgeniy Polyakov
2007-06-12 13:11                                         ` Jens Axboe
2007-06-11  8:00                             ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).