netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Network/block layer race.
@ 2008-03-28  9:20 Evgeniy Polyakov
  2008-03-28 20:40 ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-03-28  9:20 UTC (permalink / raw)
  To: Jens Axboe, David Miller; +Cc: netdev

Hi.

There is a race between ->sendpage() and block layer, when the latter
can override the page while it is queued in hardware, qdisk or tcp
queue. Although page's reference counter is handled correctly, and page
will not be freed until fully transferred, block layer can reuse it,
since it assumes that after ->sendpage() returns, page is no longer
used. It is invalid assumption, but there is no way currently to
determine when page is no longer used by network except invoke a
callback during skb freeing.

Block layer pages do not use page->lru.next, at least in kernel afaics,
which is a kmem_cache pointer, so some users, who do know, what they are
doing, can set it up to private data structure and replace skb
destructor with own callback, which in turn will invoke sock_wfree()
when needed (transmit only is interesing), so there will not be any
changes in skb structure, maybe some extension of the sock (a single
pointer to private callback or reuse sk_user_data, which is only used by
rpc code, and export of the sock_wfree() function.

I do not know if we have to fix sendfile()/splice() since everyone is
used to have that race, but some other out-of-tree network storage
projects (like distributed storage) would greatly benefit from it.

So far it is a request for comments and idea has to be better tested if
accepted, so the question is: will such a hack be accepted?

Thanks.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Network/block layer race.
  2008-03-28  9:20 Network/block layer race Evgeniy Polyakov
@ 2008-03-28 20:40 ` David Miller
  2008-03-28 20:56   ` Evgeniy Polyakov
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-03-28 20:40 UTC (permalink / raw)
  To: johnpol; +Cc: axboe, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 28 Mar 2008 12:20:36 +0300

> There is a race between ->sendpage() and block layer, when the latter
> can override the page while it is queued in hardware, qdisk or tcp
> queue. Although page's reference counter is handled correctly, and page
> will not be freed until fully transferred, block layer can reuse it,
> since it assumes that after ->sendpage() returns, page is no longer
> used. It is invalid assumption, but there is no way currently to
> determine when page is no longer used by network except invoke a
> callback during skb freeing.

It is well known that between when a page is given to sendfile path
and it is actually transmitted to the network the kernel can write
into that page multiple times.

That's why we only allow sendfile over device paths that support
checksum offloading, since the page contents can change freely at any
moment in time.

The refrence counting is to prevent leaks, rather than to protect
the integrity of the contents.

If content protection is desired, higher level things are needed.
For example, SAMBA only uses sendfile() if the remote client
has an OP lock on the file in question.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Network/block layer race.
  2008-03-28 20:40 ` David Miller
@ 2008-03-28 20:56   ` Evgeniy Polyakov
  2008-03-28 21:07     ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-03-28 20:56 UTC (permalink / raw)
  To: David Miller; +Cc: axboe, netdev

Hi.

On Fri, Mar 28, 2008 at 01:40:53PM -0700, David Miller (davem@davemloft.net) wrote:
> If content protection is desired, higher level things are needed.
> For example, SAMBA only uses sendfile() if the remote client
> has an OP lock on the file in question.

What I propose is to inform sender about when page is transmitted and
received on the other side via implicit acks, which are provided already
by tcp, but is not provided to higher layer.
We can use it for sendfile and thus do not require higher
layer 'distributed' locks, i.e. sync some state between hosts to inform
that data was succesfully received. Patch would be minimal, just export
sock_wfree() and setup different skb destructor if sk_sock_data is
provided, that will allow for special users (like the same samba) to use
->sendpage() path too (with some changes in the code of course, but that
is worth whole page copying).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Network/block layer race.
  2008-03-28 20:56   ` Evgeniy Polyakov
@ 2008-03-28 21:07     ` David Miller
  2008-03-28 21:51       ` Evgeniy Polyakov
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-03-28 21:07 UTC (permalink / raw)
  To: johnpol; +Cc: axboe, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 28 Mar 2008 23:56:13 +0300

> Hi.
> 
> On Fri, Mar 28, 2008 at 01:40:53PM -0700, David Miller (davem@davemloft.net) wrote:
> > If content protection is desired, higher level things are needed.
> > For example, SAMBA only uses sendfile() if the remote client
> > has an OP lock on the file in question.
> 
> What I propose is to inform sender about when page is transmitted and
> received on the other side via implicit acks, which are provided already
> by tcp, but is not provided to higher layer.
> We can use it for sendfile and thus do not require higher
> layer 'distributed' locks, i.e. sync some state between hosts to inform
> that data was succesfully received. Patch would be minimal, just export
> sock_wfree() and setup different skb destructor if sk_sock_data is
> provided, that will allow for special users (like the same samba) to use
> ->sendpage() path too (with some changes in the code of course, but that
> is worth whole page copying).

I'd have to see the patch, feel free to submit it :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Network/block layer race.
  2008-03-28 21:07     ` David Miller
@ 2008-03-28 21:51       ` Evgeniy Polyakov
  2008-04-01 16:49         ` Fix for the fundamental network/block layer race in sendfile() Evgeniy Polyakov
  0 siblings, 1 reply; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-03-28 21:51 UTC (permalink / raw)
  To: David Miller; +Cc: axboe, netdev

On Fri, Mar 28, 2008 at 02:07:36PM -0700, David Miller (davem@davemloft.net) wrote:
> I'd have to see the patch, feel free to submit it :-)

So, you do not afraid if word 'hack' is used in description, which is a good
sign for start :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Fix for the fundamental network/block layer race in sendfile().
  2008-03-28 21:51       ` Evgeniy Polyakov
@ 2008-04-01 16:49         ` Evgeniy Polyakov
  2008-04-01 17:14           ` Mika Penttilä
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-01 16:49 UTC (permalink / raw)
  To: David Miller; +Cc: axboe, netdev

Hi.

Summary of the previous series with this pompous header:
when sendfile() returns, pages which it sent can still be queued in tcp
stack or hardware, so subsequent write into them will endup in
corrupting data which will be eventually sent. This concerns all
->sendpage() users namely sendfile() and splice().

We can only safely reuse that pages only when ack is received from the
remote side, which will force network stack to release pages.

My simple extension allows to hook into data releasing path and perform
any actions we want. This is achieved by replacing skb->destructor with
own callback registerd by interested user, for example splice/sendfile
code. Splice (pipe info structure) in turn is extended to hold atomic
counter of the pages in flight (without structure size change because of
alignment issues it has right now), so splice code will sleep when full
pipe info (->nrbufs pages) was sent, it will wait until number of pages
in flight hits zero, which is decremented in private splice callback.

Patch was tested with simple send and recv applications, which can be
found at
http://tservice.net.ru/~s0mbre/archive/tmp/sendfile_fix/

One has to run them on different machines, since loopback uses a bit
different scheme (namely page is _never_ copied, so when it is received
by 'remote' side, it still exists on the 'local' side, so modifications
will endup in data corruption).
devfs1# ./recv -a 0.0.0.0 -p 1025 -c 1024
devfs2# ./send -a devfs1 -p 1025 -f /tmp/test -c 1024

In case of failure you will get this:
Connected to devfs1:1025.
/tmp/test/1024 -> devfs1:1025
Data was corrupted: ab.

after short period of time, where above 'ab' is a hex byte writen into
mapped file, which has been sent, immediately after senfile()
returns to userspace.
Data is supposed to be always zero, and applications should run forever.
-c parameter specifies number of bytes to be sent in each run of the
sendfile(). It has to be the same on both machines.

Fix attached, consider for inclusion.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/fs/read_write.c b/fs/read_write.c
index 49a9871..8c94e03 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <net/sock.h>
 #include "read_write.h"
 
 #include <asm/uaccess.h>
@@ -703,6 +704,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 	loff_t pos;
 	ssize_t retval;
 	int fput_needed_in, fput_needed_out, fl;
+	struct sock *sk;
+	struct socket *sock;
 
 	/*
 	 * Get input file, and verify that it is ok..
@@ -762,6 +765,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 		count = max - pos;
 	}
 
+	sock = out_file->private_data;
+	sk = sock->sk;
+
+	sk->sk_user_data = &skb_splice_destructor;
+	sk->sk_flags |= SO_PRIVATE_CALLBACK;
+
 	fl = 0;
 #if 0
 	/*
diff --git a/fs/splice.c b/fs/splice.c
index 0670c91..1432b13 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,7 @@
 #include <linux/syscalls.h>
 #include <linux/uio.h>
 #include <linux/security.h>
+#include <net/sock.h>
 
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
@@ -535,6 +536,7 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 	if (!ret) {
 		more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
 
+		buf->page->lru.next = (void *)pipe;
 		ret = file->f_op->sendpage(file, buf->page, buf->offset,
 					   sd->len, &pos, more);
 	}
@@ -629,6 +631,8 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 	ret = 0;
 	do_wakeup = 0;
 
+	atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
+
 	for (;;) {
 		if (pipe->nrbufs) {
 			struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
@@ -665,12 +669,18 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 					do_wakeup = 1;
 			}
 
-			if (!sd->total_len)
+			if (!sd->total_len) {
+				wait_event_interruptible(pipe->wait,
+					!atomic_read(&pipe->god_blessed_us));
 				break;
+			}
 		}
 
 		if (pipe->nrbufs)
 			continue;
+
+		wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
+
 		if (!pipe->writers)
 			break;
 		if (!pipe->waiting_writers) {
diff --git a/include/asm-x86/socket.h b/include/asm-x86/socket.h
index 80af9c4..a4b047e 100644
--- a/include/asm-x86/socket.h
+++ b/include/asm-x86/socket.h
@@ -54,4 +54,6 @@
 
 #define SO_MARK			36
 
+#define SO_PRIVATE_CALLBACK	37
+
 #endif /* _ASM_SOCKET_H */
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8e41202..465405a 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -51,6 +51,7 @@ struct pipe_inode_info {
 	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
+	atomic_t god_blessed_us;
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
 	struct inode *inode;
diff --git a/include/net/sock.h b/include/net/sock.h
index fd98760..ac7bc52 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -862,6 +862,8 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
 extern void			sock_wfree(struct sk_buff *skb);
 extern void			sock_rfree(struct sk_buff *skb);
 
+extern void skb_splice_destructor(struct sk_buff *skb);
+
 extern int			sock_setsockopt(struct socket *sock, int level,
 						int op, char __user *optval,
 						int optlen);
@@ -1168,6 +1170,8 @@ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
+	if (sk->sk_flags & SO_PRIVATE_CALLBACK)
+		skb->destructor = sk->sk_user_data;
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 2654c14..0c10581 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/pipe_fs_i.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -1116,6 +1117,24 @@ void sock_wfree(struct sk_buff *skb)
 	sock_put(sk);
 }
 
+void skb_splice_destructor(struct sk_buff *skb)
+{
+	if (skb_shinfo(skb)->nr_frags) {
+		int i;
+		struct pipe_inode_info *pipe;
+
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			struct page *page = skb_shinfo(skb)->frags[i].page;
+
+			pipe = (struct pipe_inode_info *)page->lru.next;
+			if (atomic_dec_return(&pipe->god_blessed_us) == 0)
+				wake_up(&pipe->wait);
+		}
+	}
+
+	sock_wfree(skb);
+}
+
 /*
  * Read buffer destructor automatically called from kfree_skb.
  */
@@ -2164,6 +2183,7 @@ EXPORT_SYMBOL(sock_no_socketpair);
 EXPORT_SYMBOL(sock_rfree);
 EXPORT_SYMBOL(sock_setsockopt);
 EXPORT_SYMBOL(sock_wfree);
+EXPORT_SYMBOL(skb_splice_destructor);
 EXPORT_SYMBOL(sock_wmalloc);
 EXPORT_SYMBOL(sock_i_uid);
 EXPORT_SYMBOL(sock_i_ino);


-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 16:49         ` Fix for the fundamental network/block layer race in sendfile() Evgeniy Polyakov
@ 2008-04-01 17:14           ` Mika Penttilä
  2008-04-01 17:36             ` Evgeniy Polyakov
  2008-04-01 17:19           ` Eric Dumazet
  2008-04-08 12:25           ` [take 2] " Evgeniy Polyakov
  2 siblings, 1 reply; 22+ messages in thread
From: Mika Penttilä @ 2008-04-01 17:14 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, axboe, netdev

Evgeniy Polyakov wrote:
> Hi.
>
> Summary of the previous series with this pompous header:
> when sendfile() returns, pages which it sent can still be queued in tcp
> stack or hardware, so subsequent write into them will endup in
> corrupting data which will be eventually sent. This concerns all
> ->sendpage() users namely sendfile() and splice().
>
> We can only safely reuse that pages only when ack is received from the
> remote side, which will force network stack to release pages.
>
> My simple extension allows to hook into data releasing path and perform
> any actions we want. This is achieved by replacing skb->destructor with
> own callback registerd by interested user, for example splice/sendfile
> code. Splice (pipe info structure) in turn is extended to hold atomic
> counter of the pages in flight (without structure size change because of
> alignment issues it has right now), so splice code will sleep when full
> pipe info (->nrbufs pages) was sent, it will wait until number of pages
> in flight hits zero, which is decremented in private splice callback.
>
> Patch was tested with simple send and recv applications, which can be
> found at
> http://tservice.net.ru/~s0mbre/archive/tmp/sendfile_fix/
>
> One has to run them on different machines, since loopback uses a bit
> different scheme (namely page is _never_ copied, so when it is received
> by 'remote' side, it still exists on the 'local' side, so modifications
> will endup in data corruption).
> devfs1# ./recv -a 0.0.0.0 -p 1025 -c 1024
> devfs2# ./send -a devfs1 -p 1025 -f /tmp/test -c 1024
>
> In case of failure you will get this:
> Connected to devfs1:1025.
> /tmp/test/1024 -> devfs1:1025
> Data was corrupted: ab.
>
> after short period of time, where above 'ab' is a hex byte writen into
> mapped file, which has been sent, immediately after senfile()
> returns to userspace.
> Data is supposed to be always zero, and applications should run forever.
> -c parameter specifies number of bytes to be sent in each run of the
> sendfile(). It has to be the same on both machines.
>
> Fix attached, consider for inclusion.
>
> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 49a9871..8c94e03 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -16,6 +16,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/pagemap.h>
>  #include <linux/splice.h>
> +#include <net/sock.h>
>  #include "read_write.h"
>  
>  #include <asm/uaccess.h>
> @@ -703,6 +704,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  	loff_t pos;
>  	ssize_t retval;
>  	int fput_needed_in, fput_needed_out, fl;
> +	struct sock *sk;
> +	struct socket *sock;
>  
>  	/*
>  	 * Get input file, and verify that it is ok..
> @@ -762,6 +765,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  		count = max - pos;
>  	}
>  
> +	sock = out_file->private_data;
> +	sk = sock->sk;
> +
> +	sk->sk_user_data = &skb_splice_destructor;
> +	sk->sk_flags |= SO_PRIVATE_CALLBACK;
> +
>  	fl = 0;
>  #if 0
>  	/*
> diff --git a/fs/splice.c b/fs/splice.c
> index 0670c91..1432b13 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -29,6 +29,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/uio.h>
>  #include <linux/security.h>
> +#include <net/sock.h>
>  
>  /*
>   * Attempt to steal a page from a pipe buffer. This should perhaps go into
> @@ -535,6 +536,7 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
>  	if (!ret) {
>  		more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
>  
> +		buf->page->lru.next = (void *)pipe;
>  		ret = file->f_op->sendpage(file, buf->page, buf->offset,
>  					   sd->len, &pos, more);
>  	}
> @@ -629,6 +631,8 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
>  	ret = 0;
>  	do_wakeup = 0;
>  
> +	atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
> +
>  	for (;;) {
>  		if (pipe->nrbufs) {
>  			struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
> @@ -665,12 +669,18 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
>  					do_wakeup = 1;
>  			}
>  
> -			if (!sd->total_len)
> +			if (!sd->total_len) {
> +				wait_event_interruptible(pipe->wait,
> +					!atomic_read(&pipe->god_blessed_us));
>  				break;
> +			}
>  		}
>  
>  		if (pipe->nrbufs)
>  			continue;
> +
> +		wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
> +
>  		if (!pipe->writers)
>  			break;
>  		if (!pipe->waiting_writers) {
> diff --git a/include/asm-x86/socket.h b/include/asm-x86/socket.h
> index 80af9c4..a4b047e 100644
> --- a/include/asm-x86/socket.h
> +++ b/include/asm-x86/socket.h
> @@ -54,4 +54,6 @@
>  
>  #define SO_MARK			36
>  
> +#define SO_PRIVATE_CALLBACK	37
> +
>  #endif /* _ASM_SOCKET_H */
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 8e41202..465405a 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -51,6 +51,7 @@ struct pipe_inode_info {
>  	unsigned int waiting_writers;
>  	unsigned int r_counter;
>  	unsigned int w_counter;
> +	atomic_t god_blessed_us;
>  	struct fasync_struct *fasync_readers;
>  	struct fasync_struct *fasync_writers;
>  	struct inode *inode;
> diff --git a/include/net/sock.h b/include/net/sock.h
> index fd98760..ac7bc52 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -862,6 +862,8 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
>  extern void			sock_wfree(struct sk_buff *skb);
>  extern void			sock_rfree(struct sk_buff *skb);
>  
> +extern void skb_splice_destructor(struct sk_buff *skb);
> +
>  extern int			sock_setsockopt(struct socket *sock, int level,
>  						int op, char __user *optval,
>  						int optlen);
> @@ -1168,6 +1170,8 @@ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  	sock_hold(sk);
>  	skb->sk = sk;
>  	skb->destructor = sock_wfree;
> +	if (sk->sk_flags & SO_PRIVATE_CALLBACK)
> +		skb->destructor = sk->sk_user_data;
>  	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
>  }
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 2654c14..0c10581 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -112,6 +112,7 @@
>  #include <linux/tcp.h>
>  #include <linux/init.h>
>  #include <linux/highmem.h>
> +#include <linux/pipe_fs_i.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/system.h>
> @@ -1116,6 +1117,24 @@ void sock_wfree(struct sk_buff *skb)
>  	sock_put(sk);
>  }
>  
> +void skb_splice_destructor(struct sk_buff *skb)
> +{
> +	if (skb_shinfo(skb)->nr_frags) {
> +		int i;
> +		struct pipe_inode_info *pipe;
> +
> +		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> +			struct page *page = skb_shinfo(skb)->frags[i].page;
> +
> +			pipe = (struct pipe_inode_info *)page->lru.next;
> +			if (atomic_dec_return(&pipe->god_blessed_us) == 0)
> +				wake_up(&pipe->wait);
> +		}
> +	}
> +
> +	sock_wfree(skb);
> +}
> +
>  /*
>   * Read buffer destructor automatically called from kfree_skb.
>   */
> @@ -2164,6 +2183,7 @@ EXPORT_SYMBOL(sock_no_socketpair);
>  EXPORT_SYMBOL(sock_rfree);
>  EXPORT_SYMBOL(sock_setsockopt);
>  EXPORT_SYMBOL(sock_wfree);
> +EXPORT_SYMBOL(skb_splice_destructor);
>  EXPORT_SYMBOL(sock_wmalloc);
>  EXPORT_SYMBOL(sock_i_uid);
>  EXPORT_SYMBOL(sock_i_ino);
>
>
>   
Hi,

Aren't you breaking the other types of splices, like splice to file 
which doesn't have the wakeup thing.

--Mika


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 16:49         ` Fix for the fundamental network/block layer race in sendfile() Evgeniy Polyakov
  2008-04-01 17:14           ` Mika Penttilä
@ 2008-04-01 17:19           ` Eric Dumazet
  2008-04-01 17:47             ` Evgeniy Polyakov
  2008-04-08 12:25           ` [take 2] " Evgeniy Polyakov
  2 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-04-01 17:19 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, axboe, netdev

Evgeniy Polyakov a écrit :
> Hi.
> 
> Summary of the previous series with this pompous header:
> when sendfile() returns, pages which it sent can still be queued in tcp
> stack or hardware, so subsequent write into them will endup in
> corrupting data which will be eventually sent. This concerns all
> ->sendpage() users namely sendfile() and splice().
> 
> We can only safely reuse that pages only when ack is received from the
> remote side, which will force network stack to release pages.
> 
> My simple extension allows to hook into data releasing path and perform
> any actions we want. This is achieved by replacing skb->destructor with
> own callback registerd by interested user, for example splice/sendfile
> code. Splice (pipe info structure) in turn is extended to hold atomic
> counter of the pages in flight (without structure size change because of
> alignment issues it has right now), so splice code will sleep when full
> pipe info (->nrbufs pages) was sent, it will wait until number of pages
> in flight hits zero, which is decremented in private splice callback.
> 
> Patch was tested with simple send and recv applications, which can be
> found at
> http://tservice.net.ru/~s0mbre/archive/tmp/sendfile_fix/
> 
> One has to run them on different machines, since loopback uses a bit
> different scheme (namely page is _never_ copied, so when it is received
> by 'remote' side, it still exists on the 'local' side, so modifications
> will endup in data corruption).
> devfs1# ./recv -a 0.0.0.0 -p 1025 -c 1024
> devfs2# ./send -a devfs1 -p 1025 -f /tmp/test -c 1024
> 
> In case of failure you will get this:
> Connected to devfs1:1025.
> /tmp/test/1024 -> devfs1:1025
> Data was corrupted: ab.
> 
> after short period of time, where above 'ab' is a hex byte writen into
> mapped file, which has been sent, immediately after senfile()
> returns to userspace.
> Data is supposed to be always zero, and applications should run forever.
> -c parameter specifies number of bytes to be sent in each run of the
> sendfile(). It has to be the same on both machines.
> 
> Fix attached, consider for inclusion.
> 
> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 49a9871..8c94e03 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -16,6 +16,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/pagemap.h>
>  #include <linux/splice.h>
> +#include <net/sock.h>
>  #include "read_write.h"
>  
>  #include <asm/uaccess.h>
> @@ -703,6 +704,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  	loff_t pos;
>  	ssize_t retval;
>  	int fput_needed_in, fput_needed_out, fl;
> +	struct sock *sk;
> +	struct socket *sock;
>  
>  	/*
>  	 * Get input file, and verify that it is ok..
> @@ -762,6 +765,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  		count = max - pos;
>  	}
>  
> +	sock = out_file->private_data;
> +	sk = sock->sk;
> +
> +	sk->sk_user_data = &skb_splice_destructor;
> +	sk->sk_flags |= SO_PRIVATE_CALLBACK;

Ouch...

this seems very ugly (sorry).

1) Is out_file garanted to be a socket ?

2) SO_PRIVATE_CALLBACK is defined to be 37 later on your patch. Is ORing 37 to 
sk_flags really working ???

3) So, once a socket has been used for a sendfile(), all subsequent sendmsg() 
will call skb_splice_destructor ? (I see no clearing of SO_PRIVATE_CALLBACK)
  This will probably crash.

4) god_blessed_us name ... Oh I got it, its April the first !!!!


> +
>  	fl = 0;
>  #if 0
>  	/*
> diff --git a/fs/splice.c b/fs/splice.c
> index 0670c91..1432b13 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -29,6 +29,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/uio.h>
>  #include <linux/security.h>
> +#include <net/sock.h>
>  
>  /*
>   * Attempt to steal a page from a pipe buffer. This should perhaps go into
> @@ -535,6 +536,7 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
>  	if (!ret) {
>  		more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
>  
> +		buf->page->lru.next = (void *)pipe;
>  		ret = file->f_op->sendpage(file, buf->page, buf->offset,
>  					   sd->len, &pos, more);
>  	}
> @@ -629,6 +631,8 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
>  	ret = 0;
>  	do_wakeup = 0;
>  
> +	atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
> +
>  	for (;;) {
>  		if (pipe->nrbufs) {
>  			struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
> @@ -665,12 +669,18 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
>  					do_wakeup = 1;
>  			}
>  
> -			if (!sd->total_len)
> +			if (!sd->total_len) {
> +				wait_event_interruptible(pipe->wait,
> +					!atomic_read(&pipe->god_blessed_us));
>  				break;
> +			}
>  		}
>  
>  		if (pipe->nrbufs)
>  			continue;
> +
> +		wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
> +
>  		if (!pipe->writers)
>  			break;
>  		if (!pipe->waiting_writers) {
> diff --git a/include/asm-x86/socket.h b/include/asm-x86/socket.h
> index 80af9c4..a4b047e 100644
> --- a/include/asm-x86/socket.h
> +++ b/include/asm-x86/socket.h
> @@ -54,4 +54,6 @@
>  
>  #define SO_MARK			36
>  
> +#define SO_PRIVATE_CALLBACK	37
> +
>  #endif /* _ASM_SOCKET_H */
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 8e41202..465405a 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -51,6 +51,7 @@ struct pipe_inode_info {
>  	unsigned int waiting_writers;
>  	unsigned int r_counter;
>  	unsigned int w_counter;
> +	atomic_t god_blessed_us;
>  	struct fasync_struct *fasync_readers;
>  	struct fasync_struct *fasync_writers;
>  	struct inode *inode;
> diff --git a/include/net/sock.h b/include/net/sock.h
> index fd98760..ac7bc52 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -862,6 +862,8 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
>  extern void			sock_wfree(struct sk_buff *skb);
>  extern void			sock_rfree(struct sk_buff *skb);
>  
> +extern void skb_splice_destructor(struct sk_buff *skb);
> +
>  extern int			sock_setsockopt(struct socket *sock, int level,
>  						int op, char __user *optval,
>  						int optlen);
> @@ -1168,6 +1170,8 @@ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
>  	sock_hold(sk);
>  	skb->sk = sk;
>  	skb->destructor = sock_wfree;
> +	if (sk->sk_flags & SO_PRIVATE_CALLBACK)
> +		skb->destructor = sk->sk_user_data;
>  	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
>  }
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 2654c14..0c10581 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -112,6 +112,7 @@
>  #include <linux/tcp.h>
>  #include <linux/init.h>
>  #include <linux/highmem.h>
> +#include <linux/pipe_fs_i.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/system.h>
> @@ -1116,6 +1117,24 @@ void sock_wfree(struct sk_buff *skb)
>  	sock_put(sk);
>  }
>  
> +void skb_splice_destructor(struct sk_buff *skb)
> +{
> +	if (skb_shinfo(skb)->nr_frags) {
> +		int i;
> +		struct pipe_inode_info *pipe;
> +
> +		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> +			struct page *page = skb_shinfo(skb)->frags[i].page;
> +
> +			pipe = (struct pipe_inode_info *)page->lru.next;
> +			if (atomic_dec_return(&pipe->god_blessed_us) == 0)
> +				wake_up(&pipe->wait);
> +		}
> +	}
> +
> +	sock_wfree(skb);
> +}
> +
>  /*
>   * Read buffer destructor automatically called from kfree_skb.
>   */
> @@ -2164,6 +2183,7 @@ EXPORT_SYMBOL(sock_no_socketpair);
>  EXPORT_SYMBOL(sock_rfree);
>  EXPORT_SYMBOL(sock_setsockopt);
>  EXPORT_SYMBOL(sock_wfree);
> +EXPORT_SYMBOL(skb_splice_destructor);
>  EXPORT_SYMBOL(sock_wmalloc);
>  EXPORT_SYMBOL(sock_i_uid);
>  EXPORT_SYMBOL(sock_i_ino);
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 17:14           ` Mika Penttilä
@ 2008-04-01 17:36             ` Evgeniy Polyakov
  0 siblings, 0 replies; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-01 17:36 UTC (permalink / raw)
  To: Mika Penttilä; +Cc: David Miller, axboe, netdev

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

Hi.

On Tue, Apr 01, 2008 at 08:14:47PM +0300, Mika Penttilä (mika.penttila@kolumbus.fi) wrote:
> Aren't you breaking the other types of splices, like splice to file 
> which doesn't have the wakeup thing.

Do we care about them? Likely we do...

Thanks for pointing!

Inlined fix on top of previous patch, attached full diff.
Not tested with other splices though, but is rather straightforward :)
Is there simple application to run?

--- /tmp/splice.c	2008-04-01 21:30:39.000000000 +0400
+++ ./fs/splice.c	2008-04-01 21:29:53.000000000 +0400
@@ -631,7 +631,8 @@
 	ret = 0;
 	do_wakeup = 0;
 
-	atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
+	if (actor == pipe_to_sendpage)
+		atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
 
 	for (;;) {
 		if (pipe->nrbufs) {
@@ -670,8 +671,9 @@
 			}
 
 			if (!sd->total_len) {
-				wait_event_interruptible(pipe->wait,
-					!atomic_read(&pipe->god_blessed_us));
+				if (actor == pipe_to_sendpage)
+					wait_event_interruptible(pipe->wait,
+						!atomic_read(&pipe->god_blessed_us));
 				break;
 			}
 		}
@@ -679,7 +681,8 @@
 		if (pipe->nrbufs)
 			continue;
 
-		wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
+		if (actor == pipe_to_sendpage)
+			wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
 
 		if (!pipe->writers)
 			break;


-- 
	Evgeniy Polyakov

[-- Attachment #2: 1 --]
[-- Type: text/plain, Size: 4890 bytes --]

diff --git a/fs/read_write.c b/fs/read_write.c
index 49a9871..8c94e03 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <net/sock.h>
 #include "read_write.h"
 
 #include <asm/uaccess.h>
@@ -703,6 +704,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 	loff_t pos;
 	ssize_t retval;
 	int fput_needed_in, fput_needed_out, fl;
+	struct sock *sk;
+	struct socket *sock;
 
 	/*
 	 * Get input file, and verify that it is ok..
@@ -762,6 +765,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 		count = max - pos;
 	}
 
+	sock = out_file->private_data;
+	sk = sock->sk;
+
+	sk->sk_user_data = &skb_splice_destructor;
+	sk->sk_flags |= SO_PRIVATE_CALLBACK;
+
 	fl = 0;
 #if 0
 	/*
diff --git a/fs/splice.c b/fs/splice.c
index 0670c91..e29c485 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,7 @@
 #include <linux/syscalls.h>
 #include <linux/uio.h>
 #include <linux/security.h>
+#include <net/sock.h>
 
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
@@ -535,6 +536,7 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 	if (!ret) {
 		more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
 
+		buf->page->lru.next = (void *)pipe;
 		ret = file->f_op->sendpage(file, buf->page, buf->offset,
 					   sd->len, &pos, more);
 	}
@@ -629,6 +631,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 	ret = 0;
 	do_wakeup = 0;
 
+	if (actor == pipe_to_sendpage)
+		atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
+
 	for (;;) {
 		if (pipe->nrbufs) {
 			struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
@@ -665,12 +670,20 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 					do_wakeup = 1;
 			}
 
-			if (!sd->total_len)
+			if (!sd->total_len) {
+				if (actor == pipe_to_sendpage)
+					wait_event_interruptible(pipe->wait,
+						!atomic_read(&pipe->god_blessed_us));
 				break;
+			}
 		}
 
 		if (pipe->nrbufs)
 			continue;
+
+		if (actor == pipe_to_sendpage)
+			wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
+
 		if (!pipe->writers)
 			break;
 		if (!pipe->waiting_writers) {
diff --git a/include/asm-x86/socket.h b/include/asm-x86/socket.h
index 80af9c4..a4b047e 100644
--- a/include/asm-x86/socket.h
+++ b/include/asm-x86/socket.h
@@ -54,4 +54,6 @@
 
 #define SO_MARK			36
 
+#define SO_PRIVATE_CALLBACK	37
+
 #endif /* _ASM_SOCKET_H */
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8e41202..465405a 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -51,6 +51,7 @@ struct pipe_inode_info {
 	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
+	atomic_t god_blessed_us;
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
 	struct inode *inode;
diff --git a/include/net/sock.h b/include/net/sock.h
index fd98760..ac7bc52 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -862,6 +862,8 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
 extern void			sock_wfree(struct sk_buff *skb);
 extern void			sock_rfree(struct sk_buff *skb);
 
+extern void skb_splice_destructor(struct sk_buff *skb);
+
 extern int			sock_setsockopt(struct socket *sock, int level,
 						int op, char __user *optval,
 						int optlen);
@@ -1168,6 +1170,8 @@ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
+	if (sk->sk_flags & SO_PRIVATE_CALLBACK)
+		skb->destructor = sk->sk_user_data;
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 2654c14..0c10581 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/pipe_fs_i.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -1116,6 +1117,24 @@ void sock_wfree(struct sk_buff *skb)
 	sock_put(sk);
 }
 
+void skb_splice_destructor(struct sk_buff *skb)
+{
+	if (skb_shinfo(skb)->nr_frags) {
+		int i;
+		struct pipe_inode_info *pipe;
+
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			struct page *page = skb_shinfo(skb)->frags[i].page;
+
+			pipe = (struct pipe_inode_info *)page->lru.next;
+			if (atomic_dec_return(&pipe->god_blessed_us) == 0)
+				wake_up(&pipe->wait);
+		}
+	}
+
+	sock_wfree(skb);
+}
+
 /*
  * Read buffer destructor automatically called from kfree_skb.
  */
@@ -2164,6 +2183,7 @@ EXPORT_SYMBOL(sock_no_socketpair);
 EXPORT_SYMBOL(sock_rfree);
 EXPORT_SYMBOL(sock_setsockopt);
 EXPORT_SYMBOL(sock_wfree);
+EXPORT_SYMBOL(skb_splice_destructor);
 EXPORT_SYMBOL(sock_wmalloc);
 EXPORT_SYMBOL(sock_i_uid);
 EXPORT_SYMBOL(sock_i_ino);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 17:19           ` Eric Dumazet
@ 2008-04-01 17:47             ` Evgeniy Polyakov
  2008-04-01 18:07               ` Evgeniy Polyakov
  0 siblings, 1 reply; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-01 17:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, axboe, netdev

Hi.

On Tue, Apr 01, 2008 at 07:19:39PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> >@@ -762,6 +765,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, 
> >loff_t *ppos,
> > 		count = max - pos;
> > 	}
> > 
> >+	sock = out_file->private_data;
> >+	sk = sock->sk;
> >+
> >+	sk->sk_user_data = &skb_splice_destructor;
> >+	sk->sk_flags |= SO_PRIVATE_CALLBACK;
> 
> Ouch...
> 
> this seems very ugly (sorry).
> 
> 1) Is out_file garanted to be a socket ?

Yep.

> 2) SO_PRIVATE_CALLBACK is defined to be 37 later on your patch. Is ORing 37 
> to sk_flags really working ???

Kind of... Not as supposed to be, since sk_flags is a long.
Will put that flag into socket instead.

> 3) So, once a socket has been used for a sendfile(), all subsequent 
> sendmsg() will call skb_splice_destructor ? (I see no clearing of 
> SO_PRIVATE_CALLBACK)
>  This will probably crash.

Yep, need to clear.

> 4) god_blessed_us name ... Oh I got it, its April the first !!!!

But it still works :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 17:47             ` Evgeniy Polyakov
@ 2008-04-01 18:07               ` Evgeniy Polyakov
  2008-04-01 19:21                 ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-01 18:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, axboe, netdev

On Tue, Apr 01, 2008 at 09:47:59PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > 1) Is out_file garanted to be a socket ?
> > 2) SO_PRIVATE_CALLBACK is defined to be 37 later on your patch. Is ORing 37 
> > to sk_flags really working ???
> > 3) So, once a socket has been used for a sendfile(), all subsequent 
> > sendmsg() will call skb_splice_destructor ? (I see no clearing of 
> > SO_PRIVATE_CALLBACK)
> >  This will probably crash.
> > 4) god_blessed_us name ... Oh I got it, its April the first !!!!

Addressed all above (2 and 3 actually).
The formed by using correct socket flag (set_bit/clear_bit, like in a
real life), the latter by clearing bit on sendfile() exit and to fix
exit/clear race each spliced page also setups lru.prev to itself, which
is checked in destructor.
Name was not changed, does it matter?

Patch still works ok :)

Thanks Eric for pointing that bugs.

Signed-off-as-usual.

diff --git a/fs/read_write.c b/fs/read_write.c
index 49a9871..fda55f6 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <net/sock.h>
 #include "read_write.h"
 
 #include <asm/uaccess.h>
@@ -703,6 +704,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 	loff_t pos;
 	ssize_t retval;
 	int fput_needed_in, fput_needed_out, fl;
+	struct sock *sk;
+	struct socket *sock;
 
 	/*
 	 * Get input file, and verify that it is ok..
@@ -762,6 +765,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 		count = max - pos;
 	}
 
+	sock = out_file->private_data;
+	sk = sock->sk;
+
+	sk->sk_user_data = &skb_splice_destructor;
+	set_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);
+
 	fl = 0;
 #if 0
 	/*
@@ -775,6 +784,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 #endif
 	retval = do_splice_direct(in_file, ppos, out_file, count, fl);
 
+	clear_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);
+
 	if (retval > 0) {
 		add_rchar(current, retval);
 		add_wchar(current, retval);
diff --git a/fs/splice.c b/fs/splice.c
index 0670c91..dcda32e 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,7 @@
 #include <linux/syscalls.h>
 #include <linux/uio.h>
 #include <linux/security.h>
+#include <net/sock.h>
 
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
@@ -535,6 +536,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 	if (!ret) {
 		more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
 
+		buf->page->lru.next = (void *)pipe;
+		buf->page->lru.prev = (void *)buf->page;
 		ret = file->f_op->sendpage(file, buf->page, buf->offset,
 					   sd->len, &pos, more);
 	}
@@ -629,6 +632,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 	ret = 0;
 	do_wakeup = 0;
 
+	if (actor == pipe_to_sendpage)
+		atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
+
 	for (;;) {
 		if (pipe->nrbufs) {
 			struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
@@ -665,12 +671,20 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 					do_wakeup = 1;
 			}
 
-			if (!sd->total_len)
+			if (!sd->total_len) {
+				if (actor == pipe_to_sendpage)
+					wait_event_interruptible(pipe->wait,
+						!atomic_read(&pipe->god_blessed_us));
 				break;
+			}
 		}
 
 		if (pipe->nrbufs)
 			continue;
+
+		if (actor == pipe_to_sendpage)
+			wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
+
 		if (!pipe->writers)
 			break;
 		if (!pipe->waiting_writers) {
diff --git a/include/asm-x86/socket.h b/include/asm-x86/socket.h
diff --git a/include/linux/net.h b/include/linux/net.h
index c414d90..5977a5e 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -65,6 +65,7 @@ typedef enum {
 #define SOCK_NOSPACE		2
 #define SOCK_PASSCRED		3
 #define SOCK_PASSSEC		4
+#define SOCK_PRIVATE_CALLBACK	5
 
 #ifndef ARCH_HAS_SOCKET_TYPES
 /**
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8e41202..465405a 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -51,6 +51,7 @@ struct pipe_inode_info {
 	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
+	atomic_t god_blessed_us;
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
 	struct inode *inode;
diff --git a/include/net/sock.h b/include/net/sock.h
index fd98760..441dcd2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -862,6 +862,8 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
 extern void			sock_wfree(struct sk_buff *skb);
 extern void			sock_rfree(struct sk_buff *skb);
 
+extern void skb_splice_destructor(struct sk_buff *skb);
+
 extern int			sock_setsockopt(struct socket *sock, int level,
 						int op, char __user *optval,
 						int optlen);
@@ -1168,6 +1170,8 @@ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
+	if (sk->sk_socket && test_bit(SOCK_PRIVATE_CALLBACK, &sk->sk_socket->flags))
+		skb->destructor = sk->sk_user_data;
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 2654c14..64aa36a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/pipe_fs_i.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -1116,6 +1117,26 @@ void sock_wfree(struct sk_buff *skb)
 	sock_put(sk);
 }
 
+void skb_splice_destructor(struct sk_buff *skb)
+{
+	if (skb_shinfo(skb)->nr_frags) {
+		int i;
+		struct pipe_inode_info *pipe;
+
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			struct page *page = skb_shinfo(skb)->frags[i].page;
+
+			if (page->lru.prev == (struct list_head *)page) {
+				pipe = (struct pipe_inode_info *)page->lru.next;
+				if (atomic_dec_return(&pipe->god_blessed_us) == 0)
+					wake_up(&pipe->wait);
+			}
+		}
+	}
+
+	sock_wfree(skb);
+}
+
 /*
  * Read buffer destructor automatically called from kfree_skb.
  */
@@ -2164,6 +2185,7 @@ EXPORT_SYMBOL(sock_no_socketpair);
 EXPORT_SYMBOL(sock_rfree);
 EXPORT_SYMBOL(sock_setsockopt);
 EXPORT_SYMBOL(sock_wfree);
+EXPORT_SYMBOL(skb_splice_destructor);
 EXPORT_SYMBOL(sock_wmalloc);
 EXPORT_SYMBOL(sock_i_uid);
 EXPORT_SYMBOL(sock_i_ino);


-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 18:07               ` Evgeniy Polyakov
@ 2008-04-01 19:21                 ` Eric Dumazet
  2008-04-01 19:45                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-04-01 19:21 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, axboe, netdev

Evgeniy Polyakov a écrit :
> On Tue, Apr 01, 2008 at 09:47:59PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
>>> 1) Is out_file garanted to be a socket ?
>>> 2) SO_PRIVATE_CALLBACK is defined to be 37 later on your patch. Is ORing 37 
>>> to sk_flags really working ???
>>> 3) So, once a socket has been used for a sendfile(), all subsequent 
>>> sendmsg() will call skb_splice_destructor ? (I see no clearing of 
>>> SO_PRIVATE_CALLBACK)
>>>  This will probably crash.
>>> 4) god_blessed_us name ... Oh I got it, its April the first !!!!
> 
> Addressed all above (2 and 3 actually).
> The formed by using correct socket flag (set_bit/clear_bit, like in a
> real life), the latter by clearing bit on sendfile() exit and to fix
> exit/clear race each spliced page also setups lru.prev to itself, which
> is checked in destructor.
> Name was not changed, does it matter?
> 
> Patch still works ok :)
> 
> Thanks Eric for pointing that bugs.
> 
> Signed-off-as-usual.
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 49a9871..fda55f6 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -16,6 +16,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/pagemap.h>
>  #include <linux/splice.h>
> +#include <net/sock.h>
>  #include "read_write.h"
>  
>  #include <asm/uaccess.h>
> @@ -703,6 +704,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  	loff_t pos;
>  	ssize_t retval;
>  	int fput_needed_in, fput_needed_out, fl;
> +	struct sock *sk;
> +	struct socket *sock;
>  
>  	/*
>  	 * Get input file, and verify that it is ok..
> @@ -762,6 +765,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  		count = max - pos;
>  	}
>  
> +	sock = out_file->private_data;
> +	sk = sock->sk;
> +
> +	sk->sk_user_data = &skb_splice_destructor;
> +	set_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);
> +
>  	fl = 0;
>  #if 0
>  	/*
> @@ -775,6 +784,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  #endif
>  	retval = do_splice_direct(in_file, ppos, out_file, count, fl);
>  
> +	clear_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);


I see no socket locking, so multiple threads could use sendfile() & sendmsg() 
on same socket, and crash kernel...



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 19:21                 ` Eric Dumazet
@ 2008-04-01 19:45                   ` Evgeniy Polyakov
  2008-04-01 20:59                     ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-01 19:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, axboe, netdev

On Tue, Apr 01, 2008 at 09:21:19PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> I see no socket locking, so multiple threads could use sendfile() & 
> sendmsg() on same socket, and crash kernel...

How?
Those who use sendfile() automatically install page->lru.prev which is
checked in release path, only those with given value will be processed
as spliced (i.e. new callback).

It is safe to install new callback on skbs which were not spliced, since
for that skbs it will be pure sock_wfree().

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 19:45                   ` Evgeniy Polyakov
@ 2008-04-01 20:59                     ` Eric Dumazet
  2008-04-01 21:14                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-04-01 20:59 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, axboe, netdev

Evgeniy Polyakov a écrit :
> On Tue, Apr 01, 2008 at 09:21:19PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
>> I see no socket locking, so multiple threads could use sendfile() & 
>> sendmsg() on same socket, and crash kernel...
> 
> How?
> Those who use sendfile() automatically install page->lru.prev which is
> checked in release path, only those with given value will be processed
> as spliced (i.e. new callback).
> 
> It is safe to install new callback on skbs which were not spliced, since
> for that skbs it will be pure sock_wfree().
> 

first thread is doing its sendfile, and clears the bit while second thread

just entered sendfile() too, just after setting the bit and calling

do_splice_direct()

skb_set_owner_w() see the bit cleared, so install normal sock_wfree destructor 
instead of sh_user_data.

crash or leak god_bless_us, you chose :)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 20:59                     ` Eric Dumazet
@ 2008-04-01 21:14                       ` Evgeniy Polyakov
  0 siblings, 0 replies; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-01 21:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, axboe, netdev

On Tue, Apr 01, 2008 at 10:59:18PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> first thread is doing its sendfile, and clears the bit while second thread
> 
> just entered sendfile() too, just after setting the bit and calling
> 
> do_splice_direct()
> 
> skb_set_owner_w() see the bit cleared, so install normal sock_wfree 
> destructor instead of sh_user_data.
> 
> crash or leak god_bless_us, you chose :)

Yeah, that will be a problem.

There will not be a crash, but in this case second thread will stuck
waiting for god to stop blesing (reference to become zero), which
will never happen.
So, just do not clear bit at all like it was in original patch, since
new destructor callback is safe to be called with non-sendfile skbs now.

diff --git a/fs/read_write.c b/fs/read_write.c
index 49a9871..fda55f6 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <net/sock.h>
 #include "read_write.h"
 
 #include <asm/uaccess.h>
@@ -703,6 +704,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 	loff_t pos;
 	ssize_t retval;
 	int fput_needed_in, fput_needed_out, fl;
+	struct sock *sk;
+	struct socket *sock;
 
 	/*
 	 * Get input file, and verify that it is ok..
@@ -762,6 +765,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 		count = max - pos;
 	}
 
+	sock = out_file->private_data;
+	sk = sock->sk;
+
+	sk->sk_user_data = &skb_splice_destructor;
+	set_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);
+
 	fl = 0;
 #if 0
 	/*
diff --git a/fs/splice.c b/fs/splice.c
index 0670c91..dcda32e 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,7 @@
 #include <linux/syscalls.h>
 #include <linux/uio.h>
 #include <linux/security.h>
+#include <net/sock.h>
 
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
@@ -535,6 +536,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 	if (!ret) {
 		more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
 
+		buf->page->lru.next = (void *)pipe;
+		buf->page->lru.prev = (void *)buf->page;
 		ret = file->f_op->sendpage(file, buf->page, buf->offset,
 					   sd->len, &pos, more);
 	}
@@ -629,6 +632,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 	ret = 0;
 	do_wakeup = 0;
 
+	if (actor == pipe_to_sendpage)
+		atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
+
 	for (;;) {
 		if (pipe->nrbufs) {
 			struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
@@ -665,12 +671,20 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 					do_wakeup = 1;
 			}
 
-			if (!sd->total_len)
+			if (!sd->total_len) {
+				if (actor == pipe_to_sendpage)
+					wait_event_interruptible(pipe->wait,
+						!atomic_read(&pipe->god_blessed_us));
 				break;
+			}
 		}
 
 		if (pipe->nrbufs)
 			continue;
+
+		if (actor == pipe_to_sendpage)
+			wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
+
 		if (!pipe->writers)
 			break;
 		if (!pipe->waiting_writers) {
diff --git a/include/asm-x86/socket.h b/include/asm-x86/socket.h
diff --git a/include/linux/net.h b/include/linux/net.h
index c414d90..5977a5e 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -65,6 +65,7 @@ typedef enum {
 #define SOCK_NOSPACE		2
 #define SOCK_PASSCRED		3
 #define SOCK_PASSSEC		4
+#define SOCK_PRIVATE_CALLBACK	5
 
 #ifndef ARCH_HAS_SOCKET_TYPES
 /**
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8e41202..465405a 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -51,6 +51,7 @@ struct pipe_inode_info {
 	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
+	atomic_t god_blessed_us;
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
 	struct inode *inode;
diff --git a/include/net/sock.h b/include/net/sock.h
index fd98760..441dcd2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -862,6 +862,8 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
 extern void			sock_wfree(struct sk_buff *skb);
 extern void			sock_rfree(struct sk_buff *skb);
 
+extern void skb_splice_destructor(struct sk_buff *skb);
+
 extern int			sock_setsockopt(struct socket *sock, int level,
 						int op, char __user *optval,
 						int optlen);
@@ -1168,6 +1170,8 @@ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
+	if (sk->sk_socket && test_bit(SOCK_PRIVATE_CALLBACK, &sk->sk_socket->flags))
+		skb->destructor = sk->sk_user_data;
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 2654c14..64aa36a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/pipe_fs_i.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -1116,6 +1117,26 @@ void sock_wfree(struct sk_buff *skb)
 	sock_put(sk);
 }
 
+void skb_splice_destructor(struct sk_buff *skb)
+{
+	if (skb_shinfo(skb)->nr_frags) {
+		int i;
+		struct pipe_inode_info *pipe;
+
+		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+			struct page *page = skb_shinfo(skb)->frags[i].page;
+
+			if (page->lru.prev == (struct list_head *)page) {
+				pipe = (struct pipe_inode_info *)page->lru.next;
+				if (atomic_dec_return(&pipe->god_blessed_us) == 0)
+					wake_up(&pipe->wait);
+			}
+		}
+	}
+
+	sock_wfree(skb);
+}
+
 /*
  * Read buffer destructor automatically called from kfree_skb.
  */
@@ -2164,6 +2185,7 @@ EXPORT_SYMBOL(sock_no_socketpair);
 EXPORT_SYMBOL(sock_rfree);
 EXPORT_SYMBOL(sock_setsockopt);
 EXPORT_SYMBOL(sock_wfree);
+EXPORT_SYMBOL(skb_splice_destructor);
 EXPORT_SYMBOL(sock_wmalloc);
 EXPORT_SYMBOL(sock_i_uid);
 EXPORT_SYMBOL(sock_i_ino);

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [take 2] Fix for the fundamental network/block layer race in sendfile().
  2008-04-01 16:49         ` Fix for the fundamental network/block layer race in sendfile() Evgeniy Polyakov
  2008-04-01 17:14           ` Mika Penttilä
  2008-04-01 17:19           ` Eric Dumazet
@ 2008-04-08 12:25           ` Evgeniy Polyakov
  2008-04-08 12:58             ` Eric Dumazet
  2011-06-06 16:29             ` IPv6 DNSSL (rfc6106): please include the patch to pass it to user space Carlos Carvalho
  2 siblings, 2 replies; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-08 12:25 UTC (permalink / raw)
  To: David Miller; +Cc: Jens Axboe, netdev, Rusty Russell

Hi.

Summary of the previous series with this pompous header:
when sendfile() returns, pages which it sent can still be queued in tcp
stack or hardware, so subsequent write into them will endup in
corrupting data which will be eventually sent. This concerns all
->sendpage() users namely sendfile() and splice().

With this patch it is now guaranteed that data was transferred over the
wire after splice/sendfile returns.

This approach (besides the fact that previous one was not 100% correct
dealing with cloned skbs) uses new destructor field in shared info
structure (introduced by Rusty, afacs they do not collide), which is
invoked each time data is about to be released (but before actual
release of data happens).

Patch works by assigning each page from the sendfile path a couple of
pointers: to page itself to differentiate sendfile pages from all
others and pointer to pipe structure, which wakes up splice code.
It is safe to assign the same callback for sendfile and non-sendfile
pages because of above page pointer - pages which do not have it (slab,
non-sendfile bio layer and others) will not have pipe structure
dereferenced from private pointers.

P.S. Previous patch was not an April 1 joke as long as this one :)

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/fs/read_write.c b/fs/read_write.c
index 49a9871..1961a46 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
 #include <linux/splice.h>
+#include <net/sock.h>
 #include "read_write.h"
 
 #include <asm/uaccess.h>
@@ -695,6 +696,24 @@ sys_writev(unsigned long fd, const struct iovec __user *vec, unsigned long vlen)
 	return ret;
 }
 
+static void skb_splice_destructor(struct skb_shared_info *shi)
+{
+	if (shi->nr_frags) {
+		int i;
+		struct pipe_inode_info *pipe;
+
+		for (i = 0; i < shi->nr_frags; i++) {
+			struct page *page = shi->frags[i].page;
+
+			if (page->lru.prev == (struct list_head *)page) {
+				pipe = (struct pipe_inode_info *)page->lru.next;
+				if (atomic_dec_return(&pipe->god_blessed_us) == 0)
+					wake_up(&pipe->wait);
+			}
+		}
+	}
+}
+
 static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 			   size_t count, loff_t max)
 {
@@ -703,6 +722,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 	loff_t pos;
 	ssize_t retval;
 	int fput_needed_in, fput_needed_out, fl;
+	struct sock *sk;
+	struct socket *sock;
 
 	/*
 	 * Get input file, and verify that it is ok..
@@ -762,6 +783,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 		count = max - pos;
 	}
 
+	sock = out_file->private_data;
+	sk = sock->sk;
+
+	sk->sk_user_data = &skb_splice_destructor;
+	set_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);
+
 	fl = 0;
 #if 0
 	/*
@@ -775,6 +802,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
 #endif
 	retval = do_splice_direct(in_file, ppos, out_file, count, fl);
 
+	clear_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);
+
 	if (retval > 0) {
 		add_rchar(current, retval);
 		add_wchar(current, retval);
diff --git a/fs/splice.c b/fs/splice.c
index 0670c91..dcda32e 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -29,6 +29,7 @@
 #include <linux/syscalls.h>
 #include <linux/uio.h>
 #include <linux/security.h>
+#include <net/sock.h>
 
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
@@ -535,6 +536,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 	if (!ret) {
 		more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
 
+		buf->page->lru.next = (void *)pipe;
+		buf->page->lru.prev = (void *)buf->page;
 		ret = file->f_op->sendpage(file, buf->page, buf->offset,
 					   sd->len, &pos, more);
 	}
@@ -629,6 +632,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 	ret = 0;
 	do_wakeup = 0;
 
+	if (actor == pipe_to_sendpage)
+		atomic_set(&pipe->god_blessed_us, pipe->nrbufs);
+
 	for (;;) {
 		if (pipe->nrbufs) {
 			struct pipe_buffer *buf = pipe->bufs + pipe->curbuf;
@@ -665,12 +671,20 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
 					do_wakeup = 1;
 			}
 
-			if (!sd->total_len)
+			if (!sd->total_len) {
+				if (actor == pipe_to_sendpage)
+					wait_event_interruptible(pipe->wait,
+						!atomic_read(&pipe->god_blessed_us));
 				break;
+			}
 		}
 
 		if (pipe->nrbufs)
 			continue;
+
+		if (actor == pipe_to_sendpage)
+			wait_event_interruptible(pipe->wait, !atomic_read(&pipe->god_blessed_us));
+
 		if (!pipe->writers)
 			break;
 		if (!pipe->waiting_writers) {
diff --git a/include/asm-x86/socket.h b/include/asm-x86/socket.h
diff --git a/include/linux/net.h b/include/linux/net.h
index c414d90..5977a5e 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -65,6 +65,7 @@ typedef enum {
 #define SOCK_NOSPACE		2
 #define SOCK_PASSCRED		3
 #define SOCK_PASSSEC		4
+#define SOCK_PRIVATE_CALLBACK	5
 
 #ifndef ARCH_HAS_SOCKET_TYPES
 /**
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8e41202..465405a 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -51,6 +51,7 @@ struct pipe_inode_info {
 	unsigned int waiting_writers;
 	unsigned int r_counter;
 	unsigned int w_counter;
+	atomic_t god_blessed_us;
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
 	struct inode *inode;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bbd8d00..bd4e0c3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -147,6 +147,7 @@ struct skb_shared_info {
 	unsigned short  gso_type;
 	__be32          ip6_frag_id;
 	struct sk_buff	*frag_list;
+	void		(* destructor)(struct skb_shared_info *);
 	skb_frag_t	frags[MAX_SKB_FRAGS];
 };
 
diff --git a/include/net/sock.h b/include/net/sock.h
index fd98760..af4572a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1168,6 +1168,8 @@ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
+	if (sk->sk_socket && test_bit(SOCK_PRIVATE_CALLBACK, &sk->sk_socket->flags))
+		skb_shinfo(skb)->destructor = sk->sk_user_data;
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0d0fd28..f69c814 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -218,6 +218,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	shinfo->gso_type = 0;
 	shinfo->ip6_frag_id = 0;
 	shinfo->frag_list = NULL;
+	shinfo->destructor = NULL;
 
 	if (fclone) {
 		struct sk_buff *child = skb + 1;
@@ -294,6 +295,10 @@ static void skb_release_data(struct sk_buff *skb)
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
+
+		if (skb_shinfo(skb)->destructor)
+			skb_shinfo(skb)->destructor(skb_shinfo(skb));
+
 		if (skb_shinfo(skb)->nr_frags) {
 			int i;
 			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
diff --git a/net/core/sock.c b/net/core/sock.c
index 2654c14..bd470c6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -112,6 +112,7 @@
 #include <linux/tcp.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
+#include <linux/pipe_fs_i.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [take 2] Fix for the fundamental network/block layer race in sendfile().
  2008-04-08 12:25           ` [take 2] " Evgeniy Polyakov
@ 2008-04-08 12:58             ` Eric Dumazet
  2008-04-08 17:26               ` Evgeniy Polyakov
  2011-06-06 16:29             ` IPv6 DNSSL (rfc6106): please include the patch to pass it to user space Carlos Carvalho
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-04-08 12:58 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, Jens Axboe, netdev, Rusty Russell

Evgeniy Polyakov a écrit :
> Hi.
>
> Summary of the previous series with this pompous header:
> when sendfile() returns, pages which it sent can still be queued in tcp
> stack or hardware, so subsequent write into them will endup in
> corrupting data which will be eventually sent. This concerns all
> ->sendpage() users namely sendfile() and splice().
>
> With this patch it is now guaranteed that data was transferred over the
> wire after splice/sendfile returns.
>
>   
1) So a nonblocking operation (O_NDELAY) can become a blocking one now ?

> This approach (besides the fact that previous one was not 100% correct
> dealing with cloned skbs) uses new destructor field in shared info
> structure (introduced by Rusty, afacs they do not collide), which is
> invoked each time data is about to be released (but before actual
> release of data happens).
>   
> Patch works by assigning each page from the sendfile path a couple of
>
>   
> pointers: to page itself to differentiate sendfile pages from all
> others and pointer to pipe structure, which wakes up splice code.
> It is safe to assign the same callback for sendfile and non-sendfile
> pages because of above page pointer - pages which do not have it (slab,
> non-sendfile bio layer and others) will not have pipe structure
> dereferenced from private pointers.
>
> P.S. Previous patch was not an April 1 joke as long as this one :)
>
>   
Hum, I see you forgot me in CC list, dont try to escape :)
> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 49a9871..1961a46 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -16,6 +16,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/pagemap.h>
>  #include <linux/splice.h>
> +#include <net/sock.h>
>  #include "read_write.h"
>  
>  #include <asm/uaccess.h>
> @@ -695,6 +696,24 @@ sys_writev(unsigned long fd, const struct iovec __user *vec, unsigned long vlen)
>  	return ret;
>  }
>  
> +static void skb_splice_destructor(struct skb_shared_info *shi)
> +{
> +	if (shi->nr_frags) {
> +		int i;
> +		struct pipe_inode_info *pipe;
> +
> +		for (i = 0; i < shi->nr_frags; i++) {
> +			struct page *page = shi->frags[i].page;
> +
> +			if (page->lru.prev == (struct list_head *)page) {
> +				pipe = (struct pipe_inode_info *)page->lru.next;
>   
    Unless I mistaken, you store in page->lru.next some info to find 
your pipe pointer, assuming it is unique for this page.

What happens if two threads are doing a splice()/sendfile() using the 
same underlying (source) file (and same pages in this file)


> +				if (atomic_dec_return(&pipe->god_blessed_us) == 0)
> +					wake_up(&pipe->wait);
> +			}
> +		}
> +	}
> +}
> +
>  static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  			   size_t count, loff_t max)
>  {
> @@ -703,6 +722,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  	loff_t pos;
>  	ssize_t retval;
>  	int fput_needed_in, fput_needed_out, fl;
> +	struct sock *sk;
> +	struct socket *sock;
>  
>  	/*
>  	 * Get input file, and verify that it is ok..
> @@ -762,6 +783,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  		count = max - pos;
>  	}
>  
> +	sock = out_file->private_data;
> +	sk = sock->sk;
> +
> +	sk->sk_user_data = &skb_splice_destructor;
> +	set_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);
> +
>  	fl = 0;
>  #if 0
>  	/*
> @@ -775,6 +802,8 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>  #endif
>  	retval = do_splice_direct(in_file, ppos, out_file, count, fl);
>  
> +	clear_bit(SOCK_PRIVATE_CALLBACK, &sock->flags);
> +
>  	if (retval > 0) {
>  		add_rchar(current, retval);
>  		add_wchar(current, retval);
> diff --git a/fs/splice.c b/fs/splice.c
> index 0670c91..dcda32e 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -29,6 +29,7 @@
>  #include <linux/syscalls.h>
>  #include <linux/uio.h>
>  #include <linux/security.h>
> +#include <net/sock.h>
>  
>  /*
>   * Attempt to steal a page from a pipe buffer. This should perhaps go into
> @@ -535,6 +536,8 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
>  	if (!ret) {
>  		more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
>  
> +		buf->page->lru.next = (void *)pipe;
>   
is it really allowed here, are you the only user ot this page ?
> +		buf->page->lru.prev = (void *)buf->page;
>  		ret = file->f_op->sendpage(file, buf->page, buf->offset,
>  					   sd->len, &pos, more);
>  	}
> @@ -629,6 +632,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
>  	ret = 0;
>  	do_wakeup = 0;
>   





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [take 2] Fix for the fundamental network/block layer race in sendfile().
  2008-04-08 12:58             ` Eric Dumazet
@ 2008-04-08 17:26               ` Evgeniy Polyakov
  2008-04-08 21:30                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-08 17:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Jens Axboe, netdev, Rusty Russell

On Tue, Apr 08, 2008 at 02:58:43PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> >With this patch it is now guaranteed that data was transferred over the
> >wire after splice/sendfile returns.
> >
> 1) So a nonblocking operation (O_NDELAY) can become a blocking one now ?

No, it should not - ->sendpage() will return -EAGAIN and we will break
from the main loop before going into sleep.

> >P.S. Previous patch was not an April 1 joke as long as this one :)
> >
> >  
> Hum, I see you forgot me in CC list, dont try to escape :)

My fault, sorry :)

> >+static void skb_splice_destructor(struct skb_shared_info *shi)
> >+{
> >+	if (shi->nr_frags) {
> >+		int i;
> >+		struct pipe_inode_info *pipe;
> >+
> >+		for (i = 0; i < shi->nr_frags; i++) {
> >+			struct page *page = shi->frags[i].page;
> >+
> >+			if (page->lru.prev == (struct list_head *)page) {
> >+				pipe = (struct pipe_inode_info 
> >*)page->lru.next;
> >  
>    Unless I mistaken, you store in page->lru.next some info to find 
> your pipe pointer, assuming it is unique for this page.
> 
> What happens if two threads are doing a splice()/sendfile() using the 
> same underlying (source) file (and same pages in this file)

Page will be referenced twice (for each thread) and each thread will
have own pipe_inode_info structure, so each one will sleep on own wait
queue and will be awakened separately, where its release counter will be
dropped.

> >diff --git a/fs/splice.c b/fs/splice.c
> >index 0670c91..dcda32e 100644
> >--- a/fs/splice.c
> >+++ b/fs/splice.c
> >@@ -29,6 +29,7 @@
> > #include <linux/syscalls.h>
> > #include <linux/uio.h>
> > #include <linux/security.h>
> >+#include <net/sock.h>
> > 
> > /*
> >  * Attempt to steal a page from a pipe buffer. This should perhaps go into
> >@@ -535,6 +536,8 @@ static int pipe_to_sendpage(struct pipe_inode_info 
> >*pipe,
> > 	if (!ret) {
> > 		more = (sd->flags & SPLICE_F_MORE) || sd->len < 
> > 		sd->total_len;
> > 
> >+		buf->page->lru.next = (void *)pipe;
> >  
> is it really allowed here, are you the only user ot this page ?

Yes, it should be safe - afaics no slab pages or slab allocated data are
allowed in ->sendpage() (there is even BUG_ON() somewhere in
tcp_sendpage()/release code), so both lru pointers can be reused.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [take 2] Fix for the fundamental network/block layer race in sendfile().
  2008-04-08 17:26               ` Evgeniy Polyakov
@ 2008-04-08 21:30                 ` Evgeniy Polyakov
  2008-04-09 11:33                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Evgeniy Polyakov @ 2008-04-08 21:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Jens Axboe, netdev, Rusty Russell

On Tue, Apr 08, 2008 at 09:26:52PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> >    Unless I mistaken, you store in page->lru.next some info to find 
> > your pipe pointer, assuming it is unique for this page.
> > 
> > What happens if two threads are doing a splice()/sendfile() using the 
> > same underlying (source) file (and same pages in this file)
> 
> Page will be referenced twice (for each thread) and each thread will
> have own pipe_inode_info structure, so each one will sleep on own wait
> queue and will be awakened separately, where its release counter will be
> dropped.

I was wrong, here will be a problem - page can be shared between multiple
threads, since it is unlocked after putting it into page cache, so it is
not allowed to put into it some per-thread private data.

I have an even deeper hack to fix this issue: put a list of pipe
structures as a lru.next pointer, mapping lock can guard processing...
Sounds scary, want to try.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [take 2] Fix for the fundamental network/block layer race in sendfile().
  2008-04-08 21:30                 ` Evgeniy Polyakov
@ 2008-04-09 11:33                   ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2008-04-09 11:33 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Eric Dumazet, David Miller, netdev, Rusty Russell

On Wed, Apr 09 2008, Evgeniy Polyakov wrote:
> On Tue, Apr 08, 2008 at 09:26:52PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > >    Unless I mistaken, you store in page->lru.next some info to find 
> > > your pipe pointer, assuming it is unique for this page.
> > > 
> > > What happens if two threads are doing a splice()/sendfile() using the 
> > > same underlying (source) file (and same pages in this file)
> > 
> > Page will be referenced twice (for each thread) and each thread will
> > have own pipe_inode_info structure, so each one will sleep on own wait
> > queue and will be awakened separately, where its release counter will be
> > dropped.
> 
> I was wrong, here will be a problem - page can be shared between multiple
> threads, since it is unlocked after putting it into page cache, so it is
> not allowed to put into it some per-thread private data.

Threads or not, it does not matter. Even simpler, the same page could be
in multiple pipes. So that trick definitely will not work at all.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* IPv6 DNSSL (rfc6106): please include the patch to pass it to user space
  2008-04-08 12:25           ` [take 2] " Evgeniy Polyakov
  2008-04-08 12:58             ` Eric Dumazet
@ 2011-06-06 16:29             ` Carlos Carvalho
  2011-06-06 19:40               ` David Miller
  1 sibling, 1 reply; 22+ messages in thread
From: Carlos Carvalho @ 2011-06-06 16:29 UTC (permalink / raw)
  To: netdev

Currently the kernel doesn't pass DNSSL lists received by router
advertsiments (rfc 6106) to user space via netlink. Pierre Ossman sent
a patch for it about 6 months ago:
http://patchwork.ozlabs.org/patch/75243. Could you please include it
in mainline? This is a very useful feature.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: IPv6 DNSSL (rfc6106): please include the patch to pass it to user space
  2011-06-06 16:29             ` IPv6 DNSSL (rfc6106): please include the patch to pass it to user space Carlos Carvalho
@ 2011-06-06 19:40               ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2011-06-06 19:40 UTC (permalink / raw)
  To: carlos; +Cc: netdev

From: Carlos Carvalho <carlos@fisica.ufpr.br>
Date: Mon, 6 Jun 2011 13:29:11 -0300

> Currently the kernel doesn't pass DNSSL lists received by router
> advertsiments (rfc 6106) to user space via netlink. Pierre Ossman sent
> a patch for it about 6 months ago:
> http://patchwork.ozlabs.org/patch/75243. Could you please include it
> in mainline? This is a very useful feature.

The submitter marked that patch as "RFC", he must make a new formal
submission and indicate that it's in a final form and ready to be
actually applied.

You are intelligent enough to find the patch in patchwork, yet you
totally ignore the state that the patch was put into and didn't
even bother considering what it might mean.

What's the point of having something like patchwork if I still have to
explain things to people by hand?  It doesn't save me any time, since
the whole point of patchwork is that I can communicate to the entire
world what state the patch is in and that tells what (if anything) can
happen next for that patch.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-06-06 19:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-28  9:20 Network/block layer race Evgeniy Polyakov
2008-03-28 20:40 ` David Miller
2008-03-28 20:56   ` Evgeniy Polyakov
2008-03-28 21:07     ` David Miller
2008-03-28 21:51       ` Evgeniy Polyakov
2008-04-01 16:49         ` Fix for the fundamental network/block layer race in sendfile() Evgeniy Polyakov
2008-04-01 17:14           ` Mika Penttilä
2008-04-01 17:36             ` Evgeniy Polyakov
2008-04-01 17:19           ` Eric Dumazet
2008-04-01 17:47             ` Evgeniy Polyakov
2008-04-01 18:07               ` Evgeniy Polyakov
2008-04-01 19:21                 ` Eric Dumazet
2008-04-01 19:45                   ` Evgeniy Polyakov
2008-04-01 20:59                     ` Eric Dumazet
2008-04-01 21:14                       ` Evgeniy Polyakov
2008-04-08 12:25           ` [take 2] " Evgeniy Polyakov
2008-04-08 12:58             ` Eric Dumazet
2008-04-08 17:26               ` Evgeniy Polyakov
2008-04-08 21:30                 ` Evgeniy Polyakov
2008-04-09 11:33                   ` Jens Axboe
2011-06-06 16:29             ` IPv6 DNSSL (rfc6106): please include the patch to pass it to user space Carlos Carvalho
2011-06-06 19:40               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).