* Re: Dropping NETIF_F_SG since no checksum feature.
@ 2006-10-11 9:05 ` Michael S. Tsirkin
0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 9:05 UTC (permalink / raw)
To: David Miller; +Cc: netdev, openib-general, linux-kernel, shemminger
Quoting r. David Miller <davem@davemloft.net>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Wed, 11 Oct 2006 02:13:38 +0200
>
> > Maybe I can patch linux to allow SG without checksum?
> > Dave, maybe you could drop a hint or two on whether this is worthwhile
> > and what are the issues that need addressing to make this work?
> >
> > I imagine it's not just the matter of changing net/core/dev.c :).
>
> You can't, it's a quality of implementation issue. We sendfile()
> pages directly out of the filesystem page cache without any
> blocking of modifications to the page contents, and the only way
> that works is if the card computes the checksum for us.
>
> If we sendfile() a page directly, we must compute a correct checksum
> no matter what the contents. We can't do this on the cpu before the
> data hits the device because another thread of execution can go in and
> modify the page contents which would invalidate the checksum and thus
> invalidating the packet. We cannot allow this.
>
> Blocking modifications is too expensive, so that's not an option
> either.
>
But copying still works fine, does it not?
Dave, could you clarify this please?
ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags)
{
ssize_t res;
struct sock *sk = sock->sk;
if (!(sk->sk_route_caps & NETIF_F_SG) ||
!(sk->sk_route_caps & NETIF_F_ALL_CSUM))
return sock_no_sendpage(sock, page, offset, size, flags);
So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
data will be copied over rather than sent directly.
So why does dev.c have to force set NETIF_F_SG to off then?
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 9:05 ` Michael S. Tsirkin
(?)
@ 2006-10-11 9:09 ` Steven Whitehouse
2006-10-11 15:01 ` Michael S. Tsirkin
-1 siblings, 1 reply; 54+ messages in thread
From: Steven Whitehouse @ 2006-10-11 9:09 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, shemminger, linux-kernel, netdev, openib-general,
rolandd
Hi,
On Wed, Oct 11, 2006 at 11:05:04AM +0200, Michael S. Tsirkin wrote:
> Quoting r. David Miller <davem@davemloft.net>:
> > Subject: Re: Dropping NETIF_F_SG since no checksum feature.
> >
> > From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> > Date: Wed, 11 Oct 2006 02:13:38 +0200
> >
> > > Maybe I can patch linux to allow SG without checksum?
> > > Dave, maybe you could drop a hint or two on whether this is worthwhile
> > > and what are the issues that need addressing to make this work?
> > >
> > > I imagine it's not just the matter of changing net/core/dev.c :).
> >
> > You can't, it's a quality of implementation issue. We sendfile()
> > pages directly out of the filesystem page cache without any
> > blocking of modifications to the page contents, and the only way
> > that works is if the card computes the checksum for us.
> >
> > If we sendfile() a page directly, we must compute a correct checksum
> > no matter what the contents. We can't do this on the cpu before the
> > data hits the device because another thread of execution can go in and
> > modify the page contents which would invalidate the checksum and thus
> > invalidating the packet. We cannot allow this.
> >
> > Blocking modifications is too expensive, so that's not an option
> > either.
> >
>
I would argue that SG does make sense without checksum for protocols that
don't need/use a checksum. DECnet for example could do zero-copy without
caring about the checksum since it doesn't have one. One of these days
I'll get around to writing that bit of code :-)
> But copying still works fine, does it not?
> Dave, could you clarify this please?
>
> ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> size_t size, int flags)
> {
> ssize_t res;
> struct sock *sk = sock->sk;
>
> if (!(sk->sk_route_caps & NETIF_F_SG) ||
> !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> return sock_no_sendpage(sock, page, offset, size, flags);
>
>
> So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> data will be copied over rather than sent directly.
> So why does dev.c have to force set NETIF_F_SG to off then?
>
I agree with that analysis,
Steve.
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 9:09 ` Steven Whitehouse
@ 2006-10-11 15:01 ` Michael S. Tsirkin
2006-10-11 20:11 ` Steven Whitehouse
2006-10-11 20:52 ` David Miller
0 siblings, 2 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 15:01 UTC (permalink / raw)
To: Steven Whitehouse
Cc: David Miller, shemminger, linux-kernel, netdev, openib-general,
rolandd
Quoting Steven Whitehouse <steve@chygwyn.com>:
> > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > size_t size, int flags)
> > {
> > ssize_t res;
> > struct sock *sk = sock->sk;
> >
> > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > return sock_no_sendpage(sock, page, offset, size, flags);
> >
> >
> > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > data will be copied over rather than sent directly.
> > So why does dev.c have to force set NETIF_F_SG to off then?
> >
> I agree with that analysis,
So, would you Ack something like the following then?
======================
Enabling NETIF_F_SG without NETIF_F_ALL_CSUM actually seems to work fine by
doing an old-fashioned data copy in tcp_sendpage.
And for devices that do not calculate IP checksum in hardware (e.g. InfiniBand)
calculating the checksum for all packets in network driver is worse than have
the CPU piggyback the checksum compitation with the copy process.
Finally, note that NETIF_F_SG is necessary to be able to allocate skbs >
PAGE_SIZE on busy systems.
So, let's allow that combination, again, for drivers that want it.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index d4a1ec3..2d731a0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2930,14 +2930,6 @@ #endif
}
}
- /* Fix illegal SG+CSUM combinations. */
- if ((dev->features & NETIF_F_SG) &&
- !(dev->features & NETIF_F_ALL_CSUM)) {
- printk(KERN_NOTICE "%s: Dropping NETIF_F_SG since no checksum feature.\n",
- dev->name);
- dev->features &= ~NETIF_F_SG;
- }
-
/* TSO requires that SG is present as well. */
if ((dev->features & NETIF_F_TSO) &&
!(dev->features & NETIF_F_SG)) {
--
MST
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 15:01 ` Michael S. Tsirkin
@ 2006-10-11 20:11 ` Steven Whitehouse
2006-10-11 20:52 ` Michael S. Tsirkin
2006-10-11 20:57 ` Stephen Hemminger
2006-10-11 20:52 ` David Miller
1 sibling, 2 replies; 54+ messages in thread
From: Steven Whitehouse @ 2006-10-11 20:11 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: David Miller, shemminger, linux-kernel, netdev, openib-general,
rolandd
Hi,
On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote:
> Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > size_t size, int flags)
> > > {
> > > ssize_t res;
> > > struct sock *sk = sock->sk;
> > >
> > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > return sock_no_sendpage(sock, page, offset, size, flags);
> > >
> > >
> > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > data will be copied over rather than sent directly.
> > > So why does dev.c have to force set NETIF_F_SG to off then?
> > >
> > I agree with that analysis,
>
> So, would you Ack something like the following then?
>
In so far as I'm able to ack it, then yes, but with the following
caveats: that you also need to look at the tcp code's checks for
NETIF_F_SG (aside from the interface to tcp_sendpage which I think
we've agreed is ok) and ensure that this patch will not change their
behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size()
in particular - there may be others but thats the only one I can think
of off the top of my head. I think this is what davem was getting at
with his comment about copy & sum for smaller packets.
Also all subject to approval by davem and shemminger of course :-)
My general feeling is that devices should advertise the features that
they actually have and that the protocols should make the decision
as to which ones to use or not depending on the combinations available
(which I think is pretty much your argument).
Steve.
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 20:11 ` Steven Whitehouse
@ 2006-10-11 20:52 ` Michael S. Tsirkin
2006-10-11 20:57 ` Stephen Hemminger
1 sibling, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 20:52 UTC (permalink / raw)
To: Steven Whitehouse
Cc: netdev, linux-kernel, openib-general, David Miller, shemminger
Quoting r. Steven Whitehouse <steve@chygwyn.com>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> Hi,
>
> On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote:
> > Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > > size_t size, int flags)
> > > > {
> > > > ssize_t res;
> > > > struct sock *sk = sock->sk;
> > > >
> > > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > > return sock_no_sendpage(sock, page, offset, size, flags);
> > > >
> > > >
> > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > > data will be copied over rather than sent directly.
> > > > So why does dev.c have to force set NETIF_F_SG to off then?
> > > >
> > > I agree with that analysis,
> >
> > So, would you Ack something like the following then?
> >
>
> In so far as I'm able to ack it, then yes, but with the following
> caveats: that you also need to look at the tcp code's checks for
> NETIF_F_SG (aside from the interface to tcp_sendpage which I think
> we've agreed is ok) and ensure that this patch will not change their
> behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size()
> in particular - there may be others but thats the only one I can think
> of off the top of my head. I think this is what davem was getting at
> with his comment about copy & sum for smaller packets.
Will do - thanks for the tips.
> Also all subject to approval by davem and shemminger of course :-)
Goes without saying :)
> My general feeling is that devices should advertise the features that
> they actually have and that the protocols should make the decision
> as to which ones to use or not depending on the combinations available
> (which I think is pretty much your argument).
>
> Steve.
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 20:11 ` Steven Whitehouse
2006-10-11 20:52 ` Michael S. Tsirkin
@ 2006-10-11 20:57 ` Stephen Hemminger
2006-10-11 21:23 ` Michael S. Tsirkin
1 sibling, 1 reply; 54+ messages in thread
From: Stephen Hemminger @ 2006-10-11 20:57 UTC (permalink / raw)
To: Steven Whitehouse
Cc: Michael S. Tsirkin, David Miller, linux-kernel, netdev,
openib-general, rolandd
On Wed, 11 Oct 2006 21:11:38 +0100
Steven Whitehouse <steve@chygwyn.com> wrote:
> Hi,
>
> On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote:
> > Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > > size_t size, int flags)
> > > > {
> > > > ssize_t res;
> > > > struct sock *sk = sock->sk;
> > > >
> > > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > > return sock_no_sendpage(sock, page, offset, size, flags);
> > > >
> > > >
> > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > > data will be copied over rather than sent directly.
> > > > So why does dev.c have to force set NETIF_F_SG to off then?
> > > >
> > > I agree with that analysis,
> >
> > So, would you Ack something like the following then?
> >
>
> In so far as I'm able to ack it, then yes, but with the following
> caveats: that you also need to look at the tcp code's checks for
> NETIF_F_SG (aside from the interface to tcp_sendpage which I think
> we've agreed is ok) and ensure that this patch will not change their
> behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size()
> in particular - there may be others but thats the only one I can think
> of off the top of my head. I think this is what davem was getting at
> with his comment about copy & sum for smaller packets.
>
> Also all subject to approval by davem and shemminger of course :-)
>
> My general feeling is that devices should advertise the features that
> they actually have and that the protocols should make the decision
> as to which ones to use or not depending on the combinations available
> (which I think is pretty much your argument).
>
> Steve.
>
You might want to try ignoring the check in dev.c and testing
to see if there is a performance gain. It wouldn't be hard to test
a modified version and validate the performance change.
You could even do what I suggested and use skb_checksum_help()
to do inplace checksumming, as a performance test.
--
Stephen Hemminger <shemminger@osdl.org>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 20:57 ` Stephen Hemminger
@ 2006-10-11 21:23 ` Michael S. Tsirkin
0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 21:23 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Steven Whitehouse, David Miller, linux-kernel, netdev,
openib-general, rolandd
Quoting r. Stephen Hemminger <shemminger@osdl.org>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> On Wed, 11 Oct 2006 21:11:38 +0100
> Steven Whitehouse <steve@chygwyn.com> wrote:
>
> > Hi,
> >
> > On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote:
> > > Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > > > size_t size, int flags)
> > > > > {
> > > > > ssize_t res;
> > > > > struct sock *sk = sock->sk;
> > > > >
> > > > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > > > return sock_no_sendpage(sock, page, offset, size, flags);
> > > > >
> > > > >
> > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > > > data will be copied over rather than sent directly.
> > > > > So why does dev.c have to force set NETIF_F_SG to off then?
> > > > >
> > > > I agree with that analysis,
> > >
> > > So, would you Ack something like the following then?
> > >
> >
> > In so far as I'm able to ack it, then yes, but with the following
> > caveats: that you also need to look at the tcp code's checks for
> > NETIF_F_SG (aside from the interface to tcp_sendpage which I think
> > we've agreed is ok) and ensure that this patch will not change their
> > behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size()
> > in particular - there may be others but thats the only one I can think
> > of off the top of my head. I think this is what davem was getting at
> > with his comment about copy & sum for smaller packets.
> >
> > Also all subject to approval by davem and shemminger of course :-)
> >
> > My general feeling is that devices should advertise the features that
> > they actually have and that the protocols should make the decision
> > as to which ones to use or not depending on the combinations available
> > (which I think is pretty much your argument).
> >
> > Steve.
> >
>
> You might want to try ignoring the check in dev.c and testing
> to see if there is a performance gain. It wouldn't be hard to test
> a modified version and validate the performance change.
Yes. With my patch, there is a huge performance gain by increasing MTU to 64K.
And it seems the only way to do this is by S/G.
> You could even do what I suggested and use skb_checksum_help()
> to do inplace checksumming, as a performance test.
I can. But as network algorithmics says (chapter 5)
"Since such bus reads are expensive, the CPU might as well piggyback
the checksum computation with the copy process".
It speaks about onboard the adapter buffers, but memory bus reads are also much slower
than CPU nowdays. So I think even if this works well in benchmark in real life
single copy should better.
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
@ 2006-10-11 21:23 ` Michael S. Tsirkin
0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 21:23 UTC (permalink / raw)
To: Stephen Hemminger
Cc: netdev, linux-kernel, openib-general, Steven Whitehouse,
David Miller
Quoting r. Stephen Hemminger <shemminger@osdl.org>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> On Wed, 11 Oct 2006 21:11:38 +0100
> Steven Whitehouse <steve@chygwyn.com> wrote:
>
> > Hi,
> >
> > On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote:
> > > Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > > > size_t size, int flags)
> > > > > {
> > > > > ssize_t res;
> > > > > struct sock *sk = sock->sk;
> > > > >
> > > > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > > > return sock_no_sendpage(sock, page, offset, size, flags);
> > > > >
> > > > >
> > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > > > data will be copied over rather than sent directly.
> > > > > So why does dev.c have to force set NETIF_F_SG to off then?
> > > > >
> > > > I agree with that analysis,
> > >
> > > So, would you Ack something like the following then?
> > >
> >
> > In so far as I'm able to ack it, then yes, but with the following
> > caveats: that you also need to look at the tcp code's checks for
> > NETIF_F_SG (aside from the interface to tcp_sendpage which I think
> > we've agreed is ok) and ensure that this patch will not change their
> > behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size()
> > in particular - there may be others but thats the only one I can think
> > of off the top of my head. I think this is what davem was getting at
> > with his comment about copy & sum for smaller packets.
> >
> > Also all subject to approval by davem and shemminger of course :-)
> >
> > My general feeling is that devices should advertise the features that
> > they actually have and that the protocols should make the decision
> > as to which ones to use or not depending on the combinations available
> > (which I think is pretty much your argument).
> >
> > Steve.
> >
>
> You might want to try ignoring the check in dev.c and testing
> to see if there is a performance gain. It wouldn't be hard to test
> a modified version and validate the performance change.
Yes. With my patch, there is a huge performance gain by increasing MTU to 64K.
And it seems the only way to do this is by S/G.
> You could even do what I suggested and use skb_checksum_help()
> to do inplace checksumming, as a performance test.
I can. But as network algorithmics says (chapter 5)
"Since such bus reads are expensive, the CPU might as well piggyback
the checksum computation with the copy process".
It speaks about onboard the adapter buffers, but memory bus reads are also much slower
than CPU nowdays. So I think even if this works well in benchmark in real life
single copy should better.
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 21:23 ` Michael S. Tsirkin
(?)
@ 2006-10-11 21:29 ` Stephen Hemminger
2006-10-11 21:42 ` Michael S. Tsirkin
-1 siblings, 1 reply; 54+ messages in thread
From: Stephen Hemminger @ 2006-10-11 21:29 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Steven Whitehouse, David Miller, linux-kernel, netdev,
openib-general, rolandd
O
> >
> > You might want to try ignoring the check in dev.c and testing
> > to see if there is a performance gain. It wouldn't be hard to test
> > a modified version and validate the performance change.
>
> Yes. With my patch, there is a huge performance gain by increasing MTU to 64K.
> And it seems the only way to do this is by S/G.
>
> > You could even do what I suggested and use skb_checksum_help()
> > to do inplace checksumming, as a performance test.
>
> I can. But as network algorithmics says (chapter 5)
> "Since such bus reads are expensive, the CPU might as well piggyback
> the checksum computation with the copy process".
>
> It speaks about onboard the adapter buffers, but memory bus reads are also much slower
> than CPU nowdays. So I think even if this works well in benchmark in real life
> single copy should better.
>
The other alternative might be to make copy/checksum code smarter about using
fragments rather than allocating a large buffer. It should avoid second order
allocations (effective size > PAGESIZE).
--
Stephen Hemminger <shemminger@osdl.org>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 21:29 ` Stephen Hemminger
@ 2006-10-11 21:42 ` Michael S. Tsirkin
0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 21:42 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Steven Whitehouse, David Miller, linux-kernel, netdev,
openib-general, rolandd
Quoting r. Stephen Hemminger <shemminger@osdl.org>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> O
> > >
> > > You might want to try ignoring the check in dev.c and testing
> > > to see if there is a performance gain. It wouldn't be hard to test
> > > a modified version and validate the performance change.
> >
> > Yes. With my patch, there is a huge performance gain by increasing MTU to 64K.
> > And it seems the only way to do this is by S/G.
> >
> > > You could even do what I suggested and use skb_checksum_help()
> > > to do inplace checksumming, as a performance test.
> >
> > I can. But as network algorithmics says (chapter 5)
> > "Since such bus reads are expensive, the CPU might as well piggyback
> > the checksum computation with the copy process".
> >
> > It speaks about onboard the adapter buffers, but memory bus reads are also much slower
> > than CPU nowdays. So I think even if this works well in benchmark in real life
> > single copy should better.
> >
>
> The other alternative might be to make copy/checksum code smarter about using
> fragments rather than allocating a large buffer. It should avoid second order
> allocations (effective size > PAGESIZE).
In my testing, it seems quite smart already - once I declare F_SG and clear F_...CSUM
I start getting properly checksummed packets with lots of s/g fragments.
But I'm not catching the drift - an alternative to what this would be?
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 21:23 ` Michael S. Tsirkin
(?)
(?)
@ 2006-10-11 21:41 ` David Miller
2006-10-12 19:12 ` Michael S. Tsirkin
-1 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2006-10-11 21:41 UTC (permalink / raw)
To: mst; +Cc: shemminger, steve, linux-kernel, netdev, openib-general, rolandd
From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Wed, 11 Oct 2006 23:23:39 +0200
> With my patch, there is a huge performance gain by increasing MTU to 64K.
> And it seems the only way to do this is by S/G.
Numbers?
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 21:41 ` David Miller
@ 2006-10-12 19:12 ` Michael S. Tsirkin
0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-12 19:12 UTC (permalink / raw)
To: David Miller
Cc: shemminger, steve, linux-kernel, netdev, openib-general, rolandd
Quoting r. David Miller <davem@davemloft.net>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Wed, 11 Oct 2006 23:23:39 +0200
>
> > With my patch, there is a huge performance gain by increasing MTU to 64K.
> > And it seems the only way to do this is by S/G.
>
> Numbers?
>
I created two subnets on top of the same pair infiniband HCAs:
root@sw069 ~]# ifconfig ib0
ib0 Link encap:UNSPEC HWaddr
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:12.4.3.69 Bcast:12.255.255.255 Mask:255.0.0.0
inet6 addr: fe80::202:c902:20:ee45/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:1382531 errors:0 dropped:0 overruns:0 frame:0
TX packets:2725206 errors:0 dropped:5 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:71892772 (68.5 MiB) TX bytes:5290011992 (4.9 GiB)
[root@sw069 ~]# ifconfig ibc0
ibc0 Link encap:UNSPEC HWaddr
00-03-04-06-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:11.4.3.69 Bcast:11.255.255.255 Mask:255.0.0.0
inet6 addr: fe80::202:c902:20:ee45/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65484 Metric:1
RX packets:115647 errors:0 dropped:0 overruns:0 frame:0
TX packets:253403 errors:0 dropped:4 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:6014720 (5.7 MiB) TX bytes:16589589008 (15.4 GiB)
The other side was configured with 12.4.3.68 for MTU 65484
and 11.4.3.68 for MTU 2044.
And then I just run netperf:
[root@sw069 ~]#
[root@sw069 ~]# /mswg/work/mst/netperf-2.4.2/src/netperf -f M -H 12.4.3.68 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 12.4.3.68 (12.4.3.68)
port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. MBytes /s % S % S us/KB us/KB
87380 16384 16384 10.00 286.45 40.20 25.28 5.482 3.448
[root@sw069 ~]# /mswg/work/mst/netperf-2.4.2/src/netperf -f M -H 11.4.3.68
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68)
port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. MBytes/sec
87380 16384 16384 10.01 782.55
This is all very preliminary - but I hope you get the idea -
increasing MTU is very helpful for infiniband, and infiniband adapters
handle large S/G lists without problems, but the verbs
do not include support for IP checksums, so these must be done in software.
So what we would like, is for the infiniband network device to say
"I don't support checksums, I only support S/G" and then for
network layer to do the checksumming for us piggybacking on data copy
at least for cases where it does perform the copy.
Does this makes sense now?
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
@ 2006-10-12 19:12 ` Michael S. Tsirkin
0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-12 19:12 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-kernel, openib-general, steve, shemminger
Quoting r. David Miller <davem@davemloft.net>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Wed, 11 Oct 2006 23:23:39 +0200
>
> > With my patch, there is a huge performance gain by increasing MTU to 64K.
> > And it seems the only way to do this is by S/G.
>
> Numbers?
>
I created two subnets on top of the same pair infiniband HCAs:
root@sw069 ~]# ifconfig ib0
ib0 Link encap:UNSPEC HWaddr
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:12.4.3.69 Bcast:12.255.255.255 Mask:255.0.0.0
inet6 addr: fe80::202:c902:20:ee45/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:1382531 errors:0 dropped:0 overruns:0 frame:0
TX packets:2725206 errors:0 dropped:5 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:71892772 (68.5 MiB) TX bytes:5290011992 (4.9 GiB)
[root@sw069 ~]# ifconfig ibc0
ibc0 Link encap:UNSPEC HWaddr
00-03-04-06-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:11.4.3.69 Bcast:11.255.255.255 Mask:255.0.0.0
inet6 addr: fe80::202:c902:20:ee45/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65484 Metric:1
RX packets:115647 errors:0 dropped:0 overruns:0 frame:0
TX packets:253403 errors:0 dropped:4 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:6014720 (5.7 MiB) TX bytes:16589589008 (15.4 GiB)
The other side was configured with 12.4.3.68 for MTU 65484
and 11.4.3.68 for MTU 2044.
And then I just run netperf:
[root@sw069 ~]#
[root@sw069 ~]# /mswg/work/mst/netperf-2.4.2/src/netperf -f M -H 12.4.3.68 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 12.4.3.68 (12.4.3.68)
port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. MBytes /s % S % S us/KB us/KB
87380 16384 16384 10.00 286.45 40.20 25.28 5.482 3.448
[root@sw069 ~]# /mswg/work/mst/netperf-2.4.2/src/netperf -f M -H 11.4.3.68
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68)
port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. MBytes/sec
87380 16384 16384 10.01 782.55
This is all very preliminary - but I hope you get the idea -
increasing MTU is very helpful for infiniband, and infiniband adapters
handle large S/G lists without problems, but the verbs
do not include support for IP checksums, so these must be done in software.
So what we would like, is for the infiniband network device to say
"I don't support checksums, I only support S/G" and then for
network layer to do the checksumming for us piggybacking on data copy
at least for cases where it does perform the copy.
Does this makes sense now?
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-12 19:12 ` Michael S. Tsirkin
(?)
@ 2006-10-13 4:22 ` David Miller
2006-10-13 6:17 ` Michael S. Tsirkin
-1 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2006-10-13 4:22 UTC (permalink / raw)
To: mst; +Cc: shemminger, steve, linux-kernel, netdev, openib-general, rolandd
From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Thu, 12 Oct 2006 21:12:06 +0200
> Quoting r. David Miller <davem@davemloft.net>:
> > Subject: Re: Dropping NETIF_F_SG since no checksum feature.
> >
> > Numbers?
>
> I created two subnets on top of the same pair infiniband HCAs:
I was asking for SG vs. non-SG numbers so I could see proof
that it really does help like you say it will.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-13 4:22 ` David Miller
@ 2006-10-13 6:17 ` Michael S. Tsirkin
0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-13 6:17 UTC (permalink / raw)
To: David Miller
Cc: shemminger, steve, linux-kernel, netdev, openib-general, rolandd
Quoting r. David Miller <davem@davemloft.net>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Thu, 12 Oct 2006 21:12:06 +0200
>
> > Quoting r. David Miller <davem@davemloft.net>:
> > > Subject: Re: Dropping NETIF_F_SG since no checksum feature.
> > >
> > > Numbers?
> >
> > I created two subnets on top of the same pair infiniband HCAs:
>
> I was asking for SG vs. non-SG numbers so I could see proof
> that it really does help like you say it will.
>
Dave, thanks for the clarification.
Please note that ib0 is a non-SG device with MTU 2K,
sorry that I forgot to mention that.
so, to summarize my previous mail:
interface flags mtu bandwidth
ib0 linear(0) 2044 286.45
ibc0 _F_SG 65484 782.55
If I will set both ib0 and ibc0 to 64K MTU, then
benchmark-mode with the same MTU SG is somewhat slower than non-SG
(I tested this at some point, by some 10%, don't have the numbers at the moment -
do you want to see them?). I did not claim it is faster to do SG with same MTU
and it is I think clear why linear should be faster for copy *with the same MTU*.
But do you really think that we will be able to allocate
even a single 64K linear skb after the machine has been active for a while?
My assumption is that if I want to reliably get MTU > PAGE_SIZE I must support SG.
Is it the wrong one?
If this assumption is correct, then below is my line of thinking:
- with infiniband we provably get a 2.5x speedup with MTU of 64K vs to 2K.
- to get packets of that size reliably we must declare S/G support
- infiniband verbs do not support IP checksumming
- per network algorithmics, it is better to piggyback checksum calculation
on copying if copying takes place
For this reason, I would like to define the meaning of S/G set when checksum
bits are all clear as "we support S/G but not checksum, please checksum
for us if you copy data anyway". Alternatively, add a new
NETIF_F_??_CSUM bit to mean this capability.
Does this make sense?
Thanks,
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 15:01 ` Michael S. Tsirkin
@ 2006-10-11 20:52 ` David Miller
2006-10-11 20:52 ` David Miller
1 sibling, 0 replies; 54+ messages in thread
From: David Miller @ 2006-10-11 20:52 UTC (permalink / raw)
To: mst; +Cc: steve, shemminger, linux-kernel, netdev, openib-general, rolandd
From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Wed, 11 Oct 2006 17:01:03 +0200
> Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > size_t size, int flags)
> > > {
> > > ssize_t res;
> > > struct sock *sk = sock->sk;
> > >
> > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > return sock_no_sendpage(sock, page, offset, size, flags);
> > >
> > >
> > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > data will be copied over rather than sent directly.
> > > So why does dev.c have to force set NETIF_F_SG to off then?
> > >
> > I agree with that analysis,
>
> So, would you Ack something like the following then?
I certainly don't.
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
@ 2006-10-11 20:52 ` David Miller
0 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2006-10-11 20:52 UTC (permalink / raw)
To: mst; +Cc: netdev, linux-kernel, openib-general, steve, shemminger
From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Wed, 11 Oct 2006 17:01:03 +0200
> Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > size_t size, int flags)
> > > {
> > > ssize_t res;
> > > struct sock *sk = sock->sk;
> > >
> > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > return sock_no_sendpage(sock, page, offset, size, flags);
> > >
> > >
> > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > data will be copied over rather than sent directly.
> > > So why does dev.c have to force set NETIF_F_SG to off then?
> > >
> > I agree with that analysis,
>
> So, would you Ack something like the following then?
I certainly don't.
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 20:52 ` David Miller
@ 2006-10-11 21:11 ` Michael S. Tsirkin
-1 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 21:11 UTC (permalink / raw)
To: David Miller
Cc: steve, shemminger, linux-kernel, netdev, openib-general, rolandd
Quoting r. David Miller <davem@davemloft.net>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Wed, 11 Oct 2006 17:01:03 +0200
>
> > Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > > size_t size, int flags)
> > > > {
> > > > ssize_t res;
> > > > struct sock *sk = sock->sk;
> > > >
> > > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > > return sock_no_sendpage(sock, page, offset, size, flags);
> > > >
> > > >
> > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > > data will be copied over rather than sent directly.
> > > > So why does dev.c have to force set NETIF_F_SG to off then?
> > > >
> > > I agree with that analysis,
> >
> > So, would you Ack something like the following then?
>
> I certainly don't.
>
Dave, sorry, you lost me.
Would a new flag NETIF_F_SW_CSUM be better, to tell network core that device
computes checksums in software, so we should piggyback the checksum
computation with the copy process if possible?
Or is there some other issue?
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: Dropping NETIF_F_SG since no checksum feature.
@ 2006-10-11 21:11 ` Michael S. Tsirkin
0 siblings, 0 replies; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 21:11 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-kernel, openib-general, steve, shemminger
Quoting r. David Miller <davem@davemloft.net>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Wed, 11 Oct 2006 17:01:03 +0200
>
> > Quoting Steven Whitehouse <steve@chygwyn.com>:
> > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset,
> > > > size_t size, int flags)
> > > > {
> > > > ssize_t res;
> > > > struct sock *sk = sock->sk;
> > > >
> > > > if (!(sk->sk_route_caps & NETIF_F_SG) ||
> > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
> > > > return sock_no_sendpage(sock, page, offset, size, flags);
> > > >
> > > >
> > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > > data will be copied over rather than sent directly.
> > > > So why does dev.c have to force set NETIF_F_SG to off then?
> > > >
> > > I agree with that analysis,
> >
> > So, would you Ack something like the following then?
>
> I certainly don't.
>
Dave, sorry, you lost me.
Would a new flag NETIF_F_SW_CSUM be better, to tell network core that device
computes checksums in software, so we should piggyback the checksum
computation with the copy process if possible?
Or is there some other issue?
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 9:05 ` Michael S. Tsirkin
(?)
(?)
@ 2006-10-11 9:20 ` David Miller
2006-10-11 9:46 ` Michael S. Tsirkin
2006-10-11 13:11 ` [RFC] Question about potential problem in net/ipv4/route.c Eric Dumazet
-1 siblings, 2 replies; 54+ messages in thread
From: David Miller @ 2006-10-11 9:20 UTC (permalink / raw)
To: mst; +Cc: shemminger, linux-kernel, netdev, openib-general, rolandd
From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Wed, 11 Oct 2006 11:05:04 +0200
> So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> data will be copied over rather than sent directly.
> So why does dev.c have to force set NETIF_F_SG to off then?
Because it's more efficient to copy into a linear destination
buffer of an SKB than page sub-chunks when doing checksum+copy.
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: Dropping NETIF_F_SG since no checksum feature.
2006-10-11 9:20 ` David Miller
@ 2006-10-11 9:46 ` Michael S. Tsirkin
2006-10-11 18:21 ` [openib-general] " Michael Krause
2006-10-11 13:11 ` [RFC] Question about potential problem in net/ipv4/route.c Eric Dumazet
1 sibling, 1 reply; 54+ messages in thread
From: Michael S. Tsirkin @ 2006-10-11 9:46 UTC (permalink / raw)
To: David Miller; +Cc: shemminger, linux-kernel, netdev, openib-general, rolandd
Quoting r. David Miller <davem@davemloft.net>:
> Subject: Re: Dropping NETIF_F_SG since no checksum feature.
>
> From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> Date: Wed, 11 Oct 2006 11:05:04 +0200
>
> > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > data will be copied over rather than sent directly.
> > So why does dev.c have to force set NETIF_F_SG to off then?
>
> Because it's more efficient to copy into a linear destination
> buffer of an SKB than page sub-chunks when doing checksum+copy.
>
Thanks for the explanation.
Obviously its true as long as you can allocate the skb that big.
I think you won't realistically be able to get 64K in a
linear SKB on a busy system, though, is not that right?
OTOH, having large MTU (e.g. 64K) helps performance a lot since it
reduces receive side processing overhead.
So, if I understand what you are saying correctly,
things do work correctly (just slower for small skb) if NETIF_F_SG is set bug
clear, it seems that all we need to do is drop the following in dev.c:
/* Fix illegal SG+CSUM combinations. */
if ((dev->features & NETIF_F_SG) &&
!(dev->features & NETIF_F_ALL_CSUM)) {
printk(KERN_NOTICE "%s: Dropping NETIF_F_SG since no checksum feature.\n",
dev->name);
dev->features &= ~NETIF_F_SG;
}
is that right?
--
MST
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [openib-general] Dropping NETIF_F_SG since no checksum feature.
2006-10-11 9:46 ` Michael S. Tsirkin
@ 2006-10-11 18:21 ` Michael Krause
0 siblings, 0 replies; 54+ messages in thread
From: Michael Krause @ 2006-10-11 18:21 UTC (permalink / raw)
To: Michael S. Tsirkin, David Miller
Cc: netdev, openib-general, linux-kernel, shemminger
At 02:46 AM 10/11/2006, Michael S. Tsirkin wrote:
>Quoting r. David Miller <davem@davemloft.net>:
> > Subject: Re: Dropping NETIF_F_SG since no checksum feature.
> >
> > From: "Michael S. Tsirkin" <mst@mellanox.co.il>
> > Date: Wed, 11 Oct 2006 11:05:04 +0200
> >
> > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM,
> > > data will be copied over rather than sent directly.
> > > So why does dev.c have to force set NETIF_F_SG to off then?
> >
> > Because it's more efficient to copy into a linear destination
> > buffer of an SKB than page sub-chunks when doing checksum+copy.
> >
>
>Thanks for the explanation.
>Obviously its true as long as you can allocate the skb that big.
>I think you won't realistically be able to get 64K in a
>linear SKB on a busy system, though, is not that right?
>
>OTOH, having large MTU (e.g. 64K) helps performance a lot since it reduces
>receive side processing overhead.
One thing to keep in mind is while it may help performance in a
micro-benchmark, the system performance or the QoS impacts to other flows
can be negatively impacted depending upon implementation. For example,
consider multiple messages interleaving (heaven help implementations that
are not able to interleave multiple messages) on either the transmit or
receive HCA / RNIC and how the time-to-completion of any message is
extended out in time as a result of the interleave. The effective
throughput in terms of useful units of work can be lower as a result. The
same effect can be observed when there are a significant number connections
in a device being simultaneously processed.
Also, if the copy-checksum is not performed on the processor where the
application resides, then the performance can also be negatively impacted
(want to have the right cache hot when initiated or concluded). While the
aggregate computational performance of systems may be increasing at a
significant rate (set aside the per core vs. aggregate core debate), the
memory performance gains are much less. If you examine the longer term
trends, there may be a flattening out of memory performance improvements by
2009/10 without some major changes in the way controllers and subsystems
are designed.
Mike
^ permalink raw reply [flat|nested] 54+ messages in thread
* [RFC] Question about potential problem in net/ipv4/route.c
2006-10-11 9:20 ` David Miller
2006-10-11 9:46 ` Michael S. Tsirkin
@ 2006-10-11 13:11 ` Eric Dumazet
2006-10-12 5:05 ` David Miller
2006-10-16 9:00 ` [PATCH] NET : Suspicious locking in reqsk_queue_hash_req() Eric Dumazet
1 sibling, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2006-10-11 13:11 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Hi David
While browsing net/ipv4/route.c I discovered compare_keys() function, and a
potential bug in it.
static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
{
return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u,
sizeof(fl1->nl_u.ip4_u)) == 0 &&
fl1->oif == fl2->oif &&
fl1->iif == fl2->iif;
}
Using memcmp(ptr1, ptr2, sizeof(SOMEFIELD)) is dangerous because
sizeof(SOMEFIELD) can be larger than the underlying object, because of
alignment constraints.
In this case, sizeof(fl1->nl_u.ip4_u) is 16, while fl1->nl_u.ip4_u is :
struct {
__u32 daddr;
__u32 saddr;
__u32 fwmark;
__u8 tos;
__u8 scope;
} ip4_u;
So 14 bytes are really initialized, and 2 padding bytes might contain random
values...
So at the very minimum, we should avoid doing the memcmp() including those
last two bytes : It would be less bugy, and faster too... (But to get really
fast comparison, we should do some clever long/int XOR operations to avoid
many test/branches, like the optim we did in compare_ether_addr())
As shown in profiles, "repz cmpsb" is really a dog... (and its use of
esi/edi/ecx registers a high pressure for the compiler/optimizer)
Eric
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [RFC] Question about potential problem in net/ipv4/route.c
2006-10-11 13:11 ` [RFC] Question about potential problem in net/ipv4/route.c Eric Dumazet
@ 2006-10-12 5:05 ` David Miller
2006-10-12 5:31 ` Patrick McHardy
2006-10-12 5:48 ` Eric Dumazet
2006-10-16 9:00 ` [PATCH] NET : Suspicious locking in reqsk_queue_hash_req() Eric Dumazet
1 sibling, 2 replies; 54+ messages in thread
From: David Miller @ 2006-10-12 5:05 UTC (permalink / raw)
To: dada1; +Cc: netdev
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 11 Oct 2006 15:11:18 +0200
> Using memcmp(ptr1, ptr2, sizeof(SOMEFIELD)) is dangerous because
> sizeof(SOMEFIELD) can be larger than the underlying object, because of
> alignment constraints.
>
> In this case, sizeof(fl1->nl_u.ip4_u) is 16, while fl1->nl_u.ip4_u is :
>
> struct {
> __u32 daddr;
> __u32 saddr;
> __u32 fwmark;
> __u8 tos;
> __u8 scope;
> } ip4_u;
>
> So 14 bytes are really initialized, and 2 padding bytes might contain random
> values...
We always explicitly initialize the flows, and even for local stack
assignment based initialization, gcc zeros out the padding bytes
always. For non-stack-local cases we do explicit memset()'s over the
entire object. So I think while not perfect, we're very much safe
here.
> fast comparison, we should do some clever long/int XOR operations to avoid
> many test/branches, like the optim we did in compare_ether_addr())
>
> As shown in profiles, "repz cmpsb" is really a dog... (and its use of
> esi/edi/ecx registers a high pressure for the compiler/optimizer)
Yes, for the fast comparisons it is almost certainly worth it to do
something saner in compare_keys().
But looking at this further, compare_keys() is only used in hotpath
situations where we are optimizing for a miss, such as during hash
insert. The optimization therefore might be less justified as a
result.
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [RFC] Question about potential problem in net/ipv4/route.c
2006-10-12 5:05 ` David Miller
@ 2006-10-12 5:31 ` Patrick McHardy
2006-10-12 5:54 ` David Miller
2006-10-12 5:48 ` Eric Dumazet
1 sibling, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2006-10-12 5:31 UTC (permalink / raw)
To: David Miller; +Cc: dada1, netdev
[-- Attachment #1: Type: text/plain, Size: 708 bytes --]
David Miller wrote:
> We always explicitly initialize the flows, and even for local stack
> assignment based initialization, gcc zeros out the padding bytes
> always.
I thought so too until I added the iptables compat functions recently
and noticed uninitialized padding of on-stack structures, which
confused iptables since it also uses memcmp.
This program demonstrates the effect, it doesn't output the expected
"1 2" but "1 4294967042" on my x86_64 (gcc-Version 4.1.2 20060901
(prerelease) (Debian 4.1.1-13)). The initialization doesn't touch
the padding bytes:
0x0000000000400494 <test+8>: movl $0x1,0xfffffffffffffff0(%rbp)
0x000000000040049b <test+15>: movb $0x2,0xfffffffffffffff4(%rbp)
[-- Attachment #2: x.c --]
[-- Type: text/x-csrc, Size: 381 bytes --]
#include <stdio.h>
struct x1 {
unsigned int x;
char y;
};
struct x2 {
unsigned int x;
unsigned int y;
};
void pollute(void)
{
struct x2 x = {
.x = ~0,
.y = ~0,
};
}
void test(void)
{
struct x1 x1 = {
.x = 1,
.y = 2,
};
struct x2 *x2 = (struct x2 *)&x1;
printf("%u %u\n", x2->x, x2->y);
}
int main(int argc, char **argv)
{
pollute();
test();
return 0;
}
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [RFC] Question about potential problem in net/ipv4/route.c
2006-10-12 5:31 ` Patrick McHardy
@ 2006-10-12 5:54 ` David Miller
0 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2006-10-12 5:54 UTC (permalink / raw)
To: kaber; +Cc: dada1, netdev
From: Patrick McHardy <kaber@trash.net>
Date: Thu, 12 Oct 2006 07:31:12 +0200
> This program demonstrates the effect, it doesn't output the expected
> "1 2" but "1 4294967042" on my x86_64 (gcc-Version 4.1.2 20060901
> (prerelease) (Debian 4.1.1-13)). The initialization doesn't touch
> the padding bytes:
>
> 0x0000000000400494 <test+8>: movl $0x1,0xfffffffffffffff0(%rbp)
> 0x000000000040049b <test+15>: movb $0x2,0xfffffffffffffff4(%rbp)
Crap, ok we need to fix this.
I even looked at the assembler for a little test foo.c bit
of code... must have misread the sparc64 asm :)
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC] Question about potential problem in net/ipv4/route.c
2006-10-12 5:05 ` David Miller
2006-10-12 5:31 ` Patrick McHardy
@ 2006-10-12 5:48 ` Eric Dumazet
2006-10-12 6:02 ` David Miller
1 sibling, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2006-10-12 5:48 UTC (permalink / raw)
To: David Miller; +Cc: netdev
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 11 Oct 2006 15:11:18 +0200
>
>> Using memcmp(ptr1, ptr2, sizeof(SOMEFIELD)) is dangerous because
>> sizeof(SOMEFIELD) can be larger than the underlying object, because of
>> alignment constraints.
>>
>> In this case, sizeof(fl1->nl_u.ip4_u) is 16, while fl1->nl_u.ip4_u is :
>>
>> struct {
>> __u32 daddr;
>> __u32 saddr;
>> __u32 fwmark;
>> __u8 tos;
>> __u8 scope;
>> } ip4_u;
>>
>> So 14 bytes are really initialized, and 2 padding bytes might contain random
>> values...
>
> We always explicitly initialize the flows, and even for local stack
> assignment based initialization, gcc zeros out the padding bytes
> always. For non-stack-local cases we do explicit memset()'s over the
> entire object. So I think while not perfect, we're very much safe
> here.
>
Not on my gcc here (gcc version 3.4.4) : It wont zeros out the padding bytes
# cat bug.c
struct s1 {
long d;
char c;
};
void bar()
{
struct s1 s = { .d = 123, .c = 'a'};
foo(&s, sizeof(s));
}
# gcc -O2 -S bug.c
# more bug.s
.globl bar
.type bar, @function
bar:
.LFB2:
subq $24, %rsp
.LCFI0:
movl $16, %esi
xorl %eax, %eax
movq %rsp, %rdi
movq $123, (%rsp)
movb $97, 8(%rsp)
call foo
addq $24, %rsp
ret
So 9(%rsp) -> 15(%rsp) are not initialized
Same on more recent gcc (4.1.1)
>> fast comparison, we should do some clever long/int XOR operations to avoid
>> many test/branches, like the optim we did in compare_ether_addr())
>>
>> As shown in profiles, "repz cmpsb" is really a dog... (and its use of
>> esi/edi/ecx registers a high pressure for the compiler/optimizer)
>
> Yes, for the fast comparisons it is almost certainly worth it to do
> something saner in compare_keys().
>
> But looking at this further, compare_keys() is only used in hotpath
> situations where we are optimizing for a miss, such as during hash
> insert. The optimization therefore might be less justified as a
> result.
Well, on this machine I have these oprofile numbers :
<rt_intern_hash>: /* rt_intern_hash total: 993464 0.3619 */
31751 0.0116 :ffffffff803e8d26: repz cmpsb %es:(%rdi),%ds:(%rsi)
433438 0.1579 :ffffffff803e8d28: jne ffffffff803e8f80 <rt_intern_hash+0x310>
Eric
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [RFC] Question about potential problem in net/ipv4/route.c
2006-10-12 5:48 ` Eric Dumazet
@ 2006-10-12 6:02 ` David Miller
2006-10-12 6:10 ` Patrick McHardy
2006-10-12 6:35 ` Eric Dumazet
0 siblings, 2 replies; 54+ messages in thread
From: David Miller @ 2006-10-12 6:02 UTC (permalink / raw)
To: dada1; +Cc: netdev
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 12 Oct 2006 07:48:20 +0200
> Not on my gcc here (gcc version 3.4.4) : It wont zeros out the padding bytes
Patrick just proved this too :)
> Well, on this machine I have these oprofile numbers :
>
> <rt_intern_hash>: /* rt_intern_hash total: 993464 0.3619 */
>
> 31751 0.0116 :ffffffff803e8d26: repz cmpsb %es:(%rdi),%ds:(%rsi)
> 433438 0.1579 :ffffffff803e8d28: jne ffffffff803e8f80 <rt_intern_hash+0x310>
Indeed, numbers talk bullshit walks :)
How about something like this as a start?
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c41ddba..4d4b1bd 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -566,9 +566,13 @@ static inline u32 rt_score(struct rtable
static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
{
- return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u, sizeof(fl1->nl_u.ip4_u)) == 0 &&
- fl1->oif == fl2->oif &&
- fl1->iif == fl2->iif;
+ return ((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+ (fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+ (fl1->nl_u.ip4_u.fwmark ^ fl2->nl_u.ip4_u.fwmark) |
+ (*(u16 *)&fl1->nl_u.ip4_u.tos ^
+ *(u16 *)&fl2->nl_u.ip4_u.tos) |
+ (fl1->oif ^ fl2->oif) |
+ (fl1->iif ^ fl2->iif)) == 0;
}
#ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [RFC] Question about potential problem in net/ipv4/route.c
2006-10-12 6:02 ` David Miller
@ 2006-10-12 6:10 ` Patrick McHardy
2006-10-12 6:25 ` David Miller
2006-10-12 6:35 ` Eric Dumazet
1 sibling, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2006-10-12 6:10 UTC (permalink / raw)
To: David Miller; +Cc: dada1, netdev
David Miller wrote:
> Indeed, numbers talk bullshit walks :)
:)
> How about something like this as a start?
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
The same problem is also present in dn_route.c (3 uninitialized
bytes in dn_u).
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [RFC] Question about potential problem in net/ipv4/route.c
2006-10-12 6:10 ` Patrick McHardy
@ 2006-10-12 6:25 ` David Miller
0 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2006-10-12 6:25 UTC (permalink / raw)
To: kaber; +Cc: dada1, netdev
From: Patrick McHardy <kaber@trash.net>
Date: Thu, 12 Oct 2006 08:10:43 +0200
> David Miller wrote:
> > Indeed, numbers talk bullshit walks :)
>
> :)
>
> > How about something like this as a start?
> >
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>
> The same problem is also present in dn_route.c (3 uninitialized
> bytes in dn_u).
This should take care of that case too:
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index dd0761e..33ccc56 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -267,9 +267,12 @@ static void dn_dst_link_failure(struct s
static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
{
- return memcmp(&fl1->nl_u.dn_u, &fl2->nl_u.dn_u, sizeof(fl1->nl_u.dn_u)) == 0 &&
- fl1->oif == fl2->oif &&
- fl1->iif == fl2->iif;
+ return ((fl1->nl_u.dn_u.daddr ^ fl2->nl_u.dn_u.daddr) |
+ (fl1->nl_u.dn_u.saddr ^ fl2->nl_u.dn_u.saddr) |
+ (fl1->nl_u.dn_u.fwmark ^ fl2->nl_u.dn_u.fwmark) |
+ (fl1->nl_u.dn_u.scope ^ fl2->nl_u.dn_u.scope) |
+ (fl1->oif ^ fl2->oif) |
+ (fl1->iif ^ fl2->iif)) == 0;
}
static int dn_insert_route(struct dn_route *rt, unsigned hash, struct dn_route **rp)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c41ddba..4d4b1bd 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -566,9 +566,13 @@ static inline u32 rt_score(struct rtable
static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
{
- return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u, sizeof(fl1->nl_u.ip4_u)) == 0 &&
- fl1->oif == fl2->oif &&
- fl1->iif == fl2->iif;
+ return ((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+ (fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+ (fl1->nl_u.ip4_u.fwmark ^ fl2->nl_u.ip4_u.fwmark) |
+ (*(u16 *)&fl1->nl_u.ip4_u.tos ^
+ *(u16 *)&fl2->nl_u.ip4_u.tos) |
+ (fl1->oif ^ fl2->oif) |
+ (fl1->iif ^ fl2->iif)) == 0;
}
#ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED
^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: [RFC] Question about potential problem in net/ipv4/route.c
2006-10-12 6:02 ` David Miller
2006-10-12 6:10 ` Patrick McHardy
@ 2006-10-12 6:35 ` Eric Dumazet
2006-10-12 7:48 ` David Miller
1 sibling, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2006-10-12 6:35 UTC (permalink / raw)
To: David Miller; +Cc: netdev
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Thu, 12 Oct 2006 07:48:20 +0200
>
>> Not on my gcc here (gcc version 3.4.4) : It wont zeros out the padding bytes
>
> Patrick just proved this too :)
>
>> Well, on this machine I have these oprofile numbers :
>>
>> <rt_intern_hash>: /* rt_intern_hash total: 993464 0.3619 */
>>
>> 31751 0.0116 :ffffffff803e8d26: repz cmpsb %es:(%rdi),%ds:(%rsi)
>> 433438 0.1579 :ffffffff803e8d28: jne ffffffff803e8f80 <rt_intern_hash+0x310>
>
> Indeed, numbers talk bullshit walks :)
>
> How about something like this as a start?
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index c41ddba..4d4b1bd 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -566,9 +566,13 @@ static inline u32 rt_score(struct rtable
>
> static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
> {
> - return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u, sizeof(fl1->nl_u.ip4_u)) == 0 &&
> - fl1->oif == fl2->oif &&
> - fl1->iif == fl2->iif;
> + return ((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
> + (fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
> + (fl1->nl_u.ip4_u.fwmark ^ fl2->nl_u.ip4_u.fwmark) |
> + (*(u16 *)&fl1->nl_u.ip4_u.tos ^
> + *(u16 *)&fl2->nl_u.ip4_u.tos) |
> + (fl1->oif ^ fl2->oif) |
> + (fl1->iif ^ fl2->iif)) == 0;
> }
>
> #ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED
>
>
Yes it seems a nice start :)
How about avoiding the fwmark thing if !CONFIG_IP_ROUTE_FWMARK
> + return ((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
> + (fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
#ifdef CONFIG_IP_ROUTE_FWMARK
> + (fl1->nl_u.ip4_u.fwmark ^ fl2->nl_u.ip4_u.fwmark) |
#endif
> + (*(u16 *)&fl1->nl_u.ip4_u.tos ^
> + *(u16 *)&fl2->nl_u.ip4_u.tos) |
> + (fl1->oif ^ fl2->oif) |
> + (fl1->iif ^ fl2->iif)) == 0;
Eric
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()
2006-10-11 13:11 ` [RFC] Question about potential problem in net/ipv4/route.c Eric Dumazet
2006-10-12 5:05 ` David Miller
@ 2006-10-16 9:00 ` Eric Dumazet
2006-10-16 9:07 ` Eric Dumazet
2006-10-16 20:41 ` David Miller
1 sibling, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2006-10-16 9:00 UTC (permalink / raw)
To: David Miller; +Cc: netdev
[-- Attachment #1: Type: text/plain, Size: 217 bytes --]
Hi David
While browsing include/net/request_sock.h I found this suspicious locking
protecting the SYN table hash table. I think this patch is necessary.
Thank you
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
[-- Attachment #2: reqsk_queue_hash_req.patch --]
[-- Type: text/plain, Size: 454 bytes --]
--- linux-2.6.18/include/net/request_sock.h.orig 2006-10-16 10:53:11.000000000 +0200
+++ linux-2.6.18-ed/include/net/request_sock.h 2006-10-16 10:53:24.000000000 +0200
@@ -251,9 +251,9 @@
req->expires = jiffies + timeout;
req->retrans = 0;
req->sk = NULL;
- req->dl_next = lopt->syn_table[hash];
write_lock(&queue->syn_wait_lock);
+ req->dl_next = lopt->syn_table[hash];
lopt->syn_table[hash] = req;
write_unlock(&queue->syn_wait_lock);
}
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()
2006-10-16 9:00 ` [PATCH] NET : Suspicious locking in reqsk_queue_hash_req() Eric Dumazet
@ 2006-10-16 9:07 ` Eric Dumazet
2006-10-16 16:16 ` Arnaldo Carvalho de Melo
2006-10-16 20:41 ` David Miller
1 sibling, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2006-10-16 9:07 UTC (permalink / raw)
To: David Miller; +Cc: netdev
[-- Attachment #1: Type: text/plain, Size: 250 bytes --]
(Sorry, patch inlined this time)
Hi David
While browsing include/net/request_sock.h I found this suspicious locking
protecting the SYN table hash table. I think this patch is necessary.
Thank you
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
[-- Attachment #2: reqsk_queue_hash_req.patch --]
[-- Type: text/plain, Size: 454 bytes --]
--- linux-2.6.18/include/net/request_sock.h.orig 2006-10-16 10:53:11.000000000 +0200
+++ linux-2.6.18-ed/include/net/request_sock.h 2006-10-16 10:53:24.000000000 +0200
@@ -251,9 +251,9 @@
req->expires = jiffies + timeout;
req->retrans = 0;
req->sk = NULL;
- req->dl_next = lopt->syn_table[hash];
write_lock(&queue->syn_wait_lock);
+ req->dl_next = lopt->syn_table[hash];
lopt->syn_table[hash] = req;
write_unlock(&queue->syn_wait_lock);
}
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()
2006-10-16 9:07 ` Eric Dumazet
@ 2006-10-16 16:16 ` Arnaldo Carvalho de Melo
2006-10-16 16:56 ` Eric Dumazet
0 siblings, 1 reply; 54+ messages in thread
From: Arnaldo Carvalho de Melo @ 2006-10-16 16:16 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
On 10/16/06, Eric Dumazet <dada1@cosmosbay.com> wrote:
> (Sorry, patch inlined this time)
>
> Hi David
>
> While browsing include/net/request_sock.h I found this suspicious locking
> protecting the SYN table hash table. I think this patch is necessary.
>
> Thank you
Interesting, just checked and it was there before I moved this out of tcp land:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0e87506fcc734647c7b2497eee4eb81e785c857a
@@ -898,18 +898,10 @@ static struct request_sock *tcp_v4_searc
static void tcp_v4_synq_add(struct sock *sk, struct request_sock *req)
{
struct tcp_sock *tp = tcp_sk(sk);
- struct tcp_listen_opt *lopt = tp->listen_opt;
+ struct tcp_listen_opt *lopt = tp->accept_queue.listen_opt;
u32 h = tcp_v4_synq_hash(inet_rsk(req)->rmt_addr,
inet_rsk(req)->rmt_port, lopt->hash_rnd);
- req->expires = jiffies + TCP_TIMEOUT_INIT;
- req->retrans = 0;
- req->sk = NULL;
- req->dl_next = lopt->syn_table[h];
-
- write_lock(&tp->syn_wait_lock);
- lopt->syn_table[h] = req;
- write_unlock(&tp->syn_wait_lock);
-
+ reqsk_queue_hash_req(&tp->accept_queue, h, req, TCP_TIMEOUT_INIT);
tcp_synq_added(sk);
}
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>
>
> --- linux-2.6.18/include/net/request_sock.h.orig 2006-10-16 10:53:11.000000000 +0200
> +++ linux-2.6.18-ed/include/net/request_sock.h 2006-10-16 10:53:24.000000000 +0200
> @@ -251,9 +251,9 @@
> req->expires = jiffies + timeout;
> req->retrans = 0;
> req->sk = NULL;
> - req->dl_next = lopt->syn_table[hash];
>
> write_lock(&queue->syn_wait_lock);
> + req->dl_next = lopt->syn_table[hash];
> lopt->syn_table[hash] = req;
> write_unlock(&queue->syn_wait_lock);
> }
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()
2006-10-16 16:16 ` Arnaldo Carvalho de Melo
@ 2006-10-16 16:56 ` Eric Dumazet
2006-10-16 17:39 ` Eric Dumazet
0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2006-10-16 16:56 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo; +Cc: David Miller, netdev
On Monday 16 October 2006 18:16, Arnaldo Carvalho de Melo wrote:
> On 10/16/06, Eric Dumazet <dada1@cosmosbay.com> wrote:
> > (Sorry, patch inlined this time)
> >
> > Hi David
> >
> > While browsing include/net/request_sock.h I found this suspicious locking
> > protecting the SYN table hash table. I think this patch is necessary.
> >
> > Thank you
>
> Interesting, just checked and it was there before I moved this out of tcp
> land:
Well, the bug was there before you put your hands on the code (I checked
linux-2.4.33 & linux-2.4.1 , bug present on both versions)
:)
Eric
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()
2006-10-16 16:56 ` Eric Dumazet
@ 2006-10-16 17:39 ` Eric Dumazet
0 siblings, 0 replies; 54+ messages in thread
From: Eric Dumazet @ 2006-10-16 17:39 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo; +Cc: David Miller, netdev
[-- Attachment #1: Type: text/plain, Size: 1486 bytes --]
On Monday 16 October 2006 18:56, Eric Dumazet wrote:
> On Monday 16 October 2006 18:16, Arnaldo Carvalho de Melo wrote:
> > On 10/16/06, Eric Dumazet <dada1@cosmosbay.com> wrote:
> > > (Sorry, patch inlined this time)
> > >
> > > Hi David
> > >
> > > While browsing include/net/request_sock.h I found this suspicious
> > > locking protecting the SYN table hash table. I think this patch is
> > > necessary.
> > >
> > > Thank you
> >
> > Interesting, just checked and it was there before I moved this out of tcp
> > land:
>
> Well, the bug was there before you put your hands on the code (I checked
> linux-2.4.33 & linux-2.4.1 , bug present on both versions)
Well, 'bug' is not appropriate in fact. Overkill maybe ?
The comment from include/net/request_sock.h explain the thing...
* %syn_wait_lock is necessary only to avoid proc interface having to grab the
main
* lock sock while browsing the listening hash (otherwise it's deadlock
prone).
*
* This lock is acquired in read mode only from listening_get_next() seq_file
* op and it's acquired in write mode _only_ from code that is actively
* changing rskq_accept_head. All readers that are holding the master sock
lock
* don't need to grab this lock in read mode too as rskq_accept_head. writes
* are always protected from the main sock lock.
I bet a more appropriate code (and less prone to reading errors for kernel
gurus/newbies) would be :
What do you think ?
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
[-- Attachment #2: reqsk_queue_hash_req.patch --]
[-- Type: text/plain, Size: 541 bytes --]
--- linux-2.6.19-rc2/include/net/request_sock.h 2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/include/net/request_sock.h 2006-10-16 19:34:19.000000000 +0200
@@ -254,9 +254,13 @@
req->sk = NULL;
req->dl_next = lopt->syn_table[hash];
- write_lock(&queue->syn_wait_lock);
+ /*
+ * We want previous writes being commited before doing this change,
+ * so that readers of the chain are not confused.
+ */
+ smp_mb();
+
lopt->syn_table[hash] = req;
- write_unlock(&queue->syn_wait_lock);
}
#endif /* _REQUEST_SOCK_H */
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()
2006-10-16 9:00 ` [PATCH] NET : Suspicious locking in reqsk_queue_hash_req() Eric Dumazet
2006-10-16 9:07 ` Eric Dumazet
@ 2006-10-16 20:41 ` David Miller
1 sibling, 0 replies; 54+ messages in thread
From: David Miller @ 2006-10-16 20:41 UTC (permalink / raw)
To: dada1; +Cc: netdev
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 16 Oct 2006 11:00:22 +0200
> While browsing include/net/request_sock.h I found this suspicious locking
> protecting the SYN table hash table. I think this patch is necessary.
>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
People get tripped up by this one all the time.
We hold a higher level lock which protects other
inserts from happening, namely the listening socket
lock, it works here like the RTNL semaphore does.
We only need to protect the actual change of the hash
head, as lookups can occur asynchronously and we want
linkage seen by lookups to be consistent.
Alexey likes to do this locking trick a lot.
Feel free to add a comment. :-)
^ permalink raw reply [flat|nested] 54+ messages in thread