Netdev List
 help / color / mirror / Atom feed
* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-11-09 15:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20101109132239.GF22705@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:52:39 PM:

> > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > >
> > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > > >
> > > > Any feedback, comments, objections, issues or bugs about the
> > > > patches? Please let me know if something needs to be done.
> > > >
> > > > Some more test results:
> > > > _____________________________________________________
> > > >          Host->Guest BW (numtxqs=2)
> > > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > > _____________________________________________________
> > >
> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > I had to make a few changes to qemu (and a minor change in macvtap
> > driver) to get multiple TXQ support using macvtap working. The NIC
> > is a ixgbe card.
> >
> >
__________________________________________________________________________
> >             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> >
__________________________________________________________________________
> > 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> > 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> > 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> > 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> > 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> > 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> > 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> > 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495
(-10.4)
> > 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538
(-9.5)
> > 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403
(-5.1)
> > 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419
(-8.7)
> > 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463
(-8.8)
> > 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337
(-14.6)
> >
__________________________________________________________________________
> > Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> >
> >
__________________________________________________________________________
> >             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> >
__________________________________________________________________________
> > 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> > 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> > 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> > 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> > 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> > 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> > 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> > 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> > 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403
(-10.9)
> > 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> > 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824
(-15.4)
> > 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131
(-11.4)
> > 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873
(-18.8)
> >
__________________________________________________________________________
> > Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> >
> > Is there anything else you would like me to test/change, or shall
> > I submit the next version (with the above macvtap changes)?
> >
> > Thanks,
> >
> > - KK
>
> Something strange here, right?
> 1. You are consistently getting >10G/s here, and even with a single
stream?

Sorry, I should have mentioned this though I had stated in my
earlier mails. Each test result has two iterations, each of 60
seconds, except when #netperfs is 1 for which I do 10 iteration
(sum across 10 iterations).  I started doing many more iterations
for 1 netperf after finding the issue earlier with single stream.
So the BW is only 4.5-7 Gbps.

> 2. With 2 streams, is where we get < 10G/s originally. Instead of
>    doubling that we get a marginal improvement with 2 queues and
>    about 30% worse with 1 queue.

(doubling happens consistently for guest -> host, but never for
remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
testing scenario. In first case, there is a slight improvement in
BW and good reduction in SD. In the second case, only SD improves
(though BW drops for 2 stream for some reason).  In both cases,
BW and SD improves as the number of sessions increase.

> Is your card MQ?

Yes, the card is MQ. ixgbe 10g card.

Thanks,

- KK


^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Krishna Kumar2 @ 2010-11-09 15:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: davem, netdev, Rusty Russell, yvugenfi
In-Reply-To: <20101109131555.GE22705@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:45:55 PM:

> Re: [PATCH] virtio_net: Fix queue full check
>
> On Tue, Nov 09, 2010 at 09:56:03AM +0530, Krishna Kumar2 wrote:
> > Rusty Russell <rusty@rustcorp.com.au> wrote on 11/08/2010 04:38:47 AM:
> >
> > > Re: [PATCH] virtio_net: Fix queue full check
> > >
> > > On Thu, 4 Nov 2010 10:54:24 pm Michael S. Tsirkin wrote:
> > > > I thought about this some more.  I think the original
> > > > code is actually correct in returning ENOSPC: indirect
> > > > buffers are nice, but it's a mistake
> > > > to rely on them as a memory allocation might fail.
> > > >
> > > > And if you look at virtio-net, it is dropping packets
> > > > under memory pressure which is not really a happy outcome:
> > > > the packet will get freed, reallocated and we get another one,
> > > > adding pressure on the allocator instead of releasing it
> > > > until we free up some buffers.
> > > >
> > > > So I now think we should calculate the capacity
> > > > assuming non-indirect entries, and if we manage to
> > > > use indirect, all the better.
> > >
> > > I've long said it's a weakness in the network stack that it insists
> > > drivers stop the tx queue before they *might* run out of room,
leading to
> > > worst-case assumptions and underutilization of the tx ring.
> > >
> > > However, I lost that debate, and so your patch is the way it's
supposed
> > to
> > > work.  The other main indirect user (block) doesn't care as its queue
> > > allows for post-attempt blocking.
> > >
> > > I enhanced your commentry a little:
> > >
> > > Subject: virtio: return correct capacity to users
> > > Date: Thu, 4 Nov 2010 14:24:24 +0200
> > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > >
> > > We can't rely on indirect buffers for capacity
> > > calculations because they need a memory allocation
> > > which might fail.  In particular, virtio_net can get
> > > into this situation under stress, and it drops packets
> > > and performs badly.
> > >
> > > So return the number of buffers we can guarantee users.
> > >
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> >
> > I have tested this patch for 3-4 hours but so far I have not got the tx
> > full
> > error. I am not sure if "Tested-By" applies to this situation, but just
in
> > case:
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> > Tested-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> >
> > I think both this patch and the original patch I submitted
> > are needed? That patch removes ENOMEM check and the increment
> > of dev->stats.tx_fifo_errors, and reports "memory failure".
> >
> > Thanks,
> >
> > - KK
>
> So I think your patch on top of this one would be wrong:
> we actually make sure out of memory does not affect TX path
> at all, so any error would be unexpected.
>
> Incrementing tx fifo errors is probably also helpful for debugging.
>
> If you like, we could kill the special handling for ENOMEM.
> Not sure whether it matters.

Since that is dead code, we could remove it (and the fifo error
should disappear too - tx_dropped should be the only counter to
be incremented?). Sorry if I misunderstood something.

Thanks,

- KK


^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Michael S. Tsirkin @ 2010-11-09 15:30 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: davem, netdev, Rusty Russell, yvugenfi
In-Reply-To: <OF5BF09BF3.7DE39268-ON652577D6.004B4830-652577D6.0054E713@in.ibm.com>

On Tue, Nov 09, 2010 at 09:00:58PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:45:55 PM:
> 
> > Re: [PATCH] virtio_net: Fix queue full check
> >
> > On Tue, Nov 09, 2010 at 09:56:03AM +0530, Krishna Kumar2 wrote:
> > > Rusty Russell <rusty@rustcorp.com.au> wrote on 11/08/2010 04:38:47 AM:
> > >
> > > > Re: [PATCH] virtio_net: Fix queue full check
> > > >
> > > > On Thu, 4 Nov 2010 10:54:24 pm Michael S. Tsirkin wrote:
> > > > > I thought about this some more.  I think the original
> > > > > code is actually correct in returning ENOSPC: indirect
> > > > > buffers are nice, but it's a mistake
> > > > > to rely on them as a memory allocation might fail.
> > > > >
> > > > > And if you look at virtio-net, it is dropping packets
> > > > > under memory pressure which is not really a happy outcome:
> > > > > the packet will get freed, reallocated and we get another one,
> > > > > adding pressure on the allocator instead of releasing it
> > > > > until we free up some buffers.
> > > > >
> > > > > So I now think we should calculate the capacity
> > > > > assuming non-indirect entries, and if we manage to
> > > > > use indirect, all the better.
> > > >
> > > > I've long said it's a weakness in the network stack that it insists
> > > > drivers stop the tx queue before they *might* run out of room,
> leading to
> > > > worst-case assumptions and underutilization of the tx ring.
> > > >
> > > > However, I lost that debate, and so your patch is the way it's
> supposed
> > > to
> > > > work.  The other main indirect user (block) doesn't care as its queue
> > > > allows for post-attempt blocking.
> > > >
> > > > I enhanced your commentry a little:
> > > >
> > > > Subject: virtio: return correct capacity to users
> > > > Date: Thu, 4 Nov 2010 14:24:24 +0200
> > > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > >
> > > > We can't rely on indirect buffers for capacity
> > > > calculations because they need a memory allocation
> > > > which might fail.  In particular, virtio_net can get
> > > > into this situation under stress, and it drops packets
> > > > and performs badly.
> > > >
> > > > So return the number of buffers we can guarantee users.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > > > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> > >
> > > I have tested this patch for 3-4 hours but so far I have not got the tx
> > > full
> > > error. I am not sure if "Tested-By" applies to this situation, but just
> in
> > > case:
> > >
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> > > Tested-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> > >
> > > I think both this patch and the original patch I submitted
> > > are needed? That patch removes ENOMEM check and the increment
> > > of dev->stats.tx_fifo_errors, and reports "memory failure".
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > So I think your patch on top of this one would be wrong:
> > we actually make sure out of memory does not affect TX path
> > at all, so any error would be unexpected.
> >
> > Incrementing tx fifo errors is probably also helpful for debugging.
> >
> > If you like, we could kill the special handling for ENOMEM.
> > Not sure whether it matters.
> 
> Since that is dead code, we could remove it (and the fifo error
> should disappear too - tx_dropped should be the only counter to
> be incremented?). Sorry if I misunderstood something.
> 
> Thanks,
> 
> - KK

It's just a sanity check. The capacity checking is tricky enough
that I'm happier with some way to detect overflow both from ifconfig
and dmesg.

I don't really care which counter gets incremented really,
since this is some TX bug fifo error seems appropriate
but I don't really care much.


-- 
MST

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-11-09 15:33 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <OF24E08752.2087FFA4-ON652577D6.00532DF1-652577D6.0054B291@in.ibm.com>

On Tue, Nov 09, 2010 at 08:58:44PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:52:39 PM:
> 
> > > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > > >
> > > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > > > >
> > > > > Any feedback, comments, objections, issues or bugs about the
> > > > > patches? Please let me know if something needs to be done.
> > > > >
> > > > > Some more test results:
> > > > > _____________________________________________________
> > > > >          Host->Guest BW (numtxqs=2)
> > > > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > > > _____________________________________________________
> > > >
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > I had to make a few changes to qemu (and a minor change in macvtap
> > > driver) to get multiple TXQ support using macvtap working. The NIC
> > > is a ixgbe card.
> > >
> > >
> __________________________________________________________________________
> > >             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > > #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> > >
> __________________________________________________________________________
> > > 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> > > 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> > > 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> > > 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> > > 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> > > 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> > > 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> > > 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495
> (-10.4)
> > > 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538
> (-9.5)
> > > 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403
> (-5.1)
> > > 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419
> (-8.7)
> > > 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463
> (-8.8)
> > > 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337
> (-14.6)
> > >
> __________________________________________________________________________
> > > Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> > >
> > >
> __________________________________________________________________________
> > >             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > > #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> > >
> __________________________________________________________________________
> > > 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> > > 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> > > 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> > > 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> > > 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> > > 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> > > 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> > > 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> > > 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403
> (-10.9)
> > > 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> > > 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824
> (-15.4)
> > > 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131
> (-11.4)
> > > 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873
> (-18.8)
> > >
> __________________________________________________________________________
> > > Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> > >
> > > Is there anything else you would like me to test/change, or shall
> > > I submit the next version (with the above macvtap changes)?
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > Something strange here, right?
> > 1. You are consistently getting >10G/s here, and even with a single
> stream?
> 
> Sorry, I should have mentioned this though I had stated in my
> earlier mails. Each test result has two iterations, each of 60
> seconds, except when #netperfs is 1 for which I do 10 iteration
> (sum across 10 iterations).

So need to divide the number by 10?

>  I started doing many more iterations
> for 1 netperf after finding the issue earlier with single stream.
> So the BW is only 4.5-7 Gbps.
> 
> > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> >    doubling that we get a marginal improvement with 2 queues and
> >    about 30% worse with 1 queue.
> 
> (doubling happens consistently for guest -> host, but never for
> remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> testing scenario. In first case, there is a slight improvement in
> BW and good reduction in SD. In the second case, only SD improves
> (though BW drops for 2 stream for some reason).  In both cases,
> BW and SD improves as the number of sessions increase.

I guess this is another indication that something's wrong.
We are quite far from line rate, the fact BW does not scale
means there's some contention in the code.

> > Is your card MQ?
> 
> Yes, the card is MQ. ixgbe 10g card.
> 
> Thanks,
> 
> - KK

^ permalink raw reply

* Re: ip_summed setting for TCP pure-ACK packets
From: Ben Hutchings @ 2010-11-09 16:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, sf-linux-drivers, netdev
In-Reply-To: <1289235366.15376.3.camel@edumazet-laptop>

On Mon, 2010-11-08 at 17:56 +0100, Eric Dumazet wrote: 
> Le lundi 08 novembre 2010 à 16:35 +0000, Ben Hutchings a écrit :
> > As we discussed at LPC:
> > 
> > Current controllers handled by the sfc driver have a per-queue (rather
> > than per-packet) option for checksum generation.  Currently pure-ACK
> > packets sent by TCP have ip_summed == CHECKSUM_NONE and we must send
> > them on hardware queues with checksum generation disabled.  To support
> > this, we allocate 2 hardware queues per core TX queue.
> > 
> > To reduce the risk of reordering (and possibly the number of hardware TX
> > queues required), it would be helpful for TCP to set ip_summed ==
> > CHECKSUM_PARTIAL on pure-ACK packets when the output device is known to
> > support checksum generation.
> > 
> > Ben.
> > 
> 
> Hmm
> 
> Do you mean commit 2e8e18ef52e7dd1af0a3bd1 is not enough ?

It might well be... I must admit I hadn't thought to check whether this
issue had gone away.

Yes, that does the trick.  Sorry for wasting people's time on this.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.



^ permalink raw reply

* Re: [PATCH] inet: fix ip_mc_drop_socket()
From: David Miller @ 2010-11-09 16:26 UTC (permalink / raw)
  To: eric.dumazet
  Cc: markus, paulmck, miles.lane, ilpo.jarvinen, linux-kernel, lenb,
	netdev
In-Reply-To: <1289250954.2790.11.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 Nov 2010 22:15:54 +0100

> Hmm, I believe I found the bug.
> 
> Thanks guys !
> 
> [PATCH] inet: fix ip_mc_drop_socket()
> 
> commit 8723e1b4ad9be4444 (inet: RCU changes in inetdev_by_index())
> forgot one call site in ip_mc_drop_socket()
> 
> We should not decrease idev refcount after inetdev_by_index() call,
> since refcount is not increased anymore.
> 
> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
> Reported-by: Miles Lane <miles.lane@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks a lot!

^ permalink raw reply

* Re: [PATCH 1/2] r8169: revert "Handle rxfifo errors on 8168 chips"
From: David Miller @ 2010-11-09 16:27 UTC (permalink / raw)
  To: romieu; +Cc: netdev, a.radke, mjg, daniel.blueman
In-Reply-To: <20101108232305.GA13720@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Tue, 9 Nov 2010 00:23:05 +0100

> The original patch helps under obscure conditions (no pun) but
> some 8168 do not like it. The change needs to be tightened with
> a specific 8168 version.
> 
> This reverts commit 801e147cde02f04b5c2f42764cd43a89fc7400a2.
> 
> Regression at https://bugzilla.kernel.org/show_bug.cgi?id=20882
> 
> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
> Tested-by: Andreas Radke <a.radke@arcor.de>
> Cc: Matthew Garrett <mjg@redhat.com>
> Cc: Daniel J Blueman <daniel.blueman@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/2] r8169: fix sleeping while holding spinlock.
From: David Miller @ 2010-11-09 16:27 UTC (permalink / raw)
  To: romieu; +Cc: netdev, daniel.blueman, rjw
In-Reply-To: <20101108232358.GB13720@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Tue, 9 Nov 2010 00:23:58 +0100

> As device_set_wakeup_enable can now sleep, move the call to outside
> the critical section.
> 
> Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
> Acked-by: Rafael J. Wysocki <rjw@sisk.pl>

Applied.

^ permalink raw reply

* Re: ip_summed setting for TCP pure-ACK packets
From: Eric Dumazet @ 2010-11-09 16:28 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, sf-linux-drivers, netdev
In-Reply-To: <1289319889.2238.195.camel@achroite.uk.solarflarecom.com>

Le mardi 09 novembre 2010 à 16:24 +0000, Ben Hutchings a écrit :
> On Mon, 2010-11-08 at 17:56 +0100, Eric Dumazet wrote: 
> > Le lundi 08 novembre 2010 à 16:35 +0000, Ben Hutchings a écrit :
> > > As we discussed at LPC:
> > > 
> > > Current controllers handled by the sfc driver have a per-queue (rather
> > > than per-packet) option for checksum generation.  Currently pure-ACK
> > > packets sent by TCP have ip_summed == CHECKSUM_NONE and we must send
> > > them on hardware queues with checksum generation disabled.  To support
> > > this, we allocate 2 hardware queues per core TX queue.
> > > 
> > > To reduce the risk of reordering (and possibly the number of hardware TX
> > > queues required), it would be helpful for TCP to set ip_summed ==
> > > CHECKSUM_PARTIAL on pure-ACK packets when the output device is known to
> > > support checksum generation.
> > > 
> > > Ben.
> > > 
> > 
> > Hmm
> > 
> > Do you mean commit 2e8e18ef52e7dd1af0a3bd1 is not enough ?
> 
> It might well be... I must admit I hadn't thought to check whether this
> issue had gone away.
> 
> Yes, that does the trick.  Sorry for wasting people's time on this.
> 

You're welcome ;)




^ permalink raw reply

* [PATCH] net: replace min with min_t in pktgen_if_write
From: Davidlohr Bueso @ 2010-11-09 16:37 UTC (permalink / raw)
  To: David Miller, Robert Olsson; +Cc: LKML, Network Development

From: Davidlohr Bueso <dave@gnu.org>

This addresses the following compiler (gcc 4.4.5) warning:

  CC [M]  net/core/pktgen.o
net/core/pktgen.c: In function ‘pktgen_if_write’:
net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
---
 net/core/pktgen.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fbce4b0..1992cd0 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -887,7 +887,7 @@ static ssize_t pktgen_if_write(struct file *file,
 	i += len;
 
 	if (debug) {
-		size_t copy = min(count, 1023);
+		size_t copy = min_t(size_t, count, 1023);
 		char tb[copy + 1];
 		if (copy_from_user(tb, user_buffer, copy))
 			return -EFAULT;
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] net: replace min with min_t in pktgen_if_write
From: David Miller @ 2010-11-09 16:40 UTC (permalink / raw)
  To: dave; +Cc: robert.olsson, linux-kernel, netdev
In-Reply-To: <1289320633.2260.12.camel@cowboy>

From: Davidlohr Bueso <dave@gnu.org>
Date: Tue, 09 Nov 2010 13:37:13 -0300

> From: Davidlohr Bueso <dave@gnu.org>
> 
> This addresses the following compiler (gcc 4.4.5) warning:
> 
>   CC [M]  net/core/pktgen.o
> net/core/pktgen.c: In function ‘pktgen_if_write’:
> net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast
> 
> Signed-off-by: Davidlohr Bueso <dave@gnu.org>

Already fixed in net-2.6:

commit 86c2c0a8a4965ae5cbc0ff97ed39a4472e8e9b23
Author: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Date:   Sat Nov 6 20:11:38 2010 +0000

    NET: pktgen - fix compile warning
    
    This should fix the following warning:
    
    net/core/pktgen.c: In function ‘pktgen_if_write’:
    net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast
    
    Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
    Reviewed-by: Nelson Elhage <nelhage@ksplice.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fbce4b0..1992cd0 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -887,7 +887,7 @@ static ssize_t pktgen_if_write(struct file *file,
 	i += len;
 
 	if (debug) {
-		size_t copy = min(count, 1023);
+		size_t copy = min_t(size_t, count, 1023);
 		char tb[copy + 1];
 		if (copy_from_user(tb, user_buffer, copy))
 			return -EFAULT;

^ permalink raw reply related

* Re: [PATCH] Fix CAN info leak/minor heap overflow
From: David Miller @ 2010-11-09 17:05 UTC (permalink / raw)
  To: socketcan; +Cc: urs, netdev, drosenberg, security, torvalds
In-Reply-To: <4CD8FDB5.6060905@hartkopp.net>

From: Oliver Hartkopp <socketcan@hartkopp.net>
Date: Tue, 09 Nov 2010 08:52:21 +0100

> Once this patch is applied (and the procfs layout is changed anyway), i'd also
> like to send a patch from my backlog that would extend the procfs output for
> can-bcm with an additional drop counter.

I find this kind of discussion extremely disappointing.

All of this stuff you CAN guys do with procfs files and version
strings is completely wrong and bogus.

Once you create a procfs file layout, you're basically stuck and you
can at best only reasonably add new fields at the end, you can't
really change existing fields.

And sysfs would have been a lot more appropriate, you could use
attributes for each value you want to export and then just add new
sysfs attributes when you want to export new values which has very
clear semantics and backwards compatability implications.

^ permalink raw reply

* Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
From: David Miller @ 2010-11-09 17:15 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, herbert, jdike
In-Reply-To: <1289293402-4791-1-git-send-email-xiaohui.xin@intel.com>

From: xiaohui.xin@intel.com
Date: Tue,  9 Nov 2010 17:03:05 +0800

> We provide an zero-copy method which driver side may get external
> buffers to DMA. Here external means driver don't use kernel space
> to allocate skb buffers. Currently the external buffer can be from
> guest virtio-net driver.

1) Please use kernel-doc style comments to document structure
   members.

2) The idea to key off of skb->dev in skb_release_data() is
   fundamentally flawed since many actions can change skb->dev on you,
   which will end up causing a leak of your external data areas.

^ permalink raw reply

* Re: [PATCH] drivers/net: normalize TX_TIMEOUT
From: David Miller @ 2010-11-09 17:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1288860575.2659.34.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Nov 2010 09:49:35 +0100

> Some network drivers use old TX_TIMEOUT definitions, assuming HZ=100 of
> old kernels.
> 
> Convert these definitions to include HZ, since HZ can be 1000 these
> days.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-11-09 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20101109153325.GB26153@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 09:03:25 PM:

> > > Something strange here, right?
> > > 1. You are consistently getting >10G/s here, and even with a single
> > stream?
> >
> > Sorry, I should have mentioned this though I had stated in my
> > earlier mails. Each test result has two iterations, each of 60
> > seconds, except when #netperfs is 1 for which I do 10 iteration
> > (sum across 10 iterations).
>
> So need to divide the number by 10?

Yes, that is what I get with 512/1K macvtap I/O size :)

> >  I started doing many more iterations
> > for 1 netperf after finding the issue earlier with single stream.
> > So the BW is only 4.5-7 Gbps.
> >
> > > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> > >    doubling that we get a marginal improvement with 2 queues and
> > >    about 30% worse with 1 queue.
> >
> > (doubling happens consistently for guest -> host, but never for
> > remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> > testing scenario. In first case, there is a slight improvement in
> > BW and good reduction in SD. In the second case, only SD improves
> > (though BW drops for 2 stream for some reason).  In both cases,
> > BW and SD improves as the number of sessions increase.
>
> I guess this is another indication that something's wrong.

The patch - both virtio-net and vhost-net, doesn't have any
locking/mutex's/ or any synchronization method. Guest -> host
performance improvement of upto 100% shows the patch is not
doing anything wrong.

> We are quite far from line rate, the fact BW does not scale
> means there's some contention in the code.

Attaining line speed with macvtap seems to be a generic issue
and unrelated to my patch specifically. IMHO if there is nothing
wrong in the code (review) and is accepted, it will benefit as
others can also help to find what needs to be implemented in
vhost/macvtap/qemu to get line speed for guest->remote-host.

PS: bare-metal performance for host->remote-host is also
    2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.

Thanks,

- KK


^ permalink raw reply

* Re: [PATCH 36/39] net/ipv4/tcp.c: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, kuznet, pekkas, jmorris, yoshfuji, kaber, netdev,
	linux-kernel
In-Reply-To: <2afee732758b93d7464e6bb07a52b3bc67f164a4.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:53 -0700

> Coalesce long formats.
> Align arguments.
> Remove KERN_<level>.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 35/39] net/core/dev.c: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, netdev, linux-kernel
In-Reply-To: <2f05cefa4dc86594a2d7335c7978b1ae781bef0e.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:52 -0700

> Coalesce long formats.
> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 17/39] drivers/net/usb: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, gregkh, linux-usb, netdev, linux-kernel
In-Reply-To: <4ef13c27899febe0cef4f1633321ddedbe97cfcf.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:34 -0700

> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 16/39] drivers/net/can: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe-6d6DIl74uiNBDgjK7y7TUQ
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, trivial-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <2a00e2a377a44e544cec15356cf3794abe47b306.1288471898.git.joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>

From: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
Date: Sat, 30 Oct 2010 14:08:33 -0700

> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] net: tipc: fix information leak to userland
From: David Miller @ 2010-11-09 17:26 UTC (permalink / raw)
  To: segooon
  Cc: jon.maloy, netdev, kernel-janitors, linux-kernel, tipc-discussion,
	allan.stephens
In-Reply-To: <1288545032-16481-1-git-send-email-segooon@gmail.com>

From: Vasiliy Kulikov <segooon@gmail.com>
Date: Sun, 31 Oct 2010 20:10:32 +0300

> Structure sockaddr_tipc is copied to userland with padding bytes after
> "id" field in union field "name" unitialized.  It leads to leaking of
> contents of kernel stack memory.  We have to initialize them to zero.
> 
> Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>

Applied.

Patches #1 and #2 were given feedback which I need you to integrate
and submit new patches based upon, thanks.

------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v2)
From: nhorman @ 2010-11-09 17:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, zenczykowski, Neil Horman
In-Reply-To: <1288033566-2091-1-git-send-email-nhorman@tuxdriver.com>

From: Neil Horman <nhorman@tuxdriver.com>

Version 2 of this patch.  Sorry its been awhile, but I've had other issues come
up.  I've re-written this patch taking into account the change notes from the
last round

Change notes:
1) Rewrote the patch to not do all my previous stupid vmaps on single page
arrays.  Instead we just use vmalloc now.

2) Checked into the behavior of tcpdump on high order allocations in the failing
case.  Tcpdump (more specifically I think libpcap) does attempt to minimize the
allocation order on the ring buffer, unfortunately the lowest it will push the
ordrer too given a sufficiently large buffer is order 5, so its still a very
large contiguous allocation, and that says nothing of other malicious
applications that might try to do more.

Summary:
It was shown to me recently that systems under high load were driven very deep
into swap when tcpdump was run.  The reason this happened was because the
AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
application to specify how many entries an AF_PACKET socket will have and how
large each entry will be.  It seems the default setting for tcpdump is to set
the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
allocation.  Thats difficult under good circumstances, and horrid under memory
pressure.

I thought it would be good to make that a bit more usable.  I was going to do a
simple conversion of the ring buffer from contigous pages to iovecs, but
unfortunately, the metadata which AF_PACKET places in these buffers can easily
span a page boundary, and given that these buffers get mapped into user space,
and the data layout doesn't easily allow for a change to padding between frames
to avoid that, a simple iovec change is just going to break user space ABI
consistency.

So I've done this, I've added a three tiered mechanism to the af_packet set_ring
socket option.  It attempts to allocate memory in the following order:

1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
digging into swap

2) Using vmalloc

3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
needed to get the memory

The effect is that we don't disturb the system as much when we're under load,
while still being able to conduct tcpdumps effectively.

Tested successfully by me.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 net/packet/af_packet.c |   84 ++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3616f27..d1148ac 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -163,8 +163,14 @@ struct packet_mreq_max {
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring);
 
+#define PGV_FROM_VMALLOC 1
+struct pgv {
+	char *buffer;
+	unsigned char flags;
+};
+
 struct packet_ring_buffer {
-	char			**pg_vec;
+	struct pgv		*pg_vec;
 	unsigned int		head;
 	unsigned int		frames_per_block;
 	unsigned int		frame_size;
@@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
 	pg_vec_pos = position / rb->frames_per_block;
 	frame_offset = position % rb->frames_per_block;
 
-	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
+	h.raw = rb->pg_vec[pg_vec_pos].buffer +
+		(frame_offset * rb->frame_size);
 
 	if (status != __packet_get_status(po, h.raw))
 		return NULL;
@@ -2322,37 +2329,74 @@ static const struct vm_operations_struct packet_mmap_ops = {
 	.close	=	packet_mm_close,
 };
 
-static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
+static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
+			unsigned int len)
 {
 	int i;
 
 	for (i = 0; i < len; i++) {
-		if (likely(pg_vec[i]))
-			free_pages((unsigned long) pg_vec[i], order);
+		if (likely(pg_vec[i].buffer)) {
+			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
+				vfree(pg_vec[i].buffer);
+			else
+				free_pages((unsigned long)pg_vec[i].buffer,
+					   order);
+			pg_vec[i].buffer = NULL;
+		}
 	}
 	kfree(pg_vec);
 }
 
-static inline char *alloc_one_pg_vec_page(unsigned long order)
+static inline char *alloc_one_pg_vec_page(unsigned long order,
+					  unsigned char *flags)
 {
-	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
+	char *buffer = NULL;
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
+			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+	buffer = (char *) __get_free_pages(gfp_flags, order);
+
+	if (buffer)
+		return buffer;
+
+	/*
+	 * __get_free_pages failed, fall back to vmalloc
+	 */
+	*flags |= PGV_FROM_VMALLOC;
+	buffer = vmalloc((1 << order) * PAGE_SIZE);
 
-	return (char *) __get_free_pages(gfp_flags, order);
+	if (buffer)
+		return buffer;
+
+	/*
+	 * vmalloc failed, lets dig into swap here
+	 */
+	*flags = 0;
+	gfp_flags &= ~__GFP_NORETRY;
+	buffer = (char *)__get_free_pages(gfp_flags, order);
+	if (buffer)
+		return buffer;
+
+	/*
+	 * complete and utter failure
+	 */
+	return NULL;
 }
 
-static char **alloc_pg_vec(struct tpacket_req *req, int order)
+static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
 {
 	unsigned int block_nr = req->tp_block_nr;
-	char **pg_vec;
+	struct pgv *pg_vec;
 	int i;
 
-	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
+	pg_vec = kzalloc(block_nr * sizeof(struct pgv), GFP_KERNEL);
 	if (unlikely(!pg_vec))
 		goto out;
 
 	for (i = 0; i < block_nr; i++) {
-		pg_vec[i] = alloc_one_pg_vec_page(order);
-		if (unlikely(!pg_vec[i]))
+		pg_vec[i].buffer = alloc_one_pg_vec_page(order,
+							 &pg_vec[i].flags);
+		if (unlikely(!pg_vec[i].buffer))
 			goto out_free_pgvec;
 	}
 
@@ -2361,6 +2405,7 @@ out:
 
 out_free_pgvec:
 	free_pg_vec(pg_vec, order, block_nr);
+	kfree(pg_vec);
 	pg_vec = NULL;
 	goto out;
 }
@@ -2368,7 +2413,7 @@ out_free_pgvec:
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring)
 {
-	char **pg_vec = NULL;
+	struct pgv *pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
 	int was_running, order = 0;
 	struct packet_ring_buffer *rb;
@@ -2530,15 +2575,22 @@ static int packet_mmap(struct file *file, struct socket *sock,
 			continue;
 
 		for (i = 0; i < rb->pg_vec_len; i++) {
-			struct page *page = virt_to_page(rb->pg_vec[i]);
+			struct page *page;
+			void *kaddr = rb->pg_vec[i].buffer;
 			int pg_num;
 
 			for (pg_num = 0; pg_num < rb->pg_vec_pages;
-					pg_num++, page++) {
+					pg_num++) {
+				if (rb->pg_vec[i].flags & PGV_FROM_VMALLOC)
+					page = vmalloc_to_page(kaddr);
+				else
+					page = virt_to_page(kaddr);
+
 				err = vm_insert_page(vma, start, page);
 				if (unlikely(err))
 					goto out;
 				start += PAGE_SIZE;
+				kaddr += PAGE_SIZE;
 			}
 		}
 	}
-- 
1.7.2.3


^ permalink raw reply related

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v2)
From: Eric Dumazet @ 2010-11-09 18:02 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, davem, zenczykowski
In-Reply-To: <1289324799-2256-1-git-send-email-nhorman@tuxdriver.com>

Le mardi 09 novembre 2010 à 12:46 -0500, nhorman@tuxdriver.com a écrit :
> From: Neil Horman <nhorman@tuxdriver.com>
> 
> Version 2 of this patch.  Sorry its been awhile, but I've had other issues come
> up.  I've re-written this patch taking into account the change notes from the
> last round
> 
> Change notes:
> 1) Rewrote the patch to not do all my previous stupid vmaps on single page
> arrays.  Instead we just use vmalloc now.
> 
> 2) Checked into the behavior of tcpdump on high order allocations in the failing
> case.  Tcpdump (more specifically I think libpcap) does attempt to minimize the
> allocation order on the ring buffer, unfortunately the lowest it will push the
> ordrer too given a sufficiently large buffer is order 5, so its still a very
> large contiguous allocation, and that says nothing of other malicious
> applications that might try to do more.
> 
> Summary:
> It was shown to me recently that systems under high load were driven very deep
> into swap when tcpdump was run.  The reason this happened was because the
> AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
> application to specify how many entries an AF_PACKET socket will have and how
> large each entry will be.  It seems the default setting for tcpdump is to set
> the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
> allocation.  Thats difficult under good circumstances, and horrid under memory
> pressure.
> 
> I thought it would be good to make that a bit more usable.  I was going to do a
> simple conversion of the ring buffer from contigous pages to iovecs, but
> unfortunately, the metadata which AF_PACKET places in these buffers can easily
> span a page boundary, and given that these buffers get mapped into user space,
> and the data layout doesn't easily allow for a change to padding between frames
> to avoid that, a simple iovec change is just going to break user space ABI
> consistency.
> 
> So I've done this, I've added a three tiered mechanism to the af_packet set_ring
> socket option.  It attempts to allocate memory in the following order:
> 
> 1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
> digging into swap
> 
> 2) Using vmalloc
> 
> 3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
> needed to get the memory
> 
> The effect is that we don't disturb the system as much when we're under load,
> while still being able to conduct tcpdumps effectively.
> 
> Tested successfully by me.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---
>  net/packet/af_packet.c |   84 ++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 68 insertions(+), 16 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 3616f27..d1148ac 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -163,8 +163,14 @@ struct packet_mreq_max {
>  static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
>  		int closing, int tx_ring);
>  
> +#define PGV_FROM_VMALLOC 1
> +struct pgv {
> +	char *buffer;
> +	unsigned char flags;
> +};
> +
>  struct packet_ring_buffer {
> -	char			**pg_vec;
> +	struct pgv		*pg_vec;
>  	unsigned int		head;
>  	unsigned int		frames_per_block;
>  	unsigned int		frame_size;
> @@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
>  	pg_vec_pos = position / rb->frames_per_block;
>  	frame_offset = position % rb->frames_per_block;
>  
> -	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
> +	h.raw = rb->pg_vec[pg_vec_pos].buffer +
> +		(frame_offset * rb->frame_size);
>  
>  	if (status != __packet_get_status(po, h.raw))
>  		return NULL;
> @@ -2322,37 +2329,74 @@ static const struct vm_operations_struct packet_mmap_ops = {
>  	.close	=	packet_mm_close,
>  };
>  
> -static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
> +static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
> +			unsigned int len)
>  {
>  	int i;
>  
>  	for (i = 0; i < len; i++) {
> -		if (likely(pg_vec[i]))
> -			free_pages((unsigned long) pg_vec[i], order);
> +		if (likely(pg_vec[i].buffer)) {
> +			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
> +				vfree(pg_vec[i].buffer);
> +			else
> +				free_pages((unsigned long)pg_vec[i].buffer,
> +					   order);
> +			pg_vec[i].buffer = NULL;
> +		}
>  	}
>  	kfree(pg_vec);
>  }
>  
> -static inline char *alloc_one_pg_vec_page(unsigned long order)
> +static inline char *alloc_one_pg_vec_page(unsigned long order,
> +					  unsigned char *flags)
>  {
> -	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
> +	char *buffer = NULL;
> +	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
> +			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> +
> +	buffer = (char *) __get_free_pages(gfp_flags, order);
> +
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * __get_free_pages failed, fall back to vmalloc
> +	 */
> +	*flags |= PGV_FROM_VMALLOC;
> +	buffer = vmalloc((1 << order) * PAGE_SIZE);
>  
> -	return (char *) __get_free_pages(gfp_flags, order);
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * vmalloc failed, lets dig into swap here
> +	 */
> +	*flags = 0;
> +	gfp_flags &= ~__GFP_NORETRY;
> +	buffer = (char *)__get_free_pages(gfp_flags, order);
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * complete and utter failure
> +	 */
> +	return NULL;
>  }
>  
> -static char **alloc_pg_vec(struct tpacket_req *req, int order)
> +static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
>  {
>  	unsigned int block_nr = req->tp_block_nr;
> -	char **pg_vec;
> +	struct pgv *pg_vec;
>  	int i;
>  
> -	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
> +	pg_vec = kzalloc(block_nr * sizeof(struct pgv), GFP_KERNEL);

While we are at it, we could check block_nr being a sane value here ;)

Nice stuff Neil !



^ permalink raw reply

* Re: radvd and auto-ipv6 address regression from 2.6.31 to 2.6.34+
From: Ben Greear @ 2010-11-09 18:07 UTC (permalink / raw)
  To: Brian Haley; +Cc: Mikael Abrahamsson, NetDev
In-Reply-To: <4CD81D24.1040609@hp.com>

On 11/08/2010 07:54 AM, Brian Haley wrote:
> On 11/06/2010 02:17 AM, Mikael Abrahamsson wrote:
>> I would not be surprised if the IPv6 stack on Linux would gain IPv6
>> addresses using SLAAC from any interface if it sees RAs on it,
>> regardless if these are locally generated or not.
>>
>> I guess any Linux device running radvd should turn off autoconf on those
>> interface that radvd is acting on?
>>
>> net.ipv6.conf.veth0.accept_ra=0 should do the trick?
>
> I believe radvd will turn-on IPv6 forwarding on all the interfaces, at
> least it does on Debian in /etc/init.d/radvd, which essentially disables
> the reception of RA's for address configuration purposes.  I'm curious
> what these values are set to, and if something just got missed.
>
> It might be something specific to veth too, not sure how packets are
> copied/looped-back on these devices from the stack.

Thanks for the hint.  I did not have ipv6 forwarding enabled for the
radvd interfaces.  Once I fixed that, it worked as expected with no
auto-created global IPv6 address.  I have been starting radvd manually,
so I wouldn't benefit from any OS specific start scripts.

Thanks,
Ben

>
> -Brian


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Networking hangs when too many parallel requests are made at once
From: Luke Hutchison @ 2010-11-09 18:30 UTC (permalink / raw)
  To: netdev

Since around Linux kernel 2.6.33 or so (but maybe as early as
2.6.31, not sure exactly what version), when restoring a crashed or
closed browser session of either Firefox or Chrome where lots of tabs
(say 10-40) open simultaneously, the networking stack is brought to
its knees -- most or all the tabs eventually time out without data, or
a few tabs might get some data and then display a partial web page.
This behavior occurs with either wifi or ethernet, and occurs when
booting from Fedora 14 on liveusb, so it does not appear to be a
configuration problem. I have a Toshiba Satellite Pro S300M-S2142
laptop with a Core 2 Duo P8600 CPU, Intel GM45 gfx, Intel 82567V
Gigabit Ethernet and Intel 5100 Wifi, running kernel
kernel-2.6.36-1.1.fc15.x86_64 on top of Fedora 14.

Sorry for the length of the following bug report, but it's quite hard
to describe the behavior succinctly.

Even after all tabs have timed out, it's impossible to get data by
opening a new tab -- nothing seems able to access the network
connection.  Networking is broken for other processes too -- for
example, commandline tools like ping don't work either.  The
connection still shows as up in NetworkManager, and sometimes after
5-10 minutes goes back to normal, but not always. "service network
restart" and/or "service NetworkManager restart" and/or "ifdown eth0 ;
ifup eth0" sometimes fixes the problem, but sometimes normal network
activity isn't restored for several minutes and may not act completely
normal again until a reboot.

DNS resolution is the most obviously affected by this.  If I reopen a
browser session and wait a few seconds for networking to hang, I can't
usually ping by domain name but I can (usually) ping by IP address.
However new browser tabs will hang at either name resolution *or*
waiting for data, so I'm not convinced this is just a problem with DNS
resolution.

Also sometimes (but not always) whatever weird state the network stack
on my laptop gets into, things are funky enough to screw up my home
router (two different Motorola Surfboard cable modems/routers), and
the cable modem sometimes has to be reset to get the connection back
to full speed again.  However it is not a router problem in general,
because:

(1) all these symptoms (except this last one where the router somehow
gets screwed up by the laptop's odd behavior) are present whether I
use a wired or wireless connection, and regardless of which network I
am connected to (home or anywhere else, or even when tethered to my
Nexus One), and in multiple countries I have been to in the last 6
months (Portugal, Germany, China).

also
(2) I used to be able to reopen a closed browser session with 40 tabs
and they would all load up just fine.  Then at some point after a
Rawhide update, this broke.

I can't put my finger on exactly when this broke, because I was
dealing with worse breakage for a while since Fedora kernel 2.6.31.5,
as I reported at the following link:

https://bugzilla.redhat.com/show_bug.cgi?id=555213#c1

Synopsis of the above "worse" bug report:
Basically in the very same situation (opening lots of browser tabs),
the machine would lock up hard and the fan would immediately blow at
100% speed.  It took a couple of months of Rawhide updates for this
bug to go away, but by the time this lockup bug was fixed around the
release of Fedora 13 at kernel version 2.6.33, the other network
issues I have described above became evident, and were triggered in
the same way -- thus I believe the two bugs may be related somehow.

My computer has been close to unusable for moderate browsing activity
for about 8 months of the year so far, across nearly two releases of
Fedora (F13 and F14 beta).  I filed the above bug report but it was
never commented on by RedHat engineers.  I figured the bug was
probably visible enough that somebody else should notice it and I just
kept hoping the next update would contain a fix, but not yet.  I
emailed one of the Red Hat kernel engineers and he suggested I ask
upstream.

Please advise me as to how to debug this problem further.  (I haven't
seen anything that looks suspicious in
dmesg output or /var/log/messages, to start with.)

Thank you,
Luke Hutchison

^ permalink raw reply

* alloc_netdev_mq() and multiqueues
From: Kevin Wilson @ 2010-11-09 18:33 UTC (permalink / raw)
  To: netdev

Hello,
I have a short question about multiqueues  and I will appreciate if
somebody can answer shortly in 2-3 sentences.
When talking about multiqueues  I refer for example, to
http://nfws.edenwall.com/nfws_userday/David-Miller_Linux-Multiqueue-Networking.pdf,
alloc_netdev_mq() and friends.

1) Does an ordinary network driver code can be adjusted to use
multiqueues ? or do we need some
special hardware feature ?
2) How can I know if a certain device support multiqueus?\

3) Can anybody name some network cards which support  multiqueues?

Regards,
Kevin

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox