Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Marc Kleine-Budde @ 2010-11-09 14:47 UTC (permalink / raw)
  To: Tomoya MORINAGA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Samuel Ortiz, margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	Oliver Hartkopp, LKML, socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, Wolfgang Grandegger,
	netdev-u79uwXL29TY76Z2rM5mHXA, joel.clark-ral2JQCrhuEAvxtiuMwx3w,
	David S. Miller, Christian Pellegrin,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <00fd01cb8009$5efc5fd0$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 2312 bytes --]

On 11/09/2010 01:26 PM, Tomoya MORINAGA wrote:
>> It's just a question if the driver for an Intel Chipset should/could ignore
>> the endian problematic.
>>
>> If this is intended the driver should only appear in Kconfig depending on X86
>> or little endian systems.
> 
> This driver is for only x86 processor.
> I have no intention to support processor except x86.

Fair enough, if you do the endianess conversion correct, either with my
proposed cpu_to_be16 or with the simpler and probably unnoticeable
slower version from Oliver. With Olivers version it's a bit harder to
build a loop that just copies the number of bytes specified in dlc.

>> Besides this remark, the struct pch_can_regs could also be defined in a way
>> that every single CAN payload data byte could be accessed directly to allow
>> e.g. this code in pch_can_rx_normal():
>>
>> cf->data[0] = ioread8(&priv->regs->if1_data0);
>> cf->data[1] = ioread8(&priv->regs->if1_data1);
>> cf->data[2] = ioread8(&priv->regs->if1_data2);
>> (..)
>> cf->data[6] = ioread8(&priv->regs->if1_data6);
>> cf->data[7] = ioread8(&priv->regs->if1_data7);
>>
>> This is easy to understand and additionally endian aware.
> 
> I agree. Thease codes are very simple.
> I will modify like above.
> 
>>
>> In opposite to this:
>>
>> + if (cf->can_dlc > 2) {
>> + u32 data1 = *((u16 *)&cf->data[2]);
>> + iowrite32(data1, &priv->regs->if2_dataa2);
>> + }
>>
>> Which puts a little? endian u16 into the big endian register layout.
>> Sorry i'm just getting curious on this register access implementation.
> 
> I think cf->data is like below
> MSB----------LSB
> D3-D2-D1-D0
> D7-D6-D5-D4

No, cf->data is an array of 8 u8, so it looks this way:

D0 D1 D2 D3  D4 D5 D6 D7

> *((u16 *)&cf->data[2]) means "D3-D2".

No, it's D2 D3. This is why you need the endianess conversion.

> This order coprresponds to register order.
> data register is like below
> MSB----------LSB
> **-**-D3-D2
> 
> ** means reserved area.

cheers, Marc


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 188 bytes --]

_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core

^ permalink raw reply

* Re: Netlink limitations
From: Thomas Graf @ 2010-11-09 14:49 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Jan Engelhardt, David S. Miller, pablo, netdev
In-Reply-To: <4CD91406.4000603@trash.net>

On Tue, Nov 09, 2010 at 10:27:34AM +0100, Patrick McHardy wrote:
> Am 08.11.2010 16:16, schrieb Thomas Graf:
> > On Sun, Nov 07, 2010 at 06:17:43PM +0100, Patrick McHardy wrote:
> >> On 07.11.2010 17:44, Jan Engelhardt wrote:
> >>> ...
> > 
> >> That's somewhat similar to the nlattr32 idea, but a length of 0 makes
> >> more sense since that's currently not used. In that case the length
> >> would be read from a second length field which has 32 bits.
> > 
> > I agree. This makes perfect sense. I was just thinking wheter we want
> > to encode more information into the attribute header if we extend it
> > anyways.
> 
> What kind of additional information are you thinking about?

We have tried to come up with ways of forwarding netlink messages to
other machines several times. It always failed due to the fact that
protocols encode attributes/data differently without having the
ability to specify the encoding. E.g. we have nfnetlink encoding
everything in network byte order, most others encode it in host byte
order. Therefore all parsers must be aware of all protocols if they
want to forward/do something with the data. A flag would help there.

I haven't given up on the idea of self describing netlink protocols
yet. For example we could encode the attribute type
(i8|u16|u32|u16|string) in additional to the existing nested attribute
flag.

Given this additional meta data to each attribute and given we limit
ourselves to only using strict netlink attribtes going forward, i.e.
we no longer encode whole structs in attributes we'd be given netlink
protocols which can be forwarded, saved, audited while still being
fairly efficient.

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-11-09 15:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20101109132239.GF22705@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:52:39 PM:

> > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > >
> > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > > >
> > > > Any feedback, comments, objections, issues or bugs about the
> > > > patches? Please let me know if something needs to be done.
> > > >
> > > > Some more test results:
> > > > _____________________________________________________
> > > >          Host->Guest BW (numtxqs=2)
> > > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > > _____________________________________________________
> > >
> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > I had to make a few changes to qemu (and a minor change in macvtap
> > driver) to get multiple TXQ support using macvtap working. The NIC
> > is a ixgbe card.
> >
> >
__________________________________________________________________________
> >             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> >
__________________________________________________________________________
> > 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> > 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> > 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> > 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> > 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> > 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> > 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> > 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495
(-10.4)
> > 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538
(-9.5)
> > 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403
(-5.1)
> > 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419
(-8.7)
> > 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463
(-8.8)
> > 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337
(-14.6)
> >
__________________________________________________________________________
> > Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> >
> >
__________________________________________________________________________
> >             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> >
__________________________________________________________________________
> > 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> > 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> > 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> > 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> > 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> > 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> > 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> > 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> > 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403
(-10.9)
> > 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> > 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824
(-15.4)
> > 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131
(-11.4)
> > 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873
(-18.8)
> >
__________________________________________________________________________
> > Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> >
> > Is there anything else you would like me to test/change, or shall
> > I submit the next version (with the above macvtap changes)?
> >
> > Thanks,
> >
> > - KK
>
> Something strange here, right?
> 1. You are consistently getting >10G/s here, and even with a single
stream?

Sorry, I should have mentioned this though I had stated in my
earlier mails. Each test result has two iterations, each of 60
seconds, except when #netperfs is 1 for which I do 10 iteration
(sum across 10 iterations).  I started doing many more iterations
for 1 netperf after finding the issue earlier with single stream.
So the BW is only 4.5-7 Gbps.

> 2. With 2 streams, is where we get < 10G/s originally. Instead of
>    doubling that we get a marginal improvement with 2 queues and
>    about 30% worse with 1 queue.

(doubling happens consistently for guest -> host, but never for
remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
testing scenario. In first case, there is a slight improvement in
BW and good reduction in SD. In the second case, only SD improves
(though BW drops for 2 stream for some reason).  In both cases,
BW and SD improves as the number of sessions increase.

> Is your card MQ?

Yes, the card is MQ. ixgbe 10g card.

Thanks,

- KK


^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Krishna Kumar2 @ 2010-11-09 15:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: davem, netdev, Rusty Russell, yvugenfi
In-Reply-To: <20101109131555.GE22705@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:45:55 PM:

> Re: [PATCH] virtio_net: Fix queue full check
>
> On Tue, Nov 09, 2010 at 09:56:03AM +0530, Krishna Kumar2 wrote:
> > Rusty Russell <rusty@rustcorp.com.au> wrote on 11/08/2010 04:38:47 AM:
> >
> > > Re: [PATCH] virtio_net: Fix queue full check
> > >
> > > On Thu, 4 Nov 2010 10:54:24 pm Michael S. Tsirkin wrote:
> > > > I thought about this some more.  I think the original
> > > > code is actually correct in returning ENOSPC: indirect
> > > > buffers are nice, but it's a mistake
> > > > to rely on them as a memory allocation might fail.
> > > >
> > > > And if you look at virtio-net, it is dropping packets
> > > > under memory pressure which is not really a happy outcome:
> > > > the packet will get freed, reallocated and we get another one,
> > > > adding pressure on the allocator instead of releasing it
> > > > until we free up some buffers.
> > > >
> > > > So I now think we should calculate the capacity
> > > > assuming non-indirect entries, and if we manage to
> > > > use indirect, all the better.
> > >
> > > I've long said it's a weakness in the network stack that it insists
> > > drivers stop the tx queue before they *might* run out of room,
leading to
> > > worst-case assumptions and underutilization of the tx ring.
> > >
> > > However, I lost that debate, and so your patch is the way it's
supposed
> > to
> > > work.  The other main indirect user (block) doesn't care as its queue
> > > allows for post-attempt blocking.
> > >
> > > I enhanced your commentry a little:
> > >
> > > Subject: virtio: return correct capacity to users
> > > Date: Thu, 4 Nov 2010 14:24:24 +0200
> > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > >
> > > We can't rely on indirect buffers for capacity
> > > calculations because they need a memory allocation
> > > which might fail.  In particular, virtio_net can get
> > > into this situation under stress, and it drops packets
> > > and performs badly.
> > >
> > > So return the number of buffers we can guarantee users.
> > >
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> >
> > I have tested this patch for 3-4 hours but so far I have not got the tx
> > full
> > error. I am not sure if "Tested-By" applies to this situation, but just
in
> > case:
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> > Tested-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> >
> > I think both this patch and the original patch I submitted
> > are needed? That patch removes ENOMEM check and the increment
> > of dev->stats.tx_fifo_errors, and reports "memory failure".
> >
> > Thanks,
> >
> > - KK
>
> So I think your patch on top of this one would be wrong:
> we actually make sure out of memory does not affect TX path
> at all, so any error would be unexpected.
>
> Incrementing tx fifo errors is probably also helpful for debugging.
>
> If you like, we could kill the special handling for ENOMEM.
> Not sure whether it matters.

Since that is dead code, we could remove it (and the fifo error
should disappear too - tx_dropped should be the only counter to
be incremented?). Sorry if I misunderstood something.

Thanks,

- KK


^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Michael S. Tsirkin @ 2010-11-09 15:30 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: davem, netdev, Rusty Russell, yvugenfi
In-Reply-To: <OF5BF09BF3.7DE39268-ON652577D6.004B4830-652577D6.0054E713@in.ibm.com>

On Tue, Nov 09, 2010 at 09:00:58PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:45:55 PM:
> 
> > Re: [PATCH] virtio_net: Fix queue full check
> >
> > On Tue, Nov 09, 2010 at 09:56:03AM +0530, Krishna Kumar2 wrote:
> > > Rusty Russell <rusty@rustcorp.com.au> wrote on 11/08/2010 04:38:47 AM:
> > >
> > > > Re: [PATCH] virtio_net: Fix queue full check
> > > >
> > > > On Thu, 4 Nov 2010 10:54:24 pm Michael S. Tsirkin wrote:
> > > > > I thought about this some more.  I think the original
> > > > > code is actually correct in returning ENOSPC: indirect
> > > > > buffers are nice, but it's a mistake
> > > > > to rely on them as a memory allocation might fail.
> > > > >
> > > > > And if you look at virtio-net, it is dropping packets
> > > > > under memory pressure which is not really a happy outcome:
> > > > > the packet will get freed, reallocated and we get another one,
> > > > > adding pressure on the allocator instead of releasing it
> > > > > until we free up some buffers.
> > > > >
> > > > > So I now think we should calculate the capacity
> > > > > assuming non-indirect entries, and if we manage to
> > > > > use indirect, all the better.
> > > >
> > > > I've long said it's a weakness in the network stack that it insists
> > > > drivers stop the tx queue before they *might* run out of room,
> leading to
> > > > worst-case assumptions and underutilization of the tx ring.
> > > >
> > > > However, I lost that debate, and so your patch is the way it's
> supposed
> > > to
> > > > work.  The other main indirect user (block) doesn't care as its queue
> > > > allows for post-attempt blocking.
> > > >
> > > > I enhanced your commentry a little:
> > > >
> > > > Subject: virtio: return correct capacity to users
> > > > Date: Thu, 4 Nov 2010 14:24:24 +0200
> > > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > >
> > > > We can't rely on indirect buffers for capacity
> > > > calculations because they need a memory allocation
> > > > which might fail.  In particular, virtio_net can get
> > > > into this situation under stress, and it drops packets
> > > > and performs badly.
> > > >
> > > > So return the number of buffers we can guarantee users.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > > > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> > >
> > > I have tested this patch for 3-4 hours but so far I have not got the tx
> > > full
> > > error. I am not sure if "Tested-By" applies to this situation, but just
> in
> > > case:
> > >
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> > > Tested-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> > >
> > > I think both this patch and the original patch I submitted
> > > are needed? That patch removes ENOMEM check and the increment
> > > of dev->stats.tx_fifo_errors, and reports "memory failure".
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > So I think your patch on top of this one would be wrong:
> > we actually make sure out of memory does not affect TX path
> > at all, so any error would be unexpected.
> >
> > Incrementing tx fifo errors is probably also helpful for debugging.
> >
> > If you like, we could kill the special handling for ENOMEM.
> > Not sure whether it matters.
> 
> Since that is dead code, we could remove it (and the fifo error
> should disappear too - tx_dropped should be the only counter to
> be incremented?). Sorry if I misunderstood something.
> 
> Thanks,
> 
> - KK

It's just a sanity check. The capacity checking is tricky enough
that I'm happier with some way to detect overflow both from ifconfig
and dmesg.

I don't really care which counter gets incremented really,
since this is some TX bug fifo error seems appropriate
but I don't really care much.


-- 
MST

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-11-09 15:33 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <OF24E08752.2087FFA4-ON652577D6.00532DF1-652577D6.0054B291@in.ibm.com>

On Tue, Nov 09, 2010 at 08:58:44PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:52:39 PM:
> 
> > > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > > >
> > > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > > > >
> > > > > Any feedback, comments, objections, issues or bugs about the
> > > > > patches? Please let me know if something needs to be done.
> > > > >
> > > > > Some more test results:
> > > > > _____________________________________________________
> > > > >          Host->Guest BW (numtxqs=2)
> > > > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > > > _____________________________________________________
> > > >
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > I had to make a few changes to qemu (and a minor change in macvtap
> > > driver) to get multiple TXQ support using macvtap working. The NIC
> > > is a ixgbe card.
> > >
> > >
> __________________________________________________________________________
> > >             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > > #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> > >
> __________________________________________________________________________
> > > 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> > > 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> > > 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> > > 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> > > 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> > > 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> > > 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> > > 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495
> (-10.4)
> > > 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538
> (-9.5)
> > > 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403
> (-5.1)
> > > 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419
> (-8.7)
> > > 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463
> (-8.8)
> > > 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337
> (-14.6)
> > >
> __________________________________________________________________________
> > > Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> > >
> > >
> __________________________________________________________________________
> > >             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > > #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> > >
> __________________________________________________________________________
> > > 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> > > 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> > > 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> > > 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> > > 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> > > 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> > > 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> > > 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> > > 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403
> (-10.9)
> > > 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> > > 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824
> (-15.4)
> > > 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131
> (-11.4)
> > > 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873
> (-18.8)
> > >
> __________________________________________________________________________
> > > Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> > >
> > > Is there anything else you would like me to test/change, or shall
> > > I submit the next version (with the above macvtap changes)?
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > Something strange here, right?
> > 1. You are consistently getting >10G/s here, and even with a single
> stream?
> 
> Sorry, I should have mentioned this though I had stated in my
> earlier mails. Each test result has two iterations, each of 60
> seconds, except when #netperfs is 1 for which I do 10 iteration
> (sum across 10 iterations).

So need to divide the number by 10?

>  I started doing many more iterations
> for 1 netperf after finding the issue earlier with single stream.
> So the BW is only 4.5-7 Gbps.
> 
> > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> >    doubling that we get a marginal improvement with 2 queues and
> >    about 30% worse with 1 queue.
> 
> (doubling happens consistently for guest -> host, but never for
> remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> testing scenario. In first case, there is a slight improvement in
> BW and good reduction in SD. In the second case, only SD improves
> (though BW drops for 2 stream for some reason).  In both cases,
> BW and SD improves as the number of sessions increase.

I guess this is another indication that something's wrong.
We are quite far from line rate, the fact BW does not scale
means there's some contention in the code.

> > Is your card MQ?
> 
> Yes, the card is MQ. ixgbe 10g card.
> 
> Thanks,
> 
> - KK

^ permalink raw reply

* Re: ip_summed setting for TCP pure-ACK packets
From: Ben Hutchings @ 2010-11-09 16:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, sf-linux-drivers, netdev
In-Reply-To: <1289235366.15376.3.camel@edumazet-laptop>

On Mon, 2010-11-08 at 17:56 +0100, Eric Dumazet wrote: 
> Le lundi 08 novembre 2010 à 16:35 +0000, Ben Hutchings a écrit :
> > As we discussed at LPC:
> > 
> > Current controllers handled by the sfc driver have a per-queue (rather
> > than per-packet) option for checksum generation.  Currently pure-ACK
> > packets sent by TCP have ip_summed == CHECKSUM_NONE and we must send
> > them on hardware queues with checksum generation disabled.  To support
> > this, we allocate 2 hardware queues per core TX queue.
> > 
> > To reduce the risk of reordering (and possibly the number of hardware TX
> > queues required), it would be helpful for TCP to set ip_summed ==
> > CHECKSUM_PARTIAL on pure-ACK packets when the output device is known to
> > support checksum generation.
> > 
> > Ben.
> > 
> 
> Hmm
> 
> Do you mean commit 2e8e18ef52e7dd1af0a3bd1 is not enough ?

It might well be... I must admit I hadn't thought to check whether this
issue had gone away.

Yes, that does the trick.  Sorry for wasting people's time on this.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.



^ permalink raw reply

* Re: [PATCH] inet: fix ip_mc_drop_socket()
From: David Miller @ 2010-11-09 16:26 UTC (permalink / raw)
  To: eric.dumazet
  Cc: markus, paulmck, miles.lane, ilpo.jarvinen, linux-kernel, lenb,
	netdev
In-Reply-To: <1289250954.2790.11.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 Nov 2010 22:15:54 +0100

> Hmm, I believe I found the bug.
> 
> Thanks guys !
> 
> [PATCH] inet: fix ip_mc_drop_socket()
> 
> commit 8723e1b4ad9be4444 (inet: RCU changes in inetdev_by_index())
> forgot one call site in ip_mc_drop_socket()
> 
> We should not decrease idev refcount after inetdev_by_index() call,
> since refcount is not increased anymore.
> 
> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
> Reported-by: Miles Lane <miles.lane@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks a lot!

^ permalink raw reply

* Re: [PATCH 1/2] r8169: revert "Handle rxfifo errors on 8168 chips"
From: David Miller @ 2010-11-09 16:27 UTC (permalink / raw)
  To: romieu; +Cc: netdev, a.radke, mjg, daniel.blueman
In-Reply-To: <20101108232305.GA13720@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Tue, 9 Nov 2010 00:23:05 +0100

> The original patch helps under obscure conditions (no pun) but
> some 8168 do not like it. The change needs to be tightened with
> a specific 8168 version.
> 
> This reverts commit 801e147cde02f04b5c2f42764cd43a89fc7400a2.
> 
> Regression at https://bugzilla.kernel.org/show_bug.cgi?id=20882
> 
> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
> Tested-by: Andreas Radke <a.radke@arcor.de>
> Cc: Matthew Garrett <mjg@redhat.com>
> Cc: Daniel J Blueman <daniel.blueman@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/2] r8169: fix sleeping while holding spinlock.
From: David Miller @ 2010-11-09 16:27 UTC (permalink / raw)
  To: romieu; +Cc: netdev, daniel.blueman, rjw
In-Reply-To: <20101108232358.GB13720@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Tue, 9 Nov 2010 00:23:58 +0100

> As device_set_wakeup_enable can now sleep, move the call to outside
> the critical section.
> 
> Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
> Acked-by: Rafael J. Wysocki <rjw@sisk.pl>

Applied.

^ permalink raw reply

* Re: ip_summed setting for TCP pure-ACK packets
From: Eric Dumazet @ 2010-11-09 16:28 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, sf-linux-drivers, netdev
In-Reply-To: <1289319889.2238.195.camel@achroite.uk.solarflarecom.com>

Le mardi 09 novembre 2010 à 16:24 +0000, Ben Hutchings a écrit :
> On Mon, 2010-11-08 at 17:56 +0100, Eric Dumazet wrote: 
> > Le lundi 08 novembre 2010 à 16:35 +0000, Ben Hutchings a écrit :
> > > As we discussed at LPC:
> > > 
> > > Current controllers handled by the sfc driver have a per-queue (rather
> > > than per-packet) option for checksum generation.  Currently pure-ACK
> > > packets sent by TCP have ip_summed == CHECKSUM_NONE and we must send
> > > them on hardware queues with checksum generation disabled.  To support
> > > this, we allocate 2 hardware queues per core TX queue.
> > > 
> > > To reduce the risk of reordering (and possibly the number of hardware TX
> > > queues required), it would be helpful for TCP to set ip_summed ==
> > > CHECKSUM_PARTIAL on pure-ACK packets when the output device is known to
> > > support checksum generation.
> > > 
> > > Ben.
> > > 
> > 
> > Hmm
> > 
> > Do you mean commit 2e8e18ef52e7dd1af0a3bd1 is not enough ?
> 
> It might well be... I must admit I hadn't thought to check whether this
> issue had gone away.
> 
> Yes, that does the trick.  Sorry for wasting people's time on this.
> 

You're welcome ;)




^ permalink raw reply

* [PATCH] net: replace min with min_t in pktgen_if_write
From: Davidlohr Bueso @ 2010-11-09 16:37 UTC (permalink / raw)
  To: David Miller, Robert Olsson; +Cc: LKML, Network Development

From: Davidlohr Bueso <dave@gnu.org>

This addresses the following compiler (gcc 4.4.5) warning:

  CC [M]  net/core/pktgen.o
net/core/pktgen.c: In function ‘pktgen_if_write’:
net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
---
 net/core/pktgen.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fbce4b0..1992cd0 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -887,7 +887,7 @@ static ssize_t pktgen_if_write(struct file *file,
 	i += len;
 
 	if (debug) {
-		size_t copy = min(count, 1023);
+		size_t copy = min_t(size_t, count, 1023);
 		char tb[copy + 1];
 		if (copy_from_user(tb, user_buffer, copy))
 			return -EFAULT;
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] net: replace min with min_t in pktgen_if_write
From: David Miller @ 2010-11-09 16:40 UTC (permalink / raw)
  To: dave; +Cc: robert.olsson, linux-kernel, netdev
In-Reply-To: <1289320633.2260.12.camel@cowboy>

From: Davidlohr Bueso <dave@gnu.org>
Date: Tue, 09 Nov 2010 13:37:13 -0300

> From: Davidlohr Bueso <dave@gnu.org>
> 
> This addresses the following compiler (gcc 4.4.5) warning:
> 
>   CC [M]  net/core/pktgen.o
> net/core/pktgen.c: In function ‘pktgen_if_write’:
> net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast
> 
> Signed-off-by: Davidlohr Bueso <dave@gnu.org>

Already fixed in net-2.6:

commit 86c2c0a8a4965ae5cbc0ff97ed39a4472e8e9b23
Author: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Date:   Sat Nov 6 20:11:38 2010 +0000

    NET: pktgen - fix compile warning
    
    This should fix the following warning:
    
    net/core/pktgen.c: In function ‘pktgen_if_write’:
    net/core/pktgen.c:890: warning: comparison of distinct pointer types lacks a cast
    
    Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
    Reviewed-by: Nelson Elhage <nelhage@ksplice.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fbce4b0..1992cd0 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -887,7 +887,7 @@ static ssize_t pktgen_if_write(struct file *file,
 	i += len;
 
 	if (debug) {
-		size_t copy = min(count, 1023);
+		size_t copy = min_t(size_t, count, 1023);
 		char tb[copy + 1];
 		if (copy_from_user(tb, user_buffer, copy))
 			return -EFAULT;

^ permalink raw reply related

* Re: [PATCH] Fix CAN info leak/minor heap overflow
From: David Miller @ 2010-11-09 17:05 UTC (permalink / raw)
  To: socketcan; +Cc: urs, netdev, drosenberg, security, torvalds
In-Reply-To: <4CD8FDB5.6060905@hartkopp.net>

From: Oliver Hartkopp <socketcan@hartkopp.net>
Date: Tue, 09 Nov 2010 08:52:21 +0100

> Once this patch is applied (and the procfs layout is changed anyway), i'd also
> like to send a patch from my backlog that would extend the procfs output for
> can-bcm with an additional drop counter.

I find this kind of discussion extremely disappointing.

All of this stuff you CAN guys do with procfs files and version
strings is completely wrong and bogus.

Once you create a procfs file layout, you're basically stuck and you
can at best only reasonably add new fields at the end, you can't
really change existing fields.

And sysfs would have been a lot more appropriate, you could use
attributes for each value you want to export and then just add new
sysfs attributes when you want to export new values which has very
clear semantics and backwards compatability implications.

^ permalink raw reply

* Re: [PATCH v15 00/17] Provide a zero-copy method on KVM virtio-net.
From: David Miller @ 2010-11-09 17:15 UTC (permalink / raw)
  To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mst, mingo, herbert, jdike
In-Reply-To: <1289293402-4791-1-git-send-email-xiaohui.xin@intel.com>

From: xiaohui.xin@intel.com
Date: Tue,  9 Nov 2010 17:03:05 +0800

> We provide an zero-copy method which driver side may get external
> buffers to DMA. Here external means driver don't use kernel space
> to allocate skb buffers. Currently the external buffer can be from
> guest virtio-net driver.

1) Please use kernel-doc style comments to document structure
   members.

2) The idea to key off of skb->dev in skb_release_data() is
   fundamentally flawed since many actions can change skb->dev on you,
   which will end up causing a leak of your external data areas.

^ permalink raw reply

* Re: [PATCH] drivers/net: normalize TX_TIMEOUT
From: David Miller @ 2010-11-09 17:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1288860575.2659.34.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 04 Nov 2010 09:49:35 +0100

> Some network drivers use old TX_TIMEOUT definitions, assuming HZ=100 of
> old kernels.
> 
> Convert these definitions to include HZ, since HZ can be 1000 these
> days.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-11-09 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20101109153325.GB26153@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 09:03:25 PM:

> > > Something strange here, right?
> > > 1. You are consistently getting >10G/s here, and even with a single
> > stream?
> >
> > Sorry, I should have mentioned this though I had stated in my
> > earlier mails. Each test result has two iterations, each of 60
> > seconds, except when #netperfs is 1 for which I do 10 iteration
> > (sum across 10 iterations).
>
> So need to divide the number by 10?

Yes, that is what I get with 512/1K macvtap I/O size :)

> >  I started doing many more iterations
> > for 1 netperf after finding the issue earlier with single stream.
> > So the BW is only 4.5-7 Gbps.
> >
> > > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> > >    doubling that we get a marginal improvement with 2 queues and
> > >    about 30% worse with 1 queue.
> >
> > (doubling happens consistently for guest -> host, but never for
> > remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> > testing scenario. In first case, there is a slight improvement in
> > BW and good reduction in SD. In the second case, only SD improves
> > (though BW drops for 2 stream for some reason).  In both cases,
> > BW and SD improves as the number of sessions increase.
>
> I guess this is another indication that something's wrong.

The patch - both virtio-net and vhost-net, doesn't have any
locking/mutex's/ or any synchronization method. Guest -> host
performance improvement of upto 100% shows the patch is not
doing anything wrong.

> We are quite far from line rate, the fact BW does not scale
> means there's some contention in the code.

Attaining line speed with macvtap seems to be a generic issue
and unrelated to my patch specifically. IMHO if there is nothing
wrong in the code (review) and is accepted, it will benefit as
others can also help to find what needs to be implemented in
vhost/macvtap/qemu to get line speed for guest->remote-host.

PS: bare-metal performance for host->remote-host is also
    2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.

Thanks,

- KK


^ permalink raw reply

* Re: [PATCH 36/39] net/ipv4/tcp.c: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, kuznet, pekkas, jmorris, yoshfuji, kaber, netdev,
	linux-kernel
In-Reply-To: <2afee732758b93d7464e6bb07a52b3bc67f164a4.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:53 -0700

> Coalesce long formats.
> Align arguments.
> Remove KERN_<level>.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 35/39] net/core/dev.c: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, netdev, linux-kernel
In-Reply-To: <2f05cefa4dc86594a2d7335c7978b1ae781bef0e.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:52 -0700

> Coalesce long formats.
> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 17/39] drivers/net/usb: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe; +Cc: trivial, gregkh, linux-usb, netdev, linux-kernel
In-Reply-To: <4ef13c27899febe0cef4f1633321ddedbe97cfcf.1288471898.git.joe@perches.com>

From: Joe Perches <joe@perches.com>
Date: Sat, 30 Oct 2010 14:08:34 -0700

> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 16/39] drivers/net/can: Update WARN uses
From: David Miller @ 2010-11-09 17:23 UTC (permalink / raw)
  To: joe-6d6DIl74uiNBDgjK7y7TUQ
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, trivial-DgEjT+Ai2ygdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, wg-5Yr1BZd7O62+XT7JhA+gdA
In-Reply-To: <2a00e2a377a44e544cec15356cf3794abe47b306.1288471898.git.joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>

From: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
Date: Sat, 30 Oct 2010 14:08:33 -0700

> Add missing newlines.
> 
> Signed-off-by: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] net: tipc: fix information leak to userland
From: David Miller @ 2010-11-09 17:26 UTC (permalink / raw)
  To: segooon
  Cc: jon.maloy, netdev, kernel-janitors, linux-kernel, tipc-discussion,
	allan.stephens
In-Reply-To: <1288545032-16481-1-git-send-email-segooon@gmail.com>

From: Vasiliy Kulikov <segooon@gmail.com>
Date: Sun, 31 Oct 2010 20:10:32 +0300

> Structure sockaddr_tipc is copied to userland with padding bytes after
> "id" field in union field "name" unitialized.  It leads to leaking of
> contents of kernel stack memory.  We have to initialize them to zero.
> 
> Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>

Applied.

Patches #1 and #2 were given feedback which I need you to integrate
and submit new patches based upon, thanks.

------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

^ permalink raw reply

* [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v2)
From: nhorman @ 2010-11-09 17:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, zenczykowski, Neil Horman
In-Reply-To: <1288033566-2091-1-git-send-email-nhorman@tuxdriver.com>

From: Neil Horman <nhorman@tuxdriver.com>

Version 2 of this patch.  Sorry its been awhile, but I've had other issues come
up.  I've re-written this patch taking into account the change notes from the
last round

Change notes:
1) Rewrote the patch to not do all my previous stupid vmaps on single page
arrays.  Instead we just use vmalloc now.

2) Checked into the behavior of tcpdump on high order allocations in the failing
case.  Tcpdump (more specifically I think libpcap) does attempt to minimize the
allocation order on the ring buffer, unfortunately the lowest it will push the
ordrer too given a sufficiently large buffer is order 5, so its still a very
large contiguous allocation, and that says nothing of other malicious
applications that might try to do more.

Summary:
It was shown to me recently that systems under high load were driven very deep
into swap when tcpdump was run.  The reason this happened was because the
AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
application to specify how many entries an AF_PACKET socket will have and how
large each entry will be.  It seems the default setting for tcpdump is to set
the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
allocation.  Thats difficult under good circumstances, and horrid under memory
pressure.

I thought it would be good to make that a bit more usable.  I was going to do a
simple conversion of the ring buffer from contigous pages to iovecs, but
unfortunately, the metadata which AF_PACKET places in these buffers can easily
span a page boundary, and given that these buffers get mapped into user space,
and the data layout doesn't easily allow for a change to padding between frames
to avoid that, a simple iovec change is just going to break user space ABI
consistency.

So I've done this, I've added a three tiered mechanism to the af_packet set_ring
socket option.  It attempts to allocate memory in the following order:

1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
digging into swap

2) Using vmalloc

3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
needed to get the memory

The effect is that we don't disturb the system as much when we're under load,
while still being able to conduct tcpdumps effectively.

Tested successfully by me.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 net/packet/af_packet.c |   84 ++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3616f27..d1148ac 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -163,8 +163,14 @@ struct packet_mreq_max {
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring);
 
+#define PGV_FROM_VMALLOC 1
+struct pgv {
+	char *buffer;
+	unsigned char flags;
+};
+
 struct packet_ring_buffer {
-	char			**pg_vec;
+	struct pgv		*pg_vec;
 	unsigned int		head;
 	unsigned int		frames_per_block;
 	unsigned int		frame_size;
@@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
 	pg_vec_pos = position / rb->frames_per_block;
 	frame_offset = position % rb->frames_per_block;
 
-	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
+	h.raw = rb->pg_vec[pg_vec_pos].buffer +
+		(frame_offset * rb->frame_size);
 
 	if (status != __packet_get_status(po, h.raw))
 		return NULL;
@@ -2322,37 +2329,74 @@ static const struct vm_operations_struct packet_mmap_ops = {
 	.close	=	packet_mm_close,
 };
 
-static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
+static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
+			unsigned int len)
 {
 	int i;
 
 	for (i = 0; i < len; i++) {
-		if (likely(pg_vec[i]))
-			free_pages((unsigned long) pg_vec[i], order);
+		if (likely(pg_vec[i].buffer)) {
+			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
+				vfree(pg_vec[i].buffer);
+			else
+				free_pages((unsigned long)pg_vec[i].buffer,
+					   order);
+			pg_vec[i].buffer = NULL;
+		}
 	}
 	kfree(pg_vec);
 }
 
-static inline char *alloc_one_pg_vec_page(unsigned long order)
+static inline char *alloc_one_pg_vec_page(unsigned long order,
+					  unsigned char *flags)
 {
-	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
+	char *buffer = NULL;
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
+			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+	buffer = (char *) __get_free_pages(gfp_flags, order);
+
+	if (buffer)
+		return buffer;
+
+	/*
+	 * __get_free_pages failed, fall back to vmalloc
+	 */
+	*flags |= PGV_FROM_VMALLOC;
+	buffer = vmalloc((1 << order) * PAGE_SIZE);
 
-	return (char *) __get_free_pages(gfp_flags, order);
+	if (buffer)
+		return buffer;
+
+	/*
+	 * vmalloc failed, lets dig into swap here
+	 */
+	*flags = 0;
+	gfp_flags &= ~__GFP_NORETRY;
+	buffer = (char *)__get_free_pages(gfp_flags, order);
+	if (buffer)
+		return buffer;
+
+	/*
+	 * complete and utter failure
+	 */
+	return NULL;
 }
 
-static char **alloc_pg_vec(struct tpacket_req *req, int order)
+static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
 {
 	unsigned int block_nr = req->tp_block_nr;
-	char **pg_vec;
+	struct pgv *pg_vec;
 	int i;
 
-	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
+	pg_vec = kzalloc(block_nr * sizeof(struct pgv), GFP_KERNEL);
 	if (unlikely(!pg_vec))
 		goto out;
 
 	for (i = 0; i < block_nr; i++) {
-		pg_vec[i] = alloc_one_pg_vec_page(order);
-		if (unlikely(!pg_vec[i]))
+		pg_vec[i].buffer = alloc_one_pg_vec_page(order,
+							 &pg_vec[i].flags);
+		if (unlikely(!pg_vec[i].buffer))
 			goto out_free_pgvec;
 	}
 
@@ -2361,6 +2405,7 @@ out:
 
 out_free_pgvec:
 	free_pg_vec(pg_vec, order, block_nr);
+	kfree(pg_vec);
 	pg_vec = NULL;
 	goto out;
 }
@@ -2368,7 +2413,7 @@ out_free_pgvec:
 static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
 		int closing, int tx_ring)
 {
-	char **pg_vec = NULL;
+	struct pgv *pg_vec = NULL;
 	struct packet_sock *po = pkt_sk(sk);
 	int was_running, order = 0;
 	struct packet_ring_buffer *rb;
@@ -2530,15 +2575,22 @@ static int packet_mmap(struct file *file, struct socket *sock,
 			continue;
 
 		for (i = 0; i < rb->pg_vec_len; i++) {
-			struct page *page = virt_to_page(rb->pg_vec[i]);
+			struct page *page;
+			void *kaddr = rb->pg_vec[i].buffer;
 			int pg_num;
 
 			for (pg_num = 0; pg_num < rb->pg_vec_pages;
-					pg_num++, page++) {
+					pg_num++) {
+				if (rb->pg_vec[i].flags & PGV_FROM_VMALLOC)
+					page = vmalloc_to_page(kaddr);
+				else
+					page = virt_to_page(kaddr);
+
 				err = vm_insert_page(vma, start, page);
 				if (unlikely(err))
 					goto out;
 				start += PAGE_SIZE;
+				kaddr += PAGE_SIZE;
 			}
 		}
 	}
-- 
1.7.2.3


^ permalink raw reply related

* Re: [PATCH] Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v2)
From: Eric Dumazet @ 2010-11-09 18:02 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, davem, zenczykowski
In-Reply-To: <1289324799-2256-1-git-send-email-nhorman@tuxdriver.com>

Le mardi 09 novembre 2010 à 12:46 -0500, nhorman@tuxdriver.com a écrit :
> From: Neil Horman <nhorman@tuxdriver.com>
> 
> Version 2 of this patch.  Sorry its been awhile, but I've had other issues come
> up.  I've re-written this patch taking into account the change notes from the
> last round
> 
> Change notes:
> 1) Rewrote the patch to not do all my previous stupid vmaps on single page
> arrays.  Instead we just use vmalloc now.
> 
> 2) Checked into the behavior of tcpdump on high order allocations in the failing
> case.  Tcpdump (more specifically I think libpcap) does attempt to minimize the
> allocation order on the ring buffer, unfortunately the lowest it will push the
> ordrer too given a sufficiently large buffer is order 5, so its still a very
> large contiguous allocation, and that says nothing of other malicious
> applications that might try to do more.
> 
> Summary:
> It was shown to me recently that systems under high load were driven very deep
> into swap when tcpdump was run.  The reason this happened was because the
> AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space
> application to specify how many entries an AF_PACKET socket will have and how
> large each entry will be.  It seems the default setting for tcpdump is to set
> the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5
> allocation.  Thats difficult under good circumstances, and horrid under memory
> pressure.
> 
> I thought it would be good to make that a bit more usable.  I was going to do a
> simple conversion of the ring buffer from contigous pages to iovecs, but
> unfortunately, the metadata which AF_PACKET places in these buffers can easily
> span a page boundary, and given that these buffers get mapped into user space,
> and the data layout doesn't easily allow for a change to padding between frames
> to avoid that, a simple iovec change is just going to break user space ABI
> consistency.
> 
> So I've done this, I've added a three tiered mechanism to the af_packet set_ring
> socket option.  It attempts to allocate memory in the following order:
> 
> 1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without
> digging into swap
> 
> 2) Using vmalloc
> 
> 3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as
> needed to get the memory
> 
> The effect is that we don't disturb the system as much when we're under load,
> while still being able to conduct tcpdumps effectively.
> 
> Tested successfully by me.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---
>  net/packet/af_packet.c |   84 ++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 68 insertions(+), 16 deletions(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 3616f27..d1148ac 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -163,8 +163,14 @@ struct packet_mreq_max {
>  static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
>  		int closing, int tx_ring);
>  
> +#define PGV_FROM_VMALLOC 1
> +struct pgv {
> +	char *buffer;
> +	unsigned char flags;
> +};
> +
>  struct packet_ring_buffer {
> -	char			**pg_vec;
> +	struct pgv		*pg_vec;
>  	unsigned int		head;
>  	unsigned int		frames_per_block;
>  	unsigned int		frame_size;
> @@ -283,7 +289,8 @@ static void *packet_lookup_frame(struct packet_sock *po,
>  	pg_vec_pos = position / rb->frames_per_block;
>  	frame_offset = position % rb->frames_per_block;
>  
> -	h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
> +	h.raw = rb->pg_vec[pg_vec_pos].buffer +
> +		(frame_offset * rb->frame_size);
>  
>  	if (status != __packet_get_status(po, h.raw))
>  		return NULL;
> @@ -2322,37 +2329,74 @@ static const struct vm_operations_struct packet_mmap_ops = {
>  	.close	=	packet_mm_close,
>  };
>  
> -static void free_pg_vec(char **pg_vec, unsigned int order, unsigned int len)
> +static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
> +			unsigned int len)
>  {
>  	int i;
>  
>  	for (i = 0; i < len; i++) {
> -		if (likely(pg_vec[i]))
> -			free_pages((unsigned long) pg_vec[i], order);
> +		if (likely(pg_vec[i].buffer)) {
> +			if (pg_vec[i].flags & PGV_FROM_VMALLOC)
> +				vfree(pg_vec[i].buffer);
> +			else
> +				free_pages((unsigned long)pg_vec[i].buffer,
> +					   order);
> +			pg_vec[i].buffer = NULL;
> +		}
>  	}
>  	kfree(pg_vec);
>  }
>  
> -static inline char *alloc_one_pg_vec_page(unsigned long order)
> +static inline char *alloc_one_pg_vec_page(unsigned long order,
> +					  unsigned char *flags)
>  {
> -	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP | __GFP_ZERO | __GFP_NOWARN;
> +	char *buffer = NULL;
> +	gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
> +			  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> +
> +	buffer = (char *) __get_free_pages(gfp_flags, order);
> +
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * __get_free_pages failed, fall back to vmalloc
> +	 */
> +	*flags |= PGV_FROM_VMALLOC;
> +	buffer = vmalloc((1 << order) * PAGE_SIZE);
>  
> -	return (char *) __get_free_pages(gfp_flags, order);
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * vmalloc failed, lets dig into swap here
> +	 */
> +	*flags = 0;
> +	gfp_flags &= ~__GFP_NORETRY;
> +	buffer = (char *)__get_free_pages(gfp_flags, order);
> +	if (buffer)
> +		return buffer;
> +
> +	/*
> +	 * complete and utter failure
> +	 */
> +	return NULL;
>  }
>  
> -static char **alloc_pg_vec(struct tpacket_req *req, int order)
> +static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
>  {
>  	unsigned int block_nr = req->tp_block_nr;
> -	char **pg_vec;
> +	struct pgv *pg_vec;
>  	int i;
>  
> -	pg_vec = kzalloc(block_nr * sizeof(char *), GFP_KERNEL);
> +	pg_vec = kzalloc(block_nr * sizeof(struct pgv), GFP_KERNEL);

While we are at it, we could check block_nr being a sane value here ;)

Nice stuff Neil !



^ permalink raw reply

* Re: radvd and auto-ipv6 address regression from 2.6.31 to 2.6.34+
From: Ben Greear @ 2010-11-09 18:07 UTC (permalink / raw)
  To: Brian Haley; +Cc: Mikael Abrahamsson, NetDev
In-Reply-To: <4CD81D24.1040609@hp.com>

On 11/08/2010 07:54 AM, Brian Haley wrote:
> On 11/06/2010 02:17 AM, Mikael Abrahamsson wrote:
>> I would not be surprised if the IPv6 stack on Linux would gain IPv6
>> addresses using SLAAC from any interface if it sees RAs on it,
>> regardless if these are locally generated or not.
>>
>> I guess any Linux device running radvd should turn off autoconf on those
>> interface that radvd is acting on?
>>
>> net.ipv6.conf.veth0.accept_ra=0 should do the trick?
>
> I believe radvd will turn-on IPv6 forwarding on all the interfaces, at
> least it does on Debian in /etc/init.d/radvd, which essentially disables
> the reception of RA's for address configuration purposes.  I'm curious
> what these values are set to, and if something just got missed.
>
> It might be something specific to veth too, not sure how packets are
> copied/looped-back on these devices from the stack.

Thanks for the hint.  I did not have ipv6 forwarding enabled for the
radvd interfaces.  Once I fixed that, it worked as expected with no
auto-created global IPv6 address.  I have been starting radvd manually,
so I wouldn't benefit from any OS specific start scripts.

Thanks,
Ben

>
> -Brian


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox