Netdev List
 help / color / mirror / Atom feed
* Re: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before copying
From: David Vrabel @ 2013-03-26 11:13 UTC (permalink / raw)
  To: Wei Liu
  Cc: Wei Liu, Ian Campbell, konrad.wilk@oracle.com,
	netdev@vger.kernel.org, xen-devel@lists.xen.org,
	annie.li@oracle.com
In-Reply-To: <20130325190911.GC7004@zion.uk.xensource.com>

On 25/03/13 19:09, Wei Liu wrote:
> On Mon, Mar 25, 2013 at 06:29:29PM +0000, David Vrabel wrote:
>>>>
>>>
>>> Are you suggesting move the default macro value to header file? It is
>>> just an estimation, I have no knowledge of the accurate maximum value,
>>> so I think make it part of the protocol a bad idea.
>>
>> How is the author of a new frontend supposed to know how many slots they
>> can use per packet if it is not precisely defined?
>>
> 
> A new frontend shuold use the scheme you mentioned below to get the
> maximum value. For old frontends that cannot be fixed, administrator can
> configure max_skb_slots to accommodate their need.

I'm happy to the threshold for fatal errors to be configurable via a
module parameter.

>>> Do you have a handle on the maximum value?
>>
>> Backends should provide the value to the frontend via a xenstore key
>> (e.g., max-slots-per-frame).  This value should be at least 18 (the
>> historical value of MAX_SKB_FRAGS).
>>
>> The frontend may use up to this specified value or 17 if the
>> max-slots-per-frame key is missing.
>>
>> Supporting at least 18 in the backend is required for existing
>> frontends.  Limiting frontends to 17 allows them to work with all
>> backends (including recent Linux version that only supported 17).
>>
>> It's not clear why 19 or 20 were suggested as possible values.  I
>> checked back to 2.6.18 and MAX_SKB_FRAGS there is (65536/PAGE_SIZE + 2)
> 
> Because the check is >= MAX_SKB_FRAGS originally and James Harper told
> me that "Windows stops counting on 20".
> 
>> == 18.
>>
>> Separately, it may be sensible for the backend to drop packets with more
>> frags than max-slots-per-frame up to some threshold where anything more
>> is considered malicious (i.e., 1 - 18 slots is a valid packet, 19-20 are
>> dropped and 21 or more is a fatal error).
>>
> 
> Why drop the packet when we are able to process it? Frontend cannot know
> it has crossed the line anyway.

Because it's a change to the protocol and we do not want to do this for
a regression fix.

As a separate fix we can consider increasing the number of slots
per-packet once there is a mechanism to report this to the front end.

David

^ permalink raw reply

* Re: [PATCH net-next] netlink: remove duplicated NLMSG_ALIGN
From: Thomas Graf @ 2013-03-26 11:13 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: netdev, davem, stephen, zhiguo.hong
In-Reply-To: <1364274245-20689-1-git-send-email-honkiko@gmail.com>

On 03/26/13 at 01:04pm, Hong Zhiguo wrote:
> NLMSG_HDRLEN is already aligned value. It's for directly reference
> without extra alignment.
> 
> The redundant alignment here may confuse the API users.
> 
> Signed-off-by: Hong Zhiguo <honkiko@gmail.com>

Acked-by: Thomas Graf <tgraf@suug.ch>

This is actually an obsoleted API that we only want to keep around
for backwards compatibility with user space. It would be great to
replace all in kernel usages of NLMSG_LENGTH() with the type safe
variants nlmsg_*() in <net/netlink.h>

^ permalink raw reply

* RE: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before copying
From: James Harper @ 2013-03-26 11:15 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, konrad.wilk@oracle.com,
	netdev@vger.kernel.org, Wei Liu, xen-devel@lists.xen.org,
	annie.li@oracle.com
In-Reply-To: <5151812F.3080905@citrix.com>

> 
> On 26/03/13 10:52, James Harper wrote:
> >>> It's not clear why 19 or 20 were suggested as possible values.
> >>> I checked back to 2.6.18 and MAX_SKB_FRAGS there is
> >>> (65536/PAGE_SIZE + 2)
> >>
> >> Because the check is >= MAX_SKB_FRAGS originally and James Harper
> >> told me that "Windows stops counting on 20".
> >>
> >
> > I've obviously not been clear enough here... GPLPV stopped counting
> > at 20 (only needed to know if <20 or not). Windows itself can submit
> > a packet to NDIS with hundreds of buffers. It doesn't really matter
> > if it's 21 or 1021, I just didn't want to be misquoted.
> 
> This still isn't clear.  What's the maximum number of ring entries that
> GPLPV driver will use per packet?  Are you saying it's 20?  If so how
> has the GPLPV driver ever worked well with Linux's netback (with its
> historical limit of 18)?
> 

GPLPV will limit to 19, which I thought was the historic Linux limit but obviously not. I'd better look in to that.

I added a debug statement to catch what Windows would give to GPLPV, and it seemed that the maximum was 20, but then I double checked and GPLPV only needs to know if there are >19 frags or not, so it stops counting at 20. The actual number Windows will use internally is not limited so coalescing is required, and no sane amount of bumping up the Linux limit will reduce the requirement that a Windows driver will need to coalesce.

James

^ permalink raw reply

* RE: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before copying
From: Paul Durrant @ 2013-03-26 11:24 UTC (permalink / raw)
  To: James Harper, Wei Liu, David Vrabel
  Cc: Ian Campbell, Wei Liu, netdev@vger.kernel.org,
	konrad.wilk@oracle.com, xen-devel@lists.xen.org,
	annie.li@oracle.com
In-Reply-To: <6035A0D088A63A46850C3988ED045A4B3880B4F4@BITCOM1.int.sbss.com.au>

> -----Original Message-----
> From: James Harper [mailto:james.harper@bendigoit.com.au]
> Sent: 26 March 2013 11:01
> To: Paul Durrant; Wei Liu; David Vrabel
> Cc: Ian Campbell; Wei Liu; netdev@vger.kernel.org;
> konrad.wilk@oracle.com; xen-devel@lists.xen.org; annie.li@oracle.com
> Subject: RE: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before
> copying
> 
> > > Because the check is >= MAX_SKB_FRAGS originally and James Harper
> told
> > > me that "Windows stops counting on 20".
> > >
> >
> > For the Citrix PV drivers I lifted the #define of MAX_SKB_FRAGS from the
> > dom0 kernel (i.e. 18). If a packet coming from the stack has more than that
> > number of fragments then it's copied and coalesced. The value advertised
> > for TSO size is chosen such that a maximally sized TSO will always fit in 18
> > fragments after coalescing but (since this is Windows) the drivers don't
> trust
> > the stack to stick to that limit and will drop a packet if it won't fit.
> >
> > It seems reasonable that, since the backend is copying anyway, that it
> should
> > handle any fragment list coming from the frontend that it can. This would
> > allow the copy-and-coalesce code to be removed from the frontend (and
> the
> > double-copy avoided). If there is a maximum backend packet size though
> > then I think this needs to be advertised to the frontend. The backend
> should
> > clearly bin packets coming from the frontend that exceed that limit but
> > advertising that limit in xenstore allows the frontend to choose the right
> TSO
> > maximum size to advertise to its stack, rather than having to make it based
> > on some historical value that actually has little meaning (in the absence of
> > grant mapping).
> >
> 
> As stated previously, I've observed windows issuing staggering numbers of
> buffers to NDIS miniport drivers, so you will need to coalesce in a windows
> driver anyway. I'm not sure what the break even point is but I think it's safe
> to say that in the choice between using 1000 (worst case) ring slots (with the
> resulting mapping overheads) and coalescing in the frontend, coalescing is
> going to be the better option.
> 

Oh quite, if the backend is mapping and not copying then coalescing in the frontend is the right way to go. I guess coalescing once the frag count reaches a full ring count is probably necessary (since we can't push a partial packet) but it would be nice not to have to do it if the backend is going to copy anyway.

  Paul

^ permalink raw reply

* RE: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before copying
From: David Laight @ 2013-03-26 11:27 UTC (permalink / raw)
  To: James Harper, Paul Durrant, Wei Liu, David Vrabel
  Cc: Ian Campbell, Wei Liu, netdev, konrad.wilk, xen-devel, annie.li
In-Reply-To: <6035A0D088A63A46850C3988ED045A4B3880B4F4@BITCOM1.int.sbss.com.au>

> As stated previously, I've observed windows issuing staggering
> numbers of buffers to NDIS miniport drivers, so you will need
> to coalesce in a windows driver anyway. I'm not sure what the
> break even point is but I think it's safe to say that in the
> choice between using 1000 (worst case) ring slots (with the
> resulting mapping overheads) and coalescing in the frontend,
> coalescing is going to be the better option.

A long time ago we did some calculation on a sparc mbus/sbus
system (that has an iommu requiring setup for dma) and got
a breakeven point of (about) 1k.
(And I'm not sure we arrange to do aligned copies.)

Clearly that isn't directly relevant here...

It is even likely that the ethernet chips will underrun
if requested to do too many ring operations - especially
at their maximum speed.
I guess none of the modern ones require the first fragment
to be at least 100 bytes in order to guarantee retransmission
after a collision.

	David

^ permalink raw reply

* RE: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before copying
From: James Harper @ 2013-03-26 11:29 UTC (permalink / raw)
  To: Paul Durrant, Wei Liu, David Vrabel
  Cc: Ian Campbell, Wei Liu, netdev@vger.kernel.org,
	konrad.wilk@oracle.com, xen-devel@lists.xen.org,
	annie.li@oracle.com
In-Reply-To: <291EDFCB1E9E224A99088639C4762022013F7D8E0E06@LONPMAILBOX01.citrite.net>

> > As stated previously, I've observed windows issuing staggering numbers of
> > buffers to NDIS miniport drivers, so you will need to coalesce in a windows
> > driver anyway. I'm not sure what the break even point is but I think it's safe
> > to say that in the choice between using 1000 (worst case) ring slots (with
> > the
> > resulting mapping overheads) and coalescing in the frontend, coalescing is
> > going to be the better option.
> >
> 
> Oh quite, if the backend is mapping and not copying then coalescing in the
> frontend is the right way to go. I guess coalescing once the frag count
> reaches a full ring count is probably necessary (since we can't push a partial
> packet) but it would be nice not to have to do it if the backend is going to
> copy anyway.
> 

For a 9k packet with 100 frags (not a common case, but an example), what is the cost of mapping those 100 frags into the backend vs coalescing to three pages in the frontend and mapping those?

I may be misremembering but wasn't there a patch floating around for persistent mapping to avoid some of this overhead? (not applicable here but I thought it meant that the cost wasn't insignificant)

James

^ permalink raw reply

* Re: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before copying
From: Wei Liu @ 2013-03-26 11:29 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, Ian Campbell, konrad.wilk@oracle.com,
	netdev@vger.kernel.org, xen-devel@lists.xen.org,
	annie.li@oracle.com
In-Reply-To: <515182E2.50103@citrix.com>

On Tue, Mar 26, 2013 at 11:13:38AM +0000, David Vrabel wrote:
> >>
> >> Separately, it may be sensible for the backend to drop packets with more
> >> frags than max-slots-per-frame up to some threshold where anything more
> >> is considered malicious (i.e., 1 - 18 slots is a valid packet, 19-20 are
> >> dropped and 21 or more is a fatal error).
> >>
> > 
> > Why drop the packet when we are able to process it? Frontend cannot know
> > it has crossed the line anyway.
> 
> Because it's a change to the protocol and we do not want to do this for
> a regression fix.
> 

If I understand correctly the regression you talked about was introduced
by harsh punishment in XSA-39? If so, this is the patch you need to fix
that. Frontend only knows that it has connectivity or not. This patch
guarantee that the old netfront with larger MAX_SKB_FRAGS still see the
same thing from its point of view.  Netfront cannot know the
intermediate state between 18 and 20.

> As a separate fix we can consider increasing the number of slots
> per-packet once there is a mechanism to report this to the front end.
> 

Sure, that's on my TODO list.


Wei.

^ permalink raw reply

* Re: [Eulerkernel] [PATCH] af_unix: dont send SCM_CREDENTIAL when dest socket is NULL
From: dingtianhong @ 2013-03-26 11:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Eric Dumazet, netdev, Li Zefan, Xinwei Hu
In-Reply-To: <1364272360.1716.11.camel@edumazet-glaptop>

On 2013/3/26 12:32, Eric Dumazet wrote:
> On Tue, 2013-03-26 at 11:08 +0800, dingtianhong wrote:
>> On 2013/3/25 22:04, Eric Dumazet wrote:
>>> On Mon, 2013-03-25 at 18:28 +0800, dingtianhong wrote:
>>>> SCM_SCREDENTIALS should apply to write() syscalls only either source or destination
>>>> socket asserted SOCK_PASSCRED. The original implememtation in maybe_add_creds is wrong,
>>>> and breaks several LSB testcases ( i.e. /tset/LSB.os/netowkr/recvfrom/T.recvfrom).
>>>>
>>>> Origionally-authored-by: Karel Srot <ksrot@redhat.com>
>>>> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
>>>> ---
>>>>    net/unix/af_unix.c | 4 ++--
>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>>>> index 51be64f..99189fd 100644
>>>> --- a/net/unix/af_unix.c
>>>> +++ b/net/unix/af_unix.c
>>>> @@ -1413,8 +1413,8 @@ static void maybe_add_creds(struct sk_buff *skb, const struct socket *sock,
>>>>           if (UNIXCB(skb).cred)
>>>>                   return;
>>>>           if (test_bit(SOCK_PASSCRED, &sock->flags) ||
>>>> -           !other->sk_socket ||
>>>> -           test_bit(SOCK_PASSCRED, &other->sk_socket->flags)) {
>>>> +           (other->sk_socket &&
>>>> +           test_bit(SOCK_PASSCRED, &other->sk_socket->flags))) {
>>>>                   UNIXCB(skb).pid  = get_pid(task_tgid(current));
>>>>                   UNIXCB(skb).cred = get_current_cred();
>>>>           }
>>>
>>> I am not sure why adding credentials if other->sk_socket is NULL could
>>> break an application ?
>> The bugzilla has report the bug:https://lsbbugs.linuxfoundation.org/show_bug.cgi?id=3523
>>
> 
> OK
> 
>>>
>>> This was the case before commit introducing this code.
>>
>> The commit 16e5726269(af_unix: dont send SCM_CREDENTIALS by default) may  introducing the problem.
>>
> 
> So the problem is that two messages have different credentials,
> because other->sk_socket changed between first and second message.
> 
> and unix_stream_recvmsg() has the following check :
> 
>                 if (check_creds) {
>                         /* Never glue messages from different writers */
>                         if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
>                             (UNIXCB(skb).cred != siocb->scm->cred))
>                                 break;
>                 } else {
>                         /* Copy credentials */
>                         scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
>                         check_creds = 1;
>                 }
> 
> In the case the receiver doesnt care at all (using recvfrom(), not recvmsg()), 
> we probably should not even call scm_set_creds() and avoid extra refcounting.
> 
I think if not call scm_set_creds(), the credential would useles in recvmsg().
we could remove code:
		if (check_creds) {
                        /* Never glue messages from different writers */
                        if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
                            (UNIXCB(skb).cred != siocb->scm->cred))
                                break;
                } else {
                        /* Copy credentials */
                        scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
                        check_creds = 1;
                }
> 
> 
> 
> .
> 

^ permalink raw reply

* RE: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before copying
From: Paul Durrant @ 2013-03-26 11:38 UTC (permalink / raw)
  To: James Harper, Wei Liu, David Vrabel
  Cc: Ian Campbell, Wei Liu, netdev@vger.kernel.org,
	konrad.wilk@oracle.com, xen-devel@lists.xen.org,
	annie.li@oracle.com
In-Reply-To: <6035A0D088A63A46850C3988ED045A4B3880B6AE@BITCOM1.int.sbss.com.au>

> -----Original Message-----
> From: James Harper [mailto:james.harper@bendigoit.com.au]
> Sent: 26 March 2013 11:29
> To: Paul Durrant; Wei Liu; David Vrabel
> Cc: Ian Campbell; Wei Liu; netdev@vger.kernel.org;
> konrad.wilk@oracle.com; xen-devel@lists.xen.org; annie.li@oracle.com
> Subject: RE: [Xen-devel] [PATCH 5/6] xen-netback: coalesce slots before
> copying
> 
> > > As stated previously, I've observed windows issuing staggering numbers
> of
> > > buffers to NDIS miniport drivers, so you will need to coalesce in a
> windows
> > > driver anyway. I'm not sure what the break even point is but I think it's
> safe
> > > to say that in the choice between using 1000 (worst case) ring slots (with
> > > the
> > > resulting mapping overheads) and coalescing in the frontend, coalescing
> is
> > > going to be the better option.
> > >
> >
> > Oh quite, if the backend is mapping and not copying then coalescing in the
> > frontend is the right way to go. I guess coalescing once the frag count
> > reaches a full ring count is probably necessary (since we can't push a partial
> > packet) but it would be nice not to have to do it if the backend is going to
> > copy anyway.
> >
> 
> For a 9k packet with 100 frags (not a common case, but an example), what is
> the cost of mapping those 100 frags into the backend vs coalescing to three
> pages in the frontend and mapping those?
> 
> I may be misremembering but wasn't there a patch floating around for
> persistent mapping to avoid some of this overhead? (not applicable here but
> I thought it meant that the cost wasn't insignificant)
> 

The current version of netback does not map, it always grant-copies.

  Paul

^ permalink raw reply

* Re: [PATCH v2 net-next 00/12] 6lowpan: Some more bug fixes
From: Alan Ott @ 2013-03-26 11:48 UTC (permalink / raw)
  To: David S. Miller
  Cc: Tony Cheneau, Eric Dumazet, Alexander Smirnov, netdev,
	linux-zigbee-devel
In-Reply-To: <1364270372-19430-1-git-send-email-tony.cheneau@amnesiak.org>

On 03/25/2013 11:59 PM, Tony Cheneau wrote:
> This patchset fixes serious bugs within the 6LoWPAN modules. I wrote a script
> (available at [1]) to prove the issues are real.  One can try and see that
> without these patches, most of the test fail (e.g. packet dropped by the
> receiver or node crashing). With all patches applied, all tests succeed. The
> tests themselves are very basic: sending ICMP packets, sending UDP packets,
> sending TCP packets, varying size of the packets. This actually triggers some
> 6LoWPAN specific code, namely fragmentation, packet reassembly and header
> compression.
> 
> This code passed the checkpatch.pl tool with a few warnings, that I believe
> are OK. It should apply cleanly on the latest net-next.
> 

I and have been running some form of this patchset since October, and
have reviewed it several times.

Reviewed-by: Alan Ott <alan@signal11.us>
Tested-by: Alan Ott <alan@signal11.us>


> Regards,
> 	Tony Cheneau
> 
> [1]: https://github.com/tcheneau/linux802154-regression-tests
> 
> Tony Cheneau (12):
>   6lowpan: lowpan_is_iid_16_bit_compressable() does not detect
>     compressible address correctly
>   6lowpan: next header is not properly set upon decompression of a UDP
>     header.
>   6lowpan: always enable link-layer acknowledgments
>   mac802154: turn on ACK when enabled by the upper layers
>   6lowpan: use short IEEE 802.15.4 addresses for broadcast destination
>   6lowpan: fix first fragment (FRAG1) handling
>   6lowpan: add debug messages for 6LoWPAN fragmentation
>   6lowpan: store fragment tag values per device instead of net stack
>     wide
>   mac802154: re-introduce mac802154_dev_get_dsn()
>   6lowpan: obtain IEEE802.15.4 sequence number from the MAC layer
>   6lowpan: use the PANID provided by the device instead of a static
>     value
>   6lowpan: modify udp compression/uncompression to match the standard
> 
>  net/ieee802154/6lowpan.c  | 136 +++++++++++++++++++++++++++++++++++++---------
>  net/ieee802154/6lowpan.h  |   7 ++-
>  net/mac802154/mac802154.h |   1 +
>  net/mac802154/mac_cmd.c   |   1 +
>  net/mac802154/mib.c       |   9 +++
>  net/mac802154/wpan.c      |   2 +
>  6 files changed, 127 insertions(+), 29 deletions(-)
> 

^ permalink raw reply

* Re: [Linux-zigbee-devel] [PATCH v2 net-next 05/12] 6lowpan: use short IEEE 802.15.4 addresses for broadcast destination
From: Alexander Aring @ 2013-03-26 11:56 UTC (permalink / raw)
  To: Tony Cheneau
  Cc: David S. Miller, Eric Dumazet, netdev, linux-zigbee-devel,
	Alan Ott
In-Reply-To: <1364270372-19430-6-git-send-email-tony.cheneau@amnesiak.org>

Hi Tony,

On Mon, Mar 25, 2013 at 11:59:25PM -0400, Tony Cheneau wrote:
> The IEEE 802.15.4 standard uses the 0xFFFF short address (2 bytes) for message
> broadcasting.
> 
> Signed-off-by: Tony Cheneau <tony.cheneau@amnesiak.org>
> ---
>  net/ieee802154/6lowpan.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/net/ieee802154/6lowpan.c b/net/ieee802154/6lowpan.c
> index e7f61de..0eebb96 100644
> --- a/net/ieee802154/6lowpan.c
> +++ b/net/ieee802154/6lowpan.c
> @@ -572,21 +572,28 @@ static int lowpan_header_create(struct sk_buff *skb,
>  	 * this isn't implemented in mainline yet, so currently we assign 0xff
>  	 */
>  	{
> +		mac_cb(skb)->flags = IEEE802154_FC_TYPE_DATA;
> +
>  		/* prepare wpan address data */
>  		sa.addr_type = IEEE802154_ADDR_LONG;
>  		sa.pan_id = 0xff;
> -
> -		da.addr_type = IEEE802154_ADDR_LONG;
> -		da.pan_id = 0xff;
> -
> -		memcpy(&(da.hwaddr), daddr, 8);
>  		memcpy(&(sa.hwaddr), saddr, 8);
>  
> -		mac_cb(skb)->flags = IEEE802154_FC_TYPE_DATA;
> +		da.pan_id = 0xff;
> +		/*
> +		 * if the destination address is the broadcast address, use the
> +		 * corresponding short address
> +		 */
> +		if (lowpan_is_addr_broadcast(daddr)) {
> +			da.addr_type = IEEE802154_ADDR_SHORT;
> +			da.short_addr = IEEE802154_ADDR_BROADCAST;
> +		} else {
> +			da.addr_type = IEEE802154_ADDR_LONG;
> +			memcpy(&(da.hwaddr), daddr, 8);

It's some nitpick here.
Maybe it's better to use IEEE802154_ADDR_LEN instead of 8.

I mean we have this define, and we should use it if we mean the
ieee802154 address length.

Alex

^ permalink raw reply

* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
From: Benoit Lourdelet @ 2013-03-26 11:51 UTC (permalink / raw)
  To: Stephen Hemminger, Eric W. Biederman; +Cc: netdev@vger.kernel.org, Serge Hallyn
In-Reply-To: <CAOaVG17Cjj3epC2LRkDVdoqNWno=XjH4nhfiNVeceS=0d=Nyrw@mail.gmail.com>

Hello,

I re-tested with the patch and got the following results on a 32x 2Ghz
core system.

# veth 	add 	delete
1000 	36 	34
3000 	259 	137
4000 	462 	195
5000 	729     N/A

The script to create is the following :
for i in `seq 1 5000`; do
	sudo ip link add type veth
Done


The script to delete:
for d in /sys/class/net/veth*; do
	ip link del `basename $d` 2>/dev/null || true
Done

There is a very good improvement in deletion.



iproute2 does not seems to be well multithread as I get time divided by a
factor of 2 with a 8x  3.2 Ghz core system.

I don¹t know if that is the improvement you expected ?

Would the iproute2 redesign you mentioned help improve performance even
further ?


As a reference : Iproute2 baseline w/o patch:

# veth 	add 	delete

1000 	57 	70
2000 	193 	250
3000 	435 	510
4000 	752 	824
5000 	1123 	1185

Regards

Benoit




On 22/03/2013 23:27, "Stephen Hemminger" <stephen@networkplumber.org>
wrote:

>The whole ifindex map is a design mistake at this point.
>Better off to do a lazy cache or something like that.
>
>
>On Fri, Mar 22, 2013 at 3:23 PM, Eric W. Biederman
><ebiederm@xmission.com> wrote:
>>
>> Because ip link add, set, and delete map the interface name to the
>> interface index by dumping all of the interfaces before performing
>> their respective commands.  Operations that should be constant time
>> slow down when lots of network interfaces are in use.  Resulting
>> in O(N^2) time to work with O(N) devices.
>>
>> Make the work that iproute does constant time by passing the interface
>> name to the kernel instead.
>>
>> In small scale testing on my system this shows dramatic performance
>> increases of ip link add from 120s to just 11s to add 5000 network
>> devices.  And from longer than I cared to wait to just 58s to delete
>> all of those interfaces again.
>>
>> Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
>> Reported-by: Benoit Lourdelet <blourdel@juniper.net>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>
>> I think I am bungling the case where people specify an ifindex as ifNNNN
>> but does anyone care?
>>
>>  ip/iplink.c |   19 +------------------
>>  1 files changed, 1 insertions(+), 18 deletions(-)
>>
>> diff --git a/ip/iplink.c b/ip/iplink.c
>> index ad33611..6dffbf0 100644
>> --- a/ip/iplink.c
>> +++ b/ip/iplink.c
>> @@ -533,8 +533,6 @@ static int iplink_modify(int cmd, unsigned int
>>flags, int argc, char **argv)
>>                 }
>>         }
>>
>> -       ll_init_map(&rth);
>> -
>>         if (!(flags & NLM_F_CREATE)) {
>>                 if (!dev) {
>>                         fprintf(stderr, "Not enough information:
>>\"dev\" "
>> @@ -542,27 +540,12 @@ static int iplink_modify(int cmd, unsigned int
>>flags, int argc, char **argv)
>>                         exit(-1);
>>                 }
>>
>> -               req.i.ifi_index = ll_name_to_index(dev);
>> -               if (req.i.ifi_index == 0) {
>> -                       fprintf(stderr, "Cannot find device \"%s\"\n",
>>dev);
>> -                       return -1;
>> -               }
>> +               name = dev;
>>         } else {
>>                 /* Allow "ip link add dev" and "ip link add name" */
>>                 if (!name)
>>                         name = dev;
>>
>> -               if (link) {
>> -                       int ifindex;
>> -
>> -                       ifindex = ll_name_to_index(link);
>> -                       if (ifindex == 0) {
>> -                               fprintf(stderr, "Cannot find device
>>\"%s\"\n",
>> -                                       link);
>> -                               return -1;
>> -                       }
>> -                       addattr_l(&req.n, sizeof(req), IFLA_LINK,
>>&ifindex, 4);
>> -               }
>>         }
>>
>>         if (name) {
>> --
>> 1.7.5.4
>>
>

^ permalink raw reply

* Re: [PATCH v2 net-next 08/12] 6lowpan: store fragment tag values per device instead of net stack wide
From: Sergei Shtylyov @ 2013-03-26 12:38 UTC (permalink / raw)
  To: Tony Cheneau
  Cc: David S. Miller, Eric Dumazet, Alan Ott, Alexander Smirnov,
	netdev, linux-zigbee-devel
In-Reply-To: <1364270372-19430-9-git-send-email-tony.cheneau@amnesiak.org>

Hello.

On 26-03-2013 7:59, Tony Cheneau wrote:

> Signed-off-by: Tony Cheneau <tony.cheneau@amnesiak.org>
> ---
>   net/ieee802154/6lowpan.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)

> diff --git a/net/ieee802154/6lowpan.c b/net/ieee802154/6lowpan.c
> index 61eee9d..f952451 100644
> --- a/net/ieee802154/6lowpan.c
> +++ b/net/ieee802154/6lowpan.c
> @@ -104,6 +104,7 @@ static const u8 lowpan_llprefix[] = {0xfe, 0x80};
>   struct lowpan_dev_info {
>   	struct net_device	*real_dev; /* real WPAN device ptr */
>   	struct mutex		dev_list_mtx; /* mutex for list ops */
> +	unsigned short fragment_tag;

    Small formatting nit: align field name with the others above please.

WBR, Sergei

^ permalink raw reply

* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
From: Eric W. Biederman @ 2013-03-26 12:40 UTC (permalink / raw)
  To: Benoit Lourdelet; +Cc: Stephen Hemminger, netdev@vger.kernel.org, Serge Hallyn
In-Reply-To: <CD773D0D.7614%blourdel@juniper.net>

Benoit Lourdelet <blourdel@juniper.net> writes:

> Hello,
>
> I re-tested with the patch and got the following results on a 32x 2Ghz
> core system.
>
> # veth 	add 	delete
> 1000 	36 	34
> 3000 	259 	137
> 4000 	462 	195
> 5000 	729     N/A
>
> The script to create is the following :
> for i in `seq 1 5000`; do
> 	sudo ip link add type veth
> Done

Which performs horribly as I mentioned earlier because you are asking
the kernel to create the names.  If you want performance you need to
specify the names of the network devices you are creating.

aka ip link add a$i type veth name b$i

> The script to delete:
> for d in /sys/class/net/veth*; do
> 	ip link del `basename $d` 2>/dev/null || true
> Done
>
> There is a very good improvement in deletion.
>
>
>
> iproute2 does not seems to be well multithread as I get time divided by a
> factor of 2 with a 8x  3.2 Ghz core system.

All netlink traffic and all network stack configuration is serialized by
the rtnl_lock in the kernel.  This is the slow path in the kernel, not
the fast path.

> I don¹t know if that is the improvement you expected ?
>
> Would the iproute2 redesign you mentioned help improve performance even
> further ?

Specifing the names would dramatically improve your creation
performance.  It should only take you about 10s for 5000 veth pairs.
But you have to specify the names.

Anyway I have exhausted my time, and inclination in this matter.  Good
luck with whatever your problem is.

Eric

^ permalink raw reply

* Re: [Eulerkernel] [PATCH] af_unix: dont send SCM_CREDENTIAL when dest socket is NULL
From: Eric Dumazet @ 2013-03-26 13:46 UTC (permalink / raw)
  To: dingtianhong; +Cc: David S. Miller, Eric Dumazet, netdev, Li Zefan, Xinwei Hu
In-Reply-To: <515187F6.4030905@huawei.com>

On Tue, 2013-03-26 at 19:35 +0800, dingtianhong wrote:

> I think if not call scm_set_creds(), the credential would useles in recvmsg().
> we could remove code:
> 		if (check_creds) {
>                         /* Never glue messages from different writers */
>                         if ((UNIXCB(skb).pid  != siocb->scm->pid) ||
>                             (UNIXCB(skb).cred != siocb->scm->cred))
>                                 break;
>                 } else {
>                         /* Copy credentials */
>                         scm_set_cred(siocb->scm, UNIXCB(skb).pid, UNIXCB(skb).cred);
>                         check_creds = 1;
>                 }

Are you paraphrasing me or saying something different ?
 

^ permalink raw reply

* niu lock-up (Transmit timed out, resetting) and NETDEV WATCHDOG
From: Andrew Brooks @ 2013-03-26 13:46 UTC (permalink / raw)
  To: Linux Net-Dev Mailing List
In-Reply-To: <CAHOfOo21+hSwFrXZuoMppcivcOonhx-m1p-yyZsm6c5UCh0joQ@mail.gmail.com>

Hello

Using niu driver for this card: Oracle/SUN Multithreaded 10-Gigabit
Ethernet Network Controller
after a period (often less than 24 hours) the interface will hang with
errors every 5 seconds
"niu: xxx: eth2: Transmit timed out, resetting"

Sometimes also in syslog are messages
WARNING: at sch_generic:255 dev_watchdog
NETDEV WATCHDOG: eth2 (niu): transmit queue 10 timed out

I've seen this in kernel 3.5.0-26-generic #42~precise1-Ubuntu SMP
but I've not seen it in kernel 3.2.0-38-generic #61-Ubuntu SMP

Is there some change between kernels which has broken the driver
or is the difference elsewhere?

Thanks

Andrew

^ permalink raw reply

* [PATCH net-next] ptp_pch: fix typo in module parameter description
From: Jiri Benc @ 2013-03-26 13:54 UTC (permalink / raw)
  To: netdev; +Cc: Richard Cochran, Takahiro Shimizu

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 drivers/ptp/ptp_pch.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/ptp/ptp_pch.c b/drivers/ptp/ptp_pch.c
index 1367655..e85926b 100644
--- a/drivers/ptp/ptp_pch.c
+++ b/drivers/ptp/ptp_pch.c
@@ -725,7 +725,7 @@ module_exit(ptp_pch_exit);
 
 module_param_string(station, pch_param.station, sizeof pch_param.station, 0444);
 MODULE_PARM_DESC(station,
-	 "IEEE 1588 station address to use - column separated hex values");
+	 "IEEE 1588 station address to use - colon separated hex values");
 
 MODULE_AUTHOR("LAPIS SEMICONDUCTOR, <tshimizu818@gmail.com>");
 MODULE_DESCRIPTION("PTP clock using the EG20T timer");
-- 
1.7.6.5

^ permalink raw reply related

* [PATCH net-next] MAINTAINERS: add netdev list for PTP (IEEE 1588)
From: Jiri Benc @ 2013-03-26 14:01 UTC (permalink / raw)
  To: netdev; +Cc: Richard Cochran

Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 MAINTAINERS |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 86c0843..77b3748 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6310,6 +6310,7 @@ F:	drivers/acpi/apei/erst.c
 
 PTP HARDWARE CLOCK SUPPORT
 M:	Richard Cochran <richardcochran@gmail.com>
+L:	netdev@vger.kernel.org
 S:	Maintained
 W:	http://linuxptp.sourceforge.net/
 F:	Documentation/ABI/testing/sysfs-ptp
-- 
1.7.6.5

^ permalink raw reply related

* [PATCH] bonding: cleanup unneeded rcu_read_lock()
From: Veaceslav Falico @ 2013-03-26 14:10 UTC (permalink / raw)
  To: netdev; +Cc: vfalico, andy, fubar

bond_resend_igmp_join_requests_delayed() calls _resend_igmp_join_requests()
under rcu_read_lock(), while it gets its own rcu_read_lock() for the whole
function. Remove the lock from the _delayed function.

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---
 drivers/net/bonding/bond_main.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 6bbd90e..11a8cb3 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -796,9 +796,8 @@ static void bond_resend_igmp_join_requests_delayed(struct work_struct *work)
 {
 	struct bonding *bond = container_of(work, struct bonding,
 					    mcast_work.work);
-	rcu_read_lock();
+
 	bond_resend_igmp_join_requests(bond);
-	rcu_read_unlock();
 }
 
 /*
-- 
1.7.1

^ permalink raw reply related

* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
From: Serge Hallyn @ 2013-03-26 14:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Benoit Lourdelet, Stephen Hemminger, netdev@vger.kernel.org
In-Reply-To: <87wqsu72r6.fsf@xmission.com>

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Specifing the names would dramatically improve your creation
> performance.  It should only take you about 10s for 5000 veth pairs.
> But you have to specify the names.

Thanks, Eric.  I'm going to update lxc to always specify names for
the veth pairs, rather than only when they are requested by the
user's configuration file.

-serge

^ permalink raw reply

* kmem_cache_create(nf_conntrack_expect): Cache name already exists.
From: Dave Jones @ 2013-03-26 14:18 UTC (permalink / raw)
  To: netdev

We had a user report this against a 3.6.6 kernel.
Given the uptimes he usually sees on that box, it may be a while before
he gets a chance to see this again if it hasn't been fixed.

Does this look familiar to anyone ?

Mar 21 04:00:12 kernel: [8176848.470356] nf_conntrack version 0.5.0 (16049 buckets, 64196 max)
Mar 21 04:00:12 kernel: [8176848.471261] kmem_cache_create(nf_conntrack_expect): Cache name already exists.
Mar 21 04:00:12 kernel: [8176848.471794] Pid: 32711, comm: modprobe Not tainted 3.6.6-1.fc16.i686 #1
Mar 21 04:00:12 kernel: [8176848.472321] Call Trace:
Mar 21 04:00:12 kernel: [8176848.472959]  [<c0511b44>] kmem_cache_create+0x144/0x190
Mar 21 04:00:12 kernel: [8176848.473684]  [<f7d8b7e8>] nf_conntrack_expect_init+0xe8/0x120 [nf_conntrack]
Mar 21 04:00:12 kernel: [8176848.474358]  [<f7d89b06>] nf_conntrack_init+0xe6/0x320 [nf_conntrack]
Mar 21 04:00:12 kernel: [8176848.475045]  [<f7d8a164>] nf_conntrack_net_init+0x14/0x170 [nf_conntrack]
Mar 21 04:00:12 kernel: [8176848.475732]  [<f7da9000>] ? 0xf7da8fff
Mar 21 04:00:12 kernel: [8176848.476394]  [<c085c1b9>] ops_init+0x39/0x110
Mar 21 04:00:12 kernel: [8176848.477052]  [<c085c41c>] register_pernet_operations+0xcc/0x140
Mar 21 04:00:12 kernel: [8176848.477686]  [<c085c511>] register_pernet_subsys+0x21/0x40
Mar 21 04:00:12 kernel: [8176848.478348]  [<f7da900d>] nf_conntrack_standalone_init+0xd/0x1000 [nf_conntrack]
Mar 21 04:00:12 kernel: [8176848.478965]  [<c0401124>] do_one_initcall+0x34/0x170
Mar 21 04:00:12 kernel: [8176848.479608]  [<f7da9000>] ? 0xf7da8fff
Mar 21 04:00:12 kernel: [8176848.480227]  [<c049977c>] sys_init_module+0xfcc/0x1cf0
Mar 21 04:00:12 kernel: [8176848.480821]  [<c055b73a>] ? mntput_no_expire+0x3a/0x110
Mar 21 04:00:12 kernel: [8176848.481456]  [<c0958f5f>] sysenter_do_call+0x12/0x28

user has a cron job that restarts his firewall setup every morning, and this occurred
during that, while aparently things were low on memory..


Mar 21 04:00:12 modprobe: FATAL: Error inserting iptable_nat (/lib/modules/3.6.6-1.fc16.i686/kernel/net/ipv4/netfilter/iptable_nat.ko): Cannot allocate memory

^ permalink raw reply

* Re: [PATCH] libertas: drop maintainership
From: Dan Williams @ 2013-03-26 14:22 UTC (permalink / raw)
  To: John W. Linville
  Cc: Joe Perches, netdev, linux-wireless, Daniel Drake, Bing Zhao
In-Reply-To: <20130325185942.GC17454@tuxdriver.com>

On Mon, 2013-03-25 at 14:59 -0400, John W. Linville wrote:
> On Mon, Mar 18, 2013 at 12:51:35PM -0500, Dan Williams wrote:
> > On Mon, 2013-03-18 at 10:23 -0700, Joe Perches wrote:
> > > On Mon, 2013-03-18 at 11:48 -0500, Dan Williams wrote:
> > > > Would be better maintained by somebody who actualy has time for it.
> > > []
> > > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > []
> > > > -MARVELL LIBERTAS WIRELESS DRIVER
> > > > -M:	Dan Williams <dcbw@redhat.com>
> > > > -L:	libertas-dev@lists.infradead.org
> > > > -S:	Maintained
> > > > -F:	drivers/net/wireless/libertas/
> > > 
> > > I think it better to mark it as Orphan
> > > and maybe leave the list.
> > > 
> > > Maybe:
> > > 
> > > MARVELL LIBERTAS WIRELESS DRIVER
> > > L:	libertas-dev@lists.infradead.org
> > > S:	Orphan
> > > F:	drivers/net/wireless/libertas/
> > > 
> > > or
> > > 
> > > MARVELL LIBERTAS WIRELESS DRIVER
> > > S:	Orphan
> > > F:	drivers/net/wireless/libertas/
> > 
> > I can do that; I wasn't quite sure how to do this.  A quick check showed
> > patches that did what mine did, and oddly MAINTAINERS has no section for
> > dropping maintainership that I could quickly find.  If this is what
> > others prefer I'm happy to resubmit?
> 
> I probably would prefer an "Orphan" listing as well...

Ok, will do.

Dan

^ permalink raw reply

* defxx: skb_push() failing?
From: David Oostdyk @ 2013-03-26 14:29 UTC (permalink / raw)
  To: netdev; +Cc: macro

Hello,

In dfx_xmt_queue_pkt() in defxx.c, there is a skb_push(3) call which 
makes room for 3 packet request header bytes.  There is some discussion 
in the driver explaining why those three bytes will be available.  I 
have an old FDDI card that I'm trying to bring up:

05:05.0 FDDI network controller: Digital Equipment Corporation 
PCI-to-PDQ Interface Chip [PFI] (rev 02)

Most skbuffs that come through dfx_xmit_queue_pkt() have 11 bytes 
between skb->head and skb->data.  On the other hand, at almost exactly 
60-second intervals, an skb arrives that has zero bytes between 
skb->head and skb->data.  This normally causes a kernel panic, and for 
the time I just skip over such skb's.

Does anyone have advice on where I should start digging to find the 
cause of this?

Thanks in advance!
- David Oostdyk

^ permalink raw reply

* Re: [RFC][PATCH] iproute: Faster ip link add, set and delete
From: Serge Hallyn @ 2013-03-26 14:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Benoit Lourdelet, Stephen Hemminger, netdev@vger.kernel.org
In-Reply-To: <87wqsu72r6.fsf@xmission.com>

Actually, lxc is using random names now, so it's ok.

Benoit, can you use the patches from Eric with lxc (or use the script
you were using before but specify names as he said)?

-serge

^ permalink raw reply

* [net-next.git 0/8 (v2)] stmmac: update to March_2013 (ext desc, PTP, SGMII)
From: Giuseppe CAVALLARO @ 2013-03-26 14:43 UTC (permalink / raw)
  To: netdev; +Cc: rayagond, richardcochran, Giuseppe Cavallaro

These patches enhance the driver adding the PTP support and the initial code
for RGMII/SGMII/TBI/RTBI modes.
Also this patches review the driver removing some Koption for selecting between
chain and ring modes. REally useful to validate the driver also at build time.
Before adding PTP, the extended descriptor support has been added because it
is mandatory to save HW timestamp in new dedicated descriptors. Also in this
case no Koption added.

Concerning the PTP, I have hacked/reviewed and tested many
part of these patches also verifying the back compatibility on
several HW and chips.

Concerning the SGMII/RGMII we have already discussed about the support
in the net.dev Mailing list with Byungho where these patchs were partially
analysed.
So I have only ported them against the latest net-next (and on
top of PTP). I have added some missing things: e.g. some parts of the
ethtool for ANE. As we clarified with Byungho, we will add further
enhancements on top of these patches if needed.

I have also built all against ARM/SH/X68 platforms and no issues on
ST-Boxes.

Thx goes to Rayagond that wrote and tested the PTP and to Byungho for SGMII.

V2: This Version 2 has the fixes discussed in the ML, for example:
    o completely remove the Koption... all the decisions are made at probe time
    o review the PTP patches and better organize them just in two patches
    o added all the fixes provided by Richard on PTP and CLK driver.

Giuseppe Cavallaro (5):
  stmmac: reorganize chain/ring modes removing Koptions
  stmmac: support extend descriptors
  stmmac: start adding pcs and rgmii core irq
  stmmac: initial support to manage pcs modes
  stmmac: update the Doc and Version (PTP+SGMII)

Rayagond Kokatanur (3):
  stmmac: add tx_skbuff_dma to save descriptors used by PTP
  stmmac: add IEEE PTPv1 and PTPv2 support.
  stmmac: add the support for PTP hw clock driver

 Documentation/networking/stmmac.txt                |   33 +-
 drivers/net/ethernet/stmicro/stmmac/Kconfig        |   19 +-
 drivers/net/ethernet/stmicro/stmmac/Makefile       |    8 +-
 drivers/net/ethernet/stmicro/stmmac/chain_mode.c   |   90 ++-
 drivers/net/ethernet/stmicro/stmmac/common.h       |  122 ++-
 drivers/net/ethernet/stmicro/stmmac/descs.h        |   51 +-
 drivers/net/ethernet/stmicro/stmmac/descs_com.h    |   44 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac1000.h    |   40 +-
 .../net/ethernet/stmicro/stmmac/dwmac1000_core.c   |  104 ++-
 .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c    |    8 +-
 .../net/ethernet/stmicro/stmmac/dwmac100_core.c    |    3 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c |    4 +-
 drivers/net/ethernet/stmicro/stmmac/enh_desc.c     |  151 +++-
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c    |   85 ++-
 drivers/net/ethernet/stmicro/stmmac/ring_mode.c    |   38 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac.h       |   23 +-
 .../net/ethernet/stmicro/stmmac/stmmac_ethtool.c   |  156 +++-
 .../net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c  |  148 +++
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |  994 ++++++++++++++++----
 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c   |  215 +++++
 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.h   |   74 ++
 21 files changed, 2019 insertions(+), 391 deletions(-)
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.h

-- 
1.7.4.4

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox