Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2] CDC NCM: release interfaces fix in unbind()
From: Alan Stern @ 2011-05-18 14:54 UTC (permalink / raw)
  To: Alexey Orishko; +Cc: netdev, davem, oliver, linux-usb, gregkh, Alexey Orishko
In-Reply-To: <1305662344-28355-1-git-send-email-alexey.orishko@stericsson.com>

On Tue, 17 May 2011, Alexey Orishko wrote:

> Changes:
> - claim slave/data interface during bind() and release in
>  unbind() unconditionally
> - in case of error during bind(), release claimed data
>  interface in the same function
> - remove obsolited "*_claimed" entries from driver context

...

> @@ -572,42 +559,32 @@ advance:
>  		goto error;
>  
>  	/* claim interfaces, if any */
> -	if (ctx->data != intf) {
> -		temp = usb_driver_claim_interface(driver, ctx->data, dev);
> -		if (temp)
> -			goto error;
> -		ctx->data_claimed = 1;
> -	}
> -
> -	if (ctx->control != intf) {
> -		temp = usb_driver_claim_interface(driver, ctx->control, dev);
> -		if (temp)
> -			goto error;
> -		ctx->control_claimed = 1;
> -	}
> +	temp = usb_driver_claim_interface(driver, ctx->data, dev);
> +	if (temp)
> +		goto error;

Here and later on, the patch seems to have forgotten about the control 
interface.  Is this deliberate or an oversight?

Alan Stern


^ permalink raw reply

* Re: PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18 14:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Vladislav Yasevich
In-Reply-To: <1305729762.2991.34.camel@edumazet-laptop>

2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mercredi 18 mai 2011 à 16:36 +0200, Jacek Luczak a écrit :
>
>> ---
>>  net/sctp/bind_addr.c |   10 ++++------
>>  1 files changed, 4 insertions(+), 6 deletions(-)
>>
>> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
>> index faf71d1..6150ac5 100644
>> --- a/net/sctp/bind_addr.c
>> +++ b/net/sctp/bind_addr.c
>> @@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
>> *bp, __u16 port)
>>  /* Dispose of the address list. */
>
> I am afraid your mailer mangled the patch.
>
> You should have only one line :
>
> @@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr *bp, __u16 port)
>
> Instead, there is a line wrap :
>
> @@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
> *bp, __u16 port)
>

Sadly I'm completely aware of that. I don't have any such here that
actually would not do this. Patch is attached to email just because of
that. I can resend this in the evening from home.

-Jacek

^ permalink raw reply

* Re: PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-18 14:42 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: netdev, Vladislav Yasevich
In-Reply-To: <BANLkTimWpmFj1gx=XvvoXFVxL9q927iWfw@mail.gmail.com>

Le mercredi 18 mai 2011 à 16:36 +0200, Jacek Luczak a écrit :

> ---
>  net/sctp/bind_addr.c |   10 ++++------
>  1 files changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index faf71d1..6150ac5 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
> *bp, __u16 port)
>  /* Dispose of the address list. */

I am afraid your mailer mangled the patch.

You should have only one line :

@@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr *bp, __u16 port)

Instead, there is a line wrap :

@@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
*bp, __u16 port)




^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Shirley Ma @ 2011-05-18 14:38 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Michael S. Tsirkin, Ben Hutchings, David Miller, Eric Dumazet,
	Avi Kivity, Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <BANLkTi=g_PFfS5653K8ZoQ2Jhp8DhCV1hg@mail.gmail.com>

On Wed, 2011-05-18 at 13:40 +0200, Michał Mirosław wrote:
> >> >> Not more other restrictions, skb clone is OK. pskb_expand_head()
> looks
> >> >> OK to me from code review.
> >> > Hmm. pskb_expand_head calls skb_release_data while keeping
> >> > references to pages. How is that ok? What do I miss?
> >> It's making copy of the skb_shinfo earlier, so the pages refcount
> >> stays the same.
> > Exactly. But the callback is invoked so the guest thinks it's ok to
> > change this memory. If it does a corrupted packet will be sent out.
> 
> Hmm. I tool a quick look at skb_clone(), and it looks like this
> sequence will break this scheme:
> 
> skb2 = skb_clone(skb...);
> kfree_skb(skb) or pskb_expand_head(skb);  /* callback called */
> [use skb2, pages still referenced]
> kfree_skb(skb); /* callback called again */
> 
> This sequence is common in bridge, might be in other places.
> 
> Maybe this ubuf thing should just track clones? This will make it work
> on all devices then.

The callback was only invoked when last reference of skb was gone.
skb_clone does increase skb refcnt. I tested tcpdump on lower device, it
worked.

For the sequence of:

skb_clone  -> last refcnt + 1
kfree_skb() or pskb_expand_head -> callback not called
kfree_skb() -> callback called

I will check page refcount to see whether it's balanced.

Thanks
shirley

^ permalink raw reply

* PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18 14:36 UTC (permalink / raw)
  To: netdev; +Cc: Vladislav Yasevich, Eric Dumazet

[-- Attachment #1: Type: text/plain, Size: 2850 bytes --]

During the sctp_close() call, we do not use rcu primitives to
destroy the address list attached to the endpoint.  At the same
time, we do the removal of addresses from this list before
attempting to remove the socket from the port hash

As a result, it is possible for another process to find the socket
in the port hash that is in the process of being closed.  It then
proceeds to traverse the address list to find the conflict, only
to have that address list suddenly disappear without rcu() critical
section.

Fix issue by closing address list removal inside RCU critical
section.

Race can result in a kernel crash with general protection fault or
kernel NULL pointer dereference:

kernel: general protection fault: 0000 [#1] SMP
kernel: RIP: 0010:[<ffffffffa02f3dde>]  [<ffffffffa02f3dde>]
sctp_bind_addr_conflict+0x64/0x82 [sctp]
kernel: Call Trace:
kernel:  [<ffffffffa02f415f>] ? sctp_get_port_local+0x17b/0x2a3 [sctp]
kernel:  [<ffffffffa02f3d45>] ? sctp_bind_addr_match+0x33/0x68 [sctp]
kernel:  [<ffffffffa02f4416>] ? sctp_do_bind+0xd3/0x141 [sctp]
kernel:  [<ffffffffa02f5030>] ? sctp_bindx_add+0x4d/0x8e [sctp]
kernel:  [<ffffffffa02f5183>] ? sctp_setsockopt_bindx+0x112/0x4a4 [sctp]
kernel:  [<ffffffff81089e82>] ? generic_file_aio_write+0x7f/0x9b
kernel:  [<ffffffffa02f763e>] ? sctp_setsockopt+0x14f/0xfee [sctp]
kernel:  [<ffffffff810c11fb>] ? do_sync_write+0xab/0xeb
kernel:  [<ffffffff810e82ab>] ? fsnotify+0x239/0x282
kernel:  [<ffffffff810c2462>] ? alloc_file+0x18/0xb1
kernel:  [<ffffffff8134a0b1>] ? compat_sys_setsockopt+0x1a5/0x1d9
kernel:  [<ffffffff8134aaf1>] ? compat_sys_socketcall+0x143/0x1a4
kernel:  [<ffffffff810467dc>] ? sysenter_dispatch+0x7/0x32

Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>

---
 net/sctp/bind_addr.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..6150ac5 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
*bp, __u16 port)
 /* Dispose of the address list. */
 static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
 {
-       struct sctp_sockaddr_entry *addr;
-       struct list_head *pos, *temp;
+       struct sctp_sockaddr_entry *addr, *temp;

        /* Empty the bind address list. */
-       list_for_each_safe(pos, temp, &bp->address_list) {
-               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
-               list_del(pos);
-               kfree(addr);
+       list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
+               list_del_rcu(&addr->list);
+               call_rcu(&addr->rcu, sctp_local_addr_free);
                SCTP_DBG_OBJCNT_DEC(addr);
        }
 }

[-- Attachment #2: sctp_fix_close_vs_conflict_race_in_bind_addr.patch --]
[-- Type: application/octet-stream, Size: 805 bytes --]

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..6150ac5 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr *bp, __u16 port)
 /* Dispose of the address list. */
 static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
 {
-	struct sctp_sockaddr_entry *addr;
-	struct list_head *pos, *temp;
+	struct sctp_sockaddr_entry *addr, *temp;
 
 	/* Empty the bind address list. */
-	list_for_each_safe(pos, temp, &bp->address_list) {
-		addr = list_entry(pos, struct sctp_sockaddr_entry, list);
-		list_del(pos);
-		kfree(addr);
+	list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
+		list_del_rcu(&addr->list);
+		call_rcu(&addr->rcu, sctp_local_addr_free);
 		SCTP_DBG_OBJCNT_DEC(addr);
 	}
 }

^ permalink raw reply related

* Re: [PATCH net-2.6] net: add skb_dst_force() in sock_queue_err_skb()
From: Eric Dumazet @ 2011-05-18 14:34 UTC (permalink / raw)
  To: Witold Baryluk; +Cc: David Miller, netdev, shemminger
In-Reply-To: <20110518142519.GH3539@smp.if.uj.edu.pl>

Le mercredi 18 mai 2011 à 16:25 +0200, Witold Baryluk a écrit :

> Patch solves problem for me.
> 
> Initial warning was looking harmless anyway to me,
> but good it is fixed now.

It was a real bug, thanks for reporting it and testing the fix.



^ permalink raw reply

* Re: [PATCH net-2.6] net: add skb_dst_force() in sock_queue_err_skb()
From: Witold Baryluk @ 2011-05-18 14:25 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, shemminger
In-Reply-To: <20110518.022151.1670373958008022789.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 795 bytes --]

On 05-18 02:21, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 18 May 2011 01:10:46 +0200
> 
> > Commit 7fee226ad239 (add a noref bit on skb dst) forgot to use
> > skb_dst_force() on packets queued in sk_error_queue
> > 
> > This triggers following warning, for applications using IP_CMSG_PKTINFO
> > receiving one error status
>  ...
> > Close bug https://bugzilla.kernel.org/show_bug.cgi?id=34622
> > 
> > Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > CC: Stephen Hemminger <shemminger@vyatta.com>
> 
> Applied, thanks.

Patch solves problem for me.

Initial warning was looking harmless anyway to me,
but good it is fixed now.

Thanks.
Witek

-- 
Witold Baryluk

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: Identifying network namespaces (was: Network namespace manipulation with file descriptors)
From: Alexey Dobriyan @ 2011-05-18 14:13 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric W. Biederman, linux-arch, linux-kernel, netdev,
	linux-fsdevel, jamal, Daniel Lezcano, Linux Containers,
	Renato Westphal
In-Reply-To: <20110518133352.GE3762520@jupiter.n2.diac24.net>

On Wed, May 18, 2011 at 4:33 PM, David Lamparter <equinox@diac24.net> wrote:
> On Wed, May 18, 2011 at 04:03:03PM +0300, Alexey Dobriyan wrote:
>> On Wed, May 18, 2011 at 3:43 PM, David Lamparter <equinox@diac24.net> wrote:
>> > -   processes cannot easily be cross referenced with each other
>> >
>> >  in the case of user space stuff running astray - like management
>> >  software crashing, routing daemons screwing up, etc. - it becomes
>> >  fairly difficult to shut down a network namespace (or even reaquire
>> >  physical devices that have been reassigned)
>>
>> It shutdowns itself when last process using netns disappeares,
>> so if you kill your routing daemons you should be fine.
>> Physical netdevices are moved to init_net.
>
> Now assume I'm running pptpd, which forks a new pppd for each
> connection. Even if I kill pptpd, the pppd keeps running... now how do I
> find the pppds that belong to that one namespace that I'm trying to
> get rid of?

That's a valid question.

>> > So, considering this set of premises (feedback welcome) I looked for
>> > some suitable means of identification. I discarded going for any process
>> > identifiers since Eric's patches allow for network namespaces without
>> > any process holding a reference, using bind mounts instead.
>>
>> If anything it should be netns->id, /proc/*/netns outputting id
>> where id is not derived from kernel pointer.

Actually it should be symlink
/proc/net/netns -> 0         # for init_net
/proc/net/netns -> u32 (> 0)   # for the rest
to extract information by 1 syscall, not 3
where netns id is totally random, so userspace won't make assumptions.

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Vladislav Yasevich @ 2011-05-18 13:32 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: Eric Dumazet, Wei Yongjun, netdev
In-Reply-To: <BANLkTi=cDwLFSqaBt5szaFOaWmxYzkya9g@mail.gmail.com>

On 05/18/2011 09:11 AM, Jacek Luczak wrote:
> 2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
>> Le mercredi 18 mai 2011 à 14:47 +0200, Jacek Luczak a écrit :
>>
>>> OK then, at the end what Eric suggested is IMO valid:
>>>
>>> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
>>> index faf71d1..0025d90 100644
>>> --- a/net/sctp/bind_addr.c
>>> +++ b/net/sctp/bind_addr.c
>>> @@ -144,10 +144,9 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>>>         struct list_head *pos, *temp;
>>>
>>>         /* Empty the bind address list. */
>>> -       list_for_each_safe(pos, temp, &bp->address_list) {
>>> -               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
>>> -               list_del(pos);
>>> -               kfree(addr);
>>> +       list_for_each_entry(pos, &bp->address_list, list) {
>>
>> a 'safe' version is needed here, since we remove items in iterator.
> 
> Yep.
> 
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index faf71d1..6150ac5 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
> *bp, __u16 port)
>  /* Dispose of the address list. */
>  static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>  {
> -       struct sctp_sockaddr_entry *addr;
> -       struct list_head *pos, *temp;
> +       struct sctp_sockaddr_entry *addr, *temp;
> 
>         /* Empty the bind address list. */
> -       list_for_each_safe(pos, temp, &bp->address_list) {
> -               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
> -               list_del(pos);
> -               kfree(addr);
> +       list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
> +               list_del_rcu(&addr->list);
> +               call_rcu(&addr->rcu, sctp_local_addr_free);
>                 SCTP_DBG_OBJCNT_DEC(addr);
>         }
>  }
> 
> Does it now look good?

Yes.  It should the fix the race.

-vlad

> 
> -Jacek
> 


^ permalink raw reply

* Re: [patch 1/1] qeth: use ndo_set_features callback for initial setup and recovery
From: Michał Mirosław @ 2011-05-18 13:43 UTC (permalink / raw)
  To: frank.blaschka; +Cc: davem, netdev, linux-s390
In-Reply-To: <20110518132905.064355130@de.ibm.com>

2011/5/18  <frank.blaschka@de.ibm.com>:
> From: Frank Blaschka <frank.blaschka@de.ibm.com>
>
> This patch uses the ndo_set_features callback during normal device
> startup or recovery to turn on hardware RX checksum. Patch was done
> with much help from Michal Miroslaw, thx!!!
>
> Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>

[regarding usage of ndo_set_features and friends]
Reviewed-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18 13:39 UTC (permalink / raw)
  To: Vladislav Yasevich; +Cc: Eric Dumazet, Wei Yongjun, netdev
In-Reply-To: <4DD3CA57.5060709@hp.com>

2011/5/18 Vladislav Yasevich <vladislav.yasevich@hp.com>:
> On 05/18/2011 09:11 AM, Jacek Luczak wrote:
>> 2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
>>> Le mercredi 18 mai 2011 à 14:47 +0200, Jacek Luczak a écrit :
>>>
>>>> OK then, at the end what Eric suggested is IMO valid:
>>>>
>>>> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
>>>> index faf71d1..0025d90 100644
>>>> --- a/net/sctp/bind_addr.c
>>>> +++ b/net/sctp/bind_addr.c
>>>> @@ -144,10 +144,9 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>>>>         struct list_head *pos, *temp;
>>>>
>>>>         /* Empty the bind address list. */
>>>> -       list_for_each_safe(pos, temp, &bp->address_list) {
>>>> -               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
>>>> -               list_del(pos);
>>>> -               kfree(addr);
>>>> +       list_for_each_entry(pos, &bp->address_list, list) {
>>>
>>> a 'safe' version is needed here, since we remove items in iterator.
>>
>> Yep.
>>
>> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
>> index faf71d1..6150ac5 100644
>> --- a/net/sctp/bind_addr.c
>> +++ b/net/sctp/bind_addr.c
>> @@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
>> *bp, __u16 port)
>>  /* Dispose of the address list. */
>>  static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>>  {
>> -       struct sctp_sockaddr_entry *addr;
>> -       struct list_head *pos, *temp;
>> +       struct sctp_sockaddr_entry *addr, *temp;
>>
>>         /* Empty the bind address list. */
>> -       list_for_each_safe(pos, temp, &bp->address_list) {
>> -               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
>> -               list_del(pos);
>> -               kfree(addr);
>> +       list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
>> +               list_del_rcu(&addr->list);
>> +               call_rcu(&addr->rcu, sctp_local_addr_free);
>>                 SCTP_DBG_OBJCNT_DEC(addr);
>>         }
>>  }
>>
>> Does it now look good?
>
> Yes.  It should the fix the race.
>

Thanks guys then for your guidance. I will repost final patch.

-Jacek

^ permalink raw reply

* Re: Identifying network namespaces (was: Network namespace manipulation with file descriptors)
From: David Lamparter @ 2011-05-18 13:33 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-arch-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
	Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	David Lamparter
In-Reply-To: <BANLkTikmrC86hk=W84UBwhJLe_uGAN4w9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, May 18, 2011 at 04:03:03PM +0300, Alexey Dobriyan wrote:
> On Wed, May 18, 2011 at 3:43 PM, David Lamparter <equinox@diac24.net> wrote:
> > -   processes cannot easily be cross referenced with each other
> >
> >  in the case of user space stuff running astray - like management
> >  software crashing, routing daemons screwing up, etc. - it becomes
> >  fairly difficult to shut down a network namespace (or even reaquire
> >  physical devices that have been reassigned)
> 
> It shutdowns itself when last process using netns disappeares,
> so if you kill your routing daemons you should be fine.
> Physical netdevices are moved to init_net.

Now assume I'm running pptpd, which forks a new pppd for each
connection. Even if I kill pptpd, the pppd keeps running... now how do I
find the pppds that belong to that one namespace that I'm trying to
get rid of?

> > -   namespaces cannot adequately be identified to the user
> >
> >  for debugging, some kernel messages become useless. most prominently,
> >  "unregister_netdevice: waiting for lo to become free. Usage count = 123"
> >  could certainly use some clarification, *which* lo is meant...
> 
> There is no "netns %p" or something, because right now the only unique
> netns identifier is kernel pointer (which better not be exposed to userspace).
> Printing such thing would be quite useless since it's not printed
> at netns creation.

I agree printing the kernel pointer is point-less.

> > So, considering this set of premises (feedback welcome) I looked for
> > some suitable means of identification. I discarded going for any process
> > identifiers since Eric's patches allow for network namespaces without
> > any process holding a reference, using bind mounts instead.
> 
> If anything it should be netns->id, /proc/*/netns outputting id
> where id is not derived from kernel pointer.
> 
> > Solution?
> > [ using lo interface index ]
> 
> What a hack! :-)

Well, you could create another counter and count it up on namespace
creation. But the interface index is readily available to userspace as
is, and it uniquely identifies the network namespace.

(Stupidest thing you can do to break this is renaming the loopback
device; but even if you do that userspace can still look at the LOOPBACK
flag.)


-David

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply

* [patch 0/1] s390: one more qeth patch for net-next
From: frank.blaschka @ 2011-05-18 13:28 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390


Hi Dave,

thanks to the help and patience of Michal Miroslaw here is a
patch to finalize the qeth hw_features rework.

shortlog:
Frank Blaschka (1)
qeth: use ndo_set_features callback for initial setup and recovery

Thanks,
        Frank


^ permalink raw reply

* [patch 1/1] qeth: use ndo_set_features callback for initial setup and recovery
From: frank.blaschka @ 2011-05-18 13:28 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390
In-Reply-To: <20110518132834.728225299@de.ibm.com>

[-- Attachment #1: qeth_fix_hw_feature.patch --]
[-- Type: text/plain, Size: 3128 bytes --]

From: Frank Blaschka <frank.blaschka@de.ibm.com>

This patch uses the ndo_set_features callback during normal device
startup or recovery to turn on hardware RX checksum. Patch was done
with much help from Michal Miroslaw, thx!!!

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
---
 drivers/s390/net/qeth_l3_main.c |   77 ++++++++++++----------------------------
 1 file changed, 25 insertions(+), 52 deletions(-)

--- a/drivers/s390/net/qeth_l3_main.c
+++ b/drivers/s390/net/qeth_l3_main.c
@@ -1417,63 +1417,33 @@ int qeth_l3_set_rx_csum(struct qeth_card
 	int rc = 0;
 
 	if (on) {
-		if (card->state != CARD_STATE_DOWN) {
-			if (!qeth_is_supported(card,
-				    IPA_INBOUND_CHECKSUM))
-					return -EPERM;
-			rc = qeth_l3_send_checksum_command(card);
-			if (rc)
-				return -EIO;
-		}
-		card->dev->features |= NETIF_F_RXCSUM;
+		rc = qeth_l3_send_checksum_command(card);
+		if (rc)
+			return -EIO;
+		dev_info(&card->gdev->dev,
+			"HW Checksumming (inbound) enabled\n");
 	} else {
-		if (card->state != CARD_STATE_DOWN) {
-			rc = qeth_l3_send_simple_setassparms(card,
-				IPA_INBOUND_CHECKSUM, IPA_CMD_ASS_STOP, 0);
-			if (rc)
-				return -EIO;
-		}
-		card->dev->features &= ~NETIF_F_RXCSUM;
+		rc = qeth_l3_send_simple_setassparms(card,
+			IPA_INBOUND_CHECKSUM, IPA_CMD_ASS_STOP, 0);
+		if (rc)
+			return -EIO;
 	}
 
-	return rc;
+	return 0;
 }
 
 static int qeth_l3_start_ipa_checksum(struct qeth_card *card)
 {
-	int rc = 0;
-
 	QETH_CARD_TEXT(card, 3, "strtcsum");
 
 	if (card->dev->features & NETIF_F_RXCSUM) {
-		/* hw may have changed during offline or recovery */
-		if (!qeth_is_supported(card, IPA_INBOUND_CHECKSUM)) {
-			dev_info(&card->gdev->dev,
-			"Inbound HW Checksumming not "
-			"supported on %s,\ncontinuing "
-			"using Inbound SW Checksumming\n",
-			QETH_CARD_IFNAME(card));
-			goto update_feature;
-		}
-
-		rc = qeth_l3_send_checksum_command(card);
-		if (!rc)
-			dev_info(&card->gdev->dev,
-			"HW Checksumming (inbound) enabled\n");
-		else
-			goto update_feature;
-	} else
-		dev_info(&card->gdev->dev,
-			"Using SW checksumming on %s.\n",
-			QETH_CARD_IFNAME(card));
+		rtnl_lock();
+		/* force set_features call */
+		card->dev->features &= ~NETIF_F_RXCSUM;
+		netdev_update_features(card->dev);
+		rtnl_unlock();
+	}
 	return 0;
-
-update_feature:
-	rtnl_lock();
-	card->dev->features &= ~NETIF_F_RXCSUM;
-	netdev_update_features(card->dev);
-	rtnl_unlock();
-	return rc;
 }
 
 static int qeth_l3_start_ipa_tx_checksum(struct qeth_card *card)
@@ -3196,17 +3166,20 @@ static int qeth_l3_set_features(struct n
 {
 	struct qeth_card *card = dev->ml_priv;
 	u32 changed = dev->features ^ features;
-	int on;
+	int err;
 
 	if (!(changed & NETIF_F_RXCSUM))
 		return 0;
 
-	if (features & NETIF_F_RXCSUM)
-		on = 1;
-	else
-		on = 0;
+	if (card->state == CARD_STATE_DOWN ||
+	    card->state == CARD_STATE_RECOVER)
+		return 0;
+
+	err = qeth_l3_set_rx_csum(card, features & NETIF_F_RXCSUM);
+	if (err)
+		dev->features = features ^ NETIF_F_RXCSUM;
 
-	return qeth_l3_set_rx_csum(card, on);
+	return err;
 }
 
 static const struct ethtool_ops qeth_l3_ethtool_ops = {


^ permalink raw reply

* [PATCH net-next-2.6] be2net: Enable SR-IOV for Lancer
From: Padmanabh Ratnakar @ 2011-05-18 13:26 UTC (permalink / raw)
  To: davem, netdev; +Cc: Padmanabh Ratnakar, Mammatha Edhala

From: Mammatha Edhala <mammatha.edhala@emulex.com>

Enable SR-IOV for Lancer

Signed-off-by: Mammatha Edhala <mammatha.edhala@emulex.com>
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
---
 drivers/net/benet/be.h      |    5 ++++-
 drivers/net/benet/be_main.c |   17 +++++++++--------
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/net/benet/be.h b/drivers/net/benet/be.h
index 0b73dcf..a7db870 100644
--- a/drivers/net/benet/be.h
+++ b/drivers/net/benet/be.h
@@ -49,6 +49,7 @@
 #define OC_DEVICE_ID1		0x700	/* Device Id for BE2 cards */
 #define OC_DEVICE_ID2		0x710	/* Device Id for BE3 cards */
 #define OC_DEVICE_ID3		0xe220	/* Device id for Lancer cards */
+#define OC_DEVICE_ID4           0xe228   /* Device id for VF in Lancer */
 
 static inline char *nic_name(struct pci_dev *pdev)
 {
@@ -58,6 +59,7 @@ static inline char *nic_name(struct pci_dev *pdev)
 	case OC_DEVICE_ID2:
 		return OC_NAME_BE;
 	case OC_DEVICE_ID3:
+	case OC_DEVICE_ID4:
 		return OC_NAME_LANCER;
 	case BE_DEVICE_ID2:
 		return BE3_NAME;
@@ -383,7 +385,8 @@ struct be_adapter {
 #define BE_GEN2 2
 #define BE_GEN3 3
 
-#define lancer_chip(adapter)		(adapter->pdev->device == OC_DEVICE_ID3)
+#define lancer_chip(adapter)	((adapter->pdev->device == OC_DEVICE_ID3) || \
+				 (adapter->pdev->device == OC_DEVICE_ID4))
 
 extern const struct ethtool_ops be_ethtool_ops;
 
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index 7322a51..ce6edac 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -42,6 +42,7 @@ static DEFINE_PCI_DEVICE_TABLE(be_dev_ids) = {
 	{ PCI_DEVICE(BE_VENDOR_ID, OC_DEVICE_ID1) },
 	{ PCI_DEVICE(BE_VENDOR_ID, OC_DEVICE_ID2) },
 	{ PCI_DEVICE(EMULEX_VENDOR_ID, OC_DEVICE_ID3)},
+	{ PCI_DEVICE(EMULEX_VENDOR_ID, OC_DEVICE_ID4)},
 	{ 0 }
 };
 MODULE_DEVICE_TABLE(pci, be_dev_ids);
@@ -3161,7 +3162,8 @@ static int be_get_config(struct be_adapter *adapter)
 
 	memset(mac, 0, ETH_ALEN);
 
-	if (be_physfn(adapter)) {
+	/* A default permanent address is given to each VF for Lancer*/
+	if (be_physfn(adapter) || lancer_chip(adapter)) {
 		status = be_cmd_mac_addr_query(adapter, mac,
 			MAC_ADDRESS_TYPE_NETWORK, true /*permanent */, 0);
 
@@ -3203,6 +3205,7 @@ static int be_dev_family_check(struct be_adapter *adapter)
 		adapter->generation = BE_GEN3;
 		break;
 	case OC_DEVICE_ID3:
+	case OC_DEVICE_ID4:
 		pci_read_config_dword(pdev, SLI_INTF_REG_OFFSET, &sli_intf);
 		if_type = (sli_intf & SLI_INTF_IF_TYPE_MASK) >>
 						SLI_INTF_IF_TYPE_SHIFT;
@@ -3212,10 +3215,6 @@ static int be_dev_family_check(struct be_adapter *adapter)
 			dev_err(&pdev->dev, "SLI_INTF reg val is not valid\n");
 			return -EINVAL;
 		}
-		if (num_vfs > 0) {
-			dev_err(&pdev->dev, "VFs not supported\n");
-			return -EINVAL;
-		}
 		adapter->sli_family = ((sli_intf & SLI_INTF_FAMILY_MASK) >>
 					 SLI_INTF_FAMILY_SHIFT);
 		adapter->generation = BE_GEN3;
@@ -3381,9 +3380,11 @@ static int __devinit be_probe(struct pci_dev *pdev,
 		bool link_up;
 		u16 vf, lnk_speed;
 
-		status = be_vf_eth_addr_config(adapter);
-		if (status)
-			goto unreg_netdev;
+		if (!lancer_chip(adapter)) {
+			status = be_vf_eth_addr_config(adapter);
+			if (status)
+				goto unreg_netdev;
+		}
 
 		for (vf = 0; vf < num_vfs; vf++) {
 			status = be_cmd_link_status_query(adapter, &link_up,
-- 
1.6.0.2


^ permalink raw reply related

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-18 13:20 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: Vladislav Yasevich, Wei Yongjun, netdev
In-Reply-To: <BANLkTi=cDwLFSqaBt5szaFOaWmxYzkya9g@mail.gmail.com>

Le mercredi 18 mai 2011 à 15:11 +0200, Jacek Luczak a écrit :

> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index faf71d1..6150ac5 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
> *bp, __u16 port)
>  /* Dispose of the address list. */
>  static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>  {
> -       struct sctp_sockaddr_entry *addr;
> -       struct list_head *pos, *temp;
> +       struct sctp_sockaddr_entry *addr, *temp;
> 
>         /* Empty the bind address list. */
> -       list_for_each_safe(pos, temp, &bp->address_list) {
> -               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
> -               list_del(pos);
> -               kfree(addr);
> +       list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
> +               list_del_rcu(&addr->list);
> +               call_rcu(&addr->rcu, sctp_local_addr_free);
>                 SCTP_DBG_OBJCNT_DEC(addr);
>         }
>  }
> 
> Does it now look good?
> 

It seems fine yes !




^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michael S. Tsirkin @ 2011-05-18 13:19 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <BANLkTikOsR9RUJboG1P=oG8ofr5BLbgQ4Q@mail.gmail.com>

On Wed, May 18, 2011 at 02:48:24PM +0200, Michał Mirosław wrote:
> W dniu 18 maja 2011 13:56 użytkownik Michael S. Tsirkin
> <mst@redhat.com> napisał:
> > On Wed, May 18, 2011 at 01:47:33PM +0200, Michał Mirosław wrote:
> >> W dniu 18 maja 2011 13:17 użytkownik Michael S. Tsirkin
> >> <mst@redhat.com> napisał:
> >> > On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
> >> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
> >> >> >> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
> >> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
> >> >> >> > > Hello Michael,
> >> >> >> > >
> >> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
> >> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
> >> >> >> > to
> >> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
> >> >> >> > device?
> >> >> >> >
> >> >> >> > Is there any other restriction besides requiring driver to not recycle
> >> >> >> > the skb? Are there any drivers that recycle TX skbs?
> >> >> > Not just recycling skbs, keeping reference to any of the pages in the
> >> >> > skb. Another requirement is to invoke the callback
> >> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
> >> >> > that happens (skbs are only freed when some other packet is
> >> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
> >> >> > scenarious, right?
> >> >> Hmm. But every hardware driver supporting SG will keep reference to
> >> >> the pages until the packet is sent (or DMA'd to the device). This can
> >> >> take a long time if hardware queue happens to stall for some reason.
> >> > That's a fundamental property of zero copy transmit.
> >> > You can't let the application/guest reuse the memory until
> >> > no one looks at it anymore.
> >>
> >> One more question: is userspace (or whatever is sending those packets)
> >> denied from modifying passed pages? I assume it is, but just want to
> >> be sure.
> >>
> >> Best Regards,
> >> Michał Mirosław
> >
> > Good point.
> >
> > It's not denied in the sense that it still can modify them if it's
> > buggy (the pages might not be read-only).
> > But well-behaved userspace won't modify them until the callback
> > is invoked.
> >
> > That would be a problem if the underlying device is
> > a bridge where we might try to e.g. filter these packets -
> > data can get modified after the filter. We'd have to copy
> > whatever the filter accesses and use the copy - it's rarely
> > the data itself.
> >
> > That's not normally a problem for macvtap connected to a physical NIC,
> > as that already bypasses any and all filtering.
> >
> > But that's another limitation we should note in the comment,
> > and another reason to limit to specific devices.
> 
> It looks like this feature can be used only in very strict circumstances.

True. I think it's reasonable to try and start with something
restricted and then add features though - past attempts to solve the problem
generally right away did not bear fruit.

> What about tcpdump listening on the device or lowerdev? This path
> might clone the skb for any device.
> 
> Best Regards,
> Michał Mirosław

Thanks for bringing this up: taps do need to be fixed as they can hang
on to a page for unlimited time. Further, as a malicious guest can
change the packet at any time, data that taps get wouldn't be correct.
We can either linearize the problematic skbs or disable
zero copy if there are any taps for the given device.

-- 
MST

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18 13:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vladislav Yasevich, Wei Yongjun, netdev
In-Reply-To: <1305723046.2991.19.camel@edumazet-laptop>

2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mercredi 18 mai 2011 à 14:47 +0200, Jacek Luczak a écrit :
>
>> OK then, at the end what Eric suggested is IMO valid:
>>
>> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
>> index faf71d1..0025d90 100644
>> --- a/net/sctp/bind_addr.c
>> +++ b/net/sctp/bind_addr.c
>> @@ -144,10 +144,9 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>>         struct list_head *pos, *temp;
>>
>>         /* Empty the bind address list. */
>> -       list_for_each_safe(pos, temp, &bp->address_list) {
>> -               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
>> -               list_del(pos);
>> -               kfree(addr);
>> +       list_for_each_entry(pos, &bp->address_list, list) {
>
> a 'safe' version is needed here, since we remove items in iterator.

Yep.

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..6150ac5 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -140,14 +140,12 @@ void sctp_bind_addr_init(struct sctp_bind_addr
*bp, __u16 port)
 /* Dispose of the address list. */
 static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
 {
-       struct sctp_sockaddr_entry *addr;
-       struct list_head *pos, *temp;
+       struct sctp_sockaddr_entry *addr, *temp;

        /* Empty the bind address list. */
-       list_for_each_safe(pos, temp, &bp->address_list) {
-               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
-               list_del(pos);
-               kfree(addr);
+       list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
+               list_del_rcu(&addr->list);
+               call_rcu(&addr->rcu, sctp_local_addr_free);
                SCTP_DBG_OBJCNT_DEC(addr);
        }
 }

Does it now look good?

-Jacek

^ permalink raw reply related

* Re: Identifying network namespaces (was: Network namespace manipulation with file descriptors)
From: Alexey Dobriyan @ 2011-05-18 13:03 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric W. Biederman, linux-arch, linux-kernel, netdev,
	linux-fsdevel, jamal, Daniel Lezcano, Linux Containers,
	Renato Westphal
In-Reply-To: <20110518124307.GD3762520@jupiter.n2.diac24.net>

On Wed, May 18, 2011 at 3:43 PM, David Lamparter <equinox@diac24.net> wrote:
> -   processes cannot easily be cross referenced with each other
>
>  in the case of user space stuff running astray - like management
>  software crashing, routing daemons screwing up, etc. - it becomes
>  fairly difficult to shut down a network namespace (or even reaquire
>  physical devices that have been reassigned)

It shutdowns itself when last process using netns disappeares,
so if you kill your routing daemons you should be fine.
Physical netdevices are moved to init_net.

> -   namespaces cannot adequately be identified to the user
>
>  for debugging, some kernel messages become useless. most prominently,
>  "unregister_netdevice: waiting for lo to become free. Usage count = 123"
>  could certainly use some clarification, *which* lo is meant...

There is no "netns %p" or something, because right now the only unique
netns identifier is kernel pointer (which better not be exposed to userspace).
Printing such thing would be quite useless since it's not printed
at netns creation.

> So, considering this set of premises (feedback welcome) I looked for
> some suitable means of identification. I discarded going for any process
> identifiers since Eric's patches allow for network namespaces without
> any process holding a reference, using bind mounts instead.

If anything it should be netns->id, /proc/*/netns outputting id
where id is not derived from kernel pointer.

> Solution?
> =========

What a hack! :-)

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-18 12:50 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: Vladislav Yasevich, Wei Yongjun, netdev
In-Reply-To: <BANLkTinQAy+jq_VUwky2rxchomuN3avPQg@mail.gmail.com>

Le mercredi 18 mai 2011 à 14:47 +0200, Jacek Luczak a écrit :

> OK then, at the end what Eric suggested is IMO valid:
> 
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index faf71d1..0025d90 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -144,10 +144,9 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>         struct list_head *pos, *temp;
> 
>         /* Empty the bind address list. */
> -       list_for_each_safe(pos, temp, &bp->address_list) {
> -               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
> -               list_del(pos);
> -               kfree(addr);
> +       list_for_each_entry(pos, &bp->address_list, list) {

a 'safe' version is needed here, since we remove items in iterator.

> +               list_del_rcu(&pos->list);
> +               call_rcu(&pos->rcu, sctp_local_addr_free);
>                 SCTP_DBG_OBJCNT_DEC(addr);
>         }
>  }
> 
> 
> I will test this. Should be safe, avoid race not only in that case and
> it consistent.
> 
> -Jacek



^ permalink raw reply

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michał Mirosław @ 2011-05-18 12:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Shirley Ma, Ben Hutchings, David Miller, Eric Dumazet, Avi Kivity,
	Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <20110518115602.GT7589@redhat.com>

W dniu 18 maja 2011 13:56 użytkownik Michael S. Tsirkin
<mst@redhat.com> napisał:
> On Wed, May 18, 2011 at 01:47:33PM +0200, Michał Mirosław wrote:
>> W dniu 18 maja 2011 13:17 użytkownik Michael S. Tsirkin
>> <mst@redhat.com> napisał:
>> > On Wed, May 18, 2011 at 01:10:50PM +0200, Michał Mirosław wrote:
>> >> 2011/5/18 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Tue, May 17, 2011 at 03:28:38PM -0700, Shirley Ma wrote:
>> >> >> On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
>> >> >> > 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> >> > > Hello Michael,
>> >> >> > >
>> >> >> > > Looks like to use a new flag requires more time/work. I am thinking
>> >> >> > > whether we can just use HIGHDMA flag to enable zero-copy in macvtap
>> >> >> > to
>> >> >> > > avoid the new flag for now since mavctap uses real NICs as lower
>> >> >> > device?
>> >> >> >
>> >> >> > Is there any other restriction besides requiring driver to not recycle
>> >> >> > the skb? Are there any drivers that recycle TX skbs?
>> >> > Not just recycling skbs, keeping reference to any of the pages in the
>> >> > skb. Another requirement is to invoke the callback
>> >> > in a timely fashion.  For example virtio-net doesn't limit the time until
>> >> > that happens (skbs are only freed when some other packet is
>> >> > transmitted), so we need to avoid zcopy for such (nested-virt)
>> >> > scenarious, right?
>> >> Hmm. But every hardware driver supporting SG will keep reference to
>> >> the pages until the packet is sent (or DMA'd to the device). This can
>> >> take a long time if hardware queue happens to stall for some reason.
>> > That's a fundamental property of zero copy transmit.
>> > You can't let the application/guest reuse the memory until
>> > no one looks at it anymore.
>>
>> One more question: is userspace (or whatever is sending those packets)
>> denied from modifying passed pages? I assume it is, but just want to
>> be sure.
>>
>> Best Regards,
>> Michał Mirosław
>
> Good point.
>
> It's not denied in the sense that it still can modify them if it's
> buggy (the pages might not be read-only).
> But well-behaved userspace won't modify them until the callback
> is invoked.
>
> That would be a problem if the underlying device is
> a bridge where we might try to e.g. filter these packets -
> data can get modified after the filter. We'd have to copy
> whatever the filter accesses and use the copy - it's rarely
> the data itself.
>
> That's not normally a problem for macvtap connected to a physical NIC,
> as that already bypasses any and all filtering.
>
> But that's another limitation we should note in the comment,
> and another reason to limit to specific devices.

It looks like this feature can be used only in very strict circumstances.

What about tcpdump listening on the device or lowerdev? This path
might clone the skb for any device.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18 12:47 UTC (permalink / raw)
  To: Vladislav Yasevich; +Cc: Wei Yongjun, Eric Dumazet, netdev
In-Reply-To: <4DD3BC8F.9050802@hp.com>

2011/5/18 Vladislav Yasevich <vladislav.yasevich@hp.com>:
> On 05/18/2011 05:02 AM, Wei Yongjun wrote:
>
>> fix the race between sctp_bind_addr_free() and sctp_bind_addr_conflict(), maybe you just
>> need to remove the socket from the port hash before empty the bind address list.
>> some thing like this, not test.
>>
>> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
>> index c8cc24e..924d846 100644
>> --- a/net/sctp/endpointola.c
>> +++ b/net/sctp/endpointola.c
>> @@ -268,12 +268,13 @@ static void sctp_endpoint_destroy(struct sctp_endpoint *ep)
>>
>>       /* Cleanup. */
>>       sctp_inq_free(&ep->base.inqueue);
>> -     sctp_bind_addr_free(&ep->base.bind_addr);
>>
>>       /* Remove and free the port */
>>       if (sctp_sk(ep->base.sk)->bind_hash)
>>               sctp_put_port(ep->base.sk);
>>
>> +     sctp_bind_addr_free(&ep->base.bind_addr);
>> +
>>       /* Give up our hold on the sock. */
>>       if (ep->base.sk)
>>               sock_put(ep->base.sk);
>>
>>
>
> I am not sure that this will guarantee avoidance of this crash, even though it is the right
> thing to do in general.
>
> We simply make the race condition much smaller and much harder to hit.  Potentially, one
> cpu may be doing lookup of the socket while the other is doing the destroy.  If the lookup cpu
> finds the socket just as this code removes the socket from the hash, we still have this potential
> race condition.
>
> I agree with Eric, rcu_read_lock() is not strictly necessary, as what we are really after is call_rcu()
> based destruction.  We need to delay memory destruction for the rcu grace period.
>
> Thinking a little more about how bind_addr_clean() is used, it would probably benefit from getting
> converted to call_rcu().  That function is used as local clean-up in case of malloc failure; however,
> that doesn't preclude a potential race.  The fact that we do not have this race simply points out that
> we don't have any kind of parallel lookup that can hit it (and the code confirms it).  This doesn't
> mean that we wouldn't have it in the future.
>

OK then, at the end what Eric suggested is IMO valid:

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..0025d90 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -144,10 +144,9 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
        struct list_head *pos, *temp;

        /* Empty the bind address list. */
-       list_for_each_safe(pos, temp, &bp->address_list) {
-               addr = list_entry(pos, struct sctp_sockaddr_entry, list);
-               list_del(pos);
-               kfree(addr);
+       list_for_each_entry(pos, &bp->address_list, list) {
+               list_del_rcu(&pos->list);
+               call_rcu(&pos->rcu, sctp_local_addr_free);
                SCTP_DBG_OBJCNT_DEC(addr);
        }
 }


I will test this. Should be safe, avoid race not only in that case and
it consistent.

-Jacek

^ permalink raw reply related

* Re: Bug, kernel panic, NULL dereference , cleanup_once / icmp_route_lookup.clone.19.clone / nat , 2.6.39-rc7-git11
From: Denys Fedoryshchenko @ 2011-05-18 12:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1305719047.2991.4.camel@edumazet-laptop>

 On Wed, 18 May 2011 13:44:07 +0200, Eric Dumazet wrote:
> Le mercredi 18 mai 2011 à 12:05 +0200, Eric Dumazet a écrit :
>> Le mercredi 18 mai 2011 à 12:53 +0300, Denys Fedoryshchenko a écrit 
>> :
>>
>> >  Yes, i will try. I should enable SLUB debugging only, or also 
>> anything
>> >  else?
>> >
>> >  But possible it will take time to reproduce bug, seems it is 
>> occuring
>> >  rare. With 2.6.39 release i will rollout update to few hundreds 
>> PPPoE's,
>> >  maybe it will increase chances to get information.
>> >
>>
>> I would try both : slub_debug=ZFP slub_nomerge
>>
>> or maybe only slub_debug=ZFPU
>>
>> Thanks !
>>
>
> Hmm, do you have both ipv6/ipv4 trafic on your machine by any chance 
> ?
 It is NAS, has ipv6 enabled recently (i am preparing for ipv6), but 
 ipv6 is not routed anywhere, maybe just automatically addresses 
 appearing on interfaces, including the one looking to customer subnet. 
 ppp is ipv4 only.


^ permalink raw reply

* Identifying network namespaces (was: Network namespace manipulation with file descriptors)
From: David Lamparter @ 2011-05-18 12:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-arch, linux-kernel, netdev, linux-fsdevel, jamal,
	Daniel Lezcano, Linux Containers, Renato Westphal
In-Reply-To: <m1tyd7p7tq.fsf@fess.ebiederm.org>

On Fri, May 06, 2011 at 07:23:29PM -0700, Eric W. Biederman wrote:
> Today there are something things you can use namespaces to implement but
> the userspace code is unnecessarily complex and fragile because of
> limitations of the kernel interfaces.

(I hope I'm not duplicating stuff here, maybe this is already taken care
of...)

There still is the unresolved issue of identifying and/or equality
testing network namespaces. The problems I see are the following:

Problems
========

-   there is no simple list of the existing namespaces
 
  most importantly, network namespaces can be abused very sneakily to
  provide trojans or other malicious software with a covert
  communication channel.

  Just create a macvlan device on the local ethernet, shift it into a
  new namespace, grab an IP - and the admin of the host will be left
  dumbfounded with his plain old netstat or ss as to where those weird
  packets are coming from.

  (Granted, this needs root privileges and you can still see everything
  in /proc/<pid>/net/{tcp,udp,...})

-   processes cannot easily be cross referenced with each other

  in the case of user space stuff running astray - like management
  software crashing, routing daemons screwing up, etc. - it becomes
  fairly difficult to shut down a network namespace (or even reaquire
  physical devices that have been reassigned)

-   namespaces cannot adequately be identified to the user

  for debugging, some kernel messages become useless. most prominently,
  "unregister_netdevice: waiting for lo to become free. Usage count = 123"
  could certainly use some clarification, *which* lo is meant...

So, considering this set of premises (feedback welcome) I looked for
some suitable means of identification. I discarded going for any process
identifiers since Eric's patches allow for network namespaces without
any process holding a reference, using bind mounts instead.

Solution?
=========

I do think I found another identifier suitable for both exporting a full 
list and providing "getid" functionality: the "lo" interface index.
I am not fully sure on the rules here, but to my understanding the lo
interface index should be constant for the lifetime of a network
namespace, and furthermore the interface index namespace is global to
the kernel.

This can already be used right now:
- you can identify the init namespace since its lo index is 1
- you can use netlink (or SIOCGIFINDEX, ew.) to get your own ID
- as soon as you have the means to attach to a namespace, you
  automatically also have the means to get its ID

What might need to be added:
1) a list of lo indices
2) a way to get a hold of the namespace using its lo index
3) maybe dmesg output on creation/destruction for the abuse case above
4) possibly a file in /proc/<pid>/net that just contains the lo index,
  for ease of use

3) and 4) should be trivial; 1) and 2) could be solved by having a
directory in /proc (?) that has all the network namespace "files" with
their lo indices as names.


So, ... Opinions?


-David

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Vladislav Yasevich @ 2011-05-18 12:33 UTC (permalink / raw)
  To: Wei Yongjun; +Cc: Jacek Luczak, Eric Dumazet, netdev
In-Reply-To: <4DD38B30.9090601@cn.fujitsu.com>

On 05/18/2011 05:02 AM, Wei Yongjun wrote:
 
> fix the race between sctp_bind_addr_free() and sctp_bind_addr_conflict(), maybe you just
> need to remove the socket from the port hash before empty the bind address list.
> some thing like this, not test.
> 
> diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
> index c8cc24e..924d846 100644
> --- a/net/sctp/endpointola.c
> +++ b/net/sctp/endpointola.c
> @@ -268,12 +268,13 @@ static void sctp_endpoint_destroy(struct sctp_endpoint *ep)
>  
>  	/* Cleanup. */
>  	sctp_inq_free(&ep->base.inqueue);
> -	sctp_bind_addr_free(&ep->base.bind_addr);
>  
>  	/* Remove and free the port */
>  	if (sctp_sk(ep->base.sk)->bind_hash)
>  		sctp_put_port(ep->base.sk);
>  
> +	sctp_bind_addr_free(&ep->base.bind_addr);
> +
>  	/* Give up our hold on the sock. */
>  	if (ep->base.sk)
>  		sock_put(ep->base.sk);
> 
> 

I am not sure that this will guarantee avoidance of this crash, even though it is the right
thing to do in general.

We simply make the race condition much smaller and much harder to hit.  Potentially, one
cpu may be doing lookup of the socket while the other is doing the destroy.  If the lookup cpu
finds the socket just as this code removes the socket from the hash, we still have this potential
race condition.

I agree with Eric, rcu_read_lock() is not strictly necessary, as what we are really after is call_rcu()
based destruction.  We need to delay memory destruction for the rcu grace period.

Thinking a little more about how bind_addr_clean() is used, it would probably benefit from getting
converted to call_rcu().  That function is used as local clean-up in case of malloc failure; however,
that doesn't preclude a potential race.  The fact that we do not have this race simply points out that
we don't have any kind of parallel lookup that can hit it (and the code confirms it).  This doesn't
mean that we wouldn't have it in the future.

-vlad

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox