Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Question regarding expected behavior of two udp sockets with SO_REUSEADDR set
From: Neil Horman @ 2010-11-21  0:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem
In-Reply-To: <1290267887.2756.75.camel@edumazet-laptop>

On Sat, Nov 20, 2010 at 04:44:47PM +0100, Eric Dumazet wrote:
> Le samedi 20 novembre 2010 à 10:04 -0500, Neil Horman a écrit :
> 
> > Agreed.  My thought was to add logic to udp_lib_lport_inuse such that, if
> > sk_reuse is set on both sockets, and the input snum is 0 (indicating autobind)
> > we should not allow binding sk to inet_sk(sk2)->num.  Thoughts?
> 
> I dont know, problem is this could be possible right now if sk2 is bound
> on 127.0.0.2 address. Adding this test would reduce possible space.
> 
Well, yes, that is in fact the exact problem that I origionally described.  Both
sockets are bound to the same port on 127.0.0.1 on the same system.  Since both
sockets have SO_REUSEADDR set, this is allowed, but when either socket sends  a
messaage, the delivery code does a lookup in __udp4_lib_lookup, and the sock
structure that gets returned there is ambiguous. It could be either of the two
bound sockets, and the result will depend on which order they exist in the
hslot->head list (since they will both by definition hash to the same bucket).

Another solution might be to treat daddr of 127 specially, and force a traversal
of the entire hslot->head list, delivering to every matching packet, but we
might need to do that for all locally owned daddrs, so I'm not sure thats a
great solution.  Other thoughts welcome, including just leave it all well enough
alone :)


> Autobind is tricky, it chooses a port while address is part of the
> problem.
> 
Agreed, I'm comming to understand that :)

Neil


^ permalink raw reply

* Re: [PATCH net-next-2.6] packet: use vzalloc()
From: Neil Horman @ 2010-11-21  0:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1290274314.2756.84.camel@edumazet-laptop>

On Sat, Nov 20, 2010 at 06:31:54PM +0100, Eric Dumazet wrote:
> alloc_one_pg_vec_page() is supposed to return zeroed memory, so use
> vzalloc() instead of vmalloc()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>

Thanks Eric!

> ---
>  net/packet/af_packet.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index b6372dd..422705d 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2367,7 +2367,7 @@ static inline char *alloc_one_pg_vec_page(unsigned long order,
>  	 * __get_free_pages failed, fall back to vmalloc
>  	 */
>  	*flags |= PGV_FROM_VMALLOC;
> -	buffer = vmalloc((1 << order) * PAGE_SIZE);
> +	buffer = vzalloc((1 << order) * PAGE_SIZE);
>  
>  	if (buffer)
>  		return buffer;
> 
> 
> 

^ permalink raw reply

* [PATCH] netfilter: change IPS_*_BIT to IPS_*
From: Changli Gao @ 2010-11-21  0:13 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David S. Miller, netfilter-devel, netdev, Changli Gao,
	Tim Gardner, Eric Dumazet

My previous patch made a mistake when converting atomic_set to a normal
bit 'or'. IPS_*_BIT should be replaced with IPS_*.

  commit 76a2d3bcfcc86e2a8044258515b86492a37631a3
  Author: Changli Gao <xiaosuo@gmail.com>
  Date:   Mon Nov 15 11:59:03 2010 +0100

  netfilter: nf_nat: don't use atomic bit operation

  As we own the conntrack and the others can't see it until we confirm it,
  we don't need to use atomic bit operation on ct->status.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Cc: Tim Gardner <tim.gardner@canonical.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/netfilter/nf_nat_core.h |    4 ++--
 net/ipv4/netfilter/nf_nat_core.c    |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/net/netfilter/nf_nat_core.h b/include/net/netfilter/nf_nat_core.h
index 5aec85c..3dc7b98 100644
--- a/include/net/netfilter/nf_nat_core.h
+++ b/include/net/netfilter/nf_nat_core.h
@@ -21,9 +21,9 @@ static inline int nf_nat_initialized(struct nf_conn *ct,
 				     enum nf_nat_manip_type manip)
 {
 	if (manip == IP_NAT_MANIP_SRC)
-		return ct->status & IPS_SRC_NAT_DONE_BIT;
+		return ct->status & IPS_SRC_NAT_DONE;
 	else
-		return ct->status & IPS_DST_NAT_DONE_BIT;
+		return ct->status & IPS_DST_NAT_DONE;
 }
 
 struct nlattr;
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index eb55835..21e6e92 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -323,9 +323,9 @@ nf_nat_setup_info(struct nf_conn *ct,
 
 	/* It's done. */
 	if (maniptype == IP_NAT_MANIP_DST)
-		ct->status |= IPS_DST_NAT_DONE_BIT;
+		ct->status |= IPS_DST_NAT_DONE;
 	else
-		ct->status |= IPS_SRC_NAT_DONE_BIT;
+		ct->status |= IPS_SRC_NAT_DONE;
 
 	return NF_ACCEPT;
 }

^ permalink raw reply related

* Re: [PATCH] netfilter: don't use atomic bit operation
From: Changli Gao @ 2010-11-20 23:40 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David S. Miller, netfilter-devel, netdev
In-Reply-To: <4CE1129E.5040505@trash.net>

On Mon, Nov 15, 2010 at 6:59 PM, Patrick McHardy <kaber@trash.net> wrote:
> On 14.11.2010 10:05, Changli Gao wrote:
>> As we own ct, and the others can't see it until we confirm it, we don't
>> need to use atomic bit operation on ct->status.
>>
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>> ---
>>  include/net/netfilter/nf_nat_core.h |    4 ++--
>>  net/ipv4/netfilter/nf_nat_core.c    |    4 ++--
>>  2 files changed, 4 insertions(+), 4 deletions(-)
>> diff --git a/include/net/netfilter/nf_nat_core.h b/include/net/netfilter/nf_nat_core.h
>> index 33602ab..52ac1d8 100644
>> --- a/include/net/netfilter/nf_nat_core.h
>> +++ b/include/net/netfilter/nf_nat_core.h
>> @@ -21,9 +21,9 @@ static inline int nf_nat_initialized(struct nf_conn *ct,
>>                                    enum nf_nat_manip_type manip)
>>  {
>>       if (manip == IP_NAT_MANIP_SRC)
>> -             return test_bit(IPS_SRC_NAT_DONE_BIT, &ct->status);
>> +             return IPS_SRC_NAT_DONE_BIT & ct->status;
>>       else
>> -             return test_bit(IPS_DST_NAT_DONE_BIT, &ct->status);
>> +             return IPS_DST_NAT_DONE_BIT & ct->status;
>>  }
>
> Looks fine, but I changed the order to ct->status & ...
>

Sorry, I made a mistake. The suffix _BIT should be removed.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] netfilter: remove an atomic bit operation
From: Changli Gao @ 2010-11-20 23:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: tim.gardner, Patrick McHardy, David S. Miller, netfilter-devel,
	netdev
In-Reply-To: <1290271804.2756.78.camel@edumazet-laptop>

On Sun, Nov 21, 2010 at 12:50 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le samedi 20 novembre 2010 à 09:03 -0700, Tim Gardner a écrit :
>
>> NAK, set_bit() takes a bit number, not a mask. That is, assuming you can
>> get away with a non-atomic operation on this field. I'll defer to
>> Patrick on that.
>>
>> I think you have to use IPS_CONFIRMED instead, e.g.,
>>
>>       ct->status |= IPS_CONFIRMED;
>>
>
> Or
>
>        __set_bit(IPS_CONFIRMED_BIT, &ct->status);
>
>
> set_bit() is atomic, while __set_bit() is not
>

Oh, you are right. Thanks. My a previous patch accepted by Patrick has
the same problem.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Winning Notice.........Congratulations
From: ITOCHU Corporation. @ 2010-11-19 19:56 UTC (permalink / raw)


We are please to inform you that your e-mail just won you 1,000.000.00 pounds
sterlings in the Yahoo/Msn.
lottery program fill your claim by sending your contact info such as
Names:Age:Address: occupation:country: Agent Name: Mrs. Rose Elvis
Email:claimsmicrosoft56@yahoo.com.hk


^ permalink raw reply

* [PATCH] bonding: change list contact to netdev@vger.kernel.org
From: Simon Horman @ 2010-11-20 21:48 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh

bonding-devel@lists.sourceforge.net seems only receive spam
and discussion seems to already occur on netdev@vger.kernel.org.

Signed-off-by: Simon Horman <horms@verge.net.au>
---
 MAINTAINERS |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 476fb3a..c88b39e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1377,7 +1377,7 @@ F:	include/net/bluetooth/
 
 BONDING DRIVER
 M:	Jay Vosburgh <fubar@us.ibm.com>
-L:	bonding-devel@lists.sourceforge.net
+L:	netdev@vger.kernel.org
 W:	http://sourceforge.net/projects/bonding/
 S:	Supported
 F:	drivers/net/bonding/
-- 
1.7.2.3


^ permalink raw reply related

* Transfer Of $30,000,000.00 Million Dollars**
From: Dr.Raymond Kuo Fung CH'IEN. @ 2010-11-19 16:28 UTC (permalink / raw)


I am the Executive  Director and Chief Financial Officer of the operations of
the Hang Seng Bank Ltd.I have a secured business suggestion for you,with a value
of $30,000,000.00 Million. Kindly contact me form more details EMAIL :
drraymondkuochien1@xnmsn.cn
Kind Regards,
Dr.Raymond Kuo Fung CH'IEN.




^ permalink raw reply

* Greetings from Japan
From: Mr. Tsutomu Kimoto @ 2010-11-20 15:14 UTC (permalink / raw)




Greetings from Japan,
I need your partnership in receiving funds on my behalf.However is not 
mandatory nor will I in any manner compel you to honor against your will. 
Please do treat with regards as I look forward to your reply and possibly our 
partnership.
Sincerely,
Mr. Tsutomu Kimoto

-------------------------------------------------
This mail sent through IMP: http://horde.org/imp/


^ permalink raw reply

* [PATCH net-2.6] net: allow GFP_HIGHMEM in __vmalloc()
From: Eric Dumazet @ 2010-11-20 17:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

We forgot to use __GFP_HIGHMEM in several __vmalloc() calls.

In ceph, add the missing flag.

In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is
cleaner and allows using HIGHMEM pages as well.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ceph/buffer.c       |    2 +-
 net/core/request_sock.c |    4 +---
 net/ipv4/fib_trie.c     |    2 +-
 net/xfrm/xfrm_hash.c    |    2 +-
 4 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/ceph/buffer.c b/net/ceph/buffer.c
index 53d8abf..bf3e6a1 100644
--- a/net/ceph/buffer.c
+++ b/net/ceph/buffer.c
@@ -19,7 +19,7 @@ struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
 	if (b->vec.iov_base) {
 		b->is_vmalloc = false;
 	} else {
-		b->vec.iov_base = __vmalloc(len, gfp, PAGE_KERNEL);
+		b->vec.iov_base = __vmalloc(len, gfp | __GFP_HIGHMEM, PAGE_KERNEL);
 		if (!b->vec.iov_base) {
 			kfree(b);
 			return NULL;
diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index 7552495..fceeb37 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -45,9 +45,7 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
 	nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
 	lopt_size += nr_table_entries * sizeof(struct request_sock *);
 	if (lopt_size > PAGE_SIZE)
-		lopt = __vmalloc(lopt_size,
-			GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO,
-			PAGE_KERNEL);
+		lopt = vzalloc(lopt_size);
 	else
 		lopt = kzalloc(lopt_size, GFP_KERNEL);
 	if (lopt == NULL)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 200eb53..0f28034 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -365,7 +365,7 @@ static struct tnode *tnode_alloc(size_t size)
 	if (size <= PAGE_SIZE)
 		return kzalloc(size, GFP_KERNEL);
 	else
-		return __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
+		return vzalloc(size);
 }
 
 static void __tnode_vfree(struct work_struct *arg)
diff --git a/net/xfrm/xfrm_hash.c b/net/xfrm/xfrm_hash.c
index a2023ec..1e98bc0 100644
--- a/net/xfrm/xfrm_hash.c
+++ b/net/xfrm/xfrm_hash.c
@@ -19,7 +19,7 @@ struct hlist_head *xfrm_hash_alloc(unsigned int sz)
 	if (sz <= PAGE_SIZE)
 		n = kzalloc(sz, GFP_KERNEL);
 	else if (hashdist)
-		n = __vmalloc(sz, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
+		n = vzalloc(sz);
 	else
 		n = (struct hlist_head *)
 			__get_free_pages(GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO,



^ permalink raw reply related

* [PATCH net-next-2.6] packet: use vzalloc()
From: Eric Dumazet @ 2010-11-20 17:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Neil Horman

alloc_one_pg_vec_page() is supposed to return zeroed memory, so use
vzalloc() instead of vmalloc()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
---
 net/packet/af_packet.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b6372dd..422705d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2367,7 +2367,7 @@ static inline char *alloc_one_pg_vec_page(unsigned long order,
 	 * __get_free_pages failed, fall back to vmalloc
 	 */
 	*flags |= PGV_FROM_VMALLOC;
-	buffer = vmalloc((1 << order) * PAGE_SIZE);
+	buffer = vzalloc((1 << order) * PAGE_SIZE);
 
 	if (buffer)
 		return buffer;



^ permalink raw reply related

* Re: [PATCH] netfilter: remove an atomic bit operation
From: Eric Dumazet @ 2010-11-20 16:50 UTC (permalink / raw)
  To: tim.gardner
  Cc: Changli Gao, Patrick McHardy, David S. Miller, netfilter-devel,
	netdev
In-Reply-To: <4CE7F160.9050107@canonical.com>

Le samedi 20 novembre 2010 à 09:03 -0700, Tim Gardner a écrit :
> On 11/20/2010 12:55 AM, Changli Gao wrote:
> > As this ct won't be seen by the others, we don't need to set the
> > IPS_CONFIRMED_BIT in atomic way.
> >
> > Signed-off-by: Changli Gao<xiaosuo@gmail.com>
> > ---
> >   net/netfilter/nf_conntrack_core.c |    2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > index 27a5ea6..c708248 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -486,7 +486,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> >   	ct->timeout.expires += jiffies;
> >   	add_timer(&ct->timeout);
> >   	atomic_inc(&ct->ct_general.use);
> > -	set_bit(IPS_CONFIRMED_BIT,&ct->status);
> > +	ct->status |= IPS_CONFIRMED_BIT;
> >
> >   	/* Since the lookup is lockless, hash insertion must be done after
> >   	 * starting the timer and setting the CONFIRMED bit. The RCU barriers
> > --

> NAK, set_bit() takes a bit number, not a mask. That is, assuming you can 
> get away with a non-atomic operation on this field. I'll defer to 
> Patrick on that.
> 
> I think you have to use IPS_CONFIRMED instead, e.g.,
> 
> 	ct->status |= IPS_CONFIRMED;
> 

Or

	__set_bit(IPS_CONFIRMED_BIT, &ct->status);


set_bit() is atomic, while __set_bit() is not



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] netfilter: remove an atomic bit operation
From: Tim Gardner @ 2010-11-20 16:03 UTC (permalink / raw)
  To: Changli Gao; +Cc: Patrick McHardy, David S. Miller, netfilter-devel, netdev
In-Reply-To: <1290239717-12664-1-git-send-email-xiaosuo@gmail.com>

On 11/20/2010 12:55 AM, Changli Gao wrote:
> As this ct won't be seen by the others, we don't need to set the
> IPS_CONFIRMED_BIT in atomic way.
>
> Signed-off-by: Changli Gao<xiaosuo@gmail.com>
> ---
>   net/netfilter/nf_conntrack_core.c |    2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 27a5ea6..c708248 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -486,7 +486,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
>   	ct->timeout.expires += jiffies;
>   	add_timer(&ct->timeout);
>   	atomic_inc(&ct->ct_general.use);
> -	set_bit(IPS_CONFIRMED_BIT,&ct->status);
> +	ct->status |= IPS_CONFIRMED_BIT;
>
>   	/* Since the lookup is lockless, hash insertion must be done after
>   	 * starting the timer and setting the CONFIRMED bit. The RCU barriers
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

NAK, set_bit() takes a bit number, not a mask. That is, assuming you can 
get away with a non-atomic operation on this field. I'll defer to 
Patrick on that.

I think you have to use IPS_CONFIRMED instead, e.g.,

	ct->status |= IPS_CONFIRMED;

rtg
-- 
Tim Gardner tim.gardner@canonical.com

^ permalink raw reply

* Re: Question regarding expected behavior of two udp sockets with SO_REUSEADDR set
From: Eric Dumazet @ 2010-11-20 15:44 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem
In-Reply-To: <20101120150441.GA17907@hmsreliant.think-freely.org>

Le samedi 20 novembre 2010 à 10:04 -0500, Neil Horman a écrit :

> Agreed.  My thought was to add logic to udp_lib_lport_inuse such that, if
> sk_reuse is set on both sockets, and the input snum is 0 (indicating autobind)
> we should not allow binding sk to inet_sk(sk2)->num.  Thoughts?

I dont know, problem is this could be possible right now if sk2 is bound
on 127.0.0.2 address. Adding this test would reduce possible space.

Autobind is tricky, it chooses a port while address is part of the
problem.

Some loopback scalability problems could be solved if full range of
127.0.0.0/8 addresses were used, avoiding the dst refcount false
sharing :)




^ permalink raw reply

* Re: Question regarding expected behavior of two udp sockets with SO_REUSEADDR set
From: Eric Dumazet @ 2010-11-20 15:38 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem
In-Reply-To: <20101120150441.GA17907@hmsreliant.think-freely.org>

Le samedi 20 novembre 2010 à 10:04 -0500, Neil Horman a écrit :
> https://bugzilla.redhat.com/show_bug.cgi?id=643911

I am not allowed to see this bug report.

' You are not authorized to access bug #643911 '




^ permalink raw reply

* Re: Question regarding expected behavior of two udp sockets with SO_REUSEADDR set
From: Neil Horman @ 2010-11-20 15:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem
In-Reply-To: <1290226015.2756.14.camel@edumazet-laptop>

On Sat, Nov 20, 2010 at 05:06:55AM +0100, Eric Dumazet wrote:
> Le vendredi 19 novembre 2010 à 19:48 -0500, Neil Horman a écrit :
> > Hey all-
> > 
> > 	Got a question regarding expected/desired behavior of $SUBJECT
> > 
> > 
> > I have a report of a problem with a program that opens two sockets:
> > 
> > The first socket is UDP and binds to 127.0.0.1 on a randomly selected port
> > 
> > The second socket is UDP and calls connect, sending to the first socket
> > 
> > Both sockets are part of the same process and have SO_REUSEADDR set
> > 
> > After the connect the second socket sends a message to the first socket.  The
> > first socket waits for the message by calling select().
> > 
> > Its observed that occasionally the first socket fails to receive the message,
> > which is odd, given that the system is unloaded, and this is the only message
> > being sent.  A little investigation shows that when this happens, the client and
> > the server wind up bound to the same port.
> > 
> > This happens because the second socket calls inet_autobind during the connect
> > call, and since both it and the server have SO_REUSEADDR set, it is possible
> > that the autobind will select the same port that the first socket is bound to.
> > When this happens the sendmsg path can get confused.  Specifically, when the skb
> > is delivered to the destination socket, the hash lookup might find the wrong
> > entry and enqueue the skb to the second socket instead of the first.
> > 
> > Questions:
> > 
> > 1) Is that expected?
> > 
> 
> Is SO_REUSADDR used on both sockets ?
> 
Yes, both udp sockets have SO_REUSEADDR set on them

> May I ask why SO_REUSEADDR is set in the first place on UDP sockets ?
> 
Honestly, I don't know.  This was reported to me as part of:
https://bugzilla.redhat.com/show_bug.cgi?id=643911
At first the consensus was that this bug was fixed by your patch series that
adds a secondary hash for udp sockets, but on closer inspection it appears that
this is just a case of what I described above.  Specifically, that two sockets
are inadvertently binding to the same port/address, and as such when someone
sends from socket A to socket B, A is actually the socket that receives the
frame rather than socket B (as the program might have intended).

> I use it before a bind() on a given port (non null), but apparently your
> program doesnt bind() the 2nd socket before its connect() ?
> 
Correct, the second socket is autobound, via inet_autobind as called from
connect(), which we call on the second udp socket.  When that happens the socket
is bound to a random port.  But sometimes if the socket has SO_REUSEADDR set it
winds up binding to the same port that the first socket is bound to, resulting
in the above problem.

> 
> > 2) If not, what do you think the best way to fix it is?
> > 
> > 	a) Deny autobinds to the same port when SO_REUSEADDR is set, but allow
> > explicity binds to the same port?
> > 
> > 	b) Deny both autobinds and explicit binds to the same port/addr,
> > effectively disablind SO_REUSEADDR with UDP, kind of like with listening TCP
> > sockets
> > 
> > 	c) Add magic to udp_rcv to detect skbs originating from local sockets,
> > and _dont_ deliver to the socket it originated from
> 
> Why ? Its a valid use case IMHO, even with a single socket.
> 
> > 
> > I'm inclined to say, no this is not expected behavior, and that we should fix it
> > with option A, but I'm interested in getting other opinions before I go down any
> > particular path.
> > 
> 
> autobind certainly is a problem, we tried to 'fix' it in recent past and
> had to revert some patches. We tried to allow more sockets to be used
> but we failed.
> 
Agreed.  My thought was to add logic to udp_lib_lport_inuse such that, if
sk_reuse is set on both sockets, and the input snum is 0 (indicating autobind)
we should not allow binding sk to inet_sk(sk2)->num.  Thoughts?
Neil

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH] netfilter: remove an atomic bit operation
From: Changli Gao @ 2010-11-20  7:55 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David S. Miller, netfilter-devel, netdev, Changli Gao

As this ct won't be seen by the others, we don't need to set the
IPS_CONFIRMED_BIT in atomic way.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
 net/netfilter/nf_conntrack_core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 27a5ea6..c708248 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -486,7 +486,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	ct->timeout.expires += jiffies;
 	add_timer(&ct->timeout);
 	atomic_inc(&ct->ct_general.use);
-	set_bit(IPS_CONFIRMED_BIT, &ct->status);
+	ct->status |= IPS_CONFIRMED_BIT;

 	/* Since the lookup is lockless, hash insertion must be done after
 	 * starting the timer and setting the CONFIRMED bit. The RCU barriers

^ permalink raw reply related

* Re: netlink stats and message ordering.
From: Ben Greear @ 2010-11-20  5:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: NetDev
In-Reply-To: <1290228542.2756.57.camel@edumazet-laptop>

On 11/19/2010 08:49 PM, Eric Dumazet wrote:
> Le vendredi 19 novembre 2010 à 10:20 -0800, Ben Greear a écrit :
>> I have a single netlink socket that listens for various things,
>> as well as requests netdev stats and such.
>>
>> I am seeing a case during interface creation (mac-vlans) where I get one set of stats
>> that appears to show 1 packet transmitted, and then immediately after that,
>> a second set of stats for the same device that shows all zero counters.
>>
>> Since I'm trying to handle wraps properly, my code tends to consider this
>> a wrap and of course the numbers go all wrong since it wasn't really a wrap.
>>
>> I am wondering if it's possible that netlink messages are somehow being
>> re-ordered before they are sent to my application?
>
> That's strange, maybe you could add a timestamp (or a sequence number)
> in the stats message to check if two messages are eventually re-ordered.

I am becoming suspicious that I may have a re-entrancy issue in my code..maybe while
handling a dev-create message I am calling into the netlink-read logic again before
handling all of the messages that were read in the initial call.

I'm going to have to think on this more, and if nothing obvious comes up,
I'll try your suggestion for incrementing one of the stats by one each time.

Thanks,
Ben

>
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index 93f0ba2..1d271ca 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -444,6 +444,7 @@ static struct rtnl_link_stats64 *macvlan_dev_get_stats64(struct net_device *dev,
>   {
>   	struct macvlan_dev *vlan = netdev_priv(dev);
>
> +	atomic_long_inc(&dev->rx_dropped);
>   	if (vlan->pcpu_stats) {
>   		struct macvlan_pcpu_stats *p;
>   		u64 rx_packets, rx_bytes, rx_multicast, tx_packets, tx_bytes;
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* [PATCH net-2.6] atl1c: Fix hardware type check for enabling OTP CLK
From: Ben Hutchings @ 2010-11-20  4:55 UTC (permalink / raw)
  To: Luis R. Rodriguez, Jie Yang, linux-team; +Cc: David Miller, netdev

Commit 496c185c9495629ef1c65387cb2594578393cfe0 "atl1c: Add support
for Atheros AR8152 and AR8152" added the condition:

             if (hw->nic_type == athr_l1c || hw->nic_type == athr_l2c_b)

for enabling OTP CLK, and the condition:

             if (hw->nic_type == athr_l1c || hw->nic_type == athr_l2c)

for disabling OTP CLK.  Since the two previously defined hardware
types are athr_l1c and athr_l2c, the latter condition appears to be
the correct one.  Change the former to match.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
This is compile-tested only.  It looks like the current code may fail to
read non-volatile settings on the L2C hardware since the EEPROM will not
be clocked.  Please check this.

Ben.

 drivers/net/atl1c/atl1c_hw.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/atl1c/atl1c_hw.c b/drivers/net/atl1c/atl1c_hw.c
index 919080b..1bf6720 100644
--- a/drivers/net/atl1c/atl1c_hw.c
+++ b/drivers/net/atl1c/atl1c_hw.c
@@ -82,7 +82,7 @@ static int atl1c_get_permanent_address(struct atl1c_hw *hw)
 	addr[0] = addr[1] = 0;
 	AT_READ_REG(hw, REG_OTP_CTRL, &otp_ctrl_data);
 	if (atl1c_check_eeprom_exist(hw)) {
-		if (hw->nic_type == athr_l1c || hw->nic_type == athr_l2c_b) {
+		if (hw->nic_type == athr_l1c || hw->nic_type == athr_l2c) {
 			/* Enable OTP CLK */
 			if (!(otp_ctrl_data & OTP_CTRL_CLK_EN)) {
 				otp_ctrl_data |= OTP_CTRL_CLK_EN;
-- 
1.7.2.3

^ permalink raw reply related

* Re: netlink stats and message ordering.
From: Eric Dumazet @ 2010-11-20  4:49 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev
In-Reply-To: <4CE6BFE1.6020404@candelatech.com>

Le vendredi 19 novembre 2010 à 10:20 -0800, Ben Greear a écrit :
> I have a single netlink socket that listens for various things,
> as well as requests netdev stats and such.
> 
> I am seeing a case during interface creation (mac-vlans) where I get one set of stats
> that appears to show 1 packet transmitted, and then immediately after that,
> a second set of stats for the same device that shows all zero counters.
> 
> Since I'm trying to handle wraps properly, my code tends to consider this
> a wrap and of course the numbers go all wrong since it wasn't really a wrap.
> 
> I am wondering if it's possible that netlink messages are somehow being
> re-ordered before they are sent to my application?

That's strange, maybe you could add a timestamp (or a sequence number)
in the stats message to check if two messages are eventually re-ordered.

As a fast check you could try to increment an error counter like this.
If you see the rx_dropped going back, then for sure there is a
problem...

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 93f0ba2..1d271ca 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -444,6 +444,7 @@ static struct rtnl_link_stats64 *macvlan_dev_get_stats64(struct net_device *dev,
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 
+	atomic_long_inc(&dev->rx_dropped);
 	if (vlan->pcpu_stats) {
 		struct macvlan_pcpu_stats *p;
 		u64 rx_packets, rx_bytes, rx_multicast, tx_packets, tx_bytes;



^ permalink raw reply related

* Re: [PATCH v2] of/phylib: Use device tree properties to initialize Marvell PHYs.
From: Grant Likely @ 2010-11-20  4:28 UTC (permalink / raw)
  To: David Daney
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ, Cyril Chemparathy,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnaud Patard
In-Reply-To: <1290204798-28569-1-git-send-email-ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>

On Fri, Nov 19, 2010 at 02:13:18PM -0800, David Daney wrote:
> Some aspects of PHY initialization are board dependent, things like
> indicator LED connections and some clocking modes cannot be determined
> by probing.  The dev_flags element of struct phy_device can be used to
> control these things if an appropriate value can be passed from the
> Ethernet driver.  We run into problems however if the PHY connections
> are specified by the device tree.  There is no way for the Ethernet
> driver to know what flags it should pass.
> 
> If we are using the device tree, the struct phy_device will be
> populated with the device tree node corresponding to the PHY, and we
> can extract extra configuration information from there.
> 
> The next question is what should the format of that information be?
> It is highly device specific, and the device tree representation
> should not be tied to any arbitrary kernel defined constants.  A
> straight forward representation is just to specify the exact bits that
> should be set using the "marvell,reg-init" property:
> 
>       phy5: ethernet-phy@5 {
>         reg = <5>;
>         compatible = "marvell,88e1149r";
>         marvell,reg-init =
>                 /* led[0]:1000, led[1]:100, led[2]:10, led[3]:tx */
>                 <3 0x10 0 0x5777>, /* Reg 3,16 <- 0x5777 */
>                 /* mix %:0, led[0123]:drive low off hiZ */
>                 <3 0x11 0 0x00aa>, /* Reg 3,17 <- 0x00aa */
>                 /* default blink periods. */
>                 <3 0x12 0 0x4105>, /* Reg 3,18 <- 0x4105 */
>                 /* led[4]:rx, led[5]:dplx, led[45]:drive low off hiZ */
>                 <3 0x13 0 0x0a60>; /* Reg 3,19 <- 0x0a60 */
>       };
> 
>       phy6: ethernet-phy@6 {
>         reg = <6>;
>         compatible = "marvell,88e1118";
>         marvell,reg-init =
>                 /* Fix rx and tx clock transition timing */
>                 <2 0x15 0xffcf 0>, /* Reg 2,21 Clear bits 4, 5 */
>                 /* Adjust LED drive. */
>                 <3 0x11 0 0x442a>, /* Reg 3,17 <- 0442a */
>                 /* irq, blink-activity, blink-link */
>                 <3 0x10 0 0x0242>; /* Reg 3,16 <- 0x0242 */
>       };
> 
> The Marvell PHYs have a page select register at register 22 (0x16), we
> can specify any register by its page and register number.  These are
> the first and second word.  The third word contains a mask to be ANDed
> with the existing register value, and the fourth word is ORed with the
> result to yield the new register value.  The new marvell_of_reg_init
> function leaves the page select register unchanged, so a call to it
> can be dropped into the .config_init functions without unduly
> affecting the state of the PHY.
> 
> If CONFIG_OF_MDIO is not set, there is no of_node, or no
> "marvell,reg-init" property, the PHY initialization is unchanged.
> 
> Signed-off-by: David Daney <ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>
> Cc: Grant Likely <grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org>
> Cc: Cyril Chemparathy <cyril-l0cyMroinI0@public.gmane.org>
> Cc: David Daney <ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>
> Cc: Arnaud Patard <arnaud.patard-dQbF7i+pzddAfugRpC6u6w@public.gmane.org>
> Cc: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>

Untested/compiled, but looks good to me.

Reviewed-by: Grant Likely <grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org>

> ---
> 
> Note: this patch now depends on the recently sent 88E1149R support
> patch, which was seperated into its own patch set.
> 
> I think I have incororated all feedback from Grant Likely and Cyril
> Chemparathy, and Milton Miller.
> 
> David Daney
> 
>  drivers/net/phy/marvell.c |   97 +++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 97 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index def19d7..e8b9c53 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -30,6 +30,7 @@
>  #include <linux/ethtool.h>
>  #include <linux/phy.h>
>  #include <linux/marvell_phy.h>
> +#include <linux/of.h>
>  
>  #include <asm/io.h>
>  #include <asm/irq.h>
> @@ -187,6 +188,87 @@ static int marvell_config_aneg(struct phy_device *phydev)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_OF_MDIO
> +/*
> + * Set and/or override some configuration registers based on the
> + * marvell,reg-init property stored in the of_node for the phydev.
> + *
> + * marvell,reg-init = <reg-page reg mask value>,...;
> + *
> + * There may be one or more sets of <reg-page reg mask value>:
> + *
> + * reg-page: which register bank to use.
> + * reg: the register.
> + * mask: if non-zero, ANDed with existing register value.
> + * value: ORed with the masked value and written to the regiser.
> + *
> + */
> +static int marvell_of_reg_init(struct phy_device *phydev)
> +{
> +	const __be32 *paddr;
> +	int len, i, saved_page, current_page, page_changed, ret;
> +
> +	if (!phydev->dev.of_node)
> +		return 0;
> +
> +	paddr = of_get_property(phydev->dev.of_node, "marvell,reg-init", &len);
> +	if (!paddr || len < (4 * sizeof(*paddr)))
> +		return 0;
> +
> +	saved_page = phy_read(phydev, MII_MARVELL_PHY_PAGE);
> +	if (saved_page < 0)
> +		return saved_page;
> +	page_changed = 0;
> +	current_page = saved_page;
> +
> +	ret = 0;
> +	len /= sizeof(*paddr);
> +	for (i = 0; i < len - 3; i += 4) {
> +		u16 reg_page = be32_to_cpup(paddr + i);
> +		u16 reg = be32_to_cpup(paddr + i + 1);
> +		u16 mask = be32_to_cpup(paddr + i + 2);
> +		u16 val_bits = be32_to_cpup(paddr + i + 3);
> +		int val;
> +
> +		if (reg_page != current_page) {
> +			current_page = reg_page;
> +			page_changed = 1;
> +			ret = phy_write(phydev, MII_MARVELL_PHY_PAGE, reg_page);
> +			if (ret < 0)
> +				goto err;
> +		}
> +
> +		val = 0;
> +		if (mask) {
> +			val = phy_read(phydev, reg);
> +			if (val < 0) {
> +				ret = val;
> +				goto err;
> +			}
> +			val &= mask;
> +		}
> +		val |= val_bits;
> +
> +		ret = phy_write(phydev, reg, val);
> +		if (ret < 0)
> +			goto err;
> +
> +	}
> +err:
> +	if (page_changed) {
> +		i = phy_write(phydev, MII_MARVELL_PHY_PAGE, saved_page);
> +		if (ret == 0)
> +			ret = i;
> +	}
> +	return ret;
> +}
> +#else
> +static int marvell_of_reg_init(struct phy_device *phydev)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_OF_MDIO */
> +
>  static int m88e1121_config_aneg(struct phy_device *phydev)
>  {
>  	int err, oldpage, mscr;
> @@ -369,6 +451,9 @@ static int m88e1111_config_init(struct phy_device *phydev)
>  			return err;
>  	}
>  
> +	err = marvell_of_reg_init(phydev);
> +	if (err < 0)
> +		return err;
>  
>  	err = phy_write(phydev, MII_BMCR, BMCR_RESET);
>  	if (err < 0)
> @@ -421,6 +506,10 @@ static int m88e1118_config_init(struct phy_device *phydev)
>  	if (err < 0)
>  		return err;
>  
> +	err = marvell_of_reg_init(phydev);
> +	if (err < 0)
> +		return err;
> +
>  	/* Reset address */
>  	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0);
>  	if (err < 0)
> @@ -447,6 +536,10 @@ static int m88e1149_config_init(struct phy_device *phydev)
>  	if (err < 0)
>  		return err;
>  
> +	err = marvell_of_reg_init(phydev);
> +	if (err < 0)
> +		return err;
> +
>  	/* Reset address */
>  	err = phy_write(phydev, MII_MARVELL_PHY_PAGE, 0x0);
>  	if (err < 0)
> @@ -518,6 +611,10 @@ static int m88e1145_config_init(struct phy_device *phydev)
>  		}
>  	}
>  
> +	err = marvell_of_reg_init(phydev);
> +	if (err < 0)
> +		return err;
> +
>  	return 0;
>  }
>  
> -- 
> 1.7.2.3
> 

^ permalink raw reply

* Re: [PATCH v2] of/phylib: Use device tree properties to initialize Marvell PHYs.
From: Grant Likely @ 2010-11-20  4:15 UTC (permalink / raw)
  To: David Daney
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Cyril Chemparathy,
	Andy Fleming, Arnaud Patard
In-Reply-To: <4CE70A67.9010203-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>

On Fri, Nov 19, 2010 at 03:38:15PM -0800, David Daney wrote:
> On 11/19/2010 03:02 PM, Andy Fleming wrote:
> >On Fri, Nov 19, 2010 at 4:13 PM, David Daney<ddaney-M3mlKVOIwJVv6pq1l3V1OdBPR1lH4CV8@public.gmane.org>  wrote:
> >>Some aspects of PHY initialization are board dependent, things like
> >>indicator LED connections and some clocking modes cannot be determined
> >>by probing.  The dev_flags element of struct phy_device can be used to
> >>control these things if an appropriate value can be passed from the
> >>Ethernet driver.  We run into problems however if the PHY connections
> >>are specified by the device tree.  There is no way for the Ethernet
> >>driver to know what flags it should pass.
> >>
> >>If we are using the device tree, the struct phy_device will be
> >>populated with the device tree node corresponding to the PHY, and we
> >>can extract extra configuration information from there.
> >>
> >>The next question is what should the format of that information be?
> >>It is highly device specific, and the device tree representation
> >>should not be tied to any arbitrary kernel defined constants.  A
> >>straight forward representation is just to specify the exact bits that
> >>should be set using the "marvell,reg-init" property:
> >>
> >>      phy5: ethernet-phy@5 {
> >>        reg =<5>;
> >>        compatible = "marvell,88e1149r";
> >>        marvell,reg-init =
> >>                /* led[0]:1000, led[1]:100, led[2]:10, led[3]:tx */
> >>                <3 0x10 0 0x5777>, /* Reg 3,16<- 0x5777 */
> >>                /* mix %:0, led[0123]:drive low off hiZ */
> >>                <3 0x11 0 0x00aa>, /* Reg 3,17<- 0x00aa */
> >>                /* default blink periods. */
> >>                <3 0x12 0 0x4105>, /* Reg 3,18<- 0x4105 */
> >>                /* led[4]:rx, led[5]:dplx, led[45]:drive low off hiZ */
> >>                <3 0x13 0 0x0a60>; /* Reg 3,19<- 0x0a60 */
> >
> >
> >My inclination is to shy away from such hard-coded values.  I agreed
> >with Grant's comment about separating into separate cells, but I think
> >specification of hard-coded register writes is too rigid.  I think a
> >better approach would be to specify configuration attributes, like:
> >
> >marvell,blink-periods =<blah>;
> >marvell,led-config =<[drive type] [indicates]>;

I'm certainly a strong advocate of describing the desired behaviour
instead of trying to list a bunch of register values or similar.
However, I think the use-case here is sufficiently straight forward
(initial conditions for a set of registers) on things that will be set
once and then forgotten about that I think that the simple approach is
appropriate.

> I considered doing that, but rejected the idea because it is so
> complex.  There are literally probably a hundred different register
> fields with widths varying between 1 and 16 bits.  The number of
> settings that the driver might want to deal with are astronomical.
> For any given PHY model there could be 30 different settings that
> you might want to tweak.  In the example above I configure 5 LEDs
> with two parameters each, and about 4 or 5 more blink settings, that
> is something like 15 register fields.
> 
> Take the marvell.c file.  It currently supports nine different PHY
> models, each would have a bunch of model specific settings, the
> complexity of decoding all that is formidable (and would be a lot
> more code) also there would invariably be settings that were not
> implemented in the first version, so people would have to repeatedly
> extend it.
> 
> My patch adds 550 bytes (not including the string storage) of code
> on MIPS64, and can handle many cases without any changes.

Right; the permutations are huge in this case, and the payoff quite
low.

> >For one, I always advocate making the DTS human-readable.  For
> >another, I think that there are a number of configuration sequences
> >that require read-modify-write, or write-wait-write.  In other words,
> >I think that there are enough cases where actual software will be
> >needed, that an attempt to generically specify a register
> >initialization sequence will be impossible, and leave us with the same
> >problems to solve later on.
> 
> We can always add more properties as you suggest, if the existing
> code does not prove to be sufficiently flexible.

That is a very different situation however.  I am *strongly* against
anything that begins to look like a scripting or interpreted commands.
Initial conditions for a set of registers are one thing, but trying to
encode rmw cycles or delay cycles enters into the realm where the
driver really needs to understand the behaviour of the device and be
responsible for talking to it appropriately.

So, to summarize my opinion:
- list of initial register settings? okay given the alternatives
- method to perform arbitrary initialization sequences, including
  delays and rmw?  right out.

> 
> > For third...ly... allowing
> >device-tree-specified register initializations might encourage
> >developers to put all of their register initializations in the device
> >tree.  Especially when they realize that the LED initialization for
> >*their* PHY has to come between two standard initialization steps in
> >the driver.  Or before.  Or after.

Probably unlikely considering the driver still needs to handle the
device in the non-devicetree use case.  However, it may be appropriate
for the driver to filter which registers it will initialize from dt
data.

> >
> >By specifying actual functionality, the driver can work around those
> >problems, while being aware of the functional goal.  However, I'm
> >aware that such a path is more difficult, and perhaps just as futile,
> >as PHY vendors frequently don't want to document what their magic
> >sequences mean.
> >
> 
> I certainly understand your desire to make the dts readable, but I
> think it has to be balanced against the complexity and chance for
> bugs in the drivers as well as churn in the bindings specifications.
> 
> David Daney

^ permalink raw reply

* Re: Question regarding expected behavior of two udp sockets with SO_REUSEADDR set
From: Eric Dumazet @ 2010-11-20  4:06 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem
In-Reply-To: <20101120004847.GA2590@hmsreliant.think-freely.org>

Le vendredi 19 novembre 2010 à 19:48 -0500, Neil Horman a écrit :
> Hey all-
> 
> 	Got a question regarding expected/desired behavior of $SUBJECT
> 
> 
> I have a report of a problem with a program that opens two sockets:
> 
> The first socket is UDP and binds to 127.0.0.1 on a randomly selected port
> 
> The second socket is UDP and calls connect, sending to the first socket
> 
> Both sockets are part of the same process and have SO_REUSEADDR set
> 
> After the connect the second socket sends a message to the first socket.  The
> first socket waits for the message by calling select().
> 
> Its observed that occasionally the first socket fails to receive the message,
> which is odd, given that the system is unloaded, and this is the only message
> being sent.  A little investigation shows that when this happens, the client and
> the server wind up bound to the same port.
> 
> This happens because the second socket calls inet_autobind during the connect
> call, and since both it and the server have SO_REUSEADDR set, it is possible
> that the autobind will select the same port that the first socket is bound to.
> When this happens the sendmsg path can get confused.  Specifically, when the skb
> is delivered to the destination socket, the hash lookup might find the wrong
> entry and enqueue the skb to the second socket instead of the first.
> 
> Questions:
> 
> 1) Is that expected?
> 

Is SO_REUSADDR used on both sockets ?

May I ask why SO_REUSEADDR is set in the first place on UDP sockets ?

I use it before a bind() on a given port (non null), but apparently your
program doesnt bind() the 2nd socket before its connect() ?


> 2) If not, what do you think the best way to fix it is?
> 
> 	a) Deny autobinds to the same port when SO_REUSEADDR is set, but allow
> explicity binds to the same port?
> 
> 	b) Deny both autobinds and explicit binds to the same port/addr,
> effectively disablind SO_REUSEADDR with UDP, kind of like with listening TCP
> sockets
> 
> 	c) Add magic to udp_rcv to detect skbs originating from local sockets,
> and _dont_ deliver to the socket it originated from

Why ? Its a valid use case IMHO, even with a single socket.

> 
> I'm inclined to say, no this is not expected behavior, and that we should fix it
> with option A, but I'm interested in getting other opinions before I go down any
> particular path.
> 

autobind certainly is a problem, we tried to 'fix' it in recent past and
had to revert some patches. We tried to allow more sockets to be used
but we failed.




^ permalink raw reply

* Question regarding expected behavior of two udp sockets with SO_REUSEADDR set
From: Neil Horman @ 2010-11-20  0:48 UTC (permalink / raw)
  To: netdev; +Cc: davem, eric.dumazet, nhorman

Hey all-

	Got a question regarding expected/desired behavior of $SUBJECT

I have a report of a problem with a program that opens two sockets:

The first socket is UDP and binds to 127.0.0.1 on a randomly selected port

The second socket is UDP and calls connect, sending to the first socket

Both sockets are part of the same process and have SO_REUSEADDR set

After the connect the second socket sends a message to the first socket.  The
first socket waits for the message by calling select().

Its observed that occasionally the first socket fails to receive the message,
which is odd, given that the system is unloaded, and this is the only message
being sent.  A little investigation shows that when this happens, the client and
the server wind up bound to the same port.

This happens because the second socket calls inet_autobind during the connect
call, and since both it and the server have SO_REUSEADDR set, it is possible
that the autobind will select the same port that the first socket is bound to.
When this happens the sendmsg path can get confused.  Specifically, when the skb
is delivered to the destination socket, the hash lookup might find the wrong
entry and enqueue the skb to the second socket instead of the first.

Questions:

1) Is that expected?

2) If not, what do you think the best way to fix it is?

	a) Deny autobinds to the same port when SO_REUSEADDR is set, but allow
explicity binds to the same port?

	b) Deny both autobinds and explicit binds to the same port/addr,
effectively disablind SO_REUSEADDR with UDP, kind of like with listening TCP
sockets

	c) Add magic to udp_rcv to detect skbs originating from local sockets,
and _dont_ deliver to the socket it originated from

I'm inclined to say, no this is not expected behavior, and that we should fix it
with option A, but I'm interested in getting other opinions before I go down any
particular path.

Thanks!
Neil

^ permalink raw reply

* Re: Regression 2.6.36 - driver rtl8169 crashes kernel, triggered by user app
From: Francois Romieu @ 2010-11-20  0:35 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: linux-kernel, netdev
In-Reply-To: <201011181549.08545@zmi.at>

Michael Monnerie <michael.monnerie@is.it-management.at> :
> I did not get any answer, so here again:

I already answered one week ago, you were Cc:ed and I had my
answer delivered through vger. Please check your spambox.

Can you test again after reverting 801e147cde02f04b5c2f42764cd43a89fc7400a2 ?

-- 
Ueimor

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox