Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: UDP path MTU discovery
From: Glen Turner @ 2010-03-31 23:43 UTC (permalink / raw)
  To: Rick Jones; +Cc: Andi Kleen, netdev
In-Reply-To: <4BB0DCF6.9020401@hp.com>

On Mon, 2010-03-29 at 10:01 -0700, Rick Jones wrote:

> But which of the last N datagrams sent by the application should be retained for 
> retransmission?  It could be scores if not hundreds of datagrams depending on 
> the behaviour of the application and the latency to the narrow part of the network.

We don't need that sort of exotica from the kernel.  The applications
have to be prepared to retransmit lost packets in any case.

What we need is an API for an instant notification that a ICMP Packet
Too Big message has arrived concerning the socket.

Then the application simply retransmits immediately, without adding
to the exponential backoff penalty which the application maintains.
The application maintain a overall packet-transmitted limit to prevent
a DoS.

>From this application behaviour the kernel sees a stream of packets
it can use for UDP Path MTU Discovery (paced at the RTT, so not
contributing to congestion collapse). That stream halts when the
first packet makes it to the end system.

As for David Miller's rant, the applications currently have no choice
but to "do it stupidly" as the kernel doesn't pass enough information
for user space to do it intelligently.  If the kernel passed user space
the same indication as TCP gets, then we could -- and would -- do it
right.

Re-writing the applications to take advantage of the API is no great
shakes -- there aren't many of them, they are written by people with
a good knowledge of networking, but unfortunately they tend to do
important stuff (allocate addresses, serve names, authenticate link
layer access).

It would be nice if the API had some commonality between platforms.
But there's no shortage of #ifdefs already, and one more to make
these applications work well for IPv6 on jumbo frames on the platform
of choice for networking infrastructure would be seen by application
authors as well worthwhile.

Thanks for your consideration,
Glen

-- 
 Glen Turner
 www.gdt.id.au/~gdt

^ permalink raw reply

* Re: UDP path MTU discovery
From: Glen Turner @ 2010-03-31 23:57 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, netdev
In-Reply-To: <87bpe825z2.fsf@basil.nowhere.org>

On Sun, 2010-03-28 at 10:41 +0200, Andi Kleen wrote:

> It means though that all IPv6 UDP applications essentially have
> to implement path mtu discovery support (which is non trivial) 

It is trivial from the applications point of view to let the
kernel find the UDP Path MTU. We just need more information
from the kernel as to when it would like to see those packets
(ie, for performance we'd like to feed in the packet to re-send
as soon as the ICMP Packet Too Big arrives for the previous
packet).

> Will be likely a long time until they're all fixed.

There's no need to make that assumption.  We'd very much like
transactional UDP protocols to work well in advanced networks.
The other choices -- holding down millions of TCP sockets,
or using new protocols (and there are competing proposals) --
don't exactly fill our operations teams with confidence.

We'd very much like to use UDP were we can and something else
where we must.

> Seems like a big hole not considered by the IPv6 designers?

Yeah. The sockets API for IPv6 required an additional feature that
the IETF did not foresee.

-- 
 Glen Turner
 www.gdt.id.au/~gdt

^ permalink raw reply

* Re: UDP path MTU discovery
From: Hagen Paul Pfeifer @ 2010-03-31 23:51 UTC (permalink / raw)
  To: Glen Turner; +Cc: David Miller, rick.jones2, netdev
In-Reply-To: <1270078923.2389.31.camel@ilion>

* Glen Turner | 2010-04-01 10:12:03 [+1030]:

>Does select() return from its blocking so the application can make
>use of this indication immediately, rather than after the
>application's exponentially-increasing wait?

Yes, poll() will return immediately with POLLERR.

>Is an incoming ICMP the only cause of EMSGSIZE?  That is, can an
>application safely retransmit immediately?

IIRC, yes.


Cheers, Hagen

-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22

^ permalink raw reply

* Re: UDP path MTU discovery
From: Rick Jones @ 2010-04-01  0:06 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: Glen Turner, David Miller, netdev
In-Reply-To: <20100331235149.GE3042@nuttenaction>

Hagen Paul Pfeifer wrote:
> * Glen Turner | 2010-04-01 10:12:03 [+1030]:
> 
> 
>>Does select() return from its blocking so the application can make
>>use of this indication immediately, rather than after the
>>application's exponentially-increasing wait?
> 
> 
> Yes, poll() will return immediately with POLLERR.
> 
> 
>>Is an incoming ICMP the only cause of EMSGSIZE?  That is, can an
>>application safely retransmit immediately?
> 
> 
> IIRC, yes.

Under Linux perhaps, and assuming it can guess which prior send triggered the 
EMSGSIZE, but under HP-UX EMSGSIZE means you tried to send a datagram larger 
than the socket buffer:

tusc src/netperf -t UDP_RR -- -s 1024 -r 60K
...
send(4, 0x4000ee68, 61440, 0) ............................ ERR#218 EMSGSIZE

I've not checked BSD, Solaris or AIX.

On a 2.6.22 kernel where I do the same thing, it returns ENOBUFS instead.

strace src/netperf -H localhost -t UDP_RR -- -s 1024 -r 60K
...
send(4, "netperf\0netperf\0netperf\0netperf\0n"..., 61440, 0) = -1 ENOBUFS (No 
buffer space available)

Of course the send() manpage on various Linux systems I've tried says:

        EMSGSIZE
               The  socket  type  requires that message be sent atomically, and
               the size of the message to be sent made this impossible.

        ENOBUFS
               The output queue for a network interface was full.  This  gener-
               ally  indicates  that the interface has stopped sending, but may
               be caused by transient congestion.   (Normally,  this  does  not
               occur in Linux.  Packets are just silently dropped when a device
               queue overflows.)

I suppose they are old on that system.  Netperf interprets an ENOBUFS per the 
manpage, and will not exit immediately in a UDP_STREAM test, but will simply 
count the send as failed and try again.  Not sure if it is worth trying to teach 
netperf differently here or not.

rick jones

^ permalink raw reply

* Re: [PATCH 5/6] cxgb4: Add main driver file and driver Makefile
From: Dimitris Michailidis @ 2010-04-01  0:19 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20100330.225042.107712654.davem@davemloft.net>

David Miller wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Tue, 30 Mar 2010 14:19:04 -0700
> 
>> On Tue, 30 Mar 2010 10:52:21 -0800
>> Dimitris Michailidis <dm@chelsio.com> wrote:
>>
>>> +static struct cxgb4_proc_entry proc_files[] = {
>>> +#ifdef CONFIG_PROC_FS
>>> +	{ "l2t", 0444, ADAP_NEED_L2T, 0, &t4_l2t_proc_fops },
>>> +#endif
>>> +	{ "lb_stats", 0444, 0, 0, &lb_stats_proc_fops },
>>> +	{ "path_mtus", 0644, 0, 0, &mtutab_proc_fops },
>>> +	{ "qstats", 0444, 0, 0, &sge_stats_proc_fops },
>>> +	{ "rss", 0444, 0, 0, &rss_proc_fops },
>>> +	{ "tcp_stats", 0444, 0, 0, &tcp_stats_proc_fops },
>>> +	{ "tids", 0444, ADAP_NEED_OFLD, 0, &tid_info_proc_fops },
>>> +	{ "tp_err_stats", 0444, 0, 0, &tp_err_stats_proc_fops },
>>> +	{ "trace0", 0644, 0, 0, &mps_trc_proc_fops },
>>> +	{ "trace1", 0644, 0, 1, &mps_trc_proc_fops },
>>> +	{ "trace2", 0644, 0, 2, &mps_trc_proc_fops },
>>> +	{ "trace3", 0644, 0, 3, &mps_trc_proc_fops },
>>> +	{ "uld", 0444, 0, 0, &uld_proc_fops },
>>> +};
>>> +
>> Do you really need this large number of /proc files.
>> It creates another stable API to worry about.  If it is just for
>> debugging move it to debugfs, or better yet just drop it.
> 
> I also find this a bit too much.
> 
> We have all kinds of ways to export whatever statistics you
> want.  In particular we have ethtool stats of which you
> can have as many as you wish.
> 
> If necessary, we could add a feature to define "views" of ethtool
> stats so that we can have domains of driver specific statistics if the
> problem is that you don't want all of these debugging stats to clutter
> up the "main" ethtool stats.

It wasn't a concern with having too many ethtool stats that led to the use 
of proc, rather it was that most of this information isn't associated with 
netdevs.  Let me give an example.  Like most virtualized NICs this device 
contains a switch.  Some of the switch statistics are provided in the proc 
files.  While they are statistics, logically they are not associated with 
any of the netdevs.  If ethtool were to be extended so that the specified 
interface would access the underlying HW to get information possibly 
unrelated to the interface personally I'd be OK with that, I am just 
pointing out what kind of extension we are talking about.

Also some of the files here aren't statistics but report other functionality 
  of the device, usually also not tied to netdevs (e.g., the trace* files 
are again related to the switch).

So, I propose getting rid of 3-4 of these files that are of lesser value and 
moving the rest to debugfs for now.  If some alternative through ethtool or 
something becomes available I can get rid of anything that can be handled 
through a more general facility.  Would that be acceptable?

^ permalink raw reply

* Re: [PATCH 3/4] xfrm: remove policy lock when accessing policy->walk.dead
From: Herbert Xu @ 2010-04-01  0:22 UTC (permalink / raw)
  To: David Miller; +Cc: hadi, timo.teras, netdev
In-Reply-To: <20100331.135735.00480230.davem@davemloft.net>

On Wed, Mar 31, 2010 at 01:57:35PM -0700, David Miller wrote:
> 
> Herbert, if it doesn't break anything, I feel the gains are very
> worthwhile.

OK, in that case please change xfrm_state_flush too.  Right now
it also notifies on empty.

> Have you actually tried to monitor the IPSEC netlink socket events
> when a daemon is running?  It's way too painful before Jamal's
> changes, and we already emit way too many semantically empty events in
> netlink.

Honestly I can't see how eliminating an empty flush event helps.
Flushes are very rare.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 5/6] cxgb4: Add main driver file and driver Makefile
From: David Miller @ 2010-04-01  0:24 UTC (permalink / raw)
  To: dm; +Cc: shemminger, netdev
In-Reply-To: <4BB3E689.4050001@chelsio.com>

From: Dimitris Michailidis <dm@chelsio.com>
Date: Wed, 31 Mar 2010 17:19:21 -0700

> So, I propose getting rid of 3-4 of these files that are of lesser
> value and moving the rest to debugfs for now.  If some alternative
> through ethtool or something becomes available I can get rid of
> anything that can be handled through a more general facility.  Would
> that be acceptable?

You can use sysfs.

I have a similar issue on NIU wherein a top-level set of logic
behaves like a parent virtual device and contains attributes
I'd like to export.

I create a dummy platform device to represent this node and
hang the device sysfs information in that device's sysfs
directory.  It shows up as /sys/devices/platform/niu.%d
and the network ports underneath this parent appear as
symlinks named "port%d"  (part of this is setup in
drivers/net/niu.c:niu_get_parent() and niu_new_parent())

You could do something similar to represent the fact that
these statistics are virtualized and belong to some centralized
virtual entity that represents one or more actual device ports.

^ permalink raw reply

* Re: [PATCH] r8169: offical fix for CVE-2009-4537 (overlength frame DMAs)
From: Brandon Philips @ 2010-04-01  0:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, netdev, michael.s.gilbert, davem, romieu,
	eric.dumazet
In-Reply-To: <20100329160356.GC22733@hmsreliant.think-freely.org>

On 12:03 Mon 29 Mar 2010, Neil Horman wrote:
> Signed-off-by: Neil Horman <nhorman@redhat.com>

Cc: stable@kernel.org?
Tested-by: Brandon Philips <bphilips@suse.de>

> +	if (max_frame != 16383)
> +		printk(KERN_WARNING "WARNING! Changing of MTU on this NIC"
> +			"May lead to frame reception errors!\n");

This needs a space to look right. You might as well add the PFX too:

		printk(KERN_WARNING PFX "WARNING! Changing of MTU on this "
			"NIC may lead to frame reception errors!\n");

Thanks Neil.

Cheers,

	Brandon

^ permalink raw reply

* Re: [RFC] SPD basic actions per netdev
From: Herbert Xu @ 2010-04-01  0:33 UTC (permalink / raw)
  To: jamal; +Cc: Timo Teras, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <1270053478.26743.111.camel@bigi>

On Wed, Mar 31, 2010 at 12:37:58PM -0400, jamal wrote:
> 
> This may be oversight in current implementation and possibly
> nobody has needed it before - hence it is not functional.
> 
> I want to have a drop-all policy on a per-interface level
> for incoming packets and add exceptions as i need them.
> [using the flow table is cheap if you have xfrm built in].
> i.e something along the lines of:
> 
> #eth0, wild-card drop all
> ip xfrm policy add src 0.0.0.0/0 dst 0.0.0.0/0 dev eth0 \
>        dir in ptype main action block priority $SOME-HIGH-value
> #eth0, exception
> ip xfrm policy add blah blah dev eth0 \
> dir in ptype main action allow priority $SOME-small-value
> #eth1, wild-card drop all
> ip xfrm policy add src 0.0.0.0/0 dst 0.0.0.0/0 dev eth1 \
>        dir in ptype main action block priority $SOME-HIGH-value
> #eth1 exception ...
> 
> The problem is this works as long as i dont specify an interface.
> i.e, this would work in the in-direction:
> 
> ip xfrm policy add src 0.0.0.0/0 dst 0.0.0.0/0 \
>         dir in ptype main action block priority $SOME-HIGH-value
> 
> This would not work:
> ip xfrm policy add src 0.0.0.0/0 dst 0.0.0.0/0 dev eth0 \
>        dir in ptype main action block priority $SOME-HIGH-value
> 
> 
> The checks in the selector matching is the culprit, example for v4:
> 
> __xfrm4_selector_match(struct xfrm_selector *sel, struct flowi *fl)
> {
>         return  .... &&
>                 .... &&
>                 (fl->oif == sel->ifindex || !sel->ifindex);
> }
> 
> i.e in the second case i have a non-zero sel->ifindex but
> a zero fl->oif; so it wont match.
> 
> One approach to fix this is to pass the direction then i can do
> in the function call, then i can do something along the lines of
> matching if:
> (fl_dir == FLOW_DIR_IN && (fl->iif == sel->ifindex || !sel->ifindex) ||
> (fl->oif == sel->ifindex || !sel->ifindex);
> 
> Is there any reason the selector matching only assumes fl->oif?
> Are there any unforeseen issues/breakages if i added a check for the
> above.

If we're going to change this then we should just add a second
interface field to the selector, rather than trying to overload
the existing one.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH] ipv4: remove redundant verification code
From: Hagen Paul Pfeifer @ 2010-04-01  0:54 UTC (permalink / raw)
  To: netdev; +Cc: davem

The check if error signaling is wanted (inet->recverr != 0) is done by
the caller: raw.c:raw_err() and udp.c:__udp4_lib_err(), so there is no
need to check this condition again.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
 net/ipv4/ip_sockglue.c |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 644dc43..f4b47ac 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -286,12 +286,8 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err,
 		   __be16 port, u32 info, u8 *payload)
 {
-	struct inet_sock *inet = inet_sk(sk);
 	struct sock_exterr_skb *serr;
 
-	if (!inet->recverr)
-		return;
-
 	skb = skb_clone(skb, GFP_ATOMIC);
 	if (!skb)
 		return;
-- 
1.6.6.196.g1f735.dirty


^ permalink raw reply related

* Re: UDP path MTU discovery
From: Andi Kleen @ 2010-04-01  0:55 UTC (permalink / raw)
  To: Glen Turner; +Cc: Rick Jones, Andi Kleen, netdev
In-Reply-To: <1270078984.2389.33.camel@ilion>

On Thu, Apr 01, 2010 at 10:13:04AM +1030, Glen Turner wrote:
> On Mon, 2010-03-29 at 10:01 -0700, Rick Jones wrote:
> 
> > But which of the last N datagrams sent by the application should be retained for 
> > retransmission?  It could be scores if not hundreds of datagrams depending on 
> > the behaviour of the application and the latency to the narrow part of the network.
> 
> We don't need that sort of exotica from the kernel.  The applications
> have to be prepared to retransmit lost packets in any case.
> 
> What we need is an API for an instant notification that a ICMP Packet
> Too Big message has arrived concerning the socket.

That already exists of course: IP_RECVERR

> As for David Miller's rant, the applications currently have no choice
> but to "do it stupidly" as the kernel doesn't pass enough information
> for user space to do it intelligently.  If the kernel passed user space
> the same indication as TCP gets, then we could -- and would -- do it
> right.

That's wrong. Linux has supported UDP/RAW pmtu discovery since many many
years.

I have a really old presentation on it (from 2000 or so):

http://halobates.de/net-topics/text33.htm
http://halobates.de/net-topics/text34.htm
http://halobates.de/net-topics/text35.htm
http://halobates.de/net-topics/text36.htm

It's also in the manpages.

However I suspect it's too much work to change a lot of applications
to that, so I suspect the IPV6_MIN_MTU workaround is still needed.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: UDP path MTU discovery
From: Andi Kleen @ 2010-04-01  0:57 UTC (permalink / raw)
  To: Glen Turner; +Cc: Andi Kleen, David Miller, netdev
In-Reply-To: <1270079864.2389.48.camel@ilion>

> > Seems like a big hole not considered by the IPv6 designers?
> 
> Yeah. The sockets API for IPv6 required an additional feature that
> the IETF did not foresee.

Linux (or in this concrete case ANK) did foresee it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH] r8169: offical fix for CVE-2009-4537 (overlength frame DMAs)
From: Neil Horman @ 2010-04-01  1:07 UTC (permalink / raw)
  To: Brandon Philips
  Cc: linux-kernel, netdev, michael.s.gilbert, davem, romieu,
	eric.dumazet
In-Reply-To: <20100401002450.GA19117@jenkins.home.ifup.org>

On Wed, Mar 31, 2010 at 05:24:50PM -0700, Brandon Philips wrote:
> On 12:03 Mon 29 Mar 2010, Neil Horman wrote:
> > Signed-off-by: Neil Horman <nhorman@redhat.com>
> 
> Cc: stable@kernel.org?
> Tested-by: Brandon Philips <bphilips@suse.de>
> 
> > +	if (max_frame != 16383)
> > +		printk(KERN_WARNING "WARNING! Changing of MTU on this NIC"
> > +			"May lead to frame reception errors!\n");
> 
> This needs a space to look right. You might as well add the PFX too:
> 
> 		printk(KERN_WARNING PFX "WARNING! Changing of MTU on this "
> 			"NIC may lead to frame reception errors!\n");
> 
> Thanks Neil.
> 
> Cheers,
> 
> 	Brandon
> 
Daves already merged my patch, but I'll submit a cleanup patch shortly to take
care of this.
Thanks!
Neil

^ permalink raw reply

* Re: [PATCH 5/6] cxgb4: Add main driver file and driver Makefile
From: Dimitris Michailidis @ 2010-04-01  1:34 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev
In-Reply-To: <20100331.172401.00333850.davem@davemloft.net>

David Miller wrote:

>> So, I propose getting rid of 3-4 of these files that are of lesser
>> value and moving the rest to debugfs for now.  If some alternative
>> through ethtool or something becomes available I can get rid of
>> anything that can be handled through a more general facility.  Would
>> that be acceptable?
> 
> You can use sysfs.

sysfs is a possibility but I thought Stephen's initial concern was that I 
was adding too many of these proc files and that they were creating a 
potential API.  sysfs will result in a lot more files with its 
value-per-file model and I think sysfs and proc are similar in "APIness". 
So it's not clear to me how going to sysfs would address Stephen's point. 
The remove-a-few plus move-to-debugfs proposal was in order to end up with 
fewer files in a non-API filesystem.

As these builtin switches become more common I expect an official way to 
represent and access them will emerge but maybe it's not a good idea to 
introduce a sysfs model for them as part of this driver submission.

^ permalink raw reply

* Re: [PATCH 5/6] cxgb4: Add main driver file and driver Makefile
From: David Miller @ 2010-04-01  2:18 UTC (permalink / raw)
  To: dm; +Cc: shemminger, netdev
In-Reply-To: <4BB3F827.3010402@chelsio.com>

From: Dimitris Michailidis <dm@chelsio.com>
Date: Wed, 31 Mar 2010 18:34:31 -0700

> David Miller wrote:
> 
>>> So, I propose getting rid of 3-4 of these files that are of lesser
>>> value and moving the rest to debugfs for now.  If some alternative
>>> through ethtool or something becomes available I can get rid of
>>> anything that can be handled through a more general facility.  Would
>>> that be acceptable?
>> You can use sysfs.
> 
> sysfs is a possibility but I thought Stephen's initial concern was
> that I was adding too many of these proc files and that they were
> creating a potential API.

Yes, since procfs is essentially deprecated these days.

> sysfs will result in a lot more files with its value-per-file model
> and I think sysfs and proc are similar in "APIness". So it's not
> clear to me how going to sysfs would address Stephen's point. The
> remove-a-few plus move-to-debugfs proposal was in order to end up
> with fewer files in a non-API filesystem.

It's in fact easier to retain API by using sysfs.  Instead of having
to worry about the format of a procfs file listing entries one by one
per line, under sysfs you just add a new file to export new values.

> As these builtin switches become more common I expect an official way
> to represent and access them will emerge but maybe it's not a good
> idea to introduce a sysfs model for them as part of this driver
> submission.

What nodes you create under your own device object in sysfs is your
domain and your business.  Since it's one value per file there is no
real complexity in making sure tools can display the values properly.

^ permalink raw reply

* Re: [PATCH 3/4] xfrm: remove policy lock when accessing policy->walk.dead
From: jamal @ 2010-04-01  2:19 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, timo.teras, netdev
In-Reply-To: <20100401002244.GA18893@gondor.apana.org.au>

On Thu, 2010-04-01 at 08:22 +0800, Herbert Xu wrote:

> OK, in that case please change xfrm_state_flush too.  Right now
> it also notifies on empty.

It shouldnt. I will double check it tomorrow..

cheers,
jamal


^ permalink raw reply

* Re: pull request: wireless-2.6 2010-03-31
From: David Miller @ 2010-04-01  2:33 UTC (permalink / raw)
  To: linville-2XuSBdqkA4R54TAoqtyWWQ
  Cc: davem-fT/PcQaiUtJ0xCFg78z1Mw,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100331144649.GC3024-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

From: "John W. Linville" <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
Date: Wed, 31 Mar 2010 10:46:49 -0400

> Here is a batch of fixes intended for 2.6.34.  Included are a few
> device IDs, along with several almost-one-liners to fix a variety of
> issues, including a NULL deref, a potential overflow, a misues of the
> USB API, a regulatory error for iwlwifi, a race condition in mac80211,
> and some other more minor fixes.
> 
> I saw your note about only "eats someones disk" bugs.  I'm not
> sure all of these meet that test, but I hope you will take them.
> I've been sitting on them a while and letting them cook in linux-next,
> mostly to test one particular patch (which I backed-out yesterday).
> I promise I'll tighten-up after this batch! :-)

Ok, pulled, thanks John.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC] SPD basic actions per netdev
From: jamal @ 2010-04-01  2:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Timo Teras, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <20100401003352.GA19147@gondor.apana.org.au>

On Thu, 2010-04-01 at 08:33 +0800, Herbert Xu wrote:

> If we're going to change this then we should just add a second
> interface field to the selector, rather than trying to overload
> the existing one.

Do you mean to have a selector->iif/oif? Sure that makes sense - but is
a much larger surgery and user space will need to be taught.

Did you look at the patch i sent? i tried to retain current behavior
except for the input check path. output path was working in classifying
with specific netdevs. 

cheers,
jamal

^ permalink raw reply

* Re: [Patch] bonding: fix potential deadlock in bond_uninit()
From: Cong Wang @ 2010-04-01  2:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Jiri Pirko, Stephen Hemminger, netdev,
	David S. Miller, bonding-devel, Jay Vosburgh
In-Reply-To: <m1d3ykzq5a.fsf@fess.ebiederm.org>

Eric W. Biederman wrote:
> Amerigo Wang <amwang@redhat.com> writes:
> 
>> bond_uninit() is invoked with rtnl_lock held, when it does destroy_workqueue()
>> which will potentially flush all works in this workqueue, if we hold rtnl_lock
>> again in the work function, it will deadlock.
>>
>> So unlock rtnl_lock before calling destroy_workqueue().
> 
> Ouch.  That seems rather rude to our caller, and likely very
> dangerous.


This is reasonable, because workqueue flush functions will potentially
call all the work functions which could take the same lock taken before
the flush call, thus deadlock.

> 
> Is this a deadlock you actually hit, or is this something lockdep
> warned about?

It's only a lockdep warning.

> 
> My gut feel says we need to move the destroy_workqueue into
> the network device destructor.
> 

Oh, this seems a better idea, as long as the destructor are not called
with any locks holding.

Thanks!

^ permalink raw reply

* Re: [RFC] SPD basic actions per netdev
From: Herbert Xu @ 2010-04-01  2:52 UTC (permalink / raw)
  To: jamal; +Cc: Timo Teras, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <1270089323.26743.138.camel@bigi>

On Wed, Mar 31, 2010 at 10:35:23PM -0400, jamal wrote:
> On Thu, 2010-04-01 at 08:33 +0800, Herbert Xu wrote:
> 
> > If we're going to change this then we should just add a second
> > interface field to the selector, rather than trying to overload
> > the existing one.
> 
> Do you mean to have a selector->iif/oif? Sure that makes sense - but is
> a much larger surgery and user space will need to be taught.
> 
> Did you look at the patch i sent? i tried to retain current behavior
> except for the input check path. output path was working in classifying
> with specific netdevs. 

OK, I guess the chances of an existing app breaking is slim.

BTW, you should treat FLOW_DIR_FWD as FLOW_DIR_IN.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: Undefined behaviour of connect(fd, NULL, 0);
From: Changli Gao @ 2010-04-01  3:00 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Miller, shemminger, netdev
In-Reply-To: <20100401090756.69bfb57d@notabene.brown>

I think the following patch is what Neil wants. The old code implies that
connect(fd, NULL, 0) is used to check the socket connecting status, but
Stephen's patch breaks it. The old code is wrong when it checks the address's
faimly but not check the sizeof of the address to determine the family member
is valid or not before.

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index be1a6ac..3ff51f0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -576,7 +576,8 @@ int inet_stream_connect(struct socket *sock,
struct sockaddr *uaddr,

 	lock_sock(sk);

-	if (uaddr->sa_family == AF_UNSPEC) {
+	if (addr_len >= sizeof(uaddr->sa_family) &&
+	    uaddr->sa_family == AF_UNSPEC) {
 		err = sk->sk_prot->disconnect(sk, flags);
 		sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
 		goto out;

^ permalink raw reply related

* Re: Undefined behaviour of connect(fd, NULL, 0);
From: Neil Brown @ 2010-04-01  3:38 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, shemminger, netdev
In-Reply-To: <n2w412e6f7f1003312000o3a69802j120acd3946458517@mail.gmail.com>

On Thu, 1 Apr 2010 11:00:23 +0800
Changli Gao <xiaosuo@gmail.com> wrote:

> I think the following patch is what Neil wants. The old code implies that
> connect(fd, NULL, 0) is used to check the socket connecting status, but
> Stephen's patch breaks it. The old code is wrong when it checks the address's
> faimly but not check the sizeof of the address to determine the family member
> is valid or not before.
> 
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index be1a6ac..3ff51f0 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -576,7 +576,8 @@ int inet_stream_connect(struct socket *sock,
> struct sockaddr *uaddr,
> 
>  	lock_sock(sk);
> 
> -	if (uaddr->sa_family == AF_UNSPEC) {
> +	if (addr_len >= sizeof(uaddr->sa_family) &&
> +	    uaddr->sa_family == AF_UNSPEC) {
>  		err = sk->sk_prot->disconnect(sk, flags);
>  		sock->state = err ? SS_DISCONNECTING : SS_UNCONNECTED;
>  		goto out;

I'm not sure I'd say that I "want" any particular patch.
I just want to know what "connect(fd, NULL, 0)" is supposed to do, and to
have the kernel be consistent in its behaviour.  I'm not really fussed what
the behaviour is.

I suspect the customer wants that patch you have supplied as it would mean
they don't need to change their code.  But I only want it if it is "right".

The patch you have provided does what I had assumed Stephen's patch did
before I actually read it properly.

My feeling is that this patch might be more useful than Stephen's as having
connect(fd, NULL, 0) do what the customer expects seems useful, where as
having it do the same as setting AF_UNSPEC doesn't add anything.

I've googled around a bit but cannot find any evidence of anyone passing NULL
to connect like this, and what documentation I can find doesn't really
address the issue at all.

Thanks,
NeilBrown

^ permalink raw reply

* Re: Undefined behaviour of connect(fd, NULL, 0);
From: Changli Gao @ 2010-04-01  4:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Miller, shemminger, netdev
In-Reply-To: <20100401143805.1f83a161@notabene.brown>

On Thu, Apr 1, 2010 at 11:38 AM, Neil Brown <neilb@suse.de> wrote:
> I'm not sure I'd say that I "want" any particular patch.
> I just want to know what "connect(fd, NULL, 0)" is supposed to do, and to
> have the kernel be consistent in its behaviour.  I'm not really fussed what
> the behaviour is.
>
> I suspect the customer wants that patch you have supplied as it would mean
> they don't need to change their code.  But I only want it if it is "right".
>
> The patch you have provided does what I had assumed Stephen's patch did
> before I actually read it properly.
>
> My feeling is that this patch might be more useful than Stephen's as having
> connect(fd, NULL, 0) do what the customer expects seems useful, where as
> having it do the same as setting AF_UNSPEC doesn't add anything.
>
> I've googled around a bit but cannot find any evidence of anyone passing NULL
> to connect like this, and what documentation I can find doesn't really
> address the issue at all.
>

I found this from man page for connect(2)

       Generally, connection-based protocol sockets may successfully connect()
       only once; connectionless protocol sockets may use  connect()  multiple
       times to change their association.  Connectionless sockets may dissolve
       the association by connecting to an address with the  sa_family  member
       of sockaddr set to AF_UNSPEC (supported on Linux since kernel 2.2).

It only said the meaning of AF_UNSPEC for connectionless sockets, but
not connection sockets like TCP. It means that the behavior of
disconnecting the socket when AF_UNSEPC family address is met is also
undocumented.

The connect() API is a little different. If you try to call connect()
in non-blocking mode, and the API can't connect instantly, it will
return the error code for 'Operation In Progress'. When you call
connect() again, later, it may tell you 'Operation Already In
Progress' to let you know that it's still trying to connect, or it may
give you a successful return code, telling you that the connect has
been made.
from: http://www.scottklement.com/rpg/socktut/nonblocking.html

Someone may use connect() to check if the connection is established or
not. But there is no spec about the addr and addr_len value when
connect(2) is used  this way. Since there is no limit of addr and
addr_len, and we supports addr is NULL to check the status of socket
(Although it is buggy). I think we should treat it like a feature, and
the problem Neil reported is a bug.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH 1/2] phylib: Support phy module autoloading
From: Ben Hutchings @ 2010-04-01  4:34 UTC (permalink / raw)
  To: David Woodhouse; +Cc: davem, netdev, 553024
In-Reply-To: <1269998334.18090.278.camel@macbook.infradead.org>

[-- Attachment #1: Type: text/plain, Size: 5111 bytes --]

On Wed, 2010-03-31 at 02:18 +0100, David Woodhouse wrote:
> We don't use the normal hotplug mechanism because it doesn't work. It will
> load the module some time after the device appears, but that's not good
> enough for us -- we need the driver loaded _immediately_ because otherwise
> the NIC driver may just abort and then the phy 'device' goes away.
> 
> Instead, we just issue a request_module() directly in phy_device_create().
[...]

Thanks for doing this, David.  I had a stab at it earlier when this
problem was reported in Debian <http://bugs.debian.org/553024>.  I
didn't complete this because (a) I didn't understand all the details of
adding new device table type, and (b) I tried to avoid duplicating
information, which turns out to be rather difficult in modules with
multiple drivers.

Since you've dealt with (a), and (b) is not really as important, I would
just like to suggest some minor changes to your patch 1 (see below).
Feel free to fold them in.  Your patch 2 would then need the
substitutions s/phy_device_id/mdio_device_id/; s/TABLE(phy/TABLE(mdio/.

Ben.

From: Ben Hutchings <ben@decadent.org.uk>
Date: Thu, 1 Apr 2010 05:03:02 +0100
Subject: [PATCH] phylib: Minor cleanup of phylib autoloading

Refer to MDIO, consistent with other module aliases using bus names.
Change type names to __u32, consistent with the rest of the file.
Add kernel-doc comment to struct mdio_device_id.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
 drivers/net/phy/phy_device.c    |    2 +-
 include/linux/mod_devicetable.h |   22 ++++++++++++++--------
 scripts/mod/file2alias.c        |    8 ++++----
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index b35ec7e..b0e54b4 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -188,7 +188,7 @@ struct phy_device* phy_device_create(struct mii_bus *bus, int addr, int phy_id)
 	   around for long enough for the driver to get loaded. With
 	   MDIO, the NIC driver will get bored and give up as soon
 	   as it finds that there's no driver _already_ loaded. */
-	sprintf(modid, "phy:" PHYID_FMT, PHYID_ARGS(phy_id));
+	sprintf(modid, MDIO_MODULE_PREFIX MDIO_ID_FMT, MDIO_ID_ARGS(phy_id));
 	request_module(modid);
 #endif
 
diff --git a/include/linux/mod_devicetable.h b/include/linux/mod_devicetable.h
index 0c3e300..55f1f9c 100644
--- a/include/linux/mod_devicetable.h
+++ b/include/linux/mod_devicetable.h
@@ -474,10 +474,10 @@ struct platform_device_id {
 			__attribute__((aligned(sizeof(kernel_ulong_t))));
 };
 
-#define PHY_MODULE_PREFIX	"phy:"
+#define MDIO_MODULE_PREFIX	"mdio:"
 
-#define PHYID_FMT "%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d"
-#define PHYID_ARGS(_id) \
+#define MDIO_ID_FMT "%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d%d"
+#define MDIO_ID_ARGS(_id) \
 	(_id)>>31, ((_id)>>30) & 1, ((_id)>>29) & 1, ((_id)>>28) & 1,	\
 	((_id)>>27) & 1, ((_id)>>26) & 1, ((_id)>>25) & 1, ((_id)>>24) & 1, \
 	((_id)>>23) & 1, ((_id)>>22) & 1, ((_id)>>21) & 1, ((_id)>>20) & 1, \
@@ -487,11 +487,17 @@ struct platform_device_id {
 	((_id)>>7) & 1, ((_id)>>6) & 1, ((_id)>>5) & 1, ((_id)>>4) & 1, \
 	((_id)>>3) & 1, ((_id)>>2) & 1, ((_id)>>1) & 1, (_id) & 1
 
-
-
-struct phy_device_id {
-	uint32_t phy_id;
-	uint32_t phy_id_mask;
+/**
+ * struct mdio_device_id - identifies PHY devices on an MDIO/MII bus
+ * @phy_id: The result of
+ *     (mdio_read(&MII_PHYSID1) << 16 | mdio_read(&PHYSID2)) & @phy_id_mask
+ *     for this PHY type
+ * @phy_id_mask: Defines the significant bits of @phy_id.  A value of 0
+ *     is used to terminate an array of struct mdio_device_id.
+ */
+struct mdio_device_id {
+	__u32 phy_id;
+	__u32 phy_id_mask;
 };
 
 #endif /* LINUX_MOD_DEVICETABLE_H */
diff --git a/scripts/mod/file2alias.c b/scripts/mod/file2alias.c
index b412185..0e08b8b 100644
--- a/scripts/mod/file2alias.c
+++ b/scripts/mod/file2alias.c
@@ -797,7 +797,7 @@ static int do_platform_entry(const char *filename,
 }
 
 static int do_phy_entry(const char *filename,
-			struct phy_device_id *id, char *alias)
+			struct mdio_device_id *id, char *alias)
 {
 	char str[33];
 	int i;
@@ -813,7 +813,7 @@ static int do_phy_entry(const char *filename,
 			str[i] = '0';
 	}
 
-	sprintf(alias, PHY_MODULE_PREFIX "%s", str);
+	sprintf(alias, MDIO_MODULE_PREFIX "%s", str);
 	return 1;
 }
 
@@ -964,9 +964,9 @@ void handle_moddevtable(struct module *mod, struct elf_info *info,
 		do_table(symval, sym->st_size,
 			 sizeof(struct platform_device_id), "platform",
 			 do_platform_entry, mod);
-	else if (sym_is(symname, "__mod_phy_device_table"))
+	else if (sym_is(symname, "__mod_mdio_device_table"))
 		do_table(symval, sym->st_size,
-			 sizeof(struct phy_device_id), "phy",
+			 sizeof(struct mdio_device_id), "phy",
 			 do_phy_entry, mod);
 	free(zeros);
 }
-- 
1.7.0.3


-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related

* Re: [RFC] SPD basic actions per netdev
From: Timo Teräs @ 2010-04-01  4:52 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jamal, David S. Miller, Patrick McHardy, netdev
In-Reply-To: <20100401025247.GA19994@gondor.apana.org.au>

Herbert Xu wrote:
> On Wed, Mar 31, 2010 at 10:35:23PM -0400, jamal wrote:
>> On Thu, 2010-04-01 at 08:33 +0800, Herbert Xu wrote:
>>
>>> If we're going to change this then we should just add a second
>>> interface field to the selector, rather than trying to overload
>>> the existing one.
>> Do you mean to have a selector->iif/oif? Sure that makes sense - but is
>> a much larger surgery and user space will need to be taught.
>>
>> Did you look at the patch i sent? i tried to retain current behavior
>> except for the input check path. output path was working in classifying
>> with specific netdevs. 
> 
> OK, I guess the chances of an existing app breaking is slim.
> 
> BTW, you should treat FLOW_DIR_FWD as FLOW_DIR_IN.

I think we need iif and oif. The separation is clear in in/fwd policies,
as each is matched properly. But 'out' policies are used for both:
locally generated, and forwarded traffic.

Basically it goes like:
 in - for policy_check for traffic that is received locally
 fwd - for policy_check for traffic that is forwarded
 out - all (local and forwarded) traffic that goes out of box

IMHO, it's slightly confusing that in/fwd is split, but out is not.
But that's the way it works. If you now override the how interface
is checked for 'out' policy, it'll break current behaviour.

- Timo

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox