Netdev List
 help / color / mirror / Atom feed
* Re: big picture UDP/IP performance question re 2.6.18 -> 2.6.32
From: starlight @ 2011-10-07 19:41 UTC (permalink / raw)
  To: chetan loke
  Cc: Eric Dumazet, linux-kernel, netdev, Peter Zijlstra,
	Christoph Lameter, Willy Tarreau, Ingo Molnar, Stephen Hemminger,
	Benjamin LaHaise, Joe Perches, lokechetan, Con Kolivas,
	Serge Belyshev
In-Reply-To: <CAAsGZS4b2F9N3nV3TNu5xG+=2d0L0ncste4xv2vqoVFb1pOxEw @mail.gmail.com>

If they didn't want $20k to get started ($10k
for the card, $10k for the developer kit) would
probably have ported our application to run
outboard on a Tile64 NIC.  The cost + youth
of Tilra has been just a little to far out
on the edge.  If they survive (and lower the
price a bit) it will become an increasingly
easier sell.

If their patents hold up and their investors
can stomach the legal battles, they may
win the day.  Intel and AMD use ring
topologies between cores and this can
only scale so far.


At 03:27 PM 10/7/2011 -0400, chetan loke wrote:
>If at all, tile-cpu may not be able to survive the
>heat in host-mode, but there's no competition in
>device-mode. ASIC design is an expensive science
>project. All adapter vendors know this. Not to
>mention all the chicken-games they then have to
>play to incorporate multi-core ARMs(or whatever
>CPUs) to support big(40/100 G) pipes. Slowly their
>firmware needs to evolve as an OS and they start
>shopping for RT linux shops/consulting firms.Been
>there. Done that. There's a lot of bad blood and
>too much pain involved when you ask clunky
>firmware guys to abandon the big fat while loop in
>their beloved firmware and stop hacking the chip
>;). Offload it to a tile-CPU and just sell what
>you are good at - a darn protocol stack...

^ permalink raw reply

* [PATCH] bluetooth: Properly clone LSM attributes to newly created child connections
From: Paul Moore @ 2011-10-07 19:40 UTC (permalink / raw)
  To: netdev, linux-security-module, selinux

The Bluetooth stack has internal connection handlers for all of the various
Bluetooth protocols, and unfortunately, they are currently lacking the LSM
hooks found in the core network stack's connection handlers.  I say
unfortunately, because this can cause problems for users who have have an
LSM enabled and are using certain Bluetooth devices.  See one problem
report below:

 * http://bugzilla.redhat.com/show_bug.cgi?id=741703

In order to keep things simple at this point in time, this patch fixes the
problem by cloning the parent socket's LSM attributes to the newly created
child socket.  If we decide we need a more elaborate LSM marking mechanism
for Bluetooth (I somewhat doubt this) we can always revisit this decision
in the future.

Reported-by: James M. Cape <jcape@ignore-your.tv>
Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 net/bluetooth/l2cap_sock.c  |    4 ++++
 net/bluetooth/rfcomm/sock.c |    3 +++
 net/bluetooth/sco.c         |    5 ++++-
 security/security.c         |    1 +
 4 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
index 61f1f62..e829236 100644
--- a/net/bluetooth/l2cap_sock.c
+++ b/net/bluetooth/l2cap_sock.c
@@ -26,6 +26,8 @@
 
 /* Bluetooth L2CAP sockets. */
 
+#include <linux/security.h>
+
 #include <net/bluetooth/bluetooth.h>
 #include <net/bluetooth/hci_core.h>
 #include <net/bluetooth/l2cap.h>
@@ -933,6 +935,8 @@ static void l2cap_sock_init(struct sock *sk, struct sock *parent)
 		chan->force_reliable = pchan->force_reliable;
 		chan->flushable = pchan->flushable;
 		chan->force_active = pchan->force_active;
+
+		security_sk_clone(parent, sk);
 	} else {
 
 		switch (sk->sk_type) {
diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index 482722b..5417f61 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -42,6 +42,7 @@
 #include <linux/device.h>
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
+#include <linux/security.h>
 #include <net/sock.h>
 
 #include <asm/system.h>
@@ -264,6 +265,8 @@ static void rfcomm_sock_init(struct sock *sk, struct sock *parent)
 
 		pi->sec_level = rfcomm_pi(parent)->sec_level;
 		pi->role_switch = rfcomm_pi(parent)->role_switch;
+
+		security_sk_clone(parent, sk);
 	} else {
 		pi->dlc->defer_setup = 0;
 
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
index 8270f05..a324b00 100644
--- a/net/bluetooth/sco.c
+++ b/net/bluetooth/sco.c
@@ -41,6 +41,7 @@
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 #include <linux/list.h>
+#include <linux/security.h>
 #include <net/sock.h>
 
 #include <asm/system.h>
@@ -403,8 +404,10 @@ static void sco_sock_init(struct sock *sk, struct sock *parent)
 {
 	BT_DBG("sk %p", sk);
 
-	if (parent)
+	if (parent) {
 		sk->sk_type = parent->sk_type;
+		security_sk_clone(parent, sk);
+	}
 }
 
 static struct proto sco_proto = {
diff --git a/security/security.c b/security/security.c
index 0e4fccf..d9e1533 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1097,6 +1097,7 @@ void security_sk_clone(const struct sock *sk, struct sock *newsk)
 {
 	security_ops->sk_clone_security(sk, newsk);
 }
+EXPORT_SYMBOL(security_sk_clone);
 
 void security_sk_classify_flow(struct sock *sk, struct flowi *fl)
 {


^ permalink raw reply related

* [PATCH] mscan: too much data copied to CAN frame due to 16 bit accesses
From: Wolfgang Grandegger @ 2011-10-07 19:28 UTC (permalink / raw)
  To: Netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: SocketCAN Core Mailing List, Oliver Hartkopp, Andre Naujoks

Due to the 16 bit access to mscan registers there's too much data copied to
the zero initialized CAN frame when having an odd number of bytes to copy.
This patch ensures that only the requested bytes are copied by using an
8 bit access for the remaining byte.

Reported-by: Andre Naujoks <nautsch-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Oliver Hartkopp <socketcan-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>
Signed-off-by: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
---
 drivers/net/can/mscan/mscan.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/can/mscan/mscan.c b/drivers/net/can/mscan/mscan.c
index ac42f5d..ec4a311 100644
--- a/drivers/net/can/mscan/mscan.c
+++ b/drivers/net/can/mscan/mscan.c
@@ -261,11 +261,13 @@ static netdev_tx_t mscan_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		void __iomem *data = &regs->tx.dsr1_0;
 		u16 *payload = (u16 *)frame->data;
 
-		/* It is safe to write into dsr[dlc+1] */
-		for (i = 0; i < (frame->can_dlc + 1) / 2; i++) {
+		for (i = 0; i < frame->can_dlc / 2; i++) {
 			out_be16(data, *payload++);
 			data += 2 + _MSCAN_RESERVED_DSR_SIZE;
 		}
+		/* write remaining byte if necessary */
+		if (frame->can_dlc & 1)
+			out_8(data, frame->data[frame->can_dlc - 1]);
 	}
 
 	out_8(&regs->tx.dlr, frame->can_dlc);
@@ -330,10 +332,13 @@ static void mscan_get_rx_frame(struct net_device *dev, struct can_frame *frame)
 		void __iomem *data = &regs->rx.dsr1_0;
 		u16 *payload = (u16 *)frame->data;
 
-		for (i = 0; i < (frame->can_dlc + 1) / 2; i++) {
+		for (i = 0; i < frame->can_dlc / 2; i++) {
 			*payload++ = in_be16(data);
 			data += 2 + _MSCAN_RESERVED_DSR_SIZE;
 		}
+		/* read remaining byte if necessary */
+		if (frame->can_dlc & 1)
+			frame->data[frame->can_dlc - 1] = in_8(data);
 	}
 
 	out_8(&regs->canrflg, MSCAN_RXF);
-- 
1.7.4.1

^ permalink raw reply related

* Re: big picture UDP/IP performance question re 2.6.18 -> 2.6.32
From: chetan loke @ 2011-10-07 19:27 UTC (permalink / raw)
  To: starlight
  Cc: Eric Dumazet, linux-kernel, netdev, Peter Zijlstra,
	Christoph Lameter, Willy Tarreau, Ingo Molnar, Stephen Hemminger,
	Benjamin LaHaise, Joe Perches, lokechetan, Con Kolivas,
	Serge Belyshev
In-Reply-To: <6.2.5.6.2.20111007143050.039bd578@binnacle.cx>

On Fri, Oct 7, 2011 at 2:37 PM,  <starlight@binnacle.cx> wrote:
> At 02:09 PM 10/7/2011 -0400, chetan loke wrote:
>>You got it. In case of tilera there are two modes:
>>tile-cpu in device mode: beats most of the
>>non-COTS NICs. It runs linux on the adapter
>>side. Imagine having the flexibility/power to
>>program the ASIC using your favorite OS. Its
>>orgasmic. So go for it!  tile-cpu in host-mode:
>>Yes, it could be a game changer.
>
> We almost went for the 1st gen Tile64 outboard
> NIC approach, but were concerned about whether
> they would survive--still are.  Intel has
> crushed more than a few competitors along
> the way.  If Google or Facebook buys into the
> Tile-Gx it becomes a safe choice overnight.
>
>

If at all, tile-cpu may not be able to survive the heat in host-mode,
but there's no competition in device-mode. ASIC design is an expensive
science project. All adapter vendors know this. Not to mention all the
chicken-games they then have to play to incorporate multi-core ARMs(or
whatever CPUs) to support big(40/100 G) pipes. Slowly their firmware
needs to evolve as an OS and they start shopping for RT linux
shops/consulting firms.Been there. Done that. There's a lot of bad
blood and too much pain involved when you ask clunky firmware guys to
abandon the big fat while loop in their beloved firmware and stop
hacking the chip ;). Offload it to a tile-CPU and just sell what you
are good at - a darn protocol stack...

^ permalink raw reply

* Re: [PATCH] IPv6: DAD from bonding iface is treated as dup address from others
From: Neil Horman @ 2011-10-07 19:09 UTC (permalink / raw)
  To: Yinglin Sun
  Cc: Jay Vosburgh, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <CAN17JHUZLWmDQF8P6eCbLkTanYjwi0YmF5dXkfvH_QCRnbtaMA@mail.gmail.com>

On Fri, Oct 07, 2011 at 11:08:49AM -0700, Yinglin Sun wrote:
> On Fri, Oct 7, 2011 at 4:10 AM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > On Thu, Oct 06, 2011 at 06:24:36PM -0700, Yinglin Sun wrote:
> >> On Thu, Oct 6, 2011 at 5:59 PM, Jay Vosburgh <fubar@us.ibm.com> wrote:
> >> > Yinglin Sun <Yinglin.Sun@emc.com> wrote:
> >> >
> >> >>On Thu, Oct 6, 2011 at 3:17 PM, Yinglin Sun <Yinglin.Sun@emc.com> wrote:
> >> >>>
> >> >>> On Thu, Oct 6, 2011 at 12:05 PM, Jay Vosburgh <fubar@us.ibm.com> wrote:
> >> >>> >
> >> >>> > Neil Horman <nhorman@tuxdriver.com> wrote:
> >> >>> >
> >> >>> > >On Wed, Oct 05, 2011 at 08:59:10PM -0700, Yinglin Sun wrote:
> >> >>> > >> Steps to reproduce this issue:
> >> >>> > >> 1. create bond0 over eth0 and eth1, set the mode to balance-xor
> >> >>> > >> 2. add an IPv6 address to bond0
> >> >>> > >> 3. DAD packet is sent out from one slave and then is looped back from
> >> >>> > >> the other slave. Therefore, it is treated as a duplicate address and
> >> >>> > >> stays tentative afterwards:
> >> >>> > >>    kern.info:
> >> >>> > >>        Oct  5 11:50:18 testvm1 kernel: [  129.224353] bond0: IPv6 duplicate address 1234::1 detected!
> >> >
> >> > [...]
> >> >
> >> >>> > >Nack, This seems like it will just completely break DAD.  What if theres another
> >> >>> > >system out there with the same mac address.  A response from that system would
> >> >>> > >get dropped by this filter, instead of causing The local system to stop using
> >> >>> > >the address.  What you really want to do is modify
> >> >>> > >bond_should_deliver_exact_match to detect this frame on the inactive slave or
> >> >>> > >some such, and drop the frame there.
> >> >>> >
> >> >>> >        Also NACK; and adding a bit of information.  The balance-xor
> >> >>> > mode is nominally expecting to interact with a switch whose ports are
> >> >>> > set for etherchannel ("static link aggregation"), in which case the
> >> >>> > switch will not loop the packet back around.
> >> >>> >
> >> >>> >        If your switch can do etherchannel, then enable it and the
> >> >>> > problem should go away.  If your switch cannot do this, then you may
> >> >>> > have other issues, because all of the multicast or broadcast packets
> >> >>> > going out any bonding slave will loop around to another slave.  You
> >> >>> > could also use 802.3ad / LACP if you switch supports that.
> >> >>> >
> >> >>> >        For balance-xor (or balance-rr, for that matter) mode to a
> >> >>> > non-etherchannel switch, it's going to be difficult, if not impossible,
> >> >>> > to modify bond_should_deliver_exact_match, because there are no inactive
> >> >>> > slaves.  In this mode, bonding is expecting the switch to balance
> >> >>> > incoming traffic across the ports, and not deliver looped back packets
> >> >>> > or duplicates.  There are no restrictions on what type of traffic
> >> >>> > (mcast, bcast, ucast) may arrive on any given port.
> >> >>> >
> >> >>> >        I can't think of a way to make the non-etherchannel case work
> >> >>> > for balance-xor (or balance-rr) without breaking the DAD functionality
> >> >>> > in the case of an actual duplicate.  I'm not aware of a way to
> >> >>> > distinguish a looped back DAD probe from an actual duplicate address
> >> >>> > probe elsewhere on the network.
> >> >>> >
> >> >>>
> >> >>> Hi Neil & Jay,
> >> >>>
> >> >>> Thanks a lot for the comments.
> >> >>>
> >> >>> The use case is to add IPv6 address on the bonding interface first,
> >> >>> and then set up port channel on switch. We'll hit this issue and the
> >> >>> new address will stay tentative and unusable after port channel is set
> >> >>> up on switch. This patch is for this valid use case.
> >> >>>
> >> >>> Except failover mode, all slaves are active on receiving packets, so
> >> >>> we are receiving such looped back DAD and the bonding driver cannot
> >> >>> ignore them. I cannot think of a way to distinguish if a DAD is looped
> >> >>> back or from someone else having the same mac address. They look the
> >> >>> same to the host. If there is another machine having the same mac
> >> >>> address, this code path gets executed if both are doing DAD at the
> >> >>> same time for the same IPv6 address. Maybe we should find out what the
> >> >>> specification defines for this case?
> >> >>>
> >> >>
> >> >>RFC4862 has a discussion about this issue:
> >> >>http://tools.ietf.org/html/rfc4862#appendix-A
> >> >>The better solution could be to record the number of DAD sent out. If
> >> >>we received more DAD packets than we sent out, there is someone else
> >> >>on the network who has the same mac address and sent DAD for the same
> >> >>IPv6 address. However, this solution doesn't work with bonding
> >> >>interface, since all other active slaves but the one sending out DAD
> >> >>will receive packet looped back. It doesn't seem there is a simple
> >> >>solution for this issue.
> >> >
> >> >        Why are you setting up the port channel after configuring the
> >> > bond?
> >> >
> >> >        As a possible workaround, if you have control over the setup
> >> > process (perhaps it's some sort of manual process), adding one slave to
> >> > the bond, leaving the other soon-to-be slaves down, then setting up the
> >> > switch, and finally adding the remaining slaves should work around the
> >> > issue, since if the bond has only one slave it won't see any looped
> >> > packets.
> >> >
> >> >        Or you could bring the bond up as active-backup, then change the
> >> > mode to balance-xor once the switch is configured.
> >> >
> >> >        Ultimately, though, the problem stems from the settings mismatch
> >> > between the switch and the bonding system; balance-xor is meant to
> >> > interoperate with etherchannel, and when the switch is not configured
> >> > properly, correct behavior is difficult to guarantee.
> >> >
> >>
> >> Jay,
> >>
> >> Thanks a lot for the suggestion.
> >>
> >> It's mainly about usability. We would like to provide customers with
> >> consistent IPv6 configuration procedures as IPv4.  Such workarounds
> >> could be confusing and generate customer calls.
> >>
> > Its not a workaround, its the way it has to be done.  You can't just drop dad
> > packets because you can't tell the difference between those that are looped back
> > and those that are legitimaely from other hosts, so you need to do something
> > like what Jay is suggesting.
> >
> 
> Yes, you are right. For this case, we cannot tell the difference
> between those DADs  looped back and DADs for the same IPv6 addresses,
> which are from other machines having the same mac address on the
> network.
> 
> If this happens, the network is already broken due to duplicate mac
> addresses, right? I know this is not the excuse to drop DAD from
Wrong, if you assume that the other systems on the network all implement DAD
correctly, then the network is working properly until such time as your system
joined it.  At that moment, your system is the one with the duplicate mac and
ipv6 system, and DAD is there to bring the network back into a working state.

> others having the same mac address, since those DADs are legitimate
> and indicate duplicate IPv6 address in the network. Just out of
> curiosity, is there any case that such network can still function well
> with multiple machines having the same mac address besides link
> aggregation? Maybe we could drop those DADs since the network is
> already in trouble?
> 
Well, if you're just considering end systems (i.e. not routers), then having two
systems with duplicate macs and/or ipv6 addresses isnt' catastrophic to the
network as a whole,  Those two systems will just fight for ownership of the ipv6
address in the switch arp tables and will both only be intermittently reachable.
The purpose of DAD is to ensure that both parties become aware of this
condition.  The other systems on the network will function as they normally do.

Neil

> Thanks.
> 
> Yinglin
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* moving GIT back to kernel.org...
From: David Miller @ 2011-10-07 18:57 UTC (permalink / raw)
  To: netdev; +Cc: linux-wireless, netfilter-devel, sparclinux, linux-ide


I'm about to setup my GIT trees on kernel.org, once that is complete
I will be solely updating those trees again.

I will notify everyone when this is ready to go.

Just a heads up for everyone...

^ permalink raw reply

* Re: big picture UDP/IP performance question re 2.6.18 -> 2.6.32
From: starlight @ 2011-10-07 18:37 UTC (permalink / raw)
  To: chetan loke
  Cc: Eric Dumazet, linux-kernel, netdev, Peter Zijlstra,
	Christoph Lameter, Willy Tarreau, Ingo Molnar, Stephen Hemminger,
	Benjamin LaHaise, Joe Perches, lokechetan, Con Kolivas,
	Serge Belyshev
In-Reply-To: <CAAsGZS4s1wTWW1j7FRUWW9jqpPUVF3Q46AMa7+njvE1ckX0Snw @mail.gmail.com>

At 02:09 PM 10/7/2011 -0400, chetan loke wrote:
>I'm a little confused. Seems like there are
>conflicting goals. If you want to bypass the
>kernel-protocol-stack then you have the following
>options: a) kernel af_packet. This is where we
>would get a chance to test all the kernel features
>etc.

Perhaps I haven't been sufficiently clear.
The "packet socket" mode I refer to in the
earlier post was using AF/PF_PACKET mode sockets
as in

   socket(PF_PACKET, SOCK_RAW, eth_p_all);

Have run it in both normal and memory mapped
modes.  MMAP mode is a slight bit more expensive
due to the cache pressure from the additional
copy.  On the 6174 MMAP seems to be a smidgen
better in certain tests, but in the end both
read() and mapped approaches are effectively
identical on performance--and generally match
the cost of UDP sockets almost exactly.

b) Use non-commodity(?) NICs(from vendors
>you mentioned): where it might have some on-board
>memory(cushion) and so it can absorb the spikes
>and can also smoothen out too many
>PCI-transactions for bursty (and small payload -
>as in 64 byte traffic). But wait, when you use the
>libs provided by these vendors, then their
>driver(especially the Rx path) is not so much
>working in inline mode as NIC drivers in case a)
>above. This driver with a special Rx-path purely
>exists for managing your mmap'd queues.So
>of-course it's going to be faster that the
>traditional inline drivers. In this partial-inline
>mode, the adapter might i) batch the packets and
>ii) send a single notification to the
>host-side. With that single event you are now
>processing 1+ packets.

Kernel bypass is probably the best answer for
what we do.  Problem has been lack of maturity
in their driver software.  Looks like it's reaching
a point where they cover our use case.  As mentioned
earlier, Solarflare could not match the Intel
82599 + ixgbe for this app last year.  Was a
disaster.  Myricom is focused on UDP (better
for us), but only just added multi-core IRQ
doorbell wakeups in recent months.  Previously
one had to accept all IRQs on a single core or
poll, neither of which works for us.

>You got it. In case of tilera there are two modes:
>tile-cpu in device mode: beats most of the
>non-COTS NICs. It runs linux on the adapter
>side. Imagine having the flexibility/power to
>program the ASIC using your favorite OS. Its
>orgasmic. So go for it!  tile-cpu in host-mode:
>Yes, it could be a game changer.

We almost went for the 1st gen Tile64 outboard
NIC approach, but were concerned about whether
they would survive--still are.  Intel has
crushed more than a few competitors along
the way.  If Google or Facebook buys into the
Tile-Gx it becomes a safe choice overnight.

^ permalink raw reply

* Re: [RFC] net: remove erroneous sk null assignment in timestamping
From: Johannes Berg @ 2011-10-07 18:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, richardcochran
In-Reply-To: <1318009222.3988.28.camel@jlt3.sipsolutions.net>

On Fri, 2011-10-07 at 19:40 +0200, Johannes Berg wrote:

> > It looks like skb_clone_tx_timestamp() sets clone->sk without any
> > proper refcounting, so I bet this NULL'ing it out is working
> > around that bug.
> 
> You're right. But this bug can actually trigger another way: The only
> user of this is dp83640_txtstamp() which might do kfree_skb() on this
> skb, in which case that'll likely crash trying to clean up the sk.

Maybe that's how you can trigger it: have one thread turn on and off
timestamping all the time, and another thread send frames all the time,
then eventually you'll probably run into the kfree_skb() case there. If
you ever manage to run into that case, it'll crash either when freeing
this skb or when freeing the original.

Anyway, it's broken. I'll stop thinking about it. You (Richard) should
fix it quickly though otherwise I think we should revert/delete this
code.

johannes

^ permalink raw reply

* Re: [iproute2 PATCH 11/11] Fix file descriptor leak on error in read_igmp()
From: Stephen Hemminger @ 2011-10-07 18:28 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: netdev
In-Reply-To: <4E89D3BC.5050206@intra2net.com>

All these are applied and pushed to (temporary) github repo.

^ permalink raw reply

* Re: big picture UDP/IP performance question re 2.6.18 -> 2.6.32
From: chetan loke @ 2011-10-07 18:09 UTC (permalink / raw)
  To: starlight
  Cc: Eric Dumazet, linux-kernel, netdev, Peter Zijlstra,
	Christoph Lameter, Willy Tarreau, Ingo Molnar, Stephen Hemminger,
	Benjamin LaHaise, Joe Perches, lokechetan, Con Kolivas,
	Serge Belyshev
In-Reply-To: <6.2.5.6.2.20111007020308.039be248@binnacle.cx>

On Fri, Oct 7, 2011 at 2:13 AM,  <starlight@binnacle.cx> wrote:
> At 07:40 AM 10/7/2011 +0200, Eric Dumazet wrote:
>>
>>Thats exactly the opposite : Your old kernel is not fast enough
>>to enter/exit NAPI on every incoming frame.
>>
>>Instead of one IRQ per incoming frame, you have less interrupts:
>>A napi run processes more than 1 frame.
>
> Please look at the data I posted.  Batching
> appears to give 80us *better* latency in this
> case--with the old kernel.
>

Wait till we get to a point where we can use fanout+tpacket_v3 and
then hopefully we will have eased the burden from folks where for
particular use-cases they have to mimic this in user-land. Just a FYI
and not related to this thread: A month ago or so, I quickly
tried(using David's sample fanout app-code) fanout+tpacket_v3 and the
combo-mode wasn't quite working as expected. Not sure but might need
to visit the kernel-mmap-logic to create multiple mmap'd rings in the
kernel if fanout+tpacket_v3 is being used.



> Yes, of course I am interested in Intel's flow
> director and similar solutions, netfilter especially.
> 2.6.32 is only recently available in commercial
> deployment and I will be looking at that next up.
> Mainly I'll be looking at complete kernel bypass
> with 10G.  Myricom looks like it might be good.
> Tested Solarflare last year and it was a bust
> for high volume UDP (one thread) but I've heard
> that they fixed that and will revisit.

I'm a little confused. Seems like there are conflicting goals. If you
want to bypass the kernel-protocol-stack then you have the following
options:
a) kernel af_packet. This is where we would get a chance to test all
the kernel features etc.
b) Use non-commodity(?) NICs(from vendors you mentioned): where it
might have some on-board memory(cushion) and so it can absorb the
spikes and can also smoothen out too many PCI-transactions for bursty
(and small payload - as in 64 byte traffic). But wait, when you use
the libs provided by these vendors, then their driver(especially the
Rx path) is not so much working in inline mode as NIC drivers in case
a) above. This driver with a special Rx-path purely exists for
managing your mmap'd queues.So of-course it's going to be faster that
the traditional inline drivers. In this partial-inline mode, the
adapter might i) batch the packets and ii) send a single notification
to the host-side. With that single event you are now processing 1+
packets.

And now we can't really compare a) and b) because the drivers are
working in different modes. i.e inline and partial-inline mode
respectively. In case a) you might get notifications for every-packet.
So you are now stressing the process-scheduler, networking-stack etc.
In case b) you might not be able to stress all the kernel-paths. Also,
you get what you pay. If you pay 2x or 3x the cost for a non-COTS NIC
then you can push the vendors for implementing auto-coalescing and
what not on the adapter side. For case a), networking-stack needs to
keep up with whatever firmware we get from the vendor. Also, bugs if
any, can't be fixed the community for case b). So there's a tradeoff.

Sure, we need to see where small changes elsewhere in the kernel are
causing a huge penalty for user-space but you get the point.


> Please understand that I am not a curmudgeonly
> Luddite.  I realize that sometimes it is
> necessary to trade efficiency for scalability.
> All I'm doing here is trying to quantify the
> current state of affairs and make recommendations
> in a commercial environment.  For the moment
> all the excellent enhancements designed to
> permit extreme scalability are costing too
> much in efficiency to be worth using in
> production.  When/if Tilera delivers their
> 100 core CPU in volume this state of affairs
> will likely change.  I imagine both Intel
> and AMD have many-core solutions in the pipe
> as well, though it will be interesting to see
> if Tilera has the essential patents and can
> surpass the two majors in the market and the
> courts.
>

You got it. In case of tilera there are two modes:
tile-cpu in device mode: beats most of the non-COTS NICs. It runs
linux on the adapter side. Imagine having the flexibility/power to
program the ASIC using your favorite OS. Its orgasmic. So go for it!
tile-cpu in host-mode: Yes, it could be a game changer.


Chetan Loke

^ permalink raw reply

* Re: [PATCH] IPv6: DAD from bonding iface is treated as dup address from others
From: Yinglin Sun @ 2011-10-07 18:08 UTC (permalink / raw)
  To: Neil Horman
  Cc: Jay Vosburgh, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <20111007111028.GA11408@hmsreliant.think-freely.org>

On Fri, Oct 7, 2011 at 4:10 AM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Thu, Oct 06, 2011 at 06:24:36PM -0700, Yinglin Sun wrote:
>> On Thu, Oct 6, 2011 at 5:59 PM, Jay Vosburgh <fubar@us.ibm.com> wrote:
>> > Yinglin Sun <Yinglin.Sun@emc.com> wrote:
>> >
>> >>On Thu, Oct 6, 2011 at 3:17 PM, Yinglin Sun <Yinglin.Sun@emc.com> wrote:
>> >>>
>> >>> On Thu, Oct 6, 2011 at 12:05 PM, Jay Vosburgh <fubar@us.ibm.com> wrote:
>> >>> >
>> >>> > Neil Horman <nhorman@tuxdriver.com> wrote:
>> >>> >
>> >>> > >On Wed, Oct 05, 2011 at 08:59:10PM -0700, Yinglin Sun wrote:
>> >>> > >> Steps to reproduce this issue:
>> >>> > >> 1. create bond0 over eth0 and eth1, set the mode to balance-xor
>> >>> > >> 2. add an IPv6 address to bond0
>> >>> > >> 3. DAD packet is sent out from one slave and then is looped back from
>> >>> > >> the other slave. Therefore, it is treated as a duplicate address and
>> >>> > >> stays tentative afterwards:
>> >>> > >>    kern.info:
>> >>> > >>        Oct  5 11:50:18 testvm1 kernel: [  129.224353] bond0: IPv6 duplicate address 1234::1 detected!
>> >
>> > [...]
>> >
>> >>> > >Nack, This seems like it will just completely break DAD.  What if theres another
>> >>> > >system out there with the same mac address.  A response from that system would
>> >>> > >get dropped by this filter, instead of causing The local system to stop using
>> >>> > >the address.  What you really want to do is modify
>> >>> > >bond_should_deliver_exact_match to detect this frame on the inactive slave or
>> >>> > >some such, and drop the frame there.
>> >>> >
>> >>> >        Also NACK; and adding a bit of information.  The balance-xor
>> >>> > mode is nominally expecting to interact with a switch whose ports are
>> >>> > set for etherchannel ("static link aggregation"), in which case the
>> >>> > switch will not loop the packet back around.
>> >>> >
>> >>> >        If your switch can do etherchannel, then enable it and the
>> >>> > problem should go away.  If your switch cannot do this, then you may
>> >>> > have other issues, because all of the multicast or broadcast packets
>> >>> > going out any bonding slave will loop around to another slave.  You
>> >>> > could also use 802.3ad / LACP if you switch supports that.
>> >>> >
>> >>> >        For balance-xor (or balance-rr, for that matter) mode to a
>> >>> > non-etherchannel switch, it's going to be difficult, if not impossible,
>> >>> > to modify bond_should_deliver_exact_match, because there are no inactive
>> >>> > slaves.  In this mode, bonding is expecting the switch to balance
>> >>> > incoming traffic across the ports, and not deliver looped back packets
>> >>> > or duplicates.  There are no restrictions on what type of traffic
>> >>> > (mcast, bcast, ucast) may arrive on any given port.
>> >>> >
>> >>> >        I can't think of a way to make the non-etherchannel case work
>> >>> > for balance-xor (or balance-rr) without breaking the DAD functionality
>> >>> > in the case of an actual duplicate.  I'm not aware of a way to
>> >>> > distinguish a looped back DAD probe from an actual duplicate address
>> >>> > probe elsewhere on the network.
>> >>> >
>> >>>
>> >>> Hi Neil & Jay,
>> >>>
>> >>> Thanks a lot for the comments.
>> >>>
>> >>> The use case is to add IPv6 address on the bonding interface first,
>> >>> and then set up port channel on switch. We'll hit this issue and the
>> >>> new address will stay tentative and unusable after port channel is set
>> >>> up on switch. This patch is for this valid use case.
>> >>>
>> >>> Except failover mode, all slaves are active on receiving packets, so
>> >>> we are receiving such looped back DAD and the bonding driver cannot
>> >>> ignore them. I cannot think of a way to distinguish if a DAD is looped
>> >>> back or from someone else having the same mac address. They look the
>> >>> same to the host. If there is another machine having the same mac
>> >>> address, this code path gets executed if both are doing DAD at the
>> >>> same time for the same IPv6 address. Maybe we should find out what the
>> >>> specification defines for this case?
>> >>>
>> >>
>> >>RFC4862 has a discussion about this issue:
>> >>http://tools.ietf.org/html/rfc4862#appendix-A
>> >>The better solution could be to record the number of DAD sent out. If
>> >>we received more DAD packets than we sent out, there is someone else
>> >>on the network who has the same mac address and sent DAD for the same
>> >>IPv6 address. However, this solution doesn't work with bonding
>> >>interface, since all other active slaves but the one sending out DAD
>> >>will receive packet looped back. It doesn't seem there is a simple
>> >>solution for this issue.
>> >
>> >        Why are you setting up the port channel after configuring the
>> > bond?
>> >
>> >        As a possible workaround, if you have control over the setup
>> > process (perhaps it's some sort of manual process), adding one slave to
>> > the bond, leaving the other soon-to-be slaves down, then setting up the
>> > switch, and finally adding the remaining slaves should work around the
>> > issue, since if the bond has only one slave it won't see any looped
>> > packets.
>> >
>> >        Or you could bring the bond up as active-backup, then change the
>> > mode to balance-xor once the switch is configured.
>> >
>> >        Ultimately, though, the problem stems from the settings mismatch
>> > between the switch and the bonding system; balance-xor is meant to
>> > interoperate with etherchannel, and when the switch is not configured
>> > properly, correct behavior is difficult to guarantee.
>> >
>>
>> Jay,
>>
>> Thanks a lot for the suggestion.
>>
>> It's mainly about usability. We would like to provide customers with
>> consistent IPv6 configuration procedures as IPv4.  Such workarounds
>> could be confusing and generate customer calls.
>>
> Its not a workaround, its the way it has to be done.  You can't just drop dad
> packets because you can't tell the difference between those that are looped back
> and those that are legitimaely from other hosts, so you need to do something
> like what Jay is suggesting.
>

Yes, you are right. For this case, we cannot tell the difference
between those DADs  looped back and DADs for the same IPv6 addresses,
which are from other machines having the same mac address on the
network.

If this happens, the network is already broken due to duplicate mac
addresses, right? I know this is not the excuse to drop DAD from
others having the same mac address, since those DADs are legitimate
and indicate duplicate IPv6 address in the network. Just out of
curiosity, is there any case that such network can still function well
with multiple machines having the same mac address besides link
aggregation? Maybe we could drop those DADs since the network is
already in trouble?

Thanks.

Yinglin

^ permalink raw reply

* Re: [RFC] net: add wireless TX status socket option
From: Johannes Berg @ 2011-10-07 18:06 UTC (permalink / raw)
  To: netdev; +Cc: linux-wireless, Neil Horman
In-Reply-To: <1318007970.3988.27.camel-8upI4CBIZJIJvtFkdXX2HixXY32XiHfO@public.gmane.org>

On Fri, 2011-10-07 at 19:19 +0200, Johannes Berg wrote:

> This works similar to the existing TX timestamping
> in that it reflects the SKB back to the socket's
> error queue with a SCM_WIFI_STATUS cmsg that has
> an int indicating ACK status (0/1).

I should give you a test application. This is based on the code in
Documentation/networking/timestamping/timestamping.c and sends the same
frames, but to UDP port 1 to a peer you give on the command line.

E.g. if you save this file as ack.c, you can do

$ ./ack 192.168.100.2
SO_WIFI_STATUS 1
1318010597.494035: sent 124 bytes
1318010597.494084: select 2505916us
1318010597.513457: select returned: 1, success
ready for reading
1318010597.513668: received error data, 194 bytes from 69.0.0.152, 72 bytes control messages
   cmsg len 20: SOL_SOCKET acked: 1
   cmsg len 48: IPPROTO_IP IP_RECVERR ee_errno 'No message of desired type' ee_origin 4 => bounced packet => GOT OUR DATA BACK (HURRAY!)
1318010597.513860: select 2486140us

(not sure what's with the bogus RX IP, probably just not a valid field here)

johannes


/*
 * This program demonstrates how the wifi ack status feature in
 * the Linux kernel works. It sends some random packets and prints
 * out whether it was acked.
 *
 * Copyright (C) 2009, 2011 Intel Corporation.
 * Author: Patrick Ohly <patrick.ohly-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
 * Author: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
 *
 * This program is free software; you can redistribute it and/or modify it
 * under the terms and conditions of the GNU General Public License,
 * version 2, as published by the Free Software Foundation.
 *
 * This program is distributed in the hope it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
 * more details.
 *
 * You should have received a copy of the GNU General Public License along with
 * this program; if not, write to the Free Software Foundation, Inc.,
 * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
 */

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

#include <sys/time.h>
#include <sys/socket.h>
#include <sys/select.h>
#include <sys/ioctl.h>
#include <arpa/inet.h>
#include <net/if.h>

#include <asm/types.h>
#include <linux/net_tstamp.h>
#include <linux/errqueue.h>

#ifndef SO_WIFI_STATUS
# define SO_WIFI_STATUS		41
# define SCM_WIFI_STATUS	SO_WIFI_STATUS
#endif

#ifndef SO_EE_ORIGIN_TXSTATUS
#define SO_EE_ORIGIN_TXSTATUS	4
#endif

static void bail(const char *error)
{
	printf("%s: %s\n", error, strerror(errno));
	exit(1);
}

static const unsigned char sync[] = {
	0x00, 0x01, 0x00, 0x01,
	0x5f, 0x44, 0x46, 0x4c,
	0x54, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00,
	0x01, 0x01,

	/* fake uuid */
	0x00, 0x01,
	0x02, 0x03, 0x04, 0x05,

	0x00, 0x01, 0x00, 0x37,
	0x00, 0x00, 0x00, 0x08,
	0x00, 0x00, 0x00, 0x00,
	0x49, 0x05, 0xcd, 0x01,
	0x29, 0xb1, 0x8d, 0xb0,
	0x00, 0x00, 0x00, 0x00,
	0x00, 0x01,

	/* fake uuid */
	0x00, 0x01,
	0x02, 0x03, 0x04, 0x05,

	0x00, 0x00, 0x00, 0x37,
	0x00, 0x00, 0x00, 0x04,
	0x44, 0x46, 0x4c, 0x54,
	0x00, 0x00, 0xf0, 0x60,
	0x00, 0x01, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x01,
	0x00, 0x00, 0xf0, 0x60,
	0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x04,
	0x44, 0x46, 0x4c, 0x54,
	0x00, 0x01,

	/* fake uuid */
	0x00, 0x01,
	0x02, 0x03, 0x04, 0x05,

	0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00
};

static void sendpacket(int sock, struct sockaddr *addr, socklen_t addr_len)
{
	struct timeval now;
	int res;

	res = sendto(sock, sync, sizeof(sync), 0,
		addr, addr_len);
	gettimeofday(&now, 0);
	if (res < 0)
		printf("%s: %s\n", "send", strerror(errno));
	else
		printf("%ld.%06ld: sent %d bytes\n",
		       (long)now.tv_sec, (long)now.tv_usec,
		       res);
}

static void printpacket(struct msghdr *msg, int res,
			char *data, int sock)
{
	struct sockaddr_in *from_addr = (struct sockaddr_in *)msg->msg_name;
	struct cmsghdr *cmsg;
	struct timeval tv;
	struct timespec ts;
	struct timeval now;

	gettimeofday(&now, 0);

	printf("%ld.%06ld: received %s data, %d bytes from %s, %zu bytes control messages\n",
	       (long)now.tv_sec, (long)now.tv_usec, "error",
	       res,
	       inet_ntoa(from_addr->sin_addr),
	       msg->msg_controllen);
	for (cmsg = CMSG_FIRSTHDR(msg);
	     cmsg;
	     cmsg = CMSG_NXTHDR(msg, cmsg)) {
		printf("   cmsg len %zu: ", cmsg->cmsg_len);
		switch (cmsg->cmsg_level) {
		case SOL_SOCKET:
			printf("SOL_SOCKET ");
			switch (cmsg->cmsg_type) {
			case SCM_WIFI_STATUS: {
				int *ack = (void *)CMSG_DATA(cmsg);
				printf("acked: %d", *ack);
				break;
			}
			default:
				printf("type %d", cmsg->cmsg_type);
				break;
			}
			break;
		case IPPROTO_IP:
			printf("IPPROTO_IP ");
			switch (cmsg->cmsg_type) {
			case IP_RECVERR: {
				struct sock_extended_err *err =
					(struct sock_extended_err *)CMSG_DATA(cmsg);
				printf("IP_RECVERR ee_errno '%s' ee_origin %d => %s",
					strerror(err->ee_errno),
					err->ee_origin,
					err->ee_origin == SO_EE_ORIGIN_TXSTATUS ?
					"bounced packet" : "unexpected origin"
					);
				if (res < sizeof(sync))
					printf(" => truncated data?!");
				else if (!memcmp(sync, data + res - sizeof(sync),
						 sizeof(sync)))
					printf(" => GOT OUR DATA BACK (HURRAY!)");
				break;
			}
			case IP_PKTINFO: {
				struct in_pktinfo *pktinfo =
					(struct in_pktinfo *)CMSG_DATA(cmsg);
				printf("IP_PKTINFO interface index %u",
					pktinfo->ipi_ifindex);
				break;
			}
			default:
				printf("type %d", cmsg->cmsg_type);
				break;
			}
			break;
		default:
			printf("level %d type %d",
				cmsg->cmsg_level,
				cmsg->cmsg_type);
			break;
		}
		printf("\n");
	}
}

static void recvpacket(int sock)
{
	char data[256];
	struct msghdr msg;
	struct iovec entry;
	struct sockaddr_in from_addr;
	struct {
		struct cmsghdr cm;
		char control[512];
	} control;
	int res;

	memset(&msg, 0, sizeof(msg));
	msg.msg_iov = &entry;
	msg.msg_iovlen = 1;
	entry.iov_base = data;
	entry.iov_len = sizeof(data);
	msg.msg_name = (caddr_t)&from_addr;
	msg.msg_namelen = sizeof(from_addr);
	msg.msg_control = &control;
	msg.msg_controllen = sizeof(control);

	res = recvmsg(sock, &msg, MSG_ERRQUEUE|MSG_DONTWAIT);
	if (res < 0) {
		printf("%s %s: %s\n",
		       "recvmsg", "error",
		       strerror(errno));
	} else {
		printpacket(&msg, res, data, sock);
	}
}

int main(int argc, char **argv)
{
	int i;
	int enabled = 1;
	int sock;
	struct ifreq device;
	struct ifreq hwtstamp;
	struct sockaddr_in addr;
	struct ip_mreq imr;
	int val;
	socklen_t len;
	struct timeval next;

	addr.sin_family = AF_INET;
	addr.sin_port = htons(1);

	if (argc != 2 || inet_aton(argv[1], &addr.sin_addr) == 0) {
		printf("%s <ip addr>\n", argv[0]);
		exit(1);
	}

	sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
	if (sock < 0)
		bail("socket");


	/* set socket option for wifi status */
	if (setsockopt(sock, SOL_SOCKET, SO_WIFI_STATUS,
		       &enabled, sizeof(enabled)) < 0) {
		bail("enable ack status");
	}

	/* request IP_PKTINFO for debugging purposes */
	if (setsockopt(sock, SOL_IP, IP_PKTINFO,
		       &enabled, sizeof(enabled)) < 0)
		printf("%s: %s\n", "setsockopt IP_PKTINFO", strerror(errno));

	/* verify socket options */
	len = sizeof(val);
	if (getsockopt(sock, SOL_SOCKET, SO_WIFI_STATUS, &val, &len) < 0)
		printf("%s: %s\n", "getsockopt SO_TIMESTAMP", strerror(errno));
	else
		printf("SO_WIFI_STATUS %d\n", val);

	/* send packets forever every five seconds */
	gettimeofday(&next, 0);
	next.tv_sec = (next.tv_sec + 1) / 5 * 5;
	next.tv_usec = 0;
	while (1) {
		struct timeval now;
		struct timeval delta;
		long delta_us;
		int res;
		fd_set readfs, errorfs;

		gettimeofday(&now, 0);
		delta_us = (long)(next.tv_sec - now.tv_sec) * 1000000 +
			(long)(next.tv_usec - now.tv_usec);
		if (delta_us > 0) {
			/* continue waiting for timeout or data */
			delta.tv_sec = delta_us / 1000000;
			delta.tv_usec = delta_us % 1000000;

			FD_ZERO(&readfs);
			FD_ZERO(&errorfs);
			FD_SET(sock, &readfs);
			FD_SET(sock, &errorfs);
			printf("%ld.%06ld: select %ldus\n",
			       (long)now.tv_sec, (long)now.tv_usec,
			       delta_us);
			res = select(sock + 1, &readfs, 0, &errorfs, &delta);
			gettimeofday(&now, 0);
			printf("%ld.%06ld: select returned: %d, %s\n",
			       (long)now.tv_sec, (long)now.tv_usec,
			       res,
			       res < 0 ? strerror(errno) : "success");
			if (res > 0) {
				if (FD_ISSET(sock, &readfs))
					printf("ready for reading\n");
				if (FD_ISSET(sock, &errorfs))
					printf("has error\n");
				recvpacket(sock);
			}
		} else {
			/* write one packet */
			sendpacket(sock,
				   (struct sockaddr *)&addr,
				   sizeof(addr));
			next.tv_sec += 5;
			continue;
		}
	}

	return 0;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] iproute2: Fix usage and man page for 'ip link'
From: Stephen Hemminger @ 2011-10-07 18:05 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: netdev
In-Reply-To: <1317942651.6433.25.camel@w-sridhar.beaverton.ibm.com>

On Thu, 06 Oct 2011 16:10:51 -0700
Sridhar Samudrala <sri@us.ibm.com> wrote:

> Add bridge as a supported type with 'ip link' in usage and all the missing
> types in 'ip' man page. Also fixed some typos.
> 
> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

Applied. This manpage is getting huge.

^ permalink raw reply

* Re: [RFC] net: remove erroneous sk null assignment in timestamping
From: Johannes Berg @ 2011-10-07 17:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, richardcochran
In-Reply-To: <1318009644.3988.33.camel@jlt3.sipsolutions.net>

On Fri, 2011-10-07 at 19:47 +0200, Johannes Berg wrote:

> I'm afraid as is this is terminally broken. I don't have a device with a
> dp83640 PHY, but I bet you can cause kernel crashes by doing something
> like
> 
> while (1)
> 	fd = open()
> 	enable tx timestamping on fd;
> 	send(fd, frame)
> 	close(fd);

It's possible that it doesn't crash *if* (and only if!) the ethernet
driver is guaranteed to process the TXTS RX packet before freeing the
original SKB off the TX queue, and also never orphans or whatever the
original TX SKB. In that case the original TX SKB will hang on to the
socket for long enough I guess.

But that's not something I'd want to rely on. All it means that the
above code isn't an instant kernel crash. The code is still broken.

johannes

^ permalink raw reply

* Re: [RFC] net: remove erroneous sk null assignment in timestamping
From: Johannes Berg @ 2011-10-07 17:47 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, richardcochran
In-Reply-To: <1318009222.3988.28.camel@jlt3.sipsolutions.net>

On Fri, 2011-10-07 at 19:40 +0200, Johannes Berg wrote:

> > It looks like skb_clone_tx_timestamp() sets clone->sk without any
> > proper refcounting, so I bet this NULL'ing it out is working
> > around that bug.
> 
> You're right. But this bug can actually trigger another way: The only
> user of this is dp83640_txtstamp() which might do kfree_skb() on this
> skb, in which case that'll likely crash trying to clean up the sk.

Not only that -- I don't even see what causes the socket to not go away
while we're waiting for the ts_work that dp83640_txtstamp() fires off to
run.

And actually, no, this is all wrong. It shouldn't kick off ts_work
anyway, ts_work only handles *RX* timestamps, never *TX*. The *TX*
timestamps are handled by decode_txts() which is called when we receive
a TXTS frame from the device... *that* dequeues the skb from tx_queue
there.

I'm afraid as is this is terminally broken. I don't have a device with a
dp83640 PHY, but I bet you can cause kernel crashes by doing something
like

while (1)
	fd = open()
	enable tx timestamping on fd;
	send(fd, frame)
	close(fd);

I don't even think any of this is privileged.

johannes

^ permalink raw reply

* Re: [RFC] net: remove erroneous sk null assignment in timestamping
From: Johannes Berg @ 2011-10-07 17:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, richardcochran
In-Reply-To: <20111007.133356.489094996618032061.davem@davemloft.net>

On Fri, 2011-10-07 at 13:33 -0400, David Miller wrote:
> From: Johannes Berg <johannes@sipsolutions.net>
> Date: Fri, 07 Oct 2011 19:11:41 +0200
> 
> > From: Johannes Berg <johannes.berg@intel.com>
> > 
> > skb->sk is obviously required to be non-NULL
> > when we get into skb_complete_tx_timestamp().
> > sock_queue_err_skb() will call skb_orphan()
> > first thing which sets skb->sk = NULL itself.
> > This may crash if the skb is still charged to
> > the socket (skb->destructor is sk_wfree).
> > 
> > The assignment here thus seems to not only be
> > pointless (due to the skb_orphan() call) but
> > also dangerous (due to the crash).
> > 
> > Signed-off-by: Johannes Berg <johannes.berg@intel.com>
> 
> It looks like skb_clone_tx_timestamp() sets clone->sk without any
> proper refcounting, so I bet this NULL'ing it out is working
> around that bug.

You're right. But this bug can actually trigger another way: The only
user of this is dp83640_txtstamp() which might do kfree_skb() on this
skb, in which case that'll likely crash trying to clean up the sk.

johannes

^ permalink raw reply

* Re: [RFC] net: remove erroneous sk null assignment in timestamping
From: David Miller @ 2011-10-07 17:33 UTC (permalink / raw)
  To: johannes; +Cc: netdev, richardcochran
In-Reply-To: <1318007501.3988.20.camel@jlt3.sipsolutions.net>

From: Johannes Berg <johannes@sipsolutions.net>
Date: Fri, 07 Oct 2011 19:11:41 +0200

> From: Johannes Berg <johannes.berg@intel.com>
> 
> skb->sk is obviously required to be non-NULL
> when we get into skb_complete_tx_timestamp().
> sock_queue_err_skb() will call skb_orphan()
> first thing which sets skb->sk = NULL itself.
> This may crash if the skb is still charged to
> the socket (skb->destructor is sk_wfree).
> 
> The assignment here thus seems to not only be
> pointless (due to the skb_orphan() call) but
> also dangerous (due to the crash).
> 
> Signed-off-by: Johannes Berg <johannes.berg@intel.com>

It looks like skb_clone_tx_timestamp() sets clone->sk without any
proper refcounting, so I bet this NULL'ing it out is working
around that bug.

The TX side of this infrastructure seems very poorly tested.

^ permalink raw reply

* Re: A new 40G Network driver ready to submit to the kernel tree
From: Denys Fedoryshchenko @ 2011-10-07 17:27 UTC (permalink / raw)
  To: Joyce Yu - System Software; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <4E8F34A1.5020503@oracle.com>

 On Fri, 07 Oct 2011 10:19:29 -0700, Joyce Yu - System Software wrote:
>> Submit the patches to netdev for review/acceptance.
>
> Shall I just send plain driverxxx.c and driverxxx.h files,   or zip 
> or
> tarball?
>
 I guess most of answers are inside

 http://www.mjmwired.net/kernel/Documentation/SubmittingDrivers
 http://www.mjmwired.net/kernel/Documentation/SubmittingPatches

 For example:

 It is also available inside kernel, in Documentation directory.

 ---
 Denys Fedoryshchenko <nuclearcat@nuclearcat.com

^ permalink raw reply

* Re: A new 40G Network driver ready to submit to the kernel tree
From: Ben Hutchings @ 2011-10-07 17:33 UTC (permalink / raw)
  To: Joyce Yu - System Software; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <4E8F34A1.5020503@oracle.com>

On Fri, 2011-10-07 at 10:19 -0700, Joyce Yu - System Software wrote:
> > 
> > Submit the patches to netdev for review/acceptance.
> 
> Shall I just send plain driverxxx.c and driverxxx.h files,   or zip or
> tarball?

Please read Documentation/Submitting{Patches,Drivers}; then ask any
further questions that they don't answer.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] net: use sock_valbool_flag to set/clear SOCK_RXQ_OVFL
From: David Miller @ 2011-10-07 17:27 UTC (permalink / raw)
  To: nhorman; +Cc: johannes, netdev
In-Reply-To: <20111007134023.GC11408@hmsreliant.think-freely.org>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Fri, 7 Oct 2011 09:40:23 -0400

> On Fri, Oct 07, 2011 at 03:30:20PM +0200, Johannes Berg wrote:
>> From: Johannes Berg <johannes.berg@intel.com>
>> 
>> There's no point in open-coding sock_valbool_flag().
>> 
>> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
 ...
> Looks good, thanks Johannes!
> Acked-by: Neil Horman <nhorman@tuxdriver.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH] IPv6: DAD from bonding iface is treated as dup address from others
From: Neil Horman @ 2011-10-07 17:29 UTC (permalink / raw)
  To: Yinglin Sun; +Cc: netdev
In-Reply-To: <CAN17JHVC-qVst1vw5eOC1Wg8ekBejped5WErVo5wCAWTBF9JSw@mail.gmail.com>

On Fri, Oct 07, 2011 at 09:59:06AM -0700, Yinglin Sun wrote:
> On Thu, Oct 6, 2011 at 11:13 PM, Chuck Anderson <cra@wpi.edu> wrote:
> >
> > On Thu, Oct 06, 2011 at 06:24:36PM -0700, Yinglin Sun wrote:
> > > On Thu, Oct 6, 2011 at 5:59 PM, Jay Vosburgh <fubar@us.ibm.com> wrote:
> > > >        Why are you setting up the port channel after configuring the
> > > > bond?
> > > >
> > > >        As a possible workaround, if you have control over the setup
> > > > process (perhaps it's some sort of manual process), adding one slave to
> > > > the bond, leaving the other soon-to-be slaves down, then setting up the
> > > > switch, and finally adding the remaining slaves should work around the
> > > > issue, since if the bond has only one slave it won't see any looped
> > > > packets.
> > > >
> > > >        Or you could bring the bond up as active-backup, then change the
> > > > mode to balance-xor once the switch is configured.
> > > >
> > > >        Ultimately, though, the problem stems from the settings mismatch
> > > > between the switch and the bonding system; balance-xor is meant to
> > > > interoperate with etherchannel, and when the switch is not configured
> > > > properly, correct behavior is difficult to guarantee.
> > > >
> > >
> > > Jay,
> > >
> > > Thanks a lot for the suggestion.
> > >
> > > It's mainly about usability. We would like to provide customers with
> > > consistent IPv6 configuration procedures as IPv4.  Such workarounds
> > > could be confusing and generate customer calls.
> >
> > You've created/encouraged your customers to create a broken network
> > configuration by connecting two bonded links to a non-bonded,
> > non-etherchannel switch port pair.  This type of misconfiguration,
> > when applied to inter-switch trunks, can cause major network issues,
> > like looping and broadcast storms, taking down the entire network
> > unless something like Spanning Tree is enabled to protect against such
> > accidental loops.  It should be avoided at all costs.  Luckily, if the
> > Linux host in this case is not being used as a switch/bridge, the
> > impact of this might not be so bad--perhaps limited to the IPv6 DAD
> > issue you report.
> >
> > If you want better usability and plug-n-play bonding, then require
> > LACP/802.3ad to be used.  Please don't encourage your customers to
> > connect misconfigured devices to the network, thanks.
> 
> You are right. LACP is the good choice. It should be able to solve
> this issue, since all LACP bonding slaves are down before port channel
> is set up on switch. Thanks.
> 
> I'm not sure this is kind of broken network configuration. If
> customers happen to add some IPv6 addresses on bonding interface
> before setting up port channel on switch, they have to reconfigure all
Bringing up an interface prior to having it, and the peer interfaces configured
to use an agreed upon mode, is rather by definition broken :)

^ permalink raw reply

* Re: Asserting ECN from userspace?
From: Rick Jones @ 2011-10-07 17:25 UTC (permalink / raw)
  To: David Täht; +Cc: netdev, bloat-devel
In-Reply-To: <4E8BF6B2.6030101@gmail.com>

On 10/04/2011 11:18 PM, David Täht wrote:
>
> No sooner had I noted (with pleasure) the kernel's new ability to
> correctly set the dscp bits on IPv6 TCP streams without messing with the
> negotiated ECN status, that I found several use cases where being able
> to assert ECN from userspace (for either ipv4, or ipv6) would be useful.
> ...
> 3) Web Proxies. A web proxy could note when it was experiencing
> congestion on one side of the proxied connection (or another) and signal
> the other side to slow down.
> ...
> As for 3... perhaps a grantable network capability? A proxy could
> acquire privs to twiddle those bits before dropping root privs.

For 3, couldn't/shouldn't the proxy simply stop draining the appropriate 
socket buffer to cause TCP's existing flow control to slow down that side?

rick jones

^ permalink raw reply

* Re: A new 40G Network driver ready to submit to the kernel tree
From: Joyce Yu - System Software @ 2011-10-07 17:19 UTC (permalink / raw)
  To: netdev, jeffrey.t.kirsher


> 
> Submit the patches to netdev for review/acceptance.

Shall I just send plain driverxxx.c and driverxxx.h files,   or zip or
tarball?


Thanks,
Joyce

^ permalink raw reply

* [RFC] net: add wireless TX status socket option
From: Johannes Berg @ 2011-10-07 17:19 UTC (permalink / raw)
  To: netdev; +Cc: linux-wireless, Neil Horman

From: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The 802.1X EAPOL handshake hostapd does requires
knowing whether the frame was ack'ed by the peer.
Currently, we fudge this pretty badly by not even
transmitting the frame as a normal data frame but
injecting it with radiotap and getting the status
out of radiotap monitor as well. This is rather
complex, confuses users (mon.wlan0 presence) and
doesn't work with all hardware.

To get rid of that hack, introduce a real wifi TX
status option for data frame transmissions.

This works similar to the existing TX timestamping
in that it reflects the SKB back to the socket's
error queue with a SCM_WIFI_STATUS cmsg that has
an int indicating ACK status (0/1).

Since it is possible that at some point we will
want to have TX timestamping and wifi status in a
single errqueue SKB (there's little point in not
doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
to SO_EE_ORIGIN_TXSTATUS which can collect more
than just the timestamp; keep the old constant
as an alias of course. Currently the internal APIs
don't make that possible, but it wouldn't be hard
to split them up in a way that makes it possible.

Thanks to Neil Horman for helping me figure out
the functions that add the control messages.

TODO:
 * sock_tx_timestamp() function should be renamed,
   maybe to sock_tx_status()?
 * sock_recv_timestamp() should also be renamed,
   I had a hard time figuring out the difference
   between that and sock_recv_ts_and_drops(). The
   former is generic, while the latter adds RX
   information only, maybe that should be reflected
   in new names?

Signed-off-by: Johannes Berg <johannes.berg-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 arch/alpha/include/asm/socket.h   |    3 +++
 arch/arm/include/asm/socket.h     |    3 +++
 arch/avr32/include/asm/socket.h   |    3 +++
 arch/cris/include/asm/socket.h    |    3 +++
 arch/frv/include/asm/socket.h     |    3 +++
 arch/h8300/include/asm/socket.h   |    3 +++
 arch/ia64/include/asm/socket.h    |    3 +++
 arch/m32r/include/asm/socket.h    |    3 +++
 arch/m68k/include/asm/socket.h    |    3 +++
 arch/mips/include/asm/socket.h    |    3 +++
 arch/mn10300/include/asm/socket.h |    3 +++
 arch/parisc/include/asm/socket.h  |    3 +++
 arch/powerpc/include/asm/socket.h |    3 +++
 arch/s390/include/asm/socket.h    |    3 +++
 arch/sparc/include/asm/socket.h   |    3 +++
 arch/xtensa/include/asm/socket.h  |    3 +++
 include/asm-generic/socket.h      |    3 +++
 include/linux/errqueue.h          |    3 ++-
 include/linux/skbuff.h            |   18 ++++++++++++++++--
 include/net/sock.h                |    7 +++++++
 net/core/skbuff.c                 |   20 ++++++++++++++++++++
 net/core/sock.c                   |    9 +++++++++
 net/mac80211/status.c             |   35 +++++++++++++++++++++++++++--------
 net/mac80211/tx.c                 |    8 +++++++-
 net/socket.c                      |   19 +++++++++++++++++++
 25 files changed, 158 insertions(+), 12 deletions(-)

--- a/include/asm-generic/socket.h	2011-10-07 18:59:12.000000000 +0200
+++ b/include/asm-generic/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -64,4 +64,7 @@
 #define SO_DOMAIN		39
 
 #define SO_RXQ_OVFL             40
+
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS	SO_WIFI_STATUS
 #endif /* __ASM_GENERIC_SOCKET_H */
--- a/net/core/sock.c	2011-10-07 18:59:12.000000000 +0200
+++ b/net/core/sock.c	2011-10-07 18:59:12.000000000 +0200
@@ -740,6 +740,11 @@ set_rcvbuf:
 	case SO_RXQ_OVFL:
 		sock_valbool_flag(sk, SOCK_RXQ_OVFL, valbool);
 		break;
+
+	case SO_WIFI_STATUS:
+		sock_valbool_flag(sk, SOCK_WIFI_STATUS, valbool);
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -961,6 +966,10 @@ int sock_getsockopt(struct socket *sock,
 		v.val = !!sock_flag(sk, SOCK_RXQ_OVFL);
 		break;
 
+	case SO_WIFI_STATUS:
+		v.val = !!sock_flag(sk, SOCK_WIFI_STATUS);
+		break;
+
 	default:
 		return -ENOPROTOOPT;
 	}
--- a/include/net/sock.h	2011-10-07 18:59:12.000000000 +0200
+++ b/include/net/sock.h	2011-10-07 18:59:12.000000000 +0200
@@ -564,6 +564,7 @@ enum sock_flags {
 	SOCK_FASYNC, /* fasync() active */
 	SOCK_RXQ_OVFL,
 	SOCK_ZEROCOPY, /* buffers from userspace */
+	SOCK_WIFI_STATUS, /* push wifi status to userspace */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -1705,7 +1706,10 @@ static inline int sock_intr_errno(long t
 
 extern void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 	struct sk_buff *skb);
+extern void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
+	struct sk_buff *skb);
 
+/* XXX: rename this function now? */
 static __inline__ void
 sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
 {
@@ -1732,6 +1736,9 @@ sock_recv_timestamp(struct msghdr *msg,
 		__sock_recv_timestamp(msg, sk, skb);
 	else
 		sk->sk_stamp = kt;
+
+	if (sock_flag(sk, SOCK_WIFI_STATUS) && skb->wifi_acked_valid)
+		__sock_recv_wifi_status(msg, sk, skb);
 }
 
 extern void __sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
--- a/arch/alpha/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/alpha/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -69,6 +69,9 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 /* O_NONBLOCK clashes with the bits used for socket types.  Therefore we
  * have to define SOCK_NONBLOCK to a different value here.
  */
--- a/arch/arm/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/arm/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -62,4 +62,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_SOCKET_H */
--- a/arch/avr32/include/asm/socket.h	2011-10-07 18:59:10.000000000 +0200
+++ b/arch/avr32/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -62,4 +62,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* __ASM_AVR32_SOCKET_H */
--- a/arch/cris/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/cris/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -64,6 +64,9 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_SOCKET_H */
 
 
--- a/arch/frv/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/frv/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -62,5 +62,8 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_SOCKET_H */
 
--- a/arch/h8300/include/asm/socket.h	2011-10-07 18:59:10.000000000 +0200
+++ b/arch/h8300/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -62,4 +62,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_SOCKET_H */
--- a/arch/ia64/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/ia64/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -71,4 +71,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_IA64_SOCKET_H */
--- a/arch/m32r/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/m32r/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -62,4 +62,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_M32R_SOCKET_H */
--- a/arch/m68k/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/m68k/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -62,4 +62,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_SOCKET_H */
--- a/arch/mips/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/mips/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -82,6 +82,9 @@ To add: #define SO_REUSEPORT 0x0200	/* A
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #ifdef __KERNEL__
 
 /** sock_type - Socket types
--- a/arch/mn10300/include/asm/socket.h	2011-10-07 18:59:10.000000000 +0200
+++ b/arch/mn10300/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -62,4 +62,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_SOCKET_H */
--- a/arch/parisc/include/asm/socket.h	2011-10-07 18:59:10.000000000 +0200
+++ b/arch/parisc/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -61,6 +61,9 @@
 
 #define SO_RXQ_OVFL             0x4021
 
+#define SO_WIFI_STATUS		0x4022
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 /* O_NONBLOCK clashes with the bits used for socket types.  Therefore we
  * have to define SOCK_NONBLOCK to a different value here.
  */
--- a/arch/powerpc/include/asm/socket.h	2011-10-07 18:59:11.000000000 +0200
+++ b/arch/powerpc/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -69,4 +69,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif	/* _ASM_POWERPC_SOCKET_H */
--- a/arch/s390/include/asm/socket.h	2011-10-07 18:59:10.000000000 +0200
+++ b/arch/s390/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -70,4 +70,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif /* _ASM_SOCKET_H */
--- a/arch/sparc/include/asm/socket.h	2011-10-07 18:59:10.000000000 +0200
+++ b/arch/sparc/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -58,6 +58,9 @@
 
 #define SO_RXQ_OVFL             0x0024
 
+#define SO_WIFI_STATUS		0x0025
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
--- a/arch/xtensa/include/asm/socket.h	2011-10-07 18:59:10.000000000 +0200
+++ b/arch/xtensa/include/asm/socket.h	2011-10-07 18:59:12.000000000 +0200
@@ -73,4 +73,7 @@
 
 #define SO_RXQ_OVFL             40
 
+#define SO_WIFI_STATUS		41
+#define SCM_WIFI_STATUS		SO_WIFI_STATUS
+
 #endif	/* _XTENSA_SOCKET_H */
--- a/include/linux/errqueue.h	2011-10-07 18:59:11.000000000 +0200
+++ b/include/linux/errqueue.h	2011-10-07 18:59:12.000000000 +0200
@@ -17,7 +17,8 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_LOCAL	1
 #define SO_EE_ORIGIN_ICMP	2
 #define SO_EE_ORIGIN_ICMP6	3
-#define SO_EE_ORIGIN_TIMESTAMPING 4
+#define SO_EE_ORIGIN_TXSTATUS	4
+#define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
 
--- a/include/linux/skbuff.h	2011-10-07 18:59:11.000000000 +0200
+++ b/include/linux/skbuff.h	2011-10-07 18:59:12.000000000 +0200
@@ -190,6 +190,9 @@ enum {
 
 	/* device driver supports TX zero-copy buffers */
 	SKBTX_DEV_ZEROCOPY = 1 << 4,
+
+	/* generate wifi status information (where possible) */
+	SKBTX_WIFI_STATUS = 1 << 5,
 };
 
 /*
@@ -322,6 +325,8 @@ typedef unsigned char *sk_buff_data_t;
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
+ *	@wifi_acked_valid: wifi_acked was set
+ *	@wifi_acked: whether frame was acked on wifi or not
  *	@dma_cookie: a cookie to one of several possible DMA operations
  *		done by skb DMA functions
  *	@secmark: security marking
@@ -414,10 +419,10 @@ struct sk_buff {
 	__u8			ndisc_nodetype:2;
 #endif
 	__u8			ooo_okay:1;
+	__u8			wifi_acked_valid:1;
+	__u8			wifi_acked:1;
 	kmemcheck_bitfield_end(flags2);
 
-	/* 0/13 bit hole */
-
 #ifdef CONFIG_NET_DMA
 	dma_cookie_t		dma_cookie;
 #endif
@@ -2062,6 +2067,15 @@ static inline void skb_tx_timestamp(stru
 	sw_tx_timestamp(skb);
 }
 
+/**
+ * skb_complete_wifi_ack - deliver cloned skb with wifi status
+ *
+ * @skb: clone of the the original outgoing packet
+ * @acked: ack status
+ *
+ */
+void skb_complete_wifi_ack(struct sk_buff *skb, bool acked);
+
 extern __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len);
 extern __sum16 __skb_checksum_complete(struct sk_buff *skb);
 
--- a/net/core/skbuff.c	2011-10-07 18:59:12.000000000 +0200
+++ b/net/core/skbuff.c	2011-10-07 18:59:24.000000000 +0200
@@ -3150,6 +3150,26 @@ void skb_tstamp_tx(struct sk_buff *orig_
 }
 EXPORT_SYMBOL_GPL(skb_tstamp_tx);
 
+void skb_complete_wifi_ack(struct sk_buff *skb, bool acked)
+{
+	struct sock *sk = skb->sk;
+	struct sock_exterr_skb *serr;
+	int err;
+
+	skb->wifi_acked_valid = 1;
+	skb->wifi_acked = acked;
+
+	serr = SKB_EXT_ERR(skb);
+	memset(serr, 0, sizeof(*serr));
+	serr->ee.ee_errno = ENOMSG;
+	serr->ee.ee_origin = SO_EE_ORIGIN_TXSTATUS;
+
+	err = sock_queue_err_skb(sk, skb);
+	if (err)
+		kfree_skb(skb);
+}
+EXPORT_SYMBOL_GPL(skb_complete_wifi_ack);
+
 
 /**
  * skb_partial_csum_set - set up and verify partial csum values for packet
--- a/net/socket.c	2011-10-07 18:59:12.000000000 +0200
+++ b/net/socket.c	2011-10-07 18:59:12.000000000 +0200
@@ -531,6 +531,7 @@ void sock_release(struct socket *sock)
 }
 EXPORT_SYMBOL(sock_release);
 
+/* XXX: rename this function now */
 int sock_tx_timestamp(struct sock *sk, __u8 *tx_flags)
 {
 	*tx_flags = 0;
@@ -538,6 +539,8 @@ int sock_tx_timestamp(struct sock *sk, _
 		*tx_flags |= SKBTX_HW_TSTAMP;
 	if (sock_flag(sk, SOCK_TIMESTAMPING_TX_SOFTWARE))
 		*tx_flags |= SKBTX_SW_TSTAMP;
+	if (sock_flag(sk, SOCK_WIFI_STATUS))
+		*tx_flags |= SKBTX_WIFI_STATUS;
 	return 0;
 }
 EXPORT_SYMBOL(sock_tx_timestamp);
@@ -674,6 +677,22 @@ void __sock_recv_timestamp(struct msghdr
 }
 EXPORT_SYMBOL_GPL(__sock_recv_timestamp);
 
+void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
+	struct sk_buff *skb)
+{
+	int ack;
+
+	if (!sock_flag(sk, SOCK_WIFI_STATUS))
+		return;
+	if (!skb->wifi_acked_valid)
+		return;
+
+	ack = skb->wifi_acked;
+
+	put_cmsg(msg, SOL_SOCKET, SCM_WIFI_STATUS, sizeof(ack), &ack);
+}
+EXPORT_SYMBOL_GPL(__sock_recv_wifi_status);
+
 static inline void sock_recv_drops(struct msghdr *msg, struct sock *sk,
 				   struct sk_buff *skb)
 {
--- a/net/mac80211/status.c	2011-10-07 18:59:12.000000000 +0200
+++ b/net/mac80211/status.c	2011-10-07 19:05:24.000000000 +0200
@@ -252,8 +252,9 @@ void ieee80211_tx_status(struct ieee8021
 	struct sta_info *sta, *tmp;
 	int retry_count = -1, i;
 	int rates_idx = -1;
-	bool send_to_cooked;
+	bool send_to_cooked, need_for_monitors;
 	bool acked;
+	bool multicast;
 	struct ieee80211_bar *bar;
 	u16 tid;
 
@@ -278,6 +279,7 @@ void ieee80211_tx_status(struct ieee8021
 
 	sband = local->hw.wiphy->bands[info->band];
 	fc = hdr->frame_control;
+	multicast = is_multicast_ether_addr(hdr->addr1);
 
 	for_each_sta_info(local, hdr->addr1, sta, tmp) {
 		/* skip wrong virtual interface */
@@ -443,9 +445,6 @@ void ieee80211_tx_status(struct ieee8021
 			!!(info->flags & IEEE80211_TX_STAT_ACK), GFP_ATOMIC);
 	}
 
-	/* this was a transmitted frame, but now we want to reuse it */
-	skb_orphan(skb);
-
 	/* Need to make a copy before skb->cb gets cleared */
 	send_to_cooked = !!(info->flags & IEEE80211_TX_CTL_INJECTED) ||
 			(type != IEEE80211_FTYPE_DATA);
@@ -454,13 +453,33 @@ void ieee80211_tx_status(struct ieee8021
 	 * This is a bit racy but we can avoid a lot of work
 	 * with this test...
 	 */
-	if (!local->monitors && (!send_to_cooked || !local->cooked_mntrs)) {
-		dev_kfree_skb(skb);
-		return;
-	}
+	need_for_monitors = local->monitors ||
+	                    (send_to_cooked && local->cooked_mntrs);
 
 	/* send frame to monitor interfaces now */
 
+	if (!multicast && skb->sk &&
+	    skb_shinfo(skb)->tx_flags & SKBTX_WIFI_STATUS) {
+		struct sk_buff *mon_skb = NULL;
+
+                if (need_for_monitors)
+                        mon_skb = skb_clone(skb, GFP_ATOMIC);
+
+		/* consumes skb */
+		skb_complete_wifi_ack(skb, info->flags & IEEE80211_TX_STAT_ACK);
+
+		if (!mon_skb)
+			return;
+
+		skb = mon_skb;
+	} else if (!need_for_monitors) {
+	        dev_kfree_skb(skb);
+	        return;
+	} else {
+		/* this was a transmitted frame, but now we want to reuse it */
+		skb_orphan(skb);
+	}
+
 	if (skb_headroom(skb) < sizeof(*rthdr)) {
 		printk(KERN_ERR "ieee80211_tx_status: headroom too small\n");
 		dev_kfree_skb(skb);
--- a/net/mac80211/tx.c	2011-10-07 18:59:12.000000000 +0200
+++ b/net/mac80211/tx.c	2011-10-07 18:59:12.000000000 +0200
@@ -1686,6 +1686,7 @@ netdev_tx_t ieee80211_subif_start_xmit(s
 	bool wme_sta = false, authorized = false, tdls_auth = false;
 	struct sk_buff *tmp_skb;
 	bool tdls_direct = false;
+	bool multicast;
 
 	if (unlikely(skb->len < ETH_HLEN)) {
 		ret = NETDEV_TX_OK;
@@ -1872,7 +1873,8 @@ netdev_tx_t ieee80211_subif_start_xmit(s
 	 * if it is a multicast address (which can only happen
 	 * in AP mode)
 	 */
-	if (!is_multicast_ether_addr(hdr.addr1)) {
+	multicast = is_multicast_ether_addr(hdr.addr1);
+	if (!multicast) {
 		rcu_read_lock();
 		sta = sta_info_get(sdata, hdr.addr1);
 		if (sta) {
@@ -2019,6 +2021,10 @@ netdev_tx_t ieee80211_subif_start_xmit(s
 	memset(info, 0, sizeof(*info));
 
 	dev->trans_start = jiffies;
+
+	if (!multicast && skb_shinfo(skb)->tx_flags & SKBTX_WIFI_STATUS)
+		info->flags = IEEE80211_TX_CTL_REQ_TX_STATUS;
+
 	ieee80211_xmit(sdata, skb);
 
 	return NETDEV_TX_OK;


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [RFC] net: remove erroneous sk null assignment in timestamping
From: Johannes Berg @ 2011-10-07 17:11 UTC (permalink / raw)
  To: netdev; +Cc: Richard Cochran

From: Johannes Berg <johannes.berg@intel.com>

skb->sk is obviously required to be non-NULL
when we get into skb_complete_tx_timestamp().
sock_queue_err_skb() will call skb_orphan()
first thing which sets skb->sk = NULL itself.
This may crash if the skb is still charged to
the socket (skb->destructor is sk_wfree).

The assignment here thus seems to not only be
pointless (due to the skb_orphan() call) but
also dangerous (due to the crash).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
---
 net/core/timestamping.c |    1 -
 1 file changed, 1 deletion(-)

--- a/net/core/timestamping.c	2011-10-07 18:59:12.000000000 +0200
+++ b/net/core/timestamping.c	2011-10-07 19:07:06.000000000 +0200
@@ -85,7 +85,6 @@ void skb_complete_tx_timestamp(struct sk
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = ENOMSG;
 	serr->ee.ee_origin = SO_EE_ORIGIN_TIMESTAMPING;
-	skb->sk = NULL;
 	err = sock_queue_err_skb(sk, skb);
 	if (err)
 		kfree_skb(skb);

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox