* Re: 2.6.37 regression: adding main interface to a bridge breaks vlan interface RX
From: chriss @ 2011-02-26 11:51 UTC (permalink / raw)
To: netdev
In-Reply-To: <AANLkTikSvs7jF9BZzbsYkLAawpCH2h1Z0r09ft219uaa@mail.gmail.com>
Jesse Gross <jesse <at> nicira.com> writes:
>
> Can you confirm this by running tcpdump -eni br0? I would expect that
> you see the correct packets but without vlan tags.
>
Thats correct. i see the packets in br0 without tags and tagged in eth1. thats
why i added the brouting rule in ebtables to drop it at eth1 and the it apears
in eth1.3 (untagged)...
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-26 11:25 UTC (permalink / raw)
To: Jiri Pirko
Cc: David Miller, kaber, eric.dumazet, netdev, shemminger, fubar,
andy
In-Reply-To: <20110226071433.GA2783@psychotron.redhat.com>
Le 26/02/2011 08:14, Jiri Pirko a écrit :
> Sat, Feb 26, 2011 at 12:46:53AM CET, nicolas.2p.debian@gmail.com wrote:
>> Le 23/02/2011 20:05, Jiri Pirko a écrit :
>>> This patch converts bonding to use rx_handler. Results in cleaner
>>> __netif_receive_skb() with much less exceptions needed. Also
>>> bond-specific work is moved into bond code.
>>>
>>> Did performance test using pktgen and counting incoming packets by
>>> iptables. No regression noted.
>>>
>>> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>
>>> v1->v2:
>>> using skb_iif instead of new input_dev to remember original
>>> device
>>>
>>> v2->v3:
>>> do another loop in case skb->dev is changed. That way orig_dev
>>> core can be left untouched.
>>
>> Hi Jiri,
>>
>> Eventually taking enough time for a review.
>>
>> I think we should split this change :
>>
>> 1/ Change __netif_receive_skb() to call rx_handler for diverted net_device, until rx_handler is NULL.
>>
>> 2/ Convert currently existing rx_handlers (bridge and macvlan) to use
>> this new "loop" feature, removing the need to call netif_rx() inside
>> their respective rx_handler and also removing the associated
>> overhead.
>
> This might not be possible. Macvlan uses result of called netif_rx for
> counting, bridge calls netdev_receive_skb via NF_HOOK. Nevertheless,
> this can be eventually handled later, not as a part of this patch.
Yes, I agree. Step 2 and step 3 can be swapped.
Anyway, we need to describe the options given to a rx_handler:
- Return skb unchanged. This would cause normal delivery (ptype->dev == NULL or ptype->dev == skb->dev).
- Return skb->dev changed. __netif_receive_skb() will loop to the new device. This would cause
extact match delivery only (ptype->dev != NULL and ptype->dev == one of the orig_dev).
- Manage the skb another way and return NULL. This would stop any protocol handlers to receive the
skb, except if the rx_handler arrange to re-inject the skb somewhere.
>> 3/ Convert bonding to use rx_handlers.
>>
>> Also, on step 1, we definitely need to clarify what orig_dev should be.
>>
>> I now think that orig_dev should be "the device one level below the
>> current one" or NULL if current device was not diverted from another
>> one. It means that we should keep an array of crossed (diverted)
>> devices and the associated orig_dev. This array would be used to pass
>> the right orig_dev to protocol handlers, depending on the device they
>> register on :
>
> I constructed the patch in the way origdev is the same in all situations
> as before the patch. I think that this decision can be ommitted at the
> moment.
Agreed, event if the current handling of orig_dev is far from bullet proof and needs to be clarified
at some time.
>> eth0 -> bond0 -> br0
>>
>> A protocol handler registered on bond0 would receive eth0 as orig_dev.
>> A protocol handler registered on br0 would receive bond0 as orig_dev.
>>
>> [snip]
>>
>>> @@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>
>> [snip]
>>
>>> +another_round:
>>> +
>>> + __this_cpu_inc(softnet_data.processed);
>>> +
>>> #ifdef CONFIG_NET_CLS_ACT
>>> if (skb->tc_verd& TC_NCLS) {
>>> skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
>>> @@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>> #endif
>>>
>>> list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>> - if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
>>> - ptype->dev == orig_dev) {
>>> + if (!ptype->dev || ptype->dev == skb->dev) {
>>> if (pt_prev)
>>> ret = deliver_skb(skb, pt_prev, orig_dev);
>>> pt_prev = ptype;
>>> @@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>> ncls:
>>> #endif
>>>
>>
>> Why do you loop to ptype_all before calling rx_handler ?
>>
>> I don't understand why ptype_all and ptype_base are not handled at
>> the same place in current __netif_receive_skb() but I think we should
>> take the opportunity to change that, unless someone know of a good
>> reason not to do so.
>
> Again, the patch tries to do as little changes as it can. So this stays
> the same as before. In case you want to change it, feel free to submit
> patch doing that as follow-on.
The point here is that bridge and macvlan handling used to be after the ptype_all loop (hence the
place you inserted the call to rx_handler last summer), but the bonding part is currently before the
ptype_all loop.
Moving bonding handling after the ptype_all loop will cause the ptype_all loop to be run twice:
- first time, with skb->dev == eth0 and orig_dev == eth0.
- second time, with skb->dev == bond0 and orig_dev == eth0.
The first time currently does not exists. And because bonding wasn't given a chance yet to decide
that the frame should be dropped, the packet will always be delivered to eth0, causing duplicate
deliveries. Note that this is probably true for bridge and macvlan too, and that those duplicate
deliveries probably already exists.
Also, delivering skb inside a loop that may change the skb (skb->dev at least) is guaranteed to
produce strange behaviors.
Can someone, knowing the history of ptype_all/ptype_base/bridge/macvlan/bonding/vlan handling in
__netif_receive_skb(), comment on this?
Are there any reasons not to process ptype_all and ptype_base at the same location, at the end of
__netif_receive_skb(), and to manage all divert features (bridge/macvlan/bonding/vlan) before?
Nicolas.
^ permalink raw reply
* Re: [RFC] be2net: add rxhash support
From: Eric Dumazet @ 2011-02-26 10:30 UTC (permalink / raw)
To: Ajit Khaparde; +Cc: netdev
In-Reply-To: <20110225213542.GA11773@akhaparde-VBox>
Le vendredi 25 février 2011 à 15:35 -0600, Ajit Khaparde a écrit :
> I asked that because, if a switch is part a of the configuration,
> the ASIC can receive packets other than the tcp flow.
>
> And if hashing is enabled for IP packets, we can see this behavior.
> The other values indicate that hashing has been enabled for IPv4 packets.
To make sure RSS (and rxhash) was OK, I added following debugging aid :
diff --git a/include/net/sock.h b/include/net/sock.h
index da0534d..e9b1180 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -688,6 +688,7 @@ static inline void sock_rps_save_rxhash(struct sock *sk, u32 rxhash)
{
#ifdef CONFIG_RPS
if (unlikely(sk->sk_rxhash != rxhash)) {
+ pr_err("rxhash change from %x to %x\n", sk->sk_rxhash, rxhash);
sock_rps_reset_flow(sk);
sk->sk_rxhash = rxhash;
}
And got following traces :
[ 201.170297] change rxhash from 0 to be0b5a87
[ 232.607474] bonding: bond1: Setting eth3 as active slave.
[ 232.607478] bonding: bond1: making interface eth3 the new active one.
[ 232.710848] change rxhash from be0b5a87 to e56a3c1e
[ 300.047500] bonding: bond1: Setting eth1 as active slave.
[ 300.047504] bonding: bond1: making interface eth1 the new active one.
[ 300.159162] change rxhash from e56a3c1e to be0b5a87
The flip occured when I changed my active slave (bonding mode=1).
eth1 is a bnx2 NIC, while eth3 a be2net one, so its OK to change the rxhash in this case
(different firmware/algo)
So as far as be2net is concerned, everything seems OK : all packets for
a given flow get an unique RSS hash and can feed skb->rxhash
^ permalink raw reply related
* Re: [PATCH net-next 0/6] Phonet: small pipe protocol fixes
From: Rémi Denis-Courmont @ 2011-02-26 9:15 UTC (permalink / raw)
To: David Miller; +Cc: netdev-u79uwXL29TY76Z2rM5mHXA, ofono-bdc2hr5oBkPYtjvyW6yDsg
In-Reply-To: <20110225.112406.246526410.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Le vendredi 25 février 2011 21:24:06 David Miller, vous avez écrit :
> > From: "Rémi Denis-Courmont" <remi.denis-courmont-xNZwKgViW5gAvxtiuMwx3w@public.gmane.org>
> > Date: Fri, 25 Feb 2011 11:13:41 +0200
> >
> >> This patch series cleans up and fixes a number of small bits in the
> >> Phonet pipe code, especially the experimental pipe controller. Once
> >> this small bits are sorted out, I will try to fix the controller
> >> protocol implementation proper so that we do not need the
> >> compile-time (experimental) flag anymore.
> >
> > All applied thanks.
> >
> > If you want to start using GIT to push phonet changes to me, frankly I
> > would welcome that :-)
No problem in principles. I need to figure out where to put linux-phonet.git
though.
> BTW, I had to add the following patch to fix a build warning:
Hmm, right. I am planning to kill this config option and reunify the cough
cough chal-len-ged *ahem* ST-Ericsson code with the Nokia code... So I confess
did not bother to eliminate that ST-Ericsson-only mode warning.
--
Rémi Denis-Courmont
http://www.remlab.info/
http://fi.linkedin.com/in/remidenis
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: David Miller @ 2011-02-26 7:46 UTC (permalink / raw)
To: eric.dumazet
Cc: herbert, rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta,
netdev
In-Reply-To: <1298705484.2659.126.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 26 Feb 2011 08:31:24 +0100
> UDP CORK is a problem indeed. I wonder who really uses it ?
git grep MSG_MORE -- net/sunrpc
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: Eric Dumazet @ 2011-02-26 7:31 UTC (permalink / raw)
To: Herbert Xu
Cc: David Miller, rick.jones2, tgraf, therbert, wsommerfeld,
daniel.baluta, netdev
In-Reply-To: <20110226031118.GA21270@gondor.apana.org.au>
Le samedi 26 février 2011 à 11:11 +0800, Herbert Xu a écrit :
> On Fri, Feb 25, 2011 at 07:07:23PM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Sat, 26 Feb 2011 10:48:48 +0800
> >
> > > I'm looking at redoing this and the bulk of the work is going to
> > > be restructuring ip_append_data/ip_push_pending_frames so that it
> > > doesn't store the states in sk/inet_sk.
> >
> > I suppose you're going to replace that stuff with an on-stack
> > control structure that gets passed around by reference or
> > similar?
>
> Either that or have ip_append_data do ip_push_pending_frames
> directly.
>
> That function's signature is a mess already and I need to think
> about this a bit more :)
>
> Cheers,
UDP CORK is a problem indeed. I wonder who really uses it ?
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-26 7:14 UTC (permalink / raw)
To: Nicolas de Pesloüan
Cc: David Miller, kaber, eric.dumazet, netdev, shemminger, fubar,
andy
In-Reply-To: <4D683F6D.1030208@gmail.com>
Sat, Feb 26, 2011 at 12:46:53AM CET, nicolas.2p.debian@gmail.com wrote:
>Le 23/02/2011 20:05, Jiri Pirko a écrit :
>>This patch converts bonding to use rx_handler. Results in cleaner
>>__netif_receive_skb() with much less exceptions needed. Also
>>bond-specific work is moved into bond code.
>>
>>Did performance test using pktgen and counting incoming packets by
>>iptables. No regression noted.
>>
>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>
>>v1->v2:
>> using skb_iif instead of new input_dev to remember original
>> device
>>
>>v2->v3:
>> do another loop in case skb->dev is changed. That way orig_dev
>> core can be left untouched.
>
>Hi Jiri,
>
>Eventually taking enough time for a review.
>
>I think we should split this change :
>
>1/ Change __netif_receive_skb() to call rx_handler for diverted net_device, until rx_handler is NULL.
>
>2/ Convert currently existing rx_handlers (bridge and macvlan) to use
>this new "loop" feature, removing the need to call netif_rx() inside
>their respective rx_handler and also removing the associated
>overhead.
This might not be possible. Macvlan uses result of called netif_rx for
counting, bridge calls netdev_receive_skb via NF_HOOK. Nevertheless,
this can be eventually handled later, not as a part of this patch.
>
>3/ Convert bonding to use rx_handlers.
>
>Also, on step 1, we definitely need to clarify what orig_dev should be.
>
>I now think that orig_dev should be "the device one level below the
>current one" or NULL if current device was not diverted from another
>one. It means that we should keep an array of crossed (diverted)
>devices and the associated orig_dev. This array would be used to pass
>the right orig_dev to protocol handlers, depending on the device they
>register on :
I constructed the patch in the way origdev is the same in all situations
as before the patch. I think that this decision can be ommitted at the
moment.
>
>eth0 -> bond0 -> br0
>
>A protocol handler registered on bond0 would receive eth0 as orig_dev.
>A protocol handler registered on br0 would receive bond0 as orig_dev.
>
>[snip]
>
>>@@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>
>[snip]
>
>>+another_round:
>>+
>>+ __this_cpu_inc(softnet_data.processed);
>>+
>> #ifdef CONFIG_NET_CLS_ACT
>> if (skb->tc_verd& TC_NCLS) {
>> skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
>>@@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> #endif
>>
>> list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>- if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
>>- ptype->dev == orig_dev) {
>>+ if (!ptype->dev || ptype->dev == skb->dev) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>>@@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> ncls:
>> #endif
>>
>
>Why do you loop to ptype_all before calling rx_handler ?
>
>I don't understand why ptype_all and ptype_base are not handled at
>the same place in current __netif_receive_skb() but I think we should
>take the opportunity to change that, unless someone know of a good
>reason not to do so.
Again, the patch tries to do as little changes as it can. So this stays
the same as before. In case you want to change it, feel free to submit
patch doing that as follow-on.
>
>>- /* Handle special case of bridge or macvlan */
>> rx_handler = rcu_dereference(skb->dev->rx_handler);
>> if (rx_handler) {
>
> Nicolas.
^ permalink raw reply
* Re: [PATCH] xps-mq: Transmit Packet Steering for multiqueue
From: David Miller @ 2011-02-26 7:09 UTC (permalink / raw)
To: bhutchings; +Cc: therbert, eric.dumazet, shemminger, netdev
In-Reply-To: <1298312395.2608.65.camel@bwh-desktop>
From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 21 Feb 2011 18:19:55 +0000
> On Wed, 2010-09-01 at 18:32 -0700, David Miller wrote:
>> 2) TX queue datastructures in the driver get reallocated using
>> memory in that NUMA domain.
>
> I've previously sent patches to add an ethtool API for NUMA control,
> which include the option to allocate on the same node where IRQs are
> handled. However, there is currently no function to allocate
> DMA-coherent memory on a specified NUMA node (rather than the device's
> node). This is likely to be beneficial for event rings and might be
> good for descriptor rings for some devices. (The implementation I sent
> for sfc mistakenly switched it to allocating non-coherent memory, for
> which it *is* possible to specify the node.)
The thing to do is to work with someone like FUJITA Tomonori on this.
It's simply a matter of making new APIs that take the node specifier,
have the implementations either make use of or completely ignore the node,
and have the existing APIs pass in "-1" for the node or whatever the
CPP macro is for this :-)
^ permalink raw reply
* Re: [net-next-2.6 PATCH 02/10] ethtool: add ntuple flow specifier to network flow classifier
From: Alexander Duyck @ 2011-02-26 5:30 UTC (permalink / raw)
To: Ben Hutchings; +Cc: Alexander Duyck, davem, jeffrey.t.kirsher, netdev
In-Reply-To: <1298682048.3555.18.camel@localhost>
On Fri, Feb 25, 2011 at 5:00 PM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Fri, 2011-02-25 at 15:32 -0800, Alexander Duyck wrote:
>> This change is meant to add an ntuple define type to the rx network flow
>> classification specifiers. The idea is to allow ntuple to be displayed and
>> possibly configured via the network flow classification interface. To do
>> this I added a ntuple_flow_spec_ext to the lsit of supported filters, and
>> added a flow_type_ext value to the structure in an unused hole within the
>> ethtool_rx_flow_spec structure.
>
> There's a hole there on 64-bit architectures. Unfortunately, on i386
> and other architectures where u64 is not 64-bit-aligned, there isn't.
> We actually need to add compat handling for the commands that use it.
>
> Also, we don't want these flags to be ignored by older kernel versions
> and drivers - they should reject specs that they don't understand. So
> any extension flags need to be added to flow_type.
>
>> Due to the fact that the flow specifier structures are only 4 byte aligned
>> instead of 8 I had to break the user data field into 2 sections. In
>> addition I added the vlan ethertype field since this is what ixgbe was
>> using the user-data for currently and it allows for the fields to stay 4
>> byte aligned while occupying space at the end of the flow_spec.
>>
>> In order to guarantee byte ordering I also thought it best to keep all
>> fields in the flow_spec area a big endian value, as such I added vlan, vlan
>> ethertype, and data as big endian values.
>
> It's not important that byte order is consistent across architectures.
>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>>
>> include/linux/ethtool.h | 20 ++++++++++++++++++++
>> 1 files changed, 20 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
>> index aac3e2e..3d1f8e0 100644
>> --- a/include/linux/ethtool.h
>> +++ b/include/linux/ethtool.h
>> @@ -378,10 +378,25 @@ struct ethtool_usrip4_spec {
>> };
>>
>> /**
>> + * struct ethtool_ntuple_spec_ext - flow spec extension for ntuple in nfc
>> + * @unused: space unused by extension
>> + * @vlan_etype: EtherType for vlan tagged packet to match
>> + * @vlan_tci: VLAN tag to match
>> + * @data: Driver-dependent data to match
>> + */
>> +struct ethtool_ntuple_spec_ext {
>> + __be32 unused[15];
>> + __be16 vlan_etype;
>> + __be16 vlan_tci;
>> + __be32 data[2];
>> +};
> [...]
>
> This is a really nasty way to reclaim space in the union.
>
> Let's name the union, shrink it and insert the extra fields that way:
>
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -377,27 +377,43 @@ struct ethtool_usrip4_spec {
> __u8 proto;
> };
>
> +union ethtool_flow_union {
> + struct ethtool_tcpip4_spec tcp_ip4_spec;
> + struct ethtool_tcpip4_spec udp_ip4_spec;
> + struct ethtool_tcpip4_spec sctp_ip4_spec;
> + struct ethtool_ah_espip4_spec ah_ip4_spec;
> + struct ethtool_ah_espip4_spec esp_ip4_spec;
> + struct ethtool_usrip4_spec usr_ip4_spec;
> + struct ethhdr ether_spec;
> + __u8 hdata[52];
> +};
> +
> +struct ethtool_flow_ext {
> + __be16 vlan_etype;
> + __be16 vlan_tci;
> + __be32 data[2];
> + __u32 reserved[2];
> +};
> +
Any chance of getting the reserved fields moved to the top of the
structure? My only concern is that we might end up with a flow spec
larger than 52 bytes at some point and moving the reserved fields to
the front might give us a little more wiggle room future
compatibility.
> /**
> * struct ethtool_rx_flow_spec - specification for RX flow filter
> * @flow_type: Type of match to perform, e.g. %TCP_V4_FLOW
> * @h_u: Flow fields to match (dependent on @flow_type)
> + * @h_ext: Additional fields to match
> * @m_u: Masks for flow field bits to be ignored
> + * @m_ext: Masks for additional field bits to be ignored.
> + * Note, all additional fields must be ignored unless @flow_type
> + * includes the %FLOW_EXT flag.
> * @ring_cookie: RX ring/queue index to deliver to, or %RX_CLS_FLOW_DISC
> * if packets should be discarded
> * @location: Index of filter in hardware table
> */
> struct ethtool_rx_flow_spec {
> __u32 flow_type;
> - union {
> - struct ethtool_tcpip4_spec tcp_ip4_spec;
> - struct ethtool_tcpip4_spec udp_ip4_spec;
> - struct ethtool_tcpip4_spec sctp_ip4_spec;
> - struct ethtool_ah_espip4_spec ah_ip4_spec;
> - struct ethtool_ah_espip4_spec esp_ip4_spec;
> - struct ethtool_usrip4_spec usr_ip4_spec;
> - struct ethhdr ether_spec;
> - __u8 hdata[72];
> - } h_u, m_u;
> + union ethtool_flow_union h_u;
> + struct ethtool_flow_ext h_ext;
> + union ethtool_flow_union m_u;
> + struct ethtool_flow_ext m_ext;
> __u64 ring_cookie;
> __u32 location;
> };
> @@ -954,6 +970,8 @@ struct ethtool_ops {
> #define IPV4_FLOW 0x10 /* hash only */
> #define IPV6_FLOW 0x11 /* hash only */
> #define ETHER_FLOW 0x12 /* spec only (ether_spec) */
> +/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
> +#define FLOW_EXT 0x80000000
>
> /* L3-L4 network traffic flow hash options */
> #define RXH_L2DA (1 << 1)
> ---
>
> Ben.
This works for my purposes other than the one comment above. However
if you are fine with it I am good with it since I can't think of any
filters that we might need in the near future that would require more
than 52 bytes.
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-26 3:11 UTC (permalink / raw)
To: David Miller
Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev
In-Reply-To: <20110225.190723.39180243.davem@davemloft.net>
On Fri, Feb 25, 2011 at 07:07:23PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Sat, 26 Feb 2011 10:48:48 +0800
>
> > I'm looking at redoing this and the bulk of the work is going to
> > be restructuring ip_append_data/ip_push_pending_frames so that it
> > doesn't store the states in sk/inet_sk.
>
> I suppose you're going to replace that stuff with an on-stack
> control structure that gets passed around by reference or
> similar?
Either that or have ip_append_data do ip_push_pending_frames
directly.
That function's signature is a mess already and I need to think
about this a bit more :)
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: David Miller @ 2011-02-26 3:07 UTC (permalink / raw)
To: herbert; +Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev
In-Reply-To: <20110226024848.GA20993@gondor.apana.org.au>
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 26 Feb 2011 10:48:48 +0800
> I'm looking at redoing this and the bulk of the work is going to
> be restructuring ip_append_data/ip_push_pending_frames so that it
> doesn't store the states in sk/inet_sk.
I suppose you're going to replace that stuff with an on-stack
control structure that gets passed around by reference or
similar?
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-26 2:48 UTC (permalink / raw)
To: David Miller
Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev
In-Reply-To: <20110225.181244.104056532.davem@davemloft.net>
On Fri, Feb 25, 2011 at 06:12:44PM -0800, David Miller wrote:
>
> We take the lock unconditionally because we essentially have to after
> UDP takes on the socket buffer accounting facilities similar to TCP.
Well I just checked out the history tree (2.6.12) and it too had
the unconditional lock on the send path. So this predates the
system-wide buffer limit change.
I'm looking at redoing this and the bulk of the work is going to
be restructuring ip_append_data/ip_push_pending_frames so that it
doesn't store the states in sk/inet_sk.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: David Miller @ 2011-02-26 2:12 UTC (permalink / raw)
To: herbert; +Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev
In-Reply-To: <20110226005718.GA19889@gondor.apana.org.au>
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 26 Feb 2011 08:57:18 +0800
> It isn't all that hard since the easy way would be to only take
> the lock if we're already corked or about to cork.
We take the lock unconditionally because we essentially have to after
UDP takes on the socket buffer accounting facilities similar to TCP.
^ permalink raw reply
* [PATCH] pfkey: Use const where possible.
From: David Miller @ 2011-02-26 2:07 UTC (permalink / raw)
To: netdev
This actually pointed out a (seemingly known) bug where we mangle the
pfkey header in a potentially shared SKB, which is fixed here.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/key/af_key.c | 201 +++++++++++++++++++++++++++++-------------------------
1 files changed, 107 insertions(+), 94 deletions(-)
diff --git a/net/key/af_key.c b/net/key/af_key.c
index 5637285..7fb5457 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -70,7 +70,7 @@ static inline struct pfkey_sock *pfkey_sk(struct sock *sk)
return (struct pfkey_sock *)sk;
}
-static int pfkey_can_dump(struct sock *sk)
+static int pfkey_can_dump(const struct sock *sk)
{
if (3 * atomic_read(&sk->sk_rmem_alloc) <= 2 * sk->sk_rcvbuf)
return 1;
@@ -303,12 +303,13 @@ static int pfkey_do_dump(struct pfkey_sock *pfk)
return rc;
}
-static inline void pfkey_hdr_dup(struct sadb_msg *new, struct sadb_msg *orig)
+static inline void pfkey_hdr_dup(struct sadb_msg *new,
+ const struct sadb_msg *orig)
{
*new = *orig;
}
-static int pfkey_error(struct sadb_msg *orig, int err, struct sock *sk)
+static int pfkey_error(const struct sadb_msg *orig, int err, struct sock *sk)
{
struct sk_buff *skb = alloc_skb(sizeof(struct sadb_msg) + 16, GFP_KERNEL);
struct sadb_msg *hdr;
@@ -369,13 +370,13 @@ static u8 sadb_ext_min_len[] = {
};
/* Verify sadb_address_{len,prefixlen} against sa_family. */
-static int verify_address_len(void *p)
+static int verify_address_len(const void *p)
{
- struct sadb_address *sp = p;
- struct sockaddr *addr = (struct sockaddr *)(sp + 1);
- struct sockaddr_in *sin;
+ const struct sadb_address *sp = p;
+ const struct sockaddr *addr = (const struct sockaddr *)(sp + 1);
+ const struct sockaddr_in *sin;
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
- struct sockaddr_in6 *sin6;
+ const struct sockaddr_in6 *sin6;
#endif
int len;
@@ -411,16 +412,16 @@ static int verify_address_len(void *p)
return 0;
}
-static inline int pfkey_sec_ctx_len(struct sadb_x_sec_ctx *sec_ctx)
+static inline int pfkey_sec_ctx_len(const struct sadb_x_sec_ctx *sec_ctx)
{
return DIV_ROUND_UP(sizeof(struct sadb_x_sec_ctx) +
sec_ctx->sadb_x_ctx_len,
sizeof(uint64_t));
}
-static inline int verify_sec_ctx_len(void *p)
+static inline int verify_sec_ctx_len(const void *p)
{
- struct sadb_x_sec_ctx *sec_ctx = (struct sadb_x_sec_ctx *)p;
+ const struct sadb_x_sec_ctx *sec_ctx = p;
int len = sec_ctx->sadb_x_ctx_len;
if (len > PAGE_SIZE)
@@ -434,7 +435,7 @@ static inline int verify_sec_ctx_len(void *p)
return 0;
}
-static inline struct xfrm_user_sec_ctx *pfkey_sadb2xfrm_user_sec_ctx(struct sadb_x_sec_ctx *sec_ctx)
+static inline struct xfrm_user_sec_ctx *pfkey_sadb2xfrm_user_sec_ctx(const struct sadb_x_sec_ctx *sec_ctx)
{
struct xfrm_user_sec_ctx *uctx = NULL;
int ctx_size = sec_ctx->sadb_x_ctx_len;
@@ -455,16 +456,16 @@ static inline struct xfrm_user_sec_ctx *pfkey_sadb2xfrm_user_sec_ctx(struct sadb
return uctx;
}
-static int present_and_same_family(struct sadb_address *src,
- struct sadb_address *dst)
+static int present_and_same_family(const struct sadb_address *src,
+ const struct sadb_address *dst)
{
- struct sockaddr *s_addr, *d_addr;
+ const struct sockaddr *s_addr, *d_addr;
if (!src || !dst)
return 0;
- s_addr = (struct sockaddr *)(src + 1);
- d_addr = (struct sockaddr *)(dst + 1);
+ s_addr = (const struct sockaddr *)(src + 1);
+ d_addr = (const struct sockaddr *)(dst + 1);
if (s_addr->sa_family != d_addr->sa_family)
return 0;
if (s_addr->sa_family != AF_INET
@@ -477,15 +478,15 @@ static int present_and_same_family(struct sadb_address *src,
return 1;
}
-static int parse_exthdrs(struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int parse_exthdrs(struct sk_buff *skb, const struct sadb_msg *hdr, void **ext_hdrs)
{
- char *p = (char *) hdr;
+ const char *p = (char *) hdr;
int len = skb->len;
len -= sizeof(*hdr);
p += sizeof(*hdr);
while (len > 0) {
- struct sadb_ext *ehdr = (struct sadb_ext *) p;
+ const struct sadb_ext *ehdr = (const struct sadb_ext *) p;
uint16_t ext_type;
int ext_len;
@@ -514,7 +515,7 @@ static int parse_exthdrs(struct sk_buff *skb, struct sadb_msg *hdr, void **ext_h
if (verify_sec_ctx_len(p))
return -EINVAL;
}
- ext_hdrs[ext_type-1] = p;
+ ext_hdrs[ext_type-1] = (void *) p;
}
p += ext_len;
len -= ext_len;
@@ -606,21 +607,21 @@ int pfkey_sockaddr_extract(const struct sockaddr *sa, xfrm_address_t *xaddr)
}
static
-int pfkey_sadb_addr2xfrm_addr(struct sadb_address *addr, xfrm_address_t *xaddr)
+int pfkey_sadb_addr2xfrm_addr(const struct sadb_address *addr, xfrm_address_t *xaddr)
{
return pfkey_sockaddr_extract((struct sockaddr *)(addr + 1),
xaddr);
}
-static struct xfrm_state *pfkey_xfrm_state_lookup(struct net *net, struct sadb_msg *hdr, void **ext_hdrs)
+static struct xfrm_state *pfkey_xfrm_state_lookup(struct net *net, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
- struct sadb_sa *sa;
- struct sadb_address *addr;
+ const struct sadb_sa *sa;
+ const struct sadb_address *addr;
uint16_t proto;
unsigned short family;
xfrm_address_t *xaddr;
- sa = (struct sadb_sa *) ext_hdrs[SADB_EXT_SA-1];
+ sa = (const struct sadb_sa *) ext_hdrs[SADB_EXT_SA-1];
if (sa == NULL)
return NULL;
@@ -629,18 +630,18 @@ static struct xfrm_state *pfkey_xfrm_state_lookup(struct net *net, struct sadb_
return NULL;
/* sadb_address_len should be checked by caller */
- addr = (struct sadb_address *) ext_hdrs[SADB_EXT_ADDRESS_DST-1];
+ addr = (const struct sadb_address *) ext_hdrs[SADB_EXT_ADDRESS_DST-1];
if (addr == NULL)
return NULL;
- family = ((struct sockaddr *)(addr + 1))->sa_family;
+ family = ((const struct sockaddr *)(addr + 1))->sa_family;
switch (family) {
case AF_INET:
- xaddr = (xfrm_address_t *)&((struct sockaddr_in *)(addr + 1))->sin_addr;
+ xaddr = (xfrm_address_t *)&((const struct sockaddr_in *)(addr + 1))->sin_addr;
break;
#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
case AF_INET6:
- xaddr = (xfrm_address_t *)&((struct sockaddr_in6 *)(addr + 1))->sin6_addr;
+ xaddr = (xfrm_address_t *)&((const struct sockaddr_in6 *)(addr + 1))->sin6_addr;
break;
#endif
default:
@@ -691,8 +692,8 @@ static inline int pfkey_mode_to_xfrm(int mode)
}
static unsigned int pfkey_sockaddr_fill(const xfrm_address_t *xaddr, __be16 port,
- struct sockaddr *sa,
- unsigned short family)
+ struct sockaddr *sa,
+ unsigned short family)
{
switch (family) {
case AF_INET:
@@ -720,7 +721,7 @@ static unsigned int pfkey_sockaddr_fill(const xfrm_address_t *xaddr, __be16 port
return 0;
}
-static struct sk_buff *__pfkey_xfrm_state2msg(struct xfrm_state *x,
+static struct sk_buff *__pfkey_xfrm_state2msg(const struct xfrm_state *x,
int add_keys, int hsc)
{
struct sk_buff *skb;
@@ -1010,7 +1011,7 @@ static struct sk_buff *__pfkey_xfrm_state2msg(struct xfrm_state *x,
}
-static inline struct sk_buff *pfkey_xfrm_state2msg(struct xfrm_state *x)
+static inline struct sk_buff *pfkey_xfrm_state2msg(const struct xfrm_state *x)
{
struct sk_buff *skb;
@@ -1019,26 +1020,26 @@ static inline struct sk_buff *pfkey_xfrm_state2msg(struct xfrm_state *x)
return skb;
}
-static inline struct sk_buff *pfkey_xfrm_state2msg_expire(struct xfrm_state *x,
+static inline struct sk_buff *pfkey_xfrm_state2msg_expire(const struct xfrm_state *x,
int hsc)
{
return __pfkey_xfrm_state2msg(x, 0, hsc);
}
static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
- struct sadb_msg *hdr,
- void **ext_hdrs)
+ const struct sadb_msg *hdr,
+ void * const *ext_hdrs)
{
struct xfrm_state *x;
- struct sadb_lifetime *lifetime;
- struct sadb_sa *sa;
- struct sadb_key *key;
- struct sadb_x_sec_ctx *sec_ctx;
+ const struct sadb_lifetime *lifetime;
+ const struct sadb_sa *sa;
+ const struct sadb_key *key;
+ const struct sadb_x_sec_ctx *sec_ctx;
uint16_t proto;
int err;
- sa = (struct sadb_sa *) ext_hdrs[SADB_EXT_SA-1];
+ sa = (const struct sadb_sa *) ext_hdrs[SADB_EXT_SA-1];
if (!sa ||
!present_and_same_family(ext_hdrs[SADB_EXT_ADDRESS_SRC-1],
ext_hdrs[SADB_EXT_ADDRESS_DST-1]))
@@ -1077,7 +1078,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
sa->sadb_sa_encrypt > SADB_X_CALG_MAX) ||
sa->sadb_sa_encrypt > SADB_EALG_MAX)
return ERR_PTR(-EINVAL);
- key = (struct sadb_key*) ext_hdrs[SADB_EXT_KEY_AUTH-1];
+ key = (const struct sadb_key*) ext_hdrs[SADB_EXT_KEY_AUTH-1];
if (key != NULL &&
sa->sadb_sa_auth != SADB_X_AALG_NULL &&
((key->sadb_key_bits+7) / 8 == 0 ||
@@ -1104,14 +1105,14 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
if (sa->sadb_sa_flags & SADB_SAFLAGS_NOPMTUDISC)
x->props.flags |= XFRM_STATE_NOPMTUDISC;
- lifetime = (struct sadb_lifetime*) ext_hdrs[SADB_EXT_LIFETIME_HARD-1];
+ lifetime = (const struct sadb_lifetime*) ext_hdrs[SADB_EXT_LIFETIME_HARD-1];
if (lifetime != NULL) {
x->lft.hard_packet_limit = _KEY2X(lifetime->sadb_lifetime_allocations);
x->lft.hard_byte_limit = _KEY2X(lifetime->sadb_lifetime_bytes);
x->lft.hard_add_expires_seconds = lifetime->sadb_lifetime_addtime;
x->lft.hard_use_expires_seconds = lifetime->sadb_lifetime_usetime;
}
- lifetime = (struct sadb_lifetime*) ext_hdrs[SADB_EXT_LIFETIME_SOFT-1];
+ lifetime = (const struct sadb_lifetime*) ext_hdrs[SADB_EXT_LIFETIME_SOFT-1];
if (lifetime != NULL) {
x->lft.soft_packet_limit = _KEY2X(lifetime->sadb_lifetime_allocations);
x->lft.soft_byte_limit = _KEY2X(lifetime->sadb_lifetime_bytes);
@@ -1119,7 +1120,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
x->lft.soft_use_expires_seconds = lifetime->sadb_lifetime_usetime;
}
- sec_ctx = (struct sadb_x_sec_ctx *) ext_hdrs[SADB_X_EXT_SEC_CTX-1];
+ sec_ctx = (const struct sadb_x_sec_ctx *) ext_hdrs[SADB_X_EXT_SEC_CTX-1];
if (sec_ctx != NULL) {
struct xfrm_user_sec_ctx *uctx = pfkey_sadb2xfrm_user_sec_ctx(sec_ctx);
@@ -1133,7 +1134,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
goto out;
}
- key = (struct sadb_key*) ext_hdrs[SADB_EXT_KEY_AUTH-1];
+ key = (const struct sadb_key*) ext_hdrs[SADB_EXT_KEY_AUTH-1];
if (sa->sadb_sa_auth) {
int keysize = 0;
struct xfrm_algo_desc *a = xfrm_aalg_get_byid(sa->sadb_sa_auth);
@@ -1202,7 +1203,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
&x->id.daddr);
if (ext_hdrs[SADB_X_EXT_SA2-1]) {
- struct sadb_x_sa2 *sa2 = (void*)ext_hdrs[SADB_X_EXT_SA2-1];
+ const struct sadb_x_sa2 *sa2 = ext_hdrs[SADB_X_EXT_SA2-1];
int mode = pfkey_mode_to_xfrm(sa2->sadb_x_sa2_mode);
if (mode < 0) {
err = -EINVAL;
@@ -1213,7 +1214,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
}
if (ext_hdrs[SADB_EXT_ADDRESS_PROXY-1]) {
- struct sadb_address *addr = ext_hdrs[SADB_EXT_ADDRESS_PROXY-1];
+ const struct sadb_address *addr = ext_hdrs[SADB_EXT_ADDRESS_PROXY-1];
/* Nobody uses this, but we try. */
x->sel.family = pfkey_sadb_addr2xfrm_addr(addr, &x->sel.saddr);
@@ -1224,7 +1225,7 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
x->sel.family = x->props.family;
if (ext_hdrs[SADB_X_EXT_NAT_T_TYPE-1]) {
- struct sadb_x_nat_t_type* n_type;
+ const struct sadb_x_nat_t_type* n_type;
struct xfrm_encap_tmpl *natt;
x->encap = kmalloc(sizeof(*x->encap), GFP_KERNEL);
@@ -1236,12 +1237,12 @@ static struct xfrm_state * pfkey_msg2xfrm_state(struct net *net,
natt->encap_type = n_type->sadb_x_nat_t_type_type;
if (ext_hdrs[SADB_X_EXT_NAT_T_SPORT-1]) {
- struct sadb_x_nat_t_port* n_port =
+ const struct sadb_x_nat_t_port *n_port =
ext_hdrs[SADB_X_EXT_NAT_T_SPORT-1];
natt->encap_sport = n_port->sadb_x_nat_t_port_port;
}
if (ext_hdrs[SADB_X_EXT_NAT_T_DPORT-1]) {
- struct sadb_x_nat_t_port* n_port =
+ const struct sadb_x_nat_t_port *n_port =
ext_hdrs[SADB_X_EXT_NAT_T_DPORT-1];
natt->encap_dport = n_port->sadb_x_nat_t_port_port;
}
@@ -1261,12 +1262,12 @@ out:
return ERR_PTR(err);
}
-static int pfkey_reserved(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_reserved(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
return -EOPNOTSUPP;
}
-static int pfkey_getspi(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_getspi(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
struct sk_buff *resp_skb;
@@ -1365,7 +1366,7 @@ static int pfkey_getspi(struct sock *sk, struct sk_buff *skb, struct sadb_msg *h
return 0;
}
-static int pfkey_acquire(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_acquire(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
struct xfrm_state *x;
@@ -1453,7 +1454,7 @@ static int key_notify_sa(struct xfrm_state *x, const struct km_event *c)
return 0;
}
-static int pfkey_add(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_add(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
struct xfrm_state *x;
@@ -1492,7 +1493,7 @@ out:
return err;
}
-static int pfkey_delete(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_delete(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
struct xfrm_state *x;
@@ -1534,7 +1535,7 @@ out:
return err;
}
-static int pfkey_get(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_get(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
__u8 proto;
@@ -1570,7 +1571,7 @@ static int pfkey_get(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr,
return 0;
}
-static struct sk_buff *compose_sadb_supported(struct sadb_msg *orig,
+static struct sk_buff *compose_sadb_supported(const struct sadb_msg *orig,
gfp_t allocation)
{
struct sk_buff *skb;
@@ -1642,7 +1643,7 @@ out_put_algs:
return skb;
}
-static int pfkey_register(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_register(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct pfkey_sock *pfk = pfkey_sk(sk);
struct sk_buff *supp_skb;
@@ -1671,7 +1672,7 @@ static int pfkey_register(struct sock *sk, struct sk_buff *skb, struct sadb_msg
return 0;
}
-static int unicast_flush_resp(struct sock *sk, struct sadb_msg *ihdr)
+static int unicast_flush_resp(struct sock *sk, const struct sadb_msg *ihdr)
{
struct sk_buff *skb;
struct sadb_msg *hdr;
@@ -1710,7 +1711,7 @@ static int key_notify_sa_flush(const struct km_event *c)
return 0;
}
-static int pfkey_flush(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_flush(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
unsigned proto;
@@ -1784,7 +1785,7 @@ static void pfkey_dump_sa_done(struct pfkey_sock *pfk)
xfrm_state_walk_done(&pfk->dump.u.state);
}
-static int pfkey_dump(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_dump(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
u8 proto;
struct pfkey_sock *pfk = pfkey_sk(sk);
@@ -1805,19 +1806,29 @@ static int pfkey_dump(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr
return pfkey_do_dump(pfk);
}
-static int pfkey_promisc(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_promisc(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct pfkey_sock *pfk = pfkey_sk(sk);
int satype = hdr->sadb_msg_satype;
+ bool reset_errno = false;
if (hdr->sadb_msg_len == (sizeof(*hdr) / sizeof(uint64_t))) {
- /* XXX we mangle packet... */
- hdr->sadb_msg_errno = 0;
+ reset_errno = true;
if (satype != 0 && satype != 1)
return -EINVAL;
pfk->promisc = satype;
}
- pfkey_broadcast(skb_clone(skb, GFP_KERNEL), GFP_KERNEL, BROADCAST_ALL, NULL, sock_net(sk));
+ if (reset_errno && skb_cloned(skb))
+ skb = skb_copy(skb, GFP_KERNEL);
+ else
+ skb = skb_clone(skb, GFP_KERNEL);
+
+ if (reset_errno && skb) {
+ struct sadb_msg *new_hdr = (struct sadb_msg *) skb->data;
+ new_hdr->sadb_msg_errno = 0;
+ }
+
+ pfkey_broadcast(skb, GFP_KERNEL, BROADCAST_ALL, NULL, sock_net(sk));
return 0;
}
@@ -1921,7 +1932,7 @@ parse_ipsecrequests(struct xfrm_policy *xp, struct sadb_x_policy *pol)
return 0;
}
-static inline int pfkey_xfrm_policy2sec_ctx_size(struct xfrm_policy *xp)
+static inline int pfkey_xfrm_policy2sec_ctx_size(const struct xfrm_policy *xp)
{
struct xfrm_sec_ctx *xfrm_ctx = xp->security;
@@ -1933,9 +1944,9 @@ static inline int pfkey_xfrm_policy2sec_ctx_size(struct xfrm_policy *xp)
return 0;
}
-static int pfkey_xfrm_policy2msg_size(struct xfrm_policy *xp)
+static int pfkey_xfrm_policy2msg_size(const struct xfrm_policy *xp)
{
- struct xfrm_tmpl *t;
+ const struct xfrm_tmpl *t;
int sockaddr_size = pfkey_sockaddr_size(xp->family);
int socklen = 0;
int i;
@@ -1955,7 +1966,7 @@ static int pfkey_xfrm_policy2msg_size(struct xfrm_policy *xp)
pfkey_xfrm_policy2sec_ctx_size(xp);
}
-static struct sk_buff * pfkey_xfrm_policy2msg_prep(struct xfrm_policy *xp)
+static struct sk_buff * pfkey_xfrm_policy2msg_prep(const struct xfrm_policy *xp)
{
struct sk_buff *skb;
int size;
@@ -1969,7 +1980,7 @@ static struct sk_buff * pfkey_xfrm_policy2msg_prep(struct xfrm_policy *xp)
return skb;
}
-static int pfkey_xfrm_policy2msg(struct sk_buff *skb, struct xfrm_policy *xp, int dir)
+static int pfkey_xfrm_policy2msg(struct sk_buff *skb, const struct xfrm_policy *xp, int dir)
{
struct sadb_msg *hdr;
struct sadb_address *addr;
@@ -2065,8 +2076,8 @@ static int pfkey_xfrm_policy2msg(struct sk_buff *skb, struct xfrm_policy *xp, in
pol->sadb_x_policy_priority = xp->priority;
for (i=0; i<xp->xfrm_nr; i++) {
+ const struct xfrm_tmpl *t = xp->xfrm_vec + i;
struct sadb_x_ipsecrequest *rq;
- struct xfrm_tmpl *t = xp->xfrm_vec + i;
int req_size;
int mode;
@@ -2152,7 +2163,7 @@ static int key_notify_policy(struct xfrm_policy *xp, int dir, const struct km_ev
}
-static int pfkey_spdadd(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_spdadd(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
int err = 0;
@@ -2273,7 +2284,7 @@ out:
return err;
}
-static int pfkey_spddelete(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_spddelete(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
int err;
@@ -2350,7 +2361,7 @@ out:
return err;
}
-static int key_pol_get_resp(struct sock *sk, struct xfrm_policy *xp, struct sadb_msg *hdr, int dir)
+static int key_pol_get_resp(struct sock *sk, struct xfrm_policy *xp, const struct sadb_msg *hdr, int dir)
{
int err;
struct sk_buff *out_skb;
@@ -2458,7 +2469,7 @@ static int ipsecrequests_to_migrate(struct sadb_x_ipsecrequest *rq1, int len,
}
static int pfkey_migrate(struct sock *sk, struct sk_buff *skb,
- struct sadb_msg *hdr, void **ext_hdrs)
+ const struct sadb_msg *hdr, void * const *ext_hdrs)
{
int i, len, ret, err = -EINVAL;
u8 dir;
@@ -2556,7 +2567,7 @@ static int pfkey_migrate(struct sock *sk, struct sk_buff *skb,
#endif
-static int pfkey_spdget(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_spdget(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
unsigned int dir;
@@ -2644,7 +2655,7 @@ static void pfkey_dump_sp_done(struct pfkey_sock *pfk)
xfrm_policy_walk_done(&pfk->dump.u.policy);
}
-static int pfkey_spddump(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_spddump(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct pfkey_sock *pfk = pfkey_sk(sk);
@@ -2680,7 +2691,7 @@ static int key_notify_policy_flush(const struct km_event *c)
}
-static int pfkey_spdflush(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, void **ext_hdrs)
+static int pfkey_spdflush(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr, void * const *ext_hdrs)
{
struct net *net = sock_net(sk);
struct km_event c;
@@ -2709,7 +2720,7 @@ static int pfkey_spdflush(struct sock *sk, struct sk_buff *skb, struct sadb_msg
}
typedef int (*pfkey_handler)(struct sock *sk, struct sk_buff *skb,
- struct sadb_msg *hdr, void **ext_hdrs);
+ const struct sadb_msg *hdr, void * const *ext_hdrs);
static pfkey_handler pfkey_funcs[SADB_MAX + 1] = {
[SADB_RESERVED] = pfkey_reserved,
[SADB_GETSPI] = pfkey_getspi,
@@ -2736,7 +2747,7 @@ static pfkey_handler pfkey_funcs[SADB_MAX + 1] = {
[SADB_X_MIGRATE] = pfkey_migrate,
};
-static int pfkey_process(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr)
+static int pfkey_process(struct sock *sk, struct sk_buff *skb, const struct sadb_msg *hdr)
{
void *ext_hdrs[SADB_EXT_MAX];
int err;
@@ -2781,7 +2792,8 @@ static struct sadb_msg *pfkey_get_base_msg(struct sk_buff *skb, int *errp)
return hdr;
}
-static inline int aalg_tmpl_set(struct xfrm_tmpl *t, struct xfrm_algo_desc *d)
+static inline int aalg_tmpl_set(const struct xfrm_tmpl *t,
+ const struct xfrm_algo_desc *d)
{
unsigned int id = d->desc.sadb_alg_id;
@@ -2791,7 +2803,8 @@ static inline int aalg_tmpl_set(struct xfrm_tmpl *t, struct xfrm_algo_desc *d)
return (t->aalgos >> id) & 1;
}
-static inline int ealg_tmpl_set(struct xfrm_tmpl *t, struct xfrm_algo_desc *d)
+static inline int ealg_tmpl_set(const struct xfrm_tmpl *t,
+ const struct xfrm_algo_desc *d)
{
unsigned int id = d->desc.sadb_alg_id;
@@ -2801,12 +2814,12 @@ static inline int ealg_tmpl_set(struct xfrm_tmpl *t, struct xfrm_algo_desc *d)
return (t->ealgos >> id) & 1;
}
-static int count_ah_combs(struct xfrm_tmpl *t)
+static int count_ah_combs(const struct xfrm_tmpl *t)
{
int i, sz = 0;
for (i = 0; ; i++) {
- struct xfrm_algo_desc *aalg = xfrm_aalg_get_byidx(i);
+ const struct xfrm_algo_desc *aalg = xfrm_aalg_get_byidx(i);
if (!aalg)
break;
if (aalg_tmpl_set(t, aalg) && aalg->available)
@@ -2815,12 +2828,12 @@ static int count_ah_combs(struct xfrm_tmpl *t)
return sz + sizeof(struct sadb_prop);
}
-static int count_esp_combs(struct xfrm_tmpl *t)
+static int count_esp_combs(const struct xfrm_tmpl *t)
{
int i, k, sz = 0;
for (i = 0; ; i++) {
- struct xfrm_algo_desc *ealg = xfrm_ealg_get_byidx(i);
+ const struct xfrm_algo_desc *ealg = xfrm_ealg_get_byidx(i);
if (!ealg)
break;
@@ -2828,7 +2841,7 @@ static int count_esp_combs(struct xfrm_tmpl *t)
continue;
for (k = 1; ; k++) {
- struct xfrm_algo_desc *aalg = xfrm_aalg_get_byidx(k);
+ const struct xfrm_algo_desc *aalg = xfrm_aalg_get_byidx(k);
if (!aalg)
break;
@@ -2839,7 +2852,7 @@ static int count_esp_combs(struct xfrm_tmpl *t)
return sz + sizeof(struct sadb_prop);
}
-static void dump_ah_combs(struct sk_buff *skb, struct xfrm_tmpl *t)
+static void dump_ah_combs(struct sk_buff *skb, const struct xfrm_tmpl *t)
{
struct sadb_prop *p;
int i;
@@ -2851,7 +2864,7 @@ static void dump_ah_combs(struct sk_buff *skb, struct xfrm_tmpl *t)
memset(p->sadb_prop_reserved, 0, sizeof(p->sadb_prop_reserved));
for (i = 0; ; i++) {
- struct xfrm_algo_desc *aalg = xfrm_aalg_get_byidx(i);
+ const struct xfrm_algo_desc *aalg = xfrm_aalg_get_byidx(i);
if (!aalg)
break;
@@ -2871,7 +2884,7 @@ static void dump_ah_combs(struct sk_buff *skb, struct xfrm_tmpl *t)
}
}
-static void dump_esp_combs(struct sk_buff *skb, struct xfrm_tmpl *t)
+static void dump_esp_combs(struct sk_buff *skb, const struct xfrm_tmpl *t)
{
struct sadb_prop *p;
int i, k;
@@ -2883,7 +2896,7 @@ static void dump_esp_combs(struct sk_buff *skb, struct xfrm_tmpl *t)
memset(p->sadb_prop_reserved, 0, sizeof(p->sadb_prop_reserved));
for (i=0; ; i++) {
- struct xfrm_algo_desc *ealg = xfrm_ealg_get_byidx(i);
+ const struct xfrm_algo_desc *ealg = xfrm_ealg_get_byidx(i);
if (!ealg)
break;
@@ -2892,7 +2905,7 @@ static void dump_esp_combs(struct sk_buff *skb, struct xfrm_tmpl *t)
for (k = 1; ; k++) {
struct sadb_comb *c;
- struct xfrm_algo_desc *aalg = xfrm_aalg_get_byidx(k);
+ const struct xfrm_algo_desc *aalg = xfrm_aalg_get_byidx(k);
if (!aalg)
break;
if (!(aalg_tmpl_set(t, aalg) && aalg->available))
--
1.7.4.1
^ permalink raw reply related
* [PATCH] Bluetooth: Fix BT_L2CAP and BT_SCO in Kconfig
From: Gustavo F. Padovan @ 2011-02-26 1:41 UTC (permalink / raw)
To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
Cc: linville-2XuSBdqkA4R54TAoqtyWWQ,
linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20110226013639.GA2166@joana>
If we want something "bool" built-in in something "tristate" it can't
"depend on" the tristate config option.
Report by DaveM:
I give it 'y' just to make it happen, for both, and afterways no
matter how many times I rerun "make oldconfig" I keep seeing things
like this in my build:
scripts/kconfig/conf --silentoldconfig Kconfig
include/config/auto.conf:986:warning: symbol value 'm' invalid for BT_SCO
include/config/auto.conf:3156:warning: symbol value 'm' invalid for BT_L2CAP
Reported-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Signed-off-by: Gustavo F. Padovan <padovan-Y3ZbgMPKUGA34EUeqzHoZw@public.gmane.org>
---
net/bluetooth/Kconfig | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/bluetooth/Kconfig b/net/bluetooth/Kconfig
index c6f9c2f..6ae5ec5 100644
--- a/net/bluetooth/Kconfig
+++ b/net/bluetooth/Kconfig
@@ -31,9 +31,10 @@ menuconfig BT
to Bluetooth kernel modules are provided in the BlueZ packages. For
more information, see <http://www.bluez.org/>.
+if BT != n
+
config BT_L2CAP
bool "L2CAP protocol support"
- depends on BT
select CRC16
help
L2CAP (Logical Link Control and Adaptation Protocol) provides
@@ -42,11 +43,12 @@ config BT_L2CAP
config BT_SCO
bool "SCO links support"
- depends on BT
help
SCO link provides voice transport over Bluetooth. SCO support is
required for voice applications like Headset and Audio.
+endif
+
source "net/bluetooth/rfcomm/Kconfig"
source "net/bluetooth/bnep/Kconfig"
--
1.7.4.1
^ permalink raw reply related
* Re: pull request: wireless-next-2.6 2011-02-22
From: Gustavo F. Padovan @ 2011-02-26 1:36 UTC (permalink / raw)
To: David Miller, linville-2XuSBdqkA4R54TAoqtyWWQ,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20110225193618.GB2107@joana>
* Gustavo F. Padovan <padovan-Y3ZbgMPKUGA34EUeqzHoZw@public.gmane.org> [2011-02-25 16:36:18 -0300]:
> Hi David,
>
> * David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> [2011-02-25 11:15:00 -0800]:
>
> > From: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > Date: Thu, 24 Feb 2011 22:43:44 -0800 (PST)
> >
> > > From: "John W. Linville" <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
> > > Date: Tue, 22 Feb 2011 16:52:30 -0500
> > >
> > >> Here is the latest batch of wireless bits intended for 2.6.39. It seems
> > >> I neglected to send a pull request last week, so this one is a bit big
> > >> -- I apologize!
> > >>
> > >> This includes a rather large batch of bluetooth bits by way of Gustavo.
> > >> It looks like a variety of bits, including some code refactoring, some
> > >> protocol support enhancements, some bugfixes, etc. -- nothing too
> > >> unusual.
> > >>
> > >> Other items of interest include a new driver from Realtek, some ssb
> > >> support enhancements, and the usual sort of updates for mac80211 and a
> > >> variety of drivers. Also included is a wireless-2.6 pull to resolve
> > >> some build breakage.
> > >>
> > >> Please let me know if there are problems!
> > >
> > > Pulled, thanks a lot John.
> >
> > John a few things:
> >
> > 1) I had to add some vmalloc.h includes to fix the build on sparc64,
> > see commit b08cd667c4b6641c4d16a3f87f4550f81a6d69ac in net-next-2.6
> >
> > 2) Something is screwey with the bluetooth config options now.
> >
> > I have an allmodconfig tree, and when I run "make oldconfig" after
> > this pull, BT_L2CAP and BT_SCO both prompt me, claiming that they
> > can only be built statically.
> >
> > I give it 'y' just to make it happen, for both, and afterways no
> > matter how many times I rerun "make oldconfig" I keep seeing things
> > like this in my build:
> >
> > scripts/kconfig/conf --silentoldconfig Kconfig
> > include/config/auto.conf:986:warning: symbol value 'm' invalid for BT_SCO
> > include/config/auto.conf:3156:warning: symbol value 'm' invalid for BT_L2CAP
> >
> > First, what the heck is going on here? Second, why the heck can't these
> > non-trivial pieces of code be built modular any more?
>
> We now have L2CAP and SCO built-in in the main bluetooth.ko module.
>
> >
> > You can't make something "bool", have it depend on something that
> > might be modular, and then build it into what could in fact be a
> > module. That's exactly what the bluetooth stuff seems to be doing
> > now.
>
> Seems I did the Kconfig change wrong, I'll fix it ASAP and send it to you
> guys.
I Figured the problem. When I first wrote this I based the work in other
Kconfig in net/ (as it was my very first time doing such kind of changes in a
Kconfig). For example, net/decnet/ and net/ax25/ do exactly the same as the
Bluetooth Kconfig. "bool" depending on "tristate" and build both together.
But doing another look after your report there is some places where this is
done a bit different, net/ipv6 and net/mac80211 are examples. Then I changed
to this new approach to remove the direct dependency from BT_L2CAP and BT_SCO.
Patch follows this e-mail.
That point me out that we may have other subsystems doing it wrong and we have
to fix this.
--
Gustavo F. Padovan
http://profusion.mobi
^ permalink raw reply
* Re: TX VLAN acceleration on bridges broken in 2.6.37?
From: Jeff Kirsher @ 2011-02-26 1:22 UTC (permalink / raw)
To: Jesse Gross, Allan, Bruce W
Cc: Jan Niehusmann, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, Tantilov, Emil S
In-Reply-To: <AANLkTikVvjsaG94-gtauvTLjH_RW5fmva8+N7Lk-ryQ0@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1610 bytes --]
On Fri, 2011-02-25 at 17:16 -0800, Jesse Gross wrote:
> On Fri, Feb 25, 2011 at 4:19 PM, Jan Niehusmann <jan@gondor.com> wrote:
> > On Fri, Feb 25, 2011 at 02:53:21PM -0800, Jesse Gross wrote:
> >> is specific to the e1000e driver. I know that some other Intel NICs
> >> require vlan stripping on receive to be enabled for vlan insertion on
> >> transmit to work. Since this driver has not been converted over to
> >> use the new vlan model yet, it only enables these things if a vlan is
> >> directly configured on it. To confirm this can you try a few things:
> >
> > My observations confirm your theory:
>
> OK, thanks for confirming. The right solution is convert the driver
> over to the new vlan model. I don't know how soon I might get to
> this, maybe it's something that the Intel guys can take a look at?
I have made sure that Bruce is aware of the issue.
We will see what we can do to get some patches created and under
testing.
>
> > - indeed, -e is necessary to show the vlan tags. So my prior observation
> > regarding tag visibility in tcpdump was wrong. The packets are still
> > have a vlan tag in the non-working case.
> >
> > (What actually is affected by the txvlan flag is the ability to filter
> > for vlan tags with tcpdump. so 'tcpdump -e -i eth0' shows the packets,
> > 'tcpdump -e -i eth0 vlan' only shows them with txvlan off. However,
> > filtering for the vlan tag also doesn't work with the vlan interface
> > on eth0.1, while the tagging actually works, as verified above.)
>
> Good to know, though that's a separate issue.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply
* Re: TX VLAN acceleration on bridges broken in 2.6.37?
From: Jesse Gross @ 2011-02-26 1:16 UTC (permalink / raw)
To: Jan Niehusmann; +Cc: linux-kernel, netdev, Tantilov, Emil S, Kirsher, Jeffrey T
In-Reply-To: <20110226001908.GA10777@x61s.reliablesolutions.de>
On Fri, Feb 25, 2011 at 4:19 PM, Jan Niehusmann <jan@gondor.com> wrote:
> On Fri, Feb 25, 2011 at 02:53:21PM -0800, Jesse Gross wrote:
>> is specific to the e1000e driver. I know that some other Intel NICs
>> require vlan stripping on receive to be enabled for vlan insertion on
>> transmit to work. Since this driver has not been converted over to
>> use the new vlan model yet, it only enables these things if a vlan is
>> directly configured on it. To confirm this can you try a few things:
>
> My observations confirm your theory:
OK, thanks for confirming. The right solution is convert the driver
over to the new vlan model. I don't know how soon I might get to
this, maybe it's something that the Intel guys can take a look at?
> - indeed, -e is necessary to show the vlan tags. So my prior observation
> regarding tag visibility in tcpdump was wrong. The packets are still
> have a vlan tag in the non-working case.
>
> (What actually is affected by the txvlan flag is the ability to filter
> for vlan tags with tcpdump. so 'tcpdump -e -i eth0' shows the packets,
> 'tcpdump -e -i eth0 vlan' only shows them with txvlan off. However,
> filtering for the vlan tag also doesn't work with the vlan interface
> on eth0.1, while the tagging actually works, as verified above.)
Good to know, though that's a separate issue.
^ permalink raw reply
* Re: 2.6.37 regression: adding main interface to a bridge breaks vlan interface RX
From: Jesse Gross @ 2011-02-26 1:08 UTC (permalink / raw)
To: chriss; +Cc: netdev
In-Reply-To: <loom.20110226T011347-702@post.gmane.org>
On Fri, Feb 25, 2011 at 4:16 PM, chriss <mail_to_chriss@gmx.net> wrote:
> Jesse Gross <jesse <at> nicira.com> writes:
>
>>
>> What driver is in use with the NIC you are seeing this on?
>>
>
> He there
>
> the device in question is (as lspci told)
> Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8110SC/8169SC Gigabit
> Ethernet (rev 10)
>
> handled by kernel module r8169.
I'm guessing that you're hitting the special case in this code in
r8169.c:rtl8169_vlan_rx_register():
/*
* Do not disable RxVlan on 8110SCd.
*/
if (tp->vlgrp || (tp->mac_version == RTL_GIGA_MAC_VER_05))
tp->cp_cmd |= RxVlan;
else
tp->cp_cmd &= ~RxVlan;
Since before you were getting the vlans directly off the device there
was a vlan group configured. However, now that the packets are going
through the bridge, the group is not being configured on the device
and the tag gets dropped. Assuming that this is the case, the
solution is to convert the driver to use the new vlan model, which
does not require knowledge of the vlan group.
Can you confirm this by running tcpdump -eni br0? I would expect that
you see the correct packets but without vlan tags.
^ permalink raw reply
* Re: [net-next-2.6 PATCH 02/10] ethtool: add ntuple flow specifier to network flow classifier
From: Ben Hutchings @ 2011-02-26 1:00 UTC (permalink / raw)
To: Alexander Duyck; +Cc: davem, jeffrey.t.kirsher, netdev
In-Reply-To: <20110225233249.7920.70334.stgit@gitlad.jf.intel.com>
On Fri, 2011-02-25 at 15:32 -0800, Alexander Duyck wrote:
> This change is meant to add an ntuple define type to the rx network flow
> classification specifiers. The idea is to allow ntuple to be displayed and
> possibly configured via the network flow classification interface. To do
> this I added a ntuple_flow_spec_ext to the lsit of supported filters, and
> added a flow_type_ext value to the structure in an unused hole within the
> ethtool_rx_flow_spec structure.
There's a hole there on 64-bit architectures. Unfortunately, on i386
and other architectures where u64 is not 64-bit-aligned, there isn't.
We actually need to add compat handling for the commands that use it.
Also, we don't want these flags to be ignored by older kernel versions
and drivers - they should reject specs that they don't understand. So
any extension flags need to be added to flow_type.
> Due to the fact that the flow specifier structures are only 4 byte aligned
> instead of 8 I had to break the user data field into 2 sections. In
> addition I added the vlan ethertype field since this is what ixgbe was
> using the user-data for currently and it allows for the fields to stay 4
> byte aligned while occupying space at the end of the flow_spec.
>
> In order to guarantee byte ordering I also thought it best to keep all
> fields in the flow_spec area a big endian value, as such I added vlan, vlan
> ethertype, and data as big endian values.
It's not important that byte order is consistent across architectures.
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>
> include/linux/ethtool.h | 20 ++++++++++++++++++++
> 1 files changed, 20 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index aac3e2e..3d1f8e0 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -378,10 +378,25 @@ struct ethtool_usrip4_spec {
> };
>
> /**
> + * struct ethtool_ntuple_spec_ext - flow spec extension for ntuple in nfc
> + * @unused: space unused by extension
> + * @vlan_etype: EtherType for vlan tagged packet to match
> + * @vlan_tci: VLAN tag to match
> + * @data: Driver-dependent data to match
> + */
> +struct ethtool_ntuple_spec_ext {
> + __be32 unused[15];
> + __be16 vlan_etype;
> + __be16 vlan_tci;
> + __be32 data[2];
> +};
[...]
This is a really nasty way to reclaim space in the union.
Let's name the union, shrink it and insert the extra fields that way:
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -377,27 +377,43 @@ struct ethtool_usrip4_spec {
__u8 proto;
};
+union ethtool_flow_union {
+ struct ethtool_tcpip4_spec tcp_ip4_spec;
+ struct ethtool_tcpip4_spec udp_ip4_spec;
+ struct ethtool_tcpip4_spec sctp_ip4_spec;
+ struct ethtool_ah_espip4_spec ah_ip4_spec;
+ struct ethtool_ah_espip4_spec esp_ip4_spec;
+ struct ethtool_usrip4_spec usr_ip4_spec;
+ struct ethhdr ether_spec;
+ __u8 hdata[52];
+};
+
+struct ethtool_flow_ext {
+ __be16 vlan_etype;
+ __be16 vlan_tci;
+ __be32 data[2];
+ __u32 reserved[2];
+};
+
/**
* struct ethtool_rx_flow_spec - specification for RX flow filter
* @flow_type: Type of match to perform, e.g. %TCP_V4_FLOW
* @h_u: Flow fields to match (dependent on @flow_type)
+ * @h_ext: Additional fields to match
* @m_u: Masks for flow field bits to be ignored
+ * @m_ext: Masks for additional field bits to be ignored.
+ * Note, all additional fields must be ignored unless @flow_type
+ * includes the %FLOW_EXT flag.
* @ring_cookie: RX ring/queue index to deliver to, or %RX_CLS_FLOW_DISC
* if packets should be discarded
* @location: Index of filter in hardware table
*/
struct ethtool_rx_flow_spec {
__u32 flow_type;
- union {
- struct ethtool_tcpip4_spec tcp_ip4_spec;
- struct ethtool_tcpip4_spec udp_ip4_spec;
- struct ethtool_tcpip4_spec sctp_ip4_spec;
- struct ethtool_ah_espip4_spec ah_ip4_spec;
- struct ethtool_ah_espip4_spec esp_ip4_spec;
- struct ethtool_usrip4_spec usr_ip4_spec;
- struct ethhdr ether_spec;
- __u8 hdata[72];
- } h_u, m_u;
+ union ethtool_flow_union h_u;
+ struct ethtool_flow_ext h_ext;
+ union ethtool_flow_union m_u;
+ struct ethtool_flow_ext m_ext;
__u64 ring_cookie;
__u32 location;
};
@@ -954,6 +970,8 @@ struct ethtool_ops {
#define IPV4_FLOW 0x10 /* hash only */
#define IPV6_FLOW 0x11 /* hash only */
#define ETHER_FLOW 0x12 /* spec only (ether_spec) */
+/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
+#define FLOW_EXT 0x80000000
/* L3-L4 network traffic flow hash options */
#define RXH_L2DA (1 << 1)
---
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: SO_REUSEPORT - can it be done in kernel?
From: Herbert Xu @ 2011-02-26 0:57 UTC (permalink / raw)
To: David Miller
Cc: rick.jones2, tgraf, therbert, wsommerfeld, daniel.baluta, netdev
In-Reply-To: <20110225.112019.48513284.davem@davemloft.net>
David Miller <davem@davemloft.net> wrote:
>
> I think this is fundamentally a bind problem as well.
I'm fairly certain the bottleneck is indeed in the kernel, and
in the UDP stack in particular.
This is born out by a test where I used two named worker threads,
both working on the same socket. Stracing shows that they're
working flat out only doing sendmsg/recvmsg.
The result was that they obtained (in aggregate) half the throughput
of a single worker thread.
I then retested by having them use two sockets and the performance
greatly improved.
Now this is actually expected since our UDP stack is essentially
single-threaded on the send side when only one socket is being
used, mostly due to the corking functionality.
I'm unsure how big a role the receive side scalability actually
plays in this case, but I suspect it isn't great.
Which is why I'm quite skeptical about this REUSEPORT patch as
IMHO the only reason it produces a great result is solely because
it is allowing parallel sends going out.
Rather than modifying all UDP applications out there to fix what
is fundamentally a kernel problem, I think what we should do is
fix the UDP stack so that it actually scales.
It isn't all that hard since the easy way would be to only take
the lock if we're already corked or about to cork.
For the receive side we also don't need REUSEPORT as we can simply
make our UDP stack multiqueue.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* [net-2.6 PATCH] ethtool: prevent null pointer dereference with NTUPLE set but no set_rx_ntuple
From: Alexander Duyck @ 2011-02-26 0:42 UTC (permalink / raw)
To: davem, jeffrey.t.kirsher, bhutchings; +Cc: netdev, stable
This change is meant to prevent a possible null pointer dereference if
NETIF_F_NTUPLE is defined but the set_rx_ntuple function pointer is not.
This issue appears to affect all kernels since 2.6.34.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
net/core/ethtool.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index c1a71bb..4843674 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -893,6 +893,9 @@ static noinline_for_stack int ethtool_set_rx_ntuple(struct net_device *dev,
struct ethtool_rx_ntuple_flow_spec_container *fsc = NULL;
int ret;
+ if (!ops->set_rx_ntuple)
+ return -EOPNOTSUPP;
+
if (!(dev->features & NETIF_F_NTUPLE))
return -EINVAL;
^ permalink raw reply related
* Re: [net-next-2.6 PATCH 01/10] ethtool: prevent null pointer dereference with NTUPLE set but no set_rx_ntuple
From: Alexander Duyck @ 2011-02-26 0:40 UTC (permalink / raw)
To: Ben Hutchings
Cc: davem@davemloft.net, Kirsher, Jeffrey T, netdev@vger.kernel.org
In-Reply-To: <1298679675.3555.4.camel@localhost>
On 2/25/2011 4:21 PM, Ben Hutchings wrote:
> On Fri, 2011-02-25 at 15:32 -0800, Alexander Duyck wrote:
>> This change is meant to prevent a possible null pointer dereference if
>> NETIF_F_NTUPLE is defined but the set_rx_ntuple function pointer is not.
>
> I think it would be a bug for NETIF_F_NTUPLE to be enabled on a device
> that doesn't have this operation. Are there any drivers for which this
> is possible?
Currently there are no drivers where this is possible. However I
encountered it as a result of testing the patches further on in this set.
>> This issue appears to affect all kernels since 2.6.34.
>
> If this can actually happen, the fix should go to net-2.6 and
> stable@kernel.org. However, I think that the null deference is
> impossible and this really just fixes the error code.
>
> Ben.
It cannot occur with any of the in-kernel drivers since they all set the
NETIF_F_NTUPLE flag and have the function defined. However going
forward I would like to have the option of using the network flow
classifier interface instead of the set_rx_ntuple interface due to the
fact that it supports many of the features I needed.
I believe this patch should apply to net-2.6 without any changes so if
it is better placed there I will resubmit it specifically for net-2.6
and stable.
Thanks,
Alex
>> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
>> ---
>>
>> net/core/ethtool.c | 3 +++
>> 1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
>> index c1a71bb..4843674 100644
>> --- a/net/core/ethtool.c
>> +++ b/net/core/ethtool.c
>> @@ -893,6 +893,9 @@ static noinline_for_stack int ethtool_set_rx_ntuple(struct net_device *dev,
>> struct ethtool_rx_ntuple_flow_spec_container *fsc = NULL;
>> int ret;
>>
>> + if (!ops->set_rx_ntuple)
>> + return -EOPNOTSUPP;
>> +
>> if (!(dev->features& NETIF_F_NTUPLE))
>> return -EINVAL;
>>
>>
>
^ permalink raw reply
* Re: [net-next-2.6 PATCH 01/10] ethtool: prevent null pointer dereference with NTUPLE set but no set_rx_ntuple
From: Ben Hutchings @ 2011-02-26 0:21 UTC (permalink / raw)
To: Alexander Duyck; +Cc: davem, jeffrey.t.kirsher, netdev
In-Reply-To: <20110225233244.7920.26742.stgit@gitlad.jf.intel.com>
On Fri, 2011-02-25 at 15:32 -0800, Alexander Duyck wrote:
> This change is meant to prevent a possible null pointer dereference if
> NETIF_F_NTUPLE is defined but the set_rx_ntuple function pointer is not.
I think it would be a bug for NETIF_F_NTUPLE to be enabled on a device
that doesn't have this operation. Are there any drivers for which this
is possible?
> This issue appears to affect all kernels since 2.6.34.
If this can actually happen, the fix should go to net-2.6 and
stable@kernel.org. However, I think that the null deference is
impossible and this really just fixes the error code.
Ben.
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
>
> net/core/ethtool.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index c1a71bb..4843674 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -893,6 +893,9 @@ static noinline_for_stack int ethtool_set_rx_ntuple(struct net_device *dev,
> struct ethtool_rx_ntuple_flow_spec_container *fsc = NULL;
> int ret;
>
> + if (!ops->set_rx_ntuple)
> + return -EOPNOTSUPP;
> +
> if (!(dev->features & NETIF_F_NTUPLE))
> return -EINVAL;
>
>
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* Re: TX VLAN acceleration on bridges broken in 2.6.37?
From: Jan Niehusmann @ 2011-02-26 0:19 UTC (permalink / raw)
To: Jesse Gross; +Cc: linux-kernel, netdev
In-Reply-To: <AANLkTinoJqWA6ffnUx2KW_83srNbo+k6r24hnNpPTGvW@mail.gmail.com>
On Fri, Feb 25, 2011 at 02:53:21PM -0800, Jesse Gross wrote:
> is specific to the e1000e driver. I know that some other Intel NICs
> require vlan stripping on receive to be enabled for vlan insertion on
> transmit to work. Since this driver has not been converted over to
> use the new vlan model yet, it only enables these things if a vlan is
> directly configured on it. To confirm this can you try a few things:
My observations confirm your theory:
> * Directly configure the vlan on the device instead of going through the bridge.
- does work, but only if eth0 is not part of bridge (expected behaviour,
afaik)
> * Use the bridge but also configure an unused vlan device on the
> physical interface.
- does work
> * Double check that tcpdump with the settings that you are using shows
> vlan tags in other situations. In some cases you need to use the 'e'
> flag with tcpdump in order for it show vlan tags. If it is the
> driver/NIC that is dropping the tags, tcpdump should still show them.
- indeed, -e is necessary to show the vlan tags. So my prior observation
regarding tag visibility in tcpdump was wrong. The packets are still
have a vlan tag in the non-working case.
(What actually is affected by the txvlan flag is the ability to filter
for vlan tags with tcpdump. so 'tcpdump -e -i eth0' shows the packets,
'tcpdump -e -i eth0 vlan' only shows them with txvlan off. However,
filtering for the vlan tag also doesn't work with the vlan interface
on eth0.1, while the tagging actually works, as verified above.)
Jan
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox